Selective Data Reduction in Gas Chromatography/Infrared Spectrometry

(1)

Selective Data Reduction in Gas Chromatography/Infrared Spectrometry

Dongjin Pyo^*andHyundu Shin

Departmentof Chemistry, Kangwon National University,Chuncheon 200-701, Korea Received November 28, 2000

As gas chromatography/infrared spectrometry (GC/IR) becomes routinelyavailable,methodsmustbedeve loped todeal with thelargeamount of data produced.Wedemonstratecomputer methods that quickly search through a large data file, locating thosespectra thatdisplay a spectral feature of interest.Based ona modified librarysearch routine,theseselective data reductionmethodsretrieve all ornearly all ofthecompounds of in terest,whilerejectingthe vast majority of unrelated compounds.Toovercome theshifting problem of IR spec tra, a searchmethod of movingthe average patternwas designed.In thismovingpattern search, the average patternof a particular functionalgroup was notheld stationary, but wasallowedtobemovedalittle bit right and left.

Keywords:Selective data reduction,Infrared data, Pattern search,Fingerprint region.

Introduction

One ofgreatchallenges of modern analyticalchemistryis to try to makesenseof dataasfast asinstruments can spew it forth.Tothatend, we have been interested for sometime in developing computerized methods of selective data reduc tion. The goal of this method isthe rapid sorting of hundreds of spectra to find the relativefewwhich may be important enough in a given analysis to warrant further attention. For example, if acomplex mixture of unknowncompoundsis to beanalyzed, the sample might bechromatographed and the GC effluent be directedinto a continuously scanning spec trometer. After 30 minutes of analysis, 900 spectramight havebeencollected and stored in a data file. Let us assume, in thisparticularcase,thatthe analyst is interested in finding and identifyingallof the chlorinatedcompounds in the mix

ture. If the GC effluent had been detected with a chlorine selective detector, such as the electron-capture detector (ECD),the chlorinatedcompoundswould havebeen easy to locate;but the ECD contributes little structuralinformation for the identificationofunknowns. In the present case, the spectrometer providesfull spectraldata, useful foridentifi

cation, assumingthe spectra of the chlorinated compounds can be located. Each of the 900 spectra could be visually examined, searching for spectralindications ofthepresence of chlorine; a tedious and time consuming approach. The data system couldperform a library search on each of the 900 spectra-again, time consuming and inefficient. If the datasystem was sufficiently sophisticated, itcould perform some sort ofartificial intelligence ofautomated interpreta tion scheme as described aboveon eachof the 900 spectra, but thetimeinvolvedwould surely be prohibitive.The sensi

ble approach is to somehow selectsome of the 900 spectra for further attention, but to select them in a way thatmaxi mizes the probability ofselecting chlorinated compounds.

All ofthe usual identification routines-manual interpreta tion, library search, or automated interpretation-could be applied to this smaller subset. This kind of data reduction

technique is veryvaluable in GC/IR analysissince, inmost cases, only a smallfraction of the data isofinterest inany particular analysis. A computer can be employed to selec tivelyreducethevolumeof data,therebyimproving theeffi ciency ofthe analysis. Thisselectionistheprocesswe refer to as selective datareduction.1

Selective data reduction is any data processing method thatsomehowselectscertain data asbeing more“important” or more “interesting” to the analyst.Theselection of impor tant data can be based on a number ofcriteria.The selection criteriaoften used arethetotalintensity of the spectra, or the presence of some spectral feature of interest. Commercial spectrometers frequentlyhave simple routines built into the data systems: peak finding routines arejust selective data reduction using spectral intensity as a criterion. Routines based onspectral featuresvary with thetypeofspectrometer used. In GC-MS, one has mass chromatograms2orthepres enceof isotope clusters3,4 to useas indicators of compound class.In theexampleabove,thepresenceofchlorineisotope clusters can be usedto indicatechlorinatedcompounds.4

In GC/IR, “chemigrams”5 are used to indicate the pres ence of absorptions of interest. Thereareplots of integrated absorbance ina definedspectral window. Ratherthan look

ingat 900 spectra, with the use of chemigrams, the analyst sorts out only those that are likely to contain a particular functional group. Although useful, chemigrams are not always very selective, inthatthey show onlythe integrated absorbanceoverachosen frequencywindow.

In this work we show thatthe use of patterns of absor bances provides a much more selective criterion for data reduction. Computer algorithmshavebeenwritten to search through hundreds ofspectra, retrieving only those that dis

playthe pattern of interest, and these algorithms have great potential for the analysis ofGC/IR data. Although our rou tines are based onthepresence ofspectralpatterns, they are distinct from pattern recognition methods in both purpose and approach. Pattern recognition statistically sorts a large databaseinto a number of clusters, and assigns a spectrum to

(2)

a compound class based on the nearness of some metric representingthe spectrum tooneof theclusteredunits. Our approach seeks only to reducethe number of spectrawhich must befurtherinterpretedbythe analyst, and solooksonly for similarity within a defined spectral window. The data

base is not reallyrequired, and the fact,one need not know inadvancewhatfunctional group is responsible for the pat tern of interest.

We have also developed computer algorithms to evaluate the selectivity of chemigram method, i.e. how selectively chemigram algorithms pick up a specific functional group by monitoring the absorbances over specified IR spectral region. Chemigramshave beenused for specificfunctional group detection since their development in 1979,5 just as specificiondetection is used in GC/MS.Even thoughselec tivity isvital in chemigram-type approach, it hasneverbeen evaluated statistically. The algorithms we developed can givegoodevaluations for differentfunctionalgroups, and at different threshold values for each functionalgroups. Tofind an optimum threshold value is very important because it affects the selectivity ofchemigram, i.e., howmany mem

bersofa particular compound class are recoveredand how many members of other functional group are eliminated.

Particularly, when the chemigram is used for quantitative analysis of trace components, its selectivity increases the sensitivity ofthe analysis of trace components, by being insensitive to interferingspectralcontributions.

Our algorithms have been compared with chemigram algorithms in all mid-IR region (4000-400cm-1). A greater degree of selectivity was observed than with chemigram algorithms, especially in O-H stretching and carbonyl stretching regions. Forexample, carboxylicacidshaveO-H stretching and carbonyl stretching regions. One hundred eighty fivespectra ofcarboxylicacids were averagedinthe 3800-3400 cm-1 window. The resulting pattern showed a single sharp band centering around 3560 cm^-1. The same 100 spectra from the database were again considered; this time,MSQfor eachwas calculated- a measureofsimilarity of the spectrum to the average pattern for carboxylic acids.

The resultswere striking. Fourcarboxylicacids(100%) had the lowest Msqvalues: 3-chlorobutyric acid, 0.14; butyric acid, 0.26; heptanoic acid, 0.30; isobutyric acid, 0.40. The next lowest Msqwas for 2-bromo-p-cresol with anMsq= 1.70. Not only were the acids located as bestmatching the averagepattern, buttherewas a large distance betweenthe worst acid (Msq= 0.40) and thenext closest nonacid (Mq = 1.70).When theexperiment wasrepeatedontheentire data

base, morethan 92% of thecarboxylic acid spectrahadMsq

values of1.5 or less; fewer than 4%of the non-carboxylic acids had Msq of 1.5 or less.

Experiment^지Section

The computer programs described were writteninFOR

TRAN and runonan IBM 370computer. Thedatabase cho

senwas the EPA Vapor Phase Collection of3300 spectra, available from Dr. James de Haseth at the University of

Georgia. The spectrawere stored in digital format at 1600 bpi onnine track magnetic tapes. The format ofthe vapor phase tapes is thefollowing;each file representsonespectral entry. There is an end of file mark betweeneachfile and there is an end offile mark at the beginning ofeach tape.

Therearethirteen records ineach file. Record 1 is theheader record and that is 1148 bytes in length.Records 2 through5 contain the reference interferogram of2058 bytes in each.

Record 10 is the firstpart of the spectrum, with 2058bytes.

Record 11 is the second part of the spectrum, with 1646 bytes. Records 12 and 13contain the inverse Fourier trans

form of the spectrum. In records 11, 12 the first two bytes containanintegral serialnumbercorrespondingtothe serial numberinthe header record. Each spectrum is divided up intorecord of 1024 data points (20480bytes).Theordinates areexpressed to0.002absorbanceunitsfrom0.000to 1.998.

In order tosave space, theabsorbanceunitsareexpressedas integers (0 to 1998) by multiplying each ordinate by 1000.

The spectrais measuredat 2 cm-1 resolution from4000cm-1 to 450 cm-1. The headerrecord includes compoundname, formula,molecularweight,Chemical Abstracts Service (CAS) registry number, melting point, boiling point, Wiswesser Line Notation(WLN), etc. More detail ontheformatofthe records has appearedin theliterature.6

The basic strategy of our method was to usespectrafrom the database to identify patterns ofabsorbance that charac terize certain functionalgroups;and then to search for those patterns inaseriesof ‘unknown’ spectra. Representatives of a functional group were identified by computer searching the Wiswesser Line Notation (WLN)inthe database header records. Thelistof spectra retrievedby WLN was checked against the compound names to avoid coding errors. An

“average spectrum” was calculated by taking the mean absorbance ofallthe normalizedspectra at each frequency interval (2 cm-1) throughouttherange. Sincethe goalofthe projectis rapid screening,onlya small portion of thefull IR rangewas used, a portionchosen surrounding a characteris tic band ofthat functional group. For example,whensearch ing for carboxylic acids or alcohols, the O-H stretching region wasused(3800-3400cm-1).

The average spectrum was considered to represent the functionalgroup. Other spectra werethen tested in thesame frequency window to seeiftheyexhibitedthe same pattern ofabsorbance asthe average. A score was assignedtoeach spectrum, reflectingthe degree ofsimilarity to theaverage.

Since this process is similar to a library search routine, except inthat it is applied onlyto a small region of the spec trum,we used the same metric reported in the literature for library searching.

In most of the work described herein, the “difference squared”metric,7,8was used.

Msq= £(S^-Rⁱ)2

whereMsq is the similarity indicator and Si and Ri are the absorbance values of the sample and reference spectra in a frequency interval i. Clearly, the smallerthe value of Msq, the better the matchbetween theunknown and thereference

(3)

(oraverage) spectrum; a perfect match would give M^sq= 0.

XSi means the sum of the absorbancevalues of the sample spectrain a frequencyinterval i.

In the moving pattern search, a continuous incremental comparisonofthe average patternwas used.In themultiple patterns search,multiple regions were searched, a decision wasmadeon thebasisof multiple search results.

The speed of oursearch algorithms is aboutthe same as chemigrams; both of themtake about 100 sec to search 1000 spectra. When the moving pattern search is employed, it takes about three times longer than the stationary pattern searchorchemigram-typesearch.

ResultsandDiscussions

One of the most powerfulfunctions of infrared spectros copy is establishingconclusively theidentity oftwo samples that have identical spectra when determined in the same medium. The region 1500-600 cm-1containsmany absorp

tions caused by bending vibrations as well as absorptions caused byC-C,C-O,C-N and C-Clstretching vibrations. As there are many more bending vibrations in a moleculethan stretchingvibrations, this regionof the spectrum isparticu

larly rich in absorption bands and shoulders. For thisreason, it is frequently called the fingerprint region. Small differ encesin the structure and constitution of amolecule resultin significantchangesinthedistributionofabsorption peaks in this region of thespectrum. As a consequence, a closematch between twospectra in thisfingerprint regionusually gives a strongevidence for the identity of the compounds yielding the spectra. Since many bending vibrations give rise to absorption bands arethus composites of these various inter actions and depend upon the overall skeletal structure of the molecule. Exactinterpretation of spectra in thisregion is sel dompossible because of their complexity; however, onthe otherhand, it is thiscomplexitythat leads to uniqueness and the consequent usefulnessofthe region for final identifica tion purposes.

A number ofimportant groupfrequencies are to befound in the fingerprint region. These include the aromatic ring stretching vibrations at 1620 to 1470 cm-1, the C-O-C stretchingvibrationinethers and esters atabout 1200 cm-1, the C-Cl stretching vibration at 800 to 600 cm-1, the C-F stretching vibrationat 1400-1000 cm-1, C-Ovibrations and C-C vibrations at 1250 to 1050 cm-1, and C-H bending vibration in therangeof1000-670cm-1.9

When thespectrumof unknownmaterial is obtained,there are some questionsthat can beasked by theanalyst immedi

ately: does it contain a carbonyl group? is it alcohol? is it acid? is it aromatic? if so, what is the substitution type?

Answers to questionssuchas these will givemany clues for chemicalwork that can lead to theconclusive identification of the compound. Modern GC/IR instruments produce and enormousnumber of unknownspectra in a short time. Itis amazing to note that these questions can beansweredfairly quickly by examining selected fractions ofthe spectra for thosetremendousamountsofdata. This is possibleby moni

toring certain spectral windows with the progress of chro

matographicfractionation. Selective data reductionbyuse of chemigrams and ourapproach in thefingerprint region,we selected aromatic compounds as the first compound class.

Thisgrouphasagreatimportance in manypractical applica

tions and one ofthebestgroup frequencies for recognizing the presenceofaromatic ring structuresoccurs in the 1620

1470cm-1.

The average of 100 spectra of aromatic ring containing compoundsfromthedatabase is showninFigure 1. Thepat ternconsists of two peaks which occur near 1586 and 1495 cm-1. The moving patternsearch method was usedwith a 20 cm-1 window and a 2 cm-1 increment. Thesearch results are shown in Figure 2. More than 76% ofthe aromatic com pounds spectra had the lowest MSQ value of 16.0 or less;

fewer than 15% of the non-aromatic compounds had the lowestMSQof 16.0 or less.

These results were compared with the stationary pattern search (Figure 3) and the chemigrams (Figure 4). At the threshold leveloffinding 15% of non-aromatic compounds, chemigramsfound 74%of aromatic compounds and the sta

tionary pattern search found 75% of aromatic compounds while the moving pattern search 77% of aromatic com pounds. In the fingerprintregion, the pattern search results were much lesssatisfactorythan intheO-Hstretching vibra

tion region or C=O stretching vibration region,althoughstill better thanchemigramresult.Another thing tobe noted was that themoving pattern searchdid notmake a big difference in this region. Allthese undesirable results came from the factthat in fingerprint region,thespectrashowedmuch vari ation inpeakpositions, peak intensities and peakwidths,and as a result, did not show a constant pattern ofabsorbances.

Examination of non-aromatic compounds found and aro

matic compounds which were not foundwould behelpfulin understanding these results.

The spectra of the non-aromatic compounds that were assigned MSQ < 16.0 were examined to see why they had

Wavenumbers (cnr1)

Figure 1. The average spectrum of 100 carbon aromatic compounds (1620-1470 cm^-1). Boxed portion shows region used for comparison.

(4)

Figure 2. The moving pattern search results of aromatic compounds (1620-1470 cm^-1). Percentage A : percentage of non

aromatics ( )with an Msq value less than the threshold shown;

Percentage B: percentage of aromatics ( --- ) with an MSQ value less than the threshold shown.

Figure 4. The chemigram search results of aromatic compounds (1620-1470 cm^-1). Percentage A: percentage of non-aromatics ( )with a XSⁱ value less than the threshold shown; Percentage B: percentage of aromatics (——)with a ESi value greater than the threshold shown.

Figure 3. The stationary pattern search results of aromatic compounds (1620-1470 cm^-1). Percentage A: percentage of non

aromatics ( ___ ) with an MSQ value less than the threshold shown;

Percentage B: percentage of aromatics ( --- ) with an M^SQ value less than the threshold shown.

been found. In most cases, these were nitrogen-containing heterocyclic ring compoundssuchas4-lutidine, 2-methoxy- pyridine, 2,3-dihydroindole, 6-methoxyquinoline. Ouraver age pattern was generated only from carbocyclic aromatic compounds, i.e., heterocyclic aromatic compoundswere not

Figure 5. The spectrum of 2,2-dichloroacetophenone. Boxed portion shows region used for comparison.

included. However, these results seemtobe all right because it hasbeenknown10 that C=Cbandsin nitrogen heterocyclic compoundsabsorb infrared light atthe samefrequencies as in thecarbocyclicaromaticcompounds.

Similarly,thespectra of thearomaticcompounds that were not assignedMSQ< 16.0were examined. In manycases, the problem was a very low intensity, i.e. almost no intensity.

These include isobutyl benzoate and the ethyl ester of 4- phenyl-2-butanone-p-toluic acid. Other cases, such as 2,2- dichloroacetophenone (Figure 5), andm-nitrotoluene, showed a differentpattern, i.e., theinterval between two peaks was large. Itshould be notedthat all these negativeinterferences

(5)

Wavenumbers (cm*1)

4000 3500 3000

■0

P.O ■

=

응 뉴 _ 은 。

s q

익

Figure 6. The spectrum of A, A, A, 4-tetrachlorotoluene. Boxed portion shows region used for comparison.

would not be reduced by employing the moving pattern search.

Another serious problem in our algorithms was found in thisexperiment. In the case of A, A,A, 4-tetracholrotoluene (Figure 6), surprisingly, eventhough two absorptionpeaks showedat exactsamepositionsas theaveragepattern of aro

matic compounds, a very high MSQvaluewas assigned.The difference between two patterns wasthepeak width: A, A, A, 4-tetrachlorotoluene showed verynarrowpeakswhilethe average pattern showed somewhat broad peaks. This illus trates very well that our algorithms have no tolerance for variation in peak width.

References

1. Andregg, R. J. Amer Lab. 1985, 17(9), 29.

2. Hites, R. A.; Biemann, K. Anal. Chem. 1970, 42, 855.

3. Andregg, R. J. J. Chromatog. 1983, 275, 154.

4. LaBrosse, J. L.; Andregg, R. J. J. Chromatog. 1984, 315, 83.

5. Mattson, D. R.; Julian, R. L. J. Chromatog. Sci. 1979, 17, 416.

6. Griffiths, P. R.; Azarraga, L. V.; Dehaseth, J. A.; Hannah, R. W. Appl. Spectros. 1979, 33, 543.

7. Pyo, D. Vibrational Spectroscopy 1993, 5, 263.

8. Pyo, D.; Lee, J. Vibrational Spectroscopy 1994, 8, 61.

9. Silverstein, R. M.; Morrill, T. C. Spectrometric Idenifica- tion of Organic Compounds, 4th ed.; Wiley: New York, 1981; pp 166-172.

10. Nakanishi, K.; Solomon, P. H. Infrared Absorption Spectroscopy, 2nd ed.; Holden-Day: San Francisco,

1977.