Noise reduction of fast, repetitive GC/MS measurements using principal component analysis (PCA)

Size: px
Start display at page:

Download "Noise reduction of fast, repetitive GC/MS measurements using principal component analysis (PCA)"

Transcription

1 Analytica Chimica Acta 401 (1999) Noise reduction of fast, repetitive GC/MS measurements using principal component analysis (PCA) M. Statheropoulos a,, A. Pappa a, P. Karamertzanis a, H.L.C. Meuzelaar b a Department of Chemical Engineering, Sector 1, National Technical University of Athens (NTUA), 9 Iroon Polytechniou Street, , Athens, Greece b Center for Micro Analysis and Reaction Chemistry, University of Utah, Salt Lake City, USA Received 9 March 1999; received in revised form 3 June 1999; accepted 9 June 1999 Abstract Principal component analysis (PCA) was applied to the noise reduction of low ppb level benzene, toluene, ethyl benzene, xylene (BTEX) type gas chromatography/mass spectrometry (GC/MS) measurements (i.e. BTEX) with a fast, repetitive GC/MS system. The first three principal components (PCs) accounting for approximately 60 80% of the total variance in the original data could be attributed to chemical components, whilst the remaining PCs were found to be due to noise. Reconstruction of the data from the first three PCs resulted in noise reduction with improved signal fidelity. The results of PCA were comparable with those achieved by a Fourier transform method Elsevier Science B.V. All rights reserved. Keywords: Noise reduction; Principal component analysis (PCA); Roving gas chromatography/mass spectrometry (GC/MS) 1. Introduction A roving gas chromatography/mass spectrometry (GC/MS) system, using a zero-emission electric vehicle and equipped with a differential GPC has been used for monitoring and mapping low ppb concentration level benzene, toluene, ethyl benzene, xylene (BTEX) type VOCs in the direct neighborhood of a gas station [1 3]. Concentrations of VOCs in ambient air are usually very low, and in some cases, can be masked by various sources of noise. Thus, the evaluation of low intensity Corresponding author. Tel.: ; fax: address: stathero@orfeas.chemeng.ntua.gr (M. Statheropoulos) signals can be greatly facilitated by subtraction of the noise. Noise reduction can be achieved by various smoothing or filtering techniques. A number of reduction techniques are known, such as the Gaussian filtering, the Savitzky Golay filter, polynomial filter and the Fourier transforms methods [4 6]. Lee et al. [7] evaluated Principal component analysis (PCA) as a digital filter to improve the overall quality of GC/MS data on a test mixture of low molecular weight solvents. A marked increase in the signal-to-noise ratio (i.e. by a factor of from 2 to 100) was achieved. The study of the effectiveness of PCA in the noise reduction of low level outdoors BTEX measurements by fast, repetitive GC/MS was the primary target of the present work. PCA examines the degree of correlation between variables, while noise is presumed to control /99/$ see front matter 1999 Elsevier Science B.V. All rights reserved. PII: S (99)

2 36 M. Statheropoulos et al. / Analytica Chimica Acta 401 (1999) random, i.e. non-correlated, signal fluctuations. PCA is used to determine the underlying intrinsic dimensionality of the data. On removing the least significant PCs attributed to the noise and reconstructing the original data set, one expects reduced noise. The results of PCA on noise reduction are compared to those of a fast Fourier transform (FFT) method. It should be mentioned that there are ambiguities in the literature about the definition of noise and as to how it is measured. These ambiguities can be critical when the S/N ratio has to be determined [8]. Our noise estimations were based on N peak-to-peak. 2. Theoretical Multivariate data analysis (MDA) is an established set of techniques for examining the relationships among multiple variables [9]. The term multiple refers to many variables and/or linear combinations of variables. PCA is an MDA technique, which is used whenever it is necessary to form new variables which are linear combinations of the original variables. These components are orthogonal (i.e. not correlated) to each other. The new variables that are formed are referred to as Principal components (PCs). The PCs are extracted so that the first PC accounts for the maximum variance of the data. The second PC accounts for the maximum of the residual variance and so on. Generally, only a handful of PCs is needed to account for the maximum of the variance of the original data set, and for this reason, PCA is generally known as a data reduction technique. In PCA, the square dispersion matrix V is given as follows: V = [ 1 c 1 ] D T D where D is the original or modified data set with size c n, (objects variables). Well-behaved dispersion matrices can often be produced by pretreatment of the original data. Frequently used known pretreatments include mean centering and standardization. For the square matrix V, the eigenvalues and eigenvectors are calculated. The results of PCA are given as the matrices S (scores matrix) and L (loadings matrix). The S matrix is a c f matrix (objects PCs), where f is the number of the first significant PCs and L is an f n matrix (PCs variables). By post multiplying S with L, the mean centered data matrix R is calculated: R = SL R is a c n matrix. To obtain the complete reconstructed matrix of the original data, D rec, the pretreatment of the data, has to be taken into account. Noise reduction is based on reconstruction of the data using a limited number of PCs, which accounts for the maximum variance of the original data and can be attributed to components other than the white noise chemical components. On the other hand, by using only those PCs that are attributed to noise, the noise matrix N is constructed. 3. Experimental 3.1. Roving GC/MS system The fast mobile (roving) GC/MS system used [2] has an Enviroprobe (Femtoscan Corporation type) inlet system and a Hewlett-Packard MSD type mass analyzer. Single ion monitoring (SIM) was used for recording the mass peaks at m/z 78 (benzene), 91 (toluene) and 106 (xylene or ethyl benzene). The pulsed air sampling duration of the Enviroprobe inlet was 400 ms, and the sampling frequency was one sample per 15 s PCA Twenty four repetitive measurements (sampling points) consisting of 74 scans, each with a sampling frequency of 15 s (that corresponds to 1776 scans over a time period of 6 min) for the mass peaks at m/z 78, 91 and 106, were used as raw data. The size of the raw data matrix D was 24 74: 24 objects (SIM profile corresponds to 24 sampling points) and 74 variables (measurement points within each SIM profile). Three such matrices (D) were produced for each mass peak recorded. Two-dimensional contour plots (pseudo-3d plots) were constructed for the mass peaks at m/z 78, 91 and 106 in order to better visualize the measurements data. The X-axis represents the number of scans during the

3 M. Statheropoulos et al. / Analytica Chimica Acta 401 (1999) Fig. 1. Contour plots of raw data for mass peaks at m/z 78 (a), 91 (b) and 106 (c). A stripe can be attributed to air peak, B to benzene, C to toluene and D to isomers of xylene and/or ethyl benzene.

4 38 M. Statheropoulos et al. / Analytica Chimica Acta 401 (1999) Fig. 2. Reconstructed contour plots and the subtracted noise plot obtained through PCA for mass peaks at m/z 78 (a), 91 (b) and 106 (c). A stripe can be attributed to air peak, B to benzene, C to toluene and D to isomers of xylene and/or ethyl benzene.

5 M. Statheropoulos et al. / Analytica Chimica Acta 401 (1999) Fig. 3. Ninth and 14th SIM profiles after reconstruction by PCA for mass peak signals at m/z 78 (a), 91 (b) and 106 (c). Less intense line corresponds to the raw data. chromatographic analysis time (15 s) of each measurement, the Y-axis the SIM profile number, and the third dimension was a gray scale related to the intensity of the respective signals (abundance in arbitrary units). The pretreatment method of mean centering was used for running the PCA analysis, using PONTOS, an in-house developed multivariate data analysis software package for spectroscopic data [10].

6 40 M. Statheropoulos et al. / Analytica Chimica Acta 401 (1999) Fig. 4. Frequency spectrum obtained by FFT for the mass peak signals at m/z 78. The first value is not included FFT analysis In FFT analysis, a conversion of the signal from the time domain to the frequency domain occurs. Consequently, the entire data set of roving GC/MS measurements results in a vector for each mass peak. Under Fourier transformation [11], this produces 1776 different magnitude coefficients ( frequency spectrum ). By rejecting the frequencies related to noise, increased signal-to-noise ratios are obtained, resulting in a smoother spectrum. MATLAB software, version 4.2 of Mathworks, was used for the Fourier transform. 4. Results and discussion In Fig. 1, the contour plots of the mass peaks at m/z 78 (a), 91 (b) and 106 (c) are presented. Furthermore, in Fig. 1a, two dark colour stripes (A and B) appear at 1.5 s (scan 9) and 3 s (scan 15), respectively, whereas in Fig. 1b, two dark colour stripes (A and C) are clearly present at 1.5 s (scan 9) and 5.5 s (scan 27), respectively. In addition, a less intense colour stripe (D) is present at 10.5 s (scan 50). It seems that the less intense stripe presented at 10.5 s in Fig. 1c can be attributed to D. By close examination of the mass spectra of the individual compounds in Fig. 1a c, stripes B, C and D can be attributed to benzene, toluene and the various isomers of xylene and/or ethyl benzene, respectively. Finally, the dark colour stripe (A) at 1.5 s present in all Fig. 1a c can be attributed to the air peak. It should be mentioned that the level of noise expressed as light colours is considered significant. This becomes especially important in the case of stripe D (xylene isomers or ethyl benzene, low S/N ratio) PCA results PCA resulted in 24 PCs. Using the screen plot criterion [12], three PCs were selected for describing the dimensionality of data. The first three PCs accounted for 59% (m/z 78), 70% (m/z 91) and 76% (m/z 106) of the total variance, respectively. In Fig. 2, the contour plots of reconstructed data using the first three PCs, as well as the extracted noise of mass peaks at m/z 78 (a), 91 (b) and 106 (c) are presented. It should be noted that the same pattern for noise is produced when (a) either the data are reconstructed using the least significant PCs (PC 4 PC 24 )or when (b) the reconstructed matrix D rec (reconstructed through the first three PCs) is subtracted from the original data matrix D. This provides a useful check of the mathematical procedures used. It appears that, by this method, a significant amount of noise is subtracted, whilst the fidelity of the signal is retained. This is shown in Fig. 3, which presents the reconstructed plots by PCA of the ninth (object 9) and 14th (object 14) SIM profiles (mass peak at m/z 78 Fig. 3a, m/z 91 Fig. 3b and m/z 106

7 M. Statheropoulos et al. / Analytica Chimica Acta 401 (1999) Fig. 5. Reconstructed contour plot and subtracted noise through FFT for the mass peak signals at m/z 78. Fig. 3c). An average increase in the S/N ratio (by 130%) was observed for all the data. In addition, stripe D is more clearly shown in the reconstructed plots. Finally, a remarkable volume data reduction is obtained by PCA. For the reconstruction of the initial data after PCA, the scores matrix S (24 3) and the loading matrix L (3 74) are necessary. That corresponds to (24 3) + (74 3)=98 3 values, whereas the initial data volume is values. Taking into account that the data are mean centering, the reconstructed matrix R = SL must be corrected by adding the mean of each column of the original data D to the R matrix, resulting in the final reconstructed matrix

8 42 M. Statheropoulos et al. / Analytica Chimica Acta 401 (1999) Fig. 6. Ninth SIM profile of mass peak signals at m/z 78 for raw data and after noise reduction by PCA and FFT. D rec. Thus, 74 more values (the number of columns of matrix D) are needed. Therefore, the % reduction data volume achieved by PCA is [ ] (24 74) (98 3) = 80% FFT results Fig. 4 presents the magnitudes of the frequencies of the data for mass peak at m/z 78 (frequency spectrum), resulting from FFT. A periodicity of 24 is obvious, as well as the expected symmetry around the central frequency of 889. Thus, peaks appear at 24-base frequencies i.e. 24, 48, 72. This is due to the signal formation, which has a periodicity 1776/24 = 74, in agreement with the roving GC/MS measurements. Assuming that the noise is completely random (white noise with a flat spectrum), retention of only the frequencies 24K, 24+/ 1, 24+/ 2 (Fig. 5) can bring about signal reconstruction. For comparison, the results of noise reduction using both techniques on the ninth SIM profile of the mass peak at m/z 78 are presented in Fig. 6. In conclusion, it seems that both techniques are capable of subtracting comparable amounts of noise while retaining and enhancing the signal profile. Nonetheless, PCA seems to give slightly better results with regard to overall spectrum quality after noise reduction. It should be emphasized that PCA subtracts random peaks that are not correlated (white noise), while the FFT used cannot subtracts the types of noise that have the same frequency of signals. 5. Conclusions PCA has proved to be an efficient tool for the noise reduction of roving CG/MS measurements. PCA extracts the white noise and increases the S/N ratio (an average increase by 130%). Besides, PCA has a remarkable capability of achieving a high degree of data compression (80%), which is important when data transfer by telemetry is needed. PCA results were comparable to those of an FFT method used in this work. References [1] A. Pappa, M. Statheropoulos, D. Theodossiou, H.L.C. Meuzelaar, Mathematical filtering of noise on roving GC/MS measurements, in: Poster Presentation at Field Screening Europe, September 30 October , Karlsruhe, Germany.

9 M. Statheropoulos et al. / Analytica Chimica Acta 401 (1999) [2] W. McClennen, C. Vaughn, P. Cole, S.A. Sheya, D. Wager, H.L.C. Meuzelaar, N.S. Arnold, Roving CG/MS: mapping gradients in time and space, in: Proceedings of the Specialist Workshop On Field-Portable Chromatography and Spectrometry, June 3 5, 1996, Snowbird, Utah. [3] S. Arnold, W.H. McClennen, H.L.C. Meuzelaar, Anal. Chem. 63 (1991) 299. [4] X.Y. Sun, H. Singh, B. Millier, C.H. Warren, W.A. Aue, J. Chromatogr. A 687 (1994) [5] B. Barak, Anal. Chem. 67 (1995) [6] R.E. Synovec, E.S. Yeung, Anal. Chem. 58 (1986) [7] T.A. Lee, L.M. Headley, J.K. Hardy, Anal. Chem. 63 (1991) 357. [8] P. Foley, J.G. Dorsey, Chromatographia 18 (1984) 503. [9] J.F. Hair, R.E. Anderson, R.L. Tatham, Multivariate Data Analysis, 2nd ed., Macmillan, New York, [10] M. Statheropoulos, H.L.C. Meuzelaar, N. Vassiliadis, Multivariate Data Analysis Techniques for Spectroscopic Data: the PONTOS Case, (software ver. 1.2 and manual), Center for Microanalysis and Reaction Chemistry, The University of Utah, [11] J.G. Proakis, D.G. Manolakis, Introduction to Digital Signal Processing, MacMillan, New York, [12] I.T. Jolliffe, Principal Component Analysis, Springer, New York, 1986.