You are on page 1of 17

Analytica Chimica Acta 484 (2003) 7591

SIMPLISMA applied to two-dimensional wavelet compressed ion mobility spectrometry data


Guoxiang Chen1 , Peter de B. Harrington
Center for Intelligent Chemical Instrumentation, Department of Chemistry and Biochemistry, Ohio University, Athens, OH 45701-2979, USA Received 2 May 2002; received in revised form 31 January 2003; accepted 7 March 2003

Abstract A modied SIMPLe-to-use Interactive Self-modeling Mixture Analysis (SIMPLISMA) algorithm, referred to as real-time (RT) SIMPLISMA has been combined with two-dimensional (2D) wavelet compression (WC2 ). This tool was evaluated with datasets of drugs and bacteria that were acquired from two different ion mobility spectrometers and published reference data that comprised Raman, FTIR microscopy, near-infrared (NIR) and mass spectral data. RTSIMPLISMA is amenable for real-time modeling and is able to determine the number of components automatically. The 2D wavelet compression, which compresses both acquisition and drift time dimensions of measurement, was applied to the datasets prior to RTSIMPLISMA modeling. RTSIMPLISMA models obtained from the compressed data were wavelet transformed back to the uncompressed representation. The effects of wavelet lter types and compression levels were investigated. The relative root-mean-square errors (RRMSE) of reconstruction, which calculate the relative difference between the extracted models with and without 2D compressions, were used to evaluate the effects of compression on self-modeling. The results showed that satisfactory models could be obtained when a data was compressed to 1/256 of its size. 2003 Elsevier Science B.V. All rights reserved.
Keywords: Multivariate curve resolution; SIMPLISMA; Multidimensional wavelet compression; Ion mobility spectrometry; Drugs of abuse; Bacteria

1. Introduction Advancing computer technology makes it possible to acquire large amounts of data from instruments at increasing rates. Typically, large quantities of data accumulated during the acquisition process have to be stored in a computer prior to data analysis, which leads to high storage burden and low processing efciency.
Corresponding author. Tel.: +1-740-517-8458; fax: +1-740-593-0148. E-mail address: peter.harrington@ohio.edu (P.d.B. Harrington). 1 Present address: Metara, Inc., 1225 E Arques Avenue, Sunnyvale, CA 94085-4701, USA.

If the results of the data analysis are going to be used to correct or rene the measurement, then the processing must occur during the course of the measurement. This attribute is important for the construction of an intelligent instrument. Real-time (RT) methods have been developed to model data as it is acquired from an ion mobility spectrometer [1]. Real-time modeling can alleviate storage burdens and provide a global perspective of the measurement process. However, a key issue for real-time processing is that the algorithms must be computationally efcient so that the processing does not lag behind the data acquisition. The demand for processing power of many algorithms may increase linearly or geometrically with respect to spectrum

0003-2670/03/$ see front matter 2003 Elsevier Science B.V. All rights reserved. doi:10.1016/S0003-2670(03)00306-4

76

G. Chen, P. de B. Harrington / Analytica Chimica Acta 484 (2003) 7591

and resolution element numbers. If the algorithm consumes too large a share of computer resources, the data acquisition may be deprived of resources so that data stored on the acquisition board is overwritten before it is read by the computer. The problem is especially severe for those processes with high acquisition rates. Data compression is advantageous because it reduces the size and computational burden of the data without losing important chemical information. Instead, noise is lost during compression and processing of compressed data is much more efcient. There are several areas of analytical chemistry for which compression has become important. High resolution and multidimensional measurements can generate very large datasets that are cumbersome to manipulate on state-of-the-art workstations. Miniaturized sensors may be much smaller than handheld computers or portable computer equipment and require embedded processors with limited memory, power, and processing capabilities. Wireless communication of sensors to data stations may be bandwidth limited. Real-time chemometric processing and modeling requires fast computation and can become unfeasible when dataset grows too large. Therefore, data compression applications have been revitalized in analytical chemistry [25]. The wavelet transform (WT) is a popular tool for compressing and denoising data and has two benets, fast implementation and multiresolution capability [611]. The WT technique has been applied to compressing absorbance [1215] and ion mobility spectra [5], denoising absorbance spectra [12,16], chromatograms [17], and electrochemical signals [18,19]. The wavelet transform has been used with other chemometric approaches such as multicalibration [2022], pattern recognition [23,24], multivariate curve resolution [2527], and articial neural networks [28,29]. The WT techniques can be classied into two categories, the continuous wavelet transform (CWT) and the discrete wavelet transform (DWT). With respect to data compression, DWT retains sufcient spectral information and can be implemented much faster than CWT [30]. The DWT was used in this work. Compared to one-dimensional (1D) compression, multidimensional compression affords much greater compression while maintaining the signal quality.

Two-dimensional (2D) compression of sensor data was rst demonstrated with Fourier compression [2]. A 2D discrete sine transform was used to compress large sets of ion mobility spectrometry (IMS) data so that it would be amenable to PCA [31]. Two-dimensional wavelet compression (WC2 ) was the logical progression and was applied to ion mobility spectra [5] and near-infrared (NIR) spectroscopy for monitoring wood chips [4]. Multidimensional compression is especially desired for real-time processing of ion mobility spectra where the instrument acquires data very rapidly, up to thousands of spectra per minute. Ion mobility spectrometry has been widely used in the detection of chemical warfare agents [32], environmental pollutants [33], explosives [34], drugs of abuse [35], and other volatile samples [36]. IMS offers the advantages of portability, low cost, and high sensitivity. However, the above advantages can be offset by the relatively lower peak capacity of IMS compared to gas chromatography and mass spectrometry. Consequently, overlapping peaks occur frequently with the analysis of complex samples. SIMPLe-to-use Interactive Self-modeling Mixture Analysis (SIMPLISMA) has been demonstrated as a useful tool for IMS in that overlapping peaks can be mathematically separated. As a result, the selectivity and sensitivity of the method can be improved [3740]. SIMPLISMA is an efcient multivariate curve resolution method developed by Windig and Guilment [41]. The method has been used to decompose complex data into simple models from a variety of applications [4245]. SIMPLISMA has been implemented as a real-time algorithm [1]. Although the self-modeling algorithm was modied for real-time implementation, time-constraints still occurred as an obstacle for large datasets due to the signicant increase of algorithm execution time as the number of acquired spectra increased. SIMPLISMA has been applied to wavelet compressed IMS data [27]. The compression was one-dimensional and the IMS data was compressed in the drift time direction to 1/16 of its original size. Drug and bacteria ion mobility spectra used in this paper were chemically diverse samples. In addition, the two ion mobility spectrometers differed in operating principle as well. The Ionscan used a pinhole inlet and an ion shutter; while the Itemiser used a membrane inlet and a eld-free region to

G. Chen, P. de B. Harrington / Analytica Chimica Acta 484 (2003) 7591

77

trap ions prior to injection. The drug sample was a mixture of cocaine and heroin, which has the street name of a speed-ball. Mixed drug abuse has been on the rise [46,47]. The bacteria data was provided by Buxton and co-workers [48]. Bacterial cells are not volatile. Identication of bacteria by IMS was achieved by in situ derivatization produced by the thermal hydrolysis of bacterial lipids with the desorber heater and tetramethylammonium hydroxide (TMAH) derivatizing reagent. TMAH was added to the IMS sample disk to hydrolyze and methylate lipids in a similar procedure to the one used by mass spectrometrists [49]. The bacterium studied was the food borne pathogen Bacillus cereus that can be fatal for individuals with compromised immune systems. Wavelet compression was applied to ion mobility spectra in both dimensions of drift time and acquisition time to afford 2D wavelet compression [50]. The real-time SIMPLISMA algorithm is a revised version of the real-time and recursive SIMPLISMA algorithm that was reported previously [1]. The modication yielded a more reliable estimate of number of components to include in the model. RTSIMPLISMA was used to resolve concentration proles and spectra from the 2D compressed IMS data. The RTSIMPLISMA model was transformed to the uncompressed representation using the inverse wavelet transform. This paper reports the effects of wavelet lter and compression level. The relative root-mean-square error (RRMSE) for RTSIMPLISMA, which calculated the relative difference between the models, with and without 2D compression, was used to evaluate the reconstruction delity. The results indicate that satisfactory models can be obtained when the data matrix was reduced to four parts-per-thousand of its original size; which allowed faster implementation of SIMPLISMA, improved the SIMPLISMA models by removing noise from the data, and lowered storage requirement. These benets were essential for real-time modeling of large datasets.

evant wavelet procedure follows. The forward DWT is a multi-level transform that recursively bifurcates data into smooth and detail representations by passing the data through low- and high-pass lters. The process can be stopped at any level to afford a partial transform. When the optimal transform level is selected, the detail coefcients will be discarded and only the detail part will be retained for inverse DWT by which the spectrum can be reconstructed. Data compression can be achieved because the length of a detail part is halved for each level. By applying this compression to each row of X, the spectra can be compressed in row direction. The columns of the compressed data matrix are further compressed to furnish the 2D compressed matrix XC . For 2D compression notation, lr lc , wr wc refers to lr level compression using wavelet type wr applied to row compression, i.e. drift time dimension, and lc level compression using wavelet type wc applied to column compression, i.e. acquisition time dimension. The compression efciency is evaluated with compression factor (CF), which is measured as the ratio of the number of points retained in the compressed matrix (N) to the original data size (N0 ): CF = N N0 (1)

The theory of SIMPLISMA has been addressed elsewhere [41]. The objective of SIMPLISMA is to extract the pure component information from data matrix X, which includes a matrix of concentration proles (C) and one of component spectra (S): X = CST (2)

2. Theory The pyramid algorithm [51] is commonly used to implement DWT. A number of tutorials can be found elsewhere [7,10,11,30]. A brief description of the rel-

SIMPLISMA estimates C with pure variables, which correspond to components. A pure variable is the resolution element (i.e. drift time) in the dataset that has a unique variation prole and relative large variance with respect to intensity. The concentration proles (C) comprise the resolution elements that furnish the largest purity values. For the SIMPLISMA [41], the purity (pij ) of a candidate variable is dened as: j pij = ej wij = (3) wij j + for which i is the index of component, j and j are the mean and the standard deviation of the jth candidate variable, respectively. The term ej may be

78

G. Chen, P. de B. Harrington / Analytica Chimica Acta 484 (2003) 7591

recognized as an expression for relative standard error for the jth column of X. The inuence of noise is removed with the term , termed damping factor. Usually, is 5% of the maximum peak intensity of the mean of the dataset. The wij term is the weight that characterizes the linear independence of the jth candidate variable with respect to the previously extracted i 1 components. The weight is calculated with the determinant of the correlation about the origin matrix in the SIMPLISMA [41]. Another method to measure the independent weight is Gram-Schmidt method that has been addressed elsewhere [52,53]. The ofine SIMPLISMA builds models with a predened number of components (nc ). However, a xed number of components nc is incompatible with real-time analysis, in which components are dynamically changing in the acquired data. A real-time algorithm has been reported previously that did not need to predene nc [1]. The algorithm used the Gram-Schmidt method to calculate the independent weights in that it simplies the calculations and ts for real-time implementation [52]. The real-time algorithm is further modied to afford RTSIMPLISMA. First, the calculation of purity is modied to:
ns

olution or inclusion of spurious components in the model. To alleviate this problem, a new thresholding method was developed in this work which is based on the following observations. First, the standard deviation of the concentration prole of a real component should be greater than three times of noise level of the dataset that is estimated by the data points within 1.53.0 ms with respect to drift time for IMS data. Second, the purity of the last real component should be less than one percent of that of the rst pure variable, if the rst rule was satised. Finally, the difference of the relative purities of two adjacent components in terms of log should be greater than a threshold (0 ). After pure variables are found, the component spectra are obtained by: S = XT C(CT C)1 (6)

Each column of the extracted spectra S is normalized to unit vector length by dividing the spectrum by the square root of the vector sum of squares. The normalized S is used to generate new concentration proles by: C = XS(S T S)1 (7)

pij =
k=1

(xkj xj ) wij
2

(4)

The index k spectrum number. The rst part of the equation, which excludes the damping factor in Eq. (3), is proportional to the variance of variable j while standard deviation was used in the algorithm in [1]. Second, RTSIMPLISMA revised the threshold method for the better determination of nc . As described in [1], relative purity is dened as the ratio of the purity of a selected variable (pi , i 2) with that of the rst pure variable (p1 ), denoted as . pi (5) = p1 The relative purity of the rst pure variable is 1. For the remaining pure variables, the relative purities are less than 1 and decrease corresponding to order such that the variables with largest purities are always selected rst. The new pure variable threshold (0 ) was used to determine whether a candidate variable should be included in the model. The threshold method is largely dependent on the selection of 0 . Inappropriate 0 values can lead to the insufcient curve res-

The normalization of the spectra removes model ambiguity and gives the concentration proles units of intensity. The readout from the IMS instruments and the unit of intensity is volt. Concentration proles will be scale-invariant if normalization to vector length is used when SIMPLISMA is applied to the same dataset represented in wavelet domain. Instead of using the raw data as input, the WC2 RTSIMPLISMA directly processes the compressed data XC . The matrix XC in the wavelet domain can be decomposed into concentration proles and component spectra in wavelet domain, denoted as CC and S C , respectively. XC = C C S T C (8)

The RTSIMPLISMA model in wavelet domain is inversely transformed to uncompressed representation. The relative root-mean-square error of RTSIMPLISMA spectra (RRMSES) and the relative the relative root-mean-square error of RTSIMPLISMA concentration prole (RRMSEC) is used to assess the model accuracy. RRMSES compares the reconstructed

G. Chen, P. de B. Harrington / Analytica Chimica Acta 484 (2003) 7591

79

WC2 -RTSIMPLISMA component spectra (S) with the RTSIMPLISMA spectra without compression (S): RRMSES =
nc i=1 nx 2 j=1 (sij sij ) nx 2 nc i=1 j=1 sij

(9)

Likewise, RRMSEC is calculated by: RRMSEC =


nc i=1 ns 2 k=1 (cik cik ) nc ns 2 i=1 k=1 cik

(10)

This assessment approach reects the relative errors of RTSIMPLISMA models and makes it comparable for the errors from different datasets.

3. Experimental Two different ion mobility spectrometers were used, which were interfaced with personal computers through data acquisition devices (National Instruments, TX, USA). Homemade virtual instrument (VI) programs were implemented in LabVIEW 6.02 (National Instruments, TX, USA) to acquire data from the IMS instruments. Original signals were subtracted by the average baseline signal before storing. The baseline region was located from 1.5 to 3 ms of the IMS spectrum, a region of the spectrum where usually no peaks occur. The rst spectrometer was an ion trap mobility spectrometer (ITMS), ITEMISER contraband detection and identication system (Ion Track Instruments, Inc., Wilmington, MA, USA). The ITMS system was interfaced with a laptop with a single processor of Pentium III 850 MHz and 384 MB memory through a PCMCIA card (Type DAQCard-AI-16XE-50). The operating system was Windows 2000-SP2. The drug dataset was collected in positive polarity, which is the conventional mode of analysis for drugs. The acquisition rate was 80 kHz, and each spectrum consisted of 1500 points. The second spectrometer was a Barringer Ionscan 350 (Barringer Instruments, Inc., New Jersey, USA). The Ionscan was interfaced with a single processor PII 200 MHz and 64 MB RAM computer through a data acquisition board (Type AT-MIO-16XE-10). The operating system was Windows 98 Second Edition. The Ionscan was operated in positive polarity.

Datasets were collected for this instrument by placing a small amount of a prepared sample solution on a sample lter. The sample lter was placed in a lter cartridge that is heated by the desorber to vaporize the sample into the instrument inlet. A single run for this instrument was limited to 20 s. The acquisition rate was 80 kHz, and each spectrum consisted of 1600 points. Cocaine (Sigma; Lot 97H1018) and heroin (Lipomed, Inc., Cambridge, MA, USA), both in the form of freebase, were prepared in absolute ethanol. The concentrations were 0.02 and 0.20 mg/ml for cocaine and heroin, respectively. The drug mixture was prepared by adding 50 l of each drug solution to an Eppendorf tube (Brinkmann Instruments, Inc., Westbury, NY, USA). A 10 l aliquot of the mixture solution was place on a sample trap for narcotics mode (Ion Track Instruments, Inc., Wilmington, MA, USA). The sample trap was exposed to air to evaporate ethanol to yield samples comprised of 0.1 g of cocaine and 1.0 g of heroin. Approximately 100 blank spectra were collected before the sampled trap was placed into the thermal desorption unit of the ITMS. The data acquisition was halted when the ITMS returned to baseline response. Freeze-dried B. cereus was purchased from American Type Culture Collection, Manassas, VA, USA. Specimens were rehydrated by utilizing brainheart infusion broth with 3% NaCl. Several drops of the broth were used to inoculate a brainheart infusion agar plate. The plate was left at room temperature for 24 h. B. cereus cells were placed on an IMS sample disk along with 1 l of 0.1 M TMAH. The sample disk was placed above the desorber heater on the Ionscan to thermally hydrolyze the sample at 300 C. Resulting volatile compounds were introduced into the Ionscan by the carrier gas. The Ionscan system used nicotinamide as an internal calibrant. The data acquisition stopped when the instrument returned to baseline response. All programs were written at Ohio University and compiled in Borland C++ 5.02. MATLAB programs were written to perform statistical calculations. The programs were run on a desktop PC with a 1.2 GHz processor and 512 MB RAM. The operating system was Windows 2000-SP2. All calculations used single-precision (32 bit) oating-point arithmetic.

80

G. Chen, P. de B. Harrington / Analytica Chimica Acta 484 (2003) 7591

4. Results and discussion Two diverse datasets were used that represent the traditional application of drug detection and a newer application of characterizing bacteria. These datasets also had different signals and noise levels, and were useful for evaluating the WC2 -RTSIMPLISMA algorithm. Because the wavelet compression (WC) algorithm used can only process the data with dyadic length, the datasets was culled to retain 1024 spectra for B. cereus data (i.e. bacterial data) and 512 spectra for cocaineheroin data (i.e. drug data), both of which had 1024 points in drift time measurements. The raw data in context refers to the culled datasets, given in Figs. 1 and 2 as 3D surface plots, in comparison to compressed and reconstructed data by WC. SIMPLISMA accurately modeled the raw datasets with a damping factor of 0.3 of the maximum peak intensity of the mean spectrum for the bacterial data and 0.05 for the drug data unless otherwise stated. The bacterial dataset required a larger value for , because it was noisier than the drug dataset which arises from the spectra obtained from the Ionscan . The noise level of the drug dataset was 5.5 103 V (0.21% of the maximum reagent ion peak (RIP) intensity) and that of the bacterial dataset was 5.4102 V (1.3% of

the maximum RIP intensity). The SIMPLISMA models are given in Figs. 3 and 4, respectively, where the components are ordered by purity. In the spectral plots, the lower abscissa corresponds to drift time and upper abscissa to the reduced mobilities that are calculated using cocaine (1.16 cm2 V1 s1 ) [54], and nicotinamide (1.86 cm2 V1 s1 ) [55] as the calibrant ion for drug and bacterial data, respectively. Negative peaks may occur in the spectra that indicate correlations among the pure variables. The drug dataset yielded a three-component model that comprised the ammonium reagent ion peak, cocaine peak, and heroin peak. The reagent ions are formed from ammonia that is an internal dopant in the ITMS. This ion suppresses signals arising from substances with lower proton afnities and transfers charge to drugs that have comparable proton afnities to ammonia. Each IMS spectrum is closed in that the ion current (i.e. spectral intensities) integrates to a constant value. Because ionization occurs through charge transfer reactions, the RIP decreases concomitantly with the increase of analyte peaks (Figs. 3A and 4A). In Fig. 3B, the three small peaks from 10 to 13 ms in the cocaine spectrum may be related to cluster ions that formed during the analysis. The bacterial dataset was more complex and the SIMPLISMA model comprised four

Fig. 1. The cocaineheroin dataset comprised 1024 spectra displayed as a 3D surface (acquired from ITEMISER ITMS in positive mode).

G. Chen, P. de B. Harrington / Analytica Chimica Acta 484 (2003) 7591

81

Fig. 2. The Bacillus cereus dataset comprised 1024 spectra displayed as a 3D surface (acquired from Barringer Ionscan 350 spectrometer in positive mode with in situ derivatization using TMAH).

components that corresponded to the nicotinamide reactant ions, peaks pertaining to the TMAH derivatizing agent, and two peaks that corresponded to bacteria. Fig. 4A gives the concentration proles. The TMAH prole increases rapidly, because it is a volatile compound. The slow increase in height of the bacterial peaks indicates the reaction rate of the thermal hydrolysis/methylation of the lipids in B. cereus cells. The extracted SIMPLISMA spectra of all of the four components are given in Fig. 4B. The logarithm of the relative purities with respect to component number is given in Fig. 5. The relative purity curves of determinant-based SIMPLISMA are given in Fig. 5A, and the Gram-Schmidt SIMPLISMA in Fig. 5B (drug data) and Fig. 5C (bacterial data). In the gures, SIMPLISMA-gs, RTSIMPLISMA-s, and RTSIMPLISMA-v correspond to the purity calculation method using Eq. (3), the method reported in [1] using standard deviation, and the one using Eq. (4), respectively. The WC2 refers to 4 4 daublet 14daublet 4 2D wavelet compression. From Fig. 5, the determinant-based SIMPLISMA differs from the Gram-Schmidt-based in that the former does not have a clear transition point where the slopes of the relative purity curve changes, which makes it unsuitable for accurately determining the number of components (nc ) in the model. For the latter, a threshold based on calcu-

lating log after the transition point in the relative purity curves discloses the number of components in the model. The components before the transition point are real components and furnish larger purity values while those afterwards correspond to spurious components. The transition points for the drug datasets for all four of the Gram-Schmidt methods are apparent in Fig. 5. The transition point for the RTSIMPLISMA using variance is the most apparent. The relative purity curves for the raw drug data and the 2D compressed are similar for the rst six points, which suggests that the compression has an insignicant affect on the model convergence when noise is relatively low in the data. However, the curve of the compressed bacterial data diverges from that of the raw in the fth component. Generally, the rst spurious component corresponds to distribution of noise across the spectra. The 2D wavelet compression removed high frequency noise from the bacterial dataset and reduced the relative purity of the spurious components, which enhances the difference in relative purities between chemical and the spurious components. Therefore, the RTSIMPLISMA-v prevailed over the other methods and was selected for further study. As discussed earlier, RTSIMPLISMA uses a threshold 0 to determine if the transition point is

82

G. Chen, P. de B. Harrington / Analytica Chimica Acta 484 (2003) 7591

Fig. 3. SIMPLISMA models from the raw cocaineheroin dataset (three-component model). (A) Concentration proles; (B) component spectra.

G. Chen, P. de B. Harrington / Analytica Chimica Acta 484 (2003) 7591

83

Fig. 4. SIMPLISMA models from the raw Bacillus cereus dataset (four-component model). (A) Concentration proles; (B) component spectra.

84

G. Chen, P. de B. Harrington / Analytica Chimica Acta 484 (2003) 7591

Fig. 5. Relative purity curves. (A) Determinant-based SIMPLISMA; (B) Gram-Schmidt-based SIMPLISMA for the drug dataset; (C) Gram-Schmidt-based SIMPLISMA for the bacterial dataset. (SIMPLISMA-gs: SIMPLISMA using Gram-Schmidt method for raw data; RTSIMPLISMA-s: RTSIMPLISMA using standard deviation for purity calculation for raw data; RTSIMPLISMA-v: RTSIMPLISMA using variance for purity calculation for raw data; WC2 -RTSIMPLISMA-v: RTSIMPLISMA using variance for purity calculation for 2D compressed data; the highlighted points indicate transition points.)

G. Chen, P. de B. Harrington / Analytica Chimica Acta 484 (2003) 7591

85

Fig. 5. (Continued ).

Fig. 6. Percent correct number of components with respect to the threshold 0 .

86

G. Chen, P. de B. Harrington / Analytica Chimica Acta 484 (2003) 7591

reached. To investigate the threshold value, the RTSIMPLISMA with different threshold values were applied to 1458 datasets that were populated from the 4 4 2D compressed both drug and bacterial datasets with 27 types of wavelets, respectively. The 27 wavelets included 15 from the Daubechies family (daublet 2, 4, . . . , 30), 5 from the coiet family (coiet 1, 2, . . . , 5), and 7 from the symmlet family (symmlet 4, 5, . . . , 10). The drug dataset was compressed to 6464 points while the bacterial dataset to 32 64 points. The compression reduced the size of the dataset by 1/256. The results of percent correct nc with respect to 0 in Fig. 6 reveal that the optimal 0 is located between 0.45 and 0.65, in which the average percent correct nc is 98.7% for the drug dataset and 75.1% for the bacterial dataset. Compared to the drug dataset, the percent correct nc for the bacterial dataset is lower and more sensitive to the change of 0 . The bacterial dataset is noisier and more complex. From the relative purity curve of RTSIMPLISMA-v in Fig. 5C, the transition point is not as clearly dened as it was for the drug dataset in Fig. 5B. Second, the 2D compressions using some of the wavelet lters altered the relative purity curves of the bacterial dataset more than relative purity curves of the drug data. In other words, those 2D compressions distorted the dataset and changed its chemical rank. For example, the 4 4 daublet 8symmlet 8 compression reduced the number of components in bacterial dataset to 3, in which the calibration peak and the TMAH peak could not be separated and were modeled as the same component. In practice, it is more informative to resolve extra components than to underestimate the number of components. Although the percent correct nc is the highest when 0 is equal to 0.55, the value 0.5 was selected as the optimal 0 because it yielded the largest percentage of the correct number of components nc . The log criterion yielded similar results while a lower 0 criterion tended to overestimate the number of components. RTSIMPLISMA was applied to several spectral reference datasets [44,56] without compression. The Windig datasets include Raman spectra of following the reaction of tetramethyl orthosilicate in aqueous methanol in time, FTIR microscopy spectra of a polymer laminate, NIR spectra of mixtures of ve solvents, and time resolved mass spectra of a mixture of three photographic color coupling compounds.

The threshold 0 was 0.5 for all of the datasets except NIR spectra, for which it was 0.45. The rule that requires the standard deviation of a pure variable to be larger than three times of noise level was not used. The correct nc was found for the Raman, NIR (0 = 0.45) and FTIR microscopy datasets. For the time resolved mass spectra, four components were obtained from RTSIMPLISMA, while the reference reported three [44]. Our result is in agreement with the result obtained with the simplied Borgen method [57] and subspace comparisons [58]. The WC2 -RTSIMPLISMA has also been successfully applied for modeling other IMS datasets [59]. The effects of wavelet type and compression level on WC2 -RTSIMPLISMA were evaluated using compression levels that ranged from 1 to 6 and with 27 wavelet lters applied to one of the sample and drift time dimensions of the drug dataset. RTSIMPLISMA was applied to the one-dimensional compressed data. The RRMSES and RRMSEC were calculated for each level and each wavelet lter. Total, 162 (i.e. 27 6) RRMSES and 162 RRMSEC were obtained for each dimension. Two-factor analysis of variance (ANOVA) was used to evaluate the results. The reference statistic was obtained for a 5% signicance level. Both RRMSES and RRMSEC were used for ANOVA. The ratio of F/Fcrit are reported in Table 1. This statistic indicates the signicance that a factor contributes to the total variation. The F/Fcrit ratios for wavelet lter and compression level factors were calculated for the two dimensions. The compression level has a greater impact on the RRMSES and RRMSEC than wavelet type. To further investigate the effects of compression levels on modeling accuracy, the 27 wavelet lters were applied to the drug and bacterial dataset in both dimensions. Compression levels were varied from 2 to
Table 1 Contribution of wavelet type and compression level to the variation of RRMSES and RRMSEC for the drug dataset Source of variation F/Fcrit Row compression Column compression

RRMSES RRMSEC RRMSES RRMSEC Wavelet type 2.0 Compression level 58.5 1.3 17.4 1.1 6.5 1.3 12.5

G. Chen, P. de B. Harrington / Analytica Chimica Acta 484 (2003) 7591

87

Table 2 Compression levels, compression factor, percent correct nc , average RRMSES (%) 95% condence interval, minimum RRMSES, and the corresponding wavelet type for different compression levels for the drug dataset Compression levels CF Percent correct nc (%) Average RRMSES (%) Minimum RRMSES (%) Wavelet for minimum RRMSES Row 2 3 4 4 4 4 5 4 4 2 3 4 5 4 1/64 1/128 1/64 1/128 1/256 1/512 1/512 100 100 96 97 97 78 69 2.55 4.76 22.30 22.21 22.18 22.16 46.82 0.09 0.16 1.70 1.69 1.68 1.87 1.83 1.07 1.36 6.40 6.28 6.29 6.29 23.18 Daublet Daublet Daublet Daublet Daublet Daublet Daublet 16 22 14 14 14 14 10 Column Daublet 8 Daublet 2 Coiet 5 Coiet 5 Symmlet 6 Symmlet 4 Symmlet 7

5 for drift and acquisition time dimensions, respectively. The RRMSES results for a level 4 compression are given in Tables 2 and 3. The average RRMSES for each compression level pattern was calculated by averaging all of the RRMSESs from the 729 different wavelet lter combinations that accurately determined nc . The deviation of the average RRMSES was obtained by t-statistics with a signicance level of 5%. The percent correct nc of each level pattern is calculated by dividing the number of models with correct nc by the total number of models. The compression factor, the minimum RRMSES for each compression level and the corresponding wavelet type combination are given in Tables 2 and 3. With the same row compression level, the average RRMSESs are approximately equal for different column compression levels while the RRMSESs differs considerably for different row compression levels, suggesting the row compression level is a more important

factor with respect to the reconstruction errors. This nding can be explained by the pure variable selection mechanism of RTSIMPLISMA. In raw datasets, pure variables are selected from nx variables, i.e. 1024. The row compression, i.e. drift time direction, will decrease the number of variables in the variable pool, while column compression does not change it. As a result, row compression may signicantly change the selected concentration proles (C), which can furnish a larger RRMSEC. The errors in C propagate to the spectra S and may increase RRMSES, because S is calculated from C. Therefore, two factors contribute to RRMSE; one is from the compression and the other from changes to the concentration proles. The effect of the latter is more signicant than the former in low level compression (less than level 5). Consequently, the wavelet lter affects row compression and spectral reconstruction errors more signicantly than column compression. The spectral dimension is

Table 3 Compression levels, percent correct nc , average RRMSES (%) 95% condence interval, minimum RRMSES, and the corresponding wavelet type for different compression levels for the bacterial dataset Compression levels Percent correct nc (%) Average RRMSES (%) Minimum RRMSES (%) Wavelet for minimum RRMSES Row 2 3 4 4 4 4 5 4 4 2 3 4 5 4 96 91 76 83 77 25 54 9.26 9.42 19.39 21.08 22.00 31.50 60.01 0.43 0.40 1.16 1.27 1.26 2.16 2.52 2.61 3.25 10.32 10.34 10.39 11.81 38.51 Daublet 24 Symmlet 9 Daublet 22 Daublet 22 Daublet 22 Daublet 22 Daublet 12 Column Daublet Daublet Daublet Daublet Daublet Daublet Daublet 4 4 4 6 4 2 4

88

G. Chen, P. de B. Harrington / Analytica Chimica Acta 484 (2003) 7591

Fig. 7. Reconstructed RTSIMPLISMA models from the 4 4 daublet 14daublet 4 compressed drug dataset. (A) Concentration proles; (B) component spectra.

more important for optimizing compression level and wavelet lters than the acquisition time dimension. The average RRMSESs from the bacterial dataset are higher than those of the drug dataset and fur-

nishes a lower percent correct nc than the drug dataset because the former is noisier. Wavelet compression removes noise from the data. RTSIMPLISMA can remove noise to some extent but not as well as

G. Chen, P. de B. Harrington / Analytica Chimica Acta 484 (2003) 7591

89

Fig. 8. Reconstructed RTSIMPLISMA models from the 4 4 daublet 14daublet 4 compressed bacterial dataset. (A) Concentration proles; (B) component spectra.

90

G. Chen, P. de B. Harrington / Analytica Chimica Acta 484 (2003) 7591

wavelet compression. Therefore, higher noise levels may contribute some part to the higher RRMSES. Alternatively, the higher uncertainty for selecting pure variable from noisy data may also lead to a greater RRMSES. Second, the greater the compression level, the lower the percent correct nc that was obtained, because greater compressions may have altered the RTSIMPLISMA models. Selection of the optimal compression level and the wavelet lter was optimized with respect to computation time and minimized reconstruction errors. For compression level, computational efciency correlates to reduced dataset size, because SIMPLISMA has a greater computational burden than the wavelet compression. Summarizing from both tables, 4 4 compression is selected as the optimal compression level for IMS data, by which the RRMSESs are acceptable (around 6 and 10% for the drug and bacterial datasets, respectively). With the 4 5 compression, the percent correct nc is reduced considerably in comparison to 4 4 compression although the minimum RRMSESs are similar, which suggests that the compression error increases considerably from levels 4 to 5 compressions. The optimal wavelet lter pair was daublet 14daublet 4 instead of either daublet 14symmlet 6 or daublet 22daublet 4 that yielded minimum RRMSES in the tables. This result arises from two factors. Shorter wavelet lters are more computationally efcient. For example, the daublet 22 has 22 coefcients, while the daublet 4 lter has four coefcients. With the daublet 14daublet 4 lter pair, the RRMSESs for the drug and bacterial datasets are 6.60 and 11.58, respectively. The RRMSESs are similar to the corresponding minimum RRMSES. The WC2 -RTSIMPLISMA models with RRMSES less than 10% matched well with original RTSIMPLISMA models. However, 10% is not a strict threshold. The RRMSES criterion should be used in combination with the comparative observation of the specic models. Transforming the models from compressed data back to uncompressed representation, the reconstructed models (Figs. 7 and 8) are comparable with the original SIMPLISMA models. The reconstructed models differ from the original ones in that they remove most of the noise whereas they characterize the same changes in the analytical signal. Note that, in Fig. 8A, there are some notable variations of the bacterial concentration proles for

WC2 -RTSIMPLISMA model from the raw spectra, but the spectra in Fig. 8B corresponded well. The variation of the concentration proles was inconsequential.

5. Conclusions The WC2 -RTSIMPLISMA method was developed. Sets of drug and bacterial positive ion mobility spectra were used to evaluate this method. The RTSIMPLISMA automatically determines the number of components by locating the transition point of the relative purity curve. This method was evaluated with several published reference datasets. Accurate models were obtained from 2D wavelet compressed data using RTSIMPLISMA. Optimal models were obtained by a 4 4 daublet 14daublet 4 compression for which the row and column dimensions of the dataset were each compressed to 1/16 of their original size and resultant dataset was compressed to 1/256 of its original size. Key chemical information was retained in the reconstructed SIMPLISMA models for the IMS spectra. Compared to compressing spectra, compressing the acquisition time dimension is less inuential on the model accuracy. This result is expected because SIMPLISMA models are dependent on the pure variables and if the spectral dimension is overly compressed pure variables may no longer exist. Future work shall focus on the real-time implementation of the WC2 -RTSIMPLISMA algorithm.

Acknowledgements This work was presented in part at the 51st Pittsburgh Conference on Analytical Chemistry and Applied Spectroscopy in New Orleans, LA. The Center for Intelligent Chemical Instrumentation at Ohio University is thanked for supporting the conference trip. The Research Corporation is thanked for the Research Opportunity Award. Ohio University is thanked for the support of Donald R. Clippinger Fellowship. The Federal Aviation Administration and Ion Track Instruments, Inc. are thanked for the donation of instruments. Tricia Buxton is thanked for the bacterial dataset. Matt Rainsberg, Preshious Rearden,

G. Chen, P. de B. Harrington / Analytica Chimica Acta 484 (2003) 7591

91

Mariela Ochoa, Tricia Buxton, and Libo Cao are thanked for their helpful comments. References
[1] G. Chen, P.B. Harrington, Appl. Spectrosc. 55 (2001) 621. [2] C. Cai, P.B. Harrington, D.M. Davis, Anal. Chem. 69 (1997) 4249. [3] P.B. Harrington, T.L. Isenhour, Anal. Chem. 60 (1988) 2687. [4] J. Trygg, N. Kettaneh-Wold, L. Wallbcks, J. Chemomet. 15 (2001) 299. [5] A. Urbas, P.B. Harrington, Anal. Chim. Acta 446 (2001) 393. [6] B. Walczak, Wavelets in Chemistry, Elsevier, Amsterdam, 2000. [7] B. Walczak, D.L. Massart, Trends Anal. Chem. 16 (1997) 451. [8] K. Jetter, U. Depczynski, K. Molt, A. Niemller, Anal. Chim. Acta 420 (2000) 169. [9] A.K.M. Leung, F.T. Chau, J.B. Gao, Chemomet. Intell. Lab. Syst. 43 (1998) 165. [10] B.K. Alsberg, A.M. Woodward, D.B. Kell, Chemomet. Intell. Lab. Syst. 37 (1997) 215. [11] B. Walczak, D.L. Massart, Chemomet. Intell. Lab. Syst. 36 (1997) 81. [12] F. Ehrentreich, L. Smmmchen, Anal. Chem. 73 (2001) 4364. [13] H.L. Ho, W.K. Cham, F.T. Chau, J.Y. Wu, Comput. Chem. 23 (1999) 85. [14] F.T. Chau, J.B. Gao, T.M. Shih, J. Wang, Appl. Spectrosc. 51 (1997) 649. [15] A.K.M. Leung, F.T. Chau, J.B. Gao, T.M. Shih, Chemomet. Intell. Lab. Syst. 43 (1998) 69. [16] B.K. Alsberg, A.M. Woodward, M.K. Winson, J. Rowland, D.B. Kell, Analyst 122 (1997) 645. [17] J. Lasa, I. Sliwka, J. Rosiek, K. Wal, Chem. Anal. 46 (2001) 529. [18] H. Chen, Anal. Chim. Acta 346 (1997) 319. [19] S. Wu, L. Nie, J. Wang, X. Lin, L. Zheng, L. Rui, J. Electroanal. Chem. 508 (2001) 11. [20] L. Eriksson, J. Trygg, E. Johansson, R. Bro, S. Wold, Anal. Chim. Acta 420 (2000) 181. [21] J. Trygg, S. Wold, Chemomet. Intell. Lab. Syst. 42 (1998) 209. [22] S. Ren, L. Gao, Talanta 50 (2000) 1163. [23] B. Walczak, B. van den Bogaert, D.L. Massart, Anal. Chem. 68 (1996) 1742. [24] A.W. Mehay, C. Cai, P.B. Harrington, Appl. Spectrosc. 56 (2002) 223. [25] B. Walczak, D.L. Massart, Chemomet. Intell. Lab. Syst. 38 (1997) 39. [26] X. Zhang, J. Zheng, H. Gao, Anal. Chim. Acta 443 (2001) 117. [27] P.B. Harrington, P.J. Rauch, C. Cai, Anal. Chem. 73 (2001) 3247. [28] C. Cai, P.B. Harrington, J. Chem. Inf. Comput. Sci. 39 (1999) 874.

[29] E.R. Collantes, R. Duta, W.J. Welsh, W.L. Zielinski, J. Brower, Anal. Chem. 69 (1997) 1392. [30] R. Polikar, http://engineering.rowan.edu/polikar/WAVELE TS/WTtutorial.html (accessed October 2001). [31] P.B. Harrington, L. Hu, Appl. Spectrosc. 52 (1998) 1328. [32] G.R. Asbury, C. Wu, W.F. Siems, H.H. Hill, Anal. Chim. Acta 404 (2000) 273. [33] S. Sielemann, J.I. Baumbach, H. Schmidt, P. Pilzecker, Field Anal. Chem. Technol. 4 (2000) 157. [34] G.R. Asbury, J. Klasmeier, H.H. Hill, Talanta 50 (2000) 1291. [35] C. Wu, W.F. Siems, H.H. Hill, Anal. Chem. 72 (2000) 396. [36] S. Sielemann, J.I. Baumbach, H. Schmidt, P. Pilzecker, Anal. Chim. Acta 431 (2001) 293. [37] P.B. Harrington, E.S. Reese, P.J. Rauch, L. Hu, D.M. Davis, Appl. Spectrosc. 51 (1997) 808. [38] E.S. Reese, P.B. Harrington, J. Forensic Sci. 44 (1999) 68. [39] L.A. Shaw, P.B. Harrington, Spectroscopy 15 (2000) 40. [40] T.L. Buxton, P.B. Harrington, Anal. Chim. Acta 434 (2001) 269. [41] W. Windig, J. Guilment, Anal. Chem. 63 (1991) 1425. [42] D.S. Smith, J.R. Kramer, Anal. Chim. Acta 416 (2000) 211. [43] A. Garrido Frenich, J.R. Torres-Lapasi, K. De Braekeleer, D.L. Massart, J.L.M. Martnez Vidal, M.M. Galera, J. Chromatogr. A 855 (1999) 487. [44] W. Windig, Chemomet. Intell. Lab. Syst. 36 (1997) 3. [45] K. De Braekeleer, F. Cuesta-Sanchez, P.A. Hailey, D.C.A. Sharp, A.J. Pettman, D.L. Massart, J. Pharm. Biomed. Anal. 17 (1998) 141. [46] R.L. Collins, P.L. Ellickson, R.M. Bell, J. Subst. Abuse 10 (1998) 233. [47] A.J. Roberts, I.Y. Polis, L.H. Gold, Eur. J. Pharmacol. 326 (1997) 119. [48] P.B. Harrington, T.L. Buxton, G. Chen, Int. J. Ion Mobility Spectrom. 4 (2001) 148. [49] S. DeLuca, E.W. Sarver, P.B. Harrington, K.J. Voorhees, Anal. Chem. 62 (1990) 1465. [50] A.A. Urbas, P.B. Harrington, Anal. Chim. Acta 446 (2001) 393. [51] S. Mallat, IEEE Trans. Pattern Anal. Machine Intell. 11 (1989) 674. [52] P.J. Rauch, P.B. Harrington, D.M. Davis, Chemomet. Intell. Lab. Syst. 39 (1997) 175. [53] F. Cuesta-Snchez, M.S. Khots, D.L. Massart, Anal. Chem. Acta 285 (1994) 181. [54] G.A. Eiceman, Z. Karpas, Ion Mobility Spectrometry, CRC Press, Boca Raton, FL, 1994. [55] T. L. Buxton, Ph.D. Dissertation, Ohio University, Athens, OH, 2002. [56] ftp://ftp.clarkson.edu/pub/hopkepk/Chemdata/Windig (accessed December 2001). [57] B.-V. Grande, R. Manne, Chemomet. Intell. Lab. Syst. 50 (2000) 19. [58] H. Shen, Y. Liang, O.M. Kvalheim, R. Manne, Chemomet. Intell. Lab. Syst. 51 (2000) 49. [59] G. Chen, Ph.D. Dissertation, Ohio University, Athens, OH, 2003.

You might also like