Alignment and Preprocessing for Data Analysis Preprocessing tools for chromatography Basics of alignment GC FID (D) data and issues PCA F Ratios GC MS (D) data and issues PCA F Ratios PARAFAC Piecewise Alignment GUI (available online) synoveclab.chem.washington.edu/downloads.htm Email for username/password
Tools for Analysis: Classification Principal Component Analysis (PCA) Signal Retention Time PCA Loadings` Score PC X Scores + Error DCS Retention Time Score PC Degree of Class Separation (DCS) Score PC DCS DCS D s A AB, s B D, ( X X ) ( Y Y ) AB A B A B Score PC
Why Align? Reduction in classification PCA Increase in uncertainty for quantification PARAFAC Misalignment occurs frequently Daily instrument variation causes misalignment Correction is necessary to apply these methods
Retention Time Precision & PCA Replicates with limited shifting Loadings Scores Signal PCA _ Score PC DCS=3 Retention Time Retention Time Score PC Replicates with large shifting Loadings Scores Signal PCA _ Score PC DCS=3 Retention Time Retention Time Score PC
Basics of Alignment Types of alignment algorithms Cross correlation coefficient Correlation Optimized Warping (COW) *Piecewise alignment Alignment Parameters Window Size Shift Target Selection PCA Correlation Coefficient Windowed Target
Alignment Algorithms Cross correlation coefficient Move the entire chromatogram to maximize correlation Correlation Optimized Warping (COW) Separate the chromatogram into windows Warp and move the windows to optimize the correlation Find the best alignment path to correct the data *Piecewise alignment Separate the chromatogram into windows Shift the windows to optimize the correlation Find the best alignment path to correct the data
Alignment Parameters Parameter Description Effect Too Small Effect Too Large Determining correct values Window Size, W Window Size ~ pk width Relative movement high, difficult to determine quality of alignment Insufficient flexibility to correct peak to peak shifting Alignment Metric Shift, L <shift<=maximum shift L t R Insufficient movement of segments Increases time Alignment Metric
Target Selection *Global Approach All chromatograms are initially collected *User chosen target PCA optimized target Scores are produced for every sample The sample with the minimum distance from the center is the target *Maximum correlation target Calculate the product of each chromatograms correlation to the others The maximum correlation is the target Online Approach An initial target is set Chromatograms are aligned as they are collected The target changes as new chromatograms are collected
Target Selection (Maximum Correlation).6.55 Sample # 9 is the target.5 6 5 9 8 7 6 5 4 3 Type Type Type 3 Product Correlation.45.4.35.3.5..5 Signal 4 3 8.5 8. 8.5 8.. 5 5 8 Sample # Target Type Type Type 3 Signal 6 4 4 6 8 Retention Time, min 8.5 8. 8.5 8. Retention Time, min
Alignment of GC FID (D) Data and Issues Removal of artifacts and solvent peaks Baseline correction and normalization Alignment Improving PCA F Ratio
Preprocessing Tools for Chromatography Noise filtering Median filter Baseline correction Normalization Alignment
Experimental GC FID Separation 4 diesel sample types 9 minute separation 7 replicate injections over 5 days
Noise Filtering (Solvent, Spikes, and Outliers) 4 GC-FID Separation.8.6 Noise Spike Signal.4. Signal 8 6 4 Noise Spike.8.6.4..8.6 7 7. 7.4 7.6 7.8 Median Filtered Data (minimize spike) 4 6 8 Retention Time, min Signal.4..8.6.4 7 7. 7.4 7.6 7.8 Retention Time, min
Noise Filtering (Solvent, Spikes, and Outliers) GC-FID Separation Solvent.5 x 8 Scores Plot Solvent Dominated Info Signal 8 6 PC.5 -.5 4 - -.5 Outlier (low signal) - 4 6 8 Retention Time, min -.5.8..4.6.8. x 9 PC
Noise Filtering (Solvent, Spikes, and Outliers) 4 GC-FID Separation 5 x 6 Scores Plot 4 3 Signal 8 6 PC Outlier (Low Signal) 4 - Outlier Low Signal 4 6 8 Retention Time, min - -3.5.5 3 3.5 4 4.5 5 x 7 PC
Noise Filtering (Solvent, Spikes, and Outliers) 4 GC-FID Separation 5 x 6 Scores Plot 4 3 Signal 8 6 PC No Outlier (Baseline / Normalization on PC) 4 - - No Outlier 4 6 8 Retention Time, min -3 3.4 3.6 3.8 4 4. 4.4 4.6 x 7 PC
Noise Filtering (Solvent, Spikes, and Outliers).5 x -9 Baseline Corrected and Normalized Data 8 x -4 6 Scores Plot Normalized Signal.5.5 PC 4 - Tight Clustering in PCA Subtract offset line from data Then divide by total signal -.5 4 6 8 Retention Time, min -4 6. 6.3 6.4 6.5 6.6 6.7 6.8 x -3 PC
Experimental GC Separation 3 gasoline sample types 5 minute separation 6 replicate injections over days Misalignment is due to day to day instrument variation
Alignment GUI Initially blank to guide the user through alignment
Alignment GUI Load from file Load from workspace
Load & View Data Sizes Alignment GUI Zoom & Pan Align Data Choose or Estimate Parameters Choose or Estimate Target
Alignment GUI Data before alignment Estimated target and parameters Now save data to workspace or.mat file Data after alignment
PCA Classification of Gasoline x -8 3 3 x -3 Scores Plot Signal.5.5 PC - - We want better separation of groups -3-4.5 8.5 8. 8.5 8. 8.5 8.3 8.35 Retention Time, min -5.38.385.39.395.4.45.4 PC
Fisher Ratio Method (F Ratio) For each mass channel calculate Fisher Ratio at each point in D space, Sample Fisher Ratio = btwclass w ithinclass Fisher Ratios GC MS Separations Sample Col Sample 3 Col Works well for samples that have large amounts of within class variance Works best when comparing a small number of sample classes
Fisher Ratio Method (F Ratio) x -8 Selected Chromatographic Region 8 6 No alignment Signal 3.5.5 F-Ratio 8 4 8 6 4 5 5 Retention Time, min 7 6 After alignment.5 8.5 8. 8.5 8. 8.5 8.3 8.35 F-Ratio 5 4 3 Select Top Ten Retention Time, min 5 5 Retention Time, min
Fisher Ratio Method (F Ratio) 7 x -8 Regions of interest 6.5 x -3 Scores Plot 5 Signal 4 3 PC.5 Tight clustering of sample types.5 -.5-5 5 Retention Time, min - 5 6 7 8 9 x -3 PC
Summary Removal of solvents or artifacts is essential Baseline correction is an important step Alignment is essential for improving classification The F Ratio algorithm can further improve classification
Alignment of GC MS (D) Data and Issues Removal of artifacts and solvent peaks Baseline correction and normalization Alignment Improving PCA F Ratio PARAFAC
Experimental GC MS Separation 3 gasoline sample types 5 minute separation 6 replicate injections over days Misalignment is due to day to day instrument variation
Alignment GUI Initially blank to guide the user through alignment
Alignment GUI Load from file Load from workspace
Load & View Data Sizes Alignment GUI Zoom & Pan Align Data Choose or Estimate Parameters Choose or Estimate Target
Alignment GUI Data before alignment Estimated target and parameters Now save data to workspace or.mat file Data after alignment
Alignment Sample Target TIC Total Ion Current Shift Function (TIC SF) 5 Alignment Shift 5 4 Retention Time Retention Time
Alignment Total Ion Current Shift Function (TIC SF) Shift GC MS Data 8 6 4 Sample Target Retention Time GC MS Data Apply alignment 8 6 4 4 Retention Time 4 Retention Time
Alignment x -8.5 TIC Signal 7 x -8 6 5 4 3 TIC Signal.5.5 x -8.5 7. 7.5 7. 7.5 7.3 Align TIC Signal.5-5 5 Retention Time, min.5 7. 7.5 7. 7.5 7.3 Retention Time, min
Classification of Full MS Data.5 x -8 3.5 x -3 Scores Plot TIC Signal.5 PC.5 -.5 - -.5 PCA w/ full MS.5 -.3.35.4.45.5 PC 7.5 7. 7.5 7. 7.5 Retention Time, min
Fisher Ratio Method For each mass channel calculate Fisher Ratio at each point in D space, Fisher Ratio = btwclass w ithinclass GC MS Separations Sample Col Sample Col m/z m/z Fisher Ratios for Samples m/z Sum Fisher Ratios along mass channel dimension Sample 3 m/z Col Col Col Works well for samples that have large amounts of within class variance Works best when comparing a small number of sample classes
Simulated Example Using F Ratios for PCA TIC W = 4 L = Align TIC Scores GC-MS Data D F Ratios PCA F-Ratios m/z Shows differences between all samples Retention Time
F Ratios 4 GC MS Fisher Ratios 7 6 7 x -8 GC MS Separation 5 4 6 8 3 5 6 TIC Signal 4 3 4 5 5 7 GC MS Fisher Ratios 6 9 8 5 4 7 3-5 5 6 Retention Time, min 5 4 Top Fisher Ratios 4.8 5 5. 5.4 5.6
Procedure Select masses using mass spectral information Select time regions using F Ratios Combine to reduce the data set Improve results of PCA
Classification of Full MS Data 7 x -8 6 5 4 Regions of interest (with MS values) Scores Plot of Nearest Samples.5 x -4.5 Signal 3 PC -.5 - -.5 Clear Separation on st PC - - 5 5 -.5.8.9...3.4 x -3 PC Retention Time, min
Quantification of GC MS Data Use aligned, baseline corrected and normalized data Use PARAFAC of small regions for analysis Match values to mass spectra Peak Sums Peak Profiles
Quantification Target analyte Parallel Factor Analysis (PARAFAC) isolates the pure component peak and mass spectral information from overlapping peaks and background for both identification and quantification. Data Matrix = Analyte + Analyte +... Noise x x R = z z + + + y y E PARAFAC Results Compound : malate, TMS # of factors: 3 Chosen comp.: #3 Match value : 948 Peak sum : 44388 Peak height : 37688 Intensity Col. Time Intensity Col. Time
PARAFAC for GC MS Data Sample Peak Profile 8 6 8 4 6 4 4 4 Sample Stack Samples Run PARAFAC...4.6 Mass Spectrum 8 6 4 Retention Time Sample Analyte Concentration Sample 4 Sample
Experimental GC MS Separation 3 gasoline sample types 5 minute separation 6 replicate injections over days Misalignment is due to day to day instrument variation Looking for isobutyl benzene in the gasoline
Isobutyl Benzene Spectrum 9 8 7 Signal 6 5 4 3 4 6 8 4 6 Mass
PARAFAC Results for GC MS Data 6 x -9 5 8 Selected Analysis Region for PARAFAC 7 Profile 6 Qualitative Info 9 x -3 5 4 3 Peak 3 5 5 5 3 35 3.5 x -3 TIC Signal 4 3 Quantitative Info 3.5.5.5 Peak Sum 3 5 5 7 x -5 7.4 7.6 7.8 7. 7. 7.4 7.6 7.8 Retention Time, min Qualitative Info 6 5 4 3 Mass Spectrum 4 6 8 4 6
PARAFAC Results for GC MS Data Best match to isobutyl benzene Component Component Component 3.9.9.9.8.7.8.8 MV IBB =43 MV.7 IBB =788.7 MV IBB =9.6.5.4.3.. 4 6 8 4 6.6.5.4.3.. 4 6 8 4 6.6.5.4.3.. 4 6 8 4 6
Summary Mass spectral chromatographic data should be aligned An available alignment algorithm is able to align D and D data F Ratio methods can be used to improve classification PARAFAC can be effectively used to separate overlapped analytes