Functional Data Analysis of MALDI TOF Protein Spectra Dean Billheimer dean.billheimer@vanderbilt.edu. Department of Biostatistics Vanderbilt University Vanderbilt Ingram Cancer Center FDA for MALDI TOF MS p.1/43
Outline Overview of MALDI TOF Mass Spectrometry Characteristics of Spectral Signals Standard Analysis and Some Problems Analysis of Spectra as Functions Analysis of Glioma Proteins Extending FDA for Mass Spectra (coming attractions) Summary FDA for MALDI TOF MS p.2/43
MALDI TOF Mass Spectrometry Emerging as a key technology in proteomics (Nobel prize 2002). Proposed for cancer screening, diagnosis, treatment. Tremendous promise for protein profiling. Matrix Assisted Laser Desorption Ionization method of generating ions from large biomolecules (proteins!) Chemical matrix is added to sample to enhance ion formation. Pulsed laser light vaporizes/ionizes biomolecules from sample. Electric field accelerates ions and directs them into the mass analyzer. Time Of Flight separates ions based on size (mass/charge. TOF Small molecules are fast Large molecules are slow short travel time long travel time ) FDA for MALDI TOF MS p.3/43
MALDI TOF MS Schematic Laser + + ++ + + Ion Beam Time of Flight Analyzer Detector Sample and Matrix FDA for MALDI TOF MS p.4/43
MALDI-TOF Spectrum - Normal White Matter Intensity 0 10000 20000 30000 40000 10000 20000 30000 40000 50000 Mass/Charge FDA for MALDI TOF MS p.5/43
MALDI-TOF Spectra - Normal White Matter Intensity 0 10000 20000 30000 40000 Normal 1 Normal 2 10000 20000 30000 40000 50000 Mass/Charge FDA for MALDI TOF MS p.6/43
Pros/Cons of MALDI TOF MS Advantages Can be used for tissue, serum or other biological samples. Measures proteins directly. Proteins remain intact (vs. other methods). Allows measurement of many proteins simultaneously. Disadvantage Signal can be complicated. Molecules are identified only by mass/charge. Ion detection is mass dependent. 10-fold more efficient at 6 kda than 66 kda. Resolution is mass dependent. FDA for MALDI TOF MS p.7/43
Characteristics of Spectral Signals Fundamental Premise: At a given, the mean intensity is proportional to the relative amount of protein at that. (see graph) This may be difficult to detect in individual spectra because of nuisance variation. sample matrix heterogeneity (intensity) chemical noise, protein fragments, salts, fats (baseline) detector output characteristics and sensitivity other sources of error (noise) Need good signal normalization! (see graph) FDA for MALDI TOF MS p.8/43
Statistical Issues of MALDI TOF Spectra Highly multivariate! ( ). Structured signal intensity is a function of mass/charge. Variance (and higher moments) related to intensity (and ). Nuisance variation (for each spectrum) baseline adjustment intensity scaling Model identification issues. Incidental parameter problem (Neyman and Scott, 1948) FDA for MALDI TOF MS p.9/43
Survey of Standard Analysis of MALDI Spectra Within each spectrum Smoothing ( de noising ) and baseline correction Mass assignment (registration, calibration) Intensity normalization (nonlinear transformation) Peak detection from smoothed spectrum to create a peak list. Across multiple spectra Peak binning identify homologous peaks (nearby values.) Use binned peak list intensities in a classification/clustering algorithm to segregate (known) biological samples. Test classifier on independent data to assess predictive performance FDA for MALDI TOF MS p.10/43
Concerns with Standard Analysis Within a spectrum Mass registration is subject to error. (magnitude increases with distance from control points) Smoothing goals and criteria are unclear (usually by the software shipped with the spectrometer) What is baseline? (how defined?) Peak detection How is peak defined? often based on S/N (but both of these change with More fundamental concern assumes all relevant information is captured by peak location and intensity huge data reduction loss of information ) (see graph) FDA for MALDI TOF MS p.11/43
More Concerns... Combine information across multiple spectra Errors in peak detection and/or mass assignment lead to binning problems. (see graph) Tends to omit small peaks that are consistently expressed. Classification algorithm, Ignores the ordering inherent in the data ( scale) Ignores all inference goals except classification/clustering Each step proceeds conditionally on all preceeding steps (no acknowledgement of uncertainty). FDA for MALDI TOF MS p.12/43
Brief Introduction to Functional Data Analysis (Ramsay and Silverman, 1997) functional data the fundamental unit of observation is a curve (function) - patient s hormone profile (through time) - electrical potential of a neuron measured through time - spectra (mass, Raman, fluorescence, and otherwise) IDEA: We are measuring a function (often at discrete sample points), and would like to treat the function as the observation. ADVANTAGE: We are incorporating into the analysis methods structural constraints (e.g., continuity, smoothness) that are present in the data. FDA for MALDI TOF MS p.13/43
Steps in FDA Data representation: convert sample points to functional form select a functional basis (e.g., B-spline, Fourier, Wavelet) project sample points onto basis space ensuing calculations involve the basis coefficients same methods as smoothing (but not the goal) Data registration or feature alignment. Data display Calculation of Summary Statistics Statistical Modeling FDA for MALDI TOF MS p.14/43
( ) ( ) ( ) Descriptive Statistics. The! " "", and be an observed function where Let estimated mean function % $ '& # The estimated variance function # % $ '& var Covariance and Correlation functions # # % $ '& cov ) cov corr ) ) var var FDA for MALDI TOF MS p.15/43
A Functional Linear Model, 0, * / * - 3$ / Usual Linear Model / -., *+ where is an design matrix and coefficients. The usual parameter estimator is a -vector of unknown * 2, 2, 1, (- In a functional model (FANOVA). -, where, and - are functions, but is same as before. FDA for MALDI TOF MS p.16/43
Basis Function Representation : 4 5& * 3$ Represent the observations via basis function expansion 28 9 8 5 5 67 where 8 5 are basis functions covering More compactly,, and are coefficients. 5 6 :; where is the matrix of basis function coefficients. Now the FANOVA estimator is 2:;, 2, 1, (- FDA for MALDI TOF MS p.17/43
Other (* easy *) Operations in FDA Functional principal components analysis Functional linear modeling Functional ANOVA observations and parameters are functions (standard design matrix) Scalar response variable and functional independent variable All model terms are functional Functional canonical correlation Differential operators and analysis ** Thanks to Jim Ramsay for making available code for FDA. FDA for MALDI TOF MS p.18/43
Glioma Protein Analysis Glioma is a type of tumor found in the brain s white matter (infiltrating tumor cells). Four stages defined by tissue pathology. Stage progression not well understood. Compare resected tumor tissue with normal white matter from lobectomy patients. Interest in identifying protein markers of stage. FDA for MALDI TOF MS p.19/43
Analysis of Brain Tissue Mass Spectra < = Data from normal and tumor tissue specimens. Tissue cross section mounted to MALDI plate (IMS prep) Mass (per charge) range from 2000 to 50000 Da/z Focus on limited mass range 7600 to 8000 Da/z 35 patients (7 normal, 8 grade II, 9 grade III, 11 grade IV) Use B-spline basis with 120 basis functions ( data values) Thanks to Sarah Schwarz in Vanderbilt MSRC for providing data. FDA for MALDI TOF MS p.20/43
Spectrum Normalization C C B Piecewise linear baseline correction Scaling by regression against standard spectrum. Global Box-Cox transfomation based on sampling replicate spectra A. >? @ where @? is baseline correction is a scaling coefficient ( is the Box-Cox parameter D C in the following analysis) FDA for MALDI TOF MS p.21/43
Autocorrelation of Spectra 7600 7700 7800 7900 8000-0.5 1.0 7600 7700 7800 7900 8000 FDA for MALDI TOF MS p.22/43
Functional Analysis of Variance F Statistic (3, 31) 0 2 4 6 8 10 12 0.001 0.01 0.05 7600 7700 7800 7900 8000 Mass/Charge FDA for MALDI TOF MS p.23/43
Group Means Normalized Intensity 0 5 10 15 Normal Grade 2 Grade 3 Grade 4 0.001 0.01 7600 7700 7800 7900 8000 Mass/Charge FDA for MALDI TOF MS p.24/43
Key Points from Glioma Protein Spectra Analysis Identify regions exhibiting differential protein expression. Some of these regions would be difficult to find via peak selection. Autocorrelation plot suggests method for identifying different forms of a single protein. FDA for MALDI TOF MS p.25/43
Next New Thing Currently the following steps are performed sequentially 1. smooth (or de noise) spectrum 2. estimate and remove baseline 3. normalize 4. peak selection 5. do actual analysis Each step depends on all preceeding steps any error is propagated forward any uncertainty is ignored Instead, try simultaneous modeling of the (believed) components of spectra. FDA for MALDI TOF MS p.26/43
Spectrum Decomposition Spectrum Decomposition 0 2 4 6 8 10 12 0 50 100 150 Baseline Group Specific Signal Spectrum Specific FDA for MALDI TOF MS p.27/43
Spectrum Decomposition via Bayesian Inference Baseline nuisance background (in each spectrum) smoooooth monotone non increasing non negative Group Specific Signal peaks common to a group of interest combine information across multiple spectra non negative represent peaks when present, zero otherwise Spectrum Specific Signal subject or spectrum specific unexplained variation no substantial prior information aid identification may prefer mean zero for each spectrum FDA for MALDI TOF MS p.28/43
MCMC Baseline Estimate of Mass Frauda y 0 2 4 6 8 10 12 0 50 100 150 x FDA for MALDI TOF MS p.29/43
Peaks and Spectrum Effects Baseline Corrected Signal Estimate for MS Frauda y 0 2 4 6 8 10 0 50 100 150 x FDA for MALDI TOF MS p.30/43
Corrected Signal with Peaks y 0 2 4 6 8 10 0 50 100 150 x FDA for MALDI TOF MS p.31/43
Parallel Approaches to Inference E VAMPIRE cluster of 110 linux-based processors (Beowulf) Currently Embarrassingly Parallel problems Code: combination of C, R, and job scheduling languages Point-wise mixed-model analysis (Bayesian inference, using MCMC) Next Steps: combine FDA with Component-wise Bayesian model implement ScaLaPack behind language FDA for MALDI TOF MS p.32/43
Summary Protein analysis by MS has tremendous potenital for cancer screening, diagnosis, and treatment. Functional data approach is a natural fit to MS data. identified expression differences that would be difficult to find with peak detection approaches inference limitations computational challenges Good normalization is key to quantitative analysis. Theory of Normalization (w/ B. LaFleur) Proteomics = Proteo metrics All problems reduce to quantitation Adherence to statistical principles is important! dean.billheimer@vanderbilt.edu FDA for MALDI TOF MS p.33/43
Quantitation of MALDI Spectra MALDI TOF MS Calibration Experiment (Bucknall, et al. 2002) go back Peak Intensity Ratio 0.0 0.5 1.0 1.5 2.0 2.5 3.0 y = 1.17x 0.14 r = 0.998 50 100 150 200 Concentration rat met GH (nmol) FDA for MALDI TOF MS p.34/43
Unnormalized MALDI Spectra MALDI TOF MS Calibration Experiment No Normalization (Bucknall, et al. 2002) Peak Intensity 0 5000 10000 15000 20000 25000 y = 65.77x + 755.07 r = 0.83 go back 50 100 150 200 Concentration rat met GH (nmol) FDA for MALDI TOF MS p.35/43
Spectrum 1 Intensity 0 100 200 300 400 500 600 700 2520 2540 2560 2580 2600 Mass / Charge FDA for MALDI TOF MS p.36/43
Spectrum 1 with Peak Detection Intensity 0 100 200 300 400 500 600 700 2520 2540 2560 2580 2600 Mass / Charge FDA for MALDI TOF MS p.37/43
Spectrum 1 Peaks Only Intensity 0 100 200 300 400 500 600 700 go back 2520 2540 2560 2580 2600 Mass / Charge FDA for MALDI TOF MS p.38/43
Spectrum 1 Intensity 0 100 200 300 400 500 600 700 2520 2540 2560 2580 2600 Mass / Charge FDA for MALDI TOF MS p.39/43
Spectrum 1 with Peak Detection Intensity 0 100 200 300 400 500 600 700 2520 2540 2560 2580 2600 Mass / Charge FDA for MALDI TOF MS p.40/43
Spectrum 2 Intensity 0 200 400 600 2520 2540 2560 2580 2600 Mass / Charge FDA for MALDI TOF MS p.41/43
Spectrum 2 with Peak Detection Intensity 0 200 400 600 2520 2540 2560 2580 2600 Mass / Charge FDA for MALDI TOF MS p.42/43
Peaks from Spectra 1 and 2 Intensity 0 200 400 600 go back 2520 2540 2560 2580 2600 Mass / Charge FDA for MALDI TOF MS p.43/43