Functional Data Analysis of MALDI TOF Protein Spectra


 Arthur Mathews
 1 years ago
 Views:
Transcription
1 Functional Data Analysis of MALDI TOF Protein Spectra Dean Billheimer Department of Biostatistics Vanderbilt University Vanderbilt Ingram Cancer Center FDA for MALDI TOF MS p.1/43
2 Outline Overview of MALDI TOF Mass Spectrometry Characteristics of Spectral Signals Standard Analysis and Some Problems Analysis of Spectra as Functions Analysis of Glioma Proteins Extending FDA for Mass Spectra (coming attractions) Summary FDA for MALDI TOF MS p.2/43
3 MALDI TOF Mass Spectrometry Emerging as a key technology in proteomics (Nobel prize 2002). Proposed for cancer screening, diagnosis, treatment. Tremendous promise for protein profiling. Matrix Assisted Laser Desorption Ionization method of generating ions from large biomolecules (proteins!) Chemical matrix is added to sample to enhance ion formation. Pulsed laser light vaporizes/ionizes biomolecules from sample. Electric field accelerates ions and directs them into the mass analyzer. Time Of Flight separates ions based on size (mass/charge. TOF Small molecules are fast Large molecules are slow short travel time long travel time ) FDA for MALDI TOF MS p.3/43
4 MALDI TOF MS Schematic Laser Ion Beam Time of Flight Analyzer Detector Sample and Matrix FDA for MALDI TOF MS p.4/43
5 MALDITOF Spectrum  Normal White Matter Intensity Mass/Charge FDA for MALDI TOF MS p.5/43
6 MALDITOF Spectra  Normal White Matter Intensity Normal 1 Normal Mass/Charge FDA for MALDI TOF MS p.6/43
7 Pros/Cons of MALDI TOF MS Advantages Can be used for tissue, serum or other biological samples. Measures proteins directly. Proteins remain intact (vs. other methods). Allows measurement of many proteins simultaneously. Disadvantage Signal can be complicated. Molecules are identified only by mass/charge. Ion detection is mass dependent. 10fold more efficient at 6 kda than 66 kda. Resolution is mass dependent. FDA for MALDI TOF MS p.7/43
8 Characteristics of Spectral Signals Fundamental Premise: At a given, the mean intensity is proportional to the relative amount of protein at that. (see graph) This may be difficult to detect in individual spectra because of nuisance variation. sample matrix heterogeneity (intensity) chemical noise, protein fragments, salts, fats (baseline) detector output characteristics and sensitivity other sources of error (noise) Need good signal normalization! (see graph) FDA for MALDI TOF MS p.8/43
9 Statistical Issues of MALDI TOF Spectra Highly multivariate! ( ). Structured signal intensity is a function of mass/charge. Variance (and higher moments) related to intensity (and ). Nuisance variation (for each spectrum) baseline adjustment intensity scaling Model identification issues. Incidental parameter problem (Neyman and Scott, 1948) FDA for MALDI TOF MS p.9/43
10 Survey of Standard Analysis of MALDI Spectra Within each spectrum Smoothing ( de noising ) and baseline correction Mass assignment (registration, calibration) Intensity normalization (nonlinear transformation) Peak detection from smoothed spectrum to create a peak list. Across multiple spectra Peak binning identify homologous peaks (nearby values.) Use binned peak list intensities in a classification/clustering algorithm to segregate (known) biological samples. Test classifier on independent data to assess predictive performance FDA for MALDI TOF MS p.10/43
11 Concerns with Standard Analysis Within a spectrum Mass registration is subject to error. (magnitude increases with distance from control points) Smoothing goals and criteria are unclear (usually by the software shipped with the spectrometer) What is baseline? (how defined?) Peak detection How is peak defined? often based on S/N (but both of these change with More fundamental concern assumes all relevant information is captured by peak location and intensity huge data reduction loss of information ) (see graph) FDA for MALDI TOF MS p.11/43
12 More Concerns... Combine information across multiple spectra Errors in peak detection and/or mass assignment lead to binning problems. (see graph) Tends to omit small peaks that are consistently expressed. Classification algorithm, Ignores the ordering inherent in the data ( scale) Ignores all inference goals except classification/clustering Each step proceeds conditionally on all preceeding steps (no acknowledgement of uncertainty). FDA for MALDI TOF MS p.12/43
13 Brief Introduction to Functional Data Analysis (Ramsay and Silverman, 1997) functional data the fundamental unit of observation is a curve (function)  patient s hormone profile (through time)  electrical potential of a neuron measured through time  spectra (mass, Raman, fluorescence, and otherwise) IDEA: We are measuring a function (often at discrete sample points), and would like to treat the function as the observation. ADVANTAGE: We are incorporating into the analysis methods structural constraints (e.g., continuity, smoothness) that are present in the data. FDA for MALDI TOF MS p.13/43
14 Steps in FDA Data representation: convert sample points to functional form select a functional basis (e.g., Bspline, Fourier, Wavelet) project sample points onto basis space ensuing calculations involve the basis coefficients same methods as smoothing (but not the goal) Data registration or feature alignment. Data display Calculation of Summary Statistics Statistical Modeling FDA for MALDI TOF MS p.14/43
15 ( ) ( ) ( ) Descriptive Statistics. The! " "", and be an observed function where Let estimated mean function % $ '& # The estimated variance function # % $ '& var Covariance and Correlation functions # # % $ '& cov ) cov corr ) ) var var FDA for MALDI TOF MS p.15/43
16 A Functional Linear Model, 0, * / *  3$ / Usual Linear Model / ., *+ where is an design matrix and coefficients. The usual parameter estimator is a vector of unknown * 2, 2, 1, ( In a functional model (FANOVA). , where, and  are functions, but is same as before. FDA for MALDI TOF MS p.16/43
17 Basis Function Representation : 4 5& * 3$ Represent the observations via basis function expansion where 8 5 are basis functions covering More compactly,, and are coefficients. 5 6 :; where is the matrix of basis function coefficients. Now the FANOVA estimator is 2:;, 2, 1, ( FDA for MALDI TOF MS p.17/43
18 Other (* easy *) Operations in FDA Functional principal components analysis Functional linear modeling Functional ANOVA observations and parameters are functions (standard design matrix) Scalar response variable and functional independent variable All model terms are functional Functional canonical correlation Differential operators and analysis ** Thanks to Jim Ramsay for making available code for FDA. FDA for MALDI TOF MS p.18/43
19 Glioma Protein Analysis Glioma is a type of tumor found in the brain s white matter (infiltrating tumor cells). Four stages defined by tissue pathology. Stage progression not well understood. Compare resected tumor tissue with normal white matter from lobectomy patients. Interest in identifying protein markers of stage. FDA for MALDI TOF MS p.19/43
20 Analysis of Brain Tissue Mass Spectra < = Data from normal and tumor tissue specimens. Tissue cross section mounted to MALDI plate (IMS prep) Mass (per charge) range from 2000 to Da/z Focus on limited mass range 7600 to 8000 Da/z 35 patients (7 normal, 8 grade II, 9 grade III, 11 grade IV) Use Bspline basis with 120 basis functions ( data values) Thanks to Sarah Schwarz in Vanderbilt MSRC for providing data. FDA for MALDI TOF MS p.20/43
21 Spectrum Normalization C C B Piecewise linear baseline correction Scaling by regression against standard spectrum. Global BoxCox transfomation based on sampling replicate spectra A. is baseline correction is a scaling coefficient ( is the BoxCox parameter D C in the following analysis) FDA for MALDI TOF MS p.21/43
22 Autocorrelation of Spectra FDA for MALDI TOF MS p.22/43
23 Functional Analysis of Variance F Statistic (3, 31) Mass/Charge FDA for MALDI TOF MS p.23/43
24 Group Means Normalized Intensity Normal Grade 2 Grade 3 Grade Mass/Charge FDA for MALDI TOF MS p.24/43
25 Key Points from Glioma Protein Spectra Analysis Identify regions exhibiting differential protein expression. Some of these regions would be difficult to find via peak selection. Autocorrelation plot suggests method for identifying different forms of a single protein. FDA for MALDI TOF MS p.25/43
26 Next New Thing Currently the following steps are performed sequentially 1. smooth (or de noise) spectrum 2. estimate and remove baseline 3. normalize 4. peak selection 5. do actual analysis Each step depends on all preceeding steps any error is propagated forward any uncertainty is ignored Instead, try simultaneous modeling of the (believed) components of spectra. FDA for MALDI TOF MS p.26/43
27 Spectrum Decomposition Spectrum Decomposition Baseline Group Specific Signal Spectrum Specific FDA for MALDI TOF MS p.27/43
28 Spectrum Decomposition via Bayesian Inference Baseline nuisance background (in each spectrum) smoooooth monotone non increasing non negative Group Specific Signal peaks common to a group of interest combine information across multiple spectra non negative represent peaks when present, zero otherwise Spectrum Specific Signal subject or spectrum specific unexplained variation no substantial prior information aid identification may prefer mean zero for each spectrum FDA for MALDI TOF MS p.28/43
29 MCMC Baseline Estimate of Mass Frauda y x FDA for MALDI TOF MS p.29/43
30 Peaks and Spectrum Effects Baseline Corrected Signal Estimate for MS Frauda y x FDA for MALDI TOF MS p.30/43
31 Corrected Signal with Peaks y x FDA for MALDI TOF MS p.31/43
32 Parallel Approaches to Inference E VAMPIRE cluster of 110 linuxbased processors (Beowulf) Currently Embarrassingly Parallel problems Code: combination of C, R, and job scheduling languages Pointwise mixedmodel analysis (Bayesian inference, using MCMC) Next Steps: combine FDA with Componentwise Bayesian model implement ScaLaPack behind language FDA for MALDI TOF MS p.32/43
33 Summary Protein analysis by MS has tremendous potenital for cancer screening, diagnosis, and treatment. Functional data approach is a natural fit to MS data. identified expression differences that would be difficult to find with peak detection approaches inference limitations computational challenges Good normalization is key to quantitative analysis. Theory of Normalization (w/ B. LaFleur) Proteomics = Proteo metrics All problems reduce to quantitation Adherence to statistical principles is important! FDA for MALDI TOF MS p.33/43
34 Quantitation of MALDI Spectra MALDI TOF MS Calibration Experiment (Bucknall, et al. 2002) go back Peak Intensity Ratio y = 1.17x 0.14 r = Concentration rat met GH (nmol) FDA for MALDI TOF MS p.34/43
35 Unnormalized MALDI Spectra MALDI TOF MS Calibration Experiment No Normalization (Bucknall, et al. 2002) Peak Intensity y = 65.77x r = 0.83 go back Concentration rat met GH (nmol) FDA for MALDI TOF MS p.35/43
36 Spectrum 1 Intensity Mass / Charge FDA for MALDI TOF MS p.36/43
37 Spectrum 1 with Peak Detection Intensity Mass / Charge FDA for MALDI TOF MS p.37/43
38 Spectrum 1 Peaks Only Intensity go back Mass / Charge FDA for MALDI TOF MS p.38/43
39 Spectrum 1 Intensity Mass / Charge FDA for MALDI TOF MS p.39/43
40 Spectrum 1 with Peak Detection Intensity Mass / Charge FDA for MALDI TOF MS p.40/43
41 Spectrum 2 Intensity Mass / Charge FDA for MALDI TOF MS p.41/43
42 Spectrum 2 with Peak Detection Intensity Mass / Charge FDA for MALDI TOF MS p.42/43
43 Peaks from Spectra 1 and 2 Intensity go back Mass / Charge FDA for MALDI TOF MS p.43/43
An Introduction to Variable and Feature Selection
Journal of Machine Learning Research 3 (23) 11571182 Submitted 11/2; Published 3/3 An Introduction to Variable and Feature Selection Isabelle Guyon Clopinet 955 Creston Road Berkeley, CA 9478151, USA
More informationAn introduction to quadrupole timeofflight mass spectrometry
JOURNAL OF MASS SPECTROMETRY J. Mass Spectrom. 21; 36: 849 865 SPECIAL FEATURE: TUTORIAL An introduction to quadrupole timeofflight mass spectrometry Igor V. Chernushevich, Alexander V. Loboda and Bruce
More informationA Comparative Analysis Of Predictive DataMining Techniques
A Comparative Analysis Of Predictive DataMining Techniques A Thesis Presented for the Master of Science Degree The University of Tennessee, Knoxville Godswill Chukwugozie Nsofor August, 2006 DEDICATION
More informationTHE development of methods for automatic detection
Learning to Detect Objects in Images via a Sparse, PartBased Representation Shivani Agarwal, Aatif Awan and Dan Roth, Member, IEEE Computer Society 1 Abstract We study the problem of detecting objects
More informationTHE PROBLEM OF finding localized energy solutions
600 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 45, NO. 3, MARCH 1997 Sparse Signal Reconstruction from Limited Data Using FOCUSS: A Reweighted Minimum Norm Algorithm Irina F. Gorodnitsky, Member, IEEE,
More informationCompressed network monitoring for IP and alloptical networks
Compressed network monitoring for IP and alloptical networks Mark Coates, Yvan Pointurier and Michael Rabbat Department of Electrical and Computer Engineering McGill University Montreal, Quebec H3AA7,
More informationHighDimensional Image Warping
Chapter 4 HighDimensional Image Warping John Ashburner & Karl J. Friston The Wellcome Dept. of Imaging Neuroscience, 12 Queen Square, London WC1N 3BG, UK. Contents 4.1 Introduction.................................
More informationDistinctive Image Features from ScaleInvariant Keypoints
Distinctive Image Features from ScaleInvariant Keypoints David G. Lowe Computer Science Department University of British Columbia Vancouver, B.C., Canada lowe@cs.ubc.ca January 5, 2004 Abstract This paper
More informationIEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2013. ACCEPTED FOR PUBLICATION 1
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2013. ACCEPTED FOR PUBLICATION 1 ActiveSet Newton Algorithm for Overcomplete NonNegative Representations of Audio Tuomas Virtanen, Member,
More informationRegression Modeling and MetaAnalysis for Decision Making: A CostBenefit Analysis of Incentives in Telephone Surveys
Regression Modeling and MetaAnalysis for Decision Making: A CostBenefit Analysis of Incentives in Telephone Surveys Andrew Gelman, Matt Stevens, and Valerie Chan Departments of Statistics and Political
More informationData Quality Assessment: A Reviewer s Guide EPA QA/G9R
United States Office of Environmental EPA/240/B06/002 Environmental Protection Information Agency Washington, DC 20460 Data Quality Assessment: A Reviewer s Guide EPA QA/G9R FOREWORD This document is
More informationGene Selection for Cancer Classification using Support Vector Machines
Gene Selection for Cancer Classification using Support Vector Machines Isabelle Guyon+, Jason Weston+, Stephen Barnhill, M.D.+ and Vladimir Vapnik* +Barnhill Bioinformatics, Savannah, Georgia, USA * AT&T
More informationBeyond Baseline and Followup: The Case for More T in Experiments * David McKenzie, World Bank. Abstract
Beyond Baseline and Followup: The Case for More T in Experiments * David McKenzie, World Bank Abstract The vast majority of randomized experiments in economics rely on a single baseline and single followup
More informationFRAUD CLASSIFICATION USING PRINCIPAL COMPONENT ANALYSIS OF RIDITs
The Journal of Risk and Insurance, 2002, Vol. 69, No. 3, 341371 FRAUD CLASSIFICATION USING PRINCIPAL COMPONENT ANALYSIS OF RIDITs Patrick L. Brockett Richard A. Derrig Linda L. Golden Arnold Levine Mark
More informationPerformance Evaluation Methods for Human Detection and Tracking Systems for Robotic Applications
NISTIR 8045 Performance Evaluation Methods for Human Detection and Tracking Systems for Robotic Applications Michael Shneier Tsai Hong Geraldine Cheok Kamel Saidi Will Shackleford This publication is available
More informationJournal of Statistical Software
JSS Journal of Statistical Software February 2010, Volume 33, Issue 5. http://www.jstatsoft.org/ Measures of Analysis of Time Series (MATS): A MATLAB Toolkit for Computation of Multiple Measures on Time
More informationUnbiased Groupwise Alignment by Iterative Central Tendency Estimation
Math. Model. Nat. Phenom. Vol. 3, No. 6, 2008, pp. 232 Unbiased Groupwise Alignment by Iterative Central Tendency Estimation M.S. De Craene a1, B. Macq b, F. Marques c, P. Salembier c, and S.K. Warfield
More informationIBM SPSS Missing Values 22
IBM SPSS Missing Values 22 Note Before using this information and the product it supports, read the information in Notices on page 23. Product Information This edition applies to version 22, release 0,
More informationChapter 2 Survey of Biodata Analysis from a Data Mining Perspective
Chapter 2 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy, Jiawei Han, Lei Liu, and Jiong Yang Summary Recent progress in biology, medical science, bioinformatics, and biotechnology
More informationReview of basic statistics and the simplest forecasting model: the sample mean
Review of basic statistics and the simplest forecasting model: the sample mean Robert Nau Fuqua School of Business, Duke University August 2014 Most of what you need to remember about basic statistics
More informationMAGNETIC resonance (MR) image segmentation is fundamental
IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 29, NO. 12, DECEMBER 2010 1959 Coupled Nonparametric Shape and MomentBased Intershape Pose Priors for Multiple Basal Ganglia Structure Segmentation Mustafa Gökhan
More informationSampling 50 Years After Shannon
Sampling 50 Years After Shannon MICHAEL UNSER, FELLOW, IEEE This paper presents an account of the current state of sampling, 50 years after Shannon s formulation of the sampling theorem. The emphasis is
More informationPolarized Light Assessment of Complex Turbid Media Such as Biological Tissues Using Mueller Matrix Decomposition
9 Polarized Light Assessment of Complex Turbid Media Such as Biological Tissues Using Mueller Matrix Decomposition Nirmalya Ghosh IISER Kolkata, Mohanpur Campus, PO: BCKV Campus Main Office, Mohanpur 741252,
More informationFor more than 50 years, the meansquared
[ Zhou Wang and Alan C. Bovik ] For more than 50 years, the meansquared error (MSE) has been the dominant quantitative performance metric in the field of signal processing. It remains the standard criterion
More informationAn Overview of TextIndependent Speaker Recognition: from Features to Supervectors
An Overview of TextIndependent Speaker Recognition: from Features to Supervectors Tomi Kinnunen,a, Haizhou Li b a Department of Computer Science and Statistics, Speech and Image Processing Unit University
More informationPASIF A Framework for Supporting Smart Interactions with Predictive Analytics
PASIF A Framework for Supporting Smart Interactions with Predictive Analytics by Sarah Marie Matheson A thesis submitted to the School of Computing in conformity with the requirements for the degree of
More informationEvaluations and improvements in small area estimation methodologies
National Centre for Research Methods Methodological Review paper Evaluations and improvements in small area estimation methodologies Adam Whitworth (edt), University of Sheffield Evaluations and improvements
More informationWhere the Bugs Are. Thomas J. Ostrand AT&T Labs  Research 180 Park Avenue Florham Park, NJ 07932 ostrand@research.att.com. Elaine J.
Where the Bugs Are Thomas J. Ostrand AT&T Labs  Research 180 Park Avenue Florham Park, NJ 07932 ostrand@research.att.com Elaine J. Weyuker AT&T Labs  Research 180 Park Avenue Florham Park, NJ 07932 weyuker@research.att.com
More information2 Basic Concepts and Techniques of Cluster Analysis
The Challenges of Clustering High Dimensional Data * Michael Steinbach, Levent Ertöz, and Vipin Kumar Abstract Cluster analysis divides data into groups (clusters) for the purposes of summarization or
More informationLearning Deep Architectures for AI. Contents
Foundations and Trends R in Machine Learning Vol. 2, No. 1 (2009) 1 127 c 2009 Y. Bengio DOI: 10.1561/2200000006 Learning Deep Architectures for AI By Yoshua Bengio Contents 1 Introduction 2 1.1 How do
More information