Statistical Analysis Strategies for Shotgun Proteomics Data


 Julius Hoover
 1 years ago
 Views:
Transcription
1 Statistical Analysis Strategies for Shotgun Proteomics Data Ming Li, Ph.D. Cancer Biostatistics Center Vanderbilt University Medical Center
2 Ayers Institute Biomarker Pipeline normal shotgun proteome analysis compare inventories, identify differences cancer no vs. targeted quantitative analysis yes Labelfree MRM in tissue Is it really there? Is there a difference? LCMRMMS method or Y Y ELISA vs. evaluate in clinical context sensitivity specificity performance metrics Discovery few(<20) samples or sample pools low throughput identify up to ~5,000 proteins inventory differences ~150 proteins Verification ~ samples higher throughput screen candidates identify high priority candidates Validation hundreds+ of samples highest throughput validate 110 candidates clinical validation trial
3 The Big Picture Proteomics Tools Mass Spectrometry Database Bioinformatics Proteinseparation Techniques Research Goal Biomarker Discovery High Throughput Data (MALDI MS, Shotgun, etc) Challenge on Data Analysis
4 Outline Background Shotgun Proteomics Techniques and Data The Challenges for Statistical Analysis Statistical Analysis Strategies Methods and Models Case Study and Preliminary Results Discussion and Future Work
5 Background on Shotgun Proteomics What is Shotgun Proteomics? A method of identifying proteins in complex mixtures using a certain separation/digestion method combined with mass spectrometry. More specifically, the proteins in the mixture are digested and the resulting peptides are separated (by HPLC, SCX, IEF, etc), tandem mass spectrometry (MS/MS) is then used to identify the peptides (LC MS/MS) and matched back to proteins.
6 General Flow in Proteomic Analysis Proteins SDSPAGE or Prep IEF or HPLC Digest Reverse phase LC Tandem LC Protein Fraction 1 Peptide Fraction 1 data Protein Fraction 2 Peptide Fraction 2 Protein Fraction 3 Peptide Fraction 3 MS/MS Analysis Database search algorithms Protein Fraction n Peptide Fraction n Identification From Liebler, Introduction to Proteomics, 2002.
7 Shotgun 101 Design Analysis Integration Nesvizhskii et al.
8 LC separation MS MSMS fragmentation m/z 2 4 m/z m/z 3 1 m/z m/z
9 dm_200392_tpepc_std #587 RT: AV: 1 NL: 1.95E7 T: + c d Full ms2 [ ] b 2 + H 3 N b 1 b 2 b 3 b 4 b 5 b 6 b 7 b A V A G C A G A R O HN O NH O NH O HS NH O NH O NH O H N + H 3 N NH O N H HN CO 2 H y y 6 Relative Abundance y 8 y 7 y 6 y 5 y 4 y 3 y 2 y 1 y b b 3 b b 6 y y b 7 b m/z y 4 m/z
10 dm_200392_tpepc_std #587 RT: AV: 1 NL: 1.95E7 T: + c d Full ms2 [ ] Relative Abundance m/z Actual MSMS scan Precursor peptide [M+H] + = Get database sequences that match precursor peptide mass AVAGCAGAR CVAAGAAGR VGGACAAAR etc Generate virtual MSMS spectra AVAGCAGAR CVAAGAAGR VGGACAAAR b2 y2b3 y3 b4 y4 b5 y5 b6 y6 b2 y2b3y3 b4 y4 b5 y5 b6 y6 b2 y2 b3 y3 b4 y4 b5 y5 b6 y6 Compare virtual spectra to real spectrum dm_200392_tpepc_std #587 RT: AV: 1 NL: 1.95E7 T: + c d Full ms2 [ ] Scoring Peptide score Relative Abundance m/z  Detect matches between theoretical b and yions and actual spectrum ions  Compute correlation scores  Rank hits AVAGCAGAR 2.56 CVAAGAAGR 0.57 VGGACAAAR 0.32
11 From Shotgun Proteomics to Biomarker Discovery A Protein/Peptide FrequencyBased Analysis Approach Compare spectral identifications between groups; The unit of measurement is the number of times a peptide observed in a single LCMS/MS round of analysis that are matched to the peptide sequences; The counts reflect the abundance of the protein from which these peptides are derived.
12 Shotgun Data and Statistical Challenge Normal Cancer Protein Sample 1 Sample m Sample 1 Sample m # # # # : : : : : : : : #N : : : : : :
13 Statistical Analysis Goal Provide investigators a winner list of proteins for further study; The winners are the potential biomarkers of diagnostics, prognostics or therapeutic; The winners are selected by appropriate statistical models and procedures.
14 Statistical Analysis Strategy Model the Count Data Poisson Regression Model Quasilikelihood Poisson Model (GEE method) Rate Model (Poisson Model with Offset) Handle Small Sample Size Provide Appropriate Test Statistics Permutation Test Deal with Multiple Comparison Issues Frequentist Approach (FDR) Empirical Bayes Approach (LOCFDR)
15 Poisson Model for Count Data Poisson Regression Model (Poisson GLM) Y is Poisson random variable with mean u, then P( Y = y) = E( Y ) = Var( Y ) = u Log link function: log (u)=( )=η= x T β (For our application: log (u)=( )=η= β 0 +(Group) β 1 ) By NewtonRaphson algorithm, get MLE of β and the 95%CI Gstatistic and Pearson s χ 2 statistics e u y u! y μ
16 Graphical Presentation of Poisson Distribution May not be Flexible for Empirical Fitting Purpose. Poisson Distributions: Poisson( λ ) Probability λ=5 λ=10 λ=15 λ= N o te : P o is s o n( λ ) has m ean λ and SD x λ
17 Extend Poisson Model Generalize Poisson Model by Allowing Dispersion Var (Y) = φ E(Y) = φu φ =1 (no dispersion), regular Poisson GLM is appropriate φ >1 (overdispersion) or φ <1 (underdispersion), dispersion), the standard error estimation is not reliable, adjust standard errors by φ. Quasilikelihood Poisson Model A more flexible modeling technique when over/under dispersion occurs in count data No strong idea about the appropriate distributional form of the outcome variable while can specify the link and variance function for the model
18 Under/OverDispersion for Shotgun Data Esitmate Dispersion Density Estimated Dispersion: Dp Ideal Dispersion Distribution: N(1, σ 2 )
19 QuasiLikelihood Poisson Model More details about quasilikelihood, define a score, U i : Then U i = Y i u ϕv( u ) i i EU ( ) = 0; VU ( ) = i i 1 ϕv( u ) ' i ϕ ( i) ( i i) ϕ ( i) 1 E 2 i ϕ i ϕ i U V u Y u V u E = = u [ V( u )] V( u ) i
20 Derive Quasilikelihood These properties are shared by the derivatives of the loglikelihood, likelihood, l, which suggest we can use U in place of l, define: yi t Q = i dt ϕv () t Then define the log quasilikelihood for all n observations as: Q n = i = 1 Q i
21 Generalized Estimation Equation (GEE) Method Quasilikelihood approach can be adapted for repeated measures and/or longitudinal experiment designs for shotgun proteomics research Generalized Estimation Equation (GEE) methods can be used for estimations Mixed effect models for nonnormal normal responses A multivariate analogue of the equations for the quasi likelihood models i T ui 1 ( ) Var( Yi; βα, ) ( Yi ui) = 0 β
22 Rate Model for Normalizing Shotgun Data Why do We Need Rate Model? The number of events observed may depend on a size variable that determines the number of opportunities for the events to occur For example: in some run of LCMS/MS, the sample density might be different; we need normalization before comparison Can be done within Poisson/QuasiPoisson GLM by modeling the effect of the size variable
23 Rate Model for Normalizing Shotgun Data Rate Model Log (u/w)= η = x β Log (u) = log(w)+ x β In this manner, we are modeling the rate of spectral observing while still maintaining the count response for the Poisson/QuasiPoisson model. We fix the coefficient as one by using an offset.
24 Handle Small Sample Size Issues with Small Sample Size: Asymptotic Distribution May not be Hold The χ 2 distribution is only an approximation that becomes more accurate as the sample size increases Not possible to say exactly how large we need, but N 5 N 5 is often suggested Lack of Power if the Effect is not Strong In the real world, due to financial/time constraints, the number of runs for shotgun data is usually small.
25 Sample Size for Poisson Power
26 Handle Small Sample Size Some Solutions Appropriate Test Statistics: D s D L ~ χ 2 ls Model S: Log(u)= )=log( log(w)+ β 0 Model L: log(u)=log( )=log(w)+)+ β 0 +(Group Group) β 1 Permutation Test: Reference distribution is obtained from the data P = {# T _perm > T _obs } / {# total permutation}
27 Statistical Analysis Strategy Model the Count Data Poisson Regression Model Quasilikelihood Poisson Model (GEE method) Rate Model (Poisson Model with Offset) Handle Small Sample Size Provide Appropriate Test Statistics Permutation Test Deal with Multiple Comparison Issues Frequentist Approach (FDR) Empirical Bayes Approach (LOCFDR)
28 Deal with Multiple Comparisons Why Worry About It? Selection Bias: : researchers tend to select significant ones to support their conclusions Inflate the False Positive Rate: : unadjusted Pvalues from the singleinference inference procedure result in increased type I error.
29 Multiple Tests Increase Chance of False Positive Probability of Having At Least One False Positive Probability o + * α = 0.1 ooooooooooooooooooooooooooooooooooooooo α = α = 0.01 o o ooo + o oo o o o + ++ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Number of Tests
30 Deal with Multiple Comparisons Family wise error rate (FWER) methods Too conservative and less application values Not suitable for largescale simultaneous hypothesis testing problems arouse from high through technologies A desirable error rate to control might be the expected proportion of errors among the rejected hypotheses.
31 Approach 1: False Discovery Rate Define the FDR to be the expectation of Q,, and control it under a value q * V V FDR = E( Q) = E( ) = E( ) V + S R True Null hypothesis Nontrue Null hypothesis Declared Declared Nonsignificant Significant U V T S mr R R Total m 0 m m 0 m From Benjamin & Hochberg, Controlling the False Discovery Rate, 1995
32 False Discovery Rate Controlling Procedure Sorted Pvalues q c = 0.1 q c = 0.2 q c =
33 Approach 2: Local False Discovery Rate Define Local FDR Null hypothesis: H 1, H 2,,, H N Test statistic: z 1, z 2,, z N p 0 =Pr{Null} f 0 (z)) density if null p 1 =Pr{nonnull} null} f 1 (z)) density if nonnull null Mixture density: f (z)) = p 0 f 0 (z)+ p 1 f 1 (z) Local FDR: fdr(z)= )=Pr{null z}= p 0 f 0 (z)) /f/ (z) From Efron, Local False Discovery Rates, 2005
34 How to Apply Local False Discovery Rate Histogram of Summary Statistics with Fitted M ixture Density Frequency MLE: delta: sigma: p0: 0.928
35 FDR vs. LOCFDR Frequentist Approach FDR LOCFDR Empirical Bayes method Works on Pvalues P values (null hypothesis tail area) Works on the test statistics Efron: In practice, FDR and LOCFDR can be combined, using BenjaminiHochberg algorithm to identify nonnull null cases, say with q=0.1, but also providing individual LOCFDR values for those cases.
36 Statistical Analysis Strategy Model the Count Data Poisson Regression Model Quasilikelihood Poisson Model Rate Model (Poisson Model with Offset) Handle Small Sample Size Provide Appropriate Test Statistics Permutation Test Deal with Multiple Comparison Issues Frequentist Approach (FDR) Empirical Bayes Approach (LOCFDR)
37 Case Study A Global Shotgun Proteomic Analysis of Colorectal Carcinoma Biological Materials RKO Cell Lines Rectal Adenocarcinoma Specimen Mass Spectrometry Method Peptides Separated by Strong Cation Exchange (SCX) and Isoelectric Focusing (IEF) LTQQrbitrap Qrbitrap Mass Spectrometer (Thermo Electron, San Joase,, CA) Database Searching and Filtering MyriMatch Search Algorithm; IPI Human Database 3.31; IDPicker 2.0
38 Case Study : Data Analysis Preliminary Results Protein.order Protein CAR_Counts RKO_Counts Poisson_Pvalue Quasi_Pvalue Pois_Pvalue_FD R 1513 IPI E E E E IPI E E IPI E E IPI E E E E IPI E E IPI E E E E IPI E E E E IPI E E E E IPI E E E E IPI E E E E IPI E E E E IPI E E E E IPI E E E E IPI E E E E IPI E E E E IPI E E E E IPI E E E E IPI E E E E IPI E E E E07 Quasi_Pvalue_FDR
39 Case Study Data Analysis Preliminary Results
40 Discussion and Future Study A framework Flexible to be Extend Repeat measurement; trend test, and etc. Quality Control of the Data Involve in the early step of data generation process Simulation Study Evaluating the Methods Add the Current Work to the pipeline of the Proteomic Research
41 Acknowledgement Rob Slebos Dan Liebler Pierre Massion Takefumi Kikuchi David Carbone Will Gray Yu Shyr Cancer Biostatistics Center Ayer Institute GI SPORE Lung SPORE
42 Thank You!
False Discovery Rate Control with Groups
False Discovery Rate Control with Groups James X. Hu, Hongyu Zhao and Harrison H. Zhou Abstract In the context of largescale multiple hypothesis testing, the hypotheses often possess certain group structures
More informationTutorial for proteome data analysis using the Perseus software platform
Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information
More informationAnalyzing Data with GraphPad Prism
Analyzing Data with GraphPad Prism A companion to GraphPad Prism version 3 Harvey Motulsky President GraphPad Software Inc. Hmotulsky@graphpad.com GraphPad Software, Inc. 1999 GraphPad Software, Inc. All
More informationAdaptive linear stepup procedures that control the false discovery rate
Biometrika (26), 93, 3, pp. 491 57 26 Biometrika Trust Printed in Great Britain Adaptive linear stepup procedures that control the false discovery rate BY YOAV BENJAMINI Department of Statistics and Operations
More information11. Analysis of Casecontrol Studies Logistic Regression
Research methods II 113 11. Analysis of Casecontrol Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:
More informationMINITAB ASSISTANT WHITE PAPER
MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. OneWay
More informationDesign of Experiments (DOE)
MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. Design
More informationData Quality Assessment: A Reviewer s Guide EPA QA/G9R
United States Office of Environmental EPA/240/B06/002 Environmental Protection Information Agency Washington, DC 20460 Data Quality Assessment: A Reviewer s Guide EPA QA/G9R FOREWORD This document is
More informationStatistical Inference in TwoStage Online Controlled Experiments with Treatment Selection and Validation
Statistical Inference in TwoStage Online Controlled Experiments with Treatment Selection and Validation Alex Deng Microsoft One Microsoft Way Redmond, WA 98052 alexdeng@microsoft.com Tianxi Li Department
More informationA direct approach to false discovery rates
J. R. Statist. Soc. B (2002) 64, Part 3, pp. 479 498 A direct approach to false discovery rates John D. Storey Stanford University, USA [Received June 2001. Revised December 2001] Summary. Multiplehypothesis
More informationA Guide to Sample Size Calculations for Random Effect Models via Simulation and the MLPowSim Software Package
A Guide to Sample Size Calculations for Random Effect Models via Simulation and the MLPowSim Software Package William J Browne, Mousa Golalizadeh Lahi* & Richard MA Parker School of Clinical Veterinary
More informationGene Selection for Cancer Classification using Support Vector Machines
Gene Selection for Cancer Classification using Support Vector Machines Isabelle Guyon+, Jason Weston+, Stephen Barnhill, M.D.+ and Vladimir Vapnik* +Barnhill Bioinformatics, Savannah, Georgia, USA * AT&T
More informationPASIF A Framework for Supporting Smart Interactions with Predictive Analytics
PASIF A Framework for Supporting Smart Interactions with Predictive Analytics by Sarah Marie Matheson A thesis submitted to the School of Computing in conformity with the requirements for the degree of
More informationEVALUATION OF GAUSSIAN PROCESSES AND OTHER METHODS FOR NONLINEAR REGRESSION. Carl Edward Rasmussen
EVALUATION OF GAUSSIAN PROCESSES AND OTHER METHODS FOR NONLINEAR REGRESSION Carl Edward Rasmussen A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy, Graduate
More informationACTUARIAL MODELING FOR INSURANCE CLAIM SEVERITY IN MOTOR COMPREHENSIVE POLICY USING INDUSTRIAL STATISTICAL DISTRIBUTIONS
i ACTUARIAL MODELING FOR INSURANCE CLAIM SEVERITY IN MOTOR COMPREHENSIVE POLICY USING INDUSTRIAL STATISTICAL DISTRIBUTIONS OYUGI MARGARET ACHIENG BOSOM INSURANCE BROKERS LTD P.O.BOX 7454700200 NAIROBI,
More informationMetaAnalysis Notes. Jamie DeCoster. Department of Psychology University of Alabama 348 Gordon Palmer Hall Box 870348 Tuscaloosa, AL 354870348
MetaAnalysis Notes Jamie DeCoster Department of Psychology University of Alabama 348 Gordon Palmer Hall Box 870348 Tuscaloosa, AL 354870348 Phone: (205) 3484431 Fax: (205) 3488648 September 19, 2004
More informationStandards for the Design, Conduct, and Evaluation of Adaptive Randomized Clinical Trials
Standards for the Design, Conduct, and Evaluation of Adaptive Randomized Clinical Trials Michelle A. Detry, PhD, 1 Roger J. Lewis, MD, PhD, 14 Kristine R. Broglio, MS, 1 Jason T. Connor, PhD, 1,5 Scott
More informationStrong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach
J. R. Statist. Soc. B (2004) 66, Part 1, pp. 187 205 Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach John D. Storey,
More informationThe InStat guide to choosing and interpreting statistical tests
Version 3.0 The InStat guide to choosing and interpreting statistical tests Harvey Motulsky 19902003, GraphPad Software, Inc. All rights reserved. Program design, manual and help screens: Programming:
More informationMascot Search Results FAQ
Mascot Search Results FAQ 1 We had a presentation with this same title at our 2005 user meeting. So much has changed in the last 6 years that it seemed like a good idea to revisit the topic. Just about
More information8 INTERPRETATION OF SURVEY RESULTS
8 INTERPRETATION OF SURVEY RESULTS 8.1 Introduction This chapter discusses the interpretation of survey results, primarily those of the final status survey. Interpreting a survey s results is most straightforward
More information1 Theory: The General Linear Model
QMIN GLM Theory  1.1 1 Theory: The General Linear Model 1.1 Introduction Before digital computers, statistics textbooks spoke of three procedures regression, the analysis of variance (ANOVA), and the
More informationWas This Review Helpful to You? It Depends! Context and Voting Patterns in Online Content
Was This Review Helpful to You? It Depends! Context and Voting Patterns in Online Content Ruben Sipos Dept. of Computer Science Cornell University Ithaca, NY rs@cs.cornell.edu Arpita Ghosh Dept. of Information
More informationwww.rmsolutions.net R&M Solutons
Ahmed Hassouna, MD Professor of cardiovascular surgery, AinShams University, EGYPT. Diploma of medical statistics and clinical trial, Paris 6 university, Paris. 1A Choose the best answer The duration
More informationSupporting Keyword Search in Product Database: A Probabilistic Approach
Supporting Keyword Search in Product Database: A Probabilistic Approach Huizhong Duan 1, ChengXiang Zhai 2, Jinxing Cheng 3, Abhishek Gattani 4 University of Illinois at UrbanaChampaign 1,2 Walmart Labs
More informationA 61MillionPerson Experiment in Social Influence and Political Mobilization
Supplementary Information for A 61MillionPerson Experiment in Social Influence and Political Mobilization Robert M. Bond 1, Christopher J. Fariss 1, Jason J. Jones 2, Adam D. I. Kramer 3, Cameron Marlow
More informationNCSS Statistical Software
Chapter 06 Introduction This procedure provides several reports for the comparison of two distributions, including confidence intervals for the difference in means, twosample ttests, the ztest, the
More informationQualityBased Pricing for Crowdsourced Workers
NYU Stern Research Working Paper CBA1306 QualityBased Pricing for Crowdsourced Workers Jing Wang jwang5@stern.nyu.edu Panagiotis G. Ipeirotis panos@stern.nyu.edu Foster Provost fprovost@stern.nyu.edu
More informationFixedEffect Versus RandomEffects Models
CHAPTER 13 FixedEffect Versus RandomEffects Models Introduction Definition of a summary effect Estimating the summary effect Extreme effect size in a large study or a small study Confidence interval
More informationSources of Validity Evidence
Sources of Validity Evidence Inferences made from the results of a selection procedure to the performance of subsequent work behavior or outcomes need to be based on evidence that supports those inferences.
More information