Statistical Analysis Strategies for Shotgun Proteomics Data



Similar documents
Cancer Biostatistics Workshop Science of Doing Science - Biostatistics

Gene Expression Analysis

Aiping Lu. Key Laboratory of System Biology Chinese Academic Society

Application Note # LCMS-81 Introducing New Proteomics Acquisiton Strategies with the compact Towards the Universal Proteomics Acquisition Method

Quantitative proteomics background

Global and Discovery Proteomics Lecture Agenda

AB SCIEX TOF/TOF 4800 PLUS SYSTEM. Cost effective flexibility for your core needs

In-Depth Qualitative Analysis of Complex Proteomic Samples Using High Quality MS/MS at Fast Acquisition Rates

Tutorial for proteome data analysis using the Perseus software platform

ProteinScape. Innovation with Integrity. Proteomics Data Analysis & Management. Mass Spectrometry

Protein Prospector and Ways of Calculating Expectation Values

Effects of Intelligent Data Acquisition and Fast Laser Speed on Analysis of Complex Protein Digests

Introduction to mass spectrometry (MS) based proteomics and metabolomics

MRMPilot Software: Accelerating MRM Assay Development for Targeted Quantitative Proteomics

MultiQuant Software 2.0 for Targeted Protein / Peptide Quantification

The Scheduled MRM Algorithm Enables Intelligent Use of Retention Time During Multiple Reaction Monitoring

Functional Data Analysis of MALDI TOF Protein Spectra

Pep-Miner: A Novel Technology for Mass Spectrometry-Based Proteomics

From Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data

Statistical issues in the analysis of microarray data

ProteinPilot Report for ProteinPilot Software

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

False Discovery Rates

Increasing the Multiplexing of High Resolution Targeted Peptide Quantification Assays

Tutorial for Proteomics Data Submission. Katalin F. Medzihradszky Robert J. Chalkley UCSF

MASCOT Search Results Interpretation

Statistics Graduate Courses

PeptidomicsDB: a new platform for sharing MS/MS data.

Session 1. Course Presentation: Mass spectrometry-based proteomics for molecular and cellular biologists

Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant

Workshop IIc. Manual interpretation of MS/MS spectra. Ebbing de Jong. Center for Mass Spectrometry and Proteomics Phone (612) (612)

Advantages of the LTQ Orbitrap for Protein Identification in Complex Digests

Automated Biosurveillance Data from England and Wales,

Choices, choices, choices... Which sequence database? Which modifications? What mass tolerance?

Research-grade Targeted Proteomics Assay Development: PRMs for PTM Studies with Skyline or, How I learned to ditch the triple quad and love the QE

泛 用 蛋 白 質 體 學 之 質 譜 儀 資 料 分 析 平 台 的 建 立 與 應 用 Universal Mass Spectrometry Data Analysis Platform for Quantitative and Qualitative Proteomics

A Streamlined Workflow for Untargeted Metabolomics

Chapter 14. Modeling Experimental Design for Proteomics. Jan Eriksson and David Fenyö. Abstract. 1. Introduction

Shotgun Proteomic Analysis. Department of Cell Biology The Scripps Research Institute

La Protéomique : Etat de l art et perspectives

OplAnalyzer: A Toolbox for MALDI-TOF Mass Spectrometry Data Analysis

Learning Objectives:

Error Tolerant Searching of Uninterpreted MS/MS Data

Un (bref) aperçu des méthodes et outils de fouilles et de visualisation de données «omics»

Mass Spectrometry Based Proteomics

Proteomic Analysis using Accurate Mass Tags. Gordon Anderson PNNL January 4-5, 2005

Pinpointing phosphorylation sites using Selected Reaction Monitoring and Skyline

Cliquid ChemoView 3.0 Software Simple automated analysis, from sample to report

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Likelihood Approaches for Trial Designs in Early Phase Oncology

Retrospective Analysis of a Host Cell Protein Perfect Storm: Identifying Immunogenic Proteins and Fixing the Problem

Mascot Search Results FAQ

APPLIED MISSING DATA ANALYSIS

Absolute quantification of low abundance proteins by shotgun proteomics

Introduction to Proteomics 1.0

Introduction to Proteomics

Logistic Regression (a type of Generalized Linear Model)

Proteomics in Practice

Mass Spectra Alignments and their Significance

Challenges in Computational Analysis of Mass Spectrometry Data for Proteomics

Introduction to Proteomics

Introduction to General and Generalized Linear Models

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Protein Protein Interaction Networks

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Study Design and Statistical Analysis

DRUG METABOLISM. Drug discovery & development solutions FOR DRUG METABOLISM

MarkerView Software for Metabolomic and Biomarker Profiling Analysis

Master course KEMM03 Principles of Mass Spectrometric Protein Characterization. Exam

Proteomic data analysis for Orbitrap datasets using Resources available at MSI. September 28 th 2011 Pratik Jagtap

Logistic Regression (1/24/13)

ProSightPC 3.0 Quick Start Guide

Online 12 - Sections 9.1 and 9.2-Doug Ensley

Comparing Functional Data Analysis Approach and Nonparametric Mixed-Effects Modeling Approach for Longitudinal Data Analysis

Correlational Research

Accurate Mass Screening Workflows for the Analysis of Novel Psychoactive Substances

Building innovative drug discovery alliances. Evotec Munich. Quantitative Proteomics to Support the Discovery & Development of Targeted Drugs

Investigating Biological Variation of Liver Enzymes in Human Hepatocytes

Using Ontologies in Proteus for Modeling Data Mining Analysis of Proteomics Experiments

11. Analysis of Case-control Studies Logistic Regression

PITFALLS IN TIME SERIES ANALYSIS. Cliff Hurvich Stern School, NYU

WISE Power Tutorial All Exercises

Descriptive Statistics

Basic Data Analysis. Stephen Turnbull Business Administration and Public Policy Lecture 12: June 22, Abstract. Review session.

From the help desk: Bootstrapped standard errors

Biopharmaceutical Glycosylation Analysis

Statistics in Retail Finance. Chapter 2: Statistical models of default

False discovery rate and permutation test: An evaluation in ERP data analysis

The Bonferonni and Šidák Corrections for Multiple Comparisons

NUVISAN Pharma Services

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Statistics Review PSY379

Tutorial 9: SWATH data analysis in Skyline

Transcription:

Statistical Analysis Strategies for Shotgun Proteomics Data Ming Li, Ph.D. Cancer Biostatistics Center Vanderbilt University Medical Center

Ayers Institute Biomarker Pipeline normal shotgun proteome analysis compare inventories, identify differences cancer no vs. targeted quantitative analysis yes Label-free MRM in tissue Is it really there? Is there a difference? LC-MRM-MS method or Y Y ELISA vs. evaluate in clinical context sensitivity specificity performance metrics Discovery few(<20) samples or sample pools low throughput identify up to ~5,000 proteins inventory differences ~150 proteins Verification ~10-200 samples higher throughput screen 100-200 candidates identify high priority candidates Validation hundreds+ of samples highest throughput validate 1-10 candidates clinical validation trial

The Big Picture Proteomics Tools Mass Spectrometry Database Bioinformatics Protein-separation Techniques Research Goal Biomarker Discovery High Throughput Data (MALDI MS, Shotgun, etc) Challenge on Data Analysis

Outline Background Shotgun Proteomics Techniques and Data The Challenges for Statistical Analysis Statistical Analysis Strategies Methods and Models Case Study and Preliminary Results Discussion and Future Work

Background on Shotgun Proteomics What is Shotgun Proteomics? A method of identifying proteins in complex mixtures using a certain separation/digestion method combined with mass spectrometry. More specifically, the proteins in the mixture are digested and the resulting peptides are separated (by HPLC, SCX, IEF, etc), tandem mass spectrometry (MS/MS) is then used to identify the peptides (LC- MS/MS) and matched back to proteins.

General Flow in Proteomic Analysis Proteins SDS-PAGE or Prep IEF or HPLC Digest Reverse phase LC Tandem LC Protein Fraction 1 Peptide Fraction 1 data Protein Fraction 2 Peptide Fraction 2 Protein Fraction 3 Peptide Fraction 3 MS/MS Analysis Database search algorithms Protein Fraction n Peptide Fraction n Identification From Liebler, Introduction to Proteomics, 2002.

Shotgun 101 Design Analysis Integration Nesvizhskii et al.

LC separation 2 1 3 4 MS MS-MS fragmentation m/z 2 4 m/z m/z 3 1 m/z m/z

dm_200392_tpepc_std #587 RT: 20.55 AV: 1 NL: 1.95E7 T: + c d Full ms2 388.37@40.00 [ 95.00-790.00] 170.9 42 40 38 36 34 32 30 28 b 2 + H 3 N b 1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 72.087 171.219 242.298 299.350 401.487 472.566 529.618 600.696 A V A G C A G A R O HN O NH O NH O HS NH O NH O NH O H N + H 3 N NH O N H HN CO 2 H 605.3 y 7 534.3 606.3 y 6 Relative Abundance 26 24 22 20 18 16 14 12 10 8 6 4 2 0 703.820 604.688 533.609 476.557 374.420 303.341 246.290 175.211 y 8 y 7 y 6 y 5 y 4 y 3 y 2 y 1 y 3 303.3 374.2 143.0 b b 3 b 5 402.2 242.0 4 b 6 y 5 607.5 y b 7 b 535.2 8 171.9 2 299.2 477.0 197.0 459.7 375.2 530.2 615.3 169.8 246.2 357.2 517.3 143.9 232.5 271.1 304.1 403.2 445.1 587.4 129.1 196.3 400.0 558.0 100 150 200 250 300 350 400 450 500 550 600 m/z y 4 m/z

dm_200392_tpepc_std #587 RT: 20.55 AV: 1 NL: 1.95E7 T: + c d Full ms2 388.37@40.00 [ 95.00-790.00] 170.9 605.3 42 40 38 Relative Abundance 36 534.3 606.3 34 32 30 28 26 24 22 20 18 16 303.3 374.2 14 143.0 12 402.2 10 242.0 607.5 8 535.2 6 171.9 477.0 4 197.0 299.2 459.7 169.8 375.2 530.2 615.3 2 246.2 357.2 517.3 143.9 232.5 271.1 304.1 403.2 445.1 587.4 129.1 196.3 400.0 558.0 0 100 150 200 250 300 350 400 450 500 550 600 m/z Actual MS-MS scan Precursor peptide [M+H] + = 775.8 Get database sequences that match precursor peptide mass AVAGCAGAR CVAAGAAGR VGGACAAAR etc Generate virtual MS-MS spectra AVAGCAGAR CVAAGAAGR VGGACAAAR b2 y2b3 y3 b4 y4 b5 y5 b6 y6 b2 y2b3y3 b4 y4 b5 y5 b6 y6 b2 y2 b3 y3 b4 y4 b5 y5 b6 y6 Compare virtual spectra to real spectrum dm_200392_tpepc_std #587 RT: 20.55 AV: 1 NL: 1.95E7 T: + c d Full ms2 388.37@40.00 [ 95.00-790.00] 170.9 42 605.3 Scoring Peptide score Relative Abundance 40 38 36 534.3 606.3 34 32 30 28 26 24 22 20 18 16 303.3 374.2 14 143.0 12 402.2 10 242.0 607.5 8 535.2 6 171.9 477.0 4 197.0 299.2 459.7 169.8 375.2 530.2 615.3 2 246.2 357.2 517.3 143.9 232.5 271.1 304.1 403.2 445.1 587.4 129.1 196.3 400.0 558.0 0 100 150 200 250 300 350 400 450 500 550 600 m/z - Detect matches between theoretical b- and y-ions and actual spectrum ions - Compute correlation scores - Rank hits AVAGCAGAR 2.56 CVAAGAAGR 0.57 VGGACAAAR 0.32

From Shotgun Proteomics to Biomarker Discovery A Protein/Peptide Frequency-Based Analysis Approach Compare spectral identifications between groups; The unit of measurement is the number of times a peptide observed in a single LC-MS/MS round of analysis that are matched to the peptide sequences; The counts reflect the abundance of the protein from which these peptides are derived.

Shotgun Data and Statistical Challenge Normal Cancer Protein Sample 1 Sample m Sample 1 Sample m #1 57 65 108 160 #2 20 38 12 9 #3 85 67 8 22 #4 0 0 3 1 : : : : : : : : #N 70 23 12 18 : : : : : :

Statistical Analysis Goal Provide investigators a winner list of proteins for further study; The winners are the potential biomarkers of diagnostics, prognostics or therapeutic; The winners are selected by appropriate statistical models and procedures.

Statistical Analysis Strategy Model the Count Data Poisson Regression Model Quasi-likelihood Poisson Model (GEE method) Rate Model (Poisson Model with Offset) Handle Small Sample Size Provide Appropriate Test Statistics Permutation Test Deal with Multiple Comparison Issues Frequentist Approach (FDR) Empirical Bayes Approach (LOCFDR)

Poisson Model for Count Data Poisson Regression Model (Poisson GLM) Y is Poisson random variable with mean u, then P( Y = y) = E( Y ) = Var( Y ) = u Log link function: log (u)=( )=η= x T β (For our application: log (u)=( )=η= β 0 +(Group) β 1 ) By Newton-Raphson algorithm, get MLE of β and the 95%CI G-statistic and Pearson s χ 2 statistics e u y u! y μ

Graphical Presentation of Poisson Distribution May not be Flexible for Empirical Fitting Purpose. Poisson Distributions: Poisson( λ ) Probability 0.00 0.05 0.10 0.15 0.20 λ=5 λ=10 λ=15 λ=20 0 10 20 30 40 N o te : P o is s o n( λ ) has m ean λ and SD x λ

Extend Poisson Model Generalize Poisson Model by Allowing Dispersion Var (Y) = φ E(Y) = φu φ =1 (no dispersion), regular Poisson GLM is appropriate φ >1 (over-dispersion) or φ <1 (under-dispersion), dispersion), the standard error estimation is not reliable, adjust standard errors by φ. Quasi-likelihood Poisson Model A more flexible modeling technique when over/under- dispersion occurs in count data No strong idea about the appropriate distributional form of the outcome variable while can specify the link and variance function for the model

Under/Over-Dispersion for Shotgun Data Esitmate Dispersion Density 0 1 2 3 4 5 Estimated Dispersion: Dp Ideal Dispersion Distribution: N(1, σ 2 ) 0 1 2 3 4 5 6

Quasi-Likelihood Poisson Model More details about quasi-likelihood, define a score, U i : Then U i = Y i u ϕv( u ) i i EU ( ) = 0; VU ( ) = i i 1 ϕv( u ) ' i ϕ ( i) ( i i) ϕ ( i) 1 E 2 i ϕ i ϕ i U V u Y u V u E = = u [ V( u )] V( u ) i

Derive Quasi-likelihood These properties are shared by the derivatives of the log-likelihood, likelihood, l, which suggest we can use U in place of l, define: yi t Q = i dt ϕv () t Then define the log quasi-likelihood for all n observations as: Q n = i = 1 Q i

Generalized Estimation Equation (GEE) Method Quasi-likelihood approach can be adapted for repeated measures and/or longitudinal experiment designs for shotgun proteomics research Generalized Estimation Equation (GEE) methods can be used for estimations Mixed effect models for non-normal normal responses A multivariate analogue of the equations for the quasi- likelihood models i T ui 1 ( ) Var( Yi; βα, ) ( Yi ui) = 0 β

Rate Model for Normalizing Shotgun Data Why do We Need Rate Model? The number of events observed may depend on a size variable that determines the number of opportunities for the events to occur For example: in some run of LC-MS/MS, the sample density might be different; we need normalization before comparison Can be done within Poisson/Quasi-Poisson GLM by modeling the effect of the size variable

Rate Model for Normalizing Shotgun Data Rate Model Log (u/w)= η = x β Log (u) = log(w)+ x β In this manner, we are modeling the rate of spectral observing while still maintaining the count response for the Poisson/Quasi-Poisson model. We fix the coefficient as one by using an offset.

Handle Small Sample Size Issues with Small Sample Size: Asymptotic Distribution May not be Hold The χ 2 distribution is only an approximation that becomes more accurate as the sample size increases Not possible to say exactly how large we need, but N 5 N 5 is often suggested Lack of Power if the Effect is not Strong In the real world, due to financial/time constraints, the number of runs for shotgun data is usually small.

Sample Size for Poisson Power

Handle Small Sample Size Some Solutions Appropriate Test Statistics: D s -D L ~ χ 2 l-s Model S: Log(u)= )=log( log(w)+ β 0 Model L: log(u)=log( )=log(w)+)+ β 0 +(Group Group) β 1 Permutation Test: Reference distribution is obtained from the data P = {# T _perm > T _obs } / {# total permutation}

Statistical Analysis Strategy Model the Count Data Poisson Regression Model Quasi-likelihood Poisson Model (GEE method) Rate Model (Poisson Model with Offset) Handle Small Sample Size Provide Appropriate Test Statistics Permutation Test Deal with Multiple Comparison Issues Frequentist Approach (FDR) Empirical Bayes Approach (LOCFDR)

Deal with Multiple Comparisons Why Worry About It? Selection Bias: : researchers tend to select significant ones to support their conclusions Inflate the False Positive Rate: : unadjusted P-values from the single-inference inference procedure result in increased type I error.

Multiple Tests Increase Chance of False Positive Probability of Having At Least One False Positive Probability 0.0 0.2 0.4 0.6 0.8 1.0 o + * α = 0.1 ooooooooooooooooooooooooooooooooooooooo α = 0.05 +++++++++++++++++++++++++++++++++++++ α = 0.01 o o ooo + o oo + ++++ o o + +++ o + ++ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 0 1 0 2 0 3 0 4 0 5 0 Number of Tests

Deal with Multiple Comparisons Family wise error rate (FWER) methods Too conservative and less application values Not suitable for large-scale simultaneous hypothesis testing problems arouse from high through technologies A desirable error rate to control might be the expected proportion of errors among the rejected hypotheses.

Approach 1: False Discovery Rate Define the FDR to be the expectation of Q,, and control it under a value q * V V FDR = E( Q) = E( ) = E( ) V + S R True Null hypothesis Non-true Null hypothesis Declared Declared Non-significant Significant U V T S m-r R R Total m 0 m- m 0 m From Benjamin & Hochberg, Controlling the False Discovery Rate, 1995

False Discovery Rate Controlling Procedure Sorted P-values 0.0 0.2 0.4 0.6 0.8 1.0 q c = 0.1 q c = 0.2 q c = 0.3 0 50 100 150 200 250

Approach 2: Local False Discovery Rate Define Local FDR Null hypothesis: H 1, H 2,,, H N Test statistic: z 1, z 2,, z N p 0 =Pr{Null} f 0 (z)) density if null p 1 =Pr{non-null} null} f 1 (z)) density if non-null null Mixture density: f (z)) = p 0 f 0 (z)+ p 1 f 1 (z) Local FDR: fdr(z)= )=Pr{null z}= p 0 f 0 (z)) /f/ (z) From Efron, Local False Discovery Rates, 2005

How to Apply Local False Discovery Rate Histogram of Summary Statistics with Fitted M ixture Density Frequency 0 20 40 60 80-2 0 2 4 6 MLE: delta: 0.059 sigma: 1.006 p0: 0.928

FDR vs. LOCFDR Frequentist Approach FDR LOCFDR Empirical Bayes method Works on P-values P values (null hypothesis tail area) Works on the test statistics Efron: In practice, FDR and LOCFDR can be combined, using Benjamini-Hochberg algorithm to identify non-null null cases, say with q=0.1, but also providing individual LOCFDR values for those cases.

Statistical Analysis Strategy Model the Count Data Poisson Regression Model Quasi-likelihood Poisson Model Rate Model (Poisson Model with Offset) Handle Small Sample Size Provide Appropriate Test Statistics Permutation Test Deal with Multiple Comparison Issues Frequentist Approach (FDR) Empirical Bayes Approach (LOCFDR)

Case Study A Global Shotgun Proteomic Analysis of Colorectal Carcinoma Biological Materials RKO Cell Lines Rectal Adenocarcinoma Specimen Mass Spectrometry Method Peptides Separated by Strong Cation Exchange (SCX) and Isoelectric Focusing (IEF) LTQ-Qrbitrap Qrbitrap Mass Spectrometer (Thermo Electron, San Joase,, CA) Database Searching and Filtering MyriMatch Search Algorithm; IPI Human Database 3.31; IDPicker 2.0

Case Study : Data Analysis Preliminary Results Protein.order Protein CAR_Counts RKO_Counts Poisson_Pvalue Quasi_Pvalue Pois_Pvalue_FD R 1513 IPI00176193 342 2 5.72E-119 5.13E-12 8.73E-117 1.41E-08 768 IPI00020501 2267 37 0 1.84E-11 0 2.53E-08 2601 IPI00744256 2276 37 0 2.38E-11 0 2.18E-08 2648 IPI00784458 671 1 1.68E-237 3.50E-11 5.13E-235 2.40E-08 2398 IPI00465084 1326 1 0 9.68E-11 0 5.32E-08 2579 IPI00654755 759 2 9.84E-272 9.69E-11 3.38E-269 4.44E-08 280 IPI00007765 43 533 2.34E-84 1.21E-10 1.69E-82 4.76E-08 1325 IPI00072917 842 1 2.38E-297 2.61E-10 1.09E-294 8.96E-08 842 IPI00022200 847 1 3.23E-299 2.69E-10 1.78E-296 8.20E-08 2029 IPI00302592 1432 428 3.08E-183 2.73E-10 7.69E-181 7.49E-08 2438 IPI00473011 266 2 5.25E-91 3.62E-10 4.51E-89 9.05E-08 2319 IPI00418471 778 1 9.31E-279 5.95E-10 3.65E-276 1.36E-07 2509 IPI00554648 440 1 9.45E-158 7.26E-10 1.85E-155 1.53E-07 1597 IPI00216135 506 105 4.89E-87 9.86E-10 3.73E-85 1.93E-07 1983 IPI00299301 283 1 6.25E-101 1.26E-09 6.61E-99 2.31E-07 21 IPI00000230 574 101 3.69E-108 1.95E-09 4.41E-106 3.35E-07 2377 IPI00455050 547 101 1.93E-100 2.06E-09 1.97E-98 3.33E-07 871 IPI00022463 374 1 2.65E-132 2.51E-09 4.55E-130 3.84E-07 1757 IPI00220709 704 139 7.98E-124 2.56E-09 1.29E-121 3.69E-07 Quasi_Pvalue_FDR

Case Study Data Analysis Preliminary Results

Discussion and Future Study A framework Flexible to be Extend Repeat measurement; trend test, and etc. Quality Control of the Data Involve in the early step of data generation process Simulation Study Evaluating the Methods Add the Current Work to the pipeline of the Proteomic Research

Acknowledgement Rob Slebos Dan Liebler Pierre Massion Takefumi Kikuchi David Carbone Will Gray Yu Shyr Cancer Biostatistics Center Ayer Institute GI SPORE Lung SPORE

Thank You!