Protein Prospector and Ways of Calculating Expectation Values



Similar documents
Aiping Lu. Key Laboratory of System Biology Chinese Academic Society

Choices, choices, choices... Which sequence database? Which modifications? What mass tolerance?

Tutorial for Proteomics Data Submission. Katalin F. Medzihradszky Robert J. Chalkley UCSF

MASCOT Search Results Interpretation

Mass Spectra Alignments and their Significance

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

Global and Discovery Proteomics Lecture Agenda

Mascot Search Results FAQ

MRMPilot Software: Accelerating MRM Assay Development for Targeted Quantitative Proteomics

ProteinPilot Report for ProteinPilot Software

Interpretation of MS-Based Proteomics Data

Mass Spectrometry Based Proteomics

Effects of Intelligent Data Acquisition and Fast Laser Speed on Analysis of Complex Protein Digests

Master course KEMM03 Principles of Mass Spectrometric Protein Characterization. Exam

泛 用 蛋 白 質 體 學 之 質 譜 儀 資 料 分 析 平 台 的 建 立 與 應 用 Universal Mass Spectrometry Data Analysis Platform for Quantitative and Qualitative Proteomics

Error Tolerant Searching of Uninterpreted MS/MS Data

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance

MultiQuant Software 2.0 for Targeted Protein / Peptide Quantification

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Thermo Scientific PepFinder Software A New Paradigm for Peptide Mapping

Shotgun Proteomic Analysis. Department of Cell Biology The Scripps Research Institute

Quantitative proteomics background

Increasing the Multiplexing of High Resolution Targeted Peptide Quantification Assays

Chapter 14. Modeling Experimental Design for Proteomics. Jan Eriksson and David Fenyö. Abstract. 1. Introduction

In-Depth Qualitative Analysis of Complex Proteomic Samples Using High Quality MS/MS at Fast Acquisition Rates

Gamma Distribution Fitting

ProteinScape. Innovation with Integrity. Proteomics Data Analysis & Management. Mass Spectrometry

Statistical Analysis Strategies for Shotgun Proteomics Data

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Learning Objectives:

Laboration 1. Identifiering av proteiner med Mass Spektrometri. Klinisk Kemisk Diagnostik

SELDI-TOF Mass Spectrometry Protein Data By Huong Thi Dieu La

MarkerView Software for Metabolomic and Biomarker Profiling Analysis

Application Note # MT-90 MALDI-TDS: A Coherent MALDI Top-Down-Sequencing Approach Applied to the ABRF-Protein Research Group Study 2008

Development of computational methods for analysing proteomic data for genome annotation

Introduction to Database Searching using MASCOT

Logistic Regression (a type of Generalized Linear Model)

Database Searching Tutorial/Exercises Jimmy Eng

Proteomics in Practice

II. DISTRIBUTIONS distribution normal distribution. standard scores

Session 1. Course Presentation: Mass spectrometry-based proteomics for molecular and cellular biologists

ITSM-R Reference Manual

Advantages of the LTQ Orbitrap for Protein Identification in Complex Digests

Workshop IIc. Manual interpretation of MS/MS spectra. Ebbing de Jong. Center for Mass Spectrometry and Proteomics Phone (612) (612)

Exploratory Data Analysis

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name:

HYPOTHESIS TESTING: POWER OF THE TEST

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

11. Analysis of Case-control Studies Logistic Regression

Jitter Measurements in Serial Data Signals

Challenges in Computational Analysis of Mass Spectrometry Data for Proteomics

AB SCIEX TOF/TOF 4800 PLUS SYSTEM. Cost effective flexibility for your core needs

Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant

Data, Measurements, Features

Retrospective Analysis of a Host Cell Protein Perfect Storm: Identifying Immunogenic Proteins and Fixing the Problem

Magruder Statistics & Data Analysis

Mascot Integra: Data management for Proteomics ASMS 2004

Functional Data Analysis of MALDI TOF Protein Spectra

AN ITERATIVE ALGORITHM TO QUANTIFY THE FACTORS INFLUENCING PEPTIDE FRAGMENTATION FOR MS/MS SPECTRUM

Introduction to Proteomics

Proteomic data analysis for Orbitrap datasets using Resources available at MSI. September 28 th 2011 Pratik Jagtap

The Phase Modulator In NBFM Voice Communication Systems

Variables Control Charts

PeptidomicsDB: a new platform for sharing MS/MS data.

Infrared Spectroscopy: Theory

Using R for Linear Regression

Real-time PCR: Understanding C t

Searching Nucleotide Databases

Pep-Miner: A Novel Technology for Mass Spectrometry-Based Proteomics

Outline. Definitions Descriptive vs. Inferential Statistics The t-test - One-sample t-test

High-throughput Data Analysis of Proteomic Mass Spectra on the SwissBioGrid

Estimation and attribution of changes in extreme weather and climate events

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Predict the Popularity of YouTube Videos Using Early View Data

Descriptive Statistics

Application Note # LCMS-81 Introducing New Proteomics Acquisiton Strategies with the compact Towards the Universal Proteomics Acquisition Method

Chapter 3. Protein Structure and Function

Data Mining and Visualization

Introduction to Proteomics 1.0

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

EXPLORING SPATIAL PATTERNS IN YOUR DATA

Introduction to mass spectrometry (MS) based proteomics and metabolomics

Simple linear regression

Accurate Mass Screening Workflows for the Analysis of Novel Psychoactive Substances

Quantification of Multiple Therapeutic mabs in Serum Using microlc-esi-q-tof Mass Spectrometry

The Scheduled MRM Algorithm Enables Intelligent Use of Retention Time During Multiple Reaction Monitoring

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

People have thought about, and defined, probability in different ways. important to note the consequences of the definition:

Unit 9 Describing Relationships in Scatter Plots and Line Graphs

Statistics 2014 Scoring Guidelines

1 Genzyme Corp., Framingham, MA, 2 Positive Probability Ltd, Isleham, U.K.

Dongfeng Li. Autumn 2010

Correlational Research

Statistics in Medicine Research Lecture Series CSMC Fall 2014

Transcription:

Protein Prospector and Ways of Calculating Expectation Values 1/16 Aenoch J. Lynn; Robert J. Chalkley; Peter R. Baker; Mark R. Segal; and Alma L. Burlingame University of California, San Francisco, San Francisco, CA 94143-446. 2/16 Introduction With the production of large, multidimensional chromatography tandem mass spectrometry datasets it has now become essential to characterize the reliability of results. In addition, publication guidelines for mass spectrometry experiments will require a measure of the statistical significance of a peptide assignment 1. The most commonly reported statistical measure of significance is an expectation, or e-value, which represents how many random matches would be expected to achieve a given score or greater, in a search of a given size. Conventional e-value choices are.5 or.1, with.5 commonly used by other database search engines (Mascot 2, OMSSA 3 and X!Tandem 4 ). We discuss the various methods of calculating an e-value and their relative merits. 1

Expectation Values 3/16 The probability value (p-value) is the probability an event will occur at random. The expectation value (e-value) is the expected number of times an event will occur at random from a given set of trials. ( e-value = p-value * number of trials ) For mass spectrometry, an e-value is the number of times a given peptide score (or greater) will be achieved by incorrect matches from a database search. If a peptide assignment has an e-value of.1, then one would expect 1 peptide to match at random from a database 1 times in size. e-values can be calculated by: A. A theoretical calculation of the chances of a given number of peak masses out of a total number of peaks matching at random 2,3. B. Fitting the incorrect (null) results from a database search to a distribution and using this distribution to calculate a p-value 4,5. How to Calculate e-values A. Theoretical Chances of Matching Peaks What is the probability of 15 out of 25 masses matching to a random (incorrect) assignment? Potentially fast. Fails to account for database factors (amino acid frequencies). Reliable e-values requires understanding and accounting for all factors that contribute to random matching of peaks; not all factors are understood. B. Modeling the Incorrect Distribution Model the incorrect (random) distribution to determine p-values (and thereby e-values) for a corresponding score. Applicable to any scoring scheme without requiring an understanding of the factors contributing to peptide fragmentation and measurement error. Limited by the distribution family chosen for modeling. 4/16 2

Linear Tail-Fit 5/16 Use top fraction of the scores (1%) requiring large numbers of incorrect matches to model the distribution 5. Plot log( survival ) vs log( score ) and use linear regression to estimate p-values for a given peptide score. Reasonably accurate for e-values between.1 and.1. log( survival ) vs log( score ) not always linear; for Prospector, log(survival) vs score is more linear. Sensitive to matching homologous peptides and skewing the upper end of the tail. Relies upon extrapolation. High mass-accuracy spectra or species restricted database searches may not return sufficient numbers of incorrect matches for this method to work. Calculating Linear Tail-Fit 6/16 Frequency 18 16 14 12 1 8 6 4 2 2 4 6 8 1 12 14 16 Score 18 2 22 24 26 Score vs Frequency 28 3 Survival 1.9.8.7.6.5.4.3.2.1 5 1 15 2 25 3 35 Score Score vs Cumulative Frequency -.5.5 1 1.5 2 -.5 -.5 1.2 1.25 1.3 1.35 1.4 1.45-1 -1 log( survival ) -1.5-2 -2.5 log( urvival ) -1.5-2 -3-2.5-3.5-3 -4 log( score ) Log Survival vs Log Score -3.5 log( score ) Top 1% Scores 3

7/16 Survival Curves log 1 ( survival ) Protein Prospector Score Survival curves for 44 spectra searched against SwissProt. Using the top 1% of the scores, as used in the linear tail-fit. Non-Linear Survival Tails 8/16 Spectrum 1 Spectrum 2 log 1 ( score ) log 1 ( score ) Plots of log( survival ) vs. log( score ) for the top 1 (red regression line) and top 1% (green regression line) of scoring peptides for two spectra. For log( score ), the tail-fits do not appear very linear, and are sensitive to the percentile selected for the cutoff. 4

Linear Survival Tails 9/16 log( survival ) log( survival ) score score Plots of log( survival ) vs. score for the top 1% of scoring peptides for two spectra. In this region the survival tails are linear and are less sensitive to percentile selected for the cut-off. 1/16 Model Distributions Fit the null (incorrect) peptide scores to statistical distributions. Extreme value distribution Method of Moments Closed Form Maximum Likelihood Poisson distribution Gamma Less sensitive to fewer data points than Tail-Fit method (able to model high mass-accuracy MS data). Assumes that the distribution of scores (except for the correct match) is random. Use quantile-quantile plots (Q-Q plots) to determine the appropriateness of a model distribution. A quantile is the fraction of points below a given value. Plot the quantiles from the incorrect distribution against the quantiles of the model. If the data are both from the same distribution, they will fall along the 45-degree line for the plot. 5

Modeling Using Extreme Value Distribution 11/16 Experimental and model distributions Q-Q Plot experimental distribution peptide score Distribution of peptide scores from a single spectrum in a Prospector search. The red trace is the extreme value model of the experimental data. Plot of the peptide score quantiles against the extreme value distribution quantiles shows the appropriateness of using the distribution to model the experimental data. 12/16 Different Estimators linear tail fit closed form maximum likelihood method of moments -1 log( e-value ) Plot of all e-values from the top scoring peptide assignment from a Protein Prospector search using three methods to calculate the e-values. The left-most peak in each distribution are cases where the top scoring peptide is not significantly different than a random, incorrect match. 6

Linear Tail Fit SwissProt database Randomized SwissProt database 13/16 Comparing Methods to Calculate e-values Method Of Moments Results of Protein Prospector database searches comparing three methods for calculating e- values. The plots on the left are the e-values of the top scoring peptide from each of 327 spectra searched against the SwissProt database. The plots on the right are the same spectra searched against a randomized SwissProt database. For a random database, the 1log( e-value ) should be centered around. Maximum Likelihood The Linear Tail-Fit overestimates the reliability of the top peptide assignments, while the Method of Moments and Maximum Likelihood underestimates the reliability. Using the distribution of e-values from the search of the random database, a false positive rate can be calculated for the database search. -1 log( e-value ) Conclusions The scores of the incorrect peptide assignments follows the extreme value distribution. Protein Prospector scores are better suited to linear tail-fits of log( survival ) vs score. Linear Tail-Fit estimation of e-values overestimates the reliability of the assignment. Method of Moments and Maximum Likelihood methods to model extreme value distributions accurately model the observed data, but underestimate the reliability of the assignment. 14/16 7

Future Work 15/16 Add the ability to calculate e-values into Protein Prospector. Method for calculation to be determined. Use e-values to improve Protein Prospector s Discriminant Score. Acknowledgements 16/16 NIH NCRR grant RR1614 Vincent Coates Foundation References 1. Bradshaw, R. A. (25). Revised draft guidelines for proteomic data publications. Molecular & Cellular Proteomics 4(9):1223-5. 2. Perkins, D. N., Pappin, D. J., et al. (1999). Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis. 2(18):3551-67. 3. Geer, L. Y. et al. (24). Open Mass Spectrometry Search Algorithm. J Proteome Research, 3:958-964. 4. Craig, R. and Beavis, R. C. (24). TANDEM: matching proteins with tandem mass spectra. Bionformatics 2(9):1466-7. 5. Fenyo, D. and Beavis, R. C. (23). A Method for Assessing the Statistical Significance of MS Based Protein IDs Using General Scoring Schemes. Anal Chem 75:768-774. 8