Data Extraction and Analysis for LC-MS Based Proteomics

Transcription

1 Data Extraction and for LC-MS Based Proteomics Instructors Gordon Anderson, Charles Ansong, Matthew Monroe, and Ashoka Polpitiya Pacific Northwest National Laboratory, Richland, WA 99354

2 Course Outline Part I: Introduction and Overview of Label-Free Quantitative Proteomics (Anderson) Goals Data and Tools Availability Quantitative Proteomics: Historical Perspective Part II: Feature Discovery in LC-MS Datasets (Monroe and Polpitiya) Break Part III: Biological Application of the AMT tag Approach (Ansong) AMT tag Software Demo Panel Discussion Questions Future Directions

3 Course Goals Understand the reasons for developing and applying an LC-MS-based approach to proteomics Discuss considerations of experimental design for larger scale experiments Develop a sense of the source of information, its relative complexity and the algorithms required to make use of this approach See (and participate) in a demonstration of the critical tools applied to real data Learn where to get more information

4 Pacific Northwest National Laboratory Environmental Molecular Sciences Laboratory Washington Wine Country

5 Pacific Northwest National Laboratory and EMSL PNNL performs basic and applied research to deliver energy, environmental, and national security solutions for our nation. W.R. Wiley Environmental Molecular Sciences Laboratory EMSL Mission The W.R. Wiley Environmental Molecular Sciences Laboratory (EMSL), a national scientific user facility at Pacific Northwest National Laboratory, provides integrated experimental and computational resources for discovery and technological innovation in the environmental molecular sciences to support the needs of DOE and the nation. The Guest House at PNNL for EMSL Users To find out more and request access to the resource:

6 History/Evolution of PNNL Proteomics MS and Separations Based Technology Development Group EMSL User Program First AMT Paper Automated UPLC NIH-NCRR Supports cutting edge applications ICR-2LS made public LCMS WARP 4-column routine use Decon2LS DAnTE MultiAlign Pre-generated AMT tag databases for public use Interactomics AMT approach conceived and used PRISM architecture online DOE-BER support for production VIPER Automated NIH-NIA Biodefense Proteomics Research Center AMT tag pipeline tools made public DeconMSn SMART IMS Integrated Top-down/Bottom-up

7 AMT tag Approach Publication Trends Peer-Reviewed Applications, Reviews, and Software Specific to the AMT tag Approach Publications (number) Year Publications from PNNL and collaborators. Excludes non-amt tag applications papers and excludes broader technology development papers

8 PRISM Data Trends Organisms 145 Prepared Samples >60,000 LC-MS(/MS) Analyses >134,000 Automated Software Analyses >350,000 Data Files 139 TB Data in SQL Server databases 1.4 TB Organisms studied , ,000 75,000 50,000 25,000 Datasets acquired (instrumental analyses) TB data stored in PRISM 2.00E E E E+08 Over 1.5 billion mass spectra acquired E PRISM: Proteomics Research Information Storage and Management

9 Proteomics Informatics Architecture modular and loosely coupled for flexibility Web interface STARSuite Extractor Export tools Q Export MTS Explorer DMS MTS DAnTE Data Capture Manager Manager Manager Manager Manager Integrated & Automated LC- MS(/MS) Control SEQUEST X!Tandem InsPecT Peptide MASIC SICs NET Conversion Elution time alignment Decon2LS De-isotoping VIPER MultiAlign matching Data Archive PRISM: G.R. Kiebel et. al. Proteomics 2006, 6,

10 Motivations for Label-free LC-MS Proteomics Throughput, sensitivity, and sampling efficiency Compared to LC-MS/MS based approaches Shortcomings with chemical/labeling methods Multiple species need to be sampled for each peptide Potentially more sample preparation steps or increased cost Multiple analyses still required for statistical assessment New challenges for experimental design Statistical blocking and sample order randomization Helps to minimize the effects of systematic bias

11 Shotgun or MuDPIT Proteomics Complex mixture of proteins Upstream separations Parent MS spectra Tandem MS spectra SEQUEST, X!Tandem, or InsPecT with filtering LC-MS/MS C M. P. Washburn, D. Wolters, and J. R. Yates. Nature Biotechnology 2001, 19 (3),

12 LC-MS Information Funnel Biological sample analyzed by LC-MS Ions detected at instrument toped Charge and monoisotopic mass determined LC-MS observed in adjacent spectra with a defined chromatographic peak shape Identified LC-MS that match a peptide in an AMT tag database LC-MS that are observed in multiple, related datasets at roughly the same mass and time Biological knowledge

13 The Need/Use for Increased Throughput Replicate analysis to account for natural biological and normal analytical variation Mutant WT smpb Hfq rpoe himd crp slya hnr phop/q Etc Biological Rep. Sample Prep Cell Fraction Analytical Rep. X4 Contrasting Conditions 1080 analyses for 15 mutants using biological pooling 360 analyses

14 Accurate Mass and Time Tag Approach SEQUEST, X!Tandem, or InsPecT results ing Calculate exact mass observed elution time High-throughput LC-FTICR-MS with AMT tags Complex samples μlc- FTICR-MS -Matched Results Example: V.A. Petyuk, et al. Genome Research 2007, 17 (3), Compare abundances across samples

15 Considerations for Large Scale Studies The need for blocking and randomization Column effects (PNNL operates 4 column systems) Elution time variability, potential for carryover, and stationary phase life span Electrospray emitters, wear, clogging, etc. Mass Spectrometer Calibration, detector response, tuning, etc. Samples Oxidation, degradation, and other chemical modifications QA/QC to assess system performance

16 Accurate Mass and Time (AMT) Tag Data Processing Pipeline Complex Protein Mixture to Enable High Throughput Data Tryptic ion Peptide Mixture LC-MS/MS Measurements Extensive Fractionation LC-MS/MS Datasets SEQUEST X!Tandem InsPecT MASIC DeconMSn Peptide Identification Predicted Peptide s s Peptide AMT tag tag Database DAnTE Normalization Protein Select Appropriate High Throughput Proteomics LC-MS Measurements LC-MS Datasets Lists WARP net alignment/ mass calibration Masses and and NETs Diagnostic MultiAlign across samples QA/QC trends Decon2LS tope VIPER SMART Target unidentified J.S. Zimmer et. al. Mass. Spectrom. Rev. 2006, 25 (3),

17 Recent Examples of Successful Applications using LC-MS Proteomics Approaches NIA: Salmonella infecting host cells; small sample quantities whole proteome coverage L. Shi, J.N. Adkins, et. al., J. of Biological Chem. 2006, 281, of Voxels from mouse brains to reveal protein abundance patterns in brain structures V.A. Petyuk, et al. Genome Research 2007, 17 (3), The Mammary Epithelial Cell Secretome and Its Regulation by Signal Transduction ways J.K. Jacobs, et. al. J. Proteome Res. 2008, 7 (2), of purified viral particles of Monkeypox and Vaccinia viruses N.P. Manes, et. al. J. Proteome Res. 2008, 7 (3),

18 Course Related Software & Data PNNL s Data and Software Distribution Website PNNL's NCRR website Salmonella Typhimurium data resource

19 Selected Software Resources (Magnus Palmblad) (European consortium) (ISB) (Tobias Kind with Oliver Fiehn) (Phil Andrews and Jayson Falkner) (Broad Institute) (FHCRC)

20 Quantitative Proteomics Historical Perspective Microbes sequence GBPs Proteomics publications 2.9 GBPs 15,000 Sequenced microbes Quantitative Proteomics publications SEQUEST TIGR, NCBI GeneBank Human genome project * PGF AMT MudPIT 13 organisms sequenced JGI formed 1 st organism genome Protein prophet, decoy strategies Peptide prophet * Separations with accurate mass MS, 1996

21 Proteomics Workflow Cells / Tissue Sample Processing Instrument LC-MS LC-MS/MS Feature Extraction Identification Quantitative Purification Fractionation Protein extraction ion Labeling Spiking TOF Ion Traps Q-TOF TOF-TOF FTICR Orbitrap M. Bantscheff, M. Schirle, G. Sweetman, J. Rick, and B. Kuster, "Quantitative mass spectrometry in proteomics: a critical review," Anal. Bioanal. Chem. 2007, 389 (4), Number of proteins in sample identified quantified Protein concentration

22 Identification Strategies Proteomics MS (Peptide Mass Fingerprinting or PMF) Low complexity mixtures MS/MS (Peptide Fragment Fingerprinting or PFF ) Comprehensive tool set available Accurate Mass and Time (AMT) tag approach Requires database of peptide s and LC elution times High throughput Validation Peptide confidence Peptide to protein assignment Protein identification confidence Metabolomics Identification tools less mature Accurate mass can be used to determine molecular formula Structural determination Manual analysis of MS/MS spectra NMR analysis

23 Pros and Cons of PMF/PFF Strategies R. Matthiesen, "Methods, algorithms and tools in computational proteomics: a practical point of view," Proteomics 2007, 7 (16),

24 Quantitation Strategies Proteomics Label based (Relative / Absolute) Metabolic labeling Chemical labeling Enzymatic labeling Label free (Relative / Absolute) Peptide to protein rollup Degenerate peptide problem Normalization methods Metabolomics Primarily label free approaches Does not suffer from the rollup challenge

25 Quantitation Strategies M. Bantscheff, M. Schirle, G. Sweetman, J. Rick, and B. Kuster, "Quantitative mass spectrometry in proteomics: a critical review," Anal. Bioanal. Chem. 2007, 389 (4),

26 Quantitation Strategies Target Name of method or reagent Isotopes Metabolic stable-isotope labeling None N-labeling ( N-ammonium salt) 15 N Stable isotope labeling by amino acids in cell culture (SILAC) D, 13 C, 15 N Culture-derived isotope tags (CDIT) D, 13 C, 15 N Bioorthogonal noncanonical amino acid tagging (BONCAT) No isotope Isotope tagging by chemical reaction Sulfhydryl Isotope-coded affinity tagging (ICAT) D 13 Cleavable ICAT C 13 Catch-and-release (CAR) C Acrylamide D Isotope-coded reduction off of a chromatographic support (ICROC) D 2-vinyl-pyridine N-t-butyliodoacetamide Iodoacetanilide HysTag Solid-phase ICAT Visible isotope-coded affinity tags (VICAT) D D D D D C, C and Acid-labile isotope-coded extractants (ALICE) D 13 Solid phase mass tagging C Amines Tandem mass tag (TMT) D Succinic anhydride D N-acetoxysuccinamide D N-acetoxysuccinamide: In-gel Stable-Isotope Labeling (ISIL) D Acetic anahydride D Proprionic anhydride D Nicotinoyloxy succinimide (Nic-NHS) D Isotope-coded protein labeling (ICPL,Nic-NHS) D Phenyl isocyanate Isotope-coded n-terminal sulfonation (ICens) 4-sulphophenyl D or 13 C 13 C isothiocyanate (SPITC) Sulfo-NHS-SS-biotin and 13C,D3-methyl iodide 13 C and D Formaldehyde D Isobaric tag for realtive and absolute quantification (itraq) C, N and Lysines N-terminus protein N-terminus peptide Benzoic acid labeling (BA part of ANIBAL) Guanidination (O-methyl-isourea) mass-coded abundance tagging (MCAT) Guanidination (O-methyl-isourea) Quantitation using enhanced sequence tags (QUEST) 2-Methoxy-4,5-1H-imidazole Differentially isotope-coded N-terminal protein sulphonation (SPITC) N-terminal stable-isotope labelling of tryptic peptides (pentafluorophenyl-4-anilino-4-oxobutanoate) Carboxyl Methyl esterification D Ethyl esterification D C-terminal isotope-coded tagging using sulfanilic acid (SA) Aniline labeling (ANI part of ANIBAL) Indole 2-nitrobenzenesulfenyl chloride (NBSCI) 15 N 18 O 13 C No isotope C and N No isotope D 13 C D or 13 C 13 C 13 C 13 C Target Name of method or reagent Isotopes Stable-isotope incorporation via enzyme reaction C-terminus peptide Proteolytic 18 O-labeling (H2 18 O) 18 O Quantitative cysteinyl-peptide enrichment technology (QCET) Absolute quantification None Absolute quantification (AQUA) D, 13 C, 15 N Multiplexed absolute quantification (QCAT) D, 13 C, 15 N Multiplexed absolute quantification using concatenated signature D, 13 C, 15 N (QconCAT) Stable Isotope Standards and Capture by Anti-Peptide Antibodies D, 13 C, 15 N (SISCAPA) Label-free quantification None XIC-based quantification No isotope Spectrum sampling (SpS) No isotope Protein abundance index (PAI) No isotope Exponentially modified protein abundance index (empai) No isotope Probabilistic peptide scores (PMSS) No isotope A. Panchaud, M. Affolter, P. Moreillon, and M. Kussmann, "Experimental and computational approaches to quantitative proteomics: status quo and outlook," J. Proteomics 2008, 71 (1), O

27 Validation Measurement validation Peptide/Protein Identification Confidence algorithms Statistical models Quantitation Less mature than identification confidence Functional validation Western blots Gene knockout Protein assays Protein chemistry However, all measure something different

28 Active Software Development to Address Challenges Large array of available tools No universal analysis workflow Tool functional categories Peptide Identification confidence SMART, epic (PNNL active research) Quantitation A. Polpitiya et al., "DAnTE: a statistical tool for quantitative analysis of -omics data," Bioinformatics 2008, 24 (13), Data management / metadata capture Workflow automation

29 Community Development a) Semi-commercial or must contact author b) Freely available on the internet c) Commercial or not available d) Applied Biosystems e) Bioinformatics Solutions R. Matthiesen, "Methods, algorithms and tools in computational proteomics: a practical point of view," Proteomics 2007, 7 (16),

30 Software Platforms for Label-free Quantitation PNNL Pipeline PEPPeR msinspect SuperHirn CRAWDAD Lab PNNL Broad Institute FHCRC IMSB (Swiss) Univ. Wash. Feature Picker Decon2LS/Viper Mapquant msinspect SuperHirn CRAWDAD (or any other) Method Spectrum deisotoping then clustering Image then deisotoping Wavelet decomposition then de- Spectrum deisotoping then merging m/z channel binning RT Normalization, then linear or LCMSWARP Relative, then linear, or LOESS (exp) isotoping Iterative nonlinear transformation LOESS modeling Dynamic time warping m/z recalibration Yes (dynamic) Yes (quadratic) No No No Assignment of s to Statistical Evaluation of assignment Unidentified Feature Recognition Runs on AMT database, normalized elution times Mass shift decoy and/or Bayesian Statistics Stored in database for later analysis Windows with GUI AMT database, relative elution order (Landmarks) Bayesian Statistics Data-dependent tolerance-based clustering Web-based (Linux or Windows install bases) AMT database through user interaction Yes, but not well documented at present Yes, for differences only if they exist No No No User specified tolerance-based clustering Tolerance-based merging, heuristics Difference mapping only Java with GUI Linux Linux/Windows

31 Confident Quantitative Results Credible results require Rigorous statistical models Validation Measurements Functions Full disclosure of procedures and methods Dissemination Data Custom analysis software tools Data standards and release policies are critical HUPO Proteomics Standards Initiative: L. Martens and H. Hermjakob, "Proteomics data validation: why all must provide data," Mol. Biosyst. 2007, 3 (8),

32 References A. Panchaud, M. Affolter, P. Moreillon, and M. Kussmann, "Experimental and computational approaches to quantitative proteomics: status quo and outlook," J. Proteomics 2008, 71 (1), A. Honda, Y. Suzuki, and K. Suzuki, "Review of molecular modification techniques for improved detection of biomolecules by mass spectrometry," Anal. Chim. Acta. 2008, 623 (1), T.O. Metz, J.S. Page, E.S. Baker, K.Q. Tang, J. Ding, Y.F. Shen, and R.D. Smith, "High-resolution separations and improved ion production and transmission in metabolomics," Trac-Trends in Analytical Chemistry 2008, 27 (3), L. Martens and H. Hermjakob, "Proteomics data validation: why all must provide data," Mol. Biosyst. 2007, 3 (8), B.J. Webb-Robertson and W.R. Cannon, "Current trends in computational inference from mass spectrometry-based proteomics," Brief Bioinform. 2007, 8 (5), T.O. Metz, Q. Zhang, J.S. Page, Y. Shen, S.J. Callister, J.M. Jacobs, and R.D. Smith, "The future of liquid chromatography-mass spectrometry (LC- MS) in //metabolic profiling and metabolomic studies for biomarker discovery," Biomark. Med. 2007, 1 (1),

33 References R. Matthiesen, "Methods, algorithms and tools in computational proteomics: a practical point of view," Proteomics 2007, 7 (16), M. Bantscheff, M. Schirle, G. Sweetman, J. Rick, and B. Kuster, "Quantitative mass spectrometry in proteomics: a critical review," Anal. Bioanal. Chem. 2007, 389 (4), W. Urfer, M. Grzegorczyk, and K. Jung, "Statistics for proteomics: a review of tools for analyzing experimental data," Proteomics 2006, 6 Suppl 2, P. Hernandez, M. Muller, and R.D. Appel, "Automated protein identification by tandem mass spectrometry: issues and strategies," Mass Spectrom. Rev. 2006, 25 (2), J. Peng, J.E. Elias, C.C. Thoreen, L.J. Licklider, and S.P. Gygi, "Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome," J. Proteome Res. 2003, 2 (1), S.A. Gerber, J. Rush, O. Stemman, M.W. Kirschner, and S.P. Gygi, "Absolute quantification of proteins and phosphoproteins from cell lysates by tandem MS," Proc. Natl. Acad. Sci. USA 2003, 100 (12), R. Aebersold and M. Mann, "Mass spectrometry-based proteomics," Nature. 2003, 422 (6928),

34 References R.D. Smith, G.A. Anderson, M.S. Lipton, L. Pasa-Tolic, Y. Shen, T.P. Conrads, T.D. Veenstra, and H.R. Udseth, "An accurate mass tag strategy for quantitative and high-throughput proteome measurements," Proteomics 2002, 2 (5), R.D. Smith, L. Pasa-Tolic, M.S. Lipton, P.K. Jensen, G.A. Anderson, Y. Shen, T.P. Conrads, H.R. Udseth, R. Harkewicz, M.E. Belov, C. Masselon, and T.D. Veenstra, "Rapid quantitative measurements of proteomes by Fourier transform ion cyclotron resonance mass spectrometry," Electrophoresis 2001, 22 (9), T.P. Conrads, K. Alving, T.D. Veenstra, M.E. Belov, G.A. Anderson, D.J. Anderson, M.S. Lipton, L. Pasa-Tolic, H.R. Udseth, W.B. Chrisler, B.D. Thrall, and R.D. Smith, "Quantitative analysis of bacterial and mammalian proteomes using a combination of cysteine affinity tags and 15 N-metabolic labeling," Anal. Chem. 2001, 73 (9), T.P. Conrads, G.A. Anderson, T.D. Veenstra, L. Pasa-Tolic, and R.D. Smith, "Utility of accurate mass tags for proteome-wide protein identification," Anal. Chem. 2000, 72 (14), J.K. Nicholson, J.C. Lindon, and E. Holmes, "Metabonomics: understanding the metabolic responses of living systems to pathophysiological stimuli via multivariate statistical analysis of biological NMR spectroscopic data," Xenobiotica 1999, 29 (11),

35 Part II: LC-MS Feature Discovery Part I: Introduction and Overview of Label-Free Quantitative Proteomics (Anderson) Part II: Feature Discovery in LC-MS Datasets (Monroe and Polpitiya) Structure of LC-MS Data Feature discovery in individual spectra (deisotoping) Feature definition over elution time Identifying LC-MS using an AMT tag Extending the AMT tag approach for feature based analyses Estimating confidence of identified LC-MS quantitative analysis with DAnTE Break Part III: Biological Application of the AMT tag Approach (Ansong) AMT tag Software Demo Panel Discussion

36 Feature Discovery in LC-MS Datasets s High Throughput Proteomics LC-MS Datasets Lists Non-'d Two-dimensional views of an LC-MS dataset in different stages of data processing Several stages of processing are required to extract biological knowledge from raw LC-MS data Raw Data toping Monoisotopic in Each Mass Spectrum Elution profile discovery LC-MS Characterized m/z scan #

37 Structure of LC-MS Data s High Throughput Proteomics LC-MS Datasets Lists Non-'d Mass spectra capture the changing composition of peptides eluting from a chromatographic column Complex peptide mixture on a column is separated by liquid chromatography over a period of time Changing composition of the mobile phase causes different peptides to elute at different times The components eluting from a column are sampled continuously by sequential mass spectra kolker_19oct04_pegasus_0804-4_ft100k-res #991 RT: AV: 1 NL: 1.06E6 T: FTMS + p NSI Full ms [ ] % Mobile Phase B kolker_19oct04_pegasus_0804-4_ft100k-res #265 RT: AV: 1 NL: 1.39E4 T: FTMS + p NSI Full ms [ ] kolker_19oct04_pegasus_0804-4_ft100k-res #498 RT: AV: 1 NL: 1.81E6 20 T: FTMS + p NSI Full ms [ ] Relative Abundance Relative Abundance m/z Elution time (min) Relative Abundance m/z m/z m/z scan #

38 Structure of LC-MS Data s High Throughput Proteomics LC-MS Datasets Lists Non-'d Each compound is observed as an isotopic pattern in a mass spectrum The pattern is dependent on the compound s chemical composition, charge, and resolution of instrument 100 Theoretical Profile m/z: Charge: Monoisotopic Mass: Da intensity Elution range: Scans Peptide: VKHPSEIVNVGDEINVK Parent Protein: gi S ribosomal protein S1 S m/z

39 Structure of LC-MS Data s High Throughput Proteomics LC-MS Datasets Lists Non-'d A mass spectrum of a complex mixture contains overlaid distributions of several different compounds Scan e e Intensity 1.00e e e e m/z

40 Structure of LC-MS Data s High Throughput Proteomics LC-MS Datasets Lists Non-'d With LC as the first dimension, each compound is observed over multiple spectra, showing a threedimensional pattern of m/z, elution time and abundance m/z Elution time (scan) Goal: Infer mass, elution time, and intensity of compounds that are present in the LC-MS dataset Compounds are termed LC-MS since they are inferred from a 3D pattern, yet identity is unknown m/z: Charge: Monoisotopic Mass: Da Elution range: Scans Peptide: VKHPSEIVNVGDEINVK Parent Protein: gi S ribosomal protein S1 S1

41 m/z Feature Discovery in Individual Spectra s High Throughput Proteomics LC-MS Datasets Lists Non-'d toping Process of converting a mass spectrum (m/z and intensity) into a list of species (mass, abundance, and charge) toping a mass spectrum of 4 overlapping species charge Monoisotopic MW abundance intensity

42 toping an Isotopic Distribution s High Throughput Proteomics LC-MS Datasets Lists Non-'d Decon2LS deisotoping algorithm compares theoretical isotopic patterns with observed patterns Observed spectrum Theoretical spectrum Fitness value Charge detection algorithm 2 charge = 2 avg. mass = Averagine 3 estimated empirical formula: C 83 H 124 N 23 O 25 S 1 Mercury 4 1. Horn, D.M., Zubarev, R.A., McLafferty, F.W. Automated Reduction and Interpretation of High Resolution Electrospray Mass Spectra of Large Molecules. J. Am. Soc. Mass Spectrom. 2000, 11, Senko, M. W.; Beu, S. C.; McLafferty, F. W. Automated assignment of charge states from resolved isotopic peaks for multiplycharged ions. J. Am. Soc. Mass Spectrom. 1995, 6, Senko, M. W.; Beu, S. C.; McLafferty, F. W. Determination of monoisotopic masses and ion populations for large biomolecules from resolved isotopic distributions. J. Am. Soc. Mass Spectrom. 1995, 6, Rockwood, A. L.; Van Orden, S. L.; Smith, R. D. Rapid Calculation of Isotope Distributions. Anal. Chem. 1995, 67,

43 toping an Isotopic Distribution s High Throughput Proteomics LC-MS Datasets Lists Non-'d Patterson (Autocorrelation) algorithm to detect charge of a peak in a complex spectrum 3.5 x Mercury algorithm used to guess an average empirical formula for a given mass Fitness (fit) functions to quantitate quality of match between theoretical and observed profiles Averagine empirical formula of C H N O S C 83 H 124 N 23 O 25 S for Da For additional details, see the slides presented at 2007 US HUPO, available at

44 16 O/ 18 O Mixtures s High Throughput Proteomics LC-MS Datasets Lists Non-'d Overlapping isotope patterns are separated by 4 Da Creates challenges for deisotoping, particularly for charge states of 3+ or higher 3.00e e+6 intensity 2.00e e+6 d= e+6 d= d= e+5 d= d= d=1.022 d=0.501 d= m/z

45 Isotopic Composition s High Throughput Proteomics LC-MS Datasets Lists Non-'d Deviation from natural abundances In 13 C, 15 N depleted media, isotopic composition of atoms is different from those found in nature E.g., sulfur isotopes 1.25e+7 predominate the distribution at right 1.00e+7 Contrast with an isotopic distribution of a peptide with similar 7.5e+6 mass and charge (16+), but a natural atomic distribution (below) Intensity 5.0e sulfur containing peptide e+6 d=0.062 d= d=0.056 d= m/z m/z

46 Isotopic Composition s High Throughput Proteomics LC-MS Datasets Lists Non-'d Decon2LS supports changing the isotope composition

48 Feature Definition over Elution Time s High Throughput Proteomics LC-MS Datasets Lists Non-'d toping collapses original data into data lists scan num charge abundance mz fit average mw monoiso mw most abu. mw fwhm signal noise Goal: Given series of deisotoped mass spectra, group related data across elution time Look for repeated monoisotopic mass values in sequential spectra, allowing for missing data Can also look for expected chromatographic peak shape

49 Feature Definition over Elution Time s High Throughput Proteomics LC-MS Datasets Lists Non-'d Can visualize deisotoped data in two-dimensions Plotting monoisotopic mass Color is based on charge of the original data point seen Monoisotopic Mass = (m/z x charge) x charge Mass Time

50 Feature Definition over Elution Time s High Throughput Proteomics LC-MS Datasets Lists Non-'d Zoom-in view of species Same species in multiple spectra need to be grouped together Related peaks found using a weighted Euclidean distance; considers: Mass Abundance Elution time Isotopic Fit Determine 6 separate groups

51 Feature Definition over Elution Time s High Throughput Proteomics LC-MS Datasets Lists Non-'d Feature detail Median Mass: Da (more tolerant to outliers than average) Elution Time: Scan 1757 (0.363 NET, aka normalized elution time) Abundance: 1.7x10 7 counts (area under 2+ SIC) See both 2+ and 3+ data Stats typically come from the most abundant charge state Monoisotopic Mass 1, Charge: , , ppm 1, , , Abundance (counts) 2.0E+6 1.5E+6 1.0E+6 5.0E+5 Both 2+ data 3+ data Selected Ion Chromatograms 1, E00 1,740 1,745 1,750 1,755 1,760 1,765 1,770 1,775 1,780 1,785 1,790 Scan number

52 Feature Definition over Elution Time s High Throughput Proteomics LC-MS Datasets Lists Non-'d Second example LC-MS feature eluting over 7.5 minutes Clustering algorithm allows for missing data, common with chromatographic tailing

53 Feature Definition over Elution Time s High Throughput Proteomics LC-MS Datasets Lists Non-'d Second example, feature detail Median Mass: Da Elution Time: Scan 1809 (0.380 NET) Abundance: 8.7x10 7 counts (area under 3+ SIC) See both 2+ and 3+ data, though 3+ is more prevalent Monoisotopic Mass 2, Charge: , ppm 2, , , Abundance (counts) 4.0E+6 3.0E+6 2.0E+6 1.0E+6 Both 2+ data 3+ data Selected Ion Chromatograms 2, E+0 2, ,775 1,800 1,825 1,850 1,875 1,900 1,925 1,950 1,975 2,000 2,025 2,050 Scan number

54 Feature Definition over Elution Time s High Throughput Proteomics LC-MS Datasets Lists Non-'d Example: S. Typhimurium dataset on 11T FTICR 100 minute LC-MS analysis (3360 mass spectra) 67 cm, 150 μm I.D. column with 5 μm C 18 particles 78,641 deisotoped peaks Group into 5910 LC-MS

55 Isotopic Pairs Processing s High Throughput Proteomics LC-MS Datasets Lists Non-'d Paired typically have identical sequences, with and without an isotopic label e.g. 16 O/ 18 O pairs have 4 Da spacing due to two 18 O atoms Control ( 16 O water) Perturbed ( 18 O water) LC-FTICR-MS

56 Isotopic Pairs Processing s High Throughput Proteomics LC-MS Datasets Lists Non-'d Paired feature example: 16 O/ 18 O data Compute AR using ratio of areas, or Compute AR scan-by-scan, then average AR values (members must co-elute) Monoisotopic Mass Monoisotopic Mass 1, , , , , Da 1, Da 1, , , , , , E+05 AR = 1.78 Pair #424; (Light Charge used = 2 Area Heavy area ); or AR = 1.34 ± 0.2 (scan-by-scan) 1, , E+06 AR = 0.13 Pair #460; (Light Charge Area used Heavy = 2 area ); or AR = 0.12 ± 0.02 (scan-by-scan) 1, E+05 1, E+06 1, E+05 1, E+06 1, E+04 1, E+06 1, E+00 2,688 2,700 2,712 2,724 2, Scan number 1, E+00 3,010 3,026 3,042 3, Scan 3070 number

57 Feature Definition over Elution Time s High Throughput Proteomics LC-MS Datasets Lists Non-'d Numerous options in VIPER for clustering data to form LC-MS and for finding paired

59 Assembling an AMT tag s High Throughput Proteomics LC-MS Datasets Lists Non-'d Accurate Mass and Time (AMT) tag Unique peptide sequence whose monoisotopic mass and normalized elution time are accurately known AMT tags also track any modified residues in peptide AMT tag Collection of AMT tags AMT tag approach articles R.D. Smith et. al., Proteomics 2002, 2, J.S. Zimmer, M.E. Monroe et. al., Mass Spec. Reviews 2006, 25, L. Shi, J.N. Adkins, et. al., J. of Biological Chem. 2006, 281,

60 More info Assembling an AMT tag What can we use an AMT tag for? Query LC-MS/MS data to answer questions How many distinct peptides were observed passing filter criteria? Which peptides were observed most often by LC-MS/MS? How many proteins had 2 or more partially or fully tryptic peptides? Correlate LC-MS to the AMT tags Analyze multiple, related samples by LC-MS using a high mass accuracy mass spectrometer e.g. Time course study, 5 data points with 3 points per sample Characterize the LC-MS tope to obtain monoisotopic mass and charge Cluster in time dimension to obtain abundance information Match to AMT tags to identify peptides Align in mass and time dimensions Match mass and time of LC-MS to mass and time of AMT tags

61 Assembling an AMT tag s High Throughput Proteomics LC-MS Datasets Lists Non-'d Characterizing AMT tags Analyze samples by LC-MS/MS 10 minute to 180 minute LC separations Obtain 1000's of MS/MS fragmentation spectra for each sample Analyze spectra using SEQUEST, X!Tandem, InsPecT, etc. SEQUEST: X!Tandem: InsPecT: Collate results List of peptide and protein matches

62 Assembling an AMT tag s High Throughput Proteomics LC-MS Datasets Lists Non-'d AMT tag example R.VKHPSEIVNVGDEINVK.V Observed in scan of dataset #19 in an SCX fractionation series A_STM_019_110804_19_LTQ_16Dec04_Earth_ #11195 RT: AV: 1 NL: 2. 79E5 T: ITMS + c NSI d Full ms @35.00 [ ] Relative Abundance y3 b10++ y4 y5 y6 b8++ b9++ b13++ b11++ b species Match 30 b/y ions X!Tandem hyperscore = 80 X!Tandem Log(E_Value) = m/z y7 b16++ y8 y9 y10

63 More info Assembling an AMT tag s High Throughput Proteomics LC-MS Datasets Lists Non-'d AMT tag example R.VKHPSEIVNVGDEINVK.V Observed in scan of dataset #19 in an SCX fractionation series 3+ species Match 30 b/y ions X!Tandem hyperscore = 80 X!Tandem Log(E_Value) = -5.9

64 Assembling an AMT tag s High Throughput Proteomics LC-MS Datasets Lists Non-'d Align related datasets using elution times of observed peptides One option: utilize NET prediction algorithm to create theoretical dataset to align against NET prediction uses position and ordering of amino acid residues to predict normalized elution time Peptide X!Tandem Log (E_Value) Elution Time Predicted NET R.AARPAKYSYVDENGETK.T R.LVHGEEGLVAAKR.I R.GIIKVGEEVEIVGIK.E K.RFNDDGPILFIHTGGAPALFAYHPHV K.KTGVLAQVQEALKGLDVR.E R.KVAAQIPNGSTLFIGTTPEAVAHALLGHSNLR.I R.TFAISPGHMNQLRAESIPEAVIAGASALVLTSYLVR.C K. Petritis, L.J. Kangas, P.L. Ferguson, et al., Analytical Chemistry 2003, 75, K. Petritis, L.J. Kangas, B. Yan, et al., Analytical Chemistry 2006, 78,

65 Assembling an AMT tag s High Throughput Proteomics LC-MS Datasets Lists Non-'d Align related datasets using elution times of observed peptides One option: utilize NET prediction algorithm to create theoretical dataset to align against NET prediction uses position and ordering of amino acid residues to predict normalized elution time yields NET values based on observed elution times Observed NET = Slope (Observed Elution Time) + Intercept Example: 506 unique peptides used for alignment; Log(E_Value) -6 Predicted NET y = x R 2 = Elution Time (minutes) VKHPSEIVNVGDEINVK Elution time: minutes Predicted NET: Observed NET: 0.303

66 Assembling an AMT tag s High Throughput Proteomics LC-MS Datasets Lists Non-'d AMT tag example R.VKHPSEIVNVGDEINVK.V Observed in 7 (of 25) LC-MS/MS datasets in the SCX fractionation series 1, scan , hyperscore 80, Obs. NET , scan , hyperscore 69, Obs. NET , scan , hyperscore 74, Obs. NET , scan , hyperscore 77, Obs. NET Compute monoisotopic mass: Da Average d Elution Time: (StDev )

67 Assembling an AMT tag s High Throughput Proteomics LC-MS Datasets Lists Non-'d Mass and Time Tag Database Repository for AMT tags Mass, elution time, modified residues, and supporting information for each AMT tag Allows samples of unknown composition to be matched quickly and efficiently, without needing to perform tandem MS Assembled by analyzing a control set of samples, cataloging each peptide identification until subsequent analyses no longer provide new identifications MT Tag Peptide LC-MS/MS Obs. Count Calculated Monoisotopic Mass Average Observed NET Observed NET StDev MTGRELKPHDR SSALNTLTNQK HRDLLGATNP TLR WVKVDGWDN FER MYGHLKGEVA QER

68 Assembling an AMT tag s High Throughput Proteomics LC-MS Datasets Lists Non-'d Mini AMT tag Database constructed from a relatively small number of datasets e.g. 25 SCX fractionation samples from S. Typhimurium, each analyzed by LC-MS/MS and then by X!Tandem Protein database: S_typhimurium_LT2_ proteins and 1.4 million residues >STM1834 putative YebN family transport protein (yebn) {Salmonella typhimurium LT2} MFAGGSDVFNGYPGQDVVMHFTATVLLAFGMSMDAFAASIGKGATLHKPKFSEALRTGLI FGAVETLTPLIGWGLGILASKFVLEWNHWIAFVLLIFLGGRMIIEGIRGGSDEDETPLRR HSFWLLVTTAIATSLDAMAVGVGLAFLQVNIIATALAIGCATLIMSTLGMMIGRFIGPML GKRAEILGGVVLIGIGVQILWTHFHG >STM S rrna m1g745 methyltransferase (rrma) {Salmonella typhimurium LT2} MSFTCPLCHQPLTQINNSVICPQRHQFDVAKEGYINLLPVQHKRSRDPGDSAEMMQARRA FLDAGHYQPLRDAVINLLRERLDQSATAILDIGCGEGYYTHAFAEALPGVTTFGLDVAKT AIKAAAKRYSQVKFCVASSHRLPFADASMDAVIRIYAPCKAQELARVVKPGGWVVTATPG PHHLMELKGLIYDEVRLHAPYTEQLDGFTLQQSTRLAYHMQLTAEAAVALLQMTPFAWRA RPDVWEQLAASAGLSCQTDFNLHLWQRNR

69 More info Assembling an AMT tag Database Relationships Minimum information required: Single table with mass and normalized elution time (NET) Expanded schema: PK T_Mass_Tags Mass_Tag_ Peptide Monoisotopic_Mass NET T_Mass_Tags T_Mass_Tags_to_Protein_Map T_ PK Mass_Tag_ Peptide Monoisotopic_Mass PK,FK1 PK,FK2 Mass_Tag_ Ref_ PK Ref_ Reference Description T_Mass_Tags_NET PK,FK1 Mass_Tag_ Avg_GANET Cnt_GANET StD_GANET PK := Primary Key FK := Foreign Key

70 More info Assembling an AMT tag Microsoft Access Relationships Full schema to track individual peptide observations T Description T_ T_Mass_Tags T_Mass_Tags_to_Protein_Map PK Job Dataset Dataset_ Dataset_Created_DMS Dataset_Acq_Time_Start Dataset_Acq_Time_End Dataset_Scan_Count Experiment Campaign Organism Instrument_Class Instrument _Tool Parameter_File_Name Settings_File_Name Organism Name Protein_Collection_List Protein_Options_List Completed ResultType Separation_Sys_Type ScanTime_NET_Slope ScanTime_NET_Intercept ScanTime_NET_RSquared ScanTime_NET_Fit PK,FK1 PK FK1 FK2 T_Score_Discriminant Peptide_ Peptide_ T_Score_Sequest PK,FK1 _ Scan_Number Number_Of_Scans Charge_State MH Multiple_ Peptide Mass_Tag_ GANET_Obs Scan_Time Apex _Area _SN_Ratio Peptide_ XCorr DelCn Sp DelM PK,FK1 T_Score_XTandem Peptide_ PK Hyperscore Log_EValue DeltaCn2 Y_Score Y_Ions B_Score B_Ions DelM Intensity d_score Mass_Tag_ Peptide Monoisotopic_Mass Multiple_ Created Last_Affected Number_Of_ Peptide_Obs_Count_Passing_ High_d_Score High_Peptide_Prophet_Probability Mod_Count Mod_Description PMT_Quality_Score PK,FK1 T_Mass_Tags_NET Mass_Tag_ Min_GANET Max_GANET Avg_GANET Cnt_GANET StD_GANET StdError_GANET PNET PK Ref_ T_ PK,FK1 PK,FK2 Reference Description Protein_Sequence Protein_Residue_Count Monoisotopic_Mass Protein_Collection_ Last_Affected Mass_Tag_ Ref_ Mass_Tag_Name Cleavage_State Fragment_Number Fragment_Span Residue_Start Residue_End Repeat_Count Terminus_State Missed_Cleavage_Count V Set_Overview_Ex _Type _Set_ Extra_Info _Set_Name _Set_Description Peptide_Prophet_FScore Peptide_Prophet_Probability

71 More info Assembling an AMT tag Example data T_Mass_Tags Mass_Tag_ Peptide Monoisotopic_Mass VKHPSEIVNVGDEINVK T_Mass_Tags_NET Mass_Tag_ Avg_GANET Cnt_GANET StD_GANET E-03 T_ Mass Tag Scan Charge Peptide_ Peptide Job Number State R.VKHPSEIVNVGDEINVK.V R.VKHPSEIVNVGDEINVK.V R.VKHPSEIVNVGDEINVK.V R.VKHPSEIVNVGDEINVK.V R.VKHPSEIVNVGDEINVK.V R.VKHPSEIVNVGDEINVK.V R.VKHPSEIVNVGDEINVK.V T_Score_XTandem Peptide_ Hyperscore Log(E_Value)

72 Assembling an AMT tag s High Throughput Proteomics LC-MS Datasets Lists Non-'d Processing steps Thermo-Finnigan LTQ.Raw files MS/MS spectra files Peptide Results Tab delimited text files Summarized result files Microsoft Access Convert to.dta files or single _Dta.txt file using DeconMSn.exe. DeconMSn is similar to Thermo s Extract_MSn but has better support for data from LTQ-Orbitrap or LTQ-FT instruments. Process _Dta.txt file with X!Tandem or.dta files with SEQUEST. Use the Peptide File Extractor to convert SEQUEST.Out files to Synopsis (_Syn.txt) files. Convert X!Tandem.XML output files or SEQUEST _Syn.txt file to tab-delimited files using the Peptide Hit Results Processor (PHRP) application. Align datasets using the MT Creator application Load into database using MT Creator

73 DeconMSn s High Throughput Proteomics LC-MS Datasets Lists Non-'d Determines the monoisotopic mass and charge state of each parent ion chosen for fragmentation on a hybrid LC- MS/MS instrument using Decon2LS algorithms Replacement for the Extract_MSn.exe tool provided with SEQUEST and Bioworks

74 More info Assembling an AMT tag Peptide Hit Results Processor (PHRP) relationships Results_Info Result_To_Seq_Map Seq_Info PK FK1 Result_ Unique_Seq_ Group_ Scan Charge Peptide_MH Peptide_Hyperscore Peptide_Expectation_Value_Log(e) Multiple_Protein_Count Peptide_Sequence DeltaCn2 y_score y_ions b_score b_ions Delta_Mass Peptide_Intensity_Log(I) PK,FK1 PK,FK2 PK,FK1 Unique_Seq_ Result_ Mod_Details Unique_Seq_ Mass_Correction_Tag Position PK PK,FK1 PK Unique_Seq_ Mod_Count Mod_Description Monoisotopic_Mass Seq_to_Protein_Map Unique_Seq_ Protein_Name Cleavage_State Terminus_State Protein_Expectation_Value_Log(e) Protein_Intensity_Log(I)

75 MT Creator s High Throughput Proteomics LC-MS Datasets Lists Non-'d MT Creator application Allows external researchers to align multiple LC-MS/MS analyses and create a standalone AMT tag database

76 More info Assembling an AMT tag s High Throughput Proteomics LC-MS Datasets Lists Non-'d Frequency Database histograms filtered on Log(E_Value) -2 Peptide Mass Histogram Peptide Mass Frequency X!Tandem Hyperscore Histogram NET Histogram d Elution Time Frequency Hyperscore

77 AMT tag Growth Trend s High Throughput Proteomics LC-MS Datasets Lists Non-'d Trend for Mini AMT tag 25 SCX fractionation datasets of a single growth condition Peptide Count ed on Log(E_Value) -2 Trend for Mature AMT tag 521 different samples from several different growth conditions Slope of curve decreases as more datasets are added and as fewer new peptides are seen Peptide Count Dataset Count ed on Peptide Prophet Probability Dataset Count

78 Identifying LC-MS s High Throughput Proteomics LC-MS Datasets Lists Non-'d VIPER software Visualize and find in LC-MS data Match to peptides (AMT tags) Graphical User Interface and automated analysis mode

79 Identifying LC-MS s High Throughput Proteomics LC-MS Datasets Lists Non-'d Steps Load LC-MS peak lists from Decon2LS data Feature definition over elution time Select AMT tags to match against Optionally, find paired (e.g. 16 O/ 18 O pairs) Align LC-MS to AMT tags using LCMSWarp Broad AMT tag search Search tolerance refinement Final AMT tag search Report results

80 Identifying LC-MS s High Throughput Proteomics LC-MS Datasets Lists Non-'d AMT tag database selection Connect to mass tag system (MTS) if inside PNNL or use standalone Microsoft Access

81 using LCMSWarp s High Throughput Proteomics LC-MS Datasets Lists Non-'d Align scan number (i.e. elution time) of to NETs of peptides in given AMT tag database Match mass and NET of AMT tags to mass and scan number of MS Use LCMSWarp algorithm to find optimal alignment to give the most matches Calculated monoisotopic mass AMTs toped monoisotopic mass LC-MS Average observed NET Observed scan number

82 using LCMSWarp s High Throughput Proteomics LC-MS Datasets Lists Non-'d LCMSWarp computes a similarity score from conserved local mass and retention time patterns Score Best score = Scan = 1113 Shift = 113 N. Jaitly, M.E. Monroe et. al., Analytical Chemistry 2006, 78, Scan number

83 using LCMSWarp s High Throughput Proteomics LC-MS Datasets Lists Non-'d Similarity scores between LC-MS and AMT tags are used to generate a score graph of similarity Best alignment is found using a dynamic programming algorithm that determines the transformation function with maximum likelihood S. Typhimurium on 11T Heatmap of similarity score between LC-MS and AMT tags (z-score representation) AMT tag NET Function N. Jaitly, M.E. Monroe et. al., Analytical Chemistry 2006, 78, MS Scan Number

84 using LCMSWarp s High Throughput Proteomics LC-MS Datasets Lists Non-'d Transformation function is used to convert from scan number to NET centered at same scan number get the same obs. NET value When matching LC-MS to AMTs, we will search +/- a NET tolerance, which effectively allows for LC-MS to shift around a little in elution time LC-MS Feature Scan AMT tag NET LC-MS Feature NET LC-MS Feature NET LC-MS Feature Scan

85 using LCMSWarp s High Throughput Proteomics LC-MS Datasets Lists Non-'d NET Residual Difference between NET of LC-MS feature and NET of matching AMT tag Indicates quality of alignment between and AMT tags This data shows nearly linear alignment between and AMTs, but the algorithm can easily account for non-linear trends AMT tag NET S. Typhimurium on 11T MS Scan Number NET Residuals if a linear mapping is used NET Residuals after LCMSWarp

86 using LCMSWarp s High Throughput Proteomics LC-MS Datasets Lists Non-'d AMT tag NET Non-linear alignment example #1 Identical LC separation system, but having column flow irregularities S. Typhimurium on 9T NET Residuals if a linear mapping is used NET Residuals after LCMSWarp MS Scan Number

87 using LCMSWarp s High Throughput Proteomics LC-MS Datasets Lists Non-'d Non-linear alignment example #2 AMT tag from C 18 LC-MS/MS analyses using ISCO-based LC (exponential dilution gradient) LC-MS analysis used C 18 LC-MS via Agilent linear gradient pump NET Residuals if a linear mapping is used S. oneidensis on LTQ-Orbitrap NET Residuals after LCMSWarp

88 using LCMSWarp s High Throughput Proteomics LC-MS Datasets Lists Non-'d Non-linear alignment example #3 AMT tag from C 18 LC-MS/MS analyses using ISCO-based LC LC-MS analysis used C 18 LC-MS via Agilent linear gradient pump NET Residuals if a linear mapping is used QC Standards (12 protein digest) on LTQ-Orbitrap NET Residuals after LCMSWarp

89 using LCMSWarp s High Throughput Proteomics LC-MS Datasets Lists Non-'d LCMSWarp Fast and robust Previous method used least-squares regression, iterating through a large range of guesses (slow and often gave poor alignment) Requires that a reasonable number of LC-MS match the AMT tag S. Typhimurium on 11T match against 18,617 S. Typhimurium PMTs S. Typhimurium on 11T match against 65,193 S. oneidensis PMTs

90 using LCMSWarp s High Throughput Proteomics LC-MS Datasets Lists Non-'d In addition to aligning data in time, we can also recalibrate the masses of the LC-MS Possible because mass and time values are available for both LC-MS and AMT tags Two options for mass re-calibration Bulk linear correction Piece-wise correction via LCMSWarp Visualize mass differences using mass error histogram or mass residual plot

91 Mass Error Histogram s High Throughput Proteomics LC-MS Datasets Lists Non-'d List of binned mass error values Difference between feature's mass and matching AMT tag's mass Bin values to generate a histogram Typically observe background false positive level Count (LC-MS ) LC-MS Feature Mass (Da) AMT tag Mass (Da) Delta Mass (Da) Mass Error (ppm) Match Tolerances Mass: ±25 ppm NET: ±0.05 NET Likely true positive identifications Likely false positive identifications Mass Error (ppm)

92 Mass Calibration s High Throughput Proteomics LC-MS Datasets Lists Non-'d Option 1: Bulk linear correction Use location of peak in mass error histogram to adjust masses of all Shift by ppm mass; absolute shift amount increases as monoisotopic mass increases Count (LC-MS ) ppm Center of of mass: 11.6 ppm Width: 2 ppm at at 60% of of max Height: 404 counts/bin Noise level: counts/bin Shift all masses ppm: 200 Δ mass = -11.6ppm x mass old 1x10 6 ppm/da Mass Error (ppm) For 1+ feature at Da, Δ mass = Da For 3+ feature at Da, Δ mass = Da

93 Mass Calibration s High Throughput Proteomics LC-MS Datasets Lists Non-'d Option 2: Piece-wise correction via LCMSWarp Use smoothing splines to determine a smooth calibration curve which is a function of scan number Mass Residual Mass Error (ppm) vs. Scan Number Mass Error (ppm) vs. Scan Number after correction S. Typhimurium on 11T MS Scan Number MS Scan Number

94 Mass Calibration s High Throughput Proteomics LC-MS Datasets Lists Non-'d Option 2: Piece-wise correction via LCMSWarp Use a smoothing spline calibration which is a function of m/z LCMSWarp utilizes a hybrid correction based on both mass error vs. time and mass error vs. m/z Mass Residual Mass Error (ppm) vs. m/z Mass Error (ppm) vs. m/z after correction S. Typhimurium on 11T m/z m/z

95 Mass Calibration s High Throughput Proteomics LC-MS Datasets Lists Non-'d Comparison of the three methods Mass error histogram gets taller, narrower, and more symmetric Linear Mass error vs. m/z Mass error vs. time Hybrid Not all datasets show the same trends, but Hybrid mass recalibration is generally superior Bin count S. Typhimurium on 11T LCMSWarp_Hybrid LCMSWarp_vs_time LCMSWarp_vs_mz Linear Correction Bin count S. oneidensis on LTQ-FT LCMSWarp_Hybrid LCMSWarp_vs_time LCMSWarp_vs_mz Linear Correction Mass Error (ppm) Mass Error (ppm)

96 Identifying LC-MS s High Throughput Proteomics LC-MS Datasets Lists Non-'d Match to LC-MS/MS s S. Typhimurium, from 25 LC-MS/MS analyses 18,617 AMT tags, all fully or partially tryptic Look for AMT tags within a broad mass range, e.g., ±25 ppm and ±0.05 NET of each feature S. Typhimurium AMT tag Database S. Typhimurium on 11T FTICR 18,617 AMT tags 5,934 4,678 have match, matching 6,242 AMT tags Average observed NET Observed NET

97 Search Tolerance Refinement s High Throughput Proteomics LC-MS Datasets Lists Non-'d Can use mass error and NET error histograms to determine optimal search tolerances Examine distribution of of errors to to determine optimal tolerance using expectation maximization algorithm ±1.76 ppm

98 Identifying LC-MS s High Throughput Proteomics LC-MS Datasets Lists Non-'d Repeat search with final search tolerances 5,934 Observed NET

99 Identifying LC-MS s High Throughput Proteomics LC-MS Datasets Lists Non-'d Repeat search with final search tolerances 5,934 3,866 with matches Match Tolerances Mass: ±25 ppm NET: ±0.05 NET Observed NET

100 Identifying LC-MS s High Throughput Proteomics LC-MS Datasets Lists Non-'d Repeat search with final search tolerances 5,934 3,866 with matches 3,958 out of 18,617 AMT tags matched using ±1.76 ppm Match Tolerances Mass: ±1.76 ppm NET: ± NET Observed NET

101 Identifying LC-MS s High Throughput Proteomics LC-MS Datasets Lists Non-'d Caveat: given feature can match more than one AMT tag Need measure of ambiguity Match Tolerances Mass: ±4 ppm NET: ±0.02 NET AMT Tag Peptide Mass (Da) NET T.RALMQLDEALRPSLR.S K.DLETIVGLQTDAPLKR.A R.SIGIAPDVLICRGDRAI.P Monoisotopic Mass 1, Da Da NET: NET: Δ mass = 2.8 ppm Δ NET = , , , , , Δ mass = 0.17 ppm Δ NET = Δ mass = -3.5 ppm Δ NET = ppm 1, NET

102 More info d Identifying LC-MS 2 2 ( m ) ( ) 2 i mj ti tj ij 2 2 mj tj σ mj = 4 ppm, σ tj = p ij N k 1 ( mj ( ) mk tj 1 tk ) exp( d 1 LC-MS/MS 2 ij Datasets High Throughput Proteomics LC-MS Datasets exp( d Peptide Lists / 2) 2 ik Peptide s Non-'d / 2) Monoisotopic Mass 1, Match Tolerances Mass: ±4 ppm NET: ±0.02 NET AMT Tag Mass (Da) NET d 2 ij Numerator Sum: p ij 1, , , K.K. Anderson, M.E. Monroe, and D.S. Daly. Proteome Science 2006, 4, 1. 1, d ij 1, , NET

103 Identifying LC-MS s High Throughput Proteomics LC-MS Datasets Lists Non-'d VIPER reports a score that measures the uniqueness of each match AMT Tag Peptide Mass (Da) NET SLiC Score Average XCorr Avg Disc Score T.RALMQLDEALRPSLR.S K.DLETIVGLQTDAPLKR.A R.SIGIAPDVLICRGDRAI.P Monoisotopic Mass 1, , , , K.K. Anderson, M.E. Monroe, and D.S. Daly. Proteome Science 2006, 4, 1. 1, , , NET

104 Search Tolerance Refinement s High Throughput Proteomics LC-MS Datasets Lists Non-'d Effect of search tolerances on Mass Error histogram If mass error plot not centered at 0, then narrow mass windows exclude valid data Decreasing mass and/or NET tolerance reduces background false positive level Mass error histograms with linear mass correction Mass error histograms with LCMSWarp mass correction Count () ±25 ppm; ±0.05 NET ±25 ppm; ±0.02 NET ±3 ppm; ±0.02 NET ±1.5 ppm; ±0.02 NET Count () ±25 ppm; ±0.05 NET ±25 ppm; ±0.02 NET ±3 ppm; ±0.02 NET ±1.5 ppm; ±0.02 NET Mass Error (ppm) Mass Error (ppm)

105 Automated s High Throughput Proteomics LC-MS Datasets Lists Non-'d Automated processing using VIPER Processing steps and parameters defined in.ini file Separate.Ini file for 14 N/ 15 N pairs and 16 O/ 18 O pairs

106 Results s High Throughput Proteomics LC-MS Datasets Lists Non-'d Browsable result folders for visual QC of each dataset S. Typhimurium on 11T FTICR 2D 2D Plot Plot Metrics Metrics Reasonable Reasonable number number of of matches matches NET NET range range 0 to to 1 Data Searched Data With Matches Mass Mass Error Error Histogram Metrics Metrics Well Well defined, defined, symmetric symmetric mass mass error error peak peak centered centered at at 0 ppm ppm Mass Errors Before Refinement Mass Errors After Refinement

107 Results s High Throughput Proteomics LC-MS Datasets Lists Non-'d Browsable result folders for visual QC of each dataset S. Typhimurium on 11T FTICR NET NET Error Error Histogram Metrics Metrics Well Well defined, defined, symmetric symmetric NET NET error error peak peak centered centered at at 0 NET Errors Before Refinement NET Errors After Refinement Chromatogram Metrics Metrics Narrow Narrow peaks peaks evenly evenly distributed distributed throughout throughout separation separation window window Total Ion Chromatogram (TIC) Base Intensity (BPI) Chromatogram

108 Results s High Throughput Proteomics LC-MS Datasets Lists Non-'d Browsable result folders for visual QC of each dataset S. Typhimurium on 11T FTICR NET NET Surface Metrics Metrics Should Should show show a smooth, smooth, bright bright yellow, yellow, diagonal diagonal line line NET NET Residual Metrics Metrics Data Data after after recalibration recalibration should should be be narrowly narrowly distributed distributed around around zero zero

109 Results s High Throughput Proteomics LC-MS Datasets Lists Non-'d Peptide identification list (and thus proteins) identified in each sample Peptide abundance estimates (relative abundance) Confidence metrics Next, need to compare peptide abundances (and/or protein abundances) between samples to reveal biological information Complex samples μlc-fticr-ms -matched results Compare abundances across multiple proteomes

111 Independent Dataset Processing s High Throughput Proteomics LC-MS Datasets Lists Non-'d LC-MS Exp. 1 Individual LC-MS datasets are aligned to an AMT tag database sequentially Identified peptides are compared after independently processing each dataset AMT tags from LC-MS/MS Exp. 2 Exp. i Exp. N Align each dataset individually

112 Independent Dataset Processing s High Throughput Proteomics LC-MS Datasets Lists Non-'d LC-MS without matches may represent useful information, but are effectively ignored All LC-FTICR-MS AMT tags from LC-MS/MS Identified

113 Independent Dataset Processing s High Throughput Proteomics LC-MS Datasets Lists Non-'d Independent processing of each dataset results in more missing data For example, if a low abundance peptide is not identified as an LC-MS feature in a given dataset, then that peptide identification is missing from that dataset Lower abundance suffer more, but are not the only casualties Peptide Detection Reproducibility in Replicate Datasets 19% Frequency Number of Seen in all of of of % 1 of 5 2 of 5 3 of 5 5 of 5 43% 5 of 5 datasets 4 of 5 datasets 3 of 5 datasets 2 of 5 datasets 1 of 5 datasets 1 of % 4 of 5 14%

114 Extended AMT tag Method s High Throughput Proteomics LC-MS Datasets Lists Non-'d common based on mass and time patterns in all datasets first (with or without the AMT tag database) Align resulting groups of to the AMT tag database using statistics from a larger number of AMT tags from LC-MS/MS LC-MS Exp. 1 Exp. 2 Align datasets to one another, then to AMT tag Exp. i Exp. N

115 Extended AMT tag Method s High Throughput Proteomics LC-MS Datasets Lists Non-'d Align all datasets to common baseline Functions Score plots for alignment of 4 datasets against arbitrary baseline run N. Jaitly, M.E. Monroe et. al., Analytical Chemistry 2006, 78,

116 Extended AMT tag Method s High Throughput Proteomics LC-MS Datasets Lists Non-'d Example alignment of 5 LC-MS datasets Mass section before LC alignment O charge = 1 mass + charge = 2 Δ charge = 3 charge >= 4 Dataset 1 Dataset 2 Dataset 3 Dataset 4 Dataset 5 scan #

117 Extended AMT tag Method s High Throughput Proteomics LC-MS Datasets Lists Non-'d Example alignment of 5 LC-MS datasets Mass section after LC alignment O charge = 1 mass + charge = 2 Δ charge = 3 charge >= 4 Dataset 1 Dataset 2 Dataset 3 Dataset 4 Dataset 5 scan #

118 Extended AMT tag Method s High Throughput Proteomics LC-MS Datasets Lists Non-'d Cluster similar (using mass and retention time) across all LC-MS datasets, rather than analyzing each dataset separately and then collating results Can obtain abundance profiles for seen in multiple datasets 1310 Group 3 Group 4 mass 1305 Group 1 Group 5 Group Scan number

119 Extended AMT tag Method s High Throughput Proteomics LC-MS Datasets Lists Non-'d 13% 11% Identify clustered by aligning mass and elution time of clusters to AMT tags in database 19% Peptide Detection Reproducibility Comparison Independent processing 2 of 5 3 of 5 1 of 5 4 of 5 14% 5 of 5 43% 11% Cluster first, then match to AMT tags 6% 5% 4% 74% Fewer missing values observed with clustered feature approach 3 of 5 4 of 5 2 of 5 1 of 5 5 of 5 5 of 5 datasets 4 of 5 datasets 3 of 5 datasets 2 of 5 datasets 1 of 5 datasets

120 MultiAlign s High Throughput Proteomics LC-MS Datasets Lists Non-'d Represents next version of the feature identification process Along with MT Creator it represents a standalone, redistributable version of the AMT tag process

121 MultiAlign s High Throughput Proteomics LC-MS Datasets Lists Non-'d Wizard takes user through analysis setup Can perform single dataset alignment to AMT tag or multiple dataset alignment to baseline dataset or

122 MultiAlign s High Throughput Proteomics LC-MS Datasets Lists Non-'d Dataset summary pages and overview are available for visual QC

123 More info LC-MS Feature Discovery Similar approaches and software tools: High Res LC-MS CRAWDAD G.L. Finney et al. Analytical Chemistry 2008, 80, msinspect M. Bellew et. al. Bioinformatics 2006, 22, PEPPeR J. Jaffe et.al. Mol. Cell. Proteomics 2006, 5, SpecArray (Pep3D, mzxml2dat, PepList, PepMatch, PepArray) X.-J. Li, et. al. Mol Cell Proteomics 2005, 4, SuperHIRN L.N. Mueller et al. Proteomics 2007, 7, Surromed label-free quantitation software (MassView) W. Wang et al. Analytical Chemistry 2003, 75, XCMS (for Metabolite profiling) C.A. Smith et. al. Analytical Chemistry 2006, 78,

124 More info LC-MS Feature Discovery Similar approaches and software tools: Low Res LC-MS Signal maps software A. Prakash et. al. Mol. Cell Proteomics 2006, 5, Informatics platform for global proteomic profiling using LC-MS D. Radulovic, et al. Mol. Cell. Proteomics 2004, 3, Computational Proteomics System (CPAS) A. Rauch et. al. J. Proteome Research 2006, 5,

126 Accurate Mass and Time (AMT) tag pipeline More info in an LC-MS dataset are matched to a database of peptides previously identified by LC-MS/MS analyses using specified mass and elution time tolerances LC-MS dataset AMT tag Database Monoisotopic mass & Monoisotopic mass scan # d elution time (NET) (monoisotopic mass, elution time, abundance) (Peptide Sequence, monoisotopic mass, elution time) Δmass < mass tolerance ΔNET < elution time tolerance

127 Current Method for Assessing and Controlling Rate of Random Matches Decoy database searching is often used to assess rate of random AMT tag matches Shift database (or ) by some mass, and re-match to database The number of matches reflects the random rate of errors Rate of error controlled by Building more stringent LC-MS/MS databases (higher XCorr, hyperscore, Mascot score, Peptide Prophet score, etc) Reducing mass and elution time tolerances Overall method consists of iteratively controlling and assessing error to choose optimal parameters LC-MS/MS LC-MS Datasets High Throughput Proteomics Datasets Peptide Lists Peptide s Non-'d

128 Challenges with Current Method s High Throughput Proteomics LC-MS Datasets Lists Non-'d Balancing false positives and false negatives is a tricky game Building more confident LC-MS/MS database decreases background false positives, but increases false negatives Reducing mass and elution time tolerances has a similar effect Manually chosen parameters may look suitable, but in reality be sub-optimal Each identification is either accepted or rejected In reality some identifications are better than others higher MS/MS confidence, lower mass and elution time differences, etc.

129 Statistical Method for MS/MS Identification Confidence Peptide Prophet A Statistical Model to estimate probability that an MS/MS spectrum is correctly identified Uses a Linear function (F-Score) of result metrics from SEQUEST to calculate an overall value representing the confidence of identifications LC-MS/MS Datasets High Throughput Proteomics LC-MS Datasets Peptide Lists Peptide s Non-'d ARHGGEDYVFSLLTGYCEPPTGVSLR c 1 c 2 c 3 c 4 XCorr DelCN SpRank d M Parameters Used F ( x 1, x 2,.., x s ) c 0 s i 1 c i x i A. Keller, A.I. Nesvizhskii, E. Kolker, R. Aebersold, "Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search," Analytical Chemistry 2002, 74 (20),

130 Peptide Prophet F-Score Distributions s High Throughput Proteomics LC-MS Datasets Lists Non-'d Overall F-Score distribution is bimodal, with distinct distributions for correct and incorrect matches Probability that an identification is correct can be computed from relative probabilities of coming from correct or incorrect distribution Incorrect Matches All Hits Correct Hits Incorrect Hits relative frequency Correct Matches F Score

131 Metrics Associated with a Candidate Identification from AMT tag Pipeline Each match between an LC-MS feature and a peptide AMT tag is described by a mass error and an LC NET error, plus metrics related to the MS/MS spectra that identified to the AMT tag LC-MS/MS LC-MS Datasets High Throughput Proteomics Datasets Peptide Lists Peptide s Non-'d Mass Scan Aligned NET Peptide NET Mass Discriminant Score (Peptide Prophet) TETQEKNPLPSKETIEQEK Δ mass = ppm Δ NET =

132 Distribution of Mass and LC d Elution Time (NET) differences s High Throughput Proteomics LC-MS Datasets Lists Non-'d True and false matches to LC-MS show different mass and LC NET error distributions charge 2 charge >2 Centrally Distributed True Matches LC NET Error Randomly distributed False Matches Mass Error (PPM)

133 Estimating the Probability a Match is Correct s High Throughput Proteomics LC-MS Datasets Lists Non-'d The probability that a match is correct depends on where its mass and LC NET error value lies on the two-dimensional distribution NET Error = Mass Error = 1ppm h + probability of correct match= h - probabilit y of correct match height of height of central part central part height of random part h h h

134 Estimating the Probability a Match is Correct s High Throughput Proteomics LC-MS Datasets Lists Non-'d The probability that a match is correct depends on where its mass and LC NET error value lies on the two-dimensional distribution NET Error = Mass Error = 2 ppm probability of correct match= h + 0 h - probabilit y of correct match height of height of central part central part height of random part h h h

135 Statistical Method for Assignment of Relative Truth (SMART) Score A SMART score estimates the probability that an LC-MS feature has been identified by an AMT tag Combines the mass and LC NET error between the LC-MS feature and the AMT tag with the probability that the MS/MS identification in the AMT tag is correct Assumes independence of mass error, NET error and F-score LC-MS/MS Datasets High Throughput Proteomics LC-MS Datasets Peptide Lists Peptide s Non-'d p( match p( m, net m, net, Fscore, peptide) match ) p( Fscore p( m, net peptide ) p( peptide match ) p( Fscore peptide ) p( m, net random ) p( match peptide ) ) p( Fscore random match ) p( random match ) Algorithm developed by Navdeep Jaitly

136 Data Model and Model Fitting s High Throughput Proteomics LC-MS Datasets Lists Non-'d Data to Model: Mass Errors, NET Errors, F-Score distribution Expectation maximization algorithm used to find optimal parameters for the distributions Error Type Real Mass Errors Random Mass Errors Real NET Errors Random NET Errors Real F Scores Random F Scores Distribution for Model Fitting ppm errors distributed normally (truncated) Da Errors distributed normally (truncated) Distributed normally Uniform Background F-Scores normally distributed as a function of mass Gamma distribution p( match p( m, net m, net, Fscore, peptide) match ) p( Fscore p( m, net peptide ) p( peptide match ) p( Fscore peptide ) p( m, net random ) p( match peptide ) ) p( Fscore random match ) p( random match ) Algorithm developed by Navdeep Jaitly

137 More info Data Model Example: Mass Error Distribution s High Throughput Proteomics LC-MS Datasets Lists Non-'d Regular Matches Decoy (11 Da Shifted) Matches (Real and Random Distributions) (Random Distribution) Density Mass error (ppm) Quantiles from Data Mass error (Da) Quantiles from Data Mass error (ppm) Mass error (ppm) Quantiles from model Salmonella Typhimurium protein extract sample Mass error (Da) Mass error (Da) Quantiles from model N. Jaitly, J.N. Adkins, and R.D. Smith, preliminary results.

138 More info Data Model Example: NET Error Distribution s High Throughput Proteomics LC-MS Datasets Lists Non-'d Regular Matches Decoy (11 Da Shifted) Matches (Real and Random Distributions) (Random Distribution) Density NET Error Quantiles from Data NET Error Quantiles from Data NET Error NET Error Quantiles from model Salmonella Typhimurium protein extract sample NET Error NET Error Quantiles from model N. Jaitly, J.N. Adkins, and R.D. Smith, preliminary results.

139 Performance Curves s High Throughput Proteomics LC-MS Datasets Lists Non-'d Match LC-MS to AMT tags and compute the probability that the identified feature is indeed the MS/MS derived AMT tag Accept matches whose computed probabilities are greater than a specified threshold Balance sensitivity and specificity by changing score thresholds # of false (decoy) peptides Higher probability value threshold gives fewer false positives (~ 0) Lower probability value threshold results in more true positive, but at the cost of false positives Trade-off region where reducing probability value threshold results in accelerating number of false positives QC sample matched to database with 8000 decoy proteins High number of true positives achieved with very few false positives # of matched standard peptides N. Jaitly, J.N. Adkins, and R.D. Smith, preliminary results.

140 Performance Curves s High Throughput Proteomics LC-MS Datasets Lists Non-'d Salmonella Typhimurium protein extract sample compared to typical criteria Estimated FDR SMART Pep Prophet >.99, 4ppm, 0.05 NET Pep Prophet >.99, 2ppm, 0.02 NET Pep Prophet >.999, 1ppm, 0.01 NET # of Matches N. Jaitly, J.N. Adkins, and R.D. Smith, preliminary results.

141 Summary s High Throughput Proteomics LC-MS Datasets Lists Non-'d The Statistical Method for Assignment of Relative Truth (SMART) score is a model to estimate confidence of peak matching SMART provides a measure to prioritize acceptable matches using one number, by defining a probability score combining disparate information SMART allows calculation of FDR for identifications and estimates the tradeoff between false negatives and false positives Evaluation shows good correlation with observed number of correct answers Estimated FDR # of Matches

143 Data Quantitative protein inference from peptide data Complications Multiple, possibly inconsistent peptide measurements for same protein Systematic abundance variation within and between conditions How should we use information from blocking and randomization of experiments? High rate of missingness in peptide measurements Need to combine off the shelf statistical methods and novel solutions Clustering ANOVA PCA

144 Infer Protein Abundances from Peptide Abundances Multiple peptides observed for each protein For example, protein with 4 peptides 1. SADLNVDSIISYWK 2. LLLTSTGAGIVIK 3. LIVGFPAYGHTFILSDPSK 4. IPELSQSLDYIQVMTYDLHDPK Plot peptide abundance across all datasets (for 4 conditions) LC-MS/MS LC-MS Datasets High Throughput Proteomics Datasets Peptide Lists Peptide s Non-'d Condition 1 Condition 2 Control Condition 3 Outlier detection and normalization need to be performed before meaningful abundance information can be inferred

145 Outlier Detection s High Throughput Proteomics LC-MS Datasets Lists Non-'d Outlier dataset Color legend with overlaid histogram of correlation values Dataset Names Dataset Names

146 Normalization s High Throughput Proteomics LC-MS Datasets Lists Non-'d Before After Dataset 2 Dataset 2 Systematic bias Dataset 1 Dataset 1

147 Infer Protein Abundances from Peptide Abundances LC-MS/MS Datasets High Throughput Proteomics Scale peptide abundances to an automatically chosen optimal reference peptide for each protein Estimate relative protein abundance using scaled peptides LC-MS Datasets Peptide Lists Peptide s Non-'d Raw peptide abundances vs. dataset (for 1 protein) Abundance Condition 1 Condition 2 Control Condition 3 Scaled peptide abundances for this protein s 4 peptides Datasets Median protein abundance (dark black line) 1. SADLNVDSIISYWK 2. LLLTSTGAGIVIK 3. LIVGFPAYGHTFILSDPSK 4. IPELSQSLDYIQVMTYDLHDPK

148 DAnTE: Data Tool Extension s High Throughput Proteomics LC-MS Datasets Lists Non-'d Statistical tool designed to address analysis needs associated with quantitative bottom-up, proteomics data. Capture experimental design through factors: Biological conditions Biological replicates Technical replicates

149 Data s High Throughput Proteomics LC-MS Datasets Lists Non-'d Challenge: With thousands of peptides/proteins in hundreds of samples, how does one mine the data for significant trends of interest? General Steps in DAnTE: Outliers: identify bad datasets Normalization: remove any systematic variations due to instrument/sample processing etc. : infer protein abundances from peptide abundances ANOVA: look for statistically significant Cluster: explore trends/patterns

150 Where does DAnTE fit in? s High Throughput Proteomics LC-MS Datasets Lists Non-'d Multialign VIPER Biological Interpretations AMT tag pipeline Tabular Data (raw abundances, exp. ratio values, spectral counts) DAnTE, peptides and peptide abundances

151 DAnTE supports a wide array of interactive analyses s High Throughput Proteomics LC-MS Datasets Lists Non-'d Data Loading Peptide abundance Peptide-Protein relations Factors Spectral counts Variance Stabilization log2 or log10 Bias (additive/multiplicative) Normalization (within a factor) Linear Regression Local regression (LOESS) Quantile Normalization (across factors) Central tendency Median absolute Deviation (MAD) Statistical Tests ANOVA/Non-parametric Mix Models Visualization PCA / PLS Heatmaps (hierachical, kmeans) Impute Missing Data KNNimpute SVDimpute Other Investigative Histograms Boxplots Correlation diagrams MA Infer from R Z Q Other ANOVA results Save session Easily extendable to plugin more modules Statistical Environment

152 Example Dataset s High Throughput Proteomics LC-MS Datasets Lists Non-'d Proteomics analysis of two Salmonella regulatory knock-out mutants, Hfq and SmpB Relevant factor is mutant type Wild type (WT) hfq knock-out mutant smpb knock-out mutant Three biological replicates for each mutant

153 Outline of an s High Throughput Proteomics LC-MS Datasets Lists Non-'d Load data Log transform Diagnostic Define factors Within a Factor Linear regression LOESS (LOcal regression) Quantile Across Factors MAD Central tendency ANOVA Save the results to a session file (.dnt file)

154 Load Data s High Throughput Proteomics LC-MS Datasets Lists Non-'d Experiments Multialign quantitation data Protein-Peptide relationships Peptide abundances

155 Diagnostic : Normality s High Throughput Proteomics LC-MS Datasets Lists Non-'d Distribution of abundance values in a single dataset Should see Normal distribution Probability Sample Quantiles Abundance Theoretical Quantiles (Normal)

156 Diagnostic : Correlations s High Throughput Proteomics LC-MS Datasets Lists Non-'d Assess repeatability of replicate datasets Compare biological conditions hfq smpb WT Color legend with overlaid histogram of correlation values WT smpb hfq Dataset Names Dataset Names

157 Normalization: LOESS s High Throughput Proteomics LC-MS Datasets Lists Non-'d Dataset 2 Systematic bias Raw log ratio Systematic bias Dataset 1 Avg. log intensity Dataset 2 After LOESS log ratio Dataset 1 Avg. log intensity

158 Inferring Protein Abundances from Peptide Abundances s High Throughput Proteomics LC-MS Datasets Lists Non-'d Three algorithms for rolling up abundances Additional algorithms can be added as needed hfq smpb WT Raw abundance Raw peptide abundances vs. dataset (for 1 protein) Scaled abundance Median protein abundance (dark black line) Datasets Scaled peptide abundances for this protein s 28 peptides

159 Showing Significant Changes s High Throughput Proteomics LC-MS Datasets Lists Non-'d hfq smpb WT ANOVA identified 361 proteins with significant abundance changes between the mutants (with an FDR of 5%) Hierarchical clustering groups related proteins Datasets

160 DAnTE Feature List s High Throughput Proteomics LC-MS Datasets Lists Non-'d Data loading with peptideprotein group information Spectral Count Log transform Factor Definitions Normalization Linear Regression Loess Quantile normalization Median Absolute Deviation (MAD) Adj. Mean Centering Missing Value Imputation Simple mean/median of the sample Substitute a constant Advance Row mean within a factor knn method SVDimpute Save tables / factors / session Histograms QQ plots Boxplots Correlation plots MA plots PCA/PLS plots Protein rollup plots Heatmaps Rolling up to Reference peptide based scaling (R) Z-score averaging (Z) Q Statistics ANOVA Provisions for unbalanced data Random effects (multi level) models (REML) Normality test (Shapiro-Wilks) Non-parametric methods (Wilcoxon, Kruskal-Walis tests) Q-values s

161 New Significance Test for with Missing Peptide Abundance Values Missing values in peptides can bias statistical tests for significance (ANOVA) Missingness is modeled as a mixture of two probability distributions Too low to detect (left censored) Completely at random Use a likelihood estimation to find the parameters of the mixture Perform a likelihood ratio test to compute Pr where P1 and P2 are proteins with peptides having missing values. P 1 P 2 Proportion of values missing in N datasets Random component Intensity dependent 2 f 1 LC-MS/MS Peptide Abundance LC-MS Datasets High Throughput Proteomics Datasets Peptide Lists Peptide s I Non-'d

162 Further Details s High Throughput Proteomics LC-MS Datasets Lists Non-'d Help file Manuscript in Bioinformatics Bioinformatics 2008; doi: /bioinformatics/btn217

163 Part III: Biological Application Part I: Introduction and Overview of Label-Free Quantitative Proteomics (Anderson) Part II: Feature Discovery in LC-MS Datasets (Monroe and Polpitiya) Break Part III: Biological Application of the AMT tag Approach (Ansong) Salmonella Typhimurium AMT tag Software Demo Panel Discussion

164 Biological Application of the AMT tag Approach Salmonella Typhimurium projects to be discussed: Global role of translational regulation in control of gene expression Identification of targets for therapeutic intervention

165 The Central Dogma of Biology Information flow

166 Post-transcriptional Regulation in Bacteria Little is known about how RNA binding proteins facilitate the global control of gene expression at the posttranscriptional level Bacterial regulation of translation is mediated by relatively few identified protein factors Primarily those encoded by hfq and smpb Identifying proteins regulated by Hfq and SmpB is an important step in understanding the impact of post-transcriptional gene regulation in Enterobacteria

167 Adapted from Sittka et al. PLoS Genetics 2008 Post-transcriptional Regulation in Bacteria: Global identification of Hfq targets in Salmonella Hfq strongly binds to RNA molecules Binding is somewhat non-specific Possibility of spurious results introduced by tagging and/or co-ip Pyrosequencing to detect Hfq-associated RNA Determine protein changes by global proteomics Identify proteins translationally regulated Identify proteins translationally regulated Determine transcript changes on DNA microarrays Determine transcript changes on DNA microarrays

168 Salmonella Typhimurium Growth in Host-free Culture Conditions Laboratory conditions Growth in Luria Bertani broth (LB) to log phase (Log) Growth in LB to stationary phase (Stat) Intracellular environment mimic Stat phase cells transferred to acidic minimal media for 4hr, then harvested (MgM Shock) Stat phase cells diluted 1:100 into acidic minimal media growth for 4hr, then harvested (MgM Dilution)

169 Experimental Design: The Need/Use for Increased Throughput Replicate analysis to account for natural biological and normal analytical variation Mutant Wild-type SmpB Hfq Biological Rep. Cell Fraction Analytical Rep. X4 Growth Conditions 216 global proteome analyses for 2 mutants LC-MS analysis order blocked and randomized Ansong PLoS ONE 2009 In press

170 Accurate Mass and Time Tag Approach AMT tag Database Generation Complex samples fractionated, analyzed by LC-MS/MS, then analyzed using SEQUEST High-throughput Quantitative Proteomics identified from LC-MS peaks by matching LC-MS to AMT tags Protein abundance value calculation via peptide rollup Biological Sample Protein Extraction Adkins MCP 2006 source of AMT tag Database

171 AMT tag Database Overview AMT tag database 3 Salmonella Typhimurium growth conditions Analyzed both unfractionated and SCX fractionated samples for each condition 932 LC-MS/MS analyses Primarly used two Thermo LTQ mass spectrometers 65 days of instrument acquisition time Data processed with MASIC and SEQUEST 74,000 AMT tags pass filters LC-MS Data Four growth conditions for three mutants 216 LC-MS analyses over 16 days of instrument time on a Thermo LTQ-orbitrap mass spectrometer Note: equivalent LC-MS/MS analyses would require 300 days Data processed with Decon2LS and MultiAlign Identified 27,753 AMT tags with ~1% FDR

172 Diagnostic : Normality Distribution of abundance values in a single dataset Should see Normal distribution Probability Sample Quantiles Abundance Theoretical Quantiles (Normal)

173 Normalization: LOESS Correlation plot (scatter style) of all Salmonella peptide identifications from one biological replicate of WT correlated to a second biological replicate of WT Raw After LOESS Systematic bias

174 Replicate Reproducibility / Outlier Removal Pearson s pairwise correlation plot of all Salmonella peptide identifications from each triplicate analyses correlated to itself and every other analyses Each replicate analysis has a strong correlation to the other two replicate analyses for each sample, indicating reproducibility between replicate runs

175 Peptide Abundance Overview ~20,000 Clustered Represents ~1600 ~36% coverage of Salmonella proteome Biological replicates have good agreement Ansong PLoS ONE 2009 In press hfq Dil hfq Shock hfq Log hfq Stat smpb Dil smpb Shock smpb Log smpb Stat Wildtype Dil WT Shock WT Log WT Stat

176 Subset of Showing Significant Changes hfq WT Salmonella WT and hfq mutant grown in LB to Stat phase ANOVA between the two groups 270 proteins show significant changes for a false discovery rate of 5% 10 groups via K-means clustering Datasets Ansong PLoS ONE 2009 In press

177 Functional Func tional Categories Total of observed proteome Hfq regulated SmpB regulated % Hfq regulated % SmpB re g u lat e d Am ino acid biosy nthes is Biosynthesis of cofactors, prosthetic groups, and carriers Hfq-regulated proteins: SmpB-regulated proteins: 189 Cell envelope Cellular processes Cent ral int erm ediary metabolism DNA metabolism Energy metabolism Fatty acid and phospholipid metabolism Hypothetical proteins Mobile and extrachromosomal element functions Protein fate Protein synthesis Purines, py rim idines, nucleosides, and nucleotides Regulatory func tions Signal transduction Transcription Transport and binding proteins Unclassified Unknown function Ansong PLoS ONE 2009 In press

178 Validating the Proteomics Dataset Name Function Reg.: Sittka et al. Reg.: Figueroa-Bossi et al. Reg.: This study CarA carbamoyl-phosphate synthetase, glutamine-hydrolysing small subunit (-) (-) Upp undecaprenyl pyrophosphate synthetase (di-trans,poly-cis-decaprenylcistrans (-) (-) Dps stress response DNA-binding protein; starvation induced resistance to H2O2; (-) (-) FliC flagellin, filament structural protein (-) (-) LuxS quorum sensing protein, produces autoinducer - acyl-homoserine lactone-sign (-) (-) SipA cell invasion protein (-) (+) FadA 3-ketoacyl-CoA thiolase; (thiolase I, acetyl-coa transferase), small (beta) sub (-) (-) OsmY hyperosmotically inducible periplasmic protein, RpoS-dependent stationary ph (-) (-) RpsD 30S ribosomal subunit protein S4 (-) (-) SurA peptidyl-prolyl cis-trans isomerase, survival protein (+) (+) HtrA periplasmic serine protease Do, heat shock protein (+) (+) Tpx thiol peroxidase (+) (+) OppA ABC superfamily, oligopeptide transport protein with chaperone properties (+) (+) GlpQ glycerophosphodiester phosphodiesterase, periplasmic (+) (+) STM2494 putative inner membrane or exported (+) (+) GreA transcription elongation factor, cleaves 3' nucleotide of paused mrna (+) (+) FkpA FKBP-type peptidyl-prolyl cis-trans isomerase (rotamase) (+) (+) DppA ABC superfamily, dipeptide transport protein (+) (+) AphA non-specific acid phosphatase/phosphotransferase, class B (+) (+) GlnH ABC superfamily (bind_prot), glutamine high-affinity transporter (+) (+) MglB ABC superfamily (peri_perm), galactose transport protein (+) (+) GlpK glycerol kinase (+) (+) KatE catalase hydroperoxidase HPII(III), RpoS dependent (-) (-) PduD Propanediol utilization: dehydratase, medium subunit (+) (+) PduO Propanediol utilization: B12 related (+) (+) AphA non-specific acid phosphatase/phosphotransferase, class B (+) (+) 25/26 proteins are in general agreement with literature Ansong PLoS ONE 2009 In press

179 Alteration of Protein Expression Mediated Post-transcriptionally Integrating information from transcriptional analysis Hfq-regulated proteins: Hfq-regulated transcripts: 492 SmpB-regulated proteins: SmpB-regulated transcripts: 370 Ansong PLoS ONE 2009 In press Organism Salmonella Typhimurium 1 Pseudomonas aeruginosa 2 E. coli 3 Vibrio cholerae 4 Published global transcriptional analyses of Hfq-regulated genes % Hfq-regulated transcripts 20 % 15 % 6 % 6% 1 Sittka et al Sonnleitner et al Gusibert et al Ding et al. 2004

180 Validation of Translational Regulation in Salmonella: The propanediol utilization (pdu) operon LC-MS analysis of protein levels stat_hfq stat_smpb stat_wt Confirmatory western blot analysis Pdu pdu required for replication in macrophages Deletion of hfq or smpb reduces expression of Pdu proteins levels, but mrna levels show no change Suggests role for translational regulation in Salmonella pathogenesis Ansong PLoS ONE 2009 In press qrt-pcr analysis of mrna levels

181 Additional Benefit of the AMT tag Approach: Increased depth of coverage increases confidence metrics LTQ-Orbitrap LC-MS/MS datasets, SEQUEST analysis results Protein Description PeptideSequence Unique Peptide Count PSLT046 putative carbonic anhydrase K.IVGSMYHLTGGKVEFFEV PSLT046 putative carbonic anhydrase K.NVELTIENIRK.N 6 2 PSLT046 putative carbonic anhydrase K.VVLVIGHTR.C 6 14 PSLT046 putative carbonic anhydrase R.FRENRPAKHDYLAQK.R 6 3 PSLT046 putative carbonic anhydrase R.KGSNYDFVDAVAR.K 6 1 PSLT046 putative carbonic anhydrase R.VAGNISNR.D Roll up to protein level Protein Description hfq_count wt_count PSLT046 putative carbonic anhydrase 38 1 AMT tag analysis of the same LTQ-Orbitrap datasets Protein Description PeptideSequence Unique Peptide Count hfq_abundance wt_abundance PSLT046 putative carbonic anhydrase R.KGSNYDFVDAVAR.K PSLT046 putative carbonic anhydrase R.KGSNYDFVDAVARK.N PSLT046 putative carbonic anhydrase R.KNVELTIENIRK.N PSLT046 putative carbonic anhydrase R.FRENRPAKHDYLAQK.R PSLT046 putative carbonic anhydrase R.APAEIVLDAGIGETFNSR.V PSLT046 putative carbonic anhydrase K.VVLVIGHTR.C PSLT046 putative carbonic anhydrase R.KNVELTIENIR.K PSLT046 putative carbonic anhydrase K.IVGSMYHLTGGKVEFFEV PSLT046 putative carbonic anhydrase A.ASLSKEERDGMTPDAVIEHFK.Q PSLT046 putative carbonic anhydrase A.ASLSKEERDGMTPDAVIEHFKQGNLR.F PSLT046 putative carbonic anhydrase R.NSIAGQYPAAVILSCSR.A PSLT046 putative carbonic anhydrase K.NSPVLKQLEDEKKIK.I Roll up to protein level Protein Description hfq_abundance wt_abundance Fold Change PSLT046 putative carbonic anhydrase

182 S. Typhimurium: Public health burden Bacterial pathogen with a broad host range Humans: causes salmonellosis (severe form of food poisoning) Potentially life threatening to infants, elderly, and immunocompromised 40,000+ reported salmonellosis cases/year in the U.S. Financial cost ~ $3 billion/year (Tauxe 1986, USDA 07/08) Isolates becoming resistant to frontline antibiotics (fluoroquinolone drugs, Lowry 2006) Imperative to identify new targets for therapeutic intervention

183 Understanding Regulation of Virulence in Salmonella Hypothesis: Knock-out regulatory proteins involved in pathogenesis and the commonly regulated proteins represent the best targets for therapeutics M1 M4 M5 WT M2 M3

184 Experimental Design: The Need/Use for Increased Throughput Replicate analysis to account for natural biological and normal analytical variation Mutant WT smpb Hfq rpoe himd crp slya hnr phop/q Etc Biological Rep. Sample Prep Cell Fraction Analytical Rep. X4 Contrasting Conditions 1080 analyses for 15 mutants using biological pooling 360 analyses

185 Accurate Mass and Time Tag Approach AMT tag Database Generation Complex samples fractionated, analyzed by LC-MS/MS, then analyzed using SEQUEST High-throughput Quantitative Proteomics identified from LC-MS peaks by matching LC-MS to AMT tags Protein abundance value calculation Biological Sample Protein Extraction

186 AMT tag Database Overview AMT tag database 4 Salmonella Typhimurium growth conditions Analyzed both unfractionated and SCX fractionated samples for each condition 1349 LC-MS/MS analyses Primarily used two Thermo LTQ's and one LTQ-Orbitrap 77 days of instrument time Data processed with MASIC and SEQUEST 70,000 AMT tags pass filters LC-MS Data Four growth conditions for three mutants 409 LC-MS analyses over 10 days of instrument time on a Thermo LTQ-orbitrap mass spectrometer (30 minute separations) Note: equivalent LC-MS/MS analyses would require 450 days Data processed with Decon2LS and VIPER Identified 16,367 AMT tags with ~1% FDR

187 Replicate Reproducibility / Outlier Removal Abundance profiles for ~700 proteins Pearson s pairwise correlation plot of all Salmonella peptide identifications from each triplicate analyses correlated to itself and every other analyses Each replicate analysis has a strong correlation to the other two replicate analyses for each sample, indicating reproducibility between replicate runs

188 Conclusion The AMT tag approach and the associated informatics pipeline enables Systems Biology experiments that would be impractical using shotgun proteomics and/or chemical/labeling methods Lipidomics Transcriptomics Metabolomics Proteomics Integrated with Computational Modeling

189 AMT tag Software Demo Part I: Introduction and Overview of Label-Free Quantitative Proteomics (Anderson) Part II: Feature Discovery in LC-MS Datasets (Monroe and Polpitiya) Break Part III: Biological Application of the AMT tag Approach (Ansong) AMT tag Software Demo MT Creator VIPER DAnTE See the supplied DVD for the software installers for the software that will be used in the live demo Panel Discussion

190 Example Data for the AMT tag Software Demo Salmonella Typhimurium, LC-MS/MS Grown in LB (Luria-Bertani) up to log phase Mini AMT tag database, composed of 25 SCX fractions analyzed by LC-MS/MS Mass and time tag database composed from searches using X!Tandem (Log E_Value -2) Linear alignment of datasets for AMT tag database LC-FTICR-MS analysis (LTQ-Orbitrap) Knock-out mutant sample, grown and prepared in the same conditions Non-linear alignment and peak matching to the database

191 AMT tag Software Demo Thanks to the many developers, beta testers, and users Note: PNNL is always looking for good and knowledgeable informatics staff and post-docs. See us afterward for more information, or visit

192 Funding for Tool Development DOE Office of Biological and Environmental Research NIH National Center for Research Resources National Institute of Allergy and Infectious Diseases National Cancer Institute National Institute of General Medical Sciences National Institute of Diabetes & ive & Kidney Diseases

193 See the supplied DVD for the software installers for the software that will be used in the live demo MT Creator \Software_Installers\MTCreator\MTCreatorInstall.msi Data file: \Software_Installers\MTCreator\data\DatasetsDescription.txt VIPER \Software_Installers\VIPER\VIPER_Installer.msi \Software_Installers\VIPER\LCMSFeatureer (Install this after installing Viper).msi Data file: \VIPER_Data\MT_S_typhimurium_X347\Job219616_LTQ-Orbitrap\Job gel DAnTE \Software_Installers\DAnTE\R_Installers\R win32.exe \Software_Installers\DAnTE\R_Installers\RSrv250_pl1.exe \Software_Installers\DAnTE\DAnTE_Standalone_Installer\DAnTESetup.exe Data file: \DAnTE_Data\USHUPO2009_DAnTE_data.dnt Note: Live demo will not be until the end of the course