Practical Analysis of Proteome Data Using Bioinformatics and Statistics

Similar documents

Quantitative proteomics background

Aiping Lu. Key Laboratory of System Biology Chinese Academic Society

泛用蛋白質體學之質譜儀資料分析平台的建立與應用 Universal Mass Spectrometry Data Analysis Platform for Quantitative and Qualitative Proteomics

ProteinScape. Innovation with Integrity. Proteomics Data Analysis & Management. Mass Spectrometry

Session 1. Course Presentation: Mass spectrometry-based proteomics for molecular and cellular biologists

ProteinPilot Report for ProteinPilot Software

Tutorial for Proteomics Data Submission. Katalin F. Medzihradszky Robert J. Chalkley UCSF

MRMPilot Software: Accelerating MRM Assay Development for Targeted Quantitative Proteomics

La Protéomique : Etat de l art et perspectives

Global and Discovery Proteomics Lecture Agenda

Already said. Already said. Outlook. Look at LC-MS data. A look at data for quantitative analysis using MSight and Phenyx. What data for quantitation?

Effects of Intelligent Data Acquisition and Fast Laser Speed on Analysis of Complex Protein Digests

Tutorial for proteome data analysis using the Perseus software platform

Choices, choices, choices... Which sequence database? Which modifications? What mass tolerance?

using ms based proteomics

Challenges in Computational Analysis of Mass Spectrometry Data for Proteomics

Protein Prospector and Ways of Calculating Expectation Values

Mass Spectrometry Based Proteomics

Application Note # LCMS-81 Introducing New Proteomics Acquisiton Strategies with the compact Towards the Universal Proteomics Acquisition Method

Database Searching Tutorial/Exercises Jimmy Eng

MASCOT Search Results Interpretation

Proteomic data analysis for Orbitrap datasets using Resources available at MSI. September 28 th 2011 Pratik Jagtap

AB SCIEX TOF/TOF 4800 PLUS SYSTEM. Cost effective flexibility for your core needs

Mascot Search Results FAQ

Introduction to Proteomics

Research-grade Targeted Proteomics Assay Development: PRMs for PTM Studies with Skyline or, How I learned to ditch the triple quad and love the QE

Introduction to mass spectrometry (MS) based proteomics and metabolomics

Sub menu of functions to give the user overall information about the data in the file

Error Tolerant Searching of Uninterpreted MS/MS Data

Building innovative drug discovery alliances. Evotec Munich. Quantitative Proteomics to Support the Discovery & Development of Targeted Drugs

Advantages of the LTQ Orbitrap for Protein Identification in Complex Digests

Introduction to Proteomics 1.0

The Scheduled MRM Algorithm Enables Intelligent Use of Retention Time During Multiple Reaction Monitoring

Introduction to Proteomics

Pep-Miner: A Novel Technology for Mass Spectrometry-Based Proteomics

Thermo Scientific PepFinder Software A New Paradigm for Peptide Mapping

Chapter 14. Modeling Experimental Design for Proteomics. Jan Eriksson and David Fenyö. Abstract. 1. Introduction

MaxQuant User s Guide Version

In-Depth Qualitative Analysis of Complex Proteomic Samples Using High Quality MS/MS at Fast Acquisition Rates

PeptidomicsDB: a new platform for sharing MS/MS data.

Mass Spectrometry Signal Calibration for Protein Quantitation

Interpretation of MS-Based Proteomics Data

Quan%ta%ve proteomics. Maarten Altelaar, 2014

ProSightPC 3.0 Quick Start Guide

Increasing the Multiplexing of High Resolution Targeted Peptide Quantification Assays

Statistical Analysis Strategies for Shotgun Proteomics Data

Accurate Mass Screening Workflows for the Analysis of Novel Psychoactive Substances

Pinpointing phosphorylation sites using Selected Reaction Monitoring and Skyline

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Master course KEMM03 Principles of Mass Spectrometric Protein Characterization. Exam

Proteomic Analysis using Accurate Mass Tags. Gordon Anderson PNNL January 4-5, 2005

MultiQuant Software 2.0 for Targeted Protein / Peptide Quantification

Retrospective Analysis of a Host Cell Protein Perfect Storm: Identifying Immunogenic Proteins and Fixing the Problem

MarkerView Software for Metabolomic and Biomarker Profiling Analysis

Development of computational methods for analysing proteomic data for genome annotation

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Frequently Asked Questions Next Generation Sequencing

Un (bref) aperçu des méthodes et outils de fouilles et de visualisation de données «omics»

Quantitative mass spec based proteomics

Searching Nucleotide Databases

Mass Spectra Alignments and their Significance

Preprocessing, Management, and Analysis of Mass Spectrometry Proteomics Data

Quantitative mass spectrometry in proteomics: a critical review

CPAS Overview. Josh Eckels LabKey Software

LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE

BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS

Absolute quantification of low abundance proteins by shotgun proteomics

Shotgun Proteomic Analysis. Department of Cell Biology The Scripps Research Institute

Challenges and Opportunities in Proteomics Data Analysis*

Thermo Scientific SIEVE Software for Differential Expression Analysis

A Primer of Genome Science THIRD

MassMatrix Web Server User Manual

Unit I: Introduction To Scientific Processes

Functional Data Analysis of MALDI TOF Protein Spectra

Learning Objectives:

SimGlycan Software*: A New Predictive Carbohydrate Analysis Tool for MS/MS Data

Master's projects at ITMO University. Daniil Chivilikhin PhD ITMO University

Computational analysis of unassigned high-quality MS/MS spectra in proteomic data sets

MassHunter for Agilent GC/MS & GC/MS/MS

Mass Frontier 7.0 Quick Start Guide

Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant

Name: Hour: Elements & Macromolecules in Organisms

Investigating Biological Variation of Liver Enzymes in Human Hepatocytes

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

Modeling and Simulation of Gene Regulatory Networks

University of Glasgow - Programme Structure Summary C1G MSc Bioinformatics, Polyomics and Systems Biology

Enzyme Response Profiling: Integrating proteomics and genomics with xenobiotic metabolism and cytotoxicity

Biochemistry of Cells

Biochemistry Major Talk Welcome!!!!!!!!!!!!!!

Activity 7.21 Transcription factors

Chapter 5: The Structure and Function of Large Biological Molecules

Elements & Macromolecules in Organisms

Protein Protein Interaction Networks

Biological Sciences Initiative. Human Genome

is a knowledge based expert decision support tool for predicting the metabolic fate of chemicals in mammals.

Transcription:

Practical Analysis of Proteome Data Using Bioinformatics and Statistics Simon Barkow-Oesterreicher Functional Genomics Center Zurich Dr. Jonas Grossmann Functional Genomics Center Zurich 1

Outline Challenges in proteomics data analysis Protein identification --> visualization and validation Scaffold software More than one search engine Quantitative proteomics Beyond protein lists --> Pathway mapping, over-representation 2

Challenges in Proteomics Sample are usually very complex -> proteins differ widely (size, 3D-structure, chemical groups) -> dynamic range (different abundances) of proteins (e.g. Rubisco in plants makes up to 50% of the total protein amount in green tissues) Unlike in transcriptomics, only most abundant proteins are detected Because of complexity, samples are usually fractionated (no clear cut) Random-component in DDA experiments (data dependent acquisition) makes reproducibility challenging Genomic sequence and annotation (predicted proteins) is essential Mass spectrometers are complex machines and do not perform always as good (day-to-day variation) 3

Protein Identification Algorithms (using protein sequence databases) wet lab b-ions y-ions tryps 1st MS selection 2nd MS protein of interest peptides of convenient size MS spectrum fragmentation MS/MS spectrum in silico >Ath_Chr1 ACGTTTAG GAGTTAGG ACCACCA genome sequence gene prediction >At1g1120 >At1g1110 MDASISTOK MDASISTALK ADELIKAPPL ADELIKAPPL EISTK EISTK protein sequences MPVCLLSTVK MDASISTALK ELIK ADELIK APPLEISTK APPLEISTK in silico tryptic Peptides in silico theoretical sectrum Scheme for protein identification... describe all quite in detail!! 4

Protein Identification Algorithms (using protein sequence databases) wet lab b-ions y-ions tryps 1st MS selection 2nd MS protein of interest peptides of convenient size MS spectrum fragmentation MS/MS spectrum in silico >Ath_Chr1 ACGTTTAG GAGTTAGG ACCACCA genome sequence gene prediction >At1g1120 >At1g1110 MDASISTOK MDASISTALK ADELIKAPPL ADELIKAPPL EISTK EISTK protein sequences MPVCLLSTVK MDASISTALK ELIK ADELIK APPLEISTK APPLEISTK in silico tryptic Peptides in silico theoretical sectrum Scheme for protein identification... describe all quite in detail!! 4

Protein Identification Algorithms (using protein sequence databases) wet lab b-ions y-ions tryps 1st MS selection 2nd MS protein of interest peptides of convenient size MS spectrum fragmentation MS/MS spectrum in silico >Ath_Chr1 ACGTTTAG GAGTTAGG ACCACCA genome sequence gene prediction >At1g1120 >At1g1110 MDASISTOK MDASISTALK ADELIKAPPL ADELIKAPPL EISTK EISTK protein sequences MPVCLLSTVK MDASISTALK ELIK ADELIK APPLEISTK APPLEISTK in silico tryptic Peptides in silico theoretical sectrum Scheme for protein identification... peptide identification 4 protein inference describe all quite in detail!!

Peptide Identification An example: human mitogen-activated protein kinase-8 (MAPK8), 427 aa MS-compatible peptides tryptic & in MS range (mass) observed peptides good flight properties proteotypic peptides unambigous & observed frequently One example of a protein... MAPK8 from human... 5 Nat Rev Mol Cell Biol, 6(7):577 83, 2005 Nat Biotechnol, 25:125-131, 2007 when we check which tryptic peptides are in the range of the MS it looks like this... (colored means... MS-compatible) next... Which peptides are actually observed... because they have a good flight properties... and finally... which are unambigous and frequently observed

Peptide Identification An example: human mitogen-activated protein kinase-8 (MAPK8), 427 aa MS-compatible peptides tryptic & in MS range (mass) observed peptides good flight properties proteotypic peptides unambigous & observed frequently One example of a protein... MAPK8 from human... 5 Nat Rev Mol Cell Biol, 6(7):577 83, 2005 Nat Biotechnol, 25:125-131, 2007 when we check which tryptic peptides are in the range of the MS it looks like this... (colored means... MS-compatible) next... Which peptides are actually observed... because they have a good flight properties... and finally... which are unambigous and frequently observed

Peptide Identification An example: human mitogen-activated protein kinase-8 (MAPK8), 427 aa MS-compatible peptides tryptic & in MS range (mass) observed peptides good flight properties proteotypic peptides unambigous & observed frequently One example of a protein... MAPK8 from human... 5 Nat Rev Mol Cell Biol, 6(7):577 83, 2005 Nat Biotechnol, 25:125-131, 2007 when we check which tryptic peptides are in the range of the MS it looks like this... (colored means... MS-compatible) next... Which peptides are actually observed... because they have a good flight properties... and finally... which are unambigous and frequently observed

Output after peptide identification step An incomplete list of peptides which were presumably in the sample The identified peptides point to corresponding proteins Some peptides are ambiguous (protein inference problem) Some proteins are identified with several peptides, others only with a single peptide The peptides and also the proteins have some score associated with them how well they are identified 6

Why validate? Every database search generates false positives and false negatives Search Algorithm Prediction True False Downstream steps can cost a lot of time and money True True Positive False Negative Get the most accurate protein hit list with a known false discovery rate (FDR) Reality False False True Positive Negative 7

FPR vs FDR False positive rate (FPR): e.g. FPR = 5% means that on average 5% of the true false in the study will be called positive 10500 total 500 true positives 10000 false means 500 false positives (50% of total positives) False discovery rate (FDR): e.g FDR = 5% means that among all the features called positive, 5% are true negatives on average. 500 positves, 25 false positives (5%) source: 8 PNAS; Storey and Tibshirani 100 (16): 9440. (2003) There is a confusion in the proteomics-community -> FDR and FPR are often used for the same thing.. and as biologists sometimes are not too picky this leads to this confusion --> so here a definition in words.

Validation of Peptide Identification & Protein inference Protein Prophet Peptide Prophet From Nesvizhskii et al, Anal. Chem.2003, 75,4646-4658 9 Simon Barkow & Jonas Grossmann FGCZ Proteomics

Validation of Peptide Identification & Protein inference Protein Prophet Issue #1 Peptide Prophet From Nesvizhskii et al, Anal. Chem.2003, 75,4646-4658 9 Simon Barkow & Jonas Grossmann FGCZ Proteomics

Peptide validation by algorithm Key question: how to determine which identifications are valid Typical method: accept all identifications above a chosen discriminant score of a search engine (e. g. Mascot Ion Score) Choosing an threshold is problematic, depending on sample, search database, etc. Use a validation algorithm that is based on experience: PeptideProphet 10

Histogram of scores Once the discriminant scores for all the spectra in a sample are calculated, Peptide Prophet makes a histogram of these discriminant scores. For example, in the sample shown here, 70 spectra have scores around 2.5. Number of spectra in each bin Discriminant score (D) 11

Number of spectra in each bin Mixture of distributions incorrect This histogram shows the distributions of correct This Histogram and incorrect shows matches. the standard PeptideProphet distributions of assumes correct and that these incorrect distributions matches, validated are standard manually statistical in a distributions. sample with a known set of 18 proteins. Using curve-fitting, PeptideProphet draws the correct and incorrect distributions. correct Discriminant score (D) 12

Number of spectra in each bin Bayesian statistics incorrect Once correct and incorrect distributions are drawn, PeptideProphet uses Bayesian statistics to compute the probability p(+ D) that a match is correct, given a discriminant score D. correct Discriminant score (D) 13

Probability of a correct match The statistical formula looks fierce, but relating it to the histogram shows that the prob of a score of 2.5 being correct is Number of spectra in each bin incorrect correct Discriminant score (D) 14

15

How to get even more confidence? Compare peptide patterns seen in each replicate for the same protein Manually examine the spectrum for critical or characteristic fragment ions (especially single hits) Compare scores from various search engines (Mascot, SEQUEST, x!tandem, etc.) Compare other characteristics for identified peptides (NTT, MCS...) 16

Peptide Prophet features Combines database search scores Number of tryptic termini (NTT) Number of missed cleavage sites (NMC) Mass difference between theoretical mass and measured mass Peptide retention time (expected vs measured) 17

Scaffold Workflow 18

Experimental Design Three hierachies: 1. Sample Category: disease vs. control, treated vs, untreated, etc. 2. Biosample: drop of blood, tissue sample, etc. 3. MS Sample: each individual spot (MALDI), or one LC fraction 19

Scaffold Sample Window Overview for comparisons Lists and summarizes the proteins identified in each biosample or MS sample Identification probability Number of unique peptides on which the identification is based Percentage of the total spectra that this number represents Number of unique spectra associated with this protein 20

Scaffold Protein Window All Information about a single protein Sequence coverage for this and similar proteins Peptide sequence, with identified peptides highlighted in yellow and modifications highlighted in green The spectra used to identify each peptide Lots of data about the Peptides that can be revised to get confidence 21

Scaffold Quantify Window View spectral count numbers for biosamples (same color) and categories (different color) Scatterplots pane shows degree of error associated with the spectral count Venn diagram shows relationship between categories of proteins, unique peptides, or unique spectra identifications GO (Gene Ontology) mesh terms pane 22

Scaffold Statistics Window Check whether your data meets Scaffold s assumptions Statistical information for each MS sample in your analysis Relationship between peptide and protein probabilities Histogram demonstrating correct and incorrect peptide assignments (used by the Peptide Prophet) Scatterplot comparing two or more search engine results 23

Search Algorithms 24

Search Algorithms MASCOT SEQUEST X!TANDEM OMSSA Spectrum Mill 24

Search Algorithms MASCOT SEQUEST X!TANDEM OMSSA Spectrum Mill All of them can be combined with Scaffold 24

Why Overlap Small The reason that they identify different spectra is because each program has different strengths. SEQUEST 9% considers intensities 22% 4% 34% X!Tandem semi-tryptic, no neutral loss fragments 19% 7% 5% Mascot probability based scoring 25

Decoy searches applicable everywhere >sp Q4U9M9 104K_THEAN 104 kda microneme/rhoptry antigen OS=Theileria annulata GN=TA08425 PE=3 SV=1 MKFLVLLFNILCLFPILGADELVMSPIPTTDVQPKVTFDINSEVSSGPLYLNPVEMAGVK YLQLQRQPGVQVHKVVEGDIVIWENEEMPLYTCAIVTQNEVPYMAYVELLEDPDLIFFLK EGDQWAPIPEDQYLARLQQLRQQIHTESFFSLNLSFQHENYKYEMVSSFQHSIKMVVFTP KNGHICKMVYDKNIRIFKALYNEYVTSVIGFFRGLKLLLLNIFVIDDRGMIGNKYFQLLD DKYAPISVQGYVATIPKLKDFAEPYHPIILDISDIDYVNFYLGDATYHDPGFKIVPKTPQ CITKVVDGNEVIYESSNPSVECVYKVTYYDKKNESMLRLDLNHSPPSYTSYYAKREGVWV TSTYIDLEEKIEELQDHRSTELDVMFMSDKDLNVVPLTNGNLEYFMVTPKPHRDIIIVFD GSEVLWYYEGLENHLVCTWIYVTEGAPRLVHLRVKDRIPQNTDIYMVKFGEYWVRISKTQ YTQEIKKLIKKSKKKLPSIEEEDSDKHGGPPKGPEPPTGPGHSSSESKEHEDSKESKEPK EHGSPKETKEGEVTKKPGPAKEHKPSKIPVYTKRPEFPKKSKSPKRPESPKSPKRPVSPQ RPVSPKSPKRPESLDIPKSPKRPESPKSPKRPVSPQRPVSPRRPESPKSPKSPKSPKSPK VPFDPKFKEKLYDSYLDKAAKTKETVTLPPVLPTDESFTHTPIGEPTAEQPDDIEPIEES VFIKETGILTEEVKTEDIHSETGEPEEPKRPDSPTKHSPKPTGTHPSMPKKRRRSDGLAL STTDLESEAGRILRDPTGKIVTMKRSKSFDDLTTVREKEHMGAEIRKIVVDDDGTEADDE DTHPSKEKHLSTVRRRRPRPKKSSKSSKPRKPDSAFVPSIIFIFLVSLIVGIL 26

Decoy searches applicable everywhere >sp REV_Q4U9M9 REV_104K_THEAN 104 kda microneme/rhoptry antigen OS=Theileria annulata GN=TA08425 PE=3 SV=1 LIGVILSVLFIFIISPVFASDPKRPKSSKSSKKPRPRRRRVTSLHKEKSPHTDEDDAETG DDDVVIKRIEAGMHEKERVTTLDDFSKSRKMTVIKGTPDRLIRGAESELDTTSLALGDSR RRKKPMSPHTGTPKPSHKTPSDPRKPEEPEGTESHIDETKVEETLIGTEKIFVSEEIPEI DDPQEATPEGIPTHTFSEDTPLVPPLTVTEKTKAAKDLYSDYLKEKFKPDFPVKPSKPSK PSKPSKPSEPRRPSVPRQPSVPRKPSKPSEPRKPSKPIDLSEPRKPSKPSVPRQPSVPRK PSKPSEPRKPSKSKKPFEPRKTYVPIKSPKHEKAPGPKKTVEGEKTEKPSGHEKPEKSEK SDEHEKSESSSHGPGTPPEPGKPPGGHKDSDEEEISPLKKKSKKILKKIEQTYQTKSIRV WYEGFKVMYIDTNQPIRDKVRLHVLRPAGETVYIWTCVLHNELGEYYWLVESGDFVIIID RHPKPTVMFYELNGNTLPVVNLDKDSMFMVDLETSRHDQLEEIKEELDIYTSTVWVGERK AYYSTYSPPSHNLDLRLMSENKKDYYTVKYVCEVSPNSSEYIVENGDVVKTICQPTKPVI KFGPDHYTADGLYFNVYDIDSIDLIIPHYPEAFDKLKPITAVYGQVSIPAYKDDLLQFYK NGIMGRDDIVFINLLLLKLGRFFGIVSTVYENYLAKFIRINKDYVMKCIHGNKPTFVVMK ISHQFSSVMEYKYNEHQFSLNLSFFSETHIQQRLQQLRALYQDEPIPAWQDGEKLFFILD PDELLEVYAMYPVENQTVIACTYLPMEENEWIVIDGEVVKHVQVGPQRQLQLYKVGAMEV PNLYLPGSSVESNIDFTVKPQVDTTPIPSMVLEDAGLIPFLCLINFLLVLFKM 26

1) Sequest & TPP, No decoy search, PeptideProphet > 0.9 # of proteins # of peps # of MS/MS fw proteins 3176 9771 20627 single hits 1148 - - REV proteins - - - REV single hits - - - Total: 3176 proteins 36% Overall ath 801 64% The regular procedure: -> only one search engine is taken into account (sometimes even without decoy db) --> TPP for statistical evaluation --> the difference between decoy & non_decoy searches.. -> a different fitting of the probability function results in a little bit more stringency on the cutoff in terms of fewer peptide identification 27

1) Sequest & TPP, No decoy search, PeptideProphet > 0.9 # of proteins # of peps # of MS/MS fw proteins 3176 9771 20627 single hits 1148 - - REV proteins - - - REV single hits - - - Total: 3176 proteins 2) Sequest & TPP, w/ decoy search, PeptideProphet > 0.9 36% Overall ath 801 Overall ath 801 64% # of proteins # of peps # of MS/MS fw proteins 2840 8994 18662 single hits 952 - - REV proteins 103 104 126 REV single hits 102 - - FDR 3.76% 1.17% 0.68% Total: 2943 proteins 32% 3% 0% 64% 104 / (8994-104) 27 The regular procedure: -> only one search engine is taken into account (sometimes even without decoy db) --> TPP for statistical evaluation --> the difference between decoy & non_decoy searches.. -> a different fitting of the probability function results in a little bit more stringency on the cutoff in terms of fewer peptide identification

Decoy searches - Limitations Decoy searches can be applied everywhere BUT the calculation of FDRs only makes sense if a large number of proteins are identified (more than ~200) If the calculated FDR is very high.. there is a good chance that some search parameters are wrong or maybe some PTMs are not specified Reversed databases are favored over scrambled ones Low FDR doesn t mean perfect results 28

Quantitative Proteomics - my critical view Is what everybody is looking for Is what many people claim to do Is definitely the right way to go in the future Is absolutely necessary for Systems Biology Is essential to really understand the dynamics of the proteome Is not really straightforward 29

Quantitative Proteomics - What is it? Find relative changes of protein abundance from 2 similar samples (wild type VS mutant // condition_1 VS condition_2) Determine absolute protein concentrations in a sample (conclude on copy numbers and translation efficiency) -> AQUA peptides.. Find regulatory proteins and elucidate regulatory pathways 30

Quantitative Proteomics - How can it be achieved? Labeling strategy for differential expression (ICAT, itraq, TMT, SILAC --> wet lab) Label-free approaches for differential expression (--> Software solutions) Targeted approaches (SRM, MRM --> mass spec approach) 31

Quantitative Proteomics (differential expression) label strategy only ONE run is acquired label-free 2 individual runs are acquired sample prep solution software solution itraq/tmt icat SILAC Progenesis SuperHirn -> problematic is sample prep -> problematic are aligning and run to run variation 32

ICAT labels have different weights Quantification is done on the MSone level 33

itraq all labels have the same weight --> all parent ions are the same Quantification is done on the MS/MS level 34

Beyond Protein Lists and Quantitation - what else Check for over/under representation of GO-terms Functional categorization Project regulated proteins onto a metabolic pathway map 35

Principle of - Over-representation Analysis an easy example The Principle - organism with 1000 genes - binned in 5 equal categories with 200 genes - GO-cats 1-5: transcription, translation, energy delivery, nutrients uptake, degradation The researcher decides to do proteomics (brute-force) - 200 genes are identified --> 1/5th of all - statistically you would expect to find approx. 40 genes for each category In fact you find about 100 genes from GO:energy delivery category ---> category energy delivery is significantly enriched ---> different statistics can be applied 36

Principle of - ORA - In case of Proteomics The number of measured and identified proteins is still far from complete Over-representation analysis allow to find pathways or systems which are regulated or involved in a certain context -> but it is important to have the correct background/universe selected Principle: - all genes of an organism are binned in categories - categories are related to gene function (e.g. GeneOntology categories) - compare your identifications to randomly drawn genes Background-problem - take as background only those proteins ever identified in this species - take as background all identified proteins and as genes of interest and those proteins which seem to be regulated as targets (e.g: itraq experiment) Tools: R-package --> TopGO Web: --> GOTreeMachine (bioinfo.vanderbilt.edu/gotm/) 37

Scenario (from HTP proteomics) Arabidopsis thaliana: The model plant ---> ~ 28 000 genes Single-cell plant in liquid culture Grown in sugar containing solution & weekly subculturing One part grown in the dark (cardboard box) One part grown in long-day conditions (16h light) Excessive LTQ MS analysis --> 800 LC-MS runs (fractionation & replicates) A total of 7983 proteins identified from all samples (~ 30% from all genes encoded in the genome) --> Background 6547 from the cell cultures that were kept in the dark 6474 from the cell cultures that were illuminated 38

GO:0006082 Dark Light Proteins from CC_dark: BG: full universe of GO Proteins from CC_dark: BG: only proteins identified in CC GO:0008150 biological_process GO:0008150 biological_process GO:0008152 metabolic process GO:0009987 cellular process GO:0032501 multicellular organi... GO:0032502 developmental proces... GO:0008152 metabolic process GO:0009987 cellular process GO:0051179 localization GO:0032501 multicellular organi... GO:0032502 developmental proces... GO:0043170 macromolecule metabo... GO:0009058 biosynthetic process GO:0044238 primary metabolic pr... GO:0044237 cellular metabolic p... GO:0016043 cellular component o... GO:0051179 localization GO:0007275 multicellular organi... GO:0044238 primary metabolic pr... GO:0043170 macromolecule metabo... GO:0009056 catabolic process GO:0009058 biosynthetic process GO:0044237 cellular metabolic p... GO:0006807 nitrogen compound me... GO:0033036 macromolecule locali... GO:0016043 cellular component o... GO:0007275 multicellular organi... GO:0009059 macromolecule biosyn... GO:0019538 protein metabolic pr... GO:0044260 cellular macromolecu... GO:0044249 cellular biosyntheti... GO:0006807 nitrogen compound me... GO:0006519 amino acid and deriv... organic acid metabol... GO:0006996 organelle organizati... GO:0051641 cellular localizatio... GO:0051234 establishment of loc... GO:0009790 embryonic developmen... GO:0019538 protein metabolic pr... GO:0005975 carbohydrate metabol... GO:0009059 macromolecule biosyn... GO:0044260 cellular macromolecu... GO:0009057 macromolecule catabo... GO:0044248 cellular catabolic p... GO:0006066 alcohol metabolic pr... GO:0044249 cellular biosyntheti... GO:0009308 amine metabolic proc... GO:0008104 protein localization GO:0051234 establishment of loc... GO:0051641 cellular localizatio... GO:0009790 embryonic developmen... GO:0044267 cellular protein met... GO:0044271 nitrogen compound bi... GO:0009308 amine metabolic proc... GO:0019752 carboxylic acid meta... GO:0051649 establishment of cel... GO:0006810 transport GO:0044267 cellular protein met... GO:0044262 cellular carbohydrat... GO:0016052 carbohydrate catabol... GO:0044265 cellular macromolecu... GO:0046164 alcohol catabolic pr... GO:0044271 nitrogen compound bi... GO:0045184 establishment of pro... GO:0006810 transport GO:0051649 establishment of cel... GO:0006412 translation GO:0044275 cellular carbohydrat... GO:0005996 monosaccharide metab... GO:0009309 amine biosynthetic p... GO:0015031 protein transport GO:0046907 intracellular transp... GO:0006412 translation GO:0009309 amine biosynthetic p... GO:0006520 amino acid metabolic... GO:0046907 intracellular transp... GO:0046365 monosaccharide catab... GO:0019318 hexose metabolic pro... GO:0006886 intracellular protei... GO:0008652 amino acid biosynthe... GO:0019320 hexose catabolic pro... 39

GO:0006082 Dark Light Proteins from CC_dark: BG: full universe of GO Proteins from CC_dark: BG: only proteins identified in CC GO:0008150 biological_process GO:0008150 biological_process GO:0008152 metabolic process GO:0009987 cellular process GO:0032501 multicellular organi... GO:0032502 developmental proces... GO:0008152 metabolic process GO:0009987 cellular process GO:0051179 localization GO:0032501 multicellular organi... GO:0032502 developmental proces... GO:0043170 macromolecule metabo... GO:0009058 biosynthetic process GO:0044238 primary metabolic pr... GO:0044237 cellular metabolic p... GO:0016043 cellular component o... GO:0051179 localization GO:0007275 multicellular organi... GO:0044238 primary metabolic pr... GO:0043170 macromolecule metabo... GO:0009056 catabolic process GO:0009058 biosynthetic process GO:0044237 cellular metabolic p... GO:0006807 nitrogen compound me... GO:0033036 macromolecule locali... GO:0016043 cellular component o... GO:0007275 multicellular organi... GO:0009059 macromolecule biosyn... GO:0019538 protein metabolic pr... GO:0044260 cellular macromolecu... GO:0044249 cellular biosyntheti... GO:0006807 nitrogen compound me... GO:0006519 amino acid and deriv... organic acid metabol... GO:0006996 organelle organizati... GO:0051641 cellular localizatio... GO:0051234 establishment of loc... GO:0009790 embryonic developmen... GO:0019538 protein metabolic pr... GO:0005975 carbohydrate metabol... GO:0009059 macromolecule biosyn... GO:0044260 cellular macromolecu... GO:0009057 macromolecule catabo... GO:0044248 cellular catabolic p... GO:0006066 alcohol metabolic pr... GO:0044249 cellular biosyntheti... GO:0009308 amine metabolic proc... GO:0008104 protein localization GO:0051234 establishment of loc... GO:0051641 cellular localizatio... GO:0009790 embryonic developmen... GO:0044267 cellular protein met... GO:0044271 nitrogen compound bi... GO:0009308 amine metabolic proc... GO:0019752 carboxylic acid meta... GO:0051649 establishment of cel... GO:0006810 transport GO:0044267 cellular protein met... GO:0044262 cellular carbohydrat... GO:0016052 carbohydrate catabol... GO:0044265 cellular macromolecu... GO:0046164 alcohol catabolic pr... GO:0044271 nitrogen compound bi... GO:0045184 establishment of pro... GO:0006810 transport GO:0051649 establishment of cel... GO:0006412 translation GO:0044275 cellular carbohydrat... GO:0005996 monosaccharide metab... GO:0009309 amine biosynthetic p... GO:0015031 protein transport GO:0046907 intracellular transp... GO:0006412 translation GO:0009309 amine biosynthetic p... GO:0006520 amino acid metabolic... GO:0046907 intracellular transp... GO:0046365 monosaccharide catab... GO:0019318 hexose metabolic pro... GO:0006886 intracellular protei... GO:0008652 amino acid biosynthe... GO:0019320 hexose catabolic pro... 39

Projection onto Metabolic Pathway Maps same data (e.g. MapMan Software (Golm)) Dark Light only found in light found in both only found in dark 40

Q & A 41

Hands on your turn now feel free to ask 42

Scaffold hands on - Example One load your own data with Scaffold before we are going to continue Use also X!Tandem to search Have a look at the results Is it valid to calculate FDR? How high is your FDR? 43

More from Scaffold Q+ hands on... with itraq data 44

Scenario: Mouse data Liver tissue itraq data (Swiss mouse: standard diet VS high fat diet) Mouse decoy database search with Mascot -> dat-files Labels: 116 -> high fat diet /// 114, 115, 117 -> standard diet Check reproducibility (standard diet vs standard diet) Find proteins which are regulated in high fat diet / standard diet 45

Task with Scaffold Q+ How consistent are peptides of the same protein Find confident thresholds for proteins being over/under expressed Which proteins in this example do you consider as being over/ under expressed? Can you try making sense out of these proteins.. 46

What should come out.. only 2 quant categories: Histogram 2 Categories Liver Ex4 300 250 StDiet/StDiet HighFatDiet/StDiet 200 Frequency 150 100 50 0-1.4-1.3-1.2-1.1-1 -0.9-0.8-0.7-0.6-0.5-0.4-0.3-0.2-0.1 0 log2(ratio) 47 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4

What should come out.. 4 quant categories: Histogram 4 Categories Liver Ex4 400 350 300 ratio_2 (st/st) ratio_3 (high fat / st) ratio_4 (st/st) 250 Frequency 200 150 100 50 0-2 -1.8-1.6-1.4-1.2-1 -0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 log2(ratio) 48

Regulated Proteins: The List 2 ways of making sense out of this data.. take the intersection of those 2 lists.. (should be most confident) 37 4 categories 48 regulated proteins 2 categories 44 regulated proteins 49

Make sense out of Lists: this does make sense!! 50

Paint it on Reactome-maps 51

52 ELPPAK

Scaffold Similarity Window Review and control the peptide/protein mapping View protein groups in which peptides are shared check or uncheck the valid box for a peptide sequence Peptides identified in particular protein groups are color coded 53