Practical Analysis of Proteome Data Using Bioinformatics and Statistics Simon Barkow-Oesterreicher Functional Genomics Center Zurich Dr. Jonas Grossmann Functional Genomics Center Zurich 1
Outline Challenges in proteomics data analysis Protein identification --> visualization and validation Scaffold software More than one search engine Quantitative proteomics Beyond protein lists --> Pathway mapping, over-representation 2
Challenges in Proteomics Sample are usually very complex -> proteins differ widely (size, 3D-structure, chemical groups) -> dynamic range (different abundances) of proteins (e.g. Rubisco in plants makes up to 50% of the total protein amount in green tissues) Unlike in transcriptomics, only most abundant proteins are detected Because of complexity, samples are usually fractionated (no clear cut) Random-component in DDA experiments (data dependent acquisition) makes reproducibility challenging Genomic sequence and annotation (predicted proteins) is essential Mass spectrometers are complex machines and do not perform always as good (day-to-day variation) 3
Protein Identification Algorithms (using protein sequence databases) wet lab b-ions y-ions tryps 1st MS selection 2nd MS protein of interest peptides of convenient size MS spectrum fragmentation MS/MS spectrum in silico >Ath_Chr1 ACGTTTAG GAGTTAGG ACCACCA genome sequence gene prediction >At1g1120 >At1g1110 MDASISTOK MDASISTALK ADELIKAPPL ADELIKAPPL EISTK EISTK protein sequences MPVCLLSTVK MDASISTALK ELIK ADELIK APPLEISTK APPLEISTK in silico tryptic Peptides in silico theoretical sectrum Scheme for protein identification... describe all quite in detail!! 4
Protein Identification Algorithms (using protein sequence databases) wet lab b-ions y-ions tryps 1st MS selection 2nd MS protein of interest peptides of convenient size MS spectrum fragmentation MS/MS spectrum in silico >Ath_Chr1 ACGTTTAG GAGTTAGG ACCACCA genome sequence gene prediction >At1g1120 >At1g1110 MDASISTOK MDASISTALK ADELIKAPPL ADELIKAPPL EISTK EISTK protein sequences MPVCLLSTVK MDASISTALK ELIK ADELIK APPLEISTK APPLEISTK in silico tryptic Peptides in silico theoretical sectrum Scheme for protein identification... describe all quite in detail!! 4
Protein Identification Algorithms (using protein sequence databases) wet lab b-ions y-ions tryps 1st MS selection 2nd MS protein of interest peptides of convenient size MS spectrum fragmentation MS/MS spectrum in silico >Ath_Chr1 ACGTTTAG GAGTTAGG ACCACCA genome sequence gene prediction >At1g1120 >At1g1110 MDASISTOK MDASISTALK ADELIKAPPL ADELIKAPPL EISTK EISTK protein sequences MPVCLLSTVK MDASISTALK ELIK ADELIK APPLEISTK APPLEISTK in silico tryptic Peptides in silico theoretical sectrum Scheme for protein identification... peptide identification 4 protein inference describe all quite in detail!!
Peptide Identification An example: human mitogen-activated protein kinase-8 (MAPK8), 427 aa MS-compatible peptides tryptic & in MS range (mass) observed peptides good flight properties proteotypic peptides unambigous & observed frequently One example of a protein... MAPK8 from human... 5 Nat Rev Mol Cell Biol, 6(7):577 83, 2005 Nat Biotechnol, 25:125-131, 2007 when we check which tryptic peptides are in the range of the MS it looks like this... (colored means... MS-compatible) next... Which peptides are actually observed... because they have a good flight properties... and finally... which are unambigous and frequently observed
Peptide Identification An example: human mitogen-activated protein kinase-8 (MAPK8), 427 aa MS-compatible peptides tryptic & in MS range (mass) observed peptides good flight properties proteotypic peptides unambigous & observed frequently One example of a protein... MAPK8 from human... 5 Nat Rev Mol Cell Biol, 6(7):577 83, 2005 Nat Biotechnol, 25:125-131, 2007 when we check which tryptic peptides are in the range of the MS it looks like this... (colored means... MS-compatible) next... Which peptides are actually observed... because they have a good flight properties... and finally... which are unambigous and frequently observed
Peptide Identification An example: human mitogen-activated protein kinase-8 (MAPK8), 427 aa MS-compatible peptides tryptic & in MS range (mass) observed peptides good flight properties proteotypic peptides unambigous & observed frequently One example of a protein... MAPK8 from human... 5 Nat Rev Mol Cell Biol, 6(7):577 83, 2005 Nat Biotechnol, 25:125-131, 2007 when we check which tryptic peptides are in the range of the MS it looks like this... (colored means... MS-compatible) next... Which peptides are actually observed... because they have a good flight properties... and finally... which are unambigous and frequently observed
Output after peptide identification step An incomplete list of peptides which were presumably in the sample The identified peptides point to corresponding proteins Some peptides are ambiguous (protein inference problem) Some proteins are identified with several peptides, others only with a single peptide The peptides and also the proteins have some score associated with them how well they are identified 6
Why validate? Every database search generates false positives and false negatives Search Algorithm Prediction True False Downstream steps can cost a lot of time and money True True Positive False Negative Get the most accurate protein hit list with a known false discovery rate (FDR) Reality False False True Positive Negative 7
FPR vs FDR False positive rate (FPR): e.g. FPR = 5% means that on average 5% of the true false in the study will be called positive 10500 total 500 true positives 10000 false means 500 false positives (50% of total positives) False discovery rate (FDR): e.g FDR = 5% means that among all the features called positive, 5% are true negatives on average. 500 positves, 25 false positives (5%) source: 8 PNAS; Storey and Tibshirani 100 (16): 9440. (2003) There is a confusion in the proteomics-community -> FDR and FPR are often used for the same thing.. and as biologists sometimes are not too picky this leads to this confusion --> so here a definition in words.
Validation of Peptide Identification & Protein inference Protein Prophet Peptide Prophet From Nesvizhskii et al, Anal. Chem.2003, 75,4646-4658 9 Simon Barkow & Jonas Grossmann FGCZ Proteomics
Validation of Peptide Identification & Protein inference Protein Prophet Issue #1 Peptide Prophet From Nesvizhskii et al, Anal. Chem.2003, 75,4646-4658 9 Simon Barkow & Jonas Grossmann FGCZ Proteomics
Peptide validation by algorithm Key question: how to determine which identifications are valid Typical method: accept all identifications above a chosen discriminant score of a search engine (e. g. Mascot Ion Score) Choosing an threshold is problematic, depending on sample, search database, etc. Use a validation algorithm that is based on experience: PeptideProphet 10
Histogram of scores Once the discriminant scores for all the spectra in a sample are calculated, Peptide Prophet makes a histogram of these discriminant scores. For example, in the sample shown here, 70 spectra have scores around 2.5. Number of spectra in each bin Discriminant score (D) 11
Number of spectra in each bin Mixture of distributions incorrect This histogram shows the distributions of correct This Histogram and incorrect shows matches. the standard PeptideProphet distributions of assumes correct and that these incorrect distributions matches, validated are standard manually statistical in a distributions. sample with a known set of 18 proteins. Using curve-fitting, PeptideProphet draws the correct and incorrect distributions. correct Discriminant score (D) 12
Number of spectra in each bin Bayesian statistics incorrect Once correct and incorrect distributions are drawn, PeptideProphet uses Bayesian statistics to compute the probability p(+ D) that a match is correct, given a discriminant score D. correct Discriminant score (D) 13
Probability of a correct match The statistical formula looks fierce, but relating it to the histogram shows that the prob of a score of 2.5 being correct is Number of spectra in each bin incorrect correct Discriminant score (D) 14
15
How to get even more confidence? Compare peptide patterns seen in each replicate for the same protein Manually examine the spectrum for critical or characteristic fragment ions (especially single hits) Compare scores from various search engines (Mascot, SEQUEST, x!tandem, etc.) Compare other characteristics for identified peptides (NTT, MCS...) 16
Peptide Prophet features Combines database search scores Number of tryptic termini (NTT) Number of missed cleavage sites (NMC) Mass difference between theoretical mass and measured mass Peptide retention time (expected vs measured) 17
Scaffold Workflow 18
Experimental Design Three hierachies: 1. Sample Category: disease vs. control, treated vs, untreated, etc. 2. Biosample: drop of blood, tissue sample, etc. 3. MS Sample: each individual spot (MALDI), or one LC fraction 19
Scaffold Sample Window Overview for comparisons Lists and summarizes the proteins identified in each biosample or MS sample Identification probability Number of unique peptides on which the identification is based Percentage of the total spectra that this number represents Number of unique spectra associated with this protein 20
Scaffold Protein Window All Information about a single protein Sequence coverage for this and similar proteins Peptide sequence, with identified peptides highlighted in yellow and modifications highlighted in green The spectra used to identify each peptide Lots of data about the Peptides that can be revised to get confidence 21
Scaffold Quantify Window View spectral count numbers for biosamples (same color) and categories (different color) Scatterplots pane shows degree of error associated with the spectral count Venn diagram shows relationship between categories of proteins, unique peptides, or unique spectra identifications GO (Gene Ontology) mesh terms pane 22
Scaffold Statistics Window Check whether your data meets Scaffold s assumptions Statistical information for each MS sample in your analysis Relationship between peptide and protein probabilities Histogram demonstrating correct and incorrect peptide assignments (used by the Peptide Prophet) Scatterplot comparing two or more search engine results 23
Search Algorithms 24
Search Algorithms MASCOT SEQUEST X!TANDEM OMSSA Spectrum Mill 24
Search Algorithms MASCOT SEQUEST X!TANDEM OMSSA Spectrum Mill All of them can be combined with Scaffold 24
Why Overlap Small The reason that they identify different spectra is because each program has different strengths. SEQUEST 9% considers intensities 22% 4% 34% X!Tandem semi-tryptic, no neutral loss fragments 19% 7% 5% Mascot probability based scoring 25
Decoy searches applicable everywhere >sp Q4U9M9 104K_THEAN 104 kda microneme/rhoptry antigen OS=Theileria annulata GN=TA08425 PE=3 SV=1 MKFLVLLFNILCLFPILGADELVMSPIPTTDVQPKVTFDINSEVSSGPLYLNPVEMAGVK YLQLQRQPGVQVHKVVEGDIVIWENEEMPLYTCAIVTQNEVPYMAYVELLEDPDLIFFLK EGDQWAPIPEDQYLARLQQLRQQIHTESFFSLNLSFQHENYKYEMVSSFQHSIKMVVFTP KNGHICKMVYDKNIRIFKALYNEYVTSVIGFFRGLKLLLLNIFVIDDRGMIGNKYFQLLD DKYAPISVQGYVATIPKLKDFAEPYHPIILDISDIDYVNFYLGDATYHDPGFKIVPKTPQ CITKVVDGNEVIYESSNPSVECVYKVTYYDKKNESMLRLDLNHSPPSYTSYYAKREGVWV TSTYIDLEEKIEELQDHRSTELDVMFMSDKDLNVVPLTNGNLEYFMVTPKPHRDIIIVFD GSEVLWYYEGLENHLVCTWIYVTEGAPRLVHLRVKDRIPQNTDIYMVKFGEYWVRISKTQ YTQEIKKLIKKSKKKLPSIEEEDSDKHGGPPKGPEPPTGPGHSSSESKEHEDSKESKEPK EHGSPKETKEGEVTKKPGPAKEHKPSKIPVYTKRPEFPKKSKSPKRPESPKSPKRPVSPQ RPVSPKSPKRPESLDIPKSPKRPESPKSPKRPVSPQRPVSPRRPESPKSPKSPKSPKSPK VPFDPKFKEKLYDSYLDKAAKTKETVTLPPVLPTDESFTHTPIGEPTAEQPDDIEPIEES VFIKETGILTEEVKTEDIHSETGEPEEPKRPDSPTKHSPKPTGTHPSMPKKRRRSDGLAL STTDLESEAGRILRDPTGKIVTMKRSKSFDDLTTVREKEHMGAEIRKIVVDDDGTEADDE DTHPSKEKHLSTVRRRRPRPKKSSKSSKPRKPDSAFVPSIIFIFLVSLIVGIL 26
Decoy searches applicable everywhere >sp REV_Q4U9M9 REV_104K_THEAN 104 kda microneme/rhoptry antigen OS=Theileria annulata GN=TA08425 PE=3 SV=1 LIGVILSVLFIFIISPVFASDPKRPKSSKSSKKPRPRRRRVTSLHKEKSPHTDEDDAETG DDDVVIKRIEAGMHEKERVTTLDDFSKSRKMTVIKGTPDRLIRGAESELDTTSLALGDSR RRKKPMSPHTGTPKPSHKTPSDPRKPEEPEGTESHIDETKVEETLIGTEKIFVSEEIPEI DDPQEATPEGIPTHTFSEDTPLVPPLTVTEKTKAAKDLYSDYLKEKFKPDFPVKPSKPSK PSKPSKPSEPRRPSVPRQPSVPRKPSKPSEPRKPSKPIDLSEPRKPSKPSVPRQPSVPRK PSKPSEPRKPSKSKKPFEPRKTYVPIKSPKHEKAPGPKKTVEGEKTEKPSGHEKPEKSEK SDEHEKSESSSHGPGTPPEPGKPPGGHKDSDEEEISPLKKKSKKILKKIEQTYQTKSIRV WYEGFKVMYIDTNQPIRDKVRLHVLRPAGETVYIWTCVLHNELGEYYWLVESGDFVIIID RHPKPTVMFYELNGNTLPVVNLDKDSMFMVDLETSRHDQLEEIKEELDIYTSTVWVGERK AYYSTYSPPSHNLDLRLMSENKKDYYTVKYVCEVSPNSSEYIVENGDVVKTICQPTKPVI KFGPDHYTADGLYFNVYDIDSIDLIIPHYPEAFDKLKPITAVYGQVSIPAYKDDLLQFYK NGIMGRDDIVFINLLLLKLGRFFGIVSTVYENYLAKFIRINKDYVMKCIHGNKPTFVVMK ISHQFSSVMEYKYNEHQFSLNLSFFSETHIQQRLQQLRALYQDEPIPAWQDGEKLFFILD PDELLEVYAMYPVENQTVIACTYLPMEENEWIVIDGEVVKHVQVGPQRQLQLYKVGAMEV PNLYLPGSSVESNIDFTVKPQVDTTPIPSMVLEDAGLIPFLCLINFLLVLFKM 26
1) Sequest & TPP, No decoy search, PeptideProphet > 0.9 # of proteins # of peps # of MS/MS fw proteins 3176 9771 20627 single hits 1148 - - REV proteins - - - REV single hits - - - Total: 3176 proteins 36% Overall ath 801 64% The regular procedure: -> only one search engine is taken into account (sometimes even without decoy db) --> TPP for statistical evaluation --> the difference between decoy & non_decoy searches.. -> a different fitting of the probability function results in a little bit more stringency on the cutoff in terms of fewer peptide identification 27
1) Sequest & TPP, No decoy search, PeptideProphet > 0.9 # of proteins # of peps # of MS/MS fw proteins 3176 9771 20627 single hits 1148 - - REV proteins - - - REV single hits - - - Total: 3176 proteins 2) Sequest & TPP, w/ decoy search, PeptideProphet > 0.9 36% Overall ath 801 Overall ath 801 64% # of proteins # of peps # of MS/MS fw proteins 2840 8994 18662 single hits 952 - - REV proteins 103 104 126 REV single hits 102 - - FDR 3.76% 1.17% 0.68% Total: 2943 proteins 32% 3% 0% 64% 104 / (8994-104) 27 The regular procedure: -> only one search engine is taken into account (sometimes even without decoy db) --> TPP for statistical evaluation --> the difference between decoy & non_decoy searches.. -> a different fitting of the probability function results in a little bit more stringency on the cutoff in terms of fewer peptide identification
Decoy searches - Limitations Decoy searches can be applied everywhere BUT the calculation of FDRs only makes sense if a large number of proteins are identified (more than ~200) If the calculated FDR is very high.. there is a good chance that some search parameters are wrong or maybe some PTMs are not specified Reversed databases are favored over scrambled ones Low FDR doesn t mean perfect results 28
Quantitative Proteomics - my critical view Is what everybody is looking for Is what many people claim to do Is definitely the right way to go in the future Is absolutely necessary for Systems Biology Is essential to really understand the dynamics of the proteome Is not really straightforward 29
Quantitative Proteomics - What is it? Find relative changes of protein abundance from 2 similar samples (wild type VS mutant // condition_1 VS condition_2) Determine absolute protein concentrations in a sample (conclude on copy numbers and translation efficiency) -> AQUA peptides.. Find regulatory proteins and elucidate regulatory pathways 30
Quantitative Proteomics - How can it be achieved? Labeling strategy for differential expression (ICAT, itraq, TMT, SILAC --> wet lab) Label-free approaches for differential expression (--> Software solutions) Targeted approaches (SRM, MRM --> mass spec approach) 31
Quantitative Proteomics (differential expression) label strategy only ONE run is acquired label-free 2 individual runs are acquired sample prep solution software solution itraq/tmt icat SILAC Progenesis SuperHirn -> problematic is sample prep -> problematic are aligning and run to run variation 32
ICAT labels have different weights Quantification is done on the MSone level 33
itraq all labels have the same weight --> all parent ions are the same Quantification is done on the MS/MS level 34
Beyond Protein Lists and Quantitation - what else Check for over/under representation of GO-terms Functional categorization Project regulated proteins onto a metabolic pathway map 35
Principle of - Over-representation Analysis an easy example The Principle - organism with 1000 genes - binned in 5 equal categories with 200 genes - GO-cats 1-5: transcription, translation, energy delivery, nutrients uptake, degradation The researcher decides to do proteomics (brute-force) - 200 genes are identified --> 1/5th of all - statistically you would expect to find approx. 40 genes for each category In fact you find about 100 genes from GO:energy delivery category ---> category energy delivery is significantly enriched ---> different statistics can be applied 36
Principle of - ORA - In case of Proteomics The number of measured and identified proteins is still far from complete Over-representation analysis allow to find pathways or systems which are regulated or involved in a certain context -> but it is important to have the correct background/universe selected Principle: - all genes of an organism are binned in categories - categories are related to gene function (e.g. GeneOntology categories) - compare your identifications to randomly drawn genes Background-problem - take as background only those proteins ever identified in this species - take as background all identified proteins and as genes of interest and those proteins which seem to be regulated as targets (e.g: itraq experiment) Tools: R-package --> TopGO Web: --> GOTreeMachine (bioinfo.vanderbilt.edu/gotm/) 37
Scenario (from HTP proteomics) Arabidopsis thaliana: The model plant ---> ~ 28 000 genes Single-cell plant in liquid culture Grown in sugar containing solution & weekly subculturing One part grown in the dark (cardboard box) One part grown in long-day conditions (16h light) Excessive LTQ MS analysis --> 800 LC-MS runs (fractionation & replicates) A total of 7983 proteins identified from all samples (~ 30% from all genes encoded in the genome) --> Background 6547 from the cell cultures that were kept in the dark 6474 from the cell cultures that were illuminated 38
GO:0006082 Dark Light Proteins from CC_dark: BG: full universe of GO Proteins from CC_dark: BG: only proteins identified in CC GO:0008150 biological_process GO:0008150 biological_process GO:0008152 metabolic process GO:0009987 cellular process GO:0032501 multicellular organi... GO:0032502 developmental proces... GO:0008152 metabolic process GO:0009987 cellular process GO:0051179 localization GO:0032501 multicellular organi... GO:0032502 developmental proces... GO:0043170 macromolecule metabo... GO:0009058 biosynthetic process GO:0044238 primary metabolic pr... GO:0044237 cellular metabolic p... GO:0016043 cellular component o... GO:0051179 localization GO:0007275 multicellular organi... GO:0044238 primary metabolic pr... GO:0043170 macromolecule metabo... GO:0009056 catabolic process GO:0009058 biosynthetic process GO:0044237 cellular metabolic p... GO:0006807 nitrogen compound me... GO:0033036 macromolecule locali... GO:0016043 cellular component o... GO:0007275 multicellular organi... GO:0009059 macromolecule biosyn... GO:0019538 protein metabolic pr... GO:0044260 cellular macromolecu... GO:0044249 cellular biosyntheti... GO:0006807 nitrogen compound me... GO:0006519 amino acid and deriv... organic acid metabol... GO:0006996 organelle organizati... GO:0051641 cellular localizatio... GO:0051234 establishment of loc... GO:0009790 embryonic developmen... GO:0019538 protein metabolic pr... GO:0005975 carbohydrate metabol... GO:0009059 macromolecule biosyn... GO:0044260 cellular macromolecu... GO:0009057 macromolecule catabo... GO:0044248 cellular catabolic p... GO:0006066 alcohol metabolic pr... GO:0044249 cellular biosyntheti... GO:0009308 amine metabolic proc... GO:0008104 protein localization GO:0051234 establishment of loc... GO:0051641 cellular localizatio... GO:0009790 embryonic developmen... GO:0044267 cellular protein met... GO:0044271 nitrogen compound bi... GO:0009308 amine metabolic proc... GO:0019752 carboxylic acid meta... GO:0051649 establishment of cel... GO:0006810 transport GO:0044267 cellular protein met... GO:0044262 cellular carbohydrat... GO:0016052 carbohydrate catabol... GO:0044265 cellular macromolecu... GO:0046164 alcohol catabolic pr... GO:0044271 nitrogen compound bi... GO:0045184 establishment of pro... GO:0006810 transport GO:0051649 establishment of cel... GO:0006412 translation GO:0044275 cellular carbohydrat... GO:0005996 monosaccharide metab... GO:0009309 amine biosynthetic p... GO:0015031 protein transport GO:0046907 intracellular transp... GO:0006412 translation GO:0009309 amine biosynthetic p... GO:0006520 amino acid metabolic... GO:0046907 intracellular transp... GO:0046365 monosaccharide catab... GO:0019318 hexose metabolic pro... GO:0006886 intracellular protei... GO:0008652 amino acid biosynthe... GO:0019320 hexose catabolic pro... 39
GO:0006082 Dark Light Proteins from CC_dark: BG: full universe of GO Proteins from CC_dark: BG: only proteins identified in CC GO:0008150 biological_process GO:0008150 biological_process GO:0008152 metabolic process GO:0009987 cellular process GO:0032501 multicellular organi... GO:0032502 developmental proces... GO:0008152 metabolic process GO:0009987 cellular process GO:0051179 localization GO:0032501 multicellular organi... GO:0032502 developmental proces... GO:0043170 macromolecule metabo... GO:0009058 biosynthetic process GO:0044238 primary metabolic pr... GO:0044237 cellular metabolic p... GO:0016043 cellular component o... GO:0051179 localization GO:0007275 multicellular organi... GO:0044238 primary metabolic pr... GO:0043170 macromolecule metabo... GO:0009056 catabolic process GO:0009058 biosynthetic process GO:0044237 cellular metabolic p... GO:0006807 nitrogen compound me... GO:0033036 macromolecule locali... GO:0016043 cellular component o... GO:0007275 multicellular organi... GO:0009059 macromolecule biosyn... GO:0019538 protein metabolic pr... GO:0044260 cellular macromolecu... GO:0044249 cellular biosyntheti... GO:0006807 nitrogen compound me... GO:0006519 amino acid and deriv... organic acid metabol... GO:0006996 organelle organizati... GO:0051641 cellular localizatio... GO:0051234 establishment of loc... GO:0009790 embryonic developmen... GO:0019538 protein metabolic pr... GO:0005975 carbohydrate metabol... GO:0009059 macromolecule biosyn... GO:0044260 cellular macromolecu... GO:0009057 macromolecule catabo... GO:0044248 cellular catabolic p... GO:0006066 alcohol metabolic pr... GO:0044249 cellular biosyntheti... GO:0009308 amine metabolic proc... GO:0008104 protein localization GO:0051234 establishment of loc... GO:0051641 cellular localizatio... GO:0009790 embryonic developmen... GO:0044267 cellular protein met... GO:0044271 nitrogen compound bi... GO:0009308 amine metabolic proc... GO:0019752 carboxylic acid meta... GO:0051649 establishment of cel... GO:0006810 transport GO:0044267 cellular protein met... GO:0044262 cellular carbohydrat... GO:0016052 carbohydrate catabol... GO:0044265 cellular macromolecu... GO:0046164 alcohol catabolic pr... GO:0044271 nitrogen compound bi... GO:0045184 establishment of pro... GO:0006810 transport GO:0051649 establishment of cel... GO:0006412 translation GO:0044275 cellular carbohydrat... GO:0005996 monosaccharide metab... GO:0009309 amine biosynthetic p... GO:0015031 protein transport GO:0046907 intracellular transp... GO:0006412 translation GO:0009309 amine biosynthetic p... GO:0006520 amino acid metabolic... GO:0046907 intracellular transp... GO:0046365 monosaccharide catab... GO:0019318 hexose metabolic pro... GO:0006886 intracellular protei... GO:0008652 amino acid biosynthe... GO:0019320 hexose catabolic pro... 39
Projection onto Metabolic Pathway Maps same data (e.g. MapMan Software (Golm)) Dark Light only found in light found in both only found in dark 40
Q & A 41
Hands on your turn now feel free to ask 42
Scaffold hands on - Example One load your own data with Scaffold before we are going to continue Use also X!Tandem to search Have a look at the results Is it valid to calculate FDR? How high is your FDR? 43
More from Scaffold Q+ hands on... with itraq data 44
Scenario: Mouse data Liver tissue itraq data (Swiss mouse: standard diet VS high fat diet) Mouse decoy database search with Mascot -> dat-files Labels: 116 -> high fat diet /// 114, 115, 117 -> standard diet Check reproducibility (standard diet vs standard diet) Find proteins which are regulated in high fat diet / standard diet 45
Task with Scaffold Q+ How consistent are peptides of the same protein Find confident thresholds for proteins being over/under expressed Which proteins in this example do you consider as being over/ under expressed? Can you try making sense out of these proteins.. 46
What should come out.. only 2 quant categories: Histogram 2 Categories Liver Ex4 300 250 StDiet/StDiet HighFatDiet/StDiet 200 Frequency 150 100 50 0-1.4-1.3-1.2-1.1-1 -0.9-0.8-0.7-0.6-0.5-0.4-0.3-0.2-0.1 0 log2(ratio) 47 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4
What should come out.. 4 quant categories: Histogram 4 Categories Liver Ex4 400 350 300 ratio_2 (st/st) ratio_3 (high fat / st) ratio_4 (st/st) 250 Frequency 200 150 100 50 0-2 -1.8-1.6-1.4-1.2-1 -0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 log2(ratio) 48
Regulated Proteins: The List 2 ways of making sense out of this data.. take the intersection of those 2 lists.. (should be most confident) 37 4 categories 48 regulated proteins 2 categories 44 regulated proteins 49
Make sense out of Lists: this does make sense!! 50
Paint it on Reactome-maps 51
52 ELPPAK
Scaffold Similarity Window Review and control the peptide/protein mapping View protein groups in which peptides are shared check or uncheck the valid box for a peptide sequence Peptides identified in particular protein groups are color coded 53