Introduction to Proteomics

Why Proteomics? Same Genome Different Proteome Black Swallowtail - larvae and butterfly

Biological Complexity Yeast - a simple proteome 6,113 proteins = 344,855 tryptic peptides is proteomics the most significant analytical challenge? Qualitative analysis by LC/MS/MS need at least one peptide with at least 8 amino acids Quantitative analysis by LC/MS need at least one peptide pair from a differential expression study In Theory, complete coverage of yeast proteome at the protein level requires the analysis of only 6,113 peptides In Reality, one needs multiple peptides from each protein Replicates needed for statistics at each time point and biological condition (4X+ for cell lines & inbreed animals, 15-100+ for human samples)

Biologically Relevant Information - key information required Changes as a result of biological challenge (disease, drug) expression level change? change in post-translational modification(s)? pattern of changes? Biological function(s) of identified proteins protein complex component? role in signaling/regulatory pathway? sub-cellular localization? multiple functions (different compartments, splice variants) Multiple types of information required to understand biological function integration of genomics and proteomics

Types of Proteomics Interaction Proteomics Protein-protein associations Expression Proteomics Protein quantification

Basics of Protein Identification - same for proteomics as other protein MS studies - fundamental difference is scale of analysis GENE PRODUCT I.D. PROTEOME MALDI MS db PROTEIN GENOME EST DIGEST MALDI MS/MS HPLC ESI MS/MS

High Throughput Proteomics Success in proteomics is the analysis of all of the biologically relevant proteins obtaining high proteome coverage is more important than high throughput samples represent months of prior work by collaborators proteomic results will lead to months of work by collaborators Need detailed analysis, not just high throughput Obtaining high quality data is the key to success in proteomics High Information Content

What is High Information Content in Proteomics? High sample throughput number of samples replicated meaningful analytical statistics number of biological samples/group meaningful biological statistics High proteome coverage number of proteins identified High protein coverage number peptides identified PTMs, splice variants High information content improved biological knowledge improved impact of proteomics on biology

Interaction Proteomics - protein interaction networks Characterization of protein interactions in order to understand the function of individual proteins, protein complexes, and protein interaction networks - ligand Sin3 Nuclear Receptor NCoR/SMRT p/cip CBP/p300 NCoA + ligand HDAC-1 P/CAF repressed activated Gβ Gγ STE20 STE11 STE7 FUS3 KSS1 HOMOLOGUE PAK MAPKKK MAPKK MAPK Multi-subunit Complexes Biochemical Pathway

Interaction Proteomics Using Mass Spectrometry plasmid expressing epitopetagged bait protein Endogenous or exogenous bait protein isolate protein complex bait analyze components by LC/MS/MS in-gel or solution protease digest inspect on 1-D gel Database Search: Match peptide sequences to protein sequences

Interaction Proteomics - improving information content via double-tagging strategy - minimize non-specific protein binding TEV protease cleavage site HIS tag TARGET HA tag Nickel agarose beads TEV protease cleavage Coupled HA11 beads Cell expression 1st affinity column 2nd affinity column Elution

Systematic Identification of Protein Complexes in Saccharomyces cerevisiae by Mass Spectrometry Nature, 415, 180, 2002 Kss1 = MAP kinase Cdc28 complex Blue arrows = known interactions Red arrows = new interactions

Functional organization of the yeast proteome by systematic analysis of protein complexes Nature, 415, 141,2002 Figure 4: The protein complex network, and grouping of connected complexes. Links were established between complexes sharing at least one protein Cellular roles: red - cell cycle dark green - signalling dark blue - transcription, DNA maintenance, chromatin structure pink - protein and RNA transport orange - RNA metabolism light green - protein synthesis and turnover brown - cell polarity and structure violet - intermediate and energy metabolism light blue - membrane biogenesis and traffic The lower panel is an example of a complex linked to two other complexes by shared components. It illustrates the connection between the protein and complex levels of organization.

Expression Proteomics - 2D gels and mass spectrometry platform Separation & Quantitation Automated Spot Cutting Automated In-gel Digestion 1D- or 2D-gel 1 spot or band comprises many proteins Database search MALDI/MS peptide mass only (peptide mass fingerprint)...and so on, maybe 500 times Database search LC/MS/MS peptide mass and sequence information 1 2 3 550

2D Gel Electrophoresis Most powerful analytical method for protein separation multi-dimensional separation technique IEF and PAGE very sensitive detection schemes SyproRuby fluorescence staining rugged quantitation useful for visualization of PTM changes sometimes described as low throughput not true - can be higher throughput than non-gel based proteomics parallel processing of gels analysis of only proteins undergoing expression change Challenges with 2D gels specialized gels required for extreme proteins pi, MW, hydrophobicity

Expression Proteomics Differential protein expression as function of gene knockout Missing Spot Down regulated spot Wild type knockout

Improved Information Content - reproducibility & meaningful biological statistics 2D PAGE ph 4 ph 7 Improved sample throughput by running gel samples in parallel - only analyze biologically relevant proteins undergoing expression change

Improved Sample Throughput - automated spot cutting cuts blots, wet and dry gels visible and fluorescent CCD camera Cutting head Barcode reader LED (excitation 480 nm, emission 520 nm LP) Gel Cutting head cleaner

Expression Proteomics - industrialized (GeneProt, Geneva) 40+ ion trap MS/MS systems with multiplexed LCs - 2 LCs per MS for improved throughput 1,462 processor server GeneProt s approach two serum/plasma samples 5 liters each 60,000 LC fractions analyzed by MS 320,000 2D gel spots analyzed by MS 50+ mass spectrometers 1,462 processor server 6 months

Expression Proteomics - developing methods Isotope coding qualitative and quantitative analysis by MS and MS/MS different isotope codes isotope coded affinity tags cys labeling 18 O tagging via digestion in 18 O water CH 3 and CD 3 ester formation Direct LC/MS peptide mass mapping ion mapping AMRTs Graphic from M. Mann editorial Quantitative Proteomics, Nature Biotechnology, 17, 1999.

MALDI/MS Data

Improved Proteome Coverage - use of multiple analysis methods How can analytical coverage of the proteome be improved? Use both non-gel based and gel-based approaches which are complementary Non-gel based Shotgun Proteomics total digest of complex protein mixture multidimensional chromatography and tandem mass spectrometry LC/LC/MS/MS Use both ESI/MS/MS and MALDI/MS/MS

Improved Proteome Coverage - use of multiple analysis methods - shotgun proteomics and 2D Gels Shotgun proteomics (LC/LC/MS/MS) minimal bias against specific protein classes total proteolytic digestion yields tractable peptides from very large and very small proteins acidic and basic proteins (typical 2D gel pi 4-7) hydrophobic proteins (e.g. membrane proteins) qualitative information contained in MS/MS spectrum quantitative information contained in MS spectrum significantly facilitated with use of isotope coded tags 2D gel may provide higher throughput batch processing of samples in parallel analysis of only those spots changing in expression

Improved Proteome Coverage - shotgun proteomics using LC/LC/MS/MS - analysis of nuclear fraction of cancer cell line Tissue Knowledge Tissue Fractions nuclear fraction LC 1 Fractions IEX 40 fractions LC 2 Fractions RP 3 hr. gradient MS 1 MS 2 Database Search 60,000+ peptides analyzed by MS and MS/MS LC 2 Results Correlation 1,500 peptides per fraction 1,898 proteins identified in nuclear extract - 120 hour MS/MS acquisition LC 1 Tissue Fraction Results Results Correlation Correlation 60-400 proteins per fraction

Improved Proteome Coverage - multiple ionization sources for LC/MS/MS Post LC column split between ESI and MALDI Option 1 - both systems in automated MS/MS mode more comprehensive direct analysis better proteome coverage Option 2 - targeted analysis initial ESI/MS/MS analysis process and interpret data follow-up MALDI/MS/MS analysis only sample peptides not interrogated by ESI/MS/MS

Improved Proteome Coverage - Probot LC/MALDI interface - post-column split to on-line ESI/MS/MS and MALDI plates Splitter 1 80% 20% MALDI Plate Spotting Nanoscale LC / LC ESI / MS / MS MALDI / MS / MS

Improved Proteome Coverage - use of multiple ionization sources for MS/MS - 51 mitochondrial ribosomal proteins 78% Identified 8 Unique by by LC/ESI/MS/MS LC/ESI/MS/MS 16% Unique 32 Both 84% Identified 11 Unique by by LC/MALDI/MS/MS LC/MALDI/MS/MS 22% Unique Significant overlap between the two datasets (63%) Significant additional information obtained (37%)

Conclusions Proteomics is a much bigger challenge than most people realize easy to get high level proteins, essentially impossible to get complete coverage of complex mixtures with current technologies genes were easy Multiple integrated approaches are needed to provide high information content to biological studies genomic and proteomic methods multiple proteomic methods gel-based and non-gel based studies sample depletion and fractionation methods needed multiple analytical methods for protein ID This integrated approach yields a large amount of data true success requires integration of different datasets requires significant bioinformatic resources Final step transforming data into biological knowledge