Pep-Miner: A Novel Technology for Mass Spectrometry-Based Proteomics Ilan Beer Haifa Research Lab Dec 10, 2002
Pep-Miner s Location in the Life Sciences World The post-genome era - the age of proteome begins Proteins (rather than genes) are the real players in biology Fundamental activities Identifying/cataloguing proteins in various tissues Comparing cells in different conditions (e.g. healthy/diseased) Major goals: drug discovery, diagnostics Mass spectrometry: leading proteomics technology Rapidly replacing older technologies (Gel, Edman,...) Well-defined interface between wet biology and computing
Location (cont.) Pep-Miner: novel proteome analysis technology High-throughput algorithms for processing mass spec data Considerable data size reduction Improved analysis quality Better peptide identification and peptide mixture comparison
Mass Spectrometry Basics 1. Proteins are extracted from cells and digested into peptides; or peptides are directly extracted from cells Peptide mixture 2. Peptides elute from HPLC column ordered by hydrophobicity; mixture becomes simpler HPLC 3. Mixture is sampled every few seconds by mass spectrometer; peptide masses are shown as MS spectrum MS 300 m/z 2000 4. Peptides of selected masses are fragmented into sub-peptides; fragment masses are shown as MS/MS spectrum MS/MS 300 m/z 1000 5. Peptides are identified based on their MS and MS/MS spectra Identification program
Major Problem: Huge Amounts of Data A mass spectrometer produces a lot of data More than 1000 MS and 1000 MS/MS spectra/hour 10 100 MB/hour One day in a big pharma/biotech with 100 mass specs 100 x 24 x 1000 = 2.4 M MS/MS spectra to analyze 100 x 24 x 100 MB = 240 GB to store Problems Long analysis time: seconds to minutes per spectrum Enormous computing power is needed Limited long-term storage capability of valuable data Data is hard to manage
More Problems Most peptides remain unidentified Mass spectrometers have limited accuracy, sensitivity, resolution, mass range Chemical/biological contamination Unknown fragmentation rules (intensity) Incomplete protein/gene libraries Modifications and mutations Identification-based mixture comparison is rendered meaningless Pep-Miner technology addresses these problems
Pep-Miner: A Novel Proteome Analysis Technology Clustering of MS/MS data All similar spectra in all mass spec runs across a project are grouped into a cluster A cluster represents one peptide A representative spectrum is computed for each cluster (average), replacing the raw spectra Source for multiplicity A peptide may be fragmented more than once in a run Repeated runs of same material Many proteins, and hence peptides, are common to the various cell types and the various conditions
Cluster Example A. Member 1 Score:74 B. Member 2 Score:59 C. Member 3 Score:67 500 m/z 1000 500 m/z 1000 Three examples out of 52 MS/MS spectra that have been grouped into one cluster D. Combined Score:88 500 m/z 1000 500 m/z 1000 Combined spectrum of the cluster
Work Flow Read mass spectrometer files Find clusters Compute cluster representative spectra (reps); delete raw spectra Store cluster information and reps in database Send reps to identification and store results in database Show analysis results (database view) to user Read more mass spectrometer files, possibly as soon as they are created Try distributing new spectra among existing clusters; Create new clusters for the rest Compute reps Send only new reps to identification Show updated database view to user
Major Benefits Enormous saving in time and space Data size reduction (x40-x100) Allows to store larger amounts of valuable data for future analysis Reduced analysis time (x40-x100) Only cluster reps are analyzed Much of the new data is categorized rather than analyzed => saving increases as the project progresses Improved data quality Reps have better signal/noise ratio and accuracy than raw spectra Increased identification yield More accurate data -> faster analysis
More Benefits Peptide mixture comparison rendered meaningful No need for prior identification Rely on spectrum similarity only Further analysis can focus on interesting differences only Data management by humans becomes easy Whole project in one view 10 3 clusters to inspect rather than 10 5 raw spectra Knowledge acquisition and exploitation Biological knowledge is accumulated in a database Analysis quality improves as the project progresses the tool learns from the analyzed data (e.g. retention time prediction and sequence motifs)
Publications Analysis of endogenous peptides bound by soluble MHC class I molecules: a novel approach for identifying tumor-specific antigens, Barnea E, Beer I, Patoka R, Ziv T, Kessler O, Tzehoval E, Eisenbach L, Zavazava N, Admon A., European Journal of Immunology, January 2002 Application of Liquid Chromatography Tandem Mass-Spectrometry Data Clustering to Peptide Analysis, Beer I, Barnea E, Ziv T, Admon A, submitted