Isobaric Tag based MS Quantification Algorithms Analysis and Implementation

Transcription

1 Isobaric Tag based MS Quantification Algorithms Analysis and Implementation Master s degree in Proteomics and Bioinformatics Written by Sankar Martial Supervisors: Nicolas Budin 1, Pierre-Alain Binz 1 Academic year 2007/2008 This thesis was submitted as part of the requirements for the Master s degree in Proteomics and Bioinformatics from the University of Geneva. 1 Geneva Bioinformatics (GeneBio) SA 25 avenue de Champel 1206 Geneva Switzerland 0

2 Contents Abstract... 2 Acknowledgements Introduction GeneBio SA Organisation Biological Context Quantitative Proteomics Global View Isobaric Tagging Experimental Application Experimental Design: Principle of Replicate Analysis Methods: Study Quantification Workflow Introduction Experimental samples Tested Software Presentation Software comparison Discussion Establish Quantification WF Introduction Description Validation of the algorithms Discussion Application of the Quantification Workflows Alireza collaboration: Peptides Ratios-based Quantification Approach applied to Characterize Daptomycin Resistance in Staphylococcus aureus Loic Dayon Collaboration CSF Analysis by TMT 6-plex CSF micro-dialysis Conclusion Reference

3 Abstract One single gene gives rise to several proteins. This well-known sentence illustrates all of the complexity of the proteome compared to the genome and the transcriptome. As does genomics, proteomics provides a large toolbox of experimental methods to achieve the quantification of proteins. In contrast, the analytical means to obtain a reliable and trusted value of protein relative abundances are less developed. Although more and more tools are released to assess protein ratios, a quick review of proteomics papers reveals that quantitative data analysis is still performed manually. For this purpose, three algorithms for isobaric tag-based quantitative analysis were implemented. They were successfully applied on experimental datasets provided by the Biomedical Proteomics Research Group (BPRG) and the Clinical Proteomics Research Group (CPRG) of the University of Geneva. Finally, these algorithms were implemented in the quantification module of the Phenyx software. 2

4 Acknowledgements It was an immense privilege to be guided and to be under the tutelage of my supervisors Nicolas Budin and Pierre-Alain Binz. I am very grateful to Alexandre Masselot for having given me the opportunity to do this training course at GeneBio, for his availability and his advice. I would like to thank Nasri Nahas for having permitted me to carry out this training under the optimal conditions. Many thanks are due to Olivier Evalet, Yann Mauron, Roman Mylonas and Ivan Topolsky for their support and their good mood. I am very thankful to Alireza Vaezzadeh and Loic Dayon for their collaboration, I thank very much David Bouyssié who did an essential previous work on the reporter ion peaks extraction. Finally, I would like to thank all the members of GeneBio for having welcoming me during one year. 3

5 1. Introduction 1.1. GeneBio SA Geneva Bioinformatics (GeneBio) SA is a bioinformatics company founded in November It was created quasi simultaneously with the Swiss Institute of Bioinformatics (SIB). One of the main activities of GeneBio is to act as the privileged commercial arm of the SIB, and therefore bring to market developments done at the SIB in order to provide back revenues to help further developments. Its first product line started in 1998 with the Swiss-Prot database. Swiss-2DPAGE (a 2D gel database), Prosite (a database of protein domains, families and functional sites) and Melanie (a 2D gel analysis software) soon followed. GeneBio also develops and commercialises proper specialized and innovative databases and software on biological molecules. These include Phenyx, a renowned software platform for the identification and characterization of proteins and peptides from mass spectrometry data. Another example is SmileMS, the latest GeneBio software and also developed in collaboration with the SIB. It is a unique platform for the identification and analysis of small molecules by mass spectrometry. Located in Geneva, a centre of excellence in the field of proteomics, GeneBio now has between 15 and 20 employees, including a majority of biologists and computer scientists Organisation My training course was achieved within the Phenyx development team. It's a bioinformatics training involving a binomial supervision. Dr Pierre-Alain-Binz has taken the responsibility of the scientific aspects of my project, giving me advice and orientation in proteomics and data analysis. Dr Nicolas Budin directed the informatics part of the project. He has managed the whole development side of the project, initiating me into R language, Java language, and to reliable methods of software development. The mass-spectrometric-based quantification universe is wide. From the beginning, it was decided that only itraq quantification would be covered in my work. To deepen my knowledge of the master courses about Mass-Spectrometric-based quantification in proteomics, I started the training by reading papers on itraq reagents and principles of quantification. Later, in order to familiarize with MS-identification and quantification, I focused on testing some of the available quantification tools. These steps permitted me to handle biological data, to see and understand the basis of large amounts of data analysis and to feel the span of the perspectives that the mass-spectrometric-based quantitative analysis offers. Subsequently, I have developed my own tools. Using the R Language, I have replicated the workflow of some of the studied software. Finally, the time came to offer quantification mean to the Phenyx users. The implementation was done in Java with the inclusion of calls to a few crucial statistical steps. 4

6 In order to obtain real users feedback, to collect what was needed in quantification, to see how the data was analysed, to obtain quantification materials and to have an idea of how the developed tools should behave when faced with real data, a collaboration was carried out with the Biochemical Proteomics Research Group and the Clinical Proteomics Group of the Geneva University Biological Context Proteomics analysis proposes a large toolbox of analytical methods, instruments and algorithms to identify and characterize proteins. The majority of the published proteomics studies are limited to the identification of the proteins expressed in a biological system. However, this is not sufficient to answer to most biological questions. Does a protein behave significantly different between two samples? Does a protein exhibit time-dependent change? Which proteins behave similarly in the experiment? Thus, quantitative answers are more and more required and populate an increasing number of publications. Initially, quantitative and comparative proteome analysis was performed with 2D-PAGE. Due to some limitations (low dynamic range, bias against membrane and soluble proteins), gelfree methods have complemented and are gradually supplanting gel-based quantitative proteomics. Specific techniques are used to address this issue. One solution is based on the employment of stable isotopes. Isotope label can be incorporated to the process in three ways: metabolically (during cell growth), enzymatically, and chemically. Chemical incorporation of the stable isotope has produced the most of the quantitative proteome data mainly due to its chemical versatility and because it allows the analysis of any biological sample (in contrast to metabolic). Due to the high amount of data, manual analysis is laborious but remains persistent in the scientific community. Thus, many software tools are made available to support data analysis. Several are open-source, with their own assets and caveats. Most of them are able to handle identification results from one or more different search engine (Mascot, Phenyx, SEQUEST, X!Tandem...). Recently, Mascot (a major player in the identification market) proposed its own quantification module. Phenyx need to meet the concurrent demands of quantitative high throughput MS data by proposing its own quantification module. Following the six month s work of the Master Student David Bouyssié in 2006 on the quantitative data extraction module for mainly SILAC and itraq, my main objective was to add the missing downstream analyses pieces of the puzzle and provide a complete quantification pipeline for itraq methodology. 5

7 2. Quantitative Proteomics 2.1. Global View Several methods exist to assess the protein abundance in a sample. Classical methods such as western blotting, fluorophores and radioactivity are widely used due to their sensitivity and dynamic range. However, these methods are not generally appropriate due to some constraints for large scale screening, and particularly for biomarker discovery (non-targetted experiments). Wu et al. has compared MS-based stable isotope labelling methods (itraq, cicat) and DIGE (Differential Gel Electrophoresis) and has shown that these problems can be overcomed by MS-based approaches [32]. MS-based strategy coupled with separation methods (2DE or LC) is currently the more efficient mean to perform the identification of a complex mixture of protein. However, due to the fact that protolytic peptides exhibit a wide range of physico-chemical properties (size, charge, hydrophocity...), the relationship between the amount of proteins and the signal intensities is complex. Therefore mass spectrometry is not inherently quantitative, when seen as a tool for absolute quantitation. Therefore, relative quantitation is preferred, where peptides are compared between experiments data points. This can be achieved in a numbers of ways. Thus, high throughput assessment of change in protein expression is usually performed by stable isotope labelling of peptides and proteins either metabolically, enzymatically ( 16 0, 18 0 incorporation by proteolysis) or chemically using external reagents (Figure 1). 6

8 Figure 1. Overview of quantification in Proteomics, [41] Metabolic Labelling involves in-vivo incorporation of the stable isotope during cell growth and division. One of the most widely used approaches is the SILAC (Stable Isotope Labelling Amino acid in Cell culture) approach, which was introduced by Mann and co-workers in 2002 [33]. In these methods, the heavy amino acid is incorporated during the protein synthesis. The main advantage is that no multiple steps in the labelling protocol are needed and the experimental error does not affect the ratio [34] as shown in However, this approach is almost exclusively applicable to cell An other approach is chemical labelling. For example, several papers studied fluids basing on isobaric tag, itraq [4] or TMT [7]. The principle of chemical tagging rests on the reactivity of the N-ter and side chains of lysine (itraq, TMT) and cysteine (ICAT). ICAT (Isotope Coded Affinity Tag) is the most known of the Cys tagging strategies. Chemical tags were designed to simultaneously allow the enrichment of a subfration (Cys peptides) from a complex mixture of proteins and the quantification of the selected peptides at the precursor ions level. Gygi et al. [35] developed this approach in which cysteine residues are specifically derivatized with a reagent containing either zero or eight deuterium atoms as well as a biotin group for affinity purification of cysteine-derivatized peptides and subsequent MS analysis. Modified versions of ICAT are emerged to solve problems of elution (cicat) and fragment loss (VICAT...). Wu et al. highlighted drawbacks and weaknesses of ICAT methods compared with other chemical tagging approaches [32]. Although ICAT analysis yields to good results when performed on simple or moderately complex sample, the cysteine-specificity leads to a loss of sensitivity when analyzing a complex protein mixture. 7

9 Figure 2. Common quantitative mass spectrometry workflows, Boxes in blue and yellow represent two experimental conditions. Horizontal lines indicate when samples are combined. Dashed lines indicate points at which experimental variation and thus quantification errors can occur [30]. Group of labelling reagents which targets the peptide N-terminus and the epsilon-amino group of lysine residues are the most sensitive. Most of the time, this is realized via the very specific N-hydroxysuccinimide (NHS) chemistry or other active esters and acid anhydride as in, e.g., the isotope coded protein label (ICPL), isotope tags for relative and absolute quantification (itraq) [19], tandem mass tags (TMT) [8],... ICPL, ICAT methods and most of the aforementioned chemical modification techniques, relative quantification is achieved by integration of MS signal over heavy and light labels. TMT and itraq introduce the concept of the isobaric tag. Isobaric tags labelled peptides co-migrate in liquid chromatography separations. The different tag can be distinguished by the mass spectrometer only upon peptide fragmentation. This permits the simultaneous determination of both identity and relative abundance of peptide in tandem-mass spectra. ITRAQ and TMT are described more in details below and some examples of application are provided and summarised (Figure 5). 8

10 2.2. Isobaric Tagging itraq ITRAQ reagents are amine specific stable isotope labels. Up to eight biological samples can be labelled simultaneously. Structure of the reagent is supplied in It consists of three groups: the reporter group and the balance group which form the isotobaric tag (145 Da) and the PRG (Peptide Reactive Group which reacts with the peptide primary amine group). The reporter group contains the charge, and gives strong signature ions in MS/MS. The balance group changes according to the reporter group. It undergoes a neutral loss during MS/MS. The basic itraq experimental workflow is displayed in Figure 4. To begin, proteins from one sample are digested using trypsin. As digestion results, N-ter pepides are ready to be derivatized with the samplespecific reagent via an acetylation reaction. Thanks to the isobaric nature of these reagents, one peak is obtained in MS that greatly simplify the MS spectrum. After CID, the balance group is loss, which leads to one peak for each reporter ion (Figure 4) in the region of low mass of the MS/MS spectrum (Figure 5). 9

11 Figure 3. itraq reagent structure Figure 4. Schematic itraq workflow; each sample is labelled with one of the eight itraq reagents and then pooled prior to MS analysis. 10

12 itraq reagents allow multiplexed quantification of up to eight samples (cell, tissue, serum). Moreover, it permits PTM analysis. Aggarwal et al. show that the reagent does not interfere negatively with the fragmentation to the extent that peptides length and amino acid content are similar to those obtained using other MS approaches [36]. Furthermore, itraq is a highly sensitive approach. Wu et al. demonstrated that itraq covers a large part of the E.coli proteomes. It helps to identify proteins across extreme pi and MW, it detects a great number of fragment peptides per protein and low abundance proteins are more often discerned. This high sensitivity can be explained by two factors. First, itraq is a global tagging reagent on all primary amine, contrary to ICAT (labels only cysteine). The second one is related the reactivity of lysine which leads to a stronger signal in MALDI-MS. Quantification relies on daughter ions generated during CID. However, a potential pitfall inherent to the Timed Ion Selector (TIS) resolution of the MALDI-TOF/TOF may affect quantification accuracy. Quantification relies on daughter ions generated during CID. TIS allows precursor ion and his fragment to pass through a gate in order to reach the detector and contribute in this way to the quantification ratio [32].A second limitation is that experimental variations can occur during tryptic digestion. In itraq workflow (Figure 4), digestion is prior to the labelling and the mix, contrary to ICAT where the labelling is prior to mix and digestion. This may introduce a potential source of error, especially in sample handling and variable degrees of tryptic digestion between two or more samples [32]. Table 1. Advantages/Disadvantage of itraq reagents Advantages Parallel proteomics: 8-plexing. Analyze proteins from cell, tissues or serum. PTM analysis. itraq reagent don't interfere negatively Disadvantage Mass spectrometer interference could hinder itraq reliability. itraq labelling after the tryptic digestion. with fragmentation. High sensitivity. TMT Tandem Mass Tag uses exactly the same approach as itraq. TMT tags are however heavier (TMT 6-plex: 126 to 131 Da). 11

13 2.3. Experimental Application The isobaric labelling approaches have been successfully applied to a variety of experiments and to various samples (prokaryotic and eukaryotic samples including Escherichia coli, yeast, human saliva, human fibroblasts and mammary epithelial cells...) [34,36]. The itraq approach can be applied for various purposes. For example, most of time when searching to characterize proteins from a specific signalling pathway, parallel proteomics approaches such as itraq 4-plex or TMT 6-plex are commonly used in order to obtain time courses profiles. Schmelzle et al. used itraq to label 4 samples of adipocytes stimulated for insulin at different time. Then, time course profiles have been plotted and proteins displaying the same behaviour in the same fashion are clustered. Unknown protein Glu-4 functionality was discovered in this manner [20]. Another study carried out by Zhang et al. used itraq for the same purpose. Profiles were made and clustered using Spotfire tm and the methods of SOM matrix [27]. Biomarkers can also be discovered using isobaric tags. Dayon et al and Choe et al analyzed CSF samples, using respectively TMT 6-plex [5] and itraq 4-plex [7]. Moreover, Desouza et al. identified five potential markers for endometrial cancer with itraq reagents and a set of four proteins using ICAT [9]. Cong et al. compared proteomes of human fibroblasts in four different biological states: replicatively senescent (under permanent growth arrest), stress-induced prematurely senescent, quiescent and young replicating, to identify the signature proteins of each biological state [6]. Figure 5 summarizes the purpose of quantitative proteomics using the isobaric tags. Figure 5. Common application of isobaric tags in proteomics and related analysis workflows. 12

14 2.4. Experimental Design: Principle of Replicate Analysis All these experiences can be declined in replicates [37]. Indeed, random variation when working with isobaric tags has many origins. The source of variation is a function of time, manpower, instrument, subject, subject condition, preparation process, etc... By definition, the variation is a measure of the spread around the expected value. This can be measured in three different forms: experimental, technical and biological variations. Typically, experimental replicates are the actual itraq replicates. Two or more experimental itraq sets serves to label the same samples. Technical replicates are used to assess of the consistency of a measure over repeated test of sample from a same biological source under identical conditions. It eliminates errors from sample preparation and it is very important to establish the significance of the protein expression (ANOVA, t-test, LPE test). Biological replicates are used to estimate the random biological variability associated with the test subject, by repeating the creation of the test subject under the same conditions (Figure 6). Figure 6. Schematic views of the relationship between technical, experimental and biological replicates in itraq experiments. A1 and A2 are two different samples under the same conditions. [37]. 13

15 3. Methods: Study Quantification Workflow 3.1. Introduction An important part of my training course was a prospective work. How is the quantification performed in proteomics? What are the tools? What are the best existing tools? What is the difference between them? Analysing manually a large dataset of MS identification results is time-consuming and not precise (some methods are impossible or difficult to perform manually such as outliers detection, quantile-quantile plot...). However, manual analysis is still widely used. More precise results can be obtained by computer-based data treatment. Thus, several Quantification tools (Q-tools) exist with their own properties (Table 2). As shown in Figure 7, the Applied Bioscience software, ProQuant tm, remains the most used Q-tool in the scientific community. In this part, I tested three tools to familiarize myself with tandem MS quantification; Mascot tm 's Q-tool, the itraq-specific Q-tool of the Trans-Proteomics Pipeline (TPP) and i- Tracker tm developed at the Cambridge University by Shadford et al [21]. Data from isobaric tag labelled samples was difficult to find. To overcome this problem, I utilized the on-line database, Peptide Atlas, which is closely linked to the TPP [23], as well as collaboration with the BPRG, which permitted me to obtain additional data (a description of it can be found in 3.2 materials). I tried to present the tested tools, to highlight their advantages and their limitations, to assess the quality of the quantification results by comparing the protein ratios and finally to determine a reliable quantification workflow. Figure 7. Pie chart of the number of publication by quantification tools. 14 publications have been read. Scientists still prefer to quantify manually or use the official itraq software ProQuant provided by Applied Bioscience. 14

16 Table 2. Summary of software available for quantification in isobaric-tag-based reagents. Tool Name Acad / Source Environment Comment Ref links Commercial ProteinPilot Applied Biosystems commercial w indow s easily distinguish protein isoforms, protein [23] subsets, and suppress false positives; and visualize peptide-protein associations and ProQuant Applied Biosystems commercial w indow s simultaneously quantitate and identify itraq [9] reagent-labeled peptides from MS/MS spectra. Pride Wizard Multi-Q MFPaQ Manchester Centre for integrative systems biology Institute of Information Science and Institute of Chemistry, Academia Sinica, Taiw an Institut de Pharmacologie et de Biologie Structurale, Toulouse, France academic w indow s Submission of mass spectrometry data and Mascot identifications to generate a valid PRIDE XML file. It also includes the facility to add itraq labels, allow ing quantitation data to be added to the PRIDE XML. academic web server itraq quantitation performed in a mascot way. [5] /Multi-Q-Web/ academic w indow s Takes mascot (DAT) result files as input for parsing and Analyst Wiff files for quantification. Quant University of Wurzburg, academic linux/w indow s Offers data and results visualization (boxplot, error [21] German plot), error estimation, Libra ISB academic linux/w indow s Quantification module of the TPP. (cf below [11] (cygw in) for more information) i-tracker Cranfield University,UK academic linux/windows (cf below for more information) [21] Mascot-Quanti matrix science commercial linux/w indow s (cf below for more information) w w w.matrixscience.com/ [22, 28] [20] w w.mcisb.org/resources/pridewizard/index.html vc/ sashimi/trunk/trans_proteomic_pipeline/sr c/quantitation/libra/docs/libra_info.html 3.2. Experimental samples ABRF Data. A sample from the ABRF 2006 study was used. It contains eight proteins spiked in ratio ranging from 1:1 to 1:76 marked with itraq 4-plex (114.1 to 117.1). Theoretical ratio can be found in Appendix 1. Peptides are identified by tandem mass spectrometry analysis with MALDI TOF-TOF (ABI 4700). Whitehead Data. Proteomic analysis was conducted at three time points (30, 40 and 60 min) for both control and γ irradiated cultures of Halobacterium salinarium strain NRC-1. Relative quantification was achieved using shotgun isobaric tagging with itraq reagents (Applied Biosystems, Foster City, CA). Quantification is achieved upon tandem MS, which fragments the itraq reagents unevenly to release daughter products of differing mass (m/z 114, 115, 116 and 117). For direct comparison across multiple runs a common reference sample derivatized with the 114 mass tag was included in each four-plex experiment. (More information is provided by Whitehead et al. [26]). This data were obtained from Peptide Atlas 15

17 3.3. Tested Software Presentation I-Tracker Three tools were compared; i-tracker developed by Shadford & al.[21], the itraq quantification tools packaged with the TTP pipeline [22], Libra and the quantification module of the 2.2 version of the Mascot sofware. The main goal of the i-tracker software is to calculate ratios from non-centroid MSMS peaks lists in a format linked to the results of protein identification tools i.e. Mascot and SEQUEST. The i-tracker process is detailed in Figure 8. The user can define an arbitrary intensity threshold as an unique filter. Such a threshold can lead to the loss of quantifiable peptides. Moreover, the purity correction coefficients can be entered (the itraq reagents are not completely pure and manufacturers therefore provide a correction factor to avoid peaks overlapping (cf Appendix 2 table 2)). Finally, the user can enter an ion tolerance to collect the reporter peaks areas. In its results, in addition to peptide ratios, i-tracker displays a table of quantization errors for each ratio. This error provides an interesting indication of the confidence we can give to a ratio, especially for ones calculated from very low abundance ions. Advantages and limitations of i-tracker can be found in Table 3. I-Tracker is limited to quantification at the peptide level. The i-tracker outputs a csv-formatted result files. Thus, Excel macros or parsing functions are easily applicable. A second disadvantage is that only samples labelled with itraq 4-plex reagents can be analysed. The main advantage of i-tracker is the quantisation error, which gives an indication of the confidence to give in a ratio of two peaks of low abundances. INPUT : itracker Algorithm : OUTPUT : - Non Centroid MS spectra (.mgf,.dta) - Ion intensity threshold - Purity correction - Reporter Peak Range - Reporter ion peak collection - Reporter ion area calculation - Purity correction - Peak normalisation (sum of all reporter intensities) - Under threshold checking - Ratio calculation - Quantisation error calculation - Relative errors of each reporter ion - Indicative Errors - Ratio for each reporter ion Figure 8. Scheme of i-tracker process 16

18 Table 3. Summary table of the advantages and limitations of i-tracker. Advantages Limitations Algorithm & source code for i-tracker are freely available Relative error, indicating the confidence to give to a ratio especially for low peaks ions Linked to other protein identification software Specific to itraq 4-plex, don't care about the others isobaric tag methods (itraq 8-plex, TMT...) Does not compute protein quantification 17

19 A. B. Figure 9. Screen shot of i-tracker OUTPUT. A. i-tracker outputs a.csv files containing the reporter-normalised area Norm and UT? which reports a flag UT if the ion peaks area is under the user-entered threshold. B. i-tracker outputs a.csv files containing the peptide ratios, with a matrix containing the quantisation errors. 18

20 TPP- Libra Libra is the itraq quantification module of the Trans-Proteomic Pipeline. The TPP is a collection of tools (Figure 10) for MS-based proteomics developed at the Seattle Proteome Center (SPC). It contains software for the quantification (Libra for itraq, XPRESS for ICAT...), converter (search engine format to pepxml...), validation and probability assignment (ProteinProphet, PeptideProphet Libra is part of the TPP pipeline and therefore relies on it for all quantification preprocessing steps (Figure 11). The Libra input data consists in (1) a database related to analyzed data placed in /dbase/, (2) a pepxml file, (3) a mzxml file and (4) a condition file. The most suitable database was found at The pepxml is the standard format for representing identification results. It stores information concerning PeptideProphet validation and quantification, and it references the mzxml file. The latter is the bedrock of the TPP. It is an open data format for storage and exchange of mass spectroscopy data, developed at the SPC/Institute for Systems Biology. It provides a standard container for MS and MS/MS proteomics data. Several converters are available to convert raw files (proprietary file formats from the most of vendors) to mzxml format. For example, the T2DExtractor tm to convert raw files from ABI instruments developed at the University of Michigan or Wolf tm, which converts MassLynx native acquisition files. Most of the time, converters must be run in the computer where the data acquisition instrument's software is installed. The last required file is an XML configuration file supplied by the condition.xml generator ( It contains all the requisite parameters to parse the mzxml file in order to extract reporter peaks intensities: the reagent M/Z values (for itraq 4-plex: to 117.1), mass tolerance, isotopic correction coefficients (provided by applied), a method of centroiding, a method of normalization (normalization against sum of intensities, against the most intense peaks...) and a minimum intensity threshold. Figure 10. Scheme of the software involved in the Trans-Proteomic Pipeline. Identification results from SEQUEST, Phenyx, Mascot and X!Tandem can be imported. ( 19

21 Figure 11. Scheme of the quantification within the TPP. XPRESS, ASAP and Libra are the three quantification modules within the TPP. Libra performs the protein quantification in a simple way. Usually, protein quantification is derived from the group of peptides associated with the protein. As summarized in the Figure 12, Libra applies a normalisation based on the sum of the reporter intensities. Then, the normalised intensities are averaged over all peptides of a protein. Normalized values of intensities differing by more than two sigma from the mean are considered as outliers and removed. The average of the reporter intensities of the protein is recalculated and the 1-sigma standard errors are calculated using the standard deviation. The software gives in output the average values of the reporters or the ratio if a reporter has been set in denominator (Figure 13). To resume, Libra employs a simple but accurate method to compute the protein ratio. Owing to converters packaged within the TPP (i.e. out2xml), Libra handles identification results from many search-engines (Mascot, SEQUEST...). Moreover, it can send (via protxml format) the quantification results through various post-quantification tools (Figure 11) such as Cytoscape, SBEAMS... The algorithm can also quantify samples labelled by the itraq 8-plex reagents. However, other isobaric tags are not taken into account for the moment. Moreover, this tool requires the installation of the whole TPP and it is not very user-friendly since all the TPP software only have command line interface (Table 4). A detailed Libra tutorial can be found in supplementary data

22 INPUT : Libra Algorithm : OUTPUT : PepXML, MzXML, m/z tolerance * Ion intensity treshold * Purity correction * Methods of centroiding * Methods of normalisation. * Reporter ion peak collection Apply purity correction Peak normalisation, Means of each reporter channel Outliers Removal Ratio calculation Protein Ratio Standard error Figure 12. Scheme of Libra process. (*) Contained in a configuration.xml file. Table 4. Summary table of the advantages and limitations of Libra. Advantages Limitations Simple but precise way to quantify Quantify protein identified by various search engines TPP pipeline allows an easy link with post Quantification tools (Cytoscape, SBAMS...) Can be extended to itraq 8-plex Maybe too many options displayed in the interface Only command line interface (converters, Libra...) Specific to itraq, no other isobaric tag can be quantified 21

23 Figure 13. Screenshot of the protxmlviewer shown protxml files which contain the results of the quantification. Means and SD of each reporter are displayed. 22

24 Mascot quantification module Mascot 2.2 includes a quantification module that computes ratios of identified proteins. This module covers most of the quantification methods i.e. reporter-based (itraq, TMT), precursor-based (ICAT, SILAC, Absolute Quantification...) and Label Free. All these methods are classified in protocols. Reporter protocol takes into account samples labelled with the most of the isobaric tag (itraq, TMT, ExacTag) except the AMT. All of the information required for the isobaric quantification is contained in the peak list, which is needed in input. Reagents and the MS/MS tolerance can be set in the interface. All the other parameters are set in the XML configuration file. The configuration.xml file encapsulates all users parameters. There are many different parameters, split into groups. For example, the group Methods contains the methods used to calculate the ratio and the significance level used for the statistical test (default 0.05). The group Component contains the information concerning each reporter (average and mono-isotopic mass, values of impurities correction...). In Ratio users can define each ratio they want to display (numerator, denominator). The group Quality contains the filter parameters, on intensity, on score, on expect value. Outlier and Normalization include specific methods. Mascot implements Grubbs, Rosner, Dixon detection methods, and three types of are available; summing intensities, median and geometric mean. Finally, Mascot performs the quantification following protein identification. It displays a summary box containing all the protein ratios (Figure 15A). A detailed view is also provided for each protein (Figure 15B). Each peptide ratio is given and the protein ratio value is displayed in a box coupled with a measure of spread, generally a geometric standard deviation (Figure 15B). Figure 14 summarizes the Mascot Process. Table 6 shows a comparison between the theoretical ratios of the proteins of the ABRF dataset and the ratios found in Mascot. Parameters are summarised in the Figure 1 of Appendix 3. Mascot fails to quantify three proteins. Carbonic anhydrase was not identified whereas betacasein and ribonuclease was identified but not quantified. The quantification cannot be performed due to the outlier s removal option. When the option is set to none, these two proteins become quantifiable. Due to its number of parameters and its very user-friendly interface, Mascot seems to be a very complete tool for quantification (Table 5). However, importing identification results from other search-engine is impossible. 23

25 INPUT : Mascot Algorithm : OUTPUT : Peak lists m/z tolerance * Significance threshold Impurities correction Reporter mass Threshold on peptide score Threshold on peptide maximum expect Ion intensity treshold Methods of integration Methods of normalization Methods of outliers removal Set numerator, denominator Reporter ion peaks collection Apply filters Peaks normalization Peptide ratio calculation Outliers Removal Ratio calculation Significance changes Quantification summary Box Individual proteins: peptide ratios, summary box Indication of the protein change Figure 14. Scheme of Mascot process. (*) Settable in the interface, all of the other parameters are contained in the configuration.xml file. Table 5. Summary Table of the advantages and limitations of Mascot. Advantages Limitations Correction of sample variability Strong outliers detection test Significance change Covers almost all of the isobaric tag-based quantification Clear interface Coupled to Mascot identification Display only the essentials parameters Bias in the ratio to the impurities correction Only compatible with Mascot identification results 24

26 Figure 15. Screenshots of Mascot result page, A/ Summary box of the quantified protein. B/ Protein summary box containing the type of methods to averaged peptides ratio, the number of peptides and the geometric SD. Below this box, the details of ratio at the peptide level. Table 6. Comparison of the theoretical protein ratio of ABRF sample and the Mascot ones. Protein Name AC Theoric Ratio Mascot Ratio Beta casein P :4 NQ* Catalase bovine liver C 1345 P :5 0.4 Glycogen phosphorylase rabbit P : Carbonic anhydrase I / 3:1 NI* Peroxidase horseradish P 6782 P : Ribonuclease A bovine R5500 P :1 NQ* Bovine serum albumin P02769_CHAIN0 1: Lactoperoxidase P :1 1.1 (*) NQ means that the protein is identified but Not Quantifiable. NI means that the proteins are Not Identified. 25

27 3.4. Software comparison Comparison of the different tools can be made at several levels; at the level of the quantification results and at the level of the algorithm. Comparison of the Quantification Results In a first step, quantification results have been compared. ABRF data have been used in order to compare peptide ratios found in i-tracker and Mascot. Because i-tracker handles noncentroid data, the peak list file has been modified in order to add some peaks for each itraq reporter (in the reporter m/z tolerance interval). Three peptides matches assigned to protein hits (observed mass: , , ) are used in input to i-tracker and then in Mascot. Parameters used in i-tracker and Mascot are shown in Figure 1 and Figure 2 of Appendix 3. Results provided in Table 7 show two peptides differentially expressed. The variation can be explained by the two methods of integration, area calculation and sum of reporter peak intensities. Libra and Mascot results have been compared using the Whitehead data [26]. Since no protein was spiked in this experiment we cannot conclude about which tools give the best result. However, protein ratios seem to follow the same profile for both tools (Table 8). Table 7. Result comparison table for three peptide matches Peptide m/z i-tracker ratio 117/114 Mascot ratio 117/ / Table 8. Result comparison table Libra VS Mascot Libra Mascot Protein Name Diphosphomevalonate 0.74+/ / / decarboxylase Adenylosuccinate synthetase 0.69+/ / / Proteasome-activating nucleotidase 2 Fumarate hydratase 0.64+/ / / / / / Heme biosynthesis protein 1.64+/-99.99* 2.18+/-99.99* 2.07+/-99.99* Vng6208c 1.10+/ / / (*) means that only one peptide is used to calculate the ratio, standard deviation is infinite. In Mascot bold numbers indicate the ratio is significantly different from 1. 26

28 A quantitative analysis can be subdivided in 5 cardinal steps. A pre-processing step in other words by which methods of integration the reporters are extracted (summing intensities of the profile or calculating the area under the profile curves) is performed. This is followed by a filtering step. We can imagine various filters; the most observed ones are a threshold on intensities, on score, and on p-value ( expect ). The normalization step is generally applied to correct systematic biases or to avoid giving too much weight to one reporter. This step is followed by the outliers removal step. Finally, the protein ratio can be estimated. The quantification workflow of each tool is compared for each of these steps. The three tested tools have their own ways to achieve the quantification. The i-tracker makes the quantification at the peptide level. The user must therefore manually calculate protein ratios. INPUT/OUTPUT First, we compare the INPUT file formats. Mascot quantification is clearly paired with the Mascot identification, in the extent that quantification of proteins which are identified using other identification software are impossible. On the contrary, i-tracker permits the importing of SEQUEST and Mascot results, and Libra can handle many types of identification tools results (Figure 10) due to the availability of tools such as Out2XML and Mascot2XML, which convert, respectively,.out file format from SEQUEST and.dat file format from Mascot in pepxml format. In addition to pepxml file, Libra inputs RAW files converted to mzxml format (for the conversion of raw to mzxml, converters need access to the computer where the instrumentspecific software for data acquisition is installed). In OUTPUT, Mascot releases a.dat file. Libra encapsulates its results in a protxml format file, which is read with the protxml viewer tool and exported to post-quantification tools (Cystoscape, SpotFire etc ). I-Tracker chooses to produce two types of.csv file. Output style 1 is designed to be human-readable when imported into programs such as MS Excel as a comma-separated variable file. It is strictly ordered so automated parsing is also straightforward. Output style 2 is designed to allow very easy basic analysis within programs such as MS Excel. All information is outputted on a one row per spectrum basis and thus all human-readability is lost, but to the gain of being able to run functions and macros more easily. Reporter Peaks Collection There are two ways to collect peaks. Mascot allows the choice between both via the configuration editor. Libra employs the sum of intensities of the peaks profiles whereas i-tracker used the trapezoid approximation for calculating the area under a curve. Filters After reporter ion peaks collection comes the filtering process. Many filters can be applied. All of them may involve data loss, especially when choosing a threshold on intensity, since peptides that present weak intensity are removed. 27

29 Type of Quantification Workflow We can now talk about the quantification workflow of each tool. Libra and Mascot have two ways to compute the protein ratio. Mascot computes an average of peptides ratios whereas Libra computes a ratio of averaged peptide reporter intensities. Outliers Removal In a quantification workflow, outliers removal (as normalisation) is a crucial step that is always in the quantification workflow although their implementation varies among the Q-tools. Mascot implements three methods for outlier removal. Dixon's r11 test, also referred to as N9, is used to detect and remove a single outlier at a time from either the upper or lower extreme of the range [27, 28]. It is applicable to values between 4 and 100. For a greater number of values, Rosner's test is applied [24]. Grubbs detection is applicable for values between 3 and 100 [26, 25]. Libra decides that a value of intensity is an outlier if it is outside of the range: ]µ - 2 * σ, µ + 2 * σ, [ (1) Where µ is the means and σ is the standard deviation. Care must be taken when outliers are blindly removed. A rigorous analysis would be to compare data with or without outliers to see to what extend the conclusions are qualitatively different. Normalisation A second important thing when working with large biological dataset is normalisation. Normalisation always takes place at peptide level. To remove variability of the sample incorporated during the experimental procedure, and based on the assumption that differential proteins in a biological sample are in a minority, Mascot proposes two methods to normalise the data so as to make the average of the whole population ratios across the entire data set equal to one. This can be done via the median or the geometric mean. Another method is the sum; totals intensities for each reporter across the entire data set are made equal. In contrast, for Libra and i- Tracker, no variability correction is implemented. However, in order to not give too much confidence in one reporter, both tools normalise on the sum of all reporter intensities. Determination of protein abundance and measure of spread Libra implements a simple ratio of the reporter means and displays a standard error calculated from the numerator standard deviation. Due to the fact that Mascot treats the peptide ratios, it allows several methods for averaging them; median, geometric mean, and weighted mean associated with a geometric standard deviation. I-Tracker does not provide the ratio at the protein level. 28

30 Bonus Several functionalities are software specific. I-Tracker for example is the only tool that displays a quantisation error. This value can serve as warning against placing too high confidence on reported ratios when these have been based on peaks with low ion counts [21]. Err(1,2) = (100 * ((0.5 / Peak1Max) + (0.5 / Peak2Max)) (2) Mascot provides an interesting and robust indication of the relative protein fold change. It employs a one-sample t-test. The null hypothesis H 0 is that the estimation of the ratios x is equal to one. If H 0 is rejected, x is significantly different from 1. The protein ratio is reported in bold Discussion Software comparison reveals that some steps are always found. Filters, outliers removal and estimation of the ratio are essentials in a quantitative analysis. Each tool has its own way to estimate the ratio. Although they display a good estimation of the relative protein abundance, some of them are not user-friendly (i-tracker, Libra). Moreover, they don t take into account the experimental design of the experiment (i.e. replicates analysis, time course...). Finally, no mean to visualize the protein expression is implemented. A complete summary of the tools that I tested can be found in table 1 of Appendix 4. 29

31 4. Establish Quantification WF 4.1. Introduction Based on the previous study, three methods of quantification were implemented using the R language. This programming language is very well appropriate when analysing large amount of data for quantification to the extent that it offers an environment for statistical programming and In fact, descriptive statistics methods and inference tests are all ready when installing R. There are many additional packages that are easy to use and well documented. Moreover, R is able to output graphs and charts and provides therefore an attractive way to represent quantitative results. Librus, an intensity-based method similar to Libra, and Mascat, a peptide ratio based method that resembles Mascot were first implemented. Furthermore, a novel quantification algorithm, named QI (Quantification Isobaric), was developed. QI workflow is a least-square regression-based workflow. According to Bantscheff et al. [30], linear regression could be a good alternative to the filter on intensity and the inherent data loss that such a filter involves. In fact, making the difference between (1) a weak intensity from a low abundance peptide and (2) a weak intensity from the background noise is difficult. A leastsquare regression line is a straight line that passes through the data so that the sum of the square of the vertical distance data points from the line is as small as possible. So, the advantages are double. First of all, we avoid useless loss of peptide matches by applying an arbitrary threshold on intensity, and secondly, we obtain an easy method in order to visualise the protein quantification. The idea of this method is to create a linear model and coerce the regression line to pass through the origin. The slope of the line is an estimation of the protein ratio. The R- squared gives an indication of the data spread (Figure 20). Then, strong methods of outliers and influential detection are used. These methods are part of regression diagnostics. As highlighted in part 3.5, several statistical steps are common to all quantification approaches. These steps are shown in dark blue in the Figure 16 and detailed in the next part. The validation of the algorithms was effected by using two sets of spiked proteins. The ratios of the proteins contained in the ABRF sample (described in 3.2) were compared between the three methods. Then a dataset (provided by Loïc Dayon) obtained from samples containing 4 spiked proteins [7] was used to measure the root mean square deviation (RMSD) between the expected theoretical ratio and the obtained ratios for each quantification algorithm. 30

32 Figure 16. Steps of the three approaches implemented; steps in common among Librus, Mascat and QI are shown in dark blue Description Signal Extraction The data were pre-processed using the LabelMS2Extractor (LMS2E), a tool designed to extract the intensities of peptides labelled with isobaric tags. The INPUT data consist of (1) the Phenyx identification result (pidres.xml), (2) the path of the configuration file that contains information about the mass of the reagents, and the values for impurities correction and the mass tolerance. Reporters are integrated using the method of sum of intensities along the range defined by the reagent mass and the user-entered mass tolerance. Missing values of reporter intensities are replaced by 0. Another way to compute reporter signals is to integrate the area under the reagent profile curve. The former method of integration is however more accurate and can handle cases where only one peak is detected in the reporter mass range [2]. The OUTPUT is a csv file (Appendix 2) that contains all of the quantifiable peptide matches, i.e. all of the peptides that have been labelled. 31

33 Filters In all implemented quantification methods, the first step consists of the filtering of the LMS2E results table. Filters can act at peptide level or at protein level. Various filters can be imagined. At protein level, a threshold on the number of peptide matches is implemented. The greater the number of peptides per proteins, the more accurate is the ratio. At peptides level, several filters are implemented; a threshold on the minimum intensity value, on the minimum score, the maximum p-value, and a filter on proteotypic peptides i.e. when a peptide matches for several proteins, the peptide is removed. Thus, only unique peptides are taken into consideration when performing the protein quantification. Normalisation/Correction step The second crucial step is the normalisation of the dataset. Librus implements the normalisation of the peptide intensity, as Libra does. Mascat implements the three normalisation methods of the Mascot quantification module. In fact, the Librus normalisation permits one to not give too much confidence to one reporter whereas the Mascat normalisation corrects the systematic bias that occurs during the sample preparation. Outliers removal In Librus, if a normalized value of intensity is more than 2σ from the mean of the signals of one reporter, it is considered as an outlier and removed. Mascot uses specific outliers detection tests (Grubbs, Rosner, Dixon). (For more details see part 3.4). QI bases its workflow on strong regression diagnostic methods. The DFFITS and studentized residuals are important techniques to the detection of outliers and influential points in a regression analysis. The DFFITS of an observation is a measure of the influence of this observation on its own predicted value. The studentized (standardized) residuals are adjusted by dividing them by an estimate of their standard deviation. The Table 9 shows the number of peptides remaining after outliers removal. Due to its tolerance to a large range of values, the Grubbs detection test has been chosen for Mascot. For Libra, the methods described above are used. For QI, if a DFFITS value is farther than m* standard deviation σ from the average of the DFFITS, the peptide is considered as an outlier and removed. We use the factor m equal to 2, 2.5 and 3. Authors recommend 2 [31] but the number of lost peptides becomes important (Table 9). As a reliable alternative, I propose a factor m equal to 2.5. The number of peptides per protein is greater. Moreover, the resulting ratios are closer to the theoretical ratios (Table 10). The Figure 17 shows the effect of such a method. Regression lines have been plotted before and after outliers removal (m = 2). 32

34 Table 9. Comparison of the number of remaining peptides for each methods of outliers removal. AC # peptides initial # peptides Grubbs # peptides +/- 2σ # peptides DFFITS ( m = 2 ) # peptides DFFITS ( m = 2.5 ) # peptides DFFITS ( m = 3 ) "P00432" "P00433_CHAIN0" "P00489" "P02769_WOSIG0" "P80025_CHAIN0" Table 10. Quantification results comparison for each workflow. AC Theoretical Ratio Mascat Ratio Librus Ratio QI Ratio ( m = 2) QI Ratio ( m = 2.5) QI Ratio for (m = 3) "P00432" 1:5 0.5* 0.47* * 0.36 "P00433_CHAIN0" 1: "P00489" 76:1 6.51* 5.27* * 3.86 P02769_WOSIG0 1: * 1.1 "P80025_CHAIN0" 1:1 1, *

35 A. B. Figure 17. Linear regression for four proteins, equations of the regression line and regression coefficient are provided, A/ before outliers removal, B/ after outliers removal. The outliers removal step improves the regression coefficient by diminishing the spread. 34

36 Protein ratio calculation and measure of spread Libra performs a ratio of averaged intensities and gives a standard error. Mascat averages the peptide ratios by some descriptive statistics methods (median and MAD, geometric mean and geometric SD, weighted mean and weighted SD ). Being based on a regression model, QI gives the ratio by least square estimation. OUTPUT and Visualization The three implemented approaches provide csv sheets and graphs to visualize the quantification. Each workflow has its own presentation of the results and specific graphs. Thus, Mascat displays a summary table containing the ratio and an indication of the protein fold change (Figure 18A), and a protein box containing all quantitative information of one protein is provided by the R package (Figure 18B). In addition to histogram and density distribution of the peptides ratio, boxplots are used to display the peptide ratio distribution for each quantified protein (Figure 18C). A boxplot is a graph of statistical summary, with the outliers plotted individually. The quartiles are spammed by the central box (first and third quartiles) and the median is displayed by the central line (bold). Observations plotted outside of the range 1.5*IQR are suspected as possible outliers and the whiskers show the largest and smallest peptide ratios that are not considered as outliers. Librus provides a matrix of result (Figure 19A), and a protein summary box that contains all information relative to one protein (Figure 19B). 35

37 A. B. C. Figure 18. Mascat output, csv files and boxplots A. summary box, contains proteins for the whole run, B. protein box displays information for one protein; ratios are calculated with the median and the MAD, Median Absolute Deviation. C. Standard boxplot of peptide ratios. Bold horizontal line is the median; outliers are visible as isolated points. 36

38 A. B. Figure 19. Librus output, A. all ratios for one protein are displayed in a matrix of ratios. B. individual information is summarized in the protein box (mean of normalised reporter intensities, SD, SE and the number of peptide matches). 37

39 QI workflow proposes individual protein boxes containing the ratios values calculated from the slope of the regression lines (Figure 20), and quantitative scatter plots of the peptides intensities numerator VS denominator with the fitted line (Figure 17A, 17B). A standard scatter plot displays the relationship between two quantitative variables. The x and y axis represent the denominator and nominator intensity values, respectively. A scatterplot provides information about the form, the direction and the strength of a relationship. In case of quantification by regression analysis, some of this information will be constant. The relationship is linear and peptide intensity values always show a straight-line pattern. The direction of the relationship is always unambiguous and the slope of the fitted line (that corresponds to the ratio) must be positive. It means that, in terms of regression, the association is positive. The strength of the relationship between the two variables is given by the coefficient of correlation. The reported standard error gives an estimation of the standard deviation of the slope. It is inversely proportional to the reliability of the computed ratio. Finally, the R-squared is provided. It gives a better feeling of the strength of the association between the points and the regression line than the correlation coefficient. In fact it can be interpreted as a measure of the spread. The closer the regression coefficient is to one, the better the regression line describes the data, and the better the slope is an estimation of the ratio. Figure 20. Qi output, individual box summary for one protein. It contains the slope that corresponds to an estimation of the protein ratio, the standard error for this coefficient, the regression coefficient and the number of peptide matches. A large number of peptides does not guarantee a precise estimation of the ratio. Quality of the data set It is generally admitted that high throughput shotgun proteomics data are log-normally distributed [2, 29]. In the case of peptide quantification, skewed distribution can be observed. Indeed, the ratios have log values lower or greater than zero, but very seldom have large values. Usually, they vary around zero. So, a non-normal distribution may indicate that the values are meaningless, or an experimental error occurred. However, this may also indicate that the sample contains very differential proteins. Log-normal distribution is required for an outliers detection test in Mascat Workflow and is therefore also checked before a data analysis is started. The test of normality is performed by the Shapiro-Wilk test. The null hypothesis is that the sample is taken from a normal distribution. This hypothesis is rejected if the critical p-value for the test statistic W is less than This test loses its reliability for a sample size greater than 2000 values. 38

40 This size threshold is almost always exceeded when working with high throughput proteomics data. To overcome this issue, we advise to always look at the peptide ratio distribution in Mascat (Figure 21A) and the reporter intensity distribution in Librus (Figure 21B), or the data quantilequantile plot (Figure 21C). A. C. B. Figure 21. Data quality assessment, A/ Mascat workflow, Distribution of the peptide ratios. Dotted red curves are the density distribution after normalisation (Ali Data). B/ Librus workflow, superposition of the reporter intensities density distribution (Loic Data, Ante-Mortem labelled reporters) C/ Mascat workflow, normal quantile-quantile plot of peptides ratio 117/114 (Ali Data). 39

41 Implementation The implementation was done using R version The program files, documentation and example script are contained in supplementary data 1. Reporter peaks were collected using the LMS2E that performed the integration by sum of intensities and corrected the reporter impurities. Librus workflow can be applied using the script librus.r. It executes the wrapper function librus that performs the quantification. It calls the function of normalisaton, firstnorm, the outliers removal function outlier; getratio calculated the ratio and the standard error. Mascat workflow is called by the wrapper mascat found in the module Mascatfunc.R. This workflow is devised in four modules. The wrapper calls the script Mascat_norm.R which contains normalisation function, the script Mascat_outliers.R which contains the outliers detection test (this script needs the installation of an additional library), outliers which clusters a collection of some tests commonly used for identifying outliers and Mascat_changes.R which contains statistical test for significance change. The script Mascat_displays.R performs the export of the quantification results and charts. The last workflow, QI, is designed as a single module Qifunc.R. It implements a wrapper qi, which calls sloperatio, a subroutine that creates the linear regression model and displays the regression summary. The summary contains the regression statistics of the model, the slope, the standard error, and the regression coefficient. Moreover the wrapper calls regression diagnostic methods by the function diagnrm. No additional libraries are needed for dffits and studentized residues. R default distribution contains all regression diagnostic functions. 40

42 Table 11. Summary of the methods used in the three implemented workflows. 41

43 4.3. Validation of the algorithms The ratios for the five spiked proteins of the ABRF data are compared in the table 10. It shows that ratios resulting from Mascat, Librus and QI are closed to the expected theoretical values. This therefore validates our three quantification algorithms. A more precise study was performed to determine which quantification workflow provides the more accurate ratios. For this purpose, we used a data set supplied by Loic Dayon that contains a mixture of albumin (ALBU) from bovine serum, myoglobin (MYO) from horse heart, -lactoglobulin (LACB) from bovine milk, and lysozyme (LYS) from hen egg in equal weight. This dataset was originally used to determine the coefficient for impurities correction for the TMT reagents, Dayon et Al. for more information [7]. Due to its robustness, root mean square deviation was chosen to calculate the deviation from expected theoretical ratio of 1:2:3:3:5:10. As summarized in Figure 22, RMSD for each expected ratios are combined for all the experiments. The profiles clearly show that the higher a ratio, the greater is its error. To obtain a better indication of the relative accuracy of a quantitative approach, a zoom is performed on the array of low (2:1, 3:1) and high (5:1, 10:1) theoretical ratio (Figure 23). This shows that for a relatively weak ratio, the approaches seem to behave equally (Figure 23A). However for ratios higher than 3:1, the workflow based on peptide ratio, Mascat presents the largest error in the extent that its errors calculated from all the jobs are higher than a RMSD value of 2.5 (Figure 23B). On the contrary, only one job presents a big deviation for the intensity-based workflow - Librus. Moreover except for this latter, Librus' error profiles are closely clustered whatever the expected ratios. The regression-based approach generally gives low error but the job profiles are more dispersed, almost parallel. More information is found in supplementary data 7. Detailed profiles per job can be seeing in Appendix 5. 42

44 Figure 22. Superposition of the RMSD profile for each job. 43

45 A. Root Mean Square Deviation for Librus, Mascat & QI Librus Librus Librus Librus Librus Mascat Mascat Mascat Mascat Mascat QI QI QI QI QI ,5 3 2,5 2 RMSD 1,5 1 0,5 0 5:1 10:1 Theoretical Ratio B. Root Mean Square Deviation for Librus, Mascat & QI Librus Librus Librus Librus Librus Mascat Mascat Mascat Mascat Mascat QI QI QI QI QI ,9 0,8 0,7 0,6 0,5 RMSD 0,4 0,3 0,2 0,1 0 2:1 3:1 Theoretical Ratio Figure 23. RMSD on low abundance and high abundance ratios, for weak ratio error seems to be relatively equal. However, when a protein is too differential, Mascat presents ratio with the highest deviations. 44

46 4.4. Discussion Three algorithms were developed and validated using the R language Each of them proposes their own statistical procedure (Appendix 6). By measuring the RMSD between the expected theoretical ratios and the estimated ratios from the different algorithms, we conclude that Librus or to a less extent, QI give the more accurate ratios. This corroborates the observations of Carillo et al. [39]. However, care must be taken by giving too much confidence to this conclusion. Indeed, the two analysed datasets are from samples containing only spiked proteins; the normality assumption is therefore not respected. This assumption is important especially in Mascat where the outliers detection test requires that the peptide ratios of the whole dataset are normally distributed. This could explain the fact that Mascat shows the biggest variations from the expected ratios. Consequently, to say that one of these algorithms gives the more accurate ratios, a study should be done on a dataset from real biological samples where some spiked proteins would be incorporated. The next studies show the application of the algorithms into real biological contexts. 45

47 5. Application of the Quantification Workflows 5.1. Alireza collaboration: Peptides Ratios-based Quantification Approach applied to Characterize Daptomycin Resistance in Staphylococcus aureus. A two-month work with Alireza Vaezzadeh, a PHD student from the Biomedical Proteomics Research Group (BPRG), has been carried out. My part of the job consisted of an analysis of MS/MS Data obtained from quantitative MS based proteomic experiments on Staphylococcus S.aureus Introduction S. aureus, also known as golden staph, is the most common cause of staph infections. It is a spherical bacterium, frequently living on the skin or in the nose of a person where it appears in grape-like clusters when viewed through a microscope. Figure 24. Microscopic image of Staphylococcus aureus (ATCC 25923). Gram staining, magnification:1,000. It infects tissues causing furuncles and severe diseases like Staphylococcal scalded skin syndrome (SSSS) in the infant. In order to stop the rising incidence of this infection, Daptomycin was approved by the FDA (Food and Drug Administration) in 2003 for the treatment of complicated skin and soft tissue infections caused by susceptible strains of S. aureus, including methicillin-resistant S.aureus (MRSA) strains, and other gram-positive bacteria. Despite significant efforts over the past 20 years, the mode of action of Daptomycin remains mysterious. Furthermore the bacterium seems to develop resistance to this antistaphylococcal agent. The exact mechanism of the resistance is not clearly known. As methods for gaining insight into this domain, Proteomic and Transcriptomic approaches have been developed. Transcriptional profiles were performed using a customized and extensively validated oligoarray by Dr. Patrice François at the Geneva University Hospitals. Protein MS-based quantification was performed using itraq on membrane-enriched extracts. A list of differentially expressed proteins was obtained using the Mascat Workflow. 46

48 Materials Experimental Workflow. Three strains were analysed in this study: 616 (initial patient isolate), 629 (first isolate breaking through Daptomycin therapy and demonstrating decreased Daptomycin susceptibility but still within the susceptible range, termed the transitional strain ), and 701 (subsequently isolated during Daptomycin therapy and non-susceptible to Daptomycin). Quantitative-MS based proteomic experiments were performed in practical triplicates. Samples were prepared according to manufacturer s protocol (Applied Biosystems, Framingham MA). More details concerning the Experiments can be found in Supplementary Data 2. For the first proteomic experiment (PR1), strain 601 was labelled with itraq 114, strain 629 with itraq 116 and 701 with itraq 117. For the second practical replicate experiment (PR2), strain tags were crossed and strain 616 was labelled with itraq 117, strain 629 with itraq 114 and 701 with itraq 116. Finally for practical replicate three (PR3), strain 616 was labelled with itraq 116, strain 629 with itraq 117 and 701 with itraq 114. This experimental design is shown in Table 12. Table 12. Experimental design for differential comparison of S.aureus strains with dissimilar Daptomycin susceptibility. The itraq tags are crossed in each practical replicate (PR) Daptomycin Susceptible Transitional Strain Daptomycin Nonsusceptible PR 1 itraq 114 itraq 116 itraq 117 PR 2 itraq 117 itraq 114 itraq 116 PR 3 itraq 116 itraq 117 itraq Methods itraq quantification. Although several quantification software packages exist, none of them allow an easy handling of inter-run replicate and can import the data processed from Phenyx. The quantification values are extracted with the LabelMS2Extractor from reporter peaks intensities in a mass range of +/- 0.1Da.Then the Mascat methods (steps are detailed Appendix 6) was applied with the following parameters (Table 13). Data were analysed via R version A series of various filters was implemented on the data: Thresholds on p value (>1e-7), Z score (<6) and intensities (<2000). In addition, peptides present in more than one protein were removed. Finally, Proteins with less than 2 peptides were excluded. In order to reduce artifactual variation and to focus only on biological variation, a correction factor, based on the sum of the intensities of each reporter, was applied on peptide ratios. Outliers were removed using the Grubbs detection test on each protein. Each time a value was removed, the test was repeated. Then, the median of the ratios was computed to display the protein ratio. To give an indication of the significance of relative fold change, one parameter student-test was employed for a significance level of If the assumption of normality [Shapiro-Wilk test, n>3] was not respected a non-parametric Wilcoxon test was performed instead. Finally proteins were reported as differentially expressed if (1) their three replicate ratios were significantly different from one and (2).For the final selection, only proteins, which were designated as significant in all three replicates and had a coefficient of variation fewer than 40% were taken into account. data. The R script, ali_analysis.r, used for the ratio calculations can be found in supplementary 47

49 Table 13. Quantification parameters Reporter extraction moz tolerance 0.1 Da Impurities Correction No Quantification Workflow Mascat Filters * Intensities < 50 * z-score < 6 Normalisation Outliers Ratio calculation * p-value >10^-7 Sum of intensities Grubbs Median Results In the first practical replicate (PR1), 2'803 unique peptides corresponding to 565 proteins were identified from the bacterial membrane fraction. In the second experiment (PR2), peptides corresponding to 511 proteins were identified and finally in the last replicate (PR3) unique peptides corresponding to 495 proteins were identified. A total of 728 proteins (3 248 peptides) were identified from combining all replicates (Figure 26). Almost all peptides (95%) produced intense signals from the reporter fragment ions at 114.1, and Da. However, relative quantification was performed only on 347 proteins commonly identified in all three replicates. Pair-wise comparisons of the strains in the three practical replicates are shown in Figure 27A as 3D scatter plots and in Figure 27B as 2D distribution plots. Figure 25. Proteins identified in three practical replicates (PR). In total 728 proteins corresponding to 3'248 peptides were identified. 48

50 Figure 26. A. 3D scatter plots of pair-wise comparisons of S.aureus strains in replicates with different susceptibilities towards Daptomycin. B. 2D distribution plots of the logarithmic expression values in three practical replicates (PR). Only proteins with similar expressions in three replicates were considered significant. In comparing the Daptomycin susceptible strain 616 to transitional strain 629, only 68 proteins were considered significantly differentially expressed in all three replicates. None of the proteins were out of the ratio difference window of -0.5 to 0.5 (log10 of protein expression ratio). However, 46 proteins showed a down regulation of less than -0.3 fold change with a coefficient of variation (CV) of less than 21%. Comparison of the transitional strain 629 to non-susceptible strain 701 resulted in identification of 14 proteins but none of them had a ratio change in the window -0.3 to 0.3. Comparison of the two strains of 616 of 701 resulted in a list of 60 proteins with 31 proteins having expression values lower than -0.3 and CVs of lower than 37% (cf supplementary data 3). Similar results were obtained with the transcriptomic approach. 49

51 Discussion A number of interesting observations emanated from our study. From a biological point of view, we can notice that most of the identified under-expressed proteins between the susceptible and transitional strains were similar to those observed with a down-regulation from the susceptible to the non-susceptible strains. The majority of these proteins were involved in metabolism or were ribosomal or transcriptional factors. Two proteins (Q99UV7, Q7A869) involved in the ion transportation were also identified and showed a reduction in expression. L.Cui & all observed a correlation between membranes thickness and Daptomycin susceptibility. The identification and quantification of proteins that contribute to the synthesis or the metabolism of the cell membrane like Q99V41, Q7A7A5, Q7A5K8, Q7A619, Q7A5D5, Q7A6K0 and P64003 tend to confirm this observation. Moreover, this study highlights several critical points in a quantification analysis. First, an automated, computer-based analysis for quantization of large amount of proteins is more reliable than manual. Indeed, another analysis that was manually carried out was not in agreement with transcriptomic results. Owing to the filtering and outlier removal steps, the correction of experimental variation is provided during the normalization step. Secondly, Mascat provides an indication of change that directly gives an indication of the protein expression (cf Supplementary Data 4). Third, this collaboration brought to our attention complex experimental designs involving inter-run replicates (experimental replicates) and lead to their implementation and support in the quantification workflows. Combined proteomic and transcriptomic analyses allowed for obtaining a global view of complex processes involving differentially regulated factors contributing to antibiotic resistance. This combined information is essential for the global integration of the data. Several potential proteins implicated in Daptomycin resistance were identified. However, their implication has to be confirmed by targeted investigation with conventional molecular biology techniques. In view of the results found in the literature and additional information obtained in this study, we showed that our data appeared particularly relevant and that multiple mechanisms are mobilized by Staphylococcus aureus to produce resistance towards antibiotics. 50

52 5.2. Loic Dayon Collaboration CSF Analysis by TMT 6-plex Introduction A second collaboration has been carried out on a TMT 6-plex dataset, generously provided by Loic Dayon. This study consisted of the quantitative analysis of proteins obtained by shotgun proteomic approach on Cerebrospinal fluid samples using TMT. The CSF is a clear fluid found in the brain chambers (ventricles), spinal canal, and spinal cord. It is secreted by the choroids plexus, a vascular part in the ventricles of the brain. It acts as a shock absorber to protect the brain against injury. It contains electrolytes, glucose, and low proteins concentration. Chemical labelling seems to Aggarwal et al show that isobaric tags are highly sensitive and can identify low abundant proteins. In this study, relative quantification on ante-mortem and post-mortem CSF was done [7]. Due to the disruption of the Blood-Brain Barrier (BBB) a few hours after death, post-mortem CSF and ante-mortem CSF are very different in composition. In a first step, I performed a quantification analysis using Librus (a quantification workflow based on the TPP-Libra) and compared the results with the published ratio values [7]. However, I found sometimes-dissimilar protein ratios. In order to address this issue, an investigation was carried out on two possible sources of bias in the protein ratio: missing values of intensities and impurities correction. Generally, missing values are due to problems in the detection of weak signal from low abundance peptides. Instruments sometimes fail to detect the signal and even if the detection is successful, the peaks intensity may be too low to be distinguished from the background noise [38]. When looking at the Phenyx job, we can notice that a lot of peptides contain missing events for one specific reporter. Thus, I tried to assess the influence of these missing events in the quantification results by running our three quantification approaches on the data. Results showed that an approach based on linear regression is a relatively good alternative to overcome this issue. A last study has been performed on the impurity correction. In fact, TMT or itraq reagents differ in the isotopic compositions of nitrogen, carbon and oxygen but have identical masses. Due to isotopic contamination in tags, peak overlapping occurs, i.e. the peak area for each reporter ion has some contribution from other reporter ions. To correct this bias, the manufacturer ABI (itraq)) provides a datasheet which indicates the percentages of each reporter ion reagent that differs by -2, -1, +1, +2 Da from the quoted mass. I-Tracker, proquant and most of the quantification software implement this correction. Purity correction may however lead to biased estimation of the protein fold change, especially when low abundance peptides are detected. 51

53 Materials CSF Collection. Post-Mortem (PM) CSF samples from different patients (n = 4) were collected by ventricular puncture at autopsy, 6 h after death on average. Control Ante-Mortem (AM) CSF samples were collected by routine diagnostic lumbar puncture from living healthy patients (n = 4). Clinical data of deceased and living patients have been previously reported. Each patient or patient s relatives gave informed consent prior to enrolment. The local institutional ethical committee board approved the clinical protocol. Experimental Workflow. After immunoaffinity depletion, triplicates of AM and PM CSF pooled samples were reduced, alkylated, digested by trypsin, and labelled, respectively, with the six isobaric variants of the TMT (with reporter ions from m/z ) to Th). The samples were pooled and fractionated by SCX chromatography. After RP-LC separation, peptides were identified and quantified by MS/MS analysis with MALDI TOF/TOF and ESI-Q- TOF. Spiking of Protein. β-lactoglobulin (LACB) from bovine milk (90%) was purchased from Sigma (St. Louis, MO). (cf Dayon & al. [7] For more details about the Experimental Procedure). 52

54 Data analysis Protein Quantification Reporter peaks are extracted using the LabelMS/MSExtractor. They are collected with a m/z tolerance of No isotopic correction is applied due to the bias generated on the low peaks intensities. Protein quantification was performed using the Librus method. (Appendix 6). Peptides are filtered according to the signal intensity (< 50), sum of intensity (< 300), and the quality of the labelling. The protein is removed if a protein does not contain at least two peptides. A parameters summary can be found in table 14. The R script CSF6plex_analysis.R can be found in supplementary data 6. Table 14. Quantification parameters Reporter extraction moz tolerance 0.15 Da Impurities correction Quantification Workflow Librus No Filters * intensities < 50 * quality of the labelling * sum of reporter per peptides <500 * min number of peptides : 2 Normalisation Sum of peptides reporter intensities Outliers Ratio calculation µ+/-2*σ Σ PM CSF reporter mean / Σ AM CSF reporter mean Error Estimation To obtain a more reliable comparison with Loic's published result, no filter on intensity is applied; the only filter applied is the one that assesses the quality of the labelling. Proteins are removed if a protein does not contain at least 2 peptides. The Librus quantification method has been used and the ratio of the mean label 131 m/z on 126 m/z reported. A confidence interval on the estimated ratio was then obtained. An average value for the noise was first estimated. According to Loic s opinion and a brief glance at the peak list, the background noise was estimated to 50 counts. This value was then used to determine the maximum and minimum values of each reporter intensity (Table 15). Finally, the following crossing pairs of ratio were used to compute the interval [Label6MAX/Label1MIN, Label6MIN/Label1MAX] (Figure 27). 53

55 Figure 27. Error Model, Total error was obtained by crossing the pair s max-min values of the two labels. Table 15. Quantification parameters Reporter extraction moz tolerance 0.15 Da Impurities Correction No Quantification Workflow Librus Filters Normalisation Outliers * quality of the labelling Sum of peptides reporter intensities µ+/-2*σ Ratio calculation Reporter 131 mean / Reporter 126 mean 54

56 Missing Events or Missing Values of Intensities (MVI) No filter on intensity is applied for Librus as well as for Mascat or QI. Nevertheless, a threshold on peptide number is set to two. A normalisation on the sum of intensities is performed in Mascat to avoid systematic bias due to experimental variation. Then, a removal step was applied on data to remove outliers (Mascat: Grubbs detection test, QI: DFFITS approaches, which specifically removed peptides outside the range [mean (DFFTIS) +/- 2.5 sigma]). Protein ratios were calculated by the median in Mascat and were given by the slope of the regression line in QI (Table 16). The R script CSF6plex_MVI_analysis.R can be found in supplementary data 6. Table 16. Quantification parameters Reporter extraction moz tolerance 0.15 Impurities Correction No Quantification Workflow Librus Mascat QI Filters Normalisation Quality of the labelling Sum of peptides reporter intensities Sum of intensities No Outliers µ+/-2*σ Grubbs detection DFFITS Ratio calculation Reporter 131 mean / Reporter 126 mean Median of peptide ratios 131/126 Slope of regression line Results and Discussion Protein Quantification Before the analysis of biological data, it is important to know the goal of the experiment and where the data comes from. This study is designed for biomarkers discovery. CSF from living healthy patients and CSF from dead patients are collected from four patients and pooled on two samples AM and PM, respectively. It is important to notice that PM CSF is collected 6 hours after death; consequently a lot of plasma specific proteins (albumin, immunoglobulin...) and cytoplasmic proteins are unloaded in the CSF because of the disruption of the blood-brain barrier. A pooled sample of AM CSF and a pooled sample of PM CSF were spiked with the same amount of LACB and each divided into three samples. The six resulting samples were depleted of albumin, transferrin, IgG, IgA, antitrypsin, and haptoglobin. They were reduced, alkylated, digested by trypsin, and labelled with TMT. The three AM CSF samples were, respectively, labelled with TMT with reporter at m/z) 126.1, 128.1,and Phenyx software identified 1246 peptides corresponding to 220 proteins. After the filtering step describes in the Data Analysis Section, 89 proteins identified by a total of 722 peptides are quantified. Histogram and normal quantile-quantile plot of the shotgun proteomics data generally give a mean to assess the quality of the experiment. It is commonly admitted that biological data 55

57 follows mostly a log-normal distribution. In the case of this study, the observation of the discontinuity in the negative region of the density distribution curves and by the dump on the points distribution on the normal quantile-quantile plot can be explained by the fact that, despite the filtering step, there remains a large amount of large values (Figure 28B). However, as displayed by the red dotted line, the median of the peptide quantification is close to 0 (Figure 28 29B). Figure 28. A. Density distribution of log transformed peptide ratios, the density curves are shown in red and the dotted red line is the median. B. quantile-quantile plot of peptide ratios AM/PM. The y-axis denotes the sorted logtransformed quantile and the x-axis the normal quantile. Taking into account the original authors assumptions (filters, normalization...) and the experimental design (replicate), the data has been analysed with a modified version of the Librus Package that processes in a way close to TPP-Libra methods. Protein ratio and statistic about proteins are displayed by the Librus default Export (cf supplementary data 8 and 9). The resulting list of protein (cf supplementary data 10) is compared with the published ones. As displayed in table 17, differences seem huge for several proteins. Indeed, the filter on reporter peaks intensity removed a lot of peptides. Because a lot of low abundant proteins are present, the probability to encounter MVIs is large. In the original authors' analysis, MVIs are replaced by 0 and incorporated in the downstream calculation. In my analysis, the MVIs are replaced by 0 and removed because a threshold on intensity was fixed to 50. Such threshold leads to a considerable loss of quantification materials and explains most of the ratio differences. The ratio differences can also be explained by the impurities correction. Though the influence of the systematic bias incorporated by the impurities correction is low when working with data sets containing few weak signals, the bias can be enhanced in this data set where weak reporter peaks signals are abundant. In the next study we try to give an alternative to this bias. Table 17. Comparison of ratios for six proteins 56

58 AC Librus Publi A1A A2A A4D A6H8M A6ND A6NDP A6NI A6NMS O14656_CHAIN_ Error Estimation The correction of impurities can incorporate a bias in the downstream step of quantification, especially when calculating a ratio from low peaks intensity (Figure 29). After correction, values can become negative, and impact the ratio. The computation of a ratio using a negative value is impossible. The filter on the reporter intensity tends to limits this problem. However, Q-tools and researchers generally replace these values by 0 or by an estimation of the background noise. In this study we compute an error indication and a ratio interval instead of a biased estimation of the protein fold change. Figure 29. Impurites correction model. The green vertical bars indicate the true peptide intensity for four samples labelled with itraq reagents. Blue bars are the intensities after correction. Low intensity is close to the red line background noise. The reporter 116 peaks will give a negative value. It was important to notice that the majority of the proteins display a differential ratio, as shown in supplementary Data 5, and by the bars in the Figure 30. Furthermore, the more the protein ratio is extreme, the larger is the error on the ratio, as mentioned in the part 4.4. The percentage of errors tends to increase linearly for highly differential proteins. Few of them displays a ratio higher than 10, but the error on this estimation is higher than 15%. 57

59 < Figure 30. Bar plot of the number of proteins. the black line shows the evolution of the error in function of category of ratio. The bars give the number of proteins in function of the categories of ratio. A comparison of Librus results and the published one are summarized in Table 18 and viewed in Figure 31. Three proteins present ratios that do not fall into the interval [Label6MAX/Label1MIN, Label6MIN/Label1MAX]. However, the interval seems to correctly frame the majority of ratios. The impurities of isotope represent a source of systematic error that can efficiently be corrected. The impact of the systematic error on the computed ratios is high when the percentage of low intensity peptides is large in the dataset. This correction can however lead to null and/or negative intensities in a low intensity dataset and therefore may lead to wrong ratios. We therefore replaced the isotopic correction by a ratio confidence interval. 58