OplAnalyzer: A Toolbox for MALDI-TOF Mass Spectrometry Data Analysis

Transcription

1 OplAnalyzer: A Toolbox for MALDI-TOF Mass Spectrometry Data Analysis Thang V. Pham and Connie R. Jimenez OncoProteomics Laboratory, Cancer Center Amsterdam, VU University Medical Center De Boelelaan 1117, 1081 HV Amsterdam, The Netherlands {t.pham,c.jimenez}@vumc.nl Abstract. We present a software package for the analysis of MALDI- TOF mass spectrometry data. The software is designed to facilitate a complete exploratory workflow: pre-processing of raw spectral data, specification of study groups for comparison, statistical differential analysis, visualization of peptide peaks, and classification. The software supports various external tools for these tasks. We also pay special attention to the iterative nature of a typical analysis. Finally, we present two proteomics studies where the software has been used for data analysis. Keywords: data analysis, differential analysis, bio-marker discovery, MALDI-TOF, mass spectrometry, OplAnalyzer, proteomics. 1 Introduction Mass spectrometry is an attractive method in proteomics research because of its ability to identify and quantify a large number of proteins in complex biological samples [1]. However, the pre-processing and analysis of mass spectrometry data are fast becoming a bottle neck in the discovery process. This paper describes a software platform developed in our laboratory called OplAnalyzer, which supports proteomics mass spectrometry data pre-preprocessing and analysis. Specifically, we deal with MALDI-TOF mass spectrometry, a standard high throughput platform that can potentially be used for various diagnostic purposes. There are a number of tasks involved in a typical analysis: pre-processing of raw spectral data, specification of study groups for comparison, statistical differential analysis, visualization of peptide peaks, and classification [2]. Instead of integrating all these components into a single tool for a complete analysis, we develop a flexible platform where various existing tools for different tasks are accommodated. Our design also supports the interactive nature of the analysis process. Currently, the software supports the analysis of MALDI-TOF MS-1 data only. Tools for the analysis of MS/MS data with protein identification as well as data from another mass spectrometry platform namely LC-FTMS are under active development. P. Perner and O. Salvetti (Eds.): MDA 2008, LNAI 5108, pp , c Springer-Verlag Berlin Heidelberg 2008

2 74 T.V. Pham and C.R. Jimenez a. Data pre processing b. Sample grouping c. Exploratory analysis Differential analysis Classification Visualization d. Batch processing Fig. 1. An analysis workflow The analysis workflow and the system are described in Section 2. In section 3 we present two proteomics studies where the software has been employed for data analysis. 2 The System Fig. 1 shows a typical workflow in proteomics mass spectrometry data analysis. The four main steps are: data pre-processing, sample grouping, exploratory analysis, and batch processing. 2.1 Data Pre-processing The data pre-processing step includes the preparation of metadata and the processing of raw mass spectrometry signals which consists of peak detection, alignment, normalization, and deisotoping. To facilitate the use of existing tools we define a common data format between this step and the subsequent steps, which is simply based on tab-separated texts. For our instrument, a 4800 MALDI-TOF/TOF mass spectrometer (Applied Biosystems, Foster City, USA), we found that the MarkerView software (Applied Biosystems) works well for data produced in the reflectron mode. For data produced in the linear mode we have implemented a new method. To detect peaks in an individual spectrum, we search for locations of maximal value within a local m/z window. The size of the window is 11 discrete sampling points. This method is similar to the peak detection method employed in [4].

3 OplAnalyzer: A Toolbox for MALDI-TOF Mass Spectrometry Data Analysis 75 Individual spectrum and peak Mean spectrum and common peak d A B m/z p M p I Fig. 2. Peak alignment. For each common peak p M in the mean spectrum, the closest peak p I in each individual spectrum is located. If the distance d between the two peaks is less than 5, the value at point A is registered for the common peak p M in this particular spectrum. Otherwise, the value at B is registered. To find peaks that are common in all spectra, we apply peak detection to the mean spectra, analogously to [5]. Subsequently, peaks in an individual spectrum are aligned to this set of common peaks as follows. For each common peak, its value in an individual spectrum is that of the closest detected peak in that spectrum if the distance between the common peak and the closest peak (in the m/z axis) is less than 5 Da. (A better choice is likely to be based on the actual mass accuracy of the measurement and on the m/z value.) If there is no such peak, the value is simply assigned to the value of the spectrum at the m/z location of the common peak. Figure 2 illustrates the procedure. By visual inspection, we found that the quality of our alignment method is comparable to that of the more computationally expensive clustering method in [4] (data not shown). 2.2 Sample Grouping Typically, researchers are interested in several comparisons in each experiment, for examples, comparisons based on gender, age, and clinical outcomes. Also, in an interactive analysis the user might want to modify the sample groups for instance to include or exclude certain samples. To enable an efficient sample grouping, we define a text-based sample selection based on metadata. The strategy is easy to use and particularly suited for batch processing. For example, to specify two groups Healthy consisting of samples from healthy individuals and Cancer consisting of samples from cancer patients before treatment, the selection is as follows. Healthy:Cancer-type=Healthy;Cancer:Cancer-type=NSCLC,Time=PreTx

4 76 T.V. Pham and C.R. Jimenez Fig. 3. A screenshot of the output of the statistical testing module 2.3 Exploratory Analysis For data analysis we exploit existing tools in Matlab (The MathWorks, Inc). A typical first step is unsupervised analysis with principle component analysis (PCA) using all peptide intensities. Here all data points are projected onto a two or three-dimensional space for visualization. The projection does not use any information of group labels. The purpose is two-fold. First, one can observe if the data are clustered in a low dimensional space according to group labels. Second, one can detect possible outliers or unusual pattern in the data by visual inspection. For differential analysis, we provide interfaces for the t-test, Mann-Whitney U test, Kruskal-Wallis test. The p-values can be adjusted for multiple testing. The peptides are further subjected to intensity filtering, requiring that the median intensity of at least one group must be greater than 80 units and the fold change of the median intensities of the two groups must be greater than 1.5. (The numbers can be tuned for each study). Fig. 3 depicts a screenshot of the result of a comparative study. The candidate peaks are examined visually by spectra overlay. Again, we use the visualization capability of Matlab for this purpose. Finally, we provide classification model selection with support vector machine [3]. A grid search method is used to find the optimal parameter values. For each value in the grid, the generalization error is estimated by either leave-one-out cross validation or repeatedly splitting the data into two partitions randomly, one for training and one for testing. The grid point with lowest estimated generalization error is selected as our model for classification.

5 OplAnalyzer: A Toolbox for MALDI-TOF Mass Spectrometry Data Analysis Batch Processing We consider batch processing an important step in data analysis, especially with regard to reproducibility of figures and other results. In addition, batch processing helps produce a large number of figures of peptide peaks in a convenient format for visual examination. Again, we make use of the scripting capability of Matlab for this purpose. 3 Examples In the following, we describe two studies where the current software has been employed for data analysis. 3.1 Time-Course MALDI-TOF-MS Serum Peptide Profiling of Non-small Cell Lung Cancer Patients Treated with Bortezomib, Cisplatin and Gemcitabine This study performs serum peptide profiling of non-small cell lung cancer (NSCLC) patients treated with gemcitabine, cisplatin and bortezomib combinations before, during, and at end of treatment to discover peptide patterns associated with treatment-related effects and clinical outcomes [7]. Fig. 4 shows a three-dimensional PCA plot of serum peptide spectra of 13 healthy individuals and the pre-treatment serum spectra of 27 NSCLC patients. Fig. 4. Principle component analysis (PCA) of healthy versus NSCLC comparison

6 78 T.V. Pham and C.R. Jimenez (a) (b) Fig. 5. (a) Spectra overlay of the eight most differential peaks in the healthy (red) versus NSCLC (blue) comparison according to p-values of the Mann-Whitney U test. All peaks have a p-value less than (b) Heatmap of the 47 differential peaks in the healthy versus NSCLC comparison shown in the natural log scale. The peaks are ordered by median fold change between the two groups. Here, the MarkerView software was used for preprocessing, resulting in 682 peptide peaks per raw spectrum. The Mann-Whitney U test is carried out on each of the 682 peptides, resulting in 47 differential peptides. Fig. 5(a) shows the spectra overlay of the eight most differential peaks in the healthy versus NSCLC comparison. Fig. 5(b) shows a heatmap of the 47 differential peaks. We carried out classification analysis using support vector machine. A grid search for parameters was employed to find the best model according to leaveone-out cross validation (LOOCV). Using all 682 peptides, a LOOCV accuracy of 93% was achieved. When the 47 peptides selected by the Mann-Whitney U test were used, the LOOCV accuracy was 98% with 100% sensitivity and 96% specificity. The software has also been used for a large number of other comparisons such as gender, age, short and long progression free survival, and clinical treatment responses.

7 OplAnalyzer: A Toolbox for MALDI-TOF Mass Spectrometry Data Analysis intensity (tranformed value) m/z Fig. 6. Mean spectrum and detected peaks in the Da range 3.2 Breast Cancer Study with Maldi-TOF Mass Spectrometry Data of Serum Samples This study is part of the international competition on mass spectrometry proteomic diagnosis [8][9]. The dataset consists of 153 mass spectra of blood samples drawn from control individuals and patients with breast cancers. The aim is to construct a classification rule separating the two groups with a low generalization error. For this dataset, the baseline correction had been performed by the competition organizer. We used the software to perform further pre-processing: peak detection and alignment. Fig. 6 shows an example of the result of the pre-procesing algorithm. Again, a Mann-Whitney U test was performed to select features discriminating the two classes significantly. Furthermore, the Benjamini-Hochberg false discovery rate correction [6] was employed to correct for multiple testing. This results in on average 117 peaks with a false discovery rate less than 1%. Fig. 7 shows the distribution of the values of the 16 most discriminative peaks. We employed grid search with exponential spacing to find the optimal values for support vector machine model selection. The generalization error is estimated by averaging over 200 runs of randomly splitting the given data into two partitions, where the size of the test set is roughly a tenth of size of the whole dataset. The feature selection was performed for each random splitting procedure, so that fair estimates of classification accuracy were obtained. The final accuracy on a separate validation set of 78 samples is 83%.

8 80 T.V. Pham and C.R. Jimenez m/z = m/z = m/z = m/z = m/z = m/z = m/z = m/z = m/z = m/z = m/z = m/z = m/z = m/z = m/z = m/z = Fig. 7. Top 16 differential peaks

9 OplAnalyzer: A Toolbox for MALDI-TOF Mass Spectrometry Data Analysis 81 4 Summary The paper has introduced a software toolbox for the pre-processing and statistical analysis of MALDI-TOF mass spectrometry data. Our current development focuses on the support for the analysis of MS/MS data with protein identification and data from another mass spectrometry platform namely LC-FTMS. References 1. Jimenez, C.R., Piersma, S., Pham, T.V.: High-throughput and targeted in-depth mass spectrometry-based approaches for biofluid profiling and biomarker discovery. Biomarkers in Medicine 1(4), (2007) 2. Villanueva, J., Martorella, A.J., Lawlor, K., Philip, J., Fleisher, M., Robbins, R.J., Tempst, P.: Serum peptidome patterns that distinguish metastatic thyroid carcinoma from cancer-free controls are unbiased by gender and age. Mol. Cell Proteomics 5, (2006) 3. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1999) 4. Tibshirani, R., Hastie, T., Narasimhan, B., Soltys, S., Shi, G., Koong, A., Le, Q.- T.: Sample classification from protein mass spectroscopy, by peak probability contrasts. Bioinformatics 20(17), (2004) 5. Karpievitch, Y.V., Hill, E.G., Smolka, A.J., Morris, J.S., Coombes, K.R., Baggerly, K.A., Almeida, J.S.: PrepMS: TOF MS data graphical preprocessing tool. Bioinformatics 23(2), (2007) 6. Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Statist. Soc. B 57, (1995) 7. Voortman, J., Pham, T.V., Knol, J.C., Giaccone, G., Jimenez, C.R.: Time-course MALDI-TOF-MS serum peptide profiling of non-small cell lung cancer patients treated with bortezomib, cisplatin and gemcitabine. In: Proceedings of American Society of Clinical Oncology (ASCO) 2008 Annual Meeting, Chicago, USA (2008) 8. Mertens, B.: International competition on mass spectrometry proteomic diagnosis. Statistical Applications in Genetics and Molecular Biology 7(2), Article 1 (2008) 9. Pham, T.V., van de Wiel, M.A., Jimenez, C.R.: Support vector machine approach to separate control and breast cancer serum samples. Statistical Applications in Genetics and Molecular Biology 7(2), Article 11 (January 2008)