Briefings in Bioinformatics Advance Access published June 9, 2010

Size: px
Start display at page:

Download "Briefings in Bioinformatics Advance Access published June 9, 2010"

Transcription

1 Briefings in Bioinformatics Advance Access published June 9, 2010 BRIEFINGS IN BIOINFORMATICS. page1of11 doi: /bib/bbq019 Protein mass spectra data analysis for clinical biomarker discovery: aglobalreview Pascal Roy, Caroline Truntzer, Delphine Maucort-Boulch, Thomas Jouve and Nicolas Molinari Submitted: 12th February 2010; Received (in revised form): 9th May 2010 Abstract The identification of new diagnostic or prognostic biomarkers is one of the main aims of clinical cancer research. In recent years there has been a growing interest in using high throughput technologies for the detection of such biomarkers. In particular, mass spectrometry appears as an exciting tool with great potential. However, to extract any benefit from the massive potential of clinical proteomic studies, appropriate methods, improvement and validation are required. To better understand the key statistical points involved with such studies, this review presents the main data analysis steps of protein mass spectra data analysis, from the pre-processing of the data to the identification and validation of biomarkers. Keywords: clinical proteomics; statistics; pre-processing; biomarker identification; validation INTRODUCTION The identification of new diagnostic or prognostic biomarkers is one of the main aims of clinical research. A biomarker is defined as a biological constituent whose value differs between groups. Depending on the study design, these groups can be either diagnostic (diseased or healthy subjects) or prognostic (relapse or event free cases) groups. Currently, clinical research is making use of new high throughput technologies, like transcriptomics or proteomics. Proteomics is a rapidly developing field among other omics disciplines, focusing on large biological datasets. While transcriptomics focuses on RNA level data, proteomics is concerned with the next biological level in the universal dogma of genetics, namely proteins. Proteins are the actual effectors of biological functions and the measure of their expression level is in direct connection with their activity. Proteins are less influenced by downand up-regulation than RNA and might offer a better proxy for biological activity. Furthermore, proteins are exported out of the cell and can be detected in various biological fluids that are easily sampled, like blood. Nevertheless, it should also be pointed out that protein expression levels and activities are not exactly correlated. Mass spectrometry (MS) is a technology recently used for the separation and large-scale detection of proteins present in a complex biological mixture. This technology is increasingly being used in proteomic clinical research for the identification of new biomarkers and offers an interesting insight for Corresponding author. Nicolas Molinari, Laboratoire de Biostatistique, IURC, 641 avenue Gaston Giraud, Montpellier, France; BESPIM, CHU de Nîmes, France. Tel: þ ; Fax: þ ; nicolas.molinari@inserm.fr Pascal Roy, MD, PhD, is Professor of Biostatistics, Hospices civils de Lyon, Service de Biostatistique, Lyon, F-69003, France; Université de Lyon, Lyon; CNRS, UMR 5558, Pierre-Bénike, F-69310, France, interested in diagnostic and prognostic modelling in clinical research. Caroline Truntzer works as Research Engineer in a Clinical Proteomic Platform, CLiPP, CHU Dijon, where she manages the statistical team. She is PhD in Biostatistics and is interested in specific issues related to the statistical analysis of high-dimensional data. Delphine Maucort-Boulch is Associate Professor in the Equipe Biostatistique Santé. She is MD, PhD in Biostatistics, and mainly interested in modelling and models predictive properties. Thomas Jouve received a PhD in Biostatistics from the University of Lyon, France, and works on power issues in high-throughput technologies biomarker detection. Nicolas Molinari is Associate Professor of Biostatistics, EA 2415 Université Montpellier 1, University Hospital of Nîmes. He is interested in specific issues related to the statistical analysis in clinical research. ß The Author Published by Oxford University Press. For Permissions, please journals.permissions@oxfordjournals.org

2 page 2 of 11 Roy et al. biological samples containing large numbers of proteins such as plasma or tumour extracts. Initially surface enhanced laser desorption/ionization time-of-flight (SELDI-TOF) and thereafter matrix assisted laser desorption and ionization time-of-flight (MALDI-TOF) are the most commonly used MS instruments to achieve these clinical objectives. MS analysis relies on protein identification by their time-of-flight (TOF), which is related to their mass (m) to charge (z) ratio, usually written m/z. MS output is a mass spectrum, associating TOF with signal intensities. These intensities are related to the concentrations of proteins in the processed sample. The MS signal resulting from SELDI-TOF or MALDI-TOF measurements is contaminated by different sources of technical variations that can be removed by a prior pre-processing step. Whether this method is quantitative is still questioned by some authors [1]. This aspect deserves further investigations before judging the potential of MS for quantitative analysis [2, 3]. MS nevertheless appears as an exciting tool with great potential [4]. Despite numerous studies stating that MS methods are a powerful approach for detecting diseases in medicine and biology, they still require appropriate methods, improvement and validation [5 7]. This review presents the data analysis steps of MS experiments in the context of high throughput identification studies. Section 2 presents the specificity of high throughput proteomics and the pre-processing steps. After the signal is thus cleaned, Section 3 deals with biomarker identification and validation. A discussion is proposed in Section 4. PRE-PROCESSING STEPS Data acquisition leads to various sources of experimental noise. As a consequence, the observed signal (I) for sample i can be decomposed into several components described in the following equation: the biological signal of interest (S) weighted by a factor of normalization (k), a baseline (B) corresponding to systematic artifacts usually attributed to clusters of ionized matrix molecules hitting the detector or detector overload, and random noise from the electronic measuring system or of chemical origin (N) I i ðþ¼k:s t i ðþþb t i ðþþn t i ðþ, t where t refers to TOF values (related with m/z). The aim of pre-processing steps is to isolate the true signal S. The order in which these steps should be performed still remains an open question and different positions have been taken by different authors. The order of pretreatment steps and interactions between methods were investigated by Arneberg et al. [8] through an original modelling approach. The interested reader is referred to this article for more details. A summary of the entire analytical workflow is presented in Figure 1. Currently, there is no standard consensus as to which algorithms to use during the pre-processing steps; this is still an active field of research and a wide variety of algorithms have been proposed. Morris et al. rapidly provided some well-established guidelines for the pre-processing of MS data [9]. More recently, Cruz et al. [10] and Yang et al. [11] have proposed an extensive comparison of some popular and current methods for pre-processing. Most of these methods are available through specific packages, as Bioconductor repository, in the open-source R-Gui software ( A succinct description of the main and most recent methods for each of the pre-processing steps is provided below. Figure 1: Workflow for MS data analysis: from raw spectra to identification of biomarkers and classification.

3 Proteinmassspectradataanalysis page 3 of 11 Noise filtering The most commonly adopted method is the use of the wavelet transform [12, 13]. For this purpose, spectra are first decomposed into wavelet bases, usually using an undecimated discrete wavelet transform (UDWT), and in this way expressed through detail and approximation coefficients. By shrinking small detail coefficients to zero, noise is then removed from the original spectrum signal. Two possibilities are offered for this thresholding: (i) hard thresholding replaces small coefficients below a given threshold with zeroes, (ii) soft thresholding shrinks coefficients, possibly to zero. Hard thresholding might distort the signal more than soft thresholding, but it remains a simpler approach with good performance. Du et al. [14] proposed the use of continuous wavelet transforms that simultaneously filter out noise and baseline. Later, Kwon et al. [15] proposed a novel wavelet strategy that accommodates errors that vary across the mass spectrum. Recently, Mostacci et al. [16] proposed an effective multivariate denoising method combining wavelets and principal component analysis (PCA), which takes into account common structures within the signals. Baseline correction Two main approaches can be described for removing the baseline: the baseline is considered either (i) as the part of the signal remaining after features of interest have been removed [6], or (ii) as some kind of smooth curve underlying the spectrum [12, 17, 18]. Malyarenko et al. [19] offered a time-series perspective on the problem. In their article, baseline arises from a constant offset and a slowly decaying charge, plus some shift after detector overload events. Another original approach was developed by Dijkstra et al. [20]. These authors set up a mixture model to deconvolve each signal component of a spectrum. An appropriate component in this mixture model corresponds to the baseline. Alignment of spectra Due to physical principles underlying MS, a shift on the m/z axis can appear. Alignment corresponds to an adjustment of the time of flight axes of the observed spectra so that the features of the spectra are aligned with one another. In fact, for a reliable identification of local features, one needs to carefully associate each feature of interest with a specific m/z. The easiest way to perform alignment would be to allow spectra to shift a few time-steps to the left or to the right so that their correlation with other spectra is maximized. Nevertheless, this method may be too simplistic as it does not take into account the differences in shift that may occur along the m/z axis. More elaborate algorithms were thus developed for stronger misalignment. Based on a set of peaks in a reference spectrum, Jeffries et al. [21] proposed stretching each spectrum so that the distance between peaks from the reference and from other spectrum are minimized. The same year, Wong et al. [22] developed a strategy that also aligns spectra using selected local features (usually peaks from the average spectrum) as anchors. In the proposed method, spectra are locally shifted by inserting or deleting some points on the m/z axis. In 2006, Pratapa et al. [23] have proposed an original method for alignment of repetitions of mass spectra, i.e. multiple spectra for the same sample. All spectra are thought of as variations of one and the same latent spectrum, which is inferred by a Hidden Markov Model (HMM). Later, Antoniadis et al. [12] proposed an alignment based on landmarks in the wavelet framework. More recently, Kong et al. [24] proposed a Bayesian approach that uses the expectation maximization algorithm to find the posterior mode of the set of alignment functions and the mean spectrum for a population of patient, and Feng et al. [25] modelling the m/z by an integrated Markov chain shifting (IMS) method. Normalization A normalization step is necessary to ensure comparable spectra on the intensity scale. The idea is to consider that the total amount of protein is roughly the same between samples. In other terms, a small proportion of proteins are differentially expressed among all proteins in the sample. Total ion current (TIC) is a useful proxy for the total amount of protein. It is related to the total number of ion collisions with the detector and output by the mass spectrometer. Each of the intensities in the spectrum is divided by the TIC. This method is debatable [26], and more robust but less intuitive approaches are being studied. A good comparison study of these methods is available [27]. Peak detection Once the true signal is identified, peak detection aims to identify locations in the m/z range that

4 page 4 of 11 Roy et al. corresponds to peptides and to quantify corresponding intensities. Several approaches were proposed for this purpose [6], with only the main strategies being described here. The most intuitive strategy, initially developed by Yasui et al. [28], relies on local maxima, that is to say a point with higher intensity than other points in a neighbourhood. Once identified, these maxima may be filtered so as to keep only peaks that emerge from the noise and (ideally) really correspond to a peptide. For example, when one or all of the following filtering criteria exceed a user defined threshold, this may be used to help define the existence of a local maxima as a meaningful peak: signal-to-noise ratio, intensity and area [29]. Other authors have proposed to move the peak finding problem to the wavelet coefficient space; the goal here is to find high coefficients that cluster together on a similar position for different scales and that thus corresponds to peaks [12, 14, 30]. Another interesting strategy that has been proposed is to use model-based criterion that propose model functions to fit peaks [31]. The above methods subsequently require peak alignment to decide which peaks in different samples correspond to the same biological input [29]. Moreover, only peaks that appear in a large enough proportion of collected samples will be considered in the further analysis, leading to the choice of an arbitrary threshold proportion. To avoid such arbitrary decision making, Morris et al. [9], proposed to identify peaks on the meanspectrum. This has additional advantages: it filters out some instrument noise, plus some noisy peaks that appear in too few spectra. Finding features of interest Using peaks as features of interest requires estimating the intensities (as a function of the number of molecules hitting the detector) associated to these peaks. Peaks are actually two-dimensional (2D) features with a height and a width. The simpler approach is to use peak heights as intensities. However, as can easily be verified on a spectrum, peaks are narrow for low m/z but get broader with increasing m/z values. Computing the area under the peak is therefore another proxy for intensity. For a given peak, this is not necessarily a complicated step, but it requires evaluating the peak width which requires some detail. For close or overlapping peaks (on the m/z axis) corresponding to different peptides, estimating the intensity can be a really hard step. If such overlapping peaks are visible in the spectrum, several relatively recent methods [20, 32] using distribution mixtures offer the possibility to deconvolve features, thus enhancing the resolution for each peak and enabling a better intensity reading [33]. At this step, each spectrum is characterized by a finite, common number of peaks. Using adapted statistical methods, biomarker discovery analysis then aims to detect which of these peaks are associated with the factors of interest. Another approach considers spectra as functional data and analyses wavelet coefficients rather than detected peaks [9, 12, 18]. Although promising, this later approach seems little used and merits further investigation. BIOMARKER IDENTIFICATION AND VALIDATION Biological constituents associated with either diagnostic or prognostic groups are tested in diagnostic or prognostic studies, respectively. These studies correspond to the first phases of biomarker development formalized by Pepe et al. [34] in the context of early detection of cancer. Guidelines have also been proposed to encourage transparent and complete reporting in the context of prognostic studies to better understand conclusion contribution to scientific knowledge [35]. Generally, value distributions of constituents overlap between the groups. In classical clinical studies, one or few biological constituents are tested. These candidate biomarkers correspond to apriori hypotheses issued from a biological pathway. Recent developments in molecular biology have led to high throughput technologies, and consequently the question of how best to adequately design studies. A sequential approach has been adopted to discover new biomarkers. The first step corresponds to identification studies that aim to select a list of candidate biomarkers among a large number of biological constituents, and to estimate the strength of association between those candidate biomarkers and disease status or outcome. The second step corresponds to validation studies designed to retain, among previously selected candidates, the confirmed biomarkers, and to re-estimate the strength of association between those biomarkers and disease status or outcome. Classification refers to the assignment of each spectrum to a group, i.e. for each sample or equivalent individual. Given a serum sample, the aim of classification is to allocate

5 Proteinmassspectradataanalysis page 5 of 11 the patient to a specific diagnostic or prognostic group. Though tightly related to biomarker discoveries, profiling, similar to the signature concept in trascriptomics, has appeared as a new strategy for classification mass-spectra. The idea is to use a protein profile, defined as a list of intensities for different m/z positions, instead of a unique protein concentration. Such a profile can be used in two different ways: (i) as a set of proteins that must (or could later) be individually identified, and (ii) as a set of distinguishing features without searching for biological support for these. This can be an efficient approach in clinical applications. Identification studies The aim of identification studies is the selection of candidate biomarkers from a large number of biological constituents. In the proteomic context, a biomarker can be defined as a protein that differentiates samples from different groups. Several methods have been used or developed in order to identify candidate biomarkers. These methods have been discussed in the transcriptomic field. This search for biomarkers is tightly linked with the well-known curse of dimensionality that exists in all high-throughput methods, where the number of variables (peak intensities) exceeds the number of samples. Intuitively, the dimension of the regression space must be reduced, and two approaches have been proposed to do this. In the first approach, the subspace is defined by selecting certain variables within the matrix corresponding to features of interest, whereas in the second case, the subspace is defined by components. While the former of these methods directly focuses on the identification of features that exhibit differences in intensities between the groups, the latter evaluates the importance of features embedded in a classification algorithm. Feature selection When two groups are compared, simple approaches consider each feature individually. The aim is to compare intensities between the groups. Each test tells whether the two groups can be distinguished based solely on information from one particular feature. This corresponds to univariate statistical tests, such as the classical Student t-test. The observed test value is compared to the test distribution under the null hypothesis, H 0, of equality between mean expression levels in the two groups. The significance of the test is calculated from the cumulative distribution function. Since features are never considered simultaneously, correlations can not be taken into account and information from a set of features might be redundant. Biomarker identification is performed with control of type-1 and -2 error rates adapted to the large number of statistical tests performed (Table 1). The control of false discovery rate (FDR), introduced by Benjamini and Hochberg [36]: FDR ¼ EðQÞ, V=R, if R > 0 Q ¼, 0, if R ¼ 0 V being the number of false positive tests and R the total number of rejected null hypotheses, is frequently involved in controlling type-1 error [37]. Storey and Tibshirani [38] proposed to use the q value, i.e. the expected proportion of false positives among all features as or more extreme than the observed one, as a FDR-based measure of significance. They also provided an estimation of ^ 0, the proportion of features following the null hypothesis. To control type-2 error, the proportion of detected genes of interest Power ¼ ES ð Þ m 1 has been proposed as a definition of power in the context of multiple testing [39, 40]. Beyond biomarker identification, clinical studies aim to estimate the strength of the association between the candidate biomarker level and disease status or disease outcome. For parametric models, variable selection results from parameter estimation. Because of the selection mechanism involved in identification studies, the strength of association is commonly over-estimated. This optimistic bias is easily understood: only variables from which the corresponding test-values exceed an apriori defined Table 1: Classical outcome for a binary decision Conclusion Accept H 0 Reject H 0 Truth H 0 True U V m 0 H 0 Wrong T S m 1 m R R m

6 page 6 of 11 Roy et al. threshold are selected. Over-estimations of the strength of association and false discoveries are both explained by this well-known regression to the mean phenomena [41]. Optimism increases when the proportion of features of interest decreases and is reduced by increasing the number of samples [42]. Whereas multivariate modelling is useful for identifying information redundancy provided by the selected features with adjusting on the confounders, it does not correct the optimism bias linked to the selection process. The parameter estimation bias results in an over-estimation of the predictive ability of statistical models when applied to new samples. More sophisticated methods for the selection of predictors exist, jointly known as penalized regression. The idea behind these techniques is to constrain the regression coefficients. Several constraints have been proposed to shrink the parameter estimators identified as L1 or L2 penalizations [43 45]. Those penalizations result in filtering out weak predictors, and often improve prediction accuracy. The extension of the threshold gradient descent (TGD) method [46] to the survival model [47] is similar. Classification algorithms Given Y, the vector of disease status or outcome status, dimension reduction techniques aim to explain Y with information from X, the matrix of features of interest. X-based reduction methods, i.e. non supervised methods, only use information from X, while X- and Y-based methods, i.e. supervised methods, also use information from Y. A typical example of X-based methods is PCA, where components define the subspace of X that maximize the projected X-variability. The supervised X- and Y-based methods define the subspace that maximizes the projected X- and Y-covariability. Several methods can be used, and an aprioriknowledge of that structure may guide the choice of the analysis method [48]. Partial least square regression (PLS) is an example of a dimension reduction technique that directly maximizes the components associated with Y [49]. Linear discriminant analysis (LDA) first requires dimension reduction, such as PCA (PCA LDA) or PLS (PLS LDA). A subset of the first components is usually sufficient to catch most of the data covariance, and the optimal number of components can be chosen by cross-validation, as proposed by Boulesteix in the case of PLS þ DA. Wu et al. [50] proposed a comparison of certain classification algorithms for MS data. They compared LDA and quadratic discriminant analysis (QDA) to other supervised learning algorithms: k-nearest neighbors (KNN), bagging, boosting, random forest (RF) and support vector machines (SVM). KNN uses a vote of nearest samples (in a space defined by intensities for different features) to choose a class for a new sample. Bagging, boosting and RF are methods to aggregate classifiers as classification and regression trees (CART). Hastie et al. [45] present in detail learning algorithms and their properties. Combined methods These methods combine feature selection and classification algorithms. Classification algorithms involve a large set of potential biomarkers in classifier building. In the combine methods, biomarkers that most help in classification are retained and used in the classifier, while the others are discarded. Classification and regression trees are an example, as well as their combination with bagging, boosting or random-forest. Sparse PLS combines PLS regression and variable selection into a one-step procedure [51]. Reynes etal. [52] used a genetic algorithm (GA) to select feature that contributes to a voting rule, whereas Koomen et al. [53] combined a GA to select features that contribute to large Mahanalobis distance between groups with FDR controlled Student s t-test. Non-peak-based classification Some classification methods were developed that do not rely on the peak concept. Instead, these methods use every single point of the spectra to look for regions that can distinguish different groups of samples. Indeed, the search for part of the spectrum containing useful information can be performed under the guidance of variables related to clinical events. The pioneering study by Petricoin et al. [54] is a typical example of this strategy using every point of the spectrum as a potential predictor. They developed an algorithm that combined a genetic algorithm with self-organizing map to select a few points on the m/z axis as the best set of group predictors. Li et al. [55] combined GA for feature selection and KNN to perform classification in a restricted subspace. Tong et al. [56] used the same initial idea. They consider each point of a spectrum as a potential feature of interest, in the same way as Petricoin et al., but select m/z positions using CART associated in a

7 Proteinmassspectradataanalysis page 7 of 11 decision forest (a specific tree aggregator with constraints on tree heterogeneity and quality). Validation studies Data splitting of the study sample into a learning set and a test set has been proposed as a process of internal validation. The classifier is trained on the learning set and later validated on the test set. Such an internal validation is a required evaluation procedure to avoid over-fitting, i.e. avoid the use of characteristics of the learning set that are not of interest but rather specific to this particular set. The use of one particular learning set on the 2 n possible splits, and the power limitation of the procedure result in a preference for the leave-k-out or bootstrap methods [57 60]. When the statistical analysis of the identification study first requires a selection of differential biological constituents and then a classification algorithm, the test set must be excluded from all pre-processing and biomarker identification steps. Hilario etal. [61] drew attention to the potential misemployment of cross-validation techniques. They suggest not performing the pre-processing steps on the full dataset before a split between training and test set is decided. This can result in artificially over-estimated biomarker performances. Even if an internal validation has been performed, external validation is a necessary step of biomarker validation. Pepe et al. [62] underlined the difference between the strength of association between a biomarker level and disease status or outcome estimated by the odds-ratio, and its ability to classify subjects. They proposed to use the receiver operating characteristic (ROC) curves to estimate the classification performance of a biomarker or its incremental value in addition to the usual diagnostic or prognostic clinical and biological factors. This approach has been extended to the analysis of positive and negative predictive values using predictive receiver operating characteristic (PROC) curves [63]. Statistical power The power of transcriptomic studies has been shown with increasing the number of non-differentially expressed genes considered in the study [39]. This theoretical result applies to proteomics although the fewer tests performed in proteomics lead to smaller power decrease. In addition to power calculations developed in the context of transcriptomic analysis [39, 40], Jouve et al. [64] have underlined that the high instrumental variability encountered in MS, together with FDR control, builds a detrimental synergy leading to a low statistical power for usual MS study sample sizes. Larger identification studies and improvement of measurement variability are both needed. DISCUSSION Proteomic technologies have been recently adapted to the new field of clinical proteomics. Our article deals with the different steps involved in proteomic data analysis. These methods assume that the biological sample is of high quality. It is therefore quite useful to add a preanalytical step concerning sample preparation. The pivotal article by Petricoin et al. [54] emphasized the impact of preanalytical factors in proteomic studies and the predominant role of good study design towards better reproducibility and hence, biologically meaningful results [5]. The various origins of errors and biases were well identified in the preanalytical and analytical steps, leading to a good discussion concerning the conditions necessary to avoid them (temperature handling, time delay or freeze/thaw cycle). The quality of the pre-analytical steps, as well as the amount of samples available, is crucial, regardless of the technology involved later on. An inadequate quality of samples will impact the fractionation steps, such as major protein depletion, that allows the investigation of low-concentration constituents and MS analysis. Albrethsen [65] made an interesting review about reproducibility in several protein profiling studies using MALDI-TOF instruments and highlighted the several sources of technical variation. Villanueva et al. [66] and Callesen et al. [67] showed that using an optimized serum sample preparation method allows overcoming the challenge of reproducibility. It was also demonstrated that high throughput MS may be a promising biomarker discovery tool as long as analytical sources of variation are identified and controlled [68]. In addition to these sample quality considerations that could benefit in the future from good quality control, experiments must be designed so as to avoid potential sources of bias and confounding factors; inadequate study designs may lead to a confounding effect between technical factors and the biological factor of interest. Two strategies have to be employed to this end: blocking and randomizing. The first one allows avoiding systematic bias due to the block effect to balance the samples within each block

8 page 8 of 11 Roy et al. (plate) with respect to the study factors of interest. The second one allows avoiding unknown or unexpected bias like position on the plate or tip differences. Addition of technical replicates to biological replicates in study designs have recently been discussed [69]. Technical replicates allow controlling technical variability, and knowledge about biological and technical variability can then be used for sample size calculations [70]. Today, cancer clinicians would like to combine proteomic profiles and classical clinical variables in the same models to improve assessment of cancer prognosis. The first intuitive way to do this is to involve classical clinical and omic variables in the same models [71 73]. However, this approach is not satisfying since it does not take into account that the effects of omic variables are overestimated with regard to the classical clinical variables [42, 74]. In the very recent literature, certain authors have proposed models that take this optimism into account. These models simultaneously construct a classifier combining both types of variables and determine whether omic data represent additional predictive value. In 2008, Binder and Schumacher [75] proposed an offset-based boosting approach in the context of survival data. Shortly later, Boulesteix et al. [76] and LeCao et al. [77] proposed a more general approach, in the sense that it was not limited to survival data. This approach is based on PLS, RF and the pre-validation strategy suggested by Tibshirani and Efron [74]. In a more recent article, Boulesteix et al. [78] proposed a permutation-based procedure to test the additional predictive value of high-dimensional molecular data. In addition, Le Cao et al. [79] recently proposed a joint analysis of multiple datasets from different technology platforms. More precisely, they proposed a regularized version of canonical correlation analysis to analyse correlations between two data sets, and a sparse version of PLS regression that includes simultaneous variable selection in the two types of data sets. In this work, we decided to focus on the analysis of proteomic data from MS, which provides an efficient tool for discovering new biomarkers in a large amount of samples. One potential drawback of MS is that the data generated are not directly quantitative. In fact, samples analysed with MS are highly complex, thus limiting the use of MS as a quantification tool. Once selected and statistically validated, candidate markers have to be identified and quantitatively validated. Tandem MS (MS/MS) completes this type of work. This technique involves multiple steps of MS selection in which each of the peptides of interest (candidate markers) are isolated and fragmented. A sequence search engine like MASCOT for example then tries to match these observed spectra with known peptides whose sequences are stored in dedicated databases. This label-free technique also provides a direct mass spectrometric signal intensity for the identified peptides. To reduce the sample complexity and thus allow a better quantification, proteins are often first isolated through separation techniques like liquid chromatography (LC). Such label-free techniques allow a direct comparative quantification of proteins [80, 81]. These methods need heavy analytical processes and for this reason are not extensively used on large-sample datasets at the moment. In addition to this biological aspect, label-free proteomics generate an amount of data higher than those generated by MS, thus leading to additional statistical questions for which statistical analytical tools are still immature [82 84]. However, label-free proteomics sounds promising and current work is evaluating their use in large clinical proteomics studies. Some alternative methods to standard MS for differential proteomics have appeared in the last years, like isobaric tags for relative and absolute quantification (itraq) or difference gel electrophoresis (2D-DIGE). These methods aim at both selecting and quantifying candidate biomarkers. Briefly, the itraq technique utilizes four isobaric amine specific tags to determine relative protein levels in up to four samples simultaneously. 2D-DIGE is a 2D gel separation technology for proteins in which the different samples to be compared are labelled with different dyes (Cy3 and Cy5), which enables signal detection at different wave length emissions. Interested readers may read the works of Wu et al. [85] who provided a comparative study of these methods. Although interesting, these two latter techniques cannot be used in large-scale studies at the moment. In fact, both techniques are limited by the low number of samples that can be analysed simultaneously. Many hundreds of samples can be simultaneously analysed with MS, while only <10 can be analysed using 2D-DIGE or itraq. This issue results in a lack of reproducibility for the former, and costs (both in material and time) that are too high for well powered discovery studies for the latter. This is a drawback for biomarker discovery in large-scale studies.

9 Proteinmassspectradataanalysis page 9 of 11 Proteomics represents, in our opinion, a key field for the future of clinical research. However, improvements can be made in the methodologies involved. This improvement can only result from a tight collaboration between biostatisticians, computer scientists, biologists and clinicians. Key Points Efficient pre-processing is an essential pre-requisite for retrieving meaningful proteomic biological information from raw spectra and reaching meaningful clinical conclusions. The identification and validation of pertinent biomarkers requires large, well-designed studies. Methodology improvement would benefit from a tight collaboration between biostatisticians, computer scientists, biologists and clinicians. References 1. Diamandis EP. How are we going to discover new cancer biomarkers? A proteomic approach for bladder cancer. Clin Chem 2004;50: Semmes OJ, Feng Z, Adam BL, et al. Evaluation of serum protein profiling by surface-enhanced laser desorption/ionization time-of-flight mass spectrometry for the detection of prostate cancer: I assessment of platform reproducibility. Clin Chem 2005;51: West-Nørager M, Kelstrup CD, Schou C, et al. Unravelling in vitro variables of major importance for the outcome of mass spectrometry-based serum proteomics. J Chromatogr B AnalytTechnol Biomed Life Sci 2007;847: Lin Z, Jenson SD, Lim MS, et al. Application of SELDI- TOF mass spectrometry for the identication of differentially expressed proteins in transformed follicular lymphoma. Mod Pathol 2004;17: Baggerly KA, Morris JS, Coombes KR. Reproducibility of seldi-tof protein patterns in serum: comparing datasets from different experiments. Bioinformatics 2004;20: Coombes KR, Baggerly KR, Morris JS. Pre-Processing Mass Spectrometry Data, Fundamentals of Data Mining in Genomics and Proteomics. Boston: Kluwer, Hu J, Coombes KR, Morris JS, et al. The importance of experimental design in proteomic mass spectrometry experiments: Some cautionary tales. Brief Funct Genomic Proteomic 2005;3: Arneberg R, Rajalahti T, Flikka K, et al. Pretreatment of mass spectral profiles: application to proteomic data. Anal Chem 2007;79: Morris JS, Coombes KR, Koomen J, et al. Feature extraction and quantification for mass spectrometry in biomedical applications using the mean spectrum. Bioinformatics 2005; 21: Cruz-Marcelo A, Guerra R, Vannucci M, etal. Comparison of algorithms for pre-processing of SELDI-TOF mass spectrometry data. Bioinformatics 2008;24: Yang C, He Z, Yu W. Comparison of public peak detection algorithms for MALDI mass spectrometry data analysis. BMC Bioinformatics 2009;10: Antoniadis A, Bigot J, Lambert-Lacroix S, et al. Nonparametric pre-processing methods and inference tools for analyzing time-of-flight mass spectrometry data. CurrAnal Chem 2007;3: Coombes KR, Tsavachidis S, Morris JS, et al. Improved peak detection and quantification of mass spectrometry data acquired from surface-enhanced laser desorption and ionization by denoising spectra with the undecimated discrete wavelet transform. Proteomics 2005;5: Du P, Kibbe WA, Lin SM. Improved peak detection in mass spectrum by incorporating continuous wavelet transformbased pattern matching. Bioinformatics 2006;22: Kwon D, Vannucci M, Song JJ, etal. A novel wavelet-based thresholding method for the pre-processing of mass spectrometry data that accounts for heterogeneous noise. Proteomics 2008;8: Mostacci E, Truntzer C, Cardot H, et al. Multivariate denoising methods combining Wavelets and PCA for mass spectrometry data. Proteomics. In Press. 17. Fung ET, Enderwick C. ProteinChip clinical proteomics: computational challenges and solutions. Biotechniques 2002; 34(Suppl 8): Tan CS, Ploner A, Quandt A, et al. Finding regions of significance in SELDI measurements for identifying protein biomarkers. Bioinformatics 2006;22: Li X. PROcess: Ciphergen SELDI-TOF Processing R package version Malyarenko DI, Cooke WE, Adam BL, et al. Enhancement of sensitivity and resolution of surface-enhanced laser desorption/ionization time-of-flight mass spectrometric records for serum peptides using time-series analysis techniques. Clin Chem 2005;51: Dijkstra M, Roelofsen H, Vonk RJ, et al. Peak quantification in surface-enhanced laser desorption/ ionization by using mixture models. Proteomics 2006;6: Jeffries N. Algorithms for alignment of mass spectrometry proteomic data. Bioinformatics 2005;21: Wong JWH, Cagney G, Cartwright HM. SpecAlign_processing and alignment of mass spectra datasets. Bioinformatics 2005;21: Pratapa PN, Patz EF, Hartemink AJ. Finding diagnostic biomarkers in proteomic spectra. Pac Symp Biocomput. New Jersey: World Scientific, 2006; Kong X, Reilly CA. Bayesian approach to the alignment of mass spectra. Bioinformatics 2009;25: Feng Y, Ma W, Wang Z, et al. Alignment of protein mass spectrometry data by integrated Markov chain shifting method. Statistics and Its Interface 2009;2: Cairns DA, Thompson D, Perkins DN, et al. Proteomic profiling using mass spectrometry - does normalising by total ion current potentially mask some biological differences? Proteomics 2008;8: Meuleman W, Engwegen J, Gast M, et al. Comparison of normalisation methods for surface-enhanced laser desorption and ionisation (seldi) time-of-flight (tof) mass spectrometry data. BMC Bioinformatics 2008;9: Yasui Y, Pepe M, Thompson ML, et al. A data analytic strategy for protein biomarker discovery: profiling of high-dimensional proteomic data for cancer detection. Biostatistics 2003;4:

10 page 10 of 11 Roy et al. 29. Coombes KR, Fritsche J, Clarke C, et al. Quality control and peak finding for proteomics data collected from nipple aspirate fluid by surface-enhanced laser desorption and ionization. Clin Chem 2003;49: Randolph TW, Mitchell BL, McLerran DF, et al. Quantifying peptide signal in MALDI-TOF mass spectrometry data. Mol Cell Proteomics 2005;4: Lange E, Gropl C, Reinert K, et al. High accuracy peak picking of proteomics data using wavelet techniques. Pacific Symposium on Biocomputing Maui, Hawaii, USA. New Jersy: Word Scientific, 2006; Noy K, Fasulo D. Improved model-based, platformindependent feature extraction for mass spectrometry. Bioinformatics 2007;23: Du P, Angeletti RH. Automatic deconvolution of isotoperesolved mass spectra Using variable selection and quantized peptide mass distribution. Anal Chem 2006;78: Pepe MS, Etzioni R, Feng Z, et al. Phases of biomarker development for early detection of cancer. JNCI 2001;93: McShane LM, Altman DG, Sauerbrei W, et al. REporting recommendations for tumor MARKer prognostic studies (REMARK). JNCI 2005;97: Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. JRSS B 1995;57: Dudoit S, Shaffer J, Boldrick J. Multiple hypothesis testing in microarray experiments. Stat Sci 2003;18: Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci USA 2004;100: Lee M-LT, Whitmore GA. Power and sample size for DNA microarray studies. Stat Med 2002;21: Pawitan Y, Murthy KR, Michiels S, et al. Bias in the estimation of false discovery rate in microarray studies. Bioinformatics 2005;21(20): Truntzer C, Maucort-Boulch D, Roy P. Consequences of regression to the mean in the high dimensional setting submitted manuscript. 42. Truntzer C, Maucort-Boulch D, Roy P. Comparative optimism in models involving both classical clinical and gene expression information. BMC Bioinformatics 2008;9: Efron B, Hastie T, Johnstone I, et al. Least angle regression. Ann Stat 2004;32: Tibshirani R. Regression shrinkage and selection via the Lasso. J R Stat Soc B 1996;58: Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. 3rd edn. New York: Springer Science, Friedman J, Popescu B. Gradient directed regularization for linear regression and classification. Technical Report, Stanford University, Department of Statistics, Gui J, Li H. Penalized Cox regression analysis in the highdimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics 2005;21: Truntzer C, Mercier C, Estève J, et al. Importance of data structure in comparing two dimension reduction methods for classification of microarray gene expression data. BMC Bioinformatics 2007;8: Boulesteix AL. PLS dimension reduction for classification with microarray data. Stat Appl Genet Mol Biol 2004;3: Article Wu B, Abbott T, Fishman D, et al. Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics 2003;19: Lê Cao KA, Rossouw D, Robert-Granié C, et al. A sparse PLS for variable selection when integrating omics data. Stat Appl Genet Mol Biol 2008;7:Article Reynes C, Sabatier R, Molinari N, et al. A new genetic algorithm in proteomics: feature selection for SELDI-TOF data. Comput Stat Data Anal 2008;52: Koomen JM, Shih LN, Coombes KR, et al. Plasma protein profiling for diagnosis of pancreatic cancer reveals the presence of host response proteins. Clin Cancer Res 2005;11: Petricoin EF, Ardekani AM, Hitt BA, et al. Use of proteomic patterns in serum to identify ovarian cancer. Lancet 2002;359: Li L, Umbach DM, Terry P, et al. Application of the GA/KNN method to SELDI proteomics data. Bioinformatics 2004;20: Tong W, Xie Q, Hong H, et al. Using decision forest to classify prostate cancer samples on the basis of SELDI- TOF MS data: assessing chance correlation and prediction confidence. Environ Health Perspect 2004;112: Efron B, Tibshirani R. An introduction to the bootstrap. Monographs on Statistics & Applied Probability. New York: Chapman & Hall/CRC, Shao J, Tu D. The Jackknife and Bootstrap. New York: Springer, Altman DG, Royston P. What do we mean by validating a prognostic model? Stat Med 2000;29: Harrell F. Regression Modelling Strategy with Applications in Linear Models, Logistic Regression and Survival Model. New York: Springer, Hilario M, Kalousis A, Pellegrini C, et al. Processing and classification of protein mass spectra. Mass Spectrom Rev 2006;25: Pepe MS, Janes H, Longton G, etal. Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker. AmJ Epidemiol 2004;159: Shiu SY, Gatsonis C. The predictive receiver operating characteristic curve for the joint assessment of the positive and negative predictive values. PhilTrans R Soc A 2008;366: Jouve T, Maucort-Boulch D, Ducoroy P, et al. Statistical power in mass-spectrometry proteomic studies. submitted manuscript. 65. Albrethsen J. Reproducibility in protein profiling by malditof mass spectrometry. Clin Chem 2007;53: Villanueva J, Lawlor K, Toledo-Crow R, et al. Automated serum peptide profiling. Nat Protoc 2006;1: Callesen AK, Vach W, Jørgensen PE, et al. Reproducibility of mass spectrometry based protein profiles for diagnosis of breast cancer across clinical studies: a systematic review. JProteomeRes 2008;7: Hu J, Coombes KR, Morris JS, et al. The importance of experimental design in proteomic mass spectrometry experiments: some cautionary tales. Brief Funct Genomic Proteomic 2005;3: Oberg A, Vitek O. Statistical design of quantitative mass spectrometry-based proteomic profiling experiments. JProteomeRes 2009;8:

11 Proteinmassspectradataanalysis page 11 of Cairns DA, Barrett JH, Billingham LJ, et al. Sample size determination in clinical proteomic profiling experiments using mass spectrometry for class comparison. Proteomics 2009;9: Dettling M, Bühlmann P. Finding predictive gene groups from microarray data. J MultivarAnal 2004;90(1): Gevaert O, Smet FD, Timmerman D, et al. Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks. Bioinformatics 2006;22: Li L. Survival prediction of diffuse large-b-cell lymphoma based on both clinical and gene expression information. Bioinformatics 2006;22: Tibshirani R, Efron B. Pre-validation and inference in microarrays. Stat Appl Genet Mol Biol 2007;1: Binder H, Schumacher M. Allowing for mandatory covariates in boosting stimation of sparse high-dimensional survival models. BMC Bioinformatics 2008;9: Boulesteix A, Porzelius C, Daumer M. Microarray-based classification and clinical predictors: on combined classifiers and additional predictive value. Bioinformatics 2008;24: Le Cao K, Meugnier E, McLachlan GJ. Integrative mixture of experts to combine clinical factors and gene markers. Bioinformatics to appear. 78. Boulesteix A, Hothorn T. Testing the additional predictive value of high-dimensional molecular data. BMC Bioinformatics 2010;11: Le Cao K, Gonzalez I, Dejean S. IntegrOmics: an R package to unravel relationships between two omics data sets. Bioinformatics 2009;25: Wang M, You J, Bemis KG, et al. Label-free mass spectrometry-based protein quantification technologies in proteomic analysis. Brief Funct Genomics Proteomics 2008;7: Wong JWH, Sullivan MJ, Cagney G. Computational methods for the comparative quantification of proteins in label-free LCn-MS experiments. Brief Bioinform 2008;9: Karpievitch YV, Taverner T, Adkins JN, et al. Normalization of peak intensities in bottom-up MS-based proteomics using singular value decomposition. Bioinformatics 2009;25: Yu T, Park Y, Johnson JM, Jones DP. aplcms adaptive processing of high-resolution LC/MS data. Bioinformatics 2009;25: Pham TV, Piersma SR, Warmoes M, et al. On the betabinomial model for analysis of spectral count data in labelfree tandem mass spectrometry-based proteomics. Bioinformatics 2010;26: Wu WW, Wang G, Baek SJ, Shen RF. Comparative study of three proteomic quantitative methods, DIGE, cicat, and itraq, using 2D gel- or LC-MALDI TOF/TOF. J Proteome Res 2006;5:651 8.

Preprocessing, Management, and Analysis of Mass Spectrometry Proteomics Data

Preprocessing, Management, and Analysis of Mass Spectrometry Proteomics Data Preprocessing, Management, and Analysis of Mass Spectrometry Proteomics Data M. Cannataro, P. H. Guzzi, T. Mazza, and P. Veltri Università Magna Græcia di Catanzaro, Italy 1 Introduction Mass Spectrometry

More information

OplAnalyzer: A Toolbox for MALDI-TOF Mass Spectrometry Data Analysis

OplAnalyzer: A Toolbox for MALDI-TOF Mass Spectrometry Data Analysis OplAnalyzer: A Toolbox for MALDI-TOF Mass Spectrometry Data Analysis Thang V. Pham and Connie R. Jimenez OncoProteomics Laboratory, Cancer Center Amsterdam, VU University Medical Center De Boelelaan 1117,

More information

Integrated Data Mining Strategy for Effective Metabolomic Data Analysis

Integrated Data Mining Strategy for Effective Metabolomic Data Analysis The First International Symposium on Optimization and Systems Biology (OSB 07) Beijing, China, August 8 10, 2007 Copyright 2007 ORSC & APORC pp. 45 51 Integrated Data Mining Strategy for Effective Metabolomic

More information

Functional Data Analysis of MALDI TOF Protein Spectra

Functional Data Analysis of MALDI TOF Protein Spectra Functional Data Analysis of MALDI TOF Protein Spectra Dean Billheimer dean.billheimer@vanderbilt.edu. Department of Biostatistics Vanderbilt University Vanderbilt Ingram Cancer Center FDA for MALDI TOF

More information

SELDI-TOF Mass Spectrometry Protein Data By Huong Thi Dieu La

SELDI-TOF Mass Spectrometry Protein Data By Huong Thi Dieu La SELDI-TOF Mass Spectrometry Protein Data By Huong Thi Dieu La References Alejandro Cruz-Marcelo, Rudy Guerra, Marina Vannucci, Yiting Li, Ching C. Lau, and Tsz-Kwong Man. Comparison of algorithms for pre-processing

More information

Vertical data integration for melanoma prognosis. Australia 3 Melanoma Institute Australia, NSW 2060 Australia. kaushala@maths.usyd.edu.au.

Vertical data integration for melanoma prognosis. Australia 3 Melanoma Institute Australia, NSW 2060 Australia. kaushala@maths.usyd.edu.au. Vertical integration for melanoma prognosis Kaushala Jayawardana 1,4, Samuel Müller 1, Sarah-Jane Schramm 2,3, Graham J. Mann 2,3 and Jean Yang 1 1 School of Mathematics and Statistics, University of Sydney,

More information

Aiping Lu. Key Laboratory of System Biology Chinese Academic Society APLV@sibs.ac.cn

Aiping Lu. Key Laboratory of System Biology Chinese Academic Society APLV@sibs.ac.cn Aiping Lu Key Laboratory of System Biology Chinese Academic Society APLV@sibs.ac.cn Proteome and Proteomics PROTEin complement expressed by genome Marc Wilkins Electrophoresis. 1995. 16(7):1090-4. proteomics

More information

Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant

Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant Statistical Analysis NBAF-B Metabolomics Masterclass Mark Viant 1. Introduction 2. Univariate analysis Overview of lecture 3. Unsupervised multivariate analysis Principal components analysis (PCA) Interpreting

More information

Statistical issues in the analysis of microarray data

Statistical issues in the analysis of microarray data Statistical issues in the analysis of microarray data Daniel Gerhard Institute of Biostatistics Leibniz University of Hannover ESNATS Summerschool, Zermatt D. Gerhard (LUH) Analysis of microarray data

More information

AB SCIEX TOF/TOF 4800 PLUS SYSTEM. Cost effective flexibility for your core needs

AB SCIEX TOF/TOF 4800 PLUS SYSTEM. Cost effective flexibility for your core needs AB SCIEX TOF/TOF 4800 PLUS SYSTEM Cost effective flexibility for your core needs AB SCIEX TOF/TOF 4800 PLUS SYSTEM It s just what you expect from the industry leader. The AB SCIEX 4800 Plus MALDI TOF/TOF

More information

Introduction to mass spectrometry (MS) based proteomics and metabolomics

Introduction to mass spectrometry (MS) based proteomics and metabolomics Introduction to mass spectrometry (MS) based proteomics and metabolomics Tianwei Yu Department of Biostatistics and Bioinformatics Rollins School of Public Health Emory University September 10, 2015 Background

More information

Introduction to Proteomics 1.0

Introduction to Proteomics 1.0 Introduction to Proteomics 1.0 CMSP Workshop Tim Griffin Associate Professor, BMBB Faculty Director, CMSP Objectives Why are we here? For participants: Learn basics of MS-based proteomics Learn what s

More information

Effects of Intelligent Data Acquisition and Fast Laser Speed on Analysis of Complex Protein Digests

Effects of Intelligent Data Acquisition and Fast Laser Speed on Analysis of Complex Protein Digests Effects of Intelligent Data Acquisition and Fast Laser Speed on Analysis of Complex Protein Digests AB SCIEX TOF/TOF 5800 System with DynamicExit Algorithm and ProteinPilot Software for Robust Protein

More information

泛 用 蛋 白 質 體 學 之 質 譜 儀 資 料 分 析 平 台 的 建 立 與 應 用 Universal Mass Spectrometry Data Analysis Platform for Quantitative and Qualitative Proteomics

泛 用 蛋 白 質 體 學 之 質 譜 儀 資 料 分 析 平 台 的 建 立 與 應 用 Universal Mass Spectrometry Data Analysis Platform for Quantitative and Qualitative Proteomics 泛 用 蛋 白 質 體 學 之 質 譜 儀 資 料 分 析 平 台 的 建 立 與 應 用 Universal Mass Spectrometry Data Analysis Platform for Quantitative and Qualitative Proteomics 2014 Training Course Wei-Hung Chang ( 張 瑋 宏 ) ABRC, Academia

More information

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics. Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are

More information

Mass Spectrometry Signal Calibration for Protein Quantitation

Mass Spectrometry Signal Calibration for Protein Quantitation Cambridge Isotope Laboratories, Inc. www.isotope.com Proteomics Mass Spectrometry Signal Calibration for Protein Quantitation Michael J. MacCoss, PhD Associate Professor of Genome Sciences University of

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

Tutorial for proteome data analysis using the Perseus software platform

Tutorial for proteome data analysis using the Perseus software platform Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Using Ontologies in Proteus for Modeling Data Mining Analysis of Proteomics Experiments

Using Ontologies in Proteus for Modeling Data Mining Analysis of Proteomics Experiments Using Ontologies in Proteus for Modeling Data Mining Analysis of Proteomics Experiments Mario Cannataro, Pietro Hiram Guzzi, Tommaso Mazza, and Pierangelo Veltri University Magna Græcia of Catanzaro, 88100

More information

Un (bref) aperçu des méthodes et outils de fouilles et de visualisation de données «omics»

Un (bref) aperçu des méthodes et outils de fouilles et de visualisation de données «omics» Un (bref) aperçu des méthodes et outils de fouilles et de visualisation de données «omics» Workshop «Protéomique & Maladies rares» 25 th September 2012, Paris yves.vandenbrouck@cea.fr CEA Grenoble irtsv

More information

Quantitative proteomics background

Quantitative proteomics background Proteomics data analysis seminar Quantitative proteomics and transcriptomics of anaerobic and aerobic yeast cultures reveals post transcriptional regulation of key cellular processes de Groot, M., Daran

More information

False Discovery Rates

False Discovery Rates False Discovery Rates John D. Storey Princeton University, Princeton, USA January 2010 Multiple Hypothesis Testing In hypothesis testing, statistical significance is typically based on calculations involving

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

THE RISK DISTRIBUTION CURVE AND ITS DERIVATIVES. Ralph Stern Cardiovascular Medicine University of Michigan Ann Arbor, Michigan. stern@umich.

THE RISK DISTRIBUTION CURVE AND ITS DERIVATIVES. Ralph Stern Cardiovascular Medicine University of Michigan Ann Arbor, Michigan. stern@umich. THE RISK DISTRIBUTION CURVE AND ITS DERIVATIVES Ralph Stern Cardiovascular Medicine University of Michigan Ann Arbor, Michigan stern@umich.edu ABSTRACT Risk stratification is most directly and informatively

More information

Medical Informatics II

Medical Informatics II Medical Informatics II Zlatko Trajanoski Institute for Genomics and Bioinformatics Graz University of Technology http://genome.tugraz.at zlatko.trajanoski@tugraz.at Medical Informatics II Introduction

More information

MarkerView Software 1.2.1 for Metabolomic and Biomarker Profiling Analysis

MarkerView Software 1.2.1 for Metabolomic and Biomarker Profiling Analysis MarkerView Software 1.2.1 for Metabolomic and Biomarker Profiling Analysis Overview MarkerView software is a novel program designed for metabolomics applications and biomarker profiling workflows 1. Using

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

Copyright 2007 Casa Software Ltd. www.casaxps.com. ToF Mass Calibration

Copyright 2007 Casa Software Ltd. www.casaxps.com. ToF Mass Calibration ToF Mass Calibration Essentially, the relationship between the mass m of an ion and the time taken for the ion of a given charge to travel a fixed distance is quadratic in the flight time t. For an ideal

More information

Tutorial for Proteomics Data Submission. Katalin F. Medzihradszky Robert J. Chalkley UCSF

Tutorial for Proteomics Data Submission. Katalin F. Medzihradszky Robert J. Chalkley UCSF Tutorial for Proteomics Data Submission Katalin F. Medzihradszky Robert J. Chalkley UCSF Why Have Guidelines? Large-scale proteomics studies create huge amounts of data. It is impossible/impractical to

More information

La Protéomique : Etat de l art et perspectives

La Protéomique : Etat de l art et perspectives La Protéomique : Etat de l art et perspectives Odile Schiltz Institut de Pharmacologie et de Biologie Structurale CNRS, Université de Toulouse, Odile.Schiltz@ipbs.fr Protéomique et Spectrométrie de Masse

More information

Global and Discovery Proteomics Lecture Agenda

Global and Discovery Proteomics Lecture Agenda Global and Discovery Proteomics Christine A. Jelinek, Ph.D. Johns Hopkins University School of Medicine Department of Pharmacology and Molecular Sciences Middle Atlantic Mass Spectrometry Laboratory Global

More information

Application Note # LCMS-81 Introducing New Proteomics Acquisiton Strategies with the compact Towards the Universal Proteomics Acquisition Method

Application Note # LCMS-81 Introducing New Proteomics Acquisiton Strategies with the compact Towards the Universal Proteomics Acquisition Method Application Note # LCMS-81 Introducing New Proteomics Acquisiton Strategies with the compact Towards the Universal Proteomics Acquisition Method Introduction During the last decade, the complexity of samples

More information

Cancer Biostatistics Workshop Science of Doing Science - Biostatistics

Cancer Biostatistics Workshop Science of Doing Science - Biostatistics Cancer Biostatistics Workshop Science of Doing Science - Biostatistics Yu Shyr, PhD Jan. 18, 2008 Cancer Biostatistics Center Vanderbilt-Ingram Cancer Center Yu.Shyr@vanderbilt.edu Aims Cancer Biostatistics

More information

The Scheduled MRM Algorithm Enables Intelligent Use of Retention Time During Multiple Reaction Monitoring

The Scheduled MRM Algorithm Enables Intelligent Use of Retention Time During Multiple Reaction Monitoring The Scheduled MRM Algorithm Enables Intelligent Use of Retention Time During Multiple Reaction Monitoring Delivering up to 2500 MRM Transitions per LC Run Christie Hunter 1, Brigitte Simons 2 1 AB SCIEX,

More information

Alignment and Preprocessing for Data Analysis

Alignment and Preprocessing for Data Analysis Alignment and Preprocessing for Data Analysis Preprocessing tools for chromatography Basics of alignment GC FID (D) data and issues PCA F Ratios GC MS (D) data and issues PCA F Ratios PARAFAC Piecewise

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

ProteinPilot Report for ProteinPilot Software

ProteinPilot Report for ProteinPilot Software ProteinPilot Report for ProteinPilot Software Detailed Analysis of Protein Identification / Quantitation Results Automatically Sean L Seymour, Christie Hunter SCIEX, USA Pow erful mass spectrometers like

More information

Chapter 14. Modeling Experimental Design for Proteomics. Jan Eriksson and David Fenyö. Abstract. 1. Introduction

Chapter 14. Modeling Experimental Design for Proteomics. Jan Eriksson and David Fenyö. Abstract. 1. Introduction Chapter Modeling Experimental Design for Proteomics Jan Eriksson and David Fenyö Abstract The complexity of proteomes makes good experimental design essential for their successful investigation. Here,

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

TOWARD BIG DATA ANALYSIS WORKSHOP

TOWARD BIG DATA ANALYSIS WORKSHOP TOWARD BIG DATA ANALYSIS WORKSHOP 邁 向 巨 量 資 料 分 析 研 討 會 摘 要 集 2015.06.05-06 巨 量 資 料 之 矩 陣 視 覺 化 陳 君 厚 中 央 研 究 院 統 計 科 學 研 究 所 摘 要 視 覺 化 (Visualization) 與 探 索 式 資 料 分 析 (Exploratory Data Analysis, EDA)

More information

HRMS in Clinical Research: from Targeted Quantification to Metabolomics

HRMS in Clinical Research: from Targeted Quantification to Metabolomics A sponsored whitepaper. HRMS in Clinical Research: from Targeted Quantification to Metabolomics By: Bertrand Rochat Ph. D., Research Project Leader, Faculté de Biologie et de Médecine of the Centre Hospitalier

More information

Data Mining Techniques for Prognosis in Pancreatic Cancer

Data Mining Techniques for Prognosis in Pancreatic Cancer Data Mining Techniques for Prognosis in Pancreatic Cancer by Stuart Floyd A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUE In partial fulfillment of the requirements for the Degree

More information

Increasing the Multiplexing of High Resolution Targeted Peptide Quantification Assays

Increasing the Multiplexing of High Resolution Targeted Peptide Quantification Assays Increasing the Multiplexing of High Resolution Targeted Peptide Quantification Assays Scheduled MRM HR Workflow on the TripleTOF Systems Jenny Albanese, Christie Hunter AB SCIEX, USA Targeted quantitative

More information

ProteinScape. Innovation with Integrity. Proteomics Data Analysis & Management. Mass Spectrometry

ProteinScape. Innovation with Integrity. Proteomics Data Analysis & Management. Mass Spectrometry ProteinScape Proteomics Data Analysis & Management Innovation with Integrity Mass Spectrometry ProteinScape a Virtual Environment for Successful Proteomics To overcome the growing complexity of proteomics

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Pep-Miner: A Novel Technology for Mass Spectrometry-Based Proteomics

Pep-Miner: A Novel Technology for Mass Spectrometry-Based Proteomics Pep-Miner: A Novel Technology for Mass Spectrometry-Based Proteomics Ilan Beer Haifa Research Lab Dec 10, 2002 Pep-Miner s Location in the Life Sciences World The post-genome era - the age of proteome

More information

How To Perform An Ensemble Analysis

How To Perform An Ensemble Analysis Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions SMA 50: Statistical Learning and Data Mining in Bioinformatics (also listed as 5.077: Statistical Learning and Data Mining ()) Spring Term (Feb May 200) Faculty: Professor Roy Welsch Wed 0 Feb 7:00-8:0

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

Introduction to Proteomics

Introduction to Proteomics Introduction to Proteomics Åsa Wheelock, Ph.D. Division of Respiratory Medicine & Karolinska Biomics Center asa.wheelock@ki.se In: Systems Biology and the Omics Cascade, Karolinska Institutet, June 9-13,

More information

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376 Course Director: Dr. Kayvan Najarian (DCM&B, kayvan@umich.edu) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.

More information

Proteomics in Practice

Proteomics in Practice Reiner Westermeier, Torn Naven Hans-Rudolf Höpker Proteomics in Practice A Guide to Successful Experimental Design 2008 Wiley-VCH Verlag- Weinheim 978-3-527-31941-1 Preface Foreword XI XIII Abbreviations,

More information

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning. Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.

More information

Statistical Analysis Strategies for Shotgun Proteomics Data

Statistical Analysis Strategies for Shotgun Proteomics Data Statistical Analysis Strategies for Shotgun Proteomics Data Ming Li, Ph.D. Cancer Biostatistics Center Vanderbilt University Medical Center Ayers Institute Biomarker Pipeline normal shotgun proteome analysis

More information

MultiQuant Software 2.0 for Targeted Protein / Peptide Quantification

MultiQuant Software 2.0 for Targeted Protein / Peptide Quantification MultiQuant Software 2.0 for Targeted Protein / Peptide Quantification Gold Standard for Quantitative Data Processing Because of the sensitivity, selectivity, speed and throughput at which MRM assays can

More information

Introduction to Pattern Recognition

Introduction to Pattern Recognition Introduction to Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology http://tinyurl.com/bioinf525-w16

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology http://tinyurl.com/bioinf525-w16 Course Director: Dr. Barry Grant (DCM&B, bjgrant@med.umich.edu) Description: This is a three module course covering (1) Foundations of Bioinformatics, (2) Statistics in Bioinformatics, and (3) Systems

More information

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

More information

Principles of Data Mining by Hand&Mannila&Smyth

Principles of Data Mining by Hand&Mannila&Smyth Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Software Approaches for Structure Information Acquisition and Training of Chemistry Students

Software Approaches for Structure Information Acquisition and Training of Chemistry Students Software Approaches for Structure Information Acquisition and Training of Chemistry Students Nikolay T. Kochev, Plamen N. Penchev, Atanas T. Terziyski, George N. Andreev Department of Analytical Chemistry,

More information

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

using ms based proteomics

using ms based proteomics quantification using ms based proteomics lennart martens Computational Omics and Systems Biology Group Department of Medical Protein Research, VIB Department of Biochemistry, Ghent University Ghent, Belgium

More information

MRMPilot Software: Accelerating MRM Assay Development for Targeted Quantitative Proteomics

MRMPilot Software: Accelerating MRM Assay Development for Targeted Quantitative Proteomics MRMPilot Software: Accelerating MRM Assay Development for Targeted Quantitative Proteomics With Unique QTRAP and TripleTOF 5600 System Technology Targeted peptide quantification is a rapidly growing application

More information

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES Claus Gwiggner, Ecole Polytechnique, LIX, Palaiseau, France Gert Lanckriet, University of Berkeley, EECS,

More information

Classification of Bad Accounts in Credit Card Industry

Classification of Bad Accounts in Credit Card Industry Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

More information

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem Elsa Bernard Laurent Jacob Julien Mairal Jean-Philippe Vert September 24, 2013 Abstract FlipFlop implements a fast method for de novo transcript

More information

Final Project Report

Final Project Report CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes

More information

Rulex s Logic Learning Machines successfully meet biomedical challenges.

Rulex s Logic Learning Machines successfully meet biomedical challenges. Rulex s Logic Learning Machines successfully meet biomedical challenges. Rulex is a predictive analytics platform able to manage and to analyze big amounts of heterogeneous data. With Rulex, it is possible,

More information

Accurate calibration of on-line Time of Flight Mass Spectrometer (TOF-MS) for high molecular weight combustion product analysis

Accurate calibration of on-line Time of Flight Mass Spectrometer (TOF-MS) for high molecular weight combustion product analysis Accurate calibration of on-line Time of Flight Mass Spectrometer (TOF-MS) for high molecular weight combustion product analysis B. Apicella*, M. Passaro**, X. Wang***, N. Spinelli**** mariadellarcopassaro@gmail.com

More information

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection Directions in Statistical Methodology for Multivariable Predictive Modeling Frank E Harrell Jr University of Virginia Seattle WA 19May98 Overview of Modeling Process Model selection Regression shape Diagnostics

More information

diagnosis through Random

diagnosis through Random Convegno Calcolo ad Alte Prestazioni "Biocomputing" Bio-molecular diagnosis through Random Subspace Ensembles of Learning Machines Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini DSI Dipartimento

More information

In-Depth Qualitative Analysis of Complex Proteomic Samples Using High Quality MS/MS at Fast Acquisition Rates

In-Depth Qualitative Analysis of Complex Proteomic Samples Using High Quality MS/MS at Fast Acquisition Rates In-Depth Qualitative Analysis of Complex Proteomic Samples Using High Quality MS/MS at Fast Acquisition Rates Using the Explore Workflow on the AB SCIEX TripleTOF 5600 System A major challenge in proteomics

More information

Session 1. Course Presentation: Mass spectrometry-based proteomics for molecular and cellular biologists

Session 1. Course Presentation: Mass spectrometry-based proteomics for molecular and cellular biologists Program Overview Session 1. Course Presentation: Mass spectrometry-based proteomics for molecular and cellular biologists Session 2. Principles of Mass Spectrometry Session 3. Mass spectrometry based proteomics

More information

A Common Processing and Statistical Frame for Label-Free Quantitative Proteomic Analyses

A Common Processing and Statistical Frame for Label-Free Quantitative Proteomic Analyses Oksana Riba Grognuz: Label-Free Quantitative Proteomics i A Common Processing and Statistical Frame for Label-Free Quantitative Proteomic Analyses Master s Thesis in Proteomics and Bioinformatics By Oksana

More information

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov Data Integration Lectures 16 & 17 Lectures Outline Goals for Data Integration Homogeneous data integration time series data (Filkov et al. 2002) Heterogeneous data integration microarray + sequence microarray

More information

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

More information

Penalized Logistic Regression and Classification of Microarray Data

Penalized Logistic Regression and Classification of Microarray Data Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification

More information

Molecular Genetics: Challenges for Statistical Practice. J.K. Lindsey

Molecular Genetics: Challenges for Statistical Practice. J.K. Lindsey Molecular Genetics: Challenges for Statistical Practice J.K. Lindsey 1. What is a Microarray? 2. Design Questions 3. Modelling Questions 4. Longitudinal Data 5. Conclusions 1. What is a microarray? A microarray

More information

LC-MS/MS for Chromatographers

LC-MS/MS for Chromatographers LC-MS/MS for Chromatographers An introduction to the use of LC-MS/MS, with an emphasis on the analysis of drugs in biological matrices LC-MS/MS for Chromatographers An introduction to the use of LC-MS/MS,

More information

OpenMS A Framework for Quantitative HPLC/MS-Based Proteomics

OpenMS A Framework for Quantitative HPLC/MS-Based Proteomics OpenMS A Framework for Quantitative HPLC/MS-Based Proteomics Knut Reinert 1, Oliver Kohlbacher 2,Clemens Gröpl 1, Eva Lange 1, Ole Schulz-Trieglaff 1,Marc Sturm 2 and Nico Pfeifer 2 1 Algorithmische Bioinformatik,

More information

Cross-Validation. Synonyms Rotation estimation

Cross-Validation. Synonyms Rotation estimation Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical

More information

Protein Prospector and Ways of Calculating Expectation Values

Protein Prospector and Ways of Calculating Expectation Values Protein Prospector and Ways of Calculating Expectation Values 1/16 Aenoch J. Lynn; Robert J. Chalkley; Peter R. Baker; Mark R. Segal; and Alma L. Burlingame University of California, San Francisco, San

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

The primary goal of this thesis was to understand how the spatial dependence of

The primary goal of this thesis was to understand how the spatial dependence of 5 General discussion 5.1 Introduction The primary goal of this thesis was to understand how the spatial dependence of consumer attitudes can be modeled, what additional benefits the recovering of spatial

More information

Data Mining and Machine Learning in Bioinformatics

Data Mining and Machine Learning in Bioinformatics Data Mining and Machine Learning in Bioinformatics PRINCIPAL METHODS AND SUCCESSFUL APPLICATIONS Ruben Armañanzas http://mason.gmu.edu/~rarmanan Adapted from Iñaki Inza slides http://www.sc.ehu.es/isg

More information

Personalized Predictive Medicine and Genomic Clinical Trials

Personalized Predictive Medicine and Genomic Clinical Trials Personalized Predictive Medicine and Genomic Clinical Trials Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute http://brb.nci.nih.gov brb.nci.nih.gov Powerpoint presentations

More information

Building a Collaborative Informatics Platform for Translational Research: Prof. Yike Guo Department of Computing Imperial College London

Building a Collaborative Informatics Platform for Translational Research: Prof. Yike Guo Department of Computing Imperial College London Building a Collaborative Informatics Platform for Translational Research: An IMI Project Experience Prof. Yike Guo Department of Computing Imperial College London Living in the Era of BIG Big Data : Massive

More information

Organizing Your Approach to a Data Analysis

Organizing Your Approach to a Data Analysis Biost/Stat 578 B: Data Analysis Emerson, September 29, 2003 Handout #1 Organizing Your Approach to a Data Analysis The general theme should be to maximize thinking about the data analysis and to minimize

More information