msmseda LC-MS/MS Exploratory Data Analysis



Similar documents
Tutorial for proteome data analysis using the Perseus software platform

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Gene Expression Analysis

DeCyder Extended Data Analysis module Version 1.0

Quantitative proteomics background

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Hierarchical Clustering Analysis

Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant

Exploratory data analysis (Chapter 2) Fall 2011

Multivariate Analysis of Ecological Data

Statistical Analysis Strategies for Shotgun Proteomics Data

Normalization of RNA-Seq

EDASeq: Exploratory Data Analysis and Normalization for RNA-Seq

OplAnalyzer: A Toolbox for MALDI-TOF Mass Spectrometry Data Analysis

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Principal Component Analysis

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

From Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

How To Cluster

Identification of rheumatoid arthritis and osteoarthritis patients by transcriptome-based rule set generation

Un (bref) aperçu des méthodes et outils de fouilles et de visualisation de données «omics»

Automated Biosurveillance Data from England and Wales,

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

MarkerView Software for Metabolomic and Biomarker Profiling Analysis

Analyzing the Effect of Treatment and Time on Gene Expression in Partek Genomics Suite (PGS) 6.6: A Breast Cancer Study

Data, Measurements, Features

Practical Differential Gene Expression. Introduction

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

MASCOT Search Results Interpretation

MTH 140 Statistics Videos

Statistical issues in the analysis of microarray data

ProteinScape. Innovation with Integrity. Proteomics Data Analysis & Management. Mass Spectrometry

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

How To Run Statistical Tests in Excel

Choices, choices, choices... Which sequence database? Which modifications? What mass tolerance?

Alignment and Preprocessing for Data Analysis

Fairfield Public Schools

Session 1. Course Presentation: Mass spectrometry-based proteomics for molecular and cellular biologists

The Scheduled MRM Algorithm Enables Intelligent Use of Retention Time During Multiple Reaction Monitoring

BioVisualization: Enhancing Clinical Data Mining

Exercise 1.12 (Pg )

Analysis of Bead-summary Data using beadarray

Exploratory Data Analysis

ProteinPilot Report for ProteinPilot Software

Increasing the Multiplexing of High Resolution Targeted Peptide Quantification Assays

Exiqon Array Software Manual. Quick guide to data extraction from mircury LNA microrna Arrays

Analyzing microrna Data and Integrating mirna with Gene Expression Data in Partek Genomics Suite 6.6

Quality Assessment of Exon and Gene Arrays

Proteomic Analysis using Accurate Mass Tags. Gordon Anderson PNNL January 4-5, 2005

Design of Experiments (DOE)

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Diagrams and Graphs of Statistical Data

Software and Methods for the Analysis of Affymetrix GeneChip Data. Rafael A Irizarry Department of Biostatistics Johns Hopkins University

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Data Exploration Data Visualization

A Streamlined Workflow for Untargeted Metabolomics

GENEGOBI : VISUAL DATA ANALYSIS AID TOOLS FOR MICROARRAY DATA

Learning Objectives:

Exploratory data analysis for microarray data

2. Filling Data Gaps, Data validation & Descriptive Statistics

Application Note # LCMS-81 Introducing New Proteomics Acquisiton Strategies with the compact Towards the Universal Proteomics Acquisition Method

ALLEN Mouse Brain Atlas

430 Statistics and Financial Mathematics for Business

SAS Software to Fit the Generalized Linear Model

Geostatistics Exploratory Analysis

Microarray Data Analysis. A step by step analysis using BRB-Array Tools

Scatter Plots with Error Bars

Classification and Regression by randomforest

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

The Open2Dprot Proteomics Project for n-dimensional Protein Expression Data Analysis

Introduction. Chapter Before you start Formulation

ASSURING THE QUALITY OF TEST RESULTS

Protein Protein Interaction Networks

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Environmental Remote Sensing GEOG 2021

Analysis of Variance. MINITAB User s Guide 2 3-1

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Data Visualization in R


Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Tutorial for Proteomics Data Submission. Katalin F. Medzihradszky Robert J. Chalkley UCSF

Introduction to Proteomics

Paper No 19. FINALTERM EXAMINATION Fall 2009 MTH302- Business Mathematics & Statistics (Session - 2) Ref No: Time: 120 min Marks: 80

Step-by-Step Guide to Basic Expression Analysis and Normalization

Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients

Package GEOquery. August 18, 2015

How To Use Statgraphics Centurion Xvii (Version 17) On A Computer Or A Computer (For Free)

Mass Spectrometry Signal Calibration for Protein Quantitation

Statistics in Medicine Research Lecture Series CSMC Fall 2014

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Analyzing data. Thomas LaToza D: Human Aspects of Software Development (HASD) Spring, (C) Copyright Thomas D. LaToza

Analysis of gene expression data. Ulf Leser and Philippe Thomas

Exploratory Data Analysis with MATLAB

Exploratory Data Analysis

Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln. Log-Rank Test for More Than Two Groups

Transcription:

msmseda LC-MS/MS Exploratory Data Analysis Josep Gregori, Alex Sanchez, and Josep Villanueva Vall Hebron Institute of Oncology & Statistics Dept. Barcelona University josep.gregori@gmail.com October 13, 2014 1 Introduction In biomarker discovery, label-free differential proteomics is based on comparing the expression of proteins between different biological conditions [1] [2]. The experimental design involved for these experiments requires of randomization and blocking [12]. Working with balanced blocks in which all biological conditions are measured by a number of technical and/or biological replicates, and within the shortest time window possible, helps to reduce experimental bias. Unfortunately, there are many factors that may bias the results in a systematic manner: different operators, different chromatographic columns, different protein digestions, an eventual repair of the LC-MS system, and different laboratory environment conditions. Blocking and randomization help to control to some extent these factors. However, even using the best experimental design some uncontrolled variables may still interfere with differential proteomics experiments. These uncontrolled variables may be responsible for the manifestation of batch effects. Effects which are usually evidenced when samples do not cluster by their biological condition using multidimensional unsupervised techniques such as Principal Components Analysis (PCA) or Hierarchical Clustering (HC) [3]. The most benign consequence of batch effects is an increase in the observed variability with a decreased sensitivity to detect biological differences. In the worst scenario, batch effects can mask completely the underlying biology in the experiment. Exploratory Data Analysis (EDA) helps in evidencing confounding factors and eventual outliers. In front of any -omics experiment it is always wise to perfom an EDA before any differential expression statistical analysis [4] [5]. The results of this exploratory analysis, visualized by PCA maps, HC trees, and heatmaps, will inform about the extent of batch effects and the presence of putative outliers. As a result the analyst may decide on the inclusion of a blocking factor to take into account the batch effects, or about the exclusion of a bad conditioned sample. 1

2 An example LC-MS/MS dataset The dataset of this example [5] is the result of an spiking experiment, showing LC- MS/MS data obtained in ideal conditions, and optimal to detect batch effects. Samples of 500 micrograms of a standard yeast lysate are spiked either with 200fm or 600fm of a complex mix of 48 human proteins (UPS1, Sigma-Aldrich ). The measures were done in two different runs, separated by a year time. The first run consisted of four replicates of each condition, and the second run consisted of three replicates of each condition. The dataset consists in an instance of the MSnSet class, defined in the MSnbase package [6], a S4 class [7] [8]. This MSnSet object contains a spectral counts (SpC) matrix in the assaydata slot, and factors treatment and batch in the phenodata slot. (See also the expressionset vignette [9]) > library(msmseda) > data(msms.dataset) > msms.dataset MSnSet (storagemode: lockedenvironment) assaydata: 697 features, 14 samples element names: exprs protocoldata: none phenodata samplenames: 502.1 502.2... U6.0302.3 (14 total) varlabels: treat batch varmetadata: labeldescription featuredata: none experimentdata: use 'experimentdata(object)' pubmedids: http://www.ncbi.nlm.nih.gov/pubmed/22588121 Annotation: - - - Processing information - - - MSnbase version: 1.8.0 > dim(msms.dataset) [1] 697 14 > head(pdata(msms.dataset)) treat batch 502.1 U200 2502 502.2 U200 2502 502.3 U200 2502 502.4 U200 2502 502.1 U600 2502 502.2 U600 2502 > table(pdata(msms.dataset)$treat) 2

U200 U600 7 7 > table(pdata(msms.dataset)$batch) 0302 2502 6 8 > table(pdata(msms.dataset)$treat, pdata(msms.dataset)$batch) 0302 2502 U200 3 4 U600 3 4 The aim of the exploratory data analysis, in this case, is to evidence the existence of batch effects between the two runs. And eventually to check the opportunity of a simple batch effects correction. Before proceeding to the EDA, a data pre-processing is required to solve NAs and to remove improper rows in the spectral counts matrix. The NAs, common when joining datasets in which not exactly the same proteins are identified, should be substituted by 0. By improper rows to be removed, we mean: i) the rows with all zeroes, which could come from the subsetting from a bigger SpC matrix. ii) The rows belonging to artefactual identifications of -R proteins. > e <- pp.msms.data(msms.dataset) > processingdata(e) - - - Processing information - - - Subset [697,14][675,14] Mon Oct 13 20:46:56 2014 Applied pp.msms.data preprocessing: Mon Oct 13 20:46:56 2014 MSnbase version: 1.8.0 > dim(e) [1] 675 14 > setdiff(featurenames(msms.dataset), featurenames(e)) [1] "YER160C-R" "YJR066W-R" "YEL061C-R" "YJL190C (1)" "YLL057C" [6] "YBR189W" "YDR227W" "YER103W" "YGR029W" "YNR032C-A" [11] "YMR172W" "YJL200C" "YDL126C" "YGL173C" "YML037C" [16] "YHR102W" "YMR165C" "YJL138C (1)" "YBL046W" "YJL123C" [21] "YOR123C" "YGR276C" 3

3 SpC distribution A first glance to the contents of the spectral counts matrix is given by the distribution of SpC by sample, including the number of proteins identified and the total spectral counts by sample. > tfvnm <- count.stats(e) > { cat("\nsample statistics after removing NAs and -R:\n\n") cat("spc matrix dimension:",dim(e),"\n\n") print(tfvnm) } Sample statistics after removing NAs and -R: SpC matrix dimension: 675 14 proteins counts min lwh med hgh max 502.1 590 5398 0 2 3 8.0 183 502.2 592 5501 0 2 3 7.0 205 502.3 586 5477 0 1 3 8.0 202 502.4 586 5251 0 1 3 7.0 203 502.1 582 5692 0 1 3 8.5 194 502.2 577 5686 0 1 3 8.0 208 502.3 578 5552 0 1 3 8.0 215 502.4 560 5601 0 1 3 8.0 217 U2.0302.1 512 5629 0 1 2 7.0 409 U2.0302.2 499 5840 0 0 2 7.0 384 U2.0302.3 513 5726 0 1 2 7.0 364 U6.0302.1 491 5975 0 0 2 7.5 395 U6.0302.2 474 5739 0 0 2 7.5 358 U6.0302.3 474 5891 0 0 2 8.0 355 Three graphical means will contribute to visualize this distribution: i) A barplot of total SpC by sample scaled to the median. Ideally when the same amount of total protein is measured in each sample, the same total number of spectral counts will be observed. A good quality experiment will show all the bars near 1. ii) A set of SpC boxplots by sample. As most proteins in a sample may show very low signal, 0 or 1 SpC, resulting in sparce SpC vectors, more informative boxplots are obtained when removing the values below a given threshold. Here the proteins with SpC below minspc are excluded of a sample. The boxplots show the distribution of the remaining log 2 transformed SpC values by sample. iii) Superposed SpC density plots by sample. As before the proteins with SpC below minspc are excluded of each sample and the SpC are log 2 transformed. 4

> layout(mat=matrix(1:2,ncol=1),widths=1,heights=c(0.35,0.65)) > spc.barplots(exprs(e),fact=pdata(e)$treat) > spc.boxplots(exprs(e),fact=pdata(e)$treat,minspc=2, main="ups1 200fm vs 600fm") 1.0 0.8 0.6 0.4 0.2 0.0 502.1 502.2 502.3 502.4 502.1 502.2 502.3 502.4 U2.0302.1 U2.0302.2 U2.0302.3 U6.0302.1 U6.0302.2 U6.0302.3 UPS1 200fm vs 600fm log2 SpC 2 4 6 8 10 min SpC >= 2 U200 U600 502.1 502.2 502.3 502.4 502.1 502.2 502.3 502.4 U2.0302.1 U2.0302.2 U2.0302.3 U6.0302.1 U6.0302.2 U6.0302.3 Figure 1: A) Total SpC by sample scaled to the median. B) Samples SpC boxplots. 5

> spc.densityplots(exprs(e),fact=pdata(e)$treat,minspc=2, main="ups1 200fm vs 600fm") UPS1 200fm vs 600fm Density 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 min SpC >= 2 U200 U600 0 2 4 6 8 10 log2 SpC Figure 2: Samples SpC density plots. 6

4 Principal Components Analysis A plot on the two principal components of the SpC matrix visualizes the clustering of samples. Ideally the samples belonging to the same condition should cluster together. Any mixing of samples of different conditions may indicate the influence of some confounding factors. PC2 6.67 % var 100 50 0 50 100 100 0 100 200 PC1 PCA 84.14 batch % var PC2 6.67 % var 100 50 0 50 100 100 0 100 200 PC1 84.14 % var Figure 3: PCA plot on PC1/PC2, showing confounding. The labels are colored by treatment level on top, and by batch number bellow. Labels themselves are selfexplicative of treatment condition and batch. > facs <- pdata(e) > snms <- substr(as.character(facs$treat),1,2) > snms <- paste(snms,as.integer(facs$batch),sep=".") > pcares <- counts.pca(e) 7

> smpl.pca <- pcares$pca > { cat("principal components analisis on the raw SpC matrix\n") cat("variance of the first four principal components:\n\n") print(summary(smpl.pca)$importance[,1:4]) } Principal components analisis on the raw SpC matrix Variance of the first four principal components: PC1 PC2 PC3 PC4 Standard deviation 163.82011 46.11827 32.26952 22.75380 Proportion of Variance 0.84136 0.06668 0.03265 0.01623 Cumulative Proportion 0.84136 0.90804 0.94068 0.95691 Note how in these plots the samples tend to cluster by batch instead of by treatment, this is evidence of a confounding factor. Something uncontrolled contributes globally to the results in a higer extend than the treatment itself. 5 Hierarchical clustering The hierarchical clustering of samples offers another view of the same phenomenon: > counts.hc(e,facs=pdata(e)[, "treat", drop = FALSE]) HC treat U2.0302.3 U2.0302.2 U2.0302.1 U6.0302.3 U6.0302.2 U6.0302.1 502.4 502.3 502.2 502.1 502.4 502.3 502.2 502.1 400 300 200 100 0 Figure 4: Hirearchical clustering of samples, showing confounding. The labels are colored by treatment level. 8

> counts.hc(e,facs=pdata(e)[, "batch", drop = FALSE]) HC batch U2.0302.3 U2.0302.2 U2.0302.1 U6.0302.3 U6.0302.2 U6.0302.1 502.4 502.3 502.2 502.1 502.4 502.3 502.2 502.1 400 300 200 100 0 Figure 5: Hirearchical clustering of samples, showing confounding. The labels are colored by batch number. 6 Heatmap A heatmap may be more informative than the dendrogram of samples, in the sense that it allows to identify the proteins most sensitive to this confounding. In this case we need a heatmap heigh enough to allow room for all the protein names. The function counts.heatmap provides two heatmaps, the first one is fitted on an A4 page and offers a general view, the second one is provided in a separate pdf file and is drawn with 3mm high rows to allow a confortable identification of each protein and its expression profile in the experiment. 9

> counts.heatmap(e,etit="ups1",fac=pdata(e)[, "treat"]) U6.0302.2 U6.0302.3 U6.0302.1 U2.0302.2 U2.0302.3 U2.0302.1 502.4 502.3 502.4 502.3 502.1 502.2 502.2 502.1 Figure 6: Global view heatmap. The column color bar is colored as per treatment levels. 7 Batch effects correction When counfounding is detected due to batch effects, as in this case, and the runs are balanced in the two conditions to be compared, blocking may help in the reduction of the residual variance, improving the sensitivity of the statistical tests. When dealing with spectral counts the usual model for the differential expression tests is a Generalized Linear Model (GLM) [10] based on the Poisson distribution, the negative binomial, or the quasi-likelihood, and these models admit blocking as the usual ANOVA [11] when dealing with normally distributed continous data. The visualization of the influence of a batch effects correction, in the exploratory data analysis step, is easily carried out by the so called mean centering approach [4], by which the centers of each batch are made to coincide concealing the observed bias between the different batches. The PCA on the batch mean centered expression matrix will then show the level of improvement. > data(msms.dataset) 10

> msnset <- pp.msms.data(msms.dataset) > spcm <- exprs(msnset) > fbatch <- pdata(msnset)$batch > spcm2 <- batch.neutralize(spcm, fbatch, half=true, sqrt.trans=true) > ### Plot the PCA on the two first PC, and colour by treatment level > ### to visualize the improvement. > exprs(msnset) <- spcm2 > facs <- pdata(e) > snms <- substr(as.character(facs$treat),1,2) > snms <- paste(snms,as.integer(facs$batch),sep=".") > par(mar=c(4,4,0.5,2)0.1) > counts.pca(msnset, facs=facs$treat, do.plot=true, snms=snms) PC2 23.14 % var 60 40 20 0 20 40 50 0 50 PC1 39.76 % var Figure 7: PCA plot on the batch mean centered expression matrix, showing now a better clustering by treatment level. This plots shows a clear improvement in the clustering of samples by treatment condition, and suggest that a model with batch as block factor will give better results than a model just including the treatment factor. The incidence of the correction may be evaluated as: > ### Incidence of the correction > summary(as.vector(spcm-spcm2)) Min. 1st Qu. Median Mean 3rd Qu. Max. -115.60000-0.47570-0.00694 0.21000 0.88060 137.90000 > plot(density(as.vector(spcm-spcm2))) 11

8 Dispersion The simplest distribution used to explain the observed counts in sampling is the Poisson distribution. With this distribution the variance is equal to the mean, so that the dispersion coefficient -the ratio variance to mean- is one. When there are other sources of variation, apart of the sampling, such as the usual variability among biological replicates, we talk of overdispersion. In this situation the coefficient of dispersion is greater than one and the Poisson distribution is unable to explain this extra source of variance. Alternative GLM models able to explain overdispersion are based on the negative binomial distribution, or on the quasilikelihood [10]. An EDA of a dataset based on counts should include an exploration of the residual coefficients of dispersion for each of the factors in the experimental design. This will help in deciding the model to use in the inference step. The disp.estimates function plots the distribution of residual dispersion coefficients, and the scatterplot of residual variances vs mean SpC for each of the factors in the parameter facs, if this parameter is NULL then the factors are taken as default from the phenodata slot of the MSnSet object. > dsp <- disp.estimates(e) > signif(dsp,4) 0.25 0.5 0.75 0.9 0.95 0.99 1 treat 0.4471 0.7500 1.111 1.677 2.474 12.97 67.210 batch 0.2619 0.4264 0.689 1.279 2.071 5.64 8.273 This function returns silently the quartiles and the quantiles at 0.9, 0.95 and 0.99 of the residual dispersion of each factor. With technical replicates it is not uncommon to observe most of the dispersion coefficients lower that one. This is a situation of underdispersion, most likely due to the competitive sampling at the MS system entrance. 12

Density 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Log SpC variance 1 0 1 2 N = 675 Bandwidth = 0.07795 Residual dispersion factor batch 1 0 1 2 3 Log SpC mean Figure 8: Residual dispersion density plot, and residual variance vs mean scatterplot in log10 scale, of the batch factor. 13

9 Informative features In the border between EDA and inference we may explore the number and distribution of informative features. We mean by informative features those proteins showing a high enough signal in the most abundant condition and with an absolute log fold change between treatment levels above a given threshold. The following plot shows in red the features with signal not bellow 2 SpC in the most abundant condition, and with minlfc of 1. To improve the plot two transformations are available, either log2 or sqrt, although none is also accepted. > spc.scatterplot(spcm2,facs$treat,trans="sqrt",minspc=2,minlfc=1, main="ups1 200fm vs 600fm") UPS1 200fm vs 600fm mean SpC(U600) 0 5 10 15 0 5 10 15 mean SpC(U200) Figure 9: Scatterplot with informative features in red, showing the borders of fold change 2 and 0.5 as blue dash-dot lines. 14

References [1] Mallick P., Kuster B. Proteomics: a pragmatic perspective. Nat Biotechnol 2010;28:695-709. [2] Neilson K.A., Ali N.A., Muralidharan S., Mirzaei M., Mariani M., Assadourian G., et al. Less label, more free: approaches in label-free quantitative mass spectrometry. Proteomics 2011;11:535-53. [3] Husson F., Le S., Pages J. Exploratory Multivariate Analysis by Example Using R. CRC Press; 2010. [4] Luo J., Schumacher M., Scherer A., Sanoudou D., Megherbi D., Davison T., et al. A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data. Pharmacogenomics J 2010, 10(4): 278-291 [5] Gregori J., Villareal L., Mendez O., Sanchez A., Baselga J., Villanueva J., Batch effects correction improves the sensitivity of significance tests in spectral countingbased comparative discovery proteomics, Journal of Proteomics, 2012, 75, 3938-3951 [6] Laurent Gatto and Kathryn S. Lilley, MSnbase - an R/Bioconductor package for isobaric tagged mass spectrometry data visualization, processing and quantitation, Bioinformatics 28(2), 288-289 (2012). [7] Chambers J.M. Software for data analysis: programming with R, 2008 Springer [8] Genolini C. A (Not So) Short Introduction to S4 (2008) [9] Falcon S., Morgan M., Gentleman R. An Introduction to Bioconductor s Expression- Set Class (2007) [10] Agresti A., Categorical Data Analysis, Wiley-Interscience, Hoboken NJ, 2002 [11] Kutner M.H., Nachtsheim C.J., Neter J., Li W., Applied Linear Statistical Models, Fifth Edition, McGraw-Hill Intl. Edition 2005 [12] Quinn G.P., Keough M.J. Experimental Design and Data Analysis for Biologists Cambridge University Press; 1st Edition, Cambridge 2002. 15