Estimating batch effect in Microarray data with Principal Variance Component Analysis(PVCA) method



Similar documents
netresponse probabilistic tools for functional network analysis

Creating a New Annotation Package using SQLForge

GEOsubmission. Alexandre Kuhn May 3, 2016

HowTo: Querying online Data

Using the generecommender Package

R4RNA: A R package for RNA visualization and analysis

Analysis of Bead-summary Data using beadarray

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

KEGGgraph: a graph approach to KEGG PATHWAY in R and Bioconductor

EDASeq: Exploratory Data Analysis and Normalization for RNA-Seq

Practical Differential Gene Expression. Introduction

Introduction to robust calibration and variance stabilisation with VSN

Gene2Pathway Predicting Pathway Membership via Domain Signatures

mmnet: Metagenomic analysis of microbiome metabolic network

Monocle: Differential expression and time-series analysis for single-cell RNA-Seq and qpcr experiments

HMMcopy: A package for bias-free copy number estimation and robust CNA detection in tumour samples from WGS HTS data

Visualising big data in R

Gene Expression Analysis

Exploratory data analysis for microarray data

SeqArray: an R/Bioconductor Package for Big Data Management of Genome-Wide Sequencing Variants

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

How To Cluster

Principal Component Analysis

SeqArray: an R/Bioconductor Package for Big Data Management of Genome-Wide Sequencing Variants

Subspace Analysis and Optimization for AAM Based Face Alignment

A short course in Longitudinal Data Analysis ESRC Research Methods and Short Course Material for Practicals with the joiner package.

Tutorial for proteome data analysis using the Perseus software platform

Lecture 5 : The Poisson Distribution

Exploratory data analysis (Chapter 2) Fall 2011

Frozen Robust Multi-Array Analysis and the Gene Expression Barcode

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Factor Analysis. Chapter 420. Introduction

Scatter Plots with Error Bars

an introduction to VISUALIZING DATA by joel laumans

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Comparing Functional Data Analysis Approach and Nonparametric Mixed-Effects Modeling Approach for Longitudinal Data Analysis

Assessing Measurement System Variation

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

3-Step Competency Prioritization Sequence

Introduction to Principal Components and FactorAnalysis

Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients

A Toolbox for Bicluster Analysis in R

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Tutorial on Exploratory Data Analysis

Exercise with Gene Ontology - Cytoscape - BiNGO

A Demonstration of Hierarchical Clustering

Systat: Statistical Visualization Software

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Directions for using SPSS

Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant

Component Ordering in Independent Component Analysis Based on Data Power

Supervised Feature Selection & Unsupervised Dimensionality Reduction

# load in the files containing the methyaltion data and the source # code containing the SSRPMM functions

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Quality Assessment of Exon and Gene Arrays

Linear Discriminant Analysis

Normalization of RNA-Seq

Package survpresmooth

Mathematical Models of Supervised Learning and their Application to Medical Diagnosis

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

STAT355 - Probability & Statistics

ALLEN Mouse Brain Atlas

Classification and Regression by randomforest

Simple Linear Regression Inference

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Feature Selection for Stock Market Analysis

Package tagcloud. R topics documented: July 3, 2015

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Chapter 6: Constructing and Interpreting Graphic Displays of Behavioral Data

Computational Assignment 4: Discriminant Analysis

Package GEOquery. August 18, 2015

Statistical Models in R

Getting started with qplot


Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach

Categorical Data Visualization and Clustering Using Subjective Factors

Guidance for Industry

Dimensionality Reduction: Principal Components Analysis

CS171 Visualization. The Visualization Alphabet: Marks and Channels. Alexander Lex [xkcd]

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Common factor analysis

Principles of Data Visualization

Multivariate Normal Distribution

VALIDATION OF ANALYTICAL PROCEDURES: TEXT AND METHODOLOGY Q2(R1)

Package smoothhr. November 9, 2015

Principal components analysis

CNFSAT: Predictive Models, Dimensional Reduction, and Phase Transition

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

GUIDELINES FOR THE VALIDATION OF ANALYTICAL METHODS FOR ACTIVE CONSTITUENT, AGRICULTURAL AND VETERINARY CHEMICAL PRODUCTS.

Mass Spectrometry Signal Calibration for Protein Quantitation

Principal Component Analysis

Transcription:

Estimating batch effect in Microarray data with Principal Variance Component Analysis(PVCA) method Pierre R. Bushel, Jianying Li April 16, 2015 Contents 1 Introduction 1 2 Installation 2 3 An example run on published Golub data 2 4 Session 3 1 Introduction Often times batch effects are present in microarray data due to any number of factors, including e.g. a poor experimental design or when the gene expression data is combined from different studies with limited standardization. To estimate the variability of experimental effects including batch, a novel hybrid approach known as principal variance component analysis (PVCA) has been developed. The approach leverages the strengths of two very popular data analysis methods: first, principal component analysis (PCA) is used to efficiently reduce data dimension with maintaining the majority of the variability in the data, and variance components analysis (VCA) fits a mixed linear model using factors of interest as random effects to estimate and partition the total variability. The PVCA approach can be used as a screening tool to determine which sources of variability 1

(biological, technical or other) are most prominent in a given microarray data set. Using the eigenvalues associated with their corresponding eigenvectors as weights, associated variations of all factors are standardized and the magnitude of each source of variability (including each batch effect) is presented as a proportion of total variance. Although PVCA is a generic approach for quantifying the corresponding proportion of variation of each effect, it can be a handy assessment for estimating batch effect before and after batch normalization. The pvca package implements the method described in the book Batch Effects and Noise in Microarray Experiment, chapter 12 Principal Variance Components Analysis: Estimating Batch Effects in Microarray Gene Expression Data : Jianying Li, Pierre R Bushel, Tzu-Ming Chu, and Russell D Wolfinger 2010 The pvca method was applied in the paper: Michael J Boedigheimer, Russell D Wolfinger, Michael B Bass, Pierre R Bushel, Jeff W Chou, Matthew Cooper, J Christopher Corton, Jennifer Fostel, Susan Hester, Janice S Lee, Fenglong Liu, Jie Liu, Hui- Rong Qian, John Quackenbush, Syril Pettit and Karol L Thompson11 2008 Sources of variation in baseline gene expression levels from toxicogenomics study control animals across multiple laboratories BMC Genomics 2008, June 12, 9:285 2 Installation Simply skip this section if one has been familiar with the usual Bioconductor installation process. Assume that a recent version of R has been correctly installed. Install the packages from the Bioconductor repository, using the bioclite utility. Within R console, type > source("http://bioconductor.org/bioclite.r") > bioclite("pvca") Installation using the bioclite utility automatically handles the package dependencies. The pvca package depends on the packages like lme4 etc., which can be installed when pvca package is stalled. 3 An example run on published Golub data We use Golub dataset in the package golubesets as an example to illustrate the PVCA batch effect estimation procedure. This dataset contains 7129 genes from Microarray data on 72 samples from a leukemia study. It is a merged dataset and we are testing 2

variability from each factor and their two-ways interactions. The pacakge performs PVCA on this merged data and produces an estimation of the proportion (as possible batch effect) each factor and interaction contribute. The figure shows the proportion in a bar chart.. > library(golubesets) > library(pvca) > data(golub_merge) > pct_threshold <- 0.6 > batch.factors <- c("all.aml", "BM.PB", "Source") > pvcaobj <- pvcabatchassess (Golub_Merge, batch.factors, pct_threshold) > We can plot the source of potential batch effect in proporting as shown in Figure 1. > bp <- barplot(pvcaobj$dat, xlab = "Effects", + ylab = "Weighted average proportion variance", + ylim= c(0,1.1),col = c("blue"), las=2, + main="pvca estimation bar chart") > axis(1, at = bp, labels = pvcaobj$label, xlab = "Effects", cex.axis = 0.5, las=2) > values = pvcaobj$dat > new_values = round(values, 3) > text(bp,pvcaobj$dat,labels = new_values, pos=3, cex = 0.8) > 4 Session > print(sessioninfo()) R version 3.2.0 (2015-04-16) Platform: x86_64-unknown-linux-gnu (64-bit) Running under: Ubuntu 14.04.2 LTS locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C 3

PVCA estimation bar chart 1.0 Weighted average proportion variance 0.8 0.6 0.4 0.673 0.2 0.143 0.0 0.031 0.009 0.044 0.057 0.044 BM.PB:Source ALL.AML:Source ALL.AML:BM.PB Source Effects BM.PB ALL.AML Residual Figure 1: The bar chart shows the proportion of batch effect from possible source. 4

[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grdevices utils datasets methods [8] base other attached packages: [1] pvca_1.8.0 golubesets_1.9.1 Biobase_2.28.0 [4] BiocGenerics_0.14.0 loaded via a namespace (and not attached): [1] Rcpp_0.11.5 lattice_0.20-31 MASS_7.3-40 [4] grid_3.2.0 nlme_3.1-120 affy_1.46.0 [7] BiocInstaller_1.18.1 zlibbioc_1.14.0 minqa_1.2.4 [10] affyio_1.36.0 nloptr_1.0.4 limma_3.24.0 [13] Matrix_1.2-0 preprocesscore_1.30.0 splines_3.2.0 [16] lme4_1.1-7 tools_3.2.0 vsn_3.36.0 5