Software and Methods for the Analysis of Affymetrix GeneChip Data. Rafael A Irizarry Department of Biostatistics Johns Hopkins University

Similar documents

Quality Assessment of Exon and Gene Arrays

Frozen Robust Multi-Array Analysis and the Gene Expression Barcode

Row Quantile Normalisation of Microarrays

Data Acquisition. DNA microarrays. The functional genomics pipeline. Experimental design affects outcome data analysis

Analysis of gene expression data. Ulf Leser and Philippe Thomas

Microarray Data Analysis. A step by step analysis using BRB-Array Tools

A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Variance and Bias

affyplm: Fitting Probe Level Models

Gene Expression Analysis

Normalization Methods for Analysis of Affymetrix GeneChip Microarray

Lecture 11 Data storage and LIMS solutions. Stéphane LE CROM

Measuring gene expression (Microarrays) Ulf Leser

Statistical issues in the analysis of microarray data

Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients

Analyzing microrna Data and Integrating mirna with Gene Expression Data in Partek Genomics Suite 6.6

Microarray Analysis. The Basics. Thomas Girke. December 9, Microarray Analysis Slide 1/42

DEVELOPMENT OF MAP/REDUCE BASED MICROARRAY ANALYSIS TOOLS

The timecourse Package

Gene expression analysis. Ulf Leser and Karin Zimmermann

Web-based Tools for the Analysis of DNA Microarrays. End of Project Report. Authors: P. Geeleher 1,2, A. Golden 3, J. Hinde 2 and D. G.

GENEGOBI : VISUAL DATA ANALYSIS AID TOOLS FOR MICROARRAY DATA

Importance of Statistics in creating high dimensional data

Molecular Genetics: Challenges for Statistical Practice. J.K. Lindsey

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

GenBank, Entrez, & FASTA

Exiqon Array Software Manual. Quick guide to data extraction from mircury LNA microrna Arrays

Processing Genome Data using Scalable Database Technology. My Background

Basic Analysis of Microarray Data

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Genomics 95 (2010) Contents lists available at ScienceDirect. Genomics. journal homepage:

Design of a Scientic Workow for the Analysis of Microarray experiments with Taverna and R

Geostatistics Exploratory Analysis

Normalization of RNA-Seq

Factors for success in big data science

PROGRAMMING FOR BIOLOGISTS. BIOL 6297 Monday, Wednesday 10 am -12 pm

Supplementary Figure 1: Quality Assessment of Mouse Arrays. Supplementary Figure 2: Quality Assessment of Rat Arrays

ExploRase: Multivariate exploratory analysis and visualization for systems biology. 1 Introduction. Abstract

SELDI-TOF Mass Spectrometry Protein Data By Huong Thi Dieu La

Data Preparation and Statistical Displays

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Johns Hopkins University

An Introduction to Machine Learning

Classification and Regression by randomforest

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

How To Create A Microarray Database And Web Interface

GSR Microarrays Project Management System

Cluster software and Java TreeView

Data Analysis on the ABI PRISM 7700 Sequence Detection System: Setting Baselines and Thresholds. Overview. Data Analysis Tutorial

REGULATIONS FOR THE DEGREE OF BACHELOR OF SCIENCE IN BIOINFORMATICS (BSc[BioInf])

Gamma Distribution Fitting

Introduction to Statistical Methods for Microarray Data Analysis

The Open2Dprot Proteomics Project for n-dimensional Protein Expression Data Analysis

Comparison of Estimation Methods for Complex Survey Data Analysis

Microarray Data Analysis Workshop. Custom arrays and Probe design Probe design in a pangenomic world. Carsten Friis. MedVetNet Workshop, DTU 2008

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

ADO - Omics Data Analysis

Penalized Logistic Regression and Classification of Microarray Data

Protein Protein Interaction Networks

Lecture 3: Linear methods for classification

PREDA S4-classes. Francesco Ferrari October 13, 2015

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Introduction To Real Time Quantitative PCR (qpcr)

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

GENE expression profiling is widely used to explore gene

Analysis of Illumina Gene Expression Microarray Data

Pipeline Pilot Enterprise Server. Flexible Integration of Disparate Data and Applications. Capture and Deployment of Best Practices

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Comparative genomic hybridization Because arrays are more than just a tool for expression analysis

A Primer of Genome Science THIRD

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

PPInterFinder A Web Server for Mining Human Protein Protein Interaction

UKB_WCSGAX: UK Biobank 500K Samples Genotyping Data Generation by the Affymetrix Research Services Laboratory. April, 2015

Creating a New Annotation Package using SQLForge

Influence of GSM and UMTS on the Blood Brain Barrier in vitro additional results

Statistics Graduate Courses

Comparing Methods for Identifying Transcription Factor Target Genes

Exercise with Gene Ontology - Cytoscape - BiNGO

NCBI resources III: GEO and ftp site. Yanbin Yin Spring 2013

Forecasting in supply chains

Time series experiments

Correlation of microarray and quantitative real-time PCR results. Elisa Wurmbach Mount Sinai School of Medicine New York

GeneChip Expression Analysis. Data Analysis Fundamentals

InSyBio BioNets: Utmost efficiency in gene expression data and biological networks analysis

Lecture 4: Exact string searching algorithms. Exact string search algorithms. Definitions. Exact string searching or matching

Introduction to Pattern Recognition

A demonstration of the use of Datagrid testbed and services for the biomedical community

The GeWare data warehouse platform for the analysis of molecular-biological and clinical data

Exploratory data analysis for microarray data

Identification of rheumatoid arthritis and osteoarthritis patients by transcriptome-based rule set generation

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

Final Project Report

Statistical analysis of modern sequencing data quality control, modelling and interpretation

Mass Spectra Alignments and their Significance

HowTo: Querying online Data

Data deluge (and it s applications) Gianluigi Zanetti. Data deluge. (and its applications) Gianluigi Zanetti

Configuring Apache HTTP Server as a Reverse Proxy Server for SAS 9.2 Web Applications Deployed on BEA WebLogic Server 9.2

Analysis of the colorectal tumor microenvironment using integrative bioinformatic tools

Real-time PCR: Understanding C t

Transcription:

Software and Methods for the Analysis of Affymetrix GeneChip Data Rafael A Irizarry Department of Biostatistics Johns Hopkins University

Outline Overview Bioconductor Project Examples 1: Gene Annotation Example 2: Preprocessing Affymetrix Array Data

Contact Information e-mail Personal webpage Department webpage Bioinformatics Program rafa@jhu.edu http://www.biostat.jhsph.edu/~ririzarr http://www.biostat.jhsph.edu/ http://www.biostat.jhsph.edu/bioinfo http://www.bioconductor.org

Biological question Differentially expressed genes Sample class prediction etc. Experimental design Microarray experiment Image analysis Preprocessing (Normalization) Estimation Testing Clustering Discrimination Biological verification and interpretation

Bioconductor Bioconductor is an open source and open development software project for the analysis of biomedical and genomic data. The project was started in the Fall of 2001 and includes 23 core developers in the US, Europe, and Australia. R and the R package system are used to design and distribute software. ArrayAnalyzer: Commercial port of Bioconductor packages in S-Plus.

R What sorts of things is R good at? Many statistical and machine learning algorithms Good visualization capabilities Possible to write scripts that can be reused R is largely platform independent: Unix; Windows; OSX R has an active user community It s open source and free! R is a real computer language Supports many data technologies: XML, DBI, SOAP Interacts with other languages: C; Perl; Python; Java Sophisticated package creation and distribution system SPLUS is a commercial implementation of the S Language and R is an open source implementation

Gene Annotation Example: Metadata package hgu95av2 mappings between different gene IDs. ACCNUM X95808 GENENAME zinc finger protein 261 PMID 10486218 9205841 8817323 AffyID 41046_s_at LOCUSID 9203 MAP Xq13.1 SYMBOL ZNF261 GO GO:0003677 GO:0007275 GO:0016021 + many other mappings Assemble and process genomic annotation data from public repositories. Build annotation data packages or XML data documents. Associate experimental data in real time to biological metadata from web databases such as GenBank, GO, KEGG, LocusLink, and PubMed. Process and store query results: e.g., search PubMed abstracts. Generate HTML reports of analyses.

Preprocessing Illustrative example: Detecting differentially expressed genes

Affymetrix GeneChip Design 5 3 Reference sequence TGTGATGGTGCATGATGGGTCAGAAGGCCTCCGATGCGCCGATTGAGAAT GTACTACCCAGTCTTCCGGAGGCTA Perfectmatch GTACTACCCAGTGTTCCGGAGGCTA Mismatch NSB & SB NSB

Preprocessing Typically we want one measure of expression for each gene on each array 20K genes represented by 11 probe pairs of probe intensities (PM & MM) Obtain expression measure for each gene on each array by summarizing these pairs Background adjustment and normalization are important issues Affymetrix offers MAS 5.0 as solution

Software Infrastructure Experimental Data Annotation P r o b e s Arrays Probe Intensities (CEL files) A r r a y s Covariates Covariate Information MIAME P r o b e s Properties Meta Data (CDF Packages) AffyBatch Class

Why normalize? Compliments of Ben Bolstad

Default Procedure (MAS 5.0) signal * = TukeyBiweight{log( PM j MM j )}

Sometimes MM larger then PM

Sometimes MM larger then PM

Especially for large PM

Default Procedure (MAS 5.0) signal * = TukeyBiweight{log( PM j MM j )}

Can this be improved?

Can this be improved?

Why so much noise? Default algorithm seems to be inspired by the following deterministic model for background: PM = O + N + S MM = O + N PM MM = S And a multiplicative error model for signal (they take the log before averaging)

Deterministic model is wrong Do MM measure nonspecific binding? Look at Yeast DNA hybridized to Human Chip Look at PM, MM logscale scatter-plot R 2 is only 0.5

Stochastic Model (Additive background/multiplicative error) PM = O PM + N PM + S, MM = O MM + N MM log (N PM ), log (N MM ) ~ Bivariate Normal (ρ 0.7) S = exp ( s + a + ε ) s is the quantity of interest (log scale expression) E[ PM MM ] = S, but Var[ log( PM MM ) ] ~ 1/S 2 (can be very large)

Does it make a difference? Ranks 1 270 2074 3063 3935 4639 4652 5149 5372 5947 6448 6870 7037 7549 8429 9721

RMA: Our first attempt Ranks 1 2 3 4 6 7 10 16 45 56 58 88 406 999 1643 2739

Can RMA be improved? RMA attenuates signal slightly to achieve gains in precision method MAS 5.0 RMA slope 0.69 0.61

Probe Specific Effect To improve RMA we needed to account for probe-specific background effects Our first attempt was to use GC-content Others have noticed probe-specific SB effects We can extend these ideas to NSB

Predict NSB with sequence Fit simple linear model to yeast on human data to obtain base/position effects (Naef and Magnsaco)

Predict NSB with sequence Fit simple linear model to yeast on human data to obtain base/position effects Call these affinities and use them to obtain parameters for background model

Does it help? Accuracy of expression measures improves Precision a bit worst but not bad

Also explains MM thing

Also explains MM thing

Acknowledgements Ben Bolstad Leslie Cope Sandrine Dudoit Laurent Gautier Robert Gentleman Wolfgang Huber Christina Kendziorski James MacDonald Francisco Martínez-Murillo Felix Naef Marcelo Magnasco Forrest Spencer Terry Speed Jean Yang Zhijin Wu

Supplemental Slides

Does it help?

Other Good Uses: RMA This background adjustment is used to define an alternative algorithm: the Robust Multi-array Analysis Quantile normalization is used To combine the various probe intensities a log-scale probe level additive model is fit robustly log (PM * ) = a + b + ε 2 ij i j ij RMA = estimate of a i for chip i Default robust procedure is median polish b j represents the probe effect More details: Irizarry et al. Biostatistics (2003)

The Probe Effect

Other pseudo-chip images Weights Residuals Positive Residuals Negative Residuals

Why background correct?

Practical Consequences

Contact Information e-mail Personal webpage Department webpage Bioinformatics Program rafa@jhu.edu http://www.biostat.jhsph.edu/~ririzarr http://www.biostat.jhsph.edu/ http://www.biostat.jhsph.edu/bioinfo http://www.bioconductor.org

Why use log? Original scale Log scale

Why we can not ignore NSB? The data shown is from a calibration experiment NSB causes bias (E 1 +K)/(E 2 +K) E 1 / E 2 if E 1, E 2 are large (E 1 +K)/(E 2 +K) 1 if E 1, E 2 are small We are faced with a bias/variance trade-off problem

Probe effect This strong probe-effect will result in very high correlation between replicates. Do not get too exited. Look at correlation or variance of relative expression (log FC) instead.

Alternative background adjustment Use this stochastic model Minimize the MSE: s E log s To do this we need to specify distributions for the different components Notice this is probe-specific so we need to borrow strength 2 S > 0,PM, MM * These parametric distributions were chosen to provide a closed form solution

Alternative background adjustment Model observed PM as the sum of a signal intensity S and a background intensity B PM = B + S, For convenience * it is assumed that S is Exponential (α), B is Normal (µ, σ 2 ), with S and B are independent Background adjusted PM are then E[S PM] Because expectation minimizes MSE, we avoid exaggerated variance Plug-in estimates of α, µ, and σ 2 are used Notice we can use only PM and make arrays half as expensive * These parametric distributions were chosen to provide a closed form solution

Spike-in Experiment Replicate RNA was hybridized to various arrays Some probe-sets were spiked in at different concentrations across the different arrays This gives us a way to assess precision and accuracy

Spikein Experiment Probeset A r r a y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 A 0 0.25 0.5 1 2 4 8 16 32 64 128 0 512 1024 256 32 B 0.25 0.5 1 2 4 8 16 32 64 128 256 0.25 1024 0 512 64 C 0.5 1 2 4 8 16 32 64 128 256 512 0.5 0 0.25 1024 128 D 1 2 4 8 16 32 64 128 256 512 1024 1 0.25 0.5 0 256 E 2 4 8 16 32 64 128 256 512 1024 0 2 0.5 1 0.25 512 F 4 8 16 32 64 128 256 512 1024 0 0.25 4 1 2 0.5 1024 G 8 16 32 64 128 256 512 1024 0 0.25 0.5 8 2 4 1 0 H 16 32 64 128 256 512 1024 0 0.25 0.5 1 16 4 8 2 0.25 I 32 64 128 256 512 1024 0 0.25 0.5 1 2 32 8 16 4 0.5 J 64 128 256 512 1024 0 0.25 0.5 1 2 4 64 16 32 8 1 K 128 256 512 1024 0 0.25 0.5 1 2 4 8 128 32 64 16 2 L 256 512 1024 0 0.25 0.5 1 2 4 8 16 256 64 128 32 4 M 512 1024 0 0.25 0.5 1 2 4 8 16 32 512 128 256 64 8 N 512 1024 0 0.25 0.5 1 2 4 8 16 32 512 128 256 64 8 O 512 1024 0 0.25 0.5 1 2 4 8 16 32 512 128 256 64 8 P 512 1024 0 0.25 0.5 1 2 4 8 16 32 512 128 256 64 8 Q 1024 0 0.25 0.5 1 2 4 8 16 32 64 1024 256 512 128 16 R 1024 0 0.25 0.5 1 2 4 8 16 32 64 1024 256 512 128 16 S 1024 0 0.25 0.5 1 2 4 8 16 32 64 1024 256 512 128 16 T 1024 0 0.25 0.5 1 2 4 8 16 32 64 1024 256 512 128 16

NSB: Practical Consequences The data shown here comes from spike-in experiments used for calibration NSB causes foldchange attenuation at low expression level (E 1 +K)/(E 2 +K) E 1 / E 2 if E 1, E 2 are large (E 1 +K)/(E 2 +K) 1 if E 1, E 2 are small