Software and Methods for the Analysis of Affymetrix GeneChip Data Rafael A Irizarry Department of Biostatistics Johns Hopkins University
Outline Overview Bioconductor Project Examples 1: Gene Annotation Example 2: Preprocessing Affymetrix Array Data
Contact Information e-mail Personal webpage Department webpage Bioinformatics Program rafa@jhu.edu http://www.biostat.jhsph.edu/~ririzarr http://www.biostat.jhsph.edu/ http://www.biostat.jhsph.edu/bioinfo http://www.bioconductor.org
Biological question Differentially expressed genes Sample class prediction etc. Experimental design Microarray experiment Image analysis Preprocessing (Normalization) Estimation Testing Clustering Discrimination Biological verification and interpretation
Bioconductor Bioconductor is an open source and open development software project for the analysis of biomedical and genomic data. The project was started in the Fall of 2001 and includes 23 core developers in the US, Europe, and Australia. R and the R package system are used to design and distribute software. ArrayAnalyzer: Commercial port of Bioconductor packages in S-Plus.
R What sorts of things is R good at? Many statistical and machine learning algorithms Good visualization capabilities Possible to write scripts that can be reused R is largely platform independent: Unix; Windows; OSX R has an active user community It s open source and free! R is a real computer language Supports many data technologies: XML, DBI, SOAP Interacts with other languages: C; Perl; Python; Java Sophisticated package creation and distribution system SPLUS is a commercial implementation of the S Language and R is an open source implementation
Gene Annotation Example: Metadata package hgu95av2 mappings between different gene IDs. ACCNUM X95808 GENENAME zinc finger protein 261 PMID 10486218 9205841 8817323 AffyID 41046_s_at LOCUSID 9203 MAP Xq13.1 SYMBOL ZNF261 GO GO:0003677 GO:0007275 GO:0016021 + many other mappings Assemble and process genomic annotation data from public repositories. Build annotation data packages or XML data documents. Associate experimental data in real time to biological metadata from web databases such as GenBank, GO, KEGG, LocusLink, and PubMed. Process and store query results: e.g., search PubMed abstracts. Generate HTML reports of analyses.
Preprocessing Illustrative example: Detecting differentially expressed genes
Affymetrix GeneChip Design 5 3 Reference sequence TGTGATGGTGCATGATGGGTCAGAAGGCCTCCGATGCGCCGATTGAGAAT GTACTACCCAGTCTTCCGGAGGCTA Perfectmatch GTACTACCCAGTGTTCCGGAGGCTA Mismatch NSB & SB NSB
Preprocessing Typically we want one measure of expression for each gene on each array 20K genes represented by 11 probe pairs of probe intensities (PM & MM) Obtain expression measure for each gene on each array by summarizing these pairs Background adjustment and normalization are important issues Affymetrix offers MAS 5.0 as solution
Software Infrastructure Experimental Data Annotation P r o b e s Arrays Probe Intensities (CEL files) A r r a y s Covariates Covariate Information MIAME P r o b e s Properties Meta Data (CDF Packages) AffyBatch Class
Why normalize? Compliments of Ben Bolstad
Default Procedure (MAS 5.0) signal * = TukeyBiweight{log( PM j MM j )}
Sometimes MM larger then PM
Sometimes MM larger then PM
Especially for large PM
Default Procedure (MAS 5.0) signal * = TukeyBiweight{log( PM j MM j )}
Can this be improved?
Can this be improved?
Why so much noise? Default algorithm seems to be inspired by the following deterministic model for background: PM = O + N + S MM = O + N PM MM = S And a multiplicative error model for signal (they take the log before averaging)
Deterministic model is wrong Do MM measure nonspecific binding? Look at Yeast DNA hybridized to Human Chip Look at PM, MM logscale scatter-plot R 2 is only 0.5
Stochastic Model (Additive background/multiplicative error) PM = O PM + N PM + S, MM = O MM + N MM log (N PM ), log (N MM ) ~ Bivariate Normal (ρ 0.7) S = exp ( s + a + ε ) s is the quantity of interest (log scale expression) E[ PM MM ] = S, but Var[ log( PM MM ) ] ~ 1/S 2 (can be very large)
Does it make a difference? Ranks 1 270 2074 3063 3935 4639 4652 5149 5372 5947 6448 6870 7037 7549 8429 9721
RMA: Our first attempt Ranks 1 2 3 4 6 7 10 16 45 56 58 88 406 999 1643 2739
Can RMA be improved? RMA attenuates signal slightly to achieve gains in precision method MAS 5.0 RMA slope 0.69 0.61
Probe Specific Effect To improve RMA we needed to account for probe-specific background effects Our first attempt was to use GC-content Others have noticed probe-specific SB effects We can extend these ideas to NSB
Predict NSB with sequence Fit simple linear model to yeast on human data to obtain base/position effects (Naef and Magnsaco)
Predict NSB with sequence Fit simple linear model to yeast on human data to obtain base/position effects Call these affinities and use them to obtain parameters for background model
Does it help? Accuracy of expression measures improves Precision a bit worst but not bad
Also explains MM thing
Also explains MM thing
Acknowledgements Ben Bolstad Leslie Cope Sandrine Dudoit Laurent Gautier Robert Gentleman Wolfgang Huber Christina Kendziorski James MacDonald Francisco Martínez-Murillo Felix Naef Marcelo Magnasco Forrest Spencer Terry Speed Jean Yang Zhijin Wu
Supplemental Slides
Does it help?
Other Good Uses: RMA This background adjustment is used to define an alternative algorithm: the Robust Multi-array Analysis Quantile normalization is used To combine the various probe intensities a log-scale probe level additive model is fit robustly log (PM * ) = a + b + ε 2 ij i j ij RMA = estimate of a i for chip i Default robust procedure is median polish b j represents the probe effect More details: Irizarry et al. Biostatistics (2003)
The Probe Effect
Other pseudo-chip images Weights Residuals Positive Residuals Negative Residuals
Why background correct?
Practical Consequences
Contact Information e-mail Personal webpage Department webpage Bioinformatics Program rafa@jhu.edu http://www.biostat.jhsph.edu/~ririzarr http://www.biostat.jhsph.edu/ http://www.biostat.jhsph.edu/bioinfo http://www.bioconductor.org
Why use log? Original scale Log scale
Why we can not ignore NSB? The data shown is from a calibration experiment NSB causes bias (E 1 +K)/(E 2 +K) E 1 / E 2 if E 1, E 2 are large (E 1 +K)/(E 2 +K) 1 if E 1, E 2 are small We are faced with a bias/variance trade-off problem
Probe effect This strong probe-effect will result in very high correlation between replicates. Do not get too exited. Look at correlation or variance of relative expression (log FC) instead.
Alternative background adjustment Use this stochastic model Minimize the MSE: s E log s To do this we need to specify distributions for the different components Notice this is probe-specific so we need to borrow strength 2 S > 0,PM, MM * These parametric distributions were chosen to provide a closed form solution
Alternative background adjustment Model observed PM as the sum of a signal intensity S and a background intensity B PM = B + S, For convenience * it is assumed that S is Exponential (α), B is Normal (µ, σ 2 ), with S and B are independent Background adjusted PM are then E[S PM] Because expectation minimizes MSE, we avoid exaggerated variance Plug-in estimates of α, µ, and σ 2 are used Notice we can use only PM and make arrays half as expensive * These parametric distributions were chosen to provide a closed form solution
Spike-in Experiment Replicate RNA was hybridized to various arrays Some probe-sets were spiked in at different concentrations across the different arrays This gives us a way to assess precision and accuracy
Spikein Experiment Probeset A r r a y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 A 0 0.25 0.5 1 2 4 8 16 32 64 128 0 512 1024 256 32 B 0.25 0.5 1 2 4 8 16 32 64 128 256 0.25 1024 0 512 64 C 0.5 1 2 4 8 16 32 64 128 256 512 0.5 0 0.25 1024 128 D 1 2 4 8 16 32 64 128 256 512 1024 1 0.25 0.5 0 256 E 2 4 8 16 32 64 128 256 512 1024 0 2 0.5 1 0.25 512 F 4 8 16 32 64 128 256 512 1024 0 0.25 4 1 2 0.5 1024 G 8 16 32 64 128 256 512 1024 0 0.25 0.5 8 2 4 1 0 H 16 32 64 128 256 512 1024 0 0.25 0.5 1 16 4 8 2 0.25 I 32 64 128 256 512 1024 0 0.25 0.5 1 2 32 8 16 4 0.5 J 64 128 256 512 1024 0 0.25 0.5 1 2 4 64 16 32 8 1 K 128 256 512 1024 0 0.25 0.5 1 2 4 8 128 32 64 16 2 L 256 512 1024 0 0.25 0.5 1 2 4 8 16 256 64 128 32 4 M 512 1024 0 0.25 0.5 1 2 4 8 16 32 512 128 256 64 8 N 512 1024 0 0.25 0.5 1 2 4 8 16 32 512 128 256 64 8 O 512 1024 0 0.25 0.5 1 2 4 8 16 32 512 128 256 64 8 P 512 1024 0 0.25 0.5 1 2 4 8 16 32 512 128 256 64 8 Q 1024 0 0.25 0.5 1 2 4 8 16 32 64 1024 256 512 128 16 R 1024 0 0.25 0.5 1 2 4 8 16 32 64 1024 256 512 128 16 S 1024 0 0.25 0.5 1 2 4 8 16 32 64 1024 256 512 128 16 T 1024 0 0.25 0.5 1 2 4 8 16 32 64 1024 256 512 128 16
NSB: Practical Consequences The data shown here comes from spike-in experiments used for calibration NSB causes foldchange attenuation at low expression level (E 1 +K)/(E 2 +K) E 1 / E 2 if E 1, E 2 are large (E 1 +K)/(E 2 +K) 1 if E 1, E 2 are small