Row Quantile Normalisation of Microarrays



Similar documents
Software and Methods for the Analysis of Affymetrix GeneChip Data. Rafael A Irizarry Department of Biostatistics Johns Hopkins University

Frozen Robust Multi-Array Analysis and the Gene Expression Barcode

Analysis of gene expression data. Ulf Leser and Philippe Thomas

Quality Assessment of Exon and Gene Arrays

Gene Expression Analysis

Molecular Genetics: Challenges for Statistical Practice. J.K. Lindsey

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Basic Analysis of Microarray Data

Measuring gene expression (Microarrays) Ulf Leser

Analyzing microrna Data and Integrating mirna with Gene Expression Data in Partek Genomics Suite 6.6

Lecture 11 Data storage and LIMS solutions. Stéphane LE CROM

affyplm: Fitting Probe Level Models

A greedy algorithm for the DNA sequencing by hybridization with positive and negative errors and information about repetitions

Gene expression analysis. Ulf Leser and Karin Zimmermann

Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients

Microarray Technology

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

Factors for success in big data science

SNP Essentials The same SNP story

Analyzing the Effect of Treatment and Time on Gene Expression in Partek Genomics Suite (PGS) 6.6: A Breast Cancer Study

Normalization Methods for Analysis of Affymetrix GeneChip Microarray

Materials and Methods. Blocking of Globin Reverse Transcription to Enhance Human Whole Blood Gene Expression Profiling

Statistical issues in the analysis of microarray data

How To Run Statistical Tests in Excel

Microarray Data Mining: Puce a ADN

Introduction To Real Time Quantitative PCR (qpcr)

GeneChip Sequence Analysis Software (GSEQ) is used to analyze data from the Resequencing Arrays

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Microarray Analysis. The Basics. Thomas Girke. December 9, Microarray Analysis Slide 1/42

Data, Measurements, Features

An Introduction to Genomics and SAS Scientific Discovery Solutions

DNA Sequence formats

Basic Concepts of DNA, Proteins, Genes and Genomes

From Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data

User Manual. Transcriptome Analysis Console (TAC) Software. For Research Use Only. Not for use in diagnostic procedures. P/N Rev.

How many of you have checked out the web site on protein-dna interactions?

Genetic Analysis. Phenotype analysis: biological-biochemical analysis. Genotype analysis: molecular and physical analysis

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

ANALYSIS OF ENTITY-ATTRIBUTE-VALUE MODEL APPLICATIONS IN FREELY AVAILABLE DATABASE MANAGEMENT SYSTEMS FOR DNA MICROARRAY DATA PROCESSING 1.

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Real-time PCR: Understanding C t

Version 5.0 Release Notes

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

CCR Biology - Chapter 9 Practice Test - Summer 2012

Correlation of microarray and quantitative real-time PCR results. Elisa Wurmbach Mount Sinai School of Medicine New York

DEVELOPMENT OF MAP/REDUCE BASED MICROARRAY ANALYSIS TOOLS

Modified Genetic Algorithm for DNA Sequence Assembly by Shotgun and Hybridization Sequencing Techniques

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

Using Excel for inferential statistics

2. True or False? The sequence of nucleotides in the human genome is 90.9% identical from one person to the next. False (it s 99.

Data Analysis for Ion Torrent Sequencing

Asexual Versus Sexual Reproduction in Genetic Algorithms 1

Time series experiments

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

GeneSifter: Next Generation Data Management and Analysis for Next Generation Sequencing

Comparative genomic hybridization Because arrays are more than just a tool for expression analysis

Comparing Methods for Identifying Transcription Factor Target Genes

InSyBio BioNets: Utmost efficiency in gene expression data and biological networks analysis

Introduction to Pattern Recognition

GeneChip Expression Analysis. Data Analysis Fundamentals

Hybrid Genetic Algorithm for DNA Sequencing with Errors

Biotechnology: DNA Technology & Genomics

Analysis of Illumina Gene Expression Microarray Data

Gamma Distribution Fitting

Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations

INTELLIGENT DEFECT ANALYSIS SOFTWARE

Name Date Period. 2. When a molecule of double-stranded DNA undergoes replication, it results in

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Build Panoramas on Android Phones

Multivariate Analysis of Ecological Data

Identification of rheumatoid arthritis and osteoarthritis patients by transcriptome-based rule set generation

Real-time qpcr Assay Design Software

Discovery and Quantification of RNA with RNASeq Roderic Guigó Serra Centre de Regulació Genòmica (CRG)

A Tutorial Review of Microarray Data Analysis

VWF. Virtual Wafer Fab

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

EXPLORING SPATIAL PATTERNS IN YOUR DATA

Expression Quantification (I)

Activity IT S ALL RELATIVES The Role of DNA Evidence in Forensic Investigations

Graph theoretic approach to analyze amino acid network

Appendix 2 Statistical Hypothesis Testing 1

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Transcription:

Row Quantile Normalisation of Microarrays W. B. Langdon Departments of Mathematical Sciences and Biological Sciences University of Essex, CO4 3SQ Technical Report CES-484 ISSN: 1744-8050 23 June 2008 Abstract Variation in tissue sample preparation leads to variation across the Transcriptome not just between experiments but to between individual microarrays. Normalisation is essential before data from different arrays can be compared. Quantile normalisation can be used to force data from a single GeneChip to take a given distribution. However quantile normalisation can be blind to the consistent spatial variation we note in thousands of Affymetrix High-density oligonucleotide array (HDONAs) from NCBI GEO. We propose a simple computationally efficient normalisation technique which takes into account the spatial aspect. BioConductor R code is included. Keyword: Normalisation, Gene expression, High-density oligonucleotide array, data mining, standard chip, MIAME 1 Introduction While high-density oligonucleotide arrays have created biological data on an industrial scale, the cost of such GeneChips means few arrays are used in each experiment. Unfortunately it is difficult to control the dose of total mrna. Instead some form of normalisation of the data collected by different arrays is necessary. Early normalisation schemes tended to do things such as taking a number of readings from different samples and adjusting them to have the same mean. Typically the adjustment was made by linear rescaling. It was realised that by rescaling the data could also be adjusted to have the same standard deviation. In fact with modern computers it is possible to adjust data so that not only the first and second moments are the same but in fact the whole distribution is the same. This is known as quantile normalisation (Bolstad et al. 2003). Typically quantile normalisation is applied across all the chips used in a particular experiment. However the advent of large public repositories of GeneChip experiments opens the way to normalisation against far larger numbers of chips. Normalisation against all published experiments using the same chip has been demonstrated (Langdon et al. 2007b). With a reference chip, created by averaging across all of GEO (Barrett et al. 2007), a Bioinformatician can quantile normalise a single chip. Further this can be done quickly using R. Figure 1 shows 1 354 896 mrna concentration measurements provided by a single Affymetrix HG- U133 2+ GeneChip. The horizontal spatial banding is obvious. (Figure 2 plots it numerically.) Figure 3 shows the banding carries over to individual probe performance. C and G DNA base pairings have three rather than two main hydrogen bonds. This means probes with many Cs or Gs bind more tightly than those with more As and Ts. (This is confirmed by Figure 4.) The number of each type of base in HG-U133 2+ probes follows a similar pattern to that which would be produced by a random choice, see Figure 6. The

Figure 1: Example Homo Sapiens HG-U133 +2 GeneChip (prior to normalisation, yellow is higher). The banding of signal intensities is clear. Note perfect match (PM) probes are adjacent to their mismatch (MM) partner in the next row. However the horizontal banding occurs regardless of PM/MM. (In terms of spatial defects, this CEL file is the median of 2757 HG-U133 2+ down loaded from NCBI s GEO). probes are deliberately placed on the chip such that probes with similar sequences are adjacent. This increases the feature size on the 100 photolithographic masks required for each GeneChip. Having larger holes in the masks, caused by adjacent probes have the same base at the same depth, tends to reduce the importance of edge effects. This in turn eases manufacture (Pease et al. 2001; Feldman and Pevzner 1994). Assorting the probes by sequence gives the non-random layout of the probes shown in Figure 5. It is this assorting that is the underlying cause of the banding in both probe intensity and performance. Until now this strong spatial feature has been ignored by common normalisation tools. 2 Row Quantile Normalisation Recently we have applied quantile normalisation not to just a collection of GeneChips relating to one experiment but to thousands of public chips (Langdon et al. 2007a). This allows a Biologist to normalise a single chip against a public base line (i.e. GEO), rather than his own much smaller private experimental set. And normalisation can be safely done for a small number of chips, even just one. Quantile normalisation works by placing measured data in rank order. The reference distribution (in our case a robust average taken across thousands of GeneChips of the same type in GEO) is also sorted into rank order. To normalise a GeneChip, the sorted probe intensities are simply replaced by the corresponding value with the same rank from the reference distribution, i.e. the smallest measured probe intensity is replaced with the smallest average value in GEO. The value of the probe with second smallest value is replaced with the second smallest average value in GEO. And so on, until the largest measure probe intensity has been replaced by the largest average probe value from GEO. This can be done efficiently in R. Note with ordinary quantile normalisation a probe can be normalised to the average value of any probe. That probe may be anywhere on the GeneChip. 2

400 Vertical+200 Row 350 Intensity (geometric mean) 300 250 200 150 100 2 200 400 600 800 1000 1163 Row number Figure 2: Average intensity by row (dashed line). In contrast average signal across rows is close to 150. Variation is about a factor of two, both along rows and across them. (Same data as Figure 1.) Figure 3: Performance of probes across HG-U133 +2 GeneChips. Highest correlation between probes in the same probeset. (2757 GEO HG-U133 +2). (Yellow is higher. Areas without probesets are white.). The banding of signal performance is clear. 3

25000 20000 15000 10000 5000 Count 0 16384 8192 4096 2048 1024 Geometric Mean probe intensity 512 256 128 64 24 22 20 18 16 14 12 10 Guanine and Cytosine 8 6 4 2 0 Figure 4: Distributions of average HG-U133 2+ probe intensity vs. number of Cs and Gs. On average (mode, solid line) probes with many C+G are approximately twice as bright as those with few. Figure 5: Number of C s and G s in probes across HG-U133 +2 GeneChips. (Yellow is higher. Areas without probesets are white.). 4

200000 180000 HG-U133 2+ Binomial m=11.9 160000 Number of HG-U133 2+ probes 140000 120000 100000 80000 60000 40000 20000 0 0 2 4 6 8 10 12 14 16 18 20 22 24 Number of Guanine and Cytosine in probe Figure 6: The distribution of number of C s and G s in HG-U133 2+ probes is approximated by a random distribution (dashed line) with the same mean (11.9). (Same data as Figure 5.) Row quantile normalisation is the same as regular quantile normalisation, except we insist that a probe s normalised value must come from a probe in the same row. With modern GeneChips, this still leaves the sort algorithm hundreds or thousands of values to chose from. Using the same row automatically takes advantage of the way Affymetrix lays out probes on its GeneChips (see previous section). This ensures every probe in normalised to the average value of a probe which has a similar probe sequence to it. Row normalisation is readily and efficiently implemented in R. See code fragments in Figure 7. References 1 Barrett, T., Troup, D. B., Wilhite, S. E., Ledoux, P., Rudnev, D., Evangelista, C., Kim, I. F., Soboleva, A., Tomashevsky, M., and Edgar, R. (2007). NCBI GEO: mining tens of millions of expression profiles database and tools update. Nucleic Acids Research, 35(Database issue), D760 D765. 2 Bolstad, B. M., Irizarry, R. A., Åstrand, M., and Speed, T. P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19(2), 185 193. 3 Feldman, W. and Pevzner, P. A. (1994). Gray code masks for sequencing by hybridization. Genomics, 23, 233 235. 4 Langdon, W. B., da Silva Camargo, R., and Harrison, A. P. (2007a). Spatial defects in 5896 HG-U133A GeneChips. In J. Dopazo, A. Conesa, F. Al Shahrour, and D. Montener, editors, Critical Assesment of Microarray Data, Valencia. Presented at EMERALD Workshop. 5 Langdon, W. B., Upton, G. J. G., da Silva Camargo, R., and Harrison, A. P. (2007b). A survey of spatial defects in Homo Sapiens Affymetrix GeneChips. Submitted. 6 Pease, R. F., McGall, G., Goldberg, M. J., Rava, R. P., Fodor, S. P. A., Goss, V., Stryer, L., and Winkler, J. L. (2001). Printing molecular library arrays. US Patent 6,239,273. Affymetrix, Inc. (Santa Clara, CA), Filed: October 26, 1999. 5

sortbyrow=function(mat) { apply(mat,2,sort); } quantilebyrow=function(mat,rank) { #rather than sort, use rank to ensure two probes with the same value #before normalisation, have the same value after normalisation. r = apply(mat,2,rank,ties.method="min"); ans=array(na,dim=c(nrow(rank),ncol(rank))); for(j in seq(from=1,to=ncol(r))) { ans[,j]=rank[r[,j],j]; } return(ans); } qdistribution <- sortbyrow(normean); library(affy); a <- ReadAffy(filenames,sd=FALSE); mat <- quantilebyrow(array(exprs(a),dim=c(s,s)),qdistribution); Figure 7: R code fragment showing the row quantile normalisation of a single GeneChip filenames against the whole of GEO. The reference distribution, previously calculated from all the GeneChips in GEO of the same type, is stored in S S array normean. (For an HG-U133 2+ S = 1164.) The R function sortbyrow orders the array s rows independently. The individual probe values, given by ReadAffy and exprs(a), must also be stored in a S S array. quantilebyrow creates an array of indexes r into the reference distribution rank. For each row, it calculates a vector of S values which are taken from rank via the sorted indexes in r. (The R [,j] syntax implies all elements of an array with second index j.) 6