Basics of microarrays Petter Mostad 2003
Why microarrays? Microarrays work by hybridizing strands of DNA in a sample against complementary DNA in spots on a chip. Expression analysis measure relative amounts of mrna in a tissue sample testing all genes at the same time alternatives: Northern blot. qpcr SNP-analysis genomic DNA
The transcriptome Genome -> Transcriptome -> Proteome In a cell: about 300.000 transcripts representing genes at different frequencies Highly regulated turnover of transcripts: lifetimes minutes to weeks. Depends on sequence, structure Alternative splicing Connection between expression profile and protein profile?
Alternatives to hybridization Alternative ways to measure the transcription profile: EST SAGE For small number of genes qpcr (advantages/disadvantages)
Basic technology Several technologies: Affymetrix chips cdna chips oligo-chips All based on complementary strings of DNA hybridizing against each other Fluorescence indicate the amount of hybridization
cdna chips non-proprietary technology clones are made based on expressed RNA probes based on sequences a few hundred bases long specialized chips two dyes cheaper per chip, but expensive to set up often noisy data
cdna: EST libraries, printing, labelling RNA sample -> cdna -> random library Clones from the library are sequenced -> ESTs Sequence analysis of EST s gene identities public EST libraries cdna cloes are amplified by PCR and put on well plates Robot pots PCR products onto glass slides UV treatment Two parallel samples labelled with two dyes (Cy3, Cy5) Labelling performed as a reverse transcription (get cdna)
cdna: hybridization and scanning Hybridization of target + probe Scan at Cy3 and Cy5 wavelengths
Affymetrix patented technology, expensive chips Based on system with PM, MM sequences syntesized on chip about 20 probes per gene, currently randomly placed sequences optimized, to reduce cross-hybridization have had some quality problems, secrecy problems seems to be less noisy than cdna
Oligo-arrays probes are about 50 bases long avoids Affymetrix patent may work better than cdna chips
Amplification? All technologies require a minimal amount of RNA to work (1 microgram mrna?) Sometimes there is too little (human samples, samples where you want purity of cell types...) Aplification, using PCR, is an alternative Introduces noise in the data (in a semi-systematic way)
Planning of microarray experiments Using cdna or Affymetrix? What kind of cdna chip? Reference sample? Pooling? Dye swap? How many repetitions are necessary? What kind of data analysis?
Statistical issues connected to cdna chips Experimental design: Array printing What to hybridize Low-level analysis Image processing Visualization Normalization Quality measures Data analysis Ranking differentially expressed genes Assigning significance to ranking Classification (discrimination and clustering)
Experimental design Questions: Pooling of samples? Reference sample? Which samples hybridize agains which? How many arrays? Tips: Two different sample types => compare directly Several types compare to wild type => wild type ref. Saturated designs, loop designs Complexity => use reference Dye-swap when appropriate Deciding factors: Aim of experiment Availability of types of sample material
Image Analysis Purposes: extract R and G for each spot; assess quality Scanning: Avoid spot saturation. Do not use several scans Finding spot foreground pixels: Histogram method. Fit a circle. Seeded region growth Finding background: Pixels within bounding box, not foreground. Two concentric circles. Valleys. Morphological opening Subtracting background from foreground: Estimate foreground with average over pixels Estimate background with median over pixels Ignore spots with resulting negative values Handling of background has big impact!
Graphical Presentation Images of microarrays; overlays Plotting M = log R log G versus A = ½(log G + log R) (ignore spots with negative R or G) Boxplots of M values Spatial plots
Normalization Simplest: Subtract mean or median of non-regulated genes: M := M c Intensity dependent: M := M c(a) Printtip-dependent: M := M ci(a) Scale normalization of M Use of control spots Sample pool titration series. Spiking
Dependence on signal strength
Spatial dependence of signal
Variation between regions of the arrays
Quality Measures Array quality Intensities span whole range Saturation avoided Check control spots Background mostly below signal Check slide images for spatial effects Spot quality: Single spots: Check spot parameters: area perimeter, standard deviation, background variability, etc Spot quality: compare repeated spots: Reject outlier M- values. Using a spot quality measure as a weight
Hypothesis generation versus Hypothesis generation: hypothesis testing Methods may suggest that a gene is up- or down-regulated Methods may suggest new relationships between genes Suggestions may not be reproduced by another experiment; all results must be verified by other methods. Hypothesis testing: Example: Testing whether a gene is significantly up-regulated. Reproducible conclusions. Fewer methods available. In general, require repetitions of experiments, or serious assumptions.
Ranking differentially expressed genes Assuming repeated comparison of two different sample types: Simplest: Rank M Next choice: Rank t = M s / n Penalized t-statistic (Lönnstedt, Speed): t = M ( a + s 2 ) / n Penalized t-statistic (Efron): M t = ( a + s) / n
Finding significantly diff. exp. genes Problem: Multiple testing Assuming normally distributed M-values and independency, use t-distribution probability plot Controlling the family-wise error rate: Using re-sampling in a repeated experiment with a reference sample (Dudoit) Estimating the false discovery rate by using re-sampling. SAM. (Tibshirani)
Classification Identification of different cell types of conditions, or identification of different gene types Supervised learning (discriminant analysis; using learning sets) versus unsupervised learning (clustering) Clustering methods may be overused Simple methods (linear discriminant methods, nearest neighbour, classification trees) often perform as well as more complex methods
Clustering Example: Data is a time series of transcription profiles: Cluster the genes according to behaviour. Clustering starts with defining similarity between all pairs of genes (e.g., distance in some space). Hierarchical clustering. Dendrograms. Linkage methods. The K-means method. Example of hypothesis generation: Tavazoie et al.(1999) used clusters of genes to identify probable regulatory sequences upstream of them.
Clustering of genes and samples
Self-organising maps
Principal Components Analysis The principal components can be viewed as the axes of a better coordinate system for the data. Better in the sense that the data is maximally spread out along the first principal components. The principal components correspond to eigenvectors of the covariance matrix of the data. The eigenvalues represent the part of the total variance explained by each of the principal components.
Principal component analysis of expression data
Good software: BioConductor, a package using R Ref. on statistics: Smyth, Yang, Speed: Statistical Issues in cdna Microarray Data Analysis