Microarray Data Analysis Workshop MedVetNet Workshop, DTU 2008 Comparative genomic hybridization Because arrays are more than just a tool for expression analysis Carsten Friis ( with several slides from H. Willenbrock) Media glna tnra GlnA TnrA C2 glnr C3 C5 C6 K GlnR C1 C4 C7
Outline Introduction to comparative genomic hybridization (CGH) and array CGH Data analysis approaches Breakpoint detection Loss and gain analysis Real data example: Comparative genomic profiling of bacterial strains
Outline Introduction to comparative genomic hybridization (CGH) and array CGH Data analysis approaches Breakpoint detection Loss and gain analysis Real data example: Comparative genomic profiling of bacterial strains
Comparative Genomic Hybridization Study types : Gain or loss of genetic material To find variations in the genetic material Purposes: Study of chromosomal aberrations often found in cancer and developmental abnormalities Study of variations in the baseline sequence in a microbial population (microbial comparative genomics) 4
Genetic Alterations and Disease A Variety of Genetic Alterations Underlie Developmental Abnormalities and Disease Inappropriate gene activation or inactivation can be caused by: Mutation Epigenetic gene silencing (e.g. addition of methyl groups) Reciprocal translocation (exchange of fragments between two nonhomologous chromosomes) Gain or loss of genetic material Any of the above may lead to an oncogene activation or to inactivation of a tumor suppressor
Detecting structural abnormalities Albertson and Pinkel, Human Molecular Genetics, 2003
Microarrays for copy number analysis BAC arrays Affymetrix SNP chip (500 K) Representational oligonucleotide microarray analysis (ROMA) Whole genome tiling arrays Own design (NimbleGen/NimbleExpress)
Array CGH Array CGH Maps DNA Copy Number Alterations to Positions in the Genome Test Genomic DNA Reference Genomic DNA Cot-1 DNA Gain of DNA copies in tumor Loss of DNA copies in tumor Ratio Position on Sequence
Structural abnormalities * *HSR: homogeneously staining region Albertson and Pinkel, Human Molecular Genetics, 2003
Advantages over Expression Arrays Hybridization of DNA to microarray (DNA is much more stable) Little normalization is necessary Use of spatial coherence in the analysis Only 1 sample is necessary to draw conclusions it is still necessary with biological replicates to be able to draw general conclusions regarding a certain biological subtype Results may be easier interpretable and correlated with sample phenotypes e.g. loss of oncogene repressor -> certain cancer subtype
Outline Introduction to comparative genomic hybridization (CGH) and array CGH Data analysis approaches Breakpoint detection Loss and gain analysis Real data example: Comparative genomic profiling of bacterial strains
Analysis of array CGH Goal: To partition the clones into sets with the same copy number and to characterize the genomic segments in terms of copy number. Biological model: genomic rearrangements lead to gains or losses Sizable contiguous parts of the genome, possibly spanning entire chromosomes Or, alternatively, to focal high-level amplifications
Copy Number Profiles of a Tumor
Varying genomic complexity Breakpoints
Observed clone value and spatial coherence Useful to make use of the physical dependence of the nearby clones, which translates into copy number dependence N(-.3,.08^2) N(.6,.1^2)??
Expected log 2 ratio A function of copy number change, normal cell contamination and ploidy Reference ploidy=2 2.58 100% Reference ploidy=3 50% 2.0 0.58 0.07 0.58 10% 0.0 0.42 0.38
Simulation of Array CGH Data Real biological variation considered: Breast cancer data used as model data Segment length and copy number is taken from the empirical distribution observed in breast cancer data (DNAcopy segmentation). Mixture of cells (sample is not pure) Each sample was assigned a value, P t : proportion of tumor cells, between 0.3 and 0.7 from a uniform distribution. Experimental noise is Gaussian Standard deviations drawn from a uniform distribution between 0.1 and 0.2 to imitate real data where the noise may vary between experiments. Cancer subtypes are heterogeneous Certain aberrations characteristic for a cancer subtype may only exist in a percentage of the patients with that cancer subtype. Thus, in each sample, segments with copy number alterations (copy number not 2) was removed at random with probability 30%. Willenbrock and Fridlyand; Bioinformatics 2005
Comparison Scheme Use of simulated data, where the truth is known and the noise is controlled True breakpoint false predicted breakpoint
Methods for Segmentation HMM: Hidden Markov Model (acgh package) Fit HMMs in which any state is reachable from any other state (Fridlyand et al, JMVA, 2004). CBS: Circular binary segmentation (DNAcopy package) Tertiary splits of the chromosomes into contiguous regions of equal copy number and assesses significance of the proposed splits by using a permutation reference distribution (Olshen et al, Biostatistics, 2004). GLAD: Gain and Loss Analysis of DNA (GLAD package) Detects chromosomal breakpoints by estimating a piecewise constant function that is based on adaptive weights smoothing (Hupe et al, Bioinformatics, 2004). Willenbrock and Fridlyand; Bioinformatics 2005
Breakpoint Detection Accuracy
Conclusions so far Signal2noise: CBS consistently the best performance HMM has the highest FDR GLAD is least sensitive
Outline Introduction to comparative genomic hybridization (CGH) and array CGH Data analysis approaches Breakpoint detection Loss and gain analysis Application of segmentation to testing Real data example: Comparative genomic profiling of bacterial strains
Merging segments Note: that all procedures operate on individual chromosomes, therefore resulting in a large number of segments with mean values close to each other Additional Challenge: reduce number of segments by merging the ones that are likely to correspond to the same copy number This will facilitate inference of altered regions
Merging For estimating actual copy number levels from segmentations
Segmentation and Merging
ROC Curves Identification of copy number alterations for varying thresholds
Using segmentation for testing (phenotype association studies) Case: Find clones (or whole segments) that are significantly differing in copy number between two cancer subtypes. Task: Investigate whether incorporating spatial information (segmentation) into testing for differential copy number increases detection power. Data type: Samples with either of 2 different phenotypes (e.g. 2 different cancer subtypes) How: Comparison of sensitivity and specificity using: 1. Original test statistic (no use of spatial information) 2. Segmented T-statistic derived from original log 2 ratios 3. T-statistic computed from segmented log 2 ratios 27
Testing samples (original values) Red: True different clones 28
Correction for multiple testing? standard p-value cutoff for alpha=0.05 => Many false positives 29
The maxt Multiple Testing Correction By repeating random class assigningment and testing, e.g. 100 times, the following permutation reference distribution of maximum absolute test statistic is obtained (maxt distribution): We wish to control the family wise error rate (FWER) at alpha=0.05 (5% chance of 1 false positive). Therefore, the cut-off should be such that only in 5% of the random cases, we will get one false positive (95 percentile): cutoff = 5 standard significance threshold MaxT multiple testing corrected threshold 30
Testing samples (original values) maxt p- value cutoff for alpha = 0.05 standard p-value cutoff for alpha=0.05 31
Testing: Segmenting test statistics Reference 32
Testing segmented samples...... 1. Segmentation of individual samples... 33
Testing segmented samples Reference 2. T-statistic from segmented individual samples... 34
Detecting regions with differential copy number Willenbrock and Fridlyand. Bioinformatics 2005; 21(22): 4084-91 35
Outline Introduction to comparative genomic hybridization (CGH) and array CGH Data analysis approaches Breakpoint detection Loss and gain analysis Real data example: Comparative genomic profiling of bacterial strains
Real Data Example: Comparative genomic profiling of several Escherichia coli strains The microarray design included probes for: 7 known E. coli strains 39 known E. coli bacteriophages 104 known E. coli virulence genes Experimentally: 2 sequenced control strains (W3110 and EDL933), 3 replicates 2 non-sequenced strains (D1 and 3538), 3 replicates Bacteriophage: φ3538 (Δstx2::cat), 2 replicates Willenbrock et al.; J. Bacteriology 2006 37
Comparative Genomic Profiling: Challenges Ratio problems: some genes might be present on query strain but not on the known reference strain Single channel microarrays or dual channel microarrays? In this case, we used an Affymetrix single channel custommade array (NimbleExpress) Partly present genes versus similar but distinct genes 38
The 7 E. coli strains included on the microarray Very high similarity between the two K-12 strains and between the two O157:H7 strains. Percentage of homologues for E. coli genomes in columns found in E. coli genomes in rows. Willenbrock et al. Journal of Bacteriology. 2006 Nov;188(22):7713-21. 39
BLAST Atlas Willenbrock et al. Journal of Bacteriology. 2006 Nov;188(22):7713-21.
Hybridization Atlases Probe hybridizations for experiments (samples) result in a similar pattern as expected from the BLAST atlas Willenbrock et al. Journal of Bacteriology. 2006 Nov;188(22):7713-21.
Mapping the phage Φ3538 (Δstx2::cat) Willenbrock et al. Journal of Bacteriology. 2006 Nov;188(22):7713-21.
Zoom of phage Φ3538 (Δstx2::cat) The hybridization pattern is very similar for the phage, strain 3538 and strain D1 Willenbrock et al. Journal of Bacteriology. 2006 Nov;188(22):7713-21.
Hierarchical Cluster Analysis D1 is very similar to the K-12 type strains (W3110 + MG1655) K-12
E. coli virulence genes D1 is probably still a commensal strain An organism participating in a symbiotic relationship from which it benefits while the other is unaffected Willenbrock et al. Journal of Bacteriology. 2006 Nov;188(22):7713-21.
Summary Comparative genomic profiling of two E. coli strains 0175:H16 D1 0157:H7 3538 Identification of virulence genes and phage elements Conclusions: D1 is similar to the K-12 type strains Characterization of D1 and 3538 genes: Identification of a number of genes involved in DNA transfer and recombination 46
Summary Numerous methods have been introduced for segmentation of DNA copy number data and breakpoint identification. Important to benchmark against existing methods (however, only feasible if the software is publicly available) Currently, CBS (DNAcopy package) has the best overall performance Merging of segmentation results improves copy number phenotype characterization Study types: Study of copy number in cancer samples Comparison of bacterial strains Etc.