Integrative Analysis of Genomic Copy Number. Cancer.

Integrative Analysis of Genomic Copy Number and Gene Expression Data in Metastatic Prostate Cancer. Elise Chang Agilent Technologies Elise_chang@agilent.com

Agenda Introduction Features of Copy Number Workflow SNPs.. SNPs.. Case study- Integrative Analysis CNVs.. of Genomic copy number CNVs.. and Gene Expression Data in Metastatic Prostate Cancer CNPs CNPs CNVRs.. CNVRs..

Copy Number Variation- Understanding the Relevance to Human Diseases Copy number variation (CNV): DNA segments in which copy-number varies between two or more genomes Ranges from 1 Kb to millions of DNA bases in size CNVs have been associated with susceptibility to disease, complex behavioral traits, and other phenotypic variability Identifying significant CNVs is important in understanding the underlying mechanism of disease and disease susceptibility

Supported Array Platforms Affymetrix: 100K (50K Xba, 50K Hind) 500K (250K Nsp, 250K Sty) SNP 5.0 SNP 6.0 Illumina: GenomeStudio outputs for all SNP/CNV arrays GeneSpring GX plugin for GenomeStudio used to export data in format GeneSpring GX will support (plug-in located in: INSTALLDIR\app\Illumina\GX.Genotyping.Export.dll to Genomestudio\modules \ BSGT \ ReportPlugins\) -Instructions for installation are in section 26.4.1 of the manual.

Supported Arrays Affymetrix Technology available on Agilent server. Experiment creation involves importing the CEL files, summarization and normalization GX11 computes log ratio, CN and LOH GX11 uses the CN values to get ASCN, PSCN and to run GISTIC Illumina Technology created on the fly. Experiment creation involves import from GenomeStudio Log ratios, CN values and LOH are imported from GenomeStudio GX11 uses the CN values to get ASCN, PSCN and to run GISTIC

Experimental Designs Identification of variation requires comparison to either a reference DNA source, a reference dataset or a reference genome sequence. This is important for Affymetrix experiment creation 1. Analysis against a reference: The control is generated from a pool of individuals. All the test samples are then compared against a common, pooled control, also known as reference. HapMap samples are packaged as Standard Reference Custom Reference can be created 2. Paired Analysis: Control and the test DNA are from the same individual Pairing is defined during experiment grouping

Custom Reference Creation Menu: Tools> Create Custom Reference Typically need 30-40 reference samples for accurate genotype calls on non-reference Once Custom Reference is created, it will be saved for future experiment creation

Reference Creation References contain: Averaged summarised intensities for probe sets from PLIER For Affymetrix 50/100K Set Statistics from BRLMM For 250/500K Set and SNP5.0 Affymetrix arrays Statistics from BirdSeed Algorithm Clusters from BirdSeed Algorithm (and median and s.d. of clusters) For SNP6.0 Affymetrix arrays Statistics from BirdSeed Algorithm Clusters from BirdSeed Algorithm (and median and s.d. of clusters) Clusters from CANARY (and median and s.d. of clusters)

Experimental Set-up for Paired Normal Design For paired-normal experimental designs, two parameters must be specified Group indicates a set of paired samples Condition indicates which sample(s) to use as reference (Normal) for test sample(s) (Tumor) Parameters must be Group and Condition for GeneSpring GX to recognize it as a paired design Interpretation using Group and Condition must be used for Copy Number Computation

Copy Number Analysis Workflow in GeneSpring GX 11 QC / Batch Correction Copy NumberAnalysis: (CN, LOH, ASCN, Log ratio) GISTIC for Identification of Statistically Common CN variation within a set of samples Filter for Regions of Interest Biological Contextualization of Genes in Regions of interest * QC/Batch correction step is not available for Illumina workflow

Quality Control on Samples This window should look familiar to current GeneSpringGX users.

Quality Control Tools - PCA and Batch Effect Quality Control PCA- -identifies potential sample outliers Batch Effect -identifies and corrects for systematic error when different samples are processed on different days or different conditions.

Batch Correction Select interpretation that groups samples into their respective batches Minimum samples per batch Minimum m number of samples per batch to be considered for correction P-value T-test p-value cutoff for each probe Percentage of bad batches allowed If percent bad batches below userspecified value, do not perform correction for probe Each batch is T-tested against a pool of all remaining batches. Correction for each flagged entity is Correction for each flagged entity is performed using a reference batch.

Copy Number Computation Copy NumberAnalysis: (CN, LOH, ASCN, Log ratio, LOD score)

Copy Number Analysis for Affymetrix Data Computation actually computing: (1) Log ratio values Against Reference design: Normalized intensity of sample/ Normalized intensity of reference Paired design: Normalized intensity of Case/ Normalized intensity of Control (2) Genomic Copy Number Circular Binary Segmentation to identify segments Log ratio values to estimate genomic copy number Confidence value give as log10 of p-value (3) Allele-specific copy number (ascn) information Fawkes algorithm used to assign allele-specific copy number using SNP probes (4) Parent-specific copy number (pscn) information (5) Loss of Heterozygosity (LOH) Hidden Markov Model (HMM) used to calculate LOH score

Log Ratio and Copy Number Computation Copy Number computation (paired or against reference) is determined by the interpretation selected: First Log 2 ratios are calculated for every probe: Against Reference design: Normalized intensity of sample/ Normalized intensity of reference Paired design: Normalized intensity of Case/ Normalized intensity of Control

Copy Number Computation Circular Binary Segmentation Smooths outliers Finds change points in each sample using a statistic to identify a segment break Validation of change point using t-test test with p value cut off < 0.002002 Outputs are segment break points and mean log ratio for segment Segment Break Points

Copy Number Computation Once segments are identified by CBS then copy numbers and confidence scores need to be assigned to them Copy Number: HapMap dataset is used to generate a median map Using the birdseed and CANARY outputs for each possible copy number (0,1,2,3,4) the median and s.d log ratios across all probes is calculated Log ratios for segments from CBS are compared to the median map and copy numbers are assigned Homozygous and Hemizygous deletions are given values of 0 and1 Amplifications are given CN values of 3 and 4. Copy Number Confidence: Copy Numbers between 1.5 and 2.5 are assigned a p value of '1' For any other copy number a T test t against zero of log ratios is performed with multiples l testing ti correction Negative logarithm to the base 10 of the final p value reported as confidence.

Copy Number Computation Median Map Copy Number Assigned Genome- Wide Human SNP Array 6.0 Genome-Wide Human SNP Array 5.0 Mean Log Ratio that is mapped Human Mapping 500K Array Set - NSP Human Mapping 500K Array Set - STY Mapping 100k array set 4.0 0.5531951 0.54314524 05 5 0.5104986 0 0.54314524 05 5 Same as 3.5 0.43365917 0.4216105 0.39650044 0.39650044 Genome Wide Human SNP 3.0 0.31824413 0.30864272 0.26924038 0.28693026 Array 6.0 2.5 0.16928099 0.16363965 0.13422728 0.15135522 2.0 0.0 0.0 0.0 0.0 1.5-0.22511256-0.2103804-0.18339391-0.18339391 1.0-0.48062363-0.44733366-0.36318222-0.36318222 0.5-0.73515093-0.68273795-0.57555604-0.57555604 0.0-1.4098581-1.2451344-0.9485139 0.9485139

Copy Number Analysis Log ratios are smoothed to give CN values. CN segments are created using Circular Binary Segmentation (CBS) algorithm. CN values log ratios F ti l ll di t CN l i d i Fractional as well as discrete CN values are assigned, in the range of 0-4

1. Paired Analysis CN computation Condition-Type Interpretation 2. Each tumor is paired against the Normal of its group 3. All Normals are compared against the reference All samples against reference comparison Only one set of CN Analysis results can be stored.

Allele-specific Copy Number Given segment with copy number = 3, which allele was duplicated? Example output: AAB = A2: B1

Parent-specific Copy Number Consider a section of a Chromosome with haplotypes: ChrCopy1: A 1B 2A 3B 4B 5 B (after duplication): A 1B 2A 3B 4B 5 B A 1B 2A 3B 4B 5 B ChrCopy2: A 1 A 2 B 3 A 4 B 5 Suppose Copy1 gets duplicated 2 additional times (CN of region =4), the ascn become: A 1 :4 B 1 :0 and pscn = 4-0 A 2 :1 B 2 :3 and pscn = 3-1 A 3 :3 B 3 :1 and pscn = 3-1 A 4 :1 B 4 :3 and pscn = 3-1 A 5 :0 B 5 :4 and pscn = 4-0 PSCN is a measure of allelic imbalance

Copy Number Computation for Illumina Arrays Copy Number, Log ratio, and LOH scores calculated in GenomeStudio and imported into GeneSpring GX The following are computed in GeneSpring GX: ASCN information PSCN information

Analysis and Filtering Once you have identified regions of genomic alteration in individual sample how can you find meaningful events in groups of samples? Find Common Genomic Variant Regions Filter By Regions Identify Copy Neutral LOH Filter By PSCN

Finding Common Genomic Variant Regions Across asetofsamples Samples Genomic Identification of Significant ifi Targets in Cancer (GISTIC)

Find Common Genomic Variant Regions Many tumour samples have large numbers of chromosomal abberations. GISTIC was developed to try and distinguish meaningful or driver mutation events from random background somatic or passenger events Driver mutations are functionally important events which confer advantageous biological properties to the tumour allowing it to initiate grow or persist and are more likely to drive cancer pathogenesis GISTIC can also be applied to non cancer datasets where you want to find common genomic variant regions

Common Genomic Variant Regions Choose Fine or Coarse Mode Amplified Regions Deleted Regions

Common Variation Results Once GISTIC has identified aberrant regions it uses the biological genome to find overlapping genes for amplified and deleted segments For each probeset within the region, the upstream and downstream 1000 bases are scanned and the genes are identified G l i th Genes overlapping the significant regions identified and stored in the Project Navigator

Use of Filters to identify genomic landscape prevalent in metastatic prostate cancer

Results Analysis 31 Confidentialit March

Biological Contextualization of Copy Number Data 32 Confidentialit March

Case Study

Integrative Analysis of Metastatic Prostate Cancer Prostate Cancer is the most common cancer in men. Primary tumors are thought to be composed of multiple genetically distinct cancer cell clones. Both the primary and the metastatic prostate cancers are p y p heterogenous in nature, posing therapeutic challenges.

Datasets Used Expression: GSE6919 24 metastatic samples from 4 patients and 18 normal samples Genomic Copy Number: GSE14996 58 metastatic locations from 14 patients and 16 subject paired non-cancerous samples Liu et al, Nat Med. 2009. May;15(5):559-65

Copy Number Analysis in Prostate Cancer Samples 36 Confidentialit March

Expression Analysis in Prostate Cancer Samples 37 Confidentialit March

PCA- Genotyping Data Shape by Condition: Tumor Normal Color by Patient Color by Patient Group

PCA- Expression Data Normal Metastatic QC using PCA shows separation of the Normal and the Metastatic samples of GSE6919

Histogram view of data tracks in Genome Browser showing deletions as green blocks and amplifications as red dblocks Published data Chr. 6 Deletion- Pateint #17 Chromosome 6 Validated d in GX11

Joint Analysis of Gene Expression and Genomic Copy Number Data in Metastatic Prostate Cancer Copy Number Gene Expression Prostate Cancer Studies Controlled for regions and metastatic tissues 41 Confidentialit March

Deletions present in chr.6 of patient 17: An Integrative Analysis

Analysis workflow Expression: Genotyping: T-test Standard Reference FC 2.0 p-value: 0.05 Differentially expressed 441 entities Copy Number computation Filters Genome Browser

Deletion of PLAGL1 2.15 Fold Downregulation of PLAGL1 in Metastasis Data xpression Ex Genomic Data

PLAGL1 Candidate Tumor suppressor gene, with anti-proliferative activities Zinc finger protein with transactivation and DNA binding activity Presence of splice variants which allow differential regulation of apoptosis induction and cell cycle arrest Frequently deleted in many solid tumors-breast, ovarian and renal cell carcinomas Also known as LOT or Lost On Transformation

PLAG1-network analysis

First order expansion of PLAG1 network and overlay with FC data

TCF21 Genomic Data Expression Data TCF21 TCF21 CN=2 No genomic aberration of TCF21 Down regulation of Down-regulation of expression levels of TCF21

TCF21 First Order Expansion of the PLAGL1 network identified TCF21, a ts gene, to be down regulated in the expression analysis. The CN of TCF21 remains at 2, unlike that of PLAGL1. TCF21 is known to be frequently silenced epigenetically in head and neck cancer. Consistent with this, TCF21 did not show any deletion in the samples examined, raising the possibility that TFC21 could be epigenetically pg regulated in prostate cancer.

Conclusions 1. Using GX11, we could validate the presence of ERG- TMPRSS2 in several of metastatic prostate cancer samples 2. Significant Aberration found in PTEN, FGF18, TRIB3 by GISTIC indicates that these could be driver mutations of prostate cancer. 3. Additional candidates were identified by combined use of filters to identify amplified regions and regions of allelic imbalance. 4. Integrative ti analysis using expression and genotyping data has identified PLAGL1, a candidate ts gene, and TCF21, a ts gene, to be having a possible role in prostate cancer. 5. PLAGL1 deletion, though present in a small percentage of population, is an early event, occurring at a pre-metastatic stage