Segmentation-Based Detection of Mosaic Chromosomal Abnormality in Bladder Cancer Cells Using Whole Genome SNP Arrays

Transcription

1 Segmentation-Based Detection of Mosaic Chromosomal Abnormality in Bladder Cancer Cells Using Whole Genome SNP Arrays Weiyin Zhou Cancer Genomics Research Laboratory (CGR) Leidos Biomedical Research, Inc Texas A&M Masters of Science Degree Candidate Statistics Texas A&M Department of Statistics College Station, TX February

2 Acknowledgements I would like to express my deepest gratitude to my advisor, Dr. Alan Dabney, for his excellent guidance, understanding, patience, and encouragement. I would also like to thank my other committee members: Dr. William B. Smith and Dr. Bruce Lowe for taking time from their busy schedule to review my work; and to thank Ms. Penny Jackson and Kim Ritchie for their procedural and technical support. My gratitude extends to my current employer, Leidos Biomedical Research, Inc., which paid the majority of my tuition under its education assistance program. I would like to thank my husband Dr. Lisheng Cai, my son, Robert, and my daughter, Kimberly, for their love and support. They always cheer me up during the entire process. Quite naturally, my parents laid the foundation for all this effort through years of caring and teaching. 2

3 Abstract The purpose of this project is to investigate the relationship between mosaic chromosomal abnormality and bladder cancer. DNA of 3,239 individuals consisted of 1,673 bladder cancer cases and 1,566 cancer-free controls have been examined for evidence of mosaicism of the autosomes using genome-wide SNP array data generated from bladder cancer genome wide association analysis. DNA samples were extracted from blood or buccal (mouth) samples and were genotyped on Illumina Infinium HumanHap 610 quad SNP array. Two segmentation-based methods have been used to detect three types of mosaic events. 193 mosaic duplication (gain), mosaic deletion (loss), and mosaic copyneutral loss of heterozygosity (CNLOH) events, defined as being of > 0.5 Mb in size, in autosomes of 163 individuals (5%), with abnormal cell proportions of between 3.84% and 96.64%, was observed. Mosaic autosomal abnormalities were more common in the bladder cancer individuals (5.86%) compared with cancer-free persons (4.12%). Mosaic chromosomal abnormalities were statistically significantly positively associated with bladder cancer for male (OR = 1.55; P = 0.014). In cancer-free male individuals, mosaic chromosomal abnormality frequency increased with age, from 3.39% under 60 years to 7.51% between 76 and 89 years (P = 0.035). 3

4 Contents 1. Introduction 5 2. Data Description 5 3. Illumina SNP Genotyping and Normalization 8 4. Mosaic Chromosomal Abnormalities Detection Methods Log R Ratio and B Allele Frequency Re-estimate Log R Ratio and B Allele Frequency Segmentation Methods Mosaic Events Calling Methods Proportion of Mosaicism Examples of Mosaic Events Data Analysis Results and Discussion Characteristics of Mosaic Events Mosaic Chromosomal Abnormalities and Age at DNA Mosaic Chromosomal Abnormalities and Gender Mosaic Chromosomal Abnormalities and Cancer Risk Mosaic Chromosomal Abnormalities and Cancer Risk for All Subjects Mosaic Chromosomal Abnormalities and Cancer Risk for Male Only Mosaic Chromosomal Abnormalities and Cancer Risk for Female Only Conclusions and Further Studies References Appendix A: SAS Code 36 Appendix B: Normalization Pipeline 42 Appendix C: GADA Segmentation pipeline 43 Appendix D: BAF Segmentation Pipeline 45 4

5 1. Introduction Genetic mosaicism is defined as the coexistence of cells with different genetic composition within an individual caused by postzygotic event during development that can occur in both somatic (affecting only non-reproductive cells) and germline cells (with the potential of being passed on to any offspring) despite being the product of a single fertilization. Mosaicism can be caused by DNA mutations, epigenetic alterations of DNA, chromosomal abnormalities, and the spontaneous reversion of inherited mutations [1,2]. Somatic mosaicism has been established as a cause of mental retardation, birth defects, spontaneous abortion, and cancer [3-7]. The unequal distribution of DNA to daughter cells upon mitosis (chromosome instability) may lead to aneuploidy, the duplication or deletion of chromosomes or segments of chromosomes, and reciprocal duplication and deletion events that appear as copyneutral loss of heterozygosity or acquired uniparental disomy. Mosaic chromosomal abnormalities have been defined as the presence of both normal karyotypes as well as those with large structural genomic events resulting in alteration of copy number or loss of heterozygosity in distinct and detectable subpopulations of cells [8]. The development of microarray technology has had a significant impact on the genetic analysis of human disease. The whole-genome single nucleotide polymorphism (SNP) genotyping arrays have become an important tool for discovering variants that contribute to human diseases and phenotypes. The two most applications of this technology are genome-wide association studies (GWAS) and copy number variant (CNV) analysis. The SNP array offers researchers the flexibility to genotype samples with hundreds of thousands to millions of markers that deliver dense genome-wide coverage with the most up-to-date content to provide maximum coverage of genome for both association testing and copy number detection. Data from genome-wide association studies have been used for association between single SNP and disease status. It also provides an opportunity to detect chromosome variation and to investigate the association of mosaicism with disease status. In this project, the SNP microarray data generated on Illumina Infinium HumanHap 610 quad SNP array for the bladder cancer GWAS were subsequently used to uncover mosaic genomic copy number gains, losses, and copyneutral loss of heterozygosity in the sutosomes of 5% subjects. Two segmentation-based algorithms have been used to detect 193 mosaic events of > 0.5 Mb in size. The type of the chromosomal abnormalities detected has been characterized. The relationship between chromosomal abnormalities and cancer risk, age at NDA correction, and gender has been investigated. The frequency of mosaic chromosomal abnormalities was positively associated with bladder cancer for male subject. The frequency was increased with age for cancer-free individuals. 2. Data Sets Description The data used for this project consists of 3,239 individuals in bladder cancer genome-wide association studies (GWASs) from three cohorts: Beta-Carotene Cancer Prevention Study (ATBC), Prostate, Lung, Colorectal, Ovarian Cancer Screening Trial (PLCO), and Cancer Prevention Study-II (CPS-II). The summary of the study by case and control are listed in Table 1. 1,673 cases that had been diagnosed with urothelial cell carcinoma of the bladder and 1,566 controls that were cancer-free. There were 2,734 males and 505 females. The summary of the gender by case and control are listed in Table 2. The mean age at DNA withdrawal is 67 years for all subjects (range years, s.d. = 8.83). The mean age is 67 years for male (range 21-89, s.d. = 8.94) and 69 years for female (range s.d. = 7.89). The mean age is 66 years for case (range 21-87, s.d. = 10.2) and 68 years for control (range s.d. = 5

6 6.99). The study was approved by the institutional ethics committees of each participating hospital and the institutional review board (IRB) of the National Cancer Institute (NCI, USA). Written informed consent was obtained from all individuals. DNA was extracted from peripheral blood (58.6%) and mouthwash samples (41.4%). Genomic DNA was screened and analyzed at the National Cancer Institute according to the sample handling process of the Cancer Genomics Research Laboratory (CGR), Division of Cancer Epidemiology and Genetics (DCEG) before being genotyped to the HumanHap 610 Quad BeadChip (Illumina, Inc.) via the Infinium Assay. Overall, 3.3% of samples were performed in duplicate for the reproducibility checking, with SNP genotype calling concordance rate greater than 99.98% between two technical duplicates. The completion rate, defined as the proportion of frequency of non-missing genotypes for sample, were calculated by taking the number of called genotype SNP probes and dividing it by the total number of SNP probes on the array using the GLU qc.summary module ( The overall completion rate for the study samples is 97.87%. The distribution of the completion rates by sample and by locus is shown in Figure 1. A brief summary of the sample/locus counts at the 100th, 99th, 95th, 90th and 50th quantiles are provided as insert in the Figure 1. Ancestry was estimated for the 3,239 study subjects using a set of population informative SNPs [9] and data from HapMap build 27. These SNPs used are common to the commercially available Affymetrix 500K, Illumina 317K, and 550K chips. Admixture coefficients were estimated for each subject using the GLU struct.admix module, the HapMap CEU, YRI, ASA (JPT+CHB) samples were used as the fixed reference populations. A total of 3205 subjects were detected to have European ancestry. The 34 subjects were detected to have less than 80% of European ancestry, as shown in Figure 2 and are summarized in Table 3. Table 1 Summary of Study Cohorts by Cancer Status Cancer Status ATBC PLCO CPSII Total Bladder Cancer Control Total Table 2 Summary of Gender by Cancer Status Cancer Status Male Female Total Bladder Cancer Control Total Table 3 Summary of Population Structure by Cancer Status Study Imputed Ancestry Cohort Cancer Status CEU ADMIXED CEU ASA ASA,CEU CEU,YRI YRI Total ATBC Bladder Cancer Control PLCO Bladder Cancer Control CPS Bladder Cancer Control Total

7 Figure 1 Bladder completion rate by sample (left) and by locus (right). Figure 2 Population structure 7

8 3. Illumina SNP Genotyping and Normalization SNP genotyping is the measurement of genetic variations of single nucleotide polymorphisms (SNPs) between members of a species. SNPs are one of the most common types of genetic variation. A SNP is a single base pair mutation at a specific site in DNA, usually consisting of two alleles that makes up the individual s genotype. Illumina DNA Analysis BeadChips using the infinium Assay provide researchers genomic access and analyzing genetic variation. The infinium is a two color channels assay, with the data consisting of two intensity values (X, Y) for each SNP. There is one intensity channel for each of two fluorescent dyes associated with the two alleles of the SNP. The alleles measured by the X channel (Cy5 dye) are called the A allele, whereas the alleles measured by the Y channel (Cy3 dye) are called the B allele. Each SNP is analyzed independently to identify genotypes. Illumina s standard normalization algorithm is implemented as the first step in SNP genotyping data analysis. The intensity data are normalized using Illumina s self-normalization algorithm which draws on information contained in the array itself and to convert raw X and Y (allele A and allele B) signal intensities to normalized values. Normalized values then are used to analyze standard genotyping calls, Loss of Heterozygosity (LOH), and Copy Number (CN). In a diploid genome without CNVs, the three possible genotype calls are AA, AB, and BB, respectively. The raw signal intensity values measured for the A and B alleles are subject to an Illunina s standard five-step standard normalization procedure to determine six parameters: offset_x, offset_y, theta, shear, scale_x, scale_y. The normalization algorithm is designed to adjust for nominal intensity variations observed in the two color channels, background differences between the two channels, possible crosstalk between the dyes, global intensity difference, and to scale the data [10]. Figure 3 depicts the 5 steps of the normalization process. Step1: Outlier removal (Figure 3-A): Outlier SNPs are removed from consideration during the normalization parameter estimation. They will not be excluded from downstream analysis. Step2: Background estimation (offset_x, offset_y) (Figure 3-B): Identify candidate homozygote A alleles along X-axis and candidate homozygote B alleles along Y-axis. Two straight lines are fit into homozygote A and B alleles respectively. Offset_X and offset_y parameters are the intercepts from these two lines. The points are corrected for translation. Step3: Rotational estimation (theta) (Figure 3-C): Identify a set of control points by X-axis. A straight line is fit into the control points. The theta parameter is the angle between this line and the X-axis and defines the amount of rotation in the data. The points are corrected for rotation. Step4: Shear estimation (shear) (Figure 3-D): Identify a set of control points by Y-Axis. A straight line is fit into the control points. The shear parameter is the angle between this line and the Y-axis. The points are corrected for shear. Step5: Scaling estimation (scale_x, scale_y) (Figure 3-E): A statistical method used to determine scale_x and scale_y parameters. Figure 3-F is the final set of normalized data points. The points along X-axis represent AA alleles, points along Y-axis represent BB alleles, and points along 45 degree represent AB alleles. 8

9 Illumina then uses these 6 estimated parameter values to convert raw coordinates (X raw and Y raw) to normalized coordinates (X normalized and Y normalized) for each SNP, representing the experiment-wide normalized signal intensity on the A and B alleles, respectively. Figure 3 Five- step normalization procedure. Figure 3-F is the final normalized data points for a particular SNP. The points on X-axis represent AA alleles. The points along Y axis represent BB alleles. The points along approximate 45 degree are AB alleles. To visualize the data after normalization, the genotyping data are transformed from Cartesian coordinates (Figure 4 left) to a polar coordinate plot (Figure 4 right). Cartesian coordinates use the X axis to represent the intensity of A allele and the Y axis to represent the intensity of B allele. The polar coordinates use the X axis to represent normalized theta (the angle deviation from pure A signal, where 0 represents pure A signal and 1.0 represents pure B signal), and Y axis to represent the distance of point to origin. The theta and R are calculated by equations: Theta = (2/pi) * arctan(ynorm/xnorm) R = Xnorm + Ynorm Where X norm and Y norm represent transformed normalized signals from alleles A and B for a particular locus (SNP). 9

10 Figure 4 SNP Graphs: Cartesian Coordinates (left) & Polar Coordinates (Right). It displays all samples for the currently selected SNP. Samples are colored according to their genotype. From right graph, for this particular SNP (ID = rs ), 420 samples (red cluster) are called as AA alleles, 124 samples (purple cluster) called as AB alleles, and 13 samples (blue cluster) called as BB alleles. 4. Mosaic Chromosomal Abnormalities Detection Methods 4.1. Log R Ratio and B Allele Frequency The main goal of this project is to investigate the relationship between mosaic chromosomal abnormalities and bladder cancer risk by first identify regions of the genome that are aberrant in copy number, more specifically, the mosaic copy number variation on autosomal chromosomes in bladder cancer and cancer-free subjects. The detection of autosomal mosaic events was based on assessment of allelic imbalance and copy number changes. The chromosomal abnormalities were detected using two infinium high density assay outputs: the log R ratio (LRR) and B allele frequency (BAF). The LRR and BAF values are originally developed on the Illumina platform. For the Illumina SNP arrays, the LRR and BAF values can be directly calculated and exported from Illumina s GenomeStudio software. The Log R ratio (LRR) value is the normalized measure of total signal intensity and provides data on relative copy number. For each SNP, let the normalized signal intensities for the A and B alleles be denoted as X norm and Y norm, respectively. We can then calculate the R-value as R observed = X norm + Y norm as a normalized measure of total signal intensity. Log R ratio is then calculated as LRR = log 2 (R observed / R expected ), where R expected is computed from linear interpolation of the genotype clusters (Figure 5 Left). The three cluster positions are generated from a large set of samples that passed completion rate cutoff. The LRR value for a SNP is a measure of the difference between the signal intensity of the test sample and a pool of reference samples of the same SNP genotype. Since LRR is the logged ratio of observed probe intensity to expected intensity, deviation from zero is evidence for copy number change. The B allele frequency (BAF) derived from the ratio of allelic probe intensity is the proportion of hybridized sample that carries the B allele as designated by the infinium Assay. The B allele frequency can also be referred to as copy angle or allelic composition. It shows the relative presence of each of the two alternative nucleotides A and B at 10

11 each SNP locus profiled. BAF for a sample shows the theta value for a SNP corrected for cluster position. The thetavalue θ = (2/pi)*arctan(Ynorm/Xnorm). The BAF value is calculated by equation: θ AA, θ AB, θ BB are the θ values for three genotype clusters generated from a large set of samples that passed completion rate cutoff (Figure 5 Right). In the right figure, D1 = (θ - θ AB) ) and D2 = (θ AB - θ BB ). In a normal sample, discrete BAFs of 0.0, 0.5, and 1.0 are expected for each locus that representing AA, AB, BB alleles. Deviations from this expectation are indicative of aberrant copy number. For example, if a locus has a BAF = 1/3, this might indicate that there are 1 copy of the B allele and 2 copies of A alleles present in the sample because 1/(1+2) = 1/3. Analyzing both of LRR and BAF metrics provides strong resolution for detecting true copy number changes and allelic imbalance (Table 4). Figure 5 Log R Ratios (LRR) and Allelic Intensity Ratio (BAF). Table 4 Summary of Copy Numbers, Genotypes, Expected LRR, and Expected BAF. Total Copy Numbers CNV Genotypes Expected LRR Expected BAFs Deletion of Two Copy Null < 0 N/A Deletion of One Copy A, B < 0 0, 1 Normal Copy AA, AB, BB 0 0, 0.5, 1 Copy-Neutral LOH AA, BB 0 0, 1 Single Copy Duplication (Trisomy) AAA, AAB, ABB, BBB > 0 0, 1/3, 2/3, 1 Double Copy Duplication AAAA, AAAB, AABB, ABB, BBBB > 0 0, 1/4, 2/4, 3/4, 1 Mosaic Deletion mixed (AA, AB, BB) and (A, B) < 0 4 BAF bands Mosaic Copy-Neutral LOH mixed (AA, AB, BB) and (AA, BB) 0 4 BAF bands Mosaic Duplication mixed (AA, AB, BB) and (AAA, AAB, ABB, BBB) > 0 0, > 1/3, < 2/3, 1 11

12 Genetic mosaicism is the presence of cells within an organism that have a different genetic composition despite being the product of a single fertilization event. For this project, three mosaic types were investigated: mosaic deletion (loss), mosaic copy-neutral LOH, and mosaic duplication (gain) as defined below: Mosaic deletion is the coexistence of cells with normal copy and deletion of one copy. It is characterized by LRR < 0 and two heterozygous BAF bands. Mosaic copy-neutral LOH is the coexistence of cells with normal copy and copy-neutral LOH. It is characterized by LRR = 0 and two heterozygous BAF bands. Mosaic duplication is the coexistence of cells with normal copy and duplication of one copy. It is characterized by LRR > 0 and two heterozygous BAF bands between (1/3, 2/3). Notice, if two heterozygous BAF bands = 1/3 (AAB) and 2/3 (ABB), then it is pure duplication of one copy (trisomy) Re-estimate LRR and BAF LRR and BAF were estimated by the GenomeStudio software. However, there are two sources of biases that are not overcome by Illumina s five-step normalization method: dye bias and GC/CpG wave bias. There is an asymmetry in the detection of the two alleles for each SNP, caused by a remaining bias between two dyes used in the Infinium II assay after used Illumina s normalization method. The dye intensity bias can reduce precision in estimating copynumber and allelic imbalance. GC/CpG waves can be present when using incorrectly quantified DNA in the Infinium assay, or they might be present in regions of high or low GC content. The presence of GC/CpG waves creates artificial gains and losses in signal intensities for SNPs, and may lead to spurious copy-number variation calls. A fourstep custom software pipeline was implemented to the data exported from GenomeStudio, that contains called genotype, genotype call quality score, genotype probe intensities (X norm, Y norm ), log R ratio (LRR), and B allele frequency (BAF) for each assay, for further normalization to re-estimate LRR and BAF. Step 1: Quantile normalization was applied [11] to X norm and Y norm that were generated from Illumina s GenomeStudio software and resulted X qnorm and Y qnorm. This procedure removes dye bias and improves the asymmetry in the detection of the two alleles for each SNP, which influences both allelic proportions and copy number estimates. Step 2: Re-estimate genotype specific cluster centers (AA, AB, BB) for each SNP using X qnorm and Y qnorm values from assays with completion rate and genotype quality score greater than predefined thresholds so only SNPs from high quality samples were used to generate each cluster position. Step3: GC/CpG wave correction model was applied to each genotyped sample to get GC/CpG corrected allelic composition theta = (2/pi) * arctan (Y qnorm / X qnorm ) and total intensity R, which was estimated as a linear combination of (X qnorm, Y qnorm, GC content in probes) [12]. GC/CpG correction reduces the wavy patterns of signal intensities and improves the accuracy of copy-number variation detection. Step4: Finally, LRR and BAF were recomputed using the resulting quantile-normalized and GC/CpG corrected values, as described in [13]. 12

13 Reduction in variance of the LRR values after applied above 4 steps is demonstrated in Figure 6 for one cluster group of Illumina HumanHap610 assays from this project. Figure 6 Variance of log 2 R ratio (LRR) before and after normalization procedure for one cluster group of Illumina HumanHap610 assays. The reduction in GC/CpG waves is obvious in Figure 7.1 (sample without any chromosome abnormality) and Figure 7.2 (duplication) by plotting the signal intensity patterns before and after wave adjustment for the two samples from this project. Figure 7.1 Pre-normalization (left) and post-normalization (right) for a subject without chromosomal abnormality. Each dot in the figure represents one SNP. Red dots represent B-allele frequency (BAF, scale on the right side), while black dots show LRR values (LRR, scale on the left side). Three red bands represent BAF values for AA, AB, BB genotypes along the entire chromosome. One black band in middle (overlap with red AB BAF band) represents LRR values along the entire chromosome. There are wavy patterns with peaks and troughs for the LRR values across entire chromosome 13 for pre-normalized data. 13

14 Figure 7.2 Pre-normalization (left) and post-normalization (right) for subject with duplication. There are wavy patterns with peaks and troughs for the LRR values across entire chromosome 17 for pre-normalized data Segmentation Methods In this project, two open-source packages, Genomic Alteration Detection Analysis (GADA) [14] and BAF Segmentation [15], have been applied to the same data set for detecting breaking points on each chromosome. GADA software uses Sparse Bayesian Learning (SBL) segmentation algorithm, and BAF Segmentation software uses Circular Binary Segmentation (SBC) algorithm [16]. Resulting mosaic events in samples from both methods then were combined. There were two large mosaicism studies conducted by two independent research groups, Gene-Environment Association Studies consortium (GENEVA) and Cancer Genome Research Lab (CGR), the results from both groups were published at Nature Genetics at May 2012 [8,17]. Two lung cancer study data sets from Environment and Genetics in Lung Cancer Etiology Study (EAGLO) and Prostate, Lung, Colorectal, Ovarian Cancer Screening Trial (PLCO) were used by both groups. The GENENA group used CBS algorithm and CGR group used SBL algorithm. The resulting mosaic events were then compared by both groups. The concordance rate was 75%. There were mosaic events detected by one group and missed by other group for both groups. To minimize false negative rate, this project used both segmentation algorithms to detect mosaic chromosomal abnormality and then combined the results. Here are main steps implemented for the Genomic Alteration Detection Analysis (GADA) software [14]: Breakpoints detection is based on SBL (Sparse Bayesian Learning) algorithm. The method detects segments where B deviation is different from 0. The B deviation is the observed BAF value that is deviated from the expected BAF value of 0.5 for heterozygous SNPs. Essential steps: o Load the quantile normalized and GC/CpG wave corrected LRR and BAF. o Sparse Bayesian Learning model (SBL) was used to discover the most likely genomic locations and magnitudes for a CNV segment. The sparseness hyperparameter controls the SBL prior distribution which is uninformative about the location and amplitude of the CNV breakpoints but imposes a penalty 14

15 o on the number of CNV breakpoints. A higher aalpha implies that less breakpoints are expected a priori and results with fewer true CNV detected, yet fewer false positives. Backward Elimination (BE) is used to rank the statistical significance of each breakpoint obtained from SBL and to remove sequentially the least significant breakpoints using two parameters T statistic and MinSegLen. The T argument is the critical value of BE algorithm for the statistical score t m that associated to the break point m. The breakpoints with t m lower than T are discarded. The score t m is the difference between the sample averages of the probes falling on the left and right segment, divided by a pooled estimation of the standard error. T can be efficiently adjusted to controls for the False Discovery Rate (FDR). The argument MinSegLen indicates the number of consecutive probes (SNP markers) each CNV segment must contain that have a BAF-deviation different from 0. As T and MinSegLen increase, the number of the CNV breakpoints decreases. Here are main steps implemented for the BAF Segmentation software [15]: Breakpoints detection is based on CBS (Circular Binary Segmentation) algorithm [16]. The method detects segments where mbaf is different from 0.5 since expected BAF = 0.5 for AB allele in a diploid genome without CNVs. BAF data is reflected into mbaf along the 0.5 axis by the transformation mbaf = abs(baf 0.5) + 0.5, where abs stands for taking the absolute value. Essential steps: o Load the quantile normalized and GC/CpG wave corrected LRR and BAF o Convert BAF data to mbaf. So homozygous SNPs (AA and BB) are positioned at 1, and heterozygous SNPs without CNVs are positioned at 0.5. o The homozygous SNPs are uninformative for determination of the total copy number. Remove homozygous SNPs from mbaf profile based on a fixed mbaf threshold. SNPs above the threshold are considered non-informative and removed. o Triplet filtering is next applied to the mbaf threshold filtered data to future improve the removal. For each SNP the absolute sum of the difference in mbaf between the investigated SNP and the pre- and succeeding SNP was calculated and added to the SNPs distance from the 0.5 baseline. For a SNP with index i: triplet sum[i] = abs(mbaf[i - 1] - mbaf[i]) + abs(mbaf[i + 1] - mbaf[i]) + mbaf[i] o Triplet sums are compared against a threshold. SNPs with triplet sums above the threshold were considered outliers and removed. The triplet filtering is designed to remove non-informative homozygous SNPs due to experimental noise, obtain mbaf values lower than the mbaf threshold. o Applied Circular Binary Segmentation model to mbaf profiles after removal of no-informative homozygous SNPs to discover the most likely genomic locations and magnitudes for a CNV segment (the total number of breakpoints) controlled by alpha, the significance level for accepting change-points Mosaic Event Type Calling Method Each event was assigned a copy-number state based on the median LRR value for the segment: State = mosaic duplication (gain) if median(lrr) > 0.2s LRR State = mosaic deletion (loss) if median(lrr) < 0.2s LRR State = mosaic copy neutral LOH otherwise Where s LRR is the standard deviation of the segment LRR values. 15

16 After application of each segmentation method to same data set, the output file contains start and end of detected segmented region, chromosome, median LRR, and standard deviation of the LRR within the segmentation region. For each sample, the adjacent events were merged if the event types were identical and distance between segments was less than 1 Mbp. After merging, a minimum event size of length < 0.5 Mbps was excluded, as the false-positive rate increased rapidly for events of smaller size. Most of the false-positives were due to noisy data (high LRR and BAF variance) and non-mosaic CNVs were detected as being potentially mosaic Proportion of Mosaicism For each segment that was identified by SBL/CBS, a Gaussian mixture model was fit to the normalized BAF values of each segment with 2-4 Gaussian components and the best fitting model was chosen using the Akaike information criterion (AIC). 2-4 components represent 2-4 possible BAF bands. A two component model (2 BAF bands, represents AA and BB or A and B) will best fit for segments that have complete loss of heterozygosity or copyneutral or loss with mosaic proportions of nearly 100%. Three component models (3 BAF bands for AA, AB, and BB) should be the best fit for segments that are normal or with very low mosaic proportions. For segments where two or three component models are chosen, mosaic proportions are assigned manually when there was sufficient evidence of existing of mosaicism after manually reviewing the combined LRR and BAF plot. Segments where the four component model was the best fit (4 BAF bands: AA/A, BB/B, AB/A, and AB/B for mosaic deletion; AA/AA, BB/BB, AB/AA, and AB/BB for mosaic CNLOH; AA/AAA, BB/BBB, AB/AAB, and AB/ABB for mosaic duplication, see last three rows at Table 4) were assigned mosaic proportions based on the inferred state and location of the estimated heterozygote BAF bands (mu 1, mu 2 ). The mu 1 and mu 2 are mean of the BAF values across the segment for each of the two heterozygote BAF bands. The mosaic proportions were calculated based on the inferred mosaic state and location of the estimated heterozygote BAF (mu 1, mu 2 ) with formulas similar to [18]: D = mu 1 - mu 2 Proportion of cells with a deletion = 2D / (1 + D) Proportion of cells with a duplication = 2D / (1 - D) Proportion of cells with copy number neutral loss of heterozygosity = D 4.6. Example of Mosaic duplication, Deletion, and CNLOH Figure 8a is LRR and BAF plot for a normal sample. Figure 8b-g are LRR and BAF plots of six representative mosaic chromosomal abnormality examples of different types of mosaic rearrangements selected from this project. The plots show the signal intensity Log R ratio (LRR) (black dots, scale on the left side) and B allele frequency (BAF) (red dots, scale on the right side) values along the entire chromosome carrying the rearrangements in selected samples. 16

17 Figure 8a Example of one subject with normal copy for chromosome 13. Each dot in the figure represents one SNP. Red dots represent B allele frequency (BAF, scale on the right side), while black dots show Log R ratio values (LRR, scale on the left side). Three red bands represent BAF values for AA (bottom red band), AB (middle red band), BB (top red band) genotypes across entire chromosome 13. One black band in middle (overlap with red AB BAF band) represents LRR values (around 0) along the entire chromosome 13. Figure 8b Interstitial mosaic duplication at p arm of chromosome 16 characterized by increased Log R ratio (mean of LRR within segment (blue line) > 0) and abnormal heterozygous BAF. The vertical gray lines indicate the breakpoint(s) of the event segment. A non-mosaic trisomy would have a wider BAF split as 1/3 (AAB) and 2/3 (ABB) and a larger elevation of LRR. 17

18 Figure 8c Mosaic duplication for entire chromosome 8. It is characterized by increased Log R ratio (mean of LRR within segment > 0) and abnormal heterozygous BAF. The degree of mosaicism in figure 8c is less than in figure 8b because it has a narrow split in the intermediate heterozygous BAF bands along with a smaller increase in LRR. Figure 8d Mosaic copy neutral loss of heterozygosity (CNLON) for entire q arm of chromosome 1. It is characterized by unchanged Log R ratio (mean of LRR within segment close to 0) and abnormal heterozygous BAF. The p arm is in normal state. A non-mosaic CNLOH would have only two BAF bands (AA and BB) and LRR close to 0. 18

19 Figure 8e Mosaic copy neutral loss of heterozygosity (CNLON) for entire chromosome 14. It is characterized by unchanged Log R ratio (mean of LRR within segment close to 0) and abnormal heterozygous BAF. The degree of mosaicism in figure 8e is greater than in figure 8d because it has wider split in the intermediate BAF bands. Figure 8f Two small interstitial mosaic heterozygous deletions at p arm of chromosome 2. It is characterized by decreased Log R ratio (mean of LRR within segment < 0) and abnormal heterozygous BAF. A non-mosaic heterozygous deletion would have no intermediate BAF bands and a larger decrease in LRR. 19

20 Figure 8g Large mosaic heterozygous deletions at q arm of chromosome 9. It is characterized by decreased Log R ratio (mean of LRR within segment < 0) and abnormal heterozygous BAF. The mosaic deletion in 8g has a less proportion of cells containing the deletion than the one in 8f because it has narrow split in the intermediate BAF bands along with a smaller decrease in LRR. 5. Data Analysis Analysis was started by loading sample intensity files (two files per sample, for red and green channels) into Illumina s GenomeStudio software. The intensity data were normalized using Illumina s five-step self-normalization procedures (see description at section 3. Illumina SNP Genotyping and Normalization) which drew on information contained in the array itself to convert raw X and Y (allele A and allele B) signal intensities to normalized values. Data on called genotype, genotype calls quality score, raw and normalized genotype probe intensities, LRR, and BAF for each assay were exported from GenomeStudio software using its Genotype Final Report (GFR) format. Using GFR file as input, the array dataset in the GFR file was converted into a high-performance binary file format (GDAT) using the GLU software package ( ) that was developed at Cancer Genomics Research Laboratory (CGR). A GC/CpG model file (GCM file) was generated using a copy of the reference genome UCSC hg18 and Illumina binary manifest file Human610-Quadv1_B.bpm. Within GDAT, a four-step custom software pipeline (see description at section 4.2. Re-estimate Log R Ratio and B Allele Frequency) was implemented. The information in GCM file was used for GC/CPG correction. The LRR and BAF were re-estimated on the quantile-normalized and GC/CpG corrected values and written directly into the GDAT file as a new data table. All of these procedures were implemented using GLU software package. The renormalized LRR and BAF values from qualifying assay (completion rate >= 90%) were then analyzed using two custom software pipelines that involved GADA and BAF Segmentation packages to detect 20

21 whole-chromosome and large segmental events greater than 0.5Mb in size to minimize the false discovery (see description at section 4.3. Segmentation Methods). We applied the GADA method with the following setting for the related parameters: SBL sparseness hyperparameter to discover the total number of breakpoints: aalpha = 0.85; the critical value of backward elimination algorithm for the statistical score associated to a break point: T statistic = 10; the minimum number of SNPs each CNV segment must contain: MinSegLen = 200. We applied the BAF Segmentation method with the following setting for the related parameters: the threshold in mbaf for calling regions of mosaic event based on segmented mbaf values: ai_ threshold = 0.56 (default); the minimal number of SNPs a segmented region should contain to be allowed to be called as mosaic event: ai_size = 45; the threshold in mbaf for removing putatively non-informative SNPs: informative_threshold = 0.97 (default); the threshold for thriplet filtering used to improve removal of putatively non-informative homozygous SNPs: triplet_threshold = 0.8 (default). The significant level for accepting changepoints: alpha = 0.001, using CBS to identify breakpoints of genomic regions. For each sample, adjacent events were merged if the event types were identical and distance between segments was less than 1Mbp. After merging, events of length < 0.5 Mbps were excluded. All events were then plotted. False positive calls due to noisy assay data and non-mosaic copy-number variants and loss of heterozygosity due to the hemizygous deletion (deletion of one-copy) and events inherited by descent (IBD) and uniparental disomy (UPD) were also excluded from analysis base on manual review on each plot. These events were excluded because they are not mosaic events. The segment boundaries were manually corrected for some of the events. Each event detected was classified as mosaic duplication (gain), mosaic deletion (loss), and mosaic copy-neutral loss of heterozygosity. Mosaic proportion of abnormal cells was estimated (see descripted at section 4.5. proportion of mosacism). The magnitude of BAF differences for single-copy duplication events was one-third of the magnitude of that for copy-neutral LOH or copy-deletion events reducing the sensitivity for calling copy-duplication events. For mosaic duplication event, only proportion of abnormal cells <= 0.9 will be kept because as the proportion of mosaic estimation > 0.9, it is difficult to distinguish between mosaic and non-mosaic duplication reliably. To view the characteristics of mosaic events, mosaic events were plotted by proportion of abnormal cells and LRR using Microsoft office excel software (Figure 9). Two circular genomic plots by bladder cancer and control for three tracks of mosaic events for the autosomes 1 to 22 was generated using circos software ( ) (Figure 10). The frequency of mosaic events by age and cancer status on all subjects and male only plots were generated using Microsoft office excel software (Figure 12). Logistic regression models were fit using SAS software package to determine the relationship between individuals having mosaic event(s) and their age at DNA collection, gender, and cancer diagnosis. 6. Results and Discussion 6.1. Characteristics of Mosaic Events 193 mosaic segments of size greater than 0.5 Mb on autosomal chromosomes in 163 individuals, for an overall frequency of individuals with mosaicism of 5% were observed. 118 mosaic events were from bladder cancer individuals (61.14%) and 75 mosaic events were from cancer-free controls (38.86%). Mosaic autosomal abnormalities were more common in the bladder cancer individuals (98/1673 = 5.86%) compared with cancer-free 21

22 persons (65/1577 = 4.12%). The most frequent chromosome of event observed was chromosome 17 for bladder cancer individuals (6.74%) and chromosome 2 and 4 for control individuals (4.15%). Combining case and control together, the most frequent chromosome of event observed was chromosome 2, 10, and 17 (8.29%) (Table 5), which may imply instability for these three chromosomes. The most frequent type of event observed was mosaic duplication (55.96%), whereas mosaic deletion and mosaic CNLOH constituted 12.44% and % of mosaic events, respectively (Table 6). The segment size for CNLOH was the largest and mosaic duplication was the smallest. Median lengths were 0.82 Mb for mosaic duplications, 2.32 Mb for mosaic deletions, and Mb for mosaic CNLOHs. The abnormal cell proportions are between 20.88% and 89.86% for mosaic duplication; 25.45% and 96.64% for mosaic deletion; 3.84% and 95% for mosaic CNLOH (Figure 9). Table 5 Frequency of Mosaic Chromosomal Events by Chromosome and Case-Control status. Mosaic Chromosome Count Mosaic Chromosome Frequency (%) Chromosome Bladder Cancer Control Total Bladder Cancer Control Total Total Table 6 Frequency of Mosaic Chromosomal Events by Event Type and Location. Mosaic chromosome count Mosaic chromosome frequency (%) Event location Gain Loss CN LOH Total Gain Loss CN LOH Total Chromosome Telomeric p Telomeric q Interstitial Telomeric (p + q) Total

23 Figure 9 Characteristics of mosaic events. Mosaic events plotted by proportion of abnormal cells (P) and LRR for 193 events in 163 individuals. A blue dot represents P and LRR values for a mosaic duplication event. A green dot is for mosaic CNLOH event. A red dot is for mosaic deletion events. Of mosaic chromosomal events being detected by GADA and BAF methods, 4.66% spanned the entire chromosome, including 4 whole chromosome mosaic trisomy events on chromosome 8, 12, 18 and 21, with 3 of 4 events were carried by one subject; 5 whole chromosome mosaic CNLOH events on chromosome 6, 9, 18, and 19 (2 events) that were carried by 4 different subjects. There was no whole chromosome mosaic deletion event being detected (Figure 10). We found that 9.33% of mosaic chromosomal events began at a telomere p arm and 15.03% of mosaic chromosomal events end at a telomere q arm. The most mosaic chromosomal events were interstitial (70.98%), spanning no telomere. The majority of telomeric events (p + q) were mosaic copy-neutral LOH (27 / 47 = 57.45%) followed by mosaic duplication (17 / 47 = 36.17%). The majority of interstitial events were mosaic duplication (87 / 137 = 63.5%) followed by mosaic CNLOH (29 / 137 = 21%) (Table 6). There are 16 individuals (9 bladder cases and 7 controls) having mosaic events on at least two chromosomes. Among control individuals, the greatest number of mosaic chromosomal events observed for a single subject was 4 from ATBC cohort study and located at whole chromosome 12, 18, 19 and entire q arm of chromosome 21 and all of them are mosaic duplication (Figure 11). Among bladder cancer individuals, the greatest number of mosaic chromosomal events observed for a single subject was 11 from PLCO cohort study and located at 9 different chromosomes, including 2 events on chromosome 12 and 13, and all of them are mosaic copy-neutral LOH with very higher degree of mosaicism. 23

24 Figure 10 Circular plots display genomic location of mosaic events. Outer rings are the autosomes 1 to 22. Yellow track for events of mosaic copy-neutral LOH; blue track for mosaic duplication events; red track for mosaic deletion events. Left plot are events detected from bladder cancer subjects. Right plot are events detected from cancer-free controls. Figure 11 Mosaic duplication across entire of chromosome 12, 18, 19 and entire q arm of chromosome 21 for a control individual. It is characterized by increased Log R ratio (mean of LRR within segment > 0) and abnormal heterozygous BAF. 24

25 6.2. Mosaic Chromosomal Abnormalities and Age at DNA The effect of increased age on the frequency of mosaic events across three cohort studies that predominantly included individuals over the age of 60 has been examined. 27 individuals have missing ages. For remaining 3212 subjects, the frequency of control individuals with mosaic events increased with age from 3.21% for those under 60 to 4.53% for those between the ages of to 5.96% for those between the ages of 76 and 89 years (P = 0.11). The frequency of bladder cancer individuals with mosaic events was almost constant with age from 6.95% for those under 60 to 7.24% for those between the ages of 66-70, then to 6.14% for those between the ages of 76 and 89 (P = 0.32). The frequency of mosaic events was higher in bladder cancer individuals than control individuals in first four age groups. However, the frequency of mosaic events was very similar for bladder cancer individuals and control individuals for those between the ages of (Figure 12 Top). The frequency of male control individuals with mosaic events increased with age from 3.39% for those under 60 to 4.41% for those between the ages of to 7.51% for those between the ages of 76 and 89 years (P = 0.035). The frequency of male bladder cancer individuals with mosaic events was almost constant with age from 7.75% for those under 60 to 6.81% for those between the ages of 66-70, then to 6.81% for those between the ages of 76 and 89 (P = 0.26). The frequency of mosaic events was higher in bladder cancer male individuals than male control individuals in first four age groups. However, the frequency of mosaic events was low for male bladder cancer individuals than male control individuals for those between the ages of 76 and 89 (Figure 12 Bottom). For female, there were no mosaic events for those under age 60 and very few events for other four age categories and therefore it cannot provide reliable summary for each age category (Table 7). Figure 12 Frequency of mosaic events by age and cancer status for all individuals (Top) and male only (Bottom). 25

26 Table 7 Mosaic Event Counts by Five Age Categories and Cancer Status. Male + Female Male Female Bladder Bladder Bladder Control Control Control AGE Cancer Cancer Cancer Yes No Yes No Yes No Yes No Yes No Yes No <= Total Mosaic Chromosomal Abnormalities and Gender The effect of gender on the frequency of mosaic events across three cohort studies by bladder cancer and control has also been examined. For bladder cancer, the mosaic events were more frequent in males than females with male = 6.13% and female = 4.22%. For control, the frequency of mosaic events was almost the same for male and female with male = 4.08% and female = 4.48%. Logistic regression models were fit to the data with mosaic status on gender, adjusting for age for (1) control and (2) case. For control, the OR = with P-value = For case, the OR = with P-value = Logistic regression model was also fit to all subjects with mosaic status on gender, adjusting for age and cancer status, and got OR = with P-value = So we did not observe any significant gender effect for all three models (Table 8). Table 8 Frequency of Mosaic Events by Gender and Cancer Status. Mosaic event frequency (%) Adjusted Logistic Model Male Female OR 95% P-value Bladder Cancer ( ) Control ( ) Overall ( ) Mosaic Chromosomal Abnormalities and Cancer Risk Mosaic Chromosomal Abnormalities and Cancer Risk for All Subjects To investigate the relationship between mosaic chromosomal abnormalities and bladder cancer, various logistic regression models were fit to the data with mosaic status as the response variable and (1) cancer status; (2) cancer status + gender; (3) cancer status + age; (4) cancer status + age + gender as predictors for a test of partial independence of cancer status and mosaic status, controlling for gender or/and age. For each model, cancer status = 1 if bladder cancer and = 0 if control; gender = 1 if male and = 0 if female; and age as continuous variable. There are modest evidence of positive relationship between mosaic event and bladder cancer (OR=1.44, 95% CI = ; P = 0.027) for model (1); OR=1.43, 95% CI = ; P = 0.029) for model (2); OR=1.45, 95% CI =

27 2.00; P = 0.025) for model (3); OR=1.44, 95% CI = ; P = 0.026) for model (4). All four models show very similar P-values for cancer status. There is no significant evidence of gender effect and age effect for the models having gender/age as predictor(s) (Table 9.1). All four main effect models fit the data adequately. The P-value = from Pearson s goodness-of fit test for model (2) and P-vales = and from Hosmer-Lemeshow goodness-of-fit test for model (3) and (4). Age is a continuous variable, which caused models (3) and (4) have very large number of unique profiles (Table 9.2). In this situation, the data is too sparse to use Pearson and deviance goodness-of-fit test and Hosmer-Lemeshow goodnessof-fit test is more suitable for the situation where there are a large number of settings of the predictors. All four models have very similar AIC. The difference of -2 Log L between model (1) and any of other 3 models < 3.84 = χ 2 1, None of the more complex models significantly improves upon the simplest model, the model (1). However, we are interested not only whether there is evidence of cancer status effect but also age effect and gender effect on mosaic status, so model (4) (Table 9.3) gives us more meaningful and interpretable results to answer our scientifically as well as statistically important questions. To test for equality of odds ratios between cancer status and mosaic status for various ages, we add the interaction term with cancer_status + age + gender + cancer_status*age as predictors. We have almost significant evidence to indicate that the odds ratios for bladder cancer and mosaic event are differ among various ages (P = 0.054). So it is reasonable to assume that the log(or) between the cancer status levels at a given age x close to a linear function of x. This model fits the data very well and has smallest AIC value among all tested models (Table 9.4). We also examined weather DNA source effect and study effect were significant predictors of mosaic status using univariate analysis for each nominal variable. It is notable that DNA source (58.6% from blood, 41.4% from buccal) was not a significant predictor (P-value = 0.578). The study (ATBS, PLCO, CPS) was also not a significant predictor (Pvalue = 0.377). Table 9.1 Summary Logistic Regression Models on All Subjects. Setting: presence of mosaic event = 1; case vs control; male vs female Model STATUS GENDER AGE STATUS*AGE OR 95% Wald CI mosaic_event=status+error 1.44 ( ) mosaic_event=status+gender+error 1.43 ( ) Pr > ChiSq Pr > ChiSq Pr > ChiSq Pr > ChiSq mosaic_event=status+age+error 1.45 ( ) mosaic_event=status+gender+age+error 1.44 ( ) mosaic_event=status+age+gender+status*age+error Table 9.2 Summary Goodness-of-Fit Results for Logistic Regression Models on All Subjects. Model -2 Log L AIC Pearson GOF HL GOF Setting: presence of mosaic event = 1; case vs control; male vs female Pr > ChiSq Pr > ChiSq mosaic_event=status+error mosaic_event=status+gender+error mosaic_event=status+age+error mosaic_event=status+gender+age+error mosaic_event=status+age+gender+status*age+error

28 Table 9.3 SAS Output From Logistic Regression Main Effect Model on All Subjects. Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Intercept <.0001 AGE_DNA SEX_Code STATUS_Code Table 9.4 SAS Output From Logistic Regression Model With Interaction Term on All Subjects. Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Intercept <.0001 STATUS_Code AGE_DNA SEX_Code STATUS_Code*AGE_DNA To reduce the number of predictors when age was in the model, we divided age into five age_groups: <=60, 61-65, 66-70, 71-75, and >=76. We did logistic regression of mosaic status on cancer status, gender, and age_group for testing of partial independence of mosaic status and cancer status controlling for gender and 5 age groups. Again there is modest evidence of positive correlation between bladder cancer and mosaic status with P-value = 0.025, which is smaller than the P-value if age is continuous variable, but not by much (P-value = 0.026). There was no any significant difference between any of last four age groups vs. first age group (Table 9.5). Table 9.5 SAS Output From Logistic Regression Model with Age Split to Five Groups on All Subjects. Type 3 Analysis of Effects Effect DF Wald Pr > ChiSq Chi-Square STATUS_Code SEX_Code AGE_GROUP Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Intercept <.0001 STATUS_Code SEX_Code AGE_GROUP AGE_GROUP AGE_GROUP AGE_GROUP >=

29 Odds Ratio Estimates Effect Point Estimate 95% Wald Confidence Limits STATUS_Code SEX_Code AGE_GROUP vs <= AGE_GROUP vs <= AGE_GROUP vs <= AGE_GROUP >=76 vs <= Mosaic Chromosomal Abnormalities and Cancer Risk for Men Only Bladder cancer is the fourth most common cancer diagnosed in men. Men are about 3 to 4 times more likely to get bladder cancer during their lifetime than women. Overall, the chance men will develop this cancer during their life is about 1 in 26. For women, the chance is about 1 in 90 [19]. To investigate the relationship between mosaic chromosomal abnormalities and bladder cancer for male only, two logistic regression models were fit to the data with mosaic status as the response variable and (1) cancer status; (2) cancer status + age as predictors for a test of partial independence of mosaic status and cancer status, controlling for age. For each model, cancer status = 1 if bladder cancer and = 0 if control and age as continuous variable. There are modest evidence of positive relationship between mosaic event and bladder cancer risk for male individuals (OR=1.53, 95% CI = ; P = 0.017) for model (1) and (OR=1.55, 95% CI = ; P = 0.014) for model (2) (Table 10.1). There is no significant evidence of age effect for the model (2) (Table 10.2). To test for equality of odds ratios between mosaic event and bladder cancer for the various ages for male individuals, we add the interaction term with cancer status + age + cancer status*age as predictors. We have significant evidence of unequal odds ratios between mosaic event and bladder cancer for the various ages (P = 0.017). This model fits the data very well and has smallest AIC value among all tested models (Table 10.3). Table 10.1 Summary Logistic Regression Models On Male Subjects. Model STATUS AGE STATUS*AGE HL GOF Setting: presence of mosaic event = 1; case vs control OR 95% Wald CI Pr > ChiSq Pr > ChiSq Pr > ChiSq Pr > ChiSq Mosaic_event=status+error 1.53 ( ) Mosaic_event=status+age+error 1.55 ( ) Mosaic_event=status+age+status*age+error Table 10.2 SAS Output From Logistic Regression Main Effect Model on Male Subjects. Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Intercept <.0001 STATUS_Code AGE_DNA

30 Table 10.3 SAS Output From Logistic Regression Model With Interaction Term on Male Subjects. Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Intercept <.0001 STATUS_Code AGE_DNA STATUS_Code*AGE_DNA Mosaic Chromosomal Abnormalities and Cancer Risk for Female Only To investigate the relationship between mosaic chromosomal abnormalities and bladder cancer for female only, two logistic regression models were fit to the data with the mosaic event as the response variable and (1) cancer status; (2) cancer status + age as predictors for a test of partial independence of mosaic event and cancer status, controlling for age. For each model, cancer status = 1 if bladder cancer and = 0 if control and age as continuous variable. There are no significant evidence of cancer risk for mosaic female individuals (OR = 0.94, 95% CI = ; P = 0.887) for model (1) and (OR = 0.93, 95% CI = ; P = 0.865) for model (2) (Table 11.1). There is no significant evidence of age effect for the model (2) (Table 11.2). To test for equality of odds ratios between mosaic event and cancer status for the various ages for female individuals, we add the interaction term with cancer status + age + cancer status*age as predictors. There is no significant evidence of unequal odds ratios between mosaic event and bladder cancer for the various ages (P = 0.27) (Table 11.2). Table 11.1 Summary Logistic Models On Female Subjects. Model STATUS AGE STATUS*AGE HL GOF Setting: presence of mosaic event = 1; case vs. control OR 95% Wald CI Pr > ChiSq Pr > ChiSq Pr > ChiSq Pr > ChiSq mosaic_event=status+error 0.94 ( ) mosaic_event=status+age+error 0.93 ( ) mosaic_event=status+age+status*age+error Table 11.2 SAS Output From Logistic Regression Main Effect Model on Female Subjects. Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Intercept STATUS_Code AGE_DNA

31 Table 11.3 SAS Output From Logistic Regression Model with Interaction Term on Female Subjects. Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Intercept STATUS_Code AGE_DNA STATUS_Code*AGE_DNA Conclusion and Further Studies In this project, DNA of 3,239 individuals with 1,673 bladder cancer cases and 1,566 cancer-free controls have been used to investigate for evidence of mosaicism of the autosomes using Illumina s genome-wide SNP array data generated from bladder cancer genome wide association analysis. 193 mosaic duplication, mosaic deletion, and mosaic copy-neutral loss of heterozygosity (CNLOH) events with size > 0.5 Mb in autosomes of 163 study subjects (5%), with abnormal cell proportions of between 3.84% and 96.64%, have been observed. Mosaic autosomal abnormalities were statistically significantly positively associated with bladder cancer for male (OR = 1.55; P = 0.014) but not for female. The frequency of mosaicism increased with age for male control subjects, ranging from 3.39% in individuals under age 60 to 7.51% between 76 and 89 years old (P = 0.035). Mosaic autosomal abnormalities were more common in the bladder cancer individuals (5.86%) compared with cancer-free persons (4.12%). The mosaic events were more frequent in males than females with male = 6.13% and female = 4.22% for bladder cancer individuals but similar for cancer-free persons with male = 4.08% and female = 4.48%. The most frequent class of autosomal abnormality detected was mosaic duplication, representing 55.96% of mosaic events. The most frequent autosomal of mosaic event observed was chromosome 17 for bladder cancer individuals (6.74%) and chromosome 2 and 4 for control individuals (4.15%). Combining case and control together, the most frequent chromosome of mosaic event observed was chromosome 2, 10, and 17 (8.29%). Mosaicism in older cancer-free male individuals suggests that age-related genomic instability could be due to increased rates of somatic mutation or diminished capacity for genomic maintenance, such as with telomere attrition, leading to proliferation of somatically altered cell populations [20]. This project can be investigated further in several areas if we can get additional information in the future. For all subjects with mosaic events, it will be very interesting to assess the characters and behavior of mosaic events over time and to determine these individuals number of mosaic events or the proportion of observed mosaicism change over time. In this project, we calculated proportion of mosacism for each mosaic events but did not go further. If we have additional data points collected with respect to specific time of diagnosis in case, we can investigate the hypothesis of positive association between proportion of mosaicism and severity of the cancer (early to later stage). Just very recently, we accidently identified one subject participated two studies (Non-Hodgkin s Lymphoma (NHL) and Ovarian cancer). We had her genotyped data from two DNA samples that were drawn at age 59 (NHL) and 63 years (NHL + Ovrian cancer). We did mosaic chromosomal abnormality analysis on both samples and found mosaic events at multiple chromosomes. At age 59, there were mosaic events observed on chromosome 3, 8, 10, 13, 20, and X (Figure 13 left). At age 63, there were mosaic events observed on chromosome 3, 4, 8, 9, 13, 20 and X (Figure 31

32 13 right). All of the mosaic events at 59 were observed at age 63 and with increased proportion of mosaicism except the mosaic event on chromosome 10 at age 59 but unobserved at age 63. The disappearance of this event may be the result of the cancer treatment. There were two new events detected at chromosome 4 and 9 at age 63, which may be due to later stage NHL cancer or ovarian specific cancer. 32

33 Figure 13 Mosaic events observed on multiple chromosomes for a female subject having Non-Hodgkin s Lymphoma and Ovarian cancer. Left figures are for DNA samples drew at age 59 (NHL). The right figures are for DNA samples drew at 63 years (NHL + Ovarian cancer). 33

34 We can also possible to determine observed mosaic events development origin (germline or somatic cells) if we have blood, tumor tissue, and normal tissue data available. A germline mutation is one that was passed on to offspring. An example of gemline mutations linked to cancer is the ones that occur in cancer susceptibility genes, increasing a person's risk for the disease. Somatic mutations are not passed on to the next generation. By distinguishing the origin of the mutation, we may able to discover cancer susceptibility genes such as well-known BRCA1 and BRCA2 genes for breast cancer. We recently did mosaic analysis on several TCGA (The Cancer Genome Atlas, SNP data sets. Figure 14 shows possible gemline deletion at chr7: for subject 1087 who had glioblastoma multiforme (GBM) with both blood and tumor samples taken at the same time. Same mutation existed at both blood and tumor samples which means this is not tumor specific mutation. We also observed some mutations existed at blood sample not tumor sample, which implies the mutation in blood is somatic mutation instead of germline mutation (Figure 15). Figure 14 Deletion at chromosome 7 ( ) detected from blood sample (Top) and primary solid tumor sample (bottom) for one GBM subject from TCGA data. The deletion in the blood sample is very likely a gemline mutation because same event was existed at both DNA sources. 34

35 Figure 15 Mosaic CNLOH at whole chromosome 3 detected from blood sample (Top) for ovarian cancer subject 1877 from TCGA data. Bottom figure is mutation detected from primary solid tumor sample of same subject. The mutation in blood sample is somatic mutation because of very different mutation types between two DNA sources. 35