Segmentation-Based Detection of Mosaic Chromosomal Abnormality in Bladder Cancer Cells Using Whole Genome SNP Arrays

Size: px
Start display at page:

Download "Segmentation-Based Detection of Mosaic Chromosomal Abnormality in Bladder Cancer Cells Using Whole Genome SNP Arrays"

Transcription

1 Segmentation-Based Detection of Mosaic Chromosomal Abnormality in Bladder Cancer Cells Using Whole Genome SNP Arrays Weiyin Zhou Cancer Genomics Research Laboratory (CGR) Leidos Biomedical Research, Inc Texas A&M Masters of Science Degree Candidate Statistics Texas A&M Department of Statistics College Station, TX February

2 Acknowledgements I would like to express my deepest gratitude to my advisor, Dr. Alan Dabney, for his excellent guidance, understanding, patience, and encouragement. I would also like to thank my other committee members: Dr. William B. Smith and Dr. Bruce Lowe for taking time from their busy schedule to review my work; and to thank Ms. Penny Jackson and Kim Ritchie for their procedural and technical support. My gratitude extends to my current employer, Leidos Biomedical Research, Inc., which paid the majority of my tuition under its education assistance program. I would like to thank my husband Dr. Lisheng Cai, my son, Robert, and my daughter, Kimberly, for their love and support. They always cheer me up during the entire process. Quite naturally, my parents laid the foundation for all this effort through years of caring and teaching. 2

3 Abstract The purpose of this project is to investigate the relationship between mosaic chromosomal abnormality and bladder cancer. DNA of 3,239 individuals consisted of 1,673 bladder cancer cases and 1,566 cancer-free controls have been examined for evidence of mosaicism of the autosomes using genome-wide SNP array data generated from bladder cancer genome wide association analysis. DNA samples were extracted from blood or buccal (mouth) samples and were genotyped on Illumina Infinium HumanHap 610 quad SNP array. Two segmentation-based methods have been used to detect three types of mosaic events. 193 mosaic duplication (gain), mosaic deletion (loss), and mosaic copyneutral loss of heterozygosity (CNLOH) events, defined as being of > 0.5 Mb in size, in autosomes of 163 individuals (5%), with abnormal cell proportions of between 3.84% and 96.64%, was observed. Mosaic autosomal abnormalities were more common in the bladder cancer individuals (5.86%) compared with cancer-free persons (4.12%). Mosaic chromosomal abnormalities were statistically significantly positively associated with bladder cancer for male (OR = 1.55; P = 0.014). In cancer-free male individuals, mosaic chromosomal abnormality frequency increased with age, from 3.39% under 60 years to 7.51% between 76 and 89 years (P = 0.035). 3

4 Contents 1. Introduction 5 2. Data Description 5 3. Illumina SNP Genotyping and Normalization 8 4. Mosaic Chromosomal Abnormalities Detection Methods Log R Ratio and B Allele Frequency Re-estimate Log R Ratio and B Allele Frequency Segmentation Methods Mosaic Events Calling Methods Proportion of Mosaicism Examples of Mosaic Events Data Analysis Results and Discussion Characteristics of Mosaic Events Mosaic Chromosomal Abnormalities and Age at DNA Mosaic Chromosomal Abnormalities and Gender Mosaic Chromosomal Abnormalities and Cancer Risk Mosaic Chromosomal Abnormalities and Cancer Risk for All Subjects Mosaic Chromosomal Abnormalities and Cancer Risk for Male Only Mosaic Chromosomal Abnormalities and Cancer Risk for Female Only Conclusions and Further Studies References Appendix A: SAS Code 36 Appendix B: Normalization Pipeline 42 Appendix C: GADA Segmentation pipeline 43 Appendix D: BAF Segmentation Pipeline 45 4

5 1. Introduction Genetic mosaicism is defined as the coexistence of cells with different genetic composition within an individual caused by postzygotic event during development that can occur in both somatic (affecting only non-reproductive cells) and germline cells (with the potential of being passed on to any offspring) despite being the product of a single fertilization. Mosaicism can be caused by DNA mutations, epigenetic alterations of DNA, chromosomal abnormalities, and the spontaneous reversion of inherited mutations [1,2]. Somatic mosaicism has been established as a cause of mental retardation, birth defects, spontaneous abortion, and cancer [3-7]. The unequal distribution of DNA to daughter cells upon mitosis (chromosome instability) may lead to aneuploidy, the duplication or deletion of chromosomes or segments of chromosomes, and reciprocal duplication and deletion events that appear as copyneutral loss of heterozygosity or acquired uniparental disomy. Mosaic chromosomal abnormalities have been defined as the presence of both normal karyotypes as well as those with large structural genomic events resulting in alteration of copy number or loss of heterozygosity in distinct and detectable subpopulations of cells [8]. The development of microarray technology has had a significant impact on the genetic analysis of human disease. The whole-genome single nucleotide polymorphism (SNP) genotyping arrays have become an important tool for discovering variants that contribute to human diseases and phenotypes. The two most applications of this technology are genome-wide association studies (GWAS) and copy number variant (CNV) analysis. The SNP array offers researchers the flexibility to genotype samples with hundreds of thousands to millions of markers that deliver dense genome-wide coverage with the most up-to-date content to provide maximum coverage of genome for both association testing and copy number detection. Data from genome-wide association studies have been used for association between single SNP and disease status. It also provides an opportunity to detect chromosome variation and to investigate the association of mosaicism with disease status. In this project, the SNP microarray data generated on Illumina Infinium HumanHap 610 quad SNP array for the bladder cancer GWAS were subsequently used to uncover mosaic genomic copy number gains, losses, and copyneutral loss of heterozygosity in the sutosomes of 5% subjects. Two segmentation-based algorithms have been used to detect 193 mosaic events of > 0.5 Mb in size. The type of the chromosomal abnormalities detected has been characterized. The relationship between chromosomal abnormalities and cancer risk, age at NDA correction, and gender has been investigated. The frequency of mosaic chromosomal abnormalities was positively associated with bladder cancer for male subject. The frequency was increased with age for cancer-free individuals. 2. Data Sets Description The data used for this project consists of 3,239 individuals in bladder cancer genome-wide association studies (GWASs) from three cohorts: Beta-Carotene Cancer Prevention Study (ATBC), Prostate, Lung, Colorectal, Ovarian Cancer Screening Trial (PLCO), and Cancer Prevention Study-II (CPS-II). The summary of the study by case and control are listed in Table 1. 1,673 cases that had been diagnosed with urothelial cell carcinoma of the bladder and 1,566 controls that were cancer-free. There were 2,734 males and 505 females. The summary of the gender by case and control are listed in Table 2. The mean age at DNA withdrawal is 67 years for all subjects (range years, s.d. = 8.83). The mean age is 67 years for male (range 21-89, s.d. = 8.94) and 69 years for female (range s.d. = 7.89). The mean age is 66 years for case (range 21-87, s.d. = 10.2) and 68 years for control (range s.d. = 5

6 6.99). The study was approved by the institutional ethics committees of each participating hospital and the institutional review board (IRB) of the National Cancer Institute (NCI, USA). Written informed consent was obtained from all individuals. DNA was extracted from peripheral blood (58.6%) and mouthwash samples (41.4%). Genomic DNA was screened and analyzed at the National Cancer Institute according to the sample handling process of the Cancer Genomics Research Laboratory (CGR), Division of Cancer Epidemiology and Genetics (DCEG) before being genotyped to the HumanHap 610 Quad BeadChip (Illumina, Inc.) via the Infinium Assay. Overall, 3.3% of samples were performed in duplicate for the reproducibility checking, with SNP genotype calling concordance rate greater than 99.98% between two technical duplicates. The completion rate, defined as the proportion of frequency of non-missing genotypes for sample, were calculated by taking the number of called genotype SNP probes and dividing it by the total number of SNP probes on the array using the GLU qc.summary module ( The overall completion rate for the study samples is 97.87%. The distribution of the completion rates by sample and by locus is shown in Figure 1. A brief summary of the sample/locus counts at the 100th, 99th, 95th, 90th and 50th quantiles are provided as insert in the Figure 1. Ancestry was estimated for the 3,239 study subjects using a set of population informative SNPs [9] and data from HapMap build 27. These SNPs used are common to the commercially available Affymetrix 500K, Illumina 317K, and 550K chips. Admixture coefficients were estimated for each subject using the GLU struct.admix module, the HapMap CEU, YRI, ASA (JPT+CHB) samples were used as the fixed reference populations. A total of 3205 subjects were detected to have European ancestry. The 34 subjects were detected to have less than 80% of European ancestry, as shown in Figure 2 and are summarized in Table 3. Table 1 Summary of Study Cohorts by Cancer Status Cancer Status ATBC PLCO CPSII Total Bladder Cancer Control Total Table 2 Summary of Gender by Cancer Status Cancer Status Male Female Total Bladder Cancer Control Total Table 3 Summary of Population Structure by Cancer Status Study Imputed Ancestry Cohort Cancer Status CEU ADMIXED CEU ASA ASA,CEU CEU,YRI YRI Total ATBC Bladder Cancer Control PLCO Bladder Cancer Control CPS Bladder Cancer Control Total

7 Figure 1 Bladder completion rate by sample (left) and by locus (right). Figure 2 Population structure 7

8 3. Illumina SNP Genotyping and Normalization SNP genotyping is the measurement of genetic variations of single nucleotide polymorphisms (SNPs) between members of a species. SNPs are one of the most common types of genetic variation. A SNP is a single base pair mutation at a specific site in DNA, usually consisting of two alleles that makes up the individual s genotype. Illumina DNA Analysis BeadChips using the infinium Assay provide researchers genomic access and analyzing genetic variation. The infinium is a two color channels assay, with the data consisting of two intensity values (X, Y) for each SNP. There is one intensity channel for each of two fluorescent dyes associated with the two alleles of the SNP. The alleles measured by the X channel (Cy5 dye) are called the A allele, whereas the alleles measured by the Y channel (Cy3 dye) are called the B allele. Each SNP is analyzed independently to identify genotypes. Illumina s standard normalization algorithm is implemented as the first step in SNP genotyping data analysis. The intensity data are normalized using Illumina s self-normalization algorithm which draws on information contained in the array itself and to convert raw X and Y (allele A and allele B) signal intensities to normalized values. Normalized values then are used to analyze standard genotyping calls, Loss of Heterozygosity (LOH), and Copy Number (CN). In a diploid genome without CNVs, the three possible genotype calls are AA, AB, and BB, respectively. The raw signal intensity values measured for the A and B alleles are subject to an Illunina s standard five-step standard normalization procedure to determine six parameters: offset_x, offset_y, theta, shear, scale_x, scale_y. The normalization algorithm is designed to adjust for nominal intensity variations observed in the two color channels, background differences between the two channels, possible crosstalk between the dyes, global intensity difference, and to scale the data [10]. Figure 3 depicts the 5 steps of the normalization process. Step1: Outlier removal (Figure 3-A): Outlier SNPs are removed from consideration during the normalization parameter estimation. They will not be excluded from downstream analysis. Step2: Background estimation (offset_x, offset_y) (Figure 3-B): Identify candidate homozygote A alleles along X-axis and candidate homozygote B alleles along Y-axis. Two straight lines are fit into homozygote A and B alleles respectively. Offset_X and offset_y parameters are the intercepts from these two lines. The points are corrected for translation. Step3: Rotational estimation (theta) (Figure 3-C): Identify a set of control points by X-axis. A straight line is fit into the control points. The theta parameter is the angle between this line and the X-axis and defines the amount of rotation in the data. The points are corrected for rotation. Step4: Shear estimation (shear) (Figure 3-D): Identify a set of control points by Y-Axis. A straight line is fit into the control points. The shear parameter is the angle between this line and the Y-axis. The points are corrected for shear. Step5: Scaling estimation (scale_x, scale_y) (Figure 3-E): A statistical method used to determine scale_x and scale_y parameters. Figure 3-F is the final set of normalized data points. The points along X-axis represent AA alleles, points along Y-axis represent BB alleles, and points along 45 degree represent AB alleles. 8

9 Illumina then uses these 6 estimated parameter values to convert raw coordinates (X raw and Y raw) to normalized coordinates (X normalized and Y normalized) for each SNP, representing the experiment-wide normalized signal intensity on the A and B alleles, respectively. Figure 3 Five- step normalization procedure. Figure 3-F is the final normalized data points for a particular SNP. The points on X-axis represent AA alleles. The points along Y axis represent BB alleles. The points along approximate 45 degree are AB alleles. To visualize the data after normalization, the genotyping data are transformed from Cartesian coordinates (Figure 4 left) to a polar coordinate plot (Figure 4 right). Cartesian coordinates use the X axis to represent the intensity of A allele and the Y axis to represent the intensity of B allele. The polar coordinates use the X axis to represent normalized theta (the angle deviation from pure A signal, where 0 represents pure A signal and 1.0 represents pure B signal), and Y axis to represent the distance of point to origin. The theta and R are calculated by equations: Theta = (2/pi) * arctan(ynorm/xnorm) R = Xnorm + Ynorm Where X norm and Y norm represent transformed normalized signals from alleles A and B for a particular locus (SNP). 9

10 Figure 4 SNP Graphs: Cartesian Coordinates (left) & Polar Coordinates (Right). It displays all samples for the currently selected SNP. Samples are colored according to their genotype. From right graph, for this particular SNP (ID = rs ), 420 samples (red cluster) are called as AA alleles, 124 samples (purple cluster) called as AB alleles, and 13 samples (blue cluster) called as BB alleles. 4. Mosaic Chromosomal Abnormalities Detection Methods 4.1. Log R Ratio and B Allele Frequency The main goal of this project is to investigate the relationship between mosaic chromosomal abnormalities and bladder cancer risk by first identify regions of the genome that are aberrant in copy number, more specifically, the mosaic copy number variation on autosomal chromosomes in bladder cancer and cancer-free subjects. The detection of autosomal mosaic events was based on assessment of allelic imbalance and copy number changes. The chromosomal abnormalities were detected using two infinium high density assay outputs: the log R ratio (LRR) and B allele frequency (BAF). The LRR and BAF values are originally developed on the Illumina platform. For the Illumina SNP arrays, the LRR and BAF values can be directly calculated and exported from Illumina s GenomeStudio software. The Log R ratio (LRR) value is the normalized measure of total signal intensity and provides data on relative copy number. For each SNP, let the normalized signal intensities for the A and B alleles be denoted as X norm and Y norm, respectively. We can then calculate the R-value as R observed = X norm + Y norm as a normalized measure of total signal intensity. Log R ratio is then calculated as LRR = log 2 (R observed / R expected ), where R expected is computed from linear interpolation of the genotype clusters (Figure 5 Left). The three cluster positions are generated from a large set of samples that passed completion rate cutoff. The LRR value for a SNP is a measure of the difference between the signal intensity of the test sample and a pool of reference samples of the same SNP genotype. Since LRR is the logged ratio of observed probe intensity to expected intensity, deviation from zero is evidence for copy number change. The B allele frequency (BAF) derived from the ratio of allelic probe intensity is the proportion of hybridized sample that carries the B allele as designated by the infinium Assay. The B allele frequency can also be referred to as copy angle or allelic composition. It shows the relative presence of each of the two alternative nucleotides A and B at 10

11 each SNP locus profiled. BAF for a sample shows the theta value for a SNP corrected for cluster position. The thetavalue θ = (2/pi)*arctan(Ynorm/Xnorm). The BAF value is calculated by equation: θ AA, θ AB, θ BB are the θ values for three genotype clusters generated from a large set of samples that passed completion rate cutoff (Figure 5 Right). In the right figure, D1 = (θ - θ AB) ) and D2 = (θ AB - θ BB ). In a normal sample, discrete BAFs of 0.0, 0.5, and 1.0 are expected for each locus that representing AA, AB, BB alleles. Deviations from this expectation are indicative of aberrant copy number. For example, if a locus has a BAF = 1/3, this might indicate that there are 1 copy of the B allele and 2 copies of A alleles present in the sample because 1/(1+2) = 1/3. Analyzing both of LRR and BAF metrics provides strong resolution for detecting true copy number changes and allelic imbalance (Table 4). Figure 5 Log R Ratios (LRR) and Allelic Intensity Ratio (BAF). Table 4 Summary of Copy Numbers, Genotypes, Expected LRR, and Expected BAF. Total Copy Numbers CNV Genotypes Expected LRR Expected BAFs Deletion of Two Copy Null < 0 N/A Deletion of One Copy A, B < 0 0, 1 Normal Copy AA, AB, BB 0 0, 0.5, 1 Copy-Neutral LOH AA, BB 0 0, 1 Single Copy Duplication (Trisomy) AAA, AAB, ABB, BBB > 0 0, 1/3, 2/3, 1 Double Copy Duplication AAAA, AAAB, AABB, ABB, BBBB > 0 0, 1/4, 2/4, 3/4, 1 Mosaic Deletion mixed (AA, AB, BB) and (A, B) < 0 4 BAF bands Mosaic Copy-Neutral LOH mixed (AA, AB, BB) and (AA, BB) 0 4 BAF bands Mosaic Duplication mixed (AA, AB, BB) and (AAA, AAB, ABB, BBB) > 0 0, > 1/3, < 2/3, 1 11

12 Genetic mosaicism is the presence of cells within an organism that have a different genetic composition despite being the product of a single fertilization event. For this project, three mosaic types were investigated: mosaic deletion (loss), mosaic copy-neutral LOH, and mosaic duplication (gain) as defined below: Mosaic deletion is the coexistence of cells with normal copy and deletion of one copy. It is characterized by LRR < 0 and two heterozygous BAF bands. Mosaic copy-neutral LOH is the coexistence of cells with normal copy and copy-neutral LOH. It is characterized by LRR = 0 and two heterozygous BAF bands. Mosaic duplication is the coexistence of cells with normal copy and duplication of one copy. It is characterized by LRR > 0 and two heterozygous BAF bands between (1/3, 2/3). Notice, if two heterozygous BAF bands = 1/3 (AAB) and 2/3 (ABB), then it is pure duplication of one copy (trisomy) Re-estimate LRR and BAF LRR and BAF were estimated by the GenomeStudio software. However, there are two sources of biases that are not overcome by Illumina s five-step normalization method: dye bias and GC/CpG wave bias. There is an asymmetry in the detection of the two alleles for each SNP, caused by a remaining bias between two dyes used in the Infinium II assay after used Illumina s normalization method. The dye intensity bias can reduce precision in estimating copynumber and allelic imbalance. GC/CpG waves can be present when using incorrectly quantified DNA in the Infinium assay, or they might be present in regions of high or low GC content. The presence of GC/CpG waves creates artificial gains and losses in signal intensities for SNPs, and may lead to spurious copy-number variation calls. A fourstep custom software pipeline was implemented to the data exported from GenomeStudio, that contains called genotype, genotype call quality score, genotype probe intensities (X norm, Y norm ), log R ratio (LRR), and B allele frequency (BAF) for each assay, for further normalization to re-estimate LRR and BAF. Step 1: Quantile normalization was applied [11] to X norm and Y norm that were generated from Illumina s GenomeStudio software and resulted X qnorm and Y qnorm. This procedure removes dye bias and improves the asymmetry in the detection of the two alleles for each SNP, which influences both allelic proportions and copy number estimates. Step 2: Re-estimate genotype specific cluster centers (AA, AB, BB) for each SNP using X qnorm and Y qnorm values from assays with completion rate and genotype quality score greater than predefined thresholds so only SNPs from high quality samples were used to generate each cluster position. Step3: GC/CpG wave correction model was applied to each genotyped sample to get GC/CpG corrected allelic composition theta = (2/pi) * arctan (Y qnorm / X qnorm ) and total intensity R, which was estimated as a linear combination of (X qnorm, Y qnorm, GC content in probes) [12]. GC/CpG correction reduces the wavy patterns of signal intensities and improves the accuracy of copy-number variation detection. Step4: Finally, LRR and BAF were recomputed using the resulting quantile-normalized and GC/CpG corrected values, as described in [13]. 12

13 Reduction in variance of the LRR values after applied above 4 steps is demonstrated in Figure 6 for one cluster group of Illumina HumanHap610 assays from this project. Figure 6 Variance of log 2 R ratio (LRR) before and after normalization procedure for one cluster group of Illumina HumanHap610 assays. The reduction in GC/CpG waves is obvious in Figure 7.1 (sample without any chromosome abnormality) and Figure 7.2 (duplication) by plotting the signal intensity patterns before and after wave adjustment for the two samples from this project. Figure 7.1 Pre-normalization (left) and post-normalization (right) for a subject without chromosomal abnormality. Each dot in the figure represents one SNP. Red dots represent B-allele frequency (BAF, scale on the right side), while black dots show LRR values (LRR, scale on the left side). Three red bands represent BAF values for AA, AB, BB genotypes along the entire chromosome. One black band in middle (overlap with red AB BAF band) represents LRR values along the entire chromosome. There are wavy patterns with peaks and troughs for the LRR values across entire chromosome 13 for pre-normalized data. 13

14 Figure 7.2 Pre-normalization (left) and post-normalization (right) for subject with duplication. There are wavy patterns with peaks and troughs for the LRR values across entire chromosome 17 for pre-normalized data Segmentation Methods In this project, two open-source packages, Genomic Alteration Detection Analysis (GADA) [14] and BAF Segmentation [15], have been applied to the same data set for detecting breaking points on each chromosome. GADA software uses Sparse Bayesian Learning (SBL) segmentation algorithm, and BAF Segmentation software uses Circular Binary Segmentation (SBC) algorithm [16]. Resulting mosaic events in samples from both methods then were combined. There were two large mosaicism studies conducted by two independent research groups, Gene-Environment Association Studies consortium (GENEVA) and Cancer Genome Research Lab (CGR), the results from both groups were published at Nature Genetics at May 2012 [8,17]. Two lung cancer study data sets from Environment and Genetics in Lung Cancer Etiology Study (EAGLO) and Prostate, Lung, Colorectal, Ovarian Cancer Screening Trial (PLCO) were used by both groups. The GENENA group used CBS algorithm and CGR group used SBL algorithm. The resulting mosaic events were then compared by both groups. The concordance rate was 75%. There were mosaic events detected by one group and missed by other group for both groups. To minimize false negative rate, this project used both segmentation algorithms to detect mosaic chromosomal abnormality and then combined the results. Here are main steps implemented for the Genomic Alteration Detection Analysis (GADA) software [14]: Breakpoints detection is based on SBL (Sparse Bayesian Learning) algorithm. The method detects segments where B deviation is different from 0. The B deviation is the observed BAF value that is deviated from the expected BAF value of 0.5 for heterozygous SNPs. Essential steps: o Load the quantile normalized and GC/CpG wave corrected LRR and BAF. o Sparse Bayesian Learning model (SBL) was used to discover the most likely genomic locations and magnitudes for a CNV segment. The sparseness hyperparameter controls the SBL prior distribution which is uninformative about the location and amplitude of the CNV breakpoints but imposes a penalty 14

15 o on the number of CNV breakpoints. A higher aalpha implies that less breakpoints are expected a priori and results with fewer true CNV detected, yet fewer false positives. Backward Elimination (BE) is used to rank the statistical significance of each breakpoint obtained from SBL and to remove sequentially the least significant breakpoints using two parameters T statistic and MinSegLen. The T argument is the critical value of BE algorithm for the statistical score t m that associated to the break point m. The breakpoints with t m lower than T are discarded. The score t m is the difference between the sample averages of the probes falling on the left and right segment, divided by a pooled estimation of the standard error. T can be efficiently adjusted to controls for the False Discovery Rate (FDR). The argument MinSegLen indicates the number of consecutive probes (SNP markers) each CNV segment must contain that have a BAF-deviation different from 0. As T and MinSegLen increase, the number of the CNV breakpoints decreases. Here are main steps implemented for the BAF Segmentation software [15]: Breakpoints detection is based on CBS (Circular Binary Segmentation) algorithm [16]. The method detects segments where mbaf is different from 0.5 since expected BAF = 0.5 for AB allele in a diploid genome without CNVs. BAF data is reflected into mbaf along the 0.5 axis by the transformation mbaf = abs(baf 0.5) + 0.5, where abs stands for taking the absolute value. Essential steps: o Load the quantile normalized and GC/CpG wave corrected LRR and BAF o Convert BAF data to mbaf. So homozygous SNPs (AA and BB) are positioned at 1, and heterozygous SNPs without CNVs are positioned at 0.5. o The homozygous SNPs are uninformative for determination of the total copy number. Remove homozygous SNPs from mbaf profile based on a fixed mbaf threshold. SNPs above the threshold are considered non-informative and removed. o Triplet filtering is next applied to the mbaf threshold filtered data to future improve the removal. For each SNP the absolute sum of the difference in mbaf between the investigated SNP and the pre- and succeeding SNP was calculated and added to the SNPs distance from the 0.5 baseline. For a SNP with index i: triplet sum[i] = abs(mbaf[i - 1] - mbaf[i]) + abs(mbaf[i + 1] - mbaf[i]) + mbaf[i] o Triplet sums are compared against a threshold. SNPs with triplet sums above the threshold were considered outliers and removed. The triplet filtering is designed to remove non-informative homozygous SNPs due to experimental noise, obtain mbaf values lower than the mbaf threshold. o Applied Circular Binary Segmentation model to mbaf profiles after removal of no-informative homozygous SNPs to discover the most likely genomic locations and magnitudes for a CNV segment (the total number of breakpoints) controlled by alpha, the significance level for accepting change-points Mosaic Event Type Calling Method Each event was assigned a copy-number state based on the median LRR value for the segment: State = mosaic duplication (gain) if median(lrr) > 0.2s LRR State = mosaic deletion (loss) if median(lrr) < 0.2s LRR State = mosaic copy neutral LOH otherwise Where s LRR is the standard deviation of the segment LRR values. 15

16 After application of each segmentation method to same data set, the output file contains start and end of detected segmented region, chromosome, median LRR, and standard deviation of the LRR within the segmentation region. For each sample, the adjacent events were merged if the event types were identical and distance between segments was less than 1 Mbp. After merging, a minimum event size of length < 0.5 Mbps was excluded, as the false-positive rate increased rapidly for events of smaller size. Most of the false-positives were due to noisy data (high LRR and BAF variance) and non-mosaic CNVs were detected as being potentially mosaic Proportion of Mosaicism For each segment that was identified by SBL/CBS, a Gaussian mixture model was fit to the normalized BAF values of each segment with 2-4 Gaussian components and the best fitting model was chosen using the Akaike information criterion (AIC). 2-4 components represent 2-4 possible BAF bands. A two component model (2 BAF bands, represents AA and BB or A and B) will best fit for segments that have complete loss of heterozygosity or copyneutral or loss with mosaic proportions of nearly 100%. Three component models (3 BAF bands for AA, AB, and BB) should be the best fit for segments that are normal or with very low mosaic proportions. For segments where two or three component models are chosen, mosaic proportions are assigned manually when there was sufficient evidence of existing of mosaicism after manually reviewing the combined LRR and BAF plot. Segments where the four component model was the best fit (4 BAF bands: AA/A, BB/B, AB/A, and AB/B for mosaic deletion; AA/AA, BB/BB, AB/AA, and AB/BB for mosaic CNLOH; AA/AAA, BB/BBB, AB/AAB, and AB/ABB for mosaic duplication, see last three rows at Table 4) were assigned mosaic proportions based on the inferred state and location of the estimated heterozygote BAF bands (mu 1, mu 2 ). The mu 1 and mu 2 are mean of the BAF values across the segment for each of the two heterozygote BAF bands. The mosaic proportions were calculated based on the inferred mosaic state and location of the estimated heterozygote BAF (mu 1, mu 2 ) with formulas similar to [18]: D = mu 1 - mu 2 Proportion of cells with a deletion = 2D / (1 + D) Proportion of cells with a duplication = 2D / (1 - D) Proportion of cells with copy number neutral loss of heterozygosity = D 4.6. Example of Mosaic duplication, Deletion, and CNLOH Figure 8a is LRR and BAF plot for a normal sample. Figure 8b-g are LRR and BAF plots of six representative mosaic chromosomal abnormality examples of different types of mosaic rearrangements selected from this project. The plots show the signal intensity Log R ratio (LRR) (black dots, scale on the left side) and B allele frequency (BAF) (red dots, scale on the right side) values along the entire chromosome carrying the rearrangements in selected samples. 16

17 Figure 8a Example of one subject with normal copy for chromosome 13. Each dot in the figure represents one SNP. Red dots represent B allele frequency (BAF, scale on the right side), while black dots show Log R ratio values (LRR, scale on the left side). Three red bands represent BAF values for AA (bottom red band), AB (middle red band), BB (top red band) genotypes across entire chromosome 13. One black band in middle (overlap with red AB BAF band) represents LRR values (around 0) along the entire chromosome 13. Figure 8b Interstitial mosaic duplication at p arm of chromosome 16 characterized by increased Log R ratio (mean of LRR within segment (blue line) > 0) and abnormal heterozygous BAF. The vertical gray lines indicate the breakpoint(s) of the event segment. A non-mosaic trisomy would have a wider BAF split as 1/3 (AAB) and 2/3 (ABB) and a larger elevation of LRR. 17

18 Figure 8c Mosaic duplication for entire chromosome 8. It is characterized by increased Log R ratio (mean of LRR within segment > 0) and abnormal heterozygous BAF. The degree of mosaicism in figure 8c is less than in figure 8b because it has a narrow split in the intermediate heterozygous BAF bands along with a smaller increase in LRR. Figure 8d Mosaic copy neutral loss of heterozygosity (CNLON) for entire q arm of chromosome 1. It is characterized by unchanged Log R ratio (mean of LRR within segment close to 0) and abnormal heterozygous BAF. The p arm is in normal state. A non-mosaic CNLOH would have only two BAF bands (AA and BB) and LRR close to 0. 18

19 Figure 8e Mosaic copy neutral loss of heterozygosity (CNLON) for entire chromosome 14. It is characterized by unchanged Log R ratio (mean of LRR within segment close to 0) and abnormal heterozygous BAF. The degree of mosaicism in figure 8e is greater than in figure 8d because it has wider split in the intermediate BAF bands. Figure 8f Two small interstitial mosaic heterozygous deletions at p arm of chromosome 2. It is characterized by decreased Log R ratio (mean of LRR within segment < 0) and abnormal heterozygous BAF. A non-mosaic heterozygous deletion would have no intermediate BAF bands and a larger decrease in LRR. 19

20 Figure 8g Large mosaic heterozygous deletions at q arm of chromosome 9. It is characterized by decreased Log R ratio (mean of LRR within segment < 0) and abnormal heterozygous BAF. The mosaic deletion in 8g has a less proportion of cells containing the deletion than the one in 8f because it has narrow split in the intermediate BAF bands along with a smaller decrease in LRR. 5. Data Analysis Analysis was started by loading sample intensity files (two files per sample, for red and green channels) into Illumina s GenomeStudio software. The intensity data were normalized using Illumina s five-step self-normalization procedures (see description at section 3. Illumina SNP Genotyping and Normalization) which drew on information contained in the array itself to convert raw X and Y (allele A and allele B) signal intensities to normalized values. Data on called genotype, genotype calls quality score, raw and normalized genotype probe intensities, LRR, and BAF for each assay were exported from GenomeStudio software using its Genotype Final Report (GFR) format. Using GFR file as input, the array dataset in the GFR file was converted into a high-performance binary file format (GDAT) using the GLU software package ( ) that was developed at Cancer Genomics Research Laboratory (CGR). A GC/CpG model file (GCM file) was generated using a copy of the reference genome UCSC hg18 and Illumina binary manifest file Human610-Quadv1_B.bpm. Within GDAT, a four-step custom software pipeline (see description at section 4.2. Re-estimate Log R Ratio and B Allele Frequency) was implemented. The information in GCM file was used for GC/CPG correction. The LRR and BAF were re-estimated on the quantile-normalized and GC/CpG corrected values and written directly into the GDAT file as a new data table. All of these procedures were implemented using GLU software package. The renormalized LRR and BAF values from qualifying assay (completion rate >= 90%) were then analyzed using two custom software pipelines that involved GADA and BAF Segmentation packages to detect 20

21 whole-chromosome and large segmental events greater than 0.5Mb in size to minimize the false discovery (see description at section 4.3. Segmentation Methods). We applied the GADA method with the following setting for the related parameters: SBL sparseness hyperparameter to discover the total number of breakpoints: aalpha = 0.85; the critical value of backward elimination algorithm for the statistical score associated to a break point: T statistic = 10; the minimum number of SNPs each CNV segment must contain: MinSegLen = 200. We applied the BAF Segmentation method with the following setting for the related parameters: the threshold in mbaf for calling regions of mosaic event based on segmented mbaf values: ai_ threshold = 0.56 (default); the minimal number of SNPs a segmented region should contain to be allowed to be called as mosaic event: ai_size = 45; the threshold in mbaf for removing putatively non-informative SNPs: informative_threshold = 0.97 (default); the threshold for thriplet filtering used to improve removal of putatively non-informative homozygous SNPs: triplet_threshold = 0.8 (default). The significant level for accepting changepoints: alpha = 0.001, using CBS to identify breakpoints of genomic regions. For each sample, adjacent events were merged if the event types were identical and distance between segments was less than 1Mbp. After merging, events of length < 0.5 Mbps were excluded. All events were then plotted. False positive calls due to noisy assay data and non-mosaic copy-number variants and loss of heterozygosity due to the hemizygous deletion (deletion of one-copy) and events inherited by descent (IBD) and uniparental disomy (UPD) were also excluded from analysis base on manual review on each plot. These events were excluded because they are not mosaic events. The segment boundaries were manually corrected for some of the events. Each event detected was classified as mosaic duplication (gain), mosaic deletion (loss), and mosaic copy-neutral loss of heterozygosity. Mosaic proportion of abnormal cells was estimated (see descripted at section 4.5. proportion of mosacism). The magnitude of BAF differences for single-copy duplication events was one-third of the magnitude of that for copy-neutral LOH or copy-deletion events reducing the sensitivity for calling copy-duplication events. For mosaic duplication event, only proportion of abnormal cells <= 0.9 will be kept because as the proportion of mosaic estimation > 0.9, it is difficult to distinguish between mosaic and non-mosaic duplication reliably. To view the characteristics of mosaic events, mosaic events were plotted by proportion of abnormal cells and LRR using Microsoft office excel software (Figure 9). Two circular genomic plots by bladder cancer and control for three tracks of mosaic events for the autosomes 1 to 22 was generated using circos software ( ) (Figure 10). The frequency of mosaic events by age and cancer status on all subjects and male only plots were generated using Microsoft office excel software (Figure 12). Logistic regression models were fit using SAS software package to determine the relationship between individuals having mosaic event(s) and their age at DNA collection, gender, and cancer diagnosis. 6. Results and Discussion 6.1. Characteristics of Mosaic Events 193 mosaic segments of size greater than 0.5 Mb on autosomal chromosomes in 163 individuals, for an overall frequency of individuals with mosaicism of 5% were observed. 118 mosaic events were from bladder cancer individuals (61.14%) and 75 mosaic events were from cancer-free controls (38.86%). Mosaic autosomal abnormalities were more common in the bladder cancer individuals (98/1673 = 5.86%) compared with cancer-free 21

22 persons (65/1577 = 4.12%). The most frequent chromosome of event observed was chromosome 17 for bladder cancer individuals (6.74%) and chromosome 2 and 4 for control individuals (4.15%). Combining case and control together, the most frequent chromosome of event observed was chromosome 2, 10, and 17 (8.29%) (Table 5), which may imply instability for these three chromosomes. The most frequent type of event observed was mosaic duplication (55.96%), whereas mosaic deletion and mosaic CNLOH constituted 12.44% and % of mosaic events, respectively (Table 6). The segment size for CNLOH was the largest and mosaic duplication was the smallest. Median lengths were 0.82 Mb for mosaic duplications, 2.32 Mb for mosaic deletions, and Mb for mosaic CNLOHs. The abnormal cell proportions are between 20.88% and 89.86% for mosaic duplication; 25.45% and 96.64% for mosaic deletion; 3.84% and 95% for mosaic CNLOH (Figure 9). Table 5 Frequency of Mosaic Chromosomal Events by Chromosome and Case-Control status. Mosaic Chromosome Count Mosaic Chromosome Frequency (%) Chromosome Bladder Cancer Control Total Bladder Cancer Control Total Total Table 6 Frequency of Mosaic Chromosomal Events by Event Type and Location. Mosaic chromosome count Mosaic chromosome frequency (%) Event location Gain Loss CN LOH Total Gain Loss CN LOH Total Chromosome Telomeric p Telomeric q Interstitial Telomeric (p + q) Total

23 Figure 9 Characteristics of mosaic events. Mosaic events plotted by proportion of abnormal cells (P) and LRR for 193 events in 163 individuals. A blue dot represents P and LRR values for a mosaic duplication event. A green dot is for mosaic CNLOH event. A red dot is for mosaic deletion events. Of mosaic chromosomal events being detected by GADA and BAF methods, 4.66% spanned the entire chromosome, including 4 whole chromosome mosaic trisomy events on chromosome 8, 12, 18 and 21, with 3 of 4 events were carried by one subject; 5 whole chromosome mosaic CNLOH events on chromosome 6, 9, 18, and 19 (2 events) that were carried by 4 different subjects. There was no whole chromosome mosaic deletion event being detected (Figure 10). We found that 9.33% of mosaic chromosomal events began at a telomere p arm and 15.03% of mosaic chromosomal events end at a telomere q arm. The most mosaic chromosomal events were interstitial (70.98%), spanning no telomere. The majority of telomeric events (p + q) were mosaic copy-neutral LOH (27 / 47 = 57.45%) followed by mosaic duplication (17 / 47 = 36.17%). The majority of interstitial events were mosaic duplication (87 / 137 = 63.5%) followed by mosaic CNLOH (29 / 137 = 21%) (Table 6). There are 16 individuals (9 bladder cases and 7 controls) having mosaic events on at least two chromosomes. Among control individuals, the greatest number of mosaic chromosomal events observed for a single subject was 4 from ATBC cohort study and located at whole chromosome 12, 18, 19 and entire q arm of chromosome 21 and all of them are mosaic duplication (Figure 11). Among bladder cancer individuals, the greatest number of mosaic chromosomal events observed for a single subject was 11 from PLCO cohort study and located at 9 different chromosomes, including 2 events on chromosome 12 and 13, and all of them are mosaic copy-neutral LOH with very higher degree of mosaicism. 23

24 Figure 10 Circular plots display genomic location of mosaic events. Outer rings are the autosomes 1 to 22. Yellow track for events of mosaic copy-neutral LOH; blue track for mosaic duplication events; red track for mosaic deletion events. Left plot are events detected from bladder cancer subjects. Right plot are events detected from cancer-free controls. Figure 11 Mosaic duplication across entire of chromosome 12, 18, 19 and entire q arm of chromosome 21 for a control individual. It is characterized by increased Log R ratio (mean of LRR within segment > 0) and abnormal heterozygous BAF. 24

25 6.2. Mosaic Chromosomal Abnormalities and Age at DNA The effect of increased age on the frequency of mosaic events across three cohort studies that predominantly included individuals over the age of 60 has been examined. 27 individuals have missing ages. For remaining 3212 subjects, the frequency of control individuals with mosaic events increased with age from 3.21% for those under 60 to 4.53% for those between the ages of to 5.96% for those between the ages of 76 and 89 years (P = 0.11). The frequency of bladder cancer individuals with mosaic events was almost constant with age from 6.95% for those under 60 to 7.24% for those between the ages of 66-70, then to 6.14% for those between the ages of 76 and 89 (P = 0.32). The frequency of mosaic events was higher in bladder cancer individuals than control individuals in first four age groups. However, the frequency of mosaic events was very similar for bladder cancer individuals and control individuals for those between the ages of (Figure 12 Top). The frequency of male control individuals with mosaic events increased with age from 3.39% for those under 60 to 4.41% for those between the ages of to 7.51% for those between the ages of 76 and 89 years (P = 0.035). The frequency of male bladder cancer individuals with mosaic events was almost constant with age from 7.75% for those under 60 to 6.81% for those between the ages of 66-70, then to 6.81% for those between the ages of 76 and 89 (P = 0.26). The frequency of mosaic events was higher in bladder cancer male individuals than male control individuals in first four age groups. However, the frequency of mosaic events was low for male bladder cancer individuals than male control individuals for those between the ages of 76 and 89 (Figure 12 Bottom). For female, there were no mosaic events for those under age 60 and very few events for other four age categories and therefore it cannot provide reliable summary for each age category (Table 7). Figure 12 Frequency of mosaic events by age and cancer status for all individuals (Top) and male only (Bottom). 25

26 Table 7 Mosaic Event Counts by Five Age Categories and Cancer Status. Male + Female Male Female Bladder Bladder Bladder Control Control Control AGE Cancer Cancer Cancer Yes No Yes No Yes No Yes No Yes No Yes No <= Total Mosaic Chromosomal Abnormalities and Gender The effect of gender on the frequency of mosaic events across three cohort studies by bladder cancer and control has also been examined. For bladder cancer, the mosaic events were more frequent in males than females with male = 6.13% and female = 4.22%. For control, the frequency of mosaic events was almost the same for male and female with male = 4.08% and female = 4.48%. Logistic regression models were fit to the data with mosaic status on gender, adjusting for age for (1) control and (2) case. For control, the OR = with P-value = For case, the OR = with P-value = Logistic regression model was also fit to all subjects with mosaic status on gender, adjusting for age and cancer status, and got OR = with P-value = So we did not observe any significant gender effect for all three models (Table 8). Table 8 Frequency of Mosaic Events by Gender and Cancer Status. Mosaic event frequency (%) Adjusted Logistic Model Male Female OR 95% P-value Bladder Cancer ( ) Control ( ) Overall ( ) Mosaic Chromosomal Abnormalities and Cancer Risk Mosaic Chromosomal Abnormalities and Cancer Risk for All Subjects To investigate the relationship between mosaic chromosomal abnormalities and bladder cancer, various logistic regression models were fit to the data with mosaic status as the response variable and (1) cancer status; (2) cancer status + gender; (3) cancer status + age; (4) cancer status + age + gender as predictors for a test of partial independence of cancer status and mosaic status, controlling for gender or/and age. For each model, cancer status = 1 if bladder cancer and = 0 if control; gender = 1 if male and = 0 if female; and age as continuous variable. There are modest evidence of positive relationship between mosaic event and bladder cancer (OR=1.44, 95% CI = ; P = 0.027) for model (1); OR=1.43, 95% CI = ; P = 0.029) for model (2); OR=1.45, 95% CI =

27 2.00; P = 0.025) for model (3); OR=1.44, 95% CI = ; P = 0.026) for model (4). All four models show very similar P-values for cancer status. There is no significant evidence of gender effect and age effect for the models having gender/age as predictor(s) (Table 9.1). All four main effect models fit the data adequately. The P-value = from Pearson s goodness-of fit test for model (2) and P-vales = and from Hosmer-Lemeshow goodness-of-fit test for model (3) and (4). Age is a continuous variable, which caused models (3) and (4) have very large number of unique profiles (Table 9.2). In this situation, the data is too sparse to use Pearson and deviance goodness-of-fit test and Hosmer-Lemeshow goodnessof-fit test is more suitable for the situation where there are a large number of settings of the predictors. All four models have very similar AIC. The difference of -2 Log L between model (1) and any of other 3 models < 3.84 = χ 2 1, None of the more complex models significantly improves upon the simplest model, the model (1). However, we are interested not only whether there is evidence of cancer status effect but also age effect and gender effect on mosaic status, so model (4) (Table 9.3) gives us more meaningful and interpretable results to answer our scientifically as well as statistically important questions. To test for equality of odds ratios between cancer status and mosaic status for various ages, we add the interaction term with cancer_status + age + gender + cancer_status*age as predictors. We have almost significant evidence to indicate that the odds ratios for bladder cancer and mosaic event are differ among various ages (P = 0.054). So it is reasonable to assume that the log(or) between the cancer status levels at a given age x close to a linear function of x. This model fits the data very well and has smallest AIC value among all tested models (Table 9.4). We also examined weather DNA source effect and study effect were significant predictors of mosaic status using univariate analysis for each nominal variable. It is notable that DNA source (58.6% from blood, 41.4% from buccal) was not a significant predictor (P-value = 0.578). The study (ATBS, PLCO, CPS) was also not a significant predictor (Pvalue = 0.377). Table 9.1 Summary Logistic Regression Models on All Subjects. Setting: presence of mosaic event = 1; case vs control; male vs female Model STATUS GENDER AGE STATUS*AGE OR 95% Wald CI mosaic_event=status+error 1.44 ( ) mosaic_event=status+gender+error 1.43 ( ) Pr > ChiSq Pr > ChiSq Pr > ChiSq Pr > ChiSq mosaic_event=status+age+error 1.45 ( ) mosaic_event=status+gender+age+error 1.44 ( ) mosaic_event=status+age+gender+status*age+error Table 9.2 Summary Goodness-of-Fit Results for Logistic Regression Models on All Subjects. Model -2 Log L AIC Pearson GOF HL GOF Setting: presence of mosaic event = 1; case vs control; male vs female Pr > ChiSq Pr > ChiSq mosaic_event=status+error mosaic_event=status+gender+error mosaic_event=status+age+error mosaic_event=status+gender+age+error mosaic_event=status+age+gender+status*age+error

28 Table 9.3 SAS Output From Logistic Regression Main Effect Model on All Subjects. Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Intercept <.0001 AGE_DNA SEX_Code STATUS_Code Table 9.4 SAS Output From Logistic Regression Model With Interaction Term on All Subjects. Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Intercept <.0001 STATUS_Code AGE_DNA SEX_Code STATUS_Code*AGE_DNA To reduce the number of predictors when age was in the model, we divided age into five age_groups: <=60, 61-65, 66-70, 71-75, and >=76. We did logistic regression of mosaic status on cancer status, gender, and age_group for testing of partial independence of mosaic status and cancer status controlling for gender and 5 age groups. Again there is modest evidence of positive correlation between bladder cancer and mosaic status with P-value = 0.025, which is smaller than the P-value if age is continuous variable, but not by much (P-value = 0.026). There was no any significant difference between any of last four age groups vs. first age group (Table 9.5). Table 9.5 SAS Output From Logistic Regression Model with Age Split to Five Groups on All Subjects. Type 3 Analysis of Effects Effect DF Wald Pr > ChiSq Chi-Square STATUS_Code SEX_Code AGE_GROUP Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Intercept <.0001 STATUS_Code SEX_Code AGE_GROUP AGE_GROUP AGE_GROUP AGE_GROUP >=

29 Odds Ratio Estimates Effect Point Estimate 95% Wald Confidence Limits STATUS_Code SEX_Code AGE_GROUP vs <= AGE_GROUP vs <= AGE_GROUP vs <= AGE_GROUP >=76 vs <= Mosaic Chromosomal Abnormalities and Cancer Risk for Men Only Bladder cancer is the fourth most common cancer diagnosed in men. Men are about 3 to 4 times more likely to get bladder cancer during their lifetime than women. Overall, the chance men will develop this cancer during their life is about 1 in 26. For women, the chance is about 1 in 90 [19]. To investigate the relationship between mosaic chromosomal abnormalities and bladder cancer for male only, two logistic regression models were fit to the data with mosaic status as the response variable and (1) cancer status; (2) cancer status + age as predictors for a test of partial independence of mosaic status and cancer status, controlling for age. For each model, cancer status = 1 if bladder cancer and = 0 if control and age as continuous variable. There are modest evidence of positive relationship between mosaic event and bladder cancer risk for male individuals (OR=1.53, 95% CI = ; P = 0.017) for model (1) and (OR=1.55, 95% CI = ; P = 0.014) for model (2) (Table 10.1). There is no significant evidence of age effect for the model (2) (Table 10.2). To test for equality of odds ratios between mosaic event and bladder cancer for the various ages for male individuals, we add the interaction term with cancer status + age + cancer status*age as predictors. We have significant evidence of unequal odds ratios between mosaic event and bladder cancer for the various ages (P = 0.017). This model fits the data very well and has smallest AIC value among all tested models (Table 10.3). Table 10.1 Summary Logistic Regression Models On Male Subjects. Model STATUS AGE STATUS*AGE HL GOF Setting: presence of mosaic event = 1; case vs control OR 95% Wald CI Pr > ChiSq Pr > ChiSq Pr > ChiSq Pr > ChiSq Mosaic_event=status+error 1.53 ( ) Mosaic_event=status+age+error 1.55 ( ) Mosaic_event=status+age+status*age+error Table 10.2 SAS Output From Logistic Regression Main Effect Model on Male Subjects. Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Intercept <.0001 STATUS_Code AGE_DNA

30 Table 10.3 SAS Output From Logistic Regression Model With Interaction Term on Male Subjects. Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Intercept <.0001 STATUS_Code AGE_DNA STATUS_Code*AGE_DNA Mosaic Chromosomal Abnormalities and Cancer Risk for Female Only To investigate the relationship between mosaic chromosomal abnormalities and bladder cancer for female only, two logistic regression models were fit to the data with the mosaic event as the response variable and (1) cancer status; (2) cancer status + age as predictors for a test of partial independence of mosaic event and cancer status, controlling for age. For each model, cancer status = 1 if bladder cancer and = 0 if control and age as continuous variable. There are no significant evidence of cancer risk for mosaic female individuals (OR = 0.94, 95% CI = ; P = 0.887) for model (1) and (OR = 0.93, 95% CI = ; P = 0.865) for model (2) (Table 11.1). There is no significant evidence of age effect for the model (2) (Table 11.2). To test for equality of odds ratios between mosaic event and cancer status for the various ages for female individuals, we add the interaction term with cancer status + age + cancer status*age as predictors. There is no significant evidence of unequal odds ratios between mosaic event and bladder cancer for the various ages (P = 0.27) (Table 11.2). Table 11.1 Summary Logistic Models On Female Subjects. Model STATUS AGE STATUS*AGE HL GOF Setting: presence of mosaic event = 1; case vs. control OR 95% Wald CI Pr > ChiSq Pr > ChiSq Pr > ChiSq Pr > ChiSq mosaic_event=status+error 0.94 ( ) mosaic_event=status+age+error 0.93 ( ) mosaic_event=status+age+status*age+error Table 11.2 SAS Output From Logistic Regression Main Effect Model on Female Subjects. Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Intercept STATUS_Code AGE_DNA

31 Table 11.3 SAS Output From Logistic Regression Model with Interaction Term on Female Subjects. Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Intercept STATUS_Code AGE_DNA STATUS_Code*AGE_DNA Conclusion and Further Studies In this project, DNA of 3,239 individuals with 1,673 bladder cancer cases and 1,566 cancer-free controls have been used to investigate for evidence of mosaicism of the autosomes using Illumina s genome-wide SNP array data generated from bladder cancer genome wide association analysis. 193 mosaic duplication, mosaic deletion, and mosaic copy-neutral loss of heterozygosity (CNLOH) events with size > 0.5 Mb in autosomes of 163 study subjects (5%), with abnormal cell proportions of between 3.84% and 96.64%, have been observed. Mosaic autosomal abnormalities were statistically significantly positively associated with bladder cancer for male (OR = 1.55; P = 0.014) but not for female. The frequency of mosaicism increased with age for male control subjects, ranging from 3.39% in individuals under age 60 to 7.51% between 76 and 89 years old (P = 0.035). Mosaic autosomal abnormalities were more common in the bladder cancer individuals (5.86%) compared with cancer-free persons (4.12%). The mosaic events were more frequent in males than females with male = 6.13% and female = 4.22% for bladder cancer individuals but similar for cancer-free persons with male = 4.08% and female = 4.48%. The most frequent class of autosomal abnormality detected was mosaic duplication, representing 55.96% of mosaic events. The most frequent autosomal of mosaic event observed was chromosome 17 for bladder cancer individuals (6.74%) and chromosome 2 and 4 for control individuals (4.15%). Combining case and control together, the most frequent chromosome of mosaic event observed was chromosome 2, 10, and 17 (8.29%). Mosaicism in older cancer-free male individuals suggests that age-related genomic instability could be due to increased rates of somatic mutation or diminished capacity for genomic maintenance, such as with telomere attrition, leading to proliferation of somatically altered cell populations [20]. This project can be investigated further in several areas if we can get additional information in the future. For all subjects with mosaic events, it will be very interesting to assess the characters and behavior of mosaic events over time and to determine these individuals number of mosaic events or the proportion of observed mosaicism change over time. In this project, we calculated proportion of mosacism for each mosaic events but did not go further. If we have additional data points collected with respect to specific time of diagnosis in case, we can investigate the hypothesis of positive association between proportion of mosaicism and severity of the cancer (early to later stage). Just very recently, we accidently identified one subject participated two studies (Non-Hodgkin s Lymphoma (NHL) and Ovarian cancer). We had her genotyped data from two DNA samples that were drawn at age 59 (NHL) and 63 years (NHL + Ovrian cancer). We did mosaic chromosomal abnormality analysis on both samples and found mosaic events at multiple chromosomes. At age 59, there were mosaic events observed on chromosome 3, 8, 10, 13, 20, and X (Figure 13 left). At age 63, there were mosaic events observed on chromosome 3, 4, 8, 9, 13, 20 and X (Figure 31

32 13 right). All of the mosaic events at 59 were observed at age 63 and with increased proportion of mosaicism except the mosaic event on chromosome 10 at age 59 but unobserved at age 63. The disappearance of this event may be the result of the cancer treatment. There were two new events detected at chromosome 4 and 9 at age 63, which may be due to later stage NHL cancer or ovarian specific cancer. 32

33 Figure 13 Mosaic events observed on multiple chromosomes for a female subject having Non-Hodgkin s Lymphoma and Ovarian cancer. Left figures are for DNA samples drew at age 59 (NHL). The right figures are for DNA samples drew at 63 years (NHL + Ovarian cancer). 33

34 We can also possible to determine observed mosaic events development origin (germline or somatic cells) if we have blood, tumor tissue, and normal tissue data available. A germline mutation is one that was passed on to offspring. An example of gemline mutations linked to cancer is the ones that occur in cancer susceptibility genes, increasing a person's risk for the disease. Somatic mutations are not passed on to the next generation. By distinguishing the origin of the mutation, we may able to discover cancer susceptibility genes such as well-known BRCA1 and BRCA2 genes for breast cancer. We recently did mosaic analysis on several TCGA (The Cancer Genome Atlas, SNP data sets. Figure 14 shows possible gemline deletion at chr7: for subject 1087 who had glioblastoma multiforme (GBM) with both blood and tumor samples taken at the same time. Same mutation existed at both blood and tumor samples which means this is not tumor specific mutation. We also observed some mutations existed at blood sample not tumor sample, which implies the mutation in blood is somatic mutation instead of germline mutation (Figure 15). Figure 14 Deletion at chromosome 7 ( ) detected from blood sample (Top) and primary solid tumor sample (bottom) for one GBM subject from TCGA data. The deletion in the blood sample is very likely a gemline mutation because same event was existed at both DNA sources. 34

35 Figure 15 Mosaic CNLOH at whole chromosome 3 detected from blood sample (Top) for ovarian cancer subject 1877 from TCGA data. Bottom figure is mutation detected from primary solid tumor sample of same subject. The mutation in blood sample is somatic mutation because of very different mutation types between two DNA sources. 35

DNA Copy Number and Loss of Heterozygosity Analysis Algorithms

DNA Copy Number and Loss of Heterozygosity Analysis Algorithms DNA Copy Number and Loss of Heterozygosity Analysis Algorithms Detection of copy-number variants and chromosomal aberrations in GenomeStudio software. Introduction Illumina has developed several algorithms

More information

Simplifying Data Interpretation with Nexus Copy Number

Simplifying Data Interpretation with Nexus Copy Number Simplifying Data Interpretation with Nexus Copy Number A WHITE PAPER FROM BIODISCOVERY, INC. Rapid technological advancements, such as high-density acgh and SNP arrays as well as next-generation sequencing

More information

Agilent CytoGenomics Software A Complete Solution for Cytogenetic Research Data Analysis

Agilent CytoGenomics Software A Complete Solution for Cytogenetic Research Data Analysis Agilent CytoGenomics Software A Complete Solution for Cytogenetic Research Data Analysis Technical Overview Streamlines the cytogenetic research workflow for finding CNCs, LOH, and UPD Enables manual sample

More information

Comparative genomic hybridization Because arrays are more than just a tool for expression analysis

Comparative genomic hybridization Because arrays are more than just a tool for expression analysis Microarray Data Analysis Workshop MedVetNet Workshop, DTU 2008 Comparative genomic hybridization Because arrays are more than just a tool for expression analysis Carsten Friis ( with several slides from

More information

Single Nucleotide Polymorphisms (SNPs)

Single Nucleotide Polymorphisms (SNPs) Single Nucleotide Polymorphisms (SNPs) Additional Markers 13 core STR loci Obtain further information from additional markers: Y STRs Separating male samples Mitochondrial DNA Working with extremely degraded

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Globally, about 9.7% of cancers in men are prostate cancers, and the risk of developing the

Globally, about 9.7% of cancers in men are prostate cancers, and the risk of developing the Chapter 5 Analysis of Prostate Cancer Association Study Data 5.1 Risk factors for Prostate Cancer Globally, about 9.7% of cancers in men are prostate cancers, and the risk of developing the disease has

More information

CNV Univariate Analysis Tutorial

CNV Univariate Analysis Tutorial CNV Univariate Analysis Tutorial Release 8.1 Golden Helix, Inc. March 18, 2014 Contents 1. Overview 2 2. CNAM Optimal Segmenting 4 A. Performing CNAM Optimal Segmenting..................................

More information

Consistent Assay Performance Across Universal Arrays and Scanners

Consistent Assay Performance Across Universal Arrays and Scanners Technical Note: Illumina Systems and Software Consistent Assay Performance Across Universal Arrays and Scanners There are multiple Universal Array and scanner options for running Illumina DASL and GoldenGate

More information

11. Analysis of Case-control Studies Logistic Regression

11. Analysis of Case-control Studies Logistic Regression Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:

More information

REI Pearls: Pitfalls of Genetic Testing in Miscarriage

REI Pearls: Pitfalls of Genetic Testing in Miscarriage The Skinny: Genetic testing of miscarriage tissue is controversial and some people question if testing is helpful or not. This summary will: 1) outline the arguments for and against genetic testing; 2)

More information

NATIONAL GENETICS REFERENCE LABORATORY (Manchester)

NATIONAL GENETICS REFERENCE LABORATORY (Manchester) NATIONAL GENETICS REFERENCE LABORATORY (Manchester) MLPA analysis spreadsheets User Guide (updated October 2006) INTRODUCTION These spreadsheets are designed to assist with MLPA analysis using the kits

More information

GWAS Data Cleaning. GENEVA Coordinating Center Department of Biostatistics University of Washington. January 13, 2016.

GWAS Data Cleaning. GENEVA Coordinating Center Department of Biostatistics University of Washington. January 13, 2016. GWAS Data Cleaning GENEVA Coordinating Center Department of Biostatistics University of Washington January 13, 2016 Contents 1 Overview 2 2 Preparing Data 3 2.1 Data formats used in GWASTools............................

More information

ASSIsT: An Automatic SNP ScorIng Tool for in and out-breeding species Reference Manual

ASSIsT: An Automatic SNP ScorIng Tool for in and out-breeding species Reference Manual ASSIsT: An Automatic SNP ScorIng Tool for in and out-breeding species Reference Manual Di Guardo M, Micheletti D, Bianco L, Koehorst-van Putten HJJ, Longhi S, Costa F, Aranzana MJ, Velasco R, Arús P, Troggio

More information

Genotyping and quality control of UK Biobank, a large- scale, extensively phenotyped prospective resource

Genotyping and quality control of UK Biobank, a large- scale, extensively phenotyped prospective resource Genotyping and quality control of UK Biobank, a large- scale, extensively phenotyped prospective resource Information for researchers Interim Data Release, 2015 1 Introduction... 3 1.1 UK Biobank... 3

More information

Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk

Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk Structure As a starting point it is useful to consider a basic questionnaire as containing three main sections:

More information

Genomic Selection in. Applied Training Workshop, Sterling. Hans Daetwyler, The Roslin Institute and R(D)SVS

Genomic Selection in. Applied Training Workshop, Sterling. Hans Daetwyler, The Roslin Institute and R(D)SVS Genomic Selection in Dairy Cattle AQUAGENOME Applied Training Workshop, Sterling Hans Daetwyler, The Roslin Institute and R(D)SVS Dairy introduction Overview Traditional breeding Genomic selection Advantages

More information

CHROMOSOMES Dr. Fern Tsien, Dept. of Genetics, LSUHSC, NO, LA

CHROMOSOMES Dr. Fern Tsien, Dept. of Genetics, LSUHSC, NO, LA CHROMOSOMES Dr. Fern Tsien, Dept. of Genetics, LSUHSC, NO, LA Cytogenetics is the study of chromosomes and their structure, inheritance, and abnormalities. Chromosome abnormalities occur in approximately:

More information

Step by Step Guide to Importing Genetic Data into JMP Genomics

Step by Step Guide to Importing Genetic Data into JMP Genomics Step by Step Guide to Importing Genetic Data into JMP Genomics Page 1 Introduction Data for genetic analyses can exist in a variety of formats. Before this data can be analyzed it must imported into one

More information

Combining Data from Different Genotyping Platforms. Gonçalo Abecasis Center for Statistical Genetics University of Michigan

Combining Data from Different Genotyping Platforms. Gonçalo Abecasis Center for Statistical Genetics University of Michigan Combining Data from Different Genotyping Platforms Gonçalo Abecasis Center for Statistical Genetics University of Michigan The Challenge Detecting small effects requires very large sample sizes Combined

More information

Overview of Genetic Testing and Screening

Overview of Genetic Testing and Screening Integrating Genetics into Your Practice Webinar Series Overview of Genetic Testing and Screening Genetic testing is an important tool in the screening and diagnosis of many conditions. New technology is

More information

Chapter 9 Patterns of Inheritance

Chapter 9 Patterns of Inheritance Bio 100 Patterns of Inheritance 1 Chapter 9 Patterns of Inheritance Modern genetics began with Gregor Mendel s quantitative experiments with pea plants History of Heredity Blending theory of heredity -

More information

The following chapter is called "Preimplantation Genetic Diagnosis (PGD)".

The following chapter is called Preimplantation Genetic Diagnosis (PGD). Slide 1 Welcome to chapter 9. The following chapter is called "Preimplantation Genetic Diagnosis (PGD)". The author is Dr. Maria Lalioti. Slide 2 The learning objectives of this chapter are: To learn the

More information

Analysis of FFPE DNA Data in CNAG 2.0 A Manual

Analysis of FFPE DNA Data in CNAG 2.0 A Manual Analysis of FFPE DNA Data in CNAG 2.0 A Manual Table of Contents: I. Background P.2 II. Installation and Setup a. Download/Install CNAG 2.0 P.3 b. Setup P.4 III. Extract Mapping 500K FFPE Data P.7 IV.

More information

Core Facility Genomics

Core Facility Genomics Core Facility Genomics versatile genome or transcriptome analyses based on quantifiable highthroughput data ascertainment 1 Topics Collaboration with Harald Binder and Clemens Kreutz Project: Microarray

More information

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1) CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.

More information

Logistic Regression (1/24/13)

Logistic Regression (1/24/13) STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used

More information

Chromosomes, Mapping, and the Meiosis Inheritance Connection

Chromosomes, Mapping, and the Meiosis Inheritance Connection Chromosomes, Mapping, and the Meiosis Inheritance Connection Carl Correns 1900 Chapter 13 First suggests central role for chromosomes Rediscovery of Mendel s work Walter Sutton 1902 Chromosomal theory

More information

Fact Sheet 14 EPIGENETICS

Fact Sheet 14 EPIGENETICS This fact sheet describes epigenetics which refers to factors that can influence the way our genes are expressed in the cells of our body. In summary Epigenetics is a phenomenon that affects the way cells

More information

UKB_WCSGAX: UK Biobank 500K Samples Genotyping Data Generation by the Affymetrix Research Services Laboratory. April, 2015

UKB_WCSGAX: UK Biobank 500K Samples Genotyping Data Generation by the Affymetrix Research Services Laboratory. April, 2015 UKB_WCSGAX: UK Biobank 500K Samples Genotyping Data Generation by the Affymetrix Research Services Laboratory April, 2015 1 Contents Overview... 3 Rare Variants... 3 Observation... 3 Approach... 3 ApoE

More information

II. DISTRIBUTIONS distribution normal distribution. standard scores

II. DISTRIBUTIONS distribution normal distribution. standard scores Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,

More information

MiSeq: Imaging and Base Calling

MiSeq: Imaging and Base Calling MiSeq: Imaging and Page Welcome Navigation Presenter Introduction MiSeq Sequencing Workflow Narration Welcome to MiSeq: Imaging and. This course takes 35 minutes to complete. Click Next to continue. Please

More information

Gene Mapping Techniques

Gene Mapping Techniques Gene Mapping Techniques OBJECTIVES By the end of this session the student should be able to: Define genetic linkage and recombinant frequency State how genetic distance may be estimated State how restriction

More information

Review of Fundamental Mathematics

Review of Fundamental Mathematics Review of Fundamental Mathematics As explained in the Preface and in Chapter 1 of your textbook, managerial economics applies microeconomic theory to business decision making. The decision-making tools

More information

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm

More information

Genetics Lecture Notes 7.03 2005. Lectures 1 2

Genetics Lecture Notes 7.03 2005. Lectures 1 2 Genetics Lecture Notes 7.03 2005 Lectures 1 2 Lecture 1 We will begin this course with the question: What is a gene? This question will take us four lectures to answer because there are actually several

More information

Exploratory data analysis (Chapter 2) Fall 2011

Exploratory data analysis (Chapter 2) Fall 2011 Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,

More information

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses. DESCRIPTIVE STATISTICS The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses. DESCRIPTIVE VS. INFERENTIAL STATISTICS Descriptive To organize,

More information

Online Supplement to Polygenic Influence on Educational Attainment. Genotyping was conducted with the Illumina HumanOmni1-Quad v1 platform using

Online Supplement to Polygenic Influence on Educational Attainment. Genotyping was conducted with the Illumina HumanOmni1-Quad v1 platform using Online Supplement to Polygenic Influence on Educational Attainment Construction of Polygenic Score for Educational Attainment Genotyping was conducted with the Illumina HumanOmni1-Quad v1 platform using

More information

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

More information

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Section 14 Simple Linear Regression: Introduction to Least Squares Regression Slide 1 Section 14 Simple Linear Regression: Introduction to Least Squares Regression There are several different measures of statistical association used for understanding the quantitative relationship

More information

Name: Class: Date: ID: A

Name: Class: Date: ID: A Name: Class: _ Date: _ Meiosis Quiz 1. (1 point) A kidney cell is an example of which type of cell? a. sex cell b. germ cell c. somatic cell d. haploid cell 2. (1 point) How many chromosomes are in a human

More information

Mendelian inheritance and the

Mendelian inheritance and the Mendelian inheritance and the most common genetic diseases Cornelia Schubert, MD, University of Goettingen, Dept. Human Genetics EUPRIM-Net course Genetics, Immunology and Breeding Mangement German Primate

More information

Basics of Marker Assisted Selection

Basics of Marker Assisted Selection asics of Marker ssisted Selection Chapter 15 asics of Marker ssisted Selection Julius van der Werf, Department of nimal Science rian Kinghorn, Twynam Chair of nimal reeding Technologies University of New

More information

GenomeStudio Data Analysis Software

GenomeStudio Data Analysis Software GenomeStudio Analysis Software Illumina has created a comprehensive suite of data analysis tools to support a wide range of genetic analysis assays. This single software package provides data visualization

More information

Jitter Measurements in Serial Data Signals

Jitter Measurements in Serial Data Signals Jitter Measurements in Serial Data Signals Michael Schnecker, Product Manager LeCroy Corporation Introduction The increasing speed of serial data transmission systems places greater importance on measuring

More information

Heredity. Sarah crosses a homozygous white flower and a homozygous purple flower. The cross results in all purple flowers.

Heredity. Sarah crosses a homozygous white flower and a homozygous purple flower. The cross results in all purple flowers. Heredity 1. Sarah is doing an experiment on pea plants. She is studying the color of the pea plants. Sarah has noticed that many pea plants have purple flowers and many have white flowers. Sarah crosses

More information

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through

More information

GAW 15 Problem 3: Simulated Rheumatoid Arthritis Data Full Model and Simulation Parameters

GAW 15 Problem 3: Simulated Rheumatoid Arthritis Data Full Model and Simulation Parameters GAW 15 Problem 3: Simulated Rheumatoid Arthritis Data Full Model and Simulation Parameters Michael B Miller , Michael Li , Gregg Lind , Soon-Young

More information

5 GENETIC LINKAGE AND MAPPING

5 GENETIC LINKAGE AND MAPPING 5 GENETIC LINKAGE AND MAPPING 5.1 Genetic Linkage So far, we have considered traits that are affected by one or two genes, and if there are two genes, we have assumed that they assort independently. However,

More information

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications Product Bulletin Sequencing Software SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications Comprehensive reference sequence handling Helps interpret the role of each

More information

Statistics. Measurement. Scales of Measurement 7/18/2012

Statistics. Measurement. Scales of Measurement 7/18/2012 Statistics Measurement Measurement is defined as a set of rules for assigning numbers to represent objects, traits, attributes, or behaviors A variableis something that varies (eye color), a constant does

More information

Univariate Regression

Univariate Regression Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is

More information

Workshop on Establishing a Central Resource of Data from Genome Sequencing Projects

Workshop on Establishing a Central Resource of Data from Genome Sequencing Projects Report on the Workshop on Establishing a Central Resource of Data from Genome Sequencing Projects Background and Goals of the Workshop June 5 6, 2012 The use of genome sequencing in human research is growing

More information

IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

More information

SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis

SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis Goal: This tutorial introduces several websites and tools useful for determining linkage disequilibrium

More information

Basic Principles of Forensic Molecular Biology and Genetics. Population Genetics

Basic Principles of Forensic Molecular Biology and Genetics. Population Genetics Basic Principles of Forensic Molecular Biology and Genetics Population Genetics Significance of a Match What is the significance of: a fiber match? a hair match? a glass match? a DNA match? Meaning of

More information

Modeling Lifetime Value in the Insurance Industry

Modeling Lifetime Value in the Insurance Industry Modeling Lifetime Value in the Insurance Industry C. Olivia Parr Rud, Executive Vice President, Data Square, LLC ABSTRACT Acquisition modeling for direct mail insurance has the unique challenge of targeting

More information

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg Building risk prediction models - with a focus on Genome-Wide Association Studies Risk prediction models Based on data: (D i, X i1,..., X ip ) i = 1,..., n we like to fit a model P(D = 1 X 1,..., X p )

More information

Next Generation Sequencing: Technology, Mapping, and Analysis

Next Generation Sequencing: Technology, Mapping, and Analysis Next Generation Sequencing: Technology, Mapping, and Analysis Gary Benson Computer Science, Biology, Bioinformatics Boston University gbenson@bu.edu http://tandem.bu.edu/ The Human Genome Project took

More information

GenomeStudio Data Analysis Software

GenomeStudio Data Analysis Software GenomeStudio Data Analysis Software Illumina has created a comprehensive suite of data analysis tools to support a wide range of genetic analysis assays. This single software package provides data visualization

More information

Two Correlated Proportions (McNemar Test)

Two Correlated Proportions (McNemar Test) Chapter 50 Two Correlated Proportions (Mcemar Test) Introduction This procedure computes confidence intervals and hypothesis tests for the comparison of the marginal frequencies of two factors (each with

More information

Lecture 3: Mutations

Lecture 3: Mutations Lecture 3: Mutations Recall that the flow of information within a cell involves the transcription of DNA to mrna and the translation of mrna to protein. Recall also, that the flow of information between

More information

GAIA: Genomic Analysis of Important Aberrations

GAIA: Genomic Analysis of Important Aberrations GAIA: Genomic Analysis of Important Aberrations Sandro Morganella Stefano Maria Pagnotta Michele Ceccarelli Contents 1 Overview 1 2 Installation 2 3 Package Dependencies 2 4 Vega Data Description 2 4.1

More information

Section 5.0 : Horn Physics. By Martin J. King, 6/29/08 Copyright 2008 by Martin J. King. All Rights Reserved.

Section 5.0 : Horn Physics. By Martin J. King, 6/29/08 Copyright 2008 by Martin J. King. All Rights Reserved. Section 5. : Horn Physics Section 5. : Horn Physics By Martin J. King, 6/29/8 Copyright 28 by Martin J. King. All Rights Reserved. Before discussing the design of a horn loaded loudspeaker system, it is

More information

Interpretation of Somers D under four simple models

Interpretation of Somers D under four simple models Interpretation of Somers D under four simple models Roger B. Newson 03 September, 04 Introduction Somers D is an ordinal measure of association introduced by Somers (96)[9]. It can be defined in terms

More information

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number 1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number A. 3(x - x) B. x 3 x C. 3x - x D. x - 3x 2) Write the following as an algebraic expression

More information

Descriptive Statistics

Descriptive Statistics Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize

More information

Data exploration with Microsoft Excel: analysing more than one variable

Data exploration with Microsoft Excel: analysing more than one variable Data exploration with Microsoft Excel: analysing more than one variable Contents 1 Introduction... 1 2 Comparing different groups or different variables... 2 3 Exploring the association between categorical

More information

Session 7 Bivariate Data and Analysis

Session 7 Bivariate Data and Analysis Session 7 Bivariate Data and Analysis Key Terms for This Session Previously Introduced mean standard deviation New in This Session association bivariate analysis contingency table co-variation least squares

More information

Biology Behind the Crime Scene Week 4: Lab #4 Genetics Exercise (Meiosis) and RFLP Analysis of DNA

Biology Behind the Crime Scene Week 4: Lab #4 Genetics Exercise (Meiosis) and RFLP Analysis of DNA Page 1 of 5 Biology Behind the Crime Scene Week 4: Lab #4 Genetics Exercise (Meiosis) and RFLP Analysis of DNA Genetics Exercise: Understanding how meiosis affects genetic inheritance and DNA patterns

More information

CHROMOSOMES AND INHERITANCE

CHROMOSOMES AND INHERITANCE SECTION 12-1 REVIEW CHROMOSOMES AND INHERITANCE VOCABULARY REVIEW Distinguish between the terms in each of the following pairs of terms. 1. sex chromosome, autosome 2. germ-cell mutation, somatic-cell

More information

A guide to the analysis of KASP genotyping data using cluster plots

A guide to the analysis of KASP genotyping data using cluster plots extraction sequencing genotyping extraction sequencing genotyping extraction sequencing genotyping extraction sequencing A guide to the analysis of KASP genotyping data using cluster plots Contents of

More information

Factors for success in big data science

Factors for success in big data science Factors for success in big data science Damjan Vukcevic Data Science Murdoch Childrens Research Institute 16 October 2014 Big Data Reading Group (Department of Mathematics & Statistics, University of Melbourne)

More information

Interpret software. User guide. version 11

Interpret software. User guide. version 11 Interpret software User guide version 11 This protocol booklet and its contents are Oxford Gene Technology (Operations) Limited 2008. All rights reserved. Reproduction of all or any substantial part of

More information

VI. Introduction to Logistic Regression

VI. Introduction to Logistic Regression VI. Introduction to Logistic Regression We turn our attention now to the topic of modeling a categorical outcome as a function of (possibly) several factors. The framework of generalized linear models

More information

Exiqon Array Software Manual. Quick guide to data extraction from mircury LNA microrna Arrays

Exiqon Array Software Manual. Quick guide to data extraction from mircury LNA microrna Arrays Exiqon Array Software Manual Quick guide to data extraction from mircury LNA microrna Arrays March 2010 Table of contents Introduction Overview...................................................... 3 ImaGene

More information

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection Chapter 311 Introduction Often, theory and experience give only general direction as to which of a pool of candidate variables (including transformed variables) should be included in the regression model.

More information

2. Simple Linear Regression

2. Simple Linear Regression Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according

More information

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Focusing on results not data comprehensive data analysis for targeted next generation sequencing Focusing on results not data comprehensive data analysis for targeted next generation sequencing Daniel Swan, Jolyon Holdstock, Angela Matchan, Richard Stark, John Shovelton, Duarte Mohla and Simon Hughes

More information

Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER

Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER JMP Genomics Step-by-Step Guide to Bi-Parental Linkage Mapping Introduction JMP Genomics offers several tools for the creation of linkage maps

More information

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Chapter Seven Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Section : An introduction to multiple regression WHAT IS MULTIPLE REGRESSION? Multiple

More information

A trait is a variation of a particular character (e.g. color, height). Traits are passed from parents to offspring through genes.

A trait is a variation of a particular character (e.g. color, height). Traits are passed from parents to offspring through genes. 1 Biology Chapter 10 Study Guide Trait A trait is a variation of a particular character (e.g. color, height). Traits are passed from parents to offspring through genes. Genes Genes are located on chromosomes

More information

Chromosomes, Karyotyping, and Abnormalities (Learning Objectives) Learn the components and parts of a metaphase chromosome.

Chromosomes, Karyotyping, and Abnormalities (Learning Objectives) Learn the components and parts of a metaphase chromosome. Chromosomes, Karyotyping, and Abnormalities (Learning Objectives) Learn the components and parts of a metaphase chromosome. Define the terms karyotype, autosomal and sex chromosomes. Explain how many of

More information

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma siersma@sund.ku.dk The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association

More information

Hardy-Weinberg Equilibrium Problems

Hardy-Weinberg Equilibrium Problems Hardy-Weinberg Equilibrium Problems 1. The frequency of two alleles in a gene pool is 0.19 (A) and 0.81(a). Assume that the population is in Hardy-Weinberg equilibrium. (a) Calculate the percentage of

More information

Commonly Used STR Markers

Commonly Used STR Markers Commonly Used STR Markers Repeats Satellites 100 to 1000 bases repeated Minisatellites VNTR variable number tandem repeat 10 to 100 bases repeated Microsatellites STR short tandem repeat 2 to 6 bases repeated

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

BioBoot Camp Genetics

BioBoot Camp Genetics BioBoot Camp Genetics BIO.B.1.2.1 Describe how the process of DNA replication results in the transmission and/or conservation of genetic information DNA Replication is the process of DNA being copied before

More information

Genomic instability in cancers and cancer predispositions. Popova Tatiana Inserm U830 Institut Curie

Genomic instability in cancers and cancer predispositions. Popova Tatiana Inserm U830 Institut Curie Genomic instability in cancers and cancer predispositions Popova Tatiana Inserm U830 Institut Curie Time-scale in a tumor genome discovery Bovery HYP Cancer genome Knudson 2 hit HYP Tumor DNA has transforming

More information

Single-Cell Whole Genome Sequencing on the C1 System: a Performance Evaluation

Single-Cell Whole Genome Sequencing on the C1 System: a Performance Evaluation PN 100-9879 A1 TECHNICAL NOTE Single-Cell Whole Genome Sequencing on the C1 System: a Performance Evaluation Introduction Cancer is a dynamic evolutionary process of which intratumor genetic and phenotypic

More information

Information leaflet. Centrum voor Medische Genetica. Version 1/20150504 Design by Ben Caljon, UZ Brussel. Universitair Ziekenhuis Brussel

Information leaflet. Centrum voor Medische Genetica. Version 1/20150504 Design by Ben Caljon, UZ Brussel. Universitair Ziekenhuis Brussel Information on genome-wide genetic testing Array Comparative Genomic Hybridization (array CGH) Single Nucleotide Polymorphism array (SNP array) Massive Parallel Sequencing (MPS) Version 120150504 Design

More information

Frequently Asked Questions Next Generation Sequencing

Frequently Asked Questions Next Generation Sequencing Frequently Asked Questions Next Generation Sequencing Import These Frequently Asked Questions for Next Generation Sequencing are some of the more common questions our customers ask. Questions are divided

More information

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable

More information

MBA 611 STATISTICS AND QUANTITATIVE METHODS

MBA 611 STATISTICS AND QUANTITATIVE METHODS MBA 611 STATISTICS AND QUANTITATIVE METHODS Part I. Review of Basic Statistics (Chapters 1-11) A. Introduction (Chapter 1) Uncertainty: Decisions are often based on incomplete information from uncertain

More information

Genetics for the Novice

Genetics for the Novice Genetics for the Novice by Carol Barbee Wait! Don't leave yet. I know that for many breeders any article with the word genetics in the title causes an immediate negative reaction. Either they quickly turn

More information

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation

More information

Simple Predictive Analytics Curtis Seare

Simple Predictive Analytics Curtis Seare Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

More information

Overview One of the promises of studies of human genetic variation is to learn about human history and also to learn about natural selection.

Overview One of the promises of studies of human genetic variation is to learn about human history and also to learn about natural selection. Technical design document for a SNP array that is optimized for population genetics Yontao Lu, Nick Patterson, Yiping Zhan, Swapan Mallick and David Reich Overview One of the promises of studies of human

More information

A and B are not absolutely linked. They could be far enough apart on the chromosome that they assort independently.

A and B are not absolutely linked. They could be far enough apart on the chromosome that they assort independently. Name Section 7.014 Problem Set 5 Please print out this problem set and record your answers on the printed copy. Answers to this problem set are to be turned in to the box outside 68-120 by 5:00pm on Friday

More information