Methodology for Copy Number Variant Detection from High. Throughput DNA Exome Sequencing and Application to the

Transcription

1 Methodology for Copy Number Variant Detection from High Throughput DNA Exome Sequencing and Application to the Genetic Mapping of Rare Genetic Disorders Case Presentation 3 Word Count Michael Epstein CoMPLEX University College London Supervisors: Dr Vincent Plagnol, UCL Genetics Institute Prof. Nick Wood, UCL Institute of Neurology MRes Modelling Biological Complexity,

2 Contents Contents b 1 Abstract 1 2 Background Genomic Variation Copy Number Variation Detecting Copy Number Variation Comparative Genomic Hybridisation and SNP detection CNV Discovery Strategies using Next Generation Sequencing Focus of the Case Presentation Materials and Methods Generating Read Depths with ReadDepthMapper Data Generation Exome Sequencing Available Exome Sequences Program Execution Data Analysis Results 19 5 Discussion 24 Bibliography 26 b

3 1 Abstract Large structural variations such as Copy Number Variations (CNVs) are pervasive in the human genome and are thought to have a large influence on a variety of Mendelian and somatic genetic disorders. This Case Presentation reviews how increasing numbers of CNVs in the human genome are being detected, particularly within the context of emergent Next Generation Sequencing technology. The project describes the development of ReadDepthMapper, a C++ command line tool, to generate read depths from BAM formatted NGS data. The tool is then applied to 263 exome sequences available at the UCL Genetics Institute with the aim of uncovering evidence for large structural deletions. The results revealed a potential monosomy in chromosome 7 in one individual which helps suggest a biological explanation for this individual s rare genetic disorder.

4 2 Background 2.1 Genomic Variation Genomic variation can be defined as variations in genotypes within or between species. It is the key driver of varying phenotypes between individuals. Therefore discovering and understanding the various types of genomic variation, particularly in reference to the human genome, is key to unlocking the reasons behind phenotypic differences between individuals. It also helps to elucidate the role of genetic variation in the analysis of human diseases. This is particularly important in complex diseases where individual genetic factors exhibit a limited penetrate or a number of genomic factors combine to contribute to a human disease. Exploring genomic variation typically involves classifying variations based the size of discovered variants. For example, large differences between individuals, or even genomes in different cells within the same individual, can be detected visually as chromosomal abnormalities on the microscopic level. At the smallest end of the variant spectrum, differences between individual genomes have been uncovered at the base level. Variation as small as differences in nucleotides at a given position between genomic sequences, known as Single Nucleotide Polymorphisms (SNPs), are a well established source of variation. Also at the base scale, small insertions and deletions of a small number of 2

5 2.2. COPY NUMBER VARIATION 3 nucleotides, known as indels, also contribute to the polymorphism of genotypes. Analysing the relative importance of different classes of structural variation can be distilled into two issues. The first issue is estimating, for a particular class of variation, how much of the overall variation observed in a genome it accounts for. The second issue is establishing what phenotypic impact a certain class of structural variation creates. One natural implication follows in establishing the role that different types of variant plays in causing human disease. Until recently, much of the important phenotypic variation seen in the human genome was thought to be the results of small base level mutations or indels [1]. For example, recent developments in large scale sequencing and SNP discovery have allowed Genome Wide Association (GWA) studies to identify SNPs which have a statistically significant association with phenotypic trails such as height[2] and complex diseases [3, 4]. However the total level of phenotypic variation explained by these statistically significant gene loci falls well short of prior estimates of genetic heritability for complex diseases and phenotypes. This has led for the search for the missing heredity and other explanations to help explain a greater proportion of the observed heritability of these traits. 2.2 Copy Number Variation Additional attention has recently been focused on a different class of genetic variant. These structural variants are called Copy Number Variations (CNVs) and are defined as polymorphisms greater than a kilobase in size which are present at a variable number of copies when compared to a reference genome[5]. CNVs encompass insertions, deletions and inversions of stretches of DNA as well as more complex multisite variants. Such variation covers around 12% of the human genome [5], encompassing a greater proportion

6 2.2. COPY NUMBER VARIATION 4 of nucleotides in the human genome than SNPs. In addition to covering a wide area of the typical human genome, CNV regions are also thought to account for a sizeable variety of phenotypes - CNVRs span across almost 1,500 genes [5] and are likely to have significant impact on disease as initial estimates suggested 14.5% of genes registered in the Online Mendelian Inheritance in Man (OMIM) 1 were subject to CNV. Gene ontology analysis suggests that large portions of the exome that are under copy number variation tends to code for proteins involved in the immune response and inflammation, for example low copy numbers for the gene CCL3L1, which encodes an HIV1-suppressive chemokine is associated with an accelerated rate of HIV progression[4]. Causal CNVs have been well established between rare neurological diseases such as Williams-Beuren syndrome but have also been associated with spectrum disorders such as autism and other psychiatric illnesses such as schizophrenia[4]. In addition to germline CNV, somatic and inherited variations are recognised as important factors in the development of cancer genomes, such as structural chromosomal rearrangements and the development of fusion transcripts which dis-regulate the expression and activity of genes that control the cell cycle. For example, [6] establish CNV fusion transcripts such as CHD7-PVT1 likely to have occurred early in tumour genesis in small cell lung cancer on the basis of their amplified copy number. Common somatic CNVs have also seen found in other solid tumours such as prostate and colorectal cancers [4] 1

7 2.3. DETECTING COPY NUMBER VARIATION Detecting Copy Number Variation Comparative Genomic Hybridisation and SNP detection Early efforts to discover and categorise copy number variation relied on a technique called Comparative Genomic Hybridisation, or CGH, which seeks to test the relative frequencies of DNA regions from a test and a reference sample on an hybridisation array. The colour of the fluorescence on the array indicates gains or losses in copy number against a reference. In one such study[7], a targeted microarray containing 1,986 nonredundant BACs was constructed to encompass a total of 130 recombination hotspots, as defined by the presence of common intra-chromosomal duplications in the genome. 47 lymphoblastoid cell lines were hybridised which revealed 119 regions of significant copy number polymorphism, of which 73 were at the time previously unreported. This suggested an important role of segmental duplications in defining rearrangement hotspots - highlighting genomic regions likely to contribute to CNV polymorphism. Another early method used to measure copy number variation involves using SNP genotypes, for example from the extensive and high quality collection of SNPs generated from the International HapMap project[8, 9]. In suspected regions of CNVs, algorithms can be used to detect signatures from a contiguous run of SNPs which could suggest potential deletions. [10] for example, analysed transmission genotypes from parent-child trios from the International HapMap Project. The authors examined SNP calls in offspring that appear to be incompatible with Mendelian inheritance of SNPs. This is where a maternal deletion has occurred but the SNP genotyping method has miscalled as homozygous for the remaining paternal allele in the offspring. Using this technique, areas of CNV deletions within the human genome were revealed. There are drawbacks to both CGH and SNP approaches. For example, with CGH, the

8 2.3. DETECTING COPY NUMBER VARIATION 6 size of the putative CNV regions examinable is limited by the insert size, which is the length of the DNA sequence used to make up the probe. Also, the coverage of regions is physically limited by the density of the array and CGH cannot detect inversions, only the absolute copy number of a probe sequence. With SNP analyses, the detectable size and breakpoint resolution of CNVs vary with the available density of SNPs in the regions where CNVs are being studied for CNV Discovery Strategies using Next Generation Sequencing Both CGH and SNP approaches to detecting CNVs predate the development of Next Generation Sequencing (NGS). High throughput technologies allow high genomic coverage of paired-end reads of a genomic sample. Paired reads are generated through fragmenting genomic DNA samples into a predefined length, known as the insert size. Both the forward and reverse template strands are sequenced from opposite ends. The resulting set of sequences forms a library of paired reads a fixed distance apart which aids alignment to a reference genome. These paired reads can be utilised using various techniques to detect CNVs between the sample and a reference. Paired reads were originally produced by traditional Sanger sequencing, but the advent of NGS technologies such as Illumina, ABI-SOliD and Roche 454 allow a faster generation of paired end reads at a greater genome depth of coverage and at a cheaper sequencing cost per base. The read lengths of these technologies, however, are quite short by Sanger standards, typically ranging from 35bp-400bp in length. This makes them difficult to arrange into contigs and scaffolds, and therefore makes the direct assembly of a sample genome from its millions of paired reads still extremely difficult.

9 2.3. DETECTING COPY NUMBER VARIATION 7 (a) Normal read: here, the distance between the two reads is the same between the reference and sample. (b) Deletion: the distance between the reads mapped to the genome is greater than the insert size of the sample read. This indicates that a some content which is present in the reference has been deleted from the sample. (c) Insertion: the distance between the reads as mapped to the genome is smaller than the insert size of the sample. This indicates that some sequence has been inserted into the sample. (d) Inversion: the order and insert size of two paired reads are the same. However, one of each of the paired reads changes orientation indicating that some of the sequence within the two outer reads has become inverted in the sample relative to the reference. (e) Linked signature: two adjacent paired reads map to distal segments of the genome. These linked insertions can be further used to discover linked insertions and identify the content which has been inserted. (f) Everted duplication: two interlinked paired reads map in the same orientation to the reference but in a different order; this signifies a tandem duplication of DNA sequence in the sample. Figure 2.1: Paired Read Mapping signatures (Reproduced from [11])

10 2.3. DETECTING COPY NUMBER VARIATION 8 Instead, detecting genetic variation requires mapping paired reads from a sample to a reference genome, and CNVs are investigated using Paired-End Mapping or Depth of Coverage strategies, both explained below Paired-End Mapping (PEM) The limited insert size of the sequenced pairs precludes direct genome assembly and direct comparison with other completed genomes. Instead, the reads are mapped to a reference assembly and then investigated with recourse to common matching signatures which suggest insertion, deletion, inversion or other more complicated CNV events. These signatures are illustrated and described in Figure 2.1. Additionally, there is information given by split or unmapped reads. For example, a split read, where two fragments of a read map to different regions on the genome, indicate a deletion and also the breakpoint of the deletion. Similarly, a truncated mapping, where only a fraction of the whole read maps to the genome, reveals the sequence and size of an inserted element. An unmapped read, disregarding quality control issues, can indicate the insertion of novel genomic sequence as it cannot be mapped to the reference. For practical paired read mapping analysis, methods are required to detect the above signatures outlined in Figure 2.1. Occurrences of single signatures at a given genomic location are unlikely to represent true structural variants because of the imperfect nature of base calling, potential for chimeric reads and incorrect alignment to the reference. Also, while the insert size of a library can be set, it also carries some variability. Hence small indels could be presumed to occur on the basis of varying insert size if individual reads are taken at face value.

11 2.3. DETECTING COPY NUMBER VARIATION 9 Non-withstanding these issues, a pioneering study by [12] usedpemtechniquesto discover CNVs using 3kb insert reads sequenced with NGS technology. Their approach used sequences from two individuals (female European and female African) to identify similar and divergent structural variants greater than 3kb in size. Their analysis revealed that the number of structural variants were greater than previously thought, many of which could potentially affect gene function. This was as a result of the smaller paired read insert size, which enabled the study to detect smaller variants and provide greater breakpoint precision than previous studies which utilised CGH. In an alternative use of paired end short reads, [13] developed Pindel, an algorithm that uses pattern growth to map the break points of large deletions and medium sized insertions. The pattern growth process is used to map the fragments of the second paired read where the first read has been uniquely mapped to the genome. Deletions are detected from the second unmapped read, from the reference sequence located between the first mapped portion from the 3 end and the second mapped portion from the 5 end. For insertions, the insertion is classified as the segment between the 5 and 3 pattern fragments of the unmapped read which cannot be mapped to the genome. Both [12, 13] use a clustering strategy which clusters paired read mappings from discordant pairs (where only one end of the read is mapped to the reference genome) to build up a catalog of evidence to support a particular type to structural variant at a given genomic location. The minimum numbers of read pairs required for a reliable cluster to be formed and the distance at which the read pair is considered discordant are both key parameters to establish the sensitivity and specificity of CNV clusters. Sequencing coverage depth can also impact these parameters; a greater coverage depth results in fewer numbers of mate pairs required for a given level of specificity for a variant call, and shortens the distance after which a paired read can be considered discordant, as greater coverage

12 2.3. DETECTING COPY NUMBER VARIATION 10 depth increases the signal to noise ratio obtained from the sample. The continuing development of sophisticated clustering approaches will be key in detecting CNVs with increased breakpoint precision and confidence Depth of Coverage (DOC) One of the key advantages of NGS technology is the vastly improved depth of coverage. This provides the additional benefit of multiple samples of a given genomic segment which can increase the signal to noise ratio from the sample sequencing. Furthermore, if it can be assumed that the probability of any given base in the genome being sequenced is equal to any other base, then the number of reads mapping to a given stretch of DNA can be assumed to follow a Poisson distribution and is therefore in proportion to the number of times this region appears in the sample genome. Therefore, a duplicated sequence in a sample could be revealed as having a higher than expected number of reads when mapping to a reference sequence. Similarly, a sequence which has been deleted from the sample sequence would have a fewer reads mapping to the reference and hence exhibit a lower read depth. Depth of coverage measures are good at detecting large CNV events, assuming a good level of genomic coverage. This is because significant differences in read depth are less likely to occur by chance for large insertions and deletions than for smaller CNVs. However, DOC cannot detect where duplications have been inserted into the genome, but only that the duplications exist. Also, novel inserted sequences cannot be detected as they, by definition, will not map to the reference genome. Studies which have used depth of coverage to investigate CNVs typically look to segment read depth measurements mapped to the reference genome. Each read depth window aims to have the same read depth within the window, but the depth of a given window

13 2.3. DETECTING COPY NUMBER VARIATION 11 contrasts sharply with adjacent windows. Windowing provides the length of sequence with a consistent read depth and also suggests the breakpoints of the window. Based on its read depth, the window can indicate a gain, loss or no CNV event. Studies such as [6] used reads from their sample which mapped correctly to the reference in terms of insert size and orientation to build a picture of copy number changes across the genome. They subsequently adapted a circular binary segmentation algorithm to generate statistical predictions of copy number changes in both the raw copy number and the breakpoints of copy number variation. This method subsequently helped unravel the structure of complex amplified elements of the cancer genomes of two individuals, leading to putative suggestions of the evolutionary timeline of the cancer genome. The current challenge of read depth analysis is determining robust methods for segmenting the depth of coverage windows which delineates different copy number events for an accepted significance level. Recent developments to address this problem include [14] which pioneered a technique called Event-Wise Testing to merge read depths across 100bp windows into contiguously larger regions which have statistically increased or decreased read depth. The results from this statistically based merging technique suggest that this read depth technique is able to detect CNV signatures which PEM based approaches find difficult to detect, such as segmental duplications. [15] developed a technique called CNV-Seq, conceptually derived from CGH which uses NGS data to compare two genomic sample to a reference genome. [16] presented SegSeq an algorithm to segment equal copy numbers from NGS data. SegSeq merges candidate 100kb windows to generate variable window sizes setting a false discovery rate for 10 genome wide false positive segments. The technique was used both to detect copy-number variations in tumour cell lines on a par with existing microarray technologies but with greater breakpoint precision, typically to within 1kb.

14 2.4. FOCUS OF THE CASE PRESENTATION Focus of the Case Presentation The aim of this case presentation is to use CNV detection techniques to discover cases of monosomy in 263 sample exome sequences available at the UGI. Monosomy is a class of aneuploidy, or an abnormal number of chromosomes, in which case only one of the chromosomes in a diploid organism is present. Partial monosomy is where a part of a chromosome is deleted leaving only one copy of a stretch of the genome in the other chromosome. In humans, this can lead to many disorders such as Cri du chat, a rare genetic disorder due to the deletion of a part of chromosome 5 and myelodysplastic syndrome, a disease of the blood and bone marrow, linked with partial monosomy of chromosome 7. Although monosomy and partial monosomy are large structural events, this project aims to detect them using CNV approaches. However, the use of exome sequencing precludes the use of paired read mapping strategies to discover copy number variants. This is because only a small and interspersed portion of the genome is being sequenced. This means that breakpoints of deleted and inserted exons is difficult to calculate using split read techniques as split reads are unlikely to fall on these reduced boundaries. Therefore, a read depth approach is a better method to investigate the possibility of partial monosomy of samples of exome sequencing as it is a good technique to detect large deletion events, such as chromosomal partial monosomy. The specific focus of this case presentation is twofold. The first aim is to develop a computational tool to efficiently parse sequenced paired reads and generate a read depth profile across defined genomic regions. The read depths generated from the tool are then to be used in order to detect monosomy signatures across a sample of 263 exome sequences held at the UGI using a simple statistical procedure.

15 3 Materials and Methods 3.1 Generating Read Depths with ReadDepthMapper A tool called ReadDepthMapper was developed in C++ in order to generate read depths for sample sequences. Initially, the program parses a file containing a set of regions, typically a set of exons as defined by the CCDS 1, or a set of chromosomes. It then merges small regions into larger regions if two regions are very close to each other, less than 50bp by default. This is necessary as due to noise in the capture and sequencing process, there provide limits to the level of resolution that the read depth technique can provide. The names of the regions which are concatenated are noted so the user remains informed as to which regions have been merged. The program then accepts a list of paired read files, which are stored in the compressed binary version of the Sequence Alignment/Map format (BAM)[17]. Using the Samtools C API 2 as a 3rd party library within ReadDepthMapper, the files are parsed in turn. Each read pair is examined to check each paired end maps to the same chromosome and that the distance between the reads is not significantly greater than the insert size of the reads. If the read pair matches these criteria, the midpoint of the paired read is

16 3.2. DATA GENERATION 14 considered to be the location of the read. If the location of the read lies within a region as specified from the merged region file, a count is added to that region. After all paired read files have been parsed, the data structure holding the read depths generated from the paired reads is printed out to a directory location specified by the user. The user can also specify to print out the names of the regions (for example, exon names), which is useful if small exons have been merged, for reasons explained above. Development of the program utilised a unit test framework UnitTest++ 3 in order to test the key algorithms of the program, such as the sorting and ordering of regions and printing out of the read depth analysis. The unit tests can be run as part of the Makefile compilation of the program. The source code is held in an online svn repository 4, access to which is available from the author. In order to enable the parallelisation of read depth generation, a scripting tool was developed to concatenate ReadDepthMapper output files together. This allows a large number of input files with a common region file to be broken up and run concurrently, with the resulting files being merged at the end of the processes. 3.2 Data Generation Exome Sequencing Exome capture is a strategy to sequence the coding elements of a target genome in order to uncover genes or gene variants associated with genetic disorders. It involves the capture of the exons in the genome, which constitute about 1% of the human genome

17 3.2. DATA GENERATION 15 but are estimated to harbour 85% of the mutations with large effects on disease related traits[18]. Exome sequencing avoids the cost and complexity of whole genome sequencing by deep sequencing just a small portion of the human genome but maintaining the benefit of maintaining high coverage depth available through NGS. Its has recently been applied to discover the candidate gene for Miller syndrome, a rare Mendelian disorder with previously unknown cause [19]. The exomes sequenced used in this project were captured by hybridisation. The Agilent SureSelect method of genome capture is shown in Figure 3.1 below. Agilent SureSelect protocols use a biotinylated RNA library to capture the sequences of interest from a genomic sample. After the targeted genomic sample has hybridised to the RNA library, streptavidin coated magnetic beads attach to the biotin to allow it to be preferentially removed. This leaves the unbound genomic sample, which is not of interest, to be discarded. The captured elements can then be cleaned, amplified and sequenced Available Exome Sequences Figure 3.1: Agilent SureSelect Exome Capture (taken from [20]) There were 263 exome sequences available for analysis at the UGI, all obtained from

18 3.2. DATA GENERATION 16 individuals with diseases suspected to be caused by rare genetic variants. These samples can be sub-classed into different study cohorts as shown in Table 3.1 below. Source Samples Capture Technology Primary Capture Reason UCL Institute of Neurology 223 Agilent 50MB Investigation of early onset Alzheimer and early dementia UCL Institute of Ophthalmology 5 Agilent 38MB Investigation of early onset blindness QMUL School of Medicine and Dentistry 11 Agilent 38MB Investigation of rare forms of dermatological disease Cambridge Infectious Disease 12 Agilent 38MB Investigation of tuberculosis resistance QMUL Blizard Institute of Cell and Molecular Science 12 Agilent 38MB Investigation into rare forms of bone marrow failure Table 3.1: Samples used for Project Monosomy Screening As can be seen in Table 3.1, the exome capture technologies differed between subgroups. The exomes from the Institute of Neurology were captured with 50MB Agilent SureSelect technology whereas the other samples were captured with 38MB Agilent SureSelect technology. The main difference between the two captures is that the 50MB capture includes additional validated exomic content from the GENCODE project. It also includes all exons annotated in the consensus CDS database, as well as small non-coding RNAs from mirbase and Rfam Program Execution The 263 samples were examined for evidence of monosomy. A region file was generated which had each chromosome defined as one region, therefore read depth counts for each of the 263 samples were generated on a per chromosome basis. Given the large number of samples, the analysis was parallelised by dividing the samples up into groups of at most ten samples and submitting 27 jobs to the cluster at UGI. The results files were concatenated together using the script utility described in Section 3.1 to produce a read depth for each sample and chromosome in a single file. 5

19 3.3. DATA ANALYSIS Data Analysis Data analysis was performed in R for the resulting read depth analysis generated as described in Section 3.2. A straightforward statistical analysis was performed which involved calculating a z-score for each chromosome in each sample. This z-score was used to highlight autosomal chromosomes which had a statistically significant lack of reads compared to the expected number of reads for that chromosome. Such a z-score would suggest evidence of monosomy in that sample chromosome. The z-scores were computed as follows. The total number of reads for each chromosome over all samples was generated. Then, the fraction of reads that each sample has for a given chromosome was calculated. The mean fraction of chromosomal reads of each sample across its chromosomes was then computed. For a given sample, the mean number of reads and standard error for each chromosome was calculated by using the binomial formula for the expectation and variance, np and np(1 p) respectively. Here, n represents the number of reads across all samples for that chromosome, p is the fraction of reads in the sample for that chromosome and hence 1 p is the fraction of chromosomal reads mapping to other samples. The binomial distribution is appropriate for each sample chromosome as a read in a given chromosome across all the samples either maps to a particular sample (success) or it maps to that chromosome in another sample (failure). Given the large number of total reads, the estimates for the expectation and variance of a given sample chromosome can be assumed to approximate a normal distribution with the estimated binomial mean and standard error used as parameter estimates for µ and σ. The z-score for each chromosome is then calculated from the observed number of reads, the expected number of reads for that chromosome calculated from the sample

20 3.3. DATA ANALYSIS 18 fraction mean and the estimated variance in the reads for that chromosome.