Gyper: A graph-based HLA genotyper using aligned DNA sequences

Size: px
Start display at page:

Download "Gyper: A graph-based HLA genotyper using aligned DNA sequences"

Transcription

1 Gyper: A graph-based HLA genotyper using aligned DNA sequences Hannes Pétur Eggertsson Faculty of Industrial Engineering, Mechanical Engineering and Computer Science University of Iceland 2015

2

3 GYPER: A GRAPH-BASED HLA GENOTYPER USING ALIGNED DNA SEQUENCES Hannes Pétur Eggertsson 60 ECTS thesis submitted in partial fulfillment of a Magister Scientiarum degree in computational engineering Advisors Bjarni Vilhjálmur Halldórsson (decode genetics) Páll Melsted (University of Iceland) Faculty Representative Daníel Fannar Guðbjartsson Faculty of Industrial Engineering, Mechanical Engineering and Computer Science School of Engineering and Natural Sciences University of Iceland Reykjavik, September 2015

4 Gyper: A graph-based HLA genotyper using aligned DNA sequences Gyper 60 ECTS thesis submitted in partial fulfillment of a M.Sc. degree in computational engineering Copyright 2015 Hannes Pétur Eggertsson All rights reserved Faculty of Industrial Engineering, Mechanical Engineering and Computer Science School of Engineering and Natural Sciences University of Iceland Hjarðarhaga , Reykjavík, Reykjavik Iceland Telephone: Bibliographic information: Hannes Pétur Eggertsson, 2015, Gyper: A graph-based HLA genotyper using aligned DNA sequences, M.Sc. thesis, Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland. Printing: Háskólaprent, Fálkagata 2, 107 Reykjavík Reykjavik, Iceland, September 2015

5 Contents List of Figures List of Tables Acknowledgments iii vii ix 1 Introduction Genetics The human genome Genotyping The HLA gene family Gyper Background Next-generation sequencing Phred score Data formats FASTA format SAM and BAM format VCF The HLA reference alleles format Current DNA sequencing genotypers Methods Preprocessing the data Fetching the HLA reference alleles Regions with relevant reads Multiple sequence alignment Constructing a reference partial order graph Graph implementation Extending the POG Aligning sequences to the POG Algorithm Backtracking Genotyping constraints Parameters Read clipping Minimum sequence length i

6 Contents Mismatches Zygosity factor Parameter training Implementation Results Preprocessing the data Coverage read depth Filtering the BAM files Bias introduction Training of parameters Quality threshold training Minimum sequence length training Zygosity factor training Verification decode s samples Genomes exome samples Genomes WGS samples Comparison with other DNA sequencing data genotypers Accuracy Time Conclusions Summary Future work A HLA genotype call results for 1000G exome samples 59 A.1 HLA-A exome results A.2 HLA-B exome results A.3 HLA-C exome results B HLA genotype call results for 1000G WGS samples 71 B.1 HLA-A WGS results B.2 HLA-B WGS results B.3 HLA-C WGS results References 75 ii

7 List of Figures 1.1 The flow of information within an eukaryotic cell system. The coding sequence for the proteins are the red, green, and blue regions. Exons are spliced together to form RNA which is translated to protein Overview of Gyper s genotyping pipeline. An individual is sampled and sequenced. The sequenced reads are aligned to the human reference genome. The HLA allele references are fetched from an external database. A partial order graph is created which stores all the alleles in a single graph for each gene. Finally, we align the sequenced reads to the graph and genotype the individual A read pair. The two reads are read from one end of each fragment strands in opposite direction Two chromosome strands. A is always bonded with T and C is bonded with G. The reverse complementary of the sequence GATACCC is GGGTATC Example FASTA file entry for a sequence from the human reference genome. Here the sequence ID is chr6_ meaning the sequence is from chromosome 6 and is showing bases located at to The sequence description is omitted. Following the header is the sequencing data split by 50 characters per line An example of a reference allele format Modified version of Gyper s pipeline (Figure 1.2). The scope of the three main steps are highlighted IMGT/HLA XML wrapper example output Flow chart for finding relevant positions of the human genome using one allele. Reads overlapping the exons are simulated and mapped to the human reference genome iii

8 LIST OF FIGURES 3.4 Example input and output of our MSA. Bases colored green and red are indels (insertions or deletions) and mismatches, respectively Create a partial order reference graph using three example exon sequences: GATA, -AT-, and CATA. Blue edges show edges we traversed through, green labels represent changed or new labels on edges. Red and yellow nodes represent new and old nodes, respectively. a) Two nodes are created, initial node on level 0 and a final node on level Length(sequences[0]) + 1, which is here 5. b) The sequence GATA is added to the graph. c) The sequence -AT- is added to the graph. Note that no new nodes need to be created, only edges. We change the bit string for the edge going from A to T so it includes this sequence. d) The sequence CATA is added to the graph. The new C node will be on the same level as the G node Extended graph from figure 3.5. Here we have added three intron references to the graph. We use the previous initial node as a final node for the new extension. The red nodes are the new intron nodes. The intron sequences we have added are: TTA, GTA, and -TA. Note that the edges on the new nodes do not store a bit string like the other exon nodes Alignment of the sequence ACAT to the graph from figure 3.6. The numbers below each node denotes their topological sort order and blue edges are the path of the alignment Alleles A and B are represented as sets. Each set has the reads that allele explains. The number of reads explained by either allele A or B is S A,B = S A\B + S A B + S B\A Two examples of the distributions of reads (crosses) among two alleles, A and B. a) It is likely that the individual is homozygous even though S A = 16 and S A,B = 17. The read inside the B\A region is likely an error. b) Here however, S A\B and S B\A are relatively similar and thus we would rather expect that the individual is heterozygous Coverage plot for HLA-DQA Coverage plot for HLA-DQB Coverage plot for HLA-DRB Coverage plot for HLA-A Coverage plot for HLA-B Coverage plot for HLA-C iv

9 LIST OF FIGURES 4.7 File sizes of BAM files used in our training set before and after filtering. The files are ordered in ascending order by their file size before filtering. The file sizes are in MiB and we use logarithmic scale The weighted average impute INFO score for the two threshold qualities: The mismatch quality threshold ρ (blue) and the clipping threshold τ (red). Note that the axes do not start at zero The weighted average impute INFO score for different minimum sequence length. Note that the axes do not start at zero. The results show that changing the minimum sequence length is insignificant The weighted average impute INFO score for different values of the zygosity factor β. Note that the axes do not start at zero v

10 LIST OF FIGURES vi

11 List of Tables 1.1 The number of alleles for the six most important HLA genes known to the IMGT/HLA database The fraction of reads mapped to the optimal locations, suboptimal locations, and no locations of the human genome reference Checking for bias introduction our filtered BAM files. Only the most common Icelandic alleles were used in this analysis. Larger numbers mean greater unwanted bias Number of individuals in each verification dataset Gyper s 2 digit genotype call accuracy compared to decode s verification data Gyper s 4 digit genotype call accuracy compared to decode s verification data Gyper s 2 digit impute accuracy compared to decode s verification data Gyper s 4 digit impute accuracy compared to decode s verification data Gyper s 2 digit exome accuracy compared to Erlich et al. [2011] Gyper s 4 digit exome accuracy compared to Erlich et al. [2011] Gyper s 2 digit low coverage WGS accuracy compared to Erlich et al. [2011] Gyper s 4 digit low coverage WGS accuracy compared to Erlich et al. [2011] OptiType s 4 digit call accuracy on 1000 Genomes exon dataset compared to Erlich et al. [2011] OptiType s 4 digit WGS genotype calling accuracy compared to Erlich et al. [2011] vii

12 LIST OF TABLES A1 A2 A3 Gyper s called HLA-A genotype on 1000 Genomes exome dataset compared to Erlich et al. [2011] Gyper s called HLA-B genotype on 1000 Genomes exome dataset compared to Erlich et al. [2011] Gyper s called HLA-C genotype on 1000 Genomes exome dataset compared to Erlich et al. [2011] B1 B2 B3 Gyper s called HLA-A genotype on 1000 Genomes low coverage WGS dataset compared to Erlich et al. [2011] Gyper s called HLA-B genotype on 1000 Genomes low coverage WGS dataset compared to Erlich et al. [2011] Gyper s called HLA-C genotype on 1000 Genomes low coverage WGS dataset compared to Erlich et al. [2011] viii

13 Acknowledgments First, I would like to thank my father, Eggert Guðjónsson, and mother, Bryndís Helga Hannesdóttir. Not only have they passed down to me my splendid genes, but they have also showed me love and support all my life which I am deeply grateful for. It is truly privileged to have them as my parents. I am also greatly thankful my girlfriend and soulmate Bryndís Tryggvadóttir, her support for me, even in the toughest of times you have helped me believe in myself. I also want to thank University of Iceland and their teachers for an amazing job of guiding me. It is seems absurd to get such a world class level of education while not needing to pile up student loans. The teachers at the university have really given me the strive for excellence and helped me achieve my dreams. During the work of this thesis I ve been completely stumped by how much trust and facility decode and its helpful employees have given me. Thank you everyone at decode who have given me this opportunity. A special thanks goes to my advisors, Bjarni V. Halldórsson and Páll Melsted, who have given me nothing but patience and constructive feedback. I hope to be able to work with you in the future. Thank you, all! ix

14 x

15 Abstract The major histocompatibility complex has an important role for the immune system in thousands of species. The human leukocyte antigen (HLA) is the human version of the complex and is located on the short arm of chromosome 6. Identifying an individual s HLA genotype can give valuable information for medical applications. Several techniques already exist to HLA type an individual accurately, however they remain expensive and time consuming. Recently though, there has been a breakthrough in developing methods which use Next-Generation Sequencing (NGS) data for this purpose, due to its high availability. Using these methods we can genotype individuals using purely computation on sequencing data. However, these NGS methods remain somewhat time consuming, often requiring hours or days. We introduce Gyper, a new open-source software which genotypes individuals for HLA using NGS data in a matter of seconds. Gyper s speed is obtained by selecting a small subset of reads to consider and align them to the references in a partial order graph. Using Gyper we genotyped about 4,000 Icelander in decode s dataset for the six major HLA genes. The resulting data was imputed for more than Icelanders. Comparing those results with our verification data showed over 96% accuracy. Additionally we genotyped individuals from the 1000 Genomes project for the HLA class I genes and Gyper s accuracy was always equal or higher than other HLA genotypers. These results show that Gyper can provide an impressively fast, yet reliable, genotyping results for a wide range of applications. xi

16 xii

17 Ágrip Major histocompatibility complex er hópur gena sem spilar lykilhlutverk í ónæmiskerfi þúsunda lífvera. Í manninum er hann kallaður human leukocyte antigen (HLA) og er staðsettur á styttri armi litnings 6. Vitneskja um hvaða genasamsætu einstaklingur hefur er mikilvægt fyrir svið læknisfræðinnar. Ýmsar leiðir eru til staðar sem geta til um tegund HLA genasamsætu einstaklings með hárri nákvæmni en þessar aðferðir eru dýrar og tímafrekar. Nýlega hafa margar aðferðir sprottið upp sem nota DNA raðgreiningargögn í sama tilgangi vegna mikils aðgengis að slíkum gögnum. Helsti kostur slíkra aðferða er að þær krefjast aðeins tölvu og DNA raðgreiningargagna. Galli þeirra er að þær eru tímafrekar, oft þarf klukkustundir eða jafnvel daga til að vinna úr gögnunum. Við kynnum Gyper, nýjan opinn hugbúnað sem finnur HLA genasamsætu einstaklings með raðgreiningargögnum á aðeins nokkrum tugum sekúnda. Hraði Gyper fæst með því að skoða aðeins lítinn, en mikilvægan, hlut af raðgreiningargögnunum og bera saman við genasamsæturnar á hagkvæman hátt. Við notuðum Gyper til að finna genasamsætu um Íslendinga á sex mikilvægum HLA genum. Með tengslaneti ályktuðum við HLA genasamsætur fyrir um Íslendinga. Sannprófun sýndi að Gyper náði að tilgreina rétta genasamsætu í yfir 96% tilfella. Til að bera niðurstöður okkar saman við önnur forrit, þá fundum við einnig genasamsætur einstaklinga úr 1000 Genomes verkefninu. Í öllum tilfellum var nákvæmni Gyper sú sama eða hærri. Niðurstöður okkar sýna að Gyper nær að vera mjög hraðvirkur, en á sama tíma nákvæmur, í tilgreiningu sinni á genasamsætum einstaklinga fyrir HLA svæðið og á sér mikla notkunarmöguleika. xiii

18 xiv

19 1 Introduction 1.1 Genetics The flow of genetic information within a biological system was first explained in 1956 by Francis Crick [Crick, 1956]. His explanation is called the central dogma of molecular biology [Crick, 1970]. It explains how deoxyribonucleic acid (DNA) stores genetic data about every organism and how it can be transcripted to ribonucleic acid (RNA), and how RNA is translated to protein. If one would compare the genetic structure of any two individuals of the same species you will find a few differences. These differences can be very small, such as polymorphism of only a single nucleotide (SNP), or much larger, such as duplication of whole chromosomes (e.g. down s syndrome). We can identify these differences and associate them with observed properties, such as behavior, morphology, and diseases. We call the observed properties phenotypes. In 1953, James Watson and Francis Crick discovered the structure of the DNA. It is shaped like a double helix [Watson and Crick, 1953], where each helix has the opposite direction. On each helix the genetic information is stored in four different chemical bases: Adenine (A), cytosine (C), guanine (G), and thymine (T). The bases are interconnected with hydrogen bonds, where adenine bonds with thymine and guanine bonds with cytosine. Each of the connected bases form a base pair. Genes are made up of hundreds to tens of thousands of base pairs acting as instructions to make biological molecules called proteins. The genes can acquire mutations in their DNA sequence which results in different variants of the same gene, called alleles. Different alleles can either encode for the same protein, different versions of the protein, or they might even be unable to encode for the protein at all. The coding sequence of a gene is the region which is translated to protein. Humans have eukaryotic cells, which are cells with a nucleus containing the DNA. In eukaryotic cells the coding sequence is not continuous, it is split among several parts called exons. The exons are separated by introns. On either end of a gene there is an untranslated region which marks the beginning and end of the RNA reading frame. A polymerase transcribes the DNA into a RNA and splices the exon sequences. There are also untranslated regions on each end of the RNA. Finally, protein is created by translating the RNA. Figure 1.1 1

20 1 Introduction illustrates the process for an eukaryotic gene with three exons. Figure 1.1: The flow of information within an eukaryotic cell system. The coding sequence for the proteins are the red, green, and blue regions. Exons are spliced together to form RNA which is translated to protein. 1.2 The human genome The human genome contains more than three billion base pairs which are stored on 22 pairs of chromosomes plus a single pair of sex chromosomes, hence 46 chromosomes total. In every pair of chromosomes, one came from the father and one from the mother. In humans there estimated to be 20,000-25,000 protein-encoding genes [Consortium, 2004]. Genes, that have a similar function, are said to be in the same gene family. An approximate assembly of the human genome has been created and is called the human reference genome. One such assembly is released by the Genome Reference Consortium (GRC). Its most recent release is called GRCh38p4 (build 38, patch 4). Genome browsers have been created for viewing reference genomes. One is the UCSC 2

21 1.3 Genotyping Genome Browser. They provide a good way to view and search the genome with various useful information, such as the location of genes and genetic variations found in the reference [Kent et al., 2002]. One of the main use cases of the human reference genome is using it in a local sequencing alignment. In such alignments DNA sequences, which are sampled from an individual, are aligned to the human reference genome. If the sequence can be aligned to the reference it is said to be mapped to that location. Otherwise, if the sequence does not align to the reference anywhere, the sequence is unmapped. Aligning sequences this way is computationally much easier than doing a whole-genome assembly of the individual. A commonly used local alignment tool is called BWA [Li and Durbin, 2009]. 1.3 Genotyping Genotype is the combination of individual s two alleles. The act of estimating the genotype is called genotyping. In the field of bioinformatics, one major topic is called genome-wide association studies for humans. In these studies the human genotypes are associated with diseases and other phenotypes. Much of decode s current work is performing association studies on the Icelandic population. A wide range of genotyping methods exist, each with their pros and cons. Many methods are expensive, time consuming, and require advanced laboratory instruments. Another cheaper option is to use DNA sequencing data. GATK [McKenna et al., 2010] is an example of a generic DNA sequencing data genotyper. In summary, GATK extracts aligned sequences that were mapped to a certain location of the genome to predict the genotype of an individual. For most genes GATK s predictions are accurate. However, highly variable genes often cause problems because the human reference genome is unable to represent them well. Examples of such genes are the ones in human leukocyte antigen (HLA) gene family. 1.4 The HLA gene family The HLA gene family contains over 200 known genes in three different classes: I, II, and III. It is the human version of the major histocompatibility complex (MHC) and is located on the short arm of chromosome 6. In class I there are three main genes: HLA-A, HLA-B, and HLA-C. The proteins produced from these genes are on the surface of most cells. These proteins are bound to protein chains called peptides that have been exported from within the cell. The proteins from genes in HLA class I display these peptides to the immune system and if the immune system recognizes the peptides as foreign it can react to it, such as triggering the infected cell to self destruct. In HLA class II there 3

22 1 Introduction are six main genes: HLA-DPA1, HLA-DPB1, HLA-DQA1, HLA-DQB1, HLA-DRA, and HLA-DRB1. The HLA allele sequences are available in the IMGT/HLA database [Robinson et al., 2015]. As of release , August 2015, there are more than 13,000 known HLA alleles and increasing fast. With such a high number of alleles, genotyping is a very tough challenge. Previously, certain HLA genotypes have been associated with diseases. Such as type I diabetes [Kikuoka et al., 2001] and celiac disease [Kaukinen et al., 2002], which are both autoimmune diseases. In another recent study some HLA class II alleles have been associated with multiple sclerosis (MS) disease [Moutsianas et al., 2015]. Furthermore, many medical operations depend heavily on matching HLA genotypes between a patient and its donor, such as bone marrow transplantation [Hansen et al., 1980], and umbilical cord blood stem cell transplantation [Gluckman et al., 1999]. The best outcome of such transplants are produced when the donor is a sibling which is HLA identical to the patient. Unfortunately though, usually no such donor is available because there is only 25% chance that two siblings receive the same alleles from their parents. In these cases a transplant from a well-matched unrelated donor is required and most often has acceptable results [Beatty et al., 1985]. In recent years many new methods have been created to use sequencing data to genotype HLA. Such methods have reduced the cost of genotyping and require nothing but a computer and sequencing data. Currently one of the most promising HLA genotyper using sequencing data is OptiType [Szolek et al., 2014]. OptiType genotypes for the three main class I genes using an integer linear programming (ILP) algorithm. Its results show good accuracy. However, this method is still rather time consuming, requiring hours or days to compute. 1.5 Gyper Gyper is a novel open-source genotyper which uses sequencing data to genotype individuals. The motivation behind Gyper is to create a genotyper which genotypes highly variable genes in an accurate and fast manner. It uses aligned DNA sequencing data. The name Gyper is an abbreviation of Graph genotyper. In this initial release Gyper supports six HLA genes. They are the three main class I genes and three class II genes: HLA-DQA1, HLA-DQB1, and HLA-DRB1. Overall these six genes account for 12,534 alleles or 93.5% of all known HLA alleles (Table 1.1). This means that number of genotypes is enormous. The HLA-B gene has almost eight million allele combinations possible. The speed is achieved by storing the allele references in a partial order graph (POG) and align only relevant reads to it. A partial order graph is a directed acyclic graph made up of nodes (vertices) and edges. Each node stores a single DNA base while the edges contain 4

23 1.5 Gyper Table 1.1: The number of alleles for the six most important HLA genes known to the IMGT/HLA database. Gene HLA-A HLA-B HLA-C HLA-DQA1 HLA-DQB1 HLA-DRB1 #Alleles 3,192 3,977 2, ,764 information about which reference allele traverses through that edge. We make use of the fact that sequencing data is usually stored aligned and indexed, meaning we can quickly get reads that have been mapped to certain locations of the genome. Considering only a small subset of reads will vastly improve the speed of the genotyping, but has a risk of not taking all relevant reads into account and therefore missing potentially valuable information. We estimate the genotype of an individual by counting how many reads can be aligned to each allele reference. Figure 1.2 shows an overview of Gyper s pipeline. Two alleles need to be chosen, because we each have two chromosomes. If an individual has the same allele on both chromosomes, it is said to be homozygous for that gene otherwise it is heterozygous. One key challenge we faced was to determine this zygosity. Our method takes into account that reads from sequencing machines will often contain errors. For each base these machines estimate the likelihood of an error. We crop reads with low quality ends to reduce the number of errors in the data. Also, we allow mismatches in reads if that base is very likely to contain an error. Gyper has several parameters that were optmized using a training dataset. The training dataset was gathered from about 4,000 Icelandic people. After training, we verified Gyper using both a verification dataset from decode and a widely used verification dataset for individuals in the 1000 Genomes project. 5

24 1 Introduction Figure 1.2: Overview of Gyper s genotyping pipeline. An individual is sampled and sequenced. The sequenced reads are aligned to the human reference genome. The HLA allele references are fetched from an external database. A partial order graph is created which stores all the alleles in a single graph for each gene. Finally, we align the sequenced reads to the graph and genotype the individual. 6

25 2 Background 2.1 Next-generation sequencing Over the last several decades, many different sequencing methods have been developed. These methods determine the order of the nucleotide bases of a small DNA fragment. One widely used method, including at decode, is called next-generation sequencing (NGS). The machinery used is produced by Illumina which uses a synthetic approach to sequence individuals [Bentley et al., 2008]. Here we will describe the process used by those machines. First, a sample of DNA is obtained and labeled from an individual (e.g. blood). Then, the DNA is randomly sheared into fragments of various length. The average fragment length can be set to different values but in decode s dataset this length is typically around 500 bases and is almost always smaller than 1000 bases. To each end of the fragments the four types of bases are added in the mixture, each fluorescently labeled with a different color and attached with a blocking group. Figure 2.1: A read pair. The two reads are read from one end of each fragment strands in opposite direction. The four bases then compete for being the next base on the template DNA strand that is being sequenced. When one base has been attached all other non-incorporated molecules are washed away. After each synthesis, a photograph of the incorporated base is taken. For each base the likelihood of an error is estimated. The blocking group is then removed using a chemical process. This process is repeated until we have sequenced a certain number of read pairs. 7

26 2 Background The reads of a read pair have different directions (Figure 2.1). On each end of a DNA strand there can either be a three-prime (3 ) or a five-prime (5 ). The direction of a read can either be from 3 to 5 or from 5 to 3. Also, we have an idea of the length between the two reads because they will both start on each end of the fragment. Both reads are reading the same chromosome, but different chromosome strands. Recall that A complements with T, and C with G. That means one strand will be the reverse of the other with A changed to T, T changed to A, C changed to G, and G changed to C. (Figure 2.2). One string is said to be the reverse complement of the other. Figure 2.2: Two chromosome strands. A is always bonded with T and C is bonded with G. The reverse complementary of the sequence GATACCC is GGGTATC. The total number of read pairs can vary, but generally it is aimed to have 30x coverage or more. The coverage is the average number of times each base pair is sequenced. Having a too low coverage means multiple locations are likely to be not covered by any reads. Since we are adding a single base in each step the most common error we might expect are mismatches, it is highly unlikely that we get errors in the form in insertions or deletions. Insertion is when an extra DNA base is added to the read by mistake and a deletion is when a base is mistakenly not read by the sequencer. What distinguishes this method is that it is fast and cheap but the reads will be short, generally between 75 and 150 bases. In decode the read lengths used are 100, 120, or 150 base pairs. With today s Illumina sequencing machines we can produce hundreds or thousands of sequences concurrently. In 2010, the cost of sequencing one million bases ranged from $0.05 to $0.15 USD and the required time is up to 11 days [Vliet, 2010]. In January 2014, Illumina released a sequencing machine capable of sequencing an individual for less then $1,000 with 30x coverage or about $0.01 per one million bases in three days [Hayden, 2014]. Each sequence is an extremely small fraction of the human genome and we have no information if our read contains any mistakes or where the read was located, only that it was on some random chromosome at some random location. We can never even be sure that all bases are included in our reads, some locations might still be completely missing. Assembling these reads is a process called whole-genome assembly and it is a very difficult problem. 8

27 2.2 Phred score The problem can be looked at like this: We have many identical jigsaw puzzles cut in various ways. We remove a bunch of puzzles and also bend some of them around so they will not fit anymore and then try to solve the puzzle. To make the problem easier we can use a human reference genome. This can be thought as another very similar completed jigsaw puzzle. The idea is to compare the pieces we have with the completed puzzle to get a better idea where they can fit, changing the problem to an alignment problem. This task is still very computationally heavy and involves a lot of guessing. It is especially tricky for locations of the genome where variability is high, such as the HLA region. Moreover, there are many regions with high similarity to the HLA genes, which makes the task even harder. 2.2 Phred score Phred quality score is a scaled quality score given to each base pair of a read. It is a commonly used in sequencing technology. It measures the probability of sequencing errors of each base pair in a read. The Phred quality Q is the log value of the probability of sequencing errors P (e) calculated using the formula: Q = 10 log 10 (P (e)) (2.1) For example, if the probability of error is 1% (i.e. 99% accuracy) then the quality score given is 20. Solving for P (e) we get P (e) = 10 Q/10 (2.2) The quality Q from Illumina machines range from 0 to 41 and are represented as an ASCII character c [Johnson, 2013]. The conversion from Q to c can be done by adding 33 to Q and convert the bits from that number to an ASCII character. Character 33 in the ASCII table is! which is the lowest possible quality. Additionally, it is the first printable character in the table if we exclude the white space character. Using the lowest possible quality is never really done though, as it means there is 100% probability of an error. The highest quality character possible for Illumina machines is J, the 74th character in the ASCII table used for a quality score of 41. Using equation 2.2 it equals % probability of an error. The quality values are stored as ASCII characters for space efficiency. For example, if the quality value 41 was stored as the string 41 we would need two bytes instead of one. 9

28 2 Background 2.3 Data formats FASTA format The FASTA format is a common format for storing both nucleotide and amino acid sequences. It was created by William Pearson and was used in his program with the same name. Since then it has become the industry standard in bioinformatics for raw sequencing data [Pearson and Lipman, 1988]. It has a wide range of use cases. Frequently, it is used to store sequenced reads or even the whole human genome. Pearson did not have any particular specifications for the format, but here we will discuss how it is most commonly used. >chr6_ GGTATGCCTGTATATACAAATGTTCCAGAATCTGAAAAAATCCAAAGTTC AAAACATATCTAGTCCCAGGCATTTCAGATAAGGGATACTCTGTGTGTGT GTGTGTGTTTGTGTGTGTGTGTGTGTGTGTGTGTATGAATTTTGAGAGTG TTGTTTATTTTTATTTTGTAAATACAAGGTCTTGCTCTGTCACCCAGGCT GGAGATCAGTAGCATGATCACATTTCACTGCTGCTTTGAACTCTGACTCA AGGAATTCTCCCTCCTACCTCAGCCTCCCAAGTAGGTAGGACTCCCAAGT AGGTGGCGTACACCACCATGCCTGGCTAATTTATTTTATTTTTTCTAAAG Figure 2.3: Example FASTA file entry for a sequence from the human reference genome. Here the sequence ID is chr6_ meaning the sequence is from chromosome 6 and is showing bases located at to The sequence description is omitted. Following the header is the sequencing data split by 50 characters per line. Each entry in a FASTA file has a single header line which always starts with the greaterthan symbol >. It is followed by the sequence ID that cannot contain any white spaces. Optionally, the sequence ID is then followed by a white space and then a description of the sequence. The sequence ID often contains some useful information about the sequence. In the next line after each header is its sequencing data. There, each nucleotide is stored as a single character. For example we store the DNA chemical bases adenine, guanine, cytosine, and thymine as A, G, C, and T respectively. For easier readability, it is recommended that each line will not exceed 80 characters, thus sequences longer than 80 characters need to be split into multiple lines [Genestudio, 2015]. However, since headers are only allowed to be in a single line they can break the 80 character restriction. In addition, every line has the same length except perhaps the last one. This restriction makes it possible for FASTA readers to be able to know where the last location of the newline character can be, thus improving reading performances slightly. The sequence continues until the next header line or the end-of-file (EOF) file descriptor has been reached. An example entry in a FASTA file is shown in figure

29 2.3 Data formats Files using the FASTA format have the.fa or.fasta extensions. The FASTA format has some extensions, such as the FASTQ format which also includes a quality score for each base. FASTQ files have the.fq or.fastq extensions. To make searches in big FASTA files faster the files are often indexed and saved in FASTA index files. They have the extensions.fa.fai or.fasta.fai. For example, SAMTools [Li, 2011] indexes FASTA files SAM and BAM format For storing aligned reads the sequence alignment/map (SAM) format is often used. The SAM format has two sections: An optional header section and an alignment section. Lines in the header section always start with the In the alignment section each sequence is stored in one line. In each line there are 11 TAB delimited mandatory fields. If these fields are unavailable they must still be defined with the values 0 if the field contains a number or * if it contains a string. There are also optional fields that can be used by storing key-value pairs in the format of TAG:TYPE:VALUE where tag is the key. Many tags are predefined such as the RG tag which stores the read group of the sequence. A companion format to SAM is the binary alignment/map (BAM), it stores the exactly same data as SAM but is compressed using the BGZF library and encoded to binary. The compression is focused on performance rather than high compression [Li et al., 2009]. SAM and BAM files are stored in.sam and.bam files, respectively. Most commonly they are used in storing next-generation read alignments. The SAM specifications are constantly being updated by the SAM/BAM format specification working group which is a part of the SAMTools project group. SAMTools is a software package and library that can work with SAM and BAM files. They provide many tools for SAM/BAM files such as conversion from and to other formats, filtering, compression, decompression, sorting, indexing, merging files, and more [Li, 2011]. CRAM is another companion format to SAM. It allows for a highly efficient referencebased compression of SAM files based on Fritz et al. [2011]. The files are stored with a.cram extension. The reference-based compression algorithm is capable of storing the data in smaller files than BAM does VCF The variant call format (VCF) is a format to store genetic variation data. FASTA files work very well when displaying sequences with no variations, such as the human genome reference, but often it is necessary to be able to view variations of a reference. The VCF 11

30 2 Background format tries to do this in an efficient manner. It shares many similarities with SAM/BAM files. The variations supported include single nucleotide polymorphism (SNP), insertions, deletions, and structural variants. One reference genome is used and then variations are stored as alternative sequence to the reference. Files using the VCF format are usually saved with the.vcf extension. Similar to SAM/BAM files VCF files have a header section and data section. Lines in the header section start with the character # or ## depending on if the line stores data columns or meta information, respectively. The meta information is stored as key=value pairs. VCF files have eight mandatory columns which can be omitted with a. character. They are usually stored compressed using the BGZF library [Danecek et al., 2011]. 2.4 The HLA reference alleles format The HLA reference alleles are represented using a specific naming format. First the gene family is be defined, followed by the gene name. The family is HLA for the six genes supported by Gyper. Each type, subtype, synonymous substitution and non-coding substitution will then have an unique set of 2 digit numbers. Sometimes the name will have a suffix character that represents a change in the protein expression. (Figure 2.4) Figure 2.4: An example of a reference allele format. Every allele has at least 4 digits so the family, gene, type, and subtype fields are mandatory. Beyond that, the fields are only used when needed. Sometimes there are more than a hundred different subtypes. In these cases the subtypes have 3 digits. However, alleles with 3 digit subtypes do not have increased resolution. For example the allele HLA-A*02:102 is considered having a 4 digit resolution, not 5. Different types and subtypes mean that the two alleles will produce different proteins. This means the exon DNA sequences of the alleles are different. However, it is possible that the exon DNA sequences are not the same but they will still both translate to synonymous proteins. These substitutions are called synonymous substitutions because 12

31 2.5 Current DNA sequencing genotypers they do not affect the translated protein. Sequences that differ only in exon sequences but not in translated proteins are distinguished by the 6 digit resolutions. Finally, the 8 digit resolution will distinguish substitutions in the non-coding region, that is variations of introns. Genotyping techniques that use HLA proteins are only capable of 4 digit resolution typing. For most applications we are only interested in genotyping for different proteins, therefore 4 digit resolution is sufficient. By using DNA sequencing data we are capable of genotyping with up to 8 digit resolution. 2.5 Current DNA sequencing genotypers Before Gyper, other programs have been created with same purpose of genotyping using sequencing data. Gyper is highly influenced by one named OptiType [Szolek et al., 2014]. OptiType genotypes using an integer linear programming (ILP) algorithm. They tested their algorithm on a broad range of sequencing data: Whole genome sequencing data, RNA data, and exome sequencing data - which only contains exon data. Their results are very good, their comparison to verification data are showing 97% accuracy with 4 digit resolution. In our study we compared Gyper to OptiType, as their algorithm has shown to have better or equally good accuracy in comparison to other HLA genotyping programs. HLAminer [Warren et al., 2012] is a widely used program for genotyping with sequencing data. Their focus is on Illumina shotgun sequencing data. In summary, their method involves doing a HLA assembly using a tool called TASR. Then comparing the assembly to the reference alleles using a hash table filled with every 15-nucleotide word, or 15-mers, encountered. The program genotypes very quickly but in comparison to the other tools their accuracy is not very high. For example, their accuracy compared to OptiType was reported to be about 15% lower [Szolek et al., 2014]. ATHLATES [Liu et al., 2013] is another tool that uses HLA assembly to genotype. Their methods rely on accurate recovery of the exon sequences via the assembly. It uses many of the ideas behind HLAminer but improves them and deliver a much better results. Their reported accuracy 74 out of 75 allelic pairs or about 99% overall accuracy. OptiType compared themselves with ATHLATES and both programs showed a similar accuracy. However, the sample size was very low, only 3 genes typed for 11 individuals [Szolek et al., 2014]. Recently, a HLA genotyper was created by Major et al. [2013]. It filters the DNA sequencing data and then matches those reads to the exon sequences of the HLA alleles. In the alignment they discard reads with too many mismatches or any indels. Their method achieved a good HLA call accuracy of 94.2% for an exome dataset. However, on the same dataset OptiType achieved even better results. 13

32 2 Background 14

33 3 Methods Figure 3.1: Modified version of Gyper s pipeline (Figure 1.2). The scope of the three main steps are highlighted. Our method has three main steps: Preprocess the data (Section 3.1). Create the partial order graph (Section 3.2). Align reads to the partial order graph to genotype the individual (Section 3.3). Figure 3.1 declares the scopes of these steps. The preprocessing step is released as a separate contribution. Gyper s scope only covers the creation of partial order graph and alignment to it. In section 3.4 we discuss the different adjustable parameters Gyper has and in section 3.5 how we train these parameters. Finally, in section 3.6 we will touch on the implementation and availability of Gyper. 15

34 3 Methods 3.1 Preprocessing the data The preprocessing is released separately from Gyper because it depends on external libraries, which we did not want to add as dependencies. Furthermore, the processes in this step are exclusive to the HLA genes Fetching the HLA reference alleles The HLA reference alleles were fetched from the IMGT/HLA database [Robinson et al., 2015]. The database is released in various formats: FASTA, flat files, MSF, PIR, and XML. We used the XML database. To reduce the size of the database, there are few alleles that work as a template for other alleles. Template alleles have information about its full sequence with all features (exons, introns and untranslated regions) and references itself as a template. Each non-template allele only specifies features that differ from its template allele. This structure made it tedious to fetch the allele sequences for all non-template alleles manually. To simplify the process we created a XML wrapper which depends on RapidXML, a fast XML DOM parser library in C++ [Kalicinski, 2015]. Our wrapper fetches features of all alleles and outputs them in a FASTA file. Figure 3.2 shows an example output. In the output FASTA files the first two letters of the header is the feature identification code. The allowed codes are 5P, P 3, E[0 9], and I[0 9]. Features with 5P and P 3 are the untranslated regions closer to the 5 and 3 strand ends, respectively. E[0 9] represents exons 0 through 9 and I[0 9] introns 0 through 9. The number of exons differ among genes. >E5_HLA-DQB1*03:02:01 GACCTCAAGGGCCTCCACCAGCAG >I5_HLA-DQB1*03:02:01 GTGATATTTCAGCCATGAGCCAGTGTGGGGGGGCACAGGTGTAAGAGGGAAGA... Figure 3.2: IMGT/HLA XML wrapper example output. The direction of the reference alleles are from 5 to 3. To get the full sequence of the allele the features should be concatenated in the following order: 5P, E1, I1, E2,..., EN, P 3. Where N is the total number of exons for a given gene. 16

35 3.1 Preprocessing the data Regions with relevant reads The HLA gene cluster is only a very small part of the whole human genome, and thus only a very small portion of the sequenced reads is relevant to HLA genotyping. We could expect a huge reduction in the time taken to perform the genotyping, if we would only need to consider those reads. However, finding the relevant reads is difficult. One solution is to use every read pair that has been aligned to the gene s position on the reference genome, such as GATK does [McKenna et al., 2010]. Our concern is that such an approach is unable to have accurate results for the HLA region due to its high variability. Reads from the HLA region are often misaligned since the human reference genome cannot represent those regions well. Another solution would be to take every sequenced read into account, such as OptiType does [Szolek et al., 2014]. Their method involves considering every read pair by first mapping them to all HLA alleles, and then use all the mapped reads to genotype the individual. Their assumption is that the true genotype can explain the most reads. But we have two concerns. First, taking every read into account takes a very long time, hours or days using whole genome sequencing (WGS) data. Only a very small portion of the reads are even relevant to the genotyper. Second, using reads that are mapped elsewhere risks biasing the results. For example if a sequence outside the HLA is very similar to one or more alleles but not all, those particular alleles will be biased to score higher than the other alleles. This results in the genotyper predicting the alleles most similar to the reference too often. Regions of interest We propose a different method. Our method makes use of the fact the reads are usually stored aligned in alignment files (SAM/BAM files). All read pairs belong in one of three categories: Both reads are mapped to the reference genome at locations l 1 and l 2, and usually we expect them to be relatively close to each other, we assume 0 l 2 l One read is unmapped, but the other read in the pair is mapped to location l. By convention, both reads are marked to be located at l. Both reads are unmapped. The reads are both marked as unmapped. Our goal is to find regions of the genome which are likely to have HLA relevant reads mapped to them. The following three steps explain our process: (1) Simulate reads that overlap the alleles exons. (2) Map the simulated reads to the human reference genome. 17

36 3 Methods (3) Check where the simulated reads map. Figure 3.3 shows the process for a single allele. The aligner will map to a correct location, a suboptimal location outside HLA, or not map the read to any position of the genome. We are only interested in reads overlapping the exons, since the exons determine the first 6 digits of the HLA genotype. All aligned positions are extracted and used as the regions of interest. When genotyping, we only use reads located inside the regions of interest. We hope that filtering the data this way will decrease the overall computational time of the genotyping without decreasing its accuracy significantly. Figure 3.3: Flow chart for finding relevant positions of the human genome using one allele. Reads overlapping the exons are simulated and mapped to the human reference genome. Algorithm Algorithm 1 shows the pseudo code behind this technique. For step (1) we gather all sequences for both exons and introns from our fetched reference HLA alleles. As we are only interested in the exons, we look for reads overlapping the exons completely or partly. In our case we simulated reads of length 100 base pairs so we used 100 bases from each intron. If, for example, the exon is of length 20 base pairs, the complete sequence will be 100 intron base pairs, 20 exon base pairs, and then another 100 base pairs sequence for a total of 220 base pairs. Reads are simulated using a read simulator called Mason, which is provided as part of the C++ library SeqAn [Döring et al., 2008, Holtgrewe, 2010]. Mason simulates sequencing 18

37 3.1 Preprocessing the data data in a realistic manner, so some sequences will contain errors. Mason was run with the Illumina machine settings so the chance of a mismatch will be much higher than an insert or deletion. Step (2) is by far the most computationally intensive step, requiring each simulated fragment to be mapped to the human reference genome. In this step we explicitly call the Burrows-Wheeler Aligner (BWA) to map the fragments [Li and Durbin, 2009]. The BWA- MEM algorithm was chosen for the task. Its output is a SAM file with our simulated reads aligned to the human reference genome. We discarded reads that the BWA could not align to the reference. Steps (1) and (2) were implemented in a single C++ program. Step (3) requires sorting all mapped positions from the previous step and keeping them in two sorted lists, one for all the starting positions and one for all the end positions. The end position is required because sometimes BWA will only align a partial read. By looping through both lists we can know at any point what the coverage depth is. The coverage depth of a base is the number of reads that overlap that base. The algorithm starts on the start position, and checks the coverage of each base until we have reached the final position. The first item in the starting position list is the start position, and the last item in the end position list is the final position. Most bases of the genome have no coverage depth, but we can always skip to the next item in the start position list when that occurs. Checking for bias introduction When extracting reads outside the HLA region we might be introducing a bias to the results for some of the HLA alleles. For example, if the suboptimal regions contains sequences which are very similar to some alleles, but not all. If that happens those alleles will have a biased score. We can correct for this bias by adding a parameter in Gyper which lowers the scores of the biased alleles. To measure the bias we simulate read pairs from all suboptimal locations we found and use them in Gyper. We can then get an estimate of how often this event happened. Reads are simulated with Mason using Illumina settings with 3000x coverage depth. Results can be found in section Multiple sequence alignment In section we parsed all features of the HLA alleles. Aligning the sequences in a multiple sequence alignment (MSA) will be beneficial to the algorithm that creates the partial order graph for two reasons: First, all the sequences have the same length. Second, 19

38 3 Methods Input : All reference alleles for each genotype (alleles), number of random fragments (n), and length of intron sequences to use (intronlength). Output: All locations of the genome and their coverage depth, excluding all location with no depth. 1 exons Fetch all exons using our IMGT/HLA XML parser. ; Step 1 2 Add sequences from surrounding introns of up to length intronlength on each end of the exons.; 3 fragments Generate n random simulated reads from exons.; 4 startpositions, endpositions Array with all zeros of length n ; Step 2 5 for i 0 to n 1 do 6 startpositions[i] Map(fragments[i]); 7 endpositions[i] startpositions[i] + Length(fragments[i]) 8 end 9 startpositions Sort(startPositions) ; Step 3 10 endpositions Sort(endPositions); 11 location, depth Empty array; 12 i, j, k, d 0; 13 while k endpositions[n-1] do 14 if startpositions[i] k then 15 i + +; 16 d + +; 17 continue 18 end 19 if endpositions[j] k then 20 j + +; 21 d ; 22 continue 23 end 24 if d is 0 then 25 k startpositions[i] 26 end 27 else 28 Add k to location; 29 Add d to depth; 30 k + +; 31 end 32 end Algorithm 1: Extracting regions from the genome which are likely to contain misaligned HLA reads. We do that by first simulate HLA reads, map them to the genome, and determine which locations they were mapped to. 20

39 3.1 Preprocessing the data most sequences will share a base, so we can use a single node to represent a base on most or all alleles. MSA has been used before for this purpose, for example Dilthey et al. [2015]. We align the sequences by inserting gaps into the them. In the graph, gaps are represented by the absence of a node so adding gaps will not increase the number of nodes in the graph. In fact, by aligning the sequences we can often use fewer nodes to represent the sequences. So the MSA is effectively reducing the number of nodes required for the graph. However, an optimal MSA is massive for this case. The worst case has close to 4,000 different sequences with an average length of over two thousand base pairs (N 4000, L > 2000). Optimal MSA has proven to be a NP-hard problem [Wang and Jiang, 1994] and with a complexity of O(L N ). So an optimal MSA would take absurdly long time for our case and is not necessary. Approximate MSA is more appropriate. We use MUSCLE, which does an approximate MSA with high accuracy. The complexity of MUSCLE s MSA algorithm is O(N 3 L + NL 2 ) [Edgar, 2004]. Figure 3.4 shows an example of our data before and after the MSA. The input and output are both in FASTA format, and the output sometimes contains dashes which represent gaps. Specifically, these are sequences shown are the 4th intron for alleles HLA- C*15:02:01, HLA-C*16:01:01, and HLA-C*17:01:01:01. For clearer representation we have colored mismatches as red, and indels as green. For these three reference alleles, HLA-C*15:02:01 and HLA-C*16:01:01 only have one base mismatch between them. Sequences HLA-C*16:01:01 and HLA-C*17:01:01:01 however, there are 2 bases mismatched and another 3 bases inserted into HLA-C*17:01:01:01. 21

40 3 Methods Input: >I4_HLA*15:02:01 GTAAGGAGGGGGATGAGGGGTCATGTGTCTTCTCAGGGAAAGCAGAAGTCCTGGAGCCCTTCAGCTGGGT CAGGGCTGAGGCTTGGGGGTCAGGGCCCCTCACCTTCCCCTCCTTTCCCAG >I4_HLA-C*16:01:01 GTAAGGAGGGGGATGAGGGGTCATGTGTCTTCTCAGGGAAAGCAGAAGTCCTGGAGCCCTTCAGCCGGGT CAGGGCTGAGGCTTGGGGGTCAGGGCCCCTCACCTTCCCCTCCTTTCCCAG >I4_HLA-C*17:01:01:01 GTAAGGAGGGGGATGAGGGGTCATGTGTCTTCTCAGGGAAAGCAGAAGTCCTTCTGGAGCCCTTCAGCCG GGTCAGGGCTGAGGCTTGGGTGTAAGGGCCCCTCACCTTCCCCTCCTTTCCCAG Output: >I4_HLA-C*15:02:01 GTAAGGAGGGGGATGAGGGGTCATGTGTCTTCTCAGGGAAAGCAGAAGTC---CTGGAGCCCTTCAGCTG GGTCAGGGCTGAGGCTTGGGGGTCAGGGCCCCTCACCTTCCCCTCCTTTCCCAG >I4_HLA-C*16:01:01 GTAAGGAGGGGGATGAGGGGTCATGTGTCTTCTCAGGGAAAGCAGAAGTC---CTGGAGCCCTTCAGCCG GGTCAGGGCTGAGGCTTGGGGGTCAGGGCCCCTCACCTTCCCCTCCTTTCCCAG >I4_HLA-C*17:01:01:01 GTAAGGAGGGGGATGAGGGGTCATGTGTCTTCTCAGGGAAAGCAGAAGTCCTTCTGGAGCCCTTCAGCCG GGTCAGGGCTGAGGCTTGGGTGTAAGGGCCCCTCACCTTCCCCTCCTTTCCCAG Figure 3.4: Example input and output of our MSA. Bases colored green and red are indels (insertions or deletions) and mismatches, respectively. 22

41 3.2 Constructing a reference partial order graph 3.2 Constructing a reference partial order graph Typically sequences are aligned to a single reference genome. For genes with high structural and sequence diversity this can lead to poor characterization of such regions. To represent these genes, such as the HLA genes, we used a partial order graph (POG) Graph implementation In the POG we both store nodes (vertices) and directed edges. By doing a MSA we ensure that every feature has the same length. The features are added to the graph one by one. Each node has an integer that stores the level and a single DNA base value, that is A, T, G, or C. We surround each feature with a node that has no DNA base. The level corresponds to the location of that base in the sequence. No two nodes connected by a direct path can have the same level. If n 1 and n 2 are two connected nodes in the graph the edge will always be directed to the node with the higher level. This means that the last node of the graph, the one with no outgoing edges, has the highest level. The first node of the graph, the one with no incoming edges, has a level 0. Each edge stores reference to both nodes it connects with. Furthermore, if the edge is inside an exon it also stores a bit string. The bit string has a length equal to the number of references used. The purpose of these bit strings will be discussed when we align sequences to the graph and genotype in section 3.3. When an edge is created all bits are initialized to 0 except for the one representing the exon that is being added. When adding another sequence to the graph which shares a path with an earlier sequence, we flip the corresponding bit to 1. By storing it this way we never need more than ceil(r/8) bytes memory per edge, where r is the number of references used in the graph. If, for example, we had 1000 references and 10,000 edges we would only need 1.25 megabytes to store this information. Additionally, we only need to store it for exons. Algorithm 2 shows the pseudocode behind this method and figure 3.5 shows an example how it creates a graph for three exon sequences: GATA, -AT-, and CATA. We add exons and introns sequences separately because edge creations are handled differently, exons have a bit string while introns do not. There is high amount of missing and unreliable intron data in the IMGT/HLA database, compared to the exons. So instead of trying to reuse data from other introns we simply allow reads to align freely within the intron. With such free alignment there is no need for the bit strings, so they are omitted on introns. 23

42 3 Methods Input : Fasta file with aligned sequences of some feature. Output: Partial order graph we can use as a reference. 1 graph empty partial order graph; 2 sequences read sequences from Fasta file; 3 previous new node with level 0.; 4 endnode new node with level Length(sequences[0]) + 1.; 5 Add previous and endnode to graph.; 6 for sequence in sequences do 7 for pos in Length(sequence) do 8 if sequence[pos] is a gap then 9 continue; 10 end 11 node node with letter sequence[pos] and level pos + 1.; 12 if node exists in graph then 13 next node; 14 if No edge exists from previous to next then 15 Add Edge between previous and next with bit pos flipped on. 16 end 17 else 18 Flip bit pos on for edge from previous to next. 19 end 20 end 21 else 22 Add node to graph.; 23 Add Edge between previous and next with bit pos flipped on. 24 end 25 previous next; 26 end 27 if No edge exists between previous and endnode then 28 Add Edge between previous and endnode with bit pos flipped on. 29 end 30 else 31 Flip bit pos on for edge from previous to next. 32 end 33 end Algorithm 2: Creating a reference partial order graph for a single exon. Creating a graph for an intron is similar but then we do not store the bit string on edges, hence no need to create or modify them. 24

43 3.2 Constructing a reference partial order graph Levels a) b) G A T A 001 c) G A T A d) G A T A C 101 Figure 3.5: Create a partial order reference graph using three example exon sequences: GATA, -AT-, and CATA. Blue edges show edges we traversed through, green labels represent changed or new labels on edges. Red and yellow nodes represent new and old nodes, respectively. a) Two nodes are created, initial node on level 0 and a final node on level Length(sequences[0]) + 1, which is here 5. b) The sequence GATA is added to the graph. c) The sequence -AT- is added to the graph. Note that no new nodes need to be created, only edges. We change the bit string for the edge going from A to T so it includes this sequence. d) The sequence CATA is added to the graph. The new C node will be on the same level as the G node. 25

44 3 Methods Extending the POG Figure 3.6 shows how the partial order graph can be extended with three intron sequences: TTA, -TA, and GTA. In our implementation we extend the graph by adding sequences connecting to the lowest level node. So when creating a graph for the HLA genes we add features in reversed order: First the 3 untranslated end of the allele, then the last exon, then the last intron, and so on until we have added the 5 untranslated region. The nodes connecting the features are always free to traverse through so they do not need to store any DNA base. We keep track of the level of these nodes. When we are aligning to the graph we can check the level of the node we are aligning to. So at any point in the alignment, we know if we are aligning to an exon or an intron. 26

45 3.2 Constructing a reference partial order graph T A T G G A T A C Figure 3.6: Extended graph from figure 3.5. Here we have added three intron references to the graph. We use the previous initial node as a final node for the new extension. The red nodes are the new intron nodes. The intron sequences we have added are: TTA, GTA, and -TA. Note that the edges on the new nodes do not store a bit string like the other exon nodes. 27

46 3 Methods 3.3 Aligning sequences to the POG When aligning sequences to our graph the goal is to determine which pair of alleles can explain the most reads. Our assumption is that this pair of alleles is the individual s true allele pair. We have a backtracker which both to keeps track of the read s path through the graph. The match is simply an array of boolean values equal to the length of the read. This array initially has all bits set to 0, which are flipped if a match is found. The backtracker also stores which node is the previous node of the match, so we know how the match traversed through the graph. The size of the two arrays are the same as the length of the read we are aligning. In our alignment we are free to start anywhere and end anywhere, it is a semi-global alignment where both of the ends of the references are free. Generally we would need to have our two arrays equal to the length of the read plus one, but since we can start anywhere the top boolean will always be true so there is no need to store that. Since the graph is acyclic we can always find topological sorting of the nodes, meaning if there is a node n 1 which depends on the results of node n 2, n 2 will never depend on n 1 or any of n 1 s dependencies. Also there is one, and only one, node that does not depend on any other nodes. That node is the first node in our topological sort Algorithm We use a dynamic programming algorithm to align sequences to the graph as shown in algorithm 3. Our algorithm requires O(nm) time in the worst case, where n is the number of edges and m is the length of the sequence. It visits every edge on the graph and compares the DNA base of the sequence to the target node. When traversing through the graph it is always guaranteed that we have already calculated the current node s dependencies. When matches are found we only need to change a boolean value in the array and store a reference to the previous node. The aligner can find a list of nodes where the sequence was matched, because the sequence can be aligned to more than one location. If we align both reads in a read pair to a location that is very far from each other we discard that read pair. The highest distance between reads allowed is arbitrarily chosen to be 800 base pairs. If the read is aligned to multiple locations we need to choose the best one to use. We chose the best distance between two reads to be 350 base pairs. These two values were estimated from the 99.99% highest and the most common insert sizes of decode s BAM files, respectively. Figure 3.7 shows an example of an alignment of the read ACAT to a graph. The graph 28

47 3.3 Aligning sequences to the POG matches the read to nodes 4 (A), 5 (no base), 7 (C), 8 (A), and 9 (T). Here, a reference to the 9th node will be the only node in the output list. Then, when the alignment has finished, we backtrack from that node only. Backtracking has a complexity of O(m) in the worst case Backtracking The backtracking algorithm uses the backtracker to determine which reference alleles can explain the aligned read. It picks a node from the alignment algorithm and starts backtracking there. Initially it has a bit string of length equal to the amount of reference alleles used with all bits flipped to 1. Then, as the backtracking algorithm travels backwards through the graph it will perform a bitwise AND operation for every edge with a bit string. That is, if we are traversing inside an exon we will perform the AND operation. What we end with is a bit string whose bits are only flipped on if the read followed the corresponding reference allele exactly. In other words we say that those reference allele explain the read. If however we are traversing through an intron we do not know which reference allele created that edge on the graph, we would rather say that any reference can explain the read for the reasons we discussed before. Continuing with the previous example (Figure 3.6) the backtracking of the sequence ACAT would generate the following calculations: 111 AND 111 AND 100 AND 100 = 100 The convention when using bit string is to say they the rightmost bit is the first one. So the bit string 100 means that only the third reference explains the read. The exon of the third reference was CATA which ACAT overlaps. ACAT does not overlap the other two exons. If however we aligned the read TTA, the read maps to nodes 1, 3, and 4. Since these nodes are not connected to an edge with a bit string we will not require any AND operations and simply have the bit string: 111 Here we simply say that every allele can explain the read. 29

48 3 Methods Input : A sequenced read from an individual who is being genotyped for gene gene. Output: backtracker we can use to find all references that explain the read and an array of nodes where alignments end at. 1 graph a partial order graph for gene.; 2 order TopologicalSort(graph); 3 backtracker array for each node in graph storing both match (true or false) and previousnode.; 4 nodes empty array. 5 for source in order do 6 for edge in edges directed from source do 7 target edge s target.; 8 if target stores dna then 9 if read[0] == target.dna then 10 backtracker[target].match[0] = true; 11 backtracker[target].previousnode(0) = source; 12 end 13 pos = 1; 14 while pos is smaller than the length of the read do 15 if backtracker[source].match and 16 read[pos] == target.dna then 17 backtracker[target].match[pos] = true; 18 backtracker[target].previousnode(pos) = source; 19 end 20 Increment pos by 1.; 21 end 22 if backtracker[target].match[length(read)] then 23 Add target to nodes. 24 end 25 end 26 else 27 backtracker[target].match array of true values.; 28 backtracker[target].previousnode array of source.; 29 end 30 end 31 end Algorithm 3: Aligning a single sequence to the reference graph. 30

49 3.3 Aligning sequences to the POG T 1 A 4 T 3 0 G G A 8 T 9 A C 7 Figure 3.7: Alignment of the sequence ACAT to the graph from figure 3.6. The numbers below each node denotes their topological sort order and blue edges are the path of the alignment. 31

50 3 Methods Genotyping constraints When genotyping we estimate how likely it is that an individual has a particular allele depending on how many reads that allele can explain. Read and its complementary are both aligned to the graph, since we do not know the read s direction compared to the reference. We use the following constraints: Each individual can either have one or two different alleles. Everyone has two strings of chromosome 6 so each can have two different variations. A read needs to be continuous, meaning we can never add gaps to it or add gaps to the reference while aligning. Under some strict circumstances we may allow a mismatch between the read and reference. We allow this but no other types of errors since the most frequent errors in Illumina read data are mismatches [Hoffmann et al., 2009]. Since each read in a read pair is from the same chromosome, we do an AND bitwise operation on both reads bit strings. All non-paired reads are discarded. We believe using these constraints we can create a model that can accurately predict the correct genotype from Illumina next-generation data. To further improve the model we include some parameters we wish to train using in-house tools which will be discussed later. 3.4 Parameters Using the graph we need to create a heuristic that maximizes accuracy of the program without sacrificing heavily on computational time or memory. The parameters are: Read clipping, minimum sequence length, mismatches, and a zygosity factor Read clipping The quality of the read tends to drop near the end of the reads when using Illumina machine, and other technologies using sequencing by synthesis [Fuller et al., 2009]. To counter this issue we introduce clipping of the reads near the ends. Illumina produces a read quality base check which gives each base a Phred value. We use this Phred value to clip the reads. A Phred value threshold τ was picked and we say that if the bases on the read ends are lower than τ, we remove them. We repeat this process until we find a base which has a Phred value equal or higher than τ. This process is applied on both ends of the read. If we choose a high τ value, we could be removing valuable information from the reads. But if we choose a low τ value, we are likely to have more errors on the read s ends. This parameter needs to be trained to balance these two traits. 32

51 3.4 Parameters For read clipping we chose the following values to train with: τ = 20, 25, 30, 35, 37. A value of 20 will produce almost no clipping on the reads while 37 clips many entire reads away. We are opting for very few different values to test for this parameter since for every different value we will produce different reads. We need to align each read, which is a very expensive operation Minimum sequence length Since we are clipping the reads, we need another parameter that limits the minimum length of the clipped reads. We call this parameter λ. Allowing all small reads to be aligned to the graph, we would more often find matches in multiple locations. Furthermore, many reads from other regions are more likely to be aligned. At the same time it is probably not a good idea to reject all reads that have been clipped. Then we would end up with very few reads. We can very cheaply try many different values when training λ. We trained using the following minimum sequence lengths: λ = 10, 20, 30, 40, 50, 60, 70, 80, 90. Our method involves aligning a read to the graph only if the length of the read is at least the same as the lowest minimum sequence length, which is in our case 10. Then, when the alignment has finished, we copy the results and add them to the total results for each minimum sequences length which is lower or equal to the cutoff length. For instance, if the length of a read is 40 we use 40 as the cutoff length and add the results to the total results for 40, 30, 20, and 10. However, if a read has the length 9 we reject that read entirely. The computational time therefore does not scale with the number of parameter values we try. It only depends on the length of the smallest minimum sequence we use Mismatches In section 3.4 we mentioned that we can allow mismatches in the aligner in some scenarios. We chose the following scenario to allow mismatches when all the following conditions are met: The current node has one outgoing edge. The current node has one incoming edge. The quality of the read is below a given threshold, ρ. The parameter we need to train is the quality threshold ρ. The reason for first condition is because if we allow more than one then a single alignment could map to multiple paths on the graph. If we allow separate paths we are likely to face multiple cases where a single read can start at some node and end at multiple other places. When this happens there is 33

52 3 Methods no good way to decide which of these paths are the most correct one. To avoid this issue we simply forbid mismatches on nodes with more than one outgoing edge. The second condition is not mandatory. However, it gives us the nice feature of having a symmetric alignment to the graph. We chose to train the read clipping quality using these values: ρ = 20, 25, 30, 35, and 37. Again, we only chose those five values since each value requires an alignment to the graph, and thus is computationally expensive Zygosity factor When deciding which alleles will explain the most reads it will always heavily favor heterozygous over homozygous. Two different alleles will always be able to explain at least as many reads as only one allele would. In previous work, such as [Szolek et al., 2014], this issue was resolved by using a constant factor that increases the scores of homozygous results. We believe this method fails to provide good genotype if both true alleles are very similar. In other words if an individual s true alleles are very similar, then the score of those two alleles are expected to be almost as high as the score of them in a combination. To counter this issue we used weighted average between how many reads two alleles could explain and the average of how many reads they could explain individually. In a more formal manner we can say that we have selected n alleles in a set L = {a 1, a 2,..., a n }. We also have m read pairs from sequencing an individual. Given a read pair r j and an allele a i where 1 j m and 1 i n, we define C rj,a i = { 1, if allele ai can explain the read pair r j 0, otherwise (3.1) If some single read pair is explained by some allele then we say that a hit counter for that allele and read pair is 1, but otherwise 0. Read pair r j can be explained by any number of alleles. n 0 C rj,a i n (3.2) i=1 If the total score of an allele A L is S A for given read pairs r 1, r 2,..., r j,..., r m is m S A = C rj,a (3.3) j=1 Since each individual has two chromosomes we need a method to convert the single allele score, S A, to a combined allele score we call S A,B. If we have two alleles A, B L we only require that one allele can explain the read. We calculate their heterozygous score as m ( ( S A,B = max Crj,A, C rj,b)) (3.4) j=1 34

53 3.4 Parameters Here we note that if the score is homozygous, which is when A = B in equation 3.3 we get m ( ( S A,A = max Crj,A, C rj,a)) m = C rj,a = S A (3.5) j=1 Individuals that have the same allele A on both chromosomes simply get a score S A,A = S A. This scoring scheme heavily favors heterozygous scores because S A,B S A for any two alleles A and B. Instead, we decide the zygosity of the individual by using the number of reads explained by allele one, but not the other. S A\B = j=1 m ( ( max Crj,A C rj,b, 0 )) (3.6) j=1 We also define S A B to be the number of reads explained both by allele A and B. S A B = m ( ( max Crj,A + C rj,b 1, 0 )) (3.7) j=1 Figure 3.8 shows the relation among scores S. If S A\B is much larger than S B\A it is likely that the individual is homozygous with two A alleles. However, if S A\B and S B\A are relatively similar then it is likely that the individual is heterozygous. A B S A\B S A B S B\A Figure 3.8: Alleles A and B are represented as sets. Each set has the reads that allele explains. The number of reads explained by either allele A or B is S A,B = S A\B + S A B + S B\A. We can use these results to decide zygosity using a constant β. S A\B S B\A S A\B + S B\A { > β, homozygous solution β, heterozygous solution (3.8) Where 0 β 1. Figure 3.9 further explains our reasoning. A cross represents a read. If allele A can explain the read then the cross is in set A. 35

54 3 Methods a) A B b) A B S A\B = 8 S A B = 8 S B\A = 1 S A\B = 6 S A B = 8 S B\A = 5 Figure 3.9: Two examples of the distributions of reads (crosses) among two alleles, A and B. a) It is likely that the individual is homozygous even though S A = 16 and S A,B = 17. The read inside the B\A region is likely an error. b) Here however, S A\B and S B\A are relatively similar and thus we would rather expect that the individual is heterozygous. Now we can scale the heterozygous solutions down using β. First, we assume that A has the highest score, S max = S A S B. This also means that S A\B S B\A because S A S B = S A\B S B\A. Furthermore, we can see from figure 3.8 that S A\B + S B\A = S A + S B S A B. Assuming S A\B + S B\A > 0, the condition for a heterozygous solution is S A S B β(s A + S B 2S A B ) 0 (3.9) To compare scores between homozygous and heterozygous solutions we introduce a scaled score S A,B. If an individual is heterozygous, we want One solution to equations 3.9 and 3.10 is: S A,B S A (3.10) S A,B = βs A,B + (1 β) S A + S B (3.11) 2 Equation 3.11 has a nice feature, it makes scaling unnecessary for homozygous solutions because S A,A = βs A,A + (1 β) S A + S A = S A (3.12) 2 Therefore, we only need to scale heterozygous solutions. If we use β = 1 then there would not be any scaling, S A,B = S A,B, and heterozygous solutions would be favored. If β = 0 the scoring scheme would favor homozygous solutions. We suggest β = 0.5 as a middle ground but training this parameter is computationally cheap. Using individuals in decode s dataset we trained β using 9 different values: 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9. We expect our scoring scheme, at least to some extent, to solve the problem of having two highly similar alleles. Given all this, we formulate our optimization problem as S max = max A,B L S A,B = max A,B L ( βs A,B + (1 β) S A + S B 2 ) (3.13) 36

55 3.5 Parameter training 3.5 Parameter training To train the data we used an in-house program that uses imputation to determine the correct allele of an individual. When imputing we use the relation data among individuals in the dataset to expand the results for the Icelandic population. The program requires a list of alleles, the most likely allele, and the likelihoods of all combinations of alleles on the Phred scale. The data was stored in VCF format. We define e A,B to be the event that a read is explained by some alleles A, B L, which is not the true genotype. The probability of that event occurring is P (e A,B ) = ɛ. In general, we estimate that the number of such events are d = S max S A,B (3.14) We assume all such events are mutually independent of each other so the binomial probability of it occuring d-times are in general: P (e A,B ) = ɛ d (3.15) The score number is not always an integer but it does roughly correspond to the number of mismatched read pairs, so we believe it is a reasonable metric to use. The probability of each genotype is then estimated to be P A,B = P (e A,B )P max = ɛ d P max (3.16) Our imputation tool uses a reversed Phred score, meaning that the allele with a score of 0 is the most likely one. So instead of using the probability of error P (e), we use instead P (e A,B ) in equation 2.1. We also put a limit on the Phred score so it is never higher than 255, which means an allele is never less likely to be true than The Phred score is then calculated using: Pred A,B = min ( 10 log 10 (P (e A,B )), 255) (3.17) We arbitrarily chose ɛ = 1%, so we can simplify equation 3.17 to Pred A,B = min (20d, 255) (3.18) In section we defined the 4 parameters trained. In total our input space is large, its total size is 2,025 for each gene for a single individual. The most computationally expensive part is the alignment to the partial order graph, and only two parameters required separate alignments. These two parameters are the clipping and mismatch thresholds which have a combined input space of 25. They are the dominant computational time factors. In our training set we used a dataset of 3,894 Icelanders who have been sequenced at decode. These individuals allhave passed the in-house quality control tests with high 37

56 3 Methods scores. The quality control score is based on sequencing depth coverage, contamination, and more. If we assume that the time to genotype an individual for one gene is 40 CPU seconds, then the time required to genotype 3,894 for six genes using 25 different parameters is about 270 CPU days. The parameter training was therefore carried on multiple nodes in decode s computer cluster. The results contain about 47.3 million data points, which where then imputed for more than 150,000 chip genotyped Icelanders using an in-house imputation tool. The tool evaluates an INFO score which indicates how well the genotypes fit into with the relational data of the individuals. For example, let s say that Gyper predicts a father to have the two HLA-A*01:01 alleles and the mother to have HLA-A*02:01 and HLA- A*03:01 alleles. If Gyper predicted that their child has two HLA-A*68:01 alleles, the INFO score will be lowered because this is essentially impossible. Moreover, even if Gyper predicted the child to have HLA-A*01:01 and HLA-A*02:01, that might also be false if it has already been determined that the mother passed her chromosome 6 with the HLA*03:01 allele to the child. Therefore, in this particular example case the only possible genotype of the child, given that the parents were genotyped correctly, is HLA-A*01:01 and HLA-A*03:01. The in-house imputation tool outputs for each genotype its minor allele frequency (MAF) and an INFO score estimating how well the genotyper predicted this allele correctly. We weight each INFO score by its MAF and find the average over all genotypes. Our assumption is that whichever combination of parameters that provide the highest weighted average INFO score is the best combination. 3.6 Implementation Gyper is implemented in C++ and depends on both SeqAn [Döring et al., 2008] and Boost The project is open-source and maintained on Github at The program is licenced under the simplified BSD license. SeqAn has not been released yet but its development is ongoing. Before it has been released, it is possible to use SeqAn s development branch on Github instead. 38

57 4 Results 4.1 Preprocessing the data Coverage read depth The coverage depth of the six most important genes was computed and plotted for 60 million simulated reads, 10 million reads for each gene. We found the location of each gene using RefSeq [Pruitt et al., 2014]. Table 4.1 shows the fraction of reads that mapped to optimal and suboptimal locations, and the fraction of unmapped reads. Table 4.1: The fraction of reads mapped to the optimal locations, suboptimal locations, and no locations of the human genome reference. Gene Optimal locations (%) Suboptimal location (%) Unmapped reads (%) HLA-A HLA-B HLA-C HLA-DQA HLA-DQB HLA-DRB Total We did see a high number of reads that were mapped to suboptimal locations, which are locations outside the gene. Out of the 60 million reads simulated, roughly 17% mapped to these locations. The suboptimal locations were usually close to the gene and almost always on chromosome 6. Interestingly, in few cases we saw the suboptimal locations of one gene being inside the region of another gene. This shows how much similarity is among the HLA genes. A little less than 1% of the simulated reads could not be mapped anywhere on the genome. These reads were discarded. 39

58 4 Results Figure 4.1: Coverage plot for HLA-DQA1. The HLA-DQA1 gene had 21.33% of its simulated reads mapped to suboptimal locations (Figure 4.1). It was the gene with the most unmapped reads, 3.42%. The suboptimal locations are all on chromosome 6 and most of them were mapped to a position 100,000 bases after the gene. When we say that a position is after a gene, we mean that the position is greater than the position of the gene. This length of the gene is estimated to be about six thousand base pairs. Figure 4.2: Coverage plot for HLA-DQB1. The coverage plot for the HLA-DQB1 gene is similar to the plot for HLA-DQA1. However, the HLA-DQB1 gene has a lot fewer reads mapped to a suboptimal location, 11.45% 40

59 4.1 Preprocessing the data Figure 4.3: Coverage plot for HLA-DRB1. (Figure 4.2). In fact HLA-DQB1 had the fewest reads mapped to suboptimal locations out of all the genes in this study. The gene s region is set to be around 7,000 base pairs in length. Most of the mismapped reads were mapped after the gene roughly 95,000 base pairs from it. Figure 4.4: Coverage plot for HLA-A. HLA-DRB1 was the biggest gene we studied, covering close to 11,000 base pairs. For this gene 24.11% of the simulated reads were mapped outside the gene s region (Figure 4.3). Most of them were mapped roughly 60,000 bases before the gene. This gene had the most reads mapped to a suboptimal location out of all the genes we studied. Interestingly, some reads mapped to chromosome 3 on locations 125,606, ,608,865. However, its 41

60 4 Results Figure 4.5: Coverage plot for HLA-B. depth is below the minimum depth covering threshold of 50 reads and are therefore not shown in the figure 4.3. Figure 4.6: Coverage plot for HLA-C. The HLA-A gene had 13.89% of its reads mapped outside the gene s specified position of the genome (Figure 4.4) including many locations with a very low depth. Most of the mismapped reads are however about 55,000 bases before the gene. The gene s total length is 3,414 base pairs. The other HLA class I genes are of similar size too. HLA-B had 16.12% reads mapped outside its region (Figure 4.5). The gene s length is 3,340 base pairs. Most of the reads are in fact coming from inside HLA-C s region, which 42

61 4.1 Preprocessing the data is located about 85,000 base pairs before the HLA-B gene. Finally, the HLA-C gene had 15.41% of its reads mapped to suboptimal locations (Figure 4.6). Most of them mapped inside the HLA-B gene. The length of HLA-C is 3,387 base pairs Filtering the BAM files Using our coverage read depth experiment we can now use the locations found there to filter our alignment files. The alignment files are stored in BAM files. As mentioned before our training set consists of 3,894 individuals and we filtered read from their BAM files. The average size of the BAM files before filtering was 67,220 ± 17,980 MiB, where the largest and smallest files were 158,180 and 2,110 MiB, respectively. After filtering, we created six new BAM files where each of them was filtered for each gene we were interested in. The largest and smallest file sizes of all six BAM files after reductions were 59.0 and MiB, respectively. Average size of all the six files were 23.1 ± 6.2 MiB, or a size reduction of ± 0.006% (Figure 4.7). This huge size reduction sped up the process of genotyping significantly. Figure 4.7: File sizes of BAM files used in our training set before and after filtering. The files are ordered in ascending order by their file size before filtering. The file sizes are in MiB and we use logarithmic scale. 43

62 4 Results Bias introduction By using alignments from regions outside each gene s region of the genome we are introducing a bias in our model as discussed in section Using simulated data with 5000x coverage we summarize our results in table 4.2. On average, 102 read pairs were misaligned when using Gyper. Therefore, we estimate that one read pair is mismapped from suboptimal locations for every 50x coverage depth. So for most HLA genes the bias introduction is neglectable. The gene with the highest bias was HLA-DQA1, where both HLA-DQA1*02:01 and HLA- DQA1*03:01 misaligned 1651 reads. That corresponds to roughly one misaligned read pair for every 3x coverage depth. This might cause some measurable bias for these two alleles. In the initial version of Gyper this bias was not taken into account. 44

63 4.1 Preprocessing the data Table 4.2: Checking for bias introduction our filtered BAM files. Only the most common Icelandic alleles were used in this analysis. Larger numbers mean greater unwanted bias. HLA-A HLA-B HLA-C HLA-DQA1 HLA-DQB1 HLA-DRB1 a S a a S a a S a a S a a S a a S a 01: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

64 4 Results 4.2 Training of parameters We trained the following four parameters: (1) Read clipping threshold τ, (2) minimum sequence length λ, (3) mismatch threshold ρ, and (4) zygosity factor β. Total 3,894 individuals were genotyped from the Icelandic population with a total of 2,025 different parameter settings. Then, using an in-house imputation tool we imputed data for all individuals in decode s database and calculated the INFO score described in section 3.5. The highest weighted average imputational INFO score we got was %. It was achieved when using the following parameters: Read clipping threshold τ = 30, minimum sequence base pair length λ = 60, mismatches threshold score ρ = 25, and zygosity factor β = 0.6. These parameters were set to be the default parameters of Gyper and they were used in all further runs in this study. A sensitivity analysis was performed to investigate how each parameter affected the score. In the analysis we only changed one parameter at a time and kept the other at their optimal values Quality threshold training Figure 4.8: The weighted average impute INFO score for the two threshold qualities: The mismatch quality threshold ρ (blue) and the clipping threshold τ (red). Note that the axes do not start at zero. Figure 4.8 shows how sensitive the INFO score is to the two quality thresholds ρ and τ. This quality score is the Q from equation 2.1. Overall these two parameters had an 46

65 4.2 Training of parameters insignificant effect, especially the mismatch threshold ρ which had the worst weighted average score of only 0.027% lower than the highest score. That happened when ρ = 30. For the clipping threshold τ that same metric was 0.184% when τ = 37. The most time consuming task of Gyper was aligning sequences to the graph so even if the clipping threshold did not improve the weighted average INFO score much, it significantly improved the speed of genotyping. For the mismatch threshold however, Gyper does not benefit any speed increase from allowing mismatches. In fact, it very slightly increases the time and memory needed to align a read. So if either of those two things are an issue, it is possible not to use mismatches and maintain minimal damage to accuracy. Using a very high quality threshold value ρ makes sure mismatches are never allowed. Nevertheless, in our runs we will use ρ = 25 and τ = 30 as it gives us the highest weighted average impute INFO score Minimum sequence length training The minimum sequence length λ also had a similar significance on the weighted average INFO score as the mismatch threshold ρ. The difference between the highest scoring minimum sequence length and the lowest scoring one was 0.037%, where the lowest scoring value was with λ = 90 (Figure 4.9). In decode s data the read lengths are either 100, 125 or 150 base pairs, so the maximum cropping allowed is 40-60% of the original length. If using data with much higher or much lower length of reads, it is probably a good idea to change λ accordingly. Similar to the clipping constant the higher minimum sequence length we choose, the quicker Gyper can compute the genotype. This is due to the fact that using a higher value will decrease the number of reads that need to be aligned to the graph because fewer reads will satisfy the constraint. Therefore choosing a higher λ can lead to some slight speed increase, but in our runs we are going to use the one that gives us the highest weighted average INFO score, λ =

66 4 Results Figure 4.9: The weighted average impute INFO score for different minimum sequence length. Note that the axes do not start at zero. The results show that changing the minimum sequence length is insignificant Zygosity factor training Finally, the zygosity factor β had a substantial effect on the weighted impute INFO score. The scores for the values we tested their scores ranged from % to %, with a difference of 6.962%. The β value with the lowest score of the ones we tried was β = 0.1. The choice of a β value does not affect the computational time to genotype. So we would recommend using Gyper with the default β = 0.6, as it performed best in our case. 48

67 4.2 Training of parameters Figure 4.10: The weighted average impute INFO score for different values of the zygosity factor β. Note that the axes do not start at zero. 49

68 4 Results 4.3 Verification Gyper s accuracy was verified using three different verification datasets: An in-house WGS dataset, 1000 Genomes exome dataset, and 1000 Genomes low coverage WGS dataset (Table 4.3). The in-house sample dataset is not publicly available but 1000 Genomes samples are available on their FTP site [1000Genomes, 2015]. Samples in the 1000 Genomes exome dataset have at least 20x coverage. Exomes are the part of the genome that are formed by exons. Exons only account for about 1% of the genomes so alignment files storing such data are much smaller using the same read coverage. Samples in 1000 Genomes low coverage WGS dataset have at least 3x coverage. The 1000 Genomes datasets have been verified by Erlich et al. [2011] for all the HLA class I genes: HLA-A, HLA-B, and HLA-C. It is a widely used verification dataset for many HLA genotypers and sometimes called the gold standard for HLA genotypers. We genotyped the class I genes and compared Gyper to other HLA genotypers. The called genotypes of the 1000 Genomes samples are shown in appendices A and B. Table 4.3: Number of individuals in each verification dataset. Gene decode WGS (sequenced) decode WGS (imputed) 1000 Genomes exome 1000 Genomes WGS HLA-A HLA-B HLA-C HLA-DQA HLA-DQB HLA-DRB Total The calling accuracy is the fraction of alleles Gyper correctly calls. Gyper calls the genotype with S max. However, in some cases more than one genotype has a score S max which leads to ambiguous results. In this case it is undefined which genotype Gyper calls. We believe a better quality measurement is to use coefficient of determination, r 2. When calculating r 2 we use Gyper s probability of a genotype, P. Additionally, we checked how often Gyper s called the zygosity of the samples matched the experimentally determined zygosity. Both r 2 and the zygosity calling accuracy are only calculated using 4 digit resolution. 50

69 4.3 Verification decode s samples decode has a large dataset with samples taken from the Icelandic population. Thousands of them have been sequenced using Illumina machines, aligned to the human genome, and stored in BAM files. They also have genotyped a portion of them for the six most important HLA genes using laboratory genotyping methods. The class I genes HLA-B and HLA-C were genotyped with a 2 digit resolution. The other four genes HLA-A, HLA-DQA1, HLA-DQB1, and HLA-DRB1 were typed with a 4 digit resolution. Overall 3600 genes have been genotyped using this method, which were used as a verification dataset for Gyper. Genotyping individuals this way is expected to have a high accuracy but they are costly and time consuming. For decode s dataset we did two kinds of tests. One is where Gyper genotyped sequencing files for all individuals that have both been sequenced and are part of the verification data. Unfortunately, that is only the case for 18.85% of the individuals in the verification data. The other, was to genotype the same 3,894 individuals and imputed data for other individuals in the dataset. An important use case of Gyper is to be able impute its output for a large population which can then be used in association studies. This allows us to use a much larger portion of decode s verification data, 93.30%. We cannot use the entire verification data because the imputation is unable to determine an individual s genotype if the genotypes of the its relatives are unknown. Table 4.4: Gyper s 2 digit genotype call accuracy compared to decode s verification data. Gene 0 errors 1 error 2 errors Correct alleles Accuracy HLA-A of % HLA-B of % HLA-C of % HLA-DQA of % HLA-DQB of % HLA-DRB of % All genes of % Table 4.5: Gyper s 4 digit genotype call accuracy compared to decode s verification data. Gene 0 errors 1 error 2 errors Correct alleles Accuracy r 2 HLA-A of % HLA-DQA of % HLA-DQB of % HLA-DRB of % All genes of % We genotyped all individuals in our sequenced dataset and genotyped them using only their respective BAM files. Since each individual has two alleles Gyper s prediction could 51

70 4 Results have 0, 1, or 2 errors for each individual when comparing with the verification. We use the number of correctly predicted alleles divided by the total number of alleles predicted to estimate Gyper s accuracy. The overall genotype call accuracy of Gyper was 97.6% and 94.8% using 2 and 4 digit resolutions, respectively (Tables 4.4,4.5). Zygosity was correctly called in 94.2% cases. For the imputed data Gyper s accuracy was 96.8% and 96.1% for 2 and 4 digit resolution, respectively (Tables 4.6,4.7) and the zygosity call accuracy was 97.1%. In tables 4.5 and 4.7 the HLA-B and HLA-C genes are excluded because only has the first 2 digits are known. Table 4.6: Gyper s 2 digit impute accuracy compared to decode s verification data. Gene 0 errors 1 error 2 errors Correct alleles Accuracy HLA-A of % HLA-B of % HLA-C of % HLA-DQA of % HLA-DQB of % HLA-DRB of % All genes of % Table 4.7: Gyper s 4 digit impute accuracy compared to decode s verification data. Gene 0 errors 1 error 2 errors Correct alleles Accuracy r 2 HLA-A of % HLA-DQA of % HLA-DQB of % HLA-DRB of % All genes of % Genomes exome samples Total 180 exome BAM files were fetched from the 1000 Genomes FTP site and genotyped for the three main HLA class I genes. The samples were taken from individuals from with ancestry from all over the world. Gyper s genotype call accuracy was 99.3% and 97.9% using 2 and 4 digit resolutions, respectively (Table 4.8,4.9). For all genes r 2 > 0.95 and zygosity calling was correct in all 540 cases. 52

71 4.3 Verification Table 4.8: Gyper s 2 digit exome accuracy compared to Erlich et al. [2011]. Gene 0 errors 1 error 2 errors Correct alleles Accuracy HLA-A of % HLA-B of % HLA-C of % All genes of % Table 4.9: Gyper s 4 digit exome accuracy compared to Erlich et al. [2011]. Gene 0 errors 1 error 2 errors Correct alleles Accuracy r 2 HLA-A of % HLA-B of % HLA-C of % All genes of % Genomes WGS samples We also verified Gyper using 20 low coverage WGS alignment files obtained from the 1000 Genomes project. These files have at least 3x non duplicated aligned coverage. Here, the accuracy of Gyper was 96.7% and 95.0% for the 2 and 4 digit comparisons, respectively (Tables 4.10 and 4.11). Zygosity was correctly called for 98.3% of the individuals. Table 4.10: Gyper s 2 digit low coverage WGS accuracy compared to Erlich et al. [2011]. Gene 0 errors 1 error 2 errors Correct alleles Accuracy HLA-A of % HLA-B of % HLA-C of % All genes of % Table 4.11: Gyper s 4 digit low coverage WGS accuracy compared to Erlich et al. [2011]. Gene 0 errors 1 error 2 errors Correct alleles Accuracy r 2 HLA-A of % HLA-B of % HLA-C of % All genes of % Over all datasets Gyper managed to predict 9,145 out of 9,410 alleles total at the two digit resolution, having an accuracy of 97.2%. With 4 digit resolution we predicted 3,284 out of 3,408 alleles (96,3%) correctly. 53

72 4 Results 4.4 Comparison with other DNA sequencing data genotypers Several other HLA genotypers are publicly available as discussed in section 2.5. One of the best current HLA genotyper is OptiType and our focus is to compare Gyper to it, both in terms of accuracy and time Accuracy OptiType measured its accuracy with both by genotype and zygosity calling samples from the 1000 Genomes project. In both datasets Gyper showed the same or slightly better calling accuracy compared to OptiType Genomes exome dataset Using OptiType s calling results from their article [Szolek et al., 2014] their 4 digit accuracy on the exome dataset was 97.8% (Table 4.12) while Gyper s accuracy was only barely higher at 97.9%. The exome dataset was had previously been typed by Major et al. [2013] with an accuracy of 93.9%. OptiType typed 1056 alleles correctly out of 1080 alleles total while Gyper had only a single correct allele more. Compared to Gyper their accuracy on the HLA-B gene is higher but lower for the other two genes. Table 4.12: OptiType s 4 digit call accuracy on 1000 Genomes exon dataset compared to Erlich et al. [2011]. Gene 0 errors 1 error 2 errors Correct alleles Accuracy Gyper s accuracy HLA-A of % 97.5% HLA-B of % 96.7% HLA-C of % 99.4% All genes of % 97.9% Gyper s zygosity calling accuracy was 100.0% for these samples, compared to OptiType s 98.5%. 54

73 4.4 Comparison with other DNA sequencing data genotypers 1000 Genomes WGS dataset Additionally we also compared Gyper to OptiType on WGS data with low coverage. This dataset had been genotyped before with HLAminer with 80.2% accuracy on 4 digit resolution [Warren et al., 2012]. Meanwhile, both OptiType and Gyper managed to achieve 95% genotype calling accuracy (Table 4.13). Table 4.13: OptiType s 4 digit WGS genotype calling accuracy compared to Erlich et al. [2011]. Gene 0 errors 1 error 2 errors Correct alleles Accuracy Gyper s accuracy HLA-A of % 95.0% HLA-B of % 92.5% HLA-C of % 97.5% All genes of % 95.0% Furthermore, both Gyper and OptiType called zygosity correctly in 59 of 60 cases (98.3%) Time The main feature of Gyper is its efficiency for the case where the user stores their WGS reads in an alignment file using SAM/BAM format. Storing reads aligned is now widely used and has become the industry standard. Raw reads in FASTQ files are hard to work with because they provide no context for the user. OptiType only supports FASTQ files, which means users who are interested in say, the HLA-A genotype of an individual, and store their reads only alignment will need to: 1. Sort the SAM/BAM using read name. 2. Convert SAM/BAM to two separate FASTQ files. 3. Preprocess both FASTQ files using a read mapper. 4. Run OptiType on the preprocessed files. Testing it on an in-house 90 GiB indexed BAM file this process took a couple of days. Meanwhile, using Gyper on the same computer we genotyped the individual in less than 40 seconds. Even for this massive time difference, Gyper has shown to be comparatively or more accurate in calling the correct HLA genotypes. 55

74 4 Results 56

75 5 Conclusions 5.1 Summary Gyper is a fast HLA genotyper, it dominated all other publicly available HLA genotypers in terms of speed. Its high speed is due to the fact that Gyper only uses a very small subset of reads which are believed to be relevant to the genotyping. Gyper requires sorted and indexed alignment files to fetch these reads quickly. Even though we only genotype with such a small portion of reads, we can still report that Gyper is one of the most accurate HLA genotypers publicly available. When comparing with OptiType, which has been reported to be an accurate HLA genotyper, Gyper s genotype and zygosity call accuracy was higher than OptiType s. We also measured Gyper s accuracy using the coefficient of determination, r 2, with 4 digit resolution genotypings. Gyper achieved r 2 > 0.8 for six HLA genes and r 2 > 0.95 for the three main HLA class I genes using WGS and exome samples, respectively. Previous methods, we compared ourselves with, did not measure their accuracy this way. We believe it is a better quality score than genotype call accuracy. Gyper s high accuracy is achieved by smartly creating partial order graphs for all the different alleles, and then aligning read pairs to them. Aligning all read pairs independently to every reference allele available, which can sometimes go up to 4000 different references, is extremely time consuming. Instead we create a single graph and align the read pairs to that, resulting in a much faster typing. Partial order graphs have proven to be a good way to represent variation, but we had not seen them used for our purpose before. They allow us to add a wide variety of constraints to the genotyper to extract as much information as we possibly can. By having such a quick genotyper we are able to optimize the parameters for these constraints by training at a large scale. Gyper is very extensible, it is created as a generic genotyper that is not restricted to genotyping for HLA types. Furthermore, it can easily be extended to genotype for any genomic structural variation (e.g. SNP, insertions, and deletions). We are certain that it can be used in a wide variety of applications. 57

76 5 Conclusions 5.2 Future work Even with the present results we still believe Gyper can be extended to be even faster and more accurate. We have not investigated deeply why Gyper is failing for some individuals, but we think a big part of it is due to the bias introduction discussed in section Furthermore, the experimental genotypings have been found to be inaccurate for the 1000 Genomes samples [Erlich et al., 2011]. One feature that would increase Gyper s speed is to save and load graphs from disk. In the current implementation a new graph is created every time Gyper is run. This feature does not have a high priority because the creation usually takes 2 seconds or less, but loading the graph directly from disk would be faster. Another feature to improve Gyper s speed is to make nodes store k-mers instead of single DNA bases. That way, we could easily create a hashmap which maps k-mers to node to find quickly potential alignments in the graph. Currently, when aligning a read we check every node on the graph. The alignment is by far the most time consuming part of Gyper, so with this change the typing speed would be improved significantly. Even though Gyper s top priority is accuracy, implementing these two speed improving features can allow us to train the parameters on a much larger scale, both in terms of number of genes and individuals. To improve accuracy even further we speculate that allowing some mismatches and indels on intron sequences could help. That would make up for the fact that an individual can have a sequence that is missing or wrong in the IMGT/HLA database. It is very possible that there are some rare variants in the Icelandic population which is missing from the database. If this is the case, it could have a big impact on the genotyping. As for extensibility, Gyper currently only supports SAM and BAM files so another future task is adding support for the increasingly popular CRAM files. Also, Gyper can easily be extended to add support fore more genes, SNPs, indels, or any other genetic variants. Hopefully, we can add these features to Gyper to make it an even more appealing DNA genotyper. 58

77 A HLA genotype call results for 1000G exome samples A.1 HLA-A exome results Table A1: Gyper s called HLA-A genotype on 1000 Genomes exome dataset compared to Erlich et al. [2011]. Sample Allele 1 Allele 2 Verified allele 1 Verified allele 2 2 digit matches 4 digit matches NA :01:01:01 02:01:01:01 03:01 02: NA :01:01 02:01:01:01 02:01 32: NA :01:02:01 02:01:01:01 02:01 68: NA :01:113 02:01:01:01 02:01 02: NA :01:01:01 01:01:01:01 01:01 02: NA :02:01:01 01:01:01:01 01:01 24: NA :01:01 03:01:01:01 25:01 03: NA :02:01:01 01:01:01:01 24:02 01: NA :01:01:01 02:01:01:01 02:01 02: NA :01:01:01 02:01:01:01 02:01 02: NA :01:01:01 01:01:01:01 01:01 03: NA :01:01 02:01:01:01 32:01 02: NA :01:01:01 02:01:01:01 02:01 02: NA :01:01:01 03:01:01:01 03:01 26: NA :01:01:01 01:01:01:01 02:01 01: NA :01:01:01 01:01:01:01 01:01 11: NA :01:01:01 01:01:01:01 01:01 01: NA :01:01:01 02:01:01:01 02:01 02: NA :02:01:01 02:01:01:01 24:02 02: NA :02:01:01 02:01:01:01 29:02 02: NA :01:01 11:01:01:01 25:01 11: NA :01:01:01 02:06:01:01 02:06 26: NA :01:01:01 01:01:01:01 02:01 01: NA :01:01:01 02:01:01:01 03:01 02: NA :01:02:01 01:01:01:01 01:01 31: NA :01:01:01 01:01:01:01 01:01 02: NA :01:01:01 01:01:01:01 01:01 11: NA :01:01 11:01:01:01 11:01 32: NA :01:01 02:01:01:01 02:01 25: NA :01:01:01 02:01:01:01 02:01 02: NA :01:01:01 01:01:01:01 01:01 02: NA :01:02:01 02:01:01:01 31:01 02: NA :01:01:01 01:01:01:01 01:01 02: NA :02:01:01 23:01:01 23:01 24:

78 A HLA genotype call results for 1000G exome samples NA :02:01:01 01:01:01:01 24:02 01: NA :01:01:01 02:01:01:01 02:01 02: NA :01:01:01 01:01:01:01 01:01 03: NA :01:01:01 02:01:01:01 02:01 02: NA :02:01:01 24:02:01:01 29:02 24: NA :01:01:01 02:01:01:01 03:01 02: NA :02:01:01 01:01:01:01 01:01 24: NA :01:01:01 01:01:01:01 01:01 03: NA :02:01:01 03:01:01:01 24:02 03: NA :02:01:01 02:01:01:01 02:01 24: NA :01:01:01 01:01:01:01 11:01 01: NA :02 33:01:01 33:01 66: NA :01 23:01:01 24:24 74: NA :01 36:01 36:01 36: NA :01 26:01:01:01 26:01 74: NA :01:01 23:01:01 23:01 30: NA :02:01:01 33:03:01 33:03 68: NA :02:01:01 03:01:01:01 03:01 29: NA :01 03:01:01:01 03:01 74: NA :01 03:01:01:01 03:01 36: NA :01:01:01 01:01:01:01 01:01 03: NA :03:01 24:02:01:01 24:02 33: NA :01:02:01 30:01:01 30:01 68: NA :01:01:01 02:01:01:01 02:07 11: NA :03:01 11:01:01:01 11:01 33: NA :02:01:01 11:01:01:01 11:01 24: NA :01:06 02:03:01 02:03 68: NA :07:01 02:07:01 02:07 02: NA :01:01:01 02:07:01 02:07 11: NA :01:01:01 02:06:01:01 02:06 11: NA :02:01:01 02:01:01:01 02:01 24: NA :01:01:01 24:02:01:01 24:02 26: NA :01:02:01 11:01:01:01 11:01 33: NA :03:01 02:01:01:01 02:01 33: NA :01:01 02:01:01:01 02:01 30: NA :03:01 02:01:01:01 02:01 33: NA :06:01:01 02:01:01:01 02:01 02: NA :02:01:01 03:02:01 03:02 24: NA :01:02:01 24:02:01:01 24:02 31: NA :07:01 02:07:01 02:07 02: NA :01:01:01 02:01:01:01 02:01 02: NA :01:01:01 11:01:01:01 11:01 11: NA :02:01:01 11:01:01:01 11:01 24: NA :01:02:01 24:03:01 24:02 31: NA :02:01:01 01:01:01:01 01:01 24: NA :01:01 02:06:01:01 02:06 30: NA :03:01 02:01:01:01 02:03 02: NA :01:02:01 24:02:01:01 24:02 31: NA :01:01:01 02:03:01 02:03 03: NA :01:04 11:01:01:01 11:01 11: NA :06:01:01 01:01:01:01 01:01 02: NA :01:01 02:01:01:01 02:01 32: NA :02:01:01 11:01:01:01 11:01 24: NA :02:01:01 03:01:01:01 03:01 24: NA :01:01 24:02:01:01 24:02 30:

79 A.1 HLA-A exome results NA :01:01 30:01:01 30:01 32: NA :03:01 11:01:01:01 11:01 33: NA :01:01:01 11:01:01:01 11:01 11: NA :01:01:01 11:01:01:01 11:01 11: NA :02:01:01 11:01:01:01 11:01 24: NA :01:01:01 02:01:01:01 02:01 11: NA :01 34:02:01 34:02 36: NA :01 33:03:01 33:03 74: NA :01:01 29:02:01:01 29:02 30: NA :02:01:01 03:01:01:01 03:01 68: NA :02:01:01 30:02:01:01 30:02 30: NA :01 30:01:01 30:01 36: NA :01 34:02:01 34:02 74: NA :02:01:01 24:02:01:01 24:02 24: NA :03:01 24:02:01:01 24:02 33: NA :07:01 02:06:01:01 02:07 02: NA :02:01:01 02:06:01:01 02:06 24: NA :02:01 33:03:01 31:01 33: NA :02:01:01 02:06:01:01 02:06 24: NA :01:01 26:01:01:01 26:01 30: NA :03:01 24:02:01:01 24:02 26: NA :02:01:01 24:02:01:01 24:02 24: NA :01:02:01 24:02:01:01 24:02 31: NA :01:02:01 11:01:01:01 11:01 31: NA :01:01:01 02:01:01:01 02:01 02: NA :02:01:01 11:01:01:01 11:01 24: NA :02:01:01 24:02:01:01 24:02 24: NA :02:01:01 11:01:01:01 11:01 24: NA :03:01 26:01:01:01 26:01 26: NA :01:02:01 26:02:01 26:02 31: NA :07:01 02:06:01:01 02:07 02: NA :02:01:01 24:02:01:01 24:02 24: NA :01:02:01 11:01:01:01 11:01 31: NA :01:02:01 24:02:01:01 24:02 31: NA :01:01:01 24:02:01:01 24:02 26: NA :02:01:01 02:06:01:01 02:06 24: NA :02:01:01 24:02:01:01 24:02 24: NA :03:01 24:02:01:01 24:02 33: NA :01:01:01 02:01:01:01 02:01 26: NA :06:01:01 02:01:01:01 02:01 02: NA :03:01 33:03:01 33:03 33: NA :01:02:01 03:01:01:01 03:01 31: NA :01:02:01 11:01:01:01 11:02 31: NA :01:02:01 02:01:01:01 02:01 31: NA :02:01 02:07:01 02:07 26: NA :03:01 24:02:01:01 24:02 33: NA :02:01:01 02:01:01:01 02:01 24: NA :03:01 24:02:01:01 24:02 33: NA :01:02:01 24:02:01:01 24:02 31: NA :03:01 24:02:01:01 24:02 33: NA :03:01 24:02:01:01 24:02 26: NA :02:01:01 02:06:01:01 02:06 24: NA :03:01 24:04 24:04 33: NA :02:01:01 24:02:01:01 24:02 24: NA :01:01:01 02:06:01:01 02:06 11:

80 A HLA genotype call results for 1000G exome samples NA :03:01 02:01:01:01 02:01 33: NA :03:01 02:07:01 02:07 26: NA :02:01:01 24:02:01:01 24:02 24: NA :02:01:01 36:01 36:01 68: NA :01:01 23:01:01 23:01 30: NA :01 02:01:01:01 02:01 36: NA :01 23:01:01 23:01 36: NA :01:01 03:01:01:01 03:01 30: NA :03:01 23:01:01 23:01 33: NA :01 24:02:01:01 23:01 36: NA :01:01 02:02:01 02:02 23: NA :01 33:03:01 33:03 36: NA :02:01:01 66:02 66:03 68: NA :01:01 02:01:01:01 02:01 30: NA :01:01 02:01:01:01 02:01 23: NA :01 02:11:01 02:11 36: NA :01:01 02:02:01 02:02 23: NA :03:01 32:01:01 32:01 33: NA :01:01:01 03:01:01:01 03:01 26: NA :01 33:03:01 33:03 36: NA :01:01 03:01:01:01 03:01 23: NA :01:01:01 26:01:01:01 26:01 68: NA :01:01 02:02:01 02:02 23: NA :01:01:01 30:02:01:01 30:02 68: NA :02:01:01 23:01:01 23:01 30: NA :03:01 03:01:01:01 03:01 33: NA :02:01:01 02:05:01 02:05 30: NA :01:01:01 01:01:01:01 01:01 02: NA :02:01:01 30:02:01:01 30:02 68: NA :01:01 03:01:01:01 03:01 30: NA :01:01 03:01:01:01 03:01 33: NA :03:01 30:01:01 30:01 33: NA :03:01 30:01:01 30:01 33: NA :01 30:01:01 30:01 36: NA :02:01:01 02:01:01:01 02:01 68: NA :02:01:01 30:01:01 30:01 68:

81 A.2 HLA-B exome results A.2 HLA-B exome results Table A2: Gyper s called HLA-B genotype on 1000 Genomes exome dataset compared to Erlich et al. [2011]. Sample Allele 1 Allele 2 Verified allele 1 Verified allele 2 2 digit matches 4 digit matches NA :01:01 07:02:01 07:02 57: NA :02:01 08:01:01 40:02 08: NA :02:01:01 40:01:02 44:02 40: NA :02:01:01 07:02:01 44:02 07: NA :01:01 08:01:01 08:01 57: NA :06:02 08:01:01 39:06 08: NA :01:01:01 08:01:01 08:01 18: NA :01:02 08:01:01 40:01 08: NA :01:01 44:02:01:01 44:02 15: NA :02:01 14:01:01 14:02 14: NA :01:01 07:02:01 08:01 07: NA :02:01 27:05:02 40:02 27: NA :01:01 27:05:02 27:05 57: NA :02:01 07:02:01 07:02 07: NA :01:01:01 08:01:01 35:01 08: NA :01:01:01 07:02:01 51:01 07: NA :01:01 08:01:01 57:01 08: NA :02:01 08:01:01 08:01 13: NA :02:01 07:02:01 07:02 07: NA :05:02 07:02:01 07:02 27: NA :01:01:01 15:01:01:01 18:01 15: NA :01:01 35:01:01:01 35:01 38: NA :02:01 07:02:01 07:02 07: NA :02:01:01 35:01:01:01 35:01 44: NA :01:02 08:01:01 08:01 40: NA :02:01:01 08:01:01 08:01 44: NA :01:01:01 50:01:01 51:01 50: NA :03:01 40:02:01 40:02 44: NA :02:01:01 40:01:02 44:02 40: NA :01:01 44:02:01:01 49:01 44: NA :01:01 07:02:01 08:01 07: NA :01:01:01 07:02:01 15:01 07: NA :01:01 07:02:01 08:01 07: NA :01:01:01 44:03:01 44:03 51: NA :01:01 07:02:01 07:02 08: NA :01:01:01 44:02:01:01 51:01 44: NA :01:01 07:02:01 08:01 07: NA :02:01:01 44:02:01:01 44:02 44: NA :01:01 44:03:01 44:03 57: NA :02:01 07:02:01 07:02 07: NA :01:01 08:01:01 08:01 55: NA :01:01 08:01:01 08:01 14: NA :01:01:01 07:02:01 39:06 07: NA :01:01:01 35:01:01:01 35:01 51: NA :01:01:01 08:01:01 56:01 08: NA :01:01 14:01:01 14:01 78: NA :01:01:01 14:02:01 14:03 58: NA :10:01 15:03:01 15:03 39:

82 A HLA genotype call results for 1000G exome samples NA :01:01 15:03:01 15:03 53: NA :01:01 15:03:01 15:03 42: NA :01:01 42:01:01 42:01 53: NA :01:01 15:10:01 15:10 53: NA :01:01:01 49:01:01 49:01 51: NA :01:01 15:03:01 15:03 49: NA :01:01:01 07:02:01 07:02 35: NA :01:01:01 40:01:02 40:01 58: NA :18:01 13:01:01 13:02 15: NA :01:01 38:02:01 38:02 46: NA :01:01:01 46:01:01 46:01 58: NA :01:02 40:01:02 40:01 40: NA :01:01 15:02:01 15:02 38: NA :01:01 13:01:01 13:01 46: NA :01:01 40:02:01 40:02 46: NA :01:01 07:05:06 07:05 54: NA :01:01 48:01:01 48:01 67: NA :01:01:01 39:01:01:01 39:01 59: NA :01:01:01 51:01:01:01 51:01 52: NA :01:01:01 48:01:01 48:01 51: NA :02:01 13:01:01 13:01 13: NA :01:01:01 08:01:01 35:01 08: NA :11:01 15:01:01:01 15:01 15: NA :01:01:01 40:01:02 40:01 51: NA :01:01 15:01:01:01 15:01 54: NA :01:01 46:01:01 46:01 54: NA :01:02 15:01:01:01 15:01 40: NA :01:01:01 48:01:01 52:01 81: NA :01:01:01 40:01:02 40:01 52: NA :06:01:01 15:18:01 15:18 40: NA :01:02 37:01:01 37:01 40: NA :06:01:01 13:02:01 13:02 40: NA :02:01 15:01:01:01 15:01 38: NA :01:01:01 48:01:01 48:01 51: NA :02:01 35:03:01 35:03 38: NA :01:01:01 46:01:01 46:01 51: NA :01:01 40:06:01:01 40:06 57: NA :01:01 40:01:02 40:01 48: NA :18:01 15:01:01:01 15:01 15: NA :01:01:01 35:03:01 35:03 51: NA :01:02 13:02:01 13:02 40: NA :01:01:01 13:02:01 13:02 52: NA :01:01:01 46:01:01 46:01 58: NA :01:01 54:01:01 54:01 54: NA :01:01:01 40:06:01:01 40:06 51: NA :01:01 40:01:02 40:01 46: NA :01:02 15:01:01:01 15:01 67: NA :01:02 44:03:01 44:03 52: NA :01:01 35:01:01:01 35:01 49: NA :01:01 18:01:01:01 18:01 49: NA :03:01 13:01:01 13:02 57: NA :03:01 15:03:01 15:03 57: NA :01:02 42:01:01 42:01 52: NA :01:01 53:01:01 53:01 53: NA :01:01:01 46:01:01 46:01 52:

83 A.2 HLA-B exome results NA :03:01 35:01:01:01 35:01 44: NA :01:01 35:01:01:01 35:01 46: NA :01:01:01 40:02:01 40:02 51: NA :03:01 15:01:01:01 15:01 44: NA :01:01:01 52:01:01:01 52:01 52: NA :02:01 07:02:01 07:02 13: NA :02:01 07:02:01 07:02 40: NA :01:01:01 40:01:02 40:01 59: NA :01:01:01 07:02:01 07:02 51: NA :02:01 40:01:01 40:01 40: NA :02:01 15:01:01:01 15:01 40: NA :01:01 52:01:01:01 52:01 67: NA :01:01 40:02:01 40:06 54: NA :01:01 51:01:01:01 51:01 54: NA :02:01 35:01:01:01 35:01 40: NA :01:01:01 40:06:01:01 40:06 56: NA :01:01:01 15:01:01:01 15:01 35: NA :01:01:01 40:02:01 40:02 51: NA :01:01:01 40:01:02 40:01 51: NA :01:01:01 48:01:01 48:01 52: NA :01:01 15:18:01 15:18 54: NA :02:01 40:02:01 40:02 40: NA :01:01:01 40:02:01 40:02 59: NA :07:01 15:01:01:01 15:07 46: NA :01:02 39:01:03 39:01 40: NA :01:01:01 07:02:01 07:02 15: NA :03:01 44:03:01 44:03 44: NA :02:09 40:02:01 40:02 44: NA :01:02 13:01:01 13:01 40: NA :01:01:01 51:01:01:01 51:01 51: NA :01:01 40:06:01:01 40:06 46: NA :03:01 40:02:01 40:02 44: NA :01:01:01 40:02:01 40:02 52: NA :01:01:01 44:03:01 44:03 52: NA :02:01 07:02:01 07:02 40: NA :03:01 15:07:01 15:07 44: NA :01:01:01 15:01:01:01 15:01 39: NA :01:01 51:01:02 52:01 54: NA :01:01 40:06:01:01 40:06 46: NA :01:01 54:01:01 54:01 67: NA :01:03 39:01:01:01 39:01 39: NA :01:02 58:01:01:01 58:01 67: NA :01:01 35:01:01:01 35:01 46: NA :01:01:01 35:01:01:01 35:01 35: NA :01 42:01:01 42:01 81: NA :01:01 35:01:01:01 35:01 53: NA :01:01 51:01:01:01 51:01 53: NA :02:01 07:02:01 07:02 07: NA :01:01 18:01:01:01 18:01 42: NA :24:01 07:02:01 07:02 39: NA :01:01 35:01:01:01 35:01 49: NA :01:01:01 07:02:01 07:02 58: NA :01:01:01 13:01:01 13:02 35: NA :01:01 42:01:01 42:01 53: NA :01:01 07:02:01 07:02 45:

84 A HLA genotype call results for 1000G exome samples NA :01:01:01 07:02:01 07:02 58: NA :01:02 35:01:01:01 35:01 52: NA :01:01 52:01:02 52:01 53: NA :01:01:01 53:01:01 53:01 58: NA :01:01:01 15:10:01 15:10 56: NA :01:01 18:01:01:01 18:01 53: NA :02:01 52:01:02 52:01 57: NA :01:01:01 15:10:01 15:10 51: NA :01:01 52:01:02 52:01 53: NA :03:01 35:01:01:01 35:01 44: NA :01:02 07:02:01 07:02 52: NA :01:01:01 15:16:01 15:16 18: NA :01:01:01 18:01:01:01 18:01 58: NA :01:01 42:01:01 42:01 53: NA :10:01 08:01:01 08:01 15: NA :01:02 15:10:01 15:10 52: NA :01:01:01 15:10:01 15:10 58: NA :01:01 42:02:01:01 42:02 45: NA :01:01 41:01:01 41:04 42: NA :03:01 53:01:01 53:01 57: NA :01:02 35:01:01:01 35:01 52: NA :03:01 35:01:01:01 35:01 57:

85 A.3 HLA-C exome results A.3 HLA-C exome results Table A3: Gyper s called HLA-C genotype on 1000 Genomes exome dataset compared to Erlich et al. [2011]. Sample Allele 1 Allele 2 Verified allele 1 Verified allele 2 2 digit matches 4 digit matches NA :02:01:01 06:02:01:01 07:02 06: NA :01:01:01 02:02:02:01 02:02 07: NA :04:01 03:04:01:01 03:04 07: NA :02:01:01 05:01:01:01 05:01 07: NA :01:01:01 06:02:01:01 07:01 06: NA :02:01:01 07:01:01:01 07:02 07: NA :03:01:01 07:01:01:01 07:01 12: NA :01:01:01 03:04:01:01 03:04 07: NA :01:01:01 03:04:01:01 05:01 03: NA :02:01:01 08:02:01:01 08:02 08: NA :02:01:01 07:01:01:01 07:01 07: NA :04:01 02:02:02:01 02:02 07: NA :02:01:01 02:02:02:01 02:02 06: NA :02:01:01 07:02:01:01 07:02 07: NA :01:01:01 04:01:01:01 04:01 07: NA :02:01:01 07:02:01:01 15:02 07: NA :01:01:01 06:02:01:01 06:02 07: NA :01:01:01 06:02:01:01 07:01 06: NA :02:01:01 07:02:01:01 07:02 07: NA :02:01:01 01:02:01 07:02 01: NA :03:01:01 03:03:01 12:03 03: NA :03:01:01 04:01:01:01 04:01 12: NA :02:01:01 07:02:01:01 07:02 07: NA :04:01 04:01:01:01 04:01 07: NA :01:01:01 03:04:01:01 07:01 03: NA :01:01:01 05:01:01:01 07:01 05: NA :02:01:01 06:06 15:02 06: NA :01:01 15:02:01:01 15:02 16: NA :01:01:01 03:04:01:01 03:04 05: NA :01:01:01 05:01:01:01 05:01 07: NA :02:01:01 07:01:01:01 07:01 07: NA :02:01:01 03:03:01 03:03 07: NA :02:01:01 07:01:01:01 07:01 07: NA :02:01 04:01:01:01 14:02 04: NA :02:01:01 07:01:01:01 07:01 07: NA :02:01 05:01:01:01 14:02 05: NA :02:01:01 07:01:01:01 07:01 07: NA :01:01:01 05:01:01:01 05:01 05: NA :01:01 06:02:01:01 16:01 06: NA :02:01:01 07:02:01:01 07:02 07: NA :02:01:01 07:01:01:01 07:01 07: NA :02:01:01 07:01:01:01 07:01 08: NA :02:01:01 07:02:01:01 07:02 07: NA :02:01 04:01:01:01 04:01 14: NA :01:01:01 01:02:01 01:02 07: NA :01:01 08:02:01:01 08:02 16: NA :02:01:01 07:01:01:01 07:01 08: NA :03:01:01 02:10 02:10 12:

86 A HLA genotype call results for 1000G exome samples NA :01:01:01 02:10 02:10 07: NA :01:01:01 02:10 02:10 17: NA :01:01:01 04:01:01:01 04:01 17: NA :01:01:01 04:01:01:01 04:01 04: NA :01:01 07:01:02 07:01 16: NA :01:02 02:10 02:10 07: NA :05:02 04:01:01:01 04:01 15: NA :02:01:01 03:02:02:01 03:02 07: NA :04:01 06:02:01:01 06:02 07: NA :02:01:01 01:02:01 01:02 07: NA :02:02:01 01:02:01 01:02 03: NA :02:01:01 04:01:01:01 04:01 07: NA :02:12 08:01:01 08:01 12: NA :04:01:01 01:02:01 03:04 01: NA :02:01:01 01:02:01 01:02 15: NA :02:01:01 01:02:01 01:02 07: NA :01:01 07:02:01:01 08:01 07: NA :02:01:01 01:02:01 01:02 07: NA :02:01 12:02:02 12:02 14: NA :02:01:01 08:03:01 08:03 15: NA :02:01:01 03:03:01 03:04 06: NA :02:01:01 03:03:01 03:03 07: NA :03:01 03:03:01 03:03 03: NA :02:01:01 07:02:01:01 07:02 15: NA :03:01 01:02:01 01:02 03: NA :02:01:01 01:02:01 01:02 07: NA :02:01:01 01:02:01 01:02 15: NA :02:02 08:01:01 08:01 12: NA :02:02 07:02:01:01 07:02 12: NA :01:01 08:01:01 08:01 08: NA :02:01:01 04:01:01:01 04:01 06: NA :01:01 06:02:01:01 06:02 08: NA :02:01:01 03:03:01 03:03 07: NA :02:01:01 08:01:01 08:01 15: NA :03:01:01 07:02:01:01 07:02 12: NA :02:01 07:02:01:01 07:02 14: NA :01:01 06:02:01:01 08:01 06: NA :02:01:01 04:01:01:01 04:01 07: NA :01:01 01:02:01 01:02 08: NA :02:01 12:03:01:01 12:03 14: NA :02:01:01 03:04:01:01 03:04 06: NA :02:02 06:02:01:01 06:02 12: NA :02:02:01 01:02:01 01:02 03: NA :02:01 01:02:01 01:02 01: NA :02:01 08:01:01 08:01 14: NA :02:01:01 01:02:01 01:02 07: NA :02:01:01 04:01:01:01 04:01 07: NA :01:01 16:01:01 16:01 16: NA :01:02 04:01:01:01 04:01 07: NA :01:02 02:02:02:01 02:02 07: NA :01:02 03:02:02:01 03:02 07: NA :02:01:01 02:10 02:10 08: NA :01:01:01 16:01:01 16:01 17: NA :01:01:01 04:01:01:01 04:01 04: NA :02:02 01:03 01:03 12:

87 A.3 HLA-C exome results NA :03 03:03:01 03:03 14: NA :03:01 01:02:01 01:02 03: NA :02:01 03:04:01:01 03:04 14: NA :03 07:02:01:01 07:02 14: NA :02:02 12:02:02 12:02 12: NA :02:01:01 06:02:01:01 06:02 07: NA :02:01:01 03:04:01:01 03:04 07: NA :04:01:01 01:02:01 01:02 03: NA :02:01 07:02:01:01 07:02 14: NA :04:01:01 03:04:01:01 03:04 03: NA :04:01:01 03:03:01 03:03 03: NA :02:02 07:02:01:01 07:02 12: NA :01:01 01:02:01 01:02 08: NA :02:01 01:02:01 01:02 01: NA :04:01:01 03:03:01 03:03 03: NA :01:01 04:01:01:01 04:01 08: NA :03:01 03:03:01 03:03 03: NA :02:01 03:04:01:01 03:04 14: NA :02:01 07:02:01:01 07:02 14: NA :02:02 08:03:01 08:03 12: NA :04:01 01:02:01 01:02 07: NA :02:01:01 03:04:01:01 03:04 07: NA :03:01 01:02:01 01:02 03: NA :03:01 01:02:01 01:02 03: NA :02:01:01 07:02:01:01 07:02 07: NA :02:01:01 04:01:01:01 04:01 07: NA :03 14:03 14:03 14: NA :02:01 05:01:01:01 05:01 14: NA :04:01:01 03:03:01 03:03 03: NA :02:01 14:02:01 14:02 14: NA :01:01 01:02:01 01:02 08: NA :03 03:03:01 03:03 14: NA :02:01:01 12:02:02 12:02 15: NA :03 12:02:02 12:02 14: NA :02:01:01 03:04:01:01 03:04 07: NA :03 03:03:01 03:03 14: NA :02:01:01 03:03:01 03:03 07: NA :02:02 01:02:01 01:02 12: NA :01:01 01:03 01:03 08: NA :02:01:01 01:02:01 01:02 07: NA :02:01:01 07:02:01:01 07:02 07: NA :02:01:01 03:02:02:01 03:02 07: NA :03:01 01:02:01 01:02 03: NA :03:01 03:03:01 03:03 03: NA :01:01:01 08:04:01 08:04 17: NA :02:01:01 04:01:01:01 04:01 08: NA :01:01 06:02:01:01 06:02 16: NA :05:02 07:02:01:01 07:02 15: NA :01:01:01 02:02:02:01 02:02 17: NA :02:01:01 07:01:01:01 07:01 07: NA :01:02 04:01:01:01 04:01 07: NA :02:02 07:01:01:01 07:01 07: NA :01:01:01 04:01:01:01 04:01 04: NA :01:01:01 04:01:01:01 04:01 17: NA :01:01 07:02:01:01 07:02 16:

88 A HLA genotype call results for 1000G exome samples NA :02:01:01 07:01:01:01 07:01 07: NA :01:01 07:01:01:01 07:01 16: NA :01:01 04:01:01:01 04:01 16: NA :01:01:01 03:02:02:01 03:02 04: NA :04:01 01:02:01 01:02 08: NA :01:01:01 04:01:01:01 04:01 05: NA :01 16:01:01 16:01 18: NA :01:01 04:01:01:01 04:01 16: NA :01:01 04:01:01:01 04:01 16: NA :01:01:01 04:01:01:01 04:01 04: NA :01:01 07:02:01:01 07:02 16: NA :02:01 02:02:02:01 02:02 14: NA :02:01:01 07:01:01:01 07:01 08: NA :01:01:01 04:01:01:01 04:01 17: NA :01:01:01 03:02:02:01 03:02 07: NA :01:01 08:04:01 08:04 16: NA :04:01 03:02:02:01 03:02 08: NA :01:01:01 16:01:01 16:01 17: NA :01:01:01 17:01:01:01 17:01 17: NA :01 04:01:01:01 04:01 18: NA :01:01 04:01:01:01 04:01 16: NA :01 04:01:01:01 04:01 18:

89 B HLA genotype call results for 1000G WGS samples B.1 HLA-A WGS results Table B1: Gyper s called HLA-A genotype on 1000 Genomes low coverage WGS dataset compared to Erlich et al. [2011]. Sample Allele 1 Allele 2 Verified allele 1 Verified allele 2 2 digit matches 4 digit matches NA :01:01:01 02:01:01:01 03:01 02: NA :01:01 02:01:01:01 32:01 02: NA :02:01:01 02:01:01:01 29:02 02: NA :01:01:01 01:01:01:01 02:01 01: NA :02:01:01 23:01:01 23:01 24: NA :01 34:02:01 34:02 74: NA :02:01:01 24:02:01:01 24:02 24: NA :01:02:01 11:01:01:01 11:01 31: NA :02:01:01 02:06:01:01 02:06 24: NA :01:01:01 02:01:01:01 02:01 26: NA :06:01:01 02:01:01:01 02:01 02: NA :03:23 33:03:01 33:03 33: NA :01:02:01 02:01:01:01 02:01 31: NA :04 02:01:01:01 02:01 24: NA :02:01:01 36:01 36:01 68: NA :01 23:01:01 23:01 36: NA :02:01:01 66:02 66:03 68: NA :01:01:01 03:01:01:01 03:01 26: NA :01:01 02:02:01 02:02 23: NA :02:01:01 02:05:01 02:05 30:

90 B HLA genotype call results for 1000G WGS samples B.2 HLA-B WGS results Table B2: Gyper s called HLA-B genotype on 1000 Genomes low coverage WGS dataset compared to Erlich et al. [2011]. Sample Allele 1 Allele 2 Verified allele 1 Verified allele 2 2 digit matches 4 digit matches NA :01:01 07:02:01 07:02 57: NA :02:01 40:02:01 40:02 27: NA :05:02 07:02:01 07:02 27: NA :02:01 07:02:01 07:02 07: NA :01:01:01 44:03:01 44:03 51: NA :01:01 53:01:01 53:01 53: NA :01:02 13:01:04 40:06 54: NA :01:01:01 40:01:02 40:01 51: NA :02:01 40:02:01 40:02 40: NA :01:02 39:01:03 39:01 40: NA :01:01:01 07:02:01 07:02 15: NA :03:01 44:03:01 44:03 44: NA :01:01:01 51:01:01:01 51:01 51: NA :01:01:01 40:02:01 40:02 52: NA :01 08:01:01 42:01 81: NA :01:01 35:01:01:01 35:01 49: NA :01:01 42:01:01 42:01 53: NA :01:01:01 15:10:01 15:10 56: NA :01:01 52:01:02 52:01 53: NA :01:01:01 18:01:01:01 18:01 58:

91 B.3 HLA-C WGS results B.3 HLA-C WGS results Table B3: Gyper s called HLA-C genotype on 1000 Genomes low coverage WGS dataset compared to Erlich et al. [2011]. Sample Allele 1 Allele 2 Verified allele 1 Verified allele 2 2 digit matches 4 digit matches NA :02:01:01 06:02:01:01 07:02 06: NA :04:01 02:02:02:01 02:02 07: NA :02:01:01 01:02:01 07:02 01: NA :02:01:01 07:02:01:01 07:02 07: NA :02:01 04:01:01:01 14:02 04: NA :01:01:01 04:01:01:01 04:01 04: NA :01:01:01 01:02:01 01:02 08: NA :02:01 07:02:61 07:02 14: NA :02:01:01 03:04:01:01 03:04 07: NA :02:01:01 07:02:01:01 07:02 07: NA :02:01:01 04:01:01:01 04:01 07: NA :03 14:03 14:03 14: NA :02:01 14:02:01 14:02 14: NA :02:01:01 12:02:02 12:02 15: NA :01:01:01 08:04:01 08:04 17: NA :01:02 04:01:01:01 04:01 07: NA :01:01:01 04:01:01:01 04:01 17: NA :04:01 01:02:01 01:02 08: NA :01:01 04:01:01:01 04:01 16: NA :02:01:01 07:01:01:01 07:01 08:

92 B HLA genotype call results for 1000G WGS samples 74

93 References 1000Genomes. How to access 1000 genomes data. DataAccess, [Online; Accessed: ]. Patrick G. Beatty, Reginald A. Clift, Eric M. Mickelson, Brenda B. Nisperos, Nancy Flournoy, Paul J. Martin, Jean E. Sanders, Patricia Stewart, C. Dean Buckner, Rainer Storb, E. Donnall Thomas, and John A. Hansen. Marrow transplantation from related donors other than hla-identical siblings. New England Journal of Medicine, 313(13): , David R. Bentley, Shankar Balasubramanian, Harold P. Swerdlow, Geoffrey P. Smith, John Milton, Clive G. Brown, Kevin P. Hall, Dirk J. Evers, Colin L. Barnes, and Helen R. Bignell et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature, 456(7218):53 9, International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature, 431(7011):931 45, Francis Crick. On protein synthesis. Symp. Soc. Exp. Biol., XII: , Francis Crick. Central dogma of molecular biology. Nature, 227: , Petr Danecek, Adam Auton, Goncalo Abecasis, Cornelis A. Albers, Eric Banks, Mark A. DePristo, Robert Handsaker, Gerton Lunter, Gabor Marth, Stephen T. Sherry, Gilean McVean, Richard Durbin, and 1000 Genomes Project Analysis Group. The variant call format and vcftools. Bioinformatics, pages btr330v1 btr330, Alexander Dilthey, Charles Cox, Zamin Iqbal, Matthew R Nelson, and Gil McVean. Improved genome inference in the mhc using a population reference graph. Nature Genetics, 47: , Andreas Döring, David Weese, Tobias Rausch, and Knut Reinert. SeqAn an efficient, generic C++ library for sequence analysis. BMC Bioinformatics, 9:11, Robert C. Edgar. Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res, 32(5): , Rachel L. Erlich, Xiaoming Jia, Scott Anderson, Eric Banks, Xiaojiang Gao, Mary Carrington, Namrata Gupta, Mark A. DePristo, Matthew R. Henn, Niall J. Lennon, and Paul I.W. de Bakker. Next-generation sequencing for hla typing of class i loci. BMC Genomics, 12:42,

94 REFERENCES Markus Hsi-Yang Fritz, Rasko Leinonen, Guy Cochrane, and Ewan Birney. Efficient storage of high throughput dna sequencing data using reference-based compression. Genome Res, 21: , Carl W. Fuller, Lyle R. Middendorf, Steven A. Benner, George M. Church, Timothy Harris, Xiaohua Huang, Stevan B. Jovanovich, John R. Nelson, Jeffery A. Schloss, David C. Schwartz, and Dmitri V. Vezenov. The challenges of sequencing by synthesis. Nat Biotechnol, 27(11): , Genestudio. Format notes: Fasta [Online; Accessed: ]. E. Gluckman, V. Rocha, and C. Chastang. Peripheral stem cells in bone marrow transplantation. cord blood stem cell transplantation. Baillieres Best Pract Res Clin Haematol, 12(1-2):279 92, John A. Hansen, Reginald A. Clift, E. Donnall Thomas, C. Dean Buckner, Rainer Storb, and Eloise R. Giblett. Transplantation of marrow from an unrelated donor to a patient with acute leukemia. N Engl J Med, 303: , Erika Check Hayden. Is the $1,000 genome for real? web.archive.org/web/ / is-the genome-for-real , [Online; Accessed: ]. Steve Hoffmann, Christian Otto, Stefan Kurtz, Cynthia M. Sharma, Philipp Khaitovich, Jörg Vogel, Peter F. Stadler, and Jörg Hackermüller. Fast mapping of short sequences with mismatches, insertions and deletions using index structures. PLoS Comput Biol, 5(9):e , Manuel Holtgrewe. Mason - A read simulator for second generation sequencing data. Technical report, Institut für Mathematik und Informatik, Freie Universität Berlin, Matt Johnson. Understanding and pre-processing raw illumina data, Marcin Kalicinski. Rapidxml. //rapidxml.sourceforge.net/, [Online; Accessed: ]. K. Kaukinen, J. Partanen, M. Mäki, and P. Collin. Hla-dq typing in the diagnosis of celiac disease. The American Journal of Gastroenterology, 97(3):695 9, W. James Kent, Charles W. Sugnet, Terrence S. Furey, Krishna M. Roskin, Tom H. Pringle, Alan M. Zahler, and David Haussler. The human genome browser at ucsc. Genome Res, 12(6): , N. Kikuoka, S. Sugihara, T. Yanagawa, A. Ikezaki, H.S. Kim, H. Matsuoka, Y. Kobayashi, K. Wataki, S. Konda, H. Sato, S. Miyamoto, N. Sasaki, T. Sakamaki, H. Niimi, and M. Murata. Cytotoxic t lymphocyte antigen 4 gene polymorphism confers susceptibility to type 1 diabetes in japanese children: analysis of association with hla genotypes and autoantibodies. Clin Endocrinol, 55(5): ,

95 REFERENCES Heng Li. A statistical framework for snp calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27(21): , Heng Li and Richard Durbin. Fast and accurate short read alignment with burrows wheeler transform. Bioinformatics, 25: , Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, and Richard Durbin. The sequence alignment/map format and samtools. Bioinformatics, 25(16): , Chang Liu, Xiao Yang, Brian Duffy, Thalachallour Mohanakumar Robi D. Mitra, Michael C. Zody, and John D. Pfeifer. Athlates: accurate typing of human leukocyte antigen through exome sequencing. Nucleic Acids Res, 41(14):e142, Endre Major, Krisztina Rigó, Tim Hague, Attila Bérces, and Szilveszter Juhos. Hla typing from 1000 genomes whole genome and whole exome illumina data. PLoS ONE, 8(11): e78410, Aaron McKenna, Matthew Hanna, Eric Banks, Andrey Sivachenko, Kristian Cibulskis, Andrew Kernytsky, Kiran Garimella, David Altshuler, Stacey Gabriel, Mark Daly, and Mark A. DePristo. The genome analysis toolkit: A mapreduce framework for analyzing next-generation dna sequencing data. Genome Research, 20(9): , Loukas Moutsianas, Luke Jostins, Ashley H. Beecham, Alexander T. Dilthey, Dionysia K. Xifara, Maria Ban, Tejas S. Shah, Nikolaos A. Patsopoulos, Lars Alfredsson, Carl A. Anderson, and et al. Class ii hla interactions modulate genetic risk for multiple sclerosis. Nature, Online, William R. Pearson and David J. Lipman. Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences of the United States of America., 85: , K.D. Pruitt, G.R. Brown, S.M. Hiatt, F. Thibaud-Nissen, A. Astashyn, O. Ermolaeva, C.M. Farrell, J. Hart, M.J. Landrum, K.M. McGarvey, M.R. Murphy, and N.A. O Leary et al. Refseq: an update on mammalian reference sequences. Nucleic Acids Res, 42: , James Robinson, Jason A. Halliwell, James D. Hayhurst, Paul Flicek, Peter Parham, and Steven G. E. Marsh. The IPD and IMGT/HLA database: allele variant databases. Nucleic Acids Research, 43:D , Andras Szolek, Benjamin Schubert, Christopher Mohr, Marc Sturm, Magdalena Feldhahn, and Oliver Kohlbacher. Optitype: precision hla typing from next-generation sequencing data. Bioinformatics, 30(23):3310 6, Arnoud H.M. Van Vliet. Next generation sequencing of microbial transcriptomes: challenges and opportunities. FEMS Microbiol Lett, 302(1):1 7,

96 REFERENCES L. Wang and T. Jiang. On the complexity of multiple sequence alignment. Journal of Computational Biology, 1: , René L. Warren, Gina Choe, Douglas J. Freeman, Mauro Castellarin, Sarah Munro, Richard Moore, and Robert A. Holt. Derivation of hla types from shotgun sequence datasets. Genome Med, 4:95, James D. Watson and Francis H.C. Crick. A structure for deoxyribose nucleic acid. Nature, 171: ,

Structure and Function of DNA

Structure and Function of DNA Structure and Function of DNA DNA and RNA Structure DNA and RNA are nucleic acids. They consist of chemical units called nucleotides. The nucleotides are joined by a sugar-phosphate backbone. The four

More information

DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!!

DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!! DNA Replication & Protein Synthesis This isn t a baaaaaaaddd chapter!!! The Discovery of DNA s Structure Watson and Crick s discovery of DNA s structure was based on almost fifty years of research by other

More information

Bioinformatics Resources at a Glance

Bioinformatics Resources at a Glance Bioinformatics Resources at a Glance A Note about FASTA Format There are MANY free bioinformatics tools available online. Bioinformaticists have developed a standard format for nucleotide and protein sequences

More information

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the

More information

Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure 3.11 3.15 enzymes control cell chemistry ( metabolism )

Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure 3.11 3.15 enzymes control cell chemistry ( metabolism ) Biology 1406 Exam 3 Notes Structure of DNA Ch. 10 Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure 3.11 3.15 enzymes control cell chemistry ( metabolism ) Proteins

More information

SUPPLEMENTARY METHODS

SUPPLEMENTARY METHODS SUPPLEMENTARY METHODS Description of parameter selection for the automated calling algorithm The first analyses of the HLA data were performed with the haploid cell lines described by Horton et al. (1).

More information

Basic Concepts of DNA, Proteins, Genes and Genomes

Basic Concepts of DNA, Proteins, Genes and Genomes Basic Concepts of DNA, Proteins, Genes and Genomes Kun-Mao Chao 1,2,3 1 Graduate Institute of Biomedical Electronics and Bioinformatics 2 Department of Computer Science and Information Engineering 3 Graduate

More information

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

Algorithms in Computational Biology (236522) spring 2007 Lecture #1 Algorithms in Computational Biology (236522) spring 2007 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: Tuesday 11:00-12:00/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office

More information

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources 1 of 8 11/7/2004 11:00 AM National Center for Biotechnology Information About NCBI NCBI at a Glance A Science Primer Human Genome Resources Model Organisms Guide Outreach and Education Databases and Tools

More information

From DNA to Protein. Proteins. Chapter 13. Prokaryotes and Eukaryotes. The Path From Genes to Proteins. All proteins consist of polypeptide chains

From DNA to Protein. Proteins. Chapter 13. Prokaryotes and Eukaryotes. The Path From Genes to Proteins. All proteins consist of polypeptide chains Proteins From DNA to Protein Chapter 13 All proteins consist of polypeptide chains A linear sequence of amino acids Each chain corresponds to the nucleotide base sequence of a gene The Path From Genes

More information

DNA, RNA, Protein synthesis, and Mutations. Chapters 12-13.3

DNA, RNA, Protein synthesis, and Mutations. Chapters 12-13.3 DNA, RNA, Protein synthesis, and Mutations Chapters 12-13.3 1A)Identify the components of DNA and explain its role in heredity. DNA s Role in heredity: Contains the genetic information of a cell that can

More information

Genomes and SNPs in Malaria and Sickle Cell Anemia

Genomes and SNPs in Malaria and Sickle Cell Anemia Genomes and SNPs in Malaria and Sickle Cell Anemia Introduction to Genome Browsing with Ensembl Ensembl The vast amount of information in biological databases today demands a way of organising and accessing

More information

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Focusing on results not data comprehensive data analysis for targeted next generation sequencing Focusing on results not data comprehensive data analysis for targeted next generation sequencing Daniel Swan, Jolyon Holdstock, Angela Matchan, Richard Stark, John Shovelton, Duarte Mohla and Simon Hughes

More information

12.1 The Role of DNA in Heredity

12.1 The Role of DNA in Heredity 12.1 The Role of DNA in Heredity Only in the last 50 years have scientists understood the role of DNA in heredity. That understanding began with the discovery of DNA s structure. In 1952, Rosalind Franklin

More information

Name Date Period. 2. When a molecule of double-stranded DNA undergoes replication, it results in

Name Date Period. 2. When a molecule of double-stranded DNA undergoes replication, it results in DNA, RNA, Protein Synthesis Keystone 1. During the process shown above, the two strands of one DNA molecule are unwound. Then, DNA polymerases add complementary nucleotides to each strand which results

More information

To be able to describe polypeptide synthesis including transcription and splicing

To be able to describe polypeptide synthesis including transcription and splicing Thursday 8th March COPY LO: To be able to describe polypeptide synthesis including transcription and splicing Starter Explain the difference between transcription and translation BATS Describe and explain

More information

Next Generation Sequencing: Technology, Mapping, and Analysis

Next Generation Sequencing: Technology, Mapping, and Analysis Next Generation Sequencing: Technology, Mapping, and Analysis Gary Benson Computer Science, Biology, Bioinformatics Boston University [email protected] http://tandem.bu.edu/ The Human Genome Project took

More information

Translation Study Guide

Translation Study Guide Translation Study Guide This study guide is a written version of the material you have seen presented in the replication unit. In translation, the cell uses the genetic information contained in mrna to

More information

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment Tutorial for Windows and Macintosh Preparing Your Data for NGS Alignment 2015 Gene Codes Corporation Gene Codes Corporation 775 Technology Drive, Ann Arbor, MI 48108 USA 1.800.497.4939 (USA) 1.734.769.7249

More information

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications Product Bulletin Sequencing Software SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications Comprehensive reference sequence handling Helps interpret the role of each

More information

Genetics Module B, Anchor 3

Genetics Module B, Anchor 3 Genetics Module B, Anchor 3 Key Concepts: - An individual s characteristics are determines by factors that are passed from one parental generation to the next. - During gamete formation, the alleles for

More information

Transcription and Translation of DNA

Transcription and Translation of DNA Transcription and Translation of DNA Genotype our genetic constitution ( makeup) is determined (controlled) by the sequence of bases in its genes Phenotype determined by the proteins synthesised when genes

More information

Introduction to NGS data analysis

Introduction to NGS data analysis Introduction to NGS data analysis Jeroen F. J. Laros Leiden Genome Technology Center Department of Human Genetics Center for Human and Clinical Genetics Sequencing Illumina platforms Characteristics: High

More information

GenBank, Entrez, & FASTA

GenBank, Entrez, & FASTA GenBank, Entrez, & FASTA Nucleotide Sequence Databases First generation GenBank is a representative example started as sort of a museum to preserve knowledge of a sequence from first discovery great repositories,

More information

MiSeq: Imaging and Base Calling

MiSeq: Imaging and Base Calling MiSeq: Imaging and Page Welcome Navigation Presenter Introduction MiSeq Sequencing Workflow Narration Welcome to MiSeq: Imaging and. This course takes 35 minutes to complete. Click Next to continue. Please

More information

SAP HANA Enabling Genome Analysis

SAP HANA Enabling Genome Analysis SAP HANA Enabling Genome Analysis Joanna L. Kelley, PhD Postdoctoral Scholar, Stanford University Enakshi Singh, MSc HANA Product Management, SAP Labs LLC Outline Use cases Genomics review Challenges in

More information

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University HiCOMB 2014 May 19 th, Phoenix, Arizona 1 Outline

More information

Lab # 12: DNA and RNA

Lab # 12: DNA and RNA 115 116 Concepts to be explored: Structure of DNA Nucleotides Amino Acids Proteins Genetic Code Mutation RNA Transcription to RNA Translation to a Protein Figure 12. 1: DNA double helix Introduction Long

More information

Protein Synthesis. Page 41 Page 44 Page 47 Page 42 Page 45 Page 48 Page 43 Page 46 Page 49. Page 41. DNA RNA Protein. Vocabulary

Protein Synthesis. Page 41 Page 44 Page 47 Page 42 Page 45 Page 48 Page 43 Page 46 Page 49. Page 41. DNA RNA Protein. Vocabulary Protein Synthesis Vocabulary Transcription Translation Translocation Chromosomal mutation Deoxyribonucleic acid Frame shift mutation Gene expression Mutation Point mutation Page 41 Page 41 Page 44 Page

More information

2. True or False? The sequence of nucleotides in the human genome is 90.9% identical from one person to the next. False (it s 99.

2. True or False? The sequence of nucleotides in the human genome is 90.9% identical from one person to the next. False (it s 99. 1. True or False? A typical chromosome can contain several hundred to several thousand genes, arranged in linear order along the DNA molecule present in the chromosome. True 2. True or False? The sequence

More information

MAKING AN EVOLUTIONARY TREE

MAKING AN EVOLUTIONARY TREE Student manual MAKING AN EVOLUTIONARY TREE THEORY The relationship between different species can be derived from different information sources. The connection between species may turn out by similarities

More information

Chapter 6 DNA Replication

Chapter 6 DNA Replication Chapter 6 DNA Replication Each strand of the DNA double helix contains a sequence of nucleotides that is exactly complementary to the nucleotide sequence of its partner strand. Each strand can therefore

More information

Forensic DNA Testing Terminology

Forensic DNA Testing Terminology Forensic DNA Testing Terminology ABI 310 Genetic Analyzer a capillary electrophoresis instrument used by forensic DNA laboratories to separate short tandem repeat (STR) loci on the basis of their size.

More information

Delivering the power of the world s most successful genomics platform

Delivering the power of the world s most successful genomics platform Delivering the power of the world s most successful genomics platform NextCODE Health is bringing the full power of the world s largest and most successful genomics platform to everyday clinical care NextCODE

More information

RNA and Protein Synthesis

RNA and Protein Synthesis Name lass Date RN and Protein Synthesis Information and Heredity Q: How does information fl ow from DN to RN to direct the synthesis of proteins? 13.1 What is RN? WHT I KNOW SMPLE NSWER: RN is a nucleic

More information

A Tutorial in Genetic Sequence Classification Tools and Techniques

A Tutorial in Genetic Sequence Classification Tools and Techniques A Tutorial in Genetic Sequence Classification Tools and Techniques Jake Drew Data Mining CSE 8331 Southern Methodist University [email protected] www.jakemdrew.com Sequence Characters IUPAC nucleotide

More information

a. Ribosomal RNA rrna a type ofrna that combines with proteins to form Ribosomes on which polypeptide chains of proteins are assembled

a. Ribosomal RNA rrna a type ofrna that combines with proteins to form Ribosomes on which polypeptide chains of proteins are assembled Biology 101 Chapter 14 Name: Fill-in-the-Blanks Which base follows the next in a strand of DNA is referred to. as the base (1) Sequence. The region of DNA that calls for the assembly of specific amino

More information

BioBoot Camp Genetics

BioBoot Camp Genetics BioBoot Camp Genetics BIO.B.1.2.1 Describe how the process of DNA replication results in the transmission and/or conservation of genetic information DNA Replication is the process of DNA being copied before

More information

Data Analysis for Ion Torrent Sequencing

Data Analysis for Ion Torrent Sequencing IFU022 v140202 Research Use Only Instructions For Use Part III Data Analysis for Ion Torrent Sequencing MANUFACTURER: Multiplicom N.V. Galileilaan 18 2845 Niel Belgium Revision date: August 21, 2014 Page

More information

Final Project Report

Final Project Report CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes

More information

Lectures 1 and 8 15. February 7, 2013. Genomics 2012: Repetitorium. Peter N Robinson. VL1: Next- Generation Sequencing. VL8 9: Variant Calling

Lectures 1 and 8 15. February 7, 2013. Genomics 2012: Repetitorium. Peter N Robinson. VL1: Next- Generation Sequencing. VL8 9: Variant Calling Lectures 1 and 8 15 February 7, 2013 This is a review of the material from lectures 1 and 8 14. Note that the material from lecture 15 is not relevant for the final exam. Today we will go over the material

More information

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data The Illumina TopHat Alignment and Cufflinks Assembly and Differential Expression apps make RNA data analysis accessible to any user, regardless

More information

13.2 Ribosomes & Protein Synthesis

13.2 Ribosomes & Protein Synthesis 13.2 Ribosomes & Protein Synthesis Introduction: *A specific sequence of bases in DNA carries the directions for forming a polypeptide, a chain of amino acids (there are 20 different types of amino acid).

More information

2. The number of different kinds of nucleotides present in any DNA molecule is A) four B) six C) two D) three

2. The number of different kinds of nucleotides present in any DNA molecule is A) four B) six C) two D) three Chem 121 Chapter 22. Nucleic Acids 1. Any given nucleotide in a nucleic acid contains A) two bases and a sugar. B) one sugar, two bases and one phosphate. C) two sugars and one phosphate. D) one sugar,

More information

Overview of Eukaryotic Gene Prediction

Overview of Eukaryotic Gene Prediction Overview of Eukaryotic Gene Prediction CBB 231 / COMPSCI 261 W.H. Majoros What is DNA? Nucleus Chromosome Telomere Centromere Cell Telomere base pairs histones DNA (double helix) DNA is a Double Helix

More information

ISTEP+: Biology I End-of-Course Assessment Released Items and Scoring Notes

ISTEP+: Biology I End-of-Course Assessment Released Items and Scoring Notes ISTEP+: Biology I End-of-Course Assessment Released Items and Scoring Notes Page 1 of 22 Introduction Indiana students enrolled in Biology I participated in the ISTEP+: Biology I Graduation Examination

More information

Activity IT S ALL RELATIVES The Role of DNA Evidence in Forensic Investigations

Activity IT S ALL RELATIVES The Role of DNA Evidence in Forensic Investigations Activity IT S ALL RELATIVES The Role of DNA Evidence in Forensic Investigations SCENARIO You have responded, as a result of a call from the police to the Coroner s Office, to the scene of the death of

More information

Version 5.0 Release Notes

Version 5.0 Release Notes Version 5.0 Release Notes 2011 Gene Codes Corporation Gene Codes Corporation 775 Technology Drive, Ann Arbor, MI 48108 USA 1.800.497.4939 (USA) +1.734.769.7249 (elsewhere) +1.734.769.7074 (fax) www.genecodes.com

More information

Frequently Asked Questions Next Generation Sequencing

Frequently Asked Questions Next Generation Sequencing Frequently Asked Questions Next Generation Sequencing Import These Frequently Asked Questions for Next Generation Sequencing are some of the more common questions our customers ask. Questions are divided

More information

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

An example of bioinformatics application on plant breeding projects in Rijk Zwaan An example of bioinformatics application on plant breeding projects in Rijk Zwaan Xiangyu Rao 17-08-2012 Introduction of RZ Rijk Zwaan is active worldwide as a vegetable breeding company that focuses on

More information

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want 1 When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want to search other databases as well. There are very

More information

Module 1. Sequence Formats and Retrieval. Charles Steward

Module 1. Sequence Formats and Retrieval. Charles Steward The Open Door Workshop Module 1 Sequence Formats and Retrieval Charles Steward 1 Aims Acquaint you with different file formats and associated annotations. Introduce different nucleotide and protein databases.

More information

Simplifying Data Interpretation with Nexus Copy Number

Simplifying Data Interpretation with Nexus Copy Number Simplifying Data Interpretation with Nexus Copy Number A WHITE PAPER FROM BIODISCOVERY, INC. Rapid technological advancements, such as high-density acgh and SNP arrays as well as next-generation sequencing

More information

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik Leading Genomics Diagnostic harma Discove Collab Shanghai Cambridge, MA Reykjavik Global leadership for using the genome to create better medicine WuXi NextCODE provides a uniquely proven and integrated

More information

Introduction to next-generation sequencing data

Introduction to next-generation sequencing data Introduction to next-generation sequencing data David Simpson Centre for Experimental Medicine Queens University Belfast http://www.qub.ac.uk/research-centres/cem/ Outline History of DNA sequencing NGS

More information

Teacher Guide: Have Your DNA and Eat It Too ACTIVITY OVERVIEW. http://gslc.genetics.utah.edu

Teacher Guide: Have Your DNA and Eat It Too ACTIVITY OVERVIEW. http://gslc.genetics.utah.edu ACTIVITY OVERVIEW Abstract: Students build an edible model of DNA while learning basic DNA structure and the rules of base pairing. Module: The Basics and Beyond Prior Knowledge Needed: DNA contains heritable

More information

SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE

SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE AP Biology Date SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE LEARNING OBJECTIVES Students will gain an appreciation of the physical effects of sickle cell anemia, its prevalence in the population,

More information

DNA and Forensic Science

DNA and Forensic Science DNA and Forensic Science Micah A. Luftig * Stephen Richey ** I. INTRODUCTION This paper represents a discussion of the fundamental principles of DNA technology as it applies to forensic testing. A brief

More information

Replication Study Guide

Replication Study Guide Replication Study Guide This study guide is a written version of the material you have seen presented in the replication unit. Self-reproduction is a function of life that human-engineered systems have

More information

Single-Cell Whole Genome Sequencing on the C1 System: a Performance Evaluation

Single-Cell Whole Genome Sequencing on the C1 System: a Performance Evaluation PN 100-9879 A1 TECHNICAL NOTE Single-Cell Whole Genome Sequencing on the C1 System: a Performance Evaluation Introduction Cancer is a dynamic evolutionary process of which intratumor genetic and phenotypic

More information

Human Leukocyte Antigens - HLA

Human Leukocyte Antigens - HLA Human Leukocyte Antigens - HLA Human Leukocyte Antigens (HLA) are cell surface proteins involved in immune function. HLA molecules present antigenic peptides to generate immune defense reactions. HLA-class

More information

Genotyping by sequencing and data analysis. Ross Whetten North Carolina State University

Genotyping by sequencing and data analysis. Ross Whetten North Carolina State University Genotyping by sequencing and data analysis Ross Whetten North Carolina State University Stein (2010) Genome Biology 11:207 More New Technology on the Horizon Genotyping By Sequencing Timeline 2007 Complexity

More information

Mitochondrial DNA Analysis

Mitochondrial DNA Analysis Mitochondrial DNA Analysis Lineage Markers Lineage markers are passed down from generation to generation without changing Except for rare mutation events They can help determine the lineage (family tree)

More information

Searching Nucleotide Databases

Searching Nucleotide Databases Searching Nucleotide Databases 1 When we search a nucleic acid databases, Mascot always performs a 6 frame translation on the fly. That is, 3 reading frames from the forward strand and 3 reading frames

More information

History of DNA Sequencing & Current Applications

History of DNA Sequencing & Current Applications History of DNA Sequencing & Current Applications Christopher McLeod President & CEO, 454 Life Sciences, A Roche Company IMPORTANT NOTICE Intended Use Unless explicitly stated otherwise, all Roche Applied

More information

A Complete Example of Next- Gen DNA Sequencing Read Alignment. Presentation Title Goes Here

A Complete Example of Next- Gen DNA Sequencing Read Alignment. Presentation Title Goes Here A Complete Example of Next- Gen DNA Sequencing Read Alignment Presentation Title Goes Here 1 FASTQ Format: The de- facto file format for sharing sequence read data Sequence and a per- base quality score

More information

agucacaaacgcu agugcuaguuua uaugcagucuua

agucacaaacgcu agugcuaguuua uaugcagucuua RNA Secondary Structure Prediction: The Co-transcriptional effect on RNA folding agucacaaacgcu agugcuaguuua uaugcagucuua By Conrad Godfrey Abstract RNA secondary structure prediction is an area of bioinformatics

More information

DNA Sequencing Data Compression. Michael Chung

DNA Sequencing Data Compression. Michael Chung DNA Sequencing Data Compression Michael Chung Problem DNA sequencing per dollar is increasing faster than storage capacity per dollar. Stein (2010) Data 3 billion base pairs in human genome Genomes are

More information

SNP Essentials The same SNP story

SNP Essentials The same SNP story HOW SNPS HELP RESEARCHERS FIND THE GENETIC CAUSES OF DISEASE SNP Essentials One of the findings of the Human Genome Project is that the DNA of any two people, all 3.1 billion molecules of it, is more than

More information

Gene Models & Bed format: What they represent.

Gene Models & Bed format: What they represent. GeneModels&Bedformat:Whattheyrepresent. Gene models are hypotheses about the structure of transcripts produced by a gene. Like all models, they may be correct, partly correct, or entirely wrong. Typically,

More information

13.4 Gene Regulation and Expression

13.4 Gene Regulation and Expression 13.4 Gene Regulation and Expression Lesson Objectives Describe gene regulation in prokaryotes. Explain how most eukaryotic genes are regulated. Relate gene regulation to development in multicellular organisms.

More information

14.3 Studying the Human Genome

14.3 Studying the Human Genome 14.3 Studying the Human Genome Lesson Objectives Summarize the methods of DNA analysis. State the goals of the Human Genome Project and explain what we have learned so far. Lesson Summary Manipulating

More information

Control of Gene Expression

Control of Gene Expression Control of Gene Expression What is Gene Expression? Gene expression is the process by which informa9on from a gene is used in the synthesis of a func9onal gene product. What is Gene Expression? Figure

More information

Text file One header line meta information lines One line : variant/position

Text file One header line meta information lines One line : variant/position Software Calling: GATK SAMTOOLS mpileup Varscan SOAP VCF format Text file One header line meta information lines One line : variant/position ##fileformat=vcfv4.1! ##filedate=20090805! ##source=myimputationprogramv3.1!

More information

Name Class Date. Figure 13 1. 2. Which nucleotide in Figure 13 1 indicates the nucleic acid above is RNA? a. uracil c. cytosine b. guanine d.

Name Class Date. Figure 13 1. 2. Which nucleotide in Figure 13 1 indicates the nucleic acid above is RNA? a. uracil c. cytosine b. guanine d. 13 Multiple Choice RNA and Protein Synthesis Chapter Test A Write the letter that best answers the question or completes the statement on the line provided. 1. Which of the following are found in both

More information

CCR Biology - Chapter 9 Practice Test - Summer 2012

CCR Biology - Chapter 9 Practice Test - Summer 2012 Name: Class: Date: CCR Biology - Chapter 9 Practice Test - Summer 2012 Multiple Choice Identify the choice that best completes the statement or answers the question. 1. Genetic engineering is possible

More information

Analysis of ChIP-seq data in Galaxy

Analysis of ChIP-seq data in Galaxy Analysis of ChIP-seq data in Galaxy November, 2012 Local copy: https://galaxy.wi.mit.edu/ Joint project between BaRC and IT Main site: http://main.g2.bx.psu.edu/ 1 Font Conventions Bold and blue refers

More information

DNA Mapping/Alignment. Team: I Thought You GNU? Lars Olsen, Venkata Aditya Kovuri, Nick Merowsky

DNA Mapping/Alignment. Team: I Thought You GNU? Lars Olsen, Venkata Aditya Kovuri, Nick Merowsky DNA Mapping/Alignment Team: I Thought You GNU? Lars Olsen, Venkata Aditya Kovuri, Nick Merowsky Overview Summary Research Paper 1 Research Paper 2 Research Paper 3 Current Progress Software Designs to

More information

PRACTICE TEST QUESTIONS

PRACTICE TEST QUESTIONS PART A: MULTIPLE CHOICE QUESTIONS PRACTICE TEST QUESTIONS DNA & PROTEIN SYNTHESIS B 1. One of the functions of DNA is to A. secrete vacuoles. B. make copies of itself. C. join amino acids to each other.

More information

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS) Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS) A typical RNA Seq experiment Library construction Protocol variations Fragmentation methods RNA: nebulization,

More information

DNA and the Cell. Version 2.3. English version. ELLS European Learning Laboratory for the Life Sciences

DNA and the Cell. Version 2.3. English version. ELLS European Learning Laboratory for the Life Sciences DNA and the Cell Anastasios Koutsos Alexandra Manaia Julia Willingale-Theune Version 2.3 English version ELLS European Learning Laboratory for the Life Sciences Anastasios Koutsos, Alexandra Manaia and

More information

Practical Guideline for Whole Genome Sequencing

Practical Guideline for Whole Genome Sequencing Practical Guideline for Whole Genome Sequencing Disclosure Kwangsik Nho Assistant Professor Center for Neuroimaging Department of Radiology and Imaging Sciences Center for Computational Biology and Bioinformatics

More information

Services. Updated 05/31/2016

Services. Updated 05/31/2016 Updated 05/31/2016 Services 1. Whole exome sequencing... 2 2. Whole Genome Sequencing (WGS)... 3 3. 16S rrna sequencing... 4 4. Customized gene panels... 5 5. RNA-Seq... 6 6. qpcr... 7 7. HLA typing...

More information

Molecular Genetics. RNA, Transcription, & Protein Synthesis

Molecular Genetics. RNA, Transcription, & Protein Synthesis Molecular Genetics RNA, Transcription, & Protein Synthesis Section 1 RNA AND TRANSCRIPTION Objectives Describe the primary functions of RNA Identify how RNA differs from DNA Describe the structure and

More information

Crime Scenes and Genes

Crime Scenes and Genes Glossary Agarose Biotechnology Cell Chromosome DNA (deoxyribonucleic acid) Electrophoresis Gene Micro-pipette Mutation Nucleotide Nucleus PCR (Polymerase chain reaction) Primer STR (short tandem repeats)

More information

LifeScope Genomic Analysis Software 2.5

LifeScope Genomic Analysis Software 2.5 USER GUIDE LifeScope Genomic Analysis Software 2.5 Graphical User Interface DATA ANALYSIS METHODS AND INTERPRETATION Publication Part Number 4471877 Rev. A Revision Date November 2011 For Research Use

More information

Coding sequence the sequence of nucleotide bases on the DNA that are transcribed into RNA which are in turn translated into protein

Coding sequence the sequence of nucleotide bases on the DNA that are transcribed into RNA which are in turn translated into protein Assignment 3 Michele Owens Vocabulary Gene: A sequence of DNA that instructs a cell to produce a particular protein Promoter a control sequence near the start of a gene Coding sequence the sequence of

More information

Single Nucleotide Polymorphisms (SNPs)

Single Nucleotide Polymorphisms (SNPs) Single Nucleotide Polymorphisms (SNPs) Additional Markers 13 core STR loci Obtain further information from additional markers: Y STRs Separating male samples Mitochondrial DNA Working with extremely degraded

More information

Biological Sequence Data Formats

Biological Sequence Data Formats Biological Sequence Data Formats Here we present three standard formats in which biological sequence data (DNA, RNA and protein) can be stored and presented. Raw Sequence: Data without description. FASTA

More information

GENE REGULATION. Teacher Packet

GENE REGULATION. Teacher Packet AP * BIOLOGY GENE REGULATION Teacher Packet AP* is a trademark of the College Entrance Examination Board. The College Entrance Examination Board was not involved in the production of this material. Pictures

More information

RNA & Protein Synthesis

RNA & Protein Synthesis RNA & Protein Synthesis Genes send messages to cellular machinery RNA Plays a major role in process Process has three phases (Genetic) Transcription (Genetic) Translation Protein Synthesis RNA Synthesis

More information

Lecture Series 7. From DNA to Protein. Genotype to Phenotype. Reading Assignments. A. Genes and the Synthesis of Polypeptides

Lecture Series 7. From DNA to Protein. Genotype to Phenotype. Reading Assignments. A. Genes and the Synthesis of Polypeptides Lecture Series 7 From DNA to Protein: Genotype to Phenotype Reading Assignments Read Chapter 7 From DNA to Protein A. Genes and the Synthesis of Polypeptides Genes are made up of DNA and are expressed

More information

Thymine = orange Adenine = dark green Guanine = purple Cytosine = yellow Uracil = brown

Thymine = orange Adenine = dark green Guanine = purple Cytosine = yellow Uracil = brown 1 DNA Coloring - Transcription & Translation Transcription RNA, Ribonucleic Acid is very similar to DNA. RNA normally exists as a single strand (and not the double stranded double helix of DNA). It contains

More information

Lecture Overview. Hydrogen Bonds. Special Properties of Water Molecules. Universal Solvent. ph Scale Illustrated. special properties of water

Lecture Overview. Hydrogen Bonds. Special Properties of Water Molecules. Universal Solvent. ph Scale Illustrated. special properties of water Lecture Overview special properties of water > water as a solvent > ph molecules of the cell > properties of carbon > carbohydrates > lipids > proteins > nucleic acids Hydrogen Bonds polarity of water

More information

Academic Nucleic Acids and Protein Synthesis Test

Academic Nucleic Acids and Protein Synthesis Test Academic Nucleic Acids and Protein Synthesis Test Multiple Choice Identify the letter of the choice that best completes the statement or answers the question. 1. Each organism has a unique combination

More information

Genetics Test Biology I

Genetics Test Biology I Genetics Test Biology I Multiple Choice Identify the choice that best completes the statement or answers the question. 1. Avery s experiments showed that bacteria are transformed by a. RNA. c. proteins.

More information

How many of you have checked out the web site on protein-dna interactions?

How many of you have checked out the web site on protein-dna interactions? How many of you have checked out the web site on protein-dna interactions? Example of an approximately 40,000 probe spotted oligo microarray with enlarged inset to show detail. Find and be ready to discuss

More information

Comparing Methods for Identifying Transcription Factor Target Genes

Comparing Methods for Identifying Transcription Factor Target Genes Comparing Methods for Identifying Transcription Factor Target Genes Alena van Bömmel (R 3.3.73) Matthew Huska (R 3.3.18) Max Planck Institute for Molecular Genetics Folie 1 Transcriptional Regulation TF

More information