Gyper: A graph-based HLA genotyper using aligned DNA sequences

Transcription

1 Gyper: A graph-based HLA genotyper using aligned DNA sequences Hannes Pétur Eggertsson Faculty of Industrial Engineering, Mechanical Engineering and Computer Science University of Iceland 2015

2

3 GYPER: A GRAPH-BASED HLA GENOTYPER USING ALIGNED DNA SEQUENCES Hannes Pétur Eggertsson 60 ECTS thesis submitted in partial fulfillment of a Magister Scientiarum degree in computational engineering Advisors Bjarni Vilhjálmur Halldórsson (decode genetics) Páll Melsted (University of Iceland) Faculty Representative Daníel Fannar Guðbjartsson Faculty of Industrial Engineering, Mechanical Engineering and Computer Science School of Engineering and Natural Sciences University of Iceland Reykjavik, September 2015

4 Gyper: A graph-based HLA genotyper using aligned DNA sequences Gyper 60 ECTS thesis submitted in partial fulfillment of a M.Sc. degree in computational engineering Copyright 2015 Hannes Pétur Eggertsson All rights reserved Faculty of Industrial Engineering, Mechanical Engineering and Computer Science School of Engineering and Natural Sciences University of Iceland Hjarðarhaga , Reykjavík, Reykjavik Iceland Telephone: Bibliographic information: Hannes Pétur Eggertsson, 2015, Gyper: A graph-based HLA genotyper using aligned DNA sequences, M.Sc. thesis, Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland. Printing: Háskólaprent, Fálkagata 2, 107 Reykjavík Reykjavik, Iceland, September 2015

5 Contents List of Figures List of Tables Acknowledgments iii vii ix 1 Introduction Genetics The human genome Genotyping The HLA gene family Gyper Background Next-generation sequencing Phred score Data formats FASTA format SAM and BAM format VCF The HLA reference alleles format Current DNA sequencing genotypers Methods Preprocessing the data Fetching the HLA reference alleles Regions with relevant reads Multiple sequence alignment Constructing a reference partial order graph Graph implementation Extending the POG Aligning sequences to the POG Algorithm Backtracking Genotyping constraints Parameters Read clipping Minimum sequence length i

6 Contents Mismatches Zygosity factor Parameter training Implementation Results Preprocessing the data Coverage read depth Filtering the BAM files Bias introduction Training of parameters Quality threshold training Minimum sequence length training Zygosity factor training Verification decode s samples Genomes exome samples Genomes WGS samples Comparison with other DNA sequencing data genotypers Accuracy Time Conclusions Summary Future work A HLA genotype call results for 1000G exome samples 59 A.1 HLA-A exome results A.2 HLA-B exome results A.3 HLA-C exome results B HLA genotype call results for 1000G WGS samples 71 B.1 HLA-A WGS results B.2 HLA-B WGS results B.3 HLA-C WGS results References 75 ii

7 List of Figures 1.1 The flow of information within an eukaryotic cell system. The coding sequence for the proteins are the red, green, and blue regions. Exons are spliced together to form RNA which is translated to protein Overview of Gyper s genotyping pipeline. An individual is sampled and sequenced. The sequenced reads are aligned to the human reference genome. The HLA allele references are fetched from an external database. A partial order graph is created which stores all the alleles in a single graph for each gene. Finally, we align the sequenced reads to the graph and genotype the individual A read pair. The two reads are read from one end of each fragment strands in opposite direction Two chromosome strands. A is always bonded with T and C is bonded with G. The reverse complementary of the sequence GATACCC is GGGTATC Example FASTA file entry for a sequence from the human reference genome. Here the sequence ID is chr6_ meaning the sequence is from chromosome 6 and is showing bases located at to The sequence description is omitted. Following the header is the sequencing data split by 50 characters per line An example of a reference allele format Modified version of Gyper s pipeline (Figure 1.2). The scope of the three main steps are highlighted IMGT/HLA XML wrapper example output Flow chart for finding relevant positions of the human genome using one allele. Reads overlapping the exons are simulated and mapped to the human reference genome iii

8 LIST OF FIGURES 3.4 Example input and output of our MSA. Bases colored green and red are indels (insertions or deletions) and mismatches, respectively Create a partial order reference graph using three example exon sequences: GATA, -AT-, and CATA. Blue edges show edges we traversed through, green labels represent changed or new labels on edges. Red and yellow nodes represent new and old nodes, respectively. a) Two nodes are created, initial node on level 0 and a final node on level Length(sequences[0]) + 1, which is here 5. b) The sequence GATA is added to the graph. c) The sequence -AT- is added to the graph. Note that no new nodes need to be created, only edges. We change the bit string for the edge going from A to T so it includes this sequence. d) The sequence CATA is added to the graph. The new C node will be on the same level as the G node Extended graph from figure 3.5. Here we have added three intron references to the graph. We use the previous initial node as a final node for the new extension. The red nodes are the new intron nodes. The intron sequences we have added are: TTA, GTA, and -TA. Note that the edges on the new nodes do not store a bit string like the other exon nodes Alignment of the sequence ACAT to the graph from figure 3.6. The numbers below each node denotes their topological sort order and blue edges are the path of the alignment Alleles A and B are represented as sets. Each set has the reads that allele explains. The number of reads explained by either allele A or B is S A,B = S A\B + S A B + S B\A Two examples of the distributions of reads (crosses) among two alleles, A and B. a) It is likely that the individual is homozygous even though S A = 16 and S A,B = 17. The read inside the B\A region is likely an error. b) Here however, S A\B and S B\A are relatively similar and thus we would rather expect that the individual is heterozygous Coverage plot for HLA-DQA Coverage plot for HLA-DQB Coverage plot for HLA-DRB Coverage plot for HLA-A Coverage plot for HLA-B Coverage plot for HLA-C iv

9 LIST OF FIGURES 4.7 File sizes of BAM files used in our training set before and after filtering. The files are ordered in ascending order by their file size before filtering. The file sizes are in MiB and we use logarithmic scale The weighted average impute INFO score for the two threshold qualities: The mismatch quality threshold ρ (blue) and the clipping threshold τ (red). Note that the axes do not start at zero The weighted average impute INFO score for different minimum sequence length. Note that the axes do not start at zero. The results show that changing the minimum sequence length is insignificant The weighted average impute INFO score for different values of the zygosity factor β. Note that the axes do not start at zero v

10 LIST OF FIGURES vi

11 List of Tables 1.1 The number of alleles for the six most important HLA genes known to the IMGT/HLA database The fraction of reads mapped to the optimal locations, suboptimal locations, and no locations of the human genome reference Checking for bias introduction our filtered BAM files. Only the most common Icelandic alleles were used in this analysis. Larger numbers mean greater unwanted bias Number of individuals in each verification dataset Gyper s 2 digit genotype call accuracy compared to decode s verification data Gyper s 4 digit genotype call accuracy compared to decode s verification data Gyper s 2 digit impute accuracy compared to decode s verification data Gyper s 4 digit impute accuracy compared to decode s verification data Gyper s 2 digit exome accuracy compared to Erlich et al. [2011] Gyper s 4 digit exome accuracy compared to Erlich et al. [2011] Gyper s 2 digit low coverage WGS accuracy compared to Erlich et al. [2011] Gyper s 4 digit low coverage WGS accuracy compared to Erlich et al. [2011] OptiType s 4 digit call accuracy on 1000 Genomes exon dataset compared to Erlich et al. [2011] OptiType s 4 digit WGS genotype calling accuracy compared to Erlich et al. [2011] vii

12 LIST OF TABLES A1 A2 A3 Gyper s called HLA-A genotype on 1000 Genomes exome dataset compared to Erlich et al. [2011] Gyper s called HLA-B genotype on 1000 Genomes exome dataset compared to Erlich et al. [2011] Gyper s called HLA-C genotype on 1000 Genomes exome dataset compared to Erlich et al. [2011] B1 B2 B3 Gyper s called HLA-A genotype on 1000 Genomes low coverage WGS dataset compared to Erlich et al. [2011] Gyper s called HLA-B genotype on 1000 Genomes low coverage WGS dataset compared to Erlich et al. [2011] Gyper s called HLA-C genotype on 1000 Genomes low coverage WGS dataset compared to Erlich et al. [2011] viii

13 Acknowledgments First, I would like to thank my father, Eggert Guðjónsson, and mother, Bryndís Helga Hannesdóttir. Not only have they passed down to me my splendid genes, but they have also showed me love and support all my life which I am deeply grateful for. It is truly privileged to have them as my parents. I am also greatly thankful my girlfriend and soulmate Bryndís Tryggvadóttir, her support for me, even in the toughest of times you have helped me believe in myself. I also want to thank University of Iceland and their teachers for an amazing job of guiding me. It is seems absurd to get such a world class level of education while not needing to pile up student loans. The teachers at the university have really given me the strive for excellence and helped me achieve my dreams. During the work of this thesis I ve been completely stumped by how much trust and facility decode and its helpful employees have given me. Thank you everyone at decode who have given me this opportunity. A special thanks goes to my advisors, Bjarni V. Halldórsson and Páll Melsted, who have given me nothing but patience and constructive feedback. I hope to be able to work with you in the future. Thank you, all! ix

14 x

15 Abstract The major histocompatibility complex has an important role for the immune system in thousands of species. The human leukocyte antigen (HLA) is the human version of the complex and is located on the short arm of chromosome 6. Identifying an individual s HLA genotype can give valuable information for medical applications. Several techniques already exist to HLA type an individual accurately, however they remain expensive and time consuming. Recently though, there has been a breakthrough in developing methods which use Next-Generation Sequencing (NGS) data for this purpose, due to its high availability. Using these methods we can genotype individuals using purely computation on sequencing data. However, these NGS methods remain somewhat time consuming, often requiring hours or days. We introduce Gyper, a new open-source software which genotypes individuals for HLA using NGS data in a matter of seconds. Gyper s speed is obtained by selecting a small subset of reads to consider and align them to the references in a partial order graph. Using Gyper we genotyped about 4,000 Icelander in decode s dataset for the six major HLA genes. The resulting data was imputed for more than Icelanders. Comparing those results with our verification data showed over 96% accuracy. Additionally we genotyped individuals from the 1000 Genomes project for the HLA class I genes and Gyper s accuracy was always equal or higher than other HLA genotypers. These results show that Gyper can provide an impressively fast, yet reliable, genotyping results for a wide range of applications. xi

16 xii

17 Ágrip Major histocompatibility complex er hópur gena sem spilar lykilhlutverk í ónæmiskerfi þúsunda lífvera. Í manninum er hann kallaður human leukocyte antigen (HLA) og er staðsettur á styttri armi litnings 6. Vitneskja um hvaða genasamsætu einstaklingur hefur er mikilvægt fyrir svið læknisfræðinnar. Ýmsar leiðir eru til staðar sem geta til um tegund HLA genasamsætu einstaklings með hárri nákvæmni en þessar aðferðir eru dýrar og tímafrekar. Nýlega hafa margar aðferðir sprottið upp sem nota DNA raðgreiningargögn í sama tilgangi vegna mikils aðgengis að slíkum gögnum. Helsti kostur slíkra aðferða er að þær krefjast aðeins tölvu og DNA raðgreiningargagna. Galli þeirra er að þær eru tímafrekar, oft þarf klukkustundir eða jafnvel daga til að vinna úr gögnunum. Við kynnum Gyper, nýjan opinn hugbúnað sem finnur HLA genasamsætu einstaklings með raðgreiningargögnum á aðeins nokkrum tugum sekúnda. Hraði Gyper fæst með því að skoða aðeins lítinn, en mikilvægan, hlut af raðgreiningargögnunum og bera saman við genasamsæturnar á hagkvæman hátt. Við notuðum Gyper til að finna genasamsætu um Íslendinga á sex mikilvægum HLA genum. Með tengslaneti ályktuðum við HLA genasamsætur fyrir um Íslendinga. Sannprófun sýndi að Gyper náði að tilgreina rétta genasamsætu í yfir 96% tilfella. Til að bera niðurstöður okkar saman við önnur forrit, þá fundum við einnig genasamsætur einstaklinga úr 1000 Genomes verkefninu. Í öllum tilfellum var nákvæmni Gyper sú sama eða hærri. Niðurstöður okkar sýna að Gyper nær að vera mjög hraðvirkur, en á sama tíma nákvæmur, í tilgreiningu sinni á genasamsætum einstaklinga fyrir HLA svæðið og á sér mikla notkunarmöguleika. xiii

18 xiv

19 1 Introduction 1.1 Genetics The flow of genetic information within a biological system was first explained in 1956 by Francis Crick [Crick, 1956]. His explanation is called the central dogma of molecular biology [Crick, 1970]. It explains how deoxyribonucleic acid (DNA) stores genetic data about every organism and how it can be transcripted to ribonucleic acid (RNA), and how RNA is translated to protein. If one would compare the genetic structure of any two individuals of the same species you will find a few differences. These differences can be very small, such as polymorphism of only a single nucleotide (SNP), or much larger, such as duplication of whole chromosomes (e.g. down s syndrome). We can identify these differences and associate them with observed properties, such as behavior, morphology, and diseases. We call the observed properties phenotypes. In 1953, James Watson and Francis Crick discovered the structure of the DNA. It is shaped like a double helix [Watson and Crick, 1953], where each helix has the opposite direction. On each helix the genetic information is stored in four different chemical bases: Adenine (A), cytosine (C), guanine (G), and thymine (T). The bases are interconnected with hydrogen bonds, where adenine bonds with thymine and guanine bonds with cytosine. Each of the connected bases form a base pair. Genes are made up of hundreds to tens of thousands of base pairs acting as instructions to make biological molecules called proteins. The genes can acquire mutations in their DNA sequence which results in different variants of the same gene, called alleles. Different alleles can either encode for the same protein, different versions of the protein, or they might even be unable to encode for the protein at all. The coding sequence of a gene is the region which is translated to protein. Humans have eukaryotic cells, which are cells with a nucleus containing the DNA. In eukaryotic cells the coding sequence is not continuous, it is split among several parts called exons. The exons are separated by introns. On either end of a gene there is an untranslated region which marks the beginning and end of the RNA reading frame. A polymerase transcribes the DNA into a RNA and splices the exon sequences. There are also untranslated regions on each end of the RNA. Finally, protein is created by translating the RNA. Figure 1.1 1

20 1 Introduction illustrates the process for an eukaryotic gene with three exons. Figure 1.1: The flow of information within an eukaryotic cell system. The coding sequence for the proteins are the red, green, and blue regions. Exons are spliced together to form RNA which is translated to protein. 1.2 The human genome The human genome contains more than three billion base pairs which are stored on 22 pairs of chromosomes plus a single pair of sex chromosomes, hence 46 chromosomes total. In every pair of chromosomes, one came from the father and one from the mother. In humans there estimated to be 20,000-25,000 protein-encoding genes [Consortium, 2004]. Genes, that have a similar function, are said to be in the same gene family. An approximate assembly of the human genome has been created and is called the human reference genome. One such assembly is released by the Genome Reference Consortium (GRC). Its most recent release is called GRCh38p4 (build 38, patch 4). Genome browsers have been created for viewing reference genomes. One is the UCSC 2

21 1.3 Genotyping Genome Browser. They provide a good way to view and search the genome with various useful information, such as the location of genes and genetic variations found in the reference [Kent et al., 2002]. One of the main use cases of the human reference genome is using it in a local sequencing alignment. In such alignments DNA sequences, which are sampled from an individual, are aligned to the human reference genome. If the sequence can be aligned to the reference it is said to be mapped to that location. Otherwise, if the sequence does not align to the reference anywhere, the sequence is unmapped. Aligning sequences this way is computationally much easier than doing a whole-genome assembly of the individual. A commonly used local alignment tool is called BWA [Li and Durbin, 2009]. 1.3 Genotyping Genotype is the combination of individual s two alleles. The act of estimating the genotype is called genotyping. In the field of bioinformatics, one major topic is called genome-wide association studies for humans. In these studies the human genotypes are associated with diseases and other phenotypes. Much of decode s current work is performing association studies on the Icelandic population. A wide range of genotyping methods exist, each with their pros and cons. Many methods are expensive, time consuming, and require advanced laboratory instruments. Another cheaper option is to use DNA sequencing data. GATK [McKenna et al., 2010] is an example of a generic DNA sequencing data genotyper. In summary, GATK extracts aligned sequences that were mapped to a certain location of the genome to predict the genotype of an individual. For most genes GATK s predictions are accurate. However, highly variable genes often cause problems because the human reference genome is unable to represent them well. Examples of such genes are the ones in human leukocyte antigen (HLA) gene family. 1.4 The HLA gene family The HLA gene family contains over 200 known genes in three different classes: I, II, and III. It is the human version of the major histocompatibility complex (MHC) and is located on the short arm of chromosome 6. In class I there are three main genes: HLA-A, HLA-B, and HLA-C. The proteins produced from these genes are on the surface of most cells. These proteins are bound to protein chains called peptides that have been exported from within the cell. The proteins from genes in HLA class I display these peptides to the immune system and if the immune system recognizes the peptides as foreign it can react to it, such as triggering the infected cell to self destruct. In HLA class II there 3

22 1 Introduction are six main genes: HLA-DPA1, HLA-DPB1, HLA-DQA1, HLA-DQB1, HLA-DRA, and HLA-DRB1. The HLA allele sequences are available in the IMGT/HLA database [Robinson et al., 2015]. As of release , August 2015, there are more than 13,000 known HLA alleles and increasing fast. With such a high number of alleles, genotyping is a very tough challenge. Previously, certain HLA genotypes have been associated with diseases. Such as type I diabetes [Kikuoka et al., 2001] and celiac disease [Kaukinen et al., 2002], which are both autoimmune diseases. In another recent study some HLA class II alleles have been associated with multiple sclerosis (MS) disease [Moutsianas et al., 2015]. Furthermore, many medical operations depend heavily on matching HLA genotypes between a patient and its donor, such as bone marrow transplantation [Hansen et al., 1980], and umbilical cord blood stem cell transplantation [Gluckman et al., 1999]. The best outcome of such transplants are produced when the donor is a sibling which is HLA identical to the patient. Unfortunately though, usually no such donor is available because there is only 25% chance that two siblings receive the same alleles from their parents. In these cases a transplant from a well-matched unrelated donor is required and most often has acceptable results [Beatty et al., 1985]. In recent years many new methods have been created to use sequencing data to genotype HLA. Such methods have reduced the cost of genotyping and require nothing but a computer and sequencing data. Currently one of the most promising HLA genotyper using sequencing data is OptiType [Szolek et al., 2014]. OptiType genotypes for the three main class I genes using an integer linear programming (ILP) algorithm. Its results show good accuracy. However, this method is still rather time consuming, requiring hours or days to compute. 1.5 Gyper Gyper is a novel open-source genotyper which uses sequencing data to genotype individuals. The motivation behind Gyper is to create a genotyper which genotypes highly variable genes in an accurate and fast manner. It uses aligned DNA sequencing data. The name Gyper is an abbreviation of Graph genotyper. In this initial release Gyper supports six HLA genes. They are the three main class I genes and three class II genes: HLA-DQA1, HLA-DQB1, and HLA-DRB1. Overall these six genes account for 12,534 alleles or 93.5% of all known HLA alleles (Table 1.1). This means that number of genotypes is enormous. The HLA-B gene has almost eight million allele combinations possible. The speed is achieved by storing the allele references in a partial order graph (POG) and align only relevant reads to it. A partial order graph is a directed acyclic graph made up of nodes (vertices) and edges. Each node stores a single DNA base while the edges contain 4

23 1.5 Gyper Table 1.1: The number of alleles for the six most important HLA genes known to the IMGT/HLA database. Gene HLA-A HLA-B HLA-C HLA-DQA1 HLA-DQB1 HLA-DRB1 #Alleles 3,192 3,977 2, ,764 information about which reference allele traverses through that edge. We make use of the fact that sequencing data is usually stored aligned and indexed, meaning we can quickly get reads that have been mapped to certain locations of the genome. Considering only a small subset of reads will vastly improve the speed of the genotyping, but has a risk of not taking all relevant reads into account and therefore missing potentially valuable information. We estimate the genotype of an individual by counting how many reads can be aligned to each allele reference. Figure 1.2 shows an overview of Gyper s pipeline. Two alleles need to be chosen, because we each have two chromosomes. If an individual has the same allele on both chromosomes, it is said to be homozygous for that gene otherwise it is heterozygous. One key challenge we faced was to determine this zygosity. Our method takes into account that reads from sequencing machines will often contain errors. For each base these machines estimate the likelihood of an error. We crop reads with low quality ends to reduce the number of errors in the data. Also, we allow mismatches in reads if that base is very likely to contain an error. Gyper has several parameters that were optmized using a training dataset. The training dataset was gathered from about 4,000 Icelandic people. After training, we verified Gyper using both a verification dataset from decode and a widely used verification dataset for individuals in the 1000 Genomes project. 5

24 1 Introduction Figure 1.2: Overview of Gyper s genotyping pipeline. An individual is sampled and sequenced. The sequenced reads are aligned to the human reference genome. The HLA allele references are fetched from an external database. A partial order graph is created which stores all the alleles in a single graph for each gene. Finally, we align the sequenced reads to the graph and genotype the individual. 6

25 2 Background 2.1 Next-generation sequencing Over the last several decades, many different sequencing methods have been developed. These methods determine the order of the nucleotide bases of a small DNA fragment. One widely used method, including at decode, is called next-generation sequencing (NGS). The machinery used is produced by Illumina which uses a synthetic approach to sequence individuals [Bentley et al., 2008]. Here we will describe the process used by those machines. First, a sample of DNA is obtained and labeled from an individual (e.g. blood). Then, the DNA is randomly sheared into fragments of various length. The average fragment length can be set to different values but in decode s dataset this length is typically around 500 bases and is almost always smaller than 1000 bases. To each end of the fragments the four types of bases are added in the mixture, each fluorescently labeled with a different color and attached with a blocking group. Figure 2.1: A read pair. The two reads are read from one end of each fragment strands in opposite direction. The four bases then compete for being the next base on the template DNA strand that is being sequenced. When one base has been attached all other non-incorporated molecules are washed away. After each synthesis, a photograph of the incorporated base is taken. For each base the likelihood of an error is estimated. The blocking group is then removed using a chemical process. This process is repeated until we have sequenced a certain number of read pairs. 7

26 2 Background The reads of a read pair have different directions (Figure 2.1). On each end of a DNA strand there can either be a three-prime (3 ) or a five-prime (5 ). The direction of a read can either be from 3 to 5 or from 5 to 3. Also, we have an idea of the length between the two reads because they will both start on each end of the fragment. Both reads are reading the same chromosome, but different chromosome strands. Recall that A complements with T, and C with G. That means one strand will be the reverse of the other with A changed to T, T changed to A, C changed to G, and G changed to C. (Figure 2.2). One string is said to be the reverse complement of the other. Figure 2.2: Two chromosome strands. A is always bonded with T and C is bonded with G. The reverse complementary of the sequence GATACCC is GGGTATC. The total number of read pairs can vary, but generally it is aimed to have 30x coverage or more. The coverage is the average number of times each base pair is sequenced. Having a too low coverage means multiple locations are likely to be not covered by any reads. Since we are adding a single base in each step the most common error we might expect are mismatches, it is highly unlikely that we get errors in the form in insertions or deletions. Insertion is when an extra DNA base is added to the read by mistake and a deletion is when a base is mistakenly not read by the sequencer. What distinguishes this method is that it is fast and cheap but the reads will be short, generally between 75 and 150 bases. In decode the read lengths used are 100, 120, or 150 base pairs. With today s Illumina sequencing machines we can produce hundreds or thousands of sequences concurrently. In 2010, the cost of sequencing one million bases ranged from $0.05 to $0.15 USD and the required time is up to 11 days [Vliet, 2010]. In January 2014, Illumina released a sequencing machine capable of sequencing an individual for less then $1,000 with 30x coverage or about $0.01 per one million bases in three days [Hayden, 2014]. Each sequence is an extremely small fraction of the human genome and we have no information if our read contains any mistakes or where the read was located, only that it was on some random chromosome at some random location. We can never even be sure that all bases are included in our reads, some locations might still be completely missing. Assembling these reads is a process called whole-genome assembly and it is a very difficult problem. 8

27 2.2 Phred score The problem can be looked at like this: We have many identical jigsaw puzzles cut in various ways. We remove a bunch of puzzles and also bend some of them around so they will not fit anymore and then try to solve the puzzle. To make the problem easier we can use a human reference genome. This can be thought as another very similar completed jigsaw puzzle. The idea is to compare the pieces we have with the completed puzzle to get a better idea where they can fit, changing the problem to an alignment problem. This task is still very computationally heavy and involves a lot of guessing. It is especially tricky for locations of the genome where variability is high, such as the HLA region. Moreover, there are many regions with high similarity to the HLA genes, which makes the task even harder. 2.2 Phred score Phred quality score is a scaled quality score given to each base pair of a read. It is a commonly used in sequencing technology. It measures the probability of sequencing errors of each base pair in a read. The Phred quality Q is the log value of the probability of sequencing errors P (e) calculated using the formula: Q = 10 log 10 (P (e)) (2.1) For example, if the probability of error is 1% (i.e. 99% accuracy) then the quality score given is 20. Solving for P (e) we get P (e) = 10 Q/10 (2.2) The quality Q from Illumina machines range from 0 to 41 and are represented as an ASCII character c [Johnson, 2013]. The conversion from Q to c can be done by adding 33 to Q and convert the bits from that number to an ASCII character. Character 33 in the ASCII table is! which is the lowest possible quality. Additionally, it is the first printable character in the table if we exclude the white space character. Using the lowest possible quality is never really done though, as it means there is 100% probability of an error. The highest quality character possible for Illumina machines is J, the 74th character in the ASCII table used for a quality score of 41. Using equation 2.2 it equals % probability of an error. The quality values are stored as ASCII characters for space efficiency. For example, if the quality value 41 was stored as the string 41 we would need two bytes instead of one. 9

28 2 Background 2.3 Data formats FASTA format The FASTA format is a common format for storing both nucleotide and amino acid sequences. It was created by William Pearson and was used in his program with the same name. Since then it has become the industry standard in bioinformatics for raw sequencing data [Pearson and Lipman, 1988]. It has a wide range of use cases. Frequently, it is used to store sequenced reads or even the whole human genome. Pearson did not have any particular specifications for the format, but here we will discuss how it is most commonly used. >chr6_ GGTATGCCTGTATATACAAATGTTCCAGAATCTGAAAAAATCCAAAGTTC AAAACATATCTAGTCCCAGGCATTTCAGATAAGGGATACTCTGTGTGTGT GTGTGTGTTTGTGTGTGTGTGTGTGTGTGTGTGTATGAATTTTGAGAGTG TTGTTTATTTTTATTTTGTAAATACAAGGTCTTGCTCTGTCACCCAGGCT GGAGATCAGTAGCATGATCACATTTCACTGCTGCTTTGAACTCTGACTCA AGGAATTCTCCCTCCTACCTCAGCCTCCCAAGTAGGTAGGACTCCCAAGT AGGTGGCGTACACCACCATGCCTGGCTAATTTATTTTATTTTTTCTAAAG Figure 2.3: Example FASTA file entry for a sequence from the human reference genome. Here the sequence ID is chr6_ meaning the sequence is from chromosome 6 and is showing bases located at to The sequence description is omitted. Following the header is the sequencing data split by 50 characters per line. Each entry in a FASTA file has a single header line which always starts with the greaterthan symbol >. It is followed by the sequence ID that cannot contain any white spaces. Optionally, the sequence ID is then followed by a white space and then a description of the sequence. The sequence ID often contains some useful information about the sequence. In the next line after each header is its sequencing data. There, each nucleotide is stored as a single character. For example we store the DNA chemical bases adenine, guanine, cytosine, and thymine as A, G, C, and T respectively. For easier readability, it is recommended that each line will not exceed 80 characters, thus sequences longer than 80 characters need to be split into multiple lines [Genestudio, 2015]. However, since headers are only allowed to be in a single line they can break the 80 character restriction. In addition, every line has the same length except perhaps the last one. This restriction makes it possible for FASTA readers to be able to know where the last location of the newline character can be, thus improving reading performances slightly. The sequence continues until the next header line or the end-of-file (EOF) file descriptor has been reached. An example entry in a FASTA file is shown in figure

29 2.3 Data formats Files using the FASTA format have the.fa or.fasta extensions. The FASTA format has some extensions, such as the FASTQ format which also includes a quality score for each base. FASTQ files have the.fq or.fastq extensions. To make searches in big FASTA files faster the files are often indexed and saved in FASTA index files. They have the extensions.fa.fai or.fasta.fai. For example, SAMTools [Li, 2011] indexes FASTA files SAM and BAM format For storing aligned reads the sequence alignment/map (SAM) format is often used. The SAM format has two sections: An optional header section and an alignment section. Lines in the header section always start with the In the alignment section each sequence is stored in one line. In each line there are 11 TAB delimited mandatory fields. If these fields are unavailable they must still be defined with the values 0 if the field contains a number or * if it contains a string. There are also optional fields that can be used by storing key-value pairs in the format of TAG:TYPE:VALUE where tag is the key. Many tags are predefined such as the RG tag which stores the read group of the sequence. A companion format to SAM is the binary alignment/map (BAM), it stores the exactly same data as SAM but is compressed using the BGZF library and encoded to binary. The compression is focused on performance rather than high compression [Li et al., 2009]. SAM and BAM files are stored in.sam and.bam files, respectively. Most commonly they are used in storing next-generation read alignments. The SAM specifications are constantly being updated by the SAM/BAM format specification working group which is a part of the SAMTools project group. SAMTools is a software package and library that can work with SAM and BAM files. They provide many tools for SAM/BAM files such as conversion from and to other formats, filtering, compression, decompression, sorting, indexing, merging files, and more [Li, 2011]. CRAM is another companion format to SAM. It allows for a highly efficient referencebased compression of SAM files based on Fritz et al. [2011]. The files are stored with a.cram extension. The reference-based compression algorithm is capable of storing the data in smaller files than BAM does VCF The variant call format (VCF) is a format to store genetic variation data. FASTA files work very well when displaying sequences with no variations, such as the human genome reference, but often it is necessary to be able to view variations of a reference. The VCF 11

30 2 Background format tries to do this in an efficient manner. It shares many similarities with SAM/BAM files. The variations supported include single nucleotide polymorphism (SNP), insertions, deletions, and structural variants. One reference genome is used and then variations are stored as alternative sequence to the reference. Files using the VCF format are usually saved with the.vcf extension. Similar to SAM/BAM files VCF files have a header section and data section. Lines in the header section start with the character # or ## depending on if the line stores data columns or meta information, respectively. The meta information is stored as key=value pairs. VCF files have eight mandatory columns which can be omitted with a. character. They are usually stored compressed using the BGZF library [Danecek et al., 2011]. 2.4 The HLA reference alleles format The HLA reference alleles are represented using a specific naming format. First the gene family is be defined, followed by the gene name. The family is HLA for the six genes supported by Gyper. Each type, subtype, synonymous substitution and non-coding substitution will then have an unique set of 2 digit numbers. Sometimes the name will have a suffix character that represents a change in the protein expression. (Figure 2.4) Figure 2.4: An example of a reference allele format. Every allele has at least 4 digits so the family, gene, type, and subtype fields are mandatory. Beyond that, the fields are only used when needed. Sometimes there are more than a hundred different subtypes. In these cases the subtypes have 3 digits. However, alleles with 3 digit subtypes do not have increased resolution. For example the allele HLA-A*02:102 is considered having a 4 digit resolution, not 5. Different types and subtypes mean that the two alleles will produce different proteins. This means the exon DNA sequences of the alleles are different. However, it is possible that the exon DNA sequences are not the same but they will still both translate to synonymous proteins. These substitutions are called synonymous substitutions because 12

31 2.5 Current DNA sequencing genotypers they do not affect the translated protein. Sequences that differ only in exon sequences but not in translated proteins are distinguished by the 6 digit resolutions. Finally, the 8 digit resolution will distinguish substitutions in the non-coding region, that is variations of introns. Genotyping techniques that use HLA proteins are only capable of 4 digit resolution typing. For most applications we are only interested in genotyping for different proteins, therefore 4 digit resolution is sufficient. By using DNA sequencing data we are capable of genotyping with up to 8 digit resolution. 2.5 Current DNA sequencing genotypers Before Gyper, other programs have been created with same purpose of genotyping using sequencing data. Gyper is highly influenced by one named OptiType [Szolek et al., 2014]. OptiType genotypes using an integer linear programming (ILP) algorithm. They tested their algorithm on a broad range of sequencing data: Whole genome sequencing data, RNA data, and exome sequencing data - which only contains exon data. Their results are very good, their comparison to verification data are showing 97% accuracy with 4 digit resolution. In our study we compared Gyper to OptiType, as their algorithm has shown to have better or equally good accuracy in comparison to other HLA genotyping programs. HLAminer [Warren et al., 2012] is a widely used program for genotyping with sequencing data. Their focus is on Illumina shotgun sequencing data. In summary, their method involves doing a HLA assembly using a tool called TASR. Then comparing the assembly to the reference alleles using a hash table filled with every 15-nucleotide word, or 15-mers, encountered. The program genotypes very quickly but in comparison to the other tools their accuracy is not very high. For example, their accuracy compared to OptiType was reported to be about 15% lower [Szolek et al., 2014]. ATHLATES [Liu et al., 2013] is another tool that uses HLA assembly to genotype. Their methods rely on accurate recovery of the exon sequences via the assembly. It uses many of the ideas behind HLAminer but improves them and deliver a much better results. Their reported accuracy 74 out of 75 allelic pairs or about 99% overall accuracy. OptiType compared themselves with ATHLATES and both programs showed a similar accuracy. However, the sample size was very low, only 3 genes typed for 11 individuals [Szolek et al., 2014]. Recently, a HLA genotyper was created by Major et al. [2013]. It filters the DNA sequencing data and then matches those reads to the exon sequences of the HLA alleles. In the alignment they discard reads with too many mismatches or any indels. Their method achieved a good HLA call accuracy of 94.2% for an exome dataset. However, on the same dataset OptiType achieved even better results. 13

32 2 Background 14

33 3 Methods Figure 3.1: Modified version of Gyper s pipeline (Figure 1.2). The scope of the three main steps are highlighted. Our method has three main steps: Preprocess the data (Section 3.1). Create the partial order graph (Section 3.2). Align reads to the partial order graph to genotype the individual (Section 3.3). Figure 3.1 declares the scopes of these steps. The preprocessing step is released as a separate contribution. Gyper s scope only covers the creation of partial order graph and alignment to it. In section 3.4 we discuss the different adjustable parameters Gyper has and in section 3.5 how we train these parameters. Finally, in section 3.6 we will touch on the implementation and availability of Gyper. 15

34 3 Methods 3.1 Preprocessing the data The preprocessing is released separately from Gyper because it depends on external libraries, which we did not want to add as dependencies. Furthermore, the processes in this step are exclusive to the HLA genes Fetching the HLA reference alleles The HLA reference alleles were fetched from the IMGT/HLA database [Robinson et al., 2015]. The database is released in various formats: FASTA, flat files, MSF, PIR, and XML. We used the XML database. To reduce the size of the database, there are few alleles that work as a template for other alleles. Template alleles have information about its full sequence with all features (exons, introns and untranslated regions) and references itself as a template. Each non-template allele only specifies features that differ from its template allele. This structure made it tedious to fetch the allele sequences for all non-template alleles manually. To simplify the process we created a XML wrapper which depends on RapidXML, a fast XML DOM parser library in C++ [Kalicinski, 2015]. Our wrapper fetches features of all alleles and outputs them in a FASTA file. Figure 3.2 shows an example output. In the output FASTA files the first two letters of the header is the feature identification code. The allowed codes are 5P, P 3, E[0 9], and I[0 9]. Features with 5P and P 3 are the untranslated regions closer to the 5 and 3 strand ends, respectively. E[0 9] represents exons 0 through 9 and I[0 9] introns 0 through 9. The number of exons differ among genes. >E5_HLA-DQB1*03:02:01 GACCTCAAGGGCCTCCACCAGCAG >I5_HLA-DQB1*03:02:01 GTGATATTTCAGCCATGAGCCAGTGTGGGGGGGCACAGGTGTAAGAGGGAAGA... Figure 3.2: IMGT/HLA XML wrapper example output. The direction of the reference alleles are from 5 to 3. To get the full sequence of the allele the features should be concatenated in the following order: 5P, E1, I1, E2,..., EN, P 3. Where N is the total number of exons for a given gene. 16

35 3.1 Preprocessing the data Regions with relevant reads The HLA gene cluster is only a very small part of the whole human genome, and thus only a very small portion of the sequenced reads is relevant to HLA genotyping. We could expect a huge reduction in the time taken to perform the genotyping, if we would only need to consider those reads. However, finding the relevant reads is difficult. One solution is to use every read pair that has been aligned to the gene s position on the reference genome, such as GATK does [McKenna et al., 2010]. Our concern is that such an approach is unable to have accurate results for the HLA region due to its high variability. Reads from the HLA region are often misaligned since the human reference genome cannot represent those regions well. Another solution would be to take every sequenced read into account, such as OptiType does [Szolek et al., 2014]. Their method involves considering every read pair by first mapping them to all HLA alleles, and then use all the mapped reads to genotype the individual. Their assumption is that the true genotype can explain the most reads. But we have two concerns. First, taking every read into account takes a very long time, hours or days using whole genome sequencing (WGS) data. Only a very small portion of the reads are even relevant to the genotyper. Second, using reads that are mapped elsewhere risks biasing the results. For example if a sequence outside the HLA is very similar to one or more alleles but not all, those particular alleles will be biased to score higher than the other alleles. This results in the genotyper predicting the alleles most similar to the reference too often. Regions of interest We propose a different method. Our method makes use of the fact the reads are usually stored aligned in alignment files (SAM/BAM files). All read pairs belong in one of three categories: Both reads are mapped to the reference genome at locations l 1 and l 2, and usually we expect them to be relatively close to each other, we assume 0 l 2 l One read is unmapped, but the other read in the pair is mapped to location l. By convention, both reads are marked to be located at l. Both reads are unmapped. The reads are both marked as unmapped. Our goal is to find regions of the genome which are likely to have HLA relevant reads mapped to them. The following three steps explain our process: (1) Simulate reads that overlap the alleles exons. (2) Map the simulated reads to the human reference genome. 17

36 3 Methods (3) Check where the simulated reads map. Figure 3.3 shows the process for a single allele. The aligner will map to a correct location, a suboptimal location outside HLA, or not map the read to any position of the genome. We are only interested in reads overlapping the exons, since the exons determine the first 6 digits of the HLA genotype. All aligned positions are extracted and used as the regions of interest. When genotyping, we only use reads located inside the regions of interest. We hope that filtering the data this way will decrease the overall computational time of the genotyping without decreasing its accuracy significantly. Figure 3.3: Flow chart for finding relevant positions of the human genome using one allele. Reads overlapping the exons are simulated and mapped to the human reference genome. Algorithm Algorithm 1 shows the pseudo code behind this technique. For step (1) we gather all sequences for both exons and introns from our fetched reference HLA alleles. As we are only interested in the exons, we look for reads overlapping the exons completely or partly. In our case we simulated reads of length 100 base pairs so we used 100 bases from each intron. If, for example, the exon is of length 20 base pairs, the complete sequence will be 100 intron base pairs, 20 exon base pairs, and then another 100 base pairs sequence for a total of 220 base pairs. Reads are simulated using a read simulator called Mason, which is provided as part of the C++ library SeqAn [Döring et al., 2008, Holtgrewe, 2010]. Mason simulates sequencing 18

37 3.1 Preprocessing the data data in a realistic manner, so some sequences will contain errors. Mason was run with the Illumina machine settings so the chance of a mismatch will be much higher than an insert or deletion. Step (2) is by far the most computationally intensive step, requiring each simulated fragment to be mapped to the human reference genome. In this step we explicitly call the Burrows-Wheeler Aligner (BWA) to map the fragments [Li and Durbin, 2009]. The BWA- MEM algorithm was chosen for the task. Its output is a SAM file with our simulated reads aligned to the human reference genome. We discarded reads that the BWA could not align to the reference. Steps (1) and (2) were implemented in a single C++ program. Step (3) requires sorting all mapped positions from the previous step and keeping them in two sorted lists, one for all the starting positions and one for all the end positions. The end position is required because sometimes BWA will only align a partial read. By looping through both lists we can know at any point what the coverage depth is. The coverage depth of a base is the number of reads that overlap that base. The algorithm starts on the start position, and checks the coverage of each base until we have reached the final position. The first item in the starting position list is the start position, and the last item in the end position list is the final position. Most bases of the genome have no coverage depth, but we can always skip to the next item in the start position list when that occurs. Checking for bias introduction When extracting reads outside the HLA region we might be introducing a bias to the results for some of the HLA alleles. For example, if the suboptimal regions contains sequences which are very similar to some alleles, but not all. If that happens those alleles will have a biased score. We can correct for this bias by adding a parameter in Gyper which lowers the scores of the biased alleles. To measure the bias we simulate read pairs from all suboptimal locations we found and use them in Gyper. We can then get an estimate of how often this event happened. Reads are simulated with Mason using Illumina settings with 3000x coverage depth. Results can be found in section Multiple sequence alignment In section we parsed all features of the HLA alleles. Aligning the sequences in a multiple sequence alignment (MSA) will be beneficial to the algorithm that creates the partial order graph for two reasons: First, all the sequences have the same length. Second, 19

38 3 Methods Input : All reference alleles for each genotype (alleles), number of random fragments (n), and length of intron sequences to use (intronlength). Output: All locations of the genome and their coverage depth, excluding all location with no depth. 1 exons Fetch all exons using our IMGT/HLA XML parser. ; Step 1 2 Add sequences from surrounding introns of up to length intronlength on each end of the exons.; 3 fragments Generate n random simulated reads from exons.; 4 startpositions, endpositions Array with all zeros of length n ; Step 2 5 for i 0 to n 1 do 6 startpositions[i] Map(fragments[i]); 7 endpositions[i] startpositions[i] + Length(fragments[i]) 8 end 9 startpositions Sort(startPositions) ; Step 3 10 endpositions Sort(endPositions); 11 location, depth Empty array; 12 i, j, k, d 0; 13 while k endpositions[n-1] do 14 if startpositions[i] k then 15 i + +; 16 d + +; 17 continue 18 end 19 if endpositions[j] k then 20 j + +; 21 d ; 22 continue 23 end 24 if d is 0 then 25 k startpositions[i] 26 end 27 else 28 Add k to location; 29 Add d to depth; 30 k + +; 31 end 32 end Algorithm 1: Extracting regions from the genome which are likely to contain misaligned HLA reads. We do that by first simulate HLA reads, map them to the genome, and determine which locations they were mapped to. 20

39 3.1 Preprocessing the data most sequences will share a base, so we can use a single node to represent a base on most or all alleles. MSA has been used before for this purpose, for example Dilthey et al. [2015]. We align the sequences by inserting gaps into the them. In the graph, gaps are represented by the absence of a node so adding gaps will not increase the number of nodes in the graph. In fact, by aligning the sequences we can often use fewer nodes to represent the sequences. So the MSA is effectively reducing the number of nodes required for the graph. However, an optimal MSA is massive for this case. The worst case has close to 4,000 different sequences with an average length of over two thousand base pairs (N 4000, L > 2000). Optimal MSA has proven to be a NP-hard problem [Wang and Jiang, 1994] and with a complexity of O(L N ). So an optimal MSA would take absurdly long time for our case and is not necessary. Approximate MSA is more appropriate. We use MUSCLE, which does an approximate MSA with high accuracy. The complexity of MUSCLE s MSA algorithm is O(N 3 L + NL 2 ) [Edgar, 2004]. Figure 3.4 shows an example of our data before and after the MSA. The input and output are both in FASTA format, and the output sometimes contains dashes which represent gaps. Specifically, these are sequences shown are the 4th intron for alleles HLA- C*15:02:01, HLA-C*16:01:01, and HLA-C*17:01:01:01. For clearer representation we have colored mismatches as red, and indels as green. For these three reference alleles, HLA-C*15:02:01 and HLA-C*16:01:01 only have one base mismatch between them. Sequences HLA-C*16:01:01 and HLA-C*17:01:01:01 however, there are 2 bases mismatched and another 3 bases inserted into HLA-C*17:01:01:01. 21

40 3 Methods Input: >I4_HLA*15:02:01 GTAAGGAGGGGGATGAGGGGTCATGTGTCTTCTCAGGGAAAGCAGAAGTCCTGGAGCCCTTCAGCTGGGT CAGGGCTGAGGCTTGGGGGTCAGGGCCCCTCACCTTCCCCTCCTTTCCCAG >I4_HLA-C*16:01:01 GTAAGGAGGGGGATGAGGGGTCATGTGTCTTCTCAGGGAAAGCAGAAGTCCTGGAGCCCTTCAGCCGGGT CAGGGCTGAGGCTTGGGGGTCAGGGCCCCTCACCTTCCCCTCCTTTCCCAG >I4_HLA-C*17:01:01:01 GTAAGGAGGGGGATGAGGGGTCATGTGTCTTCTCAGGGAAAGCAGAAGTCCTTCTGGAGCCCTTCAGCCG GGTCAGGGCTGAGGCTTGGGTGTAAGGGCCCCTCACCTTCCCCTCCTTTCCCAG Output: >I4_HLA-C*15:02:01 GTAAGGAGGGGGATGAGGGGTCATGTGTCTTCTCAGGGAAAGCAGAAGTC---CTGGAGCCCTTCAGCTG GGTCAGGGCTGAGGCTTGGGGGTCAGGGCCCCTCACCTTCCCCTCCTTTCCCAG >I4_HLA-C*16:01:01 GTAAGGAGGGGGATGAGGGGTCATGTGTCTTCTCAGGGAAAGCAGAAGTC---CTGGAGCCCTTCAGCCG GGTCAGGGCTGAGGCTTGGGGGTCAGGGCCCCTCACCTTCCCCTCCTTTCCCAG >I4_HLA-C*17:01:01:01 GTAAGGAGGGGGATGAGGGGTCATGTGTCTTCTCAGGGAAAGCAGAAGTCCTTCTGGAGCCCTTCAGCCG GGTCAGGGCTGAGGCTTGGGTGTAAGGGCCCCTCACCTTCCCCTCCTTTCCCAG Figure 3.4: Example input and output of our MSA. Bases colored green and red are indels (insertions or deletions) and mismatches, respectively. 22

41 3.2 Constructing a reference partial order graph 3.2 Constructing a reference partial order graph Typically sequences are aligned to a single reference genome. For genes with high structural and sequence diversity this can lead to poor characterization of such regions. To represent these genes, such as the HLA genes, we used a partial order graph (POG) Graph implementation In the POG we both store nodes (vertices) and directed edges. By doing a MSA we ensure that every feature has the same length. The features are added to the graph one by one. Each node has an integer that stores the level and a single DNA base value, that is A, T, G, or C. We surround each feature with a node that has no DNA base. The level corresponds to the location of that base in the sequence. No two nodes connected by a direct path can have the same level. If n 1 and n 2 are two connected nodes in the graph the edge will always be directed to the node with the higher level. This means that the last node of the graph, the one with no outgoing edges, has the highest level. The first node of the graph, the one with no incoming edges, has a level 0. Each edge stores reference to both nodes it connects with. Furthermore, if the edge is inside an exon it also stores a bit string. The bit string has a length equal to the number of references used. The purpose of these bit strings will be discussed when we align sequences to the graph and genotype in section 3.3. When an edge is created all bits are initialized to 0 except for the one representing the exon that is being added. When adding another sequence to the graph which shares a path with an earlier sequence, we flip the corresponding bit to 1. By storing it this way we never need more than ceil(r/8) bytes memory per edge, where r is the number of references used in the graph. If, for example, we had 1000 references and 10,000 edges we would only need 1.25 megabytes to store this information. Additionally, we only need to store it for exons. Algorithm 2 shows the pseudocode behind this method and figure 3.5 shows an example how it creates a graph for three exon sequences: GATA, -AT-, and CATA. We add exons and introns sequences separately because edge creations are handled differently, exons have a bit string while introns do not. There is high amount of missing and unreliable intron data in the IMGT/HLA database, compared to the exons. So instead of trying to reuse data from other introns we simply allow reads to align freely within the intron. With such free alignment there is no need for the bit strings, so they are omitted on introns. 23

42 3 Methods Input : Fasta file with aligned sequences of some feature. Output: Partial order graph we can use as a reference. 1 graph empty partial order graph; 2 sequences read sequences from Fasta file; 3 previous new node with level 0.; 4 endnode new node with level Length(sequences[0]) + 1.; 5 Add previous and endnode to graph.; 6 for sequence in sequences do 7 for pos in Length(sequence) do 8 if sequence[pos] is a gap then 9 continue; 10 end 11 node node with letter sequence[pos] and level pos + 1.; 12 if node exists in graph then 13 next node; 14 if No edge exists from previous to next then 15 Add Edge between previous and next with bit pos flipped on. 16 end 17 else 18 Flip bit pos on for edge from previous to next. 19 end 20 end 21 else 22 Add node to graph.; 23 Add Edge between previous and next with bit pos flipped on. 24 end 25 previous next; 26 end 27 if No edge exists between previous and endnode then 28 Add Edge between previous and endnode with bit pos flipped on. 29 end 30 else 31 Flip bit pos on for edge from previous to next. 32 end 33 end Algorithm 2: Creating a reference partial order graph for a single exon. Creating a graph for an intron is similar but then we do not store the bit string on edges, hence no need to create or modify them. 24

43 3.2 Constructing a reference partial order graph Levels a) b) G A T A 001 c) G A T A d) G A T A C 101 Figure 3.5: Create a partial order reference graph using three example exon sequences: GATA, -AT-, and CATA. Blue edges show edges we traversed through, green labels represent changed or new labels on edges. Red and yellow nodes represent new and old nodes, respectively. a) Two nodes are created, initial node on level 0 and a final node on level Length(sequences[0]) + 1, which is here 5. b) The sequence GATA is added to the graph. c) The sequence -AT- is added to the graph. Note that no new nodes need to be created, only edges. We change the bit string for the edge going from A to T so it includes this sequence. d) The sequence CATA is added to the graph. The new C node will be on the same level as the G node. 25

44 3 Methods Extending the POG Figure 3.6 shows how the partial order graph can be extended with three intron sequences: TTA, -TA, and GTA. In our implementation we extend the graph by adding sequences connecting to the lowest level node. So when creating a graph for the HLA genes we add features in reversed order: First the 3 untranslated end of the allele, then the last exon, then the last intron, and so on until we have added the 5 untranslated region. The nodes connecting the features are always free to traverse through so they do not need to store any DNA base. We keep track of the level of these nodes. When we are aligning to the graph we can check the level of the node we are aligning to. So at any point in the alignment, we know if we are aligning to an exon or an intron. 26

45 3.2 Constructing a reference partial order graph T A T G G A T A C Figure 3.6: Extended graph from figure 3.5. Here we have added three intron references to the graph. We use the previous initial node as a final node for the new extension. The red nodes are the new intron nodes. The intron sequences we have added are: TTA, GTA, and -TA. Note that the edges on the new nodes do not store a bit string like the other exon nodes. 27

46 3 Methods 3.3 Aligning sequences to the POG When aligning sequences to our graph the goal is to determine which pair of alleles can explain the most reads. Our assumption is that this pair of alleles is the individual s true allele pair. We have a backtracker which both to keeps track of the read s path through the graph. The match is simply an array of boolean values equal to the length of the read. This array initially has all bits set to 0, which are flipped if a match is found. The backtracker also stores which node is the previous node of the match, so we know how the match traversed through the graph. The size of the two arrays are the same as the length of the read we are aligning. In our alignment we are free to start anywhere and end anywhere, it is a semi-global alignment where both of the ends of the references are free. Generally we would need to have our two arrays equal to the length of the read plus one, but since we can start anywhere the top boolean will always be true so there is no need to store that. Since the graph is acyclic we can always find topological sorting of the nodes, meaning if there is a node n 1 which depends on the results of node n 2, n 2 will never depend on n 1 or any of n 1 s dependencies. Also there is one, and only one, node that does not depend on any other nodes. That node is the first node in our topological sort Algorithm We use a dynamic programming algorithm to align sequences to the graph as shown in algorithm 3. Our algorithm requires O(nm) time in the worst case, where n is the number of edges and m is the length of the sequence. It visits every edge on the graph and compares the DNA base of the sequence to the target node. When traversing through the graph it is always guaranteed that we have already calculated the current node s dependencies. When matches are found we only need to change a boolean value in the array and store a reference to the previous node. The aligner can find a list of nodes where the sequence was matched, because the sequence can be aligned to more than one location. If we align both reads in a read pair to a location that is very far from each other we discard that read pair. The highest distance between reads allowed is arbitrarily chosen to be 800 base pairs. If the read is aligned to multiple locations we need to choose the best one to use. We chose the best distance between two reads to be 350 base pairs. These two values were estimated from the 99.99% highest and the most common insert sizes of decode s BAM files, respectively. Figure 3.7 shows an example of an alignment of the read ACAT to a graph. The graph 28

47 3.3 Aligning sequences to the POG matches the read to nodes 4 (A), 5 (no base), 7 (C), 8 (A), and 9 (T). Here, a reference to the 9th node will be the only node in the output list. Then, when the alignment has finished, we backtrack from that node only. Backtracking has a complexity of O(m) in the worst case Backtracking The backtracking algorithm uses the backtracker to determine which reference alleles can explain the aligned read. It picks a node from the alignment algorithm and starts backtracking there. Initially it has a bit string of length equal to the amount of reference alleles used with all bits flipped to 1. Then, as the backtracking algorithm travels backwards through the graph it will perform a bitwise AND operation for every edge with a bit string. That is, if we are traversing inside an exon we will perform the AND operation. What we end with is a bit string whose bits are only flipped on if the read followed the corresponding reference allele exactly. In other words we say that those reference allele explain the read. If however we are traversing through an intron we do not know which reference allele created that edge on the graph, we would rather say that any reference can explain the read for the reasons we discussed before. Continuing with the previous example (Figure 3.6) the backtracking of the sequence ACAT would generate the following calculations: 111 AND 111 AND 100 AND 100 = 100 The convention when using bit string is to say they the rightmost bit is the first one. So the bit string 100 means that only the third reference explains the read. The exon of the third reference was CATA which ACAT overlaps. ACAT does not overlap the other two exons. If however we aligned the read TTA, the read maps to nodes 1, 3, and 4. Since these nodes are not connected to an edge with a bit string we will not require any AND operations and simply have the bit string: 111 Here we simply say that every allele can explain the read. 29

48 3 Methods Input : A sequenced read from an individual who is being genotyped for gene gene. Output: backtracker we can use to find all references that explain the read and an array of nodes where alignments end at. 1 graph a partial order graph for gene.; 2 order TopologicalSort(graph); 3 backtracker array for each node in graph storing both match (true or false) and previousnode.; 4 nodes empty array. 5 for source in order do 6 for edge in edges directed from source do 7 target edge s target.; 8 if target stores dna then 9 if read[0] == target.dna then 10 backtracker[target].match[0] = true; 11 backtracker[target].previousnode(0) = source; 12 end 13 pos = 1; 14 while pos is smaller than the length of the read do 15 if backtracker[source].match and 16 read[pos] == target.dna then 17 backtracker[target].match[pos] = true; 18 backtracker[target].previousnode(pos) = source; 19 end 20 Increment pos by 1.; 21 end 22 if backtracker[target].match[length(read)] then 23 Add target to nodes. 24 end 25 end 26 else 27 backtracker[target].match array of true values.; 28 backtracker[target].previousnode array of source.; 29 end 30 end 31 end Algorithm 3: Aligning a single sequence to the reference graph. 30

49 3.3 Aligning sequences to the POG T 1 A 4 T 3 0 G G A 8 T 9 A C 7 Figure 3.7: Alignment of the sequence ACAT to the graph from figure 3.6. The numbers below each node denotes their topological sort order and blue edges are the path of the alignment. 31

50 3 Methods Genotyping constraints When genotyping we estimate how likely it is that an individual has a particular allele depending on how many reads that allele can explain. Read and its complementary are both aligned to the graph, since we do not know the read s direction compared to the reference. We use the following constraints: Each individual can either have one or two different alleles. Everyone has two strings of chromosome 6 so each can have two different variations. A read needs to be continuous, meaning we can never add gaps to it or add gaps to the reference while aligning. Under some strict circumstances we may allow a mismatch between the read and reference. We allow this but no other types of errors since the most frequent errors in Illumina read data are mismatches [Hoffmann et al., 2009]. Since each read in a read pair is from the same chromosome, we do an AND bitwise operation on both reads bit strings. All non-paired reads are discarded. We believe using these constraints we can create a model that can accurately predict the correct genotype from Illumina next-generation data. To further improve the model we include some parameters we wish to train using in-house tools which will be discussed later. 3.4 Parameters Using the graph we need to create a heuristic that maximizes accuracy of the program without sacrificing heavily on computational time or memory. The parameters are: Read clipping, minimum sequence length, mismatches, and a zygosity factor Read clipping The quality of the read tends to drop near the end of the reads when using Illumina machine, and other technologies using sequencing by synthesis [Fuller et al., 2009]. To counter this issue we introduce clipping of the reads near the ends. Illumina produces a read quality base check which gives each base a Phred value. We use this Phred value to clip the reads. A Phred value threshold τ was picked and we say that if the bases on the read ends are lower than τ, we remove them. We repeat this process until we find a base which has a Phred value equal or higher than τ. This process is applied on both ends of the read. If we choose a high τ value, we could be removing valuable information from the reads. But if we choose a low τ value, we are likely to have more errors on the read s ends. This parameter needs to be trained to balance these two traits. 32

51 3.4 Parameters For read clipping we chose the following values to train with: τ = 20, 25, 30, 35, 37. A value of 20 will produce almost no clipping on the reads while 37 clips many entire reads away. We are opting for very few different values to test for this parameter since for every different value we will produce different reads. We need to align each read, which is a very expensive operation Minimum sequence length Since we are clipping the reads, we need another parameter that limits the minimum length of the clipped reads. We call this parameter λ. Allowing all small reads to be aligned to the graph, we would more often find matches in multiple locations. Furthermore, many reads from other regions are more likely to be aligned. At the same time it is probably not a good idea to reject all reads that have been clipped. Then we would end up with very few reads. We can very cheaply try many different values when training λ. We trained using the following minimum sequence lengths: λ = 10, 20, 30, 40, 50, 60, 70, 80, 90. Our method involves aligning a read to the graph only if the length of the read is at least the same as the lowest minimum sequence length, which is in our case 10. Then, when the alignment has finished, we copy the results and add them to the total results for each minimum sequences length which is lower or equal to the cutoff length. For instance, if the length of a read is 40 we use 40 as the cutoff length and add the results to the total results for 40, 30, 20, and 10. However, if a read has the length 9 we reject that read entirely. The computational time therefore does not scale with the number of parameter values we try. It only depends on the length of the smallest minimum sequence we use Mismatches In section 3.4 we mentioned that we can allow mismatches in the aligner in some scenarios. We chose the following scenario to allow mismatches when all the following conditions are met: The current node has one outgoing edge. The current node has one incoming edge. The quality of the read is below a given threshold, ρ. The parameter we need to train is the quality threshold ρ. The reason for first condition is because if we allow more than one then a single alignment could map to multiple paths on the graph. If we allow separate paths we are likely to face multiple cases where a single read can start at some node and end at multiple other places. When this happens there is 33

52 3 Methods no good way to decide which of these paths are the most correct one. To avoid this issue we simply forbid mismatches on nodes with more than one outgoing edge. The second condition is not mandatory. However, it gives us the nice feature of having a symmetric alignment to the graph. We chose to train the read clipping quality using these values: ρ = 20, 25, 30, 35, and 37. Again, we only chose those five values since each value requires an alignment to the graph, and thus is computationally expensive Zygosity factor When deciding which alleles will explain the most reads it will always heavily favor heterozygous over homozygous. Two different alleles will always be able to explain at least as many reads as only one allele would. In previous work, such as [Szolek et al., 2014], this issue was resolved by using a constant factor that increases the scores of homozygous results. We believe this method fails to provide good genotype if both true alleles are very similar. In other words if an individual s true alleles are very similar, then the score of those two alleles are expected to be almost as high as the score of them in a combination. To counter this issue we used weighted average between how many reads two alleles could explain and the average of how many reads they could explain individually. In a more formal manner we can say that we have selected n alleles in a set L = {a 1, a 2,..., a n }. We also have m read pairs from sequencing an individual. Given a read pair r j and an allele a i where 1 j m and 1 i n, we define C rj,a i = { 1, if allele ai can explain the read pair r j 0, otherwise (3.1) If some single read pair is explained by some allele then we say that a hit counter for that allele and read pair is 1, but otherwise 0. Read pair r j can be explained by any number of alleles. n 0 C rj,a i n (3.2) i=1 If the total score of an allele A L is S A for given read pairs r 1, r 2,..., r j,..., r m is m S A = C rj,a (3.3) j=1 Since each individual has two chromosomes we need a method to convert the single allele score, S A, to a combined allele score we call S A,B. If we have two alleles A, B L we only require that one allele can explain the read. We calculate their heterozygous score as m ( ( S A,B = max Crj,A, C rj,b)) (3.4) j=1 34

53 3.4 Parameters Here we note that if the score is homozygous, which is when A = B in equation 3.3 we get m ( ( S A,A = max Crj,A, C rj,a)) m = C rj,a = S A (3.5) j=1 Individuals that have the same allele A on both chromosomes simply get a score S A,A = S A. This scoring scheme heavily favors heterozygous scores because S A,B S A for any two alleles A and B. Instead, we decide the zygosity of the individual by using the number of reads explained by allele one, but not the other. S A\B = j=1 m ( ( max Crj,A C rj,b, 0 )) (3.6) j=1 We also define S A B to be the number of reads explained both by allele A and B. S A B = m ( ( max Crj,A + C rj,b 1, 0 )) (3.7) j=1 Figure 3.8 shows the relation among scores S. If S A\B is much larger than S B\A it is likely that the individual is homozygous with two A alleles. However, if S A\B and S B\A are relatively similar then it is likely that the individual is heterozygous. A B S A\B S A B S B\A Figure 3.8: Alleles A and B are represented as sets. Each set has the reads that allele explains. The number of reads explained by either allele A or B is S A,B = S A\B + S A B + S B\A. We can use these results to decide zygosity using a constant β. S A\B S B\A S A\B + S B\A { > β, homozygous solution β, heterozygous solution (3.8) Where 0 β 1. Figure 3.9 further explains our reasoning. A cross represents a read. If allele A can explain the read then the cross is in set A. 35

54 3 Methods a) A B b) A B S A\B = 8 S A B = 8 S B\A = 1 S A\B = 6 S A B = 8 S B\A = 5 Figure 3.9: Two examples of the distributions of reads (crosses) among two alleles, A and B. a) It is likely that the individual is homozygous even though S A = 16 and S A,B = 17. The read inside the B\A region is likely an error. b) Here however, S A\B and S B\A are relatively similar and thus we would rather expect that the individual is heterozygous. Now we can scale the heterozygous solutions down using β. First, we assume that A has the highest score, S max = S A S B. This also means that S A\B S B\A because S A S B = S A\B S B\A. Furthermore, we can see from figure 3.8 that S A\B + S B\A = S A + S B S A B. Assuming S A\B + S B\A > 0, the condition for a heterozygous solution is S A S B β(s A + S B 2S A B ) 0 (3.9) To compare scores between homozygous and heterozygous solutions we introduce a scaled score S A,B. If an individual is heterozygous, we want One solution to equations 3.9 and 3.10 is: S A,B S A (3.10) S A,B = βs A,B + (1 β) S A + S B (3.11) 2 Equation 3.11 has a nice feature, it makes scaling unnecessary for homozygous solutions because S A,A = βs A,A + (1 β) S A + S A = S A (3.12) 2 Therefore, we only need to scale heterozygous solutions. If we use β = 1 then there would not be any scaling, S A,B = S A,B, and heterozygous solutions would be favored. If β = 0 the scoring scheme would favor homozygous solutions. We suggest β = 0.5 as a middle ground but training this parameter is computationally cheap. Using individuals in decode s dataset we trained β using 9 different values: 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9. We expect our scoring scheme, at least to some extent, to solve the problem of having two highly similar alleles. Given all this, we formulate our optimization problem as S max = max A,B L S A,B = max A,B L ( βs A,B + (1 β) S A + S B 2 ) (3.13) 36

55 3.5 Parameter training 3.5 Parameter training To train the data we used an in-house program that uses imputation to determine the correct allele of an individual. When imputing we use the relation data among individuals in the dataset to expand the results for the Icelandic population. The program requires a list of alleles, the most likely allele, and the likelihoods of all combinations of alleles on the Phred scale. The data was stored in VCF format. We define e A,B to be the event that a read is explained by some alleles A, B L, which is not the true genotype. The probability of that event occurring is P (e A,B ) = ɛ. In general, we estimate that the number of such events are d = S max S A,B (3.14) We assume all such events are mutually independent of each other so the binomial probability of it occuring d-times are in general: P (e A,B ) = ɛ d (3.15) The score number is not always an integer but it does roughly correspond to the number of mismatched read pairs, so we believe it is a reasonable metric to use. The probability of each genotype is then estimated to be P A,B = P (e A,B )P max = ɛ d P max (3.16) Our imputation tool uses a reversed Phred score, meaning that the allele with a score of 0 is the most likely one. So instead of using the probability of error P (e), we use instead P (e A,B ) in equation 2.1. We also put a limit on the Phred score so it is never higher than 255, which means an allele is never less likely to be true than The Phred score is then calculated using: Pred A,B = min ( 10 log 10 (P (e A,B )), 255) (3.17) We arbitrarily chose ɛ = 1%, so we can simplify equation 3.17 to Pred A,B = min (20d, 255) (3.18) In section we defined the 4 parameters trained. In total our input space is large, its total size is 2,025 for each gene for a single individual. The most computationally expensive part is the alignment to the partial order graph, and only two parameters required separate alignments. These two parameters are the clipping and mismatch thresholds which have a combined input space of 25. They are the dominant computational time factors. In our training set we used a dataset of 3,894 Icelanders who have been sequenced at decode. These individuals allhave passed the in-house quality control tests with high 37

56 3 Methods scores. The quality control score is based on sequencing depth coverage, contamination, and more. If we assume that the time to genotype an individual for one gene is 40 CPU seconds, then the time required to genotype 3,894 for six genes using 25 different parameters is about 270 CPU days. The parameter training was therefore carried on multiple nodes in decode s computer cluster. The results contain about 47.3 million data points, which where then imputed for more than 150,000 chip genotyped Icelanders using an in-house imputation tool. The tool evaluates an INFO score which indicates how well the genotypes fit into with the relational data of the individuals. For example, let s say that Gyper predicts a father to have the two HLA-A*01:01 alleles and the mother to have HLA-A*02:01 and HLA- A*03:01 alleles. If Gyper predicted that their child has two HLA-A*68:01 alleles, the INFO score will be lowered because this is essentially impossible. Moreover, even if Gyper predicted the child to have HLA-A*01:01 and HLA-A*02:01, that might also be false if it has already been determined that the mother passed her chromosome 6 with the HLA*03:01 allele to the child. Therefore, in this particular example case the only possible genotype of the child, given that the parents were genotyped correctly, is HLA-A*01:01 and HLA-A*03:01. The in-house imputation tool outputs for each genotype its minor allele frequency (MAF) and an INFO score estimating how well the genotyper predicted this allele correctly. We weight each INFO score by its MAF and find the average over all genotypes. Our assumption is that whichever combination of parameters that provide the highest weighted average INFO score is the best combination. 3.6 Implementation Gyper is implemented in C++ and depends on both SeqAn [Döring et al., 2008] and Boost The project is open-source and maintained on Github at The program is licenced under the simplified BSD license. SeqAn has not been released yet but its development is ongoing. Before it has been released, it is possible to use SeqAn s development branch on Github instead. 38

57 4 Results 4.1 Preprocessing the data Coverage read depth The coverage depth of the six most important genes was computed and plotted for 60 million simulated reads, 10 million reads for each gene. We found the location of each gene using RefSeq [Pruitt et al., 2014]. Table 4.1 shows the fraction of reads that mapped to optimal and suboptimal locations, and the fraction of unmapped reads. Table 4.1: The fraction of reads mapped to the optimal locations, suboptimal locations, and no locations of the human genome reference. Gene Optimal locations (%) Suboptimal location (%) Unmapped reads (%) HLA-A HLA-B HLA-C HLA-DQA HLA-DQB HLA-DRB Total We did see a high number of reads that were mapped to suboptimal locations, which are locations outside the gene. Out of the 60 million reads simulated, roughly 17% mapped to these locations. The suboptimal locations were usually close to the gene and almost always on chromosome 6. Interestingly, in few cases we saw the suboptimal locations of one gene being inside the region of another gene. This shows how much similarity is among the HLA genes. A little less than 1% of the simulated reads could not be mapped anywhere on the genome. These reads were discarded. 39

58 4 Results Figure 4.1: Coverage plot for HLA-DQA1. The HLA-DQA1 gene had 21.33% of its simulated reads mapped to suboptimal locations (Figure 4.1). It was the gene with the most unmapped reads, 3.42%. The suboptimal locations are all on chromosome 6 and most of them were mapped to a position 100,000 bases after the gene. When we say that a position is after a gene, we mean that the position is greater than the position of the gene. This length of the gene is estimated to be about six thousand base pairs. Figure 4.2: Coverage plot for HLA-DQB1. The coverage plot for the HLA-DQB1 gene is similar to the plot for HLA-DQA1. However, the HLA-DQB1 gene has a lot fewer reads mapped to a suboptimal location, 11.45% 40

59 4.1 Preprocessing the data Figure 4.3: Coverage plot for HLA-DRB1. (Figure 4.2). In fact HLA-DQB1 had the fewest reads mapped to suboptimal locations out of all the genes in this study. The gene s region is set to be around 7,000 base pairs in length. Most of the mismapped reads were mapped after the gene roughly 95,000 base pairs from it. Figure 4.4: Coverage plot for HLA-A. HLA-DRB1 was the biggest gene we studied, covering close to 11,000 base pairs. For this gene 24.11% of the simulated reads were mapped outside the gene s region (Figure 4.3). Most of them were mapped roughly 60,000 bases before the gene. This gene had the most reads mapped to a suboptimal location out of all the genes we studied. Interestingly, some reads mapped to chromosome 3 on locations 125,606, ,608,865. However, its 41

60 4 Results Figure 4.5: Coverage plot for HLA-B. depth is below the minimum depth covering threshold of 50 reads and are therefore not shown in the figure 4.3. Figure 4.6: Coverage plot for HLA-C. The HLA-A gene had 13.89% of its reads mapped outside the gene s specified position of the genome (Figure 4.4) including many locations with a very low depth. Most of the mismapped reads are however about 55,000 bases before the gene. The gene s total length is 3,414 base pairs. The other HLA class I genes are of similar size too. HLA-B had 16.12% reads mapped outside its region (Figure 4.5). The gene s length is 3,340 base pairs. Most of the reads are in fact coming from inside HLA-C s region, which 42

61 4.1 Preprocessing the data is located about 85,000 base pairs before the HLA-B gene. Finally, the HLA-C gene had 15.41% of its reads mapped to suboptimal locations (Figure 4.6). Most of them mapped inside the HLA-B gene. The length of HLA-C is 3,387 base pairs Filtering the BAM files Using our coverage read depth experiment we can now use the locations found there to filter our alignment files. The alignment files are stored in BAM files. As mentioned before our training set consists of 3,894 individuals and we filtered read from their BAM files. The average size of the BAM files before filtering was 67,220 ± 17,980 MiB, where the largest and smallest files were 158,180 and 2,110 MiB, respectively. After filtering, we created six new BAM files where each of them was filtered for each gene we were interested in. The largest and smallest file sizes of all six BAM files after reductions were 59.0 and MiB, respectively. Average size of all the six files were 23.1 ± 6.2 MiB, or a size reduction of ± 0.006% (Figure 4.7). This huge size reduction sped up the process of genotyping significantly. Figure 4.7: File sizes of BAM files used in our training set before and after filtering. The files are ordered in ascending order by their file size before filtering. The file sizes are in MiB and we use logarithmic scale. 43

62 4 Results Bias introduction By using alignments from regions outside each gene s region of the genome we are introducing a bias in our model as discussed in section Using simulated data with 5000x coverage we summarize our results in table 4.2. On average, 102 read pairs were misaligned when using Gyper. Therefore, we estimate that one read pair is mismapped from suboptimal locations for every 50x coverage depth. So for most HLA genes the bias introduction is neglectable. The gene with the highest bias was HLA-DQA1, where both HLA-DQA1*02:01 and HLA- DQA1*03:01 misaligned 1651 reads. That corresponds to roughly one misaligned read pair for every 3x coverage depth. This might cause some measurable bias for these two alleles. In the initial version of Gyper this bias was not taken into account. 44

63 4.1 Preprocessing the data Table 4.2: Checking for bias introduction our filtered BAM files. Only the most common Icelandic alleles were used in this analysis. Larger numbers mean greater unwanted bias. HLA-A HLA-B HLA-C HLA-DQA1 HLA-DQB1 HLA-DRB1 a S a a S a a S a a S a a S a a S a 01: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

64 4 Results 4.2 Training of parameters We trained the following four parameters: (1) Read clipping threshold τ, (2) minimum sequence length λ, (3) mismatch threshold ρ, and (4) zygosity factor β. Total 3,894 individuals were genotyped from the Icelandic population with a total of 2,025 different parameter settings. Then, using an in-house imputation tool we imputed data for all individuals in decode s database and calculated the INFO score described in section 3.5. The highest weighted average imputational INFO score we got was %. It was achieved when using the following parameters: Read clipping threshold τ = 30, minimum sequence base pair length λ = 60, mismatches threshold score ρ = 25, and zygosity factor β = 0.6. These parameters were set to be the default parameters of Gyper and they were used in all further runs in this study. A sensitivity analysis was performed to investigate how each parameter affected the score. In the analysis we only changed one parameter at a time and kept the other at their optimal values Quality threshold training Figure 4.8: The weighted average impute INFO score for the two threshold qualities: The mismatch quality threshold ρ (blue) and the clipping threshold τ (red). Note that the axes do not start at zero. Figure 4.8 shows how sensitive the INFO score is to the two quality thresholds ρ and τ. This quality score is the Q from equation 2.1. Overall these two parameters had an 46

65 4.2 Training of parameters insignificant effect, especially the mismatch threshold ρ which had the worst weighted average score of only 0.027% lower than the highest score. That happened when ρ = 30. For the clipping threshold τ that same metric was 0.184% when τ = 37. The most time consuming task of Gyper was aligning sequences to the graph so even if the clipping threshold did not improve the weighted average INFO score much, it significantly improved the speed of genotyping. For the mismatch threshold however, Gyper does not benefit any speed increase from allowing mismatches. In fact, it very slightly increases the time and memory needed to align a read. So if either of those two things are an issue, it is possible not to use mismatches and maintain minimal damage to accuracy. Using a very high quality threshold value ρ makes sure mismatches are never allowed. Nevertheless, in our runs we will use ρ = 25 and τ = 30 as it gives us the highest weighted average impute INFO score Minimum sequence length training The minimum sequence length λ also had a similar significance on the weighted average INFO score as the mismatch threshold ρ. The difference between the highest scoring minimum sequence length and the lowest scoring one was 0.037%, where the lowest scoring value was with λ = 90 (Figure 4.9). In decode s data the read lengths are either 100, 125 or 150 base pairs, so the maximum cropping allowed is 40-60% of the original length. If using data with much higher or much lower length of reads, it is probably a good idea to change λ accordingly. Similar to the clipping constant the higher minimum sequence length we choose, the quicker Gyper can compute the genotype. This is due to the fact that using a higher value will decrease the number of reads that need to be aligned to the graph because fewer reads will satisfy the constraint. Therefore choosing a higher λ can lead to some slight speed increase, but in our runs we are going to use the one that gives us the highest weighted average INFO score, λ =

66 4 Results Figure 4.9: The weighted average impute INFO score for different minimum sequence length. Note that the axes do not start at zero. The results show that changing the minimum sequence length is insignificant Zygosity factor training Finally, the zygosity factor β had a substantial effect on the weighted impute INFO score. The scores for the values we tested their scores ranged from % to %, with a difference of 6.962%. The β value with the lowest score of the ones we tried was β = 0.1. The choice of a β value does not affect the computational time to genotype. So we would recommend using Gyper with the default β = 0.6, as it performed best in our case. 48

67 4.2 Training of parameters Figure 4.10: The weighted average impute INFO score for different values of the zygosity factor β. Note that the axes do not start at zero. 49

68 4 Results 4.3 Verification Gyper s accuracy was verified using three different verification datasets: An in-house WGS dataset, 1000 Genomes exome dataset, and 1000 Genomes low coverage WGS dataset (Table 4.3). The in-house sample dataset is not publicly available but 1000 Genomes samples are available on their FTP site [1000Genomes, 2015]. Samples in the 1000 Genomes exome dataset have at least 20x coverage. Exomes are the part of the genome that are formed by exons. Exons only account for about 1% of the genomes so alignment files storing such data are much smaller using the same read coverage. Samples in 1000 Genomes low coverage WGS dataset have at least 3x coverage. The 1000 Genomes datasets have been verified by Erlich et al. [2011] for all the HLA class I genes: HLA-A, HLA-B, and HLA-C. It is a widely used verification dataset for many HLA genotypers and sometimes called the gold standard for HLA genotypers. We genotyped the class I genes and compared Gyper to other HLA genotypers. The called genotypes of the 1000 Genomes samples are shown in appendices A and B. Table 4.3: Number of individuals in each verification dataset. Gene decode WGS (sequenced) decode WGS (imputed) 1000 Genomes exome 1000 Genomes WGS HLA-A HLA-B HLA-C HLA-DQA HLA-DQB HLA-DRB Total The calling accuracy is the fraction of alleles Gyper correctly calls. Gyper calls the genotype with S max. However, in some cases more than one genotype has a score S max which leads to ambiguous results. In this case it is undefined which genotype Gyper calls. We believe a better quality measurement is to use coefficient of determination, r 2. When calculating r 2 we use Gyper s probability of a genotype, P. Additionally, we checked how often Gyper s called the zygosity of the samples matched the experimentally determined zygosity. Both r 2 and the zygosity calling accuracy are only calculated using 4 digit resolution. 50

69 4.3 Verification decode s samples decode has a large dataset with samples taken from the Icelandic population. Thousands of them have been sequenced using Illumina machines, aligned to the human genome, and stored in BAM files. They also have genotyped a portion of them for the six most important HLA genes using laboratory genotyping methods. The class I genes HLA-B and HLA-C were genotyped with a 2 digit resolution. The other four genes HLA-A, HLA-DQA1, HLA-DQB1, and HLA-DRB1 were typed with a 4 digit resolution. Overall 3600 genes have been genotyped using this method, which were used as a verification dataset for Gyper. Genotyping individuals this way is expected to have a high accuracy but they are costly and time consuming. For decode s dataset we did two kinds of tests. One is where Gyper genotyped sequencing files for all individuals that have both been sequenced and are part of the verification data. Unfortunately, that is only the case for 18.85% of the individuals in the verification data. The other, was to genotype the same 3,894 individuals and imputed data for other individuals in the dataset. An important use case of Gyper is to be able impute its output for a large population which can then be used in association studies. This allows us to use a much larger portion of decode s verification data, 93.30%. We cannot use the entire verification data because the imputation is unable to determine an individual s genotype if the genotypes of the its relatives are unknown. Table 4.4: Gyper s 2 digit genotype call accuracy compared to decode s verification data. Gene 0 errors 1 error 2 errors Correct alleles Accuracy HLA-A of % HLA-B of % HLA-C of % HLA-DQA of % HLA-DQB of % HLA-DRB of % All genes of % Table 4.5: Gyper s 4 digit genotype call accuracy compared to decode s verification data. Gene 0 errors 1 error 2 errors Correct alleles Accuracy r 2 HLA-A of % HLA-DQA of % HLA-DQB of % HLA-DRB of % All genes of % We genotyped all individuals in our sequenced dataset and genotyped them using only their respective BAM files. Since each individual has two alleles Gyper s prediction could 51

70 4 Results have 0, 1, or 2 errors for each individual when comparing with the verification. We use the number of correctly predicted alleles divided by the total number of alleles predicted to estimate Gyper s accuracy. The overall genotype call accuracy of Gyper was 97.6% and 94.8% using 2 and 4 digit resolutions, respectively (Tables 4.4,4.5). Zygosity was correctly called in 94.2% cases. For the imputed data Gyper s accuracy was 96.8% and 96.1% for 2 and 4 digit resolution, respectively (Tables 4.6,4.7) and the zygosity call accuracy was 97.1%. In tables 4.5 and 4.7 the HLA-B and HLA-C genes are excluded because only has the first 2 digits are known. Table 4.6: Gyper s 2 digit impute accuracy compared to decode s verification data. Gene 0 errors 1 error 2 errors Correct alleles Accuracy HLA-A of % HLA-B of % HLA-C of % HLA-DQA of % HLA-DQB of % HLA-DRB of % All genes of % Table 4.7: Gyper s 4 digit impute accuracy compared to decode s verification data. Gene 0 errors 1 error 2 errors Correct alleles Accuracy r 2 HLA-A of % HLA-DQA of % HLA-DQB of % HLA-DRB of % All genes of % Genomes exome samples Total 180 exome BAM files were fetched from the 1000 Genomes FTP site and genotyped for the three main HLA class I genes. The samples were taken from individuals from with ancestry from all over the world. Gyper s genotype call accuracy was 99.3% and 97.9% using 2 and 4 digit resolutions, respectively (Table 4.8,4.9). For all genes r 2 > 0.95 and zygosity calling was correct in all 540 cases. 52

71 4.3 Verification Table 4.8: Gyper s 2 digit exome accuracy compared to Erlich et al. [2011]. Gene 0 errors 1 error 2 errors Correct alleles Accuracy HLA-A of % HLA-B of % HLA-C of % All genes of % Table 4.9: Gyper s 4 digit exome accuracy compared to Erlich et al. [2011]. Gene 0 errors 1 error 2 errors Correct alleles Accuracy r 2 HLA-A of % HLA-B of % HLA-C of % All genes of % Genomes WGS samples We also verified Gyper using 20 low coverage WGS alignment files obtained from the 1000 Genomes project. These files have at least 3x non duplicated aligned coverage. Here, the accuracy of Gyper was 96.7% and 95.0% for the 2 and 4 digit comparisons, respectively (Tables 4.10 and 4.11). Zygosity was correctly called for 98.3% of the individuals. Table 4.10: Gyper s 2 digit low coverage WGS accuracy compared to Erlich et al. [2011]. Gene 0 errors 1 error 2 errors Correct alleles Accuracy HLA-A of % HLA-B of % HLA-C of % All genes of % Table 4.11: Gyper s 4 digit low coverage WGS accuracy compared to Erlich et al. [2011]. Gene 0 errors 1 error 2 errors Correct alleles Accuracy r 2 HLA-A of % HLA-B of % HLA-C of % All genes of % Over all datasets Gyper managed to predict 9,145 out of 9,410 alleles total at the two digit resolution, having an accuracy of 97.2%. With 4 digit resolution we predicted 3,284 out of 3,408 alleles (96,3%) correctly. 53

72 4 Results 4.4 Comparison with other DNA sequencing data genotypers Several other HLA genotypers are publicly available as discussed in section 2.5. One of the best current HLA genotyper is OptiType and our focus is to compare Gyper to it, both in terms of accuracy and time Accuracy OptiType measured its accuracy with both by genotype and zygosity calling samples from the 1000 Genomes project. In both datasets Gyper showed the same or slightly better calling accuracy compared to OptiType Genomes exome dataset Using OptiType s calling results from their article [Szolek et al., 2014] their 4 digit accuracy on the exome dataset was 97.8% (Table 4.12) while Gyper s accuracy was only barely higher at 97.9%. The exome dataset was had previously been typed by Major et al. [2013] with an accuracy of 93.9%. OptiType typed 1056 alleles correctly out of 1080 alleles total while Gyper had only a single correct allele more. Compared to Gyper their accuracy on the HLA-B gene is higher but lower for the other two genes. Table 4.12: OptiType s 4 digit call accuracy on 1000 Genomes exon dataset compared to Erlich et al. [2011]. Gene 0 errors 1 error 2 errors Correct alleles Accuracy Gyper s accuracy HLA-A of % 97.5% HLA-B of % 96.7% HLA-C of % 99.4% All genes of % 97.9% Gyper s zygosity calling accuracy was 100.0% for these samples, compared to OptiType s 98.5%. 54

73 4.4 Comparison with other DNA sequencing data genotypers 1000 Genomes WGS dataset Additionally we also compared Gyper to OptiType on WGS data with low coverage. This dataset had been genotyped before with HLAminer with 80.2% accuracy on 4 digit resolution [Warren et al., 2012]. Meanwhile, both OptiType and Gyper managed to achieve 95% genotype calling accuracy (Table 4.13). Table 4.13: OptiType s 4 digit WGS genotype calling accuracy compared to Erlich et al. [2011]. Gene 0 errors 1 error 2 errors Correct alleles Accuracy Gyper s accuracy HLA-A of % 95.0% HLA-B of % 92.5% HLA-C of % 97.5% All genes of % 95.0% Furthermore, both Gyper and OptiType called zygosity correctly in 59 of 60 cases (98.3%) Time The main feature of Gyper is its efficiency for the case where the user stores their WGS reads in an alignment file using SAM/BAM format. Storing reads aligned is now widely used and has become the industry standard. Raw reads in FASTQ files are hard to work with because they provide no context for the user. OptiType only supports FASTQ files, which means users who are interested in say, the HLA-A genotype of an individual, and store their reads only alignment will need to: 1. Sort the SAM/BAM using read name. 2. Convert SAM/BAM to two separate FASTQ files. 3. Preprocess both FASTQ files using a read mapper. 4. Run OptiType on the preprocessed files. Testing it on an in-house 90 GiB indexed BAM file this process took a couple of days. Meanwhile, using Gyper on the same computer we genotyped the individual in less than 40 seconds. Even for this massive time difference, Gyper has shown to be comparatively or more accurate in calling the correct HLA genotypes. 55

74 4 Results 56

75 5 Conclusions 5.1 Summary Gyper is a fast HLA genotyper, it dominated all other publicly available HLA genotypers in terms of speed. Its high speed is due to the fact that Gyper only uses a very small subset of reads which are believed to be relevant to the genotyping. Gyper requires sorted and indexed alignment files to fetch these reads quickly. Even though we only genotype with such a small portion of reads, we can still report that Gyper is one of the most accurate HLA genotypers publicly available. When comparing with OptiType, which has been reported to be an accurate HLA genotyper, Gyper s genotype and zygosity call accuracy was higher than OptiType s. We also measured Gyper s accuracy using the coefficient of determination, r 2, with 4 digit resolution genotypings. Gyper achieved r 2 > 0.8 for six HLA genes and r 2 > 0.95 for the three main HLA class I genes using WGS and exome samples, respectively. Previous methods, we compared ourselves with, did not measure their accuracy this way. We believe it is a better quality score than genotype call accuracy. Gyper s high accuracy is achieved by smartly creating partial order graphs for all the different alleles, and then aligning read pairs to them. Aligning all read pairs independently to every reference allele available, which can sometimes go up to 4000 different references, is extremely time consuming. Instead we create a single graph and align the read pairs to that, resulting in a much faster typing. Partial order graphs have proven to be a good way to represent variation, but we had not seen them used for our purpose before. They allow us to add a wide variety of constraints to the genotyper to extract as much information as we possibly can. By having such a quick genotyper we are able to optimize the parameters for these constraints by training at a large scale. Gyper is very extensible, it is created as a generic genotyper that is not restricted to genotyping for HLA types. Furthermore, it can easily be extended to genotype for any genomic structural variation (e.g. SNP, insertions, and deletions). We are certain that it can be used in a wide variety of applications. 57

76 5 Conclusions 5.2 Future work Even with the present results we still believe Gyper can be extended to be even faster and more accurate. We have not investigated deeply why Gyper is failing for some individuals, but we think a big part of it is due to the bias introduction discussed in section Furthermore, the experimental genotypings have been found to be inaccurate for the 1000 Genomes samples [Erlich et al., 2011]. One feature that would increase Gyper s speed is to save and load graphs from disk. In the current implementation a new graph is created every time Gyper is run. This feature does not have a high priority because the creation usually takes 2 seconds or less, but loading the graph directly from disk would be faster. Another feature to improve Gyper s speed is to make nodes store k-mers instead of single DNA bases. That way, we could easily create a hashmap which maps k-mers to node to find quickly potential alignments in the graph. Currently, when aligning a read we check every node on the graph. The alignment is by far the most time consuming part of Gyper, so with this change the typing speed would be improved significantly. Even though Gyper s top priority is accuracy, implementing these two speed improving features can allow us to train the parameters on a much larger scale, both in terms of number of genes and individuals. To improve accuracy even further we speculate that allowing some mismatches and indels on intron sequences could help. That would make up for the fact that an individual can have a sequence that is missing or wrong in the IMGT/HLA database. It is very possible that there are some rare variants in the Icelandic population which is missing from the database. If this is the case, it could have a big impact on the genotyping. As for extensibility, Gyper currently only supports SAM and BAM files so another future task is adding support for the increasingly popular CRAM files. Also, Gyper can easily be extended to add support fore more genes, SNPs, indels, or any other genetic variants. Hopefully, we can add these features to Gyper to make it an even more appealing DNA genotyper. 58

77 A HLA genotype call results for 1000G exome samples A.1 HLA-A exome results Table A1: Gyper s called HLA-A genotype on 1000 Genomes exome dataset compared to Erlich et al. [2011]. Sample Allele 1 Allele 2 Verified allele 1 Verified allele 2 2 digit matches 4 digit matches NA :01:01:01 02:01:01:01 03:01 02: NA :01:01 02:01:01:01 02:01 32: NA :01:02:01 02:01:01:01 02:01 68: NA :01:113 02:01:01:01 02:01 02: NA :01:01:01 01:01:01:01 01:01 02: NA :02:01:01 01:01:01:01 01:01 24: NA :01:01 03:01:01:01 25:01 03: NA :02:01:01 01:01:01:01 24:02 01: NA :01:01:01 02:01:01:01 02:01 02: NA :01:01:01 02:01:01:01 02:01 02: NA :01:01:01 01:01:01:01 01:01 03: NA :01:01 02:01:01:01 32:01 02: NA :01:01:01 02:01:01:01 02:01 02: NA :01:01:01 03:01:01:01 03:01 26: NA :01:01:01 01:01:01:01 02:01 01: NA :01:01:01 01:01:01:01 01:01 11: NA :01:01:01 01:01:01:01 01:01 01: NA :01:01:01 02:01:01:01 02:01 02: NA :02:01:01 02:01:01:01 24:02 02: NA :02:01:01 02:01:01:01 29:02 02: NA :01:01 11:01:01:01 25:01 11: NA :01:01:01 02:06:01:01 02:06 26: NA :01:01:01 01:01:01:01 02:01 01: NA :01:01:01 02:01:01:01 03:01 02: NA :01:02:01 01:01:01:01 01:01 31: NA :01:01:01 01:01:01:01 01:01 02: NA :01:01:01 01:01:01:01 01:01 11: NA :01:01 11:01:01:01 11:01 32: NA :01:01 02:01:01:01 02:01 25: NA :01:01:01 02:01:01:01 02:01 02: NA :01:01:01 01:01:01:01 01:01 02: NA :01:02:01 02:01:01:01 31:01 02: NA :01:01:01 01:01:01:01 01:01 02: NA :02:01:01 23:01:01 23:01 24:

78 A HLA genotype call results for 1000G exome samples NA :02:01:01 01:01:01:01 24:02 01: NA :01:01:01 02:01:01:01 02:01 02: NA :01:01:01 01:01:01:01 01:01 03: NA :01:01:01 02:01:01:01 02:01 02: NA :02:01:01 24:02:01:01 29:02 24: NA :01:01:01 02:01:01:01 03:01 02: NA :02:01:01 01:01:01:01 01:01 24: NA :01:01:01 01:01:01:01 01:01 03: NA :02:01:01 03:01:01:01 24:02 03: NA :02:01:01 02:01:01:01 02:01 24: NA :01:01:01 01:01:01:01 11:01 01: NA :02 33:01:01 33:01 66: NA :01 23:01:01 24:24 74: NA :01 36:01 36:01 36: NA :01 26:01:01:01 26:01 74: NA :01:01 23:01:01 23:01 30: NA :02:01:01 33:03:01 33:03 68: NA :02:01:01 03:01:01:01 03:01 29: NA :01 03:01:01:01 03:01 74: NA :01 03:01:01:01 03:01 36: NA :01:01:01 01:01:01:01 01:01 03: NA :03:01 24:02:01:01 24:02 33: NA :01:02:01 30:01:01 30:01 68: NA :01:01:01 02:01:01:01 02:07 11: NA :03:01 11:01:01:01 11:01 33: NA :02:01:01 11:01:01:01 11:01 24: NA :01:06 02:03:01 02:03 68: NA :07:01 02:07:01 02:07 02: NA :01:01:01 02:07:01 02:07 11: NA :01:01:01 02:06:01:01 02:06 11: NA :02:01:01 02:01:01:01 02:01 24: NA :01:01:01 24:02:01:01 24:02 26: NA :01:02:01 11:01:01:01 11:01 33: NA :03:01 02:01:01:01 02:01 33: NA :01:01 02:01:01:01 02:01 30: NA :03:01 02:01:01:01 02:01 33: NA :06:01:01 02:01:01:01 02:01 02: NA :02:01:01 03:02:01 03:02 24: NA :01:02:01 24:02:01:01 24:02 31: NA :07:01 02:07:01 02:07 02: NA :01:01:01 02:01:01:01 02:01 02: NA :01:01:01 11:01:01:01 11:01 11: NA :02:01:01 11:01:01:01 11:01 24: NA :01:02:01 24:03:01 24:02 31: NA :02:01:01 01:01:01:01 01:01 24: NA :01:01 02:06:01:01 02:06 30: NA :03:01 02:01:01:01 02:03 02: NA :01:02:01 24:02:01:01 24:02 31: NA :01:01:01 02:03:01 02:03 03: NA :01:04 11:01:01:01 11:01 11: NA :06:01:01 01:01:01:01 01:01 02: NA :01:01 02:01:01:01 02:01 32: NA :02:01:01 11:01:01:01 11:01 24: NA :02:01:01 03:01:01:01 03:01 24: NA :01:01 24:02:01:01 24:02 30:

79 A.1 HLA-A exome results NA :01:01 30:01:01 30:01 32: NA :03:01 11:01:01:01 11:01 33: NA :01:01:01 11:01:01:01 11:01 11: NA :01:01:01 11:01:01:01 11:01 11: NA :02:01:01 11:01:01:01 11:01 24: NA :01:01:01 02:01:01:01 02:01 11: NA :01 34:02:01 34:02 36: NA :01 33:03:01 33:03 74: NA :01:01 29:02:01:01 29:02 30: NA :02:01:01 03:01:01:01 03:01 68: NA :02:01:01 30:02:01:01 30:02 30: NA :01 30:01:01 30:01 36: NA :01 34:02:01 34:02 74: NA :02:01:01 24:02:01:01 24:02 24: NA :03:01 24:02:01:01 24:02 33: NA :07:01 02:06:01:01 02:07 02: NA :02:01:01 02:06:01:01 02:06 24: NA :02:01 33:03:01 31:01 33: NA :02:01:01 02:06:01:01 02:06 24: NA :01:01 26:01:01:01 26:01 30: NA :03:01 24:02:01:01 24:02 26: NA :02:01:01 24:02:01:01 24:02 24: NA :01:02:01 24:02:01:01 24:02 31: NA :01:02:01 11:01:01:01 11:01 31: NA :01:01:01 02:01:01:01 02:01 02: NA :02:01:01 11:01:01:01 11:01 24: NA :02:01:01 24:02:01:01 24:02 24: NA :02:01:01 11:01:01:01 11:01 24: NA :03:01 26:01:01:01 26:01 26: NA :01:02:01 26:02:01 26:02 31: NA :07:01 02:06:01:01 02:07 02: NA :02:01:01 24:02:01:01 24:02 24: NA :01:02:01 11:01:01:01 11:01 31: NA :01:02:01 24:02:01:01 24:02 31: NA :01:01:01 24:02:01:01 24:02 26: NA :02:01:01 02:06:01:01 02:06 24: NA :02:01:01 24:02:01:01 24:02 24: NA :03:01 24:02:01:01 24:02 33: NA :01:01:01 02:01:01:01 02:01 26: NA :06:01:01 02:01:01:01 02:01 02: NA :03:01 33:03:01 33:03 33: NA :01:02:01 03:01:01:01 03:01 31: NA :01:02:01 11:01:01:01 11:02 31: NA :01:02:01 02:01:01:01 02:01 31: NA :02:01 02:07:01 02:07 26: NA :03:01 24:02:01:01 24:02 33: NA :02:01:01 02:01:01:01 02:01 24: NA :03:01 24:02:01:01 24:02 33: NA :01:02:01 24:02:01:01 24:02 31: NA :03:01 24:02:01:01 24:02 33: NA :03:01 24:02:01:01 24:02 26: NA :02:01:01 02:06:01:01 02:06 24: NA :03:01 24:04 24:04 33: NA :02:01:01 24:02:01:01 24:02 24: NA :01:01:01 02:06:01:01 02:06 11:

80 A HLA genotype call results for 1000G exome samples NA :03:01 02:01:01:01 02:01 33: NA :03:01 02:07:01 02:07 26: NA :02:01:01 24:02:01:01 24:02 24: NA :02:01:01 36:01 36:01 68: NA :01:01 23:01:01 23:01 30: NA :01 02:01:01:01 02:01 36: NA :01 23:01:01 23:01 36: NA :01:01 03:01:01:01 03:01 30: NA :03:01 23:01:01 23:01 33: NA :01 24:02:01:01 23:01 36: NA :01:01 02:02:01 02:02 23: NA :01 33:03:01 33:03 36: NA :02:01:01 66:02 66:03 68: NA :01:01 02:01:01:01 02:01 30: NA :01:01 02:01:01:01 02:01 23: NA :01 02:11:01 02:11 36: NA :01:01 02:02:01 02:02 23: NA :03:01 32:01:01 32:01 33: NA :01:01:01 03:01:01:01 03:01 26: NA :01 33:03:01 33:03 36: NA :01:01 03:01:01:01 03:01 23: NA :01:01:01 26:01:01:01 26:01 68: NA :01:01 02:02:01 02:02 23: NA :01:01:01 30:02:01:01 30:02 68: NA :02:01:01 23:01:01 23:01 30: NA :03:01 03:01:01:01 03:01 33: NA :02:01:01 02:05:01 02:05 30: NA :01:01:01 01:01:01:01 01:01 02: NA :02:01:01 30:02:01:01 30:02 68: NA :01:01 03:01:01:01 03:01 30: NA :01:01 03:01:01:01 03:01 33: NA :03:01 30:01:01 30:01 33: NA :03:01 30:01:01 30:01 33: NA :01 30:01:01 30:01 36: NA :02:01:01 02:01:01:01 02:01 68: NA :02:01:01 30:01:01 30:01 68:

81 A.2 HLA-B exome results A.2 HLA-B exome results Table A2: Gyper s called HLA-B genotype on 1000 Genomes exome dataset compared to Erlich et al. [2011]. Sample Allele 1 Allele 2 Verified allele 1 Verified allele 2 2 digit matches 4 digit matches NA :01:01 07:02:01 07:02 57: NA :02:01 08:01:01 40:02 08: NA :02:01:01 40:01:02 44:02 40: NA :02:01:01 07:02:01 44:02 07: NA :01:01 08:01:01 08:01 57: NA :06:02 08:01:01 39:06 08: NA :01:01:01 08:01:01 08:01 18: NA :01:02 08:01:01 40:01 08: NA :01:01 44:02:01:01 44:02 15: NA :02:01 14:01:01 14:02 14: NA :01:01 07:02:01 08:01 07: NA :02:01 27:05:02 40:02 27: NA :01:01 27:05:02 27:05 57: NA :02:01 07:02:01 07:02 07: NA :01:01:01 08:01:01 35:01 08: NA :01:01:01 07:02:01 51:01 07: NA :01:01 08:01:01 57:01 08: NA :02:01 08:01:01 08:01 13: NA :02:01 07:02:01 07:02 07: NA :05:02 07:02:01 07:02 27: NA :01:01:01 15:01:01:01 18:01 15: NA :01:01 35:01:01:01 35:01 38: NA :02:01 07:02:01 07:02 07: NA :02:01:01 35:01:01:01 35:01 44: NA :01:02 08:01:01 08:01 40: NA :02:01:01 08:01:01 08:01 44: NA :01:01:01 50:01:01 51:01 50: NA :03:01 40:02:01 40:02 44: NA :02:01:01 40:01:02 44:02 40: NA :01:01 44:02:01:01 49:01 44: NA :01:01 07:02:01 08:01 07: NA :01:01:01 07:02:01 15:01 07: NA :01:01 07:02:01 08:01 07: NA :01:01:01 44:03:01 44:03 51: NA :01:01 07:02:01 07:02 08: NA :01:01:01 44:02:01:01 51:01 44: NA :01:01 07:02:01 08:01 07: NA :02:01:01 44:02:01:01 44:02 44: NA :01:01 44:03:01 44:03 57: NA :02:01 07:02:01 07:02 07: NA :01:01 08:01:01 08:01 55: NA :01:01 08:01:01 08:01 14: NA :01:01:01 07:02:01 39:06 07: NA :01:01:01 35:01:01:01 35:01 51: NA :01:01:01 08:01:01 56:01 08: NA :01:01 14:01:01 14:01 78: NA :01:01:01 14:02:01 14:03 58: NA :10:01 15:03:01 15:03 39:

82 A HLA genotype call results for 1000G exome samples NA :01:01 15:03:01 15:03 53: NA :01:01 15:03:01 15:03 42: NA :01:01 42:01:01 42:01 53: NA :01:01 15:10:01 15:10 53: NA :01:01:01 49:01:01 49:01 51: NA :01:01 15:03:01 15:03 49: NA :01:01:01 07:02:01 07:02 35: NA :01:01:01 40:01:02 40:01 58: NA :18:01 13:01:01 13:02 15: NA :01:01 38:02:01 38:02 46: NA :01:01:01 46:01:01 46:01 58: NA :01:02 40:01:02 40:01 40: NA :01:01 15:02:01 15:02 38: NA :01:01 13:01:01 13:01 46: NA :01:01 40:02:01 40:02 46: NA :01:01 07:05:06 07:05 54: NA :01:01 48:01:01 48:01 67: NA :01:01:01 39:01:01:01 39:01 59: NA :01:01:01 51:01:01:01 51:01 52: NA :01:01:01 48:01:01 48:01 51: NA :02:01 13:01:01 13:01 13: NA :01:01:01 08:01:01 35:01 08: NA :11:01 15:01:01:01 15:01 15: NA :01:01:01 40:01:02 40:01 51: NA :01:01 15:01:01:01 15:01 54: NA :01:01 46:01:01 46:01 54: NA :01:02 15:01:01:01 15:01 40: NA :01:01:01 48:01:01 52:01 81: NA :01:01:01 40:01:02 40:01 52: NA :06:01:01 15:18:01 15:18 40: NA :01:02 37:01:01 37:01 40: NA :06:01:01 13:02:01 13:02 40: NA :02:01 15:01:01:01 15:01 38: NA :01:01:01 48:01:01 48:01 51: NA :02:01 35:03:01 35:03 38: NA :01:01:01 46:01:01 46:01 51: NA :01:01 40:06:01:01 40:06 57: NA :01:01 40:01:02 40:01 48: NA :18:01 15:01:01:01 15:01 15: NA :01:01:01 35:03:01 35:03 51: NA :01:02 13:02:01 13:02 40: NA :01:01:01 13:02:01 13:02 52: NA :01:01:01 46:01:01 46:01 58: NA :01:01 54:01:01 54:01 54: NA :01:01:01 40:06:01:01 40:06 51: NA :01:01 40:01:02 40:01 46: NA :01:02 15:01:01:01 15:01 67: NA :01:02 44:03:01 44:03 52: NA :01:01 35:01:01:01 35:01 49: NA :01:01 18:01:01:01 18:01 49: NA :03:01 13:01:01 13:02 57: NA :03:01 15:03:01 15:03 57: NA :01:02 42:01:01 42:01 52: NA :01:01 53:01:01 53:01 53: NA :01:01:01 46:01:01 46:01 52:

83 A.2 HLA-B exome results NA :03:01 35:01:01:01 35:01 44: NA :01:01 35:01:01:01 35:01 46: NA :01:01:01 40:02:01 40:02 51: NA :03:01 15:01:01:01 15:01 44: NA :01:01:01 52:01:01:01 52:01 52: NA :02:01 07:02:01 07:02 13: NA :02:01 07:02:01 07:02 40: NA :01:01:01 40:01:02 40:01 59: NA :01:01:01 07:02:01 07:02 51: NA :02:01 40:01:01 40:01 40: NA :02:01 15:01:01:01 15:01 40: NA :01:01 52:01:01:01 52:01 67: NA :01:01 40:02:01 40:06 54: NA :01:01 51:01:01:01 51:01 54: NA :02:01 35:01:01:01 35:01 40: NA :01:01:01 40:06:01:01 40:06 56: NA :01:01:01 15:01:01:01 15:01 35: NA :01:01:01 40:02:01 40:02 51: NA :01:01:01 40:01:02 40:01 51: NA :01:01:01 48:01:01 48:01 52: NA :01:01 15:18:01 15:18 54: NA :02:01 40:02:01 40:02 40: NA :01:01:01 40:02:01 40:02 59: NA :07:01 15:01:01:01 15:07 46: NA :01:02 39:01:03 39:01 40: NA :01:01:01 07:02:01 07:02 15: NA :03:01 44:03:01 44:03 44: NA :02:09 40:02:01 40:02 44: NA :01:02 13:01:01 13:01 40: NA :01:01:01 51:01:01:01 51:01 51: NA :01:01 40:06:01:01 40:06 46: NA :03:01 40:02:01 40:02 44: NA :01:01:01 40:02:01 40:02 52: NA :01:01:01 44:03:01 44:03 52: NA :02:01 07:02:01 07:02 40: NA :03:01 15:07:01 15:07 44: NA :01:01:01 15:01:01:01 15:01 39: NA :01:01 51:01:02 52:01 54: NA :01:01 40:06:01:01 40:06 46: NA :01:01 54:01:01 54:01 67: NA :01:03 39:01:01:01 39:01 39: NA :01:02 58:01:01:01 58:01 67: NA :01:01 35:01:01:01 35:01 46: NA :01:01:01 35:01:01:01 35:01 35: NA :01 42:01:01 42:01 81: NA :01:01 35:01:01:01 35:01 53: NA :01:01 51:01:01:01 51:01 53: NA :02:01 07:02:01 07:02 07: NA :01:01 18:01:01:01 18:01 42: NA :24:01 07:02:01 07:02 39: NA :01:01 35:01:01:01 35:01 49: NA :01:01:01 07:02:01 07:02 58: NA :01:01:01 13:01:01 13:02 35: NA :01:01 42:01:01 42:01 53: NA :01:01 07:02:01 07:02 45:

84 A HLA genotype call results for 1000G exome samples NA :01:01:01 07:02:01 07:02 58: NA :01:02 35:01:01:01 35:01 52: NA :01:01 52:01:02 52:01 53: NA :01:01:01 53:01:01 53:01 58: NA :01:01:01 15:10:01 15:10 56: NA :01:01 18:01:01:01 18:01 53: NA :02:01 52:01:02 52:01 57: NA :01:01:01 15:10:01 15:10 51: NA :01:01 52:01:02 52:01 53: NA :03:01 35:01:01:01 35:01 44: NA :01:02 07:02:01 07:02 52: NA :01:01:01 15:16:01 15:16 18: NA :01:01:01 18:01:01:01 18:01 58: NA :01:01 42:01:01 42:01 53: NA :10:01 08:01:01 08:01 15: NA :01:02 15:10:01 15:10 52: NA :01:01:01 15:10:01 15:10 58: NA :01:01 42:02:01:01 42:02 45: NA :01:01 41:01:01 41:04 42: NA :03:01 53:01:01 53:01 57: NA :01:02 35:01:01:01 35:01 52: NA :03:01 35:01:01:01 35:01 57:

85 A.3 HLA-C exome results A.3 HLA-C exome results Table A3: Gyper s called HLA-C genotype on 1000 Genomes exome dataset compared to Erlich et al. [2011]. Sample Allele 1 Allele 2 Verified allele 1 Verified allele 2 2 digit matches 4 digit matches NA :02:01:01 06:02:01:01 07:02 06: NA :01:01:01 02:02:02:01 02:02 07: NA :04:01 03:04:01:01 03:04 07: NA :02:01:01 05:01:01:01 05:01 07: NA :01:01:01 06:02:01:01 07:01 06: NA :02:01:01 07:01:01:01 07:02 07: NA :03:01:01 07:01:01:01 07:01 12: NA :01:01:01 03:04:01:01 03:04 07: NA :01:01:01 03:04:01:01 05:01 03: NA :02:01:01 08:02:01:01 08:02 08: NA :02:01:01 07:01:01:01 07:01 07: NA :04:01 02:02:02:01 02:02 07: NA :02:01:01 02:02:02:01 02:02 06: NA :02:01:01 07:02:01:01 07:02 07: NA :01:01:01 04:01:01:01 04:01 07: NA :02:01:01 07:02:01:01 15:02 07: NA :01:01:01 06:02:01:01 06:02 07: NA :01:01:01 06:02:01:01 07:01 06: NA :02:01:01 07:02:01:01 07:02 07: NA :02:01:01 01:02:01 07:02 01: NA :03:01:01 03:03:01 12:03 03: NA :03:01:01 04:01:01:01 04:01 12: NA :02:01:01 07:02:01:01 07:02 07: NA :04:01 04:01:01:01 04:01 07: NA :01:01:01 03:04:01:01 07:01 03: NA :01:01:01 05:01:01:01 07:01 05: NA :02:01:01 06:06 15:02 06: NA :01:01 15:02:01:01 15:02 16: NA :01:01:01 03:04:01:01 03:04 05: NA :01:01:01 05:01:01:01 05:01 07: NA :02:01:01 07:01:01:01 07:01 07: NA :02:01:01 03:03:01 03:03 07: NA :02:01:01 07:01:01:01 07:01 07: NA :02:01 04:01:01:01 14:02 04: NA :02:01:01 07:01:01:01 07:01 07: NA :02:01 05:01:01:01 14:02 05: NA :02:01:01 07:01:01:01 07:01 07: NA :01:01:01 05:01:01:01 05:01 05: NA :01:01 06:02:01:01 16:01 06: NA :02:01:01 07:02:01:01 07:02 07: NA :02:01:01 07:01:01:01 07:01 07: NA :02:01:01 07:01:01:01 07:01 08: NA :02:01:01 07:02:01:01 07:02 07: NA :02:01 04:01:01:01 04:01 14: NA :01:01:01 01:02:01 01:02 07: NA :01:01 08:02:01:01 08:02 16: NA :02:01:01 07:01:01:01 07:01 08: NA :03:01:01 02:10 02:10 12:

86 A HLA genotype call results for 1000G exome samples NA :01:01:01 02:10 02:10 07: NA :01:01:01 02:10 02:10 17: NA :01:01:01 04:01:01:01 04:01 17: NA :01:01:01 04:01:01:01 04:01 04: NA :01:01 07:01:02 07:01 16: NA :01:02 02:10 02:10 07: NA :05:02 04:01:01:01 04:01 15: NA :02:01:01 03:02:02:01 03:02 07: NA :04:01 06:02:01:01 06:02 07: NA :02:01:01 01:02:01 01:02 07: NA :02:02:01 01:02:01 01:02 03: NA :02:01:01 04:01:01:01 04:01 07: NA :02:12 08:01:01 08:01 12: NA :04:01:01 01:02:01 03:04 01: NA :02:01:01 01:02:01 01:02 15: NA :02:01:01 01:02:01 01:02 07: NA :01:01 07:02:01:01 08:01 07: NA :02:01:01 01:02:01 01:02 07: NA :02:01 12:02:02 12:02 14: NA :02:01:01 08:03:01 08:03 15: NA :02:01:01 03:03:01 03:04 06: NA :02:01:01 03:03:01 03:03 07: NA :03:01 03:03:01 03:03 03: NA :02:01:01 07:02:01:01 07:02 15: NA :03:01 01:02:01 01:02 03: NA :02:01:01 01:02:01 01:02 07: NA :02:01:01 01:02:01 01:02 15: NA :02:02 08:01:01 08:01 12: NA :02:02 07:02:01:01 07:02 12: NA :01:01 08:01:01 08:01 08: NA :02:01:01 04:01:01:01 04:01 06: NA :01:01 06:02:01:01 06:02 08: NA :02:01:01 03:03:01 03:03 07: NA :02:01:01 08:01:01 08:01 15: NA :03:01:01 07:02:01:01 07:02 12: NA :02:01 07:02:01:01 07:02 14: NA :01:01 06:02:01:01 08:01 06: NA :02:01:01 04:01:01:01 04:01 07: NA :01:01 01:02:01 01:02 08: NA :02:01 12:03:01:01 12:03 14: NA :02:01:01 03:04:01:01 03:04 06: NA :02:02 06:02:01:01 06:02 12: NA :02:02:01 01:02:01 01:02 03: NA :02:01 01:02:01 01:02 01: NA :02:01 08:01:01 08:01 14: NA :02:01:01 01:02:01 01:02 07: NA :02:01:01 04:01:01:01 04:01 07: NA :01:01 16:01:01 16:01 16: NA :01:02 04:01:01:01 04:01 07: NA :01:02 02:02:02:01 02:02 07: NA :01:02 03:02:02:01 03:02 07: NA :02:01:01 02:10 02:10 08: NA :01:01:01 16:01:01 16:01 17: NA :01:01:01 04:01:01:01 04:01 04: NA :02:02 01:03 01:03 12:

87 A.3 HLA-C exome results NA :03 03:03:01 03:03 14: NA :03:01 01:02:01 01:02 03: NA :02:01 03:04:01:01 03:04 14: NA :03 07:02:01:01 07:02 14: NA :02:02 12:02:02 12:02 12: NA :02:01:01 06:02:01:01 06:02 07: NA :02:01:01 03:04:01:01 03:04 07: NA :04:01:01 01:02:01 01:02 03: NA :02:01 07:02:01:01 07:02 14: NA :04:01:01 03:04:01:01 03:04 03: NA :04:01:01 03:03:01 03:03 03: NA :02:02 07:02:01:01 07:02 12: NA :01:01 01:02:01 01:02 08: NA :02:01 01:02:01 01:02 01: NA :04:01:01 03:03:01 03:03 03: NA :01:01 04:01:01:01 04:01 08: NA :03:01 03:03:01 03:03 03: NA :02:01 03:04:01:01 03:04 14: NA :02:01 07:02:01:01 07:02 14: NA :02:02 08:03:01 08:03 12: NA :04:01 01:02:01 01:02 07: NA :02:01:01 03:04:01:01 03:04 07: NA :03:01 01:02:01 01:02 03: NA :03:01 01:02:01 01:02 03: NA :02:01:01 07:02:01:01 07:02 07: NA :02:01:01 04:01:01:01 04:01 07: NA :03 14:03 14:03 14: NA :02:01 05:01:01:01 05:01 14: NA :04:01:01 03:03:01 03:03 03: NA :02:01 14:02:01 14:02 14: NA :01:01 01:02:01 01:02 08: NA :03 03:03:01 03:03 14: NA :02:01:01 12:02:02 12:02 15: NA :03 12:02:02 12:02 14: NA :02:01:01 03:04:01:01 03:04 07: NA :03 03:03:01 03:03 14: NA :02:01:01 03:03:01 03:03 07: NA :02:02 01:02:01 01:02 12: NA :01:01 01:03 01:03 08: NA :02:01:01 01:02:01 01:02 07: NA :02:01:01 07:02:01:01 07:02 07: NA :02:01:01 03:02:02:01 03:02 07: NA :03:01 01:02:01 01:02 03: NA :03:01 03:03:01 03:03 03: NA :01:01:01 08:04:01 08:04 17: NA :02:01:01 04:01:01:01 04:01 08: NA :01:01 06:02:01:01 06:02 16: NA :05:02 07:02:01:01 07:02 15: NA :01:01:01 02:02:02:01 02:02 17: NA :02:01:01 07:01:01:01 07:01 07: NA :01:02 04:01:01:01 04:01 07: NA :02:02 07:01:01:01 07:01 07: NA :01:01:01 04:01:01:01 04:01 04: NA :01:01:01 04:01:01:01 04:01 17: NA :01:01 07:02:01:01 07:02 16:

88 A HLA genotype call results for 1000G exome samples NA :02:01:01 07:01:01:01 07:01 07: NA :01:01 07:01:01:01 07:01 16: NA :01:01 04:01:01:01 04:01 16: NA :01:01:01 03:02:02:01 03:02 04: NA :04:01 01:02:01 01:02 08: NA :01:01:01 04:01:01:01 04:01 05: NA :01 16:01:01 16:01 18: NA :01:01 04:01:01:01 04:01 16: NA :01:01 04:01:01:01 04:01 16: NA :01:01:01 04:01:01:01 04:01 04: NA :01:01 07:02:01:01 07:02 16: NA :02:01 02:02:02:01 02:02 14: NA :02:01:01 07:01:01:01 07:01 08: NA :01:01:01 04:01:01:01 04:01 17: NA :01:01:01 03:02:02:01 03:02 07: NA :01:01 08:04:01 08:04 16: NA :04:01 03:02:02:01 03:02 08: NA :01:01:01 16:01:01 16:01 17: NA :01:01:01 17:01:01:01 17:01 17: NA :01 04:01:01:01 04:01 18: NA :01:01 04:01:01:01 04:01 16: NA :01 04:01:01:01 04:01 18:

89 B HLA genotype call results for 1000G WGS samples B.1 HLA-A WGS results Table B1: Gyper s called HLA-A genotype on 1000 Genomes low coverage WGS dataset compared to Erlich et al. [2011]. Sample Allele 1 Allele 2 Verified allele 1 Verified allele 2 2 digit matches 4 digit matches NA :01:01:01 02:01:01:01 03:01 02: NA :01:01 02:01:01:01 32:01 02: NA :02:01:01 02:01:01:01 29:02 02: NA :01:01:01 01:01:01:01 02:01 01: NA :02:01:01 23:01:01 23:01 24: NA :01 34:02:01 34:02 74: NA :02:01:01 24:02:01:01 24:02 24: NA :01:02:01 11:01:01:01 11:01 31: NA :02:01:01 02:06:01:01 02:06 24: NA :01:01:01 02:01:01:01 02:01 26: NA :06:01:01 02:01:01:01 02:01 02: NA :03:23 33:03:01 33:03 33: NA :01:02:01 02:01:01:01 02:01 31: NA :04 02:01:01:01 02:01 24: NA :02:01:01 36:01 36:01 68: NA :01 23:01:01 23:01 36: NA :02:01:01 66:02 66:03 68: NA :01:01:01 03:01:01:01 03:01 26: NA :01:01 02:02:01 02:02 23: NA :02:01:01 02:05:01 02:05 30:

90 B HLA genotype call results for 1000G WGS samples B.2 HLA-B WGS results Table B2: Gyper s called HLA-B genotype on 1000 Genomes low coverage WGS dataset compared to Erlich et al. [2011]. Sample Allele 1 Allele 2 Verified allele 1 Verified allele 2 2 digit matches 4 digit matches NA :01:01 07:02:01 07:02 57: NA :02:01 40:02:01 40:02 27: NA :05:02 07:02:01 07:02 27: NA :02:01 07:02:01 07:02 07: NA :01:01:01 44:03:01 44:03 51: NA :01:01 53:01:01 53:01 53: NA :01:02 13:01:04 40:06 54: NA :01:01:01 40:01:02 40:01 51: NA :02:01 40:02:01 40:02 40: NA :01:02 39:01:03 39:01 40: NA :01:01:01 07:02:01 07:02 15: NA :03:01 44:03:01 44:03 44: NA :01:01:01 51:01:01:01 51:01 51: NA :01:01:01 40:02:01 40:02 52: NA :01 08:01:01 42:01 81: NA :01:01 35:01:01:01 35:01 49: NA :01:01 42:01:01 42:01 53: NA :01:01:01 15:10:01 15:10 56: NA :01:01 52:01:02 52:01 53: NA :01:01:01 18:01:01:01 18:01 58:

91 B.3 HLA-C WGS results B.3 HLA-C WGS results Table B3: Gyper s called HLA-C genotype on 1000 Genomes low coverage WGS dataset compared to Erlich et al. [2011]. Sample Allele 1 Allele 2 Verified allele 1 Verified allele 2 2 digit matches 4 digit matches NA :02:01:01 06:02:01:01 07:02 06: NA :04:01 02:02:02:01 02:02 07: NA :02:01:01 01:02:01 07:02 01: NA :02:01:01 07:02:01:01 07:02 07: NA :02:01 04:01:01:01 14:02 04: NA :01:01:01 04:01:01:01 04:01 04: NA :01:01:01 01:02:01 01:02 08: NA :02:01 07:02:61 07:02 14: NA :02:01:01 03:04:01:01 03:04 07: NA :02:01:01 07:02:01:01 07:02 07: NA :02:01:01 04:01:01:01 04:01 07: NA :03 14:03 14:03 14: NA :02:01 14:02:01 14:02 14: NA :02:01:01 12:02:02 12:02 15: NA :01:01:01 08:04:01 08:04 17: NA :01:02 04:01:01:01 04:01 07: NA :01:01:01 04:01:01:01 04:01 17: NA :04:01 01:02:01 01:02 08: NA :01:01 04:01:01:01 04:01 16: NA :02:01:01 07:01:01:01 07:01 08:

92 B HLA genotype call results for 1000G WGS samples 74

93 References 1000Genomes. How to access 1000 genomes data. DataAccess, [Online; Accessed: ]. Patrick G. Beatty, Reginald A. Clift, Eric M. Mickelson, Brenda B. Nisperos, Nancy Flournoy, Paul J. Martin, Jean E. Sanders, Patricia Stewart, C. Dean Buckner, Rainer Storb, E. Donnall Thomas, and John A. Hansen. Marrow transplantation from related donors other than hla-identical siblings. New England Journal of Medicine, 313(13): , David R. Bentley, Shankar Balasubramanian, Harold P. Swerdlow, Geoffrey P. Smith, John Milton, Clive G. Brown, Kevin P. Hall, Dirk J. Evers, Colin L. Barnes, and Helen R. Bignell et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature, 456(7218):53 9, International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature, 431(7011):931 45, Francis Crick. On protein synthesis. Symp. Soc. Exp. Biol., XII: , Francis Crick. Central dogma of molecular biology. Nature, 227: , Petr Danecek, Adam Auton, Goncalo Abecasis, Cornelis A. Albers, Eric Banks, Mark A. DePristo, Robert Handsaker, Gerton Lunter, Gabor Marth, Stephen T. Sherry, Gilean McVean, Richard Durbin, and 1000 Genomes Project Analysis Group. The variant call format and vcftools. Bioinformatics, pages btr330v1 btr330, Alexander Dilthey, Charles Cox, Zamin Iqbal, Matthew R Nelson, and Gil McVean. Improved genome inference in the mhc using a population reference graph. Nature Genetics, 47: , Andreas Döring, David Weese, Tobias Rausch, and Knut Reinert. SeqAn an efficient, generic C++ library for sequence analysis. BMC Bioinformatics, 9:11, Robert C. Edgar. Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res, 32(5): , Rachel L. Erlich, Xiaoming Jia, Scott Anderson, Eric Banks, Xiaojiang Gao, Mary Carrington, Namrata Gupta, Mark A. DePristo, Matthew R. Henn, Niall J. Lennon, and Paul I.W. de Bakker. Next-generation sequencing for hla typing of class i loci. BMC Genomics, 12:42,

94 REFERENCES Markus Hsi-Yang Fritz, Rasko Leinonen, Guy Cochrane, and Ewan Birney. Efficient storage of high throughput dna sequencing data using reference-based compression. Genome Res, 21: , Carl W. Fuller, Lyle R. Middendorf, Steven A. Benner, George M. Church, Timothy Harris, Xiaohua Huang, Stevan B. Jovanovich, John R. Nelson, Jeffery A. Schloss, David C. Schwartz, and Dmitri V. Vezenov. The challenges of sequencing by synthesis. Nat Biotechnol, 27(11): , Genestudio. Format notes: Fasta [Online; Accessed: ]. E. Gluckman, V. Rocha, and C. Chastang. Peripheral stem cells in bone marrow transplantation. cord blood stem cell transplantation. Baillieres Best Pract Res Clin Haematol, 12(1-2):279 92, John A. Hansen, Reginald A. Clift, E. Donnall Thomas, C. Dean Buckner, Rainer Storb, and Eloise R. Giblett. Transplantation of marrow from an unrelated donor to a patient with acute leukemia. N Engl J Med, 303: , Erika Check Hayden. Is the $1,000 genome for real? web.archive.org/web/ / is-the genome-for-real , [Online; Accessed: ]. Steve Hoffmann, Christian Otto, Stefan Kurtz, Cynthia M. Sharma, Philipp Khaitovich, Jörg Vogel, Peter F. Stadler, and Jörg Hackermüller. Fast mapping of short sequences with mismatches, insertions and deletions using index structures. PLoS Comput Biol, 5(9):e , Manuel Holtgrewe. Mason - A read simulator for second generation sequencing data. Technical report, Institut für Mathematik und Informatik, Freie Universität Berlin, Matt Johnson. Understanding and pre-processing raw illumina data, Marcin Kalicinski. Rapidxml. //rapidxml.sourceforge.net/, [Online; Accessed: ]. K. Kaukinen, J. Partanen, M. Mäki, and P. Collin. Hla-dq typing in the diagnosis of celiac disease. The American Journal of Gastroenterology, 97(3):695 9, W. James Kent, Charles W. Sugnet, Terrence S. Furey, Krishna M. Roskin, Tom H. Pringle, Alan M. Zahler, and David Haussler. The human genome browser at ucsc. Genome Res, 12(6): , N. Kikuoka, S. Sugihara, T. Yanagawa, A. Ikezaki, H.S. Kim, H. Matsuoka, Y. Kobayashi, K. Wataki, S. Konda, H. Sato, S. Miyamoto, N. Sasaki, T. Sakamaki, H. Niimi, and M. Murata. Cytotoxic t lymphocyte antigen 4 gene polymorphism confers susceptibility to type 1 diabetes in japanese children: analysis of association with hla genotypes and autoantibodies. Clin Endocrinol, 55(5): ,

95 REFERENCES Heng Li. A statistical framework for snp calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27(21): , Heng Li and Richard Durbin. Fast and accurate short read alignment with burrows wheeler transform. Bioinformatics, 25: , Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, and Richard Durbin. The sequence alignment/map format and samtools. Bioinformatics, 25(16): , Chang Liu, Xiao Yang, Brian Duffy, Thalachallour Mohanakumar Robi D. Mitra, Michael C. Zody, and John D. Pfeifer. Athlates: accurate typing of human leukocyte antigen through exome sequencing. Nucleic Acids Res, 41(14):e142, Endre Major, Krisztina Rigó, Tim Hague, Attila Bérces, and Szilveszter Juhos. Hla typing from 1000 genomes whole genome and whole exome illumina data. PLoS ONE, 8(11): e78410, Aaron McKenna, Matthew Hanna, Eric Banks, Andrey Sivachenko, Kristian Cibulskis, Andrew Kernytsky, Kiran Garimella, David Altshuler, Stacey Gabriel, Mark Daly, and Mark A. DePristo. The genome analysis toolkit: A mapreduce framework for analyzing next-generation dna sequencing data. Genome Research, 20(9): , Loukas Moutsianas, Luke Jostins, Ashley H. Beecham, Alexander T. Dilthey, Dionysia K. Xifara, Maria Ban, Tejas S. Shah, Nikolaos A. Patsopoulos, Lars Alfredsson, Carl A. Anderson, and et al. Class ii hla interactions modulate genetic risk for multiple sclerosis. Nature, Online, William R. Pearson and David J. Lipman. Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences of the United States of America., 85: , K.D. Pruitt, G.R. Brown, S.M. Hiatt, F. Thibaud-Nissen, A. Astashyn, O. Ermolaeva, C.M. Farrell, J. Hart, M.J. Landrum, K.M. McGarvey, M.R. Murphy, and N.A. O Leary et al. Refseq: an update on mammalian reference sequences. Nucleic Acids Res, 42: , James Robinson, Jason A. Halliwell, James D. Hayhurst, Paul Flicek, Peter Parham, and Steven G. E. Marsh. The IPD and IMGT/HLA database: allele variant databases. Nucleic Acids Research, 43:D , Andras Szolek, Benjamin Schubert, Christopher Mohr, Marc Sturm, Magdalena Feldhahn, and Oliver Kohlbacher. Optitype: precision hla typing from next-generation sequencing data. Bioinformatics, 30(23):3310 6, Arnoud H.M. Van Vliet. Next generation sequencing of microbial transcriptomes: challenges and opportunities. FEMS Microbiol Lett, 302(1):1 7,

96 REFERENCES L. Wang and T. Jiang. On the complexity of multiple sequence alignment. Journal of Computational Biology, 1: , René L. Warren, Gina Choe, Douglas J. Freeman, Mauro Castellarin, Sarah Munro, Richard Moore, and Robert A. Holt. Derivation of hla types from shotgun sequence datasets. Genome Med, 4:95, James D. Watson and Francis H.C. Crick. A structure for deoxyribose nucleic acid. Nature, 171: ,