Overview of Genome Assembly Techniques

Transcription

1 6 Overview of Genome Assembly Techniques Sun Kim and Haixu Tang The most common laboratory mechanism for reading DNA sequences (e.g., gel electrophoresis) can determine sequence of up to approximately 1,000 nucleotides at a time. 1 However, the size of an organism s genome is much larger; for example, the human genome consists of approximately 3 billion nucleotides. The most commonly used and most cost-effective process is shotgun sequencing, which physically breaks multiple copies (or clones) of a target DNA molecule into short, readable fragments and then reassembles the short fragments to reconstruct the target DNA sequence. The assembly of short fragments in shotgun sequencing was originally done by hand, but manual assembly clearly is not desirable since it is error prone and not cost-effective. Automatic fragment assembly has been studied for a long period of time [1 11]. Various sequence assemblers contributed to the determination of many genome sequences, including the most recent announcement of the human genome [12, 13]. 6.1 Genome Sequencing by Shotgun-Sequencing Strategy Almost all large-scale sequencing projects employ the shotgun strategy that assembles (deduces) the target DNA sequence from a set of short DNA 1. Recently, several promising new experimental techniques have been developed. Note that we survey new sequencing technology in Part I. 79

2 80 Genome Sequencing Technology and Algorithms fragments determined from DNA pieces randomly sampled from the target sequence. The set of short DNA fragments, called shotgun reads, are assembled into a set of contigs, or sets of aligned fragments, using a computer program, fragment assembler. The fragment assembly is a conceptually simple procedure that generates longer sequences by detecting overlapping fragments. If the fragment assembly can be done perfectly, the genome sequencing would be a simple problem. However, there are extensive repetitive sequences, repeats in short, in a genomic sequence, which can easily mislead the fragment assembly process (see Figure 6.1). A useful technique to overcome the difficulty from repeats is to sequence both ends of a clone, generating two fragment reads per clone. Since the insert size of clone is known, we know the approximate distance between two fragments. This technique is developed by Hood and his colleagues [14]. The fragment matching information is often referred as mate-pair information, which becomes essential for large-scale shotgun sequencing. The main issue utilizing this information during the assembly process is that we do not know the sequence between the two reads, which can only be deduced by assembling other fragments into a single contig. As a result, we can utilize the clone-length information only after assembly, which results in either a correct or an incorrect assembly based on the clone-length information. See Figure 6.2. One strategy to use the mate-pair information effectively is to assemble contigs as accurate as possible by detecting potentially misassembled contigs and then utilize the mate-pair information using only contigs that are likely to be assembled correctly. See Sections 7.4 and 7.5 for the strategy A Procedure for Whole-Genome Shotgun (WGS) Sequencing In general, assembly of shotgun reads generates a large number of contigs and some of them are misassembled probably due to repetitive sequences in the target DNA. As a result, genome sequencing is usually carried out in multiple steps and, unfortunately, there is no consensus on the steps of all genome-sequencing Assembly is wrong due to repeats Figure 6.1 Effect of repeats in fragment assembly. Since assembly is based on overlapping regions between fragments, repeats can easily mislead assembly process, putting all five fragments from two repeat copies into one contig in this figure.

3 Overview of Genome Assembly Techniques 81 Approximate length in bps 5 f1 f2 5 Read one fragment f1 Read another fragment f2 from one end. from the other end. f1 f2 f1 f2 2kb 6kb Figure 6.2 (Upper panel) Mate-pair information. Two fragments, f 1 and f 2, are read at both ends of the same clone of approximate size of 2 kb. (Bottom panel) The main issue utilizing this information during the assembly process is that we do not know the sequence between the two reads, which can only be deduced by assembling other fragments into a single contig. As a result, we can utilize the clone-length information only after assembly, which result in either a correct or an incorrect assembly based on the clone length information. projects. We describe a general procedure for genome sequencing below, but we emphasize that procedures used at genome-sequencing centers differ in details. 1. Fragment readout: The sequences of each fragment are determined using an automatic base-calling software. Phred [15, 16] is the most widely used program. 2. Trimming vector sequences: Shotgun reads often contain part of the vector sequences that have to be removed before sequence assembly. 3. Trimming low-quality sequences: Shotgun reads contain poor-quality basecalls and removing or masking out these low-quality base calls often leads to more accurate sequence assembly. However, this step is optional and some sequencing centers do not mask out low-quality

4 82 Genome Sequencing Technology and Algorithms base calls, relying on the fragment assembler to utilize quality values to decide true fragment overlaps. 4. Fragment assembly: The shotgun data is input to a fragment assembler that automatically generates a set of aligned fragment called contigs. A survey of fragment assembly algorithms is in Chapter Assembly validation: Some contigs that assembled in the previous steps may be misassembled due to repeats. Since we do not have a priori knowledge on repeats in the target DNA, it is very difficult to verify the correctness of assembly of each contig and this step is largely done manually. There are recent algorithmic developments on automatic verification of contig assemblies (see Section 6.4). 6. Scaffolding contigs: Contigs need to be oriented and ordered. The mate-pair information is a primary information source for this step, thus this step is not achievable if the input shotgun data is not prepared by reading both ends of clones (see Section 6.5 for more details). 7. Finishing: Assuming that all contigs are assembled correctly and contigs are oriented and ordered correctly, we can close gaps between two contigs by sequencing specific regions that correspond to the positions of gaps. 6.2 Trimming Vector and Low-Quality Sequences DNA characters in a fragment are determined from a chromatogram that can be viewed as a plot that shows possibilities of each of four DNA characters. The base call is a DNA character that is determined from a chromatogram and this process is done automatically by a computer program. Phred [15], the most widely used base-calling program, generates numeric values to denote the confidence level of each base call. The quality value of a base is q = log ( p) where p is the estimated error probability for the base [15]. Thus a sequencing machine generates two types of output files, one for DNA fragment sequence and another for based-call-quality values for a DNA fragment. Using this information, the next step is to trim sequences from vectors and identify low-quality regions The Trimming Vector and Low-Quality Sequences Problem Input: A set of fragment reads with base-call-quality information; a set of vector sequences. Output: A set of fragment reads with vector sequences trimmed, and with information on start and end positions of good-quality regions.

5 Overview of Genome Assembly Techniques 83 The problem is relatively simple but it requires a carefully written computer program to handle issues related to fragment trimming, such as vector sequence removal, identification of low-quality regions, and identification of contaminant reads. Chou and Holmes [16] wrote a suite of programs called LUCY for trimming vector and low-quality sequences from each fragment read. The design goal of LUCY is to process fragments so that trimmed fragments have the best overall quality rather than individual base-quality values. LUCY operates in multiple steps Quality Region Determination The following is the quality region determination: 1. To determine the good-quality region of a fragment, LUCY first removes low-quality regions from both ends since the beginning and end of each sequence are typically of low quality. This is done by scanning a fragment from its left by identifying the first window of 10 bases with an error probability rate of 0.02 or less. Similarly, it finds the first window from its right end. 2. The next step is to identify regions with high error rates. This is done by scanning the remaining fragment region with a window size of 50 bases and a maximum error probability rate with 0.08, and then with a window size of 10 bases and a maximum probability error rate with 0.3. Any clean range sequences of less than 100 bases will be discarded. 3. Each of remaining clean range sequences will be further examined by checking it with two parameters, the overall maximum error probability (0.025) and the maximum error probability rate of two consecutive bases at the ends (0.02). Note that this step looks at the entire sequence range rather than a window Vector Splice Site Trimming This step requires two input files, one with a whole vector sequence and another with two splice-site temple sequences, upsteam and downstream from the insertion point on the vector. Since vector splice sites are usually at the beginning of a fragment where the quality of bases is low, a simple sequence matching does not guarantee to find vector splice sites. Note that vector splice sites may be outside the good-quality region of a fragment and these splice sites will still be sought for since splice-site information is useful (e.g., for estimating clone length). To deal with sequences with low-quality bases, LUCY uses three consecutive windows of 40, 60, and 100 bases. Vector sequences are matched with a minimum match length of 8, 12, and 16 bases within the three windows.

6 84 Genome Sequencing Technology and Algorithms Contaminant Detection Contaminates can come from many sources, including Escherichia coli or human, which can be identified easily by sequence comparison methods. The real challenge is to identify contaminates from cloning vectors themselves. Two common contaminants are vector inserts that formed by a vector s splicing with another vector and short inserts in which case most of a fragment read is a vector sequence. 1. The first step is to prepare a sequence tag pool of, say, 10 bases from a full-length vector sequence. 2. Each fragment is converted into tags and searched against the contaminant tag pools. A contaminant sequence is detected by counting the number of matched tags. Since the tag-matching step is performed after trimming low-quality regions, tag matching is done only for good-quality regions, so matched tags are of very high confidence. 6.3 Fragment Assembly Given a set of shotgun reads with vector and low-quality sequences trimmed, a fragment assembly program is used to assemble the reads to reconstruct the target sequence The Fragment Assembly Problem Input: A set of fragment reads with vector sequences trimmed and with information on start and end positions of good-quality regions; a set of base-call-quality information for each fragment. Output: A set of contigs, each of which is a set of aligned fragments; a set of consensus sequences, a consensus sequence for a contig. Typically, a fragment assembly program generates many sets of assembled fragments instead of a single contiguous sequence. Two major reasons are repeats in the target sequence and low-coverage regions in the shotgun data. See Chapter 7 for details. Most fragment assembly algorithms employ the overlap-layout-consensus approach Overlap-Layout-Consensus Approach The most widely used overlap-layout-consensus approach, pioneered by Peltola et al. [11], consists of three major steps: (1) identification of candidate overlaps, (2) fragment layout, and (3) consensus sequence generation from the layout. The first step is achieved using string pattern matching techniques, generating possible overlaps between fragments. The second and third steps involve

7 Overview of Genome Assembly Techniques 85 building models, implicit or explicit, for computing the layout of fragments and generating consensus sequences by enumerating the search space based on the model. Many successful sequence assembly algorithms have been developed based on this paradigm [1 3, 5, 7 9, 12]. There are also other approaches explicitly based on graph theory [6, 17]. These sequence assemblers contributed to the determination of many genome sequences, including the most recent announcement of the human genome [12, 13]. However, sequence assemblers typically generate a large number of contigs rather than single contiguous sequences, due to repetitive sequences and technical difficulties encountered at different stages of a genome-sequencing project. For example, the four most widely used assemblers generated 149 to more than 300 contigs for the N. meningitidis genome of 2.18 Mb [6]. The complete determination of the target sequence from the set of contigs requires a significant amount of work, which is called finishing. 6.4 Assembly Validation Repeats in the genome can easily lead to misassembly of contigs. Thus it is very important to validate contig assembly before scaffolding contigs or finishing gaps between contigs. The most accurate method to detect such misassembled contigs is to perform wet-lab experiments. However, this is time consuming and requires carefully designed experiments The Assembly Validation Problem Input: A set of contigs with fragment alignment information; mate-pair information (some methods, e.g., an information theoretic probabilistic approach in this section, do not require this information). Output: For each base position, prediction on whether the position is assembled correctly or not. Rouchka and States [18] proposed a computational technique to design wet-lab experiments for contig assembly validation, including high clone coverage maps, multiple complete digest mapping, optical restriction mapping, and ordered shotgun sequencing [18]. Recently, several computational techniques without using wet-lab experiments have been developed. These techniques can be implemented as separate computational tools [19 22] or embedded in assemblers. The assembly validation techniques used in the sequence assemblers are reviewed in Sections 7.3, 7.4, and 7.5. Another interesting approach is to compare sequence assemblies from two or more fragment assembly programs to detect misassembled regions and to get a higher quality assembly (e.g., [23]).

8 86 Genome Sequencing Technology and Algorithms TAMPA Dew et al. [19] developed a sequence-assembly validation method that utilizes mate-pair data to evaluate and compare assemblies. The basic assumption is that lengths of mate pairs from a clone library follow Gaussian distribution with a mean µ and a standard deviation σ, which can be observed in the plots of clone mate lengths in the final curated assembly. Thus mate pairs are unsatisfied if the distance between pairs are beyond the range µ ± 3σ. TAMPAisa computational geometry-based approach to detecting assembly breakpoints by exploiting constraints that mate pairs impose on each other. They classified mate pairs into four assembly problems, insertion of incorrect sequences between a mate pair, deletion of sequences between sequences of a mate pair, inversion between two or more mate pairs, and transposition of mate pairs. The effects of four assembly problems are stretched (insertion and transposition), compressed (deletion or transposition), and (anti)-normal (inversion) Compression/Expansion Statistics Zimin and Yorke [22] developed compression/expansion statistics to identify misassembled regions (i.e., assembly regions that are either compressed or expanded due to repeats). The basic idea is again to assume that insert lengths between two mate fragments are distributed according to a Gaussian distribution with a mean and a variance. Given a contig, a global mean and a global variance of insert lengths are estimated. Then a sample mean and sample variance for a given library at a given base position in the contig is computed as follows. The sample mean length is the average length of inserts that covers the given position. The sample variance is estimated as sample standard deviation = global standard deviation N where N is the number of inserts that cover the given base position. Using the sample and global means and variance, the CE statistic is computed as C = ( ) sample mean global mean sample standard deviation The CE statistic is negative at a collapsed region and positive at an expanded region. The thresholds for collapsed and expanded regions are empirically determined as 4 and 4.7, respectively. Using the CE statistic, Zimin et al. developed a method to compare and correct (reconciliate) misassembled regions using two different assemblies.

9 Overview of Genome Assembly Techniques Clone Coverage Analysis Sequence assembly validation based on the clone coverages can be used to detect large-scale misassemblies, especially collapsed repeats [20, 24]. This approach works in three steps as listed next. 1. Contigs are oriented and ordered. 2. The estimated lengths for all library clone types are computed and clones are classified into two classes, bad clones whose length deviate much from the expected clone length and good clones whose length is within an acceptable range of the expected clone length. 3. A good-minus-bad clone coverage plot is computed for each contig by subtracting the number of bad clones from the number of good clones. The basic idea is simple. Any region where more bad clones are aligned than good clones is likely to be misassembled An Information Theoretic Probabilistic Approach This approach [21] identifies misassembled regions using entropy plots that are computed using statistics on the number of patterns per fragment. To compute entropy of fragments, we need to construct a probability model that measures how much each aligned fragment contributes to misassembly. The probability function f i is built using the fragment distribution, a measure used for repeat handling in a sequence assembler called AMASS [3]. From the probability model, we compute entropy at base position p in a contig as: ( ) = ( i) log( ( i) ) entropy p prob f prob f p δ pos ( f ) p+ δ i where pos(f i ) denotes the left-end position of f i in the contig and δ is a user-input parameter (by default, it is the same as the window size used for the fragment distribution calculation). Figure 6.3 shows how the entropy plot can detect misassembled regions. 6.5 Scaffold Generation Some sequence assembly packages include a scaffold generation module that generates scaffolds of assembled contigs [5, 6, 9, 25 28]. There are separate packages such as GigAssembler [26] and Bambus [28], which will be surveyed in this section.

10 88 Genome Sequencing Technology and Algorithms C8 fragment coverage 40 Coverage Coverage Base position (a) Contig8.good bad Base position (b) 8 C8 entropy 7 6 Coverage Base position (c) Figure 6.3 (a) The fragment coverage, (b) the clone coverage plot, and (c) the entropy plot, for a contig 8 generated by Phrap (version 2001). There is a misassembled region from 89,415 to 90,332 where the fragment coverages are not distinctly high but the valleys in the clone coverage plot and peaks in the entropy plot are distinct, effectively identifying the misassembled region.

11 Overview of Genome Assembly Techniques The Scaffold Generation Problem Input: A set of contigs; mate-pair information; physical/genetic map (optional); expressed sequence tags (EST) (optional). Output: A set of linearly ordered contigs with optional gap-size information between adjacent contigs. In a sense, scaffolding is to generate a linear order of contigs after orienting and ordering contigs. Issues related to scaffolding are: 1. Mate-pair information is erroneous. Some mate pairs come from chimeric clones. More seriously, mate-pair information from fragments aligned at wrong places can easily confuse contig orientation and ordering. 2. There are typically several types of clone libraries that differ in length, say, 2 kb, 10 kb, 40 kb, 100 kb, and so on. In general, mate-pair information from shorter clones is more accurate than that from longer clones. How to utilize mate-pair information of different quality is not trivial. Bambus is one of the hierarchical scaffolding methods that utilize mate-pair information from clones of different length in a hierarchical fashion. 3. There are external information sources, that can be utilized for scaffolding contigs, such as physical/genetic map, alignment information obtained by aligning contigs to already finished genomes, and conservation of gene synteny Bambus The main design philosophy of Bambus is to make a stand-alone scaffolding package so that is can be coupled with other fragment assembly packages and users can easily control parameters for scaffolding contigs. Note that recent assemblers, such as the Celera Whole Genome Assembler, and Arachne, embedded a scaffolding module, but the scaffolding modules are tightly coupled with specific assemblers. In this section, we explain the steps of Bambus, discussing how Bambus deals with the main issues of scaffolding contigs Edge Bundling to Handle Errors in Mate Pairs Mate-pair information is very important in orienting and ordering contigs. However, some mate pairs are incorrect due to misassembly of contigs or fragments from chimeric clones. Intuitively, if more mate pairs between two contigs that are consistent in terms of orienting and ordering the two contigs, the mate pairs can be considered to be correct with a higher confidence. This problem can be formally defined and solved by finding the largest clique (i.e., a fully

12 90 Genome Sequencing Technology and Algorithms connected subgraph), in the interval graph induced by the inter-conitg gap ranges for the links. The cluster with the most links is chosen and all links in other clusters are given the invalid orientation tags. Bambus allows the user to specify different redundancies to be used for contig links, depending on the confidence in the data. For example, shorter clones, say, of 2 kb, require a smaller number of edge bundling while longer clones, say, of 100 kb, require more edge bundling to be used for contig orientation. The output of this step is a set of contig edges between contig pairs. The remaining task is to orient and order contigs using these contig edges Contig Orientation The contig orientation problem is to find a consistent orientation for all contigs. This is sometime challenging. Consider three contigs, A, B, and C. The orientation of a contig, say B, with respect to another contig, say A, can be either A B, A rc(b), rc(a) B, or rc(a) rc(b) where rc(a) and rc(b) represent the reverse complement of A and B, respectively. Suppose that contigs edge impose contig orientation, A rc(b) and A rc(c). In addition, suppose that contigs edge impose contig orientation, B rc(c). Then the three-contig orientation is not consistent. Note that there are clones of different length, thus bundled edges are also of different length. In case there is error in bundled edges, the situation described above can happen. Based on the principle of parsimony, we can consider the contig orientation problem as contig orientation and ordering with computing the minimum number of contig edges to be removed to make a consistent orientation for all contigs. Unfortunately, this problem is an NP hard [4]. Thus a greedy heuristic algorithm is used in practice. In general, the greedy contig orientation algorithm works well since the edge-bundling step generates contig edges of high accuracy Contig Ordering The contig-ordering problem is to embed contigs on a line while preserving the gap length suggested by bundled edges. This can be formulated as a problem of topological ordering of contigs subject to length constraints between contigs. An optimization problem formulation would be to find a topological ordering with a minimum number of edges removed. This is also known as an NP-hard problem. Bambus uses an expand-contract greedy heuristic for contig ordering. The first step expand is to anchor the first unplaced contig with edges at their maximum allowable length, then traverse the contig graph in a breadth-first search manner to fill in the range. As this is a greedy placement of contig, any contig with inconsistent ordering is not placed. After the expand step, contigs are brought back and placed as close as possible to the midpoint of the range defined by the length constraints of edges. This contraction step allows placement of as many contigs as possible. The resulting ordering may not be

13 Overview of Genome Assembly Techniques 91 consistent, meaning that two contigs may occupy the same space. This ambiguous placement can be helpful for the final genome-finishing step Hierarchical Scaffolding Contig edges from smaller insert libraries have significant fewer errors than those from longer insert libraries. Thus how to utilize clone edges of different accuracies is not trivial. Bambus generates scaffolds of contigs in a hierarchical fashion, starting with contig edges from smallest libraries, say, of 2 kb, then Bambus adds edges of lower-quality from larger insert libraries. The quality of contig edges are evaluated not only by the library length but also by the number of edges connecting two contigs since confirmation from two independent edges is an indication of higher quality in connecting contigs Untangling Given a scaffold of contigs, there can be contigs that are involved in multiple paths of contigs. In this case, it may be desirable to untangle those contigs to convert an ambiguous scaffold to a single linear stretch. The Bambus untangler resolves an ambiguous scaffold by iteratively finding the longest nonself-overlapping path in a greedy fashion. If a contig involves multiple potential paths, it may be desirable to break a contig into multiple pieces and then test if single linear nonoverlapping stretches of contigs can be generated. Bambus plans to incorporate this implementation in its future release GigAssembler GigAssembler [26] was used for the human genome assembly in the public Human Genome Project [13]. It generated scaffolds using contigs, map, mrna, EST, and BAC end data. The overview of the scaffold-generation process is as follows. 1. Decontaminating and repeat masking the sequence. RepeatMasker [29] is used to mask known repeats and contaminants from bacteria, vectors, and others. 2. Aligning mrna, EST, BAC end, and paired-plasmid reads against initial sequence contigs. 3. Creating an input directory structure using map and other data. For the human genome, they used Washington University s map data. A directory is created for a chromosome and a subdirectory for each fingerprint clone contig. 4. For each fingerprint clone contig, aligning the initial sequence contigs within that contig against each other.

14 92 Genome Sequencing Technology and Algorithms 5. Using the Gigassembler program within each fingerprint clone contig to merge overlapping initial sequence contigs, and to order and orient the resulting sequence contigs into scaffolds. 6. Combining the contig assemblies into full chromosome assemblies Preprocessing: Alignment of mrna, ESTs, BAC Ends, and Paired Reads Contigs are oriented and ordered by aligning mrna, ESTs, BAC ends, and paired reads to contigs using a program called pslayout. It reports all matches above a certain minimal quality between query sequences and database sequences. To compute alignments, it collects candidate-matching regions by using 10-mer indices in a set of overlapping 500-base regions of query sequence. Then candidate-matching regions are aligned, especially tolerating intron regions in the case of aligning mrna and ESTs. Resulting aligned regions are combined using a dynamic programming algorithm. To reduce the effects of repeats, two techniques are used. The first technique is to use repeat sequences (repeats are not masked although they are detected by RepeatMasker). The second technique is to maintain near best" matches only, which means discard matches (even good ones) if they are below the best match Assembly and Ordering of Contigs The GigAssembler operates in many steps. The task is to generate consensus sequences determined by contigs and their ordering. Contigs are ordered and merged gradually into larger ones by building rafts, barges, raft-ordering graphs, and bridge graphs. Next we describe the algorithm in more detail. 1. Build merged sequence contigs, called rafts, from overlapping initial sequence contigs. A score to each aligning pair is assigned, and then the alignments are processed from the best scoring ones to least ones. 2. Build sequenced clone contigs, called barges, from overlapping clones. Barges are constructed in a greedy fashion where the clone overlap is the sum of all initial sequence contig overlaps. Each clone will be assigned a coordinate in the result barge. 3. Once the orientation and order of clones are determined while constructing barges, rafts (merged sequence contigs) can be ordered using a raft-ordering graph. This is a directed graph with two types of nodes, rafts and sequenced clone endpoints. To understand what is happening at this stage, see Figure Rafts are bridged with mrnas, ESTs, paired-plasmid reads, BAC end pairs, and ordering information from the sequencing centers. The resulting graph is called a bridge graph. Bridge is added one at a time, starting with the best scoring, to the ordering graph. The score

15 Overview of Genome Assembly Techniques 93 AAAA AAAAAAAAAA a1a1a1a1 a2a2a2a2 BBBBBBBBBBBBBBBBBBBBBBB b1b1b1b1b1b1b1 b2b2b2b2 CCCCCCCCCCCCCCCCCCC c1c1c1c1c1c1 c2c2c2 As Bs Ae Cs Be Ce As Bs Ae Cs Be Ce Figure 6.4 How to build a raft-ordering graph. Six initial contigs (a1, a2, b1, b2, c1, c2) are aligned to three clones (A, B, C) (top figure), an ordering graph of clone starts and ends is given (middle figure), and the final raft-ordering graph after adding in rafts to the ordering graph (bottom figure). Form the top figure, we can construct three rafts, a1, b1, a2, b2, c1, and c2, based on their overlaps. Then an ordering graph of clone starts and ends can be constructed based on the positions of clone start and end positions as in the middle figure. The node names As and Ae denotes the start and the end of a clone A, respectively. Finally, the three rafts are added to the ordering graph as in the bottom figure. function for bridges is based on the type of information. mrna information is given the highest weight, then paired-plasmid reads, information provided by the sequencing centers, ESTs, and BAC end matches, in that order. 5. Walk the bridge graph to get an ordering of rafts. Each bridge is walked in the order of the default coordinates assigned with a constraint that if a raft has predecessors, all the predecessors must be walked before the raft is walked. 6. A sequence path through each raft is built in a greedy fashion, starting with the longest, most finished initial sequence contig that passes though each section of the raft.

16 94 Genome Sequencing Technology and Algorithms 7. Build the final sequence for the fingerprint clone contig by inserting the appropriate number of Ns between raft sequence paths. 6.6 Finishing Finishing is labor intensive and constitutes a major bottleneck in any genome-sequencing project. Input to the finishing stage is a set of oriented and ordered contigs. However, as we discussed in previous sections, it is still challenging to verify the correctness of contigs and to generate scaffolds of contigs. Due to the difficulties, there are two different views in pursuing genomelevel sequencing, one in favor of the whole-genome shotgun strategy [30] and another in favor of a hierarchical strategy involving only smaller-scale shotgun sequencing [31] (see Section 6.7 for more discussion). In summary, there should be more efforts in developing frameworks for genome sequencing as well as component tools such as scalable, reliable sequence assemblers and contig assembly- validation methods. As one of the initial efforts, the AMOS project aims at developing open-source whole-genome assembly software for the genome- sequencing community [32]. 6.7 Three Strategies for Whole-Genome Sequencing The first whole bacterial genome, H. influenzae, was sequenced at TIGR in 1995 using the whole-genome shotgun strategy [33]. Since then, the whole-genome shotgun strategy has been successfully used for many genomes, including human [12]. Basically, the longer the size of DNA region, the more repeats exist, which becomes clearly a major hurdle to the genome sequencing. There are three different strategies of employing shotgun strategy in wholegenome sequencing. 1. The whole-genome shotgun strategy applies shotgun-sequencing strategy to the whole-genome level. The advantage of this approach is that it is most cost-effective since shotgun data can be prepared in a single step at the whole-genome level. The human genome assembly by Celera was achieved using this strategy [12]. 2. The hierarchical approach uses libraries of different insert size. Libraries of larger insert size are further split into libraries of smaller insert size while keeping track of which library the current library is a descendant of. This often requires a high-resolution genetic map prior to the whole-genome assembly and low-resolution physical map. Shotgun strategy is applied when a library becomes small enough so that the

17 Overview of Genome Assembly Techniques 95 current assembly algorithm could determine the target sequence with confidence. Since the library hierarchy information can be easily used to produce the target sequence, this approach may produce more accurate genome sequences. The major drawback of this approach is time and cost of genome sequencing. The human genome consortium used this approach [13]. 3. The hybrid approach, called pooled genomic indexing a technique pioneered at the Baylor College of Medicine, employs both the hierarchical and the whole-genome shotgun approaches, but without physical and genetic mapping information. This approach combines two different types of shotgun reads, one from the whole-genome shotgun and the other from the shotgun sequencing of individual BACs. BACs are generated by using large insert BAC clones and a minimum tiling path of BACs that covers the whole genome is computed. Then shotgun strategy with a low coverage is applied to a set of selected BACs from the tiling path. In parallel, the whole-genome shotgun strategy is applied to generate another set of shotgun data at the whole-genome level. These two shotgun data sets are combined to determine the whole-genome sequence. The brown Norway rat genome was assembled with this strategy [34]. 6.8 Discussion In this section, we briefly summarize techniques for genome sequencing. For more information, readers may refer to several review articles on genome sequencing [35 37]. Although genome sequencing still remains an open problem, recent advances in computational techniques make it possible to sequence very large eukaryotic genomes such as Drosophila melanogaster [27] and the human genome [12, 13]. We discuss some of the recent trends in genome-sequencing strategies next. 1. To achieve very large-scale genome sequencing, it is necessary to have shotgun data of very high quality (see [27]). Otherwise, is not possible to distinguish repeats from errors in the shotgun data. What is really interesting is that recent assemblers attempt to correct errors in the input shotgun data before sequence assembly [5, 6]. There is no guarantee to correct errors without knowing the target sequence. Indeed, EULER, a genome assembly package, names this procedure as data corruption instead of error correction. Nonetheless, this is a promising technique that works for large-scale sequence assembly.

18 96 Genome Sequencing Technology and Algorithms 2. Repeat boundaries are identified before sequence assembly and then contigs are assembled up to the boundaries [5, 27]. Like the error correction, there is no guarantee to identify the repeat boundaries correctly without knowing the target sequence, but this is another promising technique. 3. Computational techniques to ensure the correctness of contig assembly becomes more important. Correctness can be checked using the characteristics of the shotgun data (i.e., random sampling). 2 There are two ways to utilize the characteristics of data, on the fragment level [2, 5, 9, 27] and there is also an interesting approach based on pattern statistics [21]. 4. Mate-pair information and base-call-quality values become an essential data for genome sequencing A Thought on an Exploratory Genome Sequencing Framework As hinted in Section 6.5.2, contigs that are involved in multiple paths of contigs may be broken into smaller pieces and can be tested if linear paths can be generated. To realize this idea, methods for assembly validation that detect misassembled regions in contigs are much needed (see Section 6.4). We tested this idea for several bacterial genomes including Agrobacterium [38]. The schematic overview of a genome-sequencing framework [39] developed at DuPont is depicted in Figure 6.5. This approach can be viewed as a hypothesis generation and validation paradigm in search of a set of correctly assembled contigs and their ordering. All decisions made at user interaction points are hypotheses that will be subsequently tested with larger clones in the next step. This approach was successful in assembling several genome sequences. For example, 502 contigs in the Phrap assembly of the Agrobacterium shotgun data were grouped and ordered into only 15 sets of contigs (the largest set longer than 2 Mb) using a Web interface in a single iteration of our genome-sequencing framework. Note that there are four replicons in the Agrobacterium genome. As more accurate assembly validation methods are developed, this technique might be useful for automating the sequencing of microbial genomes. By embedding sequence assembly modules into an assembly package, such as Minimus [40]. 2. There are regions where sampling is biased due to biological reasons. However, random sampling can be assumed as a whole shotgun data.

19 Overview of Genome Assembly Techniques 97 A set of DNA fragments Sequence assembler A set of contigs Assembly validation Split contigs Clone linkage information Ordering contigs A set of groups of contigs Large clone linkage information Ordering groups A set of groups of groups Figure 6.5 A framework for genome sequencing. This framework is to search for a set of correctly assembled contigs and their ordering in an iterative fashion. Acknowledgments Sun Kim was supported in part by a Career DBI from the National Science Foundation (United States) and a grant from the Korea Insititute of Science and Technology Information. We thank the anonymous reviewer for his or her valuable comments. [1] Green, P., References [2] Sutton, G., et al., TIGR Assembler: A New Tool for Assembling Large Shotgun Sequencing Projects, Genome Science and Technology, Vol. 1, 1995, pp [3] Kim, S., and A. M. Segre, AMASS: A Structured Pattern Matching Approach to Shotgun Sequence Assembly, Journal of Computational Biology, Vol. 6, No. 4, [4] Kececioglu, J. D., and E. W. Myers, Combinatorial Algorithms for DNA Sequence Assembly, Algorithmica, Vol. 13, 1995.

20 98 Genome Sequencing Technology and Algorithms [5] Batzoglou, S., et al., Arachne: A Whole-Genome Shotgun Assembler, Genome Research, Vol. 12, No. 1, 2002, pp [6] Pevzner, P. A., et al., An Eulerian Path Approach to DNA Fragment Assembly, PNAS, Vol. 98, 2001, pp [7] Huang, X., A Contig Assembly Program Based on Sensitive Detection of Fragment Overlaps, Genomics, Vol. 14, [8] Huang, X., An Improved Sequence Assembly Program, Genomics, Vol. 33, [9] Huang, X., and A. Madan, CAP3: A DNA Sequence Assembly Program, Genome Research, Vol. 9, No. 9, 1999, pp [10] Huang, X. et al., PCAP: A Whole-Genome Assembly Program, Genome Res., Vol. 13, 2003, pp [11] Peltola, H., et al., SEQAID: A DNA Sequence Assembling Program Based on a Mathematical Model, Nucleic Acids Res., Vol. 12, No. 1, Pt. 1, 1984, pp [12] Venter, J. C., et al., The Sequence of the Human Genome, Science, Vol. 291, 2001, pp [13] Lander, E. S., et al., Initial Sequencing and Analysis of the Human Genome, Nature, Vol. 409, 2001, pp [14] Roach, J. C., Pairwise End Sequencing: A Unified Approach to Genome Mapping and Sequencing, Genomics, Vol. 26, 1995, pp [15] Ewing, B., et al., Base-Calling of Automated Sequencer Traces Using Phred. I. Accuracy Assessment, Genome Research, Vol. 8, 1998, pp [16] Chou, H. H., and M. H. Holmes, DNA Sequence Quality Trimming and Vector Removal, Bioinformatics, Vol. 17, No. 12, 2001, pp [17] Idury, R., and M. S. Waterman, A New Algorithm for DNA Sequence Assembly, Journal of Computational Biology, Vol. 2, No. 2, 1995, pp [18] Rouchka, E. C., and D. J. States, Sequence Assembly Validation by Multiple Restriction Digest Fragment Coverage Analysis, Proc. of Intelligent Systems for Molecular Biology (ISMB), 1998, pp [19] Dew, I. M., et al., A Tool for Analyzing Mate Pairs in Assemblies (TAMPA), J. Comput. Biol., Vol. 12, No. 5, 2005, pp [20] Kim, S., et al., Enumerating Repetitive Sequences from Pairwise Sequence Matches, manuscript, DuPont Central Research, [21] Kim, S., et al., A Probabilistic Approach to Sequence Assembly Validation, ACM SIGKDD Workshop on Data Mining in Bioinformatics (BioKDD2001), 2001, pp [22] Zimin, R., and J. A. Yorke, Assembly Reconciliation Method, umd.edu/reconciliation.htm, [23] Shatkay, H., et al., ThurGood: Evaluating Assembly-to-Assembly Mapping, Journal of Computational Biology, Vol. 11, No. 5, 2004, pp

21 Overview of Genome Assembly Techniques 99 [24] Kim, S., et al., A Computational Approach to Sequence Assembly Validation, manuscript, DuPont Central Research, [25] She, X., et al., Shotgun Sequence Assembly and Recent Segmental Duplications Within the Human Genome, Nature, Vol. 431, 2004, pp [26] Kent, W. J., and D. Haussler, Assembly of the Working Draft of the Human Genome with GigAssembler, Genome Res., Vol. 11, 2001, pp [27] Myers, G., et al., A Whole-Genome Assembly of Drosophila, Science, Vol. 287, 2000, pp [28] Pop, M., et al., Hierarchical Scaffolding with Bambus, Genome Res., Vol. 14, No. 1, 2004, pp [29] Smit, A., Repeat Masker, [30] Weber, J. L., and E. Myers, Human Whole-Genome Shotgun Sequencing, Genome Research, Vol. 7, 1997, pp [31] Olson, M., and P. Green, A Quality-First Credo for the Human Genome Project, Genome Research, Vol. 8, No. 5, 1998, pp [32] AMOS: A Modular Open-Source Assembler, [33] Fleischmann, R. D., et al., Whole-Genome Random Sequencing and Assembly of Haemophilus Influenzae Rd., Science, Vol. 269, No. 5223, 1995, pp [34] Gibbs, R. A., et al., (Rat Genome Sequencing Project Consortium), Genome Sequence of the Brown Norway Rat Yields Insights into Mammalian Evolution, Nature, Vol. 428, No. 6982, 2004, pp [35] Pop, M., et al., Genome Sequence Assembly: Algorithms and Issues, IEEE Computer, Vol. 35, No. 7, 2002, pp [36] Pop, M., et al., Shotgun Sequence Assembly, Advances in Computers, Vol. 60, June [37] Batzoglou, S., Algorithmic Challenges in Mammalian Genome Sequence Assembly: Special Review, in D. M. Jordel, P. Little, and S. Subramaniam, (eds.), Encyclopedia of Genomics, Proteomics, and Bioinformatics, New York: John Wiley & Sons, [38] Wood, D. W., et al., The Genome of Agrobacterium Tumefaciens C58: Insights into the Evolution and Biology of a Natural Genetic Engineer, Science, 2001, pp [39] Kim, S., The AMASS Genome Sequencing Package, Advances in Genome Biology and Technology Conference, February [40] Sommer, D. D., et al., Minimus: A Fast, Lightweight Genome Assembler, BMC Bioinformatics, Vol. 8, February 26, 2007, p. 64.