Overview of Genome Assembly Techniques
|
|
|
- Everett Warren
- 10 years ago
- Views:
Transcription
1 6 Overview of Genome Assembly Techniques Sun Kim and Haixu Tang The most common laboratory mechanism for reading DNA sequences (e.g., gel electrophoresis) can determine sequence of up to approximately 1,000 nucleotides at a time. 1 However, the size of an organism s genome is much larger; for example, the human genome consists of approximately 3 billion nucleotides. The most commonly used and most cost-effective process is shotgun sequencing, which physically breaks multiple copies (or clones) of a target DNA molecule into short, readable fragments and then reassembles the short fragments to reconstruct the target DNA sequence. The assembly of short fragments in shotgun sequencing was originally done by hand, but manual assembly clearly is not desirable since it is error prone and not cost-effective. Automatic fragment assembly has been studied for a long period of time [1 11]. Various sequence assemblers contributed to the determination of many genome sequences, including the most recent announcement of the human genome [12, 13]. 6.1 Genome Sequencing by Shotgun-Sequencing Strategy Almost all large-scale sequencing projects employ the shotgun strategy that assembles (deduces) the target DNA sequence from a set of short DNA 1. Recently, several promising new experimental techniques have been developed. Note that we survey new sequencing technology in Part I. 79
2 80 Genome Sequencing Technology and Algorithms fragments determined from DNA pieces randomly sampled from the target sequence. The set of short DNA fragments, called shotgun reads, are assembled into a set of contigs, or sets of aligned fragments, using a computer program, fragment assembler. The fragment assembly is a conceptually simple procedure that generates longer sequences by detecting overlapping fragments. If the fragment assembly can be done perfectly, the genome sequencing would be a simple problem. However, there are extensive repetitive sequences, repeats in short, in a genomic sequence, which can easily mislead the fragment assembly process (see Figure 6.1). A useful technique to overcome the difficulty from repeats is to sequence both ends of a clone, generating two fragment reads per clone. Since the insert size of clone is known, we know the approximate distance between two fragments. This technique is developed by Hood and his colleagues [14]. The fragment matching information is often referred as mate-pair information, which becomes essential for large-scale shotgun sequencing. The main issue utilizing this information during the assembly process is that we do not know the sequence between the two reads, which can only be deduced by assembling other fragments into a single contig. As a result, we can utilize the clone-length information only after assembly, which results in either a correct or an incorrect assembly based on the clone-length information. See Figure 6.2. One strategy to use the mate-pair information effectively is to assemble contigs as accurate as possible by detecting potentially misassembled contigs and then utilize the mate-pair information using only contigs that are likely to be assembled correctly. See Sections 7.4 and 7.5 for the strategy A Procedure for Whole-Genome Shotgun (WGS) Sequencing In general, assembly of shotgun reads generates a large number of contigs and some of them are misassembled probably due to repetitive sequences in the target DNA. As a result, genome sequencing is usually carried out in multiple steps and, unfortunately, there is no consensus on the steps of all genome-sequencing Assembly is wrong due to repeats Figure 6.1 Effect of repeats in fragment assembly. Since assembly is based on overlapping regions between fragments, repeats can easily mislead assembly process, putting all five fragments from two repeat copies into one contig in this figure.
3 Overview of Genome Assembly Techniques 81 Approximate length in bps 5 f1 f2 5 Read one fragment f1 Read another fragment f2 from one end. from the other end. f1 f2 f1 f2 2kb 6kb Figure 6.2 (Upper panel) Mate-pair information. Two fragments, f 1 and f 2, are read at both ends of the same clone of approximate size of 2 kb. (Bottom panel) The main issue utilizing this information during the assembly process is that we do not know the sequence between the two reads, which can only be deduced by assembling other fragments into a single contig. As a result, we can utilize the clone-length information only after assembly, which result in either a correct or an incorrect assembly based on the clone length information. projects. We describe a general procedure for genome sequencing below, but we emphasize that procedures used at genome-sequencing centers differ in details. 1. Fragment readout: The sequences of each fragment are determined using an automatic base-calling software. Phred [15, 16] is the most widely used program. 2. Trimming vector sequences: Shotgun reads often contain part of the vector sequences that have to be removed before sequence assembly. 3. Trimming low-quality sequences: Shotgun reads contain poor-quality basecalls and removing or masking out these low-quality base calls often leads to more accurate sequence assembly. However, this step is optional and some sequencing centers do not mask out low-quality
4 82 Genome Sequencing Technology and Algorithms base calls, relying on the fragment assembler to utilize quality values to decide true fragment overlaps. 4. Fragment assembly: The shotgun data is input to a fragment assembler that automatically generates a set of aligned fragment called contigs. A survey of fragment assembly algorithms is in Chapter Assembly validation: Some contigs that assembled in the previous steps may be misassembled due to repeats. Since we do not have a priori knowledge on repeats in the target DNA, it is very difficult to verify the correctness of assembly of each contig and this step is largely done manually. There are recent algorithmic developments on automatic verification of contig assemblies (see Section 6.4). 6. Scaffolding contigs: Contigs need to be oriented and ordered. The mate-pair information is a primary information source for this step, thus this step is not achievable if the input shotgun data is not prepared by reading both ends of clones (see Section 6.5 for more details). 7. Finishing: Assuming that all contigs are assembled correctly and contigs are oriented and ordered correctly, we can close gaps between two contigs by sequencing specific regions that correspond to the positions of gaps. 6.2 Trimming Vector and Low-Quality Sequences DNA characters in a fragment are determined from a chromatogram that can be viewed as a plot that shows possibilities of each of four DNA characters. The base call is a DNA character that is determined from a chromatogram and this process is done automatically by a computer program. Phred [15], the most widely used base-calling program, generates numeric values to denote the confidence level of each base call. The quality value of a base is q = log ( p) where p is the estimated error probability for the base [15]. Thus a sequencing machine generates two types of output files, one for DNA fragment sequence and another for based-call-quality values for a DNA fragment. Using this information, the next step is to trim sequences from vectors and identify low-quality regions The Trimming Vector and Low-Quality Sequences Problem Input: A set of fragment reads with base-call-quality information; a set of vector sequences. Output: A set of fragment reads with vector sequences trimmed, and with information on start and end positions of good-quality regions.
5 Overview of Genome Assembly Techniques 83 The problem is relatively simple but it requires a carefully written computer program to handle issues related to fragment trimming, such as vector sequence removal, identification of low-quality regions, and identification of contaminant reads. Chou and Holmes [16] wrote a suite of programs called LUCY for trimming vector and low-quality sequences from each fragment read. The design goal of LUCY is to process fragments so that trimmed fragments have the best overall quality rather than individual base-quality values. LUCY operates in multiple steps Quality Region Determination The following is the quality region determination: 1. To determine the good-quality region of a fragment, LUCY first removes low-quality regions from both ends since the beginning and end of each sequence are typically of low quality. This is done by scanning a fragment from its left by identifying the first window of 10 bases with an error probability rate of 0.02 or less. Similarly, it finds the first window from its right end. 2. The next step is to identify regions with high error rates. This is done by scanning the remaining fragment region with a window size of 50 bases and a maximum error probability rate with 0.08, and then with a window size of 10 bases and a maximum probability error rate with 0.3. Any clean range sequences of less than 100 bases will be discarded. 3. Each of remaining clean range sequences will be further examined by checking it with two parameters, the overall maximum error probability (0.025) and the maximum error probability rate of two consecutive bases at the ends (0.02). Note that this step looks at the entire sequence range rather than a window Vector Splice Site Trimming This step requires two input files, one with a whole vector sequence and another with two splice-site temple sequences, upsteam and downstream from the insertion point on the vector. Since vector splice sites are usually at the beginning of a fragment where the quality of bases is low, a simple sequence matching does not guarantee to find vector splice sites. Note that vector splice sites may be outside the good-quality region of a fragment and these splice sites will still be sought for since splice-site information is useful (e.g., for estimating clone length). To deal with sequences with low-quality bases, LUCY uses three consecutive windows of 40, 60, and 100 bases. Vector sequences are matched with a minimum match length of 8, 12, and 16 bases within the three windows.
6 84 Genome Sequencing Technology and Algorithms Contaminant Detection Contaminates can come from many sources, including Escherichia coli or human, which can be identified easily by sequence comparison methods. The real challenge is to identify contaminates from cloning vectors themselves. Two common contaminants are vector inserts that formed by a vector s splicing with another vector and short inserts in which case most of a fragment read is a vector sequence. 1. The first step is to prepare a sequence tag pool of, say, 10 bases from a full-length vector sequence. 2. Each fragment is converted into tags and searched against the contaminant tag pools. A contaminant sequence is detected by counting the number of matched tags. Since the tag-matching step is performed after trimming low-quality regions, tag matching is done only for good-quality regions, so matched tags are of very high confidence. 6.3 Fragment Assembly Given a set of shotgun reads with vector and low-quality sequences trimmed, a fragment assembly program is used to assemble the reads to reconstruct the target sequence The Fragment Assembly Problem Input: A set of fragment reads with vector sequences trimmed and with information on start and end positions of good-quality regions; a set of base-call-quality information for each fragment. Output: A set of contigs, each of which is a set of aligned fragments; a set of consensus sequences, a consensus sequence for a contig. Typically, a fragment assembly program generates many sets of assembled fragments instead of a single contiguous sequence. Two major reasons are repeats in the target sequence and low-coverage regions in the shotgun data. See Chapter 7 for details. Most fragment assembly algorithms employ the overlap-layout-consensus approach Overlap-Layout-Consensus Approach The most widely used overlap-layout-consensus approach, pioneered by Peltola et al. [11], consists of three major steps: (1) identification of candidate overlaps, (2) fragment layout, and (3) consensus sequence generation from the layout. The first step is achieved using string pattern matching techniques, generating possible overlaps between fragments. The second and third steps involve
7 Overview of Genome Assembly Techniques 85 building models, implicit or explicit, for computing the layout of fragments and generating consensus sequences by enumerating the search space based on the model. Many successful sequence assembly algorithms have been developed based on this paradigm [1 3, 5, 7 9, 12]. There are also other approaches explicitly based on graph theory [6, 17]. These sequence assemblers contributed to the determination of many genome sequences, including the most recent announcement of the human genome [12, 13]. However, sequence assemblers typically generate a large number of contigs rather than single contiguous sequences, due to repetitive sequences and technical difficulties encountered at different stages of a genome-sequencing project. For example, the four most widely used assemblers generated 149 to more than 300 contigs for the N. meningitidis genome of 2.18 Mb [6]. The complete determination of the target sequence from the set of contigs requires a significant amount of work, which is called finishing. 6.4 Assembly Validation Repeats in the genome can easily lead to misassembly of contigs. Thus it is very important to validate contig assembly before scaffolding contigs or finishing gaps between contigs. The most accurate method to detect such misassembled contigs is to perform wet-lab experiments. However, this is time consuming and requires carefully designed experiments The Assembly Validation Problem Input: A set of contigs with fragment alignment information; mate-pair information (some methods, e.g., an information theoretic probabilistic approach in this section, do not require this information). Output: For each base position, prediction on whether the position is assembled correctly or not. Rouchka and States [18] proposed a computational technique to design wet-lab experiments for contig assembly validation, including high clone coverage maps, multiple complete digest mapping, optical restriction mapping, and ordered shotgun sequencing [18]. Recently, several computational techniques without using wet-lab experiments have been developed. These techniques can be implemented as separate computational tools [19 22] or embedded in assemblers. The assembly validation techniques used in the sequence assemblers are reviewed in Sections 7.3, 7.4, and 7.5. Another interesting approach is to compare sequence assemblies from two or more fragment assembly programs to detect misassembled regions and to get a higher quality assembly (e.g., [23]).
8 86 Genome Sequencing Technology and Algorithms TAMPA Dew et al. [19] developed a sequence-assembly validation method that utilizes mate-pair data to evaluate and compare assemblies. The basic assumption is that lengths of mate pairs from a clone library follow Gaussian distribution with a mean µ and a standard deviation σ, which can be observed in the plots of clone mate lengths in the final curated assembly. Thus mate pairs are unsatisfied if the distance between pairs are beyond the range µ ± 3σ. TAMPAisa computational geometry-based approach to detecting assembly breakpoints by exploiting constraints that mate pairs impose on each other. They classified mate pairs into four assembly problems, insertion of incorrect sequences between a mate pair, deletion of sequences between sequences of a mate pair, inversion between two or more mate pairs, and transposition of mate pairs. The effects of four assembly problems are stretched (insertion and transposition), compressed (deletion or transposition), and (anti)-normal (inversion) Compression/Expansion Statistics Zimin and Yorke [22] developed compression/expansion statistics to identify misassembled regions (i.e., assembly regions that are either compressed or expanded due to repeats). The basic idea is again to assume that insert lengths between two mate fragments are distributed according to a Gaussian distribution with a mean and a variance. Given a contig, a global mean and a global variance of insert lengths are estimated. Then a sample mean and sample variance for a given library at a given base position in the contig is computed as follows. The sample mean length is the average length of inserts that covers the given position. The sample variance is estimated as sample standard deviation = global standard deviation N where N is the number of inserts that cover the given base position. Using the sample and global means and variance, the CE statistic is computed as C = ( ) sample mean global mean sample standard deviation The CE statistic is negative at a collapsed region and positive at an expanded region. The thresholds for collapsed and expanded regions are empirically determined as 4 and 4.7, respectively. Using the CE statistic, Zimin et al. developed a method to compare and correct (reconciliate) misassembled regions using two different assemblies.
9 Overview of Genome Assembly Techniques Clone Coverage Analysis Sequence assembly validation based on the clone coverages can be used to detect large-scale misassemblies, especially collapsed repeats [20, 24]. This approach works in three steps as listed next. 1. Contigs are oriented and ordered. 2. The estimated lengths for all library clone types are computed and clones are classified into two classes, bad clones whose length deviate much from the expected clone length and good clones whose length is within an acceptable range of the expected clone length. 3. A good-minus-bad clone coverage plot is computed for each contig by subtracting the number of bad clones from the number of good clones. The basic idea is simple. Any region where more bad clones are aligned than good clones is likely to be misassembled An Information Theoretic Probabilistic Approach This approach [21] identifies misassembled regions using entropy plots that are computed using statistics on the number of patterns per fragment. To compute entropy of fragments, we need to construct a probability model that measures how much each aligned fragment contributes to misassembly. The probability function f i is built using the fragment distribution, a measure used for repeat handling in a sequence assembler called AMASS [3]. From the probability model, we compute entropy at base position p in a contig as: ( ) = ( i) log( ( i) ) entropy p prob f prob f p δ pos ( f ) p+ δ i where pos(f i ) denotes the left-end position of f i in the contig and δ is a user-input parameter (by default, it is the same as the window size used for the fragment distribution calculation). Figure 6.3 shows how the entropy plot can detect misassembled regions. 6.5 Scaffold Generation Some sequence assembly packages include a scaffold generation module that generates scaffolds of assembled contigs [5, 6, 9, 25 28]. There are separate packages such as GigAssembler [26] and Bambus [28], which will be surveyed in this section.
10 88 Genome Sequencing Technology and Algorithms C8 fragment coverage 40 Coverage Coverage Base position (a) Contig8.good bad Base position (b) 8 C8 entropy 7 6 Coverage Base position (c) Figure 6.3 (a) The fragment coverage, (b) the clone coverage plot, and (c) the entropy plot, for a contig 8 generated by Phrap (version 2001). There is a misassembled region from 89,415 to 90,332 where the fragment coverages are not distinctly high but the valleys in the clone coverage plot and peaks in the entropy plot are distinct, effectively identifying the misassembled region.
11 Overview of Genome Assembly Techniques The Scaffold Generation Problem Input: A set of contigs; mate-pair information; physical/genetic map (optional); expressed sequence tags (EST) (optional). Output: A set of linearly ordered contigs with optional gap-size information between adjacent contigs. In a sense, scaffolding is to generate a linear order of contigs after orienting and ordering contigs. Issues related to scaffolding are: 1. Mate-pair information is erroneous. Some mate pairs come from chimeric clones. More seriously, mate-pair information from fragments aligned at wrong places can easily confuse contig orientation and ordering. 2. There are typically several types of clone libraries that differ in length, say, 2 kb, 10 kb, 40 kb, 100 kb, and so on. In general, mate-pair information from shorter clones is more accurate than that from longer clones. How to utilize mate-pair information of different quality is not trivial. Bambus is one of the hierarchical scaffolding methods that utilize mate-pair information from clones of different length in a hierarchical fashion. 3. There are external information sources, that can be utilized for scaffolding contigs, such as physical/genetic map, alignment information obtained by aligning contigs to already finished genomes, and conservation of gene synteny Bambus The main design philosophy of Bambus is to make a stand-alone scaffolding package so that is can be coupled with other fragment assembly packages and users can easily control parameters for scaffolding contigs. Note that recent assemblers, such as the Celera Whole Genome Assembler, and Arachne, embedded a scaffolding module, but the scaffolding modules are tightly coupled with specific assemblers. In this section, we explain the steps of Bambus, discussing how Bambus deals with the main issues of scaffolding contigs Edge Bundling to Handle Errors in Mate Pairs Mate-pair information is very important in orienting and ordering contigs. However, some mate pairs are incorrect due to misassembly of contigs or fragments from chimeric clones. Intuitively, if more mate pairs between two contigs that are consistent in terms of orienting and ordering the two contigs, the mate pairs can be considered to be correct with a higher confidence. This problem can be formally defined and solved by finding the largest clique (i.e., a fully
12 90 Genome Sequencing Technology and Algorithms connected subgraph), in the interval graph induced by the inter-conitg gap ranges for the links. The cluster with the most links is chosen and all links in other clusters are given the invalid orientation tags. Bambus allows the user to specify different redundancies to be used for contig links, depending on the confidence in the data. For example, shorter clones, say, of 2 kb, require a smaller number of edge bundling while longer clones, say, of 100 kb, require more edge bundling to be used for contig orientation. The output of this step is a set of contig edges between contig pairs. The remaining task is to orient and order contigs using these contig edges Contig Orientation The contig orientation problem is to find a consistent orientation for all contigs. This is sometime challenging. Consider three contigs, A, B, and C. The orientation of a contig, say B, with respect to another contig, say A, can be either A B, A rc(b), rc(a) B, or rc(a) rc(b) where rc(a) and rc(b) represent the reverse complement of A and B, respectively. Suppose that contigs edge impose contig orientation, A rc(b) and A rc(c). In addition, suppose that contigs edge impose contig orientation, B rc(c). Then the three-contig orientation is not consistent. Note that there are clones of different length, thus bundled edges are also of different length. In case there is error in bundled edges, the situation described above can happen. Based on the principle of parsimony, we can consider the contig orientation problem as contig orientation and ordering with computing the minimum number of contig edges to be removed to make a consistent orientation for all contigs. Unfortunately, this problem is an NP hard [4]. Thus a greedy heuristic algorithm is used in practice. In general, the greedy contig orientation algorithm works well since the edge-bundling step generates contig edges of high accuracy Contig Ordering The contig-ordering problem is to embed contigs on a line while preserving the gap length suggested by bundled edges. This can be formulated as a problem of topological ordering of contigs subject to length constraints between contigs. An optimization problem formulation would be to find a topological ordering with a minimum number of edges removed. This is also known as an NP-hard problem. Bambus uses an expand-contract greedy heuristic for contig ordering. The first step expand is to anchor the first unplaced contig with edges at their maximum allowable length, then traverse the contig graph in a breadth-first search manner to fill in the range. As this is a greedy placement of contig, any contig with inconsistent ordering is not placed. After the expand step, contigs are brought back and placed as close as possible to the midpoint of the range defined by the length constraints of edges. This contraction step allows placement of as many contigs as possible. The resulting ordering may not be
13 Overview of Genome Assembly Techniques 91 consistent, meaning that two contigs may occupy the same space. This ambiguous placement can be helpful for the final genome-finishing step Hierarchical Scaffolding Contig edges from smaller insert libraries have significant fewer errors than those from longer insert libraries. Thus how to utilize clone edges of different accuracies is not trivial. Bambus generates scaffolds of contigs in a hierarchical fashion, starting with contig edges from smallest libraries, say, of 2 kb, then Bambus adds edges of lower-quality from larger insert libraries. The quality of contig edges are evaluated not only by the library length but also by the number of edges connecting two contigs since confirmation from two independent edges is an indication of higher quality in connecting contigs Untangling Given a scaffold of contigs, there can be contigs that are involved in multiple paths of contigs. In this case, it may be desirable to untangle those contigs to convert an ambiguous scaffold to a single linear stretch. The Bambus untangler resolves an ambiguous scaffold by iteratively finding the longest nonself-overlapping path in a greedy fashion. If a contig involves multiple potential paths, it may be desirable to break a contig into multiple pieces and then test if single linear nonoverlapping stretches of contigs can be generated. Bambus plans to incorporate this implementation in its future release GigAssembler GigAssembler [26] was used for the human genome assembly in the public Human Genome Project [13]. It generated scaffolds using contigs, map, mrna, EST, and BAC end data. The overview of the scaffold-generation process is as follows. 1. Decontaminating and repeat masking the sequence. RepeatMasker [29] is used to mask known repeats and contaminants from bacteria, vectors, and others. 2. Aligning mrna, EST, BAC end, and paired-plasmid reads against initial sequence contigs. 3. Creating an input directory structure using map and other data. For the human genome, they used Washington University s map data. A directory is created for a chromosome and a subdirectory for each fingerprint clone contig. 4. For each fingerprint clone contig, aligning the initial sequence contigs within that contig against each other.
14 92 Genome Sequencing Technology and Algorithms 5. Using the Gigassembler program within each fingerprint clone contig to merge overlapping initial sequence contigs, and to order and orient the resulting sequence contigs into scaffolds. 6. Combining the contig assemblies into full chromosome assemblies Preprocessing: Alignment of mrna, ESTs, BAC Ends, and Paired Reads Contigs are oriented and ordered by aligning mrna, ESTs, BAC ends, and paired reads to contigs using a program called pslayout. It reports all matches above a certain minimal quality between query sequences and database sequences. To compute alignments, it collects candidate-matching regions by using 10-mer indices in a set of overlapping 500-base regions of query sequence. Then candidate-matching regions are aligned, especially tolerating intron regions in the case of aligning mrna and ESTs. Resulting aligned regions are combined using a dynamic programming algorithm. To reduce the effects of repeats, two techniques are used. The first technique is to use repeat sequences (repeats are not masked although they are detected by RepeatMasker). The second technique is to maintain near best" matches only, which means discard matches (even good ones) if they are below the best match Assembly and Ordering of Contigs The GigAssembler operates in many steps. The task is to generate consensus sequences determined by contigs and their ordering. Contigs are ordered and merged gradually into larger ones by building rafts, barges, raft-ordering graphs, and bridge graphs. Next we describe the algorithm in more detail. 1. Build merged sequence contigs, called rafts, from overlapping initial sequence contigs. A score to each aligning pair is assigned, and then the alignments are processed from the best scoring ones to least ones. 2. Build sequenced clone contigs, called barges, from overlapping clones. Barges are constructed in a greedy fashion where the clone overlap is the sum of all initial sequence contig overlaps. Each clone will be assigned a coordinate in the result barge. 3. Once the orientation and order of clones are determined while constructing barges, rafts (merged sequence contigs) can be ordered using a raft-ordering graph. This is a directed graph with two types of nodes, rafts and sequenced clone endpoints. To understand what is happening at this stage, see Figure Rafts are bridged with mrnas, ESTs, paired-plasmid reads, BAC end pairs, and ordering information from the sequencing centers. The resulting graph is called a bridge graph. Bridge is added one at a time, starting with the best scoring, to the ordering graph. The score
15 Overview of Genome Assembly Techniques 93 AAAA AAAAAAAAAA a1a1a1a1 a2a2a2a2 BBBBBBBBBBBBBBBBBBBBBBB b1b1b1b1b1b1b1 b2b2b2b2 CCCCCCCCCCCCCCCCCCC c1c1c1c1c1c1 c2c2c2 As Bs Ae Cs Be Ce As Bs Ae Cs Be Ce Figure 6.4 How to build a raft-ordering graph. Six initial contigs (a1, a2, b1, b2, c1, c2) are aligned to three clones (A, B, C) (top figure), an ordering graph of clone starts and ends is given (middle figure), and the final raft-ordering graph after adding in rafts to the ordering graph (bottom figure). Form the top figure, we can construct three rafts, a1, b1, a2, b2, c1, and c2, based on their overlaps. Then an ordering graph of clone starts and ends can be constructed based on the positions of clone start and end positions as in the middle figure. The node names As and Ae denotes the start and the end of a clone A, respectively. Finally, the three rafts are added to the ordering graph as in the bottom figure. function for bridges is based on the type of information. mrna information is given the highest weight, then paired-plasmid reads, information provided by the sequencing centers, ESTs, and BAC end matches, in that order. 5. Walk the bridge graph to get an ordering of rafts. Each bridge is walked in the order of the default coordinates assigned with a constraint that if a raft has predecessors, all the predecessors must be walked before the raft is walked. 6. A sequence path through each raft is built in a greedy fashion, starting with the longest, most finished initial sequence contig that passes though each section of the raft.
16 94 Genome Sequencing Technology and Algorithms 7. Build the final sequence for the fingerprint clone contig by inserting the appropriate number of Ns between raft sequence paths. 6.6 Finishing Finishing is labor intensive and constitutes a major bottleneck in any genome-sequencing project. Input to the finishing stage is a set of oriented and ordered contigs. However, as we discussed in previous sections, it is still challenging to verify the correctness of contigs and to generate scaffolds of contigs. Due to the difficulties, there are two different views in pursuing genomelevel sequencing, one in favor of the whole-genome shotgun strategy [30] and another in favor of a hierarchical strategy involving only smaller-scale shotgun sequencing [31] (see Section 6.7 for more discussion). In summary, there should be more efforts in developing frameworks for genome sequencing as well as component tools such as scalable, reliable sequence assemblers and contig assembly- validation methods. As one of the initial efforts, the AMOS project aims at developing open-source whole-genome assembly software for the genome- sequencing community [32]. 6.7 Three Strategies for Whole-Genome Sequencing The first whole bacterial genome, H. influenzae, was sequenced at TIGR in 1995 using the whole-genome shotgun strategy [33]. Since then, the whole-genome shotgun strategy has been successfully used for many genomes, including human [12]. Basically, the longer the size of DNA region, the more repeats exist, which becomes clearly a major hurdle to the genome sequencing. There are three different strategies of employing shotgun strategy in wholegenome sequencing. 1. The whole-genome shotgun strategy applies shotgun-sequencing strategy to the whole-genome level. The advantage of this approach is that it is most cost-effective since shotgun data can be prepared in a single step at the whole-genome level. The human genome assembly by Celera was achieved using this strategy [12]. 2. The hierarchical approach uses libraries of different insert size. Libraries of larger insert size are further split into libraries of smaller insert size while keeping track of which library the current library is a descendant of. This often requires a high-resolution genetic map prior to the whole-genome assembly and low-resolution physical map. Shotgun strategy is applied when a library becomes small enough so that the
17 Overview of Genome Assembly Techniques 95 current assembly algorithm could determine the target sequence with confidence. Since the library hierarchy information can be easily used to produce the target sequence, this approach may produce more accurate genome sequences. The major drawback of this approach is time and cost of genome sequencing. The human genome consortium used this approach [13]. 3. The hybrid approach, called pooled genomic indexing a technique pioneered at the Baylor College of Medicine, employs both the hierarchical and the whole-genome shotgun approaches, but without physical and genetic mapping information. This approach combines two different types of shotgun reads, one from the whole-genome shotgun and the other from the shotgun sequencing of individual BACs. BACs are generated by using large insert BAC clones and a minimum tiling path of BACs that covers the whole genome is computed. Then shotgun strategy with a low coverage is applied to a set of selected BACs from the tiling path. In parallel, the whole-genome shotgun strategy is applied to generate another set of shotgun data at the whole-genome level. These two shotgun data sets are combined to determine the whole-genome sequence. The brown Norway rat genome was assembled with this strategy [34]. 6.8 Discussion In this section, we briefly summarize techniques for genome sequencing. For more information, readers may refer to several review articles on genome sequencing [35 37]. Although genome sequencing still remains an open problem, recent advances in computational techniques make it possible to sequence very large eukaryotic genomes such as Drosophila melanogaster [27] and the human genome [12, 13]. We discuss some of the recent trends in genome-sequencing strategies next. 1. To achieve very large-scale genome sequencing, it is necessary to have shotgun data of very high quality (see [27]). Otherwise, is not possible to distinguish repeats from errors in the shotgun data. What is really interesting is that recent assemblers attempt to correct errors in the input shotgun data before sequence assembly [5, 6]. There is no guarantee to correct errors without knowing the target sequence. Indeed, EULER, a genome assembly package, names this procedure as data corruption instead of error correction. Nonetheless, this is a promising technique that works for large-scale sequence assembly.
18 96 Genome Sequencing Technology and Algorithms 2. Repeat boundaries are identified before sequence assembly and then contigs are assembled up to the boundaries [5, 27]. Like the error correction, there is no guarantee to identify the repeat boundaries correctly without knowing the target sequence, but this is another promising technique. 3. Computational techniques to ensure the correctness of contig assembly becomes more important. Correctness can be checked using the characteristics of the shotgun data (i.e., random sampling). 2 There are two ways to utilize the characteristics of data, on the fragment level [2, 5, 9, 27] and there is also an interesting approach based on pattern statistics [21]. 4. Mate-pair information and base-call-quality values become an essential data for genome sequencing A Thought on an Exploratory Genome Sequencing Framework As hinted in Section 6.5.2, contigs that are involved in multiple paths of contigs may be broken into smaller pieces and can be tested if linear paths can be generated. To realize this idea, methods for assembly validation that detect misassembled regions in contigs are much needed (see Section 6.4). We tested this idea for several bacterial genomes including Agrobacterium [38]. The schematic overview of a genome-sequencing framework [39] developed at DuPont is depicted in Figure 6.5. This approach can be viewed as a hypothesis generation and validation paradigm in search of a set of correctly assembled contigs and their ordering. All decisions made at user interaction points are hypotheses that will be subsequently tested with larger clones in the next step. This approach was successful in assembling several genome sequences. For example, 502 contigs in the Phrap assembly of the Agrobacterium shotgun data were grouped and ordered into only 15 sets of contigs (the largest set longer than 2 Mb) using a Web interface in a single iteration of our genome-sequencing framework. Note that there are four replicons in the Agrobacterium genome. As more accurate assembly validation methods are developed, this technique might be useful for automating the sequencing of microbial genomes. By embedding sequence assembly modules into an assembly package, such as Minimus [40]. 2. There are regions where sampling is biased due to biological reasons. However, random sampling can be assumed as a whole shotgun data.
19 Overview of Genome Assembly Techniques 97 A set of DNA fragments Sequence assembler A set of contigs Assembly validation Split contigs Clone linkage information Ordering contigs A set of groups of contigs Large clone linkage information Ordering groups A set of groups of groups Figure 6.5 A framework for genome sequencing. This framework is to search for a set of correctly assembled contigs and their ordering in an iterative fashion. Acknowledgments Sun Kim was supported in part by a Career DBI from the National Science Foundation (United States) and a grant from the Korea Insititute of Science and Technology Information. We thank the anonymous reviewer for his or her valuable comments. [1] Green, P., References [2] Sutton, G., et al., TIGR Assembler: A New Tool for Assembling Large Shotgun Sequencing Projects, Genome Science and Technology, Vol. 1, 1995, pp [3] Kim, S., and A. M. Segre, AMASS: A Structured Pattern Matching Approach to Shotgun Sequence Assembly, Journal of Computational Biology, Vol. 6, No. 4, [4] Kececioglu, J. D., and E. W. Myers, Combinatorial Algorithms for DNA Sequence Assembly, Algorithmica, Vol. 13, 1995.
20 98 Genome Sequencing Technology and Algorithms [5] Batzoglou, S., et al., Arachne: A Whole-Genome Shotgun Assembler, Genome Research, Vol. 12, No. 1, 2002, pp [6] Pevzner, P. A., et al., An Eulerian Path Approach to DNA Fragment Assembly, PNAS, Vol. 98, 2001, pp [7] Huang, X., A Contig Assembly Program Based on Sensitive Detection of Fragment Overlaps, Genomics, Vol. 14, [8] Huang, X., An Improved Sequence Assembly Program, Genomics, Vol. 33, [9] Huang, X., and A. Madan, CAP3: A DNA Sequence Assembly Program, Genome Research, Vol. 9, No. 9, 1999, pp [10] Huang, X. et al., PCAP: A Whole-Genome Assembly Program, Genome Res., Vol. 13, 2003, pp [11] Peltola, H., et al., SEQAID: A DNA Sequence Assembling Program Based on a Mathematical Model, Nucleic Acids Res., Vol. 12, No. 1, Pt. 1, 1984, pp [12] Venter, J. C., et al., The Sequence of the Human Genome, Science, Vol. 291, 2001, pp [13] Lander, E. S., et al., Initial Sequencing and Analysis of the Human Genome, Nature, Vol. 409, 2001, pp [14] Roach, J. C., Pairwise End Sequencing: A Unified Approach to Genome Mapping and Sequencing, Genomics, Vol. 26, 1995, pp [15] Ewing, B., et al., Base-Calling of Automated Sequencer Traces Using Phred. I. Accuracy Assessment, Genome Research, Vol. 8, 1998, pp [16] Chou, H. H., and M. H. Holmes, DNA Sequence Quality Trimming and Vector Removal, Bioinformatics, Vol. 17, No. 12, 2001, pp [17] Idury, R., and M. S. Waterman, A New Algorithm for DNA Sequence Assembly, Journal of Computational Biology, Vol. 2, No. 2, 1995, pp [18] Rouchka, E. C., and D. J. States, Sequence Assembly Validation by Multiple Restriction Digest Fragment Coverage Analysis, Proc. of Intelligent Systems for Molecular Biology (ISMB), 1998, pp [19] Dew, I. M., et al., A Tool for Analyzing Mate Pairs in Assemblies (TAMPA), J. Comput. Biol., Vol. 12, No. 5, 2005, pp [20] Kim, S., et al., Enumerating Repetitive Sequences from Pairwise Sequence Matches, manuscript, DuPont Central Research, [21] Kim, S., et al., A Probabilistic Approach to Sequence Assembly Validation, ACM SIGKDD Workshop on Data Mining in Bioinformatics (BioKDD2001), 2001, pp [22] Zimin, R., and J. A. Yorke, Assembly Reconciliation Method, umd.edu/reconciliation.htm, [23] Shatkay, H., et al., ThurGood: Evaluating Assembly-to-Assembly Mapping, Journal of Computational Biology, Vol. 11, No. 5, 2004, pp
21 Overview of Genome Assembly Techniques 99 [24] Kim, S., et al., A Computational Approach to Sequence Assembly Validation, manuscript, DuPont Central Research, [25] She, X., et al., Shotgun Sequence Assembly and Recent Segmental Duplications Within the Human Genome, Nature, Vol. 431, 2004, pp [26] Kent, W. J., and D. Haussler, Assembly of the Working Draft of the Human Genome with GigAssembler, Genome Res., Vol. 11, 2001, pp [27] Myers, G., et al., A Whole-Genome Assembly of Drosophila, Science, Vol. 287, 2000, pp [28] Pop, M., et al., Hierarchical Scaffolding with Bambus, Genome Res., Vol. 14, No. 1, 2004, pp [29] Smit, A., Repeat Masker, [30] Weber, J. L., and E. Myers, Human Whole-Genome Shotgun Sequencing, Genome Research, Vol. 7, 1997, pp [31] Olson, M., and P. Green, A Quality-First Credo for the Human Genome Project, Genome Research, Vol. 8, No. 5, 1998, pp [32] AMOS: A Modular Open-Source Assembler, [33] Fleischmann, R. D., et al., Whole-Genome Random Sequencing and Assembly of Haemophilus Influenzae Rd., Science, Vol. 269, No. 5223, 1995, pp [34] Gibbs, R. A., et al., (Rat Genome Sequencing Project Consortium), Genome Sequence of the Brown Norway Rat Yields Insights into Mammalian Evolution, Nature, Vol. 428, No. 6982, 2004, pp [35] Pop, M., et al., Genome Sequence Assembly: Algorithms and Issues, IEEE Computer, Vol. 35, No. 7, 2002, pp [36] Pop, M., et al., Shotgun Sequence Assembly, Advances in Computers, Vol. 60, June [37] Batzoglou, S., Algorithmic Challenges in Mammalian Genome Sequence Assembly: Special Review, in D. M. Jordel, P. Little, and S. Subramaniam, (eds.), Encyclopedia of Genomics, Proteomics, and Bioinformatics, New York: John Wiley & Sons, [38] Wood, D. W., et al., The Genome of Agrobacterium Tumefaciens C58: Insights into the Evolution and Biology of a Natural Genetic Engineer, Science, 2001, pp [39] Kim, S., The AMASS Genome Sequencing Package, Advances in Genome Biology and Technology Conference, February [40] Sommer, D. D., et al., Minimus: A Fast, Lightweight Genome Assembler, BMC Bioinformatics, Vol. 8, February 26, 2007, p. 64.
An Overview of DNA Sequencing
An Overview of DNA Sequencing Prokaryotic DNA Plasmid http://en.wikipedia.org/wiki/image:prokaryote_cell_diagram.svg Eukaryotic DNA http://en.wikipedia.org/wiki/image:plant_cell_structure_svg.svg DNA Structure
Introduction to Bioinformatics 3. DNA editing and contig assembly
Introduction to Bioinformatics 3. DNA editing and contig assembly Benjamin F. Matthews United States Department of Agriculture Soybean Genomics and Improvement Laboratory Beltsville, MD 20708 [email protected]
4.2.1. What is a contig? 4.2.2. What are the contig assembly programs?
Table of Contents 4.1. DNA Sequencing 4.1.1. Trace Viewer in GCG SeqLab Table. Box. Select the editor mode in the SeqLab main window. Import sequencer trace files from the File menu. Select the trace files
Reading DNA Sequences:
Reading DNA Sequences: 18-th Century Mathematics for 21-st Century Technology Michael Waterman University of Southern California Tsinghua University DNA Genetic information of an organism Double helix,
Protein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company
Genetic engineering: humans Gene replacement therapy or gene therapy Many technical and ethical issues implications for gene pool for germ-line gene therapy what traits constitute disease rather than just
Geospiza s Finch-Server: A Complete Data Management System for DNA Sequencing
KOO10 5/31/04 12:17 PM Page 131 10 Geospiza s Finch-Server: A Complete Data Management System for DNA Sequencing Sandra Porter, Joe Slagel, and Todd Smith Geospiza, Inc., Seattle, WA Introduction The increased
Two scientific papers (1, 2) recently appeared reporting
On the sequencing of the human genome Robert H. Waterston*, Eric S. Lander, and John E. Sulston *Genome Sequencing Center, Washington University, Saint Louis, MO 63108; Whitehead Institute Massachusetts
De Novo Assembly Using Illumina Reads
De Novo Assembly Using Illumina Reads High quality de novo sequence assembly using Illumina Genome Analyzer reads is possible today using publicly available short-read assemblers. Here we summarize the
Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations
Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations AlCoB 2014 First International Conference on Algorithms for Computational Biology Thiago da Silva Arruda Institute
Modified Genetic Algorithm for DNA Sequence Assembly by Shotgun and Hybridization Sequencing Techniques
International Journal of Electronics and Computer Science Engineering 2000 Available Online at www.ijecse.org ISSN- 2277-1956 Modified Genetic Algorithm for DNA Sequence Assembly by Shotgun and Hybridization
When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want
1 When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want to search other databases as well. There are very
Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms
Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms Introduction Mate pair sequencing enables the generation of libraries with insert sizes in the range of several kilobases (Kb).
Focusing on results not data comprehensive data analysis for targeted next generation sequencing
Focusing on results not data comprehensive data analysis for targeted next generation sequencing Daniel Swan, Jolyon Holdstock, Angela Matchan, Richard Stark, John Shovelton, Duarte Mohla and Simon Hughes
FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem
FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem Elsa Bernard Laurent Jacob Julien Mairal Jean-Philippe Vert September 24, 2013 Abstract FlipFlop implements a fast method for de novo transcript
A greedy algorithm for the DNA sequencing by hybridization with positive and negative errors and information about repetitions
BULLETIN OF THE POLISH ACADEMY OF SCIENCES TECHNICAL SCIENCES, Vol. 59, No. 1, 2011 DOI: 10.2478/v10175-011-0015-0 Varia A greedy algorithm for the DNA sequencing by hybridization with positive and negative
FASIM: Fragments Assembly Simulation using Biased-Sampling Model and Assembly Simulation for Microbial Genome Shotgun Sequencing
J. Microbiol. Biotechnol. (2006), 16(5), 683 688 FASIM: Fragments Assembly Simulation using Biased-Sampling Model and Assembly Simulation for Microbial Genome Shotgun Sequencing HUR, CHEOL-GOO 1,3, SUNNY
The prevailing method of determining
C OMPUTATIONAL B IOLOGY WHOLE-GENOME DNA SEQUENCING Computation is integrally and powerfully involved with the DNA sequencing technology that promises to reveal the complete human DNA sequence in the next
2 The Human Genome Project
2 The Human Genome Project LAP CHEE TSUI STEVE W. SCHERER Toronto, Canada 1 Introduction 42 2 Chromosome Maps 42 2.1 Genetic Maps 43 2.2 Physical Maps 44 3 DNA Sequencing 45 3.1 cdna Sequencing 47 3.2
Version 5.0 Release Notes
Version 5.0 Release Notes 2011 Gene Codes Corporation Gene Codes Corporation 775 Technology Drive, Ann Arbor, MI 48108 USA 1.800.497.4939 (USA) +1.734.769.7249 (elsewhere) +1.734.769.7074 (fax) www.genecodes.com
How To Cluster Of Complex Systems
Entropy based Graph Clustering: Application to Biological and Social Networks Edward C Kenley Young-Rae Cho Department of Computer Science Baylor University Complex Systems Definition Dynamically evolving
CD-HIT User s Guide. Last updated: April 5, 2010. http://cd-hit.org http://bioinformatics.org/cd-hit/
CD-HIT User s Guide Last updated: April 5, 2010 http://cd-hit.org http://bioinformatics.org/cd-hit/ Program developed by Weizhong Li s lab at UCSD http://weizhong-lab.ucsd.edu [email protected] 1. Introduction
Searching Nucleotide Databases
Searching Nucleotide Databases 1 When we search a nucleic acid databases, Mascot always performs a 6 frame translation on the fly. That is, 3 reading frames from the forward strand and 3 reading frames
Final Project Report
CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes
RESTRICTION DIGESTS Based on a handout originally available at
RESTRICTION DIGESTS Based on a handout originally available at http://genome.wustl.edu/overview/rst_digest_handout_20050127/restrictiondigest_jan2005.html What is a restriction digests? Cloned DNA is cut
RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison
RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the
Module 1. Sequence Formats and Retrieval. Charles Steward
The Open Door Workshop Module 1 Sequence Formats and Retrieval Charles Steward 1 Aims Acquaint you with different file formats and associated annotations. Introduce different nucleotide and protein databases.
Next generation sequencing (NGS)
Next generation sequencing (NGS) Vijayachitra Modhukur BIIT [email protected] 1 Bioinformatics course 11/13/12 Sequencing 2 Bioinformatics course 11/13/12 Microarrays vs NGS Sequences do not need to be known
Algorithms in Computational Biology (236522) spring 2007 Lecture #1
Algorithms in Computational Biology (236522) spring 2007 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: Tuesday 11:00-12:00/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office
A Learning Based Method for Super-Resolution of Low Resolution Images
A Learning Based Method for Super-Resolution of Low Resolution Images Emre Ugur June 1, 2004 [email protected] Abstract The main objective of this project is the study of a learning based method
Three Effective Top-Down Clustering Algorithms for Location Database Systems
Three Effective Top-Down Clustering Algorithms for Location Database Systems Kwang-Jo Lee and Sung-Bong Yang Department of Computer Science, Yonsei University, Seoul, Republic of Korea {kjlee5435, yang}@cs.yonsei.ac.kr
GenBank, Entrez, & FASTA
GenBank, Entrez, & FASTA Nucleotide Sequence Databases First generation GenBank is a representative example started as sort of a museum to preserve knowledge of a sequence from first discovery great repositories,
Welcome to Pacific Biosciences' Introduction to SMRTbell Template Preparation.
Introduction to SMRTbell Template Preparation 100 338 500 01 1. SMRTbell Template Preparation 1.1 Introduction to SMRTbell Template Preparation Welcome to Pacific Biosciences' Introduction to SMRTbell
How To Develop Software
Software Engineering Prof. N.L. Sarda Computer Science & Engineering Indian Institute of Technology, Bombay Lecture-4 Overview of Phases (Part - II) We studied the problem definition phase, with which
BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS
BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 [email protected] Genomics A genome is an organism s
How Sequencing Experiments Fail
How Sequencing Experiments Fail v1.0 Simon Andrews [email protected] Classes of Failure Technical Tracking Library Contamination Biological Interpretation Something went wrong with a machine
Introduction to next-generation sequencing data
Introduction to next-generation sequencing data David Simpson Centre for Experimental Medicine Queens University Belfast http://www.qub.ac.uk/research-centres/cem/ Outline History of DNA sequencing NGS
Pairwise Sequence Alignment
Pairwise Sequence Alignment [email protected] SS 2013 Outline Pairwise sequence alignment global - Needleman Wunsch Gotoh algorithm local - Smith Waterman algorithm BLAST - heuristics What
Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources
1 of 8 11/7/2004 11:00 AM National Center for Biotechnology Information About NCBI NCBI at a Glance A Science Primer Human Genome Resources Model Organisms Guide Outreach and Education Databases and Tools
Introduction to Genome Annotation
Introduction to Genome Annotation AGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGA CCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTT GAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTG GTGTAGATGGAGATCGCGTAGCGTGGTAGCGCGAGTTTGCGAGCT
Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals
Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals Xiaohui Xie 1, Jun Lu 1, E. J. Kulbokas 1, Todd R. Golub 1, Vamsi Mootha 1, Kerstin Lindblad-Toh
Keeping up with DNA technologies
Keeping up with DNA technologies Mihai Pop Department of Computer Science Center for Bioinformatics and Computational Biology University of Maryland, College Park The evolution of DNA sequencing Since
Bioruptor NGS: Unbiased DNA shearing for Next-Generation Sequencing
STGAAC STGAACT GTGCACT GTGAACT STGAAC STGAACT GTGCACT GTGAACT STGAAC STGAAC GTGCAC GTGAAC Wouter Coppieters Head of the genomics core facility GIGA center, University of Liège Bioruptor NGS: Unbiased DNA
Genotyping by sequencing and data analysis. Ross Whetten North Carolina State University
Genotyping by sequencing and data analysis Ross Whetten North Carolina State University Stein (2010) Genome Biology 11:207 More New Technology on the Horizon Genotyping By Sequencing Timeline 2007 Complexity
Next Generation Sequencing: Technology, Mapping, and Analysis
Next Generation Sequencing: Technology, Mapping, and Analysis Gary Benson Computer Science, Biology, Bioinformatics Boston University [email protected] http://tandem.bu.edu/ The Human Genome Project took
Introduction to NGS data analysis
Introduction to NGS data analysis Jeroen F. J. Laros Leiden Genome Technology Center Department of Human Genetics Center for Human and Clinical Genetics Sequencing Illumina platforms Characteristics: High
Choices, choices, choices... Which sequence database? Which modifications? What mass tolerance?
Optimization 1 Choices, choices, choices... Which sequence database? Which modifications? What mass tolerance? Where to begin? 2 Sequence Databases Swiss-prot MSDB, NCBI nr dbest Species specific ORFS
The Human Genome Project
The Human Genome Project Brief History of the Human Genome Project Physical Chromosome Maps Genetic (or Linkage) Maps DNA Markers Sequencing and Annotating Genomic DNA What Have We learned from the HGP?
A COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES
A COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES ULFAR ERLINGSSON, MARK MANASSE, FRANK MCSHERRY MICROSOFT RESEARCH SILICON VALLEY MOUNTAIN VIEW, CALIFORNIA, USA ABSTRACT Recent advances in the
Sanjeev Kumar. contribute
RESEARCH ISSUES IN DATAA MINING Sanjeev Kumar I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012 [email protected] 1. Introduction The field of data mining and knowledgee discovery is emerging as a
Chapter 2. imapper: A web server for the automated analysis and mapping of insertional mutagenesis sequence data against Ensembl genomes
Chapter 2. imapper: A web server for the automated analysis and mapping of insertional mutagenesis sequence data against Ensembl genomes 2.1 Introduction Large-scale insertional mutagenesis screening in
A Comparison of General Approaches to Multiprocessor Scheduling
A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA [email protected] Michael A. Palis Department of Computer Science Rutgers University
Human Genome Organization: An Update. Genome Organization: An Update
Human Genome Organization: An Update Genome Organization: An Update Highlights of Human Genome Project Timetable Proposed in 1990 as 3 billion dollar joint venture between DOE and NIH with 15 year completion
So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)
Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we
Tutorial for proteome data analysis using the Perseus software platform
Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information
SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications
Product Bulletin Sequencing Software SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications Comprehensive reference sequence handling Helps interpret the role of each
DNA Sequence Analysis
DNA Sequence Analysis Two general kinds of analysis Screen for one of a set of known sequences Determine the sequence even if it is novel Screening for a known sequence usually involves an oligonucleotide
Data Mining Project Report. Document Clustering. Meryem Uzun-Per
Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...
What mathematical optimization can, and cannot, do for biologists. Steven Kelk Department of Knowledge Engineering (DKE) Maastricht University, NL
What mathematical optimization can, and cannot, do for biologists Steven Kelk Department of Knowledge Engineering (DKE) Maastricht University, NL Introduction There is no shortage of literature about the
Next Generation Sequencing
Next Generation Sequencing Technology and applications 10/1/2015 Jeroen Van Houdt - Genomics Core - KU Leuven - UZ Leuven 1 Landmarks in DNA sequencing 1953 Discovery of DNA double helix structure 1977
Doctor of Philosophy in Computer Science
Doctor of Philosophy in Computer Science Background/Rationale The program aims to develop computer scientists who are armed with methods, tools and techniques from both theoretical and systems aspects
SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY
SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY G.Evangelin Jenifer #1, Mrs.J.Jaya Sherin *2 # PG Scholar, Department of Electronics and Communication Engineering(Communication and Networking), CSI Institute
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam
The LCA Problem Revisited
The LA Problem Revisited Michael A. Bender Martín Farach-olton SUNY Stony Brook Rutgers University May 16, 2000 Abstract We present a very simple algorithm for the Least ommon Ancestor problem. We thus
A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery
A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery Runu Rathi, Diane J. Cook, Lawrence B. Holder Department of Computer Science and Engineering The University of Texas at Arlington
Human-Mouse Synteny in Functional Genomics Experiment
Human-Mouse Synteny in Functional Genomics Experiment Ksenia Krasheninnikova University of the Russian Academy of Sciences, JetBrains [email protected] September 18, 2012 Ksenia Krasheninnikova
SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING
AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations
The GeoMedia Fusion Validate Geometry command provides the GUI for detecting geometric anomalies on a single feature.
The GeoMedia Fusion Validate Geometry command provides the GUI for detecting geometric anomalies on a single feature. Below is a discussion of the Standard Advanced Validate Geometry types. Empty Geometry
KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS
ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 6367(Print) ISSN 0976 6375(Online)
Genetic Technology. Name: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question.
Name: Class: Date: Genetic Technology Multiple Choice Identify the choice that best completes the statement or answers the question. 1. An application of using DNA technology to help environmental scientists
Distance Degree Sequences for Network Analysis
Universität Konstanz Computer & Information Science Algorithmics Group 15 Mar 2005 based on Palmer, Gibbons, and Faloutsos: ANF A Fast and Scalable Tool for Data Mining in Massive Graphs, SIGKDD 02. Motivation
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
CCR Biology - Chapter 9 Practice Test - Summer 2012
Name: Class: Date: CCR Biology - Chapter 9 Practice Test - Summer 2012 Multiple Choice Identify the choice that best completes the statement or answers the question. 1. Genetic engineering is possible
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed
COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction
COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised
Fast Sequential Summation Algorithms Using Augmented Data Structures
Fast Sequential Summation Algorithms Using Augmented Data Structures Vadim Stadnik [email protected] Abstract This paper provides an introduction to the design of augmented data structures that offer
Row Quantile Normalisation of Microarrays
Row Quantile Normalisation of Microarrays W. B. Langdon Departments of Mathematical Sciences and Biological Sciences University of Essex, CO4 3SQ Technical Report CES-484 ISSN: 1744-8050 23 June 2008 Abstract
Human Genome and Human Genome Project. Louxin Zhang
Human Genome and Human Genome Project Louxin Zhang A Primer to Genomics Cells are the fundamental working units of every living systems. DNA is made of 4 nucleotide bases. The DNA sequence is the particular
Memory Allocation Technique for Segregated Free List Based on Genetic Algorithm
Journal of Al-Nahrain University Vol.15 (2), June, 2012, pp.161-168 Science Memory Allocation Technique for Segregated Free List Based on Genetic Algorithm Manal F. Younis Computer Department, College
Web-Based Genomic Information Integration with Gene Ontology
Web-Based Genomic Information Integration with Gene Ontology Kai Xu 1 IMAGEN group, National ICT Australia, Sydney, Australia, [email protected] Abstract. Despite the dramatic growth of online genomic
MiSeq: Imaging and Base Calling
MiSeq: Imaging and Page Welcome Navigation Presenter Introduction MiSeq Sequencing Workflow Narration Welcome to MiSeq: Imaging and. This course takes 35 minutes to complete. Click Next to continue. Please
Bioinformatics Resources at a Glance
Bioinformatics Resources at a Glance A Note about FASTA Format There are MANY free bioinformatics tools available online. Bioinformaticists have developed a standard format for nucleotide and protein sequences
DNA Technology Mapping a plasmid digesting How do restriction enzymes work?
DNA Technology Mapping a plasmid A first step in working with DNA is mapping the DNA molecule. One way to do this is to use restriction enzymes (restriction endonucleases) that are naturally found in bacteria
Measurement Information Model
mcgarry02.qxd 9/7/01 1:27 PM Page 13 2 Information Model This chapter describes one of the fundamental measurement concepts of Practical Software, the Information Model. The Information Model provides
Introducing diversity among the models of multi-label classification ensemble
Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and
Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006
Hidden Markov Models in Bioinformatics By Máthé Zoltán Kőrösi Zoltán 2006 Outline Markov Chain HMM (Hidden Markov Model) Hidden Markov Models in Bioinformatics Gene Finding Gene Finding Model Viterbi algorithm
Web Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
Tutorial for Proteomics Data Submission. Katalin F. Medzihradszky Robert J. Chalkley UCSF
Tutorial for Proteomics Data Submission Katalin F. Medzihradszky Robert J. Chalkley UCSF Why Have Guidelines? Large-scale proteomics studies create huge amounts of data. It is impossible/impractical to
