Comparison of DNA Sequences with Protein Sequences
|
|
|
- Millicent Hodge
- 9 years ago
- Views:
Transcription
1 GENOMICS 46, (1997) ARTICLE NO. GE Comparison of DNA Sequences with Protein Sequences William R. Pearson,*,1 Todd Wood,* Zheng Zhang, and Webb Miller *Department of Biochemistry, University of Virginia, Charlottesville, Virginia 22908; and Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania Received April 1, 1997; accepted August 25, ), the traditional divisions grew at a rate of 40% The FASTA package of sequence comparison pro- per year, from 163 million to 317 million bases, but the grams has been expanded to include FASTX and entire GenBank grew from 173 million to 463 million FASTY, which compare a DNA sequence to a protein bases, because of the growth of the EST database. At sequence database, translating the DNA sequence in the end of 1996, about 40% of the bases in GenBank three frames and aligning the translated DNA se- were determined by high-throughput EST or genomic quence to each sequence in the protein database, sequencing. allowing gaps and frameshifts. Also new are TFASTX The DNA sequences produced by single-pass EST and TFASTY, which compare a protein sequence to a sequencing and high-throughput sequencing may be DNA sequence database, translating each sequence in of lower quality than traditional finished GenBank the DNA database in six frames and scoring alignsequences, which are typically based on multiple sements with gaps and frameshifts. FASTX and TFASTX allow only frameshifts between codons, while FASTY quence reads from both strands of the DNA template. and TFASTY allow substitutions or frameshifts within As a result, EST sequences are more likely to contain a codon. We examined the performance of FASTX and errors that produce frameshifts when translated into FASTY using different gap-opening, gap-extension, protein. Frameshift errors can be especially troubleframeshift, and nucleotide substitution penalties. In some in searches with single-pass EST sequences, begeneral, FASTX and FASTY perform equivalently cause these sequences are very likely to contain pro- when query sequences contain 0 10% errors. We also tein-coding regions, which are much more effectively evaluated the statistical estimates reported by FASTX identified by protein, rather than DNA, sequence com- and FASTY. These estimates are quite accurate, except parison. when an out-of-frame translation produces a low-complexity In an earlier paper (Zhang et al., 1997), we reported protein sequence. We used FASTX to scan the the development of rapid algorithms for comparing a Mycoplasma genitalium, Haemophilus influenzae, and translated DNA sequence to a protein sequence within Methanococcus jannaschii genomes for unidentified or a band, for incorporation into the optimization stage of misidentified protein-coding genes. We found at least the FASTA program (Pearson, 1990), and of a full 9 new protein-coding genes in the three genomes and Smith Waterman translated-dna protein alignment. at least 35 genes with potentially incorrect boundaries. In this paper, we consider two methods for aligning a 1997 Academic Press translated-dna sequence to a protein sequence and evaluate how well the two approaches identify distantly INTRODUCTION related sequences in the presence of DNA se- quence errors. We also examine the accuracy of statistical estimates produced for translated-dna protein se- Advances in automated sequencing technologies quence alignment. The estimates can be quite accurate, have dramatically increased the rate of DNA sequence but sometimes high scores are produced between unreproduction and inclusion in the GenBank and EMBL lated sequences because of simple-sequence, low-com- DNA sequence databases. Indeed, although the tradiplexity regions that are produced by translating the tional GenBank divisions have been growing at a relaincorrect reading frames. This problem can be avoided tively constant exponential rate, there have been draby searching databases from which simple-sequence rematic increases in the amount of expressed sequence gions have been removed with seg (Wootton, 1994). tag (EST) data over the past 2 years. For example, between Releases 81 (February 1994) and 99 (February MATERIALS AND METHODS 1 To whom correspondence should be addressed at the Department of Biochemistry, Jordan Hall Box 440, University of Virginia, Char- Sequence databases. Sequence similarity searches were performed lottesville, VA Telephone: (804) Fax: (804) 924- using a slightly modified version (PIR39b) of the annotated PIR [email protected]. database described earlier (Pearson, 1995). The database was modified /97 $25.00 Copyright 1997 by Academic Press All rights of reproduction in any form reserved. 24
2 DNA PROTEIN COMPARISON 25 to join some clearly related superfamilies (W.R.P, manuscript in preparation; the PIR39b database is available for downloading from ftp. virginia.edu:/pub/fasta). Two sequences from each of 46 families of proteins were used for these tests. The cdna sequences, and their corresponding open reading frames, encoding these 92 sequences were identified in GenBank and used to evaluate DNA protein sequence comparison with FASTX and FASTY. The annotations in the PIR39b database allow us to identify all of the sequences in the database that are related to members of the 46 query sequence families and, conversely, to identify the highest scoring unrelated sequences. The cdna sequences that encode the 92 query protein sequences, and their corresponding open reading frames (ORFs), were mutated to simulate the errors encountered in high-throughput EST sequencing. Based on the past experience of the Washington University St. Louis EST sequencing project (Hillier et al., 1996), the frequency of errors (substitutions, insertions, and deletions) in high-throughput EST sequencing ranges from about 2.5 to 7.5%. Mutations were cre- ated at random locations at approximately 1, 2, 5, or 10% of the positions, with the total number of mutations distributed among deletions, insertions, and substitutions according to the ratio 1:1.54:1.86. This ratio is derived from observed discrepancies between 5 ESTs and mrna sequences (Fig. 3 and Table 6 in Hillier et al., 1996; Table 6 has the labels for insertions and deletions reversed, L. Hillier, pers. comm.). Once these percentages are calculated, a normally distributed random number with mean and variance equal to the number of each type of change is drawn, and that number of substitutions, insertions, and deletions are introduced at random positions. Comparison of search performance. Scoring parameters and comparison algorithms (FASTX, FASTY) were compared using methods described earlier (Pearson, 1995). Briefly, an equivalence number (the number of related sequences missed at a similarity score that balances the number of related sequences at or below the score with the number of unrelated sequences above) is calculated for searches with each of the 92 query sequences for each combination of gap penalties and frameshift penalties or for each algorithm. The performance of one search condition is compared to that of another by comparing the equivalence numbers for the 92 searches and re- cording a / or 0 depending on which search condition performed better (ties are ignored). The sign test is then used to determine if the distribution of / s and 0 s is significantly different from the null hypothesis that differences in performance are the result of random variation. Differences in performance are summarized by a z value ; comparisons with z values ú2 are statistically significant at the 0.05 level. Characterization of bacterial genomes. To evaluate the boundaries of genes annotated for the Mycoplasma genitalium, Haemophilus influenzae, and Methanococcus jannaschii genomes, ORF sequences and their flanking nucleotides were extracted from the appropriate genome sequence. For the M. jannaschii genome, this was done for ORFs that differed in length by at least 20 amino acid residues when compared to homologues detected by a FASTA search of the translated ORFs (provided by The Institute for Genome Research; TIGR). ORFs and their flanking sequences were extracted from the complete M. jannaschii genome using the program ext. As input, ext requires a file containing a list of ORF names and positions. ext produces files containing original ORFs as well as files containing the ORFs with 250, 500, and 1000 flanking nucleotides at both 5 and 3 ends (for a total of 500, 1000, and 2000 additional nucleotides). These sequences were searched against Swiss-Prot using FASTX, and the results were examined for a combination of increase in alignment length with a decrease in expectation value when flanking nucleotides were included. Sequences that met these criteria were individually examined for the presence of low-complexity matches. A second method of detecting incorrect ORF boundary assignments was also employed for these bacterial genomes. Sequences from regions greater than 60 nucleotides in length between ORFs were ex- tracted and searched individually. This method also allows for the identification of ORFs that were not identified in the initial genomic analysis. These sequences were extracted using a modified version FIG. 1. FASTX and FASTY alignments. An asterisk indicates a frameshift. of ext called exg, which takes genome positions from command-line input. The FASTX search results were screened for matches that scored an expectation value of 0.01 or lower. Those matches were further screened individually both to omit low-complexity matches and to distinguish between previously unidentified ORFs and exten- sions of known ORFs. RESULTS Two Notions of a DNA Protein Alignment We call the DNA sequence given as input the determined DNA sequence, to emphasize that it is deter- mined by some experimental procedure and is subject to uncertainty. Both of our approaches to DNA protein alignment determine, at least implicitly, a hypothe- sized coding region, or HCR, and align the conceptual translation of the HCR and the given amino acid se- quence. The differences between the two approaches are confined to their methods of constructing an HCR. The approach taken by FASTX can be described as follows. A quasicodon of the determined DNA sequence is any three consecutive nucleotides; quasicodons are numbered 2, 3,..., N 01 according to their middle nucleotide, where N is the length of the determined sequence. An allowable list of quasicodons is any list that begins with quasicodon 2 and ends with quasico- don N 01 and such that quasicodon i õ N 01 is always followed by quasicodon i / 2, i / 3, or i / 4. An allow- able list of quasicodons determines the HCR obtained by concatenating the quasicodons one after the other. A FASTX alignment consists of an allowable list of qua- sicodons plus a protein alignment of the translated HCR and the given amino acid sequence. Less formally, a FASTX alignment can be depicted as follows. Translate the determined DNA sequence in each of the three reading frames and place each amino acid below the central nucleotide of its quasicodon, as pictured in Fig. 1A. In essence, a FASTX alignment positions the entries of the original amino acid se- quence underneath these translations, separating each entry by between one and three blanks, with the possi- ble introduction of gaps. A somewhat more general approach to DNA protein alignments is taken by FASTY. A codonification of a nucleotide sequence B Å b 1 b 2... b N is a sequence c 1 c 2...c L, where (i) each c k is either a nucleotide sequence
3 26 PEARSON ET AL. FASTX alignment. This fact alone does not guarantee that FASTY is superior to FASTX in all purposes; FASTY improves alignment scores for unrelated sequences as well as for homologous sequences, so under FIG. 2. Two choices of hypothesized codon for FASTX. A C has certain circumstances it can be less effective at distinbeen inserted between the first and the second positions in the second guishing related sequences from unrelated sequences. of three codons. FASTX is restricted to picking either CTA or GCT for its HCR. FASTY is able to pick the true codon. There are a variety of ways in which one might define an alignment of a DNA sequence and an amino acid sequence, some of which have been explored previously with 2 Éc k É 4 or a single dash character (Éc k É (Gish and States, 1993; Hein, 1994; Hein and Stovdenotes the length of c k ) and (ii) concatenating all the baek, 1994; Knecht, 1995; Guan and Uberbacher, 1996; non-dash c k sequences, in that order, gives B. A FASTY Huang and Zhang, 1996; Peltola et al., 1986). In genalignment between a determined DNA sequence B and eral, there is a tradeoff between (1) the completeness an amino acid sequence A consists of a codonification, of the set of sequencing errors that are directly modc 1 c 2... c L,ofBwith each c k placed above an amino eled by the underlying set of alignments and (2) the acid or a dash character, and where (1) if c k appears execution time required to compute an optimal alignover a dash, then Éc k É Å 3 and (2) removing all dashes ment. The FASTX approach, which models only from the second row gives A. Figure 1B depicts a frameshift errors at codon boundaries, is similar to FASTY alignment. techniques developed by Guan and Uberbacher (1996), In a FASTY alignment, each non-dash c k corresponds although judging by their published description, our to a codon of the HCR according to the following rules. algorithm (Zhang et al., 1997) is more efficient than If c k is deleted (i.e., appears over a dash), then the codon theirs. The basic idea behind FASTY was described by is just c k. Otherwise, letting a denote the amino acid Peltola et al. (1986), though we have added a number aligned to c k, the codon c is chosen to maximize the of improvements, including use of modern protein- BLOSUM50 score of (the conceptual translation of) c alignment scores, modeling of base miscalls, and imand a minus the penalty for converting c k to c. This plementation techniques (Zhang et al., 1997) needed conversion penalty deducts a fixed frameshift penalty to make it competitive in execution time with FASTX. for every nucleotide inserted or deleted and a fixed mis- It is even possible to develop a rigorous alignment algocall penalty for every nucleotide substitution. In partic- rithm that, in some precise sense, models all possible ular, a frameshift penalty is assessed whenever Éc k É sequencing errors (Zhang et al., 1997), but its execution equals either 2 or 4. time is prohibitive. In addition to directly capturing base miscalls (i.e., sequencing errors that cause an incorrect base to be reported), FASTY can correctly determine a wide range Searching with FASTX of frameshift errors caused by the sequencing process. Programs for comparing translated DNA sequences In contrast, FASTX can determine only a more limited to protein sequences, allowing frameshifts, have been class of frameshift errors. With FASTX, insertion of an added to the FASTA package (Pearson, 1996). FASTX erroneous nucleotide must occur between two codons, and FASTY compare a DNA sequence to a protein seand a nucleotide can be skipped (deleted) only if it oc- quence database, translating the DNA sequence in eicurs at both the last position of a codon and the first ther the three forward or the three reverse frames. position of the subsequent codon. For instance, when FASTX allows for frameshifts between codons, while a sequencing error inserts a nucleotide into a codon, FASTY allows for frameshifts within codons as well. FASTX is forced to pick one of the two codons obtained The TFASTX and TFASTY programs compare a proby dropping a base from one end or the other of the four- tein sequence to a DNA sequence database, translating nucleotide sequence (assuming that FASTX correctly each sequence in the database in three forward and analyzes the codons on either side). See Fig. 2. three reverse frames. Computation of an optimal alignment of course pre- FASTX, FASTY, TFASTX, and TFASTY use the supposes that a score is assigned to every possible same four steps that FASTA uses for calculating a simialignment. The score of a FASTX or FASTY alignment larity score: (1) using a lookup table to rapidly find is defined as the score of the alignment between the regions with shared identical pairs of residues (ktup Å translated HCR and the given amino acid sequence 2) or single shared identities (ktup Å 1), (2) rescanning (assessed using BLOSUM-matrix substitution scores, the regions using a BLOSUM50 scoring matrix, (3) plus gap-open and gap-extension penalties) minus pen- joining high-scoring regions that do not overlap, and alties for the difference between the determined DNA (4) calculating an optimal Smith Waterman score in sequence and the HCR. (Only frameshift penalties are a16(ktup Å 2)- or 32 (ktup Å 1)-residue-wide band assessed for FASTX.) FASTX and FASTY are guaran- (Chao et al., 1992) centered around the best initial reteed to compute an alignment that is optimal among all alignments of their respective types. Thus, the FASTY alignment will always score at least as high as the gion. FASTX and FASTY differ from FASTA by using three protein sequences, from each of the three frames, for the initial lookup process (step 1). Step 3 is modified
4 DNA PROTEIN COMPARISON 27 (a) to cause high-scoring regions in adjacent reading must have diverged more than 2 Byr ago. In contrast, frames to appear to be adjacent in the sequence, so DNA sequence searches with the housefly glutathione they can be more easily joined, and (b) to allow small transferase cdna against the appropriate GenBank divisions overlaps (10 residues) between joined regions. Step 4 is (primate/rodent, plant, or bacterial) do not find any changed to produce a band-limited DNA protein local of the mammalian, yeast, or bacterial homologues (E() ú alignment score (Zhang et al., 1997) as outlined above. 10). Thus, FASTX/Y searches are only about two-fold less FASTX and FASTY use a full Smith Waterman local sensitive than FASTA searches with the encoded protein DNA protein alignment in linear space for the final sequence, but dramatically more sensitive than the correalignment and alignment score (Zhang et al., 1997). sponding DNA sequence similarity search. TFASTX/Y use a similar strategy, but instead of augmenting Figure 5 shows alignments between a mouse class-mu the query-sequence lookup table, the library glutathione transferase, to which bases were added and sequence is encoded as two separate three-frame translations, deleted, and the correct GTM1_MOUSE protein sequence one forward and one reverse. Again, steps 3 using FASTX (Fig. 5A) or FASTY (Fig. 5B). Both algo- and 4 are modified for DNA protein comparison and rithms produce alignments that extend from the begin- TFASTX/Y provide a full Smith Waterman alignment, ning of the protein sequence to the end, but the FASTY without limits on gap size, for the final display. alignment does a better job of maximizing the number Figures 3 5 show the output produced by a typical of identities (91.7% identity for FASTY, 59.6% for FASTX search. As with other programs in the FASTA FASTX, but note that the z score, which indicates statistipackage, the distribution of observed and expected sim- cal significance, is slightly lower for the FASTY alignment). ilarity scores (Fig. 3) and the expectation value [E() value] of the highest scoring unrelated sequence (Fig. 4) can be used to evaluate the accuracy of the expectation Effective Search Parameters for FASTX and FASTY values calculated by FASTX, FASTY, TFASTX, Before evaluating the relative performances of and TFASTY. In this search of the SwissProt database FASTX and FASTY, we first searched an annotated with a housefly glutathione transferase cdna se- PIR39 database to identify good combinations of gapopen, quence, there is excellent agreement between the numbers gap-extension, and frameshift penalties in the of observed and expected similarity scores (Fig. presence of sequence errors. Open reading frames from 3), particularly for scores from 80 to 120 (all the scores 92 query sequences from 46 protein families used in ú120 come from glutathione transferase sequences). previous studies (Pearson, 1995) were mutated and Likewise, the expectation value for the highest scoring used to evaluate search performance with different gap unrelated sequence should be Ç1; in this example the and frameshift penalties (Fig. 6). In general, high highest scoring unrelated sequence has an E() of 1.6. frameshift penalties are most effective when the error Figure 4 also shows the expectation values for a rate is 0, as expected, since no frameshifts should occur. FASTA search of the same SwissProt database using the However, even with modest amounts of errors (substiprotein sequence as the query. In general, the related- tutions, insertions, and deletions) (1 and 2%), searches sequence expectation values for the protein protein comparison with frameshift penalties of 015 are significantly more are about one-half the values calculated from effective than searches with higher penalties. Very the translated DNA protein sequence comparison. The similar results are seen when the entire cdna se- reduced statistical significance results from the increased quence is used (data not shown). The error rates encountered probability of producing a high score from an unrelated in high-quality EST sequences are about sequence when three longer protein sequences (three 1.4% substitutions, 1.1% insertions, and 0.7% deletions translation frames that include a 3 untranslated region) (Hillier et al., 1996, Table 6, corrected), so the 2 5% are compared. FASTX and FASTY compare either the error rates shown in Figs. 6 and 7 are the most realistic. forward or the reverse three-frame translation to the protein FASTX and FASTY perform very similarly when the database; if both forward and reverse frames were best scoring parameters are used (Fig. 7A). However, compared, the statistical significance would be decreased in general one will not know the error rate in advance another twofold because of the effectively larger number and would prefer to use gap and frameshift penalties of comparisons performed. that perform well overall, for example, 015/02/020 for cdna to protein sequence comparison can be surpris- FASTX and 015/02/025/030 for FASTY. With these ingly sensitive. The translated housefly glutathione penalties, FASTX performs a bit better on ORF se- transferase cdna sequence shares statistically signifi- quences (Fig. 7A), otherwise the two programs perform cant similarity with human and rodent class-theta enzymes about the same. Although some of the differences in (GTT1_HUMAN, GTT1_RAT), which must have performance are statistically significant, they are not diverged from a common ancestor more than 500 Myr very dramatic. Figure 7B reports the total number of ago, and with yeast URE2_YEAST glutamine repressor related sequences missed in all 92 searches as a function protein and various plant glutathione transferases of error rate. From this perspective, differences in (GTH3_ARATH, GTH3_MAIZE), which diverged more performance between FASTX and FASTY, or even than 1 Byr ago, as well as several bacterial homologs ORFs and cdna sequences, are far less significant (DCMA_METS1, DCMA_METSP, SSPA_HAEIN), which than changes in the error rate.
5 28 PEARSON ET AL. FIG. 3. Distribution of FASTX similarity scores. A housefly glutathione transferase cdna sequence (GenBank Accession No. X73574) was used to search the SwissProt protein database (Release 34) using the FASTX program. The BLOSUM50 matrix (Henikoff and Henikoff, 1992) was used with a penalty of 015 for the first residue in a gap, 03 for each additional residue, and 030 for a frameshift. ÅÅÅ denotes the number of sequences in the database obtaining the similarity score shown; * indicates the number of sequences expected to obtain a similarity score. Statistical Estimates from DNA Protein Comparisons FASTX/Y and TFASTX/Y also report estimates of the statistical significance of the similarity scores; these estimates are based on the relationship between the mean unrelated sequence similarity score and library sequence and the average variance of the unrelated scores (W.R.P, in preparation). For protein sequences, the statistical estimates are quite accurate; on average about 1/10th of the sequences have expectation values with probabilities 0.1, 1/2 have expectation values
6 DNA PROTEIN COMPARISON 29 FIG. 4. FASTX search high scoring sequences. FASTX similarity scores from the search in Fig. 3 for high-scoring related and unrelated protein sequences are shown. Unrelated sequences are highlighted in italics. The statistical significance, in the form of an E() for each similarity score is shown (FASTX). In addition, expectation values for the same search with the translated GTT2_MUSDO protein sequence are shown (FASTA). random query sequences found a highest scoring alignment with expectation values from õ10 08 to These values are much lower than expected by chance, and six more had E() õ Likewise, when unshuf- fled ORF sequences were used, 6 queries had highest scoring unrelated sequences with E() õ , and 20 of 90 sequences had E() õ 0.05 for the highest scoring unrelated sequence. However, about half of the query with probabilities 0.5, and so on. Statistical estimates for FASTX searches are sometimes less accurate, because out-of-frame translations can sometimes have low-complexity amino acid sequence runs that produce statistically significant similarity scores with similar regions in the protein database. Thus, when 90 ORF sequences were shuffled to produce random sequences, and these sequences were used to search PIR39b, 6
7 30 PEARSON ET AL. FIG. 5. Translated DNA protein alignments FASTX and FASTY. Alignments between mgstm1.e05, a modified copy of the GTM1_MOUSE cdna sequence with Ç5% mutations (mgstm1.e05 is 94.2% identical to the correct cdna), and its encoded protein sequence, using FASTX and FASTY. The BLOSUM50 matrix was used with a penalty of 015 for the first residue in a gap, 02 for each additional residue, 015 for a frameshift, and 015 for a nucleotide substitution (FASTY only). sequences had E() ú 0.5 for the highest scoring unre- the points will fall on a diagonal line with a slope of 1. lated sequences, suggesting that for many of these sequences, If many of the points fall above the diagonal, then there the statistical estimates were accurate. The are fewer sequences with low probabilities than ex- accuracy of the statistical estimates can be judged by pected and thus the estimates are conservative. If a quantile quantile plot (Fig. 8). The expectation value many points fall below the diagonal, as occurs in Fig. of the highest unrelated sequence score from each of 90 8A, then low probabilities have been assigned too fre- ORF query sequences (or 90 random sequences derived quently, and a calculated probability of (or even from the 90 ORF sequences) was sorted from lowest lower) happens far more often than the expected 0.1% value to highest, converted to a Poisson probability of the time. Thus, a very low expectation value is not value using the equation P(E) Å 1 0 e 0E, and plotted necessarily statistically significant. against the cumulative fraction of sequences examined. Examination of the high-scoring, statistically significant If there is perfect agreement between the probability FASTX alignments showed that in every case, of a high score and the number of times it occurs, then FASTX had produced an out-of-frame translation that
8 DNA PROTEIN COMPARISON 31 FIG. 6. Effective scoring penalties open reading frames. The z values for the difference in performance with respect to the best performing gap and frameshift penalties. The values across the top (e.g., 012/01/015) indicate the cost of the first residue in a gap, each additional gapped residue, and a frameshift. The first-gap value is slightly different from the penalties defined in the description of the alignment algorithms. Thus, a first-gapped residue cost of 012 and a gap extension cost of 01 are equivalent to a gap open penalty of 11 and a gap extension penalty of 1. Differences in performance are presented as z values for a sign test of the equivalence numbers; z values greater than 2 are significant at the 0.05 level. Positive z values indicate that the parameters at the top of the column performed better; negative values indicate that the best parameter combination for that error rate performed better. (A) Searches with FASTX. The best parameters for FASTX were 015/02/020, 015/01/015, 015/02/015, 015/01/015, and 015/02/015, for 0, 1, 2, 5, and 10% errors in ORF sequences. (B) Searches with FASTY. All the searches shown used a substitution cost of 030. Searches done with a substitution cost of 015 were no better, but not significantly different. The best FASTY parameters on ORFs were 015/02/025/030 (the last value is the substitution cost), 015/01 /020/030, 015/02/020/030, 015/02/015/030, and 012/02/015/030. The original FASTA package (Pearson and Lipman, 1988) provided TFASTA, a program to compare a pro- tein sequence to a DNA sequence library, calculating six scores, one for each of the three forward and the three reverse reading frames. TFASTX and TFASTY yielded a protein domain with a repetitive, low-complexity sequence. To evaluate the statistical estimates in the absence of these low-complexity matches, the seg program (Wootton, 1994) was used to strip the PIR39b database of these regions (they were replaced by x s ). When low-complexity regions are removed from the protein database, the statistical estimates calculated both for random sequences and for the highest scoring unrelated sequence are reliable and somewhat conservative (Fig. 8B). Thus, FASTX can calculate accurate statistical significance estimates for the local similarity scores if low-complexity regions are removed from the protein sequence database. However, when one scans an unmodified database, scores that appear significant should be examined carefully for alignment of low-complexity regions. TFASTX/Y, an Alternative to TFASTA
9 32 PEARSON ET AL. somal protein L13, E() õ ; GMGLYO glyoxylase, E() õ ; SMAPHOAA alkaline phosphatase, E() õ 0.03). However, additional searches with these DNA sequences using FASTX revealed that they share strongly significant similarity with many members of the glutathione transferase family, and thus are likely to be homologous to the housefly query sequence. (SMAPHOAA clearly encodes an alkaline phosphatase starting at nucleotide 667, but also encodes a glutathione transferase family member from residue 2 to 511.) The highest scoring sequences in the GenBank divisions used for this example that are clearly unrelated to GTT2_MUSDO are PSEPUOH109 [E() õ 0.39], which encodes a haloacetate dehalogenase, and BMOEF1BP [E() õ 0.57], which encodes a EF1b elongation factor that does not appear to be related to the EF1g elongation factors that are homologous to glutathione transferases (Koonin et al., 1994). Thus, for this query sequence, which does not contain low-complexity regions, the statistical estimates appear quite accurate. A comprehensive evaluation of the statistics of comparisons to DNA databases will require a carefully annotated DNA sequence database. Identification of Genes in Bacterial Genome Sequences FIG. 7. FASTX versus FASTY. (A) The best scoring parameters FASTX and FASTY can also be used to analyze gefor FASTX and FASTY searches using either ORF or cdna query nomes that lack long exons in protein-coding genes. As sequences were compared for each error rate best. Differences in of this writing, full genomic sequences of six prokaryperformance are presented as z values; positive z values indicate otes and one eukaryote are available. Here, we present that FASTY performed better; negative values support FASTX. The best ORF parameters are shown in Fig. 6. For cdnas, the best results of preliminary analyses of three prokaryotic ge- penalties were 015/01/025, 015/02/015, 015/02/015, 015/02/ nomes sequenced at TIGR: M. jannaschii (Bult et al., 015, and 012/03/015 (FASTX) and 015/01/025/030, 015/02/025/ 1996), H. influenzae (Fleischmann et al., 1995), and M. 030, 015/01/020/030, 015/03/015/030, and 012/03/015/030 (FASTY). Searches with a good general purpose combination of parameters, 015/02/020 for FASTX and 015/02/025/030 for FASTY, are also shown. (B) The total number of related sequences missed using FASTX or FASTY with error rate. Filled symbols show the results of searches using the best search parameters for the error rate; open symbols show the results with the general purpose parameters used in (A). use the same DNA protein alignment algorithms as FASTX and FASTY to provide two alignment scores for each DNA sequence; one from the forward and one from the reverse-complement sequence. TFASTA statistics are based on the best of the six scores produced from each library sequence; TFASTX statistics are based on the best of the forward and reverse similarity scores. In the example shown in Fig. 9, TFASTX does not perform substantially better than TFASTA in identifying distantly related DNA sequences, although both TFASTA and TFASTX perform substantially better than a DNA DNA comparison. However, TFASTX provides much more informative alignments when se- quencing errors or short introns are present. A quick glance at the names of the protein sequences in Fig. 9 suggests that many significant scores were calculated for unrelated sequences (e.g., S75161 ribo- FIG. 8. Statistics of FASTX and FASTY scores. Probability of the score of the highest scoring unrelated sequence versus the cumu- lative fraction of sequences examined. Searches were performed with 90 ORF query sequences (open symbols) or 90 random sequences (filled symbols) derived from ORF sequences against either the unmodified PIR39b database (A) or the PIR39b database with lowcomplexity sequences removed (B). Searches were performed with the three parameter sets shown. As many as 7 sequences with P() values õ0.001 (some õ ) of the 90 query sequences used for each test set are below the bottom of the plot in A. FASTX searches used parameters of 015/02/020; FASTY searches used 015/02/025/030.
10 DNA PROTEIN COMPARISON 33 FIG. 9. TFASTX high-scoring sequences. High-scoring sequences is a search of the Primate, Invertebrate, Plant, and Bacterial sections of GenBank (Release 99) using a housefly glutathione transferase (GTT2_MUSDO). Unrelated sequences are highlighted in italics. (In cases in which the sequence was not clearly labeled as a glutathione transferase, additional searches were done with FASTX or FASTA against the SwissProt protein sequence database, and homology was established if the GenBank sequence obtained E() values õ10 06 with many glutathione transferase family members. Also shown are the expectation values calculated from a TFASTA search. genitalium (Fraser et al., 1995). The sequences were downloaded via ftp from TIGR s World Wide Web site ( These genomes were chosen for this preliminary study strictly because of the similarity of their on-line documentation. Identification of protein-coding regions of genomic
11 34 PEARSON ET AL. TABLE 1 Modified Gene Boundaries Based on Extended Alignments ORF Extended ORF Name Match E() Length E() Length Start Stop Haemophilus influenzae HI0097 FBP_HAEIN HI0117 MLTA_ECOLI 2.2 e HI0153 DCUB_ECOLI 1.1 e H10187 YIGT_ECOLI 6.2 e e HI0200 SELD_HAEIN HI0218 T1R1_ECOLI HI0220 ARCB_ECOLI HI0342 NAPF_ECOLI 4.4 e HI0498 POT2_HAEIN HI0537 UREF_BACSB 1.2 e e HI0603 HEMX_ECOLI 3.3 e HI0635 Y712_HAEIN HI0686 GLPT_ECOLI 7.2 e H10723 TRKH_ECOLI H10962 SYI_HAEIN 1 e HI0976 YWFM_BACSU e HI1042 METH_ECOLI 5.6 e HI1318 IF3_HAEIN HI1383 PSTS_HAEIN HI1390 YFRC_PROVU e HI1460 YADA_YEREN a a 2.8 e HI1537 LICA_HAEIN HI1653 TLDD_HAEIN HI1721 YI5B_ECOLI Methanococcus jannaschii MJ0165 PUR6_METTH 1.3 e e MJ0910 BCHD_RHOCA 1.8 e e MJ1209 MTH1_HAEPA 5.6 e e MJ1268 GLTL_ECOLI 3.2 e e MJ1325 CADF_STAAU 8.9 e e MJ1328 MTHC_HAEIN 9.7 e MJ1339 IF2_BACST 2.6 e e MJ1353 FDHA_METFO Mycoplasma genitalium MG434 PYRH_ECOLI 9.9 e e Note. Summary of FASTX search results using extended ORFs. All FASTX searches were done against SwissProt Release 34. a With HI1460 ORF as a query, YADA_YEREN is not detected as a match with E() õ 100. sequences can suffer from a number of potential errors. Perhaps the most obvious source of error is mistakes in the sequence itself. Specifically, deletions and insertions resulting in frameshift errors can result in incorrect ORF assignment or in failure to identify the correct homologue. Another source of error in genomic analysis is the failure to identify the correct boundaries of an ORF. This may occur, for example, when the computer has a choice between two different start codons and chooses the wrong one. A third source of error is the failure to identify an ORF entirely. All of these errors can potentially be remedied by searching different combinations of genomic sequence with FASTX. The obvious limitation of these methods is that FASTX can only identify errors in sequences that have homologues in the protein databases. In the case of M. jannaschii, this may be a limiting factor, since less than 50% of its ORFs have identifiable homologues (Bult et al., 1996; Kyrpides et al., 1996). Preliminary results of potential ORF boundary extensions are shown in Table 1. In most cases, it is difficult to determine from the FASTX alignments alone whether detected frameshifts and internal stop codons are the result of sequencing errors or are actual mutations in the original sequence. For instance, MJ1209 matches a methylase from Haemophilus parainfluenzae (MTH1_HAEPA) with an expectation value of ; however, the alignment is 83 amino acid residues short of the full length of MTH1_HAEPA and contains five internal stop codons. This extended ORF may well be a pseudogene. In other instances, the detected exten- sions are almost certainly the result of sequencing or record-keeping errors. For example, HI1653 matches TLDD_HAEIN with 100% identity over the full length of the match except for one frameshift. These exten- sions can also aid in revealing previously unknown homologues. HI0117 is currently described as No database match in the on-line documentation at the TIGR
12 DNA PROTEIN COMPARISON 35 TABLE 2 New Genes between Open Reading Frames Fragment Name Match E() Length Start Stop Haemophilus influenzae HI YD38_HAEIN Methanococcus jannaschii MJ RS14_METVA 7.8 e MJ YACA_BACSU 1.5 e MJ CAPM_STAAU 7.2 e MJ YCT2_BACFI 1.6 e MJ DRAG_RHORU 1.2 e Mycoplasma genitalium MG291.1 YQGF_HAEIN MG335.1 Y060_MYCGE MG334.2 YOXG_BACSU 4.7 e Note. FASTX searches that revealed previously unknown genes. Fragment E() values refer to the entire intergenic fragment, not just the alignment region given. Length corresponds to the length of the alignment in amino acid residues. WWW site, but here we show that if extended down- ú500 Myr ago could be detected by DNA sequence stream, HI0117 matches a membrane-bound lytic mu- comparison. rein transglycosylase A precursor from Escherichia coli FASTX and FASTY provide two slightly different (MLTA_ECOLI) with an expectation value of 0. strategies for aligning the codons of a DNA sequence Potential genes not previously identified are summa- with a protein sequence. In our tests, FASTY is about rized in Table 2. Only the clearest examples are shown 27% slower than FASTX. FASTX and FASTY show here; more than 60 additional intragenic regions con- very similar performance tests with the error rates tain translation frames with significant similarity to shown here. In other tests with much higher deletion an entry in SwissProt, but the basis for the similarity rates (5 20% deletion only, rather than the 0.35% was less clearcut (typically the sequence similarity was 3.5% deletions shown here), FASTY was significantly much shorter than expected). Identification of potential better (data not shown). However, in all of our tests start and stop codons for these ORFs has not been com- with mixtures of insertions, deletions, and substitupleted at this time; thus, the start and stop positions tions, there was little difference in performance belisted are for alignments only. For consistency, the tween FASTX and FASTY. Thus, FASTX is preferred naming of the new genes follows the naming convention for its speed, although FASTY is capable of producing of TIGR; for example, the gene found between HI1462 more accurate alignments. and HI1463 is here named HI This particular Unlike conventional protein similarity searches, ORF is 100% identical to the DNA sequence of HI1338 which are only moderately affected by changes in gapand yet remained unidentified both in the initial analy- penalty values (Pearson, 1995), the best frameshift sis (Fleischmann et al., 1995) and in the analysis of penalties are dependent on the expected error level. Robison et al. (1996). In the M. jannaschii genome, a Thus, for error-free data, a high frameshift penalty ribosomal protein S14 homologue (MJ0469.1) has been is appropriate; but if there is 2 5% error, lower identified among a large cluster of other ribosomal pro- frameshift penalties are more effective. For general teins (MJ0465 MJ0477). Note, also, that the results purpose searching on high quality databases, such as in Tables 1 and 2 are based on searches of SwissProt. the genomic scans summarized here, a high frameshift Searches of the PIR or GenPept databases may reveal penalty (025 to 030) is effective. For EST searches, additional ORFs or ORF extensions. where errors are expected, a lower penalty (015 to 020) may be more appropriate. DISCUSSION Accurate estimates of statistical significance are essential for automatic large-scale sequence identification. FASTX, FASTY, TFASTX, and TFASTY can provide FASTX and FASTY can provide accurate esti- a sensitive tool for rapidly characterizing DNA selow-complexity mates if they are used to search databases from which quence similarities based on translated DNA protein regions have been removed (Fig. 8). sequence comparisons. Translated DNA protein sethe Likewise, TFASTX can produce reliable estimates if quence comparisons are almost as sensitive as protein protein query sequence does not contain low-com- database searches (Figs. 4, 8) and dramatically more plexity regions. Future modifications to FASTX/Y and sensitive than DNA database searches. None of the TFASTX/Y may include a seg-style screening of the relationships in Fig. 4 based on ancestors shared sequences used in the initial lookup table, to attempt
13 36 PEARSON ET AL. to avoid detection of out-of-frame low-complexity runs D. M., Phillips, C. A., Merrick, J. M., Tomb, J.-F., Dougherty, B. A., in the initial comparison phase. This may make it pos- Bott, K. F., Hu, P.-C., Lucier, T. S., Peterson, S. N., Smith, H. O., III, C. A. H., and Venter, J. C. (1995). The minimal gene complesible to search unmodified protein databases with reli- ment of Mycoplasma genitalium. Science 270: able statistics. Gish, W., and States, D. J. (1993). Identification of protein coding We were surprised to learn that so many gene-bound- regions by database similarity search. Nature Genet. 3: ary assignments in the recently determined bacterial Guan, X., and Uberbacher, E. (1996). Alignments of DNA and protein genomes were suspect, and that additional genes with sequences containing frameshift errors. Comp. Appl. Biosci. 12: very significant similarity to known proteins could be found in intergenic regions. If only 50 70% of the genes Hein, J. (1994). An algorithm combining DNA and protein alignment. J. Theor. Biol. 167: in M. jannaschii have clear homologues in the protein Hein, J., and Stovbaek, J. (1994). Genomic alignment. J. Mol. Evol. databases, then we may have detected only about 1/2 38: of the errors. We suspect that many of these errors Henikoff, S., and Henikoff, J. G. (1992). Amino acid substitutions occurred because a full Smith Waterman DNA pro- matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89: tein alignment was not performed on initial gene as signments. Even if FASTX is not used initially to char- Hillier, L., Lennon, G., Becker, M., Bonaldo, M. F., Chiapelli, B., acterize open reading frames, it should be used in a Chissoe, S., Dietrich, N., DuBuque, T., Favello, A., Gish, W., Hawkins, M., Hultman, M., Kucaba, T., Lacy, M., Le, M., Le, N., final step to test for potential extensions of the trans- Mardis, E., Moore, B., Morris, M., Parsons, J., Prange, C., Rifkin, lated DNA alignment. L., Rohlfing, T., Schellenberg, K., Soares, M. B., Tan, F., Thierry- Meg, J., Trevaskis, E., Underwood, K., Wohldman, P., Waterston, ACKNOWLEDGMENTS R., Wilson, R., and Marra, M. (1996). Generation and analysis of 280,000 human expressed sequence tags. Genome Res. 6: This work was supported by grants from the National Library of Huang, X., and Zhang, J. (1996). Methods for comparing a DNA Medicine (LM04969) to W.R.P. and (LM05110) to W.M. sequence with a protein sequence. Comp. Appl. Biosci. 12: Knecht, L. (1995). Pairwise alignment with scoring on tuples. In REFERENCES Lecture Notes in Computer Science: Combinatorial Pattern Matching, Vol. 937, pp , Springer-Verlag, Berlin. Bult, C. J., White, O., Olsen, G. J., Zhou, L., Fleischmann, R. D., Koonin, E. V., Mushegian, A. R., Tatusov, R. L., Altschul, S. F., Bry- Sulton, G. G., Blake, J. A., Fitzgerald, L. M., Clayton, R. A., Go- ant, S. H., Bork, P., and Valencia, A. (1994). Eukaryotic translation cayne, J. D., Kerlavage, A. R., Dougherty, B. A., Tomb, J.-F., Ad- elongation factor 1 gamma contains a glutathione transferase doams, M. D., Reisch, C. I., Overbeek, R., Kirkness, E. F., Weinstock, main Study of a diverse, ancient protein superfamily using motif K. G., Merrick, J. M., Glodek, A., Scott, J. L., Geoghagen, N. S. M., search and structural modeling. Protein Sci. 3: Weidman, J. F., Fuhrmann, J. L., Nguyen, D., Utterback, T. R., Kyrpides, N. C., Olsen, G. J., Klenk, H.-P., White, O., and Woese, Kelley, J. M., Peterson, J. D., Sadow, P. W., Hanna, M. C., Cotton, C. R. (1996). Methanococcus jannaschii genome: Revisited. Microb. M. D., Roberts, K. M., Hurst, M. A., Kaine, B. P., Borodovsky, M., Comp. Genomics 1: Klenk, H.-P., Fraser, C. M., Smith, H. O., Woese, C. R., and Venter, Pearson, W. R. (1990). Rapid and sensitive sequence comparison with J. C. (1996). Complete genome sequence of the methanogenic arch- FASTP and FASTA. In Methods in Enzymology (R. F. Doolittle, aeon, Methanococcus jannaschii. Science 273: Ed.), Vol. 183, pp , Academic Press San Diego. Chao, K.-M., Pearson, W. R., and Miller, W. (1992). Aligning two Pearson, W. R. (1995). Comparison of methods for searching protein sequences within a specified diagonal band. Comp. Appl. Biosci. sequence databases. Protein Sci. 4: : Pearson, W. R. (1996). Effective protein sequence comparison. Meth- Fleischmann, R. D., Adams, M. D., White, O., Clayton, R. A., Kirkods Enzymol. 266: ness, E. F., Kerlavage, A. R., Bult, C. J., Tomb, J. F., Dougherty, Pearson, W. R., and Lipman, D. J. (1988). Improved tools for biologi- B. A., Merrick, J. M., McKenney, K., Sutton, G., FitzHugh, W., cal sequence comparison. Proc. Natl. Acad. Sci. USA 85: 2444 Fields, C., Gocayne, J. D., Scott, J., Shirley, R., Liu, L.-I., Glodek, A., Kelley, J. M., Weidman, J. F., Phillips, C. A., Spriggs, T., amd M. D. Cotton, E. H., Utterback, T. R., Hanna, M. C., Nguyen, D. T., Peltola, H., Soderlund, H., and Ukkonen, E. (1986). Algorithms for Saudek, D. M., Brandon, R. C., Fine, L. D., Fritchman, J. L., Fuhr- the search of amino acid patterns in nucleic acid sequences. Nucleic mann, J. L., Geoghagen, N. S. M., Gnehm, C. L., McDonald, L. A., Acids Res. 14: Small, K. V., Fraser, C. M., Smith, H. O., and Venter, J. C. (1995). Robison, K., Gilbert, W., and Church, G. M. (1996). More Haemophi- Whole-genome random sequencing and assembly of Haemophilus lus and Mycoplasma genes. Science 271: influenzae Rd. Science 269: Wootton, J. C. (1994). Non-globular domains in protein sequences: Fraser, C. M., Gocayne, J. D., White, O., Adams, M. D., Clayton, Automated segmentation using complexity measures. Comput. R. A., Fleischmann, R. D., Bult, C. J., Kerlavage, A. R., Sutton, G., Chem. 18: Kelley, J. M., Fritchman, J. L., Weidman, J. F., Small, K. V., San- Zhang, Z., Pearson, W., and Miller, W. (1997). Aligning a DNA sequence dusky, M., Fuhrmann, J., Nguyen, D., Utterback, T. R., Saudek, with a protein sequence. J. Comput. Biol. 4:
RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison
RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the
Bioinformatics Resources at a Glance
Bioinformatics Resources at a Glance A Note about FASTA Format There are MANY free bioinformatics tools available online. Bioinformaticists have developed a standard format for nucleotide and protein sequences
Pairwise Sequence Alignment
Pairwise Sequence Alignment [email protected] SS 2013 Outline Pairwise sequence alignment global - Needleman Wunsch Gotoh algorithm local - Smith Waterman algorithm BLAST - heuristics What
When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want
1 When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want to search other databases as well. There are very
GenBank, Entrez, & FASTA
GenBank, Entrez, & FASTA Nucleotide Sequence Databases First generation GenBank is a representative example started as sort of a museum to preserve knowledge of a sequence from first discovery great repositories,
Introduction to Genome Annotation
Introduction to Genome Annotation AGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGA CCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTT GAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTG GTGTAGATGGAGATCGCGTAGCGTGGTAGCGCGAGTTTGCGAGCT
PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE 2006 1. E-mail: [email protected]
BIOINFTool: Bioinformatics and sequence data analysis in molecular biology using Matlab Mai S. Mabrouk 1, Marwa Hamdy 2, Marwa Mamdouh 2, Marwa Aboelfotoh 2,Yasser M. Kadah 2 1 Biomedical Engineering Department,
Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources
1 of 8 11/7/2004 11:00 AM National Center for Biotechnology Information About NCBI NCBI at a Glance A Science Primer Human Genome Resources Model Organisms Guide Outreach and Education Databases and Tools
BLAST. Anders Gorm Pedersen & Rasmus Wernersson
BLAST Anders Gorm Pedersen & Rasmus Wernersson Database searching Using pairwise alignments to search databases for similar sequences Query sequence Database Database searching Most common use of pairwise
Final Project Report
CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes
Clone Manager. Getting Started
Clone Manager for Windows Professional Edition Volume 2 Alignment, Primer Operations Version 9.5 Getting Started Copyright 1994-2015 Scientific & Educational Software. All rights reserved. The software
Introduction to Bioinformatics 3. DNA editing and contig assembly
Introduction to Bioinformatics 3. DNA editing and contig assembly Benjamin F. Matthews United States Department of Agriculture Soybean Genomics and Improvement Laboratory Beltsville, MD 20708 [email protected]
Searching Nucleotide Databases
Searching Nucleotide Databases 1 When we search a nucleic acid databases, Mascot always performs a 6 frame translation on the fly. That is, 3 reading frames from the forward strand and 3 reading frames
Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003
Similarity Searches on Sequence Databases: BLAST, FASTA Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003 Outline Importance of Similarity Heuristic Sequence Alignment:
Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011
Sequence Formats and Sequence Database Searches Gloria Rendon SC11 Education June, 2011 Sequence A is the primary structure of a biological molecule. It is a chain of residues that form a precise linear
Rapid alignment methods: FASTA and BLAST. p The biological problem p Search strategies p FASTA p BLAST
Rapid alignment methods: FASTA and BLAST p The biological problem p Search strategies p FASTA p BLAST 257 BLAST: Basic Local Alignment Search Tool p BLAST (Altschul et al., 1990) and its variants are some
Gene Models & Bed format: What they represent.
GeneModels&Bedformat:Whattheyrepresent. Gene models are hypotheses about the structure of transcripts produced by a gene. Like all models, they may be correct, partly correct, or entirely wrong. Typically,
Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004
Protein & DNA Sequence Analysis Bobbie-Jo Webb-Robertson May 3, 2004 Sequence Analysis Anything connected to identifying higher biological meaning out of raw sequence data. 2 Genomic & Proteomic Data Sequence
DNA Sequencing Overview
DNA Sequencing Overview DNA sequencing involves the determination of the sequence of nucleotides in a sample of DNA. It is presently conducted using a modified PCR reaction where both normal and labeled
1 Mutation and Genetic Change
CHAPTER 14 1 Mutation and Genetic Change SECTION Genes in Action KEY IDEAS As you read this section, keep these questions in mind: What is the origin of genetic differences among organisms? What kinds
Choices, choices, choices... Which sequence database? Which modifications? What mass tolerance?
Optimization 1 Choices, choices, choices... Which sequence database? Which modifications? What mass tolerance? Where to begin? 2 Sequence Databases Swiss-prot MSDB, NCBI nr dbest Species specific ORFS
Bioinformatics Grid - Enabled Tools For Biologists.
Bioinformatics Grid - Enabled Tools For Biologists. What is Grid-Enabled Tools (GET)? As number of data from the genomics and proteomics experiment increases. Problems arise for the current sequence analysis
BUDAPEST: Bioinformatics Utility for Data Analysis of Proteomics using ESTs
BUDAPEST: Bioinformatics Utility for Data Analysis of Proteomics using ESTs Richard J. Edwards 2008. Contents 1. Introduction... 2 1.1. Version...2 1.2. Using this Manual...2 1.3. Why use BUDAPEST?...2
The sequence of bases on the mrna is a code that determines the sequence of amino acids in the polypeptide being synthesized:
Module 3F Protein Synthesis So far in this unit, we have examined: How genes are transmitted from one generation to the next Where genes are located What genes are made of How genes are replicated How
Structure and Function of DNA
Structure and Function of DNA DNA and RNA Structure DNA and RNA are nucleic acids. They consist of chemical units called nucleotides. The nucleotides are joined by a sugar-phosphate backbone. The four
Yale Pseudogene Analysis as part of GENCODE Project
Sanger Center 2009.01.20, 11:20-11:40 Mark B Gerstein Yale Illustra(on from Gerstein & Zheng (2006). Sci Am. (c) Mark Gerstein, 2002, (c) Yale, 1 1Lectures.GersteinLab.org 2007bioinfo.mbb.yale.edu Yale
From DNA to Protein. Proteins. Chapter 13. Prokaryotes and Eukaryotes. The Path From Genes to Proteins. All proteins consist of polypeptide chains
Proteins From DNA to Protein Chapter 13 All proteins consist of polypeptide chains A linear sequence of amino acids Each chain corresponds to the nucleotide base sequence of a gene The Path From Genes
SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications
Product Bulletin Sequencing Software SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications Comprehensive reference sequence handling Helps interpret the role of each
NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
Algorithms in Bioinformatics I, WS06/07, C.Dieterich 47. This lecture is based on the following, which are all recommended reading:
Algorithms in Bioinformatics I, WS06/07, C.Dieterich 47 5 BLAST and FASTA This lecture is based on the following, which are all recommended reading: D.J. Lipman and W.R. Pearson, Rapid and Sensitive Protein
FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem
FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem Elsa Bernard Laurent Jacob Julien Mairal Jean-Philippe Vert September 24, 2013 Abstract FlipFlop implements a fast method for de novo transcript
DNA Sequence Alignment Analysis
Analysis of DNA sequence data p. 1 Analysis of DNA sequence data using MEGA and DNAsp. Analysis of two genes from the X and Y chromosomes of plant species from the genus Silene The first two computer classes
Network Protocol Analysis using Bioinformatics Algorithms
Network Protocol Analysis using Bioinformatics Algorithms Marshall A. Beddoe [email protected] ABSTRACT Network protocol analysis is currently performed by hand using only intuition and a protocol
DNA Insertions and Deletions in the Human Genome. Philipp W. Messer
DNA Insertions and Deletions in the Human Genome Philipp W. Messer Genetic Variation CGACAATAGCGCTCTTACTACGTGTATCG : : CGACAATGGCGCT---ACTACGTGCATCG 1. Nucleotide mutations 2. Genomic rearrangements 3.
Guide for Bioinformatics Project Module 3
Structure- Based Evidence and Multiple Sequence Alignment In this module we will revisit some topics we started to look at while performing our BLAST search and looking at the CDD database in the first
Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure 3.11 3.15 enzymes control cell chemistry ( metabolism )
Biology 1406 Exam 3 Notes Structure of DNA Ch. 10 Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure 3.11 3.15 enzymes control cell chemistry ( metabolism ) Proteins
BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS
BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS NEW YORK CITY COLLEGE OF TECHNOLOGY The City University Of New York School of Arts and Sciences Biological Sciences Department Course title:
A Tutorial in Genetic Sequence Classification Tools and Techniques
A Tutorial in Genetic Sequence Classification Tools and Techniques Jake Drew Data Mining CSE 8331 Southern Methodist University [email protected] www.jakemdrew.com Sequence Characters IUPAC nucleotide
Focusing on results not data comprehensive data analysis for targeted next generation sequencing
Focusing on results not data comprehensive data analysis for targeted next generation sequencing Daniel Swan, Jolyon Holdstock, Angela Matchan, Richard Stark, John Shovelton, Duarte Mohla and Simon Hughes
How To Check For Differences In The One Way Anova
MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way
Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment
Sequence Analysis 15: lecture 5 Substitution matrices Multiple sequence alignment A teacher's dilemma To understand... Multiple sequence alignment Substitution matrices Phylogenetic trees You first need
Global and Discovery Proteomics Lecture Agenda
Global and Discovery Proteomics Christine A. Jelinek, Ph.D. Johns Hopkins University School of Medicine Department of Pharmacology and Molecular Sciences Middle Atlantic Mass Spectrometry Laboratory Global
13.4 Gene Regulation and Expression
13.4 Gene Regulation and Expression Lesson Objectives Describe gene regulation in prokaryotes. Explain how most eukaryotic genes are regulated. Relate gene regulation to development in multicellular organisms.
Biological Sciences Initiative. Human Genome
Biological Sciences Initiative HHMI Human Genome Introduction In 2000, researchers from around the world published a draft sequence of the entire genome. 20 labs from 6 countries worked on the sequence.
A greedy algorithm for the DNA sequencing by hybridization with positive and negative errors and information about repetitions
BULLETIN OF THE POLISH ACADEMY OF SCIENCES TECHNICAL SCIENCES, Vol. 59, No. 1, 2011 DOI: 10.2478/v10175-011-0015-0 Varia A greedy algorithm for the DNA sequencing by hybridization with positive and negative
RESTRICTION DIGESTS Based on a handout originally available at
RESTRICTION DIGESTS Based on a handout originally available at http://genome.wustl.edu/overview/rst_digest_handout_20050127/restrictiondigest_jan2005.html What is a restriction digests? Cloned DNA is cut
Genotyping by sequencing and data analysis. Ross Whetten North Carolina State University
Genotyping by sequencing and data analysis Ross Whetten North Carolina State University Stein (2010) Genome Biology 11:207 More New Technology on the Horizon Genotyping By Sequencing Timeline 2007 Complexity
Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon
Integrating DNA Motif Discovery and Genome-Wide Expression Analysis Department of Mathematics and Statistics University of Massachusetts Amherst Statistics in Functional Genomics Workshop Ascona, Switzerland
A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques
Proceedings of the 2007 WSEAS International Conference on Computer Engineering and Applications, Gold Coast, Australia, January 17-19, 2007 402 A Multiple DNA Sequence Translation Tool Incorporating Web
Consensus alignment server for reliable comparative modeling with distant templates
W50 W54 Nucleic Acids Research, 2004, Vol. 32, Web Server issue DOI: 10.1093/nar/gkh456 Consensus alignment server for reliable comparative modeling with distant templates Jahnavi C. Prasad 1, Sandor Vajda
Genetomic Promototypes
Genetomic Promototypes Mirkó Palla and Dana Pe er Department of Mechanical Engineering Clarkson University Potsdam, New York and Department of Genetics Harvard Medical School 77 Avenue Louis Pasteur Boston,
Current Motif Discovery Tools and their Limitations
Current Motif Discovery Tools and their Limitations Philipp Bucher SIB / CIG Workshop 3 October 2006 Trendy Concepts and Hypotheses Transcription regulatory elements act in a context-dependent manner.
DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!!
DNA Replication & Protein Synthesis This isn t a baaaaaaaddd chapter!!! The Discovery of DNA s Structure Watson and Crick s discovery of DNA s structure was based on almost fifty years of research by other
SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD
White Paper SGI High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems Haruna Cofer*, PhD January, 2012 Abstract The SGI High Throughput Computing (HTC) Wrapper
BIOINFORMATICS TUTORIAL
Bio 242 BIOINFORMATICS TUTORIAL Bio 242 α Amylase Lab Sequence Sequence Searches: BLAST Sequence Alignment: Clustal Omega 3d Structure & 3d Alignments DO NOT REMOVE FROM LAB. DO NOT WRITE IN THIS DOCUMENT.
MASCOT Search Results Interpretation
The Mascot protein identification program (Matrix Science, Ltd.) uses statistical methods to assess the validity of a match. MS/MS data is not ideal. That is, there are unassignable peaks (noise) and usually
MUTATION, DNA REPAIR AND CANCER
MUTATION, DNA REPAIR AND CANCER 1 Mutation A heritable change in the genetic material Essential to the continuity of life Source of variation for natural selection New mutations are more likely to be harmful
A Primer of Genome Science THIRD
A Primer of Genome Science THIRD EDITION GREG GIBSON-SPENCER V. MUSE North Carolina State University Sinauer Associates, Inc. Publishers Sunderland, Massachusetts USA Contents Preface xi 1 Genome Projects:
Amino Acids and Their Properties
Amino Acids and Their Properties Recap: ss-rrna and mutations Ribosomal RNA (rrna) evolves very slowly Much slower than proteins ss-rrna is typically used So by aligning ss-rrna of one organism with that
Genome Explorer For Comparative Genome Analysis
Genome Explorer For Comparative Genome Analysis Jenn Conn 1, Jo L. Dicks 1 and Ian N. Roberts 2 Abstract Genome Explorer brings together the tools required to build and compare phylogenies from both sequence
CCR Biology - Chapter 9 Practice Test - Summer 2012
Name: Class: Date: CCR Biology - Chapter 9 Practice Test - Summer 2012 Multiple Choice Identify the choice that best completes the statement or answers the question. 1. Genetic engineering is possible
RNA and Protein Synthesis
Name lass Date RN and Protein Synthesis Information and Heredity Q: How does information fl ow from DN to RN to direct the synthesis of proteins? 13.1 What is RN? WHT I KNOW SMPLE NSWER: RN is a nucleic
MATCH Commun. Math. Comput. Chem. 61 (2009) 781-788
MATCH Communications in Mathematical and in Computer Chemistry MATCH Commun. Math. Comput. Chem. 61 (2009) 781-788 ISSN 0340-6253 Three distances for rapid similarity analysis of DNA sequences Wei Chen,
BioBoot Camp Genetics
BioBoot Camp Genetics BIO.B.1.2.1 Describe how the process of DNA replication results in the transmission and/or conservation of genetic information DNA Replication is the process of DNA being copied before
Name Class Date. Figure 13 1. 2. Which nucleotide in Figure 13 1 indicates the nucleic acid above is RNA? a. uracil c. cytosine b. guanine d.
13 Multiple Choice RNA and Protein Synthesis Chapter Test A Write the letter that best answers the question or completes the statement on the line provided. 1. Which of the following are found in both
INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE Q5B
INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE ICH HARMONISED TRIPARTITE GUIDELINE QUALITY OF BIOTECHNOLOGICAL PRODUCTS: ANALYSIS
Module 10: Bioinformatics
Module 10: Bioinformatics 1.) Goal: To understand the general approaches for basic in silico (computer) analysis of DNA- and protein sequences. We are going to discuss sequence formatting required prior
Human Genome Organization: An Update. Genome Organization: An Update
Human Genome Organization: An Update Genome Organization: An Update Highlights of Human Genome Project Timetable Proposed in 1990 as 3 billion dollar joint venture between DOE and NIH with 15 year completion
CD-HIT User s Guide. Last updated: April 5, 2010. http://cd-hit.org http://bioinformatics.org/cd-hit/
CD-HIT User s Guide Last updated: April 5, 2010 http://cd-hit.org http://bioinformatics.org/cd-hit/ Program developed by Weizhong Li s lab at UCSD http://weizhong-lab.ucsd.edu [email protected] 1. Introduction
Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing
Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing James D. Jackson Philip J. Hatcher Department of Computer Science Kingsbury Hall University of New Hampshire Durham,
Specific problems. The genetic code. The genetic code. Adaptor molecules match amino acids to mrna codons
Tutorial II Gene expression: mrna translation and protein synthesis Piergiorgio Percipalle, PhD Program Control of gene transcription and RNA processing mrna translation and protein synthesis KAROLINSKA
Frequently Asked Questions Next Generation Sequencing
Frequently Asked Questions Next Generation Sequencing Import These Frequently Asked Questions for Next Generation Sequencing are some of the more common questions our customers ask. Questions are divided
2006 7.012 Problem Set 3 KEY
2006 7.012 Problem Set 3 KEY Due before 5 PM on FRIDAY, October 13, 2006. Turn answers in to the box outside of 68-120. PLEASE WRITE YOUR ANSWERS ON THIS PRINTOUT. 1. Which reaction is catalyzed by each
Error Tolerant Searching of Uninterpreted MS/MS Data
Error Tolerant Searching of Uninterpreted MS/MS Data 1 In any search of a large LC-MS/MS dataset 2 There are always a number of spectra which get poor scores, or even no match at all. 3 Sometimes, this
ProSightPC 3.0 Quick Start Guide
ProSightPC 3.0 Quick Start Guide The Thermo ProSightPC 3.0 application is the only proteomics software suite that effectively supports high-mass-accuracy MS/MS experiments performed on LTQ FT and LTQ Orbitrap
European Medicines Agency
European Medicines Agency July 1996 CPMP/ICH/139/95 ICH Topic Q 5 B Quality of Biotechnological Products: Analysis of the Expression Construct in Cell Lines Used for Production of r-dna Derived Protein
Biological Databases and Protein Sequence Analysis
Biological Databases and Protein Sequence Analysis Introduction M. Madan Babu, Center for Biotechnology, Anna University, Chennai 25, India Bioinformatics is the application of Information technology to
4.2.1. What is a contig? 4.2.2. What are the contig assembly programs?
Table of Contents 4.1. DNA Sequencing 4.1.1. Trace Viewer in GCG SeqLab Table. Box. Select the editor mode in the SeqLab main window. Import sequencer trace files from the File menu. Select the trace files
BCH401G Lecture 39 Andres
BCH401G Lecture 39 Andres Lecture Summary: Ribosome: Understand its role in translation and differences between translation in prokaryotes and eukaryotes. Translation: Understand the chemistry of this
Algorithms in Computational Biology (236522) spring 2007 Lecture #1
Algorithms in Computational Biology (236522) spring 2007 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: Tuesday 11:00-12:00/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office
Gene Finding CMSC 423
Gene Finding CMSC 423 Finding Signals in DNA We just have a long string of A, C, G, Ts. How can we find the signals encoded in it? Suppose you encountered a language you didn t know. How would you decipher
a. Ribosomal RNA rrna a type ofrna that combines with proteins to form Ribosomes on which polypeptide chains of proteins are assembled
Biology 101 Chapter 14 Name: Fill-in-the-Blanks Which base follows the next in a strand of DNA is referred to. as the base (1) Sequence. The region of DNA that calls for the assembly of specific amino
Information, Entropy, and Coding
Chapter 8 Information, Entropy, and Coding 8. The Need for Data Compression To motivate the material in this chapter, we first consider various data sources and some estimates for the amount of data associated
Bio-Informatics Lectures. A Short Introduction
Bio-Informatics Lectures A Short Introduction The History of Bioinformatics Sanger Sequencing PCR in presence of fluorescent, chain-terminating dideoxynucleotides Massively Parallel Sequencing Massively
CCR Biology - Chapter 8 Practice Test - Summer 2012
Name: Class: Date: CCR Biology - Chapter 8 Practice Test - Summer 2012 Multiple Choice Identify the choice that best completes the statement or answers the question. 1. What did Hershey and Chase know
Human Genome and Human Genome Project. Louxin Zhang
Human Genome and Human Genome Project Louxin Zhang A Primer to Genomics Cells are the fundamental working units of every living systems. DNA is made of 4 nucleotide bases. The DNA sequence is the particular
DnaSP, DNA polymorphism analyses by the coalescent and other methods.
DnaSP, DNA polymorphism analyses by the coalescent and other methods. Author affiliation: Julio Rozas 1, *, Juan C. Sánchez-DelBarrio 2,3, Xavier Messeguer 2 and Ricardo Rozas 1 1 Departament de Genètica,
4. DNA replication Pages: 979-984 Difficulty: 2 Ans: C Which one of the following statements about enzymes that interact with DNA is true?
Chapter 25 DNA Metabolism Multiple Choice Questions 1. DNA replication Page: 977 Difficulty: 2 Ans: C The Meselson-Stahl experiment established that: A) DNA polymerase has a crucial role in DNA synthesis.
Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals
Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals Xiaohui Xie 1, Jun Lu 1, E. J. Kulbokas 1, Todd R. Golub 1, Vamsi Mootha 1, Kerstin Lindblad-Toh
Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006
Hidden Markov Models in Bioinformatics By Máthé Zoltán Kőrösi Zoltán 2006 Outline Markov Chain HMM (Hidden Markov Model) Hidden Markov Models in Bioinformatics Gene Finding Gene Finding Model Viterbi algorithm
Resource Allocation Schemes for Gang Scheduling
Resource Allocation Schemes for Gang Scheduling B. B. Zhou School of Computing and Mathematics Deakin University Geelong, VIC 327, Australia D. Walsh R. P. Brent Department of Computer Science Australian
THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS
THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS O.U. Sezerman 1, R. Islamaj 2, E. Alpaydin 2 1 Laborotory of Computational Biology, Sabancı University, Istanbul, Turkey. 2 Computer Engineering
