goldminer Tutorial Introduction

goldminer Tutorial Introduction Genomic sequencing or large-scale gene expression studies often produce a large number of sequence fragments. A major challenge in bioinformatics is to identify the function of these sequence fragments, a process commonly known as sequence annotation. This tutorial outlines the fundamental concepts in sequence annotation, the computational aspect of sequence annotation and, in particular, how to perform sequence annotation of a set of ESTs (expressed sequence tags) by using goldminer developed in Dr. Xuhua Xia s lab in University of Ottawa. There are two major categories of computational methods for sequence annotation. The first is based on known genes in molecular databases and uses homology searches. The best representatives of this category of methods are FASTA (Pearson and Lipman 1988) and BLAST (Altschul et al. 1990; Altschul et al. 1997). The second, best represented by GENSCAN (Burge and Karlin 1997), is based on known gene structures for pattern recognition by using two types of computational methods: the neural network algorithms and the hidden Markov model. Existing software for gene-finding often combine both approaches, e.g., GenMark (Hayes and Borodovsky 1998), GLIMMER (Salzberg et al. 1998), Orpheus (Frishman et al. 1998), Projector (Meyer and Durbin 2004) and YACOP (Tech and Merkl 2003). The first category of gene-finding method used to be unimportant when there are few genes in the gene dictionary. However, the gene dictionary has expanded dramatically in the last decades and it has now become rare for a new gene sequence to find no match in the public databases. Many sequence annotation platforms are now based mainly on the first category of method, especially sequence annotation platforms designed for EST annotation (Ayoubi et al. 2002; Davila et al. 2005; Koski et al. 2005; Mao et al. 2003; Martin et al. 2004; Paquola et al. 2003). Gene annotation in large-scale gene expression studies is similar to genome annotation in that both involve a large number of sequence fragments to be annotated and that both would search against databases of known genes. For search against protein databases, one would also need to translate the nucleotide sequences in six frames (3 frames in the input strand and another 3 in the complementary strand). There are a few unique features in EST annotation. First, an actively transcribed gene (say X) tends to have more copies of its RNA, and consequently will be cloned more often, than an inactive gene (say Y). However, the ESTs for gene X RNA may be identical (Fig. 1). For this reason, an EST annotation system is expected to have contig assembly function. goldminer implements the contig assembly algorithms developed by Huang (1992) with a few modifications to improve performance. 1 ATTTAATTAAACCACGGTAAGCC 2 ACGGTAAGCCTCAACCTTTTCC 3 CAATGCTGCT TTGATTTAATTAAACCACGGTAAGCCTCAACCTTTTCCATATGTGGTCAATGCTGCTTCC AACUAAAUUAAUUUGGUGCCAUUCGGAGUUGGAAAAGGUAUACACCAGUUACGACGAAGG 4 AGUUGGAAAAGGUAUAC EST1+ EST2+ EST4- : ATTTAATTAAACCACGGTAAGCCTCAACCTTTTCCATATG Fig. 1. A schematic illustration of a RNA sequence together with its reverse transcribed complementary strand (in italics). Four ESTs are derived from the RNA, with three collinear with the RNA and the fourth collinear with the complementary strand. Contig assembly will join ESTs 1-3, with the resulting contig named concisely as EST1+ EST2+ EST4- to indicate the fact that the contig is assembled from EST1, EST2 and the complementary strand of EST4. Correct functional annotation will show that all four ESTs map to the same gene, leading to the correct conclusion that the RNA was cloned four times. Large-scale gene expression studies typically will involve printing the ESTs onto a microarray chip. If the RNA from the highly expressed gene X is represented by 100 ESTs, then one naturally do not want to print all these ESTs into separate microarray cells because it would be too wasteful. Contig assembly and EST annotation allows one to know whether a certain gene has many EST representatives so that one can choose which ESTs to print. Now suppose that you study gene expression of goldfish brains and have already accumulated a large number of ESTs (expressed sequence tags). How are you going to know what gene products these ESTs code for? Naturally

you would first want search your sequences against a goldfish sequence database (if there is one). If there are only a few goldfish genes that have been sequenced, then you should search your goldfish sequences against the genome of a closely related species such as the zebrafish genome. If there are still goldfish sequences without a good match the zebrafish genome, then you should search against genomic databases of other vertebrate species. Sometimes a good match may mean nothing, e.g., a match against another EST or against an unannotated putative sequence. For such sequences, you will need to search against databases of protein functional classification such as pfam (Bateman et al. 1999; Bateman et al. 2004), SMART (Letunic et al. 2004), and COG (Tatusov et al. 2003). While these databases represent outstanding bioinformatics advancement in protein functional classification, there are three major problems with searching against these databases. First, protein families in each of these databases are overlapping subsets of proteins with known functions (Fig. 2). It would have been much nicer to cross-validate and merge all these databases together with the inclusion of all proteins with known functions. Second, a search against these databases may yield multiple matches in different protein families and you are left wondering which one represents a more accurate functional classification. Third, searching against these databases is typically slow, which handicaps a large-scale study of gene expression with thousands of ESTs. The Conserved Domain Database (CDD) was created to overcome (or at least alleviate) these three problems (Marchler-Bauer et al. 2005). The database imports protein families from pfam, SMART and COG, with cross-database validation and re-classification to increase the accuracy of the functional classification. It includes other proteins with known functions that are curated at NCBI but not included in other databases. To increase the searching speed, CDD uses the RPS-BLAST search engine whose speed is augmented by pre-computation of much of the output. pfam SMART Unincluded proteins with known function COG Fig. 2. SMART, COG and pfam include subsets of proteins with known functional classification. The size of the circles does not reflect the size of individual databases. The outermost circle represents CDD. The web API for CDD has not yet been formally released and goldminer is the first software package for the scientific community that automates the search against CDD. In this tutorial you will learn how to install goldminer and how to use it to annotate a sample set of EST sequences from goldfish. The goldminer program is large because I packed a zebrafish CDS database with it so that one can work with the sample EST sequences from goldfish by local BLASTing. In the latter part of the tutorial, one will learn how to retrieve the zebrafish CDS sequences from GenBank and how to create a local BLAST database. Objectives:

1. Learn to do quick and dirty sequence annotation by automated local BLASTing. 2. Learn to automate the slow but accurate and informative functional annotation against databases for protein families such as Conserved-Domain Database (CDD) and pfam. 3. Automated database search and annotation by searching against other NCBI-hosted databases. 4. Assemble contigs from sequence fragments. 5. Gain experience in creating local BLAST databases.. Procedures: Note: Please ignore the installation step below if goldminer is already installed on your computer. This tutorial is written not only for you, but also for others who need to do installation themselves. 1. Install Goldminer from http://dambe.bio.uottawa.ca/goldminer.asp. Unless your computer is extremely old, all you need to do is just clicking the Goldminer.msi file and then click the Run button, following by a few more clicks on the Next button in response to ensuing dialog boxes. The default installation directory is C:\Program Files\Goldminer. Under this directory, three subdirectories are created during the installation process: a. Plate directory which contains a single sample file: CaNCBI.FAS with 42 sequences from 42 goldfish mrnas. You may put your own sequences into this same directory. b. BLASTDB directory which contains the sample zebrafish CDS BLAST files for you to practice local BLAST with the CaNCBI.FAS file. c. ESTDB directory which contains files with annotated sequences. (This directory may be missing in your installation) 2. Sequence annotation by local BLAST. a. Open EST sequence file i. Click Start All Programs Goldminer to start the program ii. Click Tools Options to set the program defaults (You do not need to do this if you use the Goldminer default). The EST plate directory is where you should store your unannotated sequence files. The default is GoldminerDir\Plate (where GoldminerDir is the Goldminer installation directory, being C:\Program Files\Goldminer by default). The EST database directory stores files containing annotated or partially annotated sequences, and the default is GoldminerDir\ESTDB. The BLAST program directory is where the BLAST programs are located and you are advised to leave it as the default. The BLAST database directory is where you have stored your personal local BLAST databases, and the default is GoldminerDir\BLASTDB. The default input file format is FASTA but you can set it to another format. A simple guideline of sequence naming is to use a combination of plate ID and well ID, e.g., iii. iv. >A1 AACACAGGUUUA...... where A1 designate the coordinates of the cloning plate. There are two types of plate in current use: the 96-well plate (i.e., with column heading from A to H and rows from 1 to 12) and the 384-well plate (i.e., with column headings from A to P and rows from 1 to 24). Of course one does not have to use the coordinates as the sequence name, but the naming convention helps associate the sequence with its physical location. Click File Open plate files to read in the CaNCBI file. Goldminer can recognize many different sequence formats, but FASTA format is the most frequently used sequence format in gene expression studies. Hence the default of FASTA format. The sequence will be displayed. At this point, we do not know what genes these sequences are, and most columns are blank. b. Local BLAST. All these searches take time. If you have many sequence fragments to annotate and you need to know their approximate functions quickly, then speed is a prime consideration. Remote searching is always slow, so one should do remote BLAST only with a subset of sequences that do not find matches by local BLAST. For this reason you should always create and install local databases to facilitate your search. You are advised to always do local BLAST first so that only a small fraction of the sequences will then be searched against remote BLAST databases in NCBI. This reduces the chance of overloading the NCBI BLAST server. i. Click BLAST ReBlast against genomic DB. ii. A dialog appears. In the EST option, click ReBlast All ESTs. In the bottom frame, leave the default unchanged, i.e., Blast against local database.

iii. Specify the local database by clicking the Browse button. If you keep the default, you will see the zebrafish.rna file. Double-click it to set. iv. Set other BLAST parameters if necessary. Leave as default if you do not know what they mean. v. Click the Done button to start local BLASTing. Once the BLASTing is finished (it may take quite a while depending on your computer speed), you will see the output with some ESTs annotated with goldfish genes. vi. Many of the ESTs have now been annotated against zebrafish genes, with highly significant e-values. You may note that sequence A2 has no match. c. A few hidden functions i. Now right click anywhere in the Matched gene column and click Find. In the dialog, enter casein kinase 1 (without quotes) and click OK. You will find three genes (D5-D7) highlighted in red (you may have to scroll down to see them). If the sequences are from your own cloning experiment, this would mean that the transcript of the casein kinase 1 gene has been cloned multiple times, and they all match the same zebrafish casein kinase 1 gene (NM_152951.1). This provides useful information in two ways. First, the casein kinase 1 gene in goldfish must be highly expressed in the brain tissue. Second, if you are study gene expression by spotted cdna microarray, then there is no need to spot these replicate clones of the casein kinase 1 transcripts into multiple sets of probe ii. cells. Only one set of probe cells is sufficient. Now right click the GeneID entry for sequence D5 (i.e., NM_152951.1) and click GenBank Sequence. The annotated zebrafish casein kinase 1 gene is displayed for you to obtain further information about the gene. iii. You may also right click the GeneID entry for sequence D5 (i.e., NM_152951.1) and then click Show HSP (HSP stands for high-scoring sequence pair) to see the details of the matched segments. iv. Click File Save to save the sequences. Whenever possible, provide an informative file name and save it to your own personal working directory 3. Remote search against the CDD database hosted at NCBI. a. Click Func.Pred. CD Search and set the parameters (which are self-explanatory) in the ensuing dialog box. If you do not know them, just use the default. In particular, you should not change the URL for CDD hosted in NCBI unless (in the very unlikely case) you have local mirror of the NCBI databases. b. You will be asked to specify a translation table. This is because the CDD database is a protein database and we need to translate our nucleotide sequences into protein sequences. All known translation tables have been implemented in Goldminer. For our sequences, the first (Standard) translation table should be used. Goldminer will then translate each sequence in three frames and search all of them against the CDD database. c. The checkbox Use complementary sequence is for database search using the complementary of the EST sequences. For the first run, you should leave it cleared. If the input sequences find no match, then you can run the search again by checking this check box. d. Click the Submit button to start. It may take a long time to finish depending on how many loaded the CDD server is. A progress bar is implemented. e. Once the search is complete, most sequences would have been functionally annotated. A few of them will find no match. These may represent genes new to science. f. Click File Save annotation with sequences to save the sequence annotations. Given the long waiting time that you have suffered through, it would be silly not to save the results in a secure directory. g. Click Func.Pred. CD Search again. In the ensuing dialog box, select CD-Search sequences with no match, and check the Use the complementary strand check box. h. Click the Submit button to start search CDD database using the complementary strand of those ESTs with no match. i. Once the search is complete, click File Save annotation with sequences to save the sequence annotations. j. Right-click anything in the CDSID column and then click CD-Search Gene will take you to the CDD seed protein in its function group. For example, right-click the first CDSD, i.e., DEAD, will take you to the full annotation of the DEAD/DEAH box helicase at the CDD server. 4. Search against a pfam server. Searching against a remote pfam server is slow. If you have many ESTs and really have to use pfam, then you should have a locally installed pfam server. Searching against pfam may not yield anything new after you have already searched against the CDD database). a. Click Func.Pred. pfam and set the parameters. If you do not know them, just use the default. b. Click the Submit button to start. It may take a long time to finish. So a progress bar is implemented. c. Once the search is complete, most sequences would have been functionally annotated. A few of them will find no match, which may be due to wrong translation.

d. Click File Save annotation with sequences to save the sequences. e. Click Func.Pred. pfam again. In the ensuing dialog box, select CD-Search sequences with no match, and check the Use the complementary strand check box. f. Click the Submit button to start. g. Click File Save annotation with sequences to save your file. h. Right-click anything in the pfamid column and then click pfam Gene will take you to the pfam seed protein in its function group. For example, right-click the first pfamid, i.e., DEAD, will take you to the full annotation of the DEAD/DEAH box helicase at pfam server. 5. Remote BLAST against NCBI database: For sequences with no match after searching against all the databases for protein functional classification, your last resort is to search against GenBank in the hope of getting a match with some information for functional inference. It is often impractical to store all databases locally because of the sheer amount of disk space need and because it is very difficult to keep updating these terabyte-size databases. So we will take advantage of the regularly updated databases maintained at NCBI. a. Click BLAST ReBlast against genomic DB. b. A dialog appears. In the EST option, set the option to ReBlast ESTs with e-value greater than 0.01 (or smaller). In the bottom frame, choose the option to Blast against NCBI databases. What is an e-value? What does the default e-value of 0.01 mean? c. Specify the NCBI database (or just leave the default of nr which stands for non-redundant) and set other BLAST parameters if necessary (or just use the default value). d. Click the BLAST button to start BLASTing against the chosen NCBI database. Note that the NCBI BLAST server often needs to handle thousands of queries per hour, and is prone to being flooded. We could be selfish and send all queries to BLAST quickly, but selfishness is incompatible with a civilized society. So Goldminer will send only one query EST at a time and do not send another until the first has been processed. This guarantees that NCBI will never identify us as bad citizens (or the Goldminer programmer an inconsiderate scientist). You can leave Goldminer to do its job and go about other businesses. Because of the slowness, a progress bar is implemented to alleviate your frustration (No progress bar is implemented for the local BLAST which is fairly fast). e. Once the BLASTing is over, those sequences that do not have matches or have only poor matches may find new matches or stay the same as before. You may note that A2, which does not have a match before after local BLAST, now has a good match. At this point there is still no information on functional classification. f. Click File Save to save the sequences. 6. Create local BLAST file for local BLASTing a. Launch your WWW browser to http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=nucleotide b. Type in Carassius auratus (which is the Latin name for goldfish) or more general taxonomic terms in the search box c. Click Limit and set the Limit to dropdown box to Organism d. Click Go to search the goldfish sequences. You will get a list of at least 700 goldfish sequences e. In the Display dropdown box, choose Fasta. f. In the Send to dropdown box, choose File, and save the sequences to Goldfish.FAS. The file can reside in any directory but for consistence we will save it in the same directory as the goldminer program. g. Back to goldminer, click BLAST Format genomic BLAST DB. In the ensuing dialog box, enter Goldfish or anything meaningful as in Title for database file and Base name for BLAST file boxes. Click Add files to be formated and browse to where you have saved the Goldfish.FAS file. Highlight the file name so that it will appear in the File name textbox. Click OK. h. Click the Go button to format the file. The display panel will tell you that the formatting is complete. i. To use this new local goldfish BLAST file for local BLASTing, repeat 3.a-b, except that teh local BLAST file will be Goldfish.nsq instead of zebrafish.rna.nsq. 7. Contig assembly. The contig assembly function is independent of the functional annotation. It is a bad idea to assemble the sequences and then perform functional annotation on the reduced number of sequences. This is because two overlapping EST may NOT necessarily be from the same transcribed RNA and may belong to different protein families. a. Click Sequence Contig Assembly b. A dialog box appears. Here is a brief explanation just in case you do not know the meaning of the contig assembly options.: i. Sequence quality options: The beginning and ending of a sequence fragment is less reliable than the middle section and may have many base-calling errors. All automatic DNA sequencers come with base-calling software that will perform an analysis of base-calling quality and let you know the

ii. iii. bases that are inferred with little confidence. Some base-calling software may allow you to set the option to trim off the unreliable ends. In that case you should change the default 20 and 500 to 1 and a number greater than the length of the longest sequence fragment, respectively. In other words, you are telling DAMBE that every base in the input sequences is good. Alignment parameters: The gap open penalty of 0 specifies local sequence alignment. The Gap extension and Mismatch score are the penalties against gap extension and mismatch. For sequences with no base-calling error, there should be no gap or mismatch, and gaps and mismatches should be penalized severely (Hence -6 and -6, respectively, by default). Base-calling errors increase the chance of has gaps and mismatches in local sequence alignment. Hence the reduced penalty for both (-2 and -3, respectively, by default). Decision parameters: These are parameters for heuristic string matching algorithms that increase the speed of computation. You may leave them as default. c. Now click the Go button. The contig assembly will be performed automatically. d. For sequence fragments that have been merged into one, the new sequence name will be in the form of SeqName1+ SeqName2+SeqName3..., meaning EST1 has its 3 -end overlapping the 5 -end of EST2 which in turn has its 3 -end overlapping the 5 -end of the complementary strand (indicated by the - sign. + means the original input sequence) of EST3 and so on. If one sequence is entirely embedded in another sequence, then the former is omitted in the new name. If the result is from the sample file, CaNCBI.fas, then H4 is entirely embedded inside A10 and H4 will not appear in the name of the assembled contig. e. Keep in mind that an assembled contig is a hypothesized neighbor relationship among the ESTs and may not be correct. Look at the detailed output instead of believe in the output blindly. 8. A few miscellaneous items: a. The column width can be user-resized. b. The last column is for custom annotation. c. Clicking the top-left cell highlights the entire sheet. Clicking a column or row heading highlights the entire column or row, respectively. 9. There are a number of functions accessible from the popup menu: a. If a sequence tag (e.g., Seq1 above) does not have a sequence entry but you obtained the sequence latter and wish to add it in, just left-click the sequence name (at first column) to highlight the entire row and then right-click to access the popup menu. Click 'Change sequence' to add the new sequence information. b. To append an entry: Right-click to access the popup menu and then click 'Append a row'. c. To copy an entire sheet, a column, a row or a cell (e.g., to EXCEL), first select it and then right-click to access the popup menu and then click Copy. References: Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. Journal of Molecular Biology 215:403-410 Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25:3389-3402 Ayoubi P, Jin X, Leite S, Liu X, Martajaja J, Abduraham A, Wan Q, Yan W, Misawa E, Prade RA (2002) PipeOnline 2.0: automated EST processing and functional data sorting. Nucleic Acids Res 30:4761-9 Bateman A, Birney E, Durbin R, Eddy SR, Finn RD, Sonnhammer EL (1999) Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. Nucleic Acids Res 27:260-2. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, Studholme DJ, Yeats C, Eddy SR (2004) The Pfam protein families database. Nucleic Acids Res 32:D138-41. Burge C, Karlin S (1997) Prediction of complete gene structures in human genomic dna. J. Mol. Biol. 268:78-94 Davila AM, Lorenzini DM, Mendes PN, Satake TS, Sousa GR, Campos LM, Mazzoni CJ, Wagner G, Pires PF, Grisard EC, Cavalcanti MC, Campos ML (2005) GARSA: genomic analysis resources for sequence annotation. Bioinformatics 21:4302-3 Frishman D, Mironov A, Mewes HW, Gelfand M (1998) Combining diverse evidence for gene recognition in completely sequenced bacterial genomes. Nucleic Acids Res 26:2941-7 Hayes WS, Borodovsky M (1998) How to interpret an anonymous bacterial genome: machine learning approach to gene identification. Genome Res 8:1154-71. Huang XQ (1992) A Contig Assembly Program Based on Sensitive Detection of Fragment Overlaps. Genomics 14:18-25 Koski LB, Gray MW, Lang BF, Burger G (2005) AutoFACT: an automatic functional annotation and classification tool. BMC Bioinformatics 6:151

Letunic I, Copley RR, Schmidt S, Ciccarelli FD, Doerks T, Schultz J, Ponting CP, Bork P (2004) SMART 4.0: towards genomic data integration. Nucleic Acids Res 32:D142-4. Mao C, Cushman JC, May GD, Weller JW (2003) ESTAP--an automated system for the analysis of EST data. Bioinformatics 19:1720-2 Marchler-Bauer A, Anderson JB, Cherukuri PF, DeWeese-Scott C, Geer LY, Gwadz M, He S, Hurwitz DI, Jackson JD, Ke Z, Lanczycki CJ, Liebert CA, Liu C, Lu F, Marchler GH, Mullokandov M, Shoemaker BA, Simonyan V, Song JS, Thiessen PA, Yamashita RA, Yin JJ, Zhang D, Bryant SH (2005) CDD: a Conserved Domain Database for protein classification. Nucleic Acids Res 33:D192-6. Martin DM, Berriman M, Barton GJ (2004) GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinformatics 5:178 Meyer IM, Durbin R (2004) Gene structure conservation aids similarity based gene prediction. Nucleic Acids Res 32:776-83 Paquola AC, Nishyiama MY, Jr., Reis EM, da Silva AM, Verjovski-Almeida S (2003) ESTWeb: bioinformatics services for EST sequencing projects. Bioinformatics 19:1587-8 Pearson WR, Lipman DJ (1988) Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85:2444-2448 Salzberg SL, Delcher AL, Kasif S, White O (1998) Microbial gene identification using interpolated Markov models. Nucleic Acids Res 26:544-8. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4:41 Tech M, Merkl R (2003) YACOP: Enhanced gene prediction obtained by a combination of existing methods. In Silico Biol 3:441-51.