Genome Explorer For Comparative Genome Analysis

Transcription

1 Genome Explorer For Comparative Genome Analysis Jenn Conn 1, Jo L. Dicks 1 and Ian N. Roberts 2 Abstract Genome Explorer brings together the tools required to build and compare phylogenies from both sequence and gene order data. It also allows hypothesis testing through simultaneous simulation of chromosomal and sequence evolution. It was written specifically to make interaction with the tools easier for users familiar with a windows style environment - the popup windows and wizards that collect data and set parameters to run the programs are all based on a common design and are both logical and intuitive to use. Many of the available tools are independently useful but frequently used as part of more complex analyses. Genome Explorer anticipates both and therefore has a modular design that allows each tool to be run independently, preserving its full range of functionality and leaving the user fully in control of their data. All classes that run bioinformatics programs implement a common interface, all parameter collection panels are derived from a single class and parameters for each program are stored in a customised object. This design makes Genome Explorer easily extendable, enables tools to be readily chained into 'pipelines' and allows all outfiles to be preserved when tools are run consecutively. Introduction Many bioinformatics tools are traditionally considered by bench scientists to be UNIX programs, accessible via X-windows connections to site-provided UNIX boxes for which they need an account and password. If this is too daunting they visit websites to perform BLAST searches and wait for the results to be sent by . This need not be the case. Most tools have a Windows version operated from a similar command line to their UNIX counterpart. These tools may be downloaded free as executables and are very easy to set up. Genome Explorer, though written in Java for portability, is intended for use on a PC running the third party components locally. 3 With more complete genomes becoming available, and organisations such as EBI and TIGR providing easy ftp access to data through their websites, downloading sequence information directly relevant to the area of research and storing it on a PC, is a simple procedure. This would mean, for example, that BLAST results could be instantaneous. Genome Explorer aims to facilitate phylogeny generation from 'raw data' by providing a single entry point to the different programs required. A typical user may have a single sequence in fasta format that needs to be compared to a database of sequences in search of homologues (BLAST (Altschul et al. 1997)), the homologues extracted (parse the BLAST report, then use the BLAST program FASTACMD) and aligned (CLUSTALW (Thompson et al. 1994), the alignment edited and saved, the final alignment resampled in preparation for bootstrapping (PHYLIP (Felsenstein 1993) - seqboot), and finally a phylogeny constructed (a further three PHYLIP programs). Increasingly, users are investigating phylogeny based on gene order, in which case raw data might simply be two genomes with chromosomes in fasta format, that can be compared for 1 Computational Biology Research, John Innes Centre, Norwich, NR4 7HU, UK 2 National Collection of Yeast Cultures, Institute of Food Research, Norwich, NR4 7EG, UK 3 Genome Explorer has been developed using BBSRC funding, and therefore only uses software that is freely available for academic use

2 homologues ( BLAST), which allows gene-order files to be created (custom program in Genome Explorer) and passed into a program (CHROMTREE (Dicks 2000)) to produce a maximum likelihood gene order tree or alternatively a distance matrix for input to PHYLIP. These are processes that many Biologists go through on a regular basis. Bench scientists are used to looking at BLAST reports to assess the quality of the matches; they like to manually edit multiple sequence alignments to ensure the gaps are sensible or realistic; and they like to have "quick and dirty" phylogenetic trees that give them an idea of the direction their research is taking. Genome Explorer takes into account the way biologists use bioinformatics software in the course of their research and makes the whole process easier for them, without presuming to dictate which infiles, outfiles, or bits of information are particularly relevant. For example, the modularity of Genome Explorer s functions lets any program be run individually with the user specifying infiles and outfiles, though many programs have been chained into 'pipelines'. These allow the user to set parameters for several programs that will be run sequentially without further user interaction. The pipelines simply chain programs so that all the outfiles exist as if they had been run one at a time. The user can therefore take data produced at any point in a pipeline, check the output, and then potentially rerun the next program in the chain with different parameters. System Design Genome Explorer is designed with a two-layer architecture - the user interface and the working programs. There is no data layer because all data is input from files in a recognised format (e.g. fasta, PHYLIP, CLUSTALW alignment) and output to files. This puts the user firmly in control of their own data, and enables them to look at results produced by tools with which they are familiar, while removing the problems associated with running them. All parameter collecting Graphical User Interfaces (GUIs) are extended from the BioProgramPanel class which provides many useful methods for creating common GUI components (browse buttons and list boxes for input files, for example) and checking the data entered by the user, as well as several abstract methods that are uniquely implemented in each child class. One of these (the getparameters method) returns an object containing all the parameter information entered by the user. This object is used in constructing an instance of the class that will actually perform the analysis. Every tool linked to Genome Explorer is run from a class implementing the RunBioProgram interface. All of these classes expect a parameters object at construction containing all relevent information for the program they will run. Various interface methods ensure that all programs (whether accessed externally via a command line, or implemented directly by the method author) can be run from within the main GUI and progress reported to the user. It is therefore simple to create an instance of an object to run one program from within another. In this way, programs that are commonly run sequentially by the user, can be run consecutively from a single GUI with outfiles from one program being used as input files for the next. Figure 1 shows the modularity of the system emphasising file formats. The RunBioProgram interface is designed to allow objects for which it is implemented to be interrogated about their progress. The loadnextinfo method loads data required for the next run of a program - this usually involves checking the input file exists, reformatting it if necessary, checking the output file can be written and updating internal variables holding input filename, input file size, and input file number. It returns TRUE if everything is set for runnext to be called. The runnext method processes the data in some way, providing that loadnextinfo completed successfully. The morefilestorun method returns TRUE if loadnextinfo needs to be run again

3 and can therefore be used as a condition to loop calls to loadnextinfo and runnext methods. The getmessage, getcurrentfilesize, getcurrentfilename, totalnumbertorun, and currentnumber methods can be used to construct accurate progress reports to relay to the user. Detailed progress reports that show the size of the file being run give the user confidence to wait patiently for results. The writeoutfilelinkhtml method must return the path to an html file presenting links to all the files output by this RunBioProgram instance. Figure 1 - Modularity of Genome Explorer - input and output file formats. This diagram shows the different file formats used as input and produced as output by the various programs available in Genome Explorer. Many of the programs are chained into pipelines for the user's convenience, but output files generated at every stage are preserved so that parts of the chain can be rerun if desired. Custom components are named in italic. * DIANA Defined Interval Amino acid Numerating Algorithm (Michelitsch and Weissman 2000) Components Genome Explorer uses BLAST programs (including those for creating blastable databases from fasta files and then retrieving sequences by their ID) for sequence comparisons, CLUSTALW for multiple sequence alignment, CHROMTREE for maximum likelihood gene order phylogeny generation and gene order distance matrix construction, and the PHYLIP suite for phylogeny generation. Only the PHYLIP programs required by current users in our laboratory have been included as PHYLIP is menu driven and has no command line interface, so is more difficult to incorporate than other programs. A program must often be run independently for every input file that needs to be processed. This is not a problem for users familiar with a scripting language, but is time consuming and limiting for the majority of users who lack such expertise. Genome Explorer parameter objects take arrays of input file paths and execute the desired program on each in

4 turn, always using the same set of parameters. Genome Explorer renames generic outfiles to something meaningful so they aren't continually overwritten, and allows the main GUI to access a message describing its progress so the user won't become impatient. Finally, as Genome Explorer is a file based system, an outfile is available to the user as soon as it has been written - there is no need to wait for the rest of the input files to be processed. Custom Components Custom software has been written to provide utility functions and to help link one tool to the next. Several search functions enable the user to perform simple text based searches on a fasta file of sequences. These can be used to characterise amino acid or nucleotide content or search for particular combinations of amino acids. Though computationally simple, these searches provide the sort of information that biologists are often curious about. A program is provided to parse multiple BLAST reports and summarise information into a single file. This can also be used to generate files of "hit ids" to be retrieved from the BLAST database and written to a fasta file. A simple multiple sequence alignment editor with basic search and identity percentage functions has also been included. The simulation functions in Genome Explorer are provided by EVOLVE. EVOLVE simulates evolution of a phylogeny of species from a single seed genome. The mechanisms of evolution mimic those that occur naturally, but are controlled to a large extent by user-defined parameters. A wide range of evolutionary events are represented, including point mutations, chromosomal rearrangements based on misalignment of identical sequences, and polyploidy. Speciation is simulated by making an identical copy of the genome, and allowing it to evolve independently. The user can specify rates of evolution, final number of species in the phylogeny and number of generations to evolve for. EVOLVE can output a treefile based on the true phylogeny (with distance measured in generations), gene order data, and/or a fasta file of sequences for every chromosome depending on user requirements. EVOLVE can therefore be used to generate phylogenies with restricted parameters for use in hypothesis testing and software development. Pipelines BLAST searches are of great utility to biologists and often need to be performed against a custom fasta file of sequences (to assist in sequence alignment, for example). To do this, the fasta file must first be converted to a blastable database using a program provided with BLAST. If the sequences are not named in the right way, BLAST will be unable to index them, and therefore unable to retrieve sequences by their identity at a later stage. Genome Explorer allows a user to elect to BLAST against a fasta file, and creates the blastable database as part of the process, renaming sequences to conform to indexing conventions if required. It also parses the BLAST reports to create a list of "hit id"s and retrieves them from the database, outputing a fasta file of all hits, plus the query sequence - ready for the user to align in CLUSTALW. This extra functionality is hidden behind a single checkbox on the parameter-collecting GUI. All of these functions are available individually, so users can traverse the pipeline manually, or "hop" into it at any point. To generate gene order files describing the gene order of several chromosomes or species, a fasta file must be provided for each chromosome, with the sequences written in the right order. A custom component of Genome Explorer then parses these files to a single fasta file, renaming each sequence to identify its species, chromosome and position. It then uses BLAST tools to create a blastable database and search for homologues to each gene. It outputs

5 a homology file (a Genome Explorer defined format), detailing which genes are homologous. Another program then parses the homology file to a gene order file. This file may then be input to CHROMTREE and subsequently PHYLIP to estimate a gene order phylogeny. These processes are kept distinct to enable the user to introduce errors in the gene order during parsing. This function enables software developers to test the robustness of algorithms that generate distance matrices from gene order files. PHYLIP programs can be run consecutively to build phylogenies. Each program takes a single infile, and parameters are set via a text menu. Genome Explorer links to most popular PHYLIP programs from a single parameters GUI. From there the user can select the programs they wish to run consecutively, set parameters for each, and select any number of input files. Every input file is then run through the entire chain of selected programs. All text that would have appeared on screen had the user run PHYLIP manually is saved in a file so the user can reassure themselves that the parameters they set were actually used. Future Work A very simple procedure for adding functions or third party software to Genome Explorer is under development. At present, it is easy to write a parameters object, parameter-collecting GUI and class to process the data. We now plan to make it easier to integrate them into the main Genome Explorer GUI through use of a wrapper class. Availability of software By contacting [email protected] Acknowledgements This work was supported by BBSRC Grant Ref. No. 99/A1/G/05563 References Altshcul, S. F., Madden,T.L., Schäffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of database programs. Nucleic Acids Res., 25, Dicks,J. (2000) CHROMTREE: Maximum likelihood estimation of chromosomal phylogenies. In Comparative Genomics: Empirical and Analytical Approaches to Gene Order Dynamics, Map Alignment and the Evolution of Gene Families (eds. D Sankoff and JH Nadeau). Kluwer Academic Press, Dordrecht. pp Felsenstein,J. (1993) PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Department of Genetics, University of Washington, Seattle. Felsenstein,J. (1989) PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics, 5, Michelitsch,M.D. and Weissman,J.S. (2000) A census of glutamine/asparagine-rich regions: Implications for their conserved function and the prediction of novel prions. PNAS, 97 (22), Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignm,ent through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22,