Bioinformatics and escience. Abstract

Bioinformatics and escience W.A. Gray 1 and C. Thompson 2 1 Cardiff University, School of Computer Science, PO Box 916, Cardiff CF24 3XF, 2 BBSRC, Polaris House, North Star Avenue, Swindon SN2 1UH Abstract This paper gives a brief overview of the diverse field of bioinformatics, identifying research themes. It then introduces the BBSRC funded escience pilot projects against this overview by presenting their biological aims. Areas of the escience programme addressed by these projects are identified to determine the contribution they are expected to make to escience. This covers contributions to the developing Grid middleware and associated escience standards as well as their bioinformatics goals. Acknowledgement The authors thank the staff and PIs of the six projects who contributed material on which this paper is based. These are: 1) (e-protein) A distributed pipeline for structural-based proteome annotation using GRID technology Prof MJE Sternberg (PI) Imperial College, University College London, European Bioinformatics Institute [www.e-protein.org] 2) (BioSimGRID) A GRID database for biomolecular simulations Prof MSP Sansom (PI) Oxford University, Southampton University, Birkbeck College, York University, Nottingham University, Birmingham University [www.biosimgrid.org] 3) (e-htpx) An escience resource for high throughput protein crystallography Dr C Nave (PI) CLRC Daresbury Laboratory, Cambridge University, Cardiff University, European Bioinformatics Institute, York University, Oxford University [www.e-htpx.org] 4) (BDWorld) A problem solving environment for global biodiversity: prototype and demonstrator Prof FA Bisby (PI) Reading University, Cardiff University, Southampton University, Natural History Museum [www.bdworld.org] 5) GRID-enabled modelling tools and databases for neuroinformatics Dr N Goddard (PI) Edinburgh University {jointly funded with MRC} [www.axiope.org and www.anc.ed.ac.uk] 6) (BASIS) Biology of ageing escience integration and simulation system Prof TBL Kirkwood (PI) Newcastle University [www.basis.ncl.ac.uk] 1. Introduction In May the BBSRC held a meeting of its escience pilot projects. It was agreed at the meeting, that it would be a good idea to present a paper at this All Hands Meeting, which covered their six pilot projects presenting their bioinformatics goals and how they will contribute to the aims of the UK escience programme. The intention is to show the types of bioinformatics research enabled by an escience approach and how these projects will drive and contribute to the escience developments occurring in parallel in other research disciplines across the UK escience programme. Bioinformatics can be described as the derivation of knowledge from computer analysis of biological data. This is a simplistic view as it is a discipline which includes a wide range of scientific investigation. It is not a homogeneous domain. This means that there is a wide range of opinions and views as to what it

comprises. In its broadest sense it is the application of informatics techniques to biological data in a research or application environment. These informatics techniques can come from a number of disciplines including computer science, statistics and mathematics. This discipline list is by no means exhaustive as researchers utilise techniques from engineering, physics, and other scientific disciplines. In their Strategic Plan 2003-2008 [1] the BBSRC recognise the growing importance of bioinformatics in their domain when they state: Genome sequencing and post-genomic technologies provide researchers with massive amounts of data. As a consequence biology is becoming more quantitative. Large experimental data sets will increasingly allow computer (in silico) simulation of biological systems. This recognises the growing importance of bioinformatics in the next generation of research in bioscience. The BBSRC identify in [1] the following areas as important to their research agenda in nonclinical bioscience: Integrative biology, Sustainable agriculture, The Healthy organism, and Bioscience for industry. It is recognised that there are different levels of biological structure within these areas from individual molecules, cells, tissues/organs through populations to microbes, plants and animals. There is a need to do research in a number of ways within these levels and across the levels. This research underpins the growth in bioinformatics due to its generation of large amounts of data, which need to be analysed, integrated and used in simulations to test new ideas. In the second strategic objective in [1], it is recognised that bioscience is increasingly dependent on the development and use of e- tools due to the data being collected at the omics level of research which should lead to more productive in silico research and the development of new bioinformatics tools. These tools will be needed in areas such as data mining, pattern recognition, model building, data sharing across the vertical and horizontal biological structure levels. Traditional bioscience research also needs access to the data collections made over the centuries by Institutes such as the Natural History Museum. These collections must be prepared in machine readable formats that allow investigations to be made that shed new light and understanding on biodiversity and the effects on it of changes in climate, agricultural policy and government policy. It is important that the UK research in bioinformatics links with other efforts at National and International levels eg the GBIF (Global Biodiversity Information Facility) initiative in biodiversity. 2. The Pilot Projects When the pilot project proposals were submitted the BBSRC had identified four theme areas for these projects: Genomics, Structural studies, Cellular processes, and Biodiversity. The successful projects covered topics within these areas, although there was no specific pilot project in the genomics area. The BASIS project at Newcastle University [2] is concerned with utilising emerging Grid middleware technology to develop a system supporting a research community investigating quantitatively the biology of ageing at the cell, tissue and organism levels. It is creating new tools such as SBML (Systems Biology Markup Language) which will help a researcher build a computer based model of ageing processes and test them. At the tissue level these will be based on fibroblasts, gut and brain with experiments being conducted to determine the effect of random cell death in a tissue. Users will be able to set up experiments involving the creation of new models or modification of existing models which let them gain a deeper understanding of the effects of tissue ageing on organisms. It is intended that this facility will be available to a distributed community of researchers who will share their results, models and analytic tools. Thus this project aims to create a facility which allows simulation by system modelling within a structure level and across levels. The development team is working closely with another local team who are developing Grid middleware in the OGSA DAI project for accessing data held in databases. Thus their

requirements are informing the development of this middleware and they are utilising it in the development of the system. e-protein aims to provide a structure-based annotation of proteins in major genomes, which can be disseminated to other researchers. It is intended that alternative annotation approaches will be investigated to identify improvements in the methods of annotation. This will be used to build local databases holding structural and functional annotation of sequence data, which can be linked with relevant bioinformatics data resources at other sites. The improvement of protein modelling is a prime aim of the project so that better function predictions can be made. They intend that the system will be available to a research community collaborating in their investigations so that they can investigate and test alternative structure models. This system will have a workflow based interface which makes use of the ICENI middleware and its rich metadata structure for describing software tools. This middleware is being developed in a related Grid project which involves several of the e- Protein investigators. As in BASIS this project will be informing the development of the Grid middleware it requires. e-htpx is addressing the problem of unifying the procedures of protein structure determination so that they can be accessed through a single interface which allows structural biologists to create models from the data generated by high throughput protein crystallography. This will involve creating new structure determination software which can take advantage of HPC computing facilities so that the results of the structure determination can be delivered on the same time scale as data collection. Data generated in these experiments will be stored at the EBI as an available resource for the research community. It will involve giving users access to instruments, data collections and analytic tools. This project is primarily concerned with building structural models within a structure level. As the system is expected to have a number of industrial users, an important concern is the authentication of users to protect the system against unauthorised use. The development team will be consulting potential industrial users to determine their requirements in this important area. This will be used to see whether this can be supported by Grid facilities. BioSimGrid aims to allow comparisons to be made of the results of multiple biomolecular simulations so that the structure of proteins and nucleic acids can be better understood. Its users will have access to large quantities of simulation data, which will be integrated in further simulation experiments testing theories in structural biology within structure levels with the capability to reuse this data in cross level modelling of structures. This data will require curation. These secondary analysis experiments will need data mining services to locate relevant data within these databases as well as data analysis tools. Some of the research community using this system will be working in commercial organisations in the pharmaceutical industry. This introduces the need for user authentication before access is allowed to some of the data and tools as commercial confidentiality will need to be protected. It also means they have an interest in investigating mechanisms for distributed authorisation and accounting. There is also a requirement to link the simulation data with other biological and structural data held in National repositories to allow development of richer, more sophisticated models. BDWorld [3] is creating a problem solving environment in which researchers can locate appropriate analytic tools and data resources held at different sites in the environment. These tools and data can then be linked in a work flow which produces results relevant to an investigation into a biodiversity problem at the species level. This may be a question such as what will be the effect on a species distribution if global warming occurs, or could this plant become invasive if it is introduced as an agricultural crop in a region. The system utilises a partial catalogue of life and other biotic and abiotic data, such as climate envelopes, in three exemplar studies to prove the concept biodiversity richness analysis; bioclimatic modelling and climate change; and phylogenetic analysis and biogeographic. This will involve the system linking heterogeneous legacy and current data collections so that it can interoperate on this data using a variety of software tools with different data format expectations. It is intended that this work will link with the GBIF system being developed in an international effort, as its resources will complement GBIFs. This system must be able to evolve by adding new data collections and

software tools to its distributed resources. This requires wrapping of legacy resources as they join so that they are consistent with the standards for data within BDWorld. This system is primarily concerned with the microbe, plant, animal level of biology although it will be able to support some lower level analyses. At the moment a basic BDWorld system is being built but its design is such that it will be able to evolve: by incorporating new tools and data resources; by adding ontologies which will help users discover the resources they need; by incorporating more sophisticated display tools. The designers are aware that Grid middleware is being created in parallel with their development of BDWorld and they are ensuring that it will be able to take advantage of appropriate Grid middleware when it has reached a suitable stage in its development. The neuroinformatics project [4] is investigating how the brain functions. It is intended that the developed system will allow neuroscientists to work collaboratively sharing their data and software tools. Research in this area needs to be undertaken at different biological levels and across the levels. The challenge is to allow the researchers to create their models collaboratively and conduct experiments on them. This involves being able to locate appropriate data so that it can be linked in the models or utilised by the models. This data and data produced by the models must be available for use in future experiments. This research is being undertaken in collaboration with scientists at the Newcastle escience centre who are looking after the database and Grid middleware aspects of this project in a separate project. The prime concern of this project at the moment is creating data models and software tools that enable heterogeneous data to be easily shared and analysed by its user community. The design team recognise that there will be a need in the next phase of its development for the system to provide sophisticated visualisation tools and ontologies. Thus they are concentrating on creating a basic system environment that can evolve by adding such tools in the future. 3. Pilot Project e-science themes These pilot projects display a number of escience themes. They are all aiming to support collaborative working within a research community who need to share data, results and software tools. This means they will need to support discovery of relevant data sources, data and their descriptions so that the different tools can analyse and share data. They will need to overcome heterogeneity in data representation when it is prepared over time for different purposes when accessing legacy data and develop new extended standards for the metadata describing this data which allows its provenance to be established and stored. Many of the projects are creating results which have to be stored so that other analytic tools can use this data in the future. This implies that the data will need to be curated with provenance showing how it was created and the tools creating it. There will also be a need to use and store descriptive data so that representation of the data is understood by researchers and can be interpreted by software tools. There is some need for High Performance Computing (HPC) but it is not a major requirement of the projects. It occurs when complex models are being built and the results of analysing and using the models are needed in real time for further analysis. E-HTPX is the only project seeing this as a prime requirement, although the others may need some access to these facilities in the future. All of the projects will be creating new software tools which need to be made available within the system for other users. These tools must be engineered so that they can link with existing tools and utilise the data available in the grid system. This means that there must be descriptions of these tools which enable them to be linked in analytic chains which can be executed by work flow engines. Several of the projects need work flow engines for their user interface to allow a user to create and execute work flows which perform the required analysis. Most of the projects are aiming to support the building of structural models at different biological levels, which can be used to determine the functioning of a biological system and the effect of change on the system. It is clear that as bioinformatics expands through the Grid this will become a growing area as researchers create more and more complex models that are not limited to one biological

level but interact across the levels to determine the effect of substructure change on the higher level structures. This growth in model complexity will also be reflected in a growing level of diversity in the data sources used in the models as researchers investigate more fully the causes of change and evolution. There is little emphasis at the moment on the need for sophisticated presentation tools which allow users to present information in different and more imaginative ways. This is probably due to modelling being mainly done at a single level at the moment. Another reason for this could be that the structural modelling of biological systems is a relatively new technique, and as it matures more sophisticated displays of the results from these complex models will be required by the modellers to make it easier for users to understand the outcomes. It is also a feature of the current state of development of the systems where these facilities are seen as the second stage of the development and an unnecessary luxury until the basic systems are working. Two projects are investigating the use of the Grid authentication techniques. This is due to the nature of the projects which have industrial links at the moment, rather than it not being of concern in the area. Again as the field matures and this type of analysis becomes more accepted there will be an increase in this requirement. 4. Expected effect on escience It is clear that the pilot projects have fairly ambitious bioinformatics goals and that they do not see themselves developing middleware per se for the Grid, but co-operating with the projects that are developing the middleware. The e-protein and neuroinformatics projects are working closely with research groups that are developing middleware in separate projects and will utilise this software as it becomes available. The e-protein team are working closely with the team developing the ICENI middleware at Imperial College and the neuroinformatics have a close link with the team developing the OGSA DAI middleware at Newcastle University. These projects will have a direct influence on the development of these pieces of escience middleware. The other projects are keeping themselves informed of middleware developments and will utilise appropriate middleware, when it is in a stable enough form, until then they are likely to use alternative nongeneric, limited capability software that is available or they developed themselves to meet their needs. However they all intend to take full advantage of Grid middleware when it is stable. All the projects have major data handling challenges and one of the major contributions from this research programme should be insight into the future metadata requirements and standards in Grid environments. This covers the description of data and software as well as provenance and curation of data. A major issue through all the projects is the interoperation of data held in different data collections. Considerable insight should be gained from these pilots as to how to describe and hold data so that this task is facilitated, especially with respect to the wrapping of legacy systems so that they can enter new environments easily. Although it is not a major feature at the moment, these systems will need metadata repositories and ontologies to help users identify the resources held in the environments that they require. These will be needed to help the researchers create the workflows that will do their analyses of the data. At the moment some of the projects are investigating or creating embryonic workflow engines which will be used to execute the analytic chains of software tools which are created by users to identify the required analysis. This work should further inform the development of these workflow engines. These projects all intend to develop basic systems which can evolve to meet future as yet unknown requirements. This will be an important feature of their system architectures and the development of these pilots should give us more insight into the best ways of building systems with this capability. This will be important in the development of the sophisticated Grid systems as we will not be able to afford to recreate such systems from scratch. 5. Conclusions The bioinformatics pilot projects will make meaningful contributions to the escience

programme in the areas of creating the metadata standards required for bioinformatics data and tools. They will contribute to the definition of data curation and provenance standards. They do not intend to make a direct contribution to the development of the middleware required for the Grid but their use of it as it evolves will inform the development of this middleware. They will also identify new middleware requirements. It is clear that in the future this research will inform the development of the next generation of ontologies and data/resource discovery tools and the more sophisticated presentation tools such as result visualisation. However the major contribution of these pilots will be as catalysts which encourage more bioinformatics research by demonstrating what can be achieved by collaborative in silico data experimentation and analysis in bioscience. The pilots will also create the basic systems which will allow the next generation of researchers to fully exploit this capability. This will enable the field to grow and support the collaborative working needed to build and exploit the next generation of systems biology models. References 1. World Class Bioscience, Strategic Plan 2003-2008, BBSRC, Swindon (2003) 2. Kirkwood TBL et al: Towards an e-biology of ageing: integrating theory and data, Nature Reviews Molecular Cell Biology 4, 243-49 (2003) 3. Bisby FA: Biodiversity Informatics, in Business (quarterly magazine of the BBSRC), 24-25, July 2003 4. Goddard N, Cannon R and Howell F: Axiope Tools for Data Management and Data Sharing, accepted by J Neuroinformatics to be published (2003)