Bioinformatics and escience. Abstract

Similar documents
BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS

Dr Alexander Henzing

MAC Consultation on the Review of the Shortage Occupation Lists for the UK and Scotland and Creative Occupations.

University of Glasgow - Programme Structure Summary C1G MSc Bioinformatics, Polyomics and Systems Biology

Exploring the roles and responsibilities of data centres and institutions in curating research data a preliminary briefing.

INFRASTRUCTURE PROGRAMME

IO Informatics The Sentient Suite

The UK e-science Programme and the Grid. Tony Hey Director of UK e-science Core Programme

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

CHALLENGES OF BIG DATA THE

Grids, e-business and e-utilities. Tony Hey Director of the UK e-science Core Programme EPSRC and DTI

Biochemistry Major Talk Welcome!!!!!!!!!!!!!!

The Importance of Bioinformatics and Information Management

Workprogramme

University Uses Business Intelligence Software to Boost Gene Research

AP Biology Essential Knowledge Student Diagnostic

e-science Technologies in Synchrotron Radiation Beamline - Remote Access and Automation (A Case Study for High Throughput Protein Crystallography)

Teaching Computational Thinking using Cloud Computing: By A/P Tan Tin Wee

Data Curation for the Long Tail of Science: The Case of Environmental Sciences

Integrating Research Information: Requirements of Science Research

Integrated Rule-based Data Management System for Genome Sequencing Data

A Capability Maturity Model for Scientific Data Management

Summary of Responses to the Request for Information (RFI): Input on Development of a NIH Data Catalog (NOT-HG )

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

Semantic and Personalised Service Discovery

Digital libraries of the future and the role of libraries

Survey of Canadian and International Data Management Initiatives. By Diego Argáez and Kathleen Shearer

BIOINFORMATICS Supporting competencies for the pharma industry

e-science and technology infrastructure for biodiversity research

OGSA - A Guide to Data Access and Integration in UK

BIOINFORMATICS METHODS AND APPLICATIONS

Semantic and Personalised Service Discovery

Scientific versus Business Workflows

RESPONSE FROM GBIF TO QUESTIONS FOR FURTHER CONSIDERATION

Case Study Life Sciences Data

Virtual research environments: learning gained from a situation and needs analysis for malaria researchers

The 100,000 genomes project

EMBL Identity & Access Management

INRA's Big Data perspectives and implementation challenges. Pascal Neveu UMR MISTEA INRA - Montpellier

EUDAT. Towards a pan-european Collaborative Data Infrastructure

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

Programme Specification (Undergraduate) Date amended: August 2012

A CONTENT STANDARD IS NOT MET UNLESS APPLICABLE CHARACTERISTICS OF SCIENCE ARE ALSO ADDRESSED AT THE SAME TIME.

Contents. Page 1 of 11

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

Open Source Software in Life Science Research. Woodhead Publishing Series in Biomedicine

Brain Segmentation A Case study of Biomedical Cloud Computing for Education and Research

Global Scientific Data Infrastructures: The Big Data Challenges. Capri, May, 2011

Soaplab - a unified Sesame door to analysis tools

Netherlands escience Center

Keystones for supporting collaborative research using multiple data sets in the medical and bio-sciences

Pipeline Pilot Enterprise Server. Flexible Integration of Disparate Data and Applications. Capture and Deployment of Best Practices

The challenge of managing research data. Axel Berg

Environmental Research and Innovation ( ERIN )

Efficient Data Storage and Analysis for Generic Biomolecular Simulation Data

Describing Web Services for user-oriented retrieval

CYBERINFRASTRUCTURE FRAMEWORK FOR 21 st CENTURY SCIENCE AND ENGINEERING (CIF21)

Cloud Computing for e-science with CARMEN

THE BRITISH LIBRARY. Unlocking The Value. The British Library s Collection Metadata Strategy Page 1 of 8

Translational research facilitating experimental medicine in dementia in the UK

Healthcare, transportation,

Using Ontologies in Proteus for Modeling Data Mining Analysis of Proteomics Experiments

Funding New Innovations in Synthetic Biology

Supporting Collaborative Grid Application Development Within The E-Science Community p. 1

Software Description Technology

An Interdepartmental Ph.D. Program in Computational Biology and Bioinformatics:

Overcoming the Technical and Policy Constraints That Limit Large-Scale Data Integration

Human Brain Project -

THE CCLRC DATA PORTAL

Informatics and Knowledge Management at the Novartis Institutes for BioMedical Research (NIBR)

Life as a scientific database curator

School of Biosciences: MRC Phenome Centre-Birmingham

Linked Science as a producer and consumer of big data in the Earth Sciences

Information and Communications Technology Strategy

Clinical Research Infrastructure

BIOSCIENCES COURSE TITLE AWARD

Science for a healthy society. Food Safety & Security. Food Databanks. Food & Health. Industrial Biotechnology. Gut Health

THe evolution of analytical lab InForMaTICs

Paradigm Changes Affecting the Practice of Scientific Communication in the Life Sciences

Intro to Data Management. Chris Jordan Data Management and Collections Group Texas Advanced Computing Center

NIH Commons Overview, Framework & Pilots - Version 1. The NIH Commons

In Vivo In Silico (ivis): the Virtual Worm, Weed and Bug

Research and Innovation Strategy: delivering a flexible workforce receptive to research and innovation

Programme Specification ( ): MSc in Bioinformatics and Computational Genomics

Families of Database Schemas for Neuroscience Experiments

The cross-disciplinary Roots of the British collaboration between scholars in humanities and

PROGRAMME SPECIFICATION

GRADUATE CATALOG LISTING

MEng, BSc Computer Science with Artificial Intelligence

CALIFORNIA STATE UNIVERSITY CHANNEL ISLANDS

Tier 2 Canada Research Chair in Bioinformatics Additional Information for Potential Applicants

Report of the DTL focus meeting on Life Science Data Repositories

Service Road Map for ANDS Core Infrastructure and Applications Programs

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

EDITORIAL MINING FOR GOLD : CAPITALISING ON DATA TO TRANSFORM DRUG DEVELOPMENT. A Changing Industry. What Is Big Data?

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

IT Challenges for the Library and Information Studies Sector

Global Ecology and Wildlife Conservation

Statement of ethical principles for biotechnology in Victoria

Transcription:

Bioinformatics and escience W.A. Gray 1 and C. Thompson 2 1 Cardiff University, School of Computer Science, PO Box 916, Cardiff CF24 3XF, 2 BBSRC, Polaris House, North Star Avenue, Swindon SN2 1UH Abstract This paper gives a brief overview of the diverse field of bioinformatics, identifying research themes. It then introduces the BBSRC funded escience pilot projects against this overview by presenting their biological aims. Areas of the escience programme addressed by these projects are identified to determine the contribution they are expected to make to escience. This covers contributions to the developing Grid middleware and associated escience standards as well as their bioinformatics goals. Acknowledgement The authors thank the staff and PIs of the six projects who contributed material on which this paper is based. These are: 1) (e-protein) A distributed pipeline for structural-based proteome annotation using GRID technology Prof MJE Sternberg (PI) Imperial College, University College London, European Bioinformatics Institute [www.e-protein.org] 2) (BioSimGRID) A GRID database for biomolecular simulations Prof MSP Sansom (PI) Oxford University, Southampton University, Birkbeck College, York University, Nottingham University, Birmingham University [www.biosimgrid.org] 3) (e-htpx) An escience resource for high throughput protein crystallography Dr C Nave (PI) CLRC Daresbury Laboratory, Cambridge University, Cardiff University, European Bioinformatics Institute, York University, Oxford University [www.e-htpx.org] 4) (BDWorld) A problem solving environment for global biodiversity: prototype and demonstrator Prof FA Bisby (PI) Reading University, Cardiff University, Southampton University, Natural History Museum [www.bdworld.org] 5) GRID-enabled modelling tools and databases for neuroinformatics Dr N Goddard (PI) Edinburgh University {jointly funded with MRC} [www.axiope.org and www.anc.ed.ac.uk] 6) (BASIS) Biology of ageing escience integration and simulation system Prof TBL Kirkwood (PI) Newcastle University [www.basis.ncl.ac.uk] 1. Introduction In May the BBSRC held a meeting of its escience pilot projects. It was agreed at the meeting, that it would be a good idea to present a paper at this All Hands Meeting, which covered their six pilot projects presenting their bioinformatics goals and how they will contribute to the aims of the UK escience programme. The intention is to show the types of bioinformatics research enabled by an escience approach and how these projects will drive and contribute to the escience developments occurring in parallel in other research disciplines across the UK escience programme. Bioinformatics can be described as the derivation of knowledge from computer analysis of biological data. This is a simplistic view as it is a discipline which includes a wide range of scientific investigation. It is not a homogeneous domain. This means that there is a wide range of opinions and views as to what it

comprises. In its broadest sense it is the application of informatics techniques to biological data in a research or application environment. These informatics techniques can come from a number of disciplines including computer science, statistics and mathematics. This discipline list is by no means exhaustive as researchers utilise techniques from engineering, physics, and other scientific disciplines. In their Strategic Plan 2003-2008 [1] the BBSRC recognise the growing importance of bioinformatics in their domain when they state: Genome sequencing and post-genomic technologies provide researchers with massive amounts of data. As a consequence biology is becoming more quantitative. Large experimental data sets will increasingly allow computer (in silico) simulation of biological systems. This recognises the growing importance of bioinformatics in the next generation of research in bioscience. The BBSRC identify in [1] the following areas as important to their research agenda in nonclinical bioscience: Integrative biology, Sustainable agriculture, The Healthy organism, and Bioscience for industry. It is recognised that there are different levels of biological structure within these areas from individual molecules, cells, tissues/organs through populations to microbes, plants and animals. There is a need to do research in a number of ways within these levels and across the levels. This research underpins the growth in bioinformatics due to its generation of large amounts of data, which need to be analysed, integrated and used in simulations to test new ideas. In the second strategic objective in [1], it is recognised that bioscience is increasingly dependent on the development and use of e- tools due to the data being collected at the omics level of research which should lead to more productive in silico research and the development of new bioinformatics tools. These tools will be needed in areas such as data mining, pattern recognition, model building, data sharing across the vertical and horizontal biological structure levels. Traditional bioscience research also needs access to the data collections made over the centuries by Institutes such as the Natural History Museum. These collections must be prepared in machine readable formats that allow investigations to be made that shed new light and understanding on biodiversity and the effects on it of changes in climate, agricultural policy and government policy. It is important that the UK research in bioinformatics links with other efforts at National and International levels eg the GBIF (Global Biodiversity Information Facility) initiative in biodiversity. 2. The Pilot Projects When the pilot project proposals were submitted the BBSRC had identified four theme areas for these projects: Genomics, Structural studies, Cellular processes, and Biodiversity. The successful projects covered topics within these areas, although there was no specific pilot project in the genomics area. The BASIS project at Newcastle University [2] is concerned with utilising emerging Grid middleware technology to develop a system supporting a research community investigating quantitatively the biology of ageing at the cell, tissue and organism levels. It is creating new tools such as SBML (Systems Biology Markup Language) which will help a researcher build a computer based model of ageing processes and test them. At the tissue level these will be based on fibroblasts, gut and brain with experiments being conducted to determine the effect of random cell death in a tissue. Users will be able to set up experiments involving the creation of new models or modification of existing models which let them gain a deeper understanding of the effects of tissue ageing on organisms. It is intended that this facility will be available to a distributed community of researchers who will share their results, models and analytic tools. Thus this project aims to create a facility which allows simulation by system modelling within a structure level and across levels. The development team is working closely with another local team who are developing Grid middleware in the OGSA DAI project for accessing data held in databases. Thus their

requirements are informing the development of this middleware and they are utilising it in the development of the system. e-protein aims to provide a structure-based annotation of proteins in major genomes, which can be disseminated to other researchers. It is intended that alternative annotation approaches will be investigated to identify improvements in the methods of annotation. This will be used to build local databases holding structural and functional annotation of sequence data, which can be linked with relevant bioinformatics data resources at other sites. The improvement of protein modelling is a prime aim of the project so that better function predictions can be made. They intend that the system will be available to a research community collaborating in their investigations so that they can investigate and test alternative structure models. This system will have a workflow based interface which makes use of the ICENI middleware and its rich metadata structure for describing software tools. This middleware is being developed in a related Grid project which involves several of the e- Protein investigators. As in BASIS this project will be informing the development of the Grid middleware it requires. e-htpx is addressing the problem of unifying the procedures of protein structure determination so that they can be accessed through a single interface which allows structural biologists to create models from the data generated by high throughput protein crystallography. This will involve creating new structure determination software which can take advantage of HPC computing facilities so that the results of the structure determination can be delivered on the same time scale as data collection. Data generated in these experiments will be stored at the EBI as an available resource for the research community. It will involve giving users access to instruments, data collections and analytic tools. This project is primarily concerned with building structural models within a structure level. As the system is expected to have a number of industrial users, an important concern is the authentication of users to protect the system against unauthorised use. The development team will be consulting potential industrial users to determine their requirements in this important area. This will be used to see whether this can be supported by Grid facilities. BioSimGrid aims to allow comparisons to be made of the results of multiple biomolecular simulations so that the structure of proteins and nucleic acids can be better understood. Its users will have access to large quantities of simulation data, which will be integrated in further simulation experiments testing theories in structural biology within structure levels with the capability to reuse this data in cross level modelling of structures. This data will require curation. These secondary analysis experiments will need data mining services to locate relevant data within these databases as well as data analysis tools. Some of the research community using this system will be working in commercial organisations in the pharmaceutical industry. This introduces the need for user authentication before access is allowed to some of the data and tools as commercial confidentiality will need to be protected. It also means they have an interest in investigating mechanisms for distributed authorisation and accounting. There is also a requirement to link the simulation data with other biological and structural data held in National repositories to allow development of richer, more sophisticated models. BDWorld [3] is creating a problem solving environment in which researchers can locate appropriate analytic tools and data resources held at different sites in the environment. These tools and data can then be linked in a work flow which produces results relevant to an investigation into a biodiversity problem at the species level. This may be a question such as what will be the effect on a species distribution if global warming occurs, or could this plant become invasive if it is introduced as an agricultural crop in a region. The system utilises a partial catalogue of life and other biotic and abiotic data, such as climate envelopes, in three exemplar studies to prove the concept biodiversity richness analysis; bioclimatic modelling and climate change; and phylogenetic analysis and biogeographic. This will involve the system linking heterogeneous legacy and current data collections so that it can interoperate on this data using a variety of software tools with different data format expectations. It is intended that this work will link with the GBIF system being developed in an international effort, as its resources will complement GBIFs. This system must be able to evolve by adding new data collections and

software tools to its distributed resources. This requires wrapping of legacy resources as they join so that they are consistent with the standards for data within BDWorld. This system is primarily concerned with the microbe, plant, animal level of biology although it will be able to support some lower level analyses. At the moment a basic BDWorld system is being built but its design is such that it will be able to evolve: by incorporating new tools and data resources; by adding ontologies which will help users discover the resources they need; by incorporating more sophisticated display tools. The designers are aware that Grid middleware is being created in parallel with their development of BDWorld and they are ensuring that it will be able to take advantage of appropriate Grid middleware when it has reached a suitable stage in its development. The neuroinformatics project [4] is investigating how the brain functions. It is intended that the developed system will allow neuroscientists to work collaboratively sharing their data and software tools. Research in this area needs to be undertaken at different biological levels and across the levels. The challenge is to allow the researchers to create their models collaboratively and conduct experiments on them. This involves being able to locate appropriate data so that it can be linked in the models or utilised by the models. This data and data produced by the models must be available for use in future experiments. This research is being undertaken in collaboration with scientists at the Newcastle escience centre who are looking after the database and Grid middleware aspects of this project in a separate project. The prime concern of this project at the moment is creating data models and software tools that enable heterogeneous data to be easily shared and analysed by its user community. The design team recognise that there will be a need in the next phase of its development for the system to provide sophisticated visualisation tools and ontologies. Thus they are concentrating on creating a basic system environment that can evolve by adding such tools in the future. 3. Pilot Project e-science themes These pilot projects display a number of escience themes. They are all aiming to support collaborative working within a research community who need to share data, results and software tools. This means they will need to support discovery of relevant data sources, data and their descriptions so that the different tools can analyse and share data. They will need to overcome heterogeneity in data representation when it is prepared over time for different purposes when accessing legacy data and develop new extended standards for the metadata describing this data which allows its provenance to be established and stored. Many of the projects are creating results which have to be stored so that other analytic tools can use this data in the future. This implies that the data will need to be curated with provenance showing how it was created and the tools creating it. There will also be a need to use and store descriptive data so that representation of the data is understood by researchers and can be interpreted by software tools. There is some need for High Performance Computing (HPC) but it is not a major requirement of the projects. It occurs when complex models are being built and the results of analysing and using the models are needed in real time for further analysis. E-HTPX is the only project seeing this as a prime requirement, although the others may need some access to these facilities in the future. All of the projects will be creating new software tools which need to be made available within the system for other users. These tools must be engineered so that they can link with existing tools and utilise the data available in the grid system. This means that there must be descriptions of these tools which enable them to be linked in analytic chains which can be executed by work flow engines. Several of the projects need work flow engines for their user interface to allow a user to create and execute work flows which perform the required analysis. Most of the projects are aiming to support the building of structural models at different biological levels, which can be used to determine the functioning of a biological system and the effect of change on the system. It is clear that as bioinformatics expands through the Grid this will become a growing area as researchers create more and more complex models that are not limited to one biological

level but interact across the levels to determine the effect of substructure change on the higher level structures. This growth in model complexity will also be reflected in a growing level of diversity in the data sources used in the models as researchers investigate more fully the causes of change and evolution. There is little emphasis at the moment on the need for sophisticated presentation tools which allow users to present information in different and more imaginative ways. This is probably due to modelling being mainly done at a single level at the moment. Another reason for this could be that the structural modelling of biological systems is a relatively new technique, and as it matures more sophisticated displays of the results from these complex models will be required by the modellers to make it easier for users to understand the outcomes. It is also a feature of the current state of development of the systems where these facilities are seen as the second stage of the development and an unnecessary luxury until the basic systems are working. Two projects are investigating the use of the Grid authentication techniques. This is due to the nature of the projects which have industrial links at the moment, rather than it not being of concern in the area. Again as the field matures and this type of analysis becomes more accepted there will be an increase in this requirement. 4. Expected effect on escience It is clear that the pilot projects have fairly ambitious bioinformatics goals and that they do not see themselves developing middleware per se for the Grid, but co-operating with the projects that are developing the middleware. The e-protein and neuroinformatics projects are working closely with research groups that are developing middleware in separate projects and will utilise this software as it becomes available. The e-protein team are working closely with the team developing the ICENI middleware at Imperial College and the neuroinformatics have a close link with the team developing the OGSA DAI middleware at Newcastle University. These projects will have a direct influence on the development of these pieces of escience middleware. The other projects are keeping themselves informed of middleware developments and will utilise appropriate middleware, when it is in a stable enough form, until then they are likely to use alternative nongeneric, limited capability software that is available or they developed themselves to meet their needs. However they all intend to take full advantage of Grid middleware when it is stable. All the projects have major data handling challenges and one of the major contributions from this research programme should be insight into the future metadata requirements and standards in Grid environments. This covers the description of data and software as well as provenance and curation of data. A major issue through all the projects is the interoperation of data held in different data collections. Considerable insight should be gained from these pilots as to how to describe and hold data so that this task is facilitated, especially with respect to the wrapping of legacy systems so that they can enter new environments easily. Although it is not a major feature at the moment, these systems will need metadata repositories and ontologies to help users identify the resources held in the environments that they require. These will be needed to help the researchers create the workflows that will do their analyses of the data. At the moment some of the projects are investigating or creating embryonic workflow engines which will be used to execute the analytic chains of software tools which are created by users to identify the required analysis. This work should further inform the development of these workflow engines. These projects all intend to develop basic systems which can evolve to meet future as yet unknown requirements. This will be an important feature of their system architectures and the development of these pilots should give us more insight into the best ways of building systems with this capability. This will be important in the development of the sophisticated Grid systems as we will not be able to afford to recreate such systems from scratch. 5. Conclusions The bioinformatics pilot projects will make meaningful contributions to the escience

programme in the areas of creating the metadata standards required for bioinformatics data and tools. They will contribute to the definition of data curation and provenance standards. They do not intend to make a direct contribution to the development of the middleware required for the Grid but their use of it as it evolves will inform the development of this middleware. They will also identify new middleware requirements. It is clear that in the future this research will inform the development of the next generation of ontologies and data/resource discovery tools and the more sophisticated presentation tools such as result visualisation. However the major contribution of these pilots will be as catalysts which encourage more bioinformatics research by demonstrating what can be achieved by collaborative in silico data experimentation and analysis in bioscience. The pilots will also create the basic systems which will allow the next generation of researchers to fully exploit this capability. This will enable the field to grow and support the collaborative working needed to build and exploit the next generation of systems biology models. References 1. World Class Bioscience, Strategic Plan 2003-2008, BBSRC, Swindon (2003) 2. Kirkwood TBL et al: Towards an e-biology of ageing: integrating theory and data, Nature Reviews Molecular Cell Biology 4, 243-49 (2003) 3. Bisby FA: Biodiversity Informatics, in Business (quarterly magazine of the BBSRC), 24-25, July 2003 4. Goddard N, Cannon R and Howell F: Axiope Tools for Data Management and Data Sharing, accepted by J Neuroinformatics to be published (2003)