Genomic Data at the British Oceanographic Data Centre A data management project for the NERC Marine and Freshwater Microbial Biodiversity Thematic Programme Gwen Moncoiffé British Oceanographic Data Centre, Liverpool, UK gmon@bodc.ac.uk www.bodc.ac.uk
What is BODC? A national facility for the curation and distribution of data related to the marine environment BODC is the UK National Marine Data Centre and one of the NERC 7 designated data centres We deal with biological, chemical, physical and geophysical data, and our databases contain measurements of over 15,000 different variables A total of 35 staff including: data scientists: most with direct experience of marine data collection and analysis, and with scientific qualifications in physics, chemistry, biogeochemistry, biology and geophysics IT specialists: IT infrastructure maintenance and development; e.g. reformatting, screening and visualisation software, database maintenance, web interface, GIS, etc.
BODC data collections The National Oceanographic Database (NODB): a collection of marine data series (continuous profiles, time-series, ) originating mainly from UK research establishments The Project Database: used to assemble continuous profiles and discrete measurements data and metadata collected during large multi-disciplinary fieldwork experiments contains data from deep ocean, coastal and estuarine work essentially a marine environmental database system with data from the marine atmosphere, from the water column and from marine sediments but can also be used for inland fieldwork activities and mesocosm experiments
BODC Data Banking Process Accession system: catalogue and safe archive of all information and data files received The National Oceanographic Database (NODB) Data series: time series, continuous profiles series, ship continuous underway measurements, etc. Inventories, metadata and documentation fully managed under Oracle, datacycles stored as files in BODC s standardised format (NetCDF equivalent) The Project Database Inventory of discrete or integrative data collection sampling events and a collection of discrete measurements: data related to water or core sampling, net tow, CTD and other profiling instruments, incubations, etc. Metadata and data stored in Oracle. Full text documentation stored externally. BODC Parameter Dictionary a multi-disciplinary parameter dictionary for data markup
BODC Data Banking Process Archiving Accession system: catalogue and safe archive of all information and data files received The National Oceanographic Database (NODB) Data series: time series, continuous profiles series, ship continuous underway measurements, etc. Databanking 1- Quality control 2- Parameter coding 3- Documentation Inventories, metadata and documentation fully managed under Oracle, datacycles stored as files in BODC s standardised format (NetCDF equivalent) The Project Database Inventory of discrete or integrative data collection sampling events and a collection of discrete measurements: data related to water or core sampling, net tow, CTD and other profiling instruments, incubations, etc. Metadata and data stored in Oracle. Full text documentation stored externally. BODC Parameter Dictionary a multi-disciplinary parameter dictionary for data markup
BODC new web site www.bodc.ac.uk
How did we become involved with genomic data? In October 2001, BODC became the main Project Data Centre for the UK-NERC thematic programme Marine and Freshwater Microbial Biodiversity (M&FMB) 5-year thematic programme 2000-2005 3 main funding phases, 34 funded projects: 2000 (mainly marine), 2001 (freshwater and comparative) and 2003 (application and exploitation phase) Fieldwork: one M&FMB-funded cruise in the Indian Ocean for co-ordinated marine fieldwork freshwater fieldwork activities at Priest Pot, a shallow pond in the English Lake District project specific fieldwork and sample collection initiatives
M&FMB: a complex data management project marine / freshwater experimental / environmental Culturability, biofilms and cell signalling Biogeochemistry water column / sediment Bioactives and natural products Genetic diversity and community structure quantitative / semi-quantitative / non-quantitative Underpinning environmental measurements for co-ordinated studies at Priest Pot, Cumbria, UK and in the Indian Ocean Virus ecology and exploitation molecular / biotic / abiotic
Why BODC? Extensive experience in managing data from large multi-disciplinary fieldwork programmes (NERC and EU-funded) We maintain one of the UK s most comprehensive database of environmental measurements Built up expertise and infrastructure for the management and quality control of data from ship-based multi-disciplinary fieldwork experiments A flexible database system which allows us to ingest and integrate new types of measurements, instruments and parameters relatively easily
Roles of the BODC as M&FMB Project Data Centre Our main roles were 1) to quality control, collate and document non-molecular data collected during marine fieldwork activities and in particular the M&FMB-funded cruise AMBITION 2) to collaborate with CEH Windermere/Lancaster for the management and collation of freshwater data focusing on Priest Pot fieldwork 3) to collate information and maintain a central inventory of site sampled, data collected and resources generated by M&FMBfunded projects 4) to preserve and catalogue any digital material submitted to us for long- or short-term archival by M&FMB scientists Molecular data (mainly sequencing data) outside of BODC s remit
Molecular data management issues These would be looked after by individual research teams and, where appropriate, submitted to public molecular databases such as e.g. GenBank As a pilot project, BODC started investigating the possibility of providing a link between GenBank records and our sampling metadata and environmental measurements in our database Having obtained a series of GenBank accession numbers we started realising that under the current arrangements, there was a high risk of data from genetic samples becoming permanently detached from their environmental context and sampling metadata and with this lost, also lost was the possibility of re-using the data in a context different from that originally intended by the originator
Example of a GenBank record from a M&FMB marine environmental DNA sample LOCUS AY125386 1444 bp DNA linear ENV 25-FEB-2003 DEFINITION Uncultured Synechococcus sp. clone A315026 16S ribosomal RNA gene, partial sequence. ACCESSION AY125386 VERSION AY125386.1 GI:28557446 KEYWORDS ENV. SOURCE uncultured Synechococcus sp. ORGANISM uncultured Synechococcus sp. Bacteria; Cyanobacteria; Chroococcales; Synechococcus; environmental samples. REFERENCE 1 (bases 1 to 1444) AUTHORS TITLE Zubkov,M.V., Fuchs,B.M., Tarran,G.A., Burkill,P.H. and Amann,R. High rate of uptake of organic nitrogen compounds by prochlorococcus cyanobacteria as a key to their dominance in oligotrophic oceanic waters JOURNAL Appl. Environ. Microbiol. 69 (2), 1299-1304 (2003) PUBMED 12571062 REFERENCE 2 (bases 1 to 1444) AUTHORS Zubkov,M.V., Fuchs,B.M., Tarran,G.A., Burkill,P.H. and Amann,R. TITLE Direct Submission JOURNAL Submitted (21-JUN-2002) Plymouth Marine Laboratory, PL1 3DH, Plymouth PL1 3DH, United Kingdom FEATURES Location/Qualifiers source 1..1444 /organism="uncultured Synechococcus sp." /mol_type="genomic DNA" /db_xref="taxon:154535" /clone="a315026" /environmental_sample rrna <1..>1444 /product="16s ribosomal RNA" ORIGIN 1 agagtttgat cctggctcag gatgaacgct ggcggcgtgc ttaacacatg caagtcgaac
Example of a GenBank record from a M&FMB marine environmental DNA sample LOCUS AY125386 1444 bp DNA linear ENV 25-FEB-2003 DEFINITION Uncultured Synechococcus sp. clone A315026 16S ribosomal RNA gene, partial sequence. ACCESSION AY125386 VERSION AY125386.1 GI:28557446 KEYWORDS ENV. SOURCE uncultured Synechococcus sp. ORGANISM uncultured Synechococcus sp. Bacteria; Cyanobacteria; Chroococcales; Synechococcus; environmental samples. REFERENCE 1 (bases 1 to 1444) AUTHORS TITLE Zubkov,M.V., Fuchs,B.M., Tarran,G.A., Burkill,P.H. and Amann,R. High rate of uptake of organic nitrogen compounds by prochlorococcus cyanobacteria as a key to their dominance in oligotrophic oceanic waters JOURNAL Appl. Environ. Microbiol. 69 (2), 1299-1304 (2003) PUBMED 12571062 REFERENCE 2 (bases 1 to 1444) AUTHORS Zubkov,M.V., Fuchs,B.M., Tarran,G.A., Burkill,P.H. and Amann,R. TITLE Direct Submission JOURNAL Submitted (21-JUN-2002) Plymouth Marine Laboratory, PL1 3DH, Plymouth PL1 3DH, United Kingdom FEATURES Location/Qualifiers source 1..1444 /organism="uncultured Synechococcus sp." /mol_type="genomic DNA" /db_xref="taxon:154535" /clone="a315026" /environmental_sample rrna <1..>1444 /product="16s ribosomal RNA" ORIGIN 1 agagtttgat cctggctcag gatgaacgct ggcggcgtgc ttaacacatg caagtcgaac
Example of a GenBank record from a M&FMB marine environmental DNA sample LOCUS AY125386 1444 bp DNA linear ENV 25-FEB-2003 DEFINITION Uncultured Synechococcus sp. clone A315026 16S ribosomal RNA gene, partial sequence. ACCESSION AY125386 VERSION AY125386.1 GI:28557446 KEYWORDS ENV. SOURCE uncultured Synechococcus sp. ORGANISM uncultured Synechococcus sp. Bacteria; Cyanobacteria; Chroococcales; Synechococcus; environmental samples. REFERENCE 1 (bases 1 to 1444) AUTHORS TITLE Zubkov,M.V., Fuchs,B.M., Tarran,G.A., Burkill,P.H. and Amann,R. High rate of uptake of organic nitrogen compounds by prochlorococcus cyanobacteria as a key to their dominance in oligotrophic oceanic waters JOURNAL Appl. Environ. Microbiol. 69 (2), 1299-1304 (2003) PUBMED 12571062 REFERENCE 2 (bases 1 to 1444) AUTHORS Zubkov,M.V., Fuchs,B.M., Tarran,G.A., Burkill,P.H. and Amann,R. TITLE Direct Submission JOURNAL Submitted (21-JUN-2002) Plymouth Marine Laboratory, PL1 3DH, Plymouth PL1 3DH, United Kingdom FEATURES Location/Qualifiers source 1..1444 /organism="uncultured Synechococcus sp." /mol_type="genomic DNA" /db_xref="taxon:154535" /clone="a315026" /environmental_sample rrna <1..>1444 /product="16s ribosomal RNA" ORIGIN 1 agagtttgat cctggctcag gatgaacgct ggcggcgtgc ttaacacatg caagtcgaac Where? When? How? Depth? Environmental Conditions? Methodology? Cruise? Project? Programme?
Problems with GenBank records for environmental samples LOCUS AY907763 1522 bp DNA linear ENV 03-AUG-2005 DEFINITION Uncultured bacterium clone A315022 16S ribosomal RNA gene, partial sequence. ACCESSION AY907763 No standardisation or control of the quality and VERSION AY907763.1 GI:62549130 KEYWORDS ENV. adequacy of background information provided SOURCE uncultured bacterium ORGANISM uncultured bacterium Bacteria; environmental samples. No information on methodology REFERENCE 1 (bases 1 to 1522) AUTHORS Fuchs,B.M., Woebken,D., Zubkov,M.V., Burkill,P. and Amann,R. TITLE Molecular identification of picoplankton populations in contrasting waters No possibility of the Arabian to Sea link the genetic sequence back JOURNAL Aquat. Microb. Ecol. 39, 145-157 (2005) REFERENCE 2 (bases to information 1 to 1522) held in our environmental AUTHORS Fuchs,B.M., Woebken,D., Zubkov,M.V., Burkill,P. and Amann,R. TITLE Direct database Submission without time-consuming detective work JOURNAL Submitted and (27-JAN-2005) reference Molecular to the Ecology, data originator Max Planck Institute for Marine Microbiology, Celsiusstr. 1, Bremen 28359, Germany FEATURES Location/Qualifiers source 1..1522 No /organism="uncultured possibility to search bacterium" for or filter genetic /mol_type="genomic DNA" sequences /isolation_source="arabian based on precise Sea environmental water" or /db_xref="taxon:77133" /clone="a315022" methodological criteria /environmental_sample rrna <1..>1522 /product="16s ribosomal RNA"
Need to link molecular and environmental databases records for environmental studies Environmental molecular/genomic data submitted to public repositories (e.g. GenBank) are publicly and easily accessible yes; BUT Value for future research is potentially greatly reduced when not readily linkable to their environmental, spatial, temporal and methodological context Submission of molecular data to public repositories cannot be considered a sufficient condition for proper data stewardship Molecular data or their unique identifiers must also be submitted to the data centre responsible for the collation and quality control of the non-molecular data and metadata Preferably, this should be done during the lifetime of the project in order to minimise risk of information being lost
Pilot project for M&FMB - Priorities Focus on AMBITION cruise first and then Priest Pot fieldwork Focus on DNA/RNA sequencing data Simple integration into our Project Database Achievable within our remaining M&FMB funding
Design of the genetic data tables main features Include a sample information and inventory table, and a genetic data table with a link to GenBank Sample inventory system populated with all the known genetic samples collected, fully integrated into our environmental database Allow for search and retrieval of environmental data corresponding to the sample even if we do not hold the molecular data or GenBank accession number Enable search of available DNA/RNA samples and the person to contact based on criteria selected from our database Ensure that we can apply our quality control checks to the sample metadata provided by the originator and provide feedback on potential problems or errors Records in the genetic data tables are created once data become available through GenBank or via direct submission
Linking genetic information into the Oracle RDBMS project database
Linking genetic information into the Oracle RDBMS project database
Linking genetic information to our Oracle RDBMS project database fields
Further development Initially designed for GenBank-type data (nucleotide or protein sequences) but can be easily adapted to handle any other non-quantitative product by a simple link to a uniquely recognised external reference number e.g. gel image ID, public repository reference number, scientists own data record identifier, external database record, etc.
Access to quality controlled information and environmental data for genetic sequences and samples Sequence AY125386 from GenBank originates from: a water sample collected on 09/09/2001, at 00:05 GMT, at a latitude of 3.8 deg N and longitude 67.0 deg E, and at a depth of 50.4m the following parameters measured by the ship s on board instruments and collected by other researchers are also available for this same water sample: temperature and salinity profiles; meteorological conditions nitrate, nitrite, ammonium, silicate and phosphate nutrient concentration and distribution oxygen concentration and distribution phytoplankton pigment composition, concentration and distribution pico-, nano- and micro-planktonic community biomass and composition including prokaryotes and eukaryotes primary production and nitrogen uptake and recycling measurements other gene sequences, link to other available molecular products (e.g. gel images) or DNA samples with contact name other measurements derived from molecular products including semiquantitative distribution data for targeted organisms
ACCESS interface
ACCESS interface
ACCESS interface
M&FMB programme data management: challenges Lack of in-house knowledge of genetic and molecular techniques Absence of metadata standards for referencing molecular methodologies and data Complexity and sometimes loose usage of some of the specific vocabulary used by scientists to describe their data (not specific to genomic community but made our job more difficult) General lack of trust in the data centre from the molecular biologists: data not provided to the data centre unless already published re-assurance about IP rights protection and release clauses had no effect Lack of interest in providing information about the data collected and the methodology used - the data are still viewed in the narrow context of the scientists own research interests-
Linking genetic data to their environmental context: other initiatives and acknowledgements Sapelo Island Microbial Observatory (SIMO) database (Sheldon et al, Georgia, US-NSF) MICRO-MAR database (Rodriguez-Valera et al, Alicante, MIRACLE EU Project) Thanks to: Michael Hughes, BODC, genetic data banking and design of ACCESS form interface Dawn Field, CEH Oxford, workshop organiser