Genomic Data at the British Oceanographic Data Centre



Similar documents
UKOARP Data Management. Rob Thomas, British Oceanographic Data Centre

NERC Data Policy Guidance Notes

DRAFT STANDARD FORM A APPLICATION FOR CONSENT TO CONDUCT MARINE SCIENTIFIC RESEARCH

Introduction to BODC and how to submit data

Introduction to protection goals, ecosystem services and roles of risk management and risk assessment. Lorraine Maltby

SCICEX boat-to-archive route map and data management template Biological and Chemical Samples

GenBank, Entrez, & FASTA

Adding Value to Oceanographic Data at the British Oceanographic Data Centre

MARY DOHERTY 1111 Holland Avenue Cambridge, MD Telephone:

Labelling and metadata standards for Shelf Sea Biogeochemistry (SSB) (updated version 2)

Science for a healthy society. Food Safety & Security. Food Databanks. Food & Health. Industrial Biotechnology. Gut Health

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

National Facility 1: British Oceanographic Data Centre (BODC)

Data dissemination best practice and STAR experience

Next Generation Sequencing Technologies in Microbial Ecology. Frank Oliver Glöckner

Bioprospecting. for. Microalgae

Lecture Outline. Introduction to Databases. Introduction. Data Formats Sample databases How to text search databases. Shifra Ben-Dor Irit Orr

The Integration of Hydrographic and Oceanographic Data in a Marine Geographic Information System U.S. Hydro 2015

BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS

CMEMS user requirements and user uptake strategy

SCIENCE. Introducing updated Cambridge International AS & A Level syllabuses for. Biology 9700 Chemistry 9701 Physics 9702

LJMU Research Data Policy: information and guidance

Joint European Research Infrastructure network for Coastal Observatories

Broken Arrow Public Schools AP Environmental Science Objectives Revised

Sources to Seafood: Mercury Pollution in the Marine Environment Background on Presenting Scientists

A beginners guide to accessing Argo data. John Gould Argo Director

UNITED KINGDOM CONTRIBUTION TO ARGO

Communities, Biomes, and Ecosystems

UMCES Draft Mission Statement, March 31, 2014

SUMMARY MISSION STATEMENT

SeaDataNet pan-european infrastructure for ocean and marine data management. Dick M.A. Schaap MARIS

GenBank: A Database of Genetic Sequence Data

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

Pan-European infrastructure for management of marine and ocean geological and geophysical data

GLOBAL TEMPERATURE AND SALINITY PROFILE PROGRAMME (GTSPP) Dr. Charles Sun, GTSPP Chair National Oceanographic Data Center USA

The Steps. 1. Transcription. 2. Transferal. 3. Translation

Metadata for Data Discovery: The NERC Data Catalogue Service. Steve Donegan

APPENDIX D. Framework Recommendations for a Statewide ASBS Information Management System

Coastal Waters Consortium (CWC) Data Management Plan

BIOLOGY 101 COURSE SYLLABUS FOR FALL 2015

SCOR/IGBP Meeting on Data Management for International Marine Research Projects 1

Tribuna Académica. Overview of Metagenomics for Marine Biodiversity Research 1. Barton E. Slatko* Metagenomics defined

Environment and Natural Resources Trust Fund 2016 Request for Proposals (RFP)

Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure enzymes control cell chemistry ( metabolism )

Science for a healthy society. Food Safety Centre. Food Databanks. Food & Health. Industrial Biotechnology. Gut Health

Archiving of Simulations within the NERC Data Management Framework: BADC Policy and Guidelines.

NORTH PACIFIC RESEARCH BOARD SEMIANNUAL PROGRESS REPORT

Diablo Valley College Catalog

Department of Food and Nutrition

Molecular Genetics. RNA, Transcription, & Protein Synthesis

Name Class Date. Figure Which nucleotide in Figure 13 1 indicates the nucleic acid above is RNA? a. uracil c. cytosine b. guanine d.

Data Publication and Paradigm Mapping Solutions

The data landscape lessons from UK

UK-EOF Data Solutions Workshop

Using standards for ocean data

13.2 Ribosomes & Protein Synthesis

All LJMU programmes are delivered and assessed in English

RESPONSE FROM GBIF TO QUESTIONS FOR FURTHER CONSIDERATION

Norwegian Satellite Earth Observation Database for Marine and Polar Research USE CASES

Programme Specification ( )

Go to: URL:

The GLOSS Delayed Mode Data Centre and the GLOSS Implementation Plan 2012

SeaDataNet pan-european infrastructure for ocean and marine data management and its relation to EMODNet and GEOSS

Programme Specification (Undergraduate) Date amended: August 2012

Introduction to IODE Data Management. Greg Reed Past Co-Chair IODE

Structure and Function of DNA

Development of a Very Flexible Web based Database System for Environmental Research

Genome and DNA Sequence Databases. BME 110/BIOL 181 CompBio Tools Todd Lowe March 31, 2009

BSc (Hons) Biology (Minor: Forensic Science or Marine & Coastal Environmental Science)/MSc Biology SC516 (Subject to Approval) SC516

SHARING RESEARCH DATA POLICY, INFRASTRUCTURE, PEOPLE

MARINE SCIENCE CO-ORDINATION COMMITTEE UPDATE REPORT FROM THE WORKING GROUPS FOR PERIOD MARCH AUGUST 2014 CONTENTS SCIENCE ALIGNMENT...

MCAS Biology. Review Packet

Establishing and operating an Ocean Data Interoperability Platform ODIP. EU US Australia cooperation

The Role of Environmental Monitoring and Data Management in Supporting Science to Inform Decision Making: A Case Study

Bachelor of Science in Applied Bioengineering

Environmental Research and Innovation ( ERIN )

Data Management Activities. Bob Keeley OOPC- 9 Southampton, Jun, 2004

Module 1. Sequence Formats and Retrieval. Charles Steward

Second Mares Conference Abstract Submission Guidelines

AP Biology Essential Knowledge Student Diagnostic

Report on the symposium Colour of Ocean Data (June 2003)

Boulder Creek Critical Zone Observatory Data Management Plan

Specific problems. The genetic code. The genetic code. Adaptor molecules match amino acids to mrna codons

M The Nucleus M The Cytoskeleton M Cell Structure and Dynamics

Environmental Science & Management College of the Environment and Life Sciences (CELS) Revised April 2012

Principles and Applications of Soil Microbiology

Pond Ecosystem Field Study MOLS

Environmental Sustainability: Academic resources, degrees, courses, research, and service

IIID 14. Biotechnology in Fish Disease Diagnostics: Application of the Polymerase Chain Reaction (PCR)

CTD Oceanographic Tags

Report to 8 th session of OOPC. By Dr. Alan R. Thomas, Director, GCOS Secretariat

Module Catalogue. for the Master Degree Programme. Microbiology (M.Sc.)

Data Stewardship for Mobile Platforms at Ocean Networks Canada

Which of the following can be determined based on this model? The atmosphere is the only reservoir on Earth that can store carbon in any form. A.

From DNA to Protein. Proteins. Chapter 13. Prokaryotes and Eukaryotes. The Path From Genes to Proteins. All proteins consist of polypeptide chains

International Data Centre for Hydrology of Lakes and Reservoirs (HYDROLARE)

Biotechnology. MSc. Medway Campus. gre.ac.uk/science

DATA MANAGEMENT PLAN

Code of Conduct and Best Practice for Access and Benefit Sharing

Standards: Human activity has consequences on living organisms and ecosystems. (94412, )

Transcription:

Genomic Data at the British Oceanographic Data Centre A data management project for the NERC Marine and Freshwater Microbial Biodiversity Thematic Programme Gwen Moncoiffé British Oceanographic Data Centre, Liverpool, UK gmon@bodc.ac.uk www.bodc.ac.uk

What is BODC? A national facility for the curation and distribution of data related to the marine environment BODC is the UK National Marine Data Centre and one of the NERC 7 designated data centres We deal with biological, chemical, physical and geophysical data, and our databases contain measurements of over 15,000 different variables A total of 35 staff including: data scientists: most with direct experience of marine data collection and analysis, and with scientific qualifications in physics, chemistry, biogeochemistry, biology and geophysics IT specialists: IT infrastructure maintenance and development; e.g. reformatting, screening and visualisation software, database maintenance, web interface, GIS, etc.

BODC data collections The National Oceanographic Database (NODB): a collection of marine data series (continuous profiles, time-series, ) originating mainly from UK research establishments The Project Database: used to assemble continuous profiles and discrete measurements data and metadata collected during large multi-disciplinary fieldwork experiments contains data from deep ocean, coastal and estuarine work essentially a marine environmental database system with data from the marine atmosphere, from the water column and from marine sediments but can also be used for inland fieldwork activities and mesocosm experiments

BODC Data Banking Process Accession system: catalogue and safe archive of all information and data files received The National Oceanographic Database (NODB) Data series: time series, continuous profiles series, ship continuous underway measurements, etc. Inventories, metadata and documentation fully managed under Oracle, datacycles stored as files in BODC s standardised format (NetCDF equivalent) The Project Database Inventory of discrete or integrative data collection sampling events and a collection of discrete measurements: data related to water or core sampling, net tow, CTD and other profiling instruments, incubations, etc. Metadata and data stored in Oracle. Full text documentation stored externally. BODC Parameter Dictionary a multi-disciplinary parameter dictionary for data markup

BODC Data Banking Process Archiving Accession system: catalogue and safe archive of all information and data files received The National Oceanographic Database (NODB) Data series: time series, continuous profiles series, ship continuous underway measurements, etc. Databanking 1- Quality control 2- Parameter coding 3- Documentation Inventories, metadata and documentation fully managed under Oracle, datacycles stored as files in BODC s standardised format (NetCDF equivalent) The Project Database Inventory of discrete or integrative data collection sampling events and a collection of discrete measurements: data related to water or core sampling, net tow, CTD and other profiling instruments, incubations, etc. Metadata and data stored in Oracle. Full text documentation stored externally. BODC Parameter Dictionary a multi-disciplinary parameter dictionary for data markup

BODC new web site www.bodc.ac.uk

How did we become involved with genomic data? In October 2001, BODC became the main Project Data Centre for the UK-NERC thematic programme Marine and Freshwater Microbial Biodiversity (M&FMB) 5-year thematic programme 2000-2005 3 main funding phases, 34 funded projects: 2000 (mainly marine), 2001 (freshwater and comparative) and 2003 (application and exploitation phase) Fieldwork: one M&FMB-funded cruise in the Indian Ocean for co-ordinated marine fieldwork freshwater fieldwork activities at Priest Pot, a shallow pond in the English Lake District project specific fieldwork and sample collection initiatives

M&FMB: a complex data management project marine / freshwater experimental / environmental Culturability, biofilms and cell signalling Biogeochemistry water column / sediment Bioactives and natural products Genetic diversity and community structure quantitative / semi-quantitative / non-quantitative Underpinning environmental measurements for co-ordinated studies at Priest Pot, Cumbria, UK and in the Indian Ocean Virus ecology and exploitation molecular / biotic / abiotic

Why BODC? Extensive experience in managing data from large multi-disciplinary fieldwork programmes (NERC and EU-funded) We maintain one of the UK s most comprehensive database of environmental measurements Built up expertise and infrastructure for the management and quality control of data from ship-based multi-disciplinary fieldwork experiments A flexible database system which allows us to ingest and integrate new types of measurements, instruments and parameters relatively easily

Roles of the BODC as M&FMB Project Data Centre Our main roles were 1) to quality control, collate and document non-molecular data collected during marine fieldwork activities and in particular the M&FMB-funded cruise AMBITION 2) to collaborate with CEH Windermere/Lancaster for the management and collation of freshwater data focusing on Priest Pot fieldwork 3) to collate information and maintain a central inventory of site sampled, data collected and resources generated by M&FMBfunded projects 4) to preserve and catalogue any digital material submitted to us for long- or short-term archival by M&FMB scientists Molecular data (mainly sequencing data) outside of BODC s remit

Molecular data management issues These would be looked after by individual research teams and, where appropriate, submitted to public molecular databases such as e.g. GenBank As a pilot project, BODC started investigating the possibility of providing a link between GenBank records and our sampling metadata and environmental measurements in our database Having obtained a series of GenBank accession numbers we started realising that under the current arrangements, there was a high risk of data from genetic samples becoming permanently detached from their environmental context and sampling metadata and with this lost, also lost was the possibility of re-using the data in a context different from that originally intended by the originator

Example of a GenBank record from a M&FMB marine environmental DNA sample LOCUS AY125386 1444 bp DNA linear ENV 25-FEB-2003 DEFINITION Uncultured Synechococcus sp. clone A315026 16S ribosomal RNA gene, partial sequence. ACCESSION AY125386 VERSION AY125386.1 GI:28557446 KEYWORDS ENV. SOURCE uncultured Synechococcus sp. ORGANISM uncultured Synechococcus sp. Bacteria; Cyanobacteria; Chroococcales; Synechococcus; environmental samples. REFERENCE 1 (bases 1 to 1444) AUTHORS TITLE Zubkov,M.V., Fuchs,B.M., Tarran,G.A., Burkill,P.H. and Amann,R. High rate of uptake of organic nitrogen compounds by prochlorococcus cyanobacteria as a key to their dominance in oligotrophic oceanic waters JOURNAL Appl. Environ. Microbiol. 69 (2), 1299-1304 (2003) PUBMED 12571062 REFERENCE 2 (bases 1 to 1444) AUTHORS Zubkov,M.V., Fuchs,B.M., Tarran,G.A., Burkill,P.H. and Amann,R. TITLE Direct Submission JOURNAL Submitted (21-JUN-2002) Plymouth Marine Laboratory, PL1 3DH, Plymouth PL1 3DH, United Kingdom FEATURES Location/Qualifiers source 1..1444 /organism="uncultured Synechococcus sp." /mol_type="genomic DNA" /db_xref="taxon:154535" /clone="a315026" /environmental_sample rrna <1..>1444 /product="16s ribosomal RNA" ORIGIN 1 agagtttgat cctggctcag gatgaacgct ggcggcgtgc ttaacacatg caagtcgaac

Example of a GenBank record from a M&FMB marine environmental DNA sample LOCUS AY125386 1444 bp DNA linear ENV 25-FEB-2003 DEFINITION Uncultured Synechococcus sp. clone A315026 16S ribosomal RNA gene, partial sequence. ACCESSION AY125386 VERSION AY125386.1 GI:28557446 KEYWORDS ENV. SOURCE uncultured Synechococcus sp. ORGANISM uncultured Synechococcus sp. Bacteria; Cyanobacteria; Chroococcales; Synechococcus; environmental samples. REFERENCE 1 (bases 1 to 1444) AUTHORS TITLE Zubkov,M.V., Fuchs,B.M., Tarran,G.A., Burkill,P.H. and Amann,R. High rate of uptake of organic nitrogen compounds by prochlorococcus cyanobacteria as a key to their dominance in oligotrophic oceanic waters JOURNAL Appl. Environ. Microbiol. 69 (2), 1299-1304 (2003) PUBMED 12571062 REFERENCE 2 (bases 1 to 1444) AUTHORS Zubkov,M.V., Fuchs,B.M., Tarran,G.A., Burkill,P.H. and Amann,R. TITLE Direct Submission JOURNAL Submitted (21-JUN-2002) Plymouth Marine Laboratory, PL1 3DH, Plymouth PL1 3DH, United Kingdom FEATURES Location/Qualifiers source 1..1444 /organism="uncultured Synechococcus sp." /mol_type="genomic DNA" /db_xref="taxon:154535" /clone="a315026" /environmental_sample rrna <1..>1444 /product="16s ribosomal RNA" ORIGIN 1 agagtttgat cctggctcag gatgaacgct ggcggcgtgc ttaacacatg caagtcgaac

Example of a GenBank record from a M&FMB marine environmental DNA sample LOCUS AY125386 1444 bp DNA linear ENV 25-FEB-2003 DEFINITION Uncultured Synechococcus sp. clone A315026 16S ribosomal RNA gene, partial sequence. ACCESSION AY125386 VERSION AY125386.1 GI:28557446 KEYWORDS ENV. SOURCE uncultured Synechococcus sp. ORGANISM uncultured Synechococcus sp. Bacteria; Cyanobacteria; Chroococcales; Synechococcus; environmental samples. REFERENCE 1 (bases 1 to 1444) AUTHORS TITLE Zubkov,M.V., Fuchs,B.M., Tarran,G.A., Burkill,P.H. and Amann,R. High rate of uptake of organic nitrogen compounds by prochlorococcus cyanobacteria as a key to their dominance in oligotrophic oceanic waters JOURNAL Appl. Environ. Microbiol. 69 (2), 1299-1304 (2003) PUBMED 12571062 REFERENCE 2 (bases 1 to 1444) AUTHORS Zubkov,M.V., Fuchs,B.M., Tarran,G.A., Burkill,P.H. and Amann,R. TITLE Direct Submission JOURNAL Submitted (21-JUN-2002) Plymouth Marine Laboratory, PL1 3DH, Plymouth PL1 3DH, United Kingdom FEATURES Location/Qualifiers source 1..1444 /organism="uncultured Synechococcus sp." /mol_type="genomic DNA" /db_xref="taxon:154535" /clone="a315026" /environmental_sample rrna <1..>1444 /product="16s ribosomal RNA" ORIGIN 1 agagtttgat cctggctcag gatgaacgct ggcggcgtgc ttaacacatg caagtcgaac Where? When? How? Depth? Environmental Conditions? Methodology? Cruise? Project? Programme?

Problems with GenBank records for environmental samples LOCUS AY907763 1522 bp DNA linear ENV 03-AUG-2005 DEFINITION Uncultured bacterium clone A315022 16S ribosomal RNA gene, partial sequence. ACCESSION AY907763 No standardisation or control of the quality and VERSION AY907763.1 GI:62549130 KEYWORDS ENV. adequacy of background information provided SOURCE uncultured bacterium ORGANISM uncultured bacterium Bacteria; environmental samples. No information on methodology REFERENCE 1 (bases 1 to 1522) AUTHORS Fuchs,B.M., Woebken,D., Zubkov,M.V., Burkill,P. and Amann,R. TITLE Molecular identification of picoplankton populations in contrasting waters No possibility of the Arabian to Sea link the genetic sequence back JOURNAL Aquat. Microb. Ecol. 39, 145-157 (2005) REFERENCE 2 (bases to information 1 to 1522) held in our environmental AUTHORS Fuchs,B.M., Woebken,D., Zubkov,M.V., Burkill,P. and Amann,R. TITLE Direct database Submission without time-consuming detective work JOURNAL Submitted and (27-JAN-2005) reference Molecular to the Ecology, data originator Max Planck Institute for Marine Microbiology, Celsiusstr. 1, Bremen 28359, Germany FEATURES Location/Qualifiers source 1..1522 No /organism="uncultured possibility to search bacterium" for or filter genetic /mol_type="genomic DNA" sequences /isolation_source="arabian based on precise Sea environmental water" or /db_xref="taxon:77133" /clone="a315022" methodological criteria /environmental_sample rrna <1..>1522 /product="16s ribosomal RNA"

Need to link molecular and environmental databases records for environmental studies Environmental molecular/genomic data submitted to public repositories (e.g. GenBank) are publicly and easily accessible yes; BUT Value for future research is potentially greatly reduced when not readily linkable to their environmental, spatial, temporal and methodological context Submission of molecular data to public repositories cannot be considered a sufficient condition for proper data stewardship Molecular data or their unique identifiers must also be submitted to the data centre responsible for the collation and quality control of the non-molecular data and metadata Preferably, this should be done during the lifetime of the project in order to minimise risk of information being lost

Pilot project for M&FMB - Priorities Focus on AMBITION cruise first and then Priest Pot fieldwork Focus on DNA/RNA sequencing data Simple integration into our Project Database Achievable within our remaining M&FMB funding

Design of the genetic data tables main features Include a sample information and inventory table, and a genetic data table with a link to GenBank Sample inventory system populated with all the known genetic samples collected, fully integrated into our environmental database Allow for search and retrieval of environmental data corresponding to the sample even if we do not hold the molecular data or GenBank accession number Enable search of available DNA/RNA samples and the person to contact based on criteria selected from our database Ensure that we can apply our quality control checks to the sample metadata provided by the originator and provide feedback on potential problems or errors Records in the genetic data tables are created once data become available through GenBank or via direct submission

Linking genetic information into the Oracle RDBMS project database

Linking genetic information into the Oracle RDBMS project database

Linking genetic information to our Oracle RDBMS project database fields

Further development Initially designed for GenBank-type data (nucleotide or protein sequences) but can be easily adapted to handle any other non-quantitative product by a simple link to a uniquely recognised external reference number e.g. gel image ID, public repository reference number, scientists own data record identifier, external database record, etc.

Access to quality controlled information and environmental data for genetic sequences and samples Sequence AY125386 from GenBank originates from: a water sample collected on 09/09/2001, at 00:05 GMT, at a latitude of 3.8 deg N and longitude 67.0 deg E, and at a depth of 50.4m the following parameters measured by the ship s on board instruments and collected by other researchers are also available for this same water sample: temperature and salinity profiles; meteorological conditions nitrate, nitrite, ammonium, silicate and phosphate nutrient concentration and distribution oxygen concentration and distribution phytoplankton pigment composition, concentration and distribution pico-, nano- and micro-planktonic community biomass and composition including prokaryotes and eukaryotes primary production and nitrogen uptake and recycling measurements other gene sequences, link to other available molecular products (e.g. gel images) or DNA samples with contact name other measurements derived from molecular products including semiquantitative distribution data for targeted organisms

ACCESS interface

ACCESS interface

ACCESS interface

M&FMB programme data management: challenges Lack of in-house knowledge of genetic and molecular techniques Absence of metadata standards for referencing molecular methodologies and data Complexity and sometimes loose usage of some of the specific vocabulary used by scientists to describe their data (not specific to genomic community but made our job more difficult) General lack of trust in the data centre from the molecular biologists: data not provided to the data centre unless already published re-assurance about IP rights protection and release clauses had no effect Lack of interest in providing information about the data collected and the methodology used - the data are still viewed in the narrow context of the scientists own research interests-

Linking genetic data to their environmental context: other initiatives and acknowledgements Sapelo Island Microbial Observatory (SIMO) database (Sheldon et al, Georgia, US-NSF) MICRO-MAR database (Rodriguez-Valera et al, Alicante, MIRACLE EU Project) Thanks to: Michael Hughes, BODC, genetic data banking and design of ACCESS form interface Dawn Field, CEH Oxford, workshop organiser