COPO: Collaborative Open Plant Omics. Rob Davey Data Infrastructure and Algorithms Group Leader robert.davey@tgac.ac.



Similar documents
NIH Commons Overview, Framework & Pilots - Version 1. The NIH Commons

BIOINFORMATICS Supporting competencies for the pharma industry

Report of the DTL focus meeting on Life Science Data Repositories

Research Data Management Guide

DATA MANAGEMENT PLAN IN THE REAL LIFE SCIENCES

Three data delivery cases for EMBL- EBI s Embassy. Guy Cochrane

DATA SCIENTIST TRAINING FOR LIBRARIANS #DST4L. C. Erdmann Designing Libraries

BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS

GeneProf and the new GeneProf Web Services

Towards the construction of an integrated Wheat Information System

How To Write A Blog Post On Globus

SHared Access Research Ecosystem (SHARE)

The National Consortium for Data Science (NCDS)

data.bris: collecting and organising repository metadata, an institutional case study

Big Data in BioMedical Sciences. Steven Newhouse, Head of Technical Services, EMBL-EBI

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Data Publishing Workflows with Dataverse

Workflow Tools at NERSC. Debbie Bard NERSC Data and Analytics Services

Summary of Responses to the Request for Information (RFI): Input on Development of a NIH Data Catalog (NOT-HG )

Exploring the roles and responsibilities of data centres and institutions in curating research data a preliminary briefing.

The Horizon 2020 Open Data Pilot. Sarah Jones Digital Curation Centre, University of Glasgow

Integrated Rule-based Data Management System for Genome Sequencing Data

Cloud and Big Data Standardisation

Big Data Standardisation in Industry and Research

Bringing Compute to the Data Alternatives to Moving Data. Part of EUDAT s Training in the Fundamentals of Data Infrastructures

LabArchives Electronic Lab Notebook:

Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers

Building Success on Acquia Cloud:

Globus Research Data Management: Introduction and Service Overview

-> Integration of MAPHiTS in Galaxy

Service Road Map for ANDS Core Infrastructure and Applications Programs

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

Running Agilent GeneSpring MPP on the Cloud

PRIVACY AWARE ACCESS CONTROL FOR CLOUD-BASED DATA PLATFORMS

Research Data Management

Canadian National Research Data Repository Service. CC and CARL Partnership for a national platform for Research Data Management

Implementation of Open Researcher and Contributor ID. (ORCID) at a Large Academic Institution

Reduce and manage operating costs and improve efficiency. Support better business decisions based on availability of real-time information

Attach receipt options:

Integrating computational data analysis capabilities into analytics applications

AWS CodePipeline. User Guide API Version

Cloud Computing for e-science with CARMEN

Introduction to NGS data analysis

The data landscape lessons from UK

OpenAIRE Research Data Management Briefing paper

Beyond The Web Drupal Meets The Desktop (And Mobile) Justin Miller Code Sorcery Workshop, LLC

Analysis of ChIP-seq data in Galaxy

Databases and platforms for data analysis from NGS of MTB

Modifying ScholarOne to seek author consent before sending manuscript notifications to Dryad the single step version.

Semantic Workflows and the Wings Workflow System

Steven Newhouse, Head of Technical Services

Technical. Overview. ~ a ~ irods version 4.x

Lessons Learned at Continental Automotive

Understanding Infrastructure as Code. By Michael Wittig and Andreas Wittig

Data grid storage for digital libraries and archives using irods

Research Data Management in Horizon 2020

NECC History. Karl V. Steiner 2011 Annual NECC Meeting, Orono, Maine March 15, 2011

Higher user satisfaction: customers can navigate website content and usergenerated content on a single site.

Cloud Computing for Scientific Research

Globus Genomics Tutorial GlobusWorld 2014

Introduction to Arvados. A Curoverse White Paper

D5.5 Initial EDSA Data Management Plan

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

MiSeq: Imaging and Base Calling

Workprogramme

Transcription:

: Collaborative Open Plant Omics Rob Davey Data Infrastructure and Algorithms Group Leader robert.davey@tgac.ac.uk @froggleston

Toni Etuk Felix Shaw Acknowledgements Oxford eresearch Centre Susanna Sansone Alejandra Gonzalez-Beltran Philippe Rocca-Serra Alfie Abdul-Rahman Warwick Jim Beynon Katherine Denby Ruth Bastow EMBL-EBI Paul Kersey TGAC Vicky Schneider Tanya Dickie Emily Angiolini Matt Drew

Recently awarded BBSRC BBR grant TGAC, Univ. Oxford, Univ. Warwick, EMBL-EBI Supported by GARNet, iplant, Eagle Genomics Empower bioscience plant researchers to: 1. Enable standards-compliant data collection, curation and integration 2. Enhance access to data analysis and visualisation pipelines 3. Facilitate data sharing and publication to promote reuse Train plant researchers in best practice for data sharing and producing citable Research Objects

(Good) Science is founded on reproducibility Reproducibility depends on: reducing reinvention ( friction )* describing methods and data maximising benefit to the researcher Describing methods well established through traditional publishing Data description sorely under-represented and used Benefits are often opaque Fear of being scooped, loss of control, reputation, etc * http://cameronneylon.net/blog/network-enabled-research/

What prevents plant scientists from openly depositing their data and metadata? Lack of interoperability between: metadata annotation services data repository services data analysis services data publishing services Researchers might not: be aware that the services exist have the expertise to use them see the value in properly describing their data

Data: Sample, Sequence, Genome, Proteome, Metabolome, Imaging Code: GitHub, BitBucket, Zenodo Analysis: Galaxy, iplant, Bioconductor, Taverna, local code/services Publication: figshare, Scientific Data, Dryad, F1000, PeerJ, Gigascience Beyond the PDF: Utopia, GitHub Training: Materials, examples, workshops, bootcamps

It's not because these services don't exist! Clearly, barriers exist between the scientist and the service Infrastructure can help by: wiring existing services together improving access to services facilitating collaboration raising profile of the benefits of open science How do we collaborate successfully to make this happen? Mapping services with Application Programming Interfaces

Grace signs into COPO with her ORCID ID This signs her into all other services as required She starts a new COPO Profile She uploads to the COPO platform: Three FASTQs (two Illumina HiSeq2500, one PacBio P6-C4) representing her velociraptor sequencing reads She tells COPO to push her data to a Galaxy server and run a workflow, producing: An assembly of the reads from ALLPATHS-LG v51551 A draft automated annotation from RAST v33-1 The interface prompts her to add metadata to her data in order to deposit them in the public repositories Metadata fields will be shown based on data, and redundant fields will be merged automatically Sample name, sample organism, data type, sequencer used, software name, software version... She clicks Upload, and everything is submitted

Single-sign on (SSO), e.g. ORCID Deposit multi-omics data in one go No context-switching between services Run and deposit analytical workflows Describe software used, versions Pull into platforms, e.g. Galaxy, iplant Support virtualisation, e.g. iplant Atmosphere, Docker, Amazon AWS Data is well-described, open, and everything has DOIs Finding and integrating data is improved greatly Make suggestions to users based on their data/workflows Programmatic access to all layers REPRODUCIBILITY

Not just raw/processed data is valuable COPO supports submission of supplementary data to Figshare PDFs (posters, papers) CSV/Excel movies/images (size permitting) Zenodo/Github releases for code DOIs Marked up with ENCODE Digital Curation Center s software metadata descriptors, for example

What have we achieved so far? TGAC infrastructure to support brokering of data irods and web server virtual machines High speed transfer Aspera links to EBI Prototype user interface for multi-omics data submissions Oauth2 support ( sign in with ORCiD, Google, Twitter) Developing JSON specification for COPO objects Easily stored in document-based databases, e.g. MongoDB Interconversion between ISA formats ISATab (CSV based) to JSON, and vice versa Linked Data specifications Community interactions Metabolights group at EBI Setting up this workshop!

COPO will: Facilitate easy relevant data description to: Submit data and metadata to multiple public repositories The reasons most of you are here What are the barriers for you and your data? Facilitate access to workflows used to analyse the data, e.g. to GigaDB, Scientific Data This will form part of another COPO workshop