COPO: Collaborative Open Plant Omics. Rob Davey Data Infrastructure and Algorithms Group Leader robert.davey@tgac.ac.

: Collaborative Open Plant Omics Rob Davey Data Infrastructure and Algorithms Group Leader robert.davey@tgac.ac.uk @froggleston

Toni Etuk Felix Shaw Acknowledgements Oxford eresearch Centre Susanna Sansone Alejandra Gonzalez-Beltran Philippe Rocca-Serra Alfie Abdul-Rahman Warwick Jim Beynon Katherine Denby Ruth Bastow EMBL-EBI Paul Kersey TGAC Vicky Schneider Tanya Dickie Emily Angiolini Matt Drew

Recently awarded BBSRC BBR grant TGAC, Univ. Oxford, Univ. Warwick, EMBL-EBI Supported by GARNet, iplant, Eagle Genomics Empower bioscience plant researchers to: 1. Enable standards-compliant data collection, curation and integration 2. Enhance access to data analysis and visualisation pipelines 3. Facilitate data sharing and publication to promote reuse Train plant researchers in best practice for data sharing and producing citable Research Objects

(Good) Science is founded on reproducibility Reproducibility depends on: reducing reinvention ( friction )* describing methods and data maximising benefit to the researcher Describing methods well established through traditional publishing Data description sorely under-represented and used Benefits are often opaque Fear of being scooped, loss of control, reputation, etc * http://cameronneylon.net/blog/network-enabled-research/

What prevents plant scientists from openly depositing their data and metadata? Lack of interoperability between: metadata annotation services data repository services data analysis services data publishing services Researchers might not: be aware that the services exist have the expertise to use them see the value in properly describing their data

Data: Sample, Sequence, Genome, Proteome, Metabolome, Imaging Code: GitHub, BitBucket, Zenodo Analysis: Galaxy, iplant, Bioconductor, Taverna, local code/services Publication: figshare, Scientific Data, Dryad, F1000, PeerJ, Gigascience Beyond the PDF: Utopia, GitHub Training: Materials, examples, workshops, bootcamps

It's not because these services don't exist! Clearly, barriers exist between the scientist and the service Infrastructure can help by: wiring existing services together improving access to services facilitating collaboration raising profile of the benefits of open science How do we collaborate successfully to make this happen? Mapping services with Application Programming Interfaces

Grace signs into COPO with her ORCID ID This signs her into all other services as required She starts a new COPO Profile She uploads to the COPO platform: Three FASTQs (two Illumina HiSeq2500, one PacBio P6-C4) representing her velociraptor sequencing reads She tells COPO to push her data to a Galaxy server and run a workflow, producing: An assembly of the reads from ALLPATHS-LG v51551 A draft automated annotation from RAST v33-1 The interface prompts her to add metadata to her data in order to deposit them in the public repositories Metadata fields will be shown based on data, and redundant fields will be merged automatically Sample name, sample organism, data type, sequencer used, software name, software version... She clicks Upload, and everything is submitted

Single-sign on (SSO), e.g. ORCID Deposit multi-omics data in one go No context-switching between services Run and deposit analytical workflows Describe software used, versions Pull into platforms, e.g. Galaxy, iplant Support virtualisation, e.g. iplant Atmosphere, Docker, Amazon AWS Data is well-described, open, and everything has DOIs Finding and integrating data is improved greatly Make suggestions to users based on their data/workflows Programmatic access to all layers REPRODUCIBILITY

Not just raw/processed data is valuable COPO supports submission of supplementary data to Figshare PDFs (posters, papers) CSV/Excel movies/images (size permitting) Zenodo/Github releases for code DOIs Marked up with ENCODE Digital Curation Center s software metadata descriptors, for example

What have we achieved so far? TGAC infrastructure to support brokering of data irods and web server virtual machines High speed transfer Aspera links to EBI Prototype user interface for multi-omics data submissions Oauth2 support ( sign in with ORCiD, Google, Twitter) Developing JSON specification for COPO objects Easily stored in document-based databases, e.g. MongoDB Interconversion between ISA formats ISATab (CSV based) to JSON, and vice versa Linked Data specifications Community interactions Metabolights group at EBI Setting up this workshop!

COPO will: Facilitate easy relevant data description to: Submit data and metadata to multiple public repositories The reasons most of you are here What are the barriers for you and your data? Facilitate access to workflows used to analyse the data, e.g. to GigaDB, Scientific Data This will form part of another COPO workshop