PROTEOMEXCHANGE AN INTERNATIONAL INFRASTRUCTURE FOR OPEN PROTEOMICS DATA

PROTEOMEXCHANGE AN INTERNATIONAL INFRASTRUCTURE FOR OPEN PROTEOMICS DATA Henning Hermjakob Team Leader Proteomics Services European Bioinformatics Institute hhe@ebi.ac.uk

Introduction to proteomics

Introduction to proteomics Metadata Metadata Raw data Metadata Results

Introduction to proteomics Rapidly developing instrumentation and data processing approaches Multitude of significantly different workflows Complex output data types in many different file formats Results depend not only on experiment and instrumentation, but also analysis approach Same data can be re-analysed meaningfully with a different question in mind Dataset size varies from <100MB to > 4TB No strong tradition of open data (currently)

Data deposition is incomplete: 27%

Proteomics Data Deposition Requirements 2010 In particular, novel protein sequences should be deposited in UniProt (www.uniprot.org); molecular interactions in an IMEx partner database (imex.sf.net); and protein identification data in PRIDE (www.ebi.ac.uk/pride), World-2DPAGE (www.expasy.org/world-2dpage/), or a comparable database. If a manuscript is accepted by the journal, all mass spectra contributing to the described work must be deposited in electronic form by the time of publication at a publicly accessible site that is independent of the authors' control. 7

ProteomeXchange: Data Deposition but where? Questions about submission: Which repository should I submit to? Should I submit to more than one? Do I need to submit raw data? Questions about deposited data: How do I find all datasets on chromatin? Do PRIDE and PeptideAtlas both have this dataset? Do they interpret it differently? Question about repository stability Peptidome closed in 2011 Tranche closed in 01/2013 Will the data remain publicly available?

ProteomeXchange data flow Receiving repositories Peptide Atlas COPaKB PRIDE (MS/MS data) Results Raw Data* Metadata / Manuscript MassIVE (MS/MS data) PASSEL (SRM data) ProteomeCentral UniProt/ nextprot Other DBs Researcher s results Reprocessed results Journals GPMDB Other DBs Raw data* Metadata

ProteomeXchange: 1620 datasets up until 8 th, January, 2015 Origin: 322 USA 197 Germany 148 United Kingdom 91 Netherlands 85 France 81 China 80 Switzerland 61 Canada 48 Belgium 47 Spain 45 Denmark 42 Australia 40 Japan 37 Sweden 28 Austria 22 India 21 Norway 21 Taiwan 20 Ireland 20 Finland 17 Italy 14 Brazil 13 Republic of Korea 13 Russia 10 Israel 9 Singapore Type: 526 PRIDE complete 982 PRIDE partial 63 PeptideAtlas/PASSEL complete 24 MassIVE 25 reprocessed Datasets/year: 2012: 102 2013: 527 2014: 963 2015: 28 Publicly Accessible: 814 datasets, 50% of all 90% PRIDE 8% PASSEL 2% MassIVE Top Species studied by at least 10 datasets: 712 Homo sapiens 193 Mus musculus 65 Saccharomyces cerevisiae 61 Arabidopsis thaliana 35 Rattus norvegicus 34 Escherichia coli 17 Bos taurus 17 Glycine max 17 Mycobacterium tuberculosis 16 Drosophila melanogaster 14 Oryza sativa ~ 310 species in total Data volume: Total: ~71 TB Number of all files: ~160,000 PXD000320-324: ~ 5 TB PXD000065: ~ 1.4TB Vizcaíno et al, Nat. Biotechnol. 2013

Will my data still be there in five years? Databases depend on continued funding Tranche repository ceased operations recently Serious data loss Peptidome ceased operations in 2011 No data loss, data still available from NCBI ftp and from PRIDE: Csordas A, et al. From Peptidome to PRIDE: Public proteomics data migration at a large scale. Proteomics. 2013 Mar 27. ProteomeXchange PRIDE, PeptideAtlas have been around since 2005 PRIDE Institutional funding to ensure basic operations while needed by community Hardware support: Two independent London data centers, eight year UK support Wellcome Trust PRIDE funding just renewed: from 1/1/2014 for four years New ProteomeXchange partners: MassIVE (Nuno Bandeira, UCSD) Imported all recoverable Tranche data Joined April 2014 Beijing Proteomics Center Might join in the future Active collaboration is key for mutual backup in case of funding loss

Complete versus partial submissions in PRIDE Complete submission: MS/MS data. Processed results can be converted to the PSI standard mzidentml or PRIDE XML. Partial submission: Any type of data (not SRM, which goes to PASSEL) E.g. top down, data independent acquisition, MS Imaging (to come), etc. Processed results cannot be converted to a data standard.

Complete vs Partial submissions: processed results For complete submissions, it is possible to connect the spectra with the identification processed results and they can be visualized. Complete Partial

Fast file transfer with Aspera - Aspera is the default file transfer protocol to PRIDE: - PX Submission tool - command line - 10 50 x faster than ftp File transfer speed is not reported as a problem any more, rather we get positive feedback

2013: The rise of public proteomics data availability 2014: The rise of proteomics data re-use PXD Identifier Hits / files = complete downloads Dataset title Publication PXD000561 153512 / 100 = 1500 A draft map of the human proteome Kim et al., Nature, 2014. PMID: 24870542 PXD000851 111587 / 45 = 2480 Membrane proteomic analysis of colorectal cancer tissue Kume et al., MCP, 2014. PMID:24687888 PXD000865 51639 / 46 = 1122 Mass spectrometry based draft of the human proteome Wilhelm et al., 2014, Nature, PMID: 24870543

ProteomeXchange data re-use Kuester et al: Inclusion of PX data (among others) in [ Wilhelm et al., 2014, Nature ] GPMDB: 15 of 20 Dataset of the week in 2014 based on PX datasets [ http://www.thegpm.org/dsotw_2014.html ] PeptideAtlas regularly re-processes PX data [ http://www.peptideatlas.org ] COPaKB processes relevant cardiovascular PX datasets [ http://www.heartproteome.org/copa/ ] Peptide shaker has direct re-processing link to PRIDE [ Vaudel M, et al. Nature Biotechnology 2015 ] PRIDE Cluster integrates across PX datasets [ Griss J, et al. Nat Methods. 2013 ] PRIDE download volume in 2014: 150 TB most popular datasets downloaded > 1000 times

The role of DOIs Mainly editors from the ProteomeXchange stakeholder group suggested to assign DOIs for PXD datasets Implemented for complete datasets only, as an incentive for authors to generate complete datasets Actual implementation straightforward Usage so far irrelevant: DOI resolution report states 26 resolutions for top ranking PXD001677 over last six months

Pubmed interaction On successful submission of a dataset, we ask the submitter to add the PXD to the abstract: PXD00* returns 232 Pubmed hits. Valuable for identifying newly released publications

Citation On successful submission of a dataset, we ask the submitter to cite PXD: Reasonably well adopted by community:

Outreach and dissemination Example dataset: PXD000764 - Title: Discovery of new CSF biomarkers for meningitis in children - 12 runs: 4 controls and 8 infected samples - Identification and quantification data http://www.proteomexchange.org/submission Ternent et al., Proteomics, 2014

Perspective: A multi- omics DDI

http://metabolomexchange.org - Coordination of Standards in Metabolomics

From ProteomeXchange to a Multi-omics Data Discovery Index Aim: Develop an infrastructure for integrated data Discovery across ProteomeXchange MetablomeXchange European Genotype Phenotype archive Challenges: Common technical infrastructure (XML, RDF, ) - feasible Common metadata representation (Attributes, Ontologies, ) hard Demonstrator for Findability across omics Findability across eight repositories in two continents and six organisations Findability across open and controlled access repositories BD2K consortium collaboration with DDI BioCADDIE project BD2K EU Elixir collaboration 29

Acknowledgements ProteomeXchange partners, particularly: Eric Deutsch, ISB, Seattle Andy Jones, U Liverpool Lennart Martens, U Gent Pierre-Alain Binz, SIB, Geneva Martin Eisenacher, MPC, Bochum Ruedi Aebersold, ETH Zurich Juan Pablo Albar, CSIC, Madrid Laurent Gatto, U Cambridge Nuno Bandeira, UCSD Peipei Ping, UCLA Reactome Antonio Fabregat Mundo Steve Jupe Phani Garapati Lincoln Stein, OICR, Canada Peter D Eustachio, NYU Guanming Wu, OICR Joel Weiser, OICR Bijay Jassal, OICR PRIDE team Juan Antonio Vizcaino Rui Wang Florian Reisinger Attila Csordas Tobias Ternent Jose Dianes Yasset Perez Riverol Noemi del Toro Ayllon Johannes Griss Editors Mike Dunn, Proteomics Achim Kraus, Proteomics Ralph Bradshaw, MCP Bill Hancock, JPR Funding NIH BD2K Centers of Excellence for Big Data Computing grant number 1U54GM114833-01 NHLBI Proteomics Center Award HHSN268201000035C Wellcome Trust PRIDE EU FW7 ProteomeXchange, PSIMEX BBSRC PROCESS EMBL-EBI All data providers!

? If the Human Genome Project had not followed an open data release policy, what would we be searching our spectra against today? proteomexchange.org psidev.info