PROTEOMEXCHANGE AN INTERNATIONAL INFRASTRUCTURE FOR OPEN PROTEOMICS DATA

Similar documents
Protein Protein Interactions (PPI) APID (Agile Protein Interaction DataAnalyzer)

Foreign Taxes Paid and Foreign Source Income INTECH Global Income Managed Volatility Fund

41 T Korea, Rep T Netherlands T Japan E Bulgaria T Argentina T Czech Republic T Greece 50.

PeptidomicsDB: a new platform for sharing MS/MS data.

Network Webinar Series

ms-data-core-api: An open-source, metadata-oriented library for computational proteomics

Genome Viewing. Module 2. Using Genome Browsers to View Annotation of the Human Genome

Image Lab Software for the GS-900 Densitometer

Metropolitan Locations in International High-Tech Networks

MANDATORY PROVIDENT FUND SCHEMES AUTHORITY

Supported Payment Methods

Thermo Scientific Compound Discoverer Software. A New Generation. of integrated solutions for small molecule structure ID

Supported Payment Methods

How does a venture capitalist appraise investment opportunities?

Agenda. Company Platform Customers Partners Competitive Analysis

MERCER S COMPENSATION ANALYSIS AND REVIEW SYSTEM AN ONLINE TOOL DESIGNED TO TAKE THE WORK OUT OF YOUR COMPENSATION REVIEW PROCESS

Report on Government Information Requests

Image Lab Software How to Obtain Stain-Free Gel and Blot Images. Instructions

Global Effective Tax Rates

Review of R&D Tax Credit. Invitation for Submissions

[NUGENESIS SAMPLE MANAGEMENT ] AMPLE IMPROVING LAB EFFICIENCY, ANAGEMENT ACCELERATING BUSINESS DECISIONS. bigstock.com $69

Big Data in BioMedical Sciences. Steven Newhouse, Head of Technical Services, EMBL-EBI

International Organization for Standardization TC 215 Health Informatics. Audrey Dickerson, RN MS ISO/TC 215 Secretary

Configuring DHCP for ShoreTel IP Phones

Computer Specifications

- 2 - Chart 2. Annual percent change in hourly compensation costs in manufacturing and exchange rates,

Performance 2015: Global Stock Markets

FAQs for Two-factor Authentication

World Consumer Income and Expenditure Patterns

How many students study abroad and where do they go?

ProteinChip Energy Absorbing Molecules (EAM)

COST Presentation. COST Office Brussels, ESF provides the COST Office through a European Commission contract

Appendix 1: Full Country Rankings

Schedule R Teleconferencing Service

The Role of Banks in Global Mergers and Acquisitions by James R. Barth, Triphon Phumiwasana, and Keven Yost *

Thermo Scientific PepFinder Software A New Paradigm for Peptide Mapping

Global Economic Briefing: Global Inflation

TOWARDS PUBLIC PROCUREMENT KEY PERFORMANCE INDICATORS. Paulo Magina Public Sector Integrity Division

What Proportion of National Wealth Is Spent on Education?

OCTOBER Russell-Parametric Cross-Sectional Volatility (CrossVol ) Indexes Construction and Methodology

World Leasing Yearbook 2016

Online Marketing Institute London, Feb 2012 Mike Shaw Director, Marketing Solutions

Motion Graphic Design Census. 10 hrs. motiongraphicdesigncensus.org. 9 hrs.

Corporate Office Von Karman Ave Suite 150 Irvine, California Toll Free: Fax:

Thermo Scientific SIEVE Software for Differential Expression Analysis

Reporting practices for domestic and total debt securities

GLOBAL EDUCATION PROGRAM

Performance 2016: Global Stock Markets

European Research Council

Quantum View Manage Administration Guide

Wat verwacht de hybride consument van de verschillende distributiesystemen? Jan Verlinden Insurance Leader Belgium Capgemini

INTERNATIONAL COMPARISONS OF HOURLY COMPENSATION COSTS

DIR Contract #DIR-TSO-2610 Amendment #1 Appendix C Price Index

Table of Contents. Conferencing Basics 3. Ready Bridge Set Up Options 4. Call Control Features 5. Security Features 6. Call Control Commands 7

BT Premium Event Call and Web Rate Card

Delegation in human resource management

CISCO CONTENT SWITCHING MODULE SOFTWARE VERSION 4.1(1) FOR THE CISCO CATALYST 6500 SERIES SWITCH AND CISCO 7600 SERIES ROUTER

Standard Big Data Architecture and Infrastructure

NetFlow Feature Acceleration

International Call Services

Get the benefits of Norgren s unique range of Online services

[ Care and Use Manual ]

Performance 2013: Global Stock Markets

ORGANISATION FOR ECONOMIC CO-OPERATION AND DEVELOPMENT

Agilent Mobile WiMAX R&D Test Set Solutions: Software and Technical Support Contract

2012 Country RepTrak Topline Report

Accuracy counts! SENSORS WITH ANALOG OUTPUT

Building a Global Internet Company: Driving Traffic to Your Site. Benjamin Edelman Harvard Business School

Guide. Axis Webinar User Guide

Brochure More information from

E-Seminar. Financial Management Internet Business Solution Seminar

AT-S39 Version 1.3 Management Software for the AT-8024 and AT-8024GB Fast Ethernet Switches. Software Release Notes

FOR IMMEDIATE RELEASE CANADA HAS THE BEST REPUTATION IN THE WORLD ACCORDING TO REPUTATION INSTITUTE

Report on Government Information Requests

Guide. Axis Webinar. User guide

GLOBAL EDUCATION PROGRAM (GEP)

Lawson Talent Management

The value of accredited certification

Updating the QIAcube operating software

How To Get A New Phone System For Your Business

Expenditure and Outputs in the Irish Health System: A Cross Country Comparison

An introduction to the World Federation of Occupational Therapists (WFOT)

CISCO IP PHONE SERVICES SOFTWARE DEVELOPMENT KIT (SDK)

DeCyder Extended Data Analysis (EDA) Software

2015 Country RepTrak The World s Most Reputable Countries

HL7 AROUND THE WORLD

How To Calculate Tertiary Type A Graduation Rate

Comparative tables. CPSS Red Book statistical update 427

MAUVE GROUP GLOBAL EMPLOYMENT SOLUTIONS PORTFOLIO

ORGANISATION FOR ECONOMIC CO-OPERATION AND DEVELOPMENT

ASAP implementation approach for SAP ERP implementation has five major phases as shown in below picture. Fit and Gap Analysis (FGA) is very critical

Thomson Video Networks Contact Center Guide

Customer Support. Superior Service Solutions for Your Laser and Laser Accessories. Superior Reliability & Performance

Lawson Business Intelligence. Solutions for Healthcare

State of the Israeli Technology Industry and the Future. Dr. Orna Berry Venture Partner, Gemini Israel Funds

Transcription:

PROTEOMEXCHANGE AN INTERNATIONAL INFRASTRUCTURE FOR OPEN PROTEOMICS DATA Henning Hermjakob Team Leader Proteomics Services European Bioinformatics Institute hhe@ebi.ac.uk

Introduction to proteomics

Introduction to proteomics Metadata Metadata Raw data Metadata Results

Introduction to proteomics Metadata Metadata Raw data Metadata Results

Introduction to proteomics Rapidly developing instrumentation and data processing approaches Multitude of significantly different workflows Complex output data types in many different file formats Results depend not only on experiment and instrumentation, but also analysis approach Same data can be re-analysed meaningfully with a different question in mind Dataset size varies from <100MB to > 4TB No strong tradition of open data (currently)

Data deposition is incomplete: 27%

Proteomics Data Deposition Requirements 2010 In particular, novel protein sequences should be deposited in UniProt (www.uniprot.org); molecular interactions in an IMEx partner database (imex.sf.net); and protein identification data in PRIDE (www.ebi.ac.uk/pride), World-2DPAGE (www.expasy.org/world-2dpage/), or a comparable database. If a manuscript is accepted by the journal, all mass spectra contributing to the described work must be deposited in electronic form by the time of publication at a publicly accessible site that is independent of the authors' control. 7

ProteomeXchange: Data Deposition but where? Questions about submission: Which repository should I submit to? Should I submit to more than one? Do I need to submit raw data? Questions about deposited data: How do I find all datasets on chromatin? Do PRIDE and PeptideAtlas both have this dataset? Do they interpret it differently? Question about repository stability Peptidome closed in 2011 Tranche closed in 01/2013 Will the data remain publicly available?

ProteomeXchange data flow Receiving repositories Peptide Atlas COPaKB PRIDE (MS/MS data) Results Raw Data* Metadata / Manuscript MassIVE (MS/MS data) PASSEL (SRM data) ProteomeCentral UniProt/ nextprot Other DBs Researcher s results Reprocessed results Journals GPMDB Other DBs Raw data* Metadata

ProteomeXchange: 1620 datasets up until 8 th, January, 2015 Origin: 322 USA 197 Germany 148 United Kingdom 91 Netherlands 85 France 81 China 80 Switzerland 61 Canada 48 Belgium 47 Spain 45 Denmark 42 Australia 40 Japan 37 Sweden 28 Austria 22 India 21 Norway 21 Taiwan 20 Ireland 20 Finland 17 Italy 14 Brazil 13 Republic of Korea 13 Russia 10 Israel 9 Singapore Type: 526 PRIDE complete 982 PRIDE partial 63 PeptideAtlas/PASSEL complete 24 MassIVE 25 reprocessed Datasets/year: 2012: 102 2013: 527 2014: 963 2015: 28 Publicly Accessible: 814 datasets, 50% of all 90% PRIDE 8% PASSEL 2% MassIVE Top Species studied by at least 10 datasets: 712 Homo sapiens 193 Mus musculus 65 Saccharomyces cerevisiae 61 Arabidopsis thaliana 35 Rattus norvegicus 34 Escherichia coli 17 Bos taurus 17 Glycine max 17 Mycobacterium tuberculosis 16 Drosophila melanogaster 14 Oryza sativa ~ 310 species in total Data volume: Total: ~71 TB Number of all files: ~160,000 PXD000320-324: ~ 5 TB PXD000065: ~ 1.4TB Vizcaíno et al, Nat. Biotechnol. 2013

Will my data still be there in five years? Databases depend on continued funding Tranche repository ceased operations recently Serious data loss Peptidome ceased operations in 2011 No data loss, data still available from NCBI ftp and from PRIDE: Csordas A, et al. From Peptidome to PRIDE: Public proteomics data migration at a large scale. Proteomics. 2013 Mar 27. ProteomeXchange PRIDE, PeptideAtlas have been around since 2005 PRIDE Institutional funding to ensure basic operations while needed by community Hardware support: Two independent London data centers, eight year UK support Wellcome Trust PRIDE funding just renewed: from 1/1/2014 for four years New ProteomeXchange partners: MassIVE (Nuno Bandeira, UCSD) Imported all recoverable Tranche data Joined April 2014 Beijing Proteomics Center Might join in the future Active collaboration is key for mutual backup in case of funding loss

Complete versus partial submissions in PRIDE Complete submission: MS/MS data. Processed results can be converted to the PSI standard mzidentml or PRIDE XML. Partial submission: Any type of data (not SRM, which goes to PASSEL) E.g. top down, data independent acquisition, MS Imaging (to come), etc. Processed results cannot be converted to a data standard.

Complete vs Partial submissions: processed results For complete submissions, it is possible to connect the spectra with the identification processed results and they can be visualized. Complete Partial

Fast file transfer with Aspera - Aspera is the default file transfer protocol to PRIDE: - PX Submission tool - command line - 10 50 x faster than ftp File transfer speed is not reported as a problem any more, rather we get positive feedback

2013: The rise of public proteomics data availability 2014: The rise of proteomics data re-use PXD Identifier Hits / files = complete downloads Dataset title Publication PXD000561 153512 / 100 = 1500 A draft map of the human proteome Kim et al., Nature, 2014. PMID: 24870542 PXD000851 111587 / 45 = 2480 Membrane proteomic analysis of colorectal cancer tissue Kume et al., MCP, 2014. PMID:24687888 PXD000865 51639 / 46 = 1122 Mass spectrometry based draft of the human proteome Wilhelm et al., 2014, Nature, PMID: 24870543

ProteomeXchange data re-use Kuester et al: Inclusion of PX data (among others) in [ Wilhelm et al., 2014, Nature ] GPMDB: 15 of 20 Dataset of the week in 2014 based on PX datasets [ http://www.thegpm.org/dsotw_2014.html ] PeptideAtlas regularly re-processes PX data [ http://www.peptideatlas.org ] COPaKB processes relevant cardiovascular PX datasets [ http://www.heartproteome.org/copa/ ] Peptide shaker has direct re-processing link to PRIDE [ Vaudel M, et al. Nature Biotechnology 2015 ] PRIDE Cluster integrates across PX datasets [ Griss J, et al. Nat Methods. 2013 ] PRIDE download volume in 2014: 150 TB most popular datasets downloaded > 1000 times

The role of DOIs Mainly editors from the ProteomeXchange stakeholder group suggested to assign DOIs for PXD datasets Implemented for complete datasets only, as an incentive for authors to generate complete datasets Actual implementation straightforward Usage so far irrelevant: DOI resolution report states 26 resolutions for top ranking PXD001677 over last six months

Pubmed interaction On successful submission of a dataset, we ask the submitter to add the PXD to the abstract: PXD00* returns 232 Pubmed hits. Valuable for identifying newly released publications

Citation On successful submission of a dataset, we ask the submitter to cite PXD: Reasonably well adopted by community:

Outreach and dissemination Example dataset: PXD000764 - Title: Discovery of new CSF biomarkers for meningitis in children - 12 runs: 4 controls and 8 infected samples - Identification and quantification data http://www.proteomexchange.org/submission Ternent et al., Proteomics, 2014

Perspective: A multi- omics DDI

http://metabolomexchange.org - Coordination of Standards in Metabolomics

27

28

From ProteomeXchange to a Multi-omics Data Discovery Index Aim: Develop an infrastructure for integrated data Discovery across ProteomeXchange MetablomeXchange European Genotype Phenotype archive Challenges: Common technical infrastructure (XML, RDF, ) - feasible Common metadata representation (Attributes, Ontologies, ) hard Demonstrator for Findability across omics Findability across eight repositories in two continents and six organisations Findability across open and controlled access repositories BD2K consortium collaboration with DDI BioCADDIE project BD2K EU Elixir collaboration 29

Acknowledgements ProteomeXchange partners, particularly: Eric Deutsch, ISB, Seattle Andy Jones, U Liverpool Lennart Martens, U Gent Pierre-Alain Binz, SIB, Geneva Martin Eisenacher, MPC, Bochum Ruedi Aebersold, ETH Zurich Juan Pablo Albar, CSIC, Madrid Laurent Gatto, U Cambridge Nuno Bandeira, UCSD Peipei Ping, UCLA Reactome Antonio Fabregat Mundo Steve Jupe Phani Garapati Lincoln Stein, OICR, Canada Peter D Eustachio, NYU Guanming Wu, OICR Joel Weiser, OICR Bijay Jassal, OICR PRIDE team Juan Antonio Vizcaino Rui Wang Florian Reisinger Attila Csordas Tobias Ternent Jose Dianes Yasset Perez Riverol Noemi del Toro Ayllon Johannes Griss Editors Mike Dunn, Proteomics Achim Kraus, Proteomics Ralph Bradshaw, MCP Bill Hancock, JPR Funding NIH BD2K Centers of Excellence for Big Data Computing grant number 1U54GM114833-01 NHLBI Proteomics Center Award HHSN268201000035C Wellcome Trust PRIDE EU FW7 ProteomeXchange, PSIMEX BBSRC PROCESS EMBL-EBI All data providers!

? If the Human Genome Project had not followed an open data release policy, what would we be searching our spectra against today? proteomexchange.org psidev.info