FROM DATA REPOSITORIES TO DATA JOURNALS: PUBLISHING AND SHARING HEALTH-RELATED DATA Big Data in Health Care, Oct 28 th, 2015 Andrew L. Hufton Managing Editor, Scientific Data Nature Publishing Group
What can publishers do to incentivize data sharing?
Remove barriers to sharing Nature-titles and Scientific Data all explicitly allow pre-publication sharing of data and article preprints Publication of data articles will not compromise novelty of subsequent research articles But, Many scientists not aware of these policies Policies vary across publishers 3
Fundamental sharing policy for Nature and the Nature research journals An inherent principle of publication is that others should be able to replicate and build upon the authors' published claims. A condition of publication in a Nature journal is that authors are required to make materials, data, code, and associated protocols promptly available to readers without undue qualifications. Any restrictions on the availability of materials or information must be disclosed to the editors [and] in the submitted manuscript. See http://www.nature.com/authors/policies/availability.html
Some problems with sharing upon request Relies heavily on trust Data associated with published works disappears at a rate of ~17% per year (Vines et al. 2014, doi:10.1016/ j.cub.2013.11.014) Datasets not referenced in a manuscript are essentially invisible (a.k.a Dark data ) Data producers do not get appropriate credit for their work
Data-access practices strengthened See editorial at Nature (Nov 2014) Clear preference for sharing large datasets via public repositories. Enforce data deposition in fields where there is strong community consensus List of public data repositories now maintained by Scientific Data Encourage authors to publish Data Descriptors at Scientific Data
Data journals rewarding those that share above and beyond existing standards Data must be well described before others can use it and benefit from it. Scientists who share data in a reusable manner deserve credit through citable publications. Several journals now offer data paper article-types, including GigaScience, F1000Research, Earth Systems Science Data, Biodiversity Data Journal 7
Launched in May 2014
Get Credit for Sharing Your Data Publications will be indexed and citeable. Open-access Authors select from three Creative Commons licenses for the main Data Descriptor. Each publication supported by CCO metadata. Focused on Data Reuse All the information others need to reuse the data; no interpretative analysis, or hypothesis testing Peer-reviewed Rigorous peer-review focused on technical data quality and reuse value Promoting Community Data Repositories Not a new data repository; data stored in community data repositories
Clear data sharing policies Data must be deposited to an approved data repository before manuscript submission, prior to peer-review. If datasets are private, they must be made accessible to editors and referees in a secure and confidential manner. Must agree to release data to the public, without undue restrictions, at the time of publication. Reasonable controls allowed for datasets with human privacy restrictions.
Data Descriptor Article or narrative component (PDF and HTML) Experimental metadata or structured component (in-house curated, machine-readable formats)
Data Descriptor Focus on data reuse Detailed descriptions of the methods and technical analyses supporting the quality of the measurements. Does not contain tests of new scientific hypotheses Sections: Title Abstract Background & Summary Methods Data Records Technical Validation Usage Notes Figures & Tables References Data Citations
Find the right repository for your data Browse our recommended data repository online. We currently list more than 80 repositories, across the biological, physical and social sciences We advise authors on the best place to store their data
Sharing of human data, and particularly clinically-derived datasets
Clinical researchers support sharing Sharing de-identified data via repositories should be required (236 respondents, 74%) Investigators should share de-identified data on request (229 respondents, 72%) Rathi V, Dzara K, Gross CP, Hrynaszkiewicz I, Joffe S, Krumholz HM, Strait KM, Ross JS: Sharing of clinical trial data among trialists: a cross sectional survey. BMJ 2012;345:e7570 15
Data on (reasonable) request - issues Meta-analyses fail to launch when insufficient IPD available unanswered requests and refusal to share Systematic Reviews 2014, 3:97 doi:10.1186/2046-4053-3-97 Poor availability of psychological research data (only 64/249 datasets available) American Psychologist, Vol 61(7), Oct 2006, 726-728. doi:10.1037/0003-066x.61.7.726 Data received from 1/10 authors publishing in PLOS Medicine and PLOS Clinical Trials PLoS ONE 4(9): e7078. doi:10.1371/journal.pone.0007078 16
Better way to publish clinical data? Working group Dec 2014 to produce guidelines 1 on publishing descriptions of nonpublic clinical data Goal to connect data on request services with a trusted repository and journal article Expected benefits: Publication and permanence Peer review data and article Discoverability e.g. PubMed New option for negative/unpublished data? Robust links with repositories 1. Hrynaszkiewicz, I., Khodiyar, V., Hufton, A. & Sansone, S. A. Publishing descriptions of non-public clinical datasets: guidance for researchers, repositories, editors and funding organisations. BioRxiv http://dx.doi.org/10.1101/021667 (2015). 17
Controlled access meets open data what should open science journals expect? A well-documented path for other researchers, including competitors, to gain access to data Mechanisms for peer-reviewers to access anonymized data Repositories and sharing platforms that support controlled access Terms of use that allow others to share and publish new findings & analyses 18
Restricted access Data Descriptor http://www.nature.com/articles/sdata201531 19
Clear explanations of how to request restricted data http://www.nature.com/articles/sdata201531 20
Data hosted at Harvard Dataverse http://dx.doi.org/10.7910/dvn/25833 21
Sharing open derivatives of restricted data Term co-occurrence data derived from 20 million electronic health records Co-occurrence data openly sharing via Dryad
Managing Editor, Scientific Data Andrew L. Hufton Honorary Academic Editor Susanna-Assunta Sansone Thanks! Now launched! Visit nature.com/scientificdata Email scientificdata@nature.com Advisory Panel and Editorial Board including senior researchers, funders, librarians and curators Tweet @ScientificData Supported by