Data Publishing Workflows with Dataverse
|
|
|
- Sibyl Holland
- 10 years ago
- Views:
Transcription
1 Data Publishing Workflows with Dataverse Mercè Crosas, Ph.D. Director of Data Science Institute for Quantitative Social Science, Harvard University MIT, May 6, 2014
2 Intro to our Data Science Team and Projects
3 Data Science at the Institute for Quantitative Social Science
4 Combines Expertise Researchers Statistical Innovation Data Curation & Stewardship Information Scientists Data Science Applications and Tools Tool Building & Computer Science Software Engineers
5 With a Team of 20 Mercè Crosas, Director of Data Science Gary King, Director of IQSS Cris Rothfuss, Excutive Director Statistics and Analytics James Honaker Christine Choirat Vito d Orazio QA Kevin Condon Elda Sotiri Software Development Gustavo Durand Robert Treacy Ellen Kraffmiller Michael Bar-Sinai Leonid Andreev Phil Durbin Steve Kraffmiller Xiangqing Yang Raman Prasad (BARI) Data Curation and Archiving Sonia Barbosa Eleni Castro Dwayne Liburd Usability and UI Elizabeth Quigley Michael Heppler
6 Two widely-used Frameworks Developed in the last Decade A framework that allows analysts to use and interpret a large body of R statistical models from heterogeneous contributors through a common interface. A data publishing framework that allows researchers to share, preserve, cite and analyze data, while keeping control and gaining credit for their data.
7 New Tools that Integrate with our Initial Work An interactive web interface that allows users at all levels of statistical expertise to explore their data and appropriately construct statistical models. Integrates with Zelig and Dataverse. A framework that allows data contributors to set a level of sensitivity for their dataset based on legal regulations, which defines how the data can be stored and shared. Integrates with Dataverse. In collaboration with NSF Privacy Tools project
8 Expanding in other Areas A web application that assists researchers to discover new clusters to categorize large document sets, leveraging all the clustering methods in the literature. An application that provides a continuous integration build solution for R packages shared in Git to archived published code in CRAN.
9 Support Throughout the Research Cycle Develop Quantitative Methods Analyze Quantitative Datasets Publish Data Cite Data from Published Results Analyze Unstructured Text Share Sensitive Data Explore, reanalyze and reuse data Develop > Analyze > Share > Explore > Validate & Reuse
10 Current Research Interests and Efforts Reproducible and Reusable Science: encourage open data and methodological transparency, and promote and enable data citation (with Dataverse, Zelig and SolaFide) Computationally Assisted Exploration: with Consilience and SolaFide, assist researchers to understand and discover new insights from their data Interdisciplinary Quantitative Scientific Scope: our tools and research frameworks address broad methodological issues in quantitative science and are often employed in other domains When Data are Not Open: solutions to preserve privacy, while still providing science the fundamental ability to learn, access and replicate findings, with DataTags and PrivateZelig Large-Scale Data Sets: will handle large-scale data sets, as Big Data science reaches all disciplines: Consilience for millions of text documents, and Zelig and Dataverse to handle TB-PB-scale data sets.
11 Harvard Dataverse
12 The Harvard Dataverse Repository! In collaboration with the Harvard Library, Harvard hosts a Dataverse instance free and open to all researchers.! It currently holds > 53,000 datasets, with 735,000 files.! Find or deposit data at:
13 Collaborations with MIT! Membership through the Harvard-MIT Data Center (e.g., statistics training, access to ICPSR collection)! The MIT Libraries Dataverse disseminates data purchased by the MIT Libraries (with Kate McNeill):! MIT faculty and research groups are already disseminating their data through the Harvard Dataverse! Research collaborations (with Micah Altman):! Integration of Publications with Data (Funded by Sloan): Privacy Tools for Sharing Research Data (Funded by NSF):
14 Dataverse 4.0 Target release date: June 23 New UI New rich, faceted search New data file ingest (excel, CSV, R, Stata, SPSS) New metadata for social sciences, astronomy, biomedical sciences. Integration with SolaFide.
15 SolaFide Demo
16 Data Publishing Workflows
17 Data Publishing Guidelines Three pillars to Data Publishing:! A trusted data repository to guarantee long-term access! A formal data citation*! Sufficient information to understand and reuse the data (metadata, documentation, code) * Data Citation Principles:
18 A Rigorous Publishing Workflow Release Version 1 Authors, Title, Year, DOI Repository, UNF, V1 A Published Dataset cannot be deleted (only deaccession, if legally needed) Push Version 1.1: small metadata change; citation doesn t change Push Version 2: big metadata change, or file change; citation changes Authors, Title, Year, DOI Repository, UNF, V2
19 Workflows that Integrate with Journals 1. Publish a dataset to your Dataverse, then provide the Data Citation to the journal. 2. Contribute to a journal Dataverse: 1. Add dataset to Journal Dataverse as a draft. 2. Journal Editor reviews it, and approves it for release. 3. Dataset is published with Data Citation and link from journal article to the data. 3. Seamless Integration between journal system and Dataverse.
20 OJS and Dataverse Integration! Sloan funded project to integrate PKP s Open Journal System with the Dataverse software.! Pilot with ~ 50 journals! OJS Dataverse plugin now available with latest OJS release!
21 Detailed System Integration! XML file: AtomPub "entry" with Dublin Core Terms (e.g., title, creator)! Zip file: All data files associated with that dataset.! HTTP header "In-Progress: false" to publish datasets.! Support HTTP verbs: GET, PUT, POST, and DELETE.! XML file: Deposit Receipt! HTTP status code: 200, 201, 204, 404, 405, 406, 412, 415 Client can query repository (server) any time to get status
22 Deposit API based on SWORD! Follows SWORD2 specifications! SWORD is known and supported within academic publishing; a profile of the AtomPub standard.! The SWORD project provides client libraries for Python, Java, Ruby, and PHP:! OJS uses the PHP client library! OSF uses the Python client library! DataUp and DVN-R built a custom Dataverse client
23 How it differs from SWORD! Dataverse does not use SWORD download API:! Use own Data API! Plan to add this support in the future! Add XML attribute to pass article citation from client:! Allow DCterms:isReferencedby to contain attributes such as HoldingsURI to link back to article from Dataverse! This is now part of the SWORD PHP client library! Use In-Progress: false to indicate that dataset is ready to be published (In SWORD spec means deposit complete)
24 Support for Metadata Standards! A core or citation metadata that applies to all datasets Supported currently by Data Deposit API! Extensible metadata blocks for specific domains:! Social sciences:! Maps to DDI schema;! file metadata extracted from tabular data file! Astronomy:! Maps to VO schema;! partially extracted from FITS file! Biomedical sciences:! Maps to ISA-tab schema! Controlled vocabularies maps to EFO, OBI, and Ontology of Clinical Research! Extended and managed using SKOS (support taxonomies within the framework of the semantic web)
25
26
27
28
29
30 Upcoming
31 Expanding to Larger and More Types of Data! Sharing sensitive data with DataTags and Secure Dataverse! Integration with other systems:! OSF! DataUp! WorldMap! DataBridge! ORCID! DASH (at Harvard)! Expand to Larger data sets
32 DataTags: For Sharing Sensitive Data
33
34 THANKS Twitter: mercecrosas (Beta)
Open Journal Systems and Dataverse Integration-- Helping Journals to Upgrade Data Publication for Reusable Research
Open Journal Systems and Dataverse Integration-- Helping Journals to Upgrade Data Publication for Reusable Research Micah Altman Director of Research, MIT Libraries http://informatics.mit.edu
Using Dataverse Virtual Archive Technology for Research Data Management. Jonathan Crabtree Thu-Mai Christian Amanda Gooch
Using Dataverse Virtual Archive Technology for Research Data Management Jonathan Crabtree Thu-Mai Christian Amanda Gooch H. W. Odum Institute Archive Services The Howard W. Odum Institute was founded in
INTRODUCTION TO THE DATAVERSE NETWORK
INTRODUCTION TO THE DATAVERSE NETWORK JANUARY 7, 2015 Jonathan Crabtree Assistant Director of Computing and Archival Research THE ODUM INSTITUTE FOR RESEARCH IN SOCIAL SCIENCE 228 DAVIS LIBRARY, CB# 3355
OpenAIRE Research Data Management Briefing paper
OpenAIRE Research Data Management Briefing paper Understanding Research Data Management February 2016 H2020-EINFRA-2014-1 Topic: e-infrastructure for Open Access Research & Innovation action Grant Agreement
How To Useuk Data Service
Publishing and citing research data Research Data Management Support Services UK Data Service University of Essex April 2014 Overview While research data is often exchanged in informal ways with collaborators
Survey of Canadian and International Data Management Initiatives. By Diego Argáez and Kathleen Shearer
Survey of Canadian and International Data Management Initiatives By Diego Argáez and Kathleen Shearer on behalf of the CARL Data Management Working Group (Working paper) April 28, 2008 Introduction Today,
Get More Value from Your Reference Data Make it Meaningful with TopBraid RDM
Get More Value from Your Reference Data Make it Meaningful with TopBraid RDM Bob DuCharme Data Governance and Information Quality Conference June 9 TopQuadrant Company Focus: TopQuadrant was founded in
data.bris: collecting and organising repository metadata, an institutional case study
Describe, disseminate, discover: metadata for effective data citation. DataCite workshop, no.2.. data.bris: collecting and organising repository metadata, an institutional case study David Boyd data.bris
Metadata driven framework for the Canada Research Data Centre Network
Metadata driven framework for the Canada Research Data Centre Network IASSIST 2010 Session A4: DDI3 Tools Pascal Heus, Metadata Technology North America [email protected] http://www.metadatatechnology.com
Research Data Archival Guidelines
Research Data Archival Guidelines LEROY MWANZIA RESEARCH METHODS GROUP APRIL 2012 Table of Contents Table of Contents... i 1 World Agroforestry Centre s Mission and Research Data... 1 2 Definitions:...
Checklist for a Data Management Plan draft
Checklist for a Data Management Plan draft The Consortium Partners involved in data creation and analysis are kindly asked to fill out the form in order to provide information for each datasets that will
Documenting the research life cycle: one data model, many products
Documenting the research life cycle: one data model, many products Mary Vardigan, 1 Peter Granda, 2 Sue Ellen Hansen, 3 Sanda Ionescu 4 and Felicia LeClere 5 Introduction Technical documentation for social
Data Management Resources at UNC: The Carolina Digital Repository and Dataverse Network
Data Management Resources at UNC: The Carolina Digital Repository and Dataverse Network November 16, 2010 Data Management Short Course Series Sponsored by the Odum Institute and the UNC Libraries Campus
Stewarding Big Data: Perspectives on Public Access to Federally Funded Scientific Research Data
Stewarding Big Data: Perspectives on Public Access to Federally Funded Scientific Research Data Big Data and Big Challenges for Law and Legal Information Georgetown Law Library January 30, 2013 William
Johns Hopkins University Data Management Services
Johns Hopkins University Data Management Services Dave Fearon, Betsy Gunia, and Tim DiLauro [email protected] http://dmp.data.jhu.edu/ JHU Data Management Services How we started: NSF s Data Management
DIGITAL STEWARDSHIP SUPPLEMENTARY INFORMATION FORM
DIGITAL STEWARDSHIP SUPPLEMENTARY INFORMATION FORM Introduction The Institute of Museum and Library Services (IMLS) is committed to expanding public access to federally funded research, data, software,
Databases & Data Infrastructure. Kerstin Lehnert
+ Databases & Data Infrastructure Kerstin Lehnert + Access to Data is Needed 2 to allow verification of research results to allow re-use of data + The road to reuse is perilous (1) 3 Accessibility Discovery,
Canadian National Research Data Repository Service. CC and CARL Partnership for a national platform for Research Data Management
Research Data Management Canadian National Research Data Repository Service Progress Report, June 2016 As their digital datasets grow, researchers across all fields of inquiry are struggling to manage
D5.5 Initial EDSA Data Management Plan
Project acronym: Project full : EDSA European Data Science Academy Grant agreement no: 643937 D5.5 Initial EDSA Data Management Plan Deliverable Editor: Other contributors: Mandy Costello (Open Data Institute)
Building Semantic Content Management Framework
Building Semantic Content Management Framework Eric Yen Computing Centre, Academia Sinica Outline What is CMS Related Work CMS Evaluation, Selection, and Metrics CMS Applications in Academia Sinica Concluding
How To Write A Blog Post On Globus
Globus Software as a Service data publication and discovery Kyle Chard, University of Chicago Computation Institute, [email protected] Jim Pruyne, University of Chicago Computation Institute, [email protected]
SHared Access Research Ecosystem (SHARE)
SHared Access Research Ecosystem (SHARE) June 7, 2013 DRAFT Association of American Universities (AAU) Association of Public and Land-grant Universities (APLU) Association of Research Libraries (ARL) This
REACCH PNA Data Management Plan
REACCH PNA Data Management Plan Regional Approaches to Climate Change (REACCH) For Pacific Northwest Agriculture 875 Perimeter Drive MS 2339 Moscow, ID 83844-2339 http://www.reacchpna.org [email protected]
How to avoid building a data swamp
How to avoid building a data swamp Case studies in Hadoop data management and governance Mark Donsky, Product Management, Cloudera Naren Korenu, Engineering, Cloudera 1 Abstract DELETE How can you make
Tools for Researchers
Tools for Researchers Microsoft Research & the Scholarly Information Ecosystem Lee Dirks & Alex Wade Directors, Education & Scholarly Communication Microsoft External Research Microsoft Corporation Microsoft
Interagency Science Working Group. National Archives and Records Administration
Interagency Science Working Group 1 National Archives and Records Administration Establishing Trustworthy Digital Repositories: A Discussion Guide Based on the ISO Open Archival Information System (OAIS)
Selecting a Taxonomy Management Tool. Wendi Pohs InfoClear Consulting #SLATaxo
Selecting a Taxonomy Management Tool Wendi Pohs InfoClear Consulting #SLATaxo InfoClear Consulting What do we do? Content Analytics Strategy and Implementation, including: Taxonomy/Ontology development
A Capability Maturity Model for Scientific Data Management
A Capability Maturity Model for Scientific Data Management 1 A Capability Maturity Model for Scientific Data Management Kevin Crowston & Jian Qin School of Information Studies, Syracuse University July
Building integration environment based on OAI-PMH protocol. Novytskyi Oleksandr Institute of Software Systems NAS Ukraine [email protected].
Building integration environment based on OAI-PMH protocol Novytskyi Oleksandr Institute of Software Systems NAS Ukraine [email protected] Roadmap What is OAI-PMH? Requirements for infrastructure Step by
Summary of Responses to the Request for Information (RFI): Input on Development of a NIH Data Catalog (NOT-HG-13-011)
Summary of Responses to the Request for Information (RFI): Input on Development of a NIH Data Catalog (NOT-HG-13-011) Key Dates Release Date: June 6, 2013 Response Date: June 25, 2013 Purpose This Request
DDI Lifecycle: Moving Forward Status of the Development of DDI 4. Joachim Wackerow Technical Committee, DDI Alliance
DDI Lifecycle: Moving Forward Status of the Development of DDI 4 Joachim Wackerow Technical Committee, DDI Alliance Should I Wait for DDI 4? No! DDI Lifecycle 4 is a long development process DDI Lifecycle
Databricks. A Primer
Databricks A Primer Who is Databricks? Databricks vision is to empower anyone to easily build and deploy advanced analytics solutions. The company was founded by the team who created Apache Spark, a powerful
Geospatial Data Stewardship at an Interdisciplinary Data Center
Geospatial Data Stewardship at an Interdisciplinary Data Center Robert R. Downs, PhD Senior Digital Archivist and Senior Staff Associate Officer or Research Acting Head of Cyberinfrastructure and Informatics
Best Practices for Data Management. RMACC HPC Symposium, 8/13/2014
Best Practices for Data Management RMACC HPC Symposium, 8/13/2014 Presenters Andrew Johnson Research Data Librarian CU-Boulder Libraries Shelley Knuth Research Data Specialist CU-Boulder Research Computing
Good Data Practices VIREC Cyber Seminar Series. September 9 13, 2013
Good Data Practices VIREC Cyber Seminar Series September 9 13, 2013 1 Acknowledgements Initial concept development work by Michael Berbaum, PhD, University of Illinois at Chicago Review of existing public
Connecting Basic Research and Healthcare Big Data
Elsevier Health Analytics WHS 2015 Big Data in Health Connecting Basic Research and Healthcare Big Data Olaf Lodbrok Managing Director Elsevier Health Analytics [email protected] t +49 89 5383 600
Service Road Map for ANDS Core Infrastructure and Applications Programs
Service Road Map for ANDS Core and Applications Programs Version 1.0 public exposure draft 31-March 2010 Document Target Audience This is a high level reference guide designed to communicate to ANDS external
Affiliation: University of Massachusetts Amherst, University Libraries
Date: January 12, 2012 Name: Marilyn S Billings Email: [email protected] Affiliation: University of Massachusetts Amherst, University Libraries City, State: Amherst, MA Summary: Thank you for
Databricks. A Primer
Databricks A Primer Who is Databricks? Databricks was founded by the team behind Apache Spark, the most active open source project in the big data ecosystem today. Our mission at Databricks is to dramatically
Research Data Management Guide
Research Data Management Guide Research Data Management at Imperial WHAT IS RESEARCH DATA MANAGEMENT (RDM)? Research data management is the planning, organisation and preservation of the evidence that
Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences
Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences WP11 Data Storage and Analysis Task 11.1 Coordination Deliverable 11.2 Community Needs of
Check Your Data Freedom: A Taxonomy to Assess Life Science Database Openness
Check Your Data Freedom: A Taxonomy to Assess Life Science Database Openness Melanie Dulong de Rosnay Fellow, Science Commons and Berkman Center for Internet & Society at Harvard University This article
Copyright 2013 Splunk Inc. Introducing Splunk 6
Copyright 2013 Splunk Inc. Introducing Splunk 6 Safe Harbor Statement During the course of this presentation, we may make forward looking statements regarding future events or the expected performance
Multi-domain Research Data Description
Multi-domain Research Data Description Fostering the participation of researchers in a ontology-based data management environment João Aguiar Castro Faculdade de Engenharia da Universidade do Porto / INESC
General concepts: DDI
General concepts: DDI Irena Vipavc Brvar, ADP SEEDS Kick-off meeting, Lausanne, 4. - 6. May 2015 How to describe our survey What we learned so far: If we want to use data at some point in the future,
Ganzheitliches Datenmanagement
Ganzheitliches Datenmanagement für Hadoop Michael Kohs, Senior Sales Consultant @mikchaos The Problem with Big Data Projects in 2016 Relational, Mainframe Documents and Emails Data Modeler Data Scientist
How To Write An Nccwsc/Csc Data Management Plan
Guidance and Requirements for NCCWSC/CSC Plans (Required for NCCWSC and CSC Proposals and Funded Projects) Prepared by the CSC/NCCWSC Working Group Emily Fort, Data and IT Manager for the National Climate
Integration of Polish National Bibliography within the repository platform for science and humanities
Marcin Roszkowski Integration of Polish National Bibliography within the repository platform for science and humanities The best thing to do to your data will be thought of by somebody else W3C LLD Agenda
Exploring the roles and responsibilities of data centres and institutions in curating research data a preliminary briefing.
Exploring the roles and responsibilities of data centres and institutions in curating research data a preliminary briefing. Dr Liz Lyon, UKOLN, University of Bath Introduction and Objectives UKOLN is undertaking
PRINCIPLES OF RESEARCH DATA MANAGEMENT AT THE RECTOR S DECISION, 9 SEPTEMBER 2014
PRINCIPLES OF RESEARCH DATA MANAGEMENT AT THE UNIVERSITY OF JYVÄSKYLÄ RECTOR S DECISION, 9 SEPTEMBER 2014 EXPERT OPINION OF THE ETHICAL COMMITTEE, 8 MAY 2014 APPROVED AT THE SCIENCE COUNCIL MEETING OF
The astronomical Virtual Observatory : lessons learnt, looking forward. Françoise Genova - Forum VO-PDC d après ADASS XXI, Paris, nov.
The astronomical Virtual Observatory : lessons learnt, looking forward Examples taken from the European view, but other projects have followed similar paths The VO aim Enable seamless access to the wealth
James Hardiman Library. Digital Scholarship Enablement Strategy
James Hardiman Library Digital Scholarship Enablement Strategy This document outlines the James Hardiman Library s strategy to enable digital scholarship at NUI Galway. The strategy envisages the development
Data Driven Discovery In the Social, Behavioral, and Economic Sciences
Data Driven Discovery In the Social, Behavioral, and Economic Sciences Simon Appleford, Marshall Scott Poole, Kevin Franklin, Peter Bajcsy, Alan B. Craig, Institute for Computing in the Humanities, Arts,
#mstrworld. No Data Left behind: 20+ new data sources with new data preparation in MicroStrategy 10
No Data Left behind: 20+ new data sources with new data preparation in MicroStrategy 10 MicroStrategy Analytics Agenda Product Workflows Different Data Import Processes Product Demonstrations Data Preparation
Mission-Critical Database with Real-Time Search for Big Data
Mission-Critical Database with Real-Time Search for Big Data February 17, 2012 Slide 1 Overview About MarkLogic Why MarkLogic Case Studies Technology and Features Slide 2 About MarkLogic 10 years in business
In ediscovery and Litigation Support Repositories MPeterson, June 2009
XAM PRESENTATION (extensible TITLE Access GOES Method) HERE In ediscovery and Litigation Support Repositories MPeterson, June 2009 Contents XAM Introduction XAM Value Propositions XAM Use Cases Digital
Department of the Interior Open Data FY14 Plan
Department of the Interior Open Data FY14 Plan November 30, 2013 Contents Overview Introduction DOI Open Data Objectives Transparency Governance and Culture Change Open Data Plan FY14 Quarterly Schedule
Test Data Management Concepts
Test Data Management Concepts BIZDATAX IS AN EKOBIT BRAND Executive Summary Test Data Management (TDM), as a part of the quality assurance (QA) process is more than ever in the focus among IT organizations
Horizon2020 Data Management Plans. Ma4 Harrison BGS
Horizon2020 Data Management Plans Ma4 Harrison BGS Data Management plan What is a Data Management Plan? A data management plan (DMP) describes what data that will be created, the standards used to describe
Globus Research Data Management: Introduction and Service Overview
Globus Research Data Management: Introduction and Service Overview Kyle Chard [email protected] Ben Blaiszik [email protected] Thank you to our sponsors! U. S. D E P A R T M E N T OF ENERGY 2 Agenda
DSpace: An Institutional Repository from the MIT Libraries and Hewlett Packard Laboratories
DSpace: An Institutional Repository from the MIT Libraries and Hewlett Packard Laboratories MacKenzie Smith, Associate Director for Technology Massachusetts Institute of Technology Libraries, Cambridge,
Digital Asset Management Developing your Institutional Repository
Digital Asset Management Developing your Institutional Repository Manny Bekier Director, Biomedical Communications Clinical Instructor, School of Public Health SUNY Downstate Medical Center Why DAM? We
Data Management Plan Template Guidelines
Data Management Plan Template Guidelines This sample plan is provided to assist grant applicants in creating a data management plan, if required by the agency receiving the proposal. A data management
Data Management Considerations for the Data Life Cycle
Data Management Considerations for the Data Life Cycle NRC STS Panel 2011 November 17, 2011, Washington DC Peter Fox (RPI) [email protected], [email protected] Tetherless World Constellation http://tw.rpi.edu
The Foundations of Successful Reference Data Management
TopQuadrant Webcast with Malcolm Chisholm March 18, 2015 The Foundations of Successful Reference Data Management Introduction of Agenda and Speakers Today s Program I. Foundations of Successful Ref. Data
