Databases & Data Infrastructure. Kerstin Lehnert

Similar documents
The data forest. Application. Application Application DATA. Office of Research

Data Management Considerations for the Data Life Cycle

Deep Carbon Observatory Data Science Interim Report, Dec

USGS Community for Data Integration

Integrated Data Management System for Critical Zone Observatories

OpenAIRE Research Data Management Briefing paper

ODUM INSTITUTE ARCHIVE SERVICES OVERVIEW IASSIST 2015

CYBERINFRASTRUCTURE FRAMEWORK FOR 21 ST CENTURY SCIENCE, ENGINEERING, AND EDUCATION (CIF21) $100,070,000 -$32,350,000 / %

CYBERINFRASTRUCTURE FRAMEWORK FOR 21 ST CENTURY SCIENCE, ENGINEERING, AND EDUCATION (CIF21)

NOAA Environmental Data Management Update for Unidata SAC

Global Scientific Data Infrastructures: The Big Data Challenges. Capri, May, 2011

Public Access Plan. U.S. Department of Energy July 24, 2014 ENERGY.GOV

BIG DATA Funding Opportunities

Towards a National Data Infrastructure. Sharon Neal Chemistry Division National Science Foundation shneal@nsf.gov

Exploring the roles and responsibilities of data centres and institutions in curating research data a preliminary briefing.

CYBERINFRASTRUCTURE FRAMEWORK $143,060,000 FOR 21 ST CENTURY SCIENCE, ENGINEERING, +$14,100,000 / 10.9% AND EDUCATION (CIF21)

CYBERINFRASTRUCTURE FRAMEWORK FOR 21 st CENTURY SCIENCE AND ENGINEERING (CIF21)

DATA STEWARDSHIP from a geoscience and academic perspective

Digital Asset Management Developing your Institutional Repository

GEOG 482/582 : GIS Data Management. Lesson 10: Enterprise GIS Data Management Strategies GEOG 482/582 / My Course / University of Washington

Research Data Alliance: Current Activities and Expected Impact. SGBD Workshop, May 2014 Herman Stehouwer

Workforce Demand and Career Opportunities in University and Research Libraries

Canadian National Research Data Repository Service. CC and CARL Partnership for a national platform for Research Data Management

A NEW STRATEGIC DIRECTION FOR NTIS

NOAA Environmental Data Management

NITRD: National Big Data Strategic Plan. Summary of Request for Information Responses

Managing Idaho s Landscapes for Ecosystem Services (MILES) Idaho EPSCoR Research Data Management Plan Version 5.0 (October 12, 2015)

EXCOM 2015 Tromsø, Norway August 2015 SCADM Report (Standing Committee on Antarctic Data Management)

EUDAT. Towards a pan-european Collaborative Data Infrastructure. Willem Elbers

Summary of Responses to the Request for Information (RFI): Input on Development of a NIH Data Catalog (NOT-HG )

CAREER TRACKS PHASE 1 UCSD Information Technology Family Function and Job Function Summary

Service Road Map for ANDS Core Infrastructure and Applications Programs

From data management policy to implementation: opportunities and challenges for libraries

A NEW STRATEGIC DIRECTION FOR NTIS

Checklist for a Data Management Plan draft

Boulder Creek Critical Zone Observatory Data Management Plan

How To Teach Data Science

How To Write A Blog Post On Globus

Environment Canada Data Management Program. Paul Paciorek Corporate Services Branch May 7, 2014

Workprogramme

Data Wrangling: From the Wild to the Lake

James Hardiman Library. Digital Scholarship Enablement Strategy

Data at NIST: A View from the Office of Data and Informatics

In 2014, the Research Data Purdue University

DDI Lifecycle: Moving Forward Status of the Development of DDI 4. Joachim Wackerow Technical Committee, DDI Alliance

A Capability Maturity Model for Scientific Data Management

Check Your Data Freedom: A Taxonomy to Assess Life Science Database Openness

Long Term Preservation of Earth Observation Space Data. Preservation Workflow

Survey of Canadian and International Data Management Initiatives. By Diego Argáez and Kathleen Shearer

EUROPEAN COMMISSION Directorate-General for Research & Innovation. Guidelines on Data Management in Horizon 2020

Compute Canada & CANARIE Response to TC3+ Consultation Paper: Capitalizing on Big Data: Toward a Policy Framework for Advancing Digital Scholarship

The Arctic Observing Network and its Data Management Challenges Florence Fetterer (NSIDC/CIRES/CU), James A. Moore (NCAR/EOL), and the CADIS team

Overcoming the Technical and Policy Constraints That Limit Large-Scale Data Integration

Research Data Management

Capitalizing on Big Data

NASA s Big Data Challenges in Climate Science

UNH Strategic Technology Plan

SHARING RESEARCH DATA POLICY, INFRASTRUCTURE, PEOPLE

Research Data Services at London s Global University. UCL Research Data Services RECODE Meeting 14 th Jan 2015

EUDAT. Towards a pan-european Collaborative Data Infrastructure

How To Useuk Data Service

Strategic Vision. for Stewarding the Nation s Climate Data. Our. NOAA s National Climatic Data Center

Data Management Plans - How to Treat Digital Sources

Data Publishing Workflows with Dataverse

Open Ontology Repository Initiative

Report of the DTL focus meeting on Life Science Data Repositories

February 22, 2013 MEMORANDUM FOR THE HEADS OF EXECUTIVE DEPARTMENTS AND AGENCIES

Open Access to Manuscripts, Open Science, and Big Data

fédération de données et de ConnaissancEs Distribuées en Imagerie BiomédicaLE Data fusion, semantic alignment, distributed queries

Data-Intensive Science and Scientific Data Infrastructure

Computational Science and Informatics (Data Science) Programs at GMU

Response from Oxford University Press, USA

Data Governance Workshop, Final Report Arlington, VA December 14-15, 2011

Science Gateways in the US. Nancy Wilkins-Diehr

Big Data to Knowledge (BD2K)

Local Loading. The OCUL, Scholars Portal, and Publisher Relationship

Data management plan

DRIVER Providing value-added services on top of Open Access institutional repositories

REACCH PNA Data Management Plan

HP SOA Systinet software

The mission of academic libraries is to support research, education,

Information and Communications Technology Strategy

SURFsara Data Services

31 December Dear Sir:

University-wide Data Management Services: Cross-campus Collaborations at Indiana University

How To Manage Research Data At Columbia

Career Tracks- Information Technology Family

Best Practices for Data Management. RMACC HPC Symposium, 8/13/2014

Strategic Plan University Libraries Virginia Tech

SCOOP Data Management: A Standards-based Distributed System for Coastal Data and Modeling

Service Oriented Architecture (SOA) An Introduction

A Policy Framework for Canadian Digital Infrastructure 1

Metadata Hierarchy in Integrated Geoscientific Database for Regional Mineral Prospecting

Invenio: A Modern Digital Library for Grey Literature

EarthCube Governance. Community

Nevada NSF EPSCoR Track 1 Data Management Plan

Metadata for Data Discovery: The NERC Data Catalogue Service. Steve Donegan

Quality Assurance of Research Data:

Organic Data Publishing: A Novel Approach to Scientific Data Sharing

From Stored Knowledge to Smart Knowledge

Transcription:

+ Databases & Data Infrastructure Kerstin Lehnert

+ Access to Data is Needed 2 to allow verification of research results to allow re-use of data

+ The road to reuse is perilous (1) 3 Accessibility Discovery, long-term access, permissions Usability understand what was measured and how (materials and methods), computations that were applied, presentation of data (units, symbols, etc.) ability to apply standard tools to all file formats Motivation Professional benefits vs effort and economic burden of publication; policies (1) Rees, Jonathan (2010): Recommendations for independent scholarly publication of data sets. Creative Commons Working Paper

+ From Databases to Data 4 Infrastructure Data-driven science creates new requirements for data: Data need to discoverable. Data need to be persistently and reliably accessible. Data need to be curated and reviewed for quality assurance. Data need to be unambiguously identifiable & located. Data need to be citable. Data need to be interoperable. This requires the development of a data infrastructure. Trusted repositories instead of informal databases.

+ From Databases to Data 5 Infrastructure Technological Infrastructure Workforce Management Models Distributed versus centralized databases Control, oversight From: Arzberger et al., Science, 2004 Financial Support Legal & Policy Framework Open Access policies Policy enforcement Cultural & Behavioral Changes Data sharing Data citation

+ Where Are We Now? 6 Few data repositories fulfill the requirements. National data centers (NCDC, NGDC, NSIDC, etc.) Domain-specific data facilities: IRIS, BCO-DMO, IEDA (MGDS, EarthChem), etc. Most databases don t, but at least provide access to data. local, no standards single point of failure not persistent Many gaps in coverage Too much dark legacy data From: http://www.elsevier.com/about/content-innovation/database-linking

+ How Can We Advance? 7 Infrastructure Sustained and comprehensive repositories Tools and workflows for data management, including data publication Best practices, standards Incentives for data sharing Credit (data citation, bibliometrics for data) Better science Policy enforcement Funding agencies Publications

+ Advancing Data Infrastructure 8 CIF21 & EarthCube Community building Building Blocks development Interoperability Growing number of initiatives & organizations to develop and implement best practices and policies Research Data Alliance CODATA/World Data Systems BRDI New approaches to data publication & citation

+ EarthCube 9 Transform the conduct of research in geosciences by supporting the development of community-guided cyberinfrastructure to integrate data and information for knowledge management across the Geosciences.

+ 10 source: B. Ransom, NSF, 2012

+ EarthCube Progress 11 2 year planning and community building phase charrettes community & concept awards domain end-user workshops Started initial developments: Research Coordination Networks funded in summer 2013 Building Blocks & Conceptual Designs funded in summer 2013 CINERGI Inventory of CI resources Test Enterprise Governance starting Sept 15 Closer collaboration of data facilities Web Services Interop Building Block project Consortium of Data Facilities (workshop coming up) The D8 and/or D20 concept

+ Guidelines & Best Practices 12 data publication open access data attribution & citation data standards trustworthiness of repositories The Research Data Alliance aims to accelerate and facilitate research data sharing and exchange. ensuring the long-term stewardship and provision of quality-assessed data and data services to the international science community and other stakeholders.

+ Data Publication: Options 13 Conventional publication Institutional Repositories Data Paper Disciplinary Repositories

+ Role of Data Repositories 14 Ensure long-term preservation Data documentation (catalog metadata) Persistent & unique identification Sustainable infrastructure & business models Ensure Usability (Disciplinary Repositories!) Adopt and/or develop community-based standards for documenting: Provenance of data (collection strategies, procedures and underlying assumptions) Data precision, errors, workflows for data quality assurance Comply with standards for data representation (formats, semantics, etc.) QA/QC of datasets and metadata Science-driven tools for data search & access Standards-based interfaces for programmatic access Integrate data for analysis

+ Domain-specific Repositories: Linking 15 Stakeholders Science Community Links to publications & bibliometrics Publishers, editors Data policies Enforcement Funding agencies Norms & needs Domain-specific Resource Collections Systems & services Tool development Interoperability Computer Sciences Long-term archiving Libraries, data centers

+ Data Publication Process 16 Example: IEDA Submission Review Publication Integration Data IEDA Data Managers IEDA Metadata Catalog IEDA Data Managers Synthesis databases DOI linking [XML] Manuscript Editors Journal Portal

+ EarthChem Standards for Data 17 Publication Following recommendations of the Editors Roundtable (Policy Statement released in 2009: www.earthchem.org/editors) complete disclosure of data used in a publication full documentation of data provenance & quality (uncertainty) unique identification of samples geospatial & taxonomic information about samples Currently reviewing policy statement to align with emerging best practices and new publication capabilities submission of data to repositories as part of editorial process data review by repositories

+ DOI to allow proper citation 18 Link to publications Link to funding source

+ Linking Data & Publications 19

+ 20 Linking Data & Publications

+ Multi-Disciplinary Data Science Implementation Integrated Applications Discovery visualizations Semantic interoperability Analytics and mining Slide: Courtesy of Peter Schematic for Deep Carbon Virtual Observatory and Interoperability Fox, RPI (July 2012) Global Census, Virtual Mineral Laboratory,... 21 Application-level mediation: vocabulary, mapping to science and data terms Software, Tools & Apps Deep Energy/ Life Applications Semantic interoperability Physics/ Chemistry Models. Res/Flux Applications Semantic query, hypothsis and inference Query, access and use of data Semantic mediation: physics, chemistry, mineral, emission data - ChemML, Data Repositories GVP MINDAT EOS EarthChem Metadata, schema, data......... Emission/ Compositions

+ What makes data useful? 22 Knowing that I can trust the numbers.

+ Data Quality Standards 23 Science Norms Standards Technology Tools

+ 24 Data Citation

+ Polar Data Infrastructure 25 Don t reinvent Many data types already have well-established repositories, standards, best practices, community governance Integration of polar data into appropriate disciplinary repositories will augment their quality and usage Polar data are diverse, difficult to cover all data types with the appropriate level of expertise Fill obvious gaps Leverage existing data infrastructure Follow standards for data publication, metadata, and repository trustworthiness

+ The CI Vision 26 Enable new forms of scholarship that are information-intensive data-intensive distributed collaborative multi-disciplinary From Elmagarmid et al. (2008): Community-Cyberinfrastructure-Enabled Discovery in Science and Engineering

+ Opportunities for Polar CI 27 Leverage EarthCube developments Build capabilities that can be used by other communities Adopt & adapt developments that are useful Use EarthCube Resources Stakeholder Alignment survey Lists of existing resources Gap analyses Use cases / science scenarios Social network (EarthCube MatchMaker) Experiences of community building & governance

+ Big Data Sample-based 28 Long Tail Data Geospatial Grids & Vectors Categorical

+ Recurring Themes of CI Gaps 29 Data (& samples): access, coverage, integration, standards Models: dynamic, shared, linked Interdisciplinary conceptual frameworks Data analysis tools: visualization, multivariate analysis, statistics Data management support: workflows, software, education Knowledge: limitations and uses of data and models across, within and between disciplines Community: collaboration, shared knowledge of existing resources

+ 30 Enriched Links (under development)

+ CZOData II Architecture 31 Leverage existing systems & capabilities Local CZOs Local CZO Web Site CZO Central Data Management System CZO Central Harvester CZO-ISGN Registra on System CZO Central Hydro System (w/wfs & CUAHSI HIS Service Interface) Open- Topography (LiDAR) CUAHSI HIS CZO Main Web Portal CZO Main Web Site CZO Data Discovery Portal Time Series Data Display & Access Tool CZO Display Files Shared Vocabulary System SESAR CZchemDB Data Access Tool Data Management Tools DataONE Interface CZO Metadata Catalog (w/csw Service Interface) CZchemDB System (w/earthchem Service Interface) EarthChem DataONE Standardsbased Web Service Clients Local CZOs CZO Central Coordina on Func ons Build integrative components CZO Central Data Repositories Non-CZO Integrated Data & Discovery Sites Clients

+ Data Infrastructure for Polar 32 Sciences Data Diversity: A unique situation? Many disciplines Big Data versus Long Tail International

+ Disciplinary Repositories 33 Ensure Usability Develop/promote community-based data reporting standards Provenance of data, data precision, errors, etc. Work with publishers, editors, professional societies Align with other data & interoperability standards Provide services for persistent data identification (DOI), data attribution and citation, long-term archiving, etc. Advance Access provide science-driven tools for data search & access provide programmatic interfaces for cross-disciplinary use links to publications

+ Data Infrastructure 34 Acquisition Access Analysis Archiving

+ The Foundation: Data 35 Open access to a global, distributed knowledge base of scientific data and information, including legacy data Seamless integration of data within disciplines across disciplines with tools (visualization, analysis) and models