Rosetta. A white paper on the challenges of sharing observational datasets. The Challenges of Sharing Observational Data



Similar documents
The Arctic Observing Network and its Data Management Challenges Florence Fetterer (NSIDC/CIRES/CU), James A. Moore (NCAR/EOL), and the CADIS team

Data-Intensive Science and Scientific Data Infrastructure

The THREDDS Data Repository: for Long Term Data Storage and Access

Environmental Data Management:

Nevada NSF EPSCoR Track 1 Data Management Plan

Voluntary Product Accessibility Template Blackboard Learn Release 9.1 April 2014 (Published April 30, 2014)

Integrated Information Services (IIS) Strategic Plan

13.2 THE INTEGRATED DATA VIEWER A WEB-ENABLED APPLICATION FOR SCIENTIFIC ANALYSIS AND VISUALIZATION

Fogbeam Vision Series - The Modern Intranet

PART 1. Representations of atmospheric phenomena

Homework: Visual Search and Interaction with NSF and NASA Polar Datasets Due: May 2nd, 2015, 12pm PT

Analytic Modeling in Python

GIS Initiative: Developing an atmospheric data model for GIS. Olga Wilhelmi (ESIG), Jennifer Boehnert (RAP/ESIG) and Terri Betancourt (RAP)


REACCH PNA Data Management Plan

Use of automated workflow systems in virtual teams

Scientific Data Management and Dissemination

The power of IBM SPSS Statistics and R together

Chapter 1: Introduction to ArcGIS Server

The IBM Cognos family

Programmabilty. Programmability in Microsoft Dynamics AX Microsoft Dynamics AX White Paper

SRS BIO OPTICAL WORKFLOW

Archive I. Metadata. 26. May 2015

IFS-8000 V2.0 INFORMATION FUSION SYSTEM

MyOcean Copernicus Marine Service Architecture and data access Experience

Collaboration Technology Support Center Microsoft Collaboration Brief

Search Based Applications

Enhancing Access to Climate Model Metadata via a Web-Accessible Database. Alisha R. Fernandez

Keys to Developing a Successful Video Culture

NIH Commons Overview, Framework & Pilots - Version 1. The NIH Commons

Insight for location-powered decision making.

ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process

StormSurgeViz: A Visualization and Analysis Application for Distributed ADCIRC-based Coastal Storm Surge, Inundation, and Wave Modeling

XSEDE Overview John Towns

Terms and Definitions for CMS Administrators, Architects, and Developers

IDL. Get the answers you need from your data. IDL

Metadata Hierarchy in Integrated Geoscientific Database for Regional Mineral Prospecting

Critical Strategies for Improving the Code Quality and Cross-Disciplinary Impact of the Computational Earth Sciences

Introducing the Open Source CUAHSI Hydrologic Information System Desktop Application (HIS Desktop)

Levels of Archival Stewardship at the NOAA National Oceanographic Data Center: A Conceptual Model 1

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

What s new in Carmenta Server 4.2

CLASS and Enterprise Solutions Rick Vizbulis. CLASS and Enterprise Solutions

An Enterprise Framework for Business Intelligence

Appery.io Overview. However mobile also presents many challenges for enterprises:

OpenText Content Hub for Publishers

TEST AUTOMATION FRAMEWORK

ETPL Extract, Transform, Predict and Load

The distribution of marine OpenData via distributed data networks and Web APIs. The example of ERDDAP, the message broker and data mediator from NOAA

THE BRITISH LIBRARY. Unlocking The Value. The British Library s Collection Metadata Strategy Page 1 of 8

Combining SAWSDL, OWL DL and UDDI for Semantically Enhanced Web Service Discovery

Empowering the Masses with Analytics

Developing accessible portals and portlets with IBM WebSphere Portal

Intro to Data Management. Chris Jordan Data Management and Collections Group Texas Advanced Computing Center

National Snow and Ice Data Center A brief overview and data management projects

Liferay Portal 6.2. Key Features List

Developing Business Intelligence and Data Visualization Applications with Web Maps

HDF5-iRODS Project. August 20, 2008

Model examples Store and provide Challenges WCS and OPeNDAP Recommendations. WCS versus OPeNDAP. Making model results available through the internet.

Meister Going Beyond Maven

NASA's Strategy and Activities in Server Side Analytics

GETTING STARTED WITH ANDROID DEVELOPMENT FOR EMBEDDED SYSTEMS

Advantage of Jquery: T his file is downloaded from

CorHousing. CorHousing provides performance indicator, risk and project management templates for the UK Social Housing sector including:

Migrating Lotus Notes Applications to Google Apps

NetCDF and HDF Data in ArcGIS

G8 Open Data Charter

Using EMC Documentum with Adobe LiveCycle ES

National LCC Meeting. ensuring maximum efficiency of LCC conservation delivery

ORACLE BUSINESS INTELLIGENCE SUITE ENTERPRISE EDITION PLUS

Developing And Marketing Mobile Applications. Presented by: Leesha Roberts, Senior Instructor, Center for Education Programmes, UTT

Executive Summary WHO SHOULD READ THIS PAPER?

Tableau Metadata Model

WMO Climate Database Management System Evaluation Criteria

The IBM Cognos family

Unidata s Vision for Transforming Geoscience by Moving Data Services and Software to the Cloud

IDW -- The Next Generation Data Warehouse. Larry Bramblett, Data Warehouse Solutions, LLC, San Ramon, CA

Norwegian Satellite Earth Observation Database for Marine and Polar Research USE CASES

Data Flow Organising action on Research Methods and Data Management

Structure? Integrated Climate Data Center How to use the ICDC? Tools? Data Formats? User

SQL Server Instance-Level Benchmarks with DVDStore

IBM Rational Web Developer for WebSphere Software Version 6.0

Visionet IT Modernization Empowering Change

ISO 14001:2015 How your ISO audit will be different. Whitepaper

Seeing by Degrees: Programming Visualization From Sensor Networks

CEOS Water Portal Status Update

Reporting trends and pain points of current and new customers IBM Corporation

Wait-Time Analysis Method: New Best Practice for Performance Management

DOCSVAULT WhitePaper. A Guide to Document Management for Legal Industry. Contents

SIP Expert GUI Generic Use Cases and Requirements

Bentley ArcGIS. Connector

Integrating the Internet into Your Measurement System. DataSocket Technical Overview

How To Build A Connector On A Website (For A Nonprogrammer)

Mapping Solar Energy Potential Through LiDAR Feature Extraction

Asset Track Getting Started Guide. An Introduction to Asset Track

Interoperable Documentation

Global Scientific Data Infrastructures: The Big Data Challenges. Capri, May, 2011

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

elearning Course Catalog

Oracle Application Development Framework Overview

Transcription:

Rosetta A white paper on the challenges of sharing observational datasets The Challenges of Sharing Observational Data Observational studies are a critical component of geosciences education and research. Scientists and students often collect observational data in the field using dataloggers, which interface with various types of instrumentation to sample and store data at the observation site. The data are held in the datalogger until they can be collected during a site visit or transmitted over a network. The data obtained from dataloggers are almost always in an ASCII- based format, such as a comma separated value (csv) file, facilitating use by scientists. Each type of datalogger produces a unique form of ASCII file; sometimes the differences between files produced by different dataloggers are quite significant. The result is that N ASCII- based data files will likely have N different layouts (even within the same project). This presents an immediate roadblock to outside users attempting to integrate the collected datasets into their own scientific workflows. Data providers often transfer basic knowledge of the dataset to others by providing a README file, as the ASCII files collected from the field often lack even the most basic metadata (such as the units of the data being sampled). The format and quality of the README files varies greatly between projects (and sometimes even within the same project), and even the best README files may omit important metadata. While a particular ASCII file layout may work well in the data providers scientific workflow (i.e. analysis and visualization tools), other users must spend significant effort to get the data into a format that fits their workflow. Furthermore, data collected from the field often end up in a spreadsheet file, which presents another hurdle for users who do not do analysis using spreadsheet software. The plethora of different ASCII and spreadsheet formats makes it very difficult to provide standard web services to access these datasets. For example, aggregating multiple files or spatially and/or temporally subsetting datasets quickly becomes impractical, because special code must be written allowing a data portal to handle each ASCII format or spreadsheet layout. This whitepaper describes the approach taken by the Rosetta project at the Unidata Program Center to improving the quality and accessibility of observational data sets while minimizing disruption to existing scientific workflows. Unidata Program Center 1

Possible Solutions The barriers discussed above can be addressed and greatly reduced by requesting that datasets meet certain quality standards if they are to be shared. (Note that sharing of data is required for NSF- funded projects). This section describes three data formats of varying quality, along with some pros and cons of using these formats to share observational data User Defined ASCII format (The status quo: each project defines its own format) o Pros: Easy to read and write Similar steps used to access data across a wide range of tools/languages Works well in the data providers existing scientific workflow o Cons: N sample files likely to have N different layouts Reading these files can become time consuming Basic metadata often not included in the datafile Difficult (and sometimes impossible) to provide standard services (aggregation, subsetting, etc.), which allow other users to efficiently access the data. Standard Layout ASCII format o Pros: Same pros of the User Defined ASCII format from above, plus: Standard file layout allowing users to know what to expect when dealing with new dataset Easy to visually inspect file through the use of a text editor or spreadsheet software Metadata included with file in a standard layout (self describing file) o Cons: Relies on data providers to consistently follow a specification, leading to frustration for the data provider unless he or she is an expert in the specification. Difficult, although easier than the User Defined ASCII format, to implement standard services (aggregation, subsetting, etc.) A Standard, self- describing, machine- independent data format o Pros: Easy to provide services for end users (aggregation of files, subsetting, etc.) Metadata embedded in a standard way (self describing file), which results in the need for a README file Unidata Program Center 2

o Cons: Steep learning curve to read and write to these formats Often requires extra software to be installed More time required to use data across several tools (each tool/language has different API) A Path Foreword: Data Format Transformation Service It seems clear from the possible data format based solutions above that a combination of the pros from the Standard Layout ASCII format and the standard, self- describing, machine- independent data format would be ideal. However, both of these solutions have cons related to either the reading or the writing of each format. Therefore, we are proposing a solution in which the burden of writing to a standard, self- describing, machine- independent data format is removed from the data provider, and the pain of reading from a standard, self- describing, machine- independent data format is removed from the data user: a Data Format Transformation Service. The Data Format Transformation Service should: Give data providers an easy front end wizard - like interface to collect metadata and parsing information to transform their User Defined ASCII format into a standard, self- describing, machine- independent data format without any more effort than that needed to write a readme file (that is, the process would require no coding or special expertise on the data providers part.) Give data users an easy way to retrieve the data in either a Standard Layout ASCII format file or a standard, self- describing, machine- independent data format file, as dictated by their own scientific workflow. This would enable data portals the ability to: Use the Standard, self- describing, machine- independent data format to serve data with rich services like aggregation and subsetting. Automatically extract metadata from the Standard, self- describing, machine- independent data format to create an automatically generated README- like file with a standard level of quality. Provide their users with access to a wide variety of datasets in a format that works for their scientific workflow, enhancing the reusability of data. Benefits of Rosetta: Rosetta uses Unidata s Common Data Model and netcdf- Java I/O Service Provider (IOSP) layer, the latter of which provides the ability to read from and Unidata Program Center 3

write to virtually any data format via the creation of an IOSP plugin. The flexibility afforded by the IOSP layer is that more transformation services can be added as the community requests them, with relative ease. Greater interoperability enabled by the choice of the netcdf format, which conforms to the CF conventions, as the Standard, self- describing, machine- independent data format. Offers a solution to part of the long- tail, dark data problem in which investigators face the challenge of sharing data that may not conform to any standard. Rosetta employs an open source development model and uses standard web technologies (Spring, JavaScript, Java). Rosetta uses transparent development infrastructure, such as github (for open source code management), and the use of the Jira ticketing system, which exposes the status of user submitted bugs, internal and external feature requests, and release timelines. Unidata Community, a diverse association of a colleges and universities, including instructors, researchers, and students. The NASA, NOAA, NSF, UCAR, and NCAR communities, as well as others, help inform Unidata, as these institutions are important sources of information and advice about events and conditions that can profoundly affect the program. Current state of Rosetta: Conversion of simple datalogger output (in a User Defined ASCII format or spreadsheet format, such as.xls or.xlsx) into Standard, self- describing, machine- independent data format (netcdf) files that are compliant with the Climate and Forecast (CF) 1.6 standard specification. Efforts so far have been based on the needs of the Advanced Cooperative Arctic Data and Information Service (ACADIS) community. Limited test server online, with plans to open testing to a wider audience with the next release. Short term plans: Expand the number of Discrete Sampling Geometries (DSGs) supported by Rosetta as specified in the CF- 1.6 standard. Expand the number of input and output formats for conversion, including the output of a Standard Layout ASCII format and a spreadsheet based standard layout. (Standards for ASCII and spreadsheet layouts will be developed as part of a longer- term goal, described below.) Create an API, with guidance from the ACADIS group, for data publishing from Rosetta to various data portals. Unidata Program Center 4

Long term plans: Create and advocate for standard ASCII and spreadsheet representations of the CF- 1.6 DSGs, based on the work described in the short- term plan above. Create a version of Rosetta suitable for local installation, for use with datasets that are too large to be efficiently transformed over the network. Expose THREDDS services, such as the netcdf Subset Service, through Rosetta, so that these services can be used on local files without running a full THREDDS Data server Development of Rosetta in light of Unidata s stragitigic plan: The Roestta group's work supports the following Unidata funding proposal focus areas: 1. Enable widespread, efficient access to geoscience data The initial goal of Rosetta is to transform unstructured ASCII data files into the netcdf format; once in this format, standard tools, such as the THREDDS Data Server, IDV, Python, and other analysis packages, can take advantage of these datasets with relative ease. 2. Develop and provide open- source tools for effective use of geoscience data Although the primary goal of Rosetta is to get data into the netcdf format, the transformation process does not stop there. The Rosetta group realizes that not everyone knows how to work with netcdf files, and may feel more comfortable working with other formats. Therefore, Rosetta includes the ability to transform from one format to another (e.g. netcdf to.xls), thereby reducing data friction. 3. Provide cyberinfrastructure leadership in data discovery, access, and use Metadata contained in netcdf format file (no longer locked away in a separate README file) can be automatically extracted, facilitating the discovery of data in these files. Additionally, the Rosetta development plan includes the creation of a standard ASCII and spreadsheet representations of the CF- 1.6 DSGs. 4. Build, support, and advocate for the diverse geoscience community Promote the use of standard formats in the dissemination of data, while allowing flexibility to transform into other formats, as needed, to enable users to "do science." For commonly used formats, such as User Defined ASCII format or an unstructured spreadsheet, create and advocate for the use of a standard representations based on the CF- 1.6 DSGs. Unidata Program Center 5