Ontology Summit 2014 Track D: Tackling the Variety Problem in Big Data Summary

Similar documents

Ontology Summit 2014 Session 05 Track D: Tackling the Variety Problem in Big Data I

IO Informatics The Sentient Suite

2014 Ontology Summit & Symposium Big Data and Semantic Web Meet Applied Ontology

Training Management System for Aircraft Engineering: indexing and retrieval of Corporate Learning Object

Open Ontology Repository Initiative

In 2014, the Research Data Purdue University

An analysis of Big Data ecosystem from an HCI perspective.

CiteSeer x in the Cloud

SHIFTING (CLOUD) GEARS

fédération de données et de ConnaissancEs Distribuées en Imagerie BiomédicaLE Data fusion, semantic alignment, distributed queries

Data Analytics, Management, Security and Privacy (Priority Area B)

ONTOLOGY-BASED APPROACH TO DEVELOPMENT OF ADJUSTABLE KNOWLEDGE INTERNET PORTAL FOR SUPPORT OF RESEARCH ACTIVITIY

TRENDS IN THE DEVELOPMENT OF BUSINESS INTELLIGENCE SYSTEMS

CS Matters in Maryland CS Principles Course

Cloud computing based big data ecosystem and requirements

Web-Based Genomic Information Integration with Gene Ontology

RFI Summary: Executive Summary

I. INTRODUCTION NOESIS ONTOLOGIES SEMANTICS AND ANNOTATION

Data Management Models For On-boarding and Ingestion Information Requirements

Finding a Good Ontology: The Open Ontology Repository Initiative

Amit Sheth & Ajith Ranabahu, Presented by Mohammad Hossein Danesh

Annotea and Semantic Web Supported Collaboration

Towards Event Sequence Representation, Reasoning and Visualization for EHR Data

An experience with Semantic Web technologies in the news domain

Action Plan towards Open Access to Publications

Collaborations between Official Statistics and Academia in the Era of Big Data

Business Process Models as Design Artefacts in ERP Development

BIG. Big Data Analysis John Domingue (STI International and The Open University) Big Data Public Private Forum

The Importance of Bioinformatics and Information Management

Novel Data Extraction Language for Structured Log Analysis

Business Intelligence: Recent Experiences in Canada

Databases & Data Infrastructure. Kerstin Lehnert

Is Big Data a Big Deal? What Big Data Does to Science

Case Study: Semantic Integration as the Key Enabler of Interoperability and Modular Architecture for Smart Grid at Long Island Power Authority (LIPA)

Lightweight Data Integration using the WebComposition Data Grid Service

UIMA and WebContent: Complementary Frameworks for Building Semantic Web Applications

Overcoming the Technical and Policy Constraints That Limit Large-Scale Data Integration

Encoding Library of Congress Subject Headings in SKOS: Authority Control for the Semantic Web

Semantically Enhanced Web Personalization Approaches and Techniques

Master s Program in Information Systems

Semantic Data Management. Xavier Lopez, Ph.D., Director, Spatial & Semantic Technologies

Leveraging Big Data Technologies to Support Research in Unstructured Data Analytics

Semantic Business Process Management Lectuer 1 - Introduction

Summary of Responses to the Request for Information (RFI): Input on Development of a NIH Data Catalog (NOT-HG )

Signed metadata: method and application

Bringing Business Objects into ETL Technology

Computational Science and Informatics (Data Science) Programs at GMU

Research Data Alliance: Current Activities and Expected Impact. SGBD Workshop, May 2014 Herman Stehouwer

Big Data Challenges in Bioinformatics

Building Semantic Content Management Framework

The Scientific Data Mining Process

Technical. Overview. ~ a ~ irods version 4.x

Business rules and science

IBM Software IBM Business Process Management Suite. Increase business agility with the IBM Business Process Management Suite

A Capability Maturity Model for Scientific Data Management

Smart Financial Data: Semantic Web technology transforms Big Data into Smart Data

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Work Package 13.5: Authors: Paul Flicek and Ilkka Lappalainen. 1. Introduction

Good morning. It is a pleasure to be with you here today to talk about the value and promise of Big Data.

BUSINESS RULES CONCEPTS... 2 BUSINESS RULE ENGINE ARCHITECTURE By using the RETE Algorithm Benefits of RETE Algorithm...

Disributed Query Processing KGRAM - Search Engine TOP 10

Data Management Plans & the DMPTool. IAP: January 26, 2016

Rosanne: islands of structure in unstructured data

Knowledgent White Paper Series. Developing an MDM Strategy WHITE PAPER. Key Components for Success

The Data Quality Continuum*

Overview. DW Source Integration, Tools, and Architecture. End User Applications (EUA) EUA Concepts. DW Front End Tools. Source Integration

BIG DATA WITHIN THE LARGE ENTERPRISE 9/19/2013. Navigating Implementation and Governance

Information Governance

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Big Data & Security. Aljosa Pasic 12/02/2015

Data Science at U of U

Towards a Domain-Specific Framework for Predictive Analytics in Manufacturing. David Lechevalier Anantha Narayanan Sudarsan Rachuri

RecordPoint Overview

fédération de données et de ConnaissancEs Distribuées en Imagerie BiomédicaLE Interrogation d'entrepôts distribués et hétérogènes

National Snow and Ice Data Center A brief overview and data management projects

Information Technology for KM

doing a literature review

Introduction to Big Data Science

Linked Science as a producer and consumer of big data in the Earth Sciences

Standards of Quality and Effectiveness for Professional Teacher Preparation Programs APPENDIX A

Writing in Social Work

eresearch Australasia 2007

Describing Web Services for user-oriented retrieval

PAPER Data retrieval in the PURE CRIS project at 9 universities

Publishing Linked Data Requires More than Just Using a Tool

Effective Management and Exploration of Scientific Data on the Web. Lena Strömbäck Linköping University

Application of OASIS Integrated Collaboration Object Model (ICOM) with Oracle Database 11g Semantic Technologies

ABSTRACT. 1. Introduction

White Paper Accessing Big Data

Oracle Forms and SOA: Software development approach for advanced flexibility An Oracle Forms Community White Paper

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Application Design: Issues in Expert System Architecture. Harry C. Reinstein Janice S. Aikins

Data Integration Hub for a Hybrid Paper Search

bigdata Managing Scale in Ontological Systems

Text and data analytics for social network mining

Digital Library Content and Course Management Systems: Issues of Interoperation

Report of the DTL focus meeting on Life Science Data Repositories

Semantic Variability Modeling for Multi-staged Service Composition

THE e-knowledge BASED INNOVATION SEMINAR

CALIFORNIA S TEACHING PERFORMANCE EXPECTATIONS (TPE)

Transcription:

Ontology Summit 2014 Track D: Tackling the Variety Problem in Big Data Summary Ken Baclawski Anne Thessen Track D Co-Champions April 28, 2014 1

The Potential of Big Data Could address important social and commercial needs. Traditional techniques are inadequate Cost, feasibility, ethical concerns New techniques are much better, but Still have ethical issues Privacy has become a prominent issue The variety problem is typically handled with ad hoc techniques April 28, 2014 2

Emergence of Big Data Recognition of the problem in early 1980s: originally called the information onslaught [1] Different fields experience the problem at different times. The transition from information scarcity to information abundance in a field appears to occur abruptly. Organizations have difficulty coping with the transition. Many organizations are still based on the information scarcity model. April 28, 2014 3

Characteristics of Big Data Large amounts of data (volume) Rapid pace of data acquisition (velocity) Complexity of the data (variety) Complexity of individual data items Both structured and unstructured data Complexity of data from one source Multiple sources April 28, 2014 4

Technology for Big Data Development of new storage and indexing strategies for handling volume and velocity Map Reduce was developed in 1994. [2] Development of techniques for handling variety Schema mapping Controlled vocabularies Knowledge representations Ontologies and semantic technologies Connection between these two? Surprisingly little collaboration and communication. A notable exception is the early work starting in 1992 on representing biological research papers. [3] April 28, 2014 5

Session Speakers Eric Chan - Enabling OODA Loop with Information Technology Nathan Wilson - The Semantic Underpinnings of EOL TraitBank Ruth Duerr - Semantics and the SSIII Project Mark Fox - Variety in Big Data: A Cities Perspective Malcolm Chisolm - Data Governance to Manage Variety in Big Data Dan Brickley - Schema.org, FOAF and Linked Data: Lessons for Web-scale vocabulary deployment Rosario Uceda-Sosa - Big Data, Open Data and the Smart City April 28, 2014 6

Speakers Eric Chan (Oracle) Tool development addressing many aspects of the variety problem including provenance, metadata, and hypothesis generation and workflow (OODA loop) Dan Brickley (Google) Examines the Variety problem on the Web and provides an overview of some lessons learned from schema.org, FOAF and Linked Data deployment of RDF-based vocabularies. April 28, 2014 7

Speakers Malcolm Chisolm (AskGet.com) Data governance is a collection of disciplines that ensure data is managed adequately in an enterprise. Malcolm will discuss what is involved in data governance and its implications for managing Variety in Big Data. Ruth Duerr (National Snow and Ice Data Center) Experiences with using ontologies to address very heterogeneous sources of data for snow and ice, including satellite data, model output, point observations, social science data (interviews with Alaskan community members), and many other kinds of data. April 28, 2014 8

Speakers Nathan Wilson (Encyclopedia of Life) Experiences with using semantic technology in the tree of life, a notoriously messy ontology Mark Fox (University of Toronto) More than half the world population live in cities and the proportion is growing, so cities are an enormous source of data. However, it is not just the amount of data that is daunting, but the enormous variety not only within a single city but also among the thousands of different cities. Rosario Uceda-Sosa (IBM) Continues the examination of the Variety problem for city data and proposes some solutions. April 28, 2014 9

Why is semantics important? Misunderstanding the data can result in invalid or misrepresented analyses. A recognized problem well before Big Data. John P. A. Ioannidis claimed that most published scientific research findings are false. [4] [5] Many forms of bias can invalidate results Bias and observer effects are difficult to recognize Standard experimental procedures do not eliminate bias. Could Big Data exacerbate this problem? April 28, 2014 10

What can ontologies do to help? Background knowledge of the domain The structure of the data Annotation of data and metadata Provenance of the data Transformations Analyses Interpretations Data processing workflows Privacy concerns Hypothesis generation and their workflows April 28, 2014 11

Still more opportunities for helping... Data governance (Malcolm Chisholm) Data integration (Mark Fox and Rosario Uceda-Sosa) Schema.org, FOAF, Linked Data (Dan Brickley) April 28, 2014 12

Harvest from data partners Rather than build it yourself, make use of collaborators. You still have a lot of work to do converting, formalizing the input and integrating the sources. But the result can be very high quality and it has a builtin user community. Nathan Wilson uses input from the EOL community. Mark Fox and Rosario Uceda-Sosa gave an example of harvesting data and metadata from city data sources. Ruth Duerr combined input from native communities in the Arctic. April 28, 2014 13

Modular Development It is much easier to reuse a smaller ontology Part of the Track A theme on reuse. One can combine several of them to create an ontology that satisfies most of your requirements. Ruth Duerr used this technique to develop an ontology for sea ice. April 28, 2014 14

Formalize existing informal models Eric Chan presented a formalization of the OODA loop and then extended and generalized it. Ruth Duerr presented a formalization of existing techniques for describing sea ice conditions and indigenous knowledge. Mark Fox and Rosario Uceda-Sosa presented formalizations of existing informal data models for city data. April 28, 2014 15

Develop ontology with extension points Eric Chan presented a framework for observation and decision making. The framework has explicit extension points to allow ease of reuse. April 28, 2014 16

Involve communities Similar to the Harvest from data partners use case Community involvement is important even if the communities do not directly contribute to the ontology. Ruth Duerr described this technique in her presentation on the ontology of sea ice. April 28, 2014 17

Governance framework This framework uses relatively deep inference using axioms and rules to ensure data quality. The OOR initiative emphasized governance and business rules for ensuring quality. Malcolm Chisholm discussed the important characteristics of governance frameworks. April 28, 2014 18

Bridge axioms Use axioms both at the data and metadata levels to bridge the gap between the semantics of data from different sources. Mark Fox and Rosario Uceda-Sosa both used this technique in their work on smart cities. April 28, 2014 19

Other Use Cases Vocabulary pipeline Dan Brickley Presentation Pattern matching Dan Brickley Presentation Information ecosystem Rosario Uceda-Sosa Presentation April 28, 2014 20

Some Challenges Little collaboration between the communities Big Data focus on volume and velocity, assuming someone else will handle variety Tool incompatibility Incompatibility between statistical and logical techniques (hybrid reasoning gap) April 28, 2014 21

Bibliography 1. D. Kerr, K. Braithwaite, N. Metropolis, D. Sharp, G.-C. Rota (Ed). Science, Computers and the Information Onslaught. Academic Press. (1984) 2. K. Baclawski and J.E. Smith. High-performance, distributed information retrieval. Northeastern University, College of Computer Science. (1994) 3. K. Baclawski. Data/knowledge bases for biological papers and techniques. Northeastern University, College of Computer Science. NSF grant. (1992) 4. Ioannidis JPA, Why Most Published Research Findings Are False. PLoS Med 2(8): e124. doi:10.1371/journal.pmed.0020124 (2005) 5. Critical Data Conference: Secondary Use of Big Data from Critical Care http://criticaldata.mit.edu/ (2014) 6. K. Baclawski and T. Niu. Ontologies for Bioinformatics. (2005) 7. K. Baclawski. Bayesian network development. In International Workshop on Software Methodologies, Tools and Techniques, pages 18-48. Keynote address. (September, 2004) April 28, 2014 22