Ontology Summit 2014 Session 05 Track D: Tackling the Variety Problem in Big Data I



Similar documents
Ontology Summit 2014 Track D: Tackling the Variety Problem in Big Data Summary

SURVEY REPORT DATA SCIENCE SOCIETY 2014

Training Management System for Aircraft Engineering: indexing and retrieval of Corporate Learning Object

Web-Based Genomic Information Integration with Gene Ontology

Master s Program in Information Systems

Open Ontology Repository Initiative

IO Informatics The Sentient Suite

Data collection architecture for Big Data

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

2014 Ontology Summit & Symposium Big Data and Semantic Web Meet Applied Ontology

BIG. Big Data Analysis John Domingue (STI International and The Open University) Big Data Public Private Forum

Data Science at U of U

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Collaborations between Official Statistics and Academia in the Era of Big Data

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

CS Matters in Maryland CS Principles Course

Report on the Dagstuhl Seminar Data Quality on the Web

fédération de données et de ConnaissancEs Distribuées en Imagerie BiomédicaLE Data fusion, semantic alignment, distributed queries

Cloud computing based big data ecosystem and requirements

Data Quality in Information Integration and Business Intelligence

Susanna-Assunta Sansone, PhD. Metadata WG3 chair.

Towards Event Sequence Representation, Reasoning and Visualization for EHR Data

Global Scientific Data Infrastructures: The Big Data Challenges. Capri, May, 2011

Is Big Data a Big Deal? What Big Data Does to Science

MULTI AGENT-BASED DISTRIBUTED DATA MINING

Amit Sheth & Ajith Ranabahu, Presented by Mohammad Hossein Danesh

The Data Quality Continuum*

Smart Financial Data: Semantic Web technology transforms Big Data into Smart Data

Semantic Business Process Management Lectuer 1 - Introduction

In 2014, the Research Data Purdue University

Work Package 13.5: Authors: Paul Flicek and Ilkka Lappalainen. 1. Introduction

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

Software Development Training Camp 1 (0-3) Prerequisite : Program development skill enhancement camp, at least 48 person-hours.

The Scientific Data Mining Process

White Paper Accessing Big Data

CRM dig : A generic digital provenance model for scientific observation

NC State University Initiatives in Big Data

Introduction to Big Data Science

Linked Science as a producer and consumer of big data in the Earth Sciences

Towards a National Health Knowledge Infrastructure (NHKI): The role of Semantics-based Knowledge Management vkashyap1@partners.org

IST687 Scientific Data Management

An analysis of Big Data ecosystem from an HCI perspective.

Databases & Data Infrastructure. Kerstin Lehnert

An experience with Semantic Web technologies in the news domain

Disributed Query Processing KGRAM - Search Engine TOP 10

Overcoming the Technical and Policy Constraints That Limit Large-Scale Data Integration

Industry 4.0 and Big Data

Lecture 11 Data storage and LIMS solutions. Stéphane LE CROM

I. INTRODUCTION NOESIS ONTOLOGIES SEMANTICS AND ANNOTATION

Describing Web Services for user-oriented retrieval

CSPA. Common Statistical Production Architecture International activities on Big Data in Official Statistics. Carlo Vaccari Istat

Leveraging Big Data Technologies to Support Research in Unstructured Data Analytics

Semantic Data Management. Xavier Lopez, Ph.D., Director, Spatial & Semantic Technologies

The Knowledge Sharing Infrastructure KSI. Steven Krauwer

School Library Standards. for California Public Schools, Grades Nine through Twelve

Big Data Challenges in Bioinformatics

Digital Libraries and Content Management

Doctor of Philosophy in Computer Science

<is web> Information Systems & Semantic Web University of Koblenz Landau, Germany

Business rules and science

ONTOLOGY-BASED APPROACH TO DEVELOPMENT OF ADJUSTABLE KNOWLEDGE INTERNET PORTAL FOR SUPPORT OF RESEARCH ACTIVITIY

Business Intelligence: Recent Experiences in Canada

Novel Data Extraction Language for Structured Log Analysis

PONTE Presentation CETIC. EU Open Day, Cambridge, 31/01/2012. Philippe Massonet

Security Issues for the Semantic Web

Some Research Challenges for Big Data Analytics of Intelligent Security

Encouraging the Teaching of Biotechnology in Secondary Schools

Master's projects at ITMO University. Daniil Chivilikhin PhD ITMO University

Finding a Good Ontology: The Open Ontology Repository Initiative

Supporting Change-Aware Semantic Web Services

TRENDS IN THE DEVELOPMENT OF BUSINESS INTELLIGENCE SYSTEMS

Towards a Domain-Specific Framework for Predictive Analytics in Manufacturing. David Lechevalier Anantha Narayanan Sudarsan Rachuri

AIE: 85-86, 193, , 294, , , 412, , , 682, SE: : 339, 434, , , , 680, 686

Data Governance on Well Header. Not Only is it Possible, Where Else Would you Start!

Towards Semantics-Enabled Distributed Infrastructure for Knowledge Acquisition

fédération de données et de ConnaissancEs Distribuées en Imagerie BiomédicaLE Interrogation d'entrepôts distribués et hétérogènes

Combining SAWSDL, OWL DL and UDDI for Semantically Enhanced Web Service Discovery

Summary of Responses to the Request for Information (RFI): Input on Development of a NIH Data Catalog (NOT-HG )

CURRICULUM VITAE ACADEMIC APPOINTMENTS

Data Grids. Lidan Wang April 5, 2007

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

Effective Management and Exploration of Scientific Data on the Web. Lena Strömbäck Linköping University

ECS 165A: Introduction to Database Systems

bigdata Managing Scale in Ontological Systems

EHR CURATION FOR MEDICAL MINING

Writing in Social Work

Life as a scientific database curator

Bench to Bedside Clinical Decision Support:

Business Process Models as Design Artefacts in ERP Development

American Statistical Association Draft Guidelines for Undergraduate Programs in Statistical Science

An Ontology-based Architecture for Integration of Clinical Trials Management Applications

Computational Science and Informatics (Data Science) Programs at GMU

<is web> Information Systems & Semantic Web University of Koblenz Landau, Germany

SemWeB Semantic Web Browser Improving Browsing Experience with Semantic and Personalized Information and Hyperlinks

可 视 化 与 可 视 计 算 概 论. Introduction to Visualization and Visual Computing 袁 晓 如 北 京 大 学

De la Business Intelligence aux Big Data. Marie- Aude AUFAURE Head of the Business Intelligence team Ecole Centrale Paris. 22/01/14 Séminaire Big Data

Kean University. Academic Assessment Report

Data Semantics Aware Cloud for High Performance Analytics

ICT Perspectives on Big Data: Well Sorted Materials

Big Health Data the challenges and connections

Transcription:

Ontology Summit 2014 Session 05 Track D: Tackling the Variety Problem in Big Data I Ken Baclawski Anne Thessen Track D Co-Champions February 13, 2014 1

Session Outline Ken Baclawski - Introduction Eric Chan (Oracle) - Enabling OODA Loop with Information Technology Nathan Wilson (Encyclopedia of Life) - The Semantic Underpinnings of EOL TraitBank Ruth Duerr (National Snow and Ice Data Center) - Semantics and the SSIII Project Anatoly Levenchuk - Hackathon (Track E) Announcement Open Panel Discussion February 13, 2014 2

History of Big Data Recognition of the problem in early 1980s: originally called the information onslaught [1] Different fields experience the problem at different times. The transition from information scarcity to information abundance in a field appears to occur abruptly. Organizations have difficulty coping with the transition. Many organizations are still based on the information scarcity model. February 13, 2014 3

Characteristics of Big Data Large amounts of data (volume) Rapid pace of data acquisition (velocity) Complexity of the data (variety) Complexity of individual data items Both structured and unstructured data Complexity of data from one source Multiple sources February 13, 2014 4

History of Big Data Development of new storage and indexing strategies for handling volume and velocity Map Reduce was developed in 1994. [2] Development of techniques for handling variety Schema mapping Controlled vocabularies Knowledge representations Ontologies and semantic technologies Connection between these two? Surprisingly little collaboration and communication. A notable exception is the early work starting in 1992 on representing biological research papers. [3] February 13, 2014 5

The Potential of Big Data Could address important social and commercial needs. Traditional techniques are inadequate Cost, feasibility, ethical concerns New techniques are much better, but Still have ethical issues Privacy has become a prominent issue The variety problem is typically handled with ad hoc techniques February 13, 2014 6

Why is semantics important? Misunderstanding the data can result in invalid or misrepresented analyses. A recognized problem well before Big Data. John P. A. Ioannidis claimed that most published scientific research findings are false. [4] [5] Many forms of bias can invalidate results Bias and observer effects are difficult to recognize Standard experimental procedures do not eliminate bias. Could Big Data exacerbate this problem? February 13, 2014 7

What can ontologies do to help? Background knowledge of the domain The structure of the data Annotation of data and metadata Provenance of the data Transformations Analyses Interpretations Data processing workflows Privacy concerns Hypothesis generation and their workflows February 13, 2014 8

Some Challenges Little collaboration between the communities Big Data focus on volume and velocity, assuming someone else will handle variety Tool incompatibility Incompatibility between statistical and logical techniques (hybrid reasoning gap) February 13, 2014 9

Statistical versus Logical Reasoning There is no intrinsic incompatibility. Probability has been axiomatized A formal ontology of probability and statistics is possible One can formally represent statistical statements in such an ontology [6] Bayesian networks are especially powerful. Can be developed using the same techniques used for developing ontologies [7] February 13, 2014 10

Bayesian Network Specification Flu Pr(Flu)=0.0001 Perceives Fever Perceives Fever TRUE FALSE Flu = F & Cold = F 0.01 0.99 Flu = F & Cold = T 0.1 0.9 Flu = T & Cold = F 0.9 0.1 Flu = T & Cold = T 0.95 0.05 Pr(Cold)=0.01 Cold Temperature Temperature Mean Std Dev Flu = F & Cold = F 37 0.5 Flu = F & Cold = T 37.5 1 Flu = T & Cold = F 39 2 Flu = T & Cold = T 39.5 2.5 Discrete Random Variable Continuous Random Variable February 13, 2014 11

Bayesian Network Inference Flu Query (Inferred RV) Cold Perceives Fever Temperature Evidence (Observed RV) Inference is performed by observing some random variables (evidence) and inferring some others (query). The evidence can be a value or a probability distribution. The answer to the query is the marginal distribution. February 13, 2014 12

Speakers Eric Chan (Oracle) Tool development addressing many aspects of the variety problem including provenance, metadata, and hypothesis generation and workflow (OODA loop) Nathan Wilson (Encyclopedia of Life) Experiences with using semantic technology in the tree of life, a notoriously messy ontology February 13, 2014 13

Speakers Ruth Duerr (National Snow and Ice Data Center) Experiences with using ontologies to address very heterogeneous sources of data for snow and ice, including satellite data, model output, point observations, social science data (interviews with Alaskan community members), and many other kinds of data. February 13, 2014 14

Track D Wiki Pages Synthesis Page http://ontolog.cim3.net/cgi-bin/wiki.pl? OntologySummit2014_Tackling_Variety_In_BigData_Synthesis Community Input Page http://ontolog.cim3.net/cgi-bin/wiki.pl? OntologySummit2014_Tackling_Variety_In_BigData_CommunityInput February 13, 2014 15

Bibliography 1. D. Kerr, K. Braithwaite, N. Metropolis, D. Sharp, G.-C. Rota (Ed). Science, Computers and the Information Onslaught. Academic Press. (1984) 2. K. Baclawski and J.E. Smith. High-performance, distributed information retrieval. Northeastern University, College of Computer Science. (1994) 3. K. Baclawski. Data/knowledge bases for biological papers and techniques. Northeastern University, College of Computer Science. NSF grant. (1992) 4. Ioannidis JPA, Why Most Published Research Findings Are False. PLoS Med 2(8): e124. doi:10.1371/journal.pmed.0020124 (2005) 5. Critical Data Conference: Secondary Use of Big Data from Critical Care http://criticaldata.mit.edu/ (2014) 6. K. Baclawski and T. Niu. Ontologies for Bioinformatics. (2005) 7. K. Baclawski. Bayesian network development. In International Workshop on Software Methodologies, Tools and Techniques, pages 18-48. Keynote address. (September, 2004) February 13, 2014 16