Data collection architecture for Big Data



Similar documents
Big Data & Security. Aljosa Pasic 12/02/2015

How To Make Sense Of Data With Altilia

Cloudbuz at Glance. How to take control of your File Transfers!

Vendor briefing Business Intelligence and Analytics Platforms Gartner 15 capabilities

BYODs & FAIR Data Stewardship

Industry 4.0 and Big Data

The Way to SOA Concept, Architectural Components and Organization

Top Ten Security and Privacy Challenges for Big Data and Smartgrids. Arnab Roy Fujitsu Laboratories of America

Overview NIST Big Data Working Group Activities

NIST Big Data Public Working Group

Selection Requirements for Business Activity Monitoring Tools

Klarna Tech Talk: Mind the Data! Jeff Pollock InfoSphere Information Integration & Governance

Transforming big data into supply chain analytics

1 st Symposium on Colossal Data and Networking (CDAN-2016) March 18-19, 2016 Medicaps Group of Institutions, Indore, India

Risk & Hazard Management

Cloud and Big Data Standardisation

Integrating MDM and Business Intelligence

MDM and Data Warehousing Complement Each Other

Monitor and Manage Your MicroStrategy BI Environment Using Enterprise Manager and Health Center

Report on the Dagstuhl Seminar Data Quality on the Web

Enterprise Data Management for SAP. Gaining competitive advantage with holistic enterprise data management across the data lifecycle

Big Data, Integration and Governance: Ask the Experts

Master Data Management Architecture

Trust and Dependability in Cloud Computing

By Makesh Kannaiyan 8/27/2011 1

Presentation: Cloud reigns over (SPIR) spread-sheets

SAP Database Strategy Overview. Uwe Grigoleit September 2013

SURFsara Data Services

A Multitier Fraud Analytics and Detection Approach

CLOUD BASED SEMANTIC EVENT PROCESSING FOR

Master Your Data and Your Business Using Informatica MDM. Ravi Shankar Sr. Director, MDM Product Marketing

ORACLE FUSION SERVICE DESCRIPTIONS

Principal MDM Components and Capabilities

SQL Server 2005 Features Comparison

Vulnerability Management

IBM Cloud Security Draft for Discussion September 12, IBM Corporation

Biometrics Workshop. The evolution of large-scale biometric architecture. Facilitators. Mark Crego, Accenture Mike Matyas, Mount Airey Group

Hadoop Data Hubs and BI. Supporting the migration from siloed reporting and BI to centralized services with Hadoop

12 Vs of Big Data Governance

Arnab Roy Fujitsu Laboratories of America and CSA Big Data WG

Overview, Goals, & Introductions

Master of Science in Health Information Technology Degree Curriculum

Service Oriented Data Management

A Look at Self Service BI with SAP Lumira Natasha Kishinevsky Dunn Solutions Group SESSION CODE: 1405

Oracle Fusion Cloud Service Global Price List October 9, 2014

Big Data Standardisation in Industry and Research

Effective Data Integration - where to begin. Bryte Systems

secure intelligence collection and assessment system Your business technologists. Powering progress

SAP Agile Data Preparation

ANALYTICS IN BIG DATA ERA

Big Data and Semantic Web in Manufacturing. Nitesh Khilwani, PhD Chief Engineer, Samsung Research Institute Noida, India

NSF Workshop on Big Data Security and Privacy

LinkZoo: A linked data platform for collaborative management of heterogeneous resources

EXPLORING THE CAVERN OF DATA GOVERNANCE

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

Cloud Data Security. Sol Cates

The Value of Taxonomy Management Research Results

ENTERPRISE BI AND DATA DISCOVERY, FINALLY

CLOUD STORAGE SECURITY INTRODUCTION. Gordon Arnold, IBM

An EVIDENCE-ENHANCED HEALTHCARE ECOSYSTEM for Cancer: I/T perspectives

HOW TO DO A SMART DATA PROJECT

IRMAC SAS INFORMATION MANAGEMENT, TRANSFORMING AN ANALYTICS CULTURE. Copyright 2012, SAS Institute Inc. All rights reserved.

All-in-one, Integrated HIM Workflow Solution

WHITE PAPER. Five Steps to Better Application Monitoring and Troubleshooting

BIG DATA: PROMISE, POWER AND PITFALLS NISHANT MEHTA

Master Data Governance & SAP Information Steward Integration. Jens Sauer, SAP Switzerland September 11 th, 2013

BIG. Big Data Analysis John Domingue (STI International and The Open University) Big Data Public Private Forum

I n t e r S y S t e m S W h I t e P a P e r F O R H E A L T H C A R E IT E X E C U T I V E S. In accountable care

Connected Product Maturity Model

PROTOTYPE IMPLEMENTATION OF A DEMAND DRIVEN NETWORK MONITORING ARCHITECTURE

Data Grids. Lidan Wang April 5, 2007

ON DEMAND ACCESS TO BIG DATA. Peter Haase fluid Operations AG

Cloud computing based big data ecosystem and requirements

STORAGE SECURITY TUTORIAL With a focus on Cloud Storage. Gordon Arnold, IBM

3rd International Symposium on Big Data and Cloud Computing Challenges (ISBCC-2016) March 10-11, 2016 VIT University, Chennai, India

Axis Cloud Collaboration Platform Business Partner Collaboration

Amplify Serviceability and Productivity by integrating machine /sensor data with Data Science

ATTPS Publication: Trustworthy ICT Taxonomy

Big Data Analytics Roadmap Energy Industry

INRA's Big Data perspectives and implementation challenges. Pascal Neveu UMR MISTEA INRA - Montpellier

Big Data and Society: The Use of Big Data in the ATHENA project

End-To-End Invoice Processing Automation at Land O Lakes. Session #705. Natalie Hawley, Applications Developer

Augmented Search for Web Applications. New frontier in big log data analysis and application intelligence

Data Science & Big Data Practice

Tech Note. TrakCel in the wider Clinical Ecosystem: Accelerating Integration and Automation

Data Governance. David Loshin Knowledge Integrity, inc. (301)

Westernacher Consulting

Big Data - Security and Privacy

Transcription:

Data collection architecture for Big Data a framework for a research agenda (Research in progress - ERP Sense Making of Big Data) Wout Hofman, May 2015, BDEI workshop

2 Big Data succes stories bias our thinking proprietary, closed solutions

3 Problem statement Large-scale, controller open implementation of data analytics/data innovation by organisations is lacking From offline to real-time Big Data versus data driven innovation - volume, variety, velocity, veracity(, value) Collection, homogenisation, and integration is time-consuming (Too) many (un)structured (linked) open data sets No clear data governance rules and data policies supported by interventions Unknown features of data sets (quality, etc.) Data with different technical formats (5-star model?) Embedded data semantics API based data sharing platforms Research focus on solving individual issues, lack of an architecture

From offline to real-time - impact on IT architecture Descriptive - what happened (also known as: supply chain visibility in logistics) Diagnostics - why did it happen (e.g. supply chain resilience) Predictive - what will happen (e.g. resilience in terms of too late, waiting queues, (Demanes case)) Prescriptive analytics - how can we make it happen (prevention, etc.) (Gartner) But also anomaly detection - combining the past with descriptive analytics (e.g. risk analysis) query evaluation - search and find appropriate data

5 The data value chain (Esmeijer, Bakker & Munck, 2015)

Processing is considered as a sequence of steps: Data generation and collection (inventory of data sources, quality features, etc.) Data preparation (filtering, cleaning, verification, annotation) Data integration Data storage (local databases, cloud storage,..) Data analytics (multi-view clustering, deep learning) Data visualisation Data driven action Data governance and security Lacking: data collection policy

Data generation and collection (Too) many (un)structured (linked) open data sets No clear data governance rules and data policies supported by interventions Data with different technical formats (5-star model?) Embedded data semantics API based data sharing platforms No standards for metadata > no (automatic) annotation: (taken from Zaveri et al.) Contextual (completeness, amount, relevancy) Trust (believability, verifiability, reputation, provenance, licensing) Representation (conciseness, consistency, understandability, interpretability, versatility) Intrinsic (accuracy, objectivity, validity, conciseness, interlinking, consistency) Dynamicity (timeliness, currency, volatility) Accessibility (availability, performance, security, response time)

Data preparation and - integration Data quality features: completeness, conciseness, correctness, and consistency Quality improvement annotation automatic detection and repair comparing data sets of different resources Homogenisation Matching and linking of data sets OWL is considered for semantics

9 Data governance and - policies Open data Community data Bilateral data Internal data Data ownership and -stewardship Applying privacy-enhanced technologies (e.g. IAA, attribute based access control, homomorphic encryption,...) (Eckartz, Hofman & van Veenstra, 2014)

Towards an architecture Data Usage (visualisation dashboard/analytics) data semantics source registry Data Collection subscripton Source Interface distributed (open) data sources

Modelling tools Data user (e.g. analytics, visualisation dashboard (complex) event processing Connectivity Adapter Interface support Query formulation Data Analytics Dashboard Data Workflow Semantic Model(s)! Subscription manage-ment Subscription registry! Data linking Data fusion Data manipulation Link evaluation Query decomposition Audit trail! Registry! Subscription protocol events (state changes) Transformation Source adapter Anonymization/ Filtering Data cleansing Source adapter Source adapter Temporary Store! Subscription manage-ment security APIs SPARQL endpoint Data Source Adapter Data Provision Provision adapters Source Registration Subscription registry! Identifica -tion & authentication Access Control Transformation Anonymization/ Filtering Audit trail! Data cleansing Data governance rules & interventions Source Annotation Profiling Data Source (open, closed, (un)structured) Data Analytics Dashboard

12 Research questions (rephrased) 1. How can privacy-enhanced technologies, semantics, and annotations of datasets improve large-scale, automatic data analytics? 2. What is the minimal required information to automatically integrate any dataset into a common format?

13 Privacy-enhanced technologies, semantics, and annotation to improve precisie and recall of datasets Annotation and metadata Semantics and technical representation of a dataset Privacy-enhanced technologies: data governance, - policies, and - semantics Data collection policy how to search and find appropriate data (appropriate: according semantics and metadata with particular quality features) query decomposition Automatic data workflow composition

14 Minimal required information to automatically transform and integrate datasets for analytics Syntax transformation Ontology learning text mining, NLP, etc. networked ontology construction Semantic transformation ontology matching and -linking

15 Thank your for your attention. Questions?