Avigdor Gal Technion Israel Institute of Technology

Similar documents
ICT Perspectives on Big Data: Well Sorted Materials

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

Exploratory Data Analysis with #codemash

Automating Big Data Management, by DISIT Lab Distributed [Systems and Internet, Data Intelligence] Technologies Lab Prof. Ph.D. Eng.

Information Visualization WS 2013/14 11 Visual Analytics

Ramesh Bhashyam Teradata Fellow Teradata Corporation

Big Data and Analytics: Challenges and Opportunities

Introduction to Engineering Using Robotics Experiments Lecture 17 Big Data

Introduction to Data Mining

COMP9321 Web Application Engineering

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science

What the Hell is Big Data?

Big Systems, Big Data

Smarter Planet evolution

Towards a Thriving Data Economy: Open Data, Big Data, and Data Ecosystems

Interactive Analytical Processing in Big Data Systems,BDGS: AMay Scalable 23, 2014 Big Data1 Generat / 20

Interactive Visual Data Analysis in the Times of Big Data

Big Data a threat or a chance?

Introducing Oracle Exalytics In-Memory Machine

CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing. University of Florida, CISE Department Prof.

Lecture 21: NoSQL III. Monday, April 20, 2015

Big Data Analytics. Chances and Challenges. Volker Markl

GEOG 482/582 : GIS Data Management. Lesson 10: Enterprise GIS Data Management Strategies GEOG 482/582 / My Course / University of Washington

Machine Learning and Cloud Computing. trends, issues, solutions. EGI-InSPIRE RI

Doing Multidisciplinary Research in Data Science

User Modeling in Big Data. Qiang Yang, Huawei Noah s Ark Lab and Hong Kong University of Science and Technology 杨 强, 华 为 诺 亚 方 舟 实 验 室, 香 港 科 大

Collaborations between Official Statistics and Academia in the Era of Big Data

Strategies For Setting Up Your Organisation For Success With Big Data. Kevin Long Business Development Director Teradata

Big Data, Physics, and the Industrial Internet! How Modeling & Analytics are Making the World Work Better."

Big Data Governance Certification Self-Study Kit Bundle

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Big Data R&D Initiative

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

What is Big Data? BCS Aberdeen Branch 6 November 2014

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Amplify Serviceability and Productivity by integrating machine /sensor data with Data Science

Big Data Systems CS 5965/6965 FALL 2014

OLAP. Data Mining Decision

Microsoft Big Data Solutions. Anar Taghiyev P-TSP

Introduction. A. Bellaachia Page: 1

What do we do at Cimigo?

A Simplified Framework for Data Cleaning and Information Retrieval in Multiple Data Source Problems

Data Mining. Concepts, Models, Methods, and Algorithms. 2nd Edition

Sunnie Chung. Cleveland State University

The Big Data Paradigm Shift. Insight Through Automation

Challenges for Data Driven Systems

Indian Journal of Science The International Journal for Science ISSN EISSN Discovery Publication. All Rights Reserved

Topics in basic DBMS course

Similarity Search and Mining in Uncertain Spatial and Spatio Temporal Databases. Andreas Züfle

Data Visualization and Team Collaboration. Michael Paulos Business Analyst Marketing, Cannery Casino Resorts June 5, :30-2:15 PM

BIG DATA CHALLENGES AND PERSPECTIVES

Introduction to Big Data the four V's

Big Data: calling for a new scope in the curricula of Computer Science. Dr. Luis Alfonso Villa Vargas

MACHINE LEARNING BASICS WITH R

Big Data & Security. Aljosa Pasic 12/02/2015

A Presenta*on from Big Data 22 February 2013

Big Data Governance Certification Self-Study Kit Bundle

The University of Jordan

International Journal of Innovative Research in Computer and Communication Engineering

A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML

Database Systems. Lecture 1: Introduction

Framework and key technologies for big data based on manufacturing Shan Ren 1, a, Xin Zhao 2, b

Unified access to all your data points. with Apache MetaModel

BIG Big Data Public Private Forum

Social Influence Analysis in Social Networking Big Data: Opportunities and Challenges. Presenter: Sancheng Peng Zhaoqing University

Data Warehouse: Introduction

Veterinary epidemiology in the era of big data: Value, Volume and Velocity of information extraction

Big Data in Pictures: Data Visualization

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

BIG DATA IN BUSINESS ENVIRONMENT

Professor, D.Sc. (Tech.) Eugene Kovshov MSTU «STANKIN», Moscow, Russia

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Data Centric Computing Revisited

Big Data Explained. An introduction to Big Data Science.

Applying Semantics to Unstructured Data (Big and Getting Bigger)

BIG DATA FUNDAMENTALS

Big Graph Processing: Some Background

Marko Grobelnik Jozef Stefan Institute

Cloud Big Data Architectures

Virtual Parking Management. Real-Time PARCS Monitoring

Jay Buckingham Dynamic Signal

Smart Financial Data: Semantic Web technology transforms Big Data into Smart Data

Data Refinery with Big Data Aspects

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

What do Analytics, Tom Cruise and Bob Dylan have in Common?

Deep Learning Meets Heterogeneous Computing. Dr. Ren Wu Distinguished Scientist, IDL, Baidu

A Berkeley View of Big Data

SEMI AUTOMATIC DATA CLEANING FROM MULTISOURCES BASED ON SEMANTIC HETEROGENOUS

Big Data and Semantic Web in Manufacturing. Nitesh Khilwani, PhD Chief Engineer, Samsung Research Institute Noida, India

Big Data in the context of Preservation and Value Adding

Data Driven Discovery In the Social, Behavioral, and Economic Sciences

How To Make Sense Of Data With Altilia

Information Management course

The Challenges of Geospatial Analytics in the Era of Big Data

Big Data and Government: What s the Big Deal? John Kreisa Chief Strategic Marketing Officer Hortonworks

Transcription:

Avigdor Gal Technion Israel Institute of Technology

Tutorial Big data integration Applications of big data integration Current challenges and future research directions

Big data is a game changer From Theory to Systems: empirical evaluation counts From Systems to : large scale empirical evaluation counts

Volume: No Longer the Size of a Teacup Volume Table : Cross Table Big data may be a single dataset with a lot of data

Volume: No Longer the Size of a Teacup Table : Cross Table Big data may be a single dataset with a lot of data

Velocity: Replacing a Teacup with a Tea Hose Volume Velocity Table : Cross Table Big data may be data that rapidly changes

Velocity: Replacing a Teacup with a Tea Hose Table : Cross Table Big data may be data that rapidly changes

Velocity: Replacing a Teacup with a Tea Hose Table : Cross Table Big data may be data that rapidly changes

Velocity: Replacing a Teacup with a Tea Hose Table : Cross Table Big data may be data that rapidly changes

Variety: When One Tea Type is Just not Enough Volume Velocity Variety Table : Cross Table Big data may be a small dataset with many different schemata

Variety: When One Tea Type is Just not Enough Table : Cross Table Big data may be a small dataset with many different schemata

Veracity: Is it Coffee or Black Tea with Milk? Volume Velocity Variety Veracity Table : Cross Table Big data may be data with varying levels of trustworthiness

Veracity: Is it Coffee or Black Tea with Milk? Table : Cross Table Big data may be data with varying levels of trustworthiness

Gathering: where and when to expect the fountain to burst Volume Velocity Variety Veracity Gathering Signal and Event Processing Table : Cross Table

Gathering: where and when to expect the fountain to burst Table : Cross Table

Management: Not your typical DBA anymore Volume Velocity Variety Veracity Gathering Managing Cloud Computing, NoSQL, NewSQL Table : Cross Table

Analytics: When Analysis Explodes Multi-Dimensionally Volume Velocity Variety Veracity Gathering Managing Analyzing Table : Cross Table & Process Mining ML, IR, NLP

Visualization: The Machine Offering to Mankind Volume Velocity Variety Veracity Gathering Managing Analyzing Visualizing Table : Cross Table User Experience

Visualization: The Machine Offering to Mankind Table : Cross Table

Cross Table Volume Velocity Variety Gathering Managing Analyzing Visualizing Veracity Table : Cross Table

What is? is the task of integrating multiple data sources into a single data source. is a management task in the Cross Table. Two major tasks of data integration are schema matching and entity resolution.

Schema Matching What is Schema Matching? Ancient history: heterogeneity of schemata Different DBAs, different names Granularity matters Schema matching is the process of creating attribute correspondences among multiple schemata

Schema Matching What is Schema Matching? Ancient history: heterogeneity of schemata Different DBAs, different names Granularity matters Schema matching is the process of creating attribute correspondences among multiple schemata Existing Work Formal Models: uncertain schema matching Algorithmic & Heuristic solutions: string, value, structure-based Empirical benchmarks: University applications, Web forms, Ontology matching competition (OAEI)

Textbook Example for Schema Matching Id name ZIP Income r 1 Green 51519 30K r 2 Green 51518 32K r 3 Peter 30528 40K r 4 Peter 30528 40K Table : SM Simple Example Id firstname lastname Address Salary r 1 John Green CARTER LAKE IA 51519 30,000 r 2 Sarah Green CARTER LAKE IA 51518 32,000K r 3 Peter Smith CLEVELAND GA 30528 40,000 r 4 Peter Smith CLEVELAND GA 30528 40,000 Table : SM Simple Example 2

Entity Resolution What is Entity Resolution? Real world data is dirty Typographical errors and missing values Different date formats and terminology Multiple representations of the same real-world object Multi-dimensional data aspects: temporal, spatial,... ER is the process of determining when different entity representations refer to the same entity.

Entity Resolution What is Entity Resolution? Real world data is dirty Typographical errors and missing values Different date formats and terminology Multiple representations of the same real-world object Multi-dimensional data aspects: temporal, spatial,... ER is the process of determining when different entity representations refer to the same entity. Existing work Formal Models and Languages Algorithmic solutions Comparative empirical analysis of solutions: FEBRL

Textbook Example for Entity Resolution Id name ZIP Income r 1 Green 51519 30K r 2 Green 51518 32K r 3 Peter 30528 40K r 4 Peter 30528 40K r 5 Gtee 51519 55K r 6 Howard 51519 30K Table : ER Simple Example

+ = Volume Velocity Variety Veracity Gathering Managing Analyzing Visualizing ER ER SM SM & ER Table : Cross Table

+ = Volume Velocity Variety Veracity Gathering Managing Analyzing Visualizing ER ER SM SM & ER

: not Your Typical Anymore Urban Traffic Management

: not Your Typical Anymore Traffic Flow

: not Your Typical Anymore Bus Log Bus Model s ω_2 ω_3 ω_i ω_{n-1} d

: not Your Typical Anymore Challenges Volume: 23 Million records per month ( 4GB) Velocity: 770,000 new records per day (an event each 2-6 seconds) Variety: Homogeneous Veracity: GPS locations

Big data The ability to take data to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it that s going to be a hugely important skill in the next decades. (Hal Varian, Google s Chief Economist) integration integration has been the basis of data understanding and processing for many years now. With big data joining in, the impact of data integration is not diminishing. Rather, it changes shape while remaining dominant.

Challenges Volume Compute data integration faster, by using parallelization. Velocity Create incremental computation methods for data integration. Variety Extend evaluation models to support data integration with minimal or no human input in the loop. Veracity Quantified uncertainty management for data integration.

Thank You Avigdor Gal Technion Israel Institute of Technology