Virtual Data Integration



Similar documents
Data Integration. Helena Galhardas Paulo Carreira DEI IST

Data Integration and Network Marketing

Query Processing in Data Integration Systems

Data Integration. May 9, Petr Kremen, Bogdan Kostov

Grid Data Integration based on Schema-mapping

A Tutorial on Data Integration

Shuffling Data Around

Magic Sets and their Application to Data Integration

Data Integration. Maurizio Lenzerini. Universitá di Roma La Sapienza

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

Data Quality in Information Integration and Business Intelligence

DATA INTEGRATION CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

A View Integration Approach to Dynamic Composition of Web Services

XML Data Integration in OGSA Grids

Schema Mediation in Peer Data Management Systems

Database Systems. Lecture Handout 1. Dr Paolo Guagliardo. University of Edinburgh. 21 September 2015

Amit Sheth & Ajith Ranabahu, Presented by Mohammad Hossein Danesh

CSE 562 Database Systems

Enterprise Modeling and Data Warehousing in Telecom Italia

Data Warehousing. Overview, Terminology, and Research Issues. Joachim Hammer. Joachim Hammer

Exercises for Principles of Data Integration

Data Grids. Lidan Wang April 5, 2007

The Piazza peer data management system

A View Integration Approach to Dynamic Composition of Web Services

Data Cleaning and Transformation - Tools. Helena Galhardas DEI IST

Comparing Data Integration Algorithms

Software Concepts. Uniprocessor Operating Systems. System software structures. CIS 505: Software Systems Architectures of Distributed Systems

MDM for the Enterprise: Complementing and extending your Active Data Warehousing strategy. Satish Krishnaswamy VP MDM Solutions - Teradata

Semantic Description of Distributed Business Processes

Enabling Collaboration Using the Biomedical Informatics Research Network (BIRN):

Query Reformulation over Ontology-based Peers (Extended Abstract)

Requirements for Context-dependent Mobile Access to Information Services

IST722 Data Warehousing

A STUDY ON DATA MINING INTEGRATION ON CLOUD CHALLENGES AND ITS SOLUTIONS

Index Selection Techniques in Data Warehouse Systems

Distributed Databases

Schema Mediation and Query Processing in Peer Data Management Systems

Reference Architecture, Requirements, Gaps, Roles

Data Management in Peer-to-Peer Data Integration Systems

Data and beyond

Overview. DW Source Integration, Tools, and Architecture. End User Applications (EUA) EUA Concepts. DW Front End Tools. Source Integration

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications

Data Integration: A Theoretical Perspective

Grid Data Integration Based on Schema Mapping

Consistent Answers from Integrated Data Sources

A Survey on Data Warehouse Constructions, Processes and Architectures

EHR Standards Landscape

Piazza: Data Management Infrastructure for Semantic Web Applications

Virtual Data Integration

Integrating XML Data Sources using RDF/S Schemas: The ICS-FORTH Semantic Web Integration Middleware (SWIM)

Configuring Sites and Understanding AD replication. Dante Villarroel Saavedra

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

View-based Data Integration

CHAPTER-6 DATA WAREHOUSE

Data Integration and Exchange. L. Libkin 1 Data Integration and Exchange

Data integration and reconciliation in Data Warehousing: Conceptual modeling and reasoning support

EDG Project: Database Management Services

Efficient Query Optimization for Distributed Join in Database Federation

Potential standardization items for the cloud computing in SC32

Integrating Heterogeneous Data Sources Using XML

Data Integration of Bioinformatics and Web-Based Software Development

DATABASES AND THE GRID

Integration and Coordination in in both Mediator-Based and Peer-to-Peer Systems

Week Days or Week Ends - Flexible. Online Instructor Led/ Class room

GEOG 482/582 : GIS Data Management. Lesson 10: Enterprise GIS Data Management Strategies GEOG 482/582 / My Course / University of Washington

DATA WAREHOUSING II. CS121: Introduction to Relational Database Systems Fall 2015 Lecture 23

COURSE 20463C: IMPLEMENTING A DATA WAREHOUSE WITH MICROSOFT SQL SERVER

Implementing a Data Warehouse with Microsoft SQL Server

HyperSpaces: A distributed event-driven shared memory model without destructive modications

Integration of Heteregeneous Bio-Medical Databases: A Federated Approach using Semantic Schemas

MASTER DATA MANAGEMENT:

Navigational Plans For Data Integration

An ontology-based mediation architecture for E-commerce applications

<Insert Picture Here> Extending Hyperion BI with the Oracle BI Server

ESS EA TF Item 2 Enterprise Architecture for the ESS

An introduction to data integration. Prof. Letizia Tanca Technologies for Information Systems

Integration of Distributed Healthcare Records: Publishing Legacy Data as XML Documents Compliant with CEN/TC251 ENV13606

Efficient Query Processing for Data Integration

Lecture Data Warehouse Systems

How To Understand Data Integration

OntoPIM: How to Rely on a Personal Ontology for Personal Information Management

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Objectives. Distributed Databases and Client/Server Architecture. Distributed Database. Data Fragmentation

USING SCHEMA AND DATA INTEGRATION TECHNIQUE TO INTEGRATE SPATIAL AND NON-SPATIAL DATA : DEVELOPING POPULATED PLACES DB OF TURKEY (PPDB_T)

FIPA agent based network distributed control system

Data-Warehouse-, Data-Mining- und OLAP-Technologien

JOURNAL OF OBJECT TECHNOLOGY

The Service Revolution software engineering without programming languages

Adding Semantics to Business Intelligence

Distributed Database Design (Chapter 5)

Implement a Data Warehouse with Microsoft SQL Server 20463C; 5 days

Some Research Challenges for Big Data Analytics of Intelligent Security

BIRN Update. Carl Kesselman

Sizing Logical Data in a Data Warehouse A Consistent and Auditable Approach

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

Enterprise Application Designs In Relation to ERP and SOA

Accessing Data Integration Systems through Conceptual Schemas (extended abstract)

Exploring the Synergistic Relationships Between BPC, BW and HANA

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

Report on the Dagstuhl Seminar Data Quality on the Web

Transcription:

Virtual Data Integration Helena Galhardas Paulo Carreira DEI IST (based on the slides of the course: CIS 550 Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Agenda Terminology Conjunctive queries and Datalog Virtual Data Integration Architecture

Building a Data Integration System Create a middleware mediator or data integration system over the sources Can be warehoused (a data warehouse) or virtual Presents a uniform query interface and schema Abstracts away multitude of sources; consults them for relevant data Unifies different source data formats (and possibly schemas) Sources are generally autonomous, not designed to be integrated Sources may be local DBs or remote web sources/services Sources may require certain input to return output (e.g., web forms) binding patterns describe these

Logical components of a virtual integration system Source descriptions Specify the properties of the sources the system needs to know to use their data. Main component is semantic mappings that specify how attributes in the sources correspond to attributes in the mediated schema, how to resolve differences in how data values are specified in different sources Other info is whether sources are complete Mediated Schema Is built for the data integration application and contains only the aspects of the domain relevant to the application. Most probably will contain a subset of the attributes seen in sources Wrappers Programs whose role is to send queries to a data source, receive answers and possibly apply some basic transformation to the answer

Example: a data integration scenario

Semantic mappings Semantic mappings in the source descriptions describe: the relationship between the sources and the mediated schema, for example: mapping of source S1 states it contains movies; the attribute name in Movies maps to attribute title in the mediator Movie relation; the Actors relation in the mediated schema is a projection of the Movies source on the attributes name and actors Whether the sources are complete, for example: Source S2 may not contain all the movie showing times in the entire country

Components of a data integration system

Query reformulation (1) Rewrite the user query that was posed in terms of the relations in the mediated schema, into queries referring to the schemas of data sources Result is called a logical query plan: Set of queries that refer to the schemata of the data sources and whose combination will yield the answer to the original query

Query reformulation (2) Ex: SELECT title, starttime!! FROM Movie, Plays!! WHERE Movie.title = Plays.movie!! AND location= New York!! AND director= Woody Allen Tuples for Movie can be obtained from source S1 but attribute title needs to be reformulated to name Tuples for Plays can be obtained from S2 or S3. Since S3 is complete for showings in NY, we choose it Since source S3 requires the title of a movie as input and the title is not specified in the query, the query plan must first access S1 and then feed the movie titles returned from S1 as inputs to S3.

Query optimization Acepts a logical query plan as input and produces a physical query plan Ex: Specifies the exact order in which sources are accessed, when results are combined, which algorithms are used for performing operations on the data The optimizer will decide the join algorithm to combine results from S1 and S3. It may stream tuples from S1 and input them into S3, or it may batch them up before sending them to S3

Query execution Responsible for the execution of the physical query plan Dispatches the queries to the individual sources through the wrappers and combines the results as specified by the query plan. Also may ask the optimizer to reconsider its plan based on its monitoring of the plan s progress (e.g., if source S3 is slow)