Data Integration and Data Provenance. Profa. Dra. Cristina Dutra de Aguiar Ciferri cdac@icmc.usp.br



Similar documents
CHAPTER-6 DATA WAREHOUSE

A Survey on Data Warehouse Architecture

International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April ISSN

DATA INTEGRATION CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

A STATISTICAL DATA FUSION TECHNIQUE IN VIRTUAL DATA INTEGRATION ENVIRONMENT

Chapter ML:XI. XI. Cluster Analysis

Web-Based Genomic Information Integration with Gene Ontology

College information system research based on data mining

8 9 +

Overview. DW Source Integration, Tools, and Architecture. End User Applications (EUA) EUA Concepts. DW Front End Tools. Source Integration

Web-Scale Extraction of Structured Data Michael J. Cafarella, Jayant Madhavan & Alon Halevy

An Ontology Based Method to Solve Query Identifier Heterogeneity in Post- Genomic Clinical Trials

Dependencies Revisited for Improving Data Quality

Report for the seminar Algorithms for Database Systems F1: A Distributed SQL Database That Scales

2 Associating Facts with Time

MULTI AGENT-BASED DISTRIBUTED DATA MINING

A MEDIATION LAYER FOR HETEROGENEOUS XML SCHEMAS

A Solution for Data Inconsistency in Data Integration *

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 16 - Data Warehousing

SEMI AUTOMATIC DATA CLEANING FROM MULTISOURCES BASED ON SEMANTIC HETEROGENOUS

Chapter 3 - Data Replication and Materialized Integration

Big Data Governance Certification Self-Study Kit Bundle

Contents RELATIONAL DATABASES

Query Management in Data Integration Systems: the MOMIS approach

META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING

Survey on Data Cleaning Prerna S.Kulkarni, Dr. J.W.Bakal

Efficient and Effective Duplicate Detection Evaluating Multiple Data using Genetic Algorithm

Data Integration in Multi-sources Information Systems

Turkish Journal of Engineering, Science and Technology

Reverse Engineering of Relational Databases to Ontologies: An Approach Based on an Analysis of HTML Forms

GEOG 482/582 : GIS Data Management. Lesson 10: Enterprise GIS Data Management Strategies GEOG 482/582 / My Course / University of Washington

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Enterprise Data Warehouse (EDW) UC Berkeley Peter Cava Manager Data Warehouse Services October 5, 2006

Report on the Dagstuhl Seminar Data Quality on the Web

Big Data Governance Certification Self-Study Kit Bundle

Dataset Preparation and Indexing for Data Mining Analysis Using Horizontal Aggregations

IMPLEMENTING SPATIAL DATA WAREHOUSE HIERARCHIES IN OBJECT-RELATIONAL DBMSs

Automatic Annotation Wrapper Generation and Mining Web Database Search Result

Data Integration and ETL Process

Data Integration and Data Cleaning in DWH

Big Data & Its Importance

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 5 - DBMS Architecture

Data Integration and ETL Process

What Your CEO Should Know About Master Data Management

Query reformulation for an XML-based Data Integration System

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

International Journal of Advanced Research in Computer Science and Software Engineering

INTEGRATION OF XML DATA IN PEER-TO-PEER E-COMMERCE APPLICATIONS

Comparing Data Integration Algorithms

chapater 7 : Distributed Database Management Systems

A Design and implementation of a data warehouse for research administration universities

ADVANCED GEOGRAPHIC INFORMATION SYSTEMS Vol. II - Using Ontologies for Geographic Information Intergration Frederico Torres Fonseca

Constraint-based Query Distribution Framework for an Integrated Global Schema

2. Background on Data Management. Aspects of Data Management and an Overview of Solutions used in Engineering Applications

SOLUTION BRIEF CA ERwin Modeling. How can I understand, manage and govern complex data assets and improve business agility?

Semantic Information Retrieval from Distributed Heterogeneous Data Sources

CMDB Federation. DMTF Standards for Federating CMDBs and other Management Data Repositories

INFORMING A INFORMATION DISCOVERY TOOL FOR USING GESTURE

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 15 - Data Warehousing: Cubes

Dynamic Data in terms of Data Mining Streams

"The performance driven Enterprise" Emerging trends in Enterprise BI Platforms

Enable Location-based Services with a Tracking Framework

A Simplified Framework for Data Cleaning and Information Retrieval in Multiple Data Source Problems

Modelling Architecture for Multimedia Data Warehouse

Distributed Database for Environmental Data Integration

Transcription:

Data Integration and Data Provenance Profa. Dra. Cristina Dutra de Aguiar Ciferri cdac@icmc.usp.br

Outline n Data Integration q Schema Integration q Instance Integration q Our Work n Data Provenance q Basic Concepts q The PrInt Model 2

Outline n Index Structures q Biological databases q Similarity search of complex data q Spatial data warehouses q Similarity search over data warehouses of images n Mining of Medical Data n 3

Data Integration n Schema Level Integration q specification of mappings that describe the semantic relationships among schemas from heterogeneous data sources n Instance Level Integration q identification of which entities from heterogeneous sources refer to the same entity in the real-world q resolution of value conflicts 4

Schema Level Integration n Semantic Relativism q conflicts between two or more representations is related to the fact that different users model the same piece of the real-world in different ways according to their perceptions q Types of Conflict Identification n n n name, including homonymous and synonymous semantic structural 5

Types of Schema n Global Schema (or Mediated Schema) q integration of several heterogeneous local schemas into a homogeneous global schema q development of mappings that describe the semantic relationships between the mediated schema and the schemas of the sources n Federated Schema q there is not a homogeneous global schema q there are several heterogeneous local schemas, each one related to a given source 6

Example of Structural Mappings X C C C C X C cor cor atrib Catrib atrib C X cor C X C Y X Y cor C W C Z cor C Z W Z C cor X Y a X Ca Y X a C atrib cor = corresponde a = associação atrib = atributo C,W,X,Y,Z = classe c o r Y C X aatrib atrib Y SPACCAPIETRA, S., PARENT, C. View Integration: A Step Forward in Solving Structural Conflicts. IEEE Transactions on Knowledge and Data Engineering, v.6, n.2, p.258-274, 1994 7

Instance Level Integration n Reference Reconciliation (or Entity Resolution) q automatically detect references to the same entity of the real-world and group them in a cluster of similar entities n Value Conflict Resolution q solve the differences among values of attributes of the entities that refer to the same entity of the real-world 8

Reference Reconciliation Examples of entities from the class article (a) a 1 = ({ Distributed query processing in a relational database system }, { 169-180 }, {p 1 ; p 2 ; p 3 }; {c 1 }) a 2 = ({ Distributed query processing in a relational database system }, { 169-180 },{p 4 ; p 5 ; p 6 }; {c 2 }) Examples of entities from the class person (p) p 1 = ({ Robert S. Epstein }, null, {p 2, p 3 }, null) p 2 = ({ Michael Stonebraker }, null, {p 1, p 3 }, null) p 3 = ({ Eugene Wong }, null, {p 1, p 2 }, null) p 4 = ({ Epstein, R.S. }, null, {p 5, p 6 }, null) p 5 = ({ Stonebraker, M. }, null, {p 4, p 6 }, null) p 6 = ({ Wong, E. }, null, {p 4, p 5 }, null) p 7 = ({ Eugene Wong }, { eugene@berkeley.edu"}, null, {p 8 }) p 8 = (null, { stonebraker@csail.mit.edu"}, null, {p 7 }) p 9 = ({ mike }, { stonebraker@csail.mit.edu"}, null, null) Examples of entities from the class conference (c) c 1 = ({ ACM Conference on Management of Data }, { 1978 }, { Austin, Texas }) c 2 = ({ ACM SIGMOD }, { 1978 }, null) article: title, pages, *authors, *conference person: name, email, *authors, *emailcontact conference: name, year, local 9

Reference Reconciliation Grouping from entities of the class article (a) grouping 1 = {a 1, a 2 } Grouping from entities of the class person (p) grouping 2 = {p 1, p 4 } grouping 3 = {p 2, p 5, p 8, p 9 } grouping 4 = {p 3, p 6, p 7 } Grouping from entities of the class conference (c) grouping 5 = {c 1, c 2 } DONG, X.; HALEVY, A.; MADHAVAN, J. Reference Reconciliation in Complex Information Spaces. In Proceedings of the ACM International Conference on Management of Data (SIGMOD), p.85-96, 2005 10

Our Work n The Academic Data Reconciler Tool q semi-automates the identification of n n correspondent objects inconsistencies q helps user to n n n eliminate inconsistencies complete data exchange data n The Academic Data Reconciler Tool for Reference Reconciliation 11

Academic Data Reconciler Tool (ADR) n Functional aspects q visualization of objects from two documents n side-by-side visualization q synchronization of objects q edition of incomplete and erroneous data q data exchange between objects from different documents 12

Interface 13

Visualization 14

Synchronization 15

Edition 16

Data Exchange 17

ADR for Reference Reconciliation! integrated entity similar entities that belong to the same grouping 18