Dependencies Revisited for Improving Data Quality



Similar documents
M C. Foundations of Data Quality Management. Wenfei Fan University of Edinburgh. Floris Geerts University of Antwerp

Repair Checking in Inconsistent Databases: Algorithms and Complexity

A Fast and Efficient Method to Find the Conditional Functional Dependencies in Databases

Logical and categorical methods in data transformation (TransLoCaTe)

Big Data Cleaning. Nan Tang. Qatar Computing Research Institute, Doha, Qatar

XML Data Integration

Improving Data Excellence Reliability and Precision of Functional Dependency

Data Quality: From Theory to Practice

Data Integration and ETL Process

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

Edinburgh Research Explorer

Data Integration and Exchange. L. Libkin 1 Data Integration and Exchange

INTEGRATION OF XML DATA IN PEER-TO-PEER E-COMMERCE APPLICATIONS

Peer Data Exchange. ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY, Pages 1 0??.

A Review of Contemporary Data Quality Issues in Data Warehouse ETL Environment

Data Integration and ETL Process

Data Quality in Information Integration and Business Intelligence

Reverse Engineering of Relational Databases to Ontologies: An Approach Based on an Analysis of HTML Forms

Clean Answers over Dirty Databases: A Probabilistic Approach

Consistent Query Answering in Databases Under Cardinality-Based Repair Semantics

CS 377 Database Systems. Database Design Theory and Normalization. Li Xiong Department of Mathematics and Computer Science Emory University

Schema mapping and query reformulation in peer-to-peer XML data integration system

Enforcing Data Quality Rules for a Synchronized VM Log Audit Environment Using Transformation Mapping Techniques

Data Integration. Maurizio Lenzerini. Universitá di Roma La Sapienza

CS2Bh: Current Technologies. Introduction to XML and Relational Databases. The Relational Model. The relational model

Data Integration and Data Provenance. Profa. Dra. Cristina Dutra de Aguiar Ciferri

KEYWORD SEARCH IN RELATIONAL DATABASES

Data Wrangling: The Elephant in the Room of Big Data. Norman Paton University of Manchester

META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING

A Framework and Architecture for Quality Assessment in Data Integration

Integrating XML Data Sources using RDF/S Schemas: The ICS-FORTH Semantic Web Integration Middleware (SWIM)

A STATISTICAL DATA FUSION TECHNIQUE IN VIRTUAL DATA INTEGRATION ENVIRONMENT

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

Comprehensive Data Quality with Oracle Data Integrator. An Oracle White Paper Updated December 2007

A Tutorial on Data Integration

Data Integration: A Theoretical Perspective

Security Issues for the Semantic Web

Normalization Theory for XML

Data Quality Problems beyond Consistency and Deduplication

BUSINESS RULES CONCEPTS... 2 BUSINESS RULE ENGINE ARCHITECTURE By using the RETE Algorithm Benefits of RETE Algorithm...

Query Processing in Data Integration Systems

CHAPTER-6 DATA WAREHOUSE

XML enabled databases. Non relational databases. Guido Rotondi

Customer Classification And Prediction Based On Data Mining Technique

Security and privacy for multimedia database management systems

AN EFFICIENT EXTENSION OF CONDITIONAL FUNCTIONAL DEPENDENCIES IN DISTRIBUTED DATABASES

Data Validation with OWL Integrity Constraints

MULTI AGENT-BASED DISTRIBUTED DATA MINING

Integrating Pattern Mining in Relational Databases

Query Answering in Peer-to-Peer Data Exchange Systems

Unsupervised Bayesian Data Cleaning Techniques for Structured Data. Sushovan De

Big Data Management Assessed Coursework Two Big Data vs Semantic Web F21BD

Neighborhood Data and Database Security

Web-Based Genomic Information Integration with Gene Ontology

Efficiently Identifying Inclusion Dependencies in RDBMS

Performance Analysis, Data Sharing, Tools Integration: New Approach based on Ontology

Report on the Dagstuhl Seminar Data Quality on the Web

ICOM 6005 Database Management Systems Design. Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001

Business Intelligence: Recent Experiences in Canada

Chapter 1: Introduction. Database Management System (DBMS) University Database Example

Data Preprocessing. Week 2

Transcription:

Dependencies Revisited for Improving Data Quality Wenfei Fan University of Edinburgh & Bell Laboratories Wenfei Fan Dependencies Revisited for Improving Data Quality 1 / 70

Real-world data is often dirty Ms. Stone, according to our database records you are supposed to be dead Wenfei Fan Dependencies Revisited for Improving Data Quality 2 / 70

Real-world data is often dirty Ms. Stone, according to our database records you are supposed to be dead US: Pentagon asked 275 dead/wounded officers to re-enlist UK: there are 81 million national insurance numbers but only 60 million people eligible Australia: 500,000 dead people retain active medicare cards In a database of 500,000 customers, 120,000 records become invalid within a year death, divorce, marriage, move typical data error rate in industry: 1% 5%, up to 30%... Errors, conflicts and inconsistencies Wenfei Fan Dependencies Revisited for Improving Data Quality 2 / 70

Dirty data is costly Poor data costs US companies $600 billion annually; Erroneously priced data in retail databases costs US customers $2.5 billion each year; World-wide losses from payment card fraud reached $4.84 billion in 2006; 30% 80% of the development time for data cleaning in a data integration project; and don t forget dirty data about WMD in Iraq The market for data quality tools is growing at 17% annually 7% average of IT segments Wenfei Fan Dependencies Revisited for Improving Data Quality 3 / 70

Research activities Statistics, management, and computer science Error correction (data imputation): to localize tuples that violate a given set of semantic rules, and fix erroneous values in the tuples that are identified as violations of the rules. Object identification: to identify tuples from one or more relations that refer to the same real-world object. Profiling: to infer and discover meta-data (constraints or semantic rules) from sample data. Data integration: to resolve conflicts in the sources via object identification; quality-driven query processing by explicitly taking into account the quality of data from various sources,... Approaches: probabilistic, empirical, rule-based, and logic-based,... Wenfei Fan Dependencies Revisited for Improving Data Quality 4 / 70

A principled approach based on data dependencies A promising approach, logic-based Capturing a fundamental part of the semantics of data: inconsistencies emerge as violations of dependencies Reasoning about the semantics of the data: inference systems, analysis algorithms,... Semantic profiling: discovery of dependencies for error correction and object identification... Wenfei Fan Dependencies Revisited for Improving Data Quality 5 / 70

A principled approach based on data dependencies A promising approach, logic-based Capturing a fundamental part of the semantics of data: inconsistencies emerge as violations of dependencies Reasoning about the semantics of the data: inference systems, analysis algorithms,... Semantic profiling: discovery of dependencies for error correction and object identification... Dependencies considered for data cleaning: often traditional functional dependencies, inclusion dependencies... designed for improving the quality of schema Wenfei Fan Dependencies Revisited for Improving Data Quality 5 / 70

A principled approach based on data dependencies A promising approach, logic-based Capturing a fundamental part of the semantics of data: inconsistencies emerge as violations of dependencies Reasoning about the semantics of the data: inference systems, analysis algorithms,... Semantic profiling: discovery of dependencies for error correction and object identification... Dependencies considered for data cleaning: often traditional functional dependencies, inclusion dependencies... designed for improving the quality of schema Revising traditional dependencies, for improving data quality Wenfei Fan Dependencies Revisited for Improving Data Quality 5 / 70

Dependencies revisited To capture data inconsistencies: adding conditions To identify objects: incorporating similarity Revision of static analyses: satisfiability, implication, axiomatizability, dependency propagation To find practical use of dependencies in data quality tools Develop appropriate constraint languages for improving data quality: revising equality-generating dependencies (EGDs) and tuple-generating dependencies (TGDs) Integrate data repairing and object identification: reasoning about conditional dependencies and matching dependencies taken together Repairing algorithms with performance guarantee Consistent querying answering for conditional dependencies Wenfei Fan Dependencies Revisited for Improving Data Quality 66 / 70

Interactions with other lines of research Incomplete information: Representation systems for all repairs? Tuple completeness via extensions of TGDs: missing tuples in addition to missing values. S. Abiteboul, L. Segoufin, and V. Vianu. Representing and querying XML with incomplete information. TODS 31(1): 208-254, 2006. Wenfei Fan Dependencies Revisited for Improving Data Quality 67 / 70

Interactions with other lines of research Incomplete information: Representation systems for all repairs? Tuple completeness via extensions of TGDs: missing tuples in addition to missing values. S. Abiteboul, L. Segoufin, and V. Vianu. Representing and querying XML with incomplete information. TODS 31(1): 208-254, 2006. Date exchange and data integration: Adding context to schema matching: CINDs, extending TGDs with conditions (order(title, price; type = book ) book(title, price)) R. Fagin. Inverting schema mappings. In PODS, 2007. Coping with unreliable data sources M. Lenzerini. Data integration: A theoretical perspective. In PODS, 2002. Wenfei Fan Dependencies Revisited for Improving Data Quality 67 / 70

Interactions with other lines of research Probabilistic data management: Consistent query answering P. Andritsos, A. Fuxman, and R. J. Miller. Clean answers over dirty databases: A probabilistic approach. In ICDE, 2006. Dependencies for probabilistic data: soft keys? Wenfei Fan Dependencies Revisited for Improving Data Quality 68 / 70

Interactions with other lines of research Probabilistic data management: Consistent query answering P. Andritsos, A. Fuxman, and R. J. Miller. Clean answers over dirty databases: A probabilistic approach. In ICDE, 2006. Dependencies for probabilistic data: soft keys? Provenance. P. Buneman: Curated Dabatases Provenance via dependency propagation analysis Wenfei Fan Dependencies Revisited for Improving Data Quality 68 / 70

Interactions with other lines of research Probabilistic data management: Consistent query answering P. Andritsos, A. Fuxman, and R. J. Miller. Clean answers over dirty databases: A probabilistic approach. In ICDE, 2006. Dependencies for probabilistic data: soft keys? Provenance. P. Buneman: Curated Dabatases Provenance via dependency propagation analysis XML data cleaning: XML constraints for data cleaning: more intriguing Repairing algorithms and consistent query answering S. Flesca, F. Furfaro, S. Greco, and E. Zumpano. Querying and repairing inconsistent XML data. In WISE 2005. Wenfei Fan Dependencies Revisited for Improving Data Quality 68 / 70

Interactions with other lines of research Probabilistic data management: Consistent query answering P. Andritsos, A. Fuxman, and R. J. Miller. Clean answers over dirty databases: A probabilistic approach. In ICDE, 2006. Dependencies for probabilistic data: soft keys? Provenance. P. Buneman: Curated Dabatases Provenance via dependency propagation analysis XML data cleaning: XML constraints for data cleaning: more intriguing Repairing algorithms and consistent query answering S. Flesca, F. Furfaro, S. Greco, and E. Zumpano. Querying and repairing inconsistent XML data. In WISE 2005. A rich source of questions Wenfei Fan Dependencies Revisited for Improving Data Quality 68 / 70