Dependencies Revisited for Improving Data Quality
|
|
|
- Shon Barber
- 10 years ago
- Views:
Transcription
1 Dependencies Revisited for Improving Data Quality Wenfei Fan University of Edinburgh & Bell Laboratories Wenfei Fan Dependencies Revisited for Improving Data Quality 1 / 70
2 Real-world data is often dirty Ms. Stone, according to our database records you are supposed to be dead Wenfei Fan Dependencies Revisited for Improving Data Quality 2 / 70
3 Real-world data is often dirty Ms. Stone, according to our database records you are supposed to be dead US: Pentagon asked 275 dead/wounded officers to re-enlist UK: there are 81 million national insurance numbers but only 60 million people eligible Australia: 500,000 dead people retain active medicare cards In a database of 500,000 customers, 120,000 records become invalid within a year death, divorce, marriage, move typical data error rate in industry: 1% 5%, up to 30%... Errors, conflicts and inconsistencies Wenfei Fan Dependencies Revisited for Improving Data Quality 2 / 70
4 Dirty data is costly Poor data costs US companies $600 billion annually; Erroneously priced data in retail databases costs US customers $2.5 billion each year; World-wide losses from payment card fraud reached $4.84 billion in 2006; 30% 80% of the development time for data cleaning in a data integration project; and don t forget dirty data about WMD in Iraq The market for data quality tools is growing at 17% annually 7% average of IT segments Wenfei Fan Dependencies Revisited for Improving Data Quality 3 / 70
5 Research activities Statistics, management, and computer science Error correction (data imputation): to localize tuples that violate a given set of semantic rules, and fix erroneous values in the tuples that are identified as violations of the rules. Object identification: to identify tuples from one or more relations that refer to the same real-world object. Profiling: to infer and discover meta-data (constraints or semantic rules) from sample data. Data integration: to resolve conflicts in the sources via object identification; quality-driven query processing by explicitly taking into account the quality of data from various sources,... Approaches: probabilistic, empirical, rule-based, and logic-based,... Wenfei Fan Dependencies Revisited for Improving Data Quality 4 / 70
6 A principled approach based on data dependencies A promising approach, logic-based Capturing a fundamental part of the semantics of data: inconsistencies emerge as violations of dependencies Reasoning about the semantics of the data: inference systems, analysis algorithms,... Semantic profiling: discovery of dependencies for error correction and object identification... Wenfei Fan Dependencies Revisited for Improving Data Quality 5 / 70
7 A principled approach based on data dependencies A promising approach, logic-based Capturing a fundamental part of the semantics of data: inconsistencies emerge as violations of dependencies Reasoning about the semantics of the data: inference systems, analysis algorithms,... Semantic profiling: discovery of dependencies for error correction and object identification... Dependencies considered for data cleaning: often traditional functional dependencies, inclusion dependencies... designed for improving the quality of schema Wenfei Fan Dependencies Revisited for Improving Data Quality 5 / 70
8 A principled approach based on data dependencies A promising approach, logic-based Capturing a fundamental part of the semantics of data: inconsistencies emerge as violations of dependencies Reasoning about the semantics of the data: inference systems, analysis algorithms,... Semantic profiling: discovery of dependencies for error correction and object identification... Dependencies considered for data cleaning: often traditional functional dependencies, inclusion dependencies... designed for improving the quality of schema Revising traditional dependencies, for improving data quality Wenfei Fan Dependencies Revisited for Improving Data Quality 5 / 70
9 Dependencies revisited To capture data inconsistencies: adding conditions To identify objects: incorporating similarity Revision of static analyses: satisfiability, implication, axiomatizability, dependency propagation To find practical use of dependencies in data quality tools Develop appropriate constraint languages for improving data quality: revising equality-generating dependencies (EGDs) and tuple-generating dependencies (TGDs) Integrate data repairing and object identification: reasoning about conditional dependencies and matching dependencies taken together Repairing algorithms with performance guarantee Consistent querying answering for conditional dependencies Wenfei Fan Dependencies Revisited for Improving Data Quality 66 / 70
10 Interactions with other lines of research Incomplete information: Representation systems for all repairs? Tuple completeness via extensions of TGDs: missing tuples in addition to missing values. S. Abiteboul, L. Segoufin, and V. Vianu. Representing and querying XML with incomplete information. TODS 31(1): , Wenfei Fan Dependencies Revisited for Improving Data Quality 67 / 70
11 Interactions with other lines of research Incomplete information: Representation systems for all repairs? Tuple completeness via extensions of TGDs: missing tuples in addition to missing values. S. Abiteboul, L. Segoufin, and V. Vianu. Representing and querying XML with incomplete information. TODS 31(1): , Date exchange and data integration: Adding context to schema matching: CINDs, extending TGDs with conditions (order(title, price; type = book ) book(title, price)) R. Fagin. Inverting schema mappings. In PODS, Coping with unreliable data sources M. Lenzerini. Data integration: A theoretical perspective. In PODS, Wenfei Fan Dependencies Revisited for Improving Data Quality 67 / 70
12 Interactions with other lines of research Probabilistic data management: Consistent query answering P. Andritsos, A. Fuxman, and R. J. Miller. Clean answers over dirty databases: A probabilistic approach. In ICDE, Dependencies for probabilistic data: soft keys? Wenfei Fan Dependencies Revisited for Improving Data Quality 68 / 70
13 Interactions with other lines of research Probabilistic data management: Consistent query answering P. Andritsos, A. Fuxman, and R. J. Miller. Clean answers over dirty databases: A probabilistic approach. In ICDE, Dependencies for probabilistic data: soft keys? Provenance. P. Buneman: Curated Dabatases Provenance via dependency propagation analysis Wenfei Fan Dependencies Revisited for Improving Data Quality 68 / 70
14 Interactions with other lines of research Probabilistic data management: Consistent query answering P. Andritsos, A. Fuxman, and R. J. Miller. Clean answers over dirty databases: A probabilistic approach. In ICDE, Dependencies for probabilistic data: soft keys? Provenance. P. Buneman: Curated Dabatases Provenance via dependency propagation analysis XML data cleaning: XML constraints for data cleaning: more intriguing Repairing algorithms and consistent query answering S. Flesca, F. Furfaro, S. Greco, and E. Zumpano. Querying and repairing inconsistent XML data. In WISE Wenfei Fan Dependencies Revisited for Improving Data Quality 68 / 70
15 Interactions with other lines of research Probabilistic data management: Consistent query answering P. Andritsos, A. Fuxman, and R. J. Miller. Clean answers over dirty databases: A probabilistic approach. In ICDE, Dependencies for probabilistic data: soft keys? Provenance. P. Buneman: Curated Dabatases Provenance via dependency propagation analysis XML data cleaning: XML constraints for data cleaning: more intriguing Repairing algorithms and consistent query answering S. Flesca, F. Furfaro, S. Greco, and E. Zumpano. Querying and repairing inconsistent XML data. In WISE A rich source of questions Wenfei Fan Dependencies Revisited for Improving Data Quality 68 / 70
M C. Foundations of Data Quality Management. Wenfei Fan University of Edinburgh. Floris Geerts University of Antwerp
Foundations of Data Quality Management Wenfei Fan University of Edinburgh Floris Geerts University of Antwerp SYNTHESIS LECTURES ON DATA MANAGEMENT #29 & M C Morgan & claypool publishers ABSTRACT Data
Repair Checking in Inconsistent Databases: Algorithms and Complexity
Repair Checking in Inconsistent Databases: Algorithms and Complexity Foto Afrati 1 Phokion G. Kolaitis 2 1 National Technical University of Athens 2 UC Santa Cruz and IBM Almaden Research Center Oxford,
A Fast and Efficient Method to Find the Conditional Functional Dependencies in Databases
International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 3, Issue 5 (August 2012), PP. 56-61 A Fast and Efficient Method to Find the Conditional
Logical and categorical methods in data transformation (TransLoCaTe)
Logical and categorical methods in data transformation (TransLoCaTe) 1 Introduction to the abbreviated project description This is an abbreviated project description of the TransLoCaTe project, with an
Big Data Cleaning. Nan Tang. Qatar Computing Research Institute, Doha, Qatar [email protected]
Big Data Cleaning Nan Tang Qatar Computing Research Institute, Doha, Qatar [email protected] Abstract. Data cleaning is, in fact, a lively subject that has played an important part in the history of data
XML Data Integration
XML Data Integration Lucja Kot Cornell University 11 November 2010 Lucja Kot (Cornell University) XML Data Integration 11 November 2010 1 / 42 Introduction Data Integration and Query Answering A data integration
Improving Data Excellence Reliability and Precision of Functional Dependency
Improving Data Excellence Reliability and Precision of Functional Dependency Tiwari Pushpa 1, K. Dhanasree 2, Dr.R.V.Krishnaiah 3 1 M.Tech Student, 2 Associate Professor, 3 PG Coordinator 1 Dept of CSE,
Data Quality: From Theory to Practice
Data Quality: From Theory to Practice Wenfei Fan School of Informatics, University of Edinburgh, and RCBD, Beihang University [email protected] ABSTRACT Data quantity and data quality, like two sides
Data Integration and ETL Process
Data Integration and ETL Process Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master studies, second
IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH
IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH Kalinka Mihaylova Kaloyanova St. Kliment Ohridski University of Sofia, Faculty of Mathematics and Informatics Sofia 1164, Bulgaria
Edinburgh Research Explorer
Edinburgh Research Explorer CerFix: A System for Cleaning Data with Certain Fixes Citation for published version: Fan, W, Li, J, Ma, S, Tang, N & Yu, W 2011, 'CerFix: A System for Cleaning Data with Certain
Data Integration and Exchange. L. Libkin 1 Data Integration and Exchange
Data Integration and Exchange L. Libkin 1 Data Integration and Exchange Traditional approach to databases A single large repository of data. Database administrator in charge of access to data. Users interact
INTEGRATION OF XML DATA IN PEER-TO-PEER E-COMMERCE APPLICATIONS
INTEGRATION OF XML DATA IN PEER-TO-PEER E-COMMERCE APPLICATIONS Tadeusz Pankowski 1,2 1 Institute of Control and Information Engineering Poznan University of Technology Pl. M.S.-Curie 5, 60-965 Poznan
Peer Data Exchange. ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY, Pages 1 0??.
Peer Data Exchange Ariel Fuxman 1 University of Toronto Phokion G. Kolaitis 2 IBM Almaden Research Center Renée J. Miller 1 University of Toronto and Wang-Chiew Tan 3 University of California, Santa Cruz
A Review of Contemporary Data Quality Issues in Data Warehouse ETL Environment
DOI: 10.15415/jotitt.2014.22021 A Review of Contemporary Data Quality Issues in Data Warehouse ETL Environment Rupali Gill 1, Jaiteg Singh 2 1 Assistant Professor, School of Computer Sciences, 2 Associate
Data Integration and ETL Process
Data Integration and ETL Process Krzysztof Dembczyński Institute of Computing Science Laboratory of Intelligent Decision Support Systems Politechnika Poznańska (Poznań University of Technology) Software
Data Quality in Information Integration and Business Intelligence
Data Quality in Information Integration and Business Intelligence Leopoldo Bertossi Carleton University School of Computer Science Ottawa, Canada : Faculty Fellow of the IBM Center for Advanced Studies
Reverse Engineering of Relational Databases to Ontologies: An Approach Based on an Analysis of HTML Forms
Reverse Engineering of Relational Databases to Ontologies: An Approach Based on an Analysis of HTML Forms Irina Astrova 1, Bela Stantic 2 1 Tallinn University of Technology, Ehitajate tee 5, 19086 Tallinn,
Clean Answers over Dirty Databases: A Probabilistic Approach
Clean Answers over Dirty Databases: A Probabilistic Approach Periklis Andritsos University of Trento [email protected] Ariel Fuxman University of Toronto [email protected] Renée J. Miller University
Consistent Query Answering in Databases Under Cardinality-Based Repair Semantics
Consistent Query Answering in Databases Under Cardinality-Based Repair Semantics Leopoldo Bertossi Carleton University School of Computer Science Ottawa, Canada Joint work with: Andrei Lopatenko (Google,
CS 377 Database Systems. Database Design Theory and Normalization. Li Xiong Department of Mathematics and Computer Science Emory University
CS 377 Database Systems Database Design Theory and Normalization Li Xiong Department of Mathematics and Computer Science Emory University 1 Relational database design So far Conceptual database design
Schema mapping and query reformulation in peer-to-peer XML data integration system
Control and Cybernetics vol. 38 (2009) No. 1 Schema mapping and query reformulation in peer-to-peer XML data integration system by Tadeusz Pankowski Institute of Control and Information Engineering Poznań
Enforcing Data Quality Rules for a Synchronized VM Log Audit Environment Using Transformation Mapping Techniques
Enforcing Data Quality Rules for a Synchronized VM Log Audit Environment Using Transformation Mapping Techniques Sean Thorpe 1, Indrajit Ray 2, and Tyrone Grandison 3 1 Faculty of Engineering and Computing,
Data Integration. Maurizio Lenzerini. Universitá di Roma La Sapienza
Data Integration Maurizio Lenzerini Universitá di Roma La Sapienza DASI 06: Phd School on Data and Service Integration Bertinoro, December 11 15, 2006 M. Lenzerini Data Integration DASI 06 1 / 213 Structure
CS2Bh: Current Technologies. Introduction to XML and Relational Databases. The Relational Model. The relational model
CS2Bh: Current Technologies Introduction to XML and Relational Databases Spring 2005 The Relational Model CS2 Spring 2005 (LN6) 1 The relational model Proposed by Codd in 1970. It is the dominant data
Data Integration and Data Provenance. Profa. Dra. Cristina Dutra de Aguiar Ciferri [email protected]
Data Integration and Data Provenance Profa. Dra. Cristina Dutra de Aguiar Ciferri [email protected] Outline n Data Integration q Schema Integration q Instance Integration q Our Work n Data Provenance q
KEYWORD SEARCH IN RELATIONAL DATABASES
KEYWORD SEARCH IN RELATIONAL DATABASES N.Divya Bharathi 1 1 PG Scholar, Department of Computer Science and Engineering, ABSTRACT Adhiyamaan College of Engineering, Hosur, (India). Data mining refers to
Data Wrangling: The Elephant in the Room of Big Data. Norman Paton University of Manchester
Data Wrangling: The Elephant in the Room of Big Data Norman Paton University of Manchester Data Wrangling Definitions: a process of iterative data exploration and transformation that enables analysis [1].
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING Ramesh Babu Palepu 1, Dr K V Sambasiva Rao 2 Dept of IT, Amrita Sai Institute of Science & Technology 1 MVR College of Engineering 2 [email protected]
A Framework and Architecture for Quality Assessment in Data Integration
A Framework and Architecture for Quality Assessment in Data Integration Jianing Wang March 2012 A Dissertation Submitted to Birkbeck College, University of London in Partial Fulfillment of the Requirements
Integrating XML Data Sources using RDF/S Schemas: The ICS-FORTH Semantic Web Integration Middleware (SWIM)
Integrating XML Data Sources using RDF/S Schemas: The ICS-FORTH Semantic Web Integration Middleware (SWIM) Extended Abstract Ioanna Koffina 1, Giorgos Serfiotis 1, Vassilis Christophides 1, Val Tannen
A STATISTICAL DATA FUSION TECHNIQUE IN VIRTUAL DATA INTEGRATION ENVIRONMENT
A STATISTICAL DATA FUSION TECHNIQUE IN VIRTUAL DATA INTEGRATION ENVIRONMENT Mohamed M. Hafez 1, Ali H. El-Bastawissy 1 and Osman M. Hegazy 1 1 Information Systems Dept., Faculty of Computers and Information,
PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.
PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS Project Project Title Area of Abstract No Specialization 1. Software
Comprehensive Data Quality with Oracle Data Integrator. An Oracle White Paper Updated December 2007
Comprehensive Data Quality with Oracle Data Integrator An Oracle White Paper Updated December 2007 Comprehensive Data Quality with Oracle Data Integrator Oracle Data Integrator ensures that bad data is
A Tutorial on Data Integration
A Tutorial on Data Integration Maurizio Lenzerini Dipartimento di Informatica e Sistemistica Antonio Ruberti, Sapienza Università di Roma DEIS 10 - Data Exchange, Integration, and Streaming November 7-12,
Data Integration: A Theoretical Perspective
Data Integration: A Theoretical Perspective Maurizio Lenzerini Dipartimento di Informatica e Sistemistica Università di Roma La Sapienza Via Salaria 113, I 00198 Roma, Italy [email protected] ABSTRACT
Security Issues for the Semantic Web
Security Issues for the Semantic Web Dr. Bhavani Thuraisingham Program Director Data and Applications Security The National Science Foundation Arlington, VA On leave from The MITRE Corporation Bedford,
Normalization Theory for XML
Normalization Theory for XML Marcelo Arenas Department of Computer Science Pontificia Universidad Católica de Chile [email protected] 1 Introduction Since the beginnings of the relational model, it was
Data Quality Problems beyond Consistency and Deduplication
Data Quality Problems beyond Consistency and Deduplication Wenfei Fan 1, Floris Geerts 2, Shuai Ma 3, Nan Tang 4, and Wenyuan Yu 1 1 University of Edinburgh, UK, [email protected],[email protected]
BUSINESS RULES CONCEPTS... 2 BUSINESS RULE ENGINE ARCHITECTURE... 4. By using the RETE Algorithm... 5. Benefits of RETE Algorithm...
1 Table of Contents BUSINESS RULES CONCEPTS... 2 BUSINESS RULES... 2 RULE INFERENCE CONCEPT... 2 BASIC BUSINESS RULES CONCEPT... 3 BUSINESS RULE ENGINE ARCHITECTURE... 4 BUSINESS RULE ENGINE ARCHITECTURE...
Query Processing in Data Integration Systems
Query Processing in Data Integration Systems Diego Calvanese Free University of Bozen-Bolzano BIT PhD Summer School Bressanone July 3 7, 2006 D. Calvanese Data Integration BIT PhD Summer School 1 / 152
CHAPTER-6 DATA WAREHOUSE
CHAPTER-6 DATA WAREHOUSE 1 CHAPTER-6 DATA WAREHOUSE 6.1 INTRODUCTION Data warehousing is gaining in popularity as organizations realize the benefits of being able to perform sophisticated analyses of their
15.00 15.30 30 XML enabled databases. Non relational databases. Guido Rotondi
Programme of the ESTP training course on BIG DATA EFFECTIVE PROCESSING AND ANALYSIS OF VERY LARGE AND UNSTRUCTURED DATA FOR OFFICIAL STATISTICS Rome, 5 9 May 2014 Istat Piazza Indipendenza 4, Room Vanoni
Customer Classification And Prediction Based On Data Mining Technique
Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor
Security and privacy for multimedia database management systems
Multimed Tools Appl (2007) 33:13 29 DOI 10.1007/s11042-006-0096-1 Security and privacy for multimedia database management systems Bhavani Thuraisingham Published online: 1 March 2007 # Springer Science
AN EFFICIENT EXTENSION OF CONDITIONAL FUNCTIONAL DEPENDENCIES IN DISTRIBUTED DATABASES
IMPACT: International Journal of Research in Engineering & Technology (IMPACT: IJRET) ISSN(E): 2321-8843; ISSN(P): 2347-4599 Vol. 2, Issue 3, Mar 2014, 17-24 Impact Journals AN EFFICIENT EXTENSION OF CONDITIONAL
Data Validation with OWL Integrity Constraints
Data Validation with OWL Integrity Constraints (Extended Abstract) Evren Sirin Clark & Parsia, LLC, Washington, DC, USA [email protected] Abstract. Data validation is an important part of data integration
MULTI AGENT-BASED DISTRIBUTED DATA MINING
MULTI AGENT-BASED DISTRIBUTED DATA MINING REECHA B. PRAJAPATI 1, SUMITRA MENARIA 2 Department of Computer Science and Engineering, Parul Institute of Technology, Gujarat Technology University Abstract:
Integrating Pattern Mining in Relational Databases
Integrating Pattern Mining in Relational Databases Toon Calders, Bart Goethals, and Adriana Prado University of Antwerp, Belgium {toon.calders, bart.goethals, adriana.prado}@ua.ac.be Abstract. Almost a
Query Answering in Peer-to-Peer Data Exchange Systems
Query Answering in Peer-to-Peer Data Exchange Systems Leopoldo Bertossi and Loreto Bravo Carleton University, School of Computer Science, Ottawa, Canada {bertossi,lbravo}@scs.carleton.ca Abstract. The
Unsupervised Bayesian Data Cleaning Techniques for Structured Data. Sushovan De
Unsupervised Bayesian Data Cleaning Techniques for Structured Data by Sushovan De A Dissertation Presented in Partial Fulfillment of the Requirement for the Degree Doctor of Philosophy Approved May 2014
Big Data Management Assessed Coursework Two Big Data vs Semantic Web F21BD
Big Data Management Assessed Coursework Two Big Data vs Semantic Web F21BD Boris Mocialov (H00180016) MSc Software Engineering Heriot-Watt University, Edinburgh April 5, 2015 1 1 Introduction The purpose
Neighborhood Data and Database Security
Neighborhood Data and Database Security Kioumars Yazdanian, FrkdCric Cuppens e-mail: yaz@ tls-cs.cert.fr - cuppens@ tls-cs.cert.fr CERT / ONERA, Dept. of Computer Science 2 avenue E. Belin, B.P. 4025,31055
Web-Based Genomic Information Integration with Gene Ontology
Web-Based Genomic Information Integration with Gene Ontology Kai Xu 1 IMAGEN group, National ICT Australia, Sydney, Australia, [email protected] Abstract. Despite the dramatic growth of online genomic
Efficiently Identifying Inclusion Dependencies in RDBMS
Efficiently Identifying Inclusion Dependencies in RDBMS Jana Bauckmann Department for Computer Science, Humboldt-Universität zu Berlin Rudower Chaussee 25, 12489 Berlin, Germany [email protected]
Performance Analysis, Data Sharing, Tools Integration: New Approach based on Ontology
Performance Analysis, Data Sharing, Tools Integration: New Approach based on Ontology Hong-Linh Truong Institute for Software Science, University of Vienna, Austria [email protected] Thomas Fahringer
Report on the Dagstuhl Seminar Data Quality on the Web
Report on the Dagstuhl Seminar Data Quality on the Web Michael Gertz M. Tamer Özsu Gunter Saake Kai-Uwe Sattler U of California at Davis, U.S.A. U of Waterloo, Canada U of Magdeburg, Germany TU Ilmenau,
ICOM 6005 Database Management Systems Design. Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001
ICOM 6005 Database Management Systems Design Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001 Readings Read Chapter 1 of text book ICOM 6005 Dr. Manuel
Business Intelligence: Recent Experiences in Canada
Business Intelligence: Recent Experiences in Canada Leopoldo Bertossi Carleton University School of Computer Science Ottawa, Canada : Faculty Fellow of the IBM Center for Advanced Studies 2 Business Intelligence
Chapter 1: Introduction. Database Management System (DBMS) University Database Example
This image cannot currently be displayed. Chapter 1: Introduction Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Database Management System (DBMS) DBMS contains information
Data Preprocessing. Week 2
Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.
