Unique column combinations

Similar documents
Data Mining: Partially from: Introduction to Data Mining by Tan, Steinbach, Kumar

Data Mining Apriori Algorithm

Lecture Notes on Database Normalization

Effective Pruning for the Discovery of Conditional Functional Dependencies

Association Analysis: Basic Concepts and Algorithms

Databases -Normalization III. (N Spadaccini 2010 and W Liu 2012) Databases - Normalization III 1 / 31

Data Mining Association Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 6. Introduction to Data Mining

Discovery of Data Dependencies in Relational Databases

Frequent item set mining

Benchmark Databases for Testing Big-Data Analytics In Cloud Environments

MAXIMAL FREQUENT ITEMSET GENERATION USING SEGMENTATION APPROACH

Boolean Algebra (cont d) UNIT 3 BOOLEAN ALGEBRA (CONT D) Guidelines for Multiplying Out and Factoring. Objectives. Iris Hui-Ru Jiang Spring 2010

HOW TO USE MINITAB: DESIGN OF EXPERIMENTS. Noelle M. Richard 08/27/14

Database Design and Normalization

Functional Dependencies and Normalization

Efficient Computation of Multiple Group By Queries Zhimin Chen Vivek Narasayya

Theory behind Normalization & DB Design. Satisfiability: Does an FD hold? Lecture 12

Die Welt Multimedia-Reichweite

How To Find Out What A Key Is In A Database Engine

Association Rule Mining

Index support for regular expression search. Alexander Korotkov PGCon 2012, Ottawa

MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM

Mining Association Rules. Mining Association Rules. What Is Association Rule Mining? What Is Association Rule Mining? What is Association rule mining

Big Data Frequent Pattern Mining

How to bet using different NairaBet Bet Combinations (Combo)

Chapter 6: Episode discovery process

Laboratory Module 8 Mining Frequent Itemsets Apriori Algorithm

Database Management Systems. Redundancy and Other Problems. Redundancy

Introduction. The Quine-McCluskey Method Handout 5 January 21, CSEE E6861y Prof. Steven Nowick

normalisation Goals: Suppose we have a db scheme: is it good? define precise notions of the qualities of a relational database scheme

New Matrix Approach to Improve Apriori Algorithm

Data Warehousing. Jens Teubner, TU Dortmund Winter 2014/15. Jens Teubner Data Warehousing Winter 2014/15 1

Relational Database Design

Market Basket Analysis and Mining Association Rules

Big Data looks Tiny from the Stratosphere

Schema Refinement and Normalization

Week 11: Normal Forms. Logical Database Design. Normal Forms and Normalization. Examples of Redundancy

Graph Database Proof of Concept Report

Online EFFECTIVE AS OF JANUARY 2013

Chapter 8. Database Design II: Relational Normalization Theory

Gerry Hobbs, Department of Statistics, West Virginia University

Unit 3 Boolean Algebra (Continued)

Chapter 13: Query Processing. Basic Steps in Query Processing

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 15 - Data Warehousing: Cubes

Limitations of E-R Designs. Relational Normalization Theory. Redundancy and Other Problems. Redundancy. Anomalies. Example

Discovering Data Quality Rules

New Approach of Computing Data Cubes in Data Warehousing

澳 門 彩 票 有 限 公 司 SLOT Sociedade de Lotarias e Apostas Mútuas de Macau, Lda. Soccer Bet Types

Finding Frequent Itemsets using Apriori Algorihm to Detect Intrusions in Large Dataset

KEYWORD SEARCH IN RELATIONAL DATABASES

Chapter 20: Data Analysis

Why Is This Important? Schema Refinement and Normal Forms. The Evils of Redundancy. Functional Dependencies (FDs) Example (Contd.)

MB2-707: Version: Microsoft Dynamics CRM Customization. and Configuration. Demo

Soccer Bet Types Content

Efficiently Identifying Inclusion Dependencies in RDBMS

Database Design and Normalization

How To Improve Performance In A Database

Chapter 7: Relational Database Design

Unit Storage Structures 1. Storage Structures. Unit 4.3

D B M G Data Base and Data Mining Group of Politecnico di Torino

Query Optimization for Distributed Database Systems Robert Taylor Candidate Number : Hertford College Supervisor: Dr.

A Fast and Efficient Method to Find the Conditional Functional Dependencies in Databases

Self-Tuning Database Systems: A Decade of Progress Surajit Chaudhuri Microsoft Research

CH3 Boolean Algebra (cont d)

Normalization in Database Design

Data Mining: Foundation, Techniques and Applications

Data Warehousing und Data Mining

The safer, easier way to help you pass any IT exams. SAP Certified Application Associate - SAP HANA 1.0. Title : Version : Demo 1 / 5

SAKey: Scalable Almost Key discovery in RDF data

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG

Database Design Patterns. Winter Lecture 24

MB2-707 MB Microsoft Dynamics CRM Customization and Configuration. Version 5.0

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Chapter 10. Functional Dependencies and Normalization for Relational Databases. Copyright 2007 Ramez Elmasri and Shamkant B.

Performance Tuning for the Teradata Database

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 16 - Data Warehousing

Database Tuning Advisor for Microsoft SQL Server 2005

BCA. Database Management System

SQL Query Evaluation. Winter Lecture 23

Introduction to Database Systems. Normalization

Advanced Oracle SQL Tuning

Efficient Data Access and Data Integration Using Information Objects Mica J. Block

Mining Association Rules to Evade Network Intrusion in Network Audit Data

Fuzzy Duplicate Detection on XML Data

Comparison of Data Mining Techniques for Money Laundering Detection System

Efficient Processing of Joins on Set-valued Attributes

PREDICTIVE MODELING OF INTER-TRANSACTION ASSOCIATION RULES A BUSINESS PERSPECTIVE

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Reversing Statistics for Scalable Test Databases Generation

Guide to Performance and Tuning: Query Performance and Sampled Selectivity

Functional Dependencies and Finding a Minimal Cover

Jordan University of Science & Technology Computer Science Department CS 728: Advanced Database Systems Midterm Exam First 2009/2010

Decision Trees from large Databases: SLIQ

Chapter 2 Data Storage

Query Processing. Q Query Plan. Example: Select B,D From R,S Where R.A = c S.E = 2 R.C=S.C. Himanshu Gupta CSE 532 Query Proc. 1

Differential privacy in health care analytics and medical research An interactive tutorial

Duplicate Detection Algorithm In Hierarchical Data Using Efficient And Effective Network Pruning Algorithm: Survey

A Dynamic Load Balancing Strategy for Parallel Datacube Computation

KD2R: a Key Discovery method for semantic Reference Reconciliation

Transcription:

Unique column combinations Arvid Heise Guest lecture in Data Profiling and Data Cleansing Prof. Dr. Felix Naumann

Agenda 2 Introduction and problem statement Unique column combinations Exponential search space Null values General pruning techniques Discovery algorithms Apriori HCA DUCC Gordian

Agenda 3 Introduction and problem statement Unique column combinations Exponential search space Null values General pruning techniques Discovery algorithms Apriori HCA DUCC Gordian

Unique column combinations 4 Relational model Dataset R with schema S Unique column combination K S r, r R : i j r[ K] r [ K] i j i In the following, they are called uniques Examples: all primary keys, all unique constraints j A B C a 1 x b 2 x c 2 y Uniques: {A, AB, AC, BC, ABC} Non-uniques: {B, C}

Minimal uniques 5 We are mostly interested in minimal uniques K S K' S : unique( K') K' K Removal of any column leads to non-unique combination For the previous example: {A, BC} Redundant: {AB, AC, ABC} A B C a 1 x b 2 x c 2 y Candidates for primary keys

Maximal non-uniques 6 Analogously we can define maximal non-uniques K' S : non unique( K') K K' Adding any column leads to unique combination K S Non-unique: {AB, AC} Redundant: {A, B, C} A B C a 1 x a 2 x a 2 y May be a data quality problem

Applications 7 Learning characteristics about a new data set Database management Finding a primary key Finding unique constraints Query optimization Cardinality estimations for joins Finding duplicates / data quality issues If expected unique column combinations are not unique Or with approximate uniques

Agenda 8 Introduction and problem statement Unique column combinations Exponential search space Null values General pruning techniques Discovery algorithms Apriori HCA DUCC Gordian

Exponential search space 9 ABCDE ABCE ABCD ABDE ACDE BCDE ABC ABE ABD ACD ADE ACE BCD BCE BDE CDE AB AC AD AE BC BD BE CD CE DE A B C D E

Result of algorithm 10 minimal unique ABCDE maximal non-unique unique ABCE ABCD ABDE ACDE BCDE non-unique ABC ABE ABD ACD ADE ACE BCD BCE BDE CDE AB AC AD AE BC BD BE CD CE DE A B C D E

TPCH line item 11 unique non-unique 8 columns 9 columns 10 columns

Size of the lattice 12 ABCDE 5 1 5 ABCE ABCD ABDE ACDE BCDE 5 5 4 ABC ABE ABD ACD ADE ACE BCD BCE BDE CDE 5 5 4 3 2 AB AC AD AE BC BD BE CD CE DE 5 2 5 4 3 2 3 A B C D E 5 1 5 4 3 2 2 3 4

Computational feasibility 13 For a lattice over n columns n combinations of size k k All combinations: 2 n -1 (let s ignore -1 for the remaining slides) Largest solution set: n n 2 minimal uniques are of size n 2 Verifying minimality, requires to check also all combinations of n size 2 1 Adding a column doubles search space

Brute forcing Uniprot 14 Data set about proteins with 223 columns Combinations: ~1.3*10 67 Largest solution: ~7.2*10 65 There are roughly 10 50 atoms on earth Assuming all uniques are of size 1-9 223 223 15 3.3 10 9 8 1ms verification time results in 100ka processing time

Agenda 15 Introduction and problem statement Unique column combinations Exponential search space Null values General pruning techniques Discovery algorithms Apriori HCA DUCC Gordian

Null values 16 Null values have a wide range of interpretations Unknown (birth day) Non-applicable (driver license number for kids) Undefined (result of integration/outer join) What is the minimal unique for the following data set? A B C D a 1 x 1 b 2 y 2 c 3 z 5 d 3 5 e 5

Handling null values #1 17 Depends on the actual application To find primary keys Remove all columns with null values Result: {A} A B C D a 1 x 1 b 2 y 2 c 3 z 5 d 3 5 e 5

Handling null values #2 18 Depends on the actual application To define unique constraints SQL defines grouping for null: null!= null Result: {A, C} -> CD unique A column of nulls is unique! A B C D a 1 x 1 b 2 y 2 c 3 z 5 d 3 5 e 5

Handling null values #3 19 Depends on the actual application To define unique constraints SQL defines distinctness for null: null = null Result: {A, BC} A B C D a 1 x 1 b 2 y 2 c 3 z 5 d 3 5 e 5

Agenda 20 Introduction and problem statement Unique column combinations Exponential search space Null values General pruning techniques Discovery algorithms Apriori HCA DUCC Gordian

Pruning with uniques 21 Pruning: inferring the type of a combination without actual verification If A is unique, supersets must be unique

Pruning effect of a pair 22 minimal unique ABCDE unique ABCE ABCD ABDE ACDE BCDE ABC ABE ABD ACD ADE ACE BCD BCE BDE CDE AB AC AD AE BC BD BE CD CE DE A B C D E

Pruning with uniques #2 23 Pruning: inferring the type of a combination without actual verification If A is unique, supersets must be unique Finding a unique column prunes half of the lattice Remove column from initial data set and restart Finding a unique column pair removes a quarter of the lattice In general, the lattice over the combination is removed The pruning power of a combination is reduced by prior findings AB prunes a quarter BC additionally prunes only one eighth ABC already pruned one eights

Pruning both ways 24 minimal unique ABCDE maximal non-unique unique ABCE ABCD ABDE ACDE BCDE non-unique ABC ABE ABD ACD ADE ACE BCD BCE BDE CDE AB AC AD AE BC BD BE CD CE DE A B C D E

Pruning on-the-fly 25 Materialization of the lattice is infeasible Only possible for few columns Nodes cannot be removed when discovering unique Prune on-the-fly Enumerate nodes as before Skip a node that has been pruned Depending on the approach that might be challenging Might require an efficient index structure Often: candidate generation

Agenda 26 Introduction and problem statement Unique column combinations Exponential search space Null values General pruning techniques Discovery algorithms Apriori HCA DUCC Gordian

Discovery Algorithms 27 Unique column Combination discovery Column-based Row-based Bottom up Hybrid Top down Apriori HCA DUCC Gordian

Column-based algorithms 28 Traverse through lattice Check for uniqueness Different approaches possible Use database back end and distinctness query SELECT COUNT(DISTINCT A, B, C) FROM R Compare with row number Position list indexes (explained later) For now, check is blackbox Prune lattice accordingly

Agenda 29 Introduction and problem statement Unique column combinations Exponential search space Null values General pruning techniques Discovery algorithms Apriori HCA DUCC Gordian

Apriori-based 30 C. Giannella and C. M. Wyss. "Finding minimal keys in a relation instance." (1999). Actually does not use much of the apriori idea Basic idea: Using the state of combinations of size k We need to visit only unpruned combinations of size k+1 Start with columns Check pairs of non-unique columns Check triples of non-unique pairs Terminate if no new combinations can be enumerated

Candidate generation 31 Do not generate too many duplicate combinations ABC, ABD, ACD, and BCD could point to ABCD Apriori: prefix-based generation Generate only combination of size n if prefix n-1 matches Only ABC and ABD can generate ABCD Still redundant verifications

Apriori visualized 32 minimal unique ABCDE maximal non-unique unique ABCE ABCD ABDE ACDE BCDE non-unique ABC ABE ABD ACD ADE ACE BCD BCE BDE CDE AB AC AD AE BC BD BE CD CE DE A B C D E

Characteristics of Apriori 33 Works well for small uniques Bottom-up checks columns first Best case: all columns are unique n checks Worst case: no uniques = one duplicate row 2 n checks Apriori is exponential to n

Extensions 34 Top-down Start from top and go down Performs better if solution set is high up Candidate pruning becomes more tricky Hybrid Combine bottom-up and top-down Interleave checks Works well if solution set has many small and large columns Worst case: solution set in the middle

Agenda 35 Introduction and problem statement Unique column combinations Exponential search space Null values General pruning techniques Discovery algorithms Apriori HCA DUCC Gordian

Histogram-Count-based Apriori 36 Ziawasch Abedjan and Felix Naumann. "Advancing the discovery of unique column combinations." Proceedings of the international Conference on Information and Knowledge Management. 2011. Extension of bottom-up apriori More sophisticated candidate generation Uses histograms for pruning Finds and uses functional dependencies on-the-fly

HCA candidate generation 37 Maintains a sorted list of non-uniques Avoids duplicate generation of combinations Prunes non-minimal uniques efficiently ABC unique, ABD is non-unique ABD would generate ABCD HCA performs quick minimality check with bitsets Hybrid approach At least checks if remaining columns contains duplicates

Statistics 38 Prunes column combinations that cannot be unique A and B contains the same value for 4/7 of the data C contains the same value for 5/7 of the data AC cannot be unique, AB might (not very likely) Especially viable if there are already indices A B C 1 A U 1 A U 1 A U 1 A U 2 B U 3 C V 4 D W

Functional dependencies 39 Functional dependency Value of one column determines value of another Birthday->age Intuition: If A->B and A non-unique, B must be non-unique If A->B and B unique, A must also be unique If A->B and AC non-unique, BC must be non-unique If A->B and BC unique, AC must also be unique FD A->B can be found with histogram of AB and B Histograms of FDs have the same distinctness counts

Analysis of HCA 40 Works well on data sets with small numbers of columns Quickly converges for many small combinations Efficient pruning Saves distinctness checks for many pairs For larger combinations statistics become less important At some point has to try all combinations Suffers from the same general complexity of Apriori

Agenda 41 Introduction and problem statement Unique column combinations Exponential search space Null values General pruning techniques Discovery algorithms Apriori HCA DUCC Gordian

DUCC 42 Arvid Heise, Jorge-Arnulfo Quiané-Ruiz, Ziawasch Abedjan, Anja Jentzsch, and Felix Naumann, Scalable Discovery of Unique Column Combinations, in preparation Done during internship at QCRI Basic idea: random walk through lattice Pick random superset if current combination is non-unique Pick random subset otherwise Lazy prune with previously visited nodes

Minimum unique column combination candidate Minimum unique column combination Maximum non-unique column combination candidate Maximum non-unique column combination Pruned ABCDE Visited nodes: 10 out of 26 ABCD ABCE ABDE ACDE BCDE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE AB AC AD AE BC BD BE CD CE DE A B C D E

Position List Index 44 Incorporates row-based pruning Intuition: number of duplicates decrease when going up Many unnecessary rows are checked again and again Keep track of duplicates with inverted index A: a->{r 1, r 2, r 3 }, b->{r 4, r 5 } B: 1->{r 1, r 3 }, 2->{r 2, r 5 } A B We don t need the actual value A: {{r 1, r 2, r 3 }, {r 4, r 5 }} B: {{r 1, r 3 }, {r 2, r 5 }} a 1 a 2 a 1 b 3 b 2

PLI Intersection 45 Initial PLIs A = {{r 1, r 2, r 3 }, {r 4, r 5 }} = {A 1, A 2 } B = {{r 1, r 3 }, {r 2, r 5 }} = {B 1, B 2 } Build(A) r 1 A 1 r 2 A 1 r 3 A 1 r 4 A 2 r 5 A 2 Probe(B) (A 1, B 1 ) -> {r 1, r 3 } (A 1, B 2 ) -> {r 2 } (A 2, B 2 ) -> {r 5 } Consolidate(AB) AB = {{r 1, r 3 }}

Analysis of PLI 46 n Space complexity: n sizeof(long) sizeof(array) 2 Intersection time complexity: O(n+n) Hash bigger PLI and probe smaller PLI If there is enough main memory Keep PLI of columns in main memory Going up in the lattice requires only to probe the current PLI Becomes increasingly fast when going up <1ms for most combinations Going down Unfortunately, PLI does not help Start from scratch

Experiments 47 Uniprot, 100k rows, (DUCC null = null)

Analysis of DUCC 48 Runtime mainly depends on size of solution set Worst case: solution set in the middle Aggressive pruning may lead to loss of minimal uniques! Gordian s final step can be used to plug these holes

Scaling up and out 49 Scalability is major design goal of DUCC Random walk well suited for parallelization Few coordination overhead Threads/worker share findings through event bus Uniques/non-uniques Holes in graph Lock-free to avoid bottlenecks Only memory barrier in local event bus

Agenda 50 Introduction and problem statement Unique column combinations Exponential search space Null values General pruning techniques Discovery algorithms Apriori HCA DUCC Gordian

Gordian 51 Yannis Sismanis et al. "GORDIAN: efficient and scalable discovery of composite keys." Proceedings of the international conference on Very Large Data Bases. 2006. Row-based algorithm Builds prefix tree while reading data Determines maximal non-uniques Compute minimal uniques from maximal non-uniques

Prefix tree 52

Calculating minimal uniques 53 minimal unique ABCDE maximal non-unique unique ABCE ABCD ABDE ACDE BCDE non-unique ABC ABE ABD ACD ADE ACE BCD BCE BDE CDE AB AC AD AE BC BD BE CD CE DE A B C D E

Analysis Gordian 54 According to paper, polynomial in the number of tuples for data with a Zipfian distribution of values Can abort scan as soon as duplicate has been found Worst case Exponential in the number of columns All data needs to be stored in memory Computing minimal uniques from maximal non-uniques O( uniques 2 columns ) Can be sped up with presorted list

Outlook 55 Finding primary keys Uniqueness is necessary criteria No null values Include other features Name includes id, number of columns Approximate uniques 99.9% of the data unique Useful to detect data errors Gordian, HCA, and DUCC can be easily modified Heuristics with sampling