On-line Data De-duplication. Ιωάννης Κρομμύδας

Transcription

1 On-line Data De-duplication Ιωάννης Κρομμύδας

2 Roadmap Data Cleaning Importance Data Cleaning Tasks Challenges for Fuzzy Matching Baseline Method (fuzzy match data cleaning) Improvements: online data cleaning using qgram tries 2

4 Data Cleaning Importance Data cleaning is critical for many industries over a wide variety of applications: marketing communications customer matching merging information systems medical records 4

5 Data Cleaning Importance The efficiency of every information processing infrastructure is greatly affected by the quality of the data residing in its databases. Poor data quality is the result of a variety of reasons: data entry errors (e.g., typing mistakes) multiple conventions for recording database fields (e.g., company names, addresses). 5

6 Data Cleaning Importance Poor data quality has a significant impact on a variety of business issues: customer relationship management inability to retrieve a customer record during a service call billing errors distribution delays 6

8 Data Cleaning Tasks One of the most important tasks in data cleaning is to de-duplicate records detection of multiple representation of a single entity The problem is straightforward for numerical values; still, it is very hard for string values and combinations of them in an attribute Names (first-, middle-, last- name), addresses, etc. 8

9 Data Cleaning Tasks Considering company names, it is common to see Microsoft, Micorsoft, Microsoft Inc. and Microsoft Corporation being used in different records to represent the same entity A simple equality or (even) substring comparison on names or addresses will not properly identify them as being the same entity, leading to a variety of potential business problems 9

10 Data Cleaning Tasks Two possible modes of de-duplication: Detection of exact duplicates, which requires a typical join operation Fuzzy matching, which entails the detection of inexact duplicates presents a challenge between accuracy, efficiency and storage overheads 10

12 Challenges for Fuzzy Matching Assume a clean reference relation R and a stream of possibly dirty tuples S, that we check over R for fuzzy duplicates. Task: first try exact match, else try fuzzy match Issues: Accuracy of the identification An appropriate similarity function Avoiding to check every stream record with everyone in R 12

13 Challenges for Fuzzy Matching Fig. 1. Template for using Fuzzy Match [CGGM03] 13

14 Challenges for Fuzzy Matching Given the similarity function and an input tuple, the result of a fuzzy match operation could be one of the following: the reference tuple being closest to the input tuple, the closest K reference tuples enabling users, if necessary, to choose one among them K or fewer tuples whose similarity to the input tuple exceeds a user-specified minimum similarity threshold 14

16 Baseline Method (Fuzzy Match Data Cleaning) Chaudhuri et SIGMOD 2003 adopt a probabilistic approach in order to return the closest K reference tuples with high probability propose a fuzzy match similarity function (fms) that explicitly considers IDF token weights and input errors while comparing tuples 16

17 Baseline Method (Fuzzy Match Data Cleaning) Chaudhuri et SIGMOD 2003 preprocess the reference relation to build an index relation, called the error tolerant index (ETI) relation, for retrieving at run time a small set of candidate reference tuples retrieve with high probability a superset of the K reference tuples closest to the input tuple 17

18 Baseline Method (Fuzzy Match Data Cleaning) Similarity between an input tuple and a reference tuple could be described as the cost of transforming the former into the latter low transformation costs of input tuples denote high similarity Transformation operations are applied on a set of tokens included in the attributes of a tuple The set of tokens included in attribute i of tuple v is denoted by tok[v(i)] if v(i) = Boeing Company, then tok[v(i)] = {Boeing, Company} 18

19 Baseline Method (Fuzzy Match Data Cleaning) Each transformation operation is associated with a cost depending on the weight of the transformed token: w( t, i) = IDF( t, i) = log R freq ( t, i), where freq(t,i) denotes the frequency of a token t in column i and equals to the number of tuples v in R such that tok(v[i]) contains t 19

20 Baseline Method (Fuzzy Match Data Cleaning) Let u be an input tuple and v a reference tuple, the cost of operations taking place in order to transform u into v is defined in next table: operation Description cost token replacement token insertion token deletion replaces t 1 in tok[u(i)] by t 2 in tok[v(i)] ed(t 1, t 2 ) w(t1,i) inserts a token t into u[i] c ins w(t, i) (0 c ins 1) deletes a token t from u[i] w(t, i) 20

21 Baseline Method (Fuzzy Match Data Cleaning) The transformation cost tc(u[i], v[i]) is the cost of the minimum cost transformation sequence for transforming u[i] into v[i]. The cost tc(u, v) of transforming u into v is the sum over all columns i of the costs tc(u[i], v[i]) of transforming u[i] into v[i] and equals to: tc( u, v) = tc i ( u[] i, v[ i] ) 21

22 Baseline Method (Fuzzy Match Data Cleaning) The fuzzy match similarity function fms(u, v) between an input tuple u and a reference tuple v in terms of the transformation cost tc(u, v) can be defined as: fms ( u, v) ( u, v),1. ( ) tc = 1 min 0 w u w(u) is the sum of weights of all tokens in the token set tok(u) token set tok(u) denotes the multiset union of sets tok(a 1 ),,tok(a n ) of tokens from the tuple u[a 1,,a n ], 22

23 Baseline Method (Fuzzy Match Data Cleaning) The K-fuzzy Match Problem: Given reference relation R, a minimum similarity threshold c (0<c<1), input tuple u, the set FM(u) of fuzzy matches of at most K tuples from R Naïve Algorithm: scan the reference relation R, comparing each tuple with u Proposed Method: build an index on the reference relation for quickly retrieving a superset of target fuzzy matches (pre-processing phase) this indexed relation is called Error Tolerant Index (ETI) - indexed using standard B+ trees to perform fast-exact lookups to prepare an ETI, fms apx needed 23

24 Baseline Method (Fuzzy Match Data Cleaning) Reference Relation (not indexable) Pre-processing Error Tolerant Index (standard database relation, but indexable) Candidate Set - superset of FM(U) Approximation of fms (fms apx ) is a pared down version of fms ignores ordering among tokens in the input and reference tuples [beoing company, seattle, wa, 98004] and [company beoing, seattle, wa, 98004] are identical to fms apx in fms apx, closeness between two tokens is measured through the similarity between sets of substrings called qgram sets 24

25 Baseline Method (Fuzzy Match Data Cleaning) Estimating fms apx requires computing token min-hash signatures mh i and min-hash similarity sim mh between two tokens min-hash similarity U: universe of strings over an alphabet Σ h i :U N, i = 1,,H be H hash functions mapping elements of U uniformly and randomly to the set of natural numbers N S a set of strings. min-hash signature m h (S) of S is the vector [mh 1 (S),, mh H (S)] where the i th coordinate mh i (S) is defined as: mh ( S ) = argmin h ( a) sim mh H ( t, t ) = I[ mh ( QG( t )) = mh ( QG( t ))] H i= 1 i 1 i i a S i 2 Let I[X] denote an indicator variable over boolean X (I[X] = 1 if X is true, else 0) 25

26 Baseline Method (Fuzzy Match Data Cleaning) Let u, v be two tuples dq = (1-1/q) be an adjustment term, fms apx is defined as: apx 1 2 fms ( u, v) = () ( () ( )) ( ) w t Max simmh QG t, QG r + d w u r tok ( []) ( v[] i ) i t tok u i q Eg: Input tuple u [Company Beoing, Seattle, NULL, 98004] Reference tuple v [Boeing Company, Seattle, WA, 98004] q = 3, H = 2, token: weight: company: 0.25, beoing: 0.5, seattle:1.0, 98004: 2.0 total weight = 3.75 Suppose min-hash signatures are [oei, ing], [com, pan], [sea, ttl], [wa], [980, 004] Score from matching beoing with boeing is: w(beoing)*(2/3* (1 1/3)) = w(beoing) Since every token matches exactly with a reference token, fms apx (u,v) = 3.75/ q

27 Baseline Method (Fuzzy Match Data Cleaning) Error Tolerant Index (ETI) enables for each input tuple u, the efficient retrieval of a candidate set S of reference tuples with similarity greater than the minimum similarity threshold fms apx is measured by comparing min-hash signatures of tokens in tok(u) and tok(v) to determine the candidate set, we need to efficiently identify for each token t in tok(u), a set of reference tuples sharing min-hash qgrams with that of t holds each qgram s along with the list of all tids of reference tuples with tokens whose min-hash signatures contain s 27

28 Baseline Method (Fuzzy Match Data Cleaning) ETI schema: [QGram, Coordinate, Column, Frequency, Tid-list] For each tuple e in ETI it holds: e[tid-list] contains the list of tids of all reference tuples containing at least one token t in the field e[column] whose e[coordinate]- th min-hash coordinate is e[qgram]. The number of tids included in e[tid-list] is stored in e[frequency] attribute. 28

29 29

30 Baseline Method (Fuzzy Match Data Cleaning) Basic Algorithm goal: reduce the number of lookups against the reference relation by effectively using ETI fetches tid-lists by looking up ETI of all q-grams in min-hash signatures of all tokens in u 30

31 Baseline Method (Fuzzy Match Data Cleaning) Basic Algorithm 1) For each token t in tok(u) compute its IDF weight w(t) 2) Determine the min-hash signature mh(t) of each token 3) Using ETI, determine candidate set S of reference tuple as per fms apx 4) Fetch the tuples in S from the reference relation, and test as per fms 5) Among tuples that pass the test, return K tuples with K highest similarity scores 31

33 Improvements: Online Data Cleaning using qgram tries Proposed method for cleaning a stream of incoming tuples, before their insertion to a database table Uses Word Index a similar to ETI structure holds information about the attribute values stored in the reference table is used for the retrieval of clean words that probably match input attribute values of a tuple Qgram Trie stores the retrieved clean words held in main memory 33

34 Improvements: Online Data Cleaning using qgram tries Word Index consists of five fields: qgram field corresponds to a sequence of Q characters coordinate field represents the occurrence position of the corresponding qgram within a string value column field indicates the string-valued attribute that holds the specific value code-list field contains a word-id list created from words that include qgram Q in the position which is denoted by the coordinate field frequency field represents the number of the words belonging to the code-list. 34

35 Improvements: Online Data Cleaning using qgram tries Qgram trie root labeled null word-prefix subtrees as the children of the root header table Qgram trie node qgram: registers the qgram represented by node count: number of clean words represented by the portion of the path reaching this node node-link: links to the next node in the trie carrying the same qgram, or null if there is none category-list: word-id list of words that share this node in the trie representation Header table qgram head of node-link: points to the first node in the trie carrying the qgram E.g., the resulting qgram trie being built in memory, if clean words Ric, Rica and Ricus, with ids 1, 2 and 3 respectively are retrieved 35

36 Improvements: Online Data Cleaning using qgram tries Matching procedure Candidate words sharing common qgrams in same positions with the input value are stored to qgram trie The qgram trie is searched according to the qgram sequence of the input value all paths of trie holding subsequences of a specific qgram sequence extracted from the possibly dirty input value matching scores between the input value and the clean words are stored in a score table The set of clean words whose similarity with the input word is above a similarity threshold is returned 36

37 Improvements: Online Data Cleaning using qgram tries Input: attribute value u, Word Index Output: K closest words to u 1. Select a qgram subsequence s of input value u a. Find first qgram q of s in header table i. Access all nodes holding q ii. Search all possible paths of trie with nodes holding the qgram subsequence s beginning with q iii. Update score table in case of successful match b. Check existence of unselected qgram subsequences of u i. if unselected qgram subsequences of u exist Repeat step 1. ii. else Go to step Sort score table 3. Return K most similar words according to their score 37

38 Improvements: Online Data Cleaning using qgram tries input value: Ricuss qgram sequence: {Ric, icu, cus, uss} clean word word id score Ric 1 1 Rica 2 1 Ricus

39 Improvements: Online Data Cleaning using qgram tries Each tuple is classified as one of the following: Clean detected duplicate (i.e., a record exists in the reference relation) new (a respective record did not previously exist in the database) Not-resolved because there are many candidates and manual attention is needed 39

40 Improvements: Online Data Cleaning using qgram tries Experimental parameters & measures measures (y-axis) time to complete number of comparisons IO activities precision and recall (percentage of successful corrections and missed corrections) memory used, hard disk needed time to generate any auxiliary structures varied parameters the data set size and the stream size noise level 40

41 That s all folks 41

42 Challenges for Fuzzy Matching To ensure high data quality, incoming data tuples must be validated and undergo a cleaning procedure In many situations, clean tuples must match acceptable tuples in reference tables For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation 42