A Survey: Detection of Duplicate Record

Transcription

1 A Survey: Detection of Duplicate Record Dewendra Bharambe 1, Susheel Jain 2, Anurag Jain 3 1,2,3 Computer Science Dept., Radharaman Institute of Technology & Science, Bhopal (M.P.) India Abstract The problem of identifying approximately duplicate record in database is an essential step for data cleaning & data integration process. A dynamic web page is displayed to show the results as well as other relevant advertisements that seem relevant to the query. The real world entities have two or more representation in databases. When dealing with large amount of data it is important that there be a well defined and tested mechanism to filter out duplicate result. This keeps the result relevant to the queries. Duplicate record exists in the query result of many web databases especially when the duplicates are defined based on only some of the fields in a record. Using exact matching technique Records that are exactly same can be detected. The system that helps user to integrate and compares the query results returned from multiple web databases matches the different sources records that referred to the same real world entity. In this paper, we analyze the literature on duplicate record detection. We cover similarity metrics which are commonly used to detect similar field entries, and present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database also the techniques for improving the efficiency and scalability of approximate duplicate detection algorithms are covered. We conclude with coverage of existing tools and with a brief discussion of the big open problems in the area. Keywords Record matching, duplicate detection, data cleaning, data integration, data deduplication, entity matching. Manuscript received November 08, Dewendra Bharambe, Computer science department, RITS, Bhopal, M.P. ( dewendra.bharambe@gmail.com). Mobile No Susheel Jain, Assistant Professor Computer science department, RITS, Bhopal, M.P. ( jain_susheel65@yahoo.com). Anurag Jain, H.O.D Computer science department, RITS, Bhopal, M.P. ( anurag.akjain@gmail.com). I. INTRODUCTION Duplicate record detection find records that represent the same entities and merges them into a single record. For bibliographic information a bibliographic database sometimes contains multiple records representing the same item and the duplicate record detection must deal with problem of finding those records and unifying them to make the database clean. The underlying techniques of duplicate records detection is record matching because record usually consist of set of fields, Record matching consist of measuring the similarity of value of corresponding fields between the objective records and combining the field similarities into a record similarity. Databases frequently contain field-values and record that refer to the same entity but are not syntactically identical. Variations in representation can arise from typographical errors, misspellings, abbreviations, as well as integration of multiple data sources. The main purpose of record matching is to identify records in the same or different databases that refer to the same real-world entity, even if the records are not identical. In slightly ironic fashion, the same problem has multiple names across research communities. In the database community, the problem is described as merge-purge [1] Data deduplication [2], and instance identification [3] in the AI community the same problem is described as database hardening [4] and name matching [5]. The names co reference resolution, identity uncertainty, and duplicate detection are also commonly used to refer to the same task. For this problem we will use the term duplicate record detection. II. DATA PREPARATION Duplicate record detection is the process of identifying different or multiple records that refer to one unique real world entity or object. The data preparation stage is the first step in the process of duplicate record detection during which data entries are stored in uniform manner in the database resolving the structural heterogeneity problem. The data preparation stage includes a parsing a data transformation and a standardization step. These steps improve the quality of the in-flow data and make the data comparable and more usable. Parsing Data transformatio n Fig1. Steps in data preparation Data standardizati on 298

2 Parsing is the first component in the data preparation which locates identifies and isolates individual data elements in the source files. Parsing makes it easier to correct standardize and match data because it allows the comparison of individual components rather than of long complex strings of data. E.g. the appropriate parsing of name and address components into consistent packets of information is a crucial part in the data cleaning process. Data transformation refers to simple conversions that can be applied to the data in order for them to confirm to the data types of their corresponding domains Simple conversions applied to the data in order to confirm their corresponding data types also refer to data transformation. This type of conversion focuses mainly on one field at time without any consideration of the values in the related fields. Example of data transformation 1) Conversion of a data element from one data type to another. 2) Renaming of a field from one name to another. 3) Range checking, this involves examination of data in a field to ensure that if falls within the expected range, usually a numeric or date range. Data standardization refers to the process of standardizing the information represented in certain fields to a specific content format. This is used for information that can be stored in many ways in various data sources and must be converted to a representation before the duplicate detection process starts. Without standardization, many duplicate entries could be erroneously designated as non duplicates based on the fact that common identifying information cannot be compared. One of the most common standardization applications involves address information. After the data preparation phase, the data are typically stored in tables having comparable fields. The next step is to identify which fields should be compared. E.g. it would not be meaningful to compare the contents of the field Last Name with the field Address. Even after parsing, data standardization, and identification of similar fields, it is not trivial to match duplicate records. Misspellings and different conventions for recording the same information still result in different, multiple representations of a unique object in the database. III. TECHNIQUES TO MATCH INDIVIDUAL FIELDS Typographical variation of string data is one of the most common reasons of mismatches in database entries. Hence, duplicate detection typically relies on string comparison techniques to deal with typographical variations. Base on various types of errors. Multiple method have been developed for this task A) Character-Based Similarity Metrics This is designed to handle errors well. In this section we cover following metrics. 1) Jaro-Winkler distance The Jaro distance measure was developed or comparing names gathered in the U.S. Census & was proposed by Jaro. It is specifically designed to look & compare surnames & given names. This method couldn t be used to compare the whole name. Given two strings & calculating the Jaro-Winkler distance includes the following steps. i. Compute the string lengths ) ii. Find the number of common characters in the two strings common are all the characters in [ ] & [ ] for which [ ] [ ] & iii. Find the number of transposition.here transposition refers to those characters from one string that is out of order with respect to their corresponding common characters from the other string. The formula for calculating the Jaro-Winkler distance is, ( ) Where, m is the number of matching characters. t is half the number of transpositions. 2) Smith-Waterman distance It described an extension of edit distance and affine gap distance. This approach assumes that mismatches at the beginning and the end of strings have lower costs than mismatches in the middle. E.g. this method correctly identifies Rowling J as being similar to Rowling James. This is because when the two strings are compared a common substring Stanford U is extracted & the algorithm assumes characters at the start & end have a low cost. If the cost is less than threshold value it classifies these as being similar or vice versa. The threshold value is user configurable & is based on the algorithm being used for duplicate detection. 299

3 3) Q-gram distance International Journal of Emerging Technology and Advanced Engineering It is a substring of a text where the length of the substring is Q. The idea is to break the string into tokens of the Q-grams and compare with another to find similarities and count the number of matches between the strings. Additional padding is done to account for characters at the start and end so as to ensure they are not ignored in the comparison. Q-gram algorithms aren t strictly phonetic matching in that they do not operate based on comparison of the phonetic characteristics of words. Instead Q-grams can be thought to compute the distance or amount of difference between two words. Utilizing the Q-gram Algorithm [6] method is highly favorable as it can match misspelled or mutated words, even if they are determined to be phonetically disparate. Consider an example, when comparing two strings Smith & Zemith and use Q=2 (length of substring) we get the following substring, for Smith we get #S S m mi it th h# and for Zemith we get #Z Ze em mi it th h#. We would get a match of three in this case (highlighted above). Higher the numbers similar are the words. This approach includes the first and last letters in the word (ex. Smith-Waterman distance) SMITH #S #Z SM MI IT TH ZEMITH ZE EM MI IT H# TH H# Example of Q-Grams when q=2 4) Affine gap distance Edit distance metric does not work well when matching strings that have been truncated (e.g. John R. Smith versus Jonathan Richard Smith ). The affine gap distance metric [7] offers a solution to this problem by introducing two extra edit operation Open gap and Extend gap. 300 The cost of extending the gap is usually smaller than the cost of opening a gap and this result in smaller cost penalties for gap mismatches than the equivalent cost under the edit distance metric. The algorithm for computing the affine gap distance requires time, when the maximum length of a gap { }. In the general case, the algorithm runs in approximately steps. Proposed for edit distance, describe how to train an edit distance model with affine gaps. 5) Edit distance The edit distance between two strings & is the minimum number of edit operation of single characters needed to transform into There are three types of edit operations. i) Insert a character into the string. ii) Delete a character form the string, iii) Replace one character with a different character. In the simplest form each edit operation has cost. The basic dynamic programming algorithm [8] for computing the edit distance between two strings takes time for two strings of length and resp. Algorithm for detecting in { } whether two strings have edit distance less than ( Notice that if then by definition the two strings do not match within distance so { } for the non-trivial case where ). The original edit distance model and allowed for different costs for different edit distance operations. There are some numerous approaches that have been developed for comparing strings. When comparing numerical values the methods are primitive. In most cases when it makes sense to compare numbers it s a basic comparison and queries can be developed to extract numerical data. There has been continuing research in using cosine similarity and other algorithms in analyzing numerical data. E.g. data in numbers can be compared with primitive operators like equal greater than and can used to calculate the difference between two numeric strings. B) Numeric Similarity Metrics There are some numerous approaches that have been developed for comparing strings. When comparing numerical values the methods are primitive. In most cases when it makes sense to compare numbers it s a basic comparison and queries can be developed to extract numerical data with ease.

4 There has been continuing research in using cosine similarity and other algorithms in analyzing numerical data. For example data in numbers can be compared with primitive operators like equal greater than and can used to calculate the difference between two numeric strings. C) Token-Based Similarity Metrics Typographical errors work well with character-based similarity metrics. It is often the case that typographical conventions lead to rearrangement of words (e.g., John Smith versus Smith John ). In such cases character level metrics fail to capture the similarity of the entities. 1) Q-grams with tf.idf WHIRL system is used to handle spelling errors by using Q-grams instead of words as tokens. In this setting a spelling error minimally affects the set of common Q- grams of two strings so the two strings Gateway communications and Communications Gateway have high similarity under this metric despite the block move and the spelling errors in both words. This metric handles the insertion and deletion of words nicely. The string Gateway Communications matches with high similarity the string Communications Gateway International since the Q-gram of the word International appear often in the relation and have low weight. 2) Atomic Strings There is a basic algorithm for matching text fields based on atomic strings. An atomic string is a sequence of alphanumeric characters delimited by punctuation characters. Two atomic strings match if they are equal or if one is the prefix of the other. Based on this algorithm the similarity of two fields is the number of their matching atomic strings divided by their average number of atomic strings. 3) WHIRL WHIRL is adopts from the information retrieval the cosine similarity combined with the tf.idf weighting scheme to compute the similarity of two fields. Cohen separates each string into words and each word is assigned a weight Where, is the number of times that database appears in the field and is where is the number of records in the that contains 301 The weight for a word in a field is high if appears a large number of times in the field (large ) and is a sufficiently rare term in the database ( large ) e.g. for a collection of company names, relatively infrequent terms such as AT&T or IBM will have higher weights than more frequent terms such as Inc. The cosine similarity of and is defined as The cosine similarity metric works well for a large variety of entries, and is insensitive to the location of words thus allowing natural words moves and swaps (e.g. John Smith is equivalent to Smith John ). D) Phonetic Similarity Metrics Character-level and token-based similarity metrics focus on the string-based representation of the database records. However, strings may be phonetically similar even if they are not similar in a character or token level. E.g. the word Kageonne is phonetically similar to Cajun despite the fact that the string representations are very different. The phonetic similarity metrics are trying to address such issues and match such strings. 1) Metaphone and Double Metaphone The Metaphone algorithm as better alternative to Soundex Philips [9] suggested using 16 consonant sounds that can describe a large number of sounds used in many English and non-english words. Double Metaphone [10] is a better version of Metaphone improving some encoding choices made in the initial Metaphone and allowing multiple encodings for names that have various possible pronunciations. For such cases, all possible encodings are tested when trying to retrieve similar names. Introduction of multiple phonetic encodings greatly enhances the matching performance with rather small overhead. 2) Soundex Soundex can be defined as a hashing mechanism for English words. It constitutes in converting words into a 4 character string that is made up of the first letter in the word and three numbers that are calculated by the hash function. This code would describe how a 32 word sounds and thus can be used to compare and find similar sounding words. An example of the same is given below. Below are steps for deriving the American Soundex code [11].

5 1. Retain the first letter of the name and drop all other occurrences of a, e, h, i, o, u, w, y. 2. Replace consonants with digits as follows (after the first letter) _b, f, p, v=>1 _c, g, j, k, q, s, x, z =2 _d, t=>3 _l=>4 _m, n=>5 _r=>6 3. Two adjacent letters with the same number are coded as a single number. 4. Continue until you have one letter and three numbers. If you run out of letters, fill in 0s until there are three numbers. Using the above steps we can derive Soundex code for e.g. for Eastman as E235, Easterman as E236 and Westminster as W235. By comparing the generated Soundex codes we can group words that sound similar. As in the above example, Eastman and Easterman can be grouped together as the Soundex code can be calculated as being near to each other when compared to the Soundex code of Westminster. 3) Oxford Name Compression Algorithm (ONCA) ONCA [12] is a two-stage technique designed to overcome most of the unsatisfactory features of pure Soundex-ing, retaining in parallel the convenient fourcharacter fixed length format. In the first step ONCA uses a British version of the NYSIIS method of compression. Then, in the second step the transformed and partially compressed name is Soundex in the usual way. This two stage technique has been used successfully for groping similar names together. 4) New York State Identification and Intelligence System (NYSIIS) The NYSIIS system, proposed by Taft [13] differs from Soundex in that it retains information about the position of vowels in the encoded word by converting most vowels to the letter A. Furthermore, NYSIIS does not use numbers to replace letters instead it replaces consonants with other phonetically similar letters thus returning a pure alpha code (no numeric component). Usually the NYSIIS code for a surname is based on a maximum of nine letters of the full alphabetical name and the NYSIIS code itself is then limited to six characters. Taft [13] compared Soundex with NYSIIS using a name database of New York State and concluded that NYSIIS is percent accurate while Soundex is percent accurate for locating surnames. The NYSIIS encoding system is still used today by the New York State Division of Criminal Justice Services. E) Summary There are various techniques that have been applied for matching fields with string data in duplicate data detection. The character based similarity metrics handle typographical errors while token based similarity metrics works well when computing a rearranged string that has same meaning. The phonetic similarity metrics are trying to match the strings which are phonetically similar. To capture the similarity in numeric data the numeric similarity metrics used IV. FINDING DUPLICATE RECORD Various methods we have seen that can be used to compare strings or individual fields and use a metric to understand their similarity or lack of it. When applied to real world situations where data is multivariate and the number of fields is as dynamic as the data itself, this makes the field of duplicate detection more complicated. There have been numerous papers and approaches addressing this issue. These approaches can be classified as A) Rule based techniques A special case of distance-based approaches [14] and [15] is the use of rules to define whether two records are the same or not. Rule-based approaches can be considered as distance-based techniques where the distance of two records is either 0 or 1. It is noteworthy that such rule based approaches which require a human expert to devise meticulously crafted matching rules typically result in systems with high accuracy. However the required tuning requires extremely high manual effort from the human experts and this effort makes the deployment of such systems difficult in practice. Currently the typical approach is to use a system that generates matching rules from training data and then manually tune the automatically generated rules. 302

6 The mapping transformation standardized data the matching transformation finds pairs of records that probably refer to the same real object, the clustering transformation groups together matching pairs with a high similarity value, and finally the merging transformation collapses each individual cluster into a tuple of the resulting data source. It is noteworthy that such rule-based approaches which require a human expect to devise meticulously crafted matching rules typically result in systems with high accuracy. However, the required tuning requires extremely high manual effort from the human experts and this effort makes the deployment of such systems difficult in practice. Currently the typical approach is to use a system that generates matching rules. B) Active-Learning Based Techniques A problem with the supervised learning techniques is the requirement for a large number of training examples. While it is easy to create a large number of training pairs that are either clearly non duplicates or clearly duplicates, it is very difficult to generate ambiguous cases that would help create a highly accurate classifier Their method suggested that by creating multiple classifiers, trained using slightly different data or parameters it is possible to detect cases and then ask the user for feedback. The key innovation in this work is the creation of several redundant functions and the concurrent exploitation of their conflicting actions in order to discover new kinds of inconsistencies among duplicates in the data set. C) Probabilistic Matching Models Let A and B be representation of two tables having n comparable fields. In the case of duplicate detection problem each tuple pair, is assigned to one of the two classes M and U. The class M contains the record pairs that represent the same entity ( Match ) and the class U contains the record pairs that represent two different entities ( Non-Match ). Each tuple pair is represented as a random vector x = [1...x n ] T with n components that correspond to the n comparable fields of A and B. Newcombe et al [23] [24] were the first to recognize duplicate detection as a Bayesian inference problem. Today is very commonly used for duplicate detection literature. Let x be the comparison vector x is the input to a decision rule that assigns x to U or to M. The assumption about x is it is a random vector whose density function will be differing / different for both classes ) Naive Bayes rule Conditional independence Assumption: p (x i /M), p (x j /M) is independent if i j. Goal: To compute the distributions of p(x/m) and p(x/u). Naive Bayes rule, Using a training set of pre-labeled record pairs, the values of p (x i /M) and p (x i /U) are computed. 2) Winkler Methodology The conditional independence is not a reasonable assumption so Winkler [16] suggested a method to estimate p (x/m), p (x/u) using expectation maximization algorithm. Winkler suggested five conditions to make unsupervised EM algorithm to work well, namely, i) The data contain relatively large percentage of matches. ii) The matching pairs are well-separated from other classes. iii) The rate of typographical errors is low. iv) There are sufficiently many redundant identifiers to overcome errors in other fields of the record. v) The estimates computed under the conditional independence assumption result in good classification performance. Winkler has proved that this unsupervised EM works well, even when a limited number of interactions are allowed between the variables. It is interesting to note that the results under the independence assumption are not considerably worse compared to the case n which the EM model allows variable interactions. 3) The Bayes Decision Rule for Minimum Error Assumption: x is a comparison vector randomly taken from the comparison space that corresponds to the record pair <α, β>. Goal: To determine whether <α, β> M or <α, β> U. Decision rule: (1) The above decision rule (1) reveals that if the probability of the match class M given the comparison vector x is larger than the probability of the non-match class U then x is classified to M and vice versa.

7 Bayes decision rule: (2) The ratio l(x) = p(x/m) p(x/u) is called as likelihood ratio. The ratio p (U) / p (M) denotes the threshold value of the likelihood ratio for the decision. Equation 2 is known as Bayes test for minimum error. It is very obviously proved that the Bayes test result in the latest probability of error and it is in that respect an optimal classifier. The above statement holds well only when the distributions of p(x/m) p(x/u) and the prior s p (U) and p (M) are known. 4) Binary model i) The probabilistic model can also be used without using training data. ii) A Binary model for the values of x i was introduced by Jaro such that: x i = 1, if field i matches x i = 0, else iii) He suggested to calculate the probabilities p(x i =1/M) using an expectation maximization (EM) algorithm and the probabilities p(x i =1/U) can be calculated by taking random pairs of records D) Unsupervised Learning One way to avoid manual labeling of the comparison vectors is to use clustering algorithms and group together similar comparison vectors. The idea behind most unsupervised learning approaches for duplicate detection is that similar comparison vectors correspond to the same class. The idea of unsupervised learning for duplicate detection has its roots in the probabilistic model. The use of a bootstrapping technique based on clustering to learn matching models. The basic idea also known as training [17] is to use very few labeled data and then use unsupervised learning techniques to appropriately label the data with unknown labels. Each entry of the comparison vector (which corresponds to the result of a field comparison) Then they partition the comparison space into clusters by using the Autoclass [18] clustering tool. The basic premise is that each cluster contains comparison vectors with similar characteristics. Therefore, all the record pairs in the cluster belong to the same class (matches, no matches or possible matches). E) Supervised and Semi-supervised Learning The supervised learning systems rely on the existence of training data in the form of records pairs, pre labeled as matching or not. One set of supervised learning techniques treat each record pair (a, b) independently similar to the probabilistic techniques. A well-known CART algorithm [19] generates classification and regression trees. A linear discriminate algorithm generates a linear combination of the parameters for separating the data according to their classes and a vector quantization approach which is generalization of the nearest neighbour algorithms. The transitively assumption can sometimes result in inconsistent decisions. For example (a, b) and (a, c) can be considered matches but (b, c) not Partitioning such as inconsistent graphs with the goal of minimizing inconsistencies in an NP-complete problem. F) Bigram Indexing The Bigram Indexing (BI) method as implemented in the Febrl [20] record linkage system allows for fuzzy blocking. The basic idea is that the blocking key values are converted into a list of Bigram (sub-strings containing two characters) and sub-lists of all possible permutations will rebuilt using a threshold (between 0.0 and 1.0 ). The resulting Bigram lists are sorted and inserted into an inverted index, which will be used to retrieve the corresponding record numbers in a block. The number of sub-lists created for a blocking key value both depends on the length of the value and the threshold. The lower the threshold the shorter the sub-lists but also the more sub-lists there will be per blocking key value resulting in more (smaller blocks) in the inverted index. In the information retrieval field Bigram indexing has been found to be robust to small typographical errors in documents. G) Distance-Based Techniques Active learning techniques require some training data or some human effort to create the matching models. In the absence of such training data or the ability to get human input supervised and active learning techniques are not appropriate. One way of avoiding the need for training data is to define a distance metric for records which does not need tuning through training data. 304

8 Distance-based approaches [14] that conflict each record in one big field may ignore important information that can be used for duplicate detection. A simple approach is to measure the distance between individual fields using the appropriate distance metric for each field and then compute the weighted distance between the records. In this case, the problem is the computation of the weights and the overall setting becomes very similar to the probabilistic setting. One of the problems of the distance-based techniques is the need to define the appropriate value for the matching threshold. n the presence of training data it is possible to find the appropriate threshold value. However this would nullify the major advantage of distance-based techniques which is the ability to operate without training data. H) Summary Finding out duplicate records from web database is very complicated task. There are some approaches to do this as the name suggests probabilistic matching uses like hood ratio theory to classify data as duplicates. The supervised and semi supervised approach needs training data in the form of record pairs pre labeled as matching or not. While unsupervised learning requires to formal training data. In duplicate detection system active learning method are used to find out ambiguous pair. In distance based techniques the distance between individual field are measured using the appropriate distance metrics while the rule based approach is the use of rules to defined whether two records is similar or not. V. DUPLICATE DETECTION TOOLS In this section, we review such packages focusing on tools that have open architecture and allow the users to understand the underlying mechanics of the matching mechanisms. The Febrl system (Freely Extensible Biomedical Record Linkage) is an open source data cleaning toolkit and it has two main components. The first component deals with data standardization and the second performs the actual duplicate detection. The data standardization relies mainly on Hidden Markov Models (HMMs) therefore Febrl typically requires training to correctly parse the database entries. For duplicate detection Febrl implements a variety of string similarity metrics, such as Jaro, edit distance and q-gram distance. Finally Febrl supports phonetic encoding (Soundex, NYSIIS and Double Metaphone) to detect similar names. Since phonetic similarity is sensitive to errors in the first letter of a name Febrl also computes phonetic similarity using the reversed version of the name string, side stepping the first-letter sensitivity problem. TAILOR [21] is a flexible record matching toolbox which allows the users to apply different duplicate detection methods on the data sets. The flexibility of using multiple models is useful when the users do not know which duplicate detection model will perform most effectively on their particular data. TAILOR follows a layered design separating comparison functions from the duplicate detection logic. Furthermore the execution strategies which improve the efficiency are implemented in a separate layer making the system more extensible than systems that rely on monolithic designs. Finally TAILOR reports statistics such as estimated accuracy and completeness which can help the users better understand the quality of a given duplicate detection execution over a new data set. WHIRL is a duplicate record detection system available for free for academic and research use. WHIRL uses the tf.idf tokenbased similarity metric to identify similar strings within two lists. The Flamingo Project is a similar tool that provides a simple string matching tool that takes as input two string lists and returns the strings pairs that are within a pre-specified edit distance threshold. WizSame by WizSoft is also a product that allows the discovery of duplicate records in a database. The matching algorithm is very similar to Soft TF.IDF. Two records match if they contain a significant fraction of identical or similar words where similar are the words that is within edit distance one. Bigmatch [22] is the duplicate detection program used by the US Census Bureau. It relies on blocking strategies to identify potential matches between the records of two relations and scales well for very large data sets. The only requirement is that one of the two relations should fit in memory and it is possible to fit in memory even relations with 100 million records. The main goal of Bigmatch is not to perform sophisticated duplicate detection, but rather to generate a set of candidate pairs that should be then processed by more sophisticated duplicate detection algorithms. Finally, we should note that currently many database vendors (Oracle, IBM, and Microsoft) do not provide sufficient tools for duplicate record detection. Most of the efforts until now have focused on creating easy-to-use ETL tools that can standardize database records and fix minor errors mainly in the context of address data. 305

9 Another typical function of the tools that are provided today is the ability to use reference tables and standardize the representation of entities that are well-known to have multiple representations. VI. FUTURE DIRECTIONS AND CONCLUSIONS In this survey we have presented a comprehensive survey of the existing techniques used for detecting non identical duplicate entries in database records. As database systems are becoming more and more commonplace data cleaning is going to be the cornerstone for correcting errors in systems which are accumulating vast amounts of errors on a daily basis. Despite the breadth and depth of the presented techniques we believe that there is still room for substantial improvement in the current state-of-the-art. First of all it is currently unclear which metrics and techniques are the current state-of-the-art. The lack of standardized, large-scale benchmarking data sets can be a big obstacle for the further development of the field as it is almost impossible to convincingly compare new techniques with existing ones. A repository of benchmark data sources with known and diverse characteristics should be made available to developers so they may evaluate their methods during the development process. Along with benchmark and evaluation data, various systems need some form of training data to produce the initial matching model. Although small data sets are available, we are not aware of large-scale, validated data sets that could be used as benchmarks. Winkler highlights techniques on how to derive data sets that are properly synonym zed and are still useful for duplicate record detection purposes. Currently there are two main approaches for duplicate record detection. Research in databases emphasizes relatively simple and fast duplicate detection techniques that can be applied to databases with millions of records. Such techniques typically do not rely on the existence of training data and emphasize efficiency over effectiveness. On the other hand research in machine learning and statistics aims To develop more sophisticated matching techniques that relies on probabilistic models. An interesting direction for future research is to develop techniques that combine the best of both worlds. Most of the duplicate detection systems available today offer various algorithmic approaches for speeding up the duplicate detection process. The changing nature of the duplicate detection process also requires adaptive methods that detect different patterns for duplicate detection and automatically adapt themselves over time. For example a background process could monitor the current data incoming data and any data sources that need to be merged or matched and decide based on the observed errors whether a revision of the duplicate detection process is necessary or not. Another related aspect of this challenge is to develop methods that permit the user to derive the proportions of errors expected in data cleaning projects. Finally large amounts of structured information are now derived from unstructured text and from the Web. This information is typically imprecise and noisy duplicate record detection techniques are crucial for improving the quality of the extracted data. REFERENCES [1] M.A. Hernandez and S.J. Stolfo, Real-World Data Is Dirty: Data Cleansing and the Merge/Purge Problem, Data Mining and Knowledge Discovery, vol. 2 no. 1, pp. 9-37, Jan [2] S. Sarawagi and A. Bhamidipaty, Interactive Deduplication Using Active Learning, Proc. Eighth ACM SIGKDD Int l Conf. Knowledge Discovery and Data Mining (KDD 02), pp [3] Y.R. Wang and S.E. Madnick, The Inter-Database Instance Identification Problem in Integrating Autonomous Systems, Proc. Fifth IEEE Int l Conf. Data Eng. (ICDE 89), pp , [4] W.W. Cohen, H. Kautz and D. McAllester, Hardening Soft Information Sources, Proc. Sixth ACM SIGKDD Int l Conf. Knowledge Discovery and Data Mining (KDD 00), pp [5] Bilenko, R.J. Mooney, W.W. Cohen, P. Ravikumar, and S.E. Fienberg, Adaptive Name Matching in Information Integration, IEEE Intelligent Systems, vol. 18 no. 5, pp Sept./Oct [6] Ukkonen E. Approximate string-matching with q-grams and maximal matches (1992) Theoretical Computer Science 92 (1), pp [7] M.S. Waterman, T.F. Smith, and W.A. Beyer, Some Biological Sequence Metrics, Advances in Math., vol. 20, no. 4, pp , [8] G. Navarro, A Guided Tour to Approximate String Matching, ACM Computing Surveys, vol. 33, no. 1, pp , [9] L. Philips, Hanging on the Metaphone, Computer Language Magazine vol. 7, no. 12, pp , Dec. 1990, [10] L. Philips, The Double Metaphone Search Algorithm, C/C++ Users J., vol. 18, no. 5, June [11] 306

10 [12] L.E. Gill, OX-LINK: The Oxford Medical Record Linkage System, Proc. Int l Record Linkage Workshop and Exposition, pp , [13] R.L. Taft, Name Search Techniques, Technical Report Special Report No. 1, New York State Identification and Intelligence System, Albany, N.Y., Feb [14] M. Bilenko and R.J. Mooney, Adaptive Duplicate Detection Using Learnable String Similarity Measures, Proc. ACM SIGKDD, pp , [15] P. Christen, Automatic Record Linkage Using Seeded Nearest Neighbour and Support Vector Machine Classification, Proc. ACM SIGKDD, pp , [16] W.E. Winkler, The State of Record Linkage and Current Research Problems, Technical Report Statistical Research Report Series RR99/04, US Bureau of the Census, Washington D.C., [17] A. Blum and T. Mitchell, Combining Labeled and Unlabeled Data with Co-Training, COLT 98: Proc. 11th Ann. Conf. Computational Learning Theory, pp , [18] P. Cheeseman and J. Sturz, Bayesian Classification (Autoclass): Theory and Results, Advances in Knowledge Discovery and Data Mining, pp , AAAI Press/The MIT Press, [19] J. Cho, N. Shivakumar, and H. Garcia-Molina, Finding Replicated Web Collections, Proc.2000 ACM SIGMOD Int l Conf. Management of Data (SIGMOD 00), pp , [20] Christen, T. Churches and M. Hegland, Febrl A Parallel Open Source Data Linkage System, Advances in Knowledge Discovery and Data Mining, pp , Springer, [21] M.G. Elfeky, A.K. Elmagarmid and V.S. Verykios, TAILOR: A Record Linkage Tool Box, Proc. 18th IEEE Int l Conf. Data Eng. (ICDE 02), pp , [22] W.E. Yancey, Bigmatch: A Program for Extracting Probable Matches from a Large File for Record Linkage, Technical Report Statistical Research Report Series RRC2002/01, US Bureau of the Census, Washington, D.C., Mar [23] H.B. Newcombe, J.M. Kennedy, S. Axford and A. James Automatic Linkage of Vital Records, Science, vol. 130, no. 3381, pp , Oct [24] H.B. Newcombe and J.M. Kennedy, Record Linkage: Making Maximum Use of the Discriminating Power of Identifying Information, Comm. ACM, vol. 5, no. 11, pp , Nov Dewendra Bharambe is a scholar of M.Tech, (Computer Science Engineering), at R.I.T.S. Bhopal, under R.G.T.U. Bhopal, India. He is working as a lecturer in J.T.Mahajan College of Engineering, Faizpur. Susheel Jain, Assistant Professor in Computer science department of R.I.T.S., Bhopal, M.P. He has done his M.Tech. in Software Engineering From Gautam Buddh Technical University, Lucknow, India. Anurag Jain, H.O.D. of Computer science department of R.I.T.S. Bhopal, M.P. He has done his M.Tech, in Computer Science and Engineering, From Barkatullah University, Bhopal, India. 307