A Survey: Detection of Duplicate Record

Size: px
Start display at page:

Download "A Survey: Detection of Duplicate Record"

Transcription

1 A Survey: Detection of Duplicate Record Dewendra Bharambe 1, Susheel Jain 2, Anurag Jain 3 1,2,3 Computer Science Dept., Radharaman Institute of Technology & Science, Bhopal (M.P.) India Abstract The problem of identifying approximately duplicate record in database is an essential step for data cleaning & data integration process. A dynamic web page is displayed to show the results as well as other relevant advertisements that seem relevant to the query. The real world entities have two or more representation in databases. When dealing with large amount of data it is important that there be a well defined and tested mechanism to filter out duplicate result. This keeps the result relevant to the queries. Duplicate record exists in the query result of many web databases especially when the duplicates are defined based on only some of the fields in a record. Using exact matching technique Records that are exactly same can be detected. The system that helps user to integrate and compares the query results returned from multiple web databases matches the different sources records that referred to the same real world entity. In this paper, we analyze the literature on duplicate record detection. We cover similarity metrics which are commonly used to detect similar field entries, and present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database also the techniques for improving the efficiency and scalability of approximate duplicate detection algorithms are covered. We conclude with coverage of existing tools and with a brief discussion of the big open problems in the area. Keywords Record matching, duplicate detection, data cleaning, data integration, data deduplication, entity matching. Manuscript received November 08, Dewendra Bharambe, Computer science department, RITS, Bhopal, M.P. ( dewendra.bharambe@gmail.com). Mobile No Susheel Jain, Assistant Professor Computer science department, RITS, Bhopal, M.P. ( jain_susheel65@yahoo.com). Anurag Jain, H.O.D Computer science department, RITS, Bhopal, M.P. ( anurag.akjain@gmail.com). I. INTRODUCTION Duplicate record detection find records that represent the same entities and merges them into a single record. For bibliographic information a bibliographic database sometimes contains multiple records representing the same item and the duplicate record detection must deal with problem of finding those records and unifying them to make the database clean. The underlying techniques of duplicate records detection is record matching because record usually consist of set of fields, Record matching consist of measuring the similarity of value of corresponding fields between the objective records and combining the field similarities into a record similarity. Databases frequently contain field-values and record that refer to the same entity but are not syntactically identical. Variations in representation can arise from typographical errors, misspellings, abbreviations, as well as integration of multiple data sources. The main purpose of record matching is to identify records in the same or different databases that refer to the same real-world entity, even if the records are not identical. In slightly ironic fashion, the same problem has multiple names across research communities. In the database community, the problem is described as merge-purge [1] Data deduplication [2], and instance identification [3] in the AI community the same problem is described as database hardening [4] and name matching [5]. The names co reference resolution, identity uncertainty, and duplicate detection are also commonly used to refer to the same task. For this problem we will use the term duplicate record detection. II. DATA PREPARATION Duplicate record detection is the process of identifying different or multiple records that refer to one unique real world entity or object. The data preparation stage is the first step in the process of duplicate record detection during which data entries are stored in uniform manner in the database resolving the structural heterogeneity problem. The data preparation stage includes a parsing a data transformation and a standardization step. These steps improve the quality of the in-flow data and make the data comparable and more usable. Parsing Data transformatio n Fig1. Steps in data preparation Data standardizati on 298

2 Parsing is the first component in the data preparation which locates identifies and isolates individual data elements in the source files. Parsing makes it easier to correct standardize and match data because it allows the comparison of individual components rather than of long complex strings of data. E.g. the appropriate parsing of name and address components into consistent packets of information is a crucial part in the data cleaning process. Data transformation refers to simple conversions that can be applied to the data in order for them to confirm to the data types of their corresponding domains Simple conversions applied to the data in order to confirm their corresponding data types also refer to data transformation. This type of conversion focuses mainly on one field at time without any consideration of the values in the related fields. Example of data transformation 1) Conversion of a data element from one data type to another. 2) Renaming of a field from one name to another. 3) Range checking, this involves examination of data in a field to ensure that if falls within the expected range, usually a numeric or date range. Data standardization refers to the process of standardizing the information represented in certain fields to a specific content format. This is used for information that can be stored in many ways in various data sources and must be converted to a representation before the duplicate detection process starts. Without standardization, many duplicate entries could be erroneously designated as non duplicates based on the fact that common identifying information cannot be compared. One of the most common standardization applications involves address information. After the data preparation phase, the data are typically stored in tables having comparable fields. The next step is to identify which fields should be compared. E.g. it would not be meaningful to compare the contents of the field Last Name with the field Address. Even after parsing, data standardization, and identification of similar fields, it is not trivial to match duplicate records. Misspellings and different conventions for recording the same information still result in different, multiple representations of a unique object in the database. III. TECHNIQUES TO MATCH INDIVIDUAL FIELDS Typographical variation of string data is one of the most common reasons of mismatches in database entries. Hence, duplicate detection typically relies on string comparison techniques to deal with typographical variations. Base on various types of errors. Multiple method have been developed for this task A) Character-Based Similarity Metrics This is designed to handle errors well. In this section we cover following metrics. 1) Jaro-Winkler distance The Jaro distance measure was developed or comparing names gathered in the U.S. Census & was proposed by Jaro. It is specifically designed to look & compare surnames & given names. This method couldn t be used to compare the whole name. Given two strings & calculating the Jaro-Winkler distance includes the following steps. i. Compute the string lengths ) ii. Find the number of common characters in the two strings common are all the characters in [ ] & [ ] for which [ ] [ ] & iii. Find the number of transposition.here transposition refers to those characters from one string that is out of order with respect to their corresponding common characters from the other string. The formula for calculating the Jaro-Winkler distance is, ( ) Where, m is the number of matching characters. t is half the number of transpositions. 2) Smith-Waterman distance It described an extension of edit distance and affine gap distance. This approach assumes that mismatches at the beginning and the end of strings have lower costs than mismatches in the middle. E.g. this method correctly identifies Rowling J as being similar to Rowling James. This is because when the two strings are compared a common substring Stanford U is extracted & the algorithm assumes characters at the start & end have a low cost. If the cost is less than threshold value it classifies these as being similar or vice versa. The threshold value is user configurable & is based on the algorithm being used for duplicate detection. 299

3 3) Q-gram distance International Journal of Emerging Technology and Advanced Engineering It is a substring of a text where the length of the substring is Q. The idea is to break the string into tokens of the Q-grams and compare with another to find similarities and count the number of matches between the strings. Additional padding is done to account for characters at the start and end so as to ensure they are not ignored in the comparison. Q-gram algorithms aren t strictly phonetic matching in that they do not operate based on comparison of the phonetic characteristics of words. Instead Q-grams can be thought to compute the distance or amount of difference between two words. Utilizing the Q-gram Algorithm [6] method is highly favorable as it can match misspelled or mutated words, even if they are determined to be phonetically disparate. Consider an example, when comparing two strings Smith & Zemith and use Q=2 (length of substring) we get the following substring, for Smith we get #S S m mi it th h# and for Zemith we get #Z Ze em mi it th h#. We would get a match of three in this case (highlighted above). Higher the numbers similar are the words. This approach includes the first and last letters in the word (ex. Smith-Waterman distance) SMITH #S #Z SM MI IT TH ZEMITH ZE EM MI IT H# TH H# Example of Q-Grams when q=2 4) Affine gap distance Edit distance metric does not work well when matching strings that have been truncated (e.g. John R. Smith versus Jonathan Richard Smith ). The affine gap distance metric [7] offers a solution to this problem by introducing two extra edit operation Open gap and Extend gap. 300 The cost of extending the gap is usually smaller than the cost of opening a gap and this result in smaller cost penalties for gap mismatches than the equivalent cost under the edit distance metric. The algorithm for computing the affine gap distance requires time, when the maximum length of a gap { }. In the general case, the algorithm runs in approximately steps. Proposed for edit distance, describe how to train an edit distance model with affine gaps. 5) Edit distance The edit distance between two strings & is the minimum number of edit operation of single characters needed to transform into There are three types of edit operations. i) Insert a character into the string. ii) Delete a character form the string, iii) Replace one character with a different character. In the simplest form each edit operation has cost. The basic dynamic programming algorithm [8] for computing the edit distance between two strings takes time for two strings of length and resp. Algorithm for detecting in { } whether two strings have edit distance less than ( Notice that if then by definition the two strings do not match within distance so { } for the non-trivial case where ). The original edit distance model and allowed for different costs for different edit distance operations. There are some numerous approaches that have been developed for comparing strings. When comparing numerical values the methods are primitive. In most cases when it makes sense to compare numbers it s a basic comparison and queries can be developed to extract numerical data. There has been continuing research in using cosine similarity and other algorithms in analyzing numerical data. E.g. data in numbers can be compared with primitive operators like equal greater than and can used to calculate the difference between two numeric strings. B) Numeric Similarity Metrics There are some numerous approaches that have been developed for comparing strings. When comparing numerical values the methods are primitive. In most cases when it makes sense to compare numbers it s a basic comparison and queries can be developed to extract numerical data with ease.

4 There has been continuing research in using cosine similarity and other algorithms in analyzing numerical data. For example data in numbers can be compared with primitive operators like equal greater than and can used to calculate the difference between two numeric strings. C) Token-Based Similarity Metrics Typographical errors work well with character-based similarity metrics. It is often the case that typographical conventions lead to rearrangement of words (e.g., John Smith versus Smith John ). In such cases character level metrics fail to capture the similarity of the entities. 1) Q-grams with tf.idf WHIRL system is used to handle spelling errors by using Q-grams instead of words as tokens. In this setting a spelling error minimally affects the set of common Q- grams of two strings so the two strings Gateway communications and Communications Gateway have high similarity under this metric despite the block move and the spelling errors in both words. This metric handles the insertion and deletion of words nicely. The string Gateway Communications matches with high similarity the string Communications Gateway International since the Q-gram of the word International appear often in the relation and have low weight. 2) Atomic Strings There is a basic algorithm for matching text fields based on atomic strings. An atomic string is a sequence of alphanumeric characters delimited by punctuation characters. Two atomic strings match if they are equal or if one is the prefix of the other. Based on this algorithm the similarity of two fields is the number of their matching atomic strings divided by their average number of atomic strings. 3) WHIRL WHIRL is adopts from the information retrieval the cosine similarity combined with the tf.idf weighting scheme to compute the similarity of two fields. Cohen separates each string into words and each word is assigned a weight Where, is the number of times that database appears in the field and is where is the number of records in the that contains 301 The weight for a word in a field is high if appears a large number of times in the field (large ) and is a sufficiently rare term in the database ( large ) e.g. for a collection of company names, relatively infrequent terms such as AT&T or IBM will have higher weights than more frequent terms such as Inc. The cosine similarity of and is defined as The cosine similarity metric works well for a large variety of entries, and is insensitive to the location of words thus allowing natural words moves and swaps (e.g. John Smith is equivalent to Smith John ). D) Phonetic Similarity Metrics Character-level and token-based similarity metrics focus on the string-based representation of the database records. However, strings may be phonetically similar even if they are not similar in a character or token level. E.g. the word Kageonne is phonetically similar to Cajun despite the fact that the string representations are very different. The phonetic similarity metrics are trying to address such issues and match such strings. 1) Metaphone and Double Metaphone The Metaphone algorithm as better alternative to Soundex Philips [9] suggested using 16 consonant sounds that can describe a large number of sounds used in many English and non-english words. Double Metaphone [10] is a better version of Metaphone improving some encoding choices made in the initial Metaphone and allowing multiple encodings for names that have various possible pronunciations. For such cases, all possible encodings are tested when trying to retrieve similar names. Introduction of multiple phonetic encodings greatly enhances the matching performance with rather small overhead. 2) Soundex Soundex can be defined as a hashing mechanism for English words. It constitutes in converting words into a 4 character string that is made up of the first letter in the word and three numbers that are calculated by the hash function. This code would describe how a 32 word sounds and thus can be used to compare and find similar sounding words. An example of the same is given below. Below are steps for deriving the American Soundex code [11].

5 1. Retain the first letter of the name and drop all other occurrences of a, e, h, i, o, u, w, y. 2. Replace consonants with digits as follows (after the first letter) _b, f, p, v=>1 _c, g, j, k, q, s, x, z =2 _d, t=>3 _l=>4 _m, n=>5 _r=>6 3. Two adjacent letters with the same number are coded as a single number. 4. Continue until you have one letter and three numbers. If you run out of letters, fill in 0s until there are three numbers. Using the above steps we can derive Soundex code for e.g. for Eastman as E235, Easterman as E236 and Westminster as W235. By comparing the generated Soundex codes we can group words that sound similar. As in the above example, Eastman and Easterman can be grouped together as the Soundex code can be calculated as being near to each other when compared to the Soundex code of Westminster. 3) Oxford Name Compression Algorithm (ONCA) ONCA [12] is a two-stage technique designed to overcome most of the unsatisfactory features of pure Soundex-ing, retaining in parallel the convenient fourcharacter fixed length format. In the first step ONCA uses a British version of the NYSIIS method of compression. Then, in the second step the transformed and partially compressed name is Soundex in the usual way. This two stage technique has been used successfully for groping similar names together. 4) New York State Identification and Intelligence System (NYSIIS) The NYSIIS system, proposed by Taft [13] differs from Soundex in that it retains information about the position of vowels in the encoded word by converting most vowels to the letter A. Furthermore, NYSIIS does not use numbers to replace letters instead it replaces consonants with other phonetically similar letters thus returning a pure alpha code (no numeric component). Usually the NYSIIS code for a surname is based on a maximum of nine letters of the full alphabetical name and the NYSIIS code itself is then limited to six characters. Taft [13] compared Soundex with NYSIIS using a name database of New York State and concluded that NYSIIS is percent accurate while Soundex is percent accurate for locating surnames. The NYSIIS encoding system is still used today by the New York State Division of Criminal Justice Services. E) Summary There are various techniques that have been applied for matching fields with string data in duplicate data detection. The character based similarity metrics handle typographical errors while token based similarity metrics works well when computing a rearranged string that has same meaning. The phonetic similarity metrics are trying to match the strings which are phonetically similar. To capture the similarity in numeric data the numeric similarity metrics used IV. FINDING DUPLICATE RECORD Various methods we have seen that can be used to compare strings or individual fields and use a metric to understand their similarity or lack of it. When applied to real world situations where data is multivariate and the number of fields is as dynamic as the data itself, this makes the field of duplicate detection more complicated. There have been numerous papers and approaches addressing this issue. These approaches can be classified as A) Rule based techniques A special case of distance-based approaches [14] and [15] is the use of rules to define whether two records are the same or not. Rule-based approaches can be considered as distance-based techniques where the distance of two records is either 0 or 1. It is noteworthy that such rule based approaches which require a human expert to devise meticulously crafted matching rules typically result in systems with high accuracy. However the required tuning requires extremely high manual effort from the human experts and this effort makes the deployment of such systems difficult in practice. Currently the typical approach is to use a system that generates matching rules from training data and then manually tune the automatically generated rules. 302

6 The mapping transformation standardized data the matching transformation finds pairs of records that probably refer to the same real object, the clustering transformation groups together matching pairs with a high similarity value, and finally the merging transformation collapses each individual cluster into a tuple of the resulting data source. It is noteworthy that such rule-based approaches which require a human expect to devise meticulously crafted matching rules typically result in systems with high accuracy. However, the required tuning requires extremely high manual effort from the human experts and this effort makes the deployment of such systems difficult in practice. Currently the typical approach is to use a system that generates matching rules. B) Active-Learning Based Techniques A problem with the supervised learning techniques is the requirement for a large number of training examples. While it is easy to create a large number of training pairs that are either clearly non duplicates or clearly duplicates, it is very difficult to generate ambiguous cases that would help create a highly accurate classifier Their method suggested that by creating multiple classifiers, trained using slightly different data or parameters it is possible to detect cases and then ask the user for feedback. The key innovation in this work is the creation of several redundant functions and the concurrent exploitation of their conflicting actions in order to discover new kinds of inconsistencies among duplicates in the data set. C) Probabilistic Matching Models Let A and B be representation of two tables having n comparable fields. In the case of duplicate detection problem each tuple pair, is assigned to one of the two classes M and U. The class M contains the record pairs that represent the same entity ( Match ) and the class U contains the record pairs that represent two different entities ( Non-Match ). Each tuple pair is represented as a random vector x = [1...x n ] T with n components that correspond to the n comparable fields of A and B. Newcombe et al [23] [24] were the first to recognize duplicate detection as a Bayesian inference problem. Today is very commonly used for duplicate detection literature. Let x be the comparison vector x is the input to a decision rule that assigns x to U or to M. The assumption about x is it is a random vector whose density function will be differing / different for both classes ) Naive Bayes rule Conditional independence Assumption: p (x i /M), p (x j /M) is independent if i j. Goal: To compute the distributions of p(x/m) and p(x/u). Naive Bayes rule, Using a training set of pre-labeled record pairs, the values of p (x i /M) and p (x i /U) are computed. 2) Winkler Methodology The conditional independence is not a reasonable assumption so Winkler [16] suggested a method to estimate p (x/m), p (x/u) using expectation maximization algorithm. Winkler suggested five conditions to make unsupervised EM algorithm to work well, namely, i) The data contain relatively large percentage of matches. ii) The matching pairs are well-separated from other classes. iii) The rate of typographical errors is low. iv) There are sufficiently many redundant identifiers to overcome errors in other fields of the record. v) The estimates computed under the conditional independence assumption result in good classification performance. Winkler has proved that this unsupervised EM works well, even when a limited number of interactions are allowed between the variables. It is interesting to note that the results under the independence assumption are not considerably worse compared to the case n which the EM model allows variable interactions. 3) The Bayes Decision Rule for Minimum Error Assumption: x is a comparison vector randomly taken from the comparison space that corresponds to the record pair <α, β>. Goal: To determine whether <α, β> M or <α, β> U. Decision rule: (1) The above decision rule (1) reveals that if the probability of the match class M given the comparison vector x is larger than the probability of the non-match class U then x is classified to M and vice versa.

7 Bayes decision rule: (2) The ratio l(x) = p(x/m) p(x/u) is called as likelihood ratio. The ratio p (U) / p (M) denotes the threshold value of the likelihood ratio for the decision. Equation 2 is known as Bayes test for minimum error. It is very obviously proved that the Bayes test result in the latest probability of error and it is in that respect an optimal classifier. The above statement holds well only when the distributions of p(x/m) p(x/u) and the prior s p (U) and p (M) are known. 4) Binary model i) The probabilistic model can also be used without using training data. ii) A Binary model for the values of x i was introduced by Jaro such that: x i = 1, if field i matches x i = 0, else iii) He suggested to calculate the probabilities p(x i =1/M) using an expectation maximization (EM) algorithm and the probabilities p(x i =1/U) can be calculated by taking random pairs of records D) Unsupervised Learning One way to avoid manual labeling of the comparison vectors is to use clustering algorithms and group together similar comparison vectors. The idea behind most unsupervised learning approaches for duplicate detection is that similar comparison vectors correspond to the same class. The idea of unsupervised learning for duplicate detection has its roots in the probabilistic model. The use of a bootstrapping technique based on clustering to learn matching models. The basic idea also known as training [17] is to use very few labeled data and then use unsupervised learning techniques to appropriately label the data with unknown labels. Each entry of the comparison vector (which corresponds to the result of a field comparison) Then they partition the comparison space into clusters by using the Autoclass [18] clustering tool. The basic premise is that each cluster contains comparison vectors with similar characteristics. Therefore, all the record pairs in the cluster belong to the same class (matches, no matches or possible matches). E) Supervised and Semi-supervised Learning The supervised learning systems rely on the existence of training data in the form of records pairs, pre labeled as matching or not. One set of supervised learning techniques treat each record pair (a, b) independently similar to the probabilistic techniques. A well-known CART algorithm [19] generates classification and regression trees. A linear discriminate algorithm generates a linear combination of the parameters for separating the data according to their classes and a vector quantization approach which is generalization of the nearest neighbour algorithms. The transitively assumption can sometimes result in inconsistent decisions. For example (a, b) and (a, c) can be considered matches but (b, c) not Partitioning such as inconsistent graphs with the goal of minimizing inconsistencies in an NP-complete problem. F) Bigram Indexing The Bigram Indexing (BI) method as implemented in the Febrl [20] record linkage system allows for fuzzy blocking. The basic idea is that the blocking key values are converted into a list of Bigram (sub-strings containing two characters) and sub-lists of all possible permutations will rebuilt using a threshold (between 0.0 and 1.0 ). The resulting Bigram lists are sorted and inserted into an inverted index, which will be used to retrieve the corresponding record numbers in a block. The number of sub-lists created for a blocking key value both depends on the length of the value and the threshold. The lower the threshold the shorter the sub-lists but also the more sub-lists there will be per blocking key value resulting in more (smaller blocks) in the inverted index. In the information retrieval field Bigram indexing has been found to be robust to small typographical errors in documents. G) Distance-Based Techniques Active learning techniques require some training data or some human effort to create the matching models. In the absence of such training data or the ability to get human input supervised and active learning techniques are not appropriate. One way of avoiding the need for training data is to define a distance metric for records which does not need tuning through training data. 304

8 Distance-based approaches [14] that conflict each record in one big field may ignore important information that can be used for duplicate detection. A simple approach is to measure the distance between individual fields using the appropriate distance metric for each field and then compute the weighted distance between the records. In this case, the problem is the computation of the weights and the overall setting becomes very similar to the probabilistic setting. One of the problems of the distance-based techniques is the need to define the appropriate value for the matching threshold. n the presence of training data it is possible to find the appropriate threshold value. However this would nullify the major advantage of distance-based techniques which is the ability to operate without training data. H) Summary Finding out duplicate records from web database is very complicated task. There are some approaches to do this as the name suggests probabilistic matching uses like hood ratio theory to classify data as duplicates. The supervised and semi supervised approach needs training data in the form of record pairs pre labeled as matching or not. While unsupervised learning requires to formal training data. In duplicate detection system active learning method are used to find out ambiguous pair. In distance based techniques the distance between individual field are measured using the appropriate distance metrics while the rule based approach is the use of rules to defined whether two records is similar or not. V. DUPLICATE DETECTION TOOLS In this section, we review such packages focusing on tools that have open architecture and allow the users to understand the underlying mechanics of the matching mechanisms. The Febrl system (Freely Extensible Biomedical Record Linkage) is an open source data cleaning toolkit and it has two main components. The first component deals with data standardization and the second performs the actual duplicate detection. The data standardization relies mainly on Hidden Markov Models (HMMs) therefore Febrl typically requires training to correctly parse the database entries. For duplicate detection Febrl implements a variety of string similarity metrics, such as Jaro, edit distance and q-gram distance. Finally Febrl supports phonetic encoding (Soundex, NYSIIS and Double Metaphone) to detect similar names. Since phonetic similarity is sensitive to errors in the first letter of a name Febrl also computes phonetic similarity using the reversed version of the name string, side stepping the first-letter sensitivity problem. TAILOR [21] is a flexible record matching toolbox which allows the users to apply different duplicate detection methods on the data sets. The flexibility of using multiple models is useful when the users do not know which duplicate detection model will perform most effectively on their particular data. TAILOR follows a layered design separating comparison functions from the duplicate detection logic. Furthermore the execution strategies which improve the efficiency are implemented in a separate layer making the system more extensible than systems that rely on monolithic designs. Finally TAILOR reports statistics such as estimated accuracy and completeness which can help the users better understand the quality of a given duplicate detection execution over a new data set. WHIRL is a duplicate record detection system available for free for academic and research use. WHIRL uses the tf.idf tokenbased similarity metric to identify similar strings within two lists. The Flamingo Project is a similar tool that provides a simple string matching tool that takes as input two string lists and returns the strings pairs that are within a pre-specified edit distance threshold. WizSame by WizSoft is also a product that allows the discovery of duplicate records in a database. The matching algorithm is very similar to Soft TF.IDF. Two records match if they contain a significant fraction of identical or similar words where similar are the words that is within edit distance one. Bigmatch [22] is the duplicate detection program used by the US Census Bureau. It relies on blocking strategies to identify potential matches between the records of two relations and scales well for very large data sets. The only requirement is that one of the two relations should fit in memory and it is possible to fit in memory even relations with 100 million records. The main goal of Bigmatch is not to perform sophisticated duplicate detection, but rather to generate a set of candidate pairs that should be then processed by more sophisticated duplicate detection algorithms. Finally, we should note that currently many database vendors (Oracle, IBM, and Microsoft) do not provide sufficient tools for duplicate record detection. Most of the efforts until now have focused on creating easy-to-use ETL tools that can standardize database records and fix minor errors mainly in the context of address data. 305

9 Another typical function of the tools that are provided today is the ability to use reference tables and standardize the representation of entities that are well-known to have multiple representations. VI. FUTURE DIRECTIONS AND CONCLUSIONS In this survey we have presented a comprehensive survey of the existing techniques used for detecting non identical duplicate entries in database records. As database systems are becoming more and more commonplace data cleaning is going to be the cornerstone for correcting errors in systems which are accumulating vast amounts of errors on a daily basis. Despite the breadth and depth of the presented techniques we believe that there is still room for substantial improvement in the current state-of-the-art. First of all it is currently unclear which metrics and techniques are the current state-of-the-art. The lack of standardized, large-scale benchmarking data sets can be a big obstacle for the further development of the field as it is almost impossible to convincingly compare new techniques with existing ones. A repository of benchmark data sources with known and diverse characteristics should be made available to developers so they may evaluate their methods during the development process. Along with benchmark and evaluation data, various systems need some form of training data to produce the initial matching model. Although small data sets are available, we are not aware of large-scale, validated data sets that could be used as benchmarks. Winkler highlights techniques on how to derive data sets that are properly synonym zed and are still useful for duplicate record detection purposes. Currently there are two main approaches for duplicate record detection. Research in databases emphasizes relatively simple and fast duplicate detection techniques that can be applied to databases with millions of records. Such techniques typically do not rely on the existence of training data and emphasize efficiency over effectiveness. On the other hand research in machine learning and statistics aims To develop more sophisticated matching techniques that relies on probabilistic models. An interesting direction for future research is to develop techniques that combine the best of both worlds. Most of the duplicate detection systems available today offer various algorithmic approaches for speeding up the duplicate detection process. The changing nature of the duplicate detection process also requires adaptive methods that detect different patterns for duplicate detection and automatically adapt themselves over time. For example a background process could monitor the current data incoming data and any data sources that need to be merged or matched and decide based on the observed errors whether a revision of the duplicate detection process is necessary or not. Another related aspect of this challenge is to develop methods that permit the user to derive the proportions of errors expected in data cleaning projects. Finally large amounts of structured information are now derived from unstructured text and from the Web. This information is typically imprecise and noisy duplicate record detection techniques are crucial for improving the quality of the extracted data. REFERENCES [1] M.A. Hernandez and S.J. Stolfo, Real-World Data Is Dirty: Data Cleansing and the Merge/Purge Problem, Data Mining and Knowledge Discovery, vol. 2 no. 1, pp. 9-37, Jan [2] S. Sarawagi and A. Bhamidipaty, Interactive Deduplication Using Active Learning, Proc. Eighth ACM SIGKDD Int l Conf. Knowledge Discovery and Data Mining (KDD 02), pp [3] Y.R. Wang and S.E. Madnick, The Inter-Database Instance Identification Problem in Integrating Autonomous Systems, Proc. Fifth IEEE Int l Conf. Data Eng. (ICDE 89), pp , [4] W.W. Cohen, H. Kautz and D. McAllester, Hardening Soft Information Sources, Proc. Sixth ACM SIGKDD Int l Conf. Knowledge Discovery and Data Mining (KDD 00), pp [5] Bilenko, R.J. Mooney, W.W. Cohen, P. Ravikumar, and S.E. Fienberg, Adaptive Name Matching in Information Integration, IEEE Intelligent Systems, vol. 18 no. 5, pp Sept./Oct [6] Ukkonen E. Approximate string-matching with q-grams and maximal matches (1992) Theoretical Computer Science 92 (1), pp [7] M.S. Waterman, T.F. Smith, and W.A. Beyer, Some Biological Sequence Metrics, Advances in Math., vol. 20, no. 4, pp , [8] G. Navarro, A Guided Tour to Approximate String Matching, ACM Computing Surveys, vol. 33, no. 1, pp , [9] L. Philips, Hanging on the Metaphone, Computer Language Magazine vol. 7, no. 12, pp , Dec. 1990, [10] L. Philips, The Double Metaphone Search Algorithm, C/C++ Users J., vol. 18, no. 5, June [11] 306

10 [12] L.E. Gill, OX-LINK: The Oxford Medical Record Linkage System, Proc. Int l Record Linkage Workshop and Exposition, pp , [13] R.L. Taft, Name Search Techniques, Technical Report Special Report No. 1, New York State Identification and Intelligence System, Albany, N.Y., Feb [14] M. Bilenko and R.J. Mooney, Adaptive Duplicate Detection Using Learnable String Similarity Measures, Proc. ACM SIGKDD, pp , [15] P. Christen, Automatic Record Linkage Using Seeded Nearest Neighbour and Support Vector Machine Classification, Proc. ACM SIGKDD, pp , [16] W.E. Winkler, The State of Record Linkage and Current Research Problems, Technical Report Statistical Research Report Series RR99/04, US Bureau of the Census, Washington D.C., [17] A. Blum and T. Mitchell, Combining Labeled and Unlabeled Data with Co-Training, COLT 98: Proc. 11th Ann. Conf. Computational Learning Theory, pp , [18] P. Cheeseman and J. Sturz, Bayesian Classification (Autoclass): Theory and Results, Advances in Knowledge Discovery and Data Mining, pp , AAAI Press/The MIT Press, [19] J. Cho, N. Shivakumar, and H. Garcia-Molina, Finding Replicated Web Collections, Proc.2000 ACM SIGMOD Int l Conf. Management of Data (SIGMOD 00), pp , [20] Christen, T. Churches and M. Hegland, Febrl A Parallel Open Source Data Linkage System, Advances in Knowledge Discovery and Data Mining, pp , Springer, [21] M.G. Elfeky, A.K. Elmagarmid and V.S. Verykios, TAILOR: A Record Linkage Tool Box, Proc. 18th IEEE Int l Conf. Data Eng. (ICDE 02), pp , [22] W.E. Yancey, Bigmatch: A Program for Extracting Probable Matches from a Large File for Record Linkage, Technical Report Statistical Research Report Series RRC2002/01, US Bureau of the Census, Washington, D.C., Mar [23] H.B. Newcombe, J.M. Kennedy, S. Axford and A. James Automatic Linkage of Vital Records, Science, vol. 130, no. 3381, pp , Oct [24] H.B. Newcombe and J.M. Kennedy, Record Linkage: Making Maximum Use of the Discriminating Power of Identifying Information, Comm. ACM, vol. 5, no. 11, pp , Nov Dewendra Bharambe is a scholar of M.Tech, (Computer Science Engineering), at R.I.T.S. Bhopal, under R.G.T.U. Bhopal, India. He is working as a lecturer in J.T.Mahajan College of Engineering, Faizpur. Susheel Jain, Assistant Professor in Computer science department of R.I.T.S., Bhopal, M.P. He has done his M.Tech. in Software Engineering From Gautam Buddh Technical University, Lucknow, India. Anurag Jain, H.O.D. of Computer science department of R.I.T.S. Bhopal, M.P. He has done his M.Tech, in Computer Science and Engineering, From Barkatullah University, Bhopal, India. 307

DATABASES play an important role in today s IT-based

DATABASES play an important role in today s IT-based IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 1, JANUARY 2007 1 Duplicate Record Detection: A Survey Ahmed K. Elmagarmid, Senior Member, IEEE, Panagiotis G. Ipeirotis, Member, IEEE

More information

Client Perspective Based Documentation Related Over Query Outcomes from Numerous Web Databases

Client Perspective Based Documentation Related Over Query Outcomes from Numerous Web Databases Beyond Limits...Volume: 2 Issue: 2 International Journal Of Advance Innovations, Thoughts & Ideas Client Perspective Based Documentation Related Over Query Outcomes from Numerous Web Databases B. Santhosh

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

A Simplified Framework for Data Cleaning and Information Retrieval in Multiple Data Source Problems

A Simplified Framework for Data Cleaning and Information Retrieval in Multiple Data Source Problems A Simplified Framework for Data Cleaning and Information Retrieval in Multiple Data Source Problems Agusthiyar.R, 1, Dr. K. Narashiman 2 Assistant Professor (Sr.G), Department of Computer Applications,

More information

Private Record Linkage with Bloom Filters

Private Record Linkage with Bloom Filters To appear in: Proceedings of Statistics Canada Symposium 2010 Social Statistics: The Interplay among Censuses, Surveys and Administrative Data Private Record Linkage with Bloom Filters Rainer Schnell,

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION CHAPTER 1 INTRODUCTION 1. Introduction 1.1 Data Warehouse In the 1990's as organizations of scale began to need more timely data for their business, they found that traditional information systems technology

More information

Record Deduplication By Evolutionary Means

Record Deduplication By Evolutionary Means Record Deduplication By Evolutionary Means Marco Modesto, Moisés G. de Carvalho, Walter dos Santos Departamento de Ciência da Computação Universidade Federal de Minas Gerais 31270-901 - Belo Horizonte

More information

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH Kalinka Mihaylova Kaloyanova St. Kliment Ohridski University of Sofia, Faculty of Mathematics and Informatics Sofia 1164, Bulgaria

More information

MINING POSTAL ADDRESSES

MINING POSTAL ADDRESSES IADIS European Conference Data Mining 2008 MINING POSTAL ADDRESSES José Carlos Cortizo Pérez Universidad Europea de Madrid / AINetSolutions José María Gómez Hidalgo R+D Department, Optenet Yaiza Temprado,

More information

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Robust Outlier Detection Technique in Data Mining: A Univariate Approach Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,

More information

Efficient and Effective Duplicate Detection Evaluating Multiple Data using Genetic Algorithm

Efficient and Effective Duplicate Detection Evaluating Multiple Data using Genetic Algorithm Efficient and Effective Duplicate Detection Evaluating Multiple Data using Genetic Algorithm Dr.M.Mayilvaganan, M.Saipriyanka Associate Professor, Dept. of Computer Science, PSG College of Arts and Science,

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Development and User Experiences of an Open Source Data Cleaning, Deduplication and Record Linkage System

Development and User Experiences of an Open Source Data Cleaning, Deduplication and Record Linkage System Development and User Experiences of an Open Source Data Cleaning, Deduplication and Record Linkage System Peter Christen School of Computer Science The Australian National University Canberra ACT 0200,

More information

Quality and Complexity Measures for Data Linkage and Deduplication

Quality and Complexity Measures for Data Linkage and Deduplication Quality and Complexity Measures for Data Linkage and Deduplication Peter Christen and Karl Goiser Department of Computer Science, Australian National University, Canberra ACT 0200, Australia {peter.christen,karl.goiser}@anu.edu.au

More information

Hybrid Technique for Data Cleaning

Hybrid Technique for Data Cleaning Hybrid Technique for Data Cleaning Ashwini M. Save P.G. Student, Department of Computer Engineering, Thadomal Shahani Engineering College, Bandra, Mumbai, India Seema Kolkur Assistant Professor, Department

More information

SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY

SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY G.Evangelin Jenifer #1, Mrs.J.Jaya Sherin *2 # PG Scholar, Department of Electronics and Communication Engineering(Communication and Networking), CSI Institute

More information

Application of Clustering and Association Methods in Data Cleaning

Application of Clustering and Association Methods in Data Cleaning Proceedings of the International Multiconference on ISBN 978-83-60810-14-9 Computer Science and Information Technology, pp. 97 103 ISSN 1896-7094 Application of Clustering and Association Methods in Data

More information

LEARNING-BASED FUSION FOR DATA DEDUPLICATION: A ROBUST AND AUTOMATED SOLUTION. Jared Dinerstein

LEARNING-BASED FUSION FOR DATA DEDUPLICATION: A ROBUST AND AUTOMATED SOLUTION. Jared Dinerstein LEARNING-BASED FUSION FOR DATA DEDUPLICATION: A ROBUST AND AUTOMATED SOLUTION by Jared Dinerstein A thesis submitted in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE in Computer

More information

A Time Efficient Algorithm for Web Log Analysis

A Time Efficient Algorithm for Web Log Analysis A Time Efficient Algorithm for Web Log Analysis Santosh Shakya Anju Singh Divakar Singh Student [M.Tech.6 th sem (CSE)] Asst.Proff, Dept. of CSE BU HOD (CSE), BUIT, BUIT,BU Bhopal Barkatullah University,

More information

How To Use Neural Networks In Data Mining

How To Use Neural Networks In Data Mining International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and

More information

A Genetic Programming Approach for Record Deduplication

A Genetic Programming Approach for Record Deduplication A Genetic Programming Approach for Record Deduplication L.Chitra Devi 1, S.M.Hansa 2, Dr.G.N.K.Suresh Babu 3 Research Scholars, Dept. of Computer Applications, GKM College of Engineering and Technology,

More information

Email Spam Detection Using Customized SimHash Function

Email Spam Detection Using Customized SimHash Function International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email

More information

Removing Fully and Partially Duplicated Records through K-Means Clustering

Removing Fully and Partially Duplicated Records through K-Means Clustering IACSIT International Journal of Engineering and Technology, Vol. 4, No. 6, December 2012 Removing Fully and Partially Duplicated Records through K-Means Clustering Bilal Khan, Azhar Rauf, Huma Javed, Shah

More information

Online Credit Card Application and Identity Crime Detection

Online Credit Card Application and Identity Crime Detection Online Credit Card Application and Identity Crime Detection Ramkumar.E & Mrs Kavitha.P School of Computing Science, Hindustan University, Chennai ABSTRACT The credit cards have found widespread usage due

More information

Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis IOSR Journal of Computer Engineering (IOSRJCE) ISSN: 2278-0661, ISBN: 2278-8727 Volume 6, Issue 5 (Nov. - Dec. 2012), PP 36-41 Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

More information

A Review of Data Mining Techniques

A Review of Data Mining Techniques Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

NEW TECHNIQUE TO DEAL WITH DYNAMIC DATA MINING IN THE DATABASE

NEW TECHNIQUE TO DEAL WITH DYNAMIC DATA MINING IN THE DATABASE www.arpapress.com/volumes/vol13issue3/ijrras_13_3_18.pdf NEW TECHNIQUE TO DEAL WITH DYNAMIC DATA MINING IN THE DATABASE Hebah H. O. Nasereddin Middle East University, P.O. Box: 144378, Code 11814, Amman-Jordan

More information

An Efficient way of Record Linkage System and Deduplication using Indexing techniques, Classification and FEBRL Framework

An Efficient way of Record Linkage System and Deduplication using Indexing techniques, Classification and FEBRL Framework An Efficient way of Record Linkage System and Deduplication using Indexing techniques, Classification and FEBRL Framework Nishand.K, Ramasami.S, T.Rajendran Abstract Record linkage is an important process

More information

A MACHINE LEARNING APPROACH TO FILTER UNWANTED MESSAGES FROM ONLINE SOCIAL NETWORKS

A MACHINE LEARNING APPROACH TO FILTER UNWANTED MESSAGES FROM ONLINE SOCIAL NETWORKS A MACHINE LEARNING APPROACH TO FILTER UNWANTED MESSAGES FROM ONLINE SOCIAL NETWORKS Charanma.P 1, P. Ganesh Kumar 2, 1 PG Scholar, 2 Assistant Professor,Department of Information Technology, Anna University

More information

Customer Classification And Prediction Based On Data Mining Technique

Customer Classification And Prediction Based On Data Mining Technique Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor

More information

Mining Signatures in Healthcare Data Based on Event Sequences and its Applications

Mining Signatures in Healthcare Data Based on Event Sequences and its Applications Mining Signatures in Healthcare Data Based on Event Sequences and its Applications Siddhanth Gokarapu 1, J. Laxmi Narayana 2 1 Student, Computer Science & Engineering-Department, JNTU Hyderabad India 1

More information

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College

More information

Comparison of K-means and Backpropagation Data Mining Algorithms

Comparison of K-means and Backpropagation Data Mining Algorithms Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

INDEPENDENT DE-DUPLICATION IN DATA CLEANING #

INDEPENDENT DE-DUPLICATION IN DATA CLEANING # UDC:004.657 Original scientific paperr INDEPENDENT DE-DUPLICATION IN DATA CLEANING # Ajumobi Udechukwu 1, Christie Ezeife 2, Ken Barker 3 1,3 Dept. of Computer Science, University of Calgary, Canada {ajumobiu,

More information

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data

More information

Web Document Clustering

Web Document Clustering Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

More information

Data Warehousing. Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de. Winter 2014/15. Jens Teubner Data Warehousing Winter 2014/15 1

Data Warehousing. Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de. Winter 2014/15. Jens Teubner Data Warehousing Winter 2014/15 1 Jens Teubner Data Warehousing Winter 2014/15 1 Data Warehousing Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Winter 2014/15 Jens Teubner Data Warehousing Winter 2014/15 152 Part VI ETL Process

More information

An Empirical Study of Application of Data Mining Techniques in Library System

An Empirical Study of Application of Data Mining Techniques in Library System An Empirical Study of Application of Data Mining Techniques in Library System Veepu Uppal Department of Computer Science and Engineering, Manav Rachna College of Engineering, Faridabad, India Gunjan Chindwani

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association

More information

Big Data with Rough Set Using Map- Reduce

Big Data with Rough Set Using Map- Reduce Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,

More information

Search Query and Matching Approach of Information Retrieval in Cloud Computing

Search Query and Matching Approach of Information Retrieval in Cloud Computing International Journal of Advances in Electrical and Electronics Engineering 99 Available online at www.ijaeee.com & www.sestindia.org ISSN: 2319-1112 Search Query and Matching Approach of Information Retrieval

More information

Less naive Bayes spam detection

Less naive Bayes spam detection Less naive Bayes spam detection Hongming Yang Eindhoven University of Technology Dept. EE, Rm PT 3.27, P.O.Box 53, 5600MB Eindhoven The Netherlands. E-mail:h.m.yang@tue.nl also CoSiNe Connectivity Systems

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

Febrl A Freely Available Record Linkage System with a Graphical User Interface

Febrl A Freely Available Record Linkage System with a Graphical User Interface Febrl A Freely Available Record Linkage System with a Graphical User Interface Peter Christen Department of Computer Science, The Australian National University Canberra ACT 0200, Australia Email: peter.christen@anu.edu.au

More information

HIGH SPEED DATA RETRIEVAL FROM NATIONAL DATA CENTER (NDC) REDUCING TIME AND IGNORING SPELLING ERROR IN SEARCH KEY BASED ON DOUBLE METAPHONE ALGORITHM

HIGH SPEED DATA RETRIEVAL FROM NATIONAL DATA CENTER (NDC) REDUCING TIME AND IGNORING SPELLING ERROR IN SEARCH KEY BASED ON DOUBLE METAPHONE ALGORITHM HIGH SPEED DATA RETRIEVAL FROM NATIONAL DATA CENTER (NDC) REDUCING TIME AND IGNORING SPELLING ERROR IN SEARCH KEY BASED ON DOUBLE METAPHONE ALGORITHM Md. Palash Uddin 1, Ashfaque Ahmed 2, Md. Delowar Hossain

More information

Bayesian Spam Filtering

Bayesian Spam Filtering Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary amaobied@ucalgary.ca http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating

More information

MEMBERSHIP LOCALIZATION WITHIN A WEB BASED JOIN FRAMEWORK

MEMBERSHIP LOCALIZATION WITHIN A WEB BASED JOIN FRAMEWORK MEMBERSHIP LOCALIZATION WITHIN A WEB BASED JOIN FRAMEWORK 1 K. LALITHA, 2 M. KEERTHANA, 3 G. KALPANA, 4 S.T. SHWETHA, 5 M. GEETHA 1 Assistant Professor, Information Technology, Panimalar Engineering College,

More information

The Enron Corpus: A New Dataset for Email Classification Research

The Enron Corpus: A New Dataset for Email Classification Research The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu

More information

SPATIAL DATA CLASSIFICATION AND DATA MINING

SPATIAL DATA CLASSIFICATION AND DATA MINING , pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

How. Matching Technology Improves. White Paper

How. Matching Technology Improves. White Paper How Matching Technology Improves Data Quality White Paper Table of Contents How... 3 What is Matching?... 3 Benefits of Matching... 5 Matching Use Cases... 6 What is Matched?... 7 Standardization before

More information

AN APPROACH TO ANTICIPATE MISSING ITEMS IN SHOPPING CARTS

AN APPROACH TO ANTICIPATE MISSING ITEMS IN SHOPPING CARTS AN APPROACH TO ANTICIPATE MISSING ITEMS IN SHOPPING CARTS Maddela Pradeep 1, V. Nagi Reddy 2 1 M.Tech Scholar(CSE), 2 Assistant Professor, Nalanda Institute Of Technology(NIT), Siddharth Nagar, Guntur,

More information

Privacy Aspects in Big Data Integration: Challenges and Opportunities

Privacy Aspects in Big Data Integration: Challenges and Opportunities Privacy Aspects in Big Data Integration: Challenges and Opportunities Peter Christen Research School of Computer Science, The Australian National University, Canberra, Australia Contact: peter.christen@anu.edu.au

More information

Optimization of ETL Work Flow in Data Warehouse

Optimization of ETL Work Flow in Data Warehouse Optimization of ETL Work Flow in Data Warehouse Kommineni Sivaganesh M.Tech Student, CSE Department, Anil Neerukonda Institute of Technology & Science Visakhapatnam, India. Sivaganesh07@gmail.com P Srinivasu

More information

Automatic Annotation Wrapper Generation and Mining Web Database Search Result

Automatic Annotation Wrapper Generation and Mining Web Database Search Result Automatic Annotation Wrapper Generation and Mining Web Database Search Result V.Yogam 1, K.Umamaheswari 2 1 PG student, ME Software Engineering, Anna University (BIT campus), Trichy, Tamil nadu, India

More information

Dynamic Data in terms of Data Mining Streams

Dynamic Data in terms of Data Mining Streams International Journal of Computer Science and Software Engineering Volume 2, Number 1 (2015), pp. 1-6 International Research Publication House http://www.irphouse.com Dynamic Data in terms of Data Mining

More information

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool. International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 9, Issue 8 (January 2014), PP. 19-24 Comparative Analysis of EM Clustering Algorithm

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Data De-duplication: A Review

Data De-duplication: A Review Data De-duplication: A Review Gianni Costa and Alfredo Cuzzocrea and Giuseppe Manco and Riccardo Ortale Abstract The wide exploitation of new techniques and systems for generating, collecting and storing

More information

International Journal of Innovative Research in Computer and Communication Engineering

International Journal of Innovative Research in Computer and Communication Engineering FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,

More information

DATA PREPARATION FOR DATA MINING

DATA PREPARATION FOR DATA MINING Applied Artificial Intelligence, 17:375 381, 2003 Copyright # 2003 Taylor & Francis 0883-9514/03 $12.00 +.00 DOI: 10.1080/08839510390219264 u DATA PREPARATION FOR DATA MINING SHICHAO ZHANG and CHENGQI

More information

Chapter 20: Data Analysis

Chapter 20: Data Analysis Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification

More information

Why is Internal Audit so Hard?

Why is Internal Audit so Hard? Why is Internal Audit so Hard? 2 2014 Why is Internal Audit so Hard? 3 2014 Why is Internal Audit so Hard? Waste Abuse Fraud 4 2014 Waves of Change 1 st Wave Personal Computers Electronic Spreadsheets

More information

American Journal of Engineering Research (AJER) 2013 American Journal of Engineering Research (AJER) e-issn: 2320-0847 p-issn : 2320-0936 Volume-2, Issue-4, pp-39-43 www.ajer.us Research Paper Open Access

More information

Technology in Action. Alan Evans Kendall Martin Mary Anne Poatsy. Eleventh Edition. Copyright 2015 Pearson Education, Inc.

Technology in Action. Alan Evans Kendall Martin Mary Anne Poatsy. Eleventh Edition. Copyright 2015 Pearson Education, Inc. Copyright 2015 Pearson Education, Inc. Technology in Action Alan Evans Kendall Martin Mary Anne Poatsy Eleventh Edition Copyright 2015 Pearson Education, Inc. Technology in Action Chapter 9 Behind the

More information

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed

More information

Intelligent Log Analyzer. André Restivo <andre.restivo@portugalmail.pt>

Intelligent Log Analyzer. André Restivo <andre.restivo@portugalmail.pt> Intelligent Log Analyzer André Restivo 9th January 2003 Abstract Server Administrators often have to analyze server logs to find if something is wrong with their machines.

More information

Selective dependable storage services for providing security in cloud computing

Selective dependable storage services for providing security in cloud computing Selective dependable storage services for providing security in cloud computing Gade Lakshmi Thirupatamma*1, M.Jayaram*2, R.Pitchaiah*3 M.Tech Scholar, Dept of CSE, UCET, Medikondur, Dist: Guntur, AP,

More information

Invited Applications Paper

Invited Applications Paper Invited Applications Paper - - Thore Graepel Joaquin Quiñonero Candela Thomas Borchert Ralf Herbrich Microsoft Research Ltd., 7 J J Thomson Avenue, Cambridge CB3 0FB, UK THOREG@MICROSOFT.COM JOAQUINC@MICROSOFT.COM

More information

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016 Network Machine Learning Research Group S. Jiang Internet-Draft Huawei Technologies Co., Ltd Intended status: Informational October 19, 2015 Expires: April 21, 2016 Abstract Network Machine Learning draft-jiang-nmlrg-network-machine-learning-00

More information

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals The Role of Size Normalization on the Recognition Rate of Handwritten Numerals Chun Lei He, Ping Zhang, Jianxiong Dong, Ching Y. Suen, Tien D. Bui Centre for Pattern Recognition and Machine Intelligence,

More information

IMPROVING BUSINESS PROCESS MODELING USING RECOMMENDATION METHOD

IMPROVING BUSINESS PROCESS MODELING USING RECOMMENDATION METHOD Journal homepage: www.mjret.in ISSN:2348-6953 IMPROVING BUSINESS PROCESS MODELING USING RECOMMENDATION METHOD Deepak Ramchandara Lad 1, Soumitra S. Das 2 Computer Dept. 12 Dr. D. Y. Patil School of Engineering,(Affiliated

More information

Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner

Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner 24 Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner Rekha S. Nyaykhor M. Tech, Dept. Of CSE, Priyadarshini Bhagwati College of Engineering, Nagpur, India

More information

A Review of Anomaly Detection Techniques in Network Intrusion Detection System

A Review of Anomaly Detection Techniques in Network Intrusion Detection System A Review of Anomaly Detection Techniques in Network Intrusion Detection System Dr.D.V.S.S.Subrahmanyam Professor, Dept. of CSE, Sreyas Institute of Engineering & Technology, Hyderabad, India ABSTRACT:In

More information

Gerard Mc Nulty Systems Optimisation Ltd gmcnulty@iol.ie/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I

Gerard Mc Nulty Systems Optimisation Ltd gmcnulty@iol.ie/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I Gerard Mc Nulty Systems Optimisation Ltd gmcnulty@iol.ie/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I Data is Important because it: Helps in Corporate Aims Basis of Business Decisions Engineering Decisions Energy

More information

Creating Synthetic Temporal Document Collections for Web Archive Benchmarking

Creating Synthetic Temporal Document Collections for Web Archive Benchmarking Creating Synthetic Temporal Document Collections for Web Archive Benchmarking Kjetil Nørvåg and Albert Overskeid Nybø Norwegian University of Science and Technology 7491 Trondheim, Norway Abstract. In

More information

An Improving Genetic Programming Approach Based Deduplication Using KFINDMR

An Improving Genetic Programming Approach Based Deduplication Using KFINDMR An Improving Genetic Programming Approach Based Deduplication Using KFINDMR P.Shanmugavadivu #1, N.Baskar *2 # Department of Computer Engineering, Bharathiar University Sri Ramakrishna Polytechnic College,

More information

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset P P P Health An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset Peng Liu 1, Elia El-Darzi 2, Lei Lei 1, Christos Vasilakis 2, Panagiotis Chountas 2, and Wei Huang

More information

Advanced record linkage methods: scalability, classification, and privacy

Advanced record linkage methods: scalability, classification, and privacy Advanced record linkage methods: scalability, classification, and privacy Peter Christen Research School of Computer Science, ANU College of Engineering and Computer Science, The Australian National University

More information

Web Data Extraction: 1 o Semestre 2007/2008

Web Data Extraction: 1 o Semestre 2007/2008 Web Data : Given Slides baseados nos slides oficiais do livro Web Data Mining c Bing Liu, Springer, December, 2006. Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008

More information

Effective Data Mining Using Neural Networks

Effective Data Mining Using Neural Networks IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 8, NO. 6, DECEMBER 1996 957 Effective Data Mining Using Neural Networks Hongjun Lu, Member, IEEE Computer Society, Rudy Setiono, and Huan Liu,

More information

Creating Relational Data from Unstructured and Ungrammatical Data Sources

Creating Relational Data from Unstructured and Ungrammatical Data Sources Journal of Artificial Intelligence Research 31 (2008) 543-590 Submitted 08/07; published 03/08 Creating Relational Data from Unstructured and Ungrammatical Data Sources Matthew Michelson Craig A. Knoblock

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL

Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL Jasna S MTech Student TKM College of engineering Kollam Manu J Pillai Assistant Professor

More information

Sharing Solutions for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences

Sharing Solutions for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Sharing Solutions for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco Fortini 1, Miguel Guigò 2, Francisco Hernandez 2,

More information

Data Mining and Database Systems: Where is the Intersection?

Data Mining and Database Systems: Where is the Intersection? Data Mining and Database Systems: Where is the Intersection? Surajit Chaudhuri Microsoft Research Email: surajitc@microsoft.com 1 Introduction The promise of decision support systems is to exploit enterprise

More information

Natural Language to Relational Query by Using Parsing Compiler

Natural Language to Relational Query by Using Parsing Compiler Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

More information

Load Distribution in Large Scale Network Monitoring Infrastructures

Load Distribution in Large Scale Network Monitoring Infrastructures Load Distribution in Large Scale Network Monitoring Infrastructures Josep Sanjuàs-Cuxart, Pere Barlet-Ros, Gianluca Iannaccone, and Josep Solé-Pareta Universitat Politècnica de Catalunya (UPC) {jsanjuas,pbarlet,pareta}@ac.upc.edu

More information

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10 1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

More information

ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search

ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search Project for Michael Pitts Course TCSS 702A University of Washington Tacoma Institute of Technology ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search Under supervision of : Dr. Senjuti

More information

Scalable Parallel Clustering for Data Mining on Multicomputers

Scalable Parallel Clustering for Data Mining on Multicomputers Scalable Parallel Clustering for Data Mining on Multicomputers D. Foti, D. Lipari, C. Pizzuti and D. Talia ISI-CNR c/o DEIS, UNICAL 87036 Rende (CS), Italy {pizzuti,talia}@si.deis.unical.it Abstract. This

More information

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup Network Anomaly Detection A Machine Learning Perspective Dhruba Kumar Bhattacharyya Jugal Kumar KaKta»C) CRC Press J Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint of the Taylor

More information

A Secured Approach to Credit Card Fraud Detection Using Hidden Markov Model

A Secured Approach to Credit Card Fraud Detection Using Hidden Markov Model A Secured Approach to Credit Card Fraud Detection Using Hidden Markov Model Twinkle Patel, Ms. Ompriya Kale Abstract: - As the usage of credit card has increased the credit card fraud has also increased

More information

Rule based Classification of BSE Stock Data with Data Mining

Rule based Classification of BSE Stock Data with Data Mining International Journal of Information Sciences and Application. ISSN 0974-2255 Volume 4, Number 1 (2012), pp. 1-9 International Research Publication House http://www.irphouse.com Rule based Classification

More information

Duplicate Detection Algorithm In Hierarchical Data Using Efficient And Effective Network Pruning Algorithm: Survey

Duplicate Detection Algorithm In Hierarchical Data Using Efficient And Effective Network Pruning Algorithm: Survey www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3 Issue 12 December 2014, Page No. 9766-9773 Duplicate Detection Algorithm In Hierarchical Data Using Efficient

More information

Grid Density Clustering Algorithm

Grid Density Clustering Algorithm Grid Density Clustering Algorithm Amandeep Kaur Mann 1, Navneet Kaur 2, Scholar, M.Tech (CSE), RIMT, Mandi Gobindgarh, Punjab, India 1 Assistant Professor (CSE), RIMT, Mandi Gobindgarh, Punjab, India 2

More information

Collective Entity Resolution In Relational Data

Collective Entity Resolution In Relational Data Collective Entity Resolution In Relational Data Indrajit Bhattacharya and Lise Getoor Department of Computer Science University of Maryland, College Park, MD 20742, USA Abstract An important aspect of

More information