A Comparison of String Distance Metrics for Name-Matching Tasks
|
|
|
- Anna Richard
- 9 years ago
- Views:
Transcription
1 A Comparison of String Distance Metrics for Name-Matching Tasks William W. Cohen Pradeep Ravikumar Stephen E. Fienberg Center for Automated Center for Computer and Learning and Discovery Communications Security Department of Statistics Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University Pittsburgh PA 523 Pittsburgh PA 523 Pittsburgh PA 523 Abstract Using an open-source, Java toolkit of name-matching methods, we experimentally compare string distance metrics on the task of matching entity names. We investigate a number of different metrics proposed by different communities, including edit-distance metrics, fast heuristic string comparators, token-based distance metrics, and hybrid methods. Overall, the best-performing method is a hybrid scheme combining a TFIDF weighting scheme, which is widely used in information retrieval, with the Jaro-Winkler string-distance scheme, which was developed in the probabilistic record linkage community. Introduction The task of matching entity names has been explored by a number of communities, including statistics, databases, and artificial intelligence. Each community has formulated the problem differently, and different techniques have been proposed. In statistics, a long line of research has been conducted in probabilistic record linkage, largely based on the seminal paper by Fellegi and Sunter (969). This paper formulates entity matching as a classification problem, where the basic goal is to classify entity pairs as matching or non-matching. Fellegi and Sunter propose using largely unsupervised methods for this task, based on a featurebased representation of pairs which is manually designed and to some extent problem-specific. These proposals have been, by and large, adopted by subsequent researchers, although often with elaborations of the underlying statistical model (Jaro 989; 995; Winkler 999; Larsen 999; Belin & Rubin 997). These methods have been used to match individuals and/or families between samples and censuses, e.g., in evaluation of the coverage of the U.S. decennial census; or between administrative records and survey data bases, e.g., in the creation of an anonymized research data base combining tax information from the Internal Revenue Service and data from the Current Population Survey. In the database community, some work on record matching has been based on knowledge-intensive approaches Copyright c 23, American Association for Artificial Intelligence ( All rights reserved. (Hernandez & Stolfo 995; Galhardas et al. 2; Raman & Hellerstein 2). However, the use of string-edit distances as a general-purpose record matching scheme was proposed by Monge and Elkan (Monge & Elkan 997; 996), and in previous work, we proposed use of the TFIDF distance metric for the same purpose (Cohen 2). In the AI community, supervised learning has been used for learning the parameters of string-edit distance metrics (Ristad & Yianilos 998; Bilenko & Mooney 22) and combining the results of different distance functions (Tejada, Knoblock, & Minton 2; Cohen & Richman 22; Bilenko & Mooney 22). More recently, probabilistic object identification methods have been adapted to matching tasks (Pasula et al. 22). In these communities there has been more emphasis on developing autonomous matching techniques which require little or or no configuration for a new task, and less emphasis on developing toolkits of methods that can be applied to new tasks by experts. Recently, we have begun implementing an open-source, Java toolkit of name-matching methods which includes a variety of different techniques. In this paper we use this toolkit to conduct a comparison of several string distances on the tasks of matching and clustering lists of entity names. This experimental use of string distance metrics, while similar to previous experiments in the database and AI communities, is a substantial departure from their usual use in statistics. In statistics, databases tend to have more structure and specification, by design. Thus the statistical literature on probabilistic record linkage represents pairs of entities not by pairs of strings, but by vectors of match features such as names and categories for variables in survey databases. By developing appropriate match features, and appropriate statistical models of matching and non-matching pairs, this approach can achieve better matching performance (at least potentially). The use of string distances considered here is most useful for matching problems with little prior knowledge, or illstructured data. Better string distance metrics might also be useful in the generation of match features in more structured database situations.
2 Methods Edit-distance like functions Distance functions map a pair of strings s and t to a real number r, where a smaller value of r indicates greater similarity between s and t. Similarity functions are analogous, except that larger values indicate greater similarity; at some risk of confusion to the reader, we will use this terms interchangably, depending on which interpretation is most natural. One important class of distance functions are edit distances, in which distance is the cost of best sequence of edit operations that convert s to t. Typical edit operations are character insertion, deletion, and substitution, and each operation much be assigned a cost. We will consider two edit-distance functions. The simple Levenstein distance assigns a unit cost to all edit operations. As an example of a more complex well-tuned distance function, we also consider the Monger-Elkan distance function (Monge & Elkan 996), which is an affine variant of the Smith-Waterman distance function (Durban et al. 998) with particular cost parameters, and scaled to the interval [,]. A broadly similar metric, which is not based on an editdistance model, is the Jaro metric (Jaro 995; 989; Winkler 999). In the record-linkage literature, good results have been obtained using variants of this method, which is based on the number and order of the common characters between two strings. Given strings s = a... a K and t = b... b L, define a character a i in s to be common with t there is a b j = a i in t such that i H j i + H, where H = min( s, t ) 2. Let s = a... a K be the characters in s which are common with t (in the same order they appear in s) and let t = b... b L be analogous; now define a transposition for s, t to be a position i such that a i b i. Let T s,t be half the number of transpositions for s and t. The Jaro similarity metric for s and t is Jaro(s, t) = 3 ( s s + t t + s T s,t s A variant of this due to Winkler (999) also uses the length P of the longest common prefix of s and t. Letting P = max(p, 4) we define Jaro-Winkler(s, t) = Jaro(s, t) + P ( Jaro(s, t)) The Jaro and Jaro-Winkler metrics seem to be intended primarily for short strings (e.g., personal first or last names.) Token-based distance functions Two strings s and t can also be considered as multisets (or bags) of words (or tokens). We also considered several token-based distance metrics. The Jaccard similarity between the word sets S and T is simply S T S T. TFIDF or Affine edit-distance functions assign a relatively lower cost to a sequence of insertions or deletions. ) cosine similarity, which is widely used in the information retrieval community can be defined as TFIDF(S, T ) = V (w, S) V (w, T ) w S T where TF w,s is the frequency of word w in S, N is the size of the corpus, IDF w is the inverse of the fraction of names in the corpus that contain w, V (w, S) = log(tf w,s + ) log(idf w ) and V (w, S) = V (w, S)/ w V (w, S) 2. Our implementation measures all document frequency statistics from the complete corpus of names to be matched. Following Dagan et al (999), a token set S can be viewed as samples from an unknown distribution P S of tokens, and a distance between S and T can be computed based on these distributions. Inspired by Dagan et al we considered the Jensen-Shannon distance between P S and P T. Letting KL(P Q) be the Kullback-Lieber divergence and letting Q(w) = 2 (P S(w) + P T (w)), this is simply Jensen-Shannon(S, T ) = 2 (KL(P S Q) + KL(P T Q)) Distributions P S were estimated using maximum likelihood, a Dirichlet prior, and a Jelenik-Mercer mixture model (Lafferty & Zhai 2). The latter two cases require parameters; for Dirichlet, we used α =, and for Jelenik-Mercer, we used the mixture coefficient λ =. From the record-linkage literature, a method proposed by Fellegi and Sunter (969) can be easily extended to a token distance. As notation, let A and B be two sets of records to match, let C = A B, let D = A B, and for X = A, B, C, D let P X (w) be the empirical probability of a name containing the word w in the set X. Also let e X be the empirical probability of an error in a name in set X; e X, be the probability of a missing name in set X; e T be the probability of two correct but differing names in A and B; and let e = e A + e B + e T + e A, + e B,. Fellegi and Sunter propose ranking pairs s, t by the odds ratio log( Pr(M s,t) Pr(U s,t) ) where M is the class of matched pairs and U is the class of unmatched pairs. Letting AGREE(s, t, w) denote the event s and t both agree in containing word w, Fellegi and Sunter note that under certain assumptions Pr(M AGREE(s, t, w)) P C (w)( e) Pr(U AGREE(s, t, w)) P A (w) P B (w)( e) If we make two addition assumptions suggested by Fellegi and Sunter, and assume that (a) matches on a word w are independent, and (b) that P A (w) P B (w) P C (w) P D (w), we find that the incremental score for the odds ratio associated with AGREE(s, t, w) is simply log P D (w). In information retrieval terms, this is a simple un-normalized IDF weight. Unfortunately, updating the log-odds score of a pair on discovering a disagreeing token w is not as simple. Estimates are provided by Fellegi and Sunter for
3 P (M AGREE(s, t, w)) and P (U AGREE(s, t, w)), but the error parameters e A, e B,... do not cancel out instead one is left with a constant penalty term, independent of w. Departing slightly from this (and following the intuition that disagreement on frequent terms is less important) we introduce a variable penalty term of k log P D (w), where k is a parameter to be set by the user. In the experiments we used k =, and call this method SFS distance (for Simplified Fellegi-Sunter). Hybrid distance functions Monge and Elkan propose the following recursive matching scheme for comparing two long strings s and t. First, s and t are broken into substrings s = a... a K and t = b... b L. Then, similarity is defined as sim(s, t) = K K L max j= sim (A i, B j ) i= where sim is some secondary distance function. We considered an implementation of this scheme in which the substrings are tokens; following Monge and Elkan, we call this a level two distance function. We experimented with level two similarity functions which used Monge-Elkan, Jaro, and Jaro-Winkler as their base function sim. We also considered a soft version of TFIDF, in which similar tokens are considered as well as tokens in S T. Again let sim be a secondary similarity function. Let CLOSE(θ, S, T ) be the set of words w S such that there is some v T such that dist (w, v) > θ, and for w CLOSE(θ, S, T ), let D(w, T ) = max v T dist(w, v). We define SoftTFIDF(S, T ) = w CLOSE(θ,S,T ) V (w, S) V (w, T ) D(w, T ) In the experiments, we used Jaro-Winkler as a secondary distance and θ =. Blocking or pruning methods In matching or clustering large lists of names, it is not computationally practical to measure distances between all pairs of strings. In statistical record linkage, it is common to group records by some variable which is known a priori to be usually the same for matching pairs. In census work, this grouping variable often names a small geographic region, and perhaps for this reason the technique is usually called blocking. Since this paper focuses on the behavior of string matching tools when little prior knowledge is available, we will use here knowledge-free approaches to reducing the set of string pairs to compare. In a matching task, there are two sets A and B. We consider as candidates all pairs of strings (s, t) A B such that s and t share some substring v which appears in at most a fraction f of all names. In a clustering task, there is one set C, and we consider all candidates (s, t) C C such that s t, and again s and t share some not-too-frequent substring v. For purely token-based Name Src M/C #Strings #Tokens animal M 579 3,6 bird M 377,977 bird2 M 982 4,95 bird3 M bird4 M 79 4,68 business M 239,526 game M 9 5,6 park M 654 3,425 fodorzagrat 2 M 863,846 ucdfolks 3 M census 4 M 84 5,765 UVA 3 C 6 6,845 coraatdv 5 C 96 47,52 Table : Datasets used in experiments. Column 2 indicates the source of the data, and column 3 indicates if it is a matching (M) or clustering (C) problem. Original sources are. (Cohen 2) 2. (Tejada, Knoblock, & Minton 2) 3. (Monge & Elkan 996) 4. William Winkler (personal communication) 5. (McCallum, Nigam, & Ungar 2) communication) methods, the substring v must be a token, and otherwise, it must be a character 4-gram. Using inverted indices this set of pairs can be enumerated quickly. For the moderate-size test sets considered here, we used f =. On the matching datasets above, the token blocker finds between 93.3% and.% of the correct pairs, with an average of 98.9%. The 4-gram blocker also finds between 93.3% and.% of the correct pairs, with an average of 99.%. Data and methodology Experiments The data used to evaluate these methods is shown in Table. Most been described elsewhere in the literature. The coraatdv dataset includes the fields author, title, date, and venue in a single string. The census dataset is a synthetic, census-like dataset, from which only textual fields were used (last name, first name, middle initial, house number, and street). To evaluate a method on a dataset, we ranked by distance all candidate pairs from the appropriate grouping algorithm. We computed the non-interpolated average precision of this ranking, the maximum F score of the ranking, and also interpolated precision at the eleven recall levels.,,...,,.. The non-interpolated average precision of a ranking containing N pairs for a task with m correct matches is N c(i)δ(i) m r= i, where c(i) is the number of correct pairs ranked before position i, and δ(i) = if the pair at rank i is correct and otherwise. Interpolated precision at recall r is c(i) the max i i, where the max is taken over all ranks i such that c(i) m r. Maximum F is max i> F (i), where F (i) is the harmonic mean of recall at rank i (i.e., c(i)/m) and precision at rank i (i.e., c(i)/i).
4 MaxF of TFIDF vs Jaccard vs Jensen-Shannon AvgPrec of TFIDF vs Jaccard vs Jensen-Shannon TFIDF Jensen-Shannon SFS Jaccard MaxF of other AvgPrec of other Figure : Relative performance of token-based measures. Left, max F of methods on matching problems, with points above the line y = x indicating better performance of TFIDF. Middle, same for non-interpolated average precision. Right, precision-recall curves averaged over all matching problems. Smoothed versions of Jensen-Shannon (not shown) are comparable in performance to the unsmoother version. Monge-Elkan Levenstein Smith-Waterman Jaro Jaro-Winkler max F of Monge-Elkan vs Levenstein vs SmithWaterman avgprec of Monge-Elkan vs Levenstein vs SmithWaterman max F of other distance metric Figure 2: Relative performance of edit-distance measures. Left and middle, points above (below) the line y = x indicating better (worse) performance for Monge-Elkan, the system performing best on average. Results for matching We will first consider the matching datasets. As shown in Figure, TFIDF seems generally the best among the tokenbased distance metrics. It does slightly better on average than the others, and is seldom much worse than any other method with respect to interpolated average precision or maximum F. As shown in Figure 2, the situation is less clear for the edit-distance based methods. The Monge-Elkan method does best on average, but the Jaro and Jaro-Winkler methods are close in average performance, and do noticably better than Monge-Elkan on several problems. The Jaro variants are also substantially more efficient (at least in our implementation), taking about / the time of the Monge-Elkan method. (The token-based methods are faster still, averaging less than / the time of the Jaro variants.) As shown in Figure 3, SoftTFIDF is generally the best among the hybrid methods we considered. In general, the time for the hybrid methods is comparable to using the underlying string edit distance. (For instance, the average matching time for SoftTFIDF is about 3 seconds on these problems, versus about 2 seconds for the Jaro-Winkler method, and.2 seconds for pure token-based TFIDF.) Finally, Figure 4 compares the three best performing editdistance like methods, the two best token-based methods, and the two best hybrid methods, using a similar methodology. Generally speaking, SoftTFIDF is the best overall distance measure for these datasets. Results for clustering It should be noted that the test suite of matching problems is dominated by problems from one source eight of the eleven test cases are associated with the WHIRL project and a different distribution of problems might well lead to quite different results. For instance, while the token-based methods perform well on average, they perform poorly on the census-like dataset, which contains many misspellings. As an additional experiment, we evaluated the four bestperforming distance metrics (SoftTFIDF, TFIDF, SFS, and Level 2 Jaro-Winkler) on the two clustering problems, which are taken from sources other than the WHIRL project. Table 2 shows MaxF and non-interpolated average precision for each method on each problem. SoftTFIDF again slightly outperforms the other methods on both of these tasks. UVA CoraATDV Method MaxF AvgPrec MaxF AvgPrec SoftTFIDF TFIDF SFS Level2 J-W Table 2: Results for selected distance metrics on two clustering problems.
5 max F of SoftTFIDF vs Level2 Levenstein vs Level2 Monge-Elkan vs Level2 Jaro avgprec of SoftTFIDF vs Level2 Levenstein vs Level2 Monge-Elkan vs Level2 Jaro SoftTFIDF Level2 Jaro-Winkler Level2 Jaro Level2 Levenstein Level2 Monge-Elkan max F of other distance metric Figure 3: Relative performance of hybrid distance measures on matching problems, relative to the SoftTFIDF metric, which performs best on average. max F of SoftTFIDF avgprec of SoftTFIDF SoftTFIDF Level2 Jaro-Winkler TFIDF SFS Monge-Elkan Jaro Jaro-Winkler max F of other distance metric Figure 4: Relative performance of best distance measures of each type on matching problems, relative to the SoftTFIDF metric. Learning to combine distance metrics Another type of hybrid distance function can be obtained by combining other distance metrics. Following previous researchers (Tejada, Knoblock, & Minton 2; Cohen & Richman 22; Bilenko & Mooney 22) we used a learning scheme to combine several of the distance functions above. Specifically, we represented pairs as feature vectors, using as features the numeric scores of Monge-Elkan, Jaro-Winkler, TFIDF, SFS, and SoftTFIDF. We then trained a binary SVM classifier (using SVM Light (Joachims 22)) using these features, and used its confidence in the match class as a score. Figure 5 shows the results of a three-fold cross-validation on nine of the matching problems. The learned combination generally slightly outperforms the individual metrics, including SoftTFIDF, particularly at extreme recall levels. Note, however, that the learned metric uses much more information: in particular, in most cases it has been trained on several thousand labeled candidate pairs, while the other metrics considered here require no training data. Concluding remarks Recently, we have begun implementing an open-source, Java toolkit of name-matching methods. This toolkit includes a variety of different techniques, as well as the infrastructure to combine techniques readily, and evaluate them systematically on test data. Using this toolkit, we conducted a comparison of several string distances on the tasks of matching and clustering lists of entity names. Many of these were techniques previously proposed in the literature, and some are novel hybrids of previous methods. We compared these accuracy of these methods for use in an automatic matching scheme, in which pairs of names are proposed by a simple grouping method, and then ranked according to distance. Used in this way, we saw that the TFIDF ranking performed best among several token-based distance metrics, and that a tuned affine-gap edit-distance metric proposed by Monge and Elkan performed best among several string edit-distance metrics. A surprisingly good distance metric is a fast heuristic scheme, proposed by Jaro (Jaro 995; 989) and later extended by Winkler (Winkler 999). This works almost as well as the Monge-Elkan scheme, but is an order of magnitude faster. One simple way of combining the TFIDF method and the Jaro-Winkler is to replace the exact token matches used in TFIDF with approximate token matches based on the Jaro- Winkler scheme. This combination performs slightly better than either Jaro-Winkler or TFIDF on average, and occasionally performs much better. It is also close in performance to a learned combination of several of the best metrics considered in this paper. Acknowledgements The preparation of this paper was supported in part by National Science Foundation Grant No. EIA-3884 to the National Institute of Statistical Sciences and by a contract from the Army Research Office to the Center for Computer
6 max F of learned metric vs SoftTFIDF avgprec of learned metric vs SoftTFIDF SoftTFIDF Learned metric max F of other distance metric Figure 5: Relative performance of best distance measures of each type on matching problems, relative to a learned combination of the same measures. and Communications Security with Carnegie Mellon University. References Belin, T. R., and Rubin, D. B A method for calibrating false-match rates in record linkage. In Record Linkage 997: Proceedings of an International Workshop and Exposition, U.S. Office of Management and Budget (Washington). Bilenko, M., and Mooney, R. 22. Learning to combine trained distance metrics for duplicate detection in databases. Technical Report Technical Report AI 2-296, Artificial Intelligence Lab, University of Texas at Austin. Available from Cohen, W. W., and Richman, J. 22. Learning to match and cluster large high-dimensional data sets for data integration. In Proceedings of The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-22). Cohen, W. W. 2. Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems 8(3): Dagan, I.; Lee, L.; and Pereira, F Similarity-based models of word cooccurrence probabilities. Machine Learning 34(-3). Durban, R.; Eddy, S. R.; Krogh, A.; and Mitchison, G Biological sequence analysis - Probabilistic models of proteins and nucleic acids. Cambridge: Cambridge University Press. Fellegi, I. P., and Sunter, A. B A theory for record linkage. Journal of the American Statistical Society 64:83 2. Galhardas, H.; Florescu, D.; Shasha, D.; and Simon, E. 2. An extensible framework for data cleaning. In ICDE, 32. Hernandez, M., and Stolfo, S The merge/purge problem for large databases. In Proceedings of the 995 ACM SIGMOD. Jaro, M. A Advances in record-linkage methodology as applied to matching the 985 census of Tampa, Florida. Journal of the American Statistical Association 84: Jaro, M. A Probabilistic linkage of large public health data files (disc: P ). Statistics in Medicine 4: Joachims, T. 22. Learning to Classify Text Using Support Vector Machines. Kluwer. Lafferty, J., and Zhai, C. 2. A study of smoothing methods for language models applied to ad hoc information retrieval. In 2 ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). Larsen, M Multiple imputation analysis of records linked using mixture models. In Statistical Society of Canada Proceedings of the Survey Methods Section, Statistical Society of Canada (McGill University, Montreal). McCallum, A.; Nigam, K.; and Ungar, L. H. 2. Efficient clustering of high-dimensional data sets with application to reference matching. In Knowledge Discovery and Data Mining, Monge, A., and Elkan, C The field-matching problem: algorithm and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. Monge, A., and Elkan, C An efficient domain-independent algorithm for detecting approximately duplicate database records. In The proceedings of the SIGMOD 997 workshop on data mining and knowledge discovery. Pasula, H.; Marthi, B.; Milch, B.; Russell, S.; and Shpitser, I. 22. Identity uncertainty and citation matching. In Advances in Neural Processing Systems 5. Vancouver, British Columbia: MIT Press. Raman, V., and Hellerstein, J. 2. Potter s wheel: An interactive data cleaning system. In The VLDB Journal, Ristad, E. S., and Yianilos, P. N Learning string edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 2(5): Tejada, S.; Knoblock, C. A.; and Minton, S. 2. Learning object identification rules for information integration. Information Systems 26(8): Winkler, W. E The state of record linkage and current research problems. Statistics of Income Division, Internal Revenue Service Publication R99/4. Available from
Record Deduplication By Evolutionary Means
Record Deduplication By Evolutionary Means Marco Modesto, Moisés G. de Carvalho, Walter dos Santos Departamento de Ciência da Computação Universidade Federal de Minas Gerais 31270-901 - Belo Horizonte
SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY
SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY G.Evangelin Jenifer #1, Mrs.J.Jaya Sherin *2 # PG Scholar, Department of Electronics and Communication Engineering(Communication and Networking), CSI Institute
Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping
Appears in Proceedings of the 5th International Conference on Data Mining (ICDM-2005) pp. 58-65, Houston, TX, November 2005 Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison
Web Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
DATABASES play an important role in today s IT-based
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 1, JANUARY 2007 1 Duplicate Record Detection: A Survey Ahmed K. Elmagarmid, Senior Member, IEEE, Panagiotis G. Ipeirotis, Member, IEEE
Quality and Complexity Measures for Data Linkage and Deduplication
Quality and Complexity Measures for Data Linkage and Deduplication Peter Christen and Karl Goiser Department of Computer Science, Australian National University, Canberra ACT 0200, Australia {peter.christen,karl.goiser}@anu.edu.au
1 o Semestre 2007/2008
Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 Outline 1 2 3 4 5 Exploiting Text How is text exploited? Two main directions Extraction Extraction
Experiments in Web Page Classification for Semantic Web
Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address
How To Find An Alias On Email From A Computer (For A Free Download)
Email Alias Detection Using Social Network Analysis Ralf Hölzer Information Networking Institute Carnegie Mellon University Pittsburgh, PA 1513 [email protected] Bradley Malin School of Computer Science
Introducing diversity among the models of multi-label classification ensemble
Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and
REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION
REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION Pilar Rey del Castillo May 2013 Introduction The exploitation of the vast amount of data originated from ICT tools and referring to a big variety
Knowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
A Simplified Framework for Data Cleaning and Information Retrieval in Multiple Data Source Problems
A Simplified Framework for Data Cleaning and Information Retrieval in Multiple Data Source Problems Agusthiyar.R, 1, Dr. K. Narashiman 2 Assistant Professor (Sr.G), Department of Computer Applications,
Assisting bug Triage in Large Open Source Projects Using Approximate String Matching
Assisting bug Triage in Large Open Source Projects Using Approximate String Matching Amir H. Moin and Günter Neumann Language Technology (LT) Lab. German Research Center for Artificial Intelligence (DFKI)
Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28
Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one) History of recognition techniques Object classification Bag-of-words Spatial pyramids Neural Networks Object
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Development and User Experiences of an Open Source Data Cleaning, Deduplication and Record Linkage System
Development and User Experiences of an Open Source Data Cleaning, Deduplication and Record Linkage System Peter Christen School of Computer Science The Australian National University Canberra ACT 0200,
Private Record Linkage with Bloom Filters
To appear in: Proceedings of Statistics Canada Symposium 2010 Social Statistics: The Interplay among Censuses, Surveys and Administrative Data Private Record Linkage with Bloom Filters Rainer Schnell,
Data Mining Project Report. Document Clustering. Meryem Uzun-Per
Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...
Introduction to Data Mining
Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association
Statistical Feature Selection Techniques for Arabic Text Categorization
Statistical Feature Selection Techniques for Arabic Text Categorization Rehab M. Duwairi Department of Computer Information Systems Jordan University of Science and Technology Irbid 22110 Jordan Tel. +962-2-7201000
Active Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
Creating Probabilistic Databases from Duplicated Data
VLDB Journal manuscript No. (will be inserted by the editor) Creating Probabilistic Databases from Duplicated Data Oktie Hassanzadeh Renée J. Miller Received: 14 September 2008 / Revised: 1 April 2009
Comparison of Standard and Zipf-Based Document Retrieval Heuristics
Comparison of Standard and Zipf-Based Document Retrieval Heuristics Benjamin Hoffmann Universität Stuttgart, Institut für Formale Methoden der Informatik Universitätsstr. 38, D-70569 Stuttgart, Germany
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
Mining a Corpus of Job Ads
Mining a Corpus of Job Ads Workshop Strings and Structures Computational Biology & Linguistics Jürgen Jürgen Hermes Hermes Sprachliche Linguistic Data Informationsverarbeitung Processing Institut Department
Clustering Technique in Data Mining for Text Documents
Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
How To Perform An Ensemble Analysis
Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
Classification and Prediction
Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser
A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE
STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LIX, Number 1, 2014 A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE DIANA HALIŢĂ AND DARIUS BUFNEA Abstract. Then
A Genetic Programming Approach for Record Deduplication
A Genetic Programming Approach for Record Deduplication L.Chitra Devi 1, S.M.Hansa 2, Dr.G.N.K.Suresh Babu 3 Research Scholars, Dept. of Computer Applications, GKM College of Engineering and Technology,
Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment
2009 10th International Conference on Document Analysis and Recognition Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment Ahmad Abdulkader Matthew R. Casey Google Inc. [email protected]
Assisting bug Triage in Large Open Source Projects Using Approximate String Matching
Assisting bug Triage in Large Open Source Projects Using Approximate String Matching Amir H. Moin and Günter Neumann Language Technology (LT) Lab. German Research Center for Artificial Intelligence (DFKI)
The Role of Size Normalization on the Recognition Rate of Handwritten Numerals
The Role of Size Normalization on the Recognition Rate of Handwritten Numerals Chun Lei He, Ping Zhang, Jianxiong Dong, Ching Y. Suen, Tien D. Bui Centre for Pattern Recognition and Machine Intelligence,
Static Data Mining Algorithm with Progressive Approach for Mining Knowledge
Global Journal of Business Management and Information Technology. Volume 1, Number 2 (2011), pp. 85-93 Research India Publications http://www.ripublication.com Static Data Mining Algorithm with Progressive
Segmentation and Classification of Online Chats
Segmentation and Classification of Online Chats Justin Weisz Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 [email protected] Abstract One method for analyzing textual chat
Protein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
Domain Classification of Technical Terms Using the Web
Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using
Machine Learning Logistic Regression
Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.
Revenue Optimization with Relevance Constraint in Sponsored Search
Revenue Optimization with Relevance Constraint in Sponsored Search Yunzhang Zhu Gang Wang Junli Yang Dakan Wang Jun Yan Zheng Chen Microsoft Resarch Asia, Beijing, China Department of Fundamental Science,
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,
A Neural-Networks-Based Approach for Ontology Alignment
A Neural-Networks-Based Approach for Ontology Alignment B. Bagheri Hariri [email protected] H. Abolhassani [email protected] H. Sayyadi [email protected] Abstract Ontologies are key elements
An Introduction to Data Mining
An Introduction to Intel Beijing [email protected] January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
TF-IDF. David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture6-tfidf.ppt
TF-IDF David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture6-tfidf.ppt Administrative Homework 3 available soon Assignment 2 available soon Popular media article
Search and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
Enhancing Quality of Data using Data Mining Method
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2, ISSN 25-967 WWW.JOURNALOFCOMPUTING.ORG 9 Enhancing Quality of Data using Data Mining Method Fatemeh Ghorbanpour A., Mir M. Pedram, Kambiz Badie, Mohammad
The Enron Corpus: A New Dataset for Email Classification Research
The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu
Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup
Network Anomaly Detection A Machine Learning Perspective Dhruba Kumar Bhattacharyya Jugal Kumar KaKta»C) CRC Press J Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint of the Taylor
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Febrl A Freely Available Record Linkage System with a Graphical User Interface
Febrl A Freely Available Record Linkage System with a Graphical User Interface Peter Christen Department of Computer Science, The Australian National University Canberra ACT 0200, Australia Email: [email protected]
Sentiment analysis on tweets in a financial domain
Sentiment analysis on tweets in a financial domain Jasmina Smailović 1,2, Miha Grčar 1, Martin Žnidaršič 1 1 Dept of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan International
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
Efficient Cluster Detection and Network Marketing in a Nautural Environment
A Probabilistic Model for Online Document Clustering with Application to Novelty Detection Jian Zhang School of Computer Science Cargenie Mellon University Pittsburgh, PA 15213 [email protected] Zoubin
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE
BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE Alex Lin Senior Architect Intelligent Mining [email protected] Outline Predictive modeling methodology k-nearest Neighbor
Predict the Popularity of YouTube Videos Using Early View Data
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis
A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis Yusuf Yaslan and Zehra Cataltepe Istanbul Technical University, Computer Engineering Department, Maslak 34469 Istanbul, Turkey
Large Scale Learning to Rank
Large Scale Learning to Rank D. Sculley Google, Inc. [email protected] Abstract Pairwise learning to rank methods such as RankSVM give good performance, but suffer from the computational burden of optimizing
Towards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 [email protected] Abstract Spam identification is crucial
Application of Clustering and Association Methods in Data Cleaning
Proceedings of the International Multiconference on ISBN 978-83-60810-14-9 Computer Science and Information Technology, pp. 97 103 ISSN 1896-7094 Application of Clustering and Association Methods in Data
Introduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.
Incorporating Window-Based Passage-Level Evidence in Document Retrieval
Incorporating -Based Passage-Level Evidence in Document Retrieval Wensi Xi, Richard Xu-Rong, Christopher S.G. Khoo Center for Advanced Information Systems School of Applied Science Nanyang Technological
Machine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
Equity forecast: Predicting long term stock price movement using machine learning
Equity forecast: Predicting long term stock price movement using machine learning Nikola Milosevic School of Computer Science, University of Manchester, UK [email protected] Abstract Long
Query Recommendation employing Query Logs in Search Optimization
1917 Query Recommendation employing Query Logs in Search Optimization Neha Singh Department of Computer Science, Shri Siddhi Vinayak Group of Institutions, Bareilly Email: [email protected] Dr Manish
Discovering Structured Event Logs from Unstructured Audit Trails for Workflow Mining
Discovering Structured Event Logs from Unstructured Audit Trails for Workflow Mining Liqiang Geng 1, Scott Buffett 1, Bruce Hamilton 1, Xin Wang 2, Larry Korba 1, Hongyu Liu 1, and Yunli Wang 1 1 IIT,
Using Data Mining Methods to Predict Personally Identifiable Information in Emails
Using Data Mining Methods to Predict Personally Identifiable Information in Emails Liqiang Geng 1, Larry Korba 1, Xin Wang, Yunli Wang 1, Hongyu Liu 1, Yonghua You 1 1 Institute of Information Technology,
CENG 734 Advanced Topics in Bioinformatics
CENG 734 Advanced Topics in Bioinformatics Week 9 Text Mining for Bioinformatics: BioCreative II.5 Fall 2010-2011 Quiz #7 1. Draw the decompressed graph for the following graph summary 2. Describe the
Cross-Validation. Synonyms Rotation estimation
Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical
A Non-Linear Schema Theorem for Genetic Algorithms
A Non-Linear Schema Theorem for Genetic Algorithms William A Greene Computer Science Department University of New Orleans New Orleans, LA 70148 bill@csunoedu 504-280-6755 Abstract We generalize Holland
Cross-Domain Collaborative Recommendation in a Cold-Start Context: The Impact of User Profile Size on the Quality of Recommendation
Cross-Domain Collaborative Recommendation in a Cold-Start Context: The Impact of User Profile Size on the Quality of Recommendation Shaghayegh Sahebi and Peter Brusilovsky Intelligent Systems Program University
Modeling System Calls for Intrusion Detection with Dynamic Window Sizes
Modeling System Calls for Intrusion Detection with Dynamic Window Sizes Eleazar Eskin Computer Science Department Columbia University 5 West 2th Street, New York, NY 27 [email protected] Salvatore
Identifying Best Bet Web Search Results by Mining Past User Behavior
Identifying Best Bet Web Search Results by Mining Past User Behavior Eugene Agichtein Microsoft Research Redmond, WA, USA [email protected] Zijian Zheng Microsoft Corporation Redmond, WA, USA [email protected]
Conditional Random Fields: An Introduction
Conditional Random Fields: An Introduction Hanna M. Wallach February 24, 2004 1 Labeling Sequential Data The task of assigning label sequences to a set of observation sequences arises in many fields, including
Collecting Polish German Parallel Corpora in the Internet
Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska
Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer
Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next
Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.
Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.38457 Accuracy Rate of Predictive Models in Credit Screening Anirut Suebsing
NEW TECHNIQUE TO DEAL WITH DYNAMIC DATA MINING IN THE DATABASE
www.arpapress.com/volumes/vol13issue3/ijrras_13_3_18.pdf NEW TECHNIQUE TO DEAL WITH DYNAMIC DATA MINING IN THE DATABASE Hebah H. O. Nasereddin Middle East University, P.O. Box: 144378, Code 11814, Amman-Jordan
