An Efficient way of Record Linkage System and Deduplication using Indexing techniques, Classification and FEBRL Framework
|
|
|
- Malcolm Boyd
- 10 years ago
- Views:
Transcription
1 An Efficient way of Record Linkage System and Deduplication using Indexing techniques, Classification and FEBRL Framework Nishand.K, Ramasami.S, T.Rajendran Abstract Record linkage is an important process in data integration, which is used in merging, matching and duplicate removal from several databases that refer to the same entities. Deduplication is the process of removing duplicate records in a single database. In recent years, data cleaning and standardization becomes an important process in data mining task. Due to complexity of today s database, finding matching records in single database is a crucial one. Indexing techniques are used to efficiently implement record linkage and deduplication. In this paper, three indexing techniques namely blocking index, sorting indexing and bigram indexing are used with a modification of existing techniques that reduces the variance in the quality of the blocking results. In addition to the indexing techniques, six comparison techniques and two classifiers are used. There is a potential for large performance speed-ups and better accuracy to be achieved by using indexing techniques along with comparison and classifier techniques. Keywords Record linkage,indexing techniques, data matching, blocking, Febrl framework I. INTRODUCTION With many businesses, government agencies and research projects collecting massive amounts of data, techniques that allow efficient processing, analyzing and mining of large databases have in recent years attracted interest from both academia and industry. An increasingly important task in the data preparation phase of many data mining projects is linking or matching records relating to the same entity from several databases, as often information from multiple sources needs to be integrated and combined in order to enrich data and allow more detailed data mining studies. The aim of such linkages is to match and aggregate all records relating to the same entity, such as a patient, a customer, a business, a consumer product, a bibliographic citation, or a genome sequence. Record linkage and deduplication can be used to improve data quality and integrity, to allow re-use of existing data sources for new studies, and to reduce costs and efforts in data acquisition. Record linkage can also help to enrich data that is used for pattern detection in data mining systems. Businesses routinely deduplicate and link their data sets to compile mailing lists, while withintaxation offices and departments of social security, record linkage and deduplication can be used toidentifypeople who register for benefits multiple times or who work and collect unemployment money. Another application of current interest is the use of data linkage in crime and terror detection. Security agencies and crime investigators increasingly rely on the ability to quickly access files for a particular individual, which may help to prevent crimes by early intervention. Manuscript received May, Nishand.K, II-ME CSE Department of Computer Science and Ramasami.S, Assistant Professor Department of Computer Science and Dr. T.Rajendran, Dean Department of Computer Science and The problem of finding records that relate to the sameentitiesnot only applies to databases that contain informationaboutpeople. Other types of entities that sometimesneed to bematched include records about businesses,consumerproducts, publications, and bibliographic citations,webpages, web search results, or genome sequences.inbioinformatics, for example, record linkage techniques canhelp find genome sequences in large data collections that aresimilar to a new, unknown sequence. In the field ofinformation retrieval, it is important to remove duplicatedocuments (such as web pages and bibliographic citations)inthe results returned by search engines, in digital librariesor inautomatic text indexing systems. Anotherapplicationof growing interest is finding and comparingconsumerproducts from different online stores. Becauseproductdescriptions are often slightly varying, matchingthem becomes challenging. Removing duplicate records in a single database is a crucial step. Deduplication can be achieved more efficiently by using indexing techniques. One or more (blocking) indexes need to be built with the aim of grouping together records that potentially match and thus reducing the huge number of possible comparisons. While this grouping should reduce the number of comparisons made as much as possible, it is important that no potential match is overlooked because of the indexing process. After index are built, records within the same index block are compared by using field comparison functions, resulting in a weight vector for each record pair compared. These weight vectors are then given to a classifier that decides if a record pair constitutes a match, non-match or a possible match. In this paper Blocking method, Sort Indexing and Bigram indexing are used with changes in the existing system. In addition to indexing technique, six comparison and two classifiers are used to increase the efficiency. 1.1 RECORD LINKAGE PROCESS Fig 1. Outline process of Record linkage 69
2 An Efficient way of Record Linkage System and Deduplication using Indexing techniques, Classification and FEBRL Framework Record linkage techniques are used to link data records relating to the same entities, such as patients or customers. Fig.1 shows the outline of record linkage. Record linkage can be used to improve data quality and integrity, to allow re-use of existing data sources for new studies, and to reduce costs and effort in data acquisition for research studies. If a unique entity identifier or key is available in all of the data sets to be linked, then the problem of linking at the entity level is trivial a simple join operation in SQL or its equivalent in other data management systems is all that is required. However, if no unique key is shared by all of the data sets, then various record linkage techniques need to be used. No matter what technique is used, a number of issues need to be addressed when linking data. Often, data is recorded or captured in various formats, and data items may be missing or contain errors. A pre-processing phase that aims to clean and standardize the data is therefore an essential first step in every linkage process. Data sets may also contain duplicate entries, in which case linkage may need to be applied within a data set to be de-duplicate it before linkage with other files is attempted. The process of linking records has various names in different user communities. While epidemiologists and statisticians speak of record linkage, the process is often referred to as data matching or as the object identity problem by computer scientists, whereas it is sometimes called merge/purge processing or list washing in commercial processing of customer databases or mailing lists. Historically, the statistical and the computer science communities have developed their own techniques, and until recently few cross-references could be found. Computer-assisted record linkage goes back as far as the 1950s. At this time, most linkage projects were based on adhoc heuristic methods. The basic ideas of probabilistic record linkage were introduced by Newcombe and Kennedy in 1962 while the theoretical foundation was provided by Fellegi and Sunter in Using frequency counts to derive agreement and disagreement probabilities, each pair of fields of each compared record pair is assigned a match weight, and critical values of the sum of these match weights are used to designate a pair of records as either a link, a possible or anon-link. Possible links are those pairs for which human oversight, also known as clerical review, is needed to decide their final linkage status. In theory, the person undertaking this clerical review has access to additional data (or may be able to seek it out) which enables them to resolve the linkage status. In practice, often no additional data is available and the clerical review process becomes one of applying human or common sense to the decision based on available data. One of the aims of the FEBRL project is to automate (and possible improve upon) this process through the use of machine learning and data mining techniques. To reduce the number of comparisons (potentially each record in one data set has to be compared with every record in a second data set), blocking techniques are typically used. The data sets are split into smaller blocks using blocking variables, like the postcode or the Soundex encoding of surnames. Only records within the same blocks are then compared. To deal with typographical variations and data entry errors, approximate string comparison functions are often used for names and addresses. These comparators usually return a score between 0.0 (two strings are completely different) and 1.0 (two strings are the same). The terms data cleaning, data standardization, data scrubbing, data pre-processing and ETL(extraction, transformation and loading) are used synonymously to refer to the general tasks of transforming the source data (often derived from operational, transactional information systems) into clean and consistent sets of records which are suitable for record linkage of for loading into a data warehouse. The meaning of the term standardization in this context is quite different from its use in epidemiology and statistics, where it usually refers to a method of dealing with the confounding effects of age. The main task of data standardization in record linkage is the resolution of inconsistencies in the way information is represented or encoded in the data. Inconsistencies can arise through typographical or other data capture errors, the use of different code sets or abbreviations and differences in record layouts. The FEBRL System is implemented in an object oriented design with a handful of modules, each containing routines for specific tasks. Record linkage consists of two main steps. The first one deals with data cleaning and standardization, while the second performs the actual linkage (or deduplication). The user needs to specify various settings in order to be able to perform a cleaning/standardization and/or a linkage /deduplication/geocoding process. II. INDEXING The aim of indexing is to reduce the potentially huge number of comparisons (every record in one data set with all records in another data set) by eliminating comparisons between records that obviously are not matches. In other words, indexing reduces the large search space by forming groups of records that are very likely to be matches. Indexing can also be seen as a clustering method that brings together records that are similar, so only these records need to be compared using the more expensive (i.e. compute intensive) field comparisons functions. 2.1 BLOCKING INDEXING Currently the FEBRL system contains several indexing methods, including the traditional blocking method used in many record linkage systems. Indexes are normally built while a data set is being standardized. After an index is built a compacting has to be done which builds index data structures that can return the blocks more efficiently. In the following example, three indexes are defined and a traditional blocking index is initialized. The first index is based on the Soundex encoding of the reversed surname field values, with a maximal code length of three, the second index is based on a combination of the first two characters in the given name field (the truncate method is used for this) concatenated with the values in the postcode field (the method is direct which means the postcode values are not encoded or modified in any way), and the third index is based on concatenation of the first two digits in the postcode plus the NYSIIS encoding of the surname values. hosp_block_def=[[( surname, soundex,3, reverse )],[( giv enname, truncate,2),( postcode, direct )],[( postcode, tru ncate,2),( surname, nysiis )],] hospital_index=blockingindex(name= HospIndex, dataset=tmpdata, block_def=hosp_block_def) 2.2 SORTING INDEXING The sorting index extends the idea of the classical blocking index in that the values of the blocks (e.g. the Soundex encodings of surnames) are sorted alphabetically and then a 70
3 sliding window is moved over these sorted blocks. When a sorting index is initialized, one argument needs to be given is: Windowing method: A positive integer that gives the size of the sliding window. For example, let's assume there are six blocks in our index (shown are the blocking variable values and the corresponding record numbers in the blocks), and we have a sliding window size of 3: a123: [4,12,89,99] a129: [6,32,54,84,91] a245: [1,39] a689: [3,17,21,35,49,76,87,93] a911: [2,42,66] b111: [8] While with the blocking index only the records within one block are compared with the records in the corresponding block of the second data set, with the sorting index and window size 3 larger blocks are formed by combining three consecutive blocks together: [a123,a129,a245]: [1,4,6,12,32,39,54,84,89,91,99] [a129,a245,a689]:[1,3,6,17,21,32,35,39,49,54,76,84,87,91,9 3] [a245,a689,a911]: [1,2,3,17,21,35,39,42,49,66,76,87,93] [a689,a911,b111]: [2,3,8,17,21,35,42,49,66,76,87,93] The idea behind this is, that neighboring blocks might contain records with similar values in the blocking variables due to errors in the original values. Note that if the window size is set to then the sorting index becomes equivalent to the blocking index BIGRAM INDEXING The aim of this technique is to index the database(s) such that records that have a similar, not just the same, BKV will be inserted into the same block. The basic idea is to create variations for each BKV using bigram, and to insert record identifiers into more than one block. This index implements a data structure based on bigrams and allows for fuzzy blocking. The basic idea is that after an index has been built, the values of the blocking variables will be converted into a list of bigrams, which is sorted alphabetically (and duplicate bigrams are removed), and sub-lists will be built using a user provided threshold (a number between 0.0 and 1.0) of all possible combinations. These resulting bigram sub-lists will be inserted into an inverted index, i.e. record numbers in the blocks will be inserted into dictionaries for each bigram sub-list. Such an inverted index will then be used to retrieve the blocks. A blocking key value baxter will result in a bigram list ( ba, ax, xt, te, er ). With a threshold of 0.8 the following sub-lists of length 4 (calculated as the length of the bigram list times the threshold: 5 0.8) will be inserted into the inverted index: ( ax, xt, te, er ) ( ba, xt, te, er ) ( ba, ax, te, er ) ( ba, ax, xt, er ) ( ba, ax, xt, te ) All record numbers which contain the blocking key value baxter will be inserted into five blocks, thus increasing the number of record pair comparisons compared to standard blocking. The number of sub-lists created for a blocking key value both depends on the length of the value and the threshold. The lower the threshold the shorter the sub-lists, but also the more sub-lists there will be per blocking key value, resulting in more (smaller blocks) in the inverted index. III. RECORD COMPARATOR A Record Comparator is constructed using a list of field comparators. Once field comparison functions have been initialized, then they can be inserted into a field_comparisons list which is then given to a record comparator. When comparing records in a linkage or deduplication process, each field comparator in a field_comparisons list will calculate a weight, and all weights for a record pair will be stored in a weight vector. Additional information stored with a weight vector is the unique record identifiers for the two records in the pair compared. The weight vectors will then be used to compute the matching status of the record pair in the classifier. The Record Comparator is initialized using references to two data sets to be linked (which can be the same in case of a deduplication task), and a list of field comparison functions. The comparison techniques used are 5.1 EXACT STRING COMPARISON This field comparator function compares the two fields (or field lists) given to it as strings and returns theagreement weight if they are the same and the disagreement weight if they differ. 5.2 TRUNCATED STRING COMPARISON This field comparison function allows the comparison of strings that can be truncated at a certain position using the argument max_string_length. Only the first max_string_length characters in both strings are compared. Similar to the exact string comparison function, this field comparator compares the two fields (or field lists) given to it as strings and returns the agreement weight if the truncated strings are the same and the disagreement weight if they differ. 5.3 APPROXIMATE STRING COMPARISON Approximate string comparison is an important feature for successful weight calculation when comparing strings from names and addresses. Instead of simply having an agreement or disagreement weight returned, approximate string comparators allow for partial agreement if strings are not exactly but almost the same, which can be due to typographical and other errors. Various algorithms for approximate string comparisons have been developed, in both the statistical record linkage and in the computer science and natural language processing communities. All string comparison functions implemented return a value between (two strings are completely different) and (two strings are the same). The approximate string comparison method has to be selected with the compare_method argument. The following methods are currently implemented: jaro The Jaro string comparator is commonly used in record linkage software. It computes the number of common characters in two strings, the lengths of both strings, and the number of transpositions to compute a similarity measure between and. winkler The Winkler comparator is based on the Jaro comparator but takes into account the fact that typographical errors occur 71
4 An Efficient way of Record Linkage System and Deduplication using Indexing techniques, Classification and FEBRL Framework more often towards the end of words, and thus gives an increased value to characters in agreement at the beginning of the strings. The partial agreement weight is therefore increased if the beginning of two strings is the same. bigram Bigrams are the two-character substrings in a string, for example 'peter' contains the bigrams 'pe', 'et', 'te' and 'er'. In the Bigram string comparator, the number of common bigrams in the two strings is counted and divided by the average number of bigrams in the two strings, to calculate a similarity measure between 0.0 and 1.0. editdist The Edit distance algorithm (also known as the Levenshtein distance) counts the minimum number of deletions, transpositions and insertions that have to be made to transform one string into the other. This number is then divided by the length of the longer string to get a similarity measure between 0.0 and 1.0. seqmatch This approximate string comparator is implemented in the Python standard library difflib. It is based on an algorithm developed by Ratcliff and Obershelp in the 1980s, and uses pattern matching to compute a similarity measure between 0.0 and ENCODED STRING COMPARISON Phonetic name encoding is traditionally used to create blocking variables in the record linkage process, but it can also be used to compare strings. The encoded string comparison function compares the two fields (or field lists) given to it as encoded strings, and returns the agreement weight if both strings are encoded the same way, otherwise the disagreement weight is returned. 5.5 KEYING DIFFERENCE COMPARISON This field comparator compares the two fields (or field lists) given to it as strings character-wise, with a maximum number of different characters that can be tolerated. This number has to be set with the argument max_key_diff. If the number of different characters is larger than zero but equal to or smaller than max_key_diff the partial agreement weight is calculated. If the number of key differences is larger than max_key_diff then the disagreement weight is returned. This field comparator can also be used to compare numerical fields, such as date of birth, telephone numbers, etc. 5.6 NUMERIC COMPARISON WITH PERCENTAGE TOLERANCE This field comparator is for numeric fields, where a given maximal percentage difference can be tolerated. This percentage has to be set by the argument max_perc_diff as a value between and. The default value is, i.e. no difference is tolerated. The agreement weight is returned if the numbers are the same and the disagreement weight if the percentage difference is larger than the max_perc_diff value. If the percentage difference between the two values is larger than but equal to or smaller than max_perc_diff the partial agreement weight is calculated. 6. CLASSIFICATION The last step in a record linkage process - after records have been compared and weight vectors have been calculated - is the classification of record pairs into links, non-links, or if this decision should be done by a human review, possible links. Two classifiers used are Fellegi & Sunter classifier and a Flexible classifier. 6.1 FELLEGI AND SUNTER CLASSIFIER The classical Fellegi and Sunter classifier simply sums all the weights in a weight vector, and then uses two thresholds to classify a record pair into one of the three classes links, non-links or possible links. 6.2 FLEXIBLE CLASSIFIER This flexible classifier allows different methods to be used to calculate the final matching weight for a weight vector. Similar to the Fellegi and Sunter classifier, two thresholds are used to classify a record pair into one of the three classes links, non-links or possible links. The results of a classification are stored in a data structure. Instead of simply summing all weights in a weight vector, this flexible classifier allows a flexible definition of the final weight calculation by defining tuples containing a function and elements of the weight vector upon which the function is applied. The final weight is then calculated using another function that needs to be defined by the user. 7. EXPERIMENTAL EVALUATION: In previous works indexing techniques are alone used to record linkage and deduplication. In this paper, indexing techniques along with classification and comparators are used. All these are implemented in FEBRL framework. Datasets used are real data and artificially generated data. Because in real datasets it is difficult to identify the deviations in results.so artificial datasets are also used. Artificial data are generated using febrl framework. This generator first creates original recordsbased on frequency tables that contain real name andaddress values, as well as other personal attributes;followed by the generation of duplicates of these recordsbased on random modifications such as inserting, deleting,or substituting characters, and swapping, removing, inserting,splitting, or merging words. The types andfrequenciesof these modifications are also based on realcharacteristics.the true match status of all record pairs isknown. Theoriginal and duplicate records were then storedinto one fileeach to facilitate their linkage. Four measures are used to assess the complexity of theindexing step and the quality of the resultingcandidaterecord pairs. The total number of matched andnonmatched record pairs are denoted with nm and nn,respectively, with nm + nm = na nb for the linkage oftwo databases, and nm + nn = na(na 1)/2 for thededuplicationof one database. The number of true matchedandtrue nonmatched record pairs generated by an indexingtechnique is denoted with sm and sn, respectively, withsm + sn nm + nn. The reduction ratio, RR = 1:0 -, measures thereduction of the comparison space, i.e., thefraction ofrecord pairs that are removed by an indexing technique.the higher the RR value, the less candidaterecord pairs arebeing generated. However, reduction ratiodoes not take thequality of the generated candidate recordpairs into account(how many are true matches or not). Fig. 7.1 shows the evaluation of the given data set. 72
5 [5]. L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S.Muthukrishnan, and D. Srivastava, Approximate StringJoins ina Database (Almost) for Free, Proc. 27th Int lconf. Very Large DataBases (VLDB 01), pp , [6]. J.I. Maletic and A. Marcus, Data Cleansing: Beyond IntegrityAnalysis, Proc. Fifth Conf. Information Quality(IQ 00), pp , [7]. L. Jin, C. Li, and S. Mehrotra, Efficient Record Linkage in LargeData Sets, Proc. Eighth Int l Conf.Database Systems for AdvancedApplications (DASFAA 03), pp , [8]. C. Faloutsos and K.-I. Lin, Fastmap: A Fast Algorithm forindexing, Data-Mining and Visualization of Traditional andmultimedia Datasets, Proc. ACMSIGMOD Int l Conf. Managementof Data (SIGMOD 95), pp , Fig. 7.1 Evaluation page showing the matching weight histogram and quality and complexity measures for a linkage. 8. CONCLUSION AND FUTURE WORK Record linkage and deduplication are important steps in the pre-processing phaseof many data mining projects, andalso important for improving data qualitybefore data isloaded into data warehouses. Different record linkage techniqueshave been presented along with comparators and classifiers, which are used to separate records into match, non-match and possible matches. The indexing techniques presented in this paper are heuristic approaches that aim to split the records in a database (or databases) into (possibly overlapping) blocks such that matches are inserted into the same block and nonmatches into different blocks. While future work in the area of indexing for record linkage and deduplication should include the development of more efficient and more scalable new indexing techniques, the ultimate goal of such research will be to develop techniques that generate blocks such that it can be proven that 1) all comparisons between records within a block will have a certain minimumsimilarity with each other, and 2) the similarity betweenrecords in different blocks is below this minimum similarity. REFERENCES [1] T. Churches, P. Christen, K. Lim, and J.X. Zhu, Preparation ofname and Address Data for Record Linkage Using HiddenMarkov Models, BioMed Central Medical Informatics and DecisionMaking, vol. 2, no. 9, [2]. P. Christen, Febrl: An Open Source Data Cleaning, Deduplicationand Record Linkage System With agraphical User Interface, Proc. 14th ACM SIGKDD Int lconf. Knowledge Discovery and DataMining (KDD 08),pp , [3]. L. Gu and R. Baxter, Decision Models for Record Linkage, Selected Papers from AusDM, LNCS 3755, Springer, [4]. S. Yan, D. Lee, M.Y. Kan, and L.C. Giles, Adaptive SortedNeighborhood Methods for Efficient RecordLinkage, Proc.Seventh ACM/IEEE-CS Joint Conf.Digital Libraries (JCDL 07), Nishand.K completed Bachelor of Engineering in Computer Science stream from Anna University, Coimbatore and pursuing Master of Engineering in Computer Science stream in Anna University, Chennai. He has presented 2 papers in International Conference. His area of interest includes, Data mining, Cloud computing and networks. S.Ramasamireceived his Bachelor of Technology degree in Information Technology stream and Master of Engineering degree in Computer Science and Engineering, both from Anna University, Chennai. He is currently doing his research in cloud computing. He has published 1 paper in international journal and 5 papers in national conferences. His area of interest includes cloud computing and network security. Dr. T.Rajendrancompleted his PhD degree in 2012 at Anna University, Chennai in the Department of Information and Communication Engineering. Now he is working as a Dean for Department of CSE & IT at Angel College of Engineering and Technology, Tirupur, Tamilnadu, India. His research interest includes Distributed Systems, Web Services, Network Security, SOA and Web Technology. He is a life member of ISTE & CSI. He has published more than 51 articles in International/ National Journals/Conferences.He has visited Dhurakij Pundit University in Thailand for presenting his research paper in International conference. He was honored with Best Professor Award 2012 by ASDF Global Awards 2012, Pondicherry. 73
Record Deduplication By Evolutionary Means
Record Deduplication By Evolutionary Means Marco Modesto, Moisés G. de Carvalho, Walter dos Santos Departamento de Ciência da Computação Universidade Federal de Minas Gerais 31270-901 - Belo Horizonte
Quality and Complexity Measures for Data Linkage and Deduplication
Quality and Complexity Measures for Data Linkage and Deduplication Peter Christen and Karl Goiser Department of Computer Science, Australian National University, Canberra ACT 0200, Australia {peter.christen,karl.goiser}@anu.edu.au
Development and User Experiences of an Open Source Data Cleaning, Deduplication and Record Linkage System
Development and User Experiences of an Open Source Data Cleaning, Deduplication and Record Linkage System Peter Christen School of Computer Science The Australian National University Canberra ACT 0200,
Febrl A Freely Available Record Linkage System with a Graphical User Interface
Febrl A Freely Available Record Linkage System with a Graphical User Interface Peter Christen Department of Computer Science, The Australian National University Canberra ACT 0200, Australia Email: [email protected]
Efficient and Effective Duplicate Detection Evaluating Multiple Data using Genetic Algorithm
Efficient and Effective Duplicate Detection Evaluating Multiple Data using Genetic Algorithm Dr.M.Mayilvaganan, M.Saipriyanka Associate Professor, Dept. of Computer Science, PSG College of Arts and Science,
Advanced record linkage methods: scalability, classification, and privacy
Advanced record linkage methods: scalability, classification, and privacy Peter Christen Research School of Computer Science, ANU College of Engineering and Computer Science, The Australian National University
Private Record Linkage with Bloom Filters
To appear in: Proceedings of Statistics Canada Symposium 2010 Social Statistics: The Interplay among Censuses, Surveys and Administrative Data Private Record Linkage with Bloom Filters Rainer Schnell,
A Genetic Programming Approach for Record Deduplication
A Genetic Programming Approach for Record Deduplication L.Chitra Devi 1, S.M.Hansa 2, Dr.G.N.K.Suresh Babu 3 Research Scholars, Dept. of Computer Applications, GKM College of Engineering and Technology,
Privacy Aspects in Big Data Integration: Challenges and Opportunities
Privacy Aspects in Big Data Integration: Challenges and Opportunities Peter Christen Research School of Computer Science, The Australian National University, Canberra, Australia Contact: [email protected]
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Online Credit Card Application and Identity Crime Detection
Online Credit Card Application and Identity Crime Detection Ramkumar.E & Mrs Kavitha.P School of Computing Science, Hindustan University, Chennai ABSTRACT The credit cards have found widespread usage due
Application of Data Mining Techniques in Intrusion Detection
Application of Data Mining Techniques in Intrusion Detection LI Min An Yang Institute of Technology [email protected] Abstract: The article introduced the importance of intrusion detection, as well as
International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
Application of Clustering and Association Methods in Data Cleaning
Proceedings of the International Multiconference on ISBN 978-83-60810-14-9 Computer Science and Information Technology, pp. 97 103 ISSN 1896-7094 Application of Clustering and Association Methods in Data
SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY
SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY G.Evangelin Jenifer #1, Mrs.J.Jaya Sherin *2 # PG Scholar, Department of Electronics and Communication Engineering(Communication and Networking), CSI Institute
Hybrid Technique for Data Cleaning
Hybrid Technique for Data Cleaning Ashwini M. Save P.G. Student, Department of Computer Engineering, Thadomal Shahani Engineering College, Bandra, Mumbai, India Seema Kolkur Assistant Professor, Department
Fuzzy Duplicate Detection on XML Data
Fuzzy Duplicate Detection on XML Data Melanie Weis Humboldt-Universität zu Berlin Unter den Linden 6, Berlin, Germany [email protected] Abstract XML is popular for data exchange and data publishing
A Simplified Framework for Data Cleaning and Information Retrieval in Multiple Data Source Problems
A Simplified Framework for Data Cleaning and Information Retrieval in Multiple Data Source Problems Agusthiyar.R, 1, Dr. K. Narashiman 2 Assistant Professor (Sr.G), Department of Computer Applications,
Issues in Identification and Linkage of Patient Records Across an Integrated Delivery System
Issues in Identification and Linkage of Patient Records Across an Integrated Delivery System Max G. Arellano, MA; Gerald I. Weber, PhD To develop successfully an integrated delivery system (IDS), it is
Robust Outlier Detection Technique in Data Mining: A Univariate Approach
Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,
Enhancing Quality of Data using Data Mining Method
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2, ISSN 25-967 WWW.JOURNALOFCOMPUTING.ORG 9 Enhancing Quality of Data using Data Mining Method Fatemeh Ghorbanpour A., Mir M. Pedram, Kambiz Badie, Mohammad
NEW TECHNIQUE TO DEAL WITH DYNAMIC DATA MINING IN THE DATABASE
www.arpapress.com/volumes/vol13issue3/ijrras_13_3_18.pdf NEW TECHNIQUE TO DEAL WITH DYNAMIC DATA MINING IN THE DATABASE Hebah H. O. Nasereddin Middle East University, P.O. Box: 144378, Code 11814, Amman-Jordan
CHAPTER 1 INTRODUCTION
CHAPTER 1 INTRODUCTION 1. Introduction 1.1 Data Warehouse In the 1990's as organizations of scale began to need more timely data for their business, they found that traditional information systems technology
Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis
IOSR Journal of Computer Engineering (IOSRJCE) ISSN: 2278-0661, ISBN: 2278-8727 Volume 6, Issue 5 (Nov. - Dec. 2012), PP 36-41 Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis
Data Warehousing. Jens Teubner, TU Dortmund [email protected]. Winter 2014/15. Jens Teubner Data Warehousing Winter 2014/15 1
Jens Teubner Data Warehousing Winter 2014/15 1 Data Warehousing Jens Teubner, TU Dortmund [email protected] Winter 2014/15 Jens Teubner Data Warehousing Winter 2014/15 152 Part VI ETL Process
Load-Balancing the Distance Computations in Record Linkage
Load-Balancing the Distance Computations in Record Linkage Dimitrios Karapiperis Vassilios S. Verykios Hellenic Open University School of Science and Technology Patras, Greece {dkarapiperis, verykios}@eap.gr
A Review of Data Mining Techniques
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,
SPATIAL DATA CLASSIFICATION AND DATA MINING
, pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal
Type Ahead Search in Database using SQL
Type Ahead Search in Database using SQL Salunke Shrikant Dadasaheb Dattakala Group of Institutions, Faculty of Engineering University of Pune [email protected] Prof. Bere Sachin Sukhadeo Dattakala
Customer Classification And Prediction Based On Data Mining Technique
Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor
Client Perspective Based Documentation Related Over Query Outcomes from Numerous Web Databases
Beyond Limits...Volume: 2 Issue: 2 International Journal Of Advance Innovations, Thoughts & Ideas Client Perspective Based Documentation Related Over Query Outcomes from Numerous Web Databases B. Santhosh
Big Data with Rough Set Using Map- Reduce
Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,
Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner
24 Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner Rekha S. Nyaykhor M. Tech, Dept. Of CSE, Priyadarshini Bhagwati College of Engineering, Nagpur, India
New Hash Function Construction for Textual and Geometric Data Retrieval
Latest Trends on Computers, Vol., pp.483-489, ISBN 978-96-474-3-4, ISSN 79-45, CSCC conference, Corfu, Greece, New Hash Function Construction for Textual and Geometric Data Retrieval Václav Skala, Jan
Dynamic Data in terms of Data Mining Streams
International Journal of Computer Science and Software Engineering Volume 2, Number 1 (2015), pp. 1-6 International Research Publication House http://www.irphouse.com Dynamic Data in terms of Data Mining
AUTOMATICALLY DECEPTIVE CRIMINAL IDENTITIES
by GANG WANG, HSINCHUN CHEN, and HOMA ATABAKHSH AUTOMATICALLY DETECTING DECEPTIVE CRIMINAL IDENTITIES Fear about identity verification reached new heights since the terrorist attacks on Sept. 11, 2001,
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed
Data Mining Solutions for the Business Environment
Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania [email protected] Over
International Journal of Innovative Research in Computer and Communication Engineering
FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,
Dataset Preparation and Indexing for Data Mining Analysis Using Horizontal Aggregations
Dataset Preparation and Indexing for Data Mining Analysis Using Horizontal Aggregations Binomol George, Ambily Balaram Abstract To analyze data efficiently, data mining systems are widely using datasets
Data quality in Accounting Information Systems
Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania
Natural Language to Relational Query by Using Parsing Compiler
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
How. Matching Technology Improves. White Paper
How Matching Technology Improves Data Quality White Paper Table of Contents How... 3 What is Matching?... 3 Benefits of Matching... 5 Matching Use Cases... 6 What is Matched?... 7 Standardization before
Optimization of ETL Work Flow in Data Warehouse
Optimization of ETL Work Flow in Data Warehouse Kommineni Sivaganesh M.Tech Student, CSE Department, Anil Neerukonda Institute of Technology & Science Visakhapatnam, India. [email protected] P Srinivasu
Enhanced Boosted Trees Technique for Customer Churn Prediction Model
IOSR Journal of Engineering (IOSRJEN) ISSN (e): 2250-3021, ISSN (p): 2278-8719 Vol. 04, Issue 03 (March. 2014), V5 PP 41-45 www.iosrjen.org Enhanced Boosted Trees Technique for Customer Churn Prediction
Why is Internal Audit so Hard?
Why is Internal Audit so Hard? 2 2014 Why is Internal Audit so Hard? 3 2014 Why is Internal Audit so Hard? Waste Abuse Fraud 4 2014 Waves of Change 1 st Wave Personal Computers Electronic Spreadsheets
Survey On: Nearest Neighbour Search With Keywords In Spatial Databases
Survey On: Nearest Neighbour Search With Keywords In Spatial Databases SayaliBorse 1, Prof. P. M. Chawan 2, Prof. VishwanathChikaraddi 3, Prof. Manish Jansari 4 P.G. Student, Dept. of Computer Engineering&
Clustering on Large Numeric Data Sets Using Hierarchical Approach Birch
Global Journal of Computer Science and Technology Software & Data Engineering Volume 12 Issue 12 Version 1.0 Year 2012 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global
Comparison of K-means and Backpropagation Data Mining Algorithms
Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and
Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL
Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL Jasna S MTech Student TKM College of engineering Kollam Manu J Pillai Assistant Professor
Email Spam Detection Using Customized SimHash Function
International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email
Advances in Natural and Applied Sciences
AENSI Journals Advances in Natural and Applied Sciences ISSN:1995-0772 EISSN: 1998-1090 Journal home page: www.aensiweb.com/anas Clustering Algorithm Based On Hadoop for Big Data 1 Jayalatchumy D. and
Query Optimization Approach in SQL to prepare Data Sets for Data Mining Analysis
Query Optimization Approach in SQL to prepare Data Sets for Data Mining Analysis Rajesh Reddy Muley 1, Sravani Achanta 2, Prof.S.V.Achutha Rao 3 1 pursuing M.Tech(CSE), Vikas College of Engineering and
How To Use Neural Networks In Data Mining
International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and
A Survey on Intrusion Detection System with Data Mining Techniques
A Survey on Intrusion Detection System with Data Mining Techniques Ms. Ruth D 1, Mrs. Lovelin Ponn Felciah M 2 1 M.Phil Scholar, Department of Computer Science, Bishop Heber College (Autonomous), Trichirappalli,
Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines
, 22-24 October, 2014, San Francisco, USA Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines Baosheng Yin, Wei Wang, Ruixue Lu, Yang Yang Abstract With the increasing
International Journal of Engineering Research ISSN: 2348-4039 & Management Technology November-2015 Volume 2, Issue-6
International Journal of Engineering Research ISSN: 2348-4039 & Management Technology Email: [email protected] November-2015 Volume 2, Issue-6 www.ijermt.org Modeling Big Data Characteristics for Discovering
International Journal of Advanced Computer Technology (IJACT) ISSN:2319-7900 PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS
PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS First A. Dr. D. Aruna Kumari, Ph.d, ; Second B. Ch.Mounika, Student, Department Of ECM, K L University, [email protected]; Third C.
International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April-2014 442 ISSN 2229-5518
International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April-2014 442 Over viewing issues of data mining with highlights of data warehousing Rushabh H. Baldaniya, Prof H.J.Baldaniya,
Using Edit-Distance Functions to Identify Similar E-Mail Addresses Howard Schreier, U.S. Dept. of Commerce, Washington DC
Paper 073-29 Using Edit-Distance Functions to Identify Similar E-Mail Addresses Howard Schreier, U.S. Dept. of Commerce, Washington DC ABSTRACT Version 9 of SAS software has added functions which can efficiently
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College
Web Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies
Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Somesh S Chavadi 1, Dr. Asha T 2 1 PG Student, 2 Professor, Department of Computer Science and Engineering,
Search Query and Matching Approach of Information Retrieval in Cloud Computing
International Journal of Advances in Electrical and Electronics Engineering 99 Available online at www.ijaeee.com & www.sestindia.org ISSN: 2319-1112 Search Query and Matching Approach of Information Retrieval
ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS
ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS Michael Affenzeller (a), Stephan M. Winkler (b), Stefan Forstenlechner (c), Gabriel Kronberger (d), Michael Kommenda (e), Stefan
Mining various patterns in sequential data in an SQL-like manner *
Mining various patterns in sequential data in an SQL-like manner * Marek Wojciechowski Poznan University of Technology, Institute of Computing Science, ul. Piotrowo 3a, 60-965 Poznan, Poland [email protected]
Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2
Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue
SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA
SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA J.RAVI RAJESH PG Scholar Rajalakshmi engineering college Thandalam, Chennai. [email protected] Mrs.
Introduction to Data Mining
Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association
Regression model approach to predict missing values in the Excel sheet databases
Regression model approach to predict missing values in the Excel sheet databases Filling of your missing data is in your hand Z. Mahesh Kumar School of Computer Science & Engineering VIT University Vellore,
Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
Record Linkage in an Hadoop Environment
Undergraduate Research Opportunity Program (UROP) Project Report Record Linkage in an Hadoop Environment By Huang Yipeng (U096926R) Department of Computer Science School of Computing National University
Search and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
Removing Fully and Partially Duplicated Records through K-Means Clustering
IACSIT International Journal of Engineering and Technology, Vol. 4, No. 6, December 2012 Removing Fully and Partially Duplicated Records through K-Means Clustering Bilal Khan, Azhar Rauf, Huma Javed, Shah
Investigating Clinical Care Pathways Correlated with Outcomes
Investigating Clinical Care Pathways Correlated with Outcomes Geetika T. Lakshmanan, Szabolcs Rozsnyai, Fei Wang IBM T. J. Watson Research Center, NY, USA August 2013 Outline Care Pathways Typical Challenges
Data Mining Approach For Subscription-Fraud. Detection in Telecommunication Sector
Contemporary Engineering Sciences, Vol. 7, 2014, no. 11, 515-522 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ces.2014.4431 Data Mining Approach For Subscription-Fraud Detection in Telecommunication
Web Mining Seminar CSE 450. Spring 2008 MWF 11:10 12:00pm Maginnes 113
CSE 450 Web Mining Seminar Spring 2008 MWF 11:10 12:00pm Maginnes 113 Instructor: Dr. Brian D. Davison Dept. of Computer Science & Engineering Lehigh University [email protected] http://www.cse.lehigh.edu/~brian/course/webmining/
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Binary Coded Web Access Pattern Tree in Education Domain
Binary Coded Web Access Pattern Tree in Education Domain C. Gomathi P.G. Department of Computer Science Kongu Arts and Science College Erode-638-107, Tamil Nadu, India E-mail: [email protected] M. Moorthi
Log Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]
Standardization and Its Effects on K-Means Clustering Algorithm
Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03
An Empirical Study of Application of Data Mining Techniques in Library System
An Empirical Study of Application of Data Mining Techniques in Library System Veepu Uppal Department of Computer Science and Engineering, Manav Rachna College of Engineering, Faridabad, India Gunjan Chindwani
Distance Degree Sequences for Network Analysis
Universität Konstanz Computer & Information Science Algorithmics Group 15 Mar 2005 based on Palmer, Gibbons, and Faloutsos: ANF A Fast and Scalable Tool for Data Mining in Massive Graphs, SIGKDD 02. Motivation
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.
College information system research based on data mining
2009 International Conference on Machine Learning and Computing IPCSIT vol.3 (2011) (2011) IACSIT Press, Singapore College information system research based on data mining An-yi Lan 1, Jie Li 2 1 Hebei
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH
IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH Kalinka Mihaylova Kaloyanova St. Kliment Ohridski University of Sofia, Faculty of Mathematics and Informatics Sofia 1164, Bulgaria
