On-line Data De-duplication. Ιωάννης Κρομμύδας
|
|
- Priscilla Holmes
- 7 years ago
- Views:
Transcription
1 On-line Data De-duplication Ιωάννης Κρομμύδας
2 Roadmap Data Cleaning Importance Data Cleaning Tasks Challenges for Fuzzy Matching Baseline Method (fuzzy match data cleaning) Improvements: online data cleaning using qgram tries 2
3 Roadmap Data Cleaning Importance Data Cleaning Tasks Challenges for Fuzzy Matching Baseline Method (fuzzy match data cleaning) Improvements: online data cleaning using qgram tries 3
4 Data Cleaning Importance Data cleaning is critical for many industries over a wide variety of applications: marketing communications customer matching merging information systems medical records 4
5 Data Cleaning Importance The efficiency of every information processing infrastructure is greatly affected by the quality of the data residing in its databases. Poor data quality is the result of a variety of reasons: data entry errors (e.g., typing mistakes) multiple conventions for recording database fields (e.g., company names, addresses). 5
6 Data Cleaning Importance Poor data quality has a significant impact on a variety of business issues: customer relationship management inability to retrieve a customer record during a service call billing errors distribution delays 6
7 Roadmap Data Cleaning Importance Data Cleaning Tasks Challenges for Fuzzy Matching Baseline Method (fuzzy match data cleaning) Improvements: online data cleaning using qgram tries 7
8 Data Cleaning Tasks One of the most important tasks in data cleaning is to de-duplicate records detection of multiple representation of a single entity The problem is straightforward for numerical values; still, it is very hard for string values and combinations of them in an attribute Names (first-, middle-, last- name), addresses, etc. 8
9 Data Cleaning Tasks Considering company names, it is common to see Microsoft, Micorsoft, Microsoft Inc. and Microsoft Corporation being used in different records to represent the same entity A simple equality or (even) substring comparison on names or addresses will not properly identify them as being the same entity, leading to a variety of potential business problems 9
10 Data Cleaning Tasks Two possible modes of de-duplication: Detection of exact duplicates, which requires a typical join operation Fuzzy matching, which entails the detection of inexact duplicates presents a challenge between accuracy, efficiency and storage overheads 10
11 Roadmap Data Cleaning Importance Data Cleaning Tasks Challenges for Fuzzy Matching Baseline Method (fuzzy match data cleaning) Improvements: online data cleaning using qgram tries 11
12 Challenges for Fuzzy Matching Assume a clean reference relation R and a stream of possibly dirty tuples S, that we check over R for fuzzy duplicates. Task: first try exact match, else try fuzzy match Issues: Accuracy of the identification An appropriate similarity function Avoiding to check every stream record with everyone in R 12
13 Challenges for Fuzzy Matching Fig. 1. Template for using Fuzzy Match [CGGM03] 13
14 Challenges for Fuzzy Matching Given the similarity function and an input tuple, the result of a fuzzy match operation could be one of the following: the reference tuple being closest to the input tuple, the closest K reference tuples enabling users, if necessary, to choose one among them K or fewer tuples whose similarity to the input tuple exceeds a user-specified minimum similarity threshold 14
15 Roadmap Data Cleaning Importance Data Cleaning Tasks Challenges for Fuzzy Matching Baseline Method (fuzzy match data cleaning) Improvements: online data cleaning using qgram tries 15
16 Baseline Method (Fuzzy Match Data Cleaning) Chaudhuri et SIGMOD 2003 adopt a probabilistic approach in order to return the closest K reference tuples with high probability propose a fuzzy match similarity function (fms) that explicitly considers IDF token weights and input errors while comparing tuples 16
17 Baseline Method (Fuzzy Match Data Cleaning) Chaudhuri et SIGMOD 2003 preprocess the reference relation to build an index relation, called the error tolerant index (ETI) relation, for retrieving at run time a small set of candidate reference tuples retrieve with high probability a superset of the K reference tuples closest to the input tuple 17
18 Baseline Method (Fuzzy Match Data Cleaning) Similarity between an input tuple and a reference tuple could be described as the cost of transforming the former into the latter low transformation costs of input tuples denote high similarity Transformation operations are applied on a set of tokens included in the attributes of a tuple The set of tokens included in attribute i of tuple v is denoted by tok[v(i)] if v(i) = Boeing Company, then tok[v(i)] = {Boeing, Company} 18
19 Baseline Method (Fuzzy Match Data Cleaning) Each transformation operation is associated with a cost depending on the weight of the transformed token: w( t, i) = IDF( t, i) = log R freq ( t, i), where freq(t,i) denotes the frequency of a token t in column i and equals to the number of tuples v in R such that tok(v[i]) contains t 19
20 Baseline Method (Fuzzy Match Data Cleaning) Let u be an input tuple and v a reference tuple, the cost of operations taking place in order to transform u into v is defined in next table: operation Description cost token replacement token insertion token deletion replaces t 1 in tok[u(i)] by t 2 in tok[v(i)] ed(t 1, t 2 ) w(t1,i) inserts a token t into u[i] c ins w(t, i) (0 c ins 1) deletes a token t from u[i] w(t, i) 20
21 Baseline Method (Fuzzy Match Data Cleaning) The transformation cost tc(u[i], v[i]) is the cost of the minimum cost transformation sequence for transforming u[i] into v[i]. The cost tc(u, v) of transforming u into v is the sum over all columns i of the costs tc(u[i], v[i]) of transforming u[i] into v[i] and equals to: tc( u, v) = tc i ( u[] i, v[ i] ) 21
22 Baseline Method (Fuzzy Match Data Cleaning) The fuzzy match similarity function fms(u, v) between an input tuple u and a reference tuple v in terms of the transformation cost tc(u, v) can be defined as: fms ( u, v) ( u, v),1. ( ) tc = 1 min 0 w u w(u) is the sum of weights of all tokens in the token set tok(u) token set tok(u) denotes the multiset union of sets tok(a 1 ),,tok(a n ) of tokens from the tuple u[a 1,,a n ], 22
23 Baseline Method (Fuzzy Match Data Cleaning) The K-fuzzy Match Problem: Given reference relation R, a minimum similarity threshold c (0<c<1), input tuple u, the set FM(u) of fuzzy matches of at most K tuples from R Naïve Algorithm: scan the reference relation R, comparing each tuple with u Proposed Method: build an index on the reference relation for quickly retrieving a superset of target fuzzy matches (pre-processing phase) this indexed relation is called Error Tolerant Index (ETI) - indexed using standard B+ trees to perform fast-exact lookups to prepare an ETI, fms apx needed 23
24 Baseline Method (Fuzzy Match Data Cleaning) Reference Relation (not indexable) Pre-processing Error Tolerant Index (standard database relation, but indexable) Candidate Set - superset of FM(U) Approximation of fms (fms apx ) is a pared down version of fms ignores ordering among tokens in the input and reference tuples [beoing company, seattle, wa, 98004] and [company beoing, seattle, wa, 98004] are identical to fms apx in fms apx, closeness between two tokens is measured through the similarity between sets of substrings called qgram sets 24
25 Baseline Method (Fuzzy Match Data Cleaning) Estimating fms apx requires computing token min-hash signatures mh i and min-hash similarity sim mh between two tokens min-hash similarity U: universe of strings over an alphabet Σ h i :U N, i = 1,,H be H hash functions mapping elements of U uniformly and randomly to the set of natural numbers N S a set of strings. min-hash signature m h (S) of S is the vector [mh 1 (S),, mh H (S)] where the i th coordinate mh i (S) is defined as: mh ( S ) = argmin h ( a) sim mh H ( t, t ) = I[ mh ( QG( t )) = mh ( QG( t ))] H i= 1 i 1 i i a S i 2 Let I[X] denote an indicator variable over boolean X (I[X] = 1 if X is true, else 0) 25
26 Baseline Method (Fuzzy Match Data Cleaning) Let u, v be two tuples dq = (1-1/q) be an adjustment term, fms apx is defined as: apx 1 2 fms ( u, v) = () ( () ( )) ( ) w t Max simmh QG t, QG r + d w u r tok ( []) ( v[] i ) i t tok u i q Eg: Input tuple u [Company Beoing, Seattle, NULL, 98004] Reference tuple v [Boeing Company, Seattle, WA, 98004] q = 3, H = 2, token: weight: company: 0.25, beoing: 0.5, seattle:1.0, 98004: 2.0 total weight = 3.75 Suppose min-hash signatures are [oei, ing], [com, pan], [sea, ttl], [wa], [980, 004] Score from matching beoing with boeing is: w(beoing)*(2/3* (1 1/3)) = w(beoing) Since every token matches exactly with a reference token, fms apx (u,v) = 3.75/ q
27 Baseline Method (Fuzzy Match Data Cleaning) Error Tolerant Index (ETI) enables for each input tuple u, the efficient retrieval of a candidate set S of reference tuples with similarity greater than the minimum similarity threshold fms apx is measured by comparing min-hash signatures of tokens in tok(u) and tok(v) to determine the candidate set, we need to efficiently identify for each token t in tok(u), a set of reference tuples sharing min-hash qgrams with that of t holds each qgram s along with the list of all tids of reference tuples with tokens whose min-hash signatures contain s 27
28 Baseline Method (Fuzzy Match Data Cleaning) ETI schema: [QGram, Coordinate, Column, Frequency, Tid-list] For each tuple e in ETI it holds: e[tid-list] contains the list of tids of all reference tuples containing at least one token t in the field e[column] whose e[coordinate]- th min-hash coordinate is e[qgram]. The number of tids included in e[tid-list] is stored in e[frequency] attribute. 28
29 29
30 Baseline Method (Fuzzy Match Data Cleaning) Basic Algorithm goal: reduce the number of lookups against the reference relation by effectively using ETI fetches tid-lists by looking up ETI of all q-grams in min-hash signatures of all tokens in u 30
31 Baseline Method (Fuzzy Match Data Cleaning) Basic Algorithm 1) For each token t in tok(u) compute its IDF weight w(t) 2) Determine the min-hash signature mh(t) of each token 3) Using ETI, determine candidate set S of reference tuple as per fms apx 4) Fetch the tuples in S from the reference relation, and test as per fms 5) Among tuples that pass the test, return K tuples with K highest similarity scores 31
32 Roadmap Data Cleaning Importance Data Cleaning Tasks Challenges for Fuzzy Matching Baseline Method (fuzzy match data cleaning) Improvements: online data cleaning using qgram tries 32
33 Improvements: Online Data Cleaning using qgram tries Proposed method for cleaning a stream of incoming tuples, before their insertion to a database table Uses Word Index a similar to ETI structure holds information about the attribute values stored in the reference table is used for the retrieval of clean words that probably match input attribute values of a tuple Qgram Trie stores the retrieved clean words held in main memory 33
34 Improvements: Online Data Cleaning using qgram tries Word Index consists of five fields: qgram field corresponds to a sequence of Q characters coordinate field represents the occurrence position of the corresponding qgram within a string value column field indicates the string-valued attribute that holds the specific value code-list field contains a word-id list created from words that include qgram Q in the position which is denoted by the coordinate field frequency field represents the number of the words belonging to the code-list. 34
35 Improvements: Online Data Cleaning using qgram tries Qgram trie root labeled null word-prefix subtrees as the children of the root header table Qgram trie node qgram: registers the qgram represented by node count: number of clean words represented by the portion of the path reaching this node node-link: links to the next node in the trie carrying the same qgram, or null if there is none category-list: word-id list of words that share this node in the trie representation Header table qgram head of node-link: points to the first node in the trie carrying the qgram E.g., the resulting qgram trie being built in memory, if clean words Ric, Rica and Ricus, with ids 1, 2 and 3 respectively are retrieved 35
36 Improvements: Online Data Cleaning using qgram tries Matching procedure Candidate words sharing common qgrams in same positions with the input value are stored to qgram trie The qgram trie is searched according to the qgram sequence of the input value all paths of trie holding subsequences of a specific qgram sequence extracted from the possibly dirty input value matching scores between the input value and the clean words are stored in a score table The set of clean words whose similarity with the input word is above a similarity threshold is returned 36
37 Improvements: Online Data Cleaning using qgram tries Input: attribute value u, Word Index Output: K closest words to u 1. Select a qgram subsequence s of input value u a. Find first qgram q of s in header table i. Access all nodes holding q ii. Search all possible paths of trie with nodes holding the qgram subsequence s beginning with q iii. Update score table in case of successful match b. Check existence of unselected qgram subsequences of u i. if unselected qgram subsequences of u exist Repeat step 1. ii. else Go to step Sort score table 3. Return K most similar words according to their score 37
38 Improvements: Online Data Cleaning using qgram tries input value: Ricuss qgram sequence: {Ric, icu, cus, uss} clean word word id score Ric 1 1 Rica 2 1 Ricus
39 Improvements: Online Data Cleaning using qgram tries Each tuple is classified as one of the following: Clean detected duplicate (i.e., a record exists in the reference relation) new (a respective record did not previously exist in the database) Not-resolved because there are many candidates and manual attention is needed 39
40 Improvements: Online Data Cleaning using qgram tries Experimental parameters & measures measures (y-axis) time to complete number of comparisons IO activities precision and recall (percentage of successful corrections and missed corrections) memory used, hard disk needed time to generate any auxiliary structures varied parameters the data set size and the stream size noise level 40
41 That s all folks 41
42 Challenges for Fuzzy Matching To ensure high data quality, incoming data tuples must be validated and undergo a cleaning procedure In many situations, clean tuples must match acceptable tuples in reference tables For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation 42
Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm
R. Sridevi et al Int. Journal of Engineering Research and Applications RESEARCH ARTICLE OPEN ACCESS Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm R. Sridevi,*
More informationCSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) 3 4 4 7 5 9 6 16 7 8 8 4 9 8 10 4 Total 92.
Name: Email ID: CSE 326, Data Structures Section: Sample Final Exam Instructions: The exam is closed book, closed notes. Unless otherwise stated, N denotes the number of elements in the data structure
More informationData Warehousing. Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de. Winter 2014/15. Jens Teubner Data Warehousing Winter 2014/15 1
Jens Teubner Data Warehousing Winter 2014/15 1 Data Warehousing Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Winter 2014/15 Jens Teubner Data Warehousing Winter 2014/15 152 Part VI ETL Process
More informationFile Management. Chapter 12
Chapter 12 File Management File is the basic element of most of the applications, since the input to an application, as well as its output, is usually a file. They also typically outlive the execution
More informationSymbol Tables. Introduction
Symbol Tables Introduction A compiler needs to collect and use information about the names appearing in the source program. This information is entered into a data structure called a symbol table. The
More informationCS 2112 Spring 2014. 0 Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions
CS 2112 Spring 2014 Assignment 3 Data Structures and Web Filtering Due: March 4, 2014 11:59 PM Implementing spam blacklists and web filters requires matching candidate domain names and URLs very rapidly
More informationWeb Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
More informationPhysical Database Design Process. Physical Database Design Process. Major Inputs to Physical Database. Components of Physical Database Design
Physical Database Design Process Physical Database Design Process The last stage of the database design process. A process of mapping the logical database structure developed in previous stages into internal
More informationPhysical Data Organization
Physical Data Organization Database design using logical model of the database - appropriate level for users to focus on - user independence from implementation details Performance - other major factor
More informationA binary search tree or BST is a binary tree that is either empty or in which the data element of each node has a key, and:
Binary Search Trees 1 The general binary tree shown in the previous chapter is not terribly useful in practice. The chief use of binary trees is for providing rapid access to data (indexing, if you will)
More informationOverview of Storage and Indexing
Overview of Storage and Indexing Chapter 8 How index-learning turns no student pale Yet holds the eel of science by the tail. -- Alexander Pope (1688-1744) Database Management Systems 3ed, R. Ramakrishnan
More informationChapter 13: Query Processing. Basic Steps in Query Processing
Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing
More informationLecture 1: Data Storage & Index
Lecture 1: Data Storage & Index R&G Chapter 8-11 Concurrency control Query Execution and Optimization Relational Operators File & Access Methods Buffer Management Disk Space Management Recovery Manager
More informationOverview of Storage and Indexing. Data on External Storage. Alternative File Organizations. Chapter 8
Overview of Storage and Indexing Chapter 8 How index-learning turns no student pale Yet holds the eel of science by the tail. -- Alexander Pope (1688-1744) Database Management Systems 3ed, R. Ramakrishnan
More informationExternal Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1
External Sorting Chapter 13 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Why Sort? A classic problem in computer science! Data requested in sorted order e.g., find students in increasing
More informationWeb Data Extraction: 1 o Semestre 2007/2008
Web Data : Given Slides baseados nos slides oficiais do livro Web Data Mining c Bing Liu, Springer, December, 2006. Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008
More informationSAP Business Objects Business Intelligence platform Document Version: 4.1 Support Package 7 2015-11-24. Data Federation Administration Tool Guide
SAP Business Objects Business Intelligence platform Document Version: 4.1 Support Package 7 2015-11-24 Data Federation Administration Tool Guide Content 1 What's new in the.... 5 2 Introduction to administration
More informationPerformance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com
More informationClean Answers over Dirty Databases: A Probabilistic Approach
Clean Answers over Dirty Databases: A Probabilistic Approach Periklis Andritsos University of Trento periklis@dit.unitn.it Ariel Fuxman University of Toronto afuxman@cs.toronto.edu Renée J. Miller University
More information1 o Semestre 2007/2008
Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 Outline 1 2 3 4 5 Exploiting Text How is text exploited? Two main directions Extraction Extraction
More informationEmail Spam Detection A Machine Learning Approach
Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn
More informationExternal Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13
External Sorting Chapter 13 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Why Sort? A classic problem in computer science! Data requested in sorted order e.g., find students in increasing
More information1. Domain Name System
1.1 Domain Name System (DNS) 1. Domain Name System To identify an entity, the Internet uses the IP address, which uniquely identifies the connection of a host to the Internet. However, people prefer to
More informationPrivate Record Linkage with Bloom Filters
To appear in: Proceedings of Statistics Canada Symposium 2010 Social Statistics: The Interplay among Censuses, Surveys and Administrative Data Private Record Linkage with Bloom Filters Rainer Schnell,
More informationEfficient Integration of Data Mining Techniques in Database Management Systems
Efficient Integration of Data Mining Techniques in Database Management Systems Fadila Bentayeb Jérôme Darmont Cédric Udréa ERIC, University of Lyon 2 5 avenue Pierre Mendès-France 69676 Bron Cedex France
More informationFuzzy Multi-Join and Top-K Query Model for Search-As-You-Type in Multiple Tables
Fuzzy Multi-Join and Top-K Query Model for Search-As-You-Type in Multiple Tables 1 M.Naveena, 2 S.Sangeetha 1 M.E-CSE, 2 AP-CSE V.S.B. Engineering College, Karur, Tamilnadu, India. 1 naveenaskrn@gmail.com,
More informationACCESS 2007. Importing and Exporting Data Files. Information Technology. MS Access 2007 Users Guide. IT Training & Development (818) 677-1700
Information Technology MS Access 2007 Users Guide ACCESS 2007 Importing and Exporting Data Files IT Training & Development (818) 677-1700 training@csun.edu TABLE OF CONTENTS Introduction... 1 Import Excel
More informationBinary Trees and Huffman Encoding Binary Search Trees
Binary Trees and Huffman Encoding Binary Search Trees Computer Science E119 Harvard Extension School Fall 2012 David G. Sullivan, Ph.D. Motivation: Maintaining a Sorted Collection of Data A data dictionary
More informationAras Corporation. 2005 Aras Corporation. All rights reserved. Notice of Rights. Notice of Liability
Aras Corporation 2005 Aras Corporation. All rights reserved Notice of Rights All rights reserved. Aras Corporation (Aras) owns this document. No part of this document may be reproduced or transmitted in
More informationClassification/Decision Trees (II)
Classification/Decision Trees (II) Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Right Sized Trees Let the expected misclassification rate of a tree T be R (T ).
More informationJet Data Manager 2012 User Guide
Jet Data Manager 2012 User Guide Welcome This documentation provides descriptions of the concepts and features of the Jet Data Manager and how to use with them. With the Jet Data Manager you can transform
More informationOriginal-page small file oriented EXT3 file storage system
Original-page small file oriented EXT3 file storage system Zhang Weizhe, Hui He, Zhang Qizhen School of Computer Science and Technology, Harbin Institute of Technology, Harbin E-mail: wzzhang@hit.edu.cn
More informationENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #2610771
ENHANCEMENTS TO SQL SERVER COLUMN STORES Anuhya Mallempati #2610771 CONTENTS Abstract Introduction Column store indexes Batch mode processing Other Enhancements Conclusion ABSTRACT SQL server introduced
More informationAuditing manual. Archive Manager. Publication Date: November, 2015
Archive Manager Publication Date: November, 2015 All Rights Reserved. This software is protected by copyright law and international treaties. Unauthorized reproduction or distribution of this software,
More informationWhy? A central concept in Computer Science. Algorithms are ubiquitous.
Analysis of Algorithms: A Brief Introduction Why? A central concept in Computer Science. Algorithms are ubiquitous. Using the Internet (sending email, transferring files, use of search engines, online
More informationComp 5311 Database Management Systems. 16. Review 2 (Physical Level)
Comp 5311 Database Management Systems 16. Review 2 (Physical Level) 1 Main Topics Indexing Join Algorithms Query Processing and Optimization Transactions and Concurrency Control 2 Indexing Used for faster
More informationRaima Database Manager Version 14.0 In-memory Database Engine
+ Raima Database Manager Version 14.0 In-memory Database Engine By Jeffrey R. Parsons, Senior Engineer January 2016 Abstract Raima Database Manager (RDM) v14.0 contains an all new data storage engine optimized
More informationIntroduction to Apache Pig Indexing and Search
Large-scale Information Processing, Summer 2014 Introduction to Apache Pig Indexing and Search Emmanouil Tzouridis Knowledge Mining & Assessment Includes slides from Ulf Brefeld: LSIP 2013 Organizational
More informationBig Data & Scripting Part II Streaming Algorithms
Big Data & Scripting Part II Streaming Algorithms 1, Counting Distinct Elements 2, 3, counting distinct elements problem formalization input: stream of elements o from some universe U e.g. ids from a set
More informationA Searching Strategy to Adopt Multi-Join Queries
A Searching Strategy to Adopt Multi-Join Queries Based on Top-K Query Model 1 M.Naveena, 2 S.Sangeetha, 1 M.E-CSE, 2 AP-CSE V.S.B. Engineering College, Karur, Tamilnadu, India. 1 naveenaskrn@gmail.com,
More informationPredicting the Stock Market with News Articles
Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is
More informationSearch and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
More informationGuide to Performance and Tuning: Query Performance and Sampled Selectivity
Guide to Performance and Tuning: Query Performance and Sampled Selectivity A feature of Oracle Rdb By Claude Proteau Oracle Rdb Relational Technology Group Oracle Corporation 1 Oracle Rdb Journal Sampled
More informationSEARCH ENGINE OPTIMIZATION USING D-DICTIONARY
SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY G.Evangelin Jenifer #1, Mrs.J.Jaya Sherin *2 # PG Scholar, Department of Electronics and Communication Engineering(Communication and Networking), CSI Institute
More informationUsing Edit-Distance Functions to Identify Similar E-Mail Addresses Howard Schreier, U.S. Dept. of Commerce, Washington DC
Paper 073-29 Using Edit-Distance Functions to Identify Similar E-Mail Addresses Howard Schreier, U.S. Dept. of Commerce, Washington DC ABSTRACT Version 9 of SAS software has added functions which can efficiently
More informationBinary Coded Web Access Pattern Tree in Education Domain
Binary Coded Web Access Pattern Tree in Education Domain C. Gomathi P.G. Department of Computer Science Kongu Arts and Science College Erode-638-107, Tamil Nadu, India E-mail: kc.gomathi@gmail.com M. Moorthi
More informationVGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams
VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li University of California, Irvine CA 9697, USA chenli@ics.uci.edu Bin Wang Northeastern University
More informationCreating Probabilistic Databases from Duplicated Data
VLDB Journal manuscript No. (will be inserted by the editor) Creating Probabilistic Databases from Duplicated Data Oktie Hassanzadeh Renée J. Miller Received: 14 September 2008 / Revised: 1 April 2009
More informationMerkle Hash Tree based Techniques for Data Integrity of Outsourced Data
Merkle Hash Tree based Techniques for Data Integrity of Outsourced Data ABSTRACT Muhammad Saqib Niaz Dept. of Computer Science Otto von Guericke University Magdeburg, Germany saqib@iti.cs.uni-magdeburg.de
More informationPartJoin: An Efficient Storage and Query Execution for Data Warehouses
PartJoin: An Efficient Storage and Query Execution for Data Warehouses Ladjel Bellatreche 1, Michel Schneider 2, Mukesh Mohania 3, and Bharat Bhargava 4 1 IMERIR, Perpignan, FRANCE ladjel@imerir.com 2
More informationA Deduplication-based Data Archiving System
2012 International Conference on Image, Vision and Computing (ICIVC 2012) IPCSIT vol. 50 (2012) (2012) IACSIT Press, Singapore DOI: 10.7763/IPCSIT.2012.V50.20 A Deduplication-based Data Archiving System
More informationAnalysis of Algorithms I: Optimal Binary Search Trees
Analysis of Algorithms I: Optimal Binary Search Trees Xi Chen Columbia University Given a set of n keys K = {k 1,..., k n } in sorted order: k 1 < k 2 < < k n we wish to build an optimal binary search
More informationMapReduce and the New Software Stack
20 Chapter 2 MapReduce and the New Software Stack Modern data-mining applications, often called big-data analysis, require us to manage immense amounts of data quickly. In many of these applications, the
More informationMedical Information-Retrieval Systems. Dong Peng Medical Informatics Group
Medical Information-Retrieval Systems Dong Peng Medical Informatics Group Outline Evolution of medical Information-Retrieval (IR). The information retrieval process. The trend of medical information retrieval
More informationPerforming Queries Using PROC SQL (1)
SAS SQL Contents Performing queries using PROC SQL Performing advanced queries using PROC SQL Combining tables horizontally using PROC SQL Combining tables vertically using PROC SQL 2 Performing Queries
More informationChapter 12 File Management. Roadmap
Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 12 File Management Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Overview Roadmap File organisation and Access
More informationChapter 12 File Management
Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 12 File Management Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Roadmap Overview File organisation and Access
More informationroot node level: internal node edge leaf node CS@VT Data Structures & Algorithms 2000-2009 McQuain
inary Trees 1 A binary tree is either empty, or it consists of a node called the root together with two binary trees called the left subtree and the right subtree of the root, which are disjoint from each
More informationInverted Indexes: Trading Precision for Efficiency
Inverted Indexes: Trading Precision for Efficiency Yufei Tao KAIST April 1, 2013 After compression, an inverted index is often small enough to fit in memory. This benefits query processing because it avoids
More informationData Mining on Streams
Data Mining on Streams Using Decision Trees CS 536: Machine Learning Instructor: Michael Littman TA: Yihua Wu Outline Introduction to data streams Overview of traditional DT learning ALG DT learning ALGs
More information16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
More informationChapter 8: Structures for Files. Truong Quynh Chi tqchi@cse.hcmut.edu.vn. Spring- 2013
Chapter 8: Data Storage, Indexing Structures for Files Truong Quynh Chi tqchi@cse.hcmut.edu.vn Spring- 2013 Overview of Database Design Process 2 Outline Data Storage Disk Storage Devices Files of Records
More informationPhysical Database Design and Tuning
Chapter 20 Physical Database Design and Tuning Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 1. Physical Database Design in Relational Databases (1) Factors that Influence
More informationKEYWORD SEARCH IN RELATIONAL DATABASES
KEYWORD SEARCH IN RELATIONAL DATABASES N.Divya Bharathi 1 1 PG Scholar, Department of Computer Science and Engineering, ABSTRACT Adhiyamaan College of Engineering, Hosur, (India). Data mining refers to
More informationA Comparison of Dictionary Implementations
A Comparison of Dictionary Implementations Mark P Neyer April 10, 2009 1 Introduction A common problem in computer science is the representation of a mapping between two sets. A mapping f : A B is a function
More information1Z0-117 Oracle Database 11g Release 2: SQL Tuning. Oracle
1Z0-117 Oracle Database 11g Release 2: SQL Tuning Oracle To purchase Full version of Practice exam click below; http://www.certshome.com/1z0-117-practice-test.html FOR Oracle 1Z0-117 Exam Candidates We
More informationText Analytics Illustrated with a Simple Data Set
CSC 594 Text Mining More on SAS Enterprise Miner Text Analytics Illustrated with a Simple Data Set This demonstration illustrates some text analytic results using a simple data set that is designed to
More informationChapter 6: Episode discovery process
Chapter 6: Episode discovery process Algorithmic Methods of Data Mining, Fall 2005, Chapter 6: Episode discovery process 1 6. Episode discovery process The knowledge discovery process KDD process of analyzing
More informationThe Set Data Model CHAPTER 7. 7.1 What This Chapter Is About
CHAPTER 7 The Set Data Model The set is the most fundamental data model of mathematics. Every concept in mathematics, from trees to real numbers, is expressible as a special kind of set. In this book,
More informationQuery Processing C H A P T E R12. Practice Exercises
C H A P T E R12 Query Processing Practice Exercises 12.1 Assume (for simplicity in this exercise) that only one tuple fits in a block and memory holds at most 3 blocks. Show the runs created on each pass
More information1. Physical Database Design in Relational Databases (1)
Chapter 20 Physical Database Design and Tuning Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 1. Physical Database Design in Relational Databases (1) Factors that Influence
More informationBig Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
More informationCHAPTER 1 INTRODUCTION
CHAPTER 1 INTRODUCTION 1. Introduction 1.1 Data Warehouse In the 1990's as organizations of scale began to need more timely data for their business, they found that traditional information systems technology
More informationDATA STRUCTURES USING C
DATA STRUCTURES USING C QUESTION BANK UNIT I 1. Define data. 2. Define Entity. 3. Define information. 4. Define Array. 5. Define data structure. 6. Give any two applications of data structures. 7. Give
More informationHunting for the Root Cause of Robotic VoIP
Hunting for the Root Cause of Robotic VoIP Avaya Labs Research November 2007 / PNW Meeting Robotic Voice at Avaya. User complaints of robotic voice. Little data about the problem. Problem is intermittent.
More informationInternational Journal of Advanced Research in Computer Science and Software Engineering
Volume 3, Issue 7, July 23 ISSN: 2277 28X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Greedy Algorithm:
More information2) What is the structure of an organization? Explain how IT support at different organizational levels.
(PGDIT 01) Paper - I : BASICS OF INFORMATION TECHNOLOGY 1) What is an information technology? Why you need to know about IT. 2) What is the structure of an organization? Explain how IT support at different
More informationApproximate Search Engine Optimization for Directory Service
Approximate Search Engine Optimization for Directory Service Kai-Hsiang Yang and Chi-Chien Pan and Tzao-Lin Lee Department of Computer Science and Information Engineering, National Taiwan University, Taipei,
More informationFacebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation
Facebook: Cassandra Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/24 Outline 1 2 3 Smruti R. Sarangi Leader Election
More informationAdvanced Oracle SQL Tuning
Advanced Oracle SQL Tuning Seminar content technical details 1) Understanding Execution Plans In this part you will learn how exactly Oracle executes SQL execution plans. Instead of describing on PowerPoint
More informationAnalysis of Algorithms I: Binary Search Trees
Analysis of Algorithms I: Binary Search Trees Xi Chen Columbia University Hash table: A data structure that maintains a subset of keys from a universe set U = {0, 1,..., p 1} and supports all three dictionary
More informationEmail Image Control. Administrator Guide
Email Image Control Administrator Guide Image Control Administrator Guide Documentation version: 1.0 Legal Notice Legal Notice Copyright 2013 Symantec Corporation. All rights reserved. Symantec, the Symantec
More informationClustering on Large Numeric Data Sets Using Hierarchical Approach Birch
Global Journal of Computer Science and Technology Software & Data Engineering Volume 12 Issue 12 Version 1.0 Year 2012 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global
More informationAn Information Retrieval using weighted Index Terms in Natural Language document collections
Internet and Information Technology in Modern Organizations: Challenges & Answers 635 An Information Retrieval using weighted Index Terms in Natural Language document collections Ahmed A. A. Radwan, Minia
More informationKnowledge Discovery and Data Mining. Structured vs. Non-Structured Data
Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.
More informationEmail Spam Detection Using Customized SimHash Function
International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email
More information2. Basic Relational Data Model
2. Basic Relational Data Model 2.1 Introduction Basic concepts of information models, their realisation in databases comprising data objects and object relationships, and their management by DBMS s that
More informationCIS 631 Database Management Systems Sample Final Exam
CIS 631 Database Management Systems Sample Final Exam 1. (25 points) Match the items from the left column with those in the right and place the letters in the empty slots. k 1. Single-level index files
More informationLeveraging Aggregate Constraints For Deduplication
Leveraging Aggregate Constraints For Deduplication Surajit Chaudhuri Anish Das Sarma Venkatesh Ganti Raghav Kaushik Microsoft Research Stanford University Microsoft Research Microsoft Research surajitc@microsoft.com
More informationChange Color for Export from Light Green to Orange when it Completes with Errors (31297)
ediscovery 5.3.1 Service Pack 8 Release Notes Document Date: July 6, 2015 2015 AccessData Group, Inc. All Rights Reserved Introduction This document lists the issues addressed by this release. All known
More informationQuiz! Database Indexes. Index. Quiz! Disc and main memory. Quiz! How costly is this operation (naive solution)?
Database Indexes How costly is this operation (naive solution)? course per weekday hour room TDA356 2 VR Monday 13:15 TDA356 2 VR Thursday 08:00 TDA356 4 HB1 Tuesday 08:00 TDA356 4 HB1 Friday 13:15 TIN090
More informationInside the PostgreSQL Query Optimizer
Inside the PostgreSQL Query Optimizer Neil Conway neilc@samurai.com Fujitsu Australia Software Technology PostgreSQL Query Optimizer Internals p. 1 Outline Introduction to query optimization Outline of
More informationEfficient Data Structures for Decision Diagrams
Artificial Intelligence Laboratory Efficient Data Structures for Decision Diagrams Master Thesis Nacereddine Ouaret Professor: Supervisors: Boi Faltings Thomas Léauté Radoslaw Szymanek Contents Introduction...
More informationReport on the Train Ticketing System
Report on the Train Ticketing System Author: Zaobo He, Bing Jiang, Zhuojun Duan 1.Introduction... 2 1.1 Intentions... 2 1.2 Background... 2 2. Overview of the Tasks... 3 2.1 Modules of the system... 3
More informationEfficient Iceberg Query Evaluation for Structured Data using Bitmap Indices
Proc. of Int. Conf. on Advances in Computer Science, AETACS Efficient Iceberg Query Evaluation for Structured Data using Bitmap Indices Ms.Archana G.Narawade a, Mrs.Vaishali Kolhe b a PG student, D.Y.Patil
More informationE-mail Listeners. E-mail Formats. Free Form. Formatted
E-mail Listeners 6 E-mail Formats You use the E-mail Listeners application to receive and process Service Requests and other types of tickets through e-mail in the form of e-mail messages. Using E- mail
More informationBinary Search Trees. Data in each node. Larger than the data in its left child Smaller than the data in its right child
Binary Search Trees Data in each node Larger than the data in its left child Smaller than the data in its right child FIGURE 11-6 Arbitrary binary tree FIGURE 11-7 Binary search tree Data Structures Using
More informationPerformance Tuning for the Teradata Database
Performance Tuning for the Teradata Database Matthew W Froemsdorf Teradata Partner Engineering and Technical Consulting - i - Document Changes Rev. Date Section Comment 1.0 2010-10-26 All Initial document
More informationOracle EXAM - 1Z0-117. Oracle Database 11g Release 2: SQL Tuning. Buy Full Product. http://www.examskey.com/1z0-117.html
Oracle EXAM - 1Z0-117 Oracle Database 11g Release 2: SQL Tuning Buy Full Product http://www.examskey.com/1z0-117.html Examskey Oracle 1Z0-117 exam demo product is here for you to test the quality of the
More informationData Quality Aware Query System
Data Quality Aware Query System By Naiem Khodabandehloo Yeganeh in fulfilment of the Degree of Doctorate of Philosophy School of Information Technology and Electrical Engineering April 2012 Examiner s
More information