On-line Data De-duplication. Ιωάννης Κρομμύδας

Size: px
Start display at page:

Download "On-line Data De-duplication. Ιωάννης Κρομμύδας"

Transcription

1 On-line Data De-duplication Ιωάννης Κρομμύδας

2 Roadmap Data Cleaning Importance Data Cleaning Tasks Challenges for Fuzzy Matching Baseline Method (fuzzy match data cleaning) Improvements: online data cleaning using qgram tries 2

3 Roadmap Data Cleaning Importance Data Cleaning Tasks Challenges for Fuzzy Matching Baseline Method (fuzzy match data cleaning) Improvements: online data cleaning using qgram tries 3

4 Data Cleaning Importance Data cleaning is critical for many industries over a wide variety of applications: marketing communications customer matching merging information systems medical records 4

5 Data Cleaning Importance The efficiency of every information processing infrastructure is greatly affected by the quality of the data residing in its databases. Poor data quality is the result of a variety of reasons: data entry errors (e.g., typing mistakes) multiple conventions for recording database fields (e.g., company names, addresses). 5

6 Data Cleaning Importance Poor data quality has a significant impact on a variety of business issues: customer relationship management inability to retrieve a customer record during a service call billing errors distribution delays 6

7 Roadmap Data Cleaning Importance Data Cleaning Tasks Challenges for Fuzzy Matching Baseline Method (fuzzy match data cleaning) Improvements: online data cleaning using qgram tries 7

8 Data Cleaning Tasks One of the most important tasks in data cleaning is to de-duplicate records detection of multiple representation of a single entity The problem is straightforward for numerical values; still, it is very hard for string values and combinations of them in an attribute Names (first-, middle-, last- name), addresses, etc. 8

9 Data Cleaning Tasks Considering company names, it is common to see Microsoft, Micorsoft, Microsoft Inc. and Microsoft Corporation being used in different records to represent the same entity A simple equality or (even) substring comparison on names or addresses will not properly identify them as being the same entity, leading to a variety of potential business problems 9

10 Data Cleaning Tasks Two possible modes of de-duplication: Detection of exact duplicates, which requires a typical join operation Fuzzy matching, which entails the detection of inexact duplicates presents a challenge between accuracy, efficiency and storage overheads 10

11 Roadmap Data Cleaning Importance Data Cleaning Tasks Challenges for Fuzzy Matching Baseline Method (fuzzy match data cleaning) Improvements: online data cleaning using qgram tries 11

12 Challenges for Fuzzy Matching Assume a clean reference relation R and a stream of possibly dirty tuples S, that we check over R for fuzzy duplicates. Task: first try exact match, else try fuzzy match Issues: Accuracy of the identification An appropriate similarity function Avoiding to check every stream record with everyone in R 12

13 Challenges for Fuzzy Matching Fig. 1. Template for using Fuzzy Match [CGGM03] 13

14 Challenges for Fuzzy Matching Given the similarity function and an input tuple, the result of a fuzzy match operation could be one of the following: the reference tuple being closest to the input tuple, the closest K reference tuples enabling users, if necessary, to choose one among them K or fewer tuples whose similarity to the input tuple exceeds a user-specified minimum similarity threshold 14

15 Roadmap Data Cleaning Importance Data Cleaning Tasks Challenges for Fuzzy Matching Baseline Method (fuzzy match data cleaning) Improvements: online data cleaning using qgram tries 15

16 Baseline Method (Fuzzy Match Data Cleaning) Chaudhuri et SIGMOD 2003 adopt a probabilistic approach in order to return the closest K reference tuples with high probability propose a fuzzy match similarity function (fms) that explicitly considers IDF token weights and input errors while comparing tuples 16

17 Baseline Method (Fuzzy Match Data Cleaning) Chaudhuri et SIGMOD 2003 preprocess the reference relation to build an index relation, called the error tolerant index (ETI) relation, for retrieving at run time a small set of candidate reference tuples retrieve with high probability a superset of the K reference tuples closest to the input tuple 17

18 Baseline Method (Fuzzy Match Data Cleaning) Similarity between an input tuple and a reference tuple could be described as the cost of transforming the former into the latter low transformation costs of input tuples denote high similarity Transformation operations are applied on a set of tokens included in the attributes of a tuple The set of tokens included in attribute i of tuple v is denoted by tok[v(i)] if v(i) = Boeing Company, then tok[v(i)] = {Boeing, Company} 18

19 Baseline Method (Fuzzy Match Data Cleaning) Each transformation operation is associated with a cost depending on the weight of the transformed token: w( t, i) = IDF( t, i) = log R freq ( t, i), where freq(t,i) denotes the frequency of a token t in column i and equals to the number of tuples v in R such that tok(v[i]) contains t 19

20 Baseline Method (Fuzzy Match Data Cleaning) Let u be an input tuple and v a reference tuple, the cost of operations taking place in order to transform u into v is defined in next table: operation Description cost token replacement token insertion token deletion replaces t 1 in tok[u(i)] by t 2 in tok[v(i)] ed(t 1, t 2 ) w(t1,i) inserts a token t into u[i] c ins w(t, i) (0 c ins 1) deletes a token t from u[i] w(t, i) 20

21 Baseline Method (Fuzzy Match Data Cleaning) The transformation cost tc(u[i], v[i]) is the cost of the minimum cost transformation sequence for transforming u[i] into v[i]. The cost tc(u, v) of transforming u into v is the sum over all columns i of the costs tc(u[i], v[i]) of transforming u[i] into v[i] and equals to: tc( u, v) = tc i ( u[] i, v[ i] ) 21

22 Baseline Method (Fuzzy Match Data Cleaning) The fuzzy match similarity function fms(u, v) between an input tuple u and a reference tuple v in terms of the transformation cost tc(u, v) can be defined as: fms ( u, v) ( u, v),1. ( ) tc = 1 min 0 w u w(u) is the sum of weights of all tokens in the token set tok(u) token set tok(u) denotes the multiset union of sets tok(a 1 ),,tok(a n ) of tokens from the tuple u[a 1,,a n ], 22

23 Baseline Method (Fuzzy Match Data Cleaning) The K-fuzzy Match Problem: Given reference relation R, a minimum similarity threshold c (0<c<1), input tuple u, the set FM(u) of fuzzy matches of at most K tuples from R Naïve Algorithm: scan the reference relation R, comparing each tuple with u Proposed Method: build an index on the reference relation for quickly retrieving a superset of target fuzzy matches (pre-processing phase) this indexed relation is called Error Tolerant Index (ETI) - indexed using standard B+ trees to perform fast-exact lookups to prepare an ETI, fms apx needed 23

24 Baseline Method (Fuzzy Match Data Cleaning) Reference Relation (not indexable) Pre-processing Error Tolerant Index (standard database relation, but indexable) Candidate Set - superset of FM(U) Approximation of fms (fms apx ) is a pared down version of fms ignores ordering among tokens in the input and reference tuples [beoing company, seattle, wa, 98004] and [company beoing, seattle, wa, 98004] are identical to fms apx in fms apx, closeness between two tokens is measured through the similarity between sets of substrings called qgram sets 24

25 Baseline Method (Fuzzy Match Data Cleaning) Estimating fms apx requires computing token min-hash signatures mh i and min-hash similarity sim mh between two tokens min-hash similarity U: universe of strings over an alphabet Σ h i :U N, i = 1,,H be H hash functions mapping elements of U uniformly and randomly to the set of natural numbers N S a set of strings. min-hash signature m h (S) of S is the vector [mh 1 (S),, mh H (S)] where the i th coordinate mh i (S) is defined as: mh ( S ) = argmin h ( a) sim mh H ( t, t ) = I[ mh ( QG( t )) = mh ( QG( t ))] H i= 1 i 1 i i a S i 2 Let I[X] denote an indicator variable over boolean X (I[X] = 1 if X is true, else 0) 25

26 Baseline Method (Fuzzy Match Data Cleaning) Let u, v be two tuples dq = (1-1/q) be an adjustment term, fms apx is defined as: apx 1 2 fms ( u, v) = () ( () ( )) ( ) w t Max simmh QG t, QG r + d w u r tok ( []) ( v[] i ) i t tok u i q Eg: Input tuple u [Company Beoing, Seattle, NULL, 98004] Reference tuple v [Boeing Company, Seattle, WA, 98004] q = 3, H = 2, token: weight: company: 0.25, beoing: 0.5, seattle:1.0, 98004: 2.0 total weight = 3.75 Suppose min-hash signatures are [oei, ing], [com, pan], [sea, ttl], [wa], [980, 004] Score from matching beoing with boeing is: w(beoing)*(2/3* (1 1/3)) = w(beoing) Since every token matches exactly with a reference token, fms apx (u,v) = 3.75/ q

27 Baseline Method (Fuzzy Match Data Cleaning) Error Tolerant Index (ETI) enables for each input tuple u, the efficient retrieval of a candidate set S of reference tuples with similarity greater than the minimum similarity threshold fms apx is measured by comparing min-hash signatures of tokens in tok(u) and tok(v) to determine the candidate set, we need to efficiently identify for each token t in tok(u), a set of reference tuples sharing min-hash qgrams with that of t holds each qgram s along with the list of all tids of reference tuples with tokens whose min-hash signatures contain s 27

28 Baseline Method (Fuzzy Match Data Cleaning) ETI schema: [QGram, Coordinate, Column, Frequency, Tid-list] For each tuple e in ETI it holds: e[tid-list] contains the list of tids of all reference tuples containing at least one token t in the field e[column] whose e[coordinate]- th min-hash coordinate is e[qgram]. The number of tids included in e[tid-list] is stored in e[frequency] attribute. 28

29 29

30 Baseline Method (Fuzzy Match Data Cleaning) Basic Algorithm goal: reduce the number of lookups against the reference relation by effectively using ETI fetches tid-lists by looking up ETI of all q-grams in min-hash signatures of all tokens in u 30

31 Baseline Method (Fuzzy Match Data Cleaning) Basic Algorithm 1) For each token t in tok(u) compute its IDF weight w(t) 2) Determine the min-hash signature mh(t) of each token 3) Using ETI, determine candidate set S of reference tuple as per fms apx 4) Fetch the tuples in S from the reference relation, and test as per fms 5) Among tuples that pass the test, return K tuples with K highest similarity scores 31

32 Roadmap Data Cleaning Importance Data Cleaning Tasks Challenges for Fuzzy Matching Baseline Method (fuzzy match data cleaning) Improvements: online data cleaning using qgram tries 32

33 Improvements: Online Data Cleaning using qgram tries Proposed method for cleaning a stream of incoming tuples, before their insertion to a database table Uses Word Index a similar to ETI structure holds information about the attribute values stored in the reference table is used for the retrieval of clean words that probably match input attribute values of a tuple Qgram Trie stores the retrieved clean words held in main memory 33

34 Improvements: Online Data Cleaning using qgram tries Word Index consists of five fields: qgram field corresponds to a sequence of Q characters coordinate field represents the occurrence position of the corresponding qgram within a string value column field indicates the string-valued attribute that holds the specific value code-list field contains a word-id list created from words that include qgram Q in the position which is denoted by the coordinate field frequency field represents the number of the words belonging to the code-list. 34

35 Improvements: Online Data Cleaning using qgram tries Qgram trie root labeled null word-prefix subtrees as the children of the root header table Qgram trie node qgram: registers the qgram represented by node count: number of clean words represented by the portion of the path reaching this node node-link: links to the next node in the trie carrying the same qgram, or null if there is none category-list: word-id list of words that share this node in the trie representation Header table qgram head of node-link: points to the first node in the trie carrying the qgram E.g., the resulting qgram trie being built in memory, if clean words Ric, Rica and Ricus, with ids 1, 2 and 3 respectively are retrieved 35

36 Improvements: Online Data Cleaning using qgram tries Matching procedure Candidate words sharing common qgrams in same positions with the input value are stored to qgram trie The qgram trie is searched according to the qgram sequence of the input value all paths of trie holding subsequences of a specific qgram sequence extracted from the possibly dirty input value matching scores between the input value and the clean words are stored in a score table The set of clean words whose similarity with the input word is above a similarity threshold is returned 36

37 Improvements: Online Data Cleaning using qgram tries Input: attribute value u, Word Index Output: K closest words to u 1. Select a qgram subsequence s of input value u a. Find first qgram q of s in header table i. Access all nodes holding q ii. Search all possible paths of trie with nodes holding the qgram subsequence s beginning with q iii. Update score table in case of successful match b. Check existence of unselected qgram subsequences of u i. if unselected qgram subsequences of u exist Repeat step 1. ii. else Go to step Sort score table 3. Return K most similar words according to their score 37

38 Improvements: Online Data Cleaning using qgram tries input value: Ricuss qgram sequence: {Ric, icu, cus, uss} clean word word id score Ric 1 1 Rica 2 1 Ricus

39 Improvements: Online Data Cleaning using qgram tries Each tuple is classified as one of the following: Clean detected duplicate (i.e., a record exists in the reference relation) new (a respective record did not previously exist in the database) Not-resolved because there are many candidates and manual attention is needed 39

40 Improvements: Online Data Cleaning using qgram tries Experimental parameters & measures measures (y-axis) time to complete number of comparisons IO activities precision and recall (percentage of successful corrections and missed corrections) memory used, hard disk needed time to generate any auxiliary structures varied parameters the data set size and the stream size noise level 40

41 That s all folks 41

42 Challenges for Fuzzy Matching To ensure high data quality, incoming data tuples must be validated and undergo a cleaning procedure In many situations, clean tuples must match acceptable tuples in reference tables For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation 42

Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm

Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm R. Sridevi et al Int. Journal of Engineering Research and Applications RESEARCH ARTICLE OPEN ACCESS Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm R. Sridevi,*

More information

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) 3 4 4 7 5 9 6 16 7 8 8 4 9 8 10 4 Total 92.

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) 3 4 4 7 5 9 6 16 7 8 8 4 9 8 10 4 Total 92. Name: Email ID: CSE 326, Data Structures Section: Sample Final Exam Instructions: The exam is closed book, closed notes. Unless otherwise stated, N denotes the number of elements in the data structure

More information

Data Warehousing. Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de. Winter 2014/15. Jens Teubner Data Warehousing Winter 2014/15 1

Data Warehousing. Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de. Winter 2014/15. Jens Teubner Data Warehousing Winter 2014/15 1 Jens Teubner Data Warehousing Winter 2014/15 1 Data Warehousing Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Winter 2014/15 Jens Teubner Data Warehousing Winter 2014/15 152 Part VI ETL Process

More information

File Management. Chapter 12

File Management. Chapter 12 Chapter 12 File Management File is the basic element of most of the applications, since the input to an application, as well as its output, is usually a file. They also typically outlive the execution

More information

Symbol Tables. Introduction

Symbol Tables. Introduction Symbol Tables Introduction A compiler needs to collect and use information about the names appearing in the source program. This information is entered into a data structure called a symbol table. The

More information

CS 2112 Spring 2014. 0 Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

CS 2112 Spring 2014. 0 Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions CS 2112 Spring 2014 Assignment 3 Data Structures and Web Filtering Due: March 4, 2014 11:59 PM Implementing spam blacklists and web filters requires matching candidate domain names and URLs very rapidly

More information

Web Document Clustering

Web Document Clustering Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

More information

Physical Database Design Process. Physical Database Design Process. Major Inputs to Physical Database. Components of Physical Database Design

Physical Database Design Process. Physical Database Design Process. Major Inputs to Physical Database. Components of Physical Database Design Physical Database Design Process Physical Database Design Process The last stage of the database design process. A process of mapping the logical database structure developed in previous stages into internal

More information

Physical Data Organization

Physical Data Organization Physical Data Organization Database design using logical model of the database - appropriate level for users to focus on - user independence from implementation details Performance - other major factor

More information

A binary search tree or BST is a binary tree that is either empty or in which the data element of each node has a key, and:

A binary search tree or BST is a binary tree that is either empty or in which the data element of each node has a key, and: Binary Search Trees 1 The general binary tree shown in the previous chapter is not terribly useful in practice. The chief use of binary trees is for providing rapid access to data (indexing, if you will)

More information

Overview of Storage and Indexing

Overview of Storage and Indexing Overview of Storage and Indexing Chapter 8 How index-learning turns no student pale Yet holds the eel of science by the tail. -- Alexander Pope (1688-1744) Database Management Systems 3ed, R. Ramakrishnan

More information

Chapter 13: Query Processing. Basic Steps in Query Processing

Chapter 13: Query Processing. Basic Steps in Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

Lecture 1: Data Storage & Index

Lecture 1: Data Storage & Index Lecture 1: Data Storage & Index R&G Chapter 8-11 Concurrency control Query Execution and Optimization Relational Operators File & Access Methods Buffer Management Disk Space Management Recovery Manager

More information

Overview of Storage and Indexing. Data on External Storage. Alternative File Organizations. Chapter 8

Overview of Storage and Indexing. Data on External Storage. Alternative File Organizations. Chapter 8 Overview of Storage and Indexing Chapter 8 How index-learning turns no student pale Yet holds the eel of science by the tail. -- Alexander Pope (1688-1744) Database Management Systems 3ed, R. Ramakrishnan

More information

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 External Sorting Chapter 13 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Why Sort? A classic problem in computer science! Data requested in sorted order e.g., find students in increasing

More information

Web Data Extraction: 1 o Semestre 2007/2008

Web Data Extraction: 1 o Semestre 2007/2008 Web Data : Given Slides baseados nos slides oficiais do livro Web Data Mining c Bing Liu, Springer, December, 2006. Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008

More information

SAP Business Objects Business Intelligence platform Document Version: 4.1 Support Package 7 2015-11-24. Data Federation Administration Tool Guide

SAP Business Objects Business Intelligence platform Document Version: 4.1 Support Package 7 2015-11-24. Data Federation Administration Tool Guide SAP Business Objects Business Intelligence platform Document Version: 4.1 Support Package 7 2015-11-24 Data Federation Administration Tool Guide Content 1 What's new in the.... 5 2 Introduction to administration

More information

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com

More information

Clean Answers over Dirty Databases: A Probabilistic Approach

Clean Answers over Dirty Databases: A Probabilistic Approach Clean Answers over Dirty Databases: A Probabilistic Approach Periklis Andritsos University of Trento periklis@dit.unitn.it Ariel Fuxman University of Toronto afuxman@cs.toronto.edu Renée J. Miller University

More information

1 o Semestre 2007/2008

1 o Semestre 2007/2008 Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 Outline 1 2 3 4 5 Exploiting Text How is text exploited? Two main directions Extraction Extraction

More information

Email Spam Detection A Machine Learning Approach

Email Spam Detection A Machine Learning Approach Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn

More information

External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13

External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13 External Sorting Chapter 13 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Why Sort? A classic problem in computer science! Data requested in sorted order e.g., find students in increasing

More information

1. Domain Name System

1. Domain Name System 1.1 Domain Name System (DNS) 1. Domain Name System To identify an entity, the Internet uses the IP address, which uniquely identifies the connection of a host to the Internet. However, people prefer to

More information

Private Record Linkage with Bloom Filters

Private Record Linkage with Bloom Filters To appear in: Proceedings of Statistics Canada Symposium 2010 Social Statistics: The Interplay among Censuses, Surveys and Administrative Data Private Record Linkage with Bloom Filters Rainer Schnell,

More information

Efficient Integration of Data Mining Techniques in Database Management Systems

Efficient Integration of Data Mining Techniques in Database Management Systems Efficient Integration of Data Mining Techniques in Database Management Systems Fadila Bentayeb Jérôme Darmont Cédric Udréa ERIC, University of Lyon 2 5 avenue Pierre Mendès-France 69676 Bron Cedex France

More information

Fuzzy Multi-Join and Top-K Query Model for Search-As-You-Type in Multiple Tables

Fuzzy Multi-Join and Top-K Query Model for Search-As-You-Type in Multiple Tables Fuzzy Multi-Join and Top-K Query Model for Search-As-You-Type in Multiple Tables 1 M.Naveena, 2 S.Sangeetha 1 M.E-CSE, 2 AP-CSE V.S.B. Engineering College, Karur, Tamilnadu, India. 1 naveenaskrn@gmail.com,

More information

ACCESS 2007. Importing and Exporting Data Files. Information Technology. MS Access 2007 Users Guide. IT Training & Development (818) 677-1700

ACCESS 2007. Importing and Exporting Data Files. Information Technology. MS Access 2007 Users Guide. IT Training & Development (818) 677-1700 Information Technology MS Access 2007 Users Guide ACCESS 2007 Importing and Exporting Data Files IT Training & Development (818) 677-1700 training@csun.edu TABLE OF CONTENTS Introduction... 1 Import Excel

More information

Binary Trees and Huffman Encoding Binary Search Trees

Binary Trees and Huffman Encoding Binary Search Trees Binary Trees and Huffman Encoding Binary Search Trees Computer Science E119 Harvard Extension School Fall 2012 David G. Sullivan, Ph.D. Motivation: Maintaining a Sorted Collection of Data A data dictionary

More information

Aras Corporation. 2005 Aras Corporation. All rights reserved. Notice of Rights. Notice of Liability

Aras Corporation. 2005 Aras Corporation. All rights reserved. Notice of Rights. Notice of Liability Aras Corporation 2005 Aras Corporation. All rights reserved Notice of Rights All rights reserved. Aras Corporation (Aras) owns this document. No part of this document may be reproduced or transmitted in

More information

Classification/Decision Trees (II)

Classification/Decision Trees (II) Classification/Decision Trees (II) Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Right Sized Trees Let the expected misclassification rate of a tree T be R (T ).

More information

Jet Data Manager 2012 User Guide

Jet Data Manager 2012 User Guide Jet Data Manager 2012 User Guide Welcome This documentation provides descriptions of the concepts and features of the Jet Data Manager and how to use with them. With the Jet Data Manager you can transform

More information

Original-page small file oriented EXT3 file storage system

Original-page small file oriented EXT3 file storage system Original-page small file oriented EXT3 file storage system Zhang Weizhe, Hui He, Zhang Qizhen School of Computer Science and Technology, Harbin Institute of Technology, Harbin E-mail: wzzhang@hit.edu.cn

More information

ENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #2610771

ENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #2610771 ENHANCEMENTS TO SQL SERVER COLUMN STORES Anuhya Mallempati #2610771 CONTENTS Abstract Introduction Column store indexes Batch mode processing Other Enhancements Conclusion ABSTRACT SQL server introduced

More information

Auditing manual. Archive Manager. Publication Date: November, 2015

Auditing manual. Archive Manager. Publication Date: November, 2015 Archive Manager Publication Date: November, 2015 All Rights Reserved. This software is protected by copyright law and international treaties. Unauthorized reproduction or distribution of this software,

More information

Why? A central concept in Computer Science. Algorithms are ubiquitous.

Why? A central concept in Computer Science. Algorithms are ubiquitous. Analysis of Algorithms: A Brief Introduction Why? A central concept in Computer Science. Algorithms are ubiquitous. Using the Internet (sending email, transferring files, use of search engines, online

More information

Comp 5311 Database Management Systems. 16. Review 2 (Physical Level)

Comp 5311 Database Management Systems. 16. Review 2 (Physical Level) Comp 5311 Database Management Systems 16. Review 2 (Physical Level) 1 Main Topics Indexing Join Algorithms Query Processing and Optimization Transactions and Concurrency Control 2 Indexing Used for faster

More information

Raima Database Manager Version 14.0 In-memory Database Engine

Raima Database Manager Version 14.0 In-memory Database Engine + Raima Database Manager Version 14.0 In-memory Database Engine By Jeffrey R. Parsons, Senior Engineer January 2016 Abstract Raima Database Manager (RDM) v14.0 contains an all new data storage engine optimized

More information

Introduction to Apache Pig Indexing and Search

Introduction to Apache Pig Indexing and Search Large-scale Information Processing, Summer 2014 Introduction to Apache Pig Indexing and Search Emmanouil Tzouridis Knowledge Mining & Assessment Includes slides from Ulf Brefeld: LSIP 2013 Organizational

More information

Big Data & Scripting Part II Streaming Algorithms

Big Data & Scripting Part II Streaming Algorithms Big Data & Scripting Part II Streaming Algorithms 1, Counting Distinct Elements 2, 3, counting distinct elements problem formalization input: stream of elements o from some universe U e.g. ids from a set

More information

A Searching Strategy to Adopt Multi-Join Queries

A Searching Strategy to Adopt Multi-Join Queries A Searching Strategy to Adopt Multi-Join Queries Based on Top-K Query Model 1 M.Naveena, 2 S.Sangeetha, 1 M.E-CSE, 2 AP-CSE V.S.B. Engineering College, Karur, Tamilnadu, India. 1 naveenaskrn@gmail.com,

More information

Predicting the Stock Market with News Articles

Predicting the Stock Market with News Articles Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

Guide to Performance and Tuning: Query Performance and Sampled Selectivity

Guide to Performance and Tuning: Query Performance and Sampled Selectivity Guide to Performance and Tuning: Query Performance and Sampled Selectivity A feature of Oracle Rdb By Claude Proteau Oracle Rdb Relational Technology Group Oracle Corporation 1 Oracle Rdb Journal Sampled

More information

SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY

SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY G.Evangelin Jenifer #1, Mrs.J.Jaya Sherin *2 # PG Scholar, Department of Electronics and Communication Engineering(Communication and Networking), CSI Institute

More information

Using Edit-Distance Functions to Identify Similar E-Mail Addresses Howard Schreier, U.S. Dept. of Commerce, Washington DC

Using Edit-Distance Functions to Identify Similar E-Mail Addresses Howard Schreier, U.S. Dept. of Commerce, Washington DC Paper 073-29 Using Edit-Distance Functions to Identify Similar E-Mail Addresses Howard Schreier, U.S. Dept. of Commerce, Washington DC ABSTRACT Version 9 of SAS software has added functions which can efficiently

More information

Binary Coded Web Access Pattern Tree in Education Domain

Binary Coded Web Access Pattern Tree in Education Domain Binary Coded Web Access Pattern Tree in Education Domain C. Gomathi P.G. Department of Computer Science Kongu Arts and Science College Erode-638-107, Tamil Nadu, India E-mail: kc.gomathi@gmail.com M. Moorthi

More information

VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams

VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li University of California, Irvine CA 9697, USA chenli@ics.uci.edu Bin Wang Northeastern University

More information

Creating Probabilistic Databases from Duplicated Data

Creating Probabilistic Databases from Duplicated Data VLDB Journal manuscript No. (will be inserted by the editor) Creating Probabilistic Databases from Duplicated Data Oktie Hassanzadeh Renée J. Miller Received: 14 September 2008 / Revised: 1 April 2009

More information

Merkle Hash Tree based Techniques for Data Integrity of Outsourced Data

Merkle Hash Tree based Techniques for Data Integrity of Outsourced Data Merkle Hash Tree based Techniques for Data Integrity of Outsourced Data ABSTRACT Muhammad Saqib Niaz Dept. of Computer Science Otto von Guericke University Magdeburg, Germany saqib@iti.cs.uni-magdeburg.de

More information

PartJoin: An Efficient Storage and Query Execution for Data Warehouses

PartJoin: An Efficient Storage and Query Execution for Data Warehouses PartJoin: An Efficient Storage and Query Execution for Data Warehouses Ladjel Bellatreche 1, Michel Schneider 2, Mukesh Mohania 3, and Bharat Bhargava 4 1 IMERIR, Perpignan, FRANCE ladjel@imerir.com 2

More information

A Deduplication-based Data Archiving System

A Deduplication-based Data Archiving System 2012 International Conference on Image, Vision and Computing (ICIVC 2012) IPCSIT vol. 50 (2012) (2012) IACSIT Press, Singapore DOI: 10.7763/IPCSIT.2012.V50.20 A Deduplication-based Data Archiving System

More information

Analysis of Algorithms I: Optimal Binary Search Trees

Analysis of Algorithms I: Optimal Binary Search Trees Analysis of Algorithms I: Optimal Binary Search Trees Xi Chen Columbia University Given a set of n keys K = {k 1,..., k n } in sorted order: k 1 < k 2 < < k n we wish to build an optimal binary search

More information

MapReduce and the New Software Stack

MapReduce and the New Software Stack 20 Chapter 2 MapReduce and the New Software Stack Modern data-mining applications, often called big-data analysis, require us to manage immense amounts of data quickly. In many of these applications, the

More information

Medical Information-Retrieval Systems. Dong Peng Medical Informatics Group

Medical Information-Retrieval Systems. Dong Peng Medical Informatics Group Medical Information-Retrieval Systems Dong Peng Medical Informatics Group Outline Evolution of medical Information-Retrieval (IR). The information retrieval process. The trend of medical information retrieval

More information

Performing Queries Using PROC SQL (1)

Performing Queries Using PROC SQL (1) SAS SQL Contents Performing queries using PROC SQL Performing advanced queries using PROC SQL Combining tables horizontally using PROC SQL Combining tables vertically using PROC SQL 2 Performing Queries

More information

Chapter 12 File Management. Roadmap

Chapter 12 File Management. Roadmap Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 12 File Management Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Overview Roadmap File organisation and Access

More information

Chapter 12 File Management

Chapter 12 File Management Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 12 File Management Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Roadmap Overview File organisation and Access

More information

root node level: internal node edge leaf node CS@VT Data Structures & Algorithms 2000-2009 McQuain

root node level: internal node edge leaf node CS@VT Data Structures & Algorithms 2000-2009 McQuain inary Trees 1 A binary tree is either empty, or it consists of a node called the root together with two binary trees called the left subtree and the right subtree of the root, which are disjoint from each

More information

Inverted Indexes: Trading Precision for Efficiency

Inverted Indexes: Trading Precision for Efficiency Inverted Indexes: Trading Precision for Efficiency Yufei Tao KAIST April 1, 2013 After compression, an inverted index is often small enough to fit in memory. This benefits query processing because it avoids

More information

Data Mining on Streams

Data Mining on Streams Data Mining on Streams Using Decision Trees CS 536: Machine Learning Instructor: Michael Littman TA: Yihua Wu Outline Introduction to data streams Overview of traditional DT learning ALG DT learning ALGs

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

Chapter 8: Structures for Files. Truong Quynh Chi tqchi@cse.hcmut.edu.vn. Spring- 2013

Chapter 8: Structures for Files. Truong Quynh Chi tqchi@cse.hcmut.edu.vn. Spring- 2013 Chapter 8: Data Storage, Indexing Structures for Files Truong Quynh Chi tqchi@cse.hcmut.edu.vn Spring- 2013 Overview of Database Design Process 2 Outline Data Storage Disk Storage Devices Files of Records

More information

Physical Database Design and Tuning

Physical Database Design and Tuning Chapter 20 Physical Database Design and Tuning Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 1. Physical Database Design in Relational Databases (1) Factors that Influence

More information

KEYWORD SEARCH IN RELATIONAL DATABASES

KEYWORD SEARCH IN RELATIONAL DATABASES KEYWORD SEARCH IN RELATIONAL DATABASES N.Divya Bharathi 1 1 PG Scholar, Department of Computer Science and Engineering, ABSTRACT Adhiyamaan College of Engineering, Hosur, (India). Data mining refers to

More information

A Comparison of Dictionary Implementations

A Comparison of Dictionary Implementations A Comparison of Dictionary Implementations Mark P Neyer April 10, 2009 1 Introduction A common problem in computer science is the representation of a mapping between two sets. A mapping f : A B is a function

More information

1Z0-117 Oracle Database 11g Release 2: SQL Tuning. Oracle

1Z0-117 Oracle Database 11g Release 2: SQL Tuning. Oracle 1Z0-117 Oracle Database 11g Release 2: SQL Tuning Oracle To purchase Full version of Practice exam click below; http://www.certshome.com/1z0-117-practice-test.html FOR Oracle 1Z0-117 Exam Candidates We

More information

Text Analytics Illustrated with a Simple Data Set

Text Analytics Illustrated with a Simple Data Set CSC 594 Text Mining More on SAS Enterprise Miner Text Analytics Illustrated with a Simple Data Set This demonstration illustrates some text analytic results using a simple data set that is designed to

More information

Chapter 6: Episode discovery process

Chapter 6: Episode discovery process Chapter 6: Episode discovery process Algorithmic Methods of Data Mining, Fall 2005, Chapter 6: Episode discovery process 1 6. Episode discovery process The knowledge discovery process KDD process of analyzing

More information

The Set Data Model CHAPTER 7. 7.1 What This Chapter Is About

The Set Data Model CHAPTER 7. 7.1 What This Chapter Is About CHAPTER 7 The Set Data Model The set is the most fundamental data model of mathematics. Every concept in mathematics, from trees to real numbers, is expressible as a special kind of set. In this book,

More information

Query Processing C H A P T E R12. Practice Exercises

Query Processing C H A P T E R12. Practice Exercises C H A P T E R12 Query Processing Practice Exercises 12.1 Assume (for simplicity in this exercise) that only one tuple fits in a block and memory holds at most 3 blocks. Show the runs created on each pass

More information

1. Physical Database Design in Relational Databases (1)

1. Physical Database Design in Relational Databases (1) Chapter 20 Physical Database Design and Tuning Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 1. Physical Database Design in Relational Databases (1) Factors that Influence

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION CHAPTER 1 INTRODUCTION 1. Introduction 1.1 Data Warehouse In the 1990's as organizations of scale began to need more timely data for their business, they found that traditional information systems technology

More information

DATA STRUCTURES USING C

DATA STRUCTURES USING C DATA STRUCTURES USING C QUESTION BANK UNIT I 1. Define data. 2. Define Entity. 3. Define information. 4. Define Array. 5. Define data structure. 6. Give any two applications of data structures. 7. Give

More information

Hunting for the Root Cause of Robotic VoIP

Hunting for the Root Cause of Robotic VoIP Hunting for the Root Cause of Robotic VoIP Avaya Labs Research November 2007 / PNW Meeting Robotic Voice at Avaya. User complaints of robotic voice. Little data about the problem. Problem is intermittent.

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 7, July 23 ISSN: 2277 28X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Greedy Algorithm:

More information

2) What is the structure of an organization? Explain how IT support at different organizational levels.

2) What is the structure of an organization? Explain how IT support at different organizational levels. (PGDIT 01) Paper - I : BASICS OF INFORMATION TECHNOLOGY 1) What is an information technology? Why you need to know about IT. 2) What is the structure of an organization? Explain how IT support at different

More information

Approximate Search Engine Optimization for Directory Service

Approximate Search Engine Optimization for Directory Service Approximate Search Engine Optimization for Directory Service Kai-Hsiang Yang and Chi-Chien Pan and Tzao-Lin Lee Department of Computer Science and Information Engineering, National Taiwan University, Taipei,

More information

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation Facebook: Cassandra Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/24 Outline 1 2 3 Smruti R. Sarangi Leader Election

More information

Advanced Oracle SQL Tuning

Advanced Oracle SQL Tuning Advanced Oracle SQL Tuning Seminar content technical details 1) Understanding Execution Plans In this part you will learn how exactly Oracle executes SQL execution plans. Instead of describing on PowerPoint

More information

Analysis of Algorithms I: Binary Search Trees

Analysis of Algorithms I: Binary Search Trees Analysis of Algorithms I: Binary Search Trees Xi Chen Columbia University Hash table: A data structure that maintains a subset of keys from a universe set U = {0, 1,..., p 1} and supports all three dictionary

More information

Email Image Control. Administrator Guide

Email Image Control. Administrator Guide Email Image Control Administrator Guide Image Control Administrator Guide Documentation version: 1.0 Legal Notice Legal Notice Copyright 2013 Symantec Corporation. All rights reserved. Symantec, the Symantec

More information

Clustering on Large Numeric Data Sets Using Hierarchical Approach Birch

Clustering on Large Numeric Data Sets Using Hierarchical Approach Birch Global Journal of Computer Science and Technology Software & Data Engineering Volume 12 Issue 12 Version 1.0 Year 2012 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global

More information

An Information Retrieval using weighted Index Terms in Natural Language document collections

An Information Retrieval using weighted Index Terms in Natural Language document collections Internet and Information Technology in Modern Organizations: Challenges & Answers 635 An Information Retrieval using weighted Index Terms in Natural Language document collections Ahmed A. A. Radwan, Minia

More information

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.

More information

Email Spam Detection Using Customized SimHash Function

Email Spam Detection Using Customized SimHash Function International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email

More information

2. Basic Relational Data Model

2. Basic Relational Data Model 2. Basic Relational Data Model 2.1 Introduction Basic concepts of information models, their realisation in databases comprising data objects and object relationships, and their management by DBMS s that

More information

CIS 631 Database Management Systems Sample Final Exam

CIS 631 Database Management Systems Sample Final Exam CIS 631 Database Management Systems Sample Final Exam 1. (25 points) Match the items from the left column with those in the right and place the letters in the empty slots. k 1. Single-level index files

More information

Leveraging Aggregate Constraints For Deduplication

Leveraging Aggregate Constraints For Deduplication Leveraging Aggregate Constraints For Deduplication Surajit Chaudhuri Anish Das Sarma Venkatesh Ganti Raghav Kaushik Microsoft Research Stanford University Microsoft Research Microsoft Research surajitc@microsoft.com

More information

Change Color for Export from Light Green to Orange when it Completes with Errors (31297)

Change Color for Export from Light Green to Orange when it Completes with Errors (31297) ediscovery 5.3.1 Service Pack 8 Release Notes Document Date: July 6, 2015 2015 AccessData Group, Inc. All Rights Reserved Introduction This document lists the issues addressed by this release. All known

More information

Quiz! Database Indexes. Index. Quiz! Disc and main memory. Quiz! How costly is this operation (naive solution)?

Quiz! Database Indexes. Index. Quiz! Disc and main memory. Quiz! How costly is this operation (naive solution)? Database Indexes How costly is this operation (naive solution)? course per weekday hour room TDA356 2 VR Monday 13:15 TDA356 2 VR Thursday 08:00 TDA356 4 HB1 Tuesday 08:00 TDA356 4 HB1 Friday 13:15 TIN090

More information

Inside the PostgreSQL Query Optimizer

Inside the PostgreSQL Query Optimizer Inside the PostgreSQL Query Optimizer Neil Conway neilc@samurai.com Fujitsu Australia Software Technology PostgreSQL Query Optimizer Internals p. 1 Outline Introduction to query optimization Outline of

More information

Efficient Data Structures for Decision Diagrams

Efficient Data Structures for Decision Diagrams Artificial Intelligence Laboratory Efficient Data Structures for Decision Diagrams Master Thesis Nacereddine Ouaret Professor: Supervisors: Boi Faltings Thomas Léauté Radoslaw Szymanek Contents Introduction...

More information

Report on the Train Ticketing System

Report on the Train Ticketing System Report on the Train Ticketing System Author: Zaobo He, Bing Jiang, Zhuojun Duan 1.Introduction... 2 1.1 Intentions... 2 1.2 Background... 2 2. Overview of the Tasks... 3 2.1 Modules of the system... 3

More information

Efficient Iceberg Query Evaluation for Structured Data using Bitmap Indices

Efficient Iceberg Query Evaluation for Structured Data using Bitmap Indices Proc. of Int. Conf. on Advances in Computer Science, AETACS Efficient Iceberg Query Evaluation for Structured Data using Bitmap Indices Ms.Archana G.Narawade a, Mrs.Vaishali Kolhe b a PG student, D.Y.Patil

More information

E-mail Listeners. E-mail Formats. Free Form. Formatted

E-mail Listeners. E-mail Formats. Free Form. Formatted E-mail Listeners 6 E-mail Formats You use the E-mail Listeners application to receive and process Service Requests and other types of tickets through e-mail in the form of e-mail messages. Using E- mail

More information

Binary Search Trees. Data in each node. Larger than the data in its left child Smaller than the data in its right child

Binary Search Trees. Data in each node. Larger than the data in its left child Smaller than the data in its right child Binary Search Trees Data in each node Larger than the data in its left child Smaller than the data in its right child FIGURE 11-6 Arbitrary binary tree FIGURE 11-7 Binary search tree Data Structures Using

More information

Performance Tuning for the Teradata Database

Performance Tuning for the Teradata Database Performance Tuning for the Teradata Database Matthew W Froemsdorf Teradata Partner Engineering and Technical Consulting - i - Document Changes Rev. Date Section Comment 1.0 2010-10-26 All Initial document

More information

Oracle EXAM - 1Z0-117. Oracle Database 11g Release 2: SQL Tuning. Buy Full Product. http://www.examskey.com/1z0-117.html

Oracle EXAM - 1Z0-117. Oracle Database 11g Release 2: SQL Tuning. Buy Full Product. http://www.examskey.com/1z0-117.html Oracle EXAM - 1Z0-117 Oracle Database 11g Release 2: SQL Tuning Buy Full Product http://www.examskey.com/1z0-117.html Examskey Oracle 1Z0-117 exam demo product is here for you to test the quality of the

More information

Data Quality Aware Query System

Data Quality Aware Query System Data Quality Aware Query System By Naiem Khodabandehloo Yeganeh in fulfilment of the Degree of Doctorate of Philosophy School of Information Technology and Electrical Engineering April 2012 Examiner s

More information