# Experimental Comparison of Set Intersection Algorithms for Inverted Indexing

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 ITAT 213 Proceedings, CEUR Workshop Proceedings Vol. 13, pp Series ISSN , c 213 V. Boža Experimental Comparison of Set Intersection Algorithms for Inverted Indexing Vladimír Boža Department of Computer Science, Faculty of Mathematics, Physics and Informatics Comenius University Mlynská dolina, Bratislava, Slovakia Abstract: The set intersection problem is one of the main problems in document retrieval. Query consists of two keywords, and for each of keyword we have a sorted set of document IDs containing it. The goal is to retrieve the set of document IDs containing both keywords. We perform an experimental comparison of and a new algorithm by Cohen and Porat (LATIN21), which has a better theoretical time complexity. We show that the new algorithm has often worse performance than the trivial one on real data. We also propose a variant of the Cohen and Porat algorithm with a similar complexity but better empirical performance. Finally, we investigate influence of document ordering on query time. 1 Introduction In the set intersection problem, we are given a collection of sets A 1,A 2,...,A d. Our goal is to preprocess them and then answer queries of the following type: For given i, j, find the intersection of sets A i and A j. This problem appears in various areas. For example, set intersection is needed in conjuctive queries in relational databases, and Ng, Amir, Pevzner [7] use set intersection to match mass spectra against a protein database. However, perhaps the most important application is in document retrieval, where the goal is to maintain data structures over a set of documents. The most typical data structure is the inverted index, which stores for each word the set of documents containing that word. Such an index allows us to easily retrieve documents containing a particular query word. When we want to retrieve documents which contain two or more given words, we can do set intersection of corresponding document sets from the inverted index. Classical algorithms for set intersection are merging and binary search. Merging identifies common elements by iterating through both sorted lists, as in the final phase of the merge sort algorithm. If we denote the length of the smaller set as m and the larger set as n, then the time complexity of merging is O(m + n), which is good when the lengths of the two sets are almost same. If m is much smaller than n, it is better to search for each element of the smaller set in the larger set by binary search, in O(mlgn) total time. There is a better algorithm originally introduced by Bentley and Yao [8], called galloping search with time complexity O(m lg(n/m)). This algorithm has good time complexity when the lengths of the sets are similar and also when the shorter set is much shorter than longer one. We will describe this algorithm in the next section. All previous algorithms get the two sorted sets on input without any additional preprocessing. However in inverted indexing, sets for all keywords are known in advance, and perhaps some preprocessing of these sets could speed up query processing. In particular, the length of the output can be much smaller than the length of the shorter set, and it would be desirable to have an algorithm, which would not depend linearly on m. The first step in this direction is a recent algorithm by Cohen and Porat [1], which uses linear memory to store all sets and processes each intersection query in time O( No+o), where o is the length of output and N is the sum of the sizes of all sets. We will denote this algorithm as the fast set intersection algorithm. However, it is not clear, whether this theoretical improvement is really useful in practice. In this article, we compare the query times of the fast set intersection algorithm and the galloping search on a dataset consisting of a sample of English Wikipedia articles with a set of twowords queries from TREC Terabyte 26 query stream. We also present a variant of the fast set intersection algorithm with a similar time and memory complexity but better empirical performance. Previous experimental comparisons of set intersection algorithms [5, 6] consider only different variants of galloping search or other algorithms that do not use set preprocessing. Yan et al. [2] observe that better query times can be achieved via better document ordering. In our experiments, we also compare query times of random document ordering and document ordering based on simple clustering scheme for all tested algorithms. 2 Algorithms In this section we descibe the three algorithms, which we will compare. We will denote the two input sets as A, B, where A = n, B = m, m n. We will denote the total number of sets as s and the total size of sets as N. 2.1 The ([8]) is a simple modification of the binary search. We try to find each element of B in A; formally for each B[i] we will find index k i such that A[k i ] B[i] and A[k i + 1] > B[i]. We will use several improvements. First, if j > i, then k j k i. This means that

2 Experimental Comparison of Set Intersection Algorithms 59 when we are searching for k i+1, we need to search only in range k i,k i + 1,...,n. Secondly before doing binary search we will find the smallest p {1,2,4,8,...} such that A[k i + p] B[i + 1]. This limits the range of the binary search when difference between k i and k i+1 is small. When using this modifications the algorithm achieves time complexity O(m lg(n/m)). Pseudocode of this algorithm is given below: low:=1 for i := 1 to m: diff := 1 while low + diff <= n and A[low + diff] < B[i]: diff *= 2 high := min(n, low + diff) k = binary_search(a, low, high) if A[k] == B[i]: output B[i] low = k This tree will have at most O(lgN) levels. At each level, we need O(N) bits for intersection matrices. This means we need O(N lgn) bits, which is O(N) in term of words. During query answering we will start traversing the tree starting in the root node. In each node we will check whether both sets are large. If not, we will answer query using the galloping search. If both sets are large, we will look into intersection matrix. If sets do not have intersection in this node, we stop the traversal in this node. Otherwise, we will propagate the query to the children of that node. We also need to check whether the element kept in the node belongs to the intersection. It can be shown that the time complexity of a set intersection query is at most O(( No + o)lgn). It can be also shown that it is never worse than time complexity of the galloping search [1]. Note that the original article used hash tables instead of the galloping search. 2.2 The fast set intersection algorithm We now briefly describe the data structure used by fast set intersection algorithm [1]. We build tree where each node handles some subsets of the original sets. The cost of a node is the sum of the sizes of all the subsets it handles. The root handles all original sets, so it costs N. Definition 1. Let d be a node with costs x. A large set in d is one of the x biggest sets in node d. Note the the original article defined large set in a slighly different way. Their large set was a set with at least x elements. It is clear, that every set with at least x elements is a large set by our definition. A set intersection matrix is a matrix that stores for each pair of sets a bit indicating whether their intersection is non-empty. For x sets this matrix needs O(x 2 ) bits of space and therefore node with cost c needs O(c) bits of memory. For each node of tree we will construct a set intersection matrix for all the large sets in that node. Now we need to describe how to build the tree. We will use a top-down approach. We will start with root node and in each node we will divide the sets and propagate them to its children. Only large sets are propagated down to the node children. We will call this sets a propagated group. Let d be a node with cost x and G its propagated group. The G costs at most x. Let E be the set of all elements in the sets of G. We will try to split E into two disjoint sets E 1,E 2. For a given set S G the child will handle S E 1 and the right child will handle S E 2. We want the each child to cost at most x/2. This is sometimes impossible to achieve. We will fix this by keeping one element of E in d. We add elements to E 1 until adding another element would make the left child cost more than x/2. The next element will be kept in d. All other elements will go to E 2 and the right child will cost at most x/ Our set intersection The set intersection matrix usually contains many ones and only few zeroes. We can use the space better by instead storing intervals in with intersection of two sets is empty. We take N biggest sets and call them large sets. We will call other sets as small sets. Note that the size of a small set is at most N. Now we will do a preprocessing for intersections of the large sets. Definition 2. Let A,B be two large sets, where B A. The empty interval is a sequence B[i],B[i + 1],...,B[ j] of elements of set B such that: For each k such that i k j: B[k] / A. i = 1 or B[i 1] A. j = n or B[ j + 1] A. The size of this interval is j i + 1. Now we will find and store k largest empty intervals from all intersections of large sets (note that we can store zero, one or more than one intervals for some pairs of sets). Note that if k > N, then the smallest stored empty interval has size at most N. We will answer set a intersection query as follows. If any of the sets is small, we will use the galloping search. This gives us query time O(mlg(n/m)), but since m N, the query time can be written as O( N lg(n/m)). If both sets are large, then we will again use the galloping search, but we will ignore empty intervals found for the given intersection. We will show two things about time complexity of the query in our algorithm. First, if k = N, then the query time complexity is bounded by O(o N). The memory complexity in this case is O(N). Secondly, if we put k = N lgn, the average query

3 6 V. Boža time complexity of this approach is not worse than the time complexity of the fast set intersection. This is because the number of empty intervals is the same as the number of bits in the set intersection matrices of the fast set intersection algorithm and our empty intervals allows us to skip search for more elements than the zeroes in the set intersection matrices. Thus average query time complexity of our approach is O(( No+o)lgn). The memory complexity is O(N lgn), which is little bit higher. The complexity of our algorithm does not look as promising as complexity of the fast set intersection algorithm, but this algorithm allows time-memory tradeoff. We can store any number of empty intervals as we want. In our experiements we set this number to achieve same memory consumption as fast set intersection algorithm. 3 Document ordering In the document retrieval we can choose arbitrary IDs for individual documents, and thus influence the ordering of elements in the input sets. There are several proposed heuristics for document ordering; most of them try to order documents for achieving better index compression ([3], [4]). But good document ordering improves query time [2]. This happens because similar document are closer together in the sets and during the galloping search we will make smaller jumps. In our work, we will use the k-scan algorithm [3]. To describe this algorithm, we first need to define similarity of documents. Definition 3. The Jaccard similarity of two sets A,B is given by: A B J(A,B) = A B The document is a set of terms. For calculating distance we will only consider N terms occuring in the largest number of documents (the large sets from the previous sections). The similarity of two documents is the Jaccard similarity of their sets. The k-scan algorithm tries to find an ordering of documents by partioning them into k clusters. Let d be the number of documents. This algorithm has k iterations. In each iteration, it first picks a cluster center and then chooses among the unassigned documents the d/k 1 ones most similar to the cluster center. Also it picks the cluster center for the next iteration, which is the d/k-th most similar document. The cluster center for the first iteration is picked randomly. If we assume that document similarity can be computed in time s, then time complexity of this algorithm is O(kds). In our experiments we use k = 1. 4 Experimental setup We implemented all algorithms in C++ and run our experiment on a computer with Intel i7 92 CPU, 12 GB RAM and Ubuntu Linux. We compiled our code with g using -O3 optimizations. Fast set intersection Figure 1: Comparison of query times for galloping search (x-axis) and fast set intersection (y-axis) with random document ordering 4.1 Documents We took all articles from the English Wikipedia [9] and divided each article into paragraphs. We have used paragraphs instead of documents, because otherwise we would only have a few big documents and the difference between algorithms would be hard to measure. Then we sampled 6.5 millions of paragraphs and took them as documents. The total size of index N (the sum of size of all sets) was 313 millions word-document pairs. 4.2 Queries We have used query log from TREC Terabyte 26 query stream [1]. We only consider two word queries from the log. This gives us approximately 14 queries. We ran each query 1 times and measured the average time in seconds. 5 Experimental results 5.1 vs. fast set interserction vs. our algorithm We will first show comparision between all algorithms using random document order. Results are shown in Figures 1, 2. As we can see the fast set intersection algorithm introduces significant overhead in query processing time for large queries. Our hypothesis is that this is because this algorithm does not use the caching in an optimal way. On the other hand our set intersection algorithm introduces a small improvement in query processing time. The average improvement is around 14%.

4 Experimental Comparison of Set Intersection Algorithms 61 Our set intersection Figure 2: Comparison of query times for galloping search (x-axis) and ours set intersection (y-axis) with random document ordering We also explored individual algorithm using more detailed statistics. The set intersection matrices of fast set intersection algorithms contained 15% of zeroes. There are 54 (39%) queries where both sets are large in root node. Only in 79 of these queries we found zero in the set intersection matrices. We have also measured how many elements of the smaller set we can skip due to zeroes in set intersection matrices. As we can see from the histogram in Figure 3, there are some queries where the output size is zero and all elements are skipped. But overall there are only few queries where we skipped more than half of the smaller set. Most of the time we skip only few percent of the smaller set. In our algorithm, there are 825 queries where we encounter an empty interval stored for the two sets. Again we measured fraction of skipped elements due to empty intervals with respect to the size of the smaller set and plotted histogram of this fractions (see Figure 4). We see that this histogram looks quite better than the previous one. 5.2 Document ordering effects The overall effect of document ordering on the query time is shown in Figures 5, 6, 7. The average improvement of query time for galloping search is 18%, for fast set intersection 27% and for our set intersection 22%. It is worth noting that the set intersections matrices contain 25% zeroes when using document ordering based on k-scans, compared to 15% with random document order. We had 21 queries which encoutered zero in some set intersection matrix in the fast set intersection algorithm. This approximatelly three times more than when we used random document ordering. Histogram of the fraction of skipped elements is in Figure 8. In this histogram we see Figure 3: Histogram of fraction of skipped elements from the smaller set in queries for fast intersection algorithm with random document ordering. Zero is ommited due to its big size (1278 queries) Figure 4: Histogram of fraction of skipped elements from the smaller set in queries for our intersection algorithm with random document ordering. Zero is ommited due to its big size (124 queries). the same problems as in random document ordering the number of queries with 8 9 percent fraction is zero. On the other hand, we gained a lot of queries where we eliminated around 1% of work. In our algorithm, there are 15 queries where we encounter an empty interval. Histogram of the fraction of skipped elements is in Figure 9. Its shape is similar to histogram when using random document ordering. Finally, in Figures 1, 11 we see a comparision of running times of algorithms when using document ordering based on k-scans. We still see significant slowdown for

5 62 V. Boža.2.2 Clustered document ordering Clustered document ordering Random document ordering Random document ordering Figure 5: Comparison of query times for galloping search using random document order(x-axis) and document order based on k-scan algorithm (y-axis) Figure 7: Comparison of query times for our set intersection using random document order(x-axis) and document order based on k-scan algorithm (y-axis).2 6 Clustered document ordering Random document ordering Figure 6: Comparison of query times for fast set intersection using random document order(x-axis) and document order based on k-scan algorithm (y-axis) fast set intersection. The speedup for our set intersection was 19% which is similar to speedup for random document ordering. Finally, in Figure 12 we see a comparison of galloping search using random document ordering and our algorithm using better document ordering. Combination of these two factors leads to average improvement around 35%. 5.3 Preprocessing time and memory consumption We now briefly sumarize proprocessing time and memory consumption of our algorithms. Using only inverted index and galloping search took 2 GB of memory and needed 4 Figure 8: Histogram of fraction of skipped elements from the smaller set in queries for fast intersection algorithm with k-scan document ordering. Zero is ommited due to its big size (1239 queries). minutes for preprocessing. The fast set intersection algorithm required 7 GB of memory and 9 minutes of preprocessing. Our algorithm required 7 GB of memory and 2.5 hours of preprocessing. 6 Conclusion We examined three different algorithms for set intersection. The experimental result can be summarized as follows: Fast set intersection algorithm does not lead to better

6 Experimental Comparison of Set Intersection Algorithms Our set intersection Figure 9: Histogram of fraction of skipped elements from the smaller set in queries for our intersection algorithm with k-scan document ordering. Zero is ommited due to its big size (122 queries). Fast set intersection Figure 1: Comparison of query times for galloping search (x-axis) and fast set intersection (y-axis) with document ordering based on k-scans Figure 11: Comparison of query times for galloping search (x-axis) and ours set intersection (y-axis) with document ordering based on k-scans Our set intersection Figure 12: Comparison of query times for galloping search with random document ordering (x-axis) and ours set intersection with document ordering based on k-scans (yaxis) empirical performance on real data. We can achieve some speedup using our algorithm but this speed up is not big. Our algorithm is slighly better at eliminating useless work than fast set intersection. Fast set intersection algorithm in most cases eliminates less then 1% of work. It is interesting question whether more careful implementation of fast set intersection algorithm can lead to better query times. We also investigated effect of document ordering on query times. We showed that better document ordering leads to greater improvement than using a different algorithm. 7 Acknowledgements This research was supported by VEGA grant 1/185/12. Author would also like to thank Brona Brejova for many usefull comments.

7 64 V. Boža References [1] Cohen, Hagai, and Ely Porat: Fast set intersection and twopatterns matching. LATIN 21: Theoretical Informatics. Springer Berlin Heidelberg, [2] Yan, Hao, Shuai Ding, and Torsten Suel: Inverted index compression and query processing with optimized document ordering. Proceedings of the 18th international conference on World wide web. ACM, 29. [3] F. Silvestri, S. Orlando, and R. Perego.: Assigning identifiers to documents to enhance the clustering property of fulltext indexes. In Proc. of the 27th Annual Int. ACM SIGIR Conf. [4] Shieh, W. Y., Chen, T. F., Shann, J. J. J., and Chung, C. P. (23).: Inverted file compression through document identifier reassignment. Information processing & management, 39(1), [5] Culpepper, J. Shane, and Alistair Moffat: Efficient set intersection for inverted indexing. ACM Transactions on Information Systems (TOIS) 29.1 (21): 1. [6] Barbay, Jérémy, Alejandro López-Ortiz, Tyler Lu, and Alejandro Salinger: An experimental investigation of set intersection algorithms for text searching. Journal of Experimental Algorithmics (JEA) 14 (29): 7. [7] Ng, Julio, Amihood Amir, and Pavel A. Pevzner.: Blocked pattern matching problem and its applications in proteomics. Research in Computational Molecular Biology. Springer Berlin Heidelberg, 211. [8] Jon Louis Bentley and Andrew Chi-chih Yao: An almost optimal algorithm for unbounded searching. Information Processing Letters - IPL, vol. 5, no. 3, pp , [9] enwiki-latest-pages-articles.xml.bz2 [1]

### Storage Management for Files of Dynamic Records

Storage Management for Files of Dynamic Records Justin Zobel Department of Computer Science, RMIT, GPO Box 2476V, Melbourne 3001, Australia. jz@cs.rmit.edu.au Alistair Moffat Department of Computer Science

### Optimization of SQL Queries in Main-Memory Databases

Optimization of SQL Queries in Main-Memory Databases Ladislav Vastag and Ján Genči Department of Computers and Informatics Technical University of Košice, Letná 9, 042 00 Košice, Slovakia lvastag@netkosice.sk

### 2 Exercises and Solutions

2 Exercises and Solutions Most of the exercises below have solutions but you should try first to solve them. Each subsection with solutions is after the corresponding subsection with exercises. 2.1 Sorting

### Resource Allocation Schemes for Gang Scheduling

Resource Allocation Schemes for Gang Scheduling B. B. Zhou School of Computing and Mathematics Deakin University Geelong, VIC 327, Australia D. Walsh R. P. Brent Department of Computer Science Australian

### Efficient Parallel Block-Max WAND Algorithm

Efficient Parallel Block-Max WAND Algorithm Oscar Rojas, Veronica Gil-Costa 2,3, and Mauricio Marin,2 DIINF, University of Santiago, Chile 2 Yahoo! Labs Santiago, Chile 3 CONICET,University of San Luis,

### International journal of Engineering Research-Online A Peer Reviewed International Journal Articles available online http://www.ijoer.

RESEARCH ARTICLE ISSN: 2321-7758 GLOBAL LOAD DISTRIBUTION USING SKIP GRAPH, BATON AND CHORD J.K.JEEVITHA, B.KARTHIKA* Information Technology,PSNA College of Engineering & Technology, Dindigul, India Article

### Binary Codes for Nonuniform

Binary Codes for Nonuniform Sources (dcc-2005) Alistair Moffat and Vo Ngoc Anh Presented by:- Palak Mehta 11-20-2006 Basic Concept: In many applications of compression, decoding speed is at least as important

### CSE 326: Data Structures B-Trees and B+ Trees

Announcements (4//08) CSE 26: Data Structures B-Trees and B+ Trees Brian Curless Spring 2008 Midterm on Friday Special office hour: 4:-5: Thursday in Jaech Gallery (6 th floor of CSE building) This is

### Techniques Covered in this Class. Inverted List Compression Techniques. Recap: Search Engine Query Processing. Recap: Search Engine Query Processing

Recap: Search Engine Query Processing Basically, to process a query we need to traverse the inverted lists of the query terms Lists are very long and are stored on disks Challenge: traverse lists as quickly

### Big Data & Scripting Part II Streaming Algorithms

Big Data & Scripting Part II Streaming Algorithms 1, 2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set),

### EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

ABSTRACT EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES Tyler Cossentine and Ramon Lawrence Department of Computer Science, University of British Columbia Okanagan Kelowna, BC, Canada tcossentine@gmail.com

### SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs Fabian Hueske, TU Berlin June 26, 21 1 Review This document is a review report on the paper Towards Proximity Pattern Mining in Large

### A Catalogue of the Steiner Triple Systems of Order 19

A Catalogue of the Steiner Triple Systems of Order 19 Petteri Kaski 1, Patric R. J. Östergård 2, Olli Pottonen 2, and Lasse Kiviluoto 3 1 Helsinki Institute for Information Technology HIIT University of

### Performance Evaluation of Improved Web Search Algorithms

Performance Evaluation of Improved Web Search Algorithms Esteban Feuerstein 1, Veronica Gil-Costa 2, Michel Mizrahi 1, Mauricio Marin 3 1 Universidad de Buenos Aires, Buenos Aires, Argentina 2 Universidad

### International Journal of Advance Research in Computer Science and Management Studies

Volume 3, Issue 11, November 2015 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

### Multimedia Communications. Huffman Coding

Multimedia Communications Huffman Coding Optimal codes Suppose that a i -> w i C + is an encoding scheme for a source alphabet A={a 1,, a N }. Suppose that the source letter a 1,, a N occur with relative

### A Set Intersection Algorithm Via x-fast Trie

A Set Intersection Algorithm Via x-fast Trie Bangyu Ye* National Engineering Laboratory for Information Security Technologies, Institute of Information Engineering, Chinese Academy of Sciences, Beijing,

### Efficient Data Replication Scheme based on Hadoop Distributed File System

, pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,

### IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 7, JULY 2009 1181

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 7, JULY 2009 1181 The Global Kernel k-means Algorithm for Clustering in Feature Space Grigorios F. Tzortzis and Aristidis C. Likas, Senior Member, IEEE

### Rethinking SIMD Vectorization for In-Memory Databases

SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest

### EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

### Persistent Binary Search Trees

Persistent Binary Search Trees Datastructures, UvA. May 30, 2008 0440949, Andreas van Cranenburgh Abstract A persistent binary tree allows access to all previous versions of the tree. This paper presents

### Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

CSE341T 08/31/2015 Lecture 3 Cost Model: Work, Span and Parallelism In this lecture, we will look at how one analyze a parallel program written using Cilk Plus. When we analyze the cost of an algorithm

### Name: Shu Xiong ID:

Homework #1 Report Multimedia Data Compression EE669 2013Spring Name: Shu Xiong ID: 3432757160 Email: shuxiong@usc.edu Content Problem1: Writing Questions... 2 Huffman Coding... 2 Lempel-Ziv Coding...

### Creating Synthetic Temporal Document Collections for Web Archive Benchmarking

Creating Synthetic Temporal Document Collections for Web Archive Benchmarking Kjetil Nørvåg and Albert Overskeid Nybø Norwegian University of Science and Technology 7491 Trondheim, Norway Abstract. In

### A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed

### R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants

R-Trees: A Dynamic Index Structure For Spatial Searching A. Guttman R-trees Generalization of B+-trees to higher dimensions Disk-based index structure Occupancy guarantee Multiple search paths Insertions

### Linear programming approach for online advertising

Linear programming approach for online advertising Igor Trajkovski Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University in Skopje, Rugjer Boshkovikj 16, P.O. Box 393, 1000 Skopje,

### In-Situ Bitmaps Generation and Efficient Data Analysis based on Bitmaps. Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

In-Situ Bitmaps Generation and Efficient Data Analysis based on Bitmaps Yu Su, Yi Wang, Gagan Agrawal The Ohio State University Motivation HPC Trends Huge performance gap CPU: extremely fast for generating

### Email Spam Detection Using Customized SimHash Function

International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email

### Load Balancing on a Grid Using Data Characteristics

Load Balancing on a Grid Using Data Characteristics Jonathan White and Dale R. Thompson Computer Science and Computer Engineering Department University of Arkansas Fayetteville, AR 72701, USA {jlw09, drt}@uark.edu

### Keywords web applications, scalability, database access

Toni Stojanovski, Member, IEEE, Ivan Velinov, and Marko Vučković toni.stojanovski@eurm.edu.mk; vuckovikmarko@hotmail.com Abstract ASP.NET web applications typically employ server controls to provide dynamic

### Algorithms and Data Structures

Algorithm Analysis Page 1 BFH-TI: Softwareschule Schweiz Algorithm Analysis Dr. CAS SD01 Algorithm Analysis Page 2 Outline Course and Textbook Overview Analysis of Algorithm Pseudo-Code and Primitive Operations

### Load Balancing in Structured Peer to Peer Systems

Load Balancing in Structured Peer to Peer Systems Dr.K.P.Kaliyamurthie 1, D.Parameswari 2 1.Professor and Head, Dept. of IT, Bharath University, Chennai-600 073. 2.Asst. Prof.(SG), Dept. of Computer Applications,

### Bitlist: New Full-text Index for Low Space Cost and Efficient Keyword Search

Bitlist: New Full-text Index for Low Space Cost and Efficient Keyword Search Weixiong Rao [1,4] Lei Chen [2] Pan Hui [2,3] Sasu Tarkoma [4] [1] School of Software Engineering [2] Department of Comp. Sci.

### An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis]

An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis] Stephan Spiegel and Sahin Albayrak DAI-Lab, Technische Universität Berlin, Ernst-Reuter-Platz 7,

### Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

### Olumide O. Owolabi Department of Computer Science. University of Abuja, Abuja, Nigeria

Volume 6, Issue 5, May 206 ISSN: 2277 28X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com An Indexed Method

### Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

### 1 2-3 Trees: The Basics

CS10: Data Structures and Object-Oriented Design (Fall 2013) November 1, 2013: 2-3 Trees: Inserting and Deleting Scribes: CS 10 Teaching Team Lecture Summary In this class, we investigated 2-3 Trees in

### Advanced Ensemble Strategies for Polynomial Models

Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer

### Previous Lectures. B-Trees. External storage. Two types of memory. B-trees. Main principles

B-Trees Algorithms and data structures for external memory as opposed to the main memory B-Trees Previous Lectures Height balanced binary search trees: AVL trees, red-black trees. Multiway search trees:

### DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

### FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

### Distributed Aggregation in Cloud Databases. By: Aparna Tiwari tiwaria@umail.iu.edu

Distributed Aggregation in Cloud Databases By: Aparna Tiwari tiwaria@umail.iu.edu ABSTRACT Data intensive applications rely heavily on aggregation functions for extraction of data according to user requirements.

Server Load Prediction Suthee Chaidaroon (unsuthee@stanford.edu) Joon Yeong Kim (kim64@stanford.edu) Jonghan Seo (jonghan@stanford.edu) Abstract Estimating server load average is one of the methods that

### An Evaluation of Self-adjusting Binary Search Tree Techniques

SOFTWARE PRACTICE AND EXPERIENCE, VOL. 23(4), 369 382 (APRIL 1993) An Evaluation of Self-adjusting Binary Search Tree Techniques jim bell and gopal gupta Department of Computer Science, James Cook University,

### Integer Factorization using the Quadratic Sieve

Integer Factorization using the Quadratic Sieve Chad Seibert* Division of Science and Mathematics University of Minnesota, Morris Morris, MN 56567 seib0060@morris.umn.edu March 16, 2011 Abstract We give

### Improved Single and Multiple Approximate String Matching

Improved Single and Multiple Approximate String Matching Kimmo Fredriksson Department of Computer Science, University of Joensuu, Finland Gonzalo Navarro Department of Computer Science, University of Chile

### A Study of Local Optima in the Biobjective Travelling Salesman Problem

A Study of Local Optima in the Biobjective Travelling Salesman Problem Luis Paquete, Marco Chiarandini and Thomas Stützle FG Intellektik, Technische Universität Darmstadt, Alexanderstr. 10, Darmstadt,

### Lecture 7: Approximation via Randomized Rounding

Lecture 7: Approximation via Randomized Rounding Often LPs return a fractional solution where the solution x, which is supposed to be in {0, } n, is in [0, ] n instead. There is a generic way of obtaining

### Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model

Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model Condro Wibawa, Irwan Bastian, Metty Mustikasari Department of Information Systems, Faculty of Computer Science and

### Binary Coded Web Access Pattern Tree in Education Domain

Binary Coded Web Access Pattern Tree in Education Domain C. Gomathi P.G. Department of Computer Science Kongu Arts and Science College Erode-638-107, Tamil Nadu, India E-mail: kc.gomathi@gmail.com M. Moorthi

### Discrete Mathematics, Chapter 3: Algorithms

Discrete Mathematics, Chapter 3: Algorithms Richard Mayr University of Edinburgh, UK Richard Mayr (University of Edinburgh, UK) Discrete Mathematics. Chapter 3 1 / 28 Outline 1 Properties of Algorithms

### A Comparison of General Approaches to Multiprocessor Scheduling

A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA jing@jolt.mt.att.com Michael A. Palis Department of Computer Science Rutgers University

### Index Construction. Merging. Merging. Information Retrieval. ! Simple in-memory indexer. ! Indexing

Information Retrieval INFO 4300 / CS 4300! Indexing Inverted indexes Compression Index construction Ranking model Index Construction! Simple in-memory indexer t! t! t! Merging Merging! Merging addresses

### MARiO: Multi Attribute Routing in Open Street Map

MARiO: Multi Attribute Routing in Open Street Map Franz Graf Matthias Renz Hans-Peter Kriegel Matthias Schubert Institute for Informatics, Ludwig-Maximilians-Universität München, Oettingenstr. 67, D-80538

### Analysis of Compression Algorithms for Program Data

Analysis of Compression Algorithms for Program Data Matthew Simpson, Clemson University with Dr. Rajeev Barua and Surupa Biswas, University of Maryland 12 August 3 Abstract Insufficient available memory

### Faster deterministic integer factorisation

David Harvey (joint work with Edgar Costa, NYU) University of New South Wales 25th October 2011 The obvious mathematical breakthrough would be the development of an easy way to factor large prime numbers

### Storage Management for High Energy Physics Applications

Abstract Storage Management for High Energy Physics Applications A. Shoshani, L. M. Bernardo, H. Nordberg, D. Rotem, and A. Sim National Energy Research Scientific Computing Division Lawrence Berkeley

### Symbol Tables. Introduction

Symbol Tables Introduction A compiler needs to collect and use information about the names appearing in the source program. This information is entered into a data structure called a symbol table. The

### I/O-Efficient Spatial Data Structures for Range Queries

I/O-Efficient Spatial Data Structures for Range Queries Lars Arge Kasper Green Larsen MADALGO, Department of Computer Science, Aarhus University, Denmark E-mail: large@madalgo.au.dk,larsen@madalgo.au.dk

### A Practical Performance Comparison of Parallel Sorting Algorithms on Homogeneous Network of Workstations

A Practical Performance Comparison of Parallel Sorting Algorithms on Homogeneous Network of Workstations Haroon Rashid and Kalim Qureshi* COMSATS Institute of Information Technology, Abbottabad, Pakistan

### The PRAM Model and Algorithms. Advanced Topics Spring 2008 Prof. Robert van Engelen

The PRAM Model and Algorithms Advanced Topics Spring 2008 Prof. Robert van Engelen Overview The PRAM model of parallel computation Simulations between PRAM models Work-time presentation framework of parallel

### Tables so far. set() get() delete() BST Average O(lg n) O(lg n) O(lg n) Worst O(n) O(n) O(n) RB Tree Average O(lg n) O(lg n) O(lg n)

Hash Tables Tables so far set() get() delete() BST Average O(lg n) O(lg n) O(lg n) Worst O(n) O(n) O(n) RB Tree Average O(lg n) O(lg n) O(lg n) Worst O(lg n) O(lg n) O(lg n) Table naïve array implementation

### Information Retrieval. Exercises Skip lists Positional indexing Permuterm index

Information Retrieval Exercises Skip lists Positional indexing Permuterm index Faster merging using skip lists Recall basic merge Walk through the two postings simultaneously, in time linear in the total

### Data, Measurements, Features

Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

### Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

### recursion, O(n), linked lists 6/14

recursion, O(n), linked lists 6/14 recursion reducing the amount of data to process and processing a smaller amount of data example: process one item in a list, recursively process the rest of the list

### ICS 434 Advanced Database Systems

ICS 434 Advanced Database Systems Dr. Abdallah Al-Sukairi sukairi@kfupm.edu.sa Second Semester 2003-2004 (032) King Fahd University of Petroleum & Minerals Information & Computer Science Department Outline

### Load Balancing in Structured Peer to Peer Systems

Load Balancing in Structured Peer to Peer Systems DR.K.P.KALIYAMURTHIE 1, D.PARAMESWARI 2 Professor and Head, Dept. of IT, Bharath University, Chennai-600 073 1 Asst. Prof. (SG), Dept. of Computer Applications,

### Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

### 12 Abstract Data Types

12 Abstract Data Types 12.1 Source: Foundations of Computer Science Cengage Learning Objectives After studying this chapter, the student should be able to: Define the concept of an abstract data type (ADT).

### Efficient visual search of local features. Cordelia Schmid

Efficient visual search of local features Cordelia Schmid Visual search change in viewing angle Matches 22 correct matches Image search system for large datasets Large image dataset (one million images

### Query Processing, optimization, and indexing techniques

Query Processing, optimization, and indexing techniques What s s this tutorial about? From here: SELECT C.name AS Course, count(s.students) AS Cnt FROM courses C, subscription S WHERE C.lecturer = Calders

### Inverted Indexes Compressed Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 40

Inverted Indexes Compressed Inverted Indexes Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 40 Compressed Inverted Indexes It is possible to combine index compression and

### An Efficiency Keyword Search Scheme to improve user experience for Encrypted Data in Cloud

, pp.246-252 http://dx.doi.org/10.14257/astl.2014.49.45 An Efficiency Keyword Search Scheme to improve user experience for Encrypted Data in Cloud Jiangang Shu ab Xingming Sun ab Lu Zhou ab Jin Wang ab

### Security-Aware Beacon Based Network Monitoring

Security-Aware Beacon Based Network Monitoring Masahiro Sasaki, Liang Zhao, Hiroshi Nagamochi Graduate School of Informatics, Kyoto University, Kyoto, Japan Email: {sasaki, liang, nag}@amp.i.kyoto-u.ac.jp

### Mobile Storage and Search Engine of Information Oriented to Food Cloud

Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

### CS420/520 Algorithm Analysis Spring 2010 Lecture 04

CS420/520 Algorithm Analysis Spring 2010 Lecture 04 I. Chapter 4 Recurrences A. Introduction 1. Three steps applied at each level of a recursion a) Divide into a number of subproblems that are smaller

### On Multiple Query Optimization in Data Mining

On Multiple Query Optimization in Data Mining Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo 3a, 60-965 Poznan, Poland {marek,mzakrz}@cs.put.poznan.pl

### Fast Evaluation of Union-Intersection Expressions

Fast Evaluation of Union-Intersection Expressions Philip Bille Anna Pagh Rasmus Pagh IT University of Copenhagen Data Structures for Intersection Queries Preprocess a collection of sets S 1,..., S m independently

### Cluster analysis and Association analysis for the same data

Cluster analysis and Association analysis for the same data Huaiguo Fu Telecommunications Software & Systems Group Waterford Institute of Technology Waterford, Ireland hfu@tssg.org Abstract: Both cluster

### Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format

, pp.91-100 http://dx.doi.org/10.14257/ijhit.2014.7.4.09 Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format Jingjing Zheng 1,* and Ting Wang 1, 2 1,* Parallel Software and Computational

### Data Deduplication in Slovak Corpora

Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences, Bratislava, Slovakia Abstract. Our paper describes our experience in deduplication of a Slovak corpus. Two methods of deduplication a plain

### Clustering & Visualization

Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.

### Chap 3 Huffman Coding

Chap 3 Huffman Coding 3.1 Overview 3.2 The Huffman Coding Algorithm 3.4 Adaptive Huffman Coding 3.5 Golomb Codes 3.6 Rice Codes 3.7 Tunstall Codes 3.8 Applications of Huffman Coding 1 3.2 The Huffman Coding

### MATHEMATICAL ENGINEERING TECHNICAL REPORTS. The Best-fit Heuristic for the Rectangular Strip Packing Problem: An Efficient Implementation

MATHEMATICAL ENGINEERING TECHNICAL REPORTS The Best-fit Heuristic for the Rectangular Strip Packing Problem: An Efficient Implementation Shinji IMAHORI, Mutsunori YAGIURA METR 2007 53 September 2007 DEPARTMENT

### This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,

### INDEXING BIOMEDICAL STREAMS IN DATA MANAGEMENT SYSTEM 1. INTRODUCTION

JOURNAL OF MEDICAL INFORMATICS & TECHNOLOGIES Vol. 9/2005, ISSN 1642-6037 Michał WIDERA *, Janusz WRÓBEL *, Adam MATONIA *, Michał JEŻEWSKI **,Krzysztof HOROBA *, Tomasz KUPKA * centralized monitoring,

### Caching XML Data on Mobile Web Clients

Caching XML Data on Mobile Web Clients Stefan Böttcher, Adelhard Türling University of Paderborn, Faculty 5 (Computer Science, Electrical Engineering & Mathematics) Fürstenallee 11, D-33102 Paderborn,

### Mining Social Network Graphs

Mining Social Network Graphs Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata November 13, 17, 2014 Social Network No introduc+on required Really? We s7ll need to understand

### Optimization of ETL Work Flow in Data Warehouse

Optimization of ETL Work Flow in Data Warehouse Kommineni Sivaganesh M.Tech Student, CSE Department, Anil Neerukonda Institute of Technology & Science Visakhapatnam, India. Sivaganesh07@gmail.com P Srinivasu

### Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

### Image Search by MapReduce

Image Search by MapReduce COEN 241 Cloud Computing Term Project Final Report Team #5 Submitted by: Lu Yu Zhe Xu Chengcheng Huang Submitted to: Prof. Ming Hwa Wang 09/01/2015 Preface Currently, there s

### Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows

TECHNISCHE UNIVERSITEIT EINDHOVEN Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows Lloyd A. Fasting May 2014 Supervisors: dr. M. Firat dr.ir. M.A.A. Boon J. van Twist MSc. Contents

### Continuous Fastest Path Planning in Road Networks by Mining Real-Time Traffic Event Information

Continuous Fastest Path Planning in Road Networks by Mining Real-Time Traffic Event Information Eric Hsueh-Chan Lu Chi-Wei Huang Vincent S. Tseng Institute of Computer Science and Information Engineering

### An efficient Join-Engine to the SQL query based on Hive with Hbase Zhao zhi-cheng & Jiang Yi

International Conference on Applied Science and Engineering Innovation (ASEI 2015) An efficient Join-Engine to the SQL query based on Hive with Hbase Zhao zhi-cheng & Jiang Yi Institute of Computer Forensics,