Experimental Comparison of Set Intersection Algorithms for Inverted Indexing
|
|
|
- Rosamund Allen
- 10 years ago
- Views:
Transcription
1 ITAT 213 Proceedings, CEUR Workshop Proceedings Vol. 13, pp Series ISSN , c 213 V. Boža Experimental Comparison of Set Intersection Algorithms for Inverted Indexing Vladimír Boža Department of Computer Science, Faculty of Mathematics, Physics and Informatics Comenius University Mlynská dolina, Bratislava, Slovakia Abstract: The set intersection problem is one of the main problems in document retrieval. Query consists of two keywords, and for each of keyword we have a sorted set of document IDs containing it. The goal is to retrieve the set of document IDs containing both keywords. We perform an experimental comparison of and a new algorithm by Cohen and Porat (LATIN21), which has a better theoretical time complexity. We show that the new algorithm has often worse performance than the trivial one on real data. We also propose a variant of the Cohen and Porat algorithm with a similar complexity but better empirical performance. Finally, we investigate influence of document ordering on query time. 1 Introduction In the set intersection problem, we are given a collection of sets A 1,A 2,...,A d. Our goal is to preprocess them and then answer queries of the following type: For given i, j, find the intersection of sets A i and A j. This problem appears in various areas. For example, set intersection is needed in conjuctive queries in relational databases, and Ng, Amir, Pevzner [7] use set intersection to match mass spectra against a protein database. However, perhaps the most important application is in document retrieval, where the goal is to maintain data structures over a set of documents. The most typical data structure is the inverted index, which stores for each word the set of documents containing that word. Such an index allows us to easily retrieve documents containing a particular query word. When we want to retrieve documents which contain two or more given words, we can do set intersection of corresponding document sets from the inverted index. Classical algorithms for set intersection are merging and binary search. Merging identifies common elements by iterating through both sorted lists, as in the final phase of the merge sort algorithm. If we denote the length of the smaller set as m and the larger set as n, then the time complexity of merging is O(m + n), which is good when the lengths of the two sets are almost same. If m is much smaller than n, it is better to search for each element of the smaller set in the larger set by binary search, in O(mlgn) total time. There is a better algorithm originally introduced by Bentley and Yao [8], called galloping search with time complexity O(m lg(n/m)). This algorithm has good time complexity when the lengths of the sets are similar and also when the shorter set is much shorter than longer one. We will describe this algorithm in the next section. All previous algorithms get the two sorted sets on input without any additional preprocessing. However in inverted indexing, sets for all keywords are known in advance, and perhaps some preprocessing of these sets could speed up query processing. In particular, the length of the output can be much smaller than the length of the shorter set, and it would be desirable to have an algorithm, which would not depend linearly on m. The first step in this direction is a recent algorithm by Cohen and Porat [1], which uses linear memory to store all sets and processes each intersection query in time O( No+o), where o is the length of output and N is the sum of the sizes of all sets. We will denote this algorithm as the fast set intersection algorithm. However, it is not clear, whether this theoretical improvement is really useful in practice. In this article, we compare the query times of the fast set intersection algorithm and the galloping search on a dataset consisting of a sample of English Wikipedia articles with a set of twowords queries from TREC Terabyte 26 query stream. We also present a variant of the fast set intersection algorithm with a similar time and memory complexity but better empirical performance. Previous experimental comparisons of set intersection algorithms [5, 6] consider only different variants of galloping search or other algorithms that do not use set preprocessing. Yan et al. [2] observe that better query times can be achieved via better document ordering. In our experiments, we also compare query times of random document ordering and document ordering based on simple clustering scheme for all tested algorithms. 2 Algorithms In this section we descibe the three algorithms, which we will compare. We will denote the two input sets as A, B, where A = n, B = m, m n. We will denote the total number of sets as s and the total size of sets as N. 2.1 The ([8]) is a simple modification of the binary search. We try to find each element of B in A; formally for each B[i] we will find index k i such that A[k i ] B[i] and A[k i + 1] > B[i]. We will use several improvements. First, if j > i, then k j k i. This means that
2 Experimental Comparison of Set Intersection Algorithms 59 when we are searching for k i+1, we need to search only in range k i,k i + 1,...,n. Secondly before doing binary search we will find the smallest p {1,2,4,8,...} such that A[k i + p] B[i + 1]. This limits the range of the binary search when difference between k i and k i+1 is small. When using this modifications the algorithm achieves time complexity O(m lg(n/m)). Pseudocode of this algorithm is given below: low:=1 for i := 1 to m: diff := 1 while low + diff <= n and A[low + diff] < B[i]: diff *= 2 high := min(n, low + diff) k = binary_search(a, low, high) if A[k] == B[i]: output B[i] low = k This tree will have at most O(lgN) levels. At each level, we need O(N) bits for intersection matrices. This means we need O(N lgn) bits, which is O(N) in term of words. During query answering we will start traversing the tree starting in the root node. In each node we will check whether both sets are large. If not, we will answer query using the galloping search. If both sets are large, we will look into intersection matrix. If sets do not have intersection in this node, we stop the traversal in this node. Otherwise, we will propagate the query to the children of that node. We also need to check whether the element kept in the node belongs to the intersection. It can be shown that the time complexity of a set intersection query is at most O(( No + o)lgn). It can be also shown that it is never worse than time complexity of the galloping search [1]. Note that the original article used hash tables instead of the galloping search. 2.2 The fast set intersection algorithm We now briefly describe the data structure used by fast set intersection algorithm [1]. We build tree where each node handles some subsets of the original sets. The cost of a node is the sum of the sizes of all the subsets it handles. The root handles all original sets, so it costs N. Definition 1. Let d be a node with costs x. A large set in d is one of the x biggest sets in node d. Note the the original article defined large set in a slighly different way. Their large set was a set with at least x elements. It is clear, that every set with at least x elements is a large set by our definition. A set intersection matrix is a matrix that stores for each pair of sets a bit indicating whether their intersection is non-empty. For x sets this matrix needs O(x 2 ) bits of space and therefore node with cost c needs O(c) bits of memory. For each node of tree we will construct a set intersection matrix for all the large sets in that node. Now we need to describe how to build the tree. We will use a top-down approach. We will start with root node and in each node we will divide the sets and propagate them to its children. Only large sets are propagated down to the node children. We will call this sets a propagated group. Let d be a node with cost x and G its propagated group. The G costs at most x. Let E be the set of all elements in the sets of G. We will try to split E into two disjoint sets E 1,E 2. For a given set S G the child will handle S E 1 and the right child will handle S E 2. We want the each child to cost at most x/2. This is sometimes impossible to achieve. We will fix this by keeping one element of E in d. We add elements to E 1 until adding another element would make the left child cost more than x/2. The next element will be kept in d. All other elements will go to E 2 and the right child will cost at most x/ Our set intersection The set intersection matrix usually contains many ones and only few zeroes. We can use the space better by instead storing intervals in with intersection of two sets is empty. We take N biggest sets and call them large sets. We will call other sets as small sets. Note that the size of a small set is at most N. Now we will do a preprocessing for intersections of the large sets. Definition 2. Let A,B be two large sets, where B A. The empty interval is a sequence B[i],B[i + 1],...,B[ j] of elements of set B such that: For each k such that i k j: B[k] / A. i = 1 or B[i 1] A. j = n or B[ j + 1] A. The size of this interval is j i + 1. Now we will find and store k largest empty intervals from all intersections of large sets (note that we can store zero, one or more than one intervals for some pairs of sets). Note that if k > N, then the smallest stored empty interval has size at most N. We will answer set a intersection query as follows. If any of the sets is small, we will use the galloping search. This gives us query time O(mlg(n/m)), but since m N, the query time can be written as O( N lg(n/m)). If both sets are large, then we will again use the galloping search, but we will ignore empty intervals found for the given intersection. We will show two things about time complexity of the query in our algorithm. First, if k = N, then the query time complexity is bounded by O(o N). The memory complexity in this case is O(N). Secondly, if we put k = N lgn, the average query
3 6 V. Boža time complexity of this approach is not worse than the time complexity of the fast set intersection. This is because the number of empty intervals is the same as the number of bits in the set intersection matrices of the fast set intersection algorithm and our empty intervals allows us to skip search for more elements than the zeroes in the set intersection matrices. Thus average query time complexity of our approach is O(( No+o)lgn). The memory complexity is O(N lgn), which is little bit higher. The complexity of our algorithm does not look as promising as complexity of the fast set intersection algorithm, but this algorithm allows time-memory tradeoff. We can store any number of empty intervals as we want. In our experiements we set this number to achieve same memory consumption as fast set intersection algorithm. 3 Document ordering In the document retrieval we can choose arbitrary IDs for individual documents, and thus influence the ordering of elements in the input sets. There are several proposed heuristics for document ordering; most of them try to order documents for achieving better index compression ([3], [4]). But good document ordering improves query time [2]. This happens because similar document are closer together in the sets and during the galloping search we will make smaller jumps. In our work, we will use the k-scan algorithm [3]. To describe this algorithm, we first need to define similarity of documents. Definition 3. The Jaccard similarity of two sets A,B is given by: A B J(A,B) = A B The document is a set of terms. For calculating distance we will only consider N terms occuring in the largest number of documents (the large sets from the previous sections). The similarity of two documents is the Jaccard similarity of their sets. The k-scan algorithm tries to find an ordering of documents by partioning them into k clusters. Let d be the number of documents. This algorithm has k iterations. In each iteration, it first picks a cluster center and then chooses among the unassigned documents the d/k 1 ones most similar to the cluster center. Also it picks the cluster center for the next iteration, which is the d/k-th most similar document. The cluster center for the first iteration is picked randomly. If we assume that document similarity can be computed in time s, then time complexity of this algorithm is O(kds). In our experiments we use k = 1. 4 Experimental setup We implemented all algorithms in C++ and run our experiment on a computer with Intel i7 92 CPU, 12 GB RAM and Ubuntu Linux. We compiled our code with g using -O3 optimizations. Fast set intersection Figure 1: Comparison of query times for galloping search (x-axis) and fast set intersection (y-axis) with random document ordering 4.1 Documents We took all articles from the English Wikipedia [9] and divided each article into paragraphs. We have used paragraphs instead of documents, because otherwise we would only have a few big documents and the difference between algorithms would be hard to measure. Then we sampled 6.5 millions of paragraphs and took them as documents. The total size of index N (the sum of size of all sets) was 313 millions word-document pairs. 4.2 Queries We have used query log from TREC Terabyte 26 query stream [1]. We only consider two word queries from the log. This gives us approximately 14 queries. We ran each query 1 times and measured the average time in seconds. 5 Experimental results 5.1 vs. fast set interserction vs. our algorithm We will first show comparision between all algorithms using random document order. Results are shown in Figures 1, 2. As we can see the fast set intersection algorithm introduces significant overhead in query processing time for large queries. Our hypothesis is that this is because this algorithm does not use the caching in an optimal way. On the other hand our set intersection algorithm introduces a small improvement in query processing time. The average improvement is around 14%.
4 Experimental Comparison of Set Intersection Algorithms 61 Our set intersection Figure 2: Comparison of query times for galloping search (x-axis) and ours set intersection (y-axis) with random document ordering We also explored individual algorithm using more detailed statistics. The set intersection matrices of fast set intersection algorithms contained 15% of zeroes. There are 54 (39%) queries where both sets are large in root node. Only in 79 of these queries we found zero in the set intersection matrices. We have also measured how many elements of the smaller set we can skip due to zeroes in set intersection matrices. As we can see from the histogram in Figure 3, there are some queries where the output size is zero and all elements are skipped. But overall there are only few queries where we skipped more than half of the smaller set. Most of the time we skip only few percent of the smaller set. In our algorithm, there are 825 queries where we encounter an empty interval stored for the two sets. Again we measured fraction of skipped elements due to empty intervals with respect to the size of the smaller set and plotted histogram of this fractions (see Figure 4). We see that this histogram looks quite better than the previous one. 5.2 Document ordering effects The overall effect of document ordering on the query time is shown in Figures 5, 6, 7. The average improvement of query time for galloping search is 18%, for fast set intersection 27% and for our set intersection 22%. It is worth noting that the set intersections matrices contain 25% zeroes when using document ordering based on k-scans, compared to 15% with random document order. We had 21 queries which encoutered zero in some set intersection matrix in the fast set intersection algorithm. This approximatelly three times more than when we used random document ordering. Histogram of the fraction of skipped elements is in Figure 8. In this histogram we see Figure 3: Histogram of fraction of skipped elements from the smaller set in queries for fast intersection algorithm with random document ordering. Zero is ommited due to its big size (1278 queries) Figure 4: Histogram of fraction of skipped elements from the smaller set in queries for our intersection algorithm with random document ordering. Zero is ommited due to its big size (124 queries). the same problems as in random document ordering the number of queries with 8 9 percent fraction is zero. On the other hand, we gained a lot of queries where we eliminated around 1% of work. In our algorithm, there are 15 queries where we encounter an empty interval. Histogram of the fraction of skipped elements is in Figure 9. Its shape is similar to histogram when using random document ordering. Finally, in Figures 1, 11 we see a comparision of running times of algorithms when using document ordering based on k-scans. We still see significant slowdown for
5 62 V. Boža.2.2 Clustered document ordering Clustered document ordering Random document ordering Random document ordering Figure 5: Comparison of query times for galloping search using random document order(x-axis) and document order based on k-scan algorithm (y-axis) Figure 7: Comparison of query times for our set intersection using random document order(x-axis) and document order based on k-scan algorithm (y-axis).2 6 Clustered document ordering Random document ordering Figure 6: Comparison of query times for fast set intersection using random document order(x-axis) and document order based on k-scan algorithm (y-axis) fast set intersection. The speedup for our set intersection was 19% which is similar to speedup for random document ordering. Finally, in Figure 12 we see a comparison of galloping search using random document ordering and our algorithm using better document ordering. Combination of these two factors leads to average improvement around 35%. 5.3 Preprocessing time and memory consumption We now briefly sumarize proprocessing time and memory consumption of our algorithms. Using only inverted index and galloping search took 2 GB of memory and needed 4 Figure 8: Histogram of fraction of skipped elements from the smaller set in queries for fast intersection algorithm with k-scan document ordering. Zero is ommited due to its big size (1239 queries). minutes for preprocessing. The fast set intersection algorithm required 7 GB of memory and 9 minutes of preprocessing. Our algorithm required 7 GB of memory and 2.5 hours of preprocessing. 6 Conclusion We examined three different algorithms for set intersection. The experimental result can be summarized as follows: Fast set intersection algorithm does not lead to better
6 Experimental Comparison of Set Intersection Algorithms Our set intersection Figure 9: Histogram of fraction of skipped elements from the smaller set in queries for our intersection algorithm with k-scan document ordering. Zero is ommited due to its big size (122 queries). Fast set intersection Figure 1: Comparison of query times for galloping search (x-axis) and fast set intersection (y-axis) with document ordering based on k-scans Figure 11: Comparison of query times for galloping search (x-axis) and ours set intersection (y-axis) with document ordering based on k-scans Our set intersection Figure 12: Comparison of query times for galloping search with random document ordering (x-axis) and ours set intersection with document ordering based on k-scans (yaxis) empirical performance on real data. We can achieve some speedup using our algorithm but this speed up is not big. Our algorithm is slighly better at eliminating useless work than fast set intersection. Fast set intersection algorithm in most cases eliminates less then 1% of work. It is interesting question whether more careful implementation of fast set intersection algorithm can lead to better query times. We also investigated effect of document ordering on query times. We showed that better document ordering leads to greater improvement than using a different algorithm. 7 Acknowledgements This research was supported by VEGA grant 1/185/12. Author would also like to thank Brona Brejova for many usefull comments.
7 64 V. Boža References [1] Cohen, Hagai, and Ely Porat: Fast set intersection and twopatterns matching. LATIN 21: Theoretical Informatics. Springer Berlin Heidelberg, [2] Yan, Hao, Shuai Ding, and Torsten Suel: Inverted index compression and query processing with optimized document ordering. Proceedings of the 18th international conference on World wide web. ACM, 29. [3] F. Silvestri, S. Orlando, and R. Perego.: Assigning identifiers to documents to enhance the clustering property of fulltext indexes. In Proc. of the 27th Annual Int. ACM SIGIR Conf. [4] Shieh, W. Y., Chen, T. F., Shann, J. J. J., and Chung, C. P. (23).: Inverted file compression through document identifier reassignment. Information processing & management, 39(1), [5] Culpepper, J. Shane, and Alistair Moffat: Efficient set intersection for inverted indexing. ACM Transactions on Information Systems (TOIS) 29.1 (21): 1. [6] Barbay, Jérémy, Alejandro López-Ortiz, Tyler Lu, and Alejandro Salinger: An experimental investigation of set intersection algorithms for text searching. Journal of Experimental Algorithmics (JEA) 14 (29): 7. [7] Ng, Julio, Amihood Amir, and Pavel A. Pevzner.: Blocked pattern matching problem and its applications in proteomics. Research in Computational Molecular Biology. Springer Berlin Heidelberg, 211. [8] Jon Louis Bentley and Andrew Chi-chih Yao: An almost optimal algorithm for unbounded searching. Information Processing Letters - IPL, vol. 5, no. 3, pp , [9] enwiki-latest-pages-articles.xml.bz2 [1]
Storage Management for Files of Dynamic Records
Storage Management for Files of Dynamic Records Justin Zobel Department of Computer Science, RMIT, GPO Box 2476V, Melbourne 3001, Australia. [email protected] Alistair Moffat Department of Computer Science
Optimization of SQL Queries in Main-Memory Databases
Optimization of SQL Queries in Main-Memory Databases Ladislav Vastag and Ján Genči Department of Computers and Informatics Technical University of Košice, Letná 9, 042 00 Košice, Slovakia [email protected]
International journal of Engineering Research-Online A Peer Reviewed International Journal Articles available online http://www.ijoer.
RESEARCH ARTICLE ISSN: 2321-7758 GLOBAL LOAD DISTRIBUTION USING SKIP GRAPH, BATON AND CHORD J.K.JEEVITHA, B.KARTHIKA* Information Technology,PSNA College of Engineering & Technology, Dindigul, India Article
Efficient Parallel Block-Max WAND Algorithm
Efficient Parallel Block-Max WAND Algorithm Oscar Rojas, Veronica Gil-Costa 2,3, and Mauricio Marin,2 DIINF, University of Santiago, Chile 2 Yahoo! Labs Santiago, Chile 3 CONICET,University of San Luis,
CSE 326: Data Structures B-Trees and B+ Trees
Announcements (4//08) CSE 26: Data Structures B-Trees and B+ Trees Brian Curless Spring 2008 Midterm on Friday Special office hour: 4:-5: Thursday in Jaech Gallery (6 th floor of CSE building) This is
EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES
ABSTRACT EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES Tyler Cossentine and Ramon Lawrence Department of Computer Science, University of British Columbia Okanagan Kelowna, BC, Canada [email protected]
Resource Allocation Schemes for Gang Scheduling
Resource Allocation Schemes for Gang Scheduling B. B. Zhou School of Computing and Mathematics Deakin University Geelong, VIC 327, Australia D. Walsh R. P. Brent Department of Computer Science Australian
Performance Evaluation of Improved Web Search Algorithms
Performance Evaluation of Improved Web Search Algorithms Esteban Feuerstein 1, Veronica Gil-Costa 2, Michel Mizrahi 1, Mauricio Marin 3 1 Universidad de Buenos Aires, Buenos Aires, Argentina 2 Universidad
Big Data & Scripting Part II Streaming Algorithms
Big Data & Scripting Part II Streaming Algorithms 1, 2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set),
SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs
SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs Fabian Hueske, TU Berlin June 26, 21 1 Review This document is a review report on the paper Towards Proximity Pattern Mining in Large
A Catalogue of the Steiner Triple Systems of Order 19
A Catalogue of the Steiner Triple Systems of Order 19 Petteri Kaski 1, Patric R. J. Östergård 2, Olli Pottonen 2, and Lasse Kiviluoto 3 1 Helsinki Institute for Information Technology HIIT University of
An Evaluation of Self-adjusting Binary Search Tree Techniques
SOFTWARE PRACTICE AND EXPERIENCE, VOL. 23(4), 369 382 (APRIL 1993) An Evaluation of Self-adjusting Binary Search Tree Techniques jim bell and gopal gupta Department of Computer Science, James Cook University,
International Journal of Advance Research in Computer Science and Management Studies
Volume 3, Issue 11, November 2015 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants
R-Trees: A Dynamic Index Structure For Spatial Searching A. Guttman R-trees Generalization of B+-trees to higher dimensions Disk-based index structure Occupancy guarantee Multiple search paths Insertions
Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:
CSE341T 08/31/2015 Lecture 3 Cost Model: Work, Span and Parallelism In this lecture, we will look at how one analyze a parallel program written using Cilk Plus. When we analyze the cost of an algorithm
An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis]
An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis] Stephan Spiegel and Sahin Albayrak DAI-Lab, Technische Universität Berlin, Ernst-Reuter-Platz 7,
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
Rethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
Persistent Binary Search Trees
Persistent Binary Search Trees Datastructures, UvA. May 30, 2008 0440949, Andreas van Cranenburgh Abstract A persistent binary tree allows access to all previous versions of the tree. This paper presents
Load Balancing on a Grid Using Data Characteristics
Load Balancing on a Grid Using Data Characteristics Jonathan White and Dale R. Thompson Computer Science and Computer Engineering Department University of Arkansas Fayetteville, AR 72701, USA {jlw09, drt}@uark.edu
Linear programming approach for online advertising
Linear programming approach for online advertising Igor Trajkovski Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University in Skopje, Rugjer Boshkovikj 16, P.O. Box 393, 1000 Skopje,
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
Clustering & Visualization
Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.
Efficient Data Replication Scheme based on Hadoop Distributed File System
, pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,
Previous Lectures. B-Trees. External storage. Two types of memory. B-trees. Main principles
B-Trees Algorithms and data structures for external memory as opposed to the main memory B-Trees Previous Lectures Height balanced binary search trees: AVL trees, red-black trees. Multiway search trees:
Load Balancing in Structured Peer to Peer Systems
Load Balancing in Structured Peer to Peer Systems Dr.K.P.Kaliyamurthie 1, D.Parameswari 2 1.Professor and Head, Dept. of IT, Bharath University, Chennai-600 073. 2.Asst. Prof.(SG), Dept. of Computer Applications,
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,
Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model
Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model Condro Wibawa, Irwan Bastian, Metty Mustikasari Department of Information Systems, Faculty of Computer Science and
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed
Faster deterministic integer factorisation
David Harvey (joint work with Edgar Costa, NYU) University of New South Wales 25th October 2011 The obvious mathematical breakthrough would be the development of an easy way to factor large prime numbers
FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM
International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT
Creating Synthetic Temporal Document Collections for Web Archive Benchmarking
Creating Synthetic Temporal Document Collections for Web Archive Benchmarking Kjetil Nørvåg and Albert Overskeid Nybø Norwegian University of Science and Technology 7491 Trondheim, Norway Abstract. In
Integer Factorization using the Quadratic Sieve
Integer Factorization using the Quadratic Sieve Chad Seibert* Division of Science and Mathematics University of Minnesota, Morris Morris, MN 56567 [email protected] March 16, 2011 Abstract We give
Email Spam Detection Using Customized SimHash Function
International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email
Improved Single and Multiple Approximate String Matching
Improved Single and Multiple Approximate String Matching Kimmo Fredriksson Department of Computer Science, University of Joensuu, Finland Gonzalo Navarro Department of Computer Science, University of Chile
Keywords web applications, scalability, database access
Toni Stojanovski, Member, IEEE, Ivan Velinov, and Marko Vučković [email protected]; [email protected] Abstract ASP.NET web applications typically employ server controls to provide dynamic
Binary Coded Web Access Pattern Tree in Education Domain
Binary Coded Web Access Pattern Tree in Education Domain C. Gomathi P.G. Department of Computer Science Kongu Arts and Science College Erode-638-107, Tamil Nadu, India E-mail: [email protected] M. Moorthi
A Study of Local Optima in the Biobjective Travelling Salesman Problem
A Study of Local Optima in the Biobjective Travelling Salesman Problem Luis Paquete, Marco Chiarandini and Thomas Stützle FG Intellektik, Technische Universität Darmstadt, Alexanderstr. 10, Darmstadt,
Server Load Prediction
Server Load Prediction Suthee Chaidaroon ([email protected]) Joon Yeong Kim ([email protected]) Jonghan Seo ([email protected]) Abstract Estimating server load average is one of the methods that
Symbol Tables. Introduction
Symbol Tables Introduction A compiler needs to collect and use information about the names appearing in the source program. This information is entered into a data structure called a symbol table. The
Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows
TECHNISCHE UNIVERSITEIT EINDHOVEN Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows Lloyd A. Fasting May 2014 Supervisors: dr. M. Firat dr.ir. M.A.A. Boon J. van Twist MSc. Contents
A Comparison of General Approaches to Multiprocessor Scheduling
A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA [email protected] Michael A. Palis Department of Computer Science Rutgers University
MATHEMATICAL ENGINEERING TECHNICAL REPORTS. The Best-fit Heuristic for the Rectangular Strip Packing Problem: An Efficient Implementation
MATHEMATICAL ENGINEERING TECHNICAL REPORTS The Best-fit Heuristic for the Rectangular Strip Packing Problem: An Efficient Implementation Shinji IMAHORI, Mutsunori YAGIURA METR 2007 53 September 2007 DEPARTMENT
Bitlist: New Full-text Index for Low Space Cost and Efficient Keyword Search
Bitlist: New Full-text Index for Low Space Cost and Efficient Keyword Search Weixiong Rao [1,4] Lei Chen [2] Pan Hui [2,3] Sasu Tarkoma [4] [1] School of Software Engineering [2] Department of Comp. Sci.
Predictive Indexing for Fast Search
Predictive Indexing for Fast Search Sharad Goel Yahoo! Research New York, NY 10018 [email protected] John Langford Yahoo! Research New York, NY 10018 [email protected] Alex Strehl Yahoo! Research New York,
Caching XML Data on Mobile Web Clients
Caching XML Data on Mobile Web Clients Stefan Böttcher, Adelhard Türling University of Paderborn, Faculty 5 (Computer Science, Electrical Engineering & Mathematics) Fürstenallee 11, D-33102 Paderborn,
Classification On The Clouds Using MapReduce
Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal [email protected] Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal [email protected]
Advanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
Analysis of Compression Algorithms for Program Data
Analysis of Compression Algorithms for Program Data Matthew Simpson, Clemson University with Dr. Rajeev Barua and Surupa Biswas, University of Maryland 12 August 3 Abstract Insufficient available memory
Overview of Storage and Indexing
Overview of Storage and Indexing Chapter 8 How index-learning turns no student pale Yet holds the eel of science by the tail. -- Alexander Pope (1688-1744) Database Management Systems 3ed, R. Ramakrishnan
Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
Unsupervised Data Mining (Clustering)
Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in
Data, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
Load Balancing in Structured Peer to Peer Systems
Load Balancing in Structured Peer to Peer Systems DR.K.P.KALIYAMURTHIE 1, D.PARAMESWARI 2 Professor and Head, Dept. of IT, Bharath University, Chennai-600 073 1 Asst. Prof. (SG), Dept. of Computer Applications,
Tables so far. set() get() delete() BST Average O(lg n) O(lg n) O(lg n) Worst O(n) O(n) O(n) RB Tree Average O(lg n) O(lg n) O(lg n)
Hash Tables Tables so far set() get() delete() BST Average O(lg n) O(lg n) O(lg n) Worst O(n) O(n) O(n) RB Tree Average O(lg n) O(lg n) O(lg n) Worst O(lg n) O(lg n) O(lg n) Table naïve array implementation
Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis
Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Abdun Mahmood, Christopher Leckie, Parampalli Udaya Department of Computer Science and Software Engineering University of
Categorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,
Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format
, pp.91-100 http://dx.doi.org/10.14257/ijhit.2014.7.4.09 Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format Jingjing Zheng 1,* and Ting Wang 1, 2 1,* Parallel Software and Computational
Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
Image Search by MapReduce
Image Search by MapReduce COEN 241 Cloud Computing Term Project Final Report Team #5 Submitted by: Lu Yu Zhe Xu Chengcheng Huang Submitted to: Prof. Ming Hwa Wang 09/01/2015 Preface Currently, there s
Mining Social Network Graphs
Mining Social Network Graphs Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata November 13, 17, 2014 Social Network No introduc+on required Really? We s7ll need to understand
FAWN - a Fast Array of Wimpy Nodes
University of Warsaw January 12, 2011 Outline Introduction 1 Introduction 2 3 4 5 Key issues Introduction Growing CPU vs. I/O gap Contemporary systems must serve millions of users Electricity consumed
SIDN Server Measurements
SIDN Server Measurements Yuri Schaeffer 1, NLnet Labs NLnet Labs document 2010-003 July 19, 2010 1 Introduction For future capacity planning SIDN would like to have an insight on the required resources
An efficient Join-Engine to the SQL query based on Hive with Hbase Zhao zhi-cheng & Jiang Yi
International Conference on Applied Science and Engineering Innovation (ASEI 2015) An efficient Join-Engine to the SQL query based on Hive with Hbase Zhao zhi-cheng & Jiang Yi Institute of Computer Forensics,
In-Memory Databases MemSQL
IT4BI - Université Libre de Bruxelles In-Memory Databases MemSQL Gabby Nikolova Thao Ha Contents I. In-memory Databases...4 1. Concept:...4 2. Indexing:...4 a. b. c. d. AVL Tree:...4 B-Tree and B+ Tree:...5
This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.
Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,
Data Deduplication in Slovak Corpora
Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences, Bratislava, Slovakia Abstract. Our paper describes our experience in deduplication of a Slovak corpus. Two methods of deduplication a plain
Physical Data Organization
Physical Data Organization Database design using logical model of the database - appropriate level for users to focus on - user independence from implementation details Performance - other major factor
Persistent Data Structures
6.854 Advanced Algorithms Lecture 2: September 9, 2005 Scribes: Sommer Gentry, Eddie Kohler Lecturer: David Karger Persistent Data Structures 2.1 Introduction and motivation So far, we ve seen only ephemeral
recursion, O(n), linked lists 6/14
recursion, O(n), linked lists 6/14 recursion reducing the amount of data to process and processing a smaller amount of data example: process one item in a list, recursively process the rest of the list
IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS
Volume 2, No. 3, March 2011 Journal of Global Research in Computer Science RESEARCH PAPER Available Online at www.jgrcs.info IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE
Hosts Address Auto Configuration for Mobile Ad Hoc Networks
Hosts Address Auto Configuration for Mobile Ad Hoc Networks Sudath Indrasinghe, Rubem Pereira, Hala Mokhtar School of Computing and Mathematical Sciences Liverpool John Moores University [email protected],
Analysis of Algorithms I: Binary Search Trees
Analysis of Algorithms I: Binary Search Trees Xi Chen Columbia University Hash table: A data structure that maintains a subset of keys from a universe set U = {0, 1,..., p 1} and supports all three dictionary
CS/COE 1501 http://cs.pitt.edu/~bill/1501/
CS/COE 1501 http://cs.pitt.edu/~bill/1501/ Lecture 01 Course Introduction Meta-notes These notes are intended for use by students in CS1501 at the University of Pittsburgh. They are provided free of charge
B-Trees. Algorithms and data structures for external memory as opposed to the main memory B-Trees. B -trees
B-Trees Algorithms and data structures for external memory as opposed to the main memory B-Trees Previous Lectures Height balanced binary search trees: AVL trees, red-black trees. Multiway search trees:
Data Structures for Moving Objects
Data Structures for Moving Objects Pankaj K. Agarwal Department of Computer Science Duke University Geometric Data Structures S: Set of geometric objects Points, segments, polygons Ask several queries
The LCA Problem Revisited
The LA Problem Revisited Michael A. Bender Martín Farach-olton SUNY Stony Brook Rutgers University May 16, 2000 Abstract We present a very simple algorithm for the Least ommon Ancestor problem. We thus
Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014
Parallel Data Mining Team 2 Flash Coders Team Research Investigation Presentation 2 Foundations of Parallel Computing Oct 2014 Agenda Overview of topic Analysis of research papers Software design Overview
An Efficiency Keyword Search Scheme to improve user experience for Encrypted Data in Cloud
, pp.246-252 http://dx.doi.org/10.14257/astl.2014.49.45 An Efficiency Keyword Search Scheme to improve user experience for Encrypted Data in Cloud Jiangang Shu ab Xingming Sun ab Lu Zhou ab Jin Wang ab
Similarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
Introduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant
Comparison of Standard and Zipf-Based Document Retrieval Heuristics
Comparison of Standard and Zipf-Based Document Retrieval Heuristics Benjamin Hoffmann Universität Stuttgart, Institut für Formale Methoden der Informatik Universitätsstr. 38, D-70569 Stuttgart, Germany
Data Warehousing und Data Mining
Data Warehousing und Data Mining Multidimensionale Indexstrukturen Ulf Leser Wissensmanagement in der Bioinformatik Content of this Lecture Multidimensional Indexing Grid-Files Kd-trees Ulf Leser: Data
Binary Heap Algorithms
CS Data Structures and Algorithms Lecture Slides Wednesday, April 5, 2009 Glenn G. Chappell Department of Computer Science University of Alaska Fairbanks [email protected] 2005 2009 Glenn G. Chappell
Distributed Framework for Data Mining As a Service on Private Cloud
RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &
Fuzzy Multi-Join and Top-K Query Model for Search-As-You-Type in Multiple Tables
Fuzzy Multi-Join and Top-K Query Model for Search-As-You-Type in Multiple Tables 1 M.Naveena, 2 S.Sangeetha 1 M.E-CSE, 2 AP-CSE V.S.B. Engineering College, Karur, Tamilnadu, India. 1 [email protected],
Redundant Data Removal Technique for Efficient Big Data Search Processing
Redundant Data Removal Technique for Efficient Big Data Search Processing Seungwoo Jeon 1, Bonghee Hong 1, Joonho Kwon 2, Yoon-sik Kwak 3 and Seok-il Song 3 1 Dept. of Computer Engineering, Pusan National
Big Data & Scripting Part II Streaming Algorithms
Big Data & Scripting Part II Streaming Algorithms 1, Counting Distinct Elements 2, 3, counting distinct elements problem formalization input: stream of elements o from some universe U e.g. ids from a set
Optimization of ETL Work Flow in Data Warehouse
Optimization of ETL Work Flow in Data Warehouse Kommineni Sivaganesh M.Tech Student, CSE Department, Anil Neerukonda Institute of Technology & Science Visakhapatnam, India. [email protected] P Srinivasu
CS423 Spring 2015 MP4: Dynamic Load Balancer Due April 27 th at 9:00 am 2015
CS423 Spring 2015 MP4: Dynamic Load Balancer Due April 27 th at 9:00 am 2015 1. Goals and Overview 1. In this MP you will design a Dynamic Load Balancer architecture for a Distributed System 2. You will
The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge,
The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge, cheapest first, we had to determine whether its two endpoints
Medical Information Management & Mining. You Chen Jan,15, 2013 [email protected]
Medical Information Management & Mining You Chen Jan,15, 2013 [email protected] 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
