# Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing

Save this PDF as:

Size: px
Start display at page:

Download "Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing"

## Transcription

3 universe U. Informally, we say that H is locality-sensitive if for any two points p and q, the probability that p and q collide under a random choice of hash function depends only on the distance between p and q. We use the notation p(t) to denote the probability of collision between two points within distance t. Several such families are known in the literature, see [8] for an overview. In this paper we use the locality-sensitive hash families for the angular distance between two unit vectors, defined in [13]. Specifically, let t [, π] be the angle between unit vectors p and q. p q t can be calculated as acos( ). The hash functions in the p q family are parametrized by a unit vector a. Each such function h a, when applied on a vector v, returns either 1 or 1, depending on the value of the dot product between a and v. Specifically, we have h a(v) = sign(a v). Previous work shows [13] that, for a random choice of a, the probability of collision satisfies p(t) = 1 t/π. That is, the probability of collision P [h a(v 1) = h a(v 2)] ranges between 1 and as the angle t between v 1 & v 2 ranges between and π. Basic LSH: To find the R-near neighbors, the basic LSH algorithm concatenates a number of functions h H into one hash function g. In particular, for k specified later, we define a family G of hash functions g(v) = (h 1(v),..., h k (v)), where h i H. For an integer L, the algorithm chooses L functions g 1,..., g L from G, independently and uniformly at random. The algorithm then creates L hash arrays, one for each function g j. During preprocessing, the algorithm stores each data point p P into buckets g j(v), for all j = 1,..., L. Since the total number of buckets may be large, the algorithm retains only the non-empty buckets by resorting to standard hashing. To answer a query q, the algorithm evaluates g 1(q),..., g L(q), and looks up the points stored in those buckets in the respective hash arrays. For each point p found in any of the buckets, the algorithm computes the distance from q to p, and reports the point v if the distance is at most R. The parameters k and L are chosen to satisfy the requirement that a near neighbor is reported with a probability at least 1 δ. See Section 7 for the details. The time for evaluating the g i functions for a query point q is O(DkL) in general. For the angular hash functions, each of the k bits output by a hash function g i involves computing a dot product of the input vector with a random vector defining a hyperplane. Each dot product can be computed in time proportional to the number of non-zeros NNZ rather than D. Thus, the total time is O(NNZkL). All-pairs LSH hashing: To reduce the time to evaluate functions g i for the query q, we reuse some of the functions h (i) j in a manner similar to [7]. Specifically, we define functions u i in the following manner. Suppose k is even and m L. Then, for i = 1... m, let u i = (h (i) 1,..., h(i) k/2 ), where each h(i) j is drawn uniformly at random from the family H. Thus u i are vectors each of k/2 functions (bits, in our case) drawn uniformly at random from the LSH family H. Now, define functions g i as g i = (u a, u b ), where 1 a < b m. Note that we obtain L = m(m 1)/2 functions g i. The time for evaluating the g i functions for a query point q is reduced to O(Dkm + L) = O(Dk L + L). This is because we need only to compute m functions u i, i.e., mk individual functions h, and then just concatenate all pairs of functions u i. For the angular functions the time is further improved to O(NNZkm + L). In the rest of the paper, we use LSH to mean the faster all-pairs version of LSH. In addition to reducing the computation time from O(DkL) to O(Dk L + L), the above method has significant benefits when combined with 2-level hashing, as described in more detail in Section Specifically, the points can be first partitioned using the first chunk of k/2 bits, and the partitioning can be refined further using the second chunk of k/2 bits. In this approach the first level partitioning is done only m times, while the second level hashing is done on much smaller sets. This significantly reduces the number of cache misses during the table construction phase. To the best of our knowledge, the cache-efficient implementation of the all-pairs hashing is a new contribution of this paper. Hash function for Twitter search: In the case of finding nearest neighbors for searching Twitter feeds, each tweet is represented as a sparse high-dimensional vector in the vocabulary space of all words. See Section 8 for further description. 4. OVERALL PLSH SYSTEM In this work, we focus on handling nearest neighbor queries on high-volume data such as Twitter feeds. In the case of Twitter feeds, there is a stream of incoming data that needs to be inserted into the LSH data structures. These structures need to support very fast query performance with very low insertion overheads. We also need to retire the oldest data periodically so that capacity is made available for incoming new data. % static M Nodes: 5 % static, % streaming % static Node Node 1 Node i Node i+m-1 Node 99 Q q A Queries Q A 1 Coordinator q 1 q j i i t Q A 99 Insertions Figure 1: Overall system for LSH showing queries and insert operations. All nodes hold part of the data and participate in queries with a coordinator node handling the merging of query results. For inserts, we use a rolling window of M nodes (Node i to i + M 1) that have a streaming structure to handle inserts in round-robin fashion. These streaming structures are periodically merged into static structures. The figure shows a snapshot when the static structures are 5% filled. When they are completely filled, the window moves to nodes i + M to i + 2M 1. Figure 1 presents an overview of how our system handles highvolume data streams. Our system consists of multiple nodes, each storing a portion of the original data. We store data in-memory for processing speed; hence the total capacity of a node is determined by its memory size. As queries arrive from different clients, they are broadcast by the coordinator to all nodes, with each node querying its data. The individual query responses from each structure are concatenated by the coordinator node and sent back to the user. PLSH includes a number of optimizations designed to promote efficient hash table construction and querying. First, in order to speed up construction of static hash tables (that are not updated dynamically as new data arrives) we developed a 2-level hashing approach, that when coupled with the all-pairs LSH method described in Section 3 significantly reduces the time to construct hash

10 is minimized, subject to P (R, k, m) 1 δ (7.3) (L N + 2 k L) 4 Memory in bytes (7.4) This can be done by enumerating k = 1, 2,... k max, and for each k selecting the smallest value of m satisfying Equation 7.3. The values of E[#unique] and E[#collisions] can be estimated from a given dataset using equations 7.1 and 7.2 through sampling. We use a random set of queries and data points for generating these estimates. For small amounts of data, we can set k max to 4, as for larger k the collision probability for two points within distance R would be less than p(t) k = For large amounts of data, k max is determined by the amount of RAM in the system. The storage required for the hash tables increases with L, which in turn increases super-linearly with k. The L N term in Equations 7.4 dominates memory requirements. For million points in a machine with 64GB of main memory, we can only store 16 hash tables (excluding other data structures). Typically, we want to store about hash tables. This fixes the maximum value of m to be 44 and the largest k to satisfy Equation 7.3 then is 16. Enumeration can proceed as earlier and the best value of (k, m) is chosen. 8. EVALUATION We now evaluate the performance of our algorithm on an Intel Xeon processor E5-267 based system with 8 cores running at 2.6 GHz. Each core has a 64 KB L1 cache and a 256 KB L2 cache, and supports 2-way Simultaneous Multi-Threading (SMT). All cores share a 2MB last level cache. Bandwidth to main memory is 32 GB/s. Our system has 64 GB memory on each node and runs RHEL 6.1. We use the Intel Composer XE 23 compiler for compiling our code 1. Our cluster has nodes connected through Infiniband (IB) and Gigabit Ethernet (GigE). All our multi node experiments use MPI to communicate over IB. Benchmarks and Performance Evaluation: We run PLSH on 1.5 billion tweets collected from September 22 to February 23. These tweets were cleaned by removing non-alphabet characters, duplicates and stop words. Each tweet is encoded as a sparse vector in a 5,-dimensional space corresponding to the size of the vocabulary used. In order to give more importance to less common words, we use an Inverse Document Frequency (IDF) score that gives greater weight to less frequently occurring words. Finally we normalize each tweet vector to transform it into a unit vector. For single node experiments, we use about.5 million tweets. The optimal LSH parameters were selected using our performance model. We use the following parameters: k = 16, m = 4, L = 78, D = 5,, R =.9, δ =.1. For queries, we use a random subset of tweets from the database. -length queries are possible if the tweet is entirely composed of special characters, unicode characters, numerals, words that are not part of the vocabulary etc. Since these queries will not find any meaningful matches, we ignore these queries. Even though we use a random subset of the input data for querying, we have found empirically that queries generated from user-given text snippets perform equally well. 8.1 PLSH vs exact algorithms 1 Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #284 Algorithm # distance computations Runtime Exhaustive search,579, ms Inverted index 847,27.9 > ms PLSH 12, ms Table 2: Runtime comparison between PLSH and other deterministic nearest neighbor algorithms. Inverted index excludes time to generate candidate matches, only including the time for distance computations. In order to prove empirically that PLSH does indeed perform significantly better than other algorithms, we perform a comparison against an exhaustive search and one based on an inverted index. The exhaustive search algorithm calculates the distance from a query point to all the points in the input data and reports only those points that lie within a distance R from the query. An inverted index is a data structure that works by finding a set of documents that contain a particular word. Given a query text, the inverted index is used to get the set of all documents (tweets) that contain at least one of the words in the document. These candidate points are filtered using the distance criterion. Both the exhaustive search and inverted index are deterministic algorithms. LSH, in contrast, is a randomized algorithm with a high probability of finding neighbors. Table 2 gives the average number of distance computations that need to be performed for each of the above mentioned algorithms for a query (averaged from a set of queries) and their runtimes on.5 million tweets on a single node. Since we expect that all these runtimes are dominated by the data lookups involved in distance computations, it is clear that PLSH performs much better than both these techniques. For inverted index, we do not include the time to generate the candidate matches (this would involve lookups into the inverted index structure), whereas for PLSH we include all times (hash table lookups and distance calculations). Even assuming lower bounds for inverted index, PLSH is about 15 faster than inverted index and 81 faster than exhaustive search while achieving 92% accuracy. Note that all algorithms have been parallelized to use multiple cores to execute queries. 8.2 Effect of optimizations We now show the breakdown of different optimizations performed for improving the performance of LSH as described in Section 5. Figure 4 shows the effect of the performance optimizations applied to the creation (initialization) of LSH, as described in Section 5.1. The unoptimized version performs hash table creation using a separate k-bit key for each of the L tables. Starting from an unoptimized implementation (1-level partitioning), we achieve a 3.7 improvement through the use of our new 2-level hash table and optimized PLSH algorithm, as well as use of shared hash tables and code vectorization. All the versions are parallelized on 16 threads. Figure 5 provides a breakdown of the effects of the optimizations on LSH query performance described in Section 5.2. The series of optimizations applied to the unoptimized implementation include the usage of bitvector for removing duplicate candidates, optimizing the sparse dot product calculation, enabling prefetching of data and the usage of large pages to avoid TLB misses. The unoptimized implementation uses the C++ STL set to remove duplicates and uses the unoptimized sparse dot calculation (Section 5.2.3). Compared to this, our final version gives a speedup of Performance Model Validation In this section, we show that the performance model proposed in Section 7 corresponds closely to the real world performance of PLSH creation and querying. Figure 6 compares the estimated and real runtimes of PLSH. Some of the error comes from the estimates of E[#collisions] and E[#unique] through sampling and other errors from inaccurately modeling the PLSH component kernels. We

### Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

### Rethinking SIMD Vectorization for In-Memory Databases

SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest

### FPGA-based Multithreading for In-Memory Hash Joins

FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded

### Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

### Benchmarking Hadoop & HBase on Violin

Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

### Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.

Universal hashing No matter how we choose our hash function, it is always possible to devise a set of keys that will hash to the same slot, making the hash scheme perform poorly. To circumvent this, we

### CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 17 by Kuo-pao Yang

CHAPTER 6 Memory 6.1 Memory 313 6.2 Types of Memory 313 6.3 The Memory Hierarchy 315 6.3.1 Locality of Reference 318 6.4 Cache Memory 319 6.4.1 Cache Mapping Schemes 321 6.4.2 Replacement Policies 333

### RevoScaleR Speed and Scalability

EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

### Benchmarking Cassandra on Violin

Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract

### Data Warehousing und Data Mining

Data Warehousing und Data Mining Multidimensionale Indexstrukturen Ulf Leser Wissensmanagement in der Bioinformatik Content of this Lecture Multidimensional Indexing Grid-Files Kd-trees Ulf Leser: Data

### In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency

### Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

### Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015

Clustering Big Data Efficient Data Mining Technologies J Singh and Teresa Brooks June 4, 2015 Hello Bulgaria (http://hello.bg/) A website with thousands of pages... Some pages identical to other pages

### Memory Management. Main memory Virtual memory

Memory Management Main memory Virtual memory Main memory Background (1) Processes need to share memory Instruction execution cycle leads to a stream of memory addresses Basic hardware CPU can only access

### A COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES

A COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES ULFAR ERLINGSSON, MARK MANASSE, FRANK MCSHERRY MICROSOFT RESEARCH SILICON VALLEY MOUNTAIN VIEW, CALIFORNIA, USA ABSTRACT Recent advances in the

### Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints Michael Bauer, Srinivasan Ravichandran University of Wisconsin-Madison Department of Computer Sciences {bauer, srini}@cs.wisc.edu

### High-Volume Data Warehousing in Centerprise. Product Datasheet

High-Volume Data Warehousing in Centerprise Product Datasheet Table of Contents Overview 3 Data Complexity 3 Data Quality 3 Speed and Scalability 3 Centerprise Data Warehouse Features 4 ETL in a Unified

### Operating Systems: Internals and Design Principles. Chapter 12 File Management Seventh Edition By William Stallings

Operating Systems: Internals and Design Principles Chapter 12 File Management Seventh Edition By William Stallings Operating Systems: Internals and Design Principles If there is one singular characteristic

### EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

ABSTRACT EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES Tyler Cossentine and Ramon Lawrence Department of Computer Science, University of British Columbia Okanagan Kelowna, BC, Canada tcossentine@gmail.com

### MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services

MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services Jiansheng Wei, Hong Jiang, Ke Zhou, Dan Feng School of Computer, Huazhong University of Science and Technology,

### Report Paper: MatLab/Database Connectivity

Report Paper: MatLab/Database Connectivity Samuel Moyle March 2003 Experiment Introduction This experiment was run following a visit to the University of Queensland, where a simulation engine has been

### CH 7. MAIN MEMORY. Base and Limit Registers. Memory-Management Unit (MMU) Chapter 7: Memory Management. Background. Logical vs. Physical Address Space

Chapter 7: Memory Management CH 7. MAIN MEMORY Background Swapping Contiguous Memory Allocation Paging Structure of the Page Table Segmentation adapted from textbook slides Background Base and Limit Registers

### Using In-Memory Computing to Simplify Big Data Analytics

SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed

### Whitepaper: performance of SqlBulkCopy

We SOLVE COMPLEX PROBLEMS of DATA MODELING and DEVELOP TOOLS and solutions to let business perform best through data analysis Whitepaper: performance of SqlBulkCopy This whitepaper provides an analysis

### Physical Data Organization

Physical Data Organization Database design using logical model of the database - appropriate level for users to focus on - user independence from implementation details Performance - other major factor

### Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

### Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

### File Management. Chapter 12

Chapter 12 File Management File is the basic element of most of the applications, since the input to an application, as well as its output, is usually a file. They also typically outlive the execution

### Chapter 9: Memory Management

Chapter 9: Memory Management Background Logical versus Physical Address Space Overlays versus Swapping Contiguous Allocation Paging Segmentation Segmentation with Paging Operating System Concepts 9.1 Background

### FAWN - a Fast Array of Wimpy Nodes

University of Warsaw January 12, 2011 Outline Introduction 1 Introduction 2 3 4 5 Key issues Introduction Growing CPU vs. I/O gap Contemporary systems must serve millions of users Electricity consumed

### Big Data & Scripting storage networks and distributed file systems

Big Data & Scripting storage networks and distributed file systems 1, 2, in the remainder we use networks of computing nodes to enable computations on even larger datasets for a computation, each node

### Efficient visual search of local features. Cordelia Schmid

Efficient visual search of local features Cordelia Schmid Visual search change in viewing angle Matches 22 correct matches Image search system for large datasets Large image dataset (one million images

### EMC DATA DOMAIN SISL SCALING ARCHITECTURE

EMC DATA DOMAIN SISL SCALING ARCHITECTURE A Detailed Review ABSTRACT While tape has been the dominant storage medium for data protection for decades because of its low cost, it is steadily losing ground

### Chapter 13. Disk Storage, Basic File Structures, and Hashing

Chapter 13 Disk Storage, Basic File Structures, and Hashing Chapter Outline Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and Extendible Hashing

### Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis

### BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

### Chapter 12 File Management. Roadmap

Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 12 File Management Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Overview Roadmap File organisation and Access

### Chapter 12 File Management

Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 12 File Management Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Roadmap Overview File organisation and Access

### Last Class: Memory Management. Recap: Paging

Last Class: Memory Management Static & Dynamic Relocation Fragmentation Paging Lecture 12, page 1 Recap: Paging Processes typically do not use their entire space in memory all the time. Paging 1. divides

### PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General

### Performance Tuning for the Teradata Database

Performance Tuning for the Teradata Database Matthew W Froemsdorf Teradata Partner Engineering and Technical Consulting - i - Document Changes Rev. Date Section Comment 1.0 2010-10-26 All Initial document

### Big Data Processing with Google s MapReduce. Alexandru Costan

1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

### ! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions

Chapter 13: Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions Basic Steps in Query

### Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com

Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

### Operating Systems Memory Management

Operating Systems Memory Management ECE 344 ECE 344 Operating Systems 1 Memory Management Contiguous Memory Allocation Paged Memory Management Virtual Memory ECE 344 Operating Systems 2 Binding of Instructions

### Chapter 13 Disk Storage, Basic File Structures, and Hashing.

Chapter 13 Disk Storage, Basic File Structures, and Hashing. Copyright 2004 Pearson Education, Inc. Chapter Outline Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files

### Agenda. Memory Management. Binding of Instructions and Data to Memory. Background. CSCI 444/544 Operating Systems Fall 2008

Agenda Background Memory Management CSCI 444/544 Operating Systems Fall 2008 Address space Static vs Dynamic allocation Contiguous vs non-contiguous allocation Background Program must be brought into memory

### Locality-Sensitive Operators for Parallel Main-Memory Database Clusters

Locality-Sensitive Operators for Parallel Main-Memory Database Clusters Wolf Rödiger, Tobias Mühlbauer, Philipp Unterbrunner*, Angelika Reiser, Alfons Kemper, Thomas Neumann Technische Universität München,

### Tableau Server 7.0 scalability

Tableau Server 7.0 scalability February 2012 p2 Executive summary In January 2012, we performed scalability tests on Tableau Server to help our customers plan for large deployments. We tested three different

### Big Graph Processing: Some Background

Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs

### Record Storage and Primary File Organization

Record Storage and Primary File Organization 1 C H A P T E R 4 Contents Introduction Secondary Storage Devices Buffering of Blocks Placing File Records on Disk Operations on Files Files of Unordered Records

### Operating Systems CSE 410, Spring 2004. File Management. Stephen Wagner Michigan State University

Operating Systems CSE 410, Spring 2004 File Management Stephen Wagner Michigan State University File Management File management system has traditionally been considered part of the operating system. Applications

### Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 13-1

Slide 13-1 Chapter 13 Disk Storage, Basic File Structures, and Hashing Chapter Outline Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and Extendible

### Cloud Computing at Google. Architecture

Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

### Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

### Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching

### File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

CS341: Operating System Lect 36: 1 st Nov 2014 Dr. A. Sahu Dept of Comp. Sc. & Engg. Indian Institute of Technology Guwahati File System & Device Drive Mass Storage Disk Structure Disk Arm Scheduling RAID

### GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

### Hardware Configuration Guide

Hardware Configuration Guide Contents Contents... 1 Annotation... 1 Factors to consider... 2 Machine Count... 2 Data Size... 2 Data Size Total... 2 Daily Backup Data Size... 2 Unique Data Percentage...

### Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications

### Overlapping Data Transfer With Application Execution on Clusters

Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer

### InfiniteGraph: The Distributed Graph Database

A Performance and Distributed Performance Benchmark of InfiniteGraph and a Leading Open Source Graph Database Using Synthetic Data Objectivity, Inc. 640 West California Ave. Suite 240 Sunnyvale, CA 94086

### CHAPTER 1 INTRODUCTION

1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.

### SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications Jürgen Primsch, SAP AG July 2011 Why In-Memory? Information at the Speed of Thought Imagine access to business data,

### Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

### How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first

### Chapter 13: Query Processing. Basic Steps in Query Processing

Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

### Hypertable Architecture Overview

WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for

### Memory Management Outline. Background Swapping Contiguous Memory Allocation Paging Segmentation Segmented Paging

Memory Management Outline Background Swapping Contiguous Memory Allocation Paging Segmentation Segmented Paging 1 Background Memory is a large array of bytes memory and registers are only storage CPU can

### Multi-dimensional index structures Part I: motivation

Multi-dimensional index structures Part I: motivation 144 Motivation: Data Warehouse A definition A data warehouse is a repository of integrated enterprise data. A data warehouse is used specifically for

### Design and Implementation of a Storage Repository Using Commonality Factoring. IEEE/NASA MSST2003 April 7-10, 2003 Eric W. Olsen

Design and Implementation of a Storage Repository Using Commonality Factoring IEEE/NASA MSST2003 April 7-10, 2003 Eric W. Olsen Axion Overview Potentially infinite historic versioning for rollback and

OpenFlow Based Load Balancing Hardeep Uppal and Dane Brandon University of Washington CSE561: Networking Project Report Abstract: In today s high-traffic internet, it is often desirable to have multiple

### ECE 4750 Computer Architecture. T16: Address Translation and Protection

ECE 4750 Computer Architecture Topic 16: Translation and Protection Christopher Batten School of Electrical and Computer Engineering Cornell University! http://www.csl.cornell.edu/courses/ece4750! ECE

### In-Memory Databases MemSQL

IT4BI - Université Libre de Bruxelles In-Memory Databases MemSQL Gabby Nikolova Thao Ha Contents I. In-memory Databases...4 1. Concept:...4 2. Indexing:...4 a. b. c. d. AVL Tree:...4 B-Tree and B+ Tree:...5

### Chapter 13. Chapter Outline. Disk Storage, Basic File Structures, and Hashing

Chapter 13 Disk Storage, Basic File Structures, and Hashing Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Chapter Outline Disk Storage Devices Files of Records Operations on Files Unordered Files

### Chapter 2 Data Storage

Chapter 2 22 CHAPTER 2. DATA STORAGE 2.1. THE MEMORY HIERARCHY 23 26 CHAPTER 2. DATA STORAGE main memory, yet is essentially random-access, with relatively small differences Figure 2.4: A typical

### Search Engines. Stephen Shaw 18th of February, 2014. Netsoc

Search Engines Stephen Shaw Netsoc 18th of February, 2014 Me M.Sc. Artificial Intelligence, University of Edinburgh Would recommend B.A. (Mod.) Computer Science, Linguistics, French,

### Chapter 8: Memory Management!

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still

### Fast Matching of Binary Features

Fast Matching of Binary Features Marius Muja and David G. Lowe Laboratory for Computational Intelligence University of British Columbia, Vancouver, Canada {mariusm,lowe}@cs.ubc.ca Abstract There has been

### DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION A DIABLO WHITE PAPER AUGUST 2014 Ricky Trigalo Director of Business Development Virtualization, Diablo Technologies

### Big Data & Scripting Part II Streaming Algorithms

Big Data & Scripting Part II Streaming Algorithms 1, Counting Distinct Elements 2, 3, counting distinct elements problem formalization input: stream of elements o from some universe U e.g. ids from a set

### Lecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the

### VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5

Performance Study VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5 VMware VirtualCenter uses a database to store metadata on the state of a VMware Infrastructure environment.

### An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct

### SAP HANA PLATFORM Top Ten Questions for Choosing In-Memory Databases. Start Here

PLATFORM Top Ten Questions for Choosing In-Memory Databases Start Here PLATFORM Top Ten Questions for Choosing In-Memory Databases. Are my applications accelerated without manual intervention and tuning?.

### Load Balance Strategies for DEVS Approximated Parallel and Distributed Discrete-Event Simulations

Load Balance Strategies for DEVS Approximated Parallel and Distributed Discrete-Event Simulations Alonso Inostrosa-Psijas, Roberto Solar, Verónica Gil-Costa and Mauricio Marín Universidad de Santiago,

### Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

### Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets

Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets Dudu Lazarov, Gil David, Amir Averbuch School of Computer Science, Tel-Aviv University Tel-Aviv 69978, Israel Abstract

### Big Data & Scripting Part II Streaming Algorithms

Big Data & Scripting Part II Streaming Algorithms 1, 2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set),

### The Sierra Clustered Database Engine, the technology at the heart of

A New Approach: Clustrix Sierra Database Engine The Sierra Clustered Database Engine, the technology at the heart of the Clustrix solution, is a shared-nothing environment that includes the Sierra Parallel

June 2010 From: StatSoft Analytics White Papers To: Internal release Re: Performance comparison of STATISTICA Version 9 on multi-core 64-bit machines with current 64-bit releases of SAS (Version 9.2) and

### Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.

Load Rebalancing for Distributed File Systems in Clouds. Smita Salunkhe, S. S. Sannakki Department of Computer Science and Engineering KLS Gogte Institute of Technology, Belgaum, Karnataka, India Affiliated

### GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

### Operating Systems Concepts

Operating Systems Concepts MODULE 7: MEMORY MANAGEMENT Andrzej Bednarski, Ph.D. student Department of Computer and Information Science Linköping University, Sweden E-mail: andbe@ida.liu.se URL: http://www.ida.liu.se/~andbe

### Bringing Big Data Modelling into the Hands of Domain Experts

Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the

### Energy Efficient MapReduce

Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing