I/O-Efficient Data Structures for Colored Range and Prefix Reporting

Similar documents
Infrastructures for big data

SAS Education Providing knowledge through global training and certification. SAS Foundation. Kursöversikt 2010

Annual Report H I G H E R E D U C AT I O N C O M M I S S I O N - PA K I S TA N

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Data storage Tree indexes

New Approach of Computing Data Cubes in Data Warehousing

Longest Common Extensions via Fingerprinting

On the Use of Compression Algorithms for Network Traffic Classification

Binary search tree with SIMD bandwidth optimization using SSE

SAS Data Integration SAS Business Intelligence

Performance Tuning for the Teradata Database

TIBCO Spotfire Metrics Modeler User s Guide. Software Release 6.0 November 2013

Properties of Stabilizing Computations

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Fuzzy Multi-Join and Top-K Query Model for Search-As-You-Type in Multiple Tables

Lecture 2: Universality

Indexing Big Data. Michael A. Bender. ingest data ?????? ??? oy vey

Memory Systems. Static Random Access Memory (SRAM) Cell

Storing Measurement Data

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

A Searching Strategy to Adopt Multi-Join Queries

A Practical Method to Diagnose Memory Leaks in Java Application Alan Yu

Chapter 6: Episode discovery process

Using In-Memory Computing to Simplify Big Data Analytics

Analysis of Compression Algorithms for Program Data

External Memory Geometric Data Structures

2) What is the structure of an organization? Explain how IT support at different organizational levels.

Persistent Data Structures and Planar Point Location

How Efficient can Memory Checking be?

The Halting Problem is Undecidable

Physical Data Organization

Files. Files. Files. Files. Files. File Organisation. What s it all about? What s in a file?

AQA GCSE in Computer Science Computer Science Microsoft IT Academy Mapping

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Interface Programmera mot interface Johan Eliasson Johan Eliasson Interface kan bryta beroendekedjor Skriv generell kod «Type» Class2 Interface

Memory Characterization to Analyze and Predict Multimedia Performance and Power in an Application Processor

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

How Efficient can Memory Checking be?

Introduction to Parallel Programming and MapReduce

MOC 20467B: Designing Business Intelligence Solutions with Microsoft SQL Server 2012

Design of Practical Succinct Data Structures for Large Data Collections

Persistent Data Structures

Topological Properties

Longest Common Extensions via Fingerprinting

ENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #

Efficient Data Structures for the Factor Periodicity Problem

Introduction to Virtual Machines

A Time Efficient Algorithm for Web Log Analysis

Scalable Prefix Matching for Internet Packet Forwarding

The Classical Architecture. Storage 1 / 36

Hands-On Microsoft Windows Server 2008

Distributed Aggregation in Cloud Databases. By: Aparna Tiwari

4D WebSTAR 5.1: Performance Advantages

Storage in Database Systems. CMPSCI 445 Fall 2010

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

ZooKeeper. Table of contents

Performance Evaluation of Improved Web Search Algorithms

1 File Processing Systems

Arithmetic Coding: Introduction

CSE 4351/5351 Notes 7: Task Scheduling & Load Balancing

How To Test Your Web Site On Wapt On A Pc Or Mac Or Mac (Or Mac) On A Mac Or Ipad Or Ipa (Or Ipa) On Pc Or Ipam (Or Pc Or Pc) On An Ip

This presentation covers virtual application shared services supplied with IBM Workload Deployer version 3.1.

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1

THE SECURITY AND PRIVACY ISSUES OF RFID SYSTEM

1. Nondeterministically guess a solution (called a certificate) 2. Check whether the solution solves the problem (called verification)

Overview of Storage and Indexing. Data on External Storage. Alternative File Organizations. Chapter 8

Distance Degree Sequences for Network Analysis

History-Independent Cuckoo Hashing

Rackspace Cloud Databases and Container-based Virtualization

Loop Invariants and Binary Search

Fujitsu Interstage Big Data Parallel Processing Server V1.0

Capacity Planning Process Estimating the load Initial configuration

Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)

- Easy to insert & delete in O(1) time - Don t need to estimate total memory needed. - Hard to search in less than O(n) time

Higher Computing Science Course Assessment Specification (C716 76)

Base Conversion written by Cathy Saxton

Increasing Flash Throughput for Big Data Applications (Data Management Track)

Cloud Computing at Google. Architecture

Learning Outcomes. COMP202 Complexity of Algorithms. Binary Search Trees and Other Search Trees

Operating Systems CSE 410, Spring File Management. Stephen Wagner Michigan State University

Unit Storage Structures 1. Storage Structures. Unit 4.3

Models and Techniques for Proving Data Structure Lower Bounds

The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge,

æ A collection of interrelated and persistent data èusually referred to as the database èdbèè.

Fusion iomemory iodrive PCIe Application Accelerator Performance Testing

CHAPTER 6 DATABASE MANAGEMENT SYSTEMS. Learning Objectives

DATABASE MANAGEMENT SYSTEM

Improving In-Memory Database Index Performance with Intel R Transactional Synchronization Extensions

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Parallel Computing. Benson Muite. benson.

Avoid a single point of failure by replicating the server Increase scalability by sharing the load among replicas

IBM RATIONAL PERFORMANCE TESTER

In-Place Computing. For Big and Complex Data

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Cache Oblivious Search Trees via Binary Trees of Small Height

Lecture 17: Virtual Memory II. Goals of virtual memory

Semantics of UML class diagrams

Operating System and Process Monitoring Tools

C H A P T E R 1 Introducing Data Relationships, Techniques for Data Manipulation, and Access Methods

Transcription:

I/O-Efficient Data Structures for Colored Range and Prefix Reporting Kasper Green Larsen, MADALGO, Aarhus University Rasmus Pagh, IT University of Copenhagen Presenter: Yakov Nekrich 1

Motivating example Sanningen finns på marken/men ingen vågar ta den./sanningen ligger på gatan./ Ingen gör den till sin. Jag ligger och ska somna, jag ser okända bilder/ och tecken klottrande sig själva bakom ögonlocken/på mörkrets vägg. I springan mellan vakenhet och dröm/försöker ett stort brev tränga sig in förgäves. Jag öppnar dörr nummer två./ Vänner! Ni drack mörkret/och blev synliga. Store collection of documents (= sets of words). Given a string p, return the IDs of all documents that contain p as a word. Optimal solution: Inverted index, precomputing the answer for each p. 2

Motivating example Sanningen finns på marken/men ingen vågar ta den./sanningen ligger på gatan./ Ingen gör den till sin. Jag ligger och ska somna, jag ser okända bilder/ och tecken klottrande sig själva bakom ögonlocken/på mörkrets vägg. I springan mellan vakenhet och dröm/försöker ett stort brev tränga sig in förgäves. Jag öppnar dörr nummer två./ Vänner! Ni drack mörkret/och blev synliga. Store collection of documents (= sets of words). Given a string p, return the IDs of all documents that contain p as prefix of a word. Optimal solution: Topic of this talk.. colored prefix reporting 3

Simple data structures Sanningen finns på marken/men ingen vågar ta den./sanningen ligger på gatan./ Ingen gör den till sin. Jag ligger och ska somna, jag ser okända bilder/ och tecken klottrande sig själva bakom ögonlocken/på mörkrets vägg. I springan mellan vakenhet och dröm/försöker ett stort brev tränga sig in förgäves. Jag öppnar dörr nummer två./ Vänner! Ni drack mörkret/och blev synliga. Inverted list for each prefix: Too much space for fixed alphabet size. Inverted list for each word: Too much time if many documents have many different words with prefix p. 4

Relation to range reporting [GHJS 03] p* words in lexicographic order 5

Relation to range reporting [GHJS 03] p* words in lexicographic order 6

Relation to range reporting [GHJS 03] p* y-coord = x-coord of previous point of same color words in lexicographic order 7

Relation to range reporting [GHJS 03] p* 3-sided range query captures only first occurrence of color words in lexicographic order 8

New result We can store a subset of n points from [n] [n], using linear space, such that 3-sided range queries can be answered in O(1+k/B) I/Os. Parameters: k = number of points returned B = number of points in a memory block (assume query string fits in one block) 9

New result Optimal time and space We can store a subset of n points from [n] [n], using linear space, such that 3-sided range queries can be answered in O(1+k/B) I/Os. Parameters: k = number of points returned B = number of points in a memory block (assume query string fits in one block) 9

New result Optimal time and space We can store a subset of n points from [n] [n], using linear space, such that 3-sided range queries can be answered in O(1+k/B) I/Os. Parameters: k = number of points returned B = number of points in a memory block (assume query string fits in one block) Implies optimal colored prefix reporting 9

Model of computation I/O model with Ɵ(B log n) bits per block. Memory is a sequence of blocks. Cost model: Number of block retrievals (I/Os). O(1) blocks can be stored in cache and accessed at no cost. 10

Model of computation Many other papers: Block stores B items. I/O model with Ɵ(B log n) bits per block. Memory is a sequence of blocks. Cost model: Number of block retrievals (I/Os). O(1) blocks can be stored in cache and accessed at no cost. 10

Selected previous results Reference Space overhead Search time Model Arge et al. 99 O(1) O(log B (n)) Comparison Nekrich 07 O(1) log B log B...(n) Unrestricted Nekrich 07 log B *(n) O(log B *(n)) Unrestricted Nekrich 07 (log B (n)) 2 O(1) Unrestricted Here O(1) O(1) Unrestricted data structures for 3-sided range reporting 11

High-level description 1. Place points in a binary priority search tree points to report points in range 12

High-level description 2. Search for points near the fringe using O(1) searches in smaller base data structures. points to report points in range 13

High-level description 3. Read blocks containing remaining points (easy). points to report points in range 14

Base data structure Core technical contribution of paper. Able to handle point sets of size poly(b log n) optimally. Main technique: Use of tabulation and fusion trees allow us to make a dynamic data structure partially persistent free of charge when the number of updates to it is small. Refer to paper for more details. 15

Indivisibility assumption Assumption often used to show lower bounds: Each block contains at most B points/items; reading a block is required to report an item. Our data structure is among few to break this assumption (with the dictionary of Iacono and Pǎtraşcu, previous presentation). 16

Indivisibility assumption Assumption often used to show lower bounds: Each block contains at most B points/items; reading a block is required to report an item. Our data structure is among few to break this assumption (with the dictionary of Iacono and Pǎtraşcu, previous presentation). Open problem: Is there a nontrivial lower bound under the indivisibility assumption? 16

Memory models There may be alternatives to the cache-oriented I/O model: Unlike conventional processors that rely on the hardware to automatically bring data and instructions close to the processor with a hierarchy of hardware caches, the Cell Broadband Engine requires the programmer to create a shopping list of the data that the program requires. from ibm.com 17

Scatter-I/O model One I/O operation can read or write any set of B words in memory. In this stronger model, we give a much simpler data structure for colored prefix search. 18

Scatter-I/O model One I/O operation can read or write any set of B words in memory. In this stronger model, we give a much simpler data structure for colored prefix search. Open problem: Can notoriously I/O-difficult graph problems such as BFS and DFS be efficiently solved in this model? 18

Conclusion Theoretically optimal solutions in the I/O model for colored prefix/range reporting. In fact, optimal solution to 3-sided range reporting in two dimensions. 19

Conclusion Theoretically optimal solutions in the I/O model for colored prefix/range reporting. In fact, optimal solution to 3-sided range reporting in two dimensions. Main open problem: Efficient extension to top-k searches, where only the k highest ranked colors are reported. 19