Sublinear Algorithms for Big Data. Part 4: Random Topics



Similar documents
LECTURE 16. Readings: Section 5.1. Lecture outline. Random processes Definition of the Bernoulli process Basic properties of the Bernoulli process

Sublinear Algorithms for Big Data. Part 4: Random Topics

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

Big Data & Scripting Part II Streaming Algorithms

Zabin Visram Room CS115 CS126 Searching. Binary Search

résumé de flux de données

B669 Sublinear Algorithms for Big Data

9th Max-Planck Advanced Course on the Foundations of Computer Science (ADFOCS) Primal-Dual Algorithms for Online Optimization: Lecture 1

Fundamentals of Analyzing and Mining Data Streams

Point Biserial Correlation Tests

Data Mining on Streams

Cross Validation. Dr. Thomas Jensen Expedia.com

Big Data & Scripting Part II Streaming Algorithms

Mining Data Streams. Chapter The Stream Data Model

Data Mining Practical Machine Learning Tools and Techniques

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Lecture 13: Validation

Visual Data Mining. Motivation. Why Visual Data Mining. Integration of visualization and data mining : Chidroop Madhavarapu CSE 591:Visual Analytics

Data Mining. Nonlinear Classification

MODELING RANDOMNESS IN NETWORK TRAFFIC

Sampling Based Range Partition Methods for Big Data Analytics

CSC148 Lecture 8. Algorithm Analysis Binary Search Sorting

Master s Theory Exam Spring 2006

B561 Advanced Database Concepts. 0 Introduction. Qin Zhang 1-1

A. OPENING POINT CLOUDS. (Notepad++ Text editor) (Cloud Compare Point cloud and mesh editor) (MeshLab Point cloud and mesh editor)

1: Meditech Information Systems On-line Help 2: Links to Micromedex 3: Print page view 4: Lock Session: temporarily minimize the session and lock it

A binary heap is a complete binary tree, where each node has a higher priority than its children. This is called heap-order property

SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis

Binary Heaps. CSE 373 Data Structures

ADOBE ACROBAT CONNECT PRO MOBILE VISUAL QUICK START GUIDE

Monte Carlo testing with Big Data

Data Streams A Tutorial

Distributed Computing over Communication Networks: Maximal Independent Set

Active Queue Management

6.2 Normal distribution. Standard Normal Distribution:

Summary Data Structures for Massive Data

SMALL INDEX LARGE INDEX (SILT)

Pulling a Random Sample from a MAXQDA Dataset

Continuous Monitoring of Top-k Queries over Sliding Windows

Offline sorting buffers on Line

An XML Framework for Integrating Continuous Queries, Composite Event Detection, and Database Condition Monitoring for Multiple Data Streams

BUSINESS ANALYTICS. Data Pre-processing. Lecture 3. Information Systems and Machine Learning Lab. University of Hildesheim.

Query Selectivity Estimation for Uncertain Data

Bayes and Naïve Bayes. cs534-machine Learning

Seed Distributions for the NCAA Men s Basketball Tournament: Why it May Not Matter Who Plays Whom*

USING OUTLOOK WEB ACCESS

Nessus Cloud User Registration

Using Data-Oblivious Algorithms for Private Cloud Storage Access. Michael T. Goodrich Dept. of Computer Science

Are You Wasting PPC Budget?

This presentation introduces you to the Decision Governance Framework that is new in IBM Operational Decision Manager version 8.5 Decision Center.

BINARY OPTIONS MENTOR USER MANUAL

Clustering Very Large Data Sets with Principal Direction Divisive Partitioning

B490 Mining the Big Data. 2 Clustering

Chapter 13: Query Processing. Basic Steps in Query Processing

Using the Complete W-2 efile Service Center

Inference of Probability Distributions for Trust and Security applications

Call and Put. Options. American and European Options. Option Terminology. Payoffs of European Options. Different Types of Options

Approximate Counts and Quantiles over Sliding Windows

Bright House Networks Home Security and Automation Mobile Application. Quick Start Guide

Co-Location-Resistant Clouds

Bright House Networks Home Security and Control Mobile Application

In-Network Coding for Resilient Sensor Data Storage and Efficient Data Mule Collection

Data Storage - II: Efficient Usage & Errors

Data Structures. Algorithm Performance and Big O Analysis

Lectures 6&7: Image Enhancement

Frontier Tandem. Administrator User Guide. Version 2.4 January 28, 2013

Local Area Networks transmission system private speedy and secure kilometres shared transmission medium hardware & software

Fact Sheet In-Memory Analysis

Polarization codes and the rate of polarization

MATH 140 Lab 4: Probability and the Standard Normal Distribution

Big Data. Lecture 6: Locality Sensitive Hashing (LSH)

Chapter 20: Data Analysis

How to submit a Service Request

CPSC 491. Today: Source code control. Source Code (Version) Control. Exercise: g., no git, subversion, cvs, etc.)

CamCard Business User Guide

Streaming Algorithms

Algorithmic Techniques for Big Data Analysis. Barna Saha AT&T Lab-Research

Congestion Control Overview

How To Manage Project Management

System Monitoring and Reporting

Decision Trees from large Databases: SLIQ

Group Testing a tool of protecting Network Security

ΤΕΙ Κρήτης, Παράρτηµα Χανίων

Maintaining Variance and k Medians over Data Stream Windows

Knowledge Discovery and Data Mining

Transcription:

Sublinear Algorithms for Big Data Part : Random Topics Qin Zhang 1-1

Topic 3: Random sampling in distributed data streams (based on a paper with ormode, Muthukrishnan and Yi, PODS 10, JAM 12) 2-1

Distributed streaming Motivated by database/networking applications Adaptive filters [Olston, Jiang, Widom, SIGMOD 03] A generic geometric approach [Scharfman et al. SIGMOD 06] Prediction models [ormode, Garofalakis, Muthukrishnan, Rastogi, SIGMOD 05] sensor networks network monitoring 3-1 environment monitoring cloud computing

Reservoir sampling [Waterman??; Vitter 85] Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample -1

Reservoir sampling [Waterman??; Vitter 85] Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample When the i-th item arrives With probability s/i, use it to replace an item in the current sample chosen uniformly at ranfom With probability 1 s/i, throw it away -2

Reservoir sampling from distributed streams When k = 1, reservoir sampling has cost Θ(s log n) When k 2, reservoir sampling has cost O(n) because it s costly to track i S k S 3 S 2 5-1 S 1 time

Reservoir sampling from distributed streams When k = 1, reservoir sampling has cost Θ(s log n) When k 2, reservoir sampling has cost O(n) because it s costly to track i S k Tracking i approximately? Sampling won t be uniform S 3 S 2 5-2 S 1 time

Reservoir sampling from distributed streams When k = 1, reservoir sampling has cost Θ(s log n) When k 2, reservoir sampling has cost O(n) because it s costly to track i S k S 3 S 2 Tracking i approximately? Sampling won t be uniform Key observation: We don t have to know the size of the population in order to sample! 5-3 S 1 time

Basic idea: binary Bernoulli sampling 6-1

Basic idea: binary Bernoulli sampling 0 1 0 1 0 0 0 1 0 1 0 1 0 0 1 1 0 1 1 0 1 1 0 0 6-2

Basic idea: binary Bernoulli sampling 0 1 0 1 0 0 0 1 0 1 0 1 0 0 1 1 0 1 1 0 1 1 0 0 onditioned upon a row having s active items, we can draw a sample from the active items 6-3

Basic idea: binary Bernoulli sampling 0 1 0 1 0 0 0 1 0 1 0 1 0 0 1 1 0 1 1 0 1 1 0 0 onditioned upon a row having s active items, we can draw a sample from the active items The could maintain a Bernoulli sample of size between s and O(s) 6-

Random sampling Algorithm [with ormode, Muthu & Yi, PODS 10 JAM 11] Initialize i = 0 sites S 1 S 2 S 3 Sites send in every item w.pr. 2 i In epoch i: S k 17-1

Random sampling Algorithm [with ormode, Muthu & Yi, PODS 10 JAM 11] Initialize i = 0 In epoch i: upper lower Sites send in every item w.pr. 2 i oordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. sites S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) 17-2

Random sampling Algorithm [with ormode, Muthu & Yi, PODS 10 JAM 11] Initialize i = 0 In epoch i: upper lower Sites send in every item w.pr. 2 i oordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. sites S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) When the lower sample reaches size s, the broadcasts to k sites advance to epoch i i+1 Discards the upper sample Randomlysplitsthelowersampleintoanewlowerandanuppersample 17-3

Random sampling Algorithm [with ormode, Muthu & Yi, PODS 10 JAM 11] Initialize i = 0 In epoch i: upper lower Sites send in every item w.pr. 2 i oordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. sites S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) When the lower sample reaches size s, the broadcasts to k sites advance to epoch i i+1 Discards the upper sample Randomlysplitsthelowersampleintoanewlowerandanuppersample orrectness: (1): In epoch i, each item is maintained in w. pr. 2 i 17-

Random sampling Algorithm [with ormode, Muthu & Yi, PODS 10 JAM 11] Initialize i = 0 In epoch i: upper lower Sites send in every item w.pr. 2 i oordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. sites S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) When the lower sample reaches size s, the broadcasts to k sites advance to epoch i i+1 Discards the upper sample Randomlysplitsthelowersampleintoanewlowerandanuppersample orrectness: (1): In epoch i, each item is maintained in w. pr. 2 i (2): Always s items are maintained in 17-5

A running example upper lower Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S 18-1

A running example upper lower Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S 1 18-2

A running example upper lower 1 Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S 1 18-3

A running example upper lower 1 Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S 2 1 18-

A running example upper lower 1 2 Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S 2 1 18-5

A running example upper lower 1 2 3 Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S 2 3 1 18-6

A running example upper lower 1 2 3 Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S 2 3 1 5 18-7

A running example upper lower 1 2 3 5 Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S 2 3 1 5 18-8

A running example upper lower 1 2 3 5 Maintain s = 3 samples Epoch 0 (p = 1) Now lower sample = 3 discard upper sample split lower sample advance to Epoch 1 sites S 1 S 2 S 3 S 2 3 1 5 18-9

A running example upper lower 1 5 Maintain s = 3 samples Epoch 0 (p = 1) Now lower sample = 3 discard upper sample split lower sample advance to Epoch 1 sites S 1 S 2 S 3 S 5 2 3 1 18-10

A running example (cont.) upper lower 1 5 Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S 5 2 3 1 19-1

A running example (cont.) upper lower 1 5 Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S 2 3 1 5 6 (discard) 19-2

A running example (cont.) upper lower 1 5 7 Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S 2 3 1 5 6 (discard) 7 19-3

A running example (cont.) upper lower 1 5 7 8 Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S 2 3 1 5 6 (discard) 8 7 19-

A running example (cont.) upper lower 1 5 7 8 Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S 2 3 1 5 6 (discard) 8 7 9 (discard) 19-5

A running example (cont.) upper lower 1 5 7 8 Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S 2 3 1 5 6 (discard) 8 7 9 (discard) 10 19-6

A running example (cont.) upper lower 1 5 7 8 10 Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S 2 3 1 5 6 (discard) 8 7 9 (discard) 10 19-7

A running example (cont.) upper lower 1 5 7 8 10 Maintain s = 3 samples Epoch 1 (p = 1/2) Again lower sample = 3 discard upper sample split lower sample advance to Epoch 2 sites S 1 S 2 S 3 S 2 3 1 5 6 (discard) 8 7 9 (discard) 10 19-8

A running example (cont.) upper lower 1 5 10 Maintain s = 3 samples Epoch 1 (p = 1/2) Again lower sample = 3 discard upper sample split lower sample advance to Epoch 2 sites S 1 S 2 S 3 S 2 3 1 5 6 (discard) 8 7 9 (discard) 10 20-1

A running example (cont.) upper lower 1 5 10 Maintain s = 3 samples Epoch 2 (p = 1/) More items will be discarded locally sites S 1 S 2 S 3 S 2 3 1 5 6 (discard) 8 7 9 (discard) 10 20-2

A running example (cont.) upper lower 1 5 10 Maintain s = 3 samples Epoch 2 (p = 1/) More items will be discarded locally Intuition: maintain a sample prob. at each site p s/n (n: total # items) without knowing n. sites S 1 S 2 S 3 S 2 3 1 5 6 (discard) 8 7 9 (discard) 10 20-3

Random sampling Analysis Initialize i = 0 In epoch i: upper lower Sites send in every item w.pr. 2 i oordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. sites S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) When the lower sample reaches size s, the broadcasts to k sites advance to epoch i i+1 Discards the upper sample Splitsthelowersampleintoanewlowersampleandanuppersample Analysis: (Msgs sent) Messages sent per epoch O(k +s) # epochs O(logn) = O((k +s)logn) 21-1

Random sampling Analysis and experiments an be 1. improved to Θ(klog k/s n+slogn) and 2. extended to sliding window cases. 22-1

Random sampling Analysis and experiments an be 1. improved to Θ(klog k/s n+slogn) and 2. extended to sliding window cases. Experiments on the real data set from 1998 world cup logs. log 2 (cost) theory 15 log 2 (cost) cost 10 3 practice 1 17 11 13 12 11 10 9 0 1 2 3 5 6 7 8 9 1011 log 2 s Basic case cost VS sample size n = 7000000,k = 128 16 15 1 13 12 0 1 2 3 5 6 7 8 9 1011 log 2 s Time-based sliding cost VS sample size n = 320000,k = 128 10 9 8 7 6 ( 0 1 2 3 5 6 log2 Time-based sliding cost VS window size s = 128,k = 128 ) w 10 22-2

Random sampling Analysis and experiments an be 1. improved to Θ(klog k/s n+slogn) and 2. extended to sliding window cases. log 2 (cost) theory 15 practice 1 13 12 11 10 Experiments on the real data set from 1998 world cup logs. 9 0 1 2 3 5 6 7 8 9 1011 log 2 s Basic case cost VS sample size n = 7000000,k = 128 total # items n = 7,000,000 # items sent,000 size of sample s = 128 # sites k = 128 22-3

Sampling from a (time-based) sliding window sliding window expired windows frozen window current window t 10-1

Sampling from a (time-based) sliding window sliding window t expired windows frozen window current window Sample for sliding window = need new ideas (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window by ISWoR 10-2

Sampling from a (time-based) sliding window sliding window expired windows frozen window current window Sample for sliding window = need new ideas (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window by ISWoR (1), (2) may be sampled by different rates. But as long as both have sizes min{s,# live items}, fine. t 10-3

Sampling from a (time-based) sliding window sliding window expired windows frozen window current window Sample for sliding window = need new ideas (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window by ISWoR (1), (2) may be sampled by different rates. But as long as both have sizes min{s,# live items}, fine. The key issue: how to guarantee both have sizes s? as items in the frozen window are expiring... t 10-

Sampling from a (time-based) sliding window sliding window expired windows frozen window current window Sample for sliding window = need new ideas (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window by ISWoR (1), (2) may be sampled by different rates. But as long as both have sizes min{s,# live items}, fine. The key issue: how to guarantee both have sizes s? as items in the frozen window are expiring... Solution: In the frozen window, find a good sample rate such that the sample size s. t 10-5

Dealing with the frozen window sliding window expired windows frozen window current window t 0 0 0 0 0 0 0 0 0 0 0 0 0 Keep all the levels? Need O(w) communication 11-1

Dealing with the frozen window sliding window expired windows frozen window current window t 0 0 0 0 0 0 0 0 0 0 0 0 0 Keep all the levels? Need O(w) communication (s = 2) Keep most recent sampled items in a level until s of them are also sampled at the next level. Total size: O(slogw) 11-2

Dealing with the frozen window sliding window expired windows frozen window current window t 0 0 0 0 0 0 0 0 0 0 0 0 0 Keep all the levels? Need O(w) communication (s = 2) Keep most recent sampled items in a level until s of them are also sampled at the next level. Total size: O(slogw) Guaranteed: There is a blue window with s sampled items that covers the unexpired portion of the frozen window 11-3

Dealing with the frozen window: The algorithm 0 0 0 0 0 0 0 0 0 0 0 0 0 (s = 2) Each site builds its own level-sampling structure for the current window until it freezes Needs O(slogw) space and O(1) time per item 12-1

Dealing with the frozen window: The algorithm 0 0 0 0 0 0 0 0 0 0 0 0 0 (s = 2) Each site builds its own level-sampling structure for the current window until it freezes Needs O(slogw) space and O(1) time per item When the current window freezes For each level, do a k-way merge to build the level of the global structure at the. Total communication O((k + s) log w) 12-2