Big Data & Scripting Part II Streaming Algorithms

Similar documents

Big Data & Scripting Part II Streaming Algorithms

The Advantages and Disadvantages of Network Computing Nodes

Mining Data Streams. Chapter The Stream Data Model

MODELING RANDOMNESS IN NETWORK TRAFFIC

Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.

Chapter Objectives. Chapter 9. Sequential Search. Search Algorithms. Search Algorithms. Binary Search

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Theoretical Aspects of Storage Systems Autumn 2009

Hierarchical Bloom Filters: Accelerating Flow Queries and Analysis

A Time Efficient Algorithm for Web Log Analysis

Big Data and Scripting map/reduce in Hadoop

CS 2112 Spring Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

Packet forwarding using improved Bloom filters

1 Digital Signatures. 1.1 The RSA Function: The eth Power Map on Z n. Crypto: Primitives and Protocols Lecture 6.

Data Reduction: Deduplication and Compression. Danny Harnik IBM Haifa Research Labs

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

To convert an arbitrary power of 2 into its English equivalent, remember the rules of exponential arithmetic:

Secure Computation Martin Beck

Innovative Techniques and Tools to Detect Data Quality Problems

New Hash Function Construction for Textual and Geometric Data Retrieval

Big Data and Scripting

Physical Data Organization

CHAPTER 1 Overview of SAS/ACCESS Interface to Relational Databases

Big Data Analytics. Genoveva Vargas-Solar French Council of Scientific Research, LIG & LAFMIA Labs

SQL Server An Overview

How To Understand How Weka Works

SAS Data Set Encryption Options

Software-Defined Traffic Measurement with OpenSketch

Probabilistic Deduplication for Cluster-Based Storage Systems

Load Balancing in MapReduce Based on Scalable Cardinality Estimates

Unit Storage Structures 1. Storage Structures. Unit 4.3

FAWN - a Fast Array of Wimpy Nodes

16.1 MAPREDUCE. For personal use only, not for distribution. 333

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Raima Database Manager Version 14.0 In-memory Database Engine

Chapter 13. Chapter Outline. Disk Storage, Basic File Structures, and Hashing

File System Management

Factoring & Primality

Private Record Linkage with Bloom Filters

COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012

Algorithmic Aspects of Big Data. Nikhil Bansal (TU Eindhoven)

Digital Signatures. (Note that authentication of sender is also achieved by MACs.) Scan your handwritten signature and append it to the document?

Distributed Computing and Big Data: Hadoop and MapReduce

Chapter 13. Disk Storage, Basic File Structures, and Hashing

Bloom Filters. Christian Antognini Trivadis AG Zürich, Switzerland

Speeding Up Cloud/Server Applications Using Flash Memory

Data Structures for Big Data: Bloom Filter. Vinicius Vielmo Cogo Smalltalks, DI, FC/UL. October 16, 2014.

FAQs. This material is built based on. Lambda Architecture. Scaling with a queue. 8/27/2015 Sangmi Pallickara

Kafka & Redis for Big Data Solutions

Distance Degree Sequences for Network Analysis

Counting Problems in Flash Storage Design

MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services

Hashing in Networked Systems

Load-Balancing the Distance Computations in Record Linkage

F-Secure Internet Security 2014 Data Transfer Declaration

Large-Scale Test Mining

This material is built based on, Patterns covered in this class FILTERING PATTERNS. Filtering pattern

Subnetting Examples. There are three types of subnetting examples I will show in this document:

Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters

Quantum Computing Lecture 7. Quantum Factoring. Anuj Dawar

ProTrack: A Simple Provenance-tracking Filesystem

Lecture 2: Universality

Data Mining Algorithms Part 1. Dejan Sarka

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Why? A central concept in Computer Science. Algorithms are ubiquitous.

U.C. Berkeley CS276: Cryptography Handout 0.1 Luca Trevisan January, Notes on Algebra

Machine Learning using MapReduce

Multiple Linear Regression in Data Mining

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 13-1

3-6 Toward Realizing Privacy-Preserving IP-Traceback

Fundamental Algorithms

Estimating Deduplication Ratios in Large Data Sets

C# Cookbook. Stephen Teilhet andjay Hilyard. O'REILLY 8 Beijing Cambridge Farnham Köln Paris Sebastopol Taipei Tokyo '"J""'

Lecture 1: Data Storage & Index

Chapter 2: Elements of Java

Useful Number Systems

Big Data Visualization on the MIC

Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman

Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein

Webapps Vulnerability Report

The Classical Architecture. Storage 1 / 36

Chapter 6. The stacking ensemble approach

This presentation explains how to monitor memory consumption of DataStage processes during run time.

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Projektgruppe. Categorization of text documents via classification

The New IoT Standard: Any App for Any Device Using Any Data Format. Mike Weiner Product Manager, Omega DevCloud KORE Telematics

Storage Management for Files of Dynamic Records

Faster deterministic integer factorisation

Master s Thesis. Secure Indexes for Keyword Search in Cloud Storage. Supervisor Professor Hitoshi Aida ( ) !!!

Jonathan Worthington Scarborough Linux User Group

Big data coming soon... to an NSI near you. John Dunne. Central Statistics Office (CSO), Ireland

Chapter 11 I/O Management and Disk Scheduling

Quantifind s story: Building custom interactive data analytics infrastructure

Transcription:

Big Data & Scripting Part II Streaming Algorithms 1,

Counting Distinct Elements 2,

3, counting distinct elements problem formalization input: stream of elements o from some universe U e.g. ids from a set of allowed ids elements can appear arbitrary often, not all of them appear question: how many different elements did appear? application example: counting the number of visitors (IPs) of a website storing the elements is not an option estimate number using hashing and probabilities

4, Flajolet-Martin algorithm idea binary function f : X {0, 1}, x i = x j f (x i ) = f (x j ) P(f (x i ) = 1) = p identically, independently distributed for all x i assume we encountered m different x i 1 what is the probability P m, that f (x i ) = 0 for all i? P 1 = (1 p) for one x i, P m = (1 p) m for m different x i P m depends on the number of unique x i, not on the total number exploit this to get an estimate of m 1 that s actually what we try to find out

5, Flajolet-Martin algorithm idea (1 p) m = ( ) (1 p) 1 mp p e mp for small p: (1 p) 1 p e 1 (Euler s number) probability P m that f (x i ) = 0 for m different x i P m = e mp if m 1/p: P m 1 if m 1/p: P m 0 p should be small (for derivation and for large m)

6, Flajolet-Martin algorithm bitmask & tail length use hash function h : X {0, 1} K e.g. binary representation of hash values assume bits of h(x i ) are uniform, independent distributed the tail idea: count number of trailing zeros tail(h) = max{j : h mod 2 j = 0} bitmask introduce a new random variable bitmask {0, 1} M initialize bitmask(i) = 0 for i = 0... M 1 (zero indexed) for every x i : bitmask(tail(h(x i ))) 1

7, Flajolet-Martin algorithm bitmask & tail length bitmask() keeps tracks of tail lengths seen so far used to extend approximate counting by random variables bitmask(0) is expected to be set to 1 by 1/2 of the x i bitmask(1) is expected to be set to 1 by 1/4 of the x i,... bitmask(i) is almost certainly 0 if i log 2 m bitmask(i) is almost certainly 1 if i log 2 m (recall that m is the number of unique elements seen ) for counting, derive random variable R: R(bitmask) = min{i : bitmask(i) = 0} it can be shown that: E(R) log 2 ϕm where ϕ = 0.77351... m can be approximated as m ϕ 1 2 R

8, Flajolet-Martin algorithm accuracy the idea so far probability of f (x i ) = 1 depends m use this for estimation for tail length: k longest tail of hash value observed estimate 2 k elements problems/refinements R is not very precise (large standard deviation σ(r) 1.12) precision could be increased by averaging multiple functions but: determining h(x i ) is costly

9, Flajolet-Martin algorithm accuray using k different hash functions (and bitmasks) increases accuracy: R = 1/k k j=1 R(bitmask j ) has std. dev. 1.12/ k estimate is exponential of R in large influence on mean value improve accuracy with only one hash function: use k different bitmaps bitmask 0,..., bitmask k 1 for each x i update only bitmask (h(x i ) mod k) ( k 1 ) R = 1/k j=0 R(bitmaskj ) std. dev. 1.12/ k still holds

10, Flajolet-Martin algorithm input: stream of numbers X (e.g. converted strings) precision parameter k output: estimate m of unique numbers within X algorithm: bitmask=bitmask[k] // array of bitmasks initialized with 0 while x i = readnext(){ h=h(x i ) update bitmask[h mod k] with tail(h) } mean=mean({r(bitmask[j]) : j = 0,..., k}) return(2 mean /0.77351); source and additional information: Flajolet, P.; Martin, G. N.: Probabilistic counting algorithms for data base applications, 1985

11, sampling in streams 2 sample descriptive subset of the original data approximate true distribution problem selection of samples (how?) introduction of bias sampling over time 2 C.f. Book: Mining of Massive Datasets, Rajaraman, Ullman, 2011, available free, online, link on website.

12, sampling example We use the following example setting to demonstrate problems in sampling: data and problem incoming data: data points: (user, query, time) e.g. DB system serving clients queries appear repeatedly goal: estimate number of unique 2,3,... times repeated queries

13, sampling naive approach (does not work) sample 10% of the data points (randomly) estimate repetitions from resulting sample example: unique and double queries unique queries A and double queries B 1,B 2 P(A in sample) = 0.1, analogous for B 1 and B 2 P(B 1 in sample and B 2 not in sample) = 0.1 (1 0.1) = 0.09 in the sample, these appear as unique queries! P(B 1 and B 2 in sample) = 0.1 0.1 = 0.01

14, sampling using hash functions solution: frequency-independent sampling sample items independent of their frequency if item is in sample, retain all copies important assumption: objects are approximately equally distributed to hash values (buckets) algorithm - sample (j/k)n elements hash function h : X {0,..., k 1} if(h(x i ) < j) add x i to sample works, if N is known in advance and j/k is small enough can be adapted to arbitrary N and fixed sample size

15, sampling adaptive sample size problem number of distinct objects unknown in the beginning sample should be of fixed or limited size approach: use large number of buckets (e.g. keep all x i ) sample x i as before (store each sample in it s bucket) when sample grows too large, drop buckets from sample

16, sampling a concrete example problem: extract 2GB sample from incoming ids input objects with id as string s 0 s 1... s k, k {3,..., 20} s i {a,..., z, A,..., Z} = Var Var = 52 convert id to number: assign value to each character: 3 v(a) = 0, v(b) = 1,..., v(z) = 51 convert to number: number(s 0... s k ) = k i=1 v(s i ) 52 i for implementation, use Horner s method: v 0 = s k, v 1 = v 0 52 + s k 1,..., number(s 0... s k ) = v k+1 apply hash function h(v) = v mod 1000 sample h(v) < s starting with s = 1000 when mem usage of samples > 2GB, k is decremented by 10 3 or use their ASCII/UTF value

17, sampling problems characters are often not equally distributed (e.g. in languages) we ignore that for now assume, hashing takes care of that hash values get very large (e.g. 52 3 = 140608) problem: primitive data types (int, long) are limited counter measure: apply h() to v i in every step works if a = 0 because of modulo arithmetic what if k = 1 and sample is still >2GB? rerun, use more buckets (e.g. 10000) or use second stage of hashing: apply new hash function to selected samples (those passing the first filter only)

18, maintaining sets motivating scenario a sample S X can be used to learn properties of the overall distribution example: queries and their frequencies in the following, we derive further statistics about the samples e.g.: frequency distributions (how many queries have certain frequencies) we need to decide, whether x i S we could exploit the filters resulting from the sampling process problem: new (unseen) objects could end up in sample buckets aim: decide set membership without further information

19, maintaining sets a first approach use fine grained hashing to split samples into small chunks remember buckets that contain x i S discard elements not in these buckets result: dirty filter if x i S it is retained if x i / S it might be discarded in the following: improve this general idea using hash functions Bloom Filter

Bloom filter idea and introduction input: large set S of objects problem: decide whether x i S without storing S formalizing the idea - (not the Bloom filter yet) hash function h : X H, H very large { 1, if s S : h(s) = i bit array B with B[i] = 0, else for each s S B[h(s)] == 1 for each s / S B[h(s)] == 1 (false positive) or B[h(s)] == 0 (reject) elements of S pass, others might pass but hopefully not all next: try to minimize false positives 20,

21, Bloom filter definition and usage Bloom filter set S and bit array B as before collection of hash functions: h 1,..., h k mapping to H set B[i] = 1, if for any s S and any h j : h j (s) = i object o passes, if B[h i (o)] = 1 for all i {1,..., k} B[h i (o)] = 0 for any i, o does not pass spreading to more hash functions avoids collisions sufficient, if only one h i does not produce a collision allows derivation of probabilistic properties e.g. how many h i, and size of H for a given S

22, Bloom filter: properties and extensions array size A = m (memory usage) e.g. 10 6 k hash functions (time consumption of test) e.g. 5 S = n elements stored (yield in a sense) e.g. 10 5 approximate probability of false positive: ( 1 e kn/m) k ( example: 1 e 5 105 /10 6) 5 0.0094 (0.94%) mem usage: 10 6 bits 122 kbyte (10 bits/id) assume id has 10 bytes: 10 7 bytes 9.5 MB conclusions: more memory improves probability, lesser elements (n), too more hash functions not necessarily! e.g. (values from above) k = 7 prob 0.82%, k = 10 prob 1% each m, n pair has optimal k = n m ln(2)

23, wrap up: sampling and filtering straightforward sampling can go wrong correct sample by fixing selection e.g. limit to certain users/ids decide if in sample or not with hash function does that solve our count duplicates problem? filter large set of objects with Bloom filter

24, insertion: difference between streaming and online algorithms: online scenario: unlimited stream one pass answers ready all the time (e.g. sliding window) streaming scenario several passes possible, usually avoided answers usually at end of execution