Big Data & Scripting Part II Streaming Algorithms

Similar documents

Big Data & Scripting Part II Streaming Algorithms

Lecture 4 Online and streaming algorithms for clustering

Big Data and Scripting map/reduce in Hadoop

Lecture 6 Online and streaming algorithms for clustering

The Advantages and Disadvantages of Network Computing Nodes

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

Data Warehousing und Data Mining

Basic Data Analysis. Stephen Turnbull Business Administration and Public Policy Lecture 12: June 22, Abstract. Review session.

Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.

Algorithmic Aspects of Big Data. Nikhil Bansal (TU Eindhoven)

1 Formulating The Low Degree Testing Problem

Mining Data Streams. Chapter The Stream Data Model

Approximation Algorithms

CSCE 310J Data Structures & Algorithms. Dynamic programming 0-1 Knapsack problem. Dynamic programming. Dynamic Programming. Knapsack problem (Review)

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Applied Algorithm Design Lecture 5

IBM SPSS Direct Marketing 23

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

IBM SPSS Direct Marketing 22

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

New Hash Function Construction for Textual and Geometric Data Retrieval

Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets

Chapter 6: Episode discovery process

Infrastructures for big data

Decision Trees from large Databases: SLIQ

D B M G Data Base and Data Mining Group of Politecnico di Torino

Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Online and Offline Selling in Limit Order Markets

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Inference of Probability Distributions for Trust and Security applications

Northumberland Knowledge

Determining optimal window size for texture feature extraction methods

Lecture 8. Confidence intervals and the central limit theorem

Environmental Remote Sensing GEOG 2021

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Distributed Computing over Communication Networks: Maximal Independent Set

IBM SPSS Direct Marketing 19

Load Balancing in MapReduce Based on Scalable Cardinality Estimates

Analysis of Algorithms I: Binary Search Trees

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Social Media Mining. Data Mining Essentials

Passive Discovery Algorithms

JUST-IN-TIME SCHEDULING WITH PERIODIC TIME SLOTS. Received December May 12, 2003; revised February 5, 2004

Offline sorting buffers on Line

Network Algorithms for Homeland Security

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Broadband Networks. Prof. Dr. Abhay Karandikar. Electrical Engineering Department. Indian Institute of Technology, Bombay. Lecture - 29.

Optimal shift scheduling with a global service level constraint

How To Understand And Solve A Linear Programming Problem

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

B490 Mining the Big Data. 0 Introduction

Arithmetic Coding: Introduction

Content Delivery Networks. Shaxun Chen April 21, 2009

Introduction to Algorithms March 10, 2004 Massachusetts Institute of Technology Professors Erik Demaine and Shafi Goldwasser Quiz 1.

Distributed and Scalable QoS Optimization for Dynamic Web Service Composition

Regression Clustering

Innovative Techniques and Tools to Detect Data Quality Problems

Data Streams A Tutorial

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing

Counting Problems in Flash Storage Design

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)

GETTING STARTED WITH LABVIEW POINT-BY-POINT VIS

Chapter Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

Experimental Comparison of Set Intersection Algorithms for Inverted Indexing

Predict Influencers in the Social Network

Statistical Machine Learning

Chapter 13: Binary and Mixed-Integer Programming

Chapter 20: Data Analysis

Characterizing Task Usage Shapes in Google s Compute Clusters

Linear Codes. Chapter Basics

Data analysis process

B669 Sublinear Algorithms for Big Data

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

Adaptive Online Gradient Descent

Advertising on the Web

A Branch and Bound Algorithm for Solving the Binary Bi-level Linear Programming Problem

Machine Learning Final Project Spam Filtering

XI XI. Community Reinvestment Act Sampling Guidelines. Sampling Guidelines CRA. Introduction

Unsupervised learning: Clustering

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Going Big in Data Dimensionality:

Outline. NP-completeness. When is a problem easy? When is a problem hard? Today. Euler Circuits

Notes on Factoring. MA 206 Kurt Bryan

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

Exploratory data analysis approaches unsupervised approaches. Steven Kiddle With thanks to Richard Dobson and Emanuele de Rinaldis

Partitioning and Divide and Conquer Strategies

CLUSTERING FOR FORENSIC ANALYSIS

The New NCCI Hazard Groups

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Transcription:

Big Data & Scripting Part II Streaming Algorithms 1,

2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set), retain only elements matching that criterion example scenario: stream of requests (user,request) sampling requests is straightforward (e.g. which pages are accessed most frequently) analyzing the distribution of frequencies is more complicated that is, we want to know, how many queries are repeated x times (for all x)

3, sampling and filtering example n = 200, 000 events, m = 40, 000 different requests, uniform distribution all queries 0 5 10 15 0 1 2 3 4 5 6 10% sample s 0 10000 20000 30000 40000 id

sampling and filtering example same dataset, but vs. # queries with this all queries by number of queries with 0 2000 4000 6000 0 5 10 15 number of queries with 0 5000 15000 25000 10% sample by 0 1 2 3 4 5 6 completely different distributions due to sampling 4,

5, sampling and filtering example same dataset, but vs. # queries with this this time sample is selected by a fixed subset of ids all queries by number of queries with 0 2000 4000 6000 0 5 10 15 corrected 10% sample by number of queries with 0 100 300 500 700 0 5 10 15

Histograms and Frequency Skews 6,

7, stream and histogram consider the following input: objects/buckets 6 5 4 3 2 1 time 0 20 40 60 80 100 as time/stream progresses, data points come in e.g. users issue requests distinguished by some id or bucket (from hashing) some are seen more often (e.g. 4) some less often (e.g. 1) e.g. user 4 sending requests with high, user 1 only one request this is highly valuable information for an analysis

8, stream and histogram objects/buckets 6 5 4 3 2 1 time 0 20 40 60 80 100 to analyze these distributions, histograms are helpful: 30 25 20 15 10 5 0 1 2 3 4 5 6 object

9, comparing histograms - different distributions an example of two different streams of observations: 700 700 600 600 500 400 300 200 500 400 300 200 100 100 0 0 objects objects both have equal number of data points (10.000) and distinct objects (60) but objects have different probabilities to be observed sorting objects by frequencies makes the difference more obvious: 700 700 600 600 500 400 300 200 500 400 300 200 100 100 0 0 objects objects

10, the plan information about the distribution of observation is crucial for many applications knowing the complete, exact histogram would be helpful is often not possible, due to the large number of distinct objects workaround: characterize histogram without knowing the complete picture characteristic properties easier to determine analogous to descriptions of distributions on R

11, characterizing distributions 30 25 20 15 10 5 0 1 2 3 4 5 6 object m i : of object i number of distinct objects seen so far: i(m i ) 0 total number of objects seen so far: i(m i ) 1 = i m i generalization: M k = i(m i ) k kth moment

12, M 2 the second moment what we have so far M 0 Flajolet-Martin algorithm from last lecture M 1 counting combination: average M 1 /M 0 next: estimate M 2 = i m 2 i

13, M 2 the second moment 700 700 600 600 500 400 300 200 500 400 300 200 100 100 0 0 objects M 2 = 1.678.672 objects M 2 = 3.320.852 Motivation M 2 describes the skewness of a distribution smaller M 2 less skewed distribution related to the Gini-Index (surprise index) used to limit approximation errors, query optimization in database systems

14, M 2 and Var(X) variance describes the distribution of values M 2 describes the distribution of their frequencies M 2 comparable to variance of frequencies: Var({m i }) = 1/N i(m i µ({m i })) 2

15, M 2 the second moment: approximation storing and counting distinct objects impossible approximation by Alon-Matias-Szegedy algorithm 1 : algorithm N observations in stream choose k random positions p j {1,..., N} when reaching position p j : store object at position start counting occurrences of this object in m j estimate: M 2 n/k( k i=1 (2m i 1)) 1 Alon, N.; Matias Y.; Szegedy, M.: The space complexity of approximating the moments, 1999

16, M 2 the second moment: example 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 c e c f a e g f f b b c g b a a f d a e N=20 random positions 3, 7, 14, 5 position 3: encounter c, counting results in 2 position 7: encounter g, 2 position 14: b 1 position 5 a 4 estimate: M 2 20[2 (2 2 1) + (2 1 1) + (2 4 1] = 20 14 = 70 4 4 true value: M 2 = 4 2 + 3 2 + 3 2 + 1 2 + 3 2 + 4 2 + 2 2 = 64

17, M 2 the second moment: summary the algorithm is simple to implement needs to store only the k counters gets more precise with larger k, proof idea: expected value of each counter is fraction of M 2 average of k counters approaches M 2 problem: N may not be known in the beginning

18, approximating M 2 with unknown stream length stream may be of unknown length or unlimited still each position must be chosen random and uniform from {1,..., N} solution keep count of k objects beginning with the first k when object at position p > k is processed: choose with probability k/(p + 1) drop existing element (chosen with equal probability) each position chosen with equal probability

clustering data streams 19,

20, clustering data streams the problem many formulations of the clustering problem possible wide application ranges, strong variance in preconditions objective function common ground: objects connected by relation identify groups of similar objects with respect to relation problem is intractable (N P-hard) some basic questions what kind of relation (e.g. binary, distance, similarity) can objects have a mean value (continuous space) what is a good cluster (objective function) possibility of overlapping clusters

21, clustering data streams STREAM in the following: a single example problem and a single algorithm k-median on a data stream in one pass with guaranteed approximation quality algorithm: STREAM Guha, Mishra,Motwani, O Callaghan: Clustering Data Streams,2000

22, clustering data streams the k-median problem input: objects X = {x i : i = 1,..., N} distance d : X X R every x i is seen once in arbitrary order (i = 1,..., N) k - number of clusters to find objective: identify k elements m 1,..., m k X (cluster centers) let N(m j ) = {x i X : j = arg min l 1,...,k d(x i, m l )} all x i for which m i is the nearest center minimize C({m 1,..., m k }) = k j=1 x i N(m j ) d(x i, m j )

23, clustering data streams approximating k-median for small problem instances k-median can be fixed parameter approximated fixed parameter approximation: C approx a Q opt (approximation is maximal by factor a worse than optimal solution for fixed a) this approximation is useful to approximate larger instances approximation (idea) k-medians can be stated as integer program P I this program can be relaxed to a linear program P L solution of P L can be rounded to solution of P I linear problems can be solved efficiently

clustering data streams weighted k-medians extending k-medians with weights: k-medians with weighted samples w : X R >0 : distance of objects to their centers multiplied by weight: C({m 1,..., m k }) = j i 1,...,N w(x i ) d(x i, m j ) k-medians is special case with unit weights weighted k-means can be approximated similar to k-means: algorithm can only be applied to small instances use it to solve small sub-problems in the following, use procedure: wkm() input: objects, weights, k output: k weighted centers runtime: O(n 2 ) 24,

25, first step - clustering with low memory approach: divide and conquer Small-Space(X) 1. divide X into l disjoint subsets X 1,..., X l 2. cluster each X i individually into l k clusters 3. result: X set of lk cluster centers 4. cluster X, using for each c X N(c) as weight 2. can be solved with a constant factor approximation: solution b times worse than optimum 4. can be solved with constant factor approximation not worse than c times optimum result: constant factor approximation partial solutions and their combination

26, extending to a solution Small-Space(X) 1. divide X into l disjoint subsets X 1,..., X l 2. cluster each X i individually into O(k) clusters 3. result: X set of O(lk) cluster centers 4. cluster X, using for each c X N(c) as weight constant factor approximation needs to cluster X i memory problem 1: size of subsets versus l needs to cluster X memory problem 2: clustering O(lk) elements