résumé de flux de données

Similar documents
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman

B669 Sublinear Algorithms for Big Data

Big Data & Scripting Part II Streaming Algorithms

Real Time Scheduling Basic Concepts. Radek Pelánek

HUAWEI Advanced Data Science with Spark Streaming. Albert Bifet

Creating Synthetic Temporal Document Collections for Web Archive Benchmarking

Capacity Planning Process Estimating the load Initial configuration

Lecture 4 Online and streaming algorithms for clustering

A comparative study of data mining (DM) and massive data mining (MDM)

AN EFFICIENT SELECTIVE DATA MINING ALGORITHM FOR BIG DATA ANALYTICS THROUGH HADOOP

B490 Mining the Big Data. 0 Introduction

recursion, O(n), linked lists 6/14

Intelligent Log Analyzer. André Restivo

Estimating Deduplication Ratios in Large Data Sets

International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April ISSN

Sublinear Algorithms for Big Data. Part 4: Random Topics

Gamma Distribution Fitting

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

FACTORING. n = fall in the arithmetic sequence

Binary Number System. 16. Binary Numbers. Base 10 digits: Base 2 digits: 0 1

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Storage in Database Systems. CMPSCI 445 Fall 2010

Data Stream Management

Concept and Applications of Data Mining. Week 1

SAP HANA. SAP HANA Performance Efficient Speed and Scale-Out for Real-Time Business Intelligence

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

An Efficient Hybrid Data Gathering Scheme in Wireless Sensor Networks

Clustering Data Streams

Data Mining & Data Stream Mining Open Source Tools

Statistics in Retail Finance. Chapter 6: Behavioural models

Big Data Begets Big Database Theory

Big Data for Big Intel

Multi-dimensional index structures Part I: motivation

Chapter 20: Data Analysis

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Large-Scale Test Mining

Load Distribution in Large Scale Network Monitoring Infrastructures

A Survey on Outlier Detection Techniques for Credit Card Fraud Detection

Lecture 6 Online and streaming algorithms for clustering

Dimensioning an inbound call center using constraint programming

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

Master s Theory Exam Spring 2006

Factoring Algorithms

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

MSU Tier 3 Usage and Troubleshooting. James Koll

Chapter 13: Query Processing. Basic Steps in Query Processing

A Uniform Asymptotic Estimate for Discounted Aggregate Claims with Subexponential Tails

Data Mining on Streams

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing

The Internet of Things and Big Data: Intro

Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters

1 The Brownian bridge construction

SDMX technical standards Data validation and other major enhancements

Physical Design. Meeting the needs of the users is the gold standard against which we measure our success in creating a database.

Data Mining: An Overview. David Madigan

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Classification and Prediction

Information Management course

Efficiency of algorithms. Algorithms. Efficiency of algorithms. Binary search and linear search. Best, worst and average case.

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Identifying Frequent Items in Sliding Windows over On-Line Packet Streams

File System Management

Lecture 8 Performance Measurements and Metrics. Performance Metrics. Outline. Performance Metrics. Performance Metrics Performance Measurements

Performance Evaluation of some Online Association Rule Mining Algorithms for sorted and unsorted Data sets

Math 461 Fall 2006 Test 2 Solutions

Lecture 6 Black-Scholes PDE

CAS CS 565, Data Mining

arxiv: v1 [cs.ir] 12 Jun 2015

CIS 700: algorithms for Big Data

Accelerating Real Time Big Data Applications. PRESENTATION TITLE GOES HERE Bob Hansen

Principles of Data Mining by Hand&Mannila&Smyth

Approximation of OLAP queries on data warehouses

UGBA 103 (Parlour, Spring 2015), Section 1. Raymond C. W. Leung

Package cpm. July 28, 2015

LECTURE 15: AMERICAN OPTIONS

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Exact Confidence Intervals

Transcription:

résumé de flux de données CLEROT Fabrice fabrice.clerot@orange-ftgroup.com Orange Labs

data streams, why bother?

massive data is the talk of the town...

data streams, why bother? service platform production database 1 GB/day data streams have always been there! (state-of-the-art, circa 2000) business oriented data mart 10 GB data warehouse 0.1-1 TB

data streams, why bother? service platform production database 10 GB / day it's growing? so what? (state-of-the-art, circa 2010) business oriented data mart 10 GB aggregate computation data warehouse 10 100 TB (more data, for longer time periods)

data streams, why bother? moore's law (cpu) and kryder's law (storage) have roughly the same exponent performance for unit cost doubles every 18 months so, indeed why bother? blindingly obvious : most exact data processing operations scale worse than O(n), often very much worse, as O(n²) even sort scales in O(nlog(n)) petabytes irresistibly crawl their way to defeat teraflops!

data streams, why bother? service platform "gigantic" data warehouses become a computational blocking point production database 10 GB / day business oriented data mart 10 GB aggregate computation data warehouse 10 100 TB (more data, for longer time periods)

data streams, why bother? this is why! service platform on-line approximate aggregate computation production database 10 GB / day business oriented data mart 10 GB off-line exact aggregate computation data warehouse 10 100 TB (more data, for longer time periods)

... but small data also matters!

(almost) brainless chatters...... otherwise known as " (remote) sensors" potentially thousands of them in an ad hoc communication network, millions to come ("digital dust") very limited power autonomy : communication and processing drastically reduce the lifetime of the sensors processing data at the stream level is a way to reduce the cost of communication and processing for instance, compute the mean of a measurement as data are sent to a base station through the ad hoc network instead of transmitting raw data to the base station

plan flux de données vs bases de données résumé vs "stream-mining" quelques résumés simples résumé de la jointure de deux flux de données

data stream vs data base data base mining processing management

data stream vs data base data base mining volatile queries on persistent data processing management

data stream vs data base data base mining volatile queries on persistent data processing centralized management

data stream vs data base data base mining volatile queries on persistent data processing centralized management load new data for existing queries

data stream vs data base data base volatile queries on persistent data mining data stream persistent queries on volatile data centralized processing distributed load new data for management keep existing data for existing queries new queries

data stream vs data base data base volatile queries on persistent data centralized mining processing data stream persistent queries on volatile data distributed infinite streams limited CPU limited memory load new data for management keep existing data for existing queries new queries

data stream vs data base data base volatile queries on persistent data centralized load new data for mining processing management data stream persistent queries on volatile data distributed keep existing data for infinite streams limited CPU limited memory limited bw (cf Cormode) existing queries new queries

data stream vs data base data base volatile queries on persistent data centralized mining processing management data stream persistent queries on volatile data distributed infinite streams limited CPU limited memory limited bw (cf Cormode) load new data for existing queries keep existing data for new queries generic summary

"generic summary" vs "stream-mining" stream-mining persistant queries valid for all the stream duration the result of the query is a new data stream (in general) if the application allows the pre-definition of all useful queries, this formalism answer all the applicative requirements what about a new query on past data? data are volatile and cannot be accessed anymore no pre-defined query has been set, so no answer can be retrieved a generic summary aims at allowing the (approximate) execution of such queries (with confidence bounds on the result)

"generic summary" vs "stream-mining" and what about a persistent query on two sliding windows one on the immediate history one on the distant past warning persistent query on past data "past window" "present window" what is the present average invoice of the 10% best clients at the same period last year?

generic summary and density estimation a data-stream is a distribution on TxD T description space for time D description space of events typically, (@, Value) a generic summary can be seen as a density estimation problem on TxD under memory, cpu, bw constraints

generic summary and density estimation "ad hoc" approaches, however, with separate treatments of time and events CluStream (Aggarwal) mixture of gaussians for the density estimation of event space limited to numerical description of events logarithmic "cliches" structure for time HClustream (Yang) or SCLOPE (Ong): claim to extend Clustream to symbolic data by keeping the frequencies of the modalities per gaussian clearly do not scale for large address space (for instance)

generic summary and sampling downgrade density estimation to sampling generic summary keep a representative sample of the stream events constraints limited memory : bound on the summary size limited cpu : bound on the per event processing distributed processing limited bw : bound on the communication requirements limited cpu : bound on the computational cost to get the global summary from the local summaries

some simple summaries uniform sampling of a stream weighted sampling of a stream uniform sampling from a sliding window on a stream weighted sampling from a sliding window on a stream

(over-)simplified steam model infinite sequence of events e e.time is an integer e.data is a description of the event perfect observation perfect time-ordering

uniform sampling of a stream at any time, the probability for an event to be in the sample is uniform with respect to the complete history of the stream sample size r at time t, the probability for event e (with e.time t) to be in the sample is r/t time evolution of the sample : res(t) condition for a new event to be sampled? condition for an event in the sample to be excluded?

reservoir sampling Pout? res(t) Pin? e t t+1

reservoir sampling Pin = r/(t+1) Pout? res(t) Pin? e t t+1

reservoir sampling Pout? Pin = r/(t+1) for all events in the reservoir: at time t: P(e, t) = r/t pour e.time < t+1 at time t+1, P(e, t+1) = r/(t+1) and P(e, t+1) = P(e, t)*[(1-pin) + Pin*(1-Pout(e))] res(t) Pin? e Pout = 1/r t t+1

reservoir sampling (Vitter) for t < r+1, place all events into the reservoir for t > r place a new event in the reservoir with probability r/t if a new event is sampled, exclude one event from the reservoir randomly

reservoir sampling Pout = 1/r res(t) e Pin = r/(t+1) t t+1

reservoir sampling Pout = 1/r res(t) res(t+1) e Pin = r/(t+1) t t+1

reservoir sampling Pout = 1/r res(t+1) Pin = r/(t+2) e t t+1 t+2

what about the constraints? limited memory : size of the reservoir set a priori limited cpu : naive implementation : one random draw per event plus another in case of success in fact, only one random draw is enough...... and much less : after an insertion, draw the time of the next insertion

what about the constraints? distributable: summary of a set of streams φ i local summaries of the same size res (MUX i φ i, t) = U i res (φ i, t) the summary of the multiplex is just the union of the summaries of the original streams, either transmit the local summaries to a collector when required or transmit the local updates to a collector limited bw to the collector no communication between streams

weighted sampling of a stream a weight is associated to all events, e.weight at any time t, the probability of a past event e.time < t+1 to be in the sample is its relative weight with respect to the total weight of the past Prob(e in res(t)) = e.weight / Σ f.time<t+1 f.weight time evolution of the sample : res(t) condition for a new event to be sampled? condition for an event in the sample to be excluded?

weighted reservoir sampling Pout = 1/r res(t) Pin = r * e.weight / Σ f. weight e t t+1 W(t) = Σ f. weight f.time < t+1 W(t+1) = W(t) + e. weight

weighted reservoir sampling Pout = 1/r res(t) interpretation: "1 event with weight 10 is equivalent to 10 identical events with weight 1 arriving at the same time" Pin = r * e. weight / Σ f. weight e t t+1 W(t) = Σ f. weight f.time < t+1 W(t+1) = W(t) + e. weight

what about the constraints? limited memory : size of the reservoir set a priori limited cpu : naive implementation : one random draw per event plus another in case of success in fact, only one random draw is enough...... but cannot do less : future weights are unknown so you cannot draw the next insertion time

quid de nos contraintes? distributable: summary of a set of streams φ i local summaries of the same size the summary of the multiplex is the union of resampled summaries of the original streams according to the respective weights of the streams n local reservoirs of size r will produce a global reservoir of size less than n*r the size of the sample is bounded but unknown in advance either transmit the local summaries to a collector when required or transmit the local updates to a collector limited bw to the collector no communication between streams

uniform sampling from a sliding window on a stream at any time t, the probability for an event of [t-τ, t] to be in the sample is uniform on [t-τ, t] zero probability for events in [0, t-τ-1] sample size r ( < τ ) time evolution of the sample : res(t) condition for a new event to be sampled? condition for an event in the sample to be excluded? how to deal with "expiring" events of the sample? at time t, e in res(t-1) and e.time = t-τ -1 "expires"

why reservoir sampling doesn t work with sliding windows suppose an element in the reservoir expires need to replace it with a randomly-chosen element from the current window however, in the data stream model we have no access to past data so we cannot sample from the "current" window, it's gone! we could store the entire window but this would require O(τ) memory

chain-sample (Babcock) initialisation: reservoir sampling on the first window include each new element in the sample with probability 1/ min(t,τ) as each element is added to the sample, choose the index of the element that will replace it when it expires when the t-th element expires, the window will be (t+1 t+τ), so choose the index from this range once the element with that index arrives, store it and choose the index that will replace it in turn, building a chain of potential replacements when an element is chosen to be discarded from the sample, discard its chain as well

example: r = 1 et τ = 10 3 enters the sample : choose the position of its successor (will be 9) 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 9 enters the window : choose the position of its successor (will be 7) 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 3 expires : include its successor, 9, in the sample 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3

what about the constraints? limited memory : size of the reservoir set a priori and the number of pointers to successors can be bounded mean length of a chain = e = 2.718 limited cpu : e = 2.718 random draws per event on average

what about the constraints? distributable: summary of a set of streams φ i local summaries of the same size res (MUX i φ i, t) = U i res (φ i, t) the summary of the multiplex is just the union of the summaries of the original streams, either transmit the local summaries to a collector when required or transmit the local updates to a collector limited bw to the collector no communication between streams

uniform sampling from a sliding window on a stream a weight is associated to each event e : e.weight at any time t, the probability for an event of [t-τ, t] to be in the sample is its relative weight with respect to the total weight on [t-τ, t] zero probability for events in [0, t-τ-1] Prob(e in res(t)) = e.weight / Σ t-τ-1<f.time<t+1 f.weight sample size r ( < τ ) time evolution of the sample : res(t) condition for a new event to be sampled? condition for an event in the sample to be excluded? how to deal with "expiring" events of the sample? at time t, e in res(t-1) and e.time = t-τ -1 "expires"

weighted chain sampling open problem... we do not know the future weights, so we cannot choose the position of the successors

another simple approach: oversample and select uniform random sample from a sliding window 1. as each element arrives remember it with probability p = c r/τ log τ; otherwise discard it 2. discard elements when they expire 3. when asked to produce a sample, choose r elements at random from the set in memory expected memory usage of O(r log τ) uses O(r log τ) memory whp the algorithm can fail if less than r elements from a window are remembered; however whp this will not happen adaptation to the weigthed case is obvious: weighted random sampling in step 1 (if weights are known in advance) or in step 3 (otherwise: for instance, application-dependent weights)

bibliography Sketching Streams Through the Net: Distributed Approximate Query Tracking, G. Cormode, M. Garofalakis, VLDB, 2005 A framework for clustering evolving data streams, C.C. Aggarwal, J. Han, J. Wang, and P.S. Yu, VLDB, 2003. Hclustream: a novel approach for clustering evolving heterogeneous data streams, Chunyu Yang and Jie Zhou, ICDMW, 2006 Sclope: an algorithm for clustering data streams of categorical attributes, Kok leong Ong, Wenyuan Li, Wee keong Ng, and Ee peng Lim, Technical report, 2004. Random sampling with a reservoir. J. Vitter. ACM Trans. Math. Softw., 11(1) :37 57, 1985. Sampling from a moving window over streaming data. B. Babcock, M. Datar, and R. Motwani, ACM