résumé de flux de données
|
|
|
- Geoffrey White
- 9 years ago
- Views:
Transcription
1 résumé de flux de données CLEROT Fabrice Orange Labs
2 data streams, why bother?
3 massive data is the talk of the town...
4 data streams, why bother? service platform production database 1 GB/day data streams have always been there! (state-of-the-art, circa 2000) business oriented data mart 10 GB data warehouse TB
5 data streams, why bother? service platform production database 10 GB / day it's growing? so what? (state-of-the-art, circa 2010) business oriented data mart 10 GB aggregate computation data warehouse TB (more data, for longer time periods)
6 data streams, why bother? moore's law (cpu) and kryder's law (storage) have roughly the same exponent performance for unit cost doubles every 18 months so, indeed why bother? blindingly obvious : most exact data processing operations scale worse than O(n), often very much worse, as O(n²) even sort scales in O(nlog(n)) petabytes irresistibly crawl their way to defeat teraflops!
7 data streams, why bother? service platform "gigantic" data warehouses become a computational blocking point production database 10 GB / day business oriented data mart 10 GB aggregate computation data warehouse TB (more data, for longer time periods)
8 data streams, why bother? this is why! service platform on-line approximate aggregate computation production database 10 GB / day business oriented data mart 10 GB off-line exact aggregate computation data warehouse TB (more data, for longer time periods)
9 ... but small data also matters!
10 (almost) brainless chatters otherwise known as " (remote) sensors" potentially thousands of them in an ad hoc communication network, millions to come ("digital dust") very limited power autonomy : communication and processing drastically reduce the lifetime of the sensors processing data at the stream level is a way to reduce the cost of communication and processing for instance, compute the mean of a measurement as data are sent to a base station through the ad hoc network instead of transmitting raw data to the base station
11 plan flux de données vs bases de données résumé vs "stream-mining" quelques résumés simples résumé de la jointure de deux flux de données
12 data stream vs data base data base mining processing management
13 data stream vs data base data base mining volatile queries on persistent data processing management
14 data stream vs data base data base mining volatile queries on persistent data processing centralized management
15 data stream vs data base data base mining volatile queries on persistent data processing centralized management load new data for existing queries
16 data stream vs data base data base volatile queries on persistent data mining data stream persistent queries on volatile data centralized processing distributed load new data for management keep existing data for existing queries new queries
17 data stream vs data base data base volatile queries on persistent data centralized mining processing data stream persistent queries on volatile data distributed infinite streams limited CPU limited memory load new data for management keep existing data for existing queries new queries
18 data stream vs data base data base volatile queries on persistent data centralized load new data for mining processing management data stream persistent queries on volatile data distributed keep existing data for infinite streams limited CPU limited memory limited bw (cf Cormode) existing queries new queries
19 data stream vs data base data base volatile queries on persistent data centralized mining processing management data stream persistent queries on volatile data distributed infinite streams limited CPU limited memory limited bw (cf Cormode) load new data for existing queries keep existing data for new queries generic summary
20 "generic summary" vs "stream-mining" stream-mining persistant queries valid for all the stream duration the result of the query is a new data stream (in general) if the application allows the pre-definition of all useful queries, this formalism answer all the applicative requirements what about a new query on past data? data are volatile and cannot be accessed anymore no pre-defined query has been set, so no answer can be retrieved a generic summary aims at allowing the (approximate) execution of such queries (with confidence bounds on the result)
21 "generic summary" vs "stream-mining" and what about a persistent query on two sliding windows one on the immediate history one on the distant past warning persistent query on past data "past window" "present window" what is the present average invoice of the 10% best clients at the same period last year?
22 generic summary and density estimation a data-stream is a distribution on TxD T description space for time D description space of events typically, (@, Value) a generic summary can be seen as a density estimation problem on TxD under memory, cpu, bw constraints
23 generic summary and density estimation "ad hoc" approaches, however, with separate treatments of time and events CluStream (Aggarwal) mixture of gaussians for the density estimation of event space limited to numerical description of events logarithmic "cliches" structure for time HClustream (Yang) or SCLOPE (Ong): claim to extend Clustream to symbolic data by keeping the frequencies of the modalities per gaussian clearly do not scale for large address space (for instance)
24 generic summary and sampling downgrade density estimation to sampling generic summary keep a representative sample of the stream events constraints limited memory : bound on the summary size limited cpu : bound on the per event processing distributed processing limited bw : bound on the communication requirements limited cpu : bound on the computational cost to get the global summary from the local summaries
25 some simple summaries uniform sampling of a stream weighted sampling of a stream uniform sampling from a sliding window on a stream weighted sampling from a sliding window on a stream
26 (over-)simplified steam model infinite sequence of events e e.time is an integer e.data is a description of the event perfect observation perfect time-ordering
27 uniform sampling of a stream at any time, the probability for an event to be in the sample is uniform with respect to the complete history of the stream sample size r at time t, the probability for event e (with e.time t) to be in the sample is r/t time evolution of the sample : res(t) condition for a new event to be sampled? condition for an event in the sample to be excluded?
28 reservoir sampling Pout? res(t) Pin? e t t+1
29 reservoir sampling Pin = r/(t+1) Pout? res(t) Pin? e t t+1
30 reservoir sampling Pout? Pin = r/(t+1) for all events in the reservoir: at time t: P(e, t) = r/t pour e.time < t+1 at time t+1, P(e, t+1) = r/(t+1) and P(e, t+1) = P(e, t)*[(1-pin) + Pin*(1-Pout(e))] res(t) Pin? e Pout = 1/r t t+1
31 reservoir sampling (Vitter) for t < r+1, place all events into the reservoir for t > r place a new event in the reservoir with probability r/t if a new event is sampled, exclude one event from the reservoir randomly
32 reservoir sampling Pout = 1/r res(t) e Pin = r/(t+1) t t+1
33 reservoir sampling Pout = 1/r res(t) res(t+1) e Pin = r/(t+1) t t+1
34 reservoir sampling Pout = 1/r res(t+1) Pin = r/(t+2) e t t+1 t+2
35 what about the constraints? limited memory : size of the reservoir set a priori limited cpu : naive implementation : one random draw per event plus another in case of success in fact, only one random draw is enough and much less : after an insertion, draw the time of the next insertion
36 what about the constraints? distributable: summary of a set of streams φ i local summaries of the same size res (MUX i φ i, t) = U i res (φ i, t) the summary of the multiplex is just the union of the summaries of the original streams, either transmit the local summaries to a collector when required or transmit the local updates to a collector limited bw to the collector no communication between streams
37 weighted sampling of a stream a weight is associated to all events, e.weight at any time t, the probability of a past event e.time < t+1 to be in the sample is its relative weight with respect to the total weight of the past Prob(e in res(t)) = e.weight / Σ f.time<t+1 f.weight time evolution of the sample : res(t) condition for a new event to be sampled? condition for an event in the sample to be excluded?
38 weighted reservoir sampling Pout = 1/r res(t) Pin = r * e.weight / Σ f. weight e t t+1 W(t) = Σ f. weight f.time < t+1 W(t+1) = W(t) + e. weight
39 weighted reservoir sampling Pout = 1/r res(t) interpretation: "1 event with weight 10 is equivalent to 10 identical events with weight 1 arriving at the same time" Pin = r * e. weight / Σ f. weight e t t+1 W(t) = Σ f. weight f.time < t+1 W(t+1) = W(t) + e. weight
40 what about the constraints? limited memory : size of the reservoir set a priori limited cpu : naive implementation : one random draw per event plus another in case of success in fact, only one random draw is enough but cannot do less : future weights are unknown so you cannot draw the next insertion time
41 quid de nos contraintes? distributable: summary of a set of streams φ i local summaries of the same size the summary of the multiplex is the union of resampled summaries of the original streams according to the respective weights of the streams n local reservoirs of size r will produce a global reservoir of size less than n*r the size of the sample is bounded but unknown in advance either transmit the local summaries to a collector when required or transmit the local updates to a collector limited bw to the collector no communication between streams
42 uniform sampling from a sliding window on a stream at any time t, the probability for an event of [t-τ, t] to be in the sample is uniform on [t-τ, t] zero probability for events in [0, t-τ-1] sample size r ( < τ ) time evolution of the sample : res(t) condition for a new event to be sampled? condition for an event in the sample to be excluded? how to deal with "expiring" events of the sample? at time t, e in res(t-1) and e.time = t-τ -1 "expires"
43 why reservoir sampling doesn t work with sliding windows suppose an element in the reservoir expires need to replace it with a randomly-chosen element from the current window however, in the data stream model we have no access to past data so we cannot sample from the "current" window, it's gone! we could store the entire window but this would require O(τ) memory
44 chain-sample (Babcock) initialisation: reservoir sampling on the first window include each new element in the sample with probability 1/ min(t,τ) as each element is added to the sample, choose the index of the element that will replace it when it expires when the t-th element expires, the window will be (t+1 t+τ), so choose the index from this range once the element with that index arrives, store it and choose the index that will replace it in turn, building a chain of potential replacements when an element is chosen to be discarded from the sample, discard its chain as well
45 example: r = 1 et τ = 10 3 enters the sample : choose the position of its successor (will be 9) enters the window : choose the position of its successor (will be 7) expires : include its successor, 9, in the sample
46 what about the constraints? limited memory : size of the reservoir set a priori and the number of pointers to successors can be bounded mean length of a chain = e = limited cpu : e = random draws per event on average
47 what about the constraints? distributable: summary of a set of streams φ i local summaries of the same size res (MUX i φ i, t) = U i res (φ i, t) the summary of the multiplex is just the union of the summaries of the original streams, either transmit the local summaries to a collector when required or transmit the local updates to a collector limited bw to the collector no communication between streams
48 uniform sampling from a sliding window on a stream a weight is associated to each event e : e.weight at any time t, the probability for an event of [t-τ, t] to be in the sample is its relative weight with respect to the total weight on [t-τ, t] zero probability for events in [0, t-τ-1] Prob(e in res(t)) = e.weight / Σ t-τ-1<f.time<t+1 f.weight sample size r ( < τ ) time evolution of the sample : res(t) condition for a new event to be sampled? condition for an event in the sample to be excluded? how to deal with "expiring" events of the sample? at time t, e in res(t-1) and e.time = t-τ -1 "expires"
49 weighted chain sampling open problem... we do not know the future weights, so we cannot choose the position of the successors
50 another simple approach: oversample and select uniform random sample from a sliding window 1. as each element arrives remember it with probability p = c r/τ log τ; otherwise discard it 2. discard elements when they expire 3. when asked to produce a sample, choose r elements at random from the set in memory expected memory usage of O(r log τ) uses O(r log τ) memory whp the algorithm can fail if less than r elements from a window are remembered; however whp this will not happen adaptation to the weigthed case is obvious: weighted random sampling in step 1 (if weights are known in advance) or in step 3 (otherwise: for instance, application-dependent weights)
51 bibliography Sketching Streams Through the Net: Distributed Approximate Query Tracking, G. Cormode, M. Garofalakis, VLDB, 2005 A framework for clustering evolving data streams, C.C. Aggarwal, J. Han, J. Wang, and P.S. Yu, VLDB, Hclustream: a novel approach for clustering evolving heterogeneous data streams, Chunyu Yang and Jie Zhou, ICDMW, 2006 Sclope: an algorithm for clustering data streams of categorical attributes, Kok leong Ong, Wenyuan Li, Wee keong Ng, and Ee peng Lim, Technical report, Random sampling with a reservoir. J. Vitter. ACM Trans. Math. Softw., 11(1) :37 57, Sampling from a moving window over streaming data. B. Babcock, M. Datar, and R. Motwani, ACM
Cloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman
Cloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman 2 In a DBMS, input is under the control of the programming staff. SQL INSERT commands or bulk loaders. Stream management is important
B669 Sublinear Algorithms for Big Data
B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Now about the Big Data Big data is everywhere : over 2.5 petabytes of sales transactions : an index of over 19 billion web pages : over 40 billion of
Big Data & Scripting Part II Streaming Algorithms
Big Data & Scripting Part II Streaming Algorithms 1, Counting Distinct Elements 2, 3, counting distinct elements problem formalization input: stream of elements o from some universe U e.g. ids from a set
Real Time Scheduling Basic Concepts. Radek Pelánek
Real Time Scheduling Basic Concepts Radek Pelánek Basic Elements Model of RT System abstraction focus only on timing constraints idealization (e.g., zero switching time) Basic Elements Basic Notions task
HUAWEI Advanced Data Science with Spark Streaming. Albert Bifet (@abifet)
HUAWEI Advanced Data Science with Spark Streaming Albert Bifet (@abifet) Huawei Noah s Ark Lab Focus Intelligent Mobile Devices Data Mining & Artificial Intelligence Intelligent Telecommunication Networks
Creating Synthetic Temporal Document Collections for Web Archive Benchmarking
Creating Synthetic Temporal Document Collections for Web Archive Benchmarking Kjetil Nørvåg and Albert Overskeid Nybø Norwegian University of Science and Technology 7491 Trondheim, Norway Abstract. In
Capacity Planning Process Estimating the load Initial configuration
Capacity Planning Any data warehouse solution will grow over time, sometimes quite dramatically. It is essential that the components of the solution (hardware, software, and database) are capable of supporting
Lecture 4 Online and streaming algorithms for clustering
CSE 291: Geometric algorithms Spring 2013 Lecture 4 Online and streaming algorithms for clustering 4.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line
A comparative study of data mining (DM) and massive data mining (MDM)
A comparative study of data mining (DM) and massive data mining (MDM) Prof. Dr. P K Srimani Former Chairman, Dept. of Computer Science and Maths, Bangalore University, Director, R & D, B.U., Bangalore,
AN EFFICIENT SELECTIVE DATA MINING ALGORITHM FOR BIG DATA ANALYTICS THROUGH HADOOP
AN EFFICIENT SELECTIVE DATA MINING ALGORITHM FOR BIG DATA ANALYTICS THROUGH HADOOP Asst.Prof Mr. M.I Peter Shiyam,M.E * Department of Computer Science and Engineering, DMI Engineering college, Aralvaimozhi.
B490 Mining the Big Data. 0 Introduction
B490 Mining the Big Data 0 Introduction Qin Zhang 1-1 Data Mining What is Data Mining? A definition : Discovery of useful, possibly unexpected, patterns in data. 2-1 Data Mining What is Data Mining? A
recursion, O(n), linked lists 6/14
recursion, O(n), linked lists 6/14 recursion reducing the amount of data to process and processing a smaller amount of data example: process one item in a list, recursively process the rest of the list
Intelligent Log Analyzer. André Restivo <[email protected]>
Intelligent Log Analyzer André Restivo 9th January 2003 Abstract Server Administrators often have to analyze server logs to find if something is wrong with their machines.
Estimating Deduplication Ratios in Large Data Sets
IBM Research labs - Haifa Estimating Deduplication Ratios in Large Data Sets Danny Harnik, Oded Margalit, Dalit Naor, Dmitry Sotnikov Gil Vernik Estimating dedupe and compression ratios some motivation
International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April-2014 442 ISSN 2229-5518
International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April-2014 442 Over viewing issues of data mining with highlights of data warehousing Rushabh H. Baldaniya, Prof H.J.Baldaniya,
Sublinear Algorithms for Big Data. Part 4: Random Topics
Sublinear Algorithms for Big Data Part : Random Topics Qin Zhang 1-1 Topic 3: Random sampling in distributed data streams (based on a paper with ormode, Muthukrishnan and Yi, PODS 10, JAM 12) 2-1 Distributed
Gamma Distribution Fitting
Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
FACTORING. n = 2 25 + 1. fall in the arithmetic sequence
FACTORING The claim that factorization is harder than primality testing (or primality certification) is not currently substantiated rigorously. As some sort of backward evidence that factoring is hard,
Binary Number System. 16. Binary Numbers. Base 10 digits: 0 1 2 3 4 5 6 7 8 9. Base 2 digits: 0 1
Binary Number System 1 Base 10 digits: 0 1 2 3 4 5 6 7 8 9 Base 2 digits: 0 1 Recall that in base 10, the digits of a number are just coefficients of powers of the base (10): 417 = 4 * 10 2 + 1 * 10 1
MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
Storage in Database Systems. CMPSCI 445 Fall 2010
Storage in Database Systems CMPSCI 445 Fall 2010 1 Storage Topics Architecture and Overview Disks Buffer management Files of records 2 DBMS Architecture Query Parser Query Rewriter Query Optimizer Query
Data Stream Management
Data Stream Management Synthesis Lectures on Data Management Editor M. Tamer Özsu, University of Waterloo Synthesis Lectures on Data Management is edited by Tamer Özsu of the University of Waterloo. The
Concept and Applications of Data Mining. Week 1
Concept and Applications of Data Mining Week 1 Topics Introduction Syllabus Data Mining Concepts Team Organization Introduction Session Your name and major The dfiiti definition of dt data mining i Your
SAP HANA. SAP HANA Performance Efficient Speed and Scale-Out for Real-Time Business Intelligence
SAP HANA SAP HANA Performance Efficient Speed and Scale-Out for Real-Time Business Intelligence SAP HANA Performance Table of Contents 3 Introduction 4 The Test Environment Database Schema Test Data System
An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database
An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct
An Efficient Hybrid Data Gathering Scheme in Wireless Sensor Networks
An Efficient Hybrid Data Gathering Scheme in Wireless Sensor Networks Ayon Chakraborty 1, Swarup Kumar Mitra 2, and M.K. Naskar 3 1 Department of CSE, Jadavpur University, Kolkata, India 2 Department of
Clustering Data Streams
Clustering Data Streams Mohamed Elasmar Prashant Thiruvengadachari Javier Salinas Martin [email protected] [email protected] [email protected] Introduction: Data mining is the science of extracting
Data Mining & Data Stream Mining Open Source Tools
Data Mining & Data Stream Mining Open Source Tools Darshana Parikh, Priyanka Tirkha Student M.Tech, Dept. of CSE, Sri Balaji College Of Engg. & Tech, Jaipur, Rajasthan, India Assistant Professor, Dept.
Statistics in Retail Finance. Chapter 6: Behavioural models
Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural
Big Data Begets Big Database Theory
Big Data Begets Big Database Theory Dan Suciu University of Washington 1 Motivation Industry analysts describe Big Data in terms of three V s: volume, velocity, variety. The data is too big to process
Big Data for Big Value @ Intel
Big Data for Big Value @ Intel Moty Fania, PE Big data Analytics Assaf Araki, Sr. Arch. Big data Analytics Advanced Analytics team @ Intel IT Corporate ownership of advanced analytics Team charter Solve
Multi-dimensional index structures Part I: motivation
Multi-dimensional index structures Part I: motivation 144 Motivation: Data Warehouse A definition A data warehouse is a repository of integrated enterprise data. A data warehouse is used specifically for
Chapter 20: Data Analysis
Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification
CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott
Large-Scale Test Mining
Large-Scale Test Mining SIAM Conference on Data Mining Text Mining 2010 Alan Ratner Northrop Grumman Information Systems NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I Aim Identify topic and language/script/coding
Load Distribution in Large Scale Network Monitoring Infrastructures
Load Distribution in Large Scale Network Monitoring Infrastructures Josep Sanjuàs-Cuxart, Pere Barlet-Ros, Gianluca Iannaccone, and Josep Solé-Pareta Universitat Politècnica de Catalunya (UPC) {jsanjuas,pbarlet,pareta}@ac.upc.edu
A Survey on Outlier Detection Techniques for Credit Card Fraud Detection
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 2, Ver. VI (Mar-Apr. 2014), PP 44-48 A Survey on Outlier Detection Techniques for Credit Card Fraud
Lecture 6 Online and streaming algorithms for clustering
CSE 291: Unsupervised learning Spring 2008 Lecture 6 Online and streaming algorithms for clustering 6.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line
Dimensioning an inbound call center using constraint programming
Dimensioning an inbound call center using constraint programming Cyril Canon 1,2, Jean-Charles Billaut 2, and Jean-Louis Bouquard 2 1 Vitalicom, 643 avenue du grain d or, 41350 Vineuil, France [email protected]
Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism
Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism Jianqiang Dong, Fei Wang and Bo Yuan Intelligent Computing Lab, Division of Informatics Graduate School at Shenzhen,
Master s Theory Exam Spring 2006
Spring 2006 This exam contains 7 questions. You should attempt them all. Each question is divided into parts to help lead you through the material. You should attempt to complete as much of each problem
Factoring Algorithms
Factoring Algorithms The p 1 Method and Quadratic Sieve November 17, 2008 () Factoring Algorithms November 17, 2008 1 / 12 Fermat s factoring method Fermat made the observation that if n has two factors
ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems
215 IEEE International Conference on Big Data (Big Data) Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems Guoxin Liu and Haiying Shen and Haoyu Wang Department of Electrical
Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013
Petabyte Scale Data at Facebook Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013 Agenda 1 Types of Data 2 Data Model and API for Facebook Graph Data 3 SLTP (Semi-OLTP) and Analytics
MSU Tier 3 Usage and Troubleshooting. James Koll
MSU Tier 3 Usage and Troubleshooting James Koll Overview Dedicated computing for MSU ATLAS members Flexible user environment ~500 job slots of various configurations ~150 TB disk space 2 Condor commands
Chapter 13: Query Processing. Basic Steps in Query Processing
Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing
A Uniform Asymptotic Estimate for Discounted Aggregate Claims with Subexponential Tails
12th International Congress on Insurance: Mathematics and Economics July 16-18, 2008 A Uniform Asymptotic Estimate for Discounted Aggregate Claims with Subexponential Tails XUEMIAO HAO (Based on a joint
Data Mining on Streams
Data Mining on Streams Using Decision Trees CS 536: Machine Learning Instructor: Michael Littman TA: Yihua Wu Outline Introduction to data streams Overview of traditional DT learning ALG DT learning ALGs
Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce
Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of
Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing
Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing Hsin-Wen Wei 1,2, Che-Wei Hsu 2, Tin-Yu Wu 3, Wei-Tsong Lee 1 1 Department of Electrical Engineering, Tamkang University
The Internet of Things and Big Data: Intro
The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific
Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters
Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Fan Deng University of Alberta [email protected] Davood Rafiei University of Alberta [email protected] ABSTRACT
1 The Brownian bridge construction
The Brownian bridge construction The Brownian bridge construction is a way to build a Brownian motion path by successively adding finer scale detail. This construction leads to a relatively easy proof
SDMX technical standards Data validation and other major enhancements
SDMX technical standards Data validation and other major enhancements Vincenzo Del Vecchio - Bank of Italy 1 Statistical Data and Metadata exchange Original scope: the exchange Statistical Institutions
Physical Design. Meeting the needs of the users is the gold standard against which we measure our success in creating a database.
Physical Design Physical Database Design (Defined): Process of producing a description of the implementation of the database on secondary storage; it describes the base relations, file organizations, and
Data Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan
Data Mining: An Overview David Madigan http://www.stat.columbia.edu/~madigan Overview Brief Introduction to Data Mining Data Mining Algorithms Specific Eamples Algorithms: Disease Clusters Algorithms:
Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme
Big Data Analytics Prof. Dr. Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie,
Classification and Prediction
Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser
Information Management course
Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli ([email protected])
Efficiency of algorithms. Algorithms. Efficiency of algorithms. Binary search and linear search. Best, worst and average case.
Algorithms Efficiency of algorithms Computational resources: time and space Best, worst and average case performance How to compare algorithms: machine-independent measure of efficiency Growth rate Complexity
Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,[email protected]
Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,[email protected] Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,
Identifying Frequent Items in Sliding Windows over On-Line Packet Streams
Identifying Frequent Items in Sliding Windows over On-Line Packet Streams (Extended Abstract) Lukasz Golab [email protected] David DeHaan [email protected] Erik D. Demaine Lab. for Comp. Sci. M.I.T.
File System Management
Lecture 7: Storage Management File System Management Contents Non volatile memory Tape, HDD, SSD Files & File System Interface Directories & their Organization File System Implementation Disk Space Allocation
Lecture 8 Performance Measurements and Metrics. Performance Metrics. Outline. Performance Metrics. Performance Metrics Performance Measurements
Outline Lecture 8 Performance Measurements and Metrics Performance Metrics Performance Measurements Kurose-Ross: 1.2-1.4 (Hassan-Jain: Chapter 3 Performance Measurement of TCP/IP Networks ) 2010-02-17
Performance Evaluation of some Online Association Rule Mining Algorithms for sorted and unsorted Data sets
Performance Evaluation of some Online Association Rule Mining Algorithms for sorted and unsorted Data sets Pramod S. Reader, Information Technology, M.P.Christian College of Engineering, Bhilai,C.G. INDIA.
Math 461 Fall 2006 Test 2 Solutions
Math 461 Fall 2006 Test 2 Solutions Total points: 100. Do all questions. Explain all answers. No notes, books, or electronic devices. 1. [105+5 points] Assume X Exponential(λ). Justify the following two
Lecture 6 Black-Scholes PDE
Lecture 6 Black-Scholes PDE Lecture Notes by Andrzej Palczewski Computational Finance p. 1 Pricing function Let the dynamics of underlining S t be given in the risk-neutral measure Q by If the contingent
CAS CS 565, Data Mining
CAS CS 565, Data Mining Course logistics Course webpage: http://www.cs.bu.edu/~evimaria/cs565-10.html Schedule: Mon Wed, 4-5:30 Instructor: Evimaria Terzi, [email protected] Office hours: Mon 2:30-4pm,
arxiv:1506.04135v1 [cs.ir] 12 Jun 2015
Reducing offline evaluation bias of collaborative filtering algorithms Arnaud de Myttenaere 1,2, Boris Golden 1, Bénédicte Le Grand 3 & Fabrice Rossi 2 arxiv:1506.04135v1 [cs.ir] 12 Jun 2015 1 - Viadeo
CIS 700: algorithms for Big Data
CIS 700: algorithms for Big Data Lecture 6: Graph Sketching Slides at http://grigory.us/big-data-class.html Grigory Yaroslavtsev http://grigory.us Sketching Graphs? We know how to sketch vectors: v Mv
Accelerating Real Time Big Data Applications. PRESENTATION TITLE GOES HERE Bob Hansen
Accelerating Real Time Big Data Applications PRESENTATION TITLE GOES HERE Bob Hansen Apeiron Data Systems Apeiron is developing a VERY high performance Flash storage system that alters the economics of
Principles of Data Mining by Hand&Mannila&Smyth
Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences
Approximation of OLAP queries on data warehouses
université paris-sud école doctorale informatique paris-sud Laboratoire de recherche en informatique discipline: informatique thèse de doctorat soutenue le 20/06/2013 par Phuong Thao CAO Approximation
UGBA 103 (Parlour, Spring 2015), Section 1. Raymond C. W. Leung
UGBA 103 (Parlour, Spring 2015), Section 1 Present Value, Compounding and Discounting Raymond C. W. Leung University of California, Berkeley Haas School of Business, Department of Finance Email: [email protected]
Package cpm. July 28, 2015
Package cpm July 28, 2015 Title Sequential and Batch Change Detection Using Parametric and Nonparametric Methods Version 2.2 Date 2015-07-09 Depends R (>= 2.15.0), methods Author Gordon J. Ross Maintainer
LECTURE 15: AMERICAN OPTIONS
LECTURE 15: AMERICAN OPTIONS 1. Introduction All of the options that we have considered thus far have been of the European variety: exercise is permitted only at the termination of the contract. These
A Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
Exact Confidence Intervals
Math 541: Statistical Theory II Instructor: Songfeng Zheng Exact Confidence Intervals Confidence intervals provide an alternative to using an estimator ˆθ when we wish to estimate an unknown parameter
