Sublinear Algorithms for Big Data. Part 4: Random Topics

Size: px
Start display at page:

Download "Sublinear Algorithms for Big Data. Part 4: Random Topics"

Transcription

1 Sublinear Algorithms for Big Data Part : Random Topics Qin Zhang 1-1

2 Topic 3: Random sampling in distributed data streams (based on a paper with ormode, Muthukrishnan and Yi, PODS 10, JAM 12) 2-1

3 Distributed streaming Motivated by database/networking applications Adaptive filters [Olston, Jiang, Widom, SIGMOD 03] A generic geometric approach [Scharfman et al. SIGMOD 06] Prediction models [ormode, Garofalakis, Muthukrishnan, Rastogi, SIGMOD 05] sensor networks network monitoring 3-1 environment monitoring cloud computing

4 Reservoir sampling [Waterman??; Vitter 85] Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample -1

5 Reservoir sampling [Waterman??; Vitter 85] Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample When the i-th item arrives With probability s/i, use it to replace an item in the current sample chosen uniformly at ranfom With probability 1 s/i, throw it away -2

6 Reservoir sampling from distributed streams When k = 1, reservoir sampling has cost Θ(s log n) When k 2, reservoir sampling has cost O(n) because it s costly to track i S k S 3 S S 1 time

7 Reservoir sampling from distributed streams When k = 1, reservoir sampling has cost Θ(s log n) When k 2, reservoir sampling has cost O(n) because it s costly to track i S k Tracking i approximately? Sampling won t be uniform S 3 S S 1 time

8 Reservoir sampling from distributed streams When k = 1, reservoir sampling has cost Θ(s log n) When k 2, reservoir sampling has cost O(n) because it s costly to track i S k S 3 S 2 Tracking i approximately? Sampling won t be uniform Key observation: We don t have to know the size of the population in order to sample! 5-3 S 1 time

9 Basic idea: binary Bernoulli sampling 6-1

10 Basic idea: binary Bernoulli sampling

11 Basic idea: binary Bernoulli sampling onditioned upon a row having s active items, we can draw a sample from the active items 6-3

12 Basic idea: binary Bernoulli sampling onditioned upon a row having s active items, we can draw a sample from the active items The could maintain a Bernoulli sample of size between s and O(s) 6-

13 Random sampling Algorithm [with ormode, Muthu & Yi, PODS 10 JAM 11] Initialize i = 0 sites S 1 S 2 S 3 Sites send in every item w.pr. 2 i In epoch i: S k 17-1

14 Random sampling Algorithm [with ormode, Muthu & Yi, PODS 10 JAM 11] Initialize i = 0 In epoch i: upper lower Sites send in every item w.pr. 2 i oordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. sites S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) 17-2

15 Random sampling Algorithm [with ormode, Muthu & Yi, PODS 10 JAM 11] Initialize i = 0 In epoch i: upper lower Sites send in every item w.pr. 2 i oordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. sites S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) When the lower sample reaches size s, the broadcasts to k sites advance to epoch i i+1 Discards the upper sample Randomlysplitsthelowersampleintoanewlowerandanuppersample 17-3

16 Random sampling Algorithm [with ormode, Muthu & Yi, PODS 10 JAM 11] Initialize i = 0 In epoch i: upper lower Sites send in every item w.pr. 2 i oordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. sites S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) When the lower sample reaches size s, the broadcasts to k sites advance to epoch i i+1 Discards the upper sample Randomlysplitsthelowersampleintoanewlowerandanuppersample orrectness: (1): In epoch i, each item is maintained in w. pr. 2 i 17-

17 Random sampling Algorithm [with ormode, Muthu & Yi, PODS 10 JAM 11] Initialize i = 0 In epoch i: upper lower Sites send in every item w.pr. 2 i oordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. sites S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) When the lower sample reaches size s, the broadcasts to k sites advance to epoch i i+1 Discards the upper sample Randomlysplitsthelowersampleintoanewlowerandanuppersample orrectness: (1): In epoch i, each item is maintained in w. pr. 2 i (2): Always s items are maintained in 17-5

18 A running example upper lower Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S 18-1

19 A running example upper lower Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S

20 A running example upper lower 1 Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S

21 A running example upper lower 1 Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S

22 A running example upper lower 1 2 Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S

23 A running example upper lower Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S

24 A running example upper lower Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S

25 A running example upper lower Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S

26 A running example upper lower Maintain s = 3 samples Epoch 0 (p = 1) Now lower sample = 3 discard upper sample split lower sample advance to Epoch 1 sites S 1 S 2 S 3 S

27 A running example upper lower 1 5 Maintain s = 3 samples Epoch 0 (p = 1) Now lower sample = 3 discard upper sample split lower sample advance to Epoch 1 sites S 1 S 2 S 3 S

28 A running example (cont.) upper lower 1 5 Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S

29 A running example (cont.) upper lower 1 5 Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S (discard) 19-2

30 A running example (cont.) upper lower Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S (discard)

31 A running example (cont.) upper lower Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S (discard)

32 A running example (cont.) upper lower Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S (discard) (discard) 19-5

33 A running example (cont.) upper lower Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S (discard) (discard)

34 A running example (cont.) upper lower Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S (discard) (discard)

35 A running example (cont.) upper lower Maintain s = 3 samples Epoch 1 (p = 1/2) Again lower sample = 3 discard upper sample split lower sample advance to Epoch 2 sites S 1 S 2 S 3 S (discard) (discard)

36 A running example (cont.) upper lower Maintain s = 3 samples Epoch 1 (p = 1/2) Again lower sample = 3 discard upper sample split lower sample advance to Epoch 2 sites S 1 S 2 S 3 S (discard) (discard)

37 A running example (cont.) upper lower Maintain s = 3 samples Epoch 2 (p = 1/) More items will be discarded locally sites S 1 S 2 S 3 S (discard) (discard)

38 A running example (cont.) upper lower Maintain s = 3 samples Epoch 2 (p = 1/) More items will be discarded locally Intuition: maintain a sample prob. at each site p s/n (n: total # items) without knowing n. sites S 1 S 2 S 3 S (discard) (discard)

39 Random sampling Analysis Initialize i = 0 In epoch i: upper lower Sites send in every item w.pr. 2 i oordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. sites S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) When the lower sample reaches size s, the broadcasts to k sites advance to epoch i i+1 Discards the upper sample Splitsthelowersampleintoanewlowersampleandanuppersample Analysis: (Msgs sent) Messages sent per epoch O(k +s) # epochs O(logn) = O((k +s)logn) 21-1

40 Random sampling Analysis and experiments an be 1. improved to Θ(klog k/s n+slogn) and 2. extended to sliding window cases. 22-1

41 Random sampling Analysis and experiments an be 1. improved to Θ(klog k/s n+slogn) and 2. extended to sliding window cases. Experiments on the real data set from 1998 world cup logs. log 2 (cost) theory 15 log 2 (cost) cost 10 3 practice log 2 s Basic case cost VS sample size n = ,k = log 2 s Time-based sliding cost VS sample size n = ,k = ( log2 Time-based sliding cost VS window size s = 128,k = 128 ) w

42 Random sampling Analysis and experiments an be 1. improved to Θ(klog k/s n+slogn) and 2. extended to sliding window cases. log 2 (cost) theory 15 practice Experiments on the real data set from 1998 world cup logs log 2 s Basic case cost VS sample size n = ,k = 128 total # items n = 7,000,000 # items sent,000 size of sample s = 128 # sites k =

43 Sampling from a (time-based) sliding window sliding window expired windows frozen window current window t 10-1

44 Sampling from a (time-based) sliding window sliding window t expired windows frozen window current window Sample for sliding window = need new ideas (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window by ISWoR 10-2

45 Sampling from a (time-based) sliding window sliding window expired windows frozen window current window Sample for sliding window = need new ideas (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window by ISWoR (1), (2) may be sampled by different rates. But as long as both have sizes min{s,# live items}, fine. t 10-3

46 Sampling from a (time-based) sliding window sliding window expired windows frozen window current window Sample for sliding window = need new ideas (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window by ISWoR (1), (2) may be sampled by different rates. But as long as both have sizes min{s,# live items}, fine. The key issue: how to guarantee both have sizes s? as items in the frozen window are expiring... t 10-

47 Sampling from a (time-based) sliding window sliding window expired windows frozen window current window Sample for sliding window = need new ideas (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window by ISWoR (1), (2) may be sampled by different rates. But as long as both have sizes min{s,# live items}, fine. The key issue: how to guarantee both have sizes s? as items in the frozen window are expiring... Solution: In the frozen window, find a good sample rate such that the sample size s. t 10-5

48 Dealing with the frozen window sliding window expired windows frozen window current window t Keep all the levels? Need O(w) communication 11-1

49 Dealing with the frozen window sliding window expired windows frozen window current window t Keep all the levels? Need O(w) communication (s = 2) Keep most recent sampled items in a level until s of them are also sampled at the next level. Total size: O(slogw) 11-2

50 Dealing with the frozen window sliding window expired windows frozen window current window t Keep all the levels? Need O(w) communication (s = 2) Keep most recent sampled items in a level until s of them are also sampled at the next level. Total size: O(slogw) Guaranteed: There is a blue window with s sampled items that covers the unexpired portion of the frozen window 11-3

51 Dealing with the frozen window: The algorithm (s = 2) Each site builds its own level-sampling structure for the current window until it freezes Needs O(slogw) space and O(1) time per item 12-1

52 Dealing with the frozen window: The algorithm (s = 2) Each site builds its own level-sampling structure for the current window until it freezes Needs O(slogw) space and O(1) time per item When the current window freezes For each level, do a k-way merge to build the level of the global structure at the. Total communication O((k + s) log w) 12-2

Optimal Tracking of Distributed Heavy Hitters and Quantiles

Optimal Tracking of Distributed Heavy Hitters and Quantiles Optimal Tracking of Distributed Heavy Hitters and Quantiles Ke Yi HKUST yike@cse.ust.hk Qin Zhang MADALGO Aarhus University qinzhang@cs.au.dk Abstract We consider the the problem of tracking heavy hitters

More information

LECTURE 16. Readings: Section 5.1. Lecture outline. Random processes Definition of the Bernoulli process Basic properties of the Bernoulli process

LECTURE 16. Readings: Section 5.1. Lecture outline. Random processes Definition of the Bernoulli process Basic properties of the Bernoulli process LECTURE 16 Readings: Section 5.1 Lecture outline Random processes Definition of the Bernoulli process Basic properties of the Bernoulli process Number of successes Distribution of interarrival times The

More information

Big Data & Scripting Part II Streaming Algorithms

Big Data & Scripting Part II Streaming Algorithms Big Data & Scripting Part II Streaming Algorithms 1, Counting Distinct Elements 2, 3, counting distinct elements problem formalization input: stream of elements o from some universe U e.g. ids from a set

More information

Sublinear Algorithms for Big Data. Part 4: Random Topics

Sublinear Algorithms for Big Data. Part 4: Random Topics Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang 1-1 2-1 Topic 1: Compressive sensing Compressive sensing The model (Candes-Romberg-Tao 04; Donoho 04) Applicaitons Medical imaging reconstruction

More information

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff Nimble Algorithms for Cloud Computing Ravi Kannan, Santosh Vempala and David Woodruff Cloud computing Data is distributed arbitrarily on many servers Parallel algorithms: time Streaming algorithms: sublinear

More information

résumé de flux de données

résumé de flux de données résumé de flux de données CLEROT Fabrice fabrice.clerot@orange-ftgroup.com Orange Labs data streams, why bother? massive data is the talk of the town... data streams, why bother? service platform production

More information

Zabin Visram Room CS115 CS126 Searching. Binary Search

Zabin Visram Room CS115 CS126 Searching. Binary Search Zabin Visram Room CS115 CS126 Searching Binary Search Binary Search Sequential search is not efficient for large lists as it searches half the list, on average Another search algorithm Binary search Very

More information

B669 Sublinear Algorithms for Big Data

B669 Sublinear Algorithms for Big Data B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Now about the Big Data Big data is everywhere : over 2.5 petabytes of sales transactions : an index of over 19 billion web pages : over 40 billion of

More information

Point Biserial Correlation Tests

Point Biserial Correlation Tests Chapter 807 Point Biserial Correlation Tests Introduction The point biserial correlation coefficient (ρ in this chapter) is the product-moment correlation calculated between a continuous random variable

More information

Efficient Fault-Tolerant Infrastructure for Cloud Computing

Efficient Fault-Tolerant Infrastructure for Cloud Computing Efficient Fault-Tolerant Infrastructure for Cloud Computing Xueyuan Su Candidate for Ph.D. in Computer Science, Yale University December 2013 Committee Michael J. Fischer (advisor) Dana Angluin James Aspnes

More information

4. MAC protocols and LANs

4. MAC protocols and LANs 4. MAC protocols and LANs 1 Outline MAC protocols and sublayers, LANs: Ethernet, Token ring and Token bus Logic Link Control (LLC) sublayer protocol Bridges: transparent (spanning tree), source routing

More information

Fundamentals of Analyzing and Mining Data Streams

Fundamentals of Analyzing and Mining Data Streams Fundamentals of Analyzing and Mining Data Streams Graham Cormode graham@research.att.com Outline 1. Streaming summaries, sketches and samples Motivating examples, applications and models Random sampling:

More information

Data Mining on Streams

Data Mining on Streams Data Mining on Streams Using Decision Trees CS 536: Machine Learning Instructor: Michael Littman TA: Yihua Wu Outline Introduction to data streams Overview of traditional DT learning ALG DT learning ALGs

More information

Multi-Way Search Trees (B Trees)

Multi-Way Search Trees (B Trees) Multi-Way Search Trees (B Trees) Multiway Search Trees An m-way search tree is a tree in which, for some integer m called the order of the tree, each node has at most m children. If n

More information

9th Max-Planck Advanced Course on the Foundations of Computer Science (ADFOCS) Primal-Dual Algorithms for Online Optimization: Lecture 1

9th Max-Planck Advanced Course on the Foundations of Computer Science (ADFOCS) Primal-Dual Algorithms for Online Optimization: Lecture 1 9th Max-Planck Advanced Course on the Foundations of Computer Science (ADFOCS) Primal-Dual Algorithms for Online Optimization: Lecture 1 Seffi Naor Computer Science Dept. Technion Haifa, Israel Introduction

More information

The Selection Problem

The Selection Problem The Selection Problem The selection problem: given an integer k and a list x 1,..., x n of n elements find the k-th smallest element in the list Example: the 3rd smallest element of the following list

More information

Sampling Central Limit Theorem Proportions. Outline. 1 Sampling. 2 Central Limit Theorem. 3 Proportions

Sampling Central Limit Theorem Proportions. Outline. 1 Sampling. 2 Central Limit Theorem. 3 Proportions Outline 1 Sampling 2 Central Limit Theorem 3 Proportions Outline 1 Sampling 2 Central Limit Theorem 3 Proportions Populations and samples When we use statistics, we are trying to find out information about

More information

Mining Data Streams. Chapter 4. 4.1 The Stream Data Model

Mining Data Streams. Chapter 4. 4.1 The Stream Data Model Chapter 4 Mining Data Streams Most of the algorithms described in this book assume that we are mining a database. That is, all our data is available when and if we want it. In this chapter, we shall make

More information

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman To motivate the Bloom-filter idea, consider a web crawler. It keeps, centrally, a list of all the URL s it has found so far. It

More information

Pulling a Random Sample from a MAXQDA Dataset

Pulling a Random Sample from a MAXQDA Dataset In this guide you will learn how to pull a random sample from a MAXQDA dataset, using the random cell function in Excel. In this process you will learn how to export and re-import variables from MAXQDA.

More information

1: Meditech Information Systems On-line Help 2: Links to Micromedex 3: Print page view 4: Lock Session: temporarily minimize the session and lock it

1: Meditech Information Systems On-line Help 2: Links to Micromedex 3: Print page view 4: Lock Session: temporarily minimize the session and lock it MEDITECH 6.0 Training Manual Training Manual Outline Shortcuts Main Menu Scheduling Desktop Header /F Footer Bars Basic Scheduling Set Scheduling Medical Necessity / Prior Authorization Pending Appointments

More information

A. OPENING POINT CLOUDS. (Notepad++ Text editor) (Cloud Compare Point cloud and mesh editor) (MeshLab Point cloud and mesh editor)

A. OPENING POINT CLOUDS. (Notepad++ Text editor) (Cloud Compare Point cloud and mesh editor) (MeshLab Point cloud and mesh editor) MeshLAB tutorial 1 A. OPENING POINT CLOUDS (Notepad++ Text editor) (Cloud Compare Point cloud and mesh editor) (MeshLab Point cloud and mesh editor) 2 OPENING POINT CLOUDS IN NOTEPAD ++ Let us understand

More information

Cross Validation. Dr. Thomas Jensen Expedia.com

Cross Validation. Dr. Thomas Jensen Expedia.com Cross Validation Dr. Thomas Jensen Expedia.com About Me PhD from ETH Used to be a statistician at Link, now Senior Business Analyst at Expedia Manage a database with 720,000 Hotels that are not on contract

More information

ADOBE ACROBAT CONNECT PRO MOBILE VISUAL QUICK START GUIDE

ADOBE ACROBAT CONNECT PRO MOBILE VISUAL QUICK START GUIDE ADOBE ACROBAT CONNECT PRO MOBILE VISUAL QUICK START GUIDE GETTING STARTED WITH ADOBE ACROBAT CONNECT PRO MOBILE FOR IPHONE AND IPOD TOUCH Overview Attend Acrobat Connect Pro meetings using your iphone

More information

Big Data & Scripting Part II Streaming Algorithms

Big Data & Scripting Part II Streaming Algorithms Big Data & Scripting Part II Streaming Algorithms 1, 2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set),

More information

USING OUTLOOK WEB ACCESS

USING OUTLOOK WEB ACCESS USING OUTLOOK WEB ACCESS 17 March 2009, Version 1.0 WHAT IS OUTLOOK WEB ACCESS? Outlook Web Access (OWA) is a webmail service of Microsoft Exchange Server. The web interface of Outlook Web Access resembles

More information

RANDOMIZED ALGORITHMS

RANDOMIZED ALGORITHMS RANDOMIZED ALGORITHMS CONCEPTS Deterministic algorithms will always perform the same series of operations (and produce the same result) on a given input Random algorithms may execute different series of

More information

Excel - Beginner Documentation. Table of Contents

Excel - Beginner Documentation. Table of Contents The Center for Teaching Excellence Table of Contents Basics... 1 What is a spreadsheet... 1 Why on a computer... 1 Basics of a spreadsheet... 1 Types of data... 1 Formatting... 1 Organize content in rows

More information

MATH 140 Lab 4: Probability and the Standard Normal Distribution

MATH 140 Lab 4: Probability and the Standard Normal Distribution MATH 140 Lab 4: Probability and the Standard Normal Distribution Problem 1. Flipping a Coin Problem In this problem, we want to simualte the process of flipping a fair coin 1000 times. Note that the outcomes

More information

ILLINOIS HOUSING DEVELOPMENT AUTHORITY INFORMATION TECHNOLOGY DEVELOPMENT AUTHORITY_ONLINE_UTILITY_ALLOWANCE_GUIDE.DOC

ILLINOIS HOUSING DEVELOPMENT AUTHORITY INFORMATION TECHNOLOGY DEVELOPMENT AUTHORITY_ONLINE_UTILITY_ALLOWANCE_GUIDE.DOC ILLINOIS HOUSING DEVELOPMENT AUTHORITY INFORMATION TECHNOLOGY DEVELOPMENT AUTHORITY_ONLINE_UTILITY_ALLOWANCE_GUIDE.DOC The following guide details how to enter Utility Allowance information for projects

More information

Compact Summaries for Large Datasets

Compact Summaries for Large Datasets Compact Summaries for Large Datasets Big Data Graham Cormode University of Warwick G.Cormode@Warwick.ac.uk The case for Big Data in one slide Big data arises in many forms: Medical data: genetic sequences,

More information

Contents. TTM4155: Teletraffic Theory (Teletrafikkteori) Probability Theory Basics. Yuming Jiang. Basic Concepts Random Variables

Contents. TTM4155: Teletraffic Theory (Teletrafikkteori) Probability Theory Basics. Yuming Jiang. Basic Concepts Random Variables TTM4155: Teletraffic Theory (Teletrafikkteori) Probability Theory Basics Yuming Jiang 1 Some figures taken from the web. Contents Basic Concepts Random Variables Discrete Random Variables Continuous Random

More information

B561 Advanced Database Concepts. 0 Introduction. Qin Zhang 1-1

B561 Advanced Database Concepts. 0 Introduction. Qin Zhang 1-1 B561 Advanced Database Concepts 0 Introduction Qin Zhang 1-1 Self introduction: my research interests 2-1 Algorithms for Big Data: streaming/sketching algorithms; algorithms on distributed data; data structures;

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

Approximate Counts and Quantiles over Sliding Windows

Approximate Counts and Quantiles over Sliding Windows Approximate Counts and Quantiles over Sliding Windows Arvind Arasu Stanford University arvinda@cs.stanford.edu Gurmeet Singh Manku Stanford University manku@cs.stanford.edu ABSTRACT We consider the problem

More information

Visual Data Mining. Motivation. Why Visual Data Mining. Integration of visualization and data mining : Chidroop Madhavarapu CSE 591:Visual Analytics

Visual Data Mining. Motivation. Why Visual Data Mining. Integration of visualization and data mining : Chidroop Madhavarapu CSE 591:Visual Analytics Motivation Visual Data Mining Visualization for Data Mining Huge amounts of information Limited display capacity of output devices Chidroop Madhavarapu CSE 591:Visual Analytics Visual Data Mining (VDM)

More information

Lecture 13: Validation

Lecture 13: Validation Lecture 3: Validation g Motivation g The Holdout g Re-sampling techniques g Three-way data splits Motivation g Validation techniques are motivated by two fundamental problems in pattern recognition: model

More information

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding

More information

An XML Framework for Integrating Continuous Queries, Composite Event Detection, and Database Condition Monitoring for Multiple Data Streams

An XML Framework for Integrating Continuous Queries, Composite Event Detection, and Database Condition Monitoring for Multiple Data Streams An XML Framework for Integrating Continuous Queries, Composite Event Detection, and Database Condition Monitoring for Multiple Data Streams Susan D. Urban 1, Suzanne W. Dietrich 1, 2, and Yi Chen 1 Arizona

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Distributed Data Management Summer Semester 2013 TU Kaiserslautern

Distributed Data Management Summer Semester 2013 TU Kaiserslautern Distributed Data Management Summer Semester 2013 TU Kaiserslautern Dr.- Ing. Sebas4an Michel smichel@mmci.uni- saarland.de 1 Lecture 8+ (DISTRIBUTED) DATA STREAM PROCESSING (INTRODUCTION) 2 So Far: Databases/NoSQL

More information

SOLVING DATABASE QUERIES USING ORTHOGONAL RANGE SEARCHING

SOLVING DATABASE QUERIES USING ORTHOGONAL RANGE SEARCHING SOLVING DATABASE QUERIES USING ORTHOGONAL RANGE SEARCHING CPS234: Computational Geometry Prof. Pankaj Agarwal Prepared by: Khalid Al Issa (khalid@cs.duke.edu) Fall 2005 BACKGROUND : RANGE SEARCHING, a

More information

Continuous Monitoring of Top-k Queries over Sliding Windows

Continuous Monitoring of Top-k Queries over Sliding Windows Continuous Monitoring of Top-k Queries over Sliding Windows Kyriakos Mouratidis 1 Spiridon Bakiras 2 Dimitris Papadias 2 1 School of Information Systems Singapore Management University 8 Stamford Road,

More information

MODELING RANDOMNESS IN NETWORK TRAFFIC

MODELING RANDOMNESS IN NETWORK TRAFFIC MODELING RANDOMNESS IN NETWORK TRAFFIC - LAVANYA JOSE, INDEPENDENT WORK FALL 11 ADVISED BY PROF. MOSES CHARIKAR ABSTRACT. Sketches are randomized data structures that allow one to record properties of

More information

B490 Mining the Big Data. 2 Clustering

B490 Mining the Big Data. 2 Clustering B490 Mining the Big Data 2 Clustering Qin Zhang 1-1 Motivations Group together similar documents/webpages/images/people/proteins/products One of the most important problems in machine learning, pattern

More information

Streaming Algorithms

Streaming Algorithms 3 Streaming Algorithms Great Ideas in Theoretical Computer Science Saarland University, Summer 2014 Some Admin: Deadline of Problem Set 1 is 23:59, May 14 (today)! Students are divided into two groups

More information

Page Replacement Strategies. Jay Kothari Maxim Shevertalov CS 370: Operating Systems (Summer 2008)

Page Replacement Strategies. Jay Kothari Maxim Shevertalov CS 370: Operating Systems (Summer 2008) Page Replacement Strategies Jay Kothari (jayk@drexel.edu) Maxim Shevertalov (max@drexel.edu) CS 370: Operating Systems (Summer 2008) Page Replacement Policies Why do we care about Replacement Policy? Replacement

More information

CSC148 Lecture 8. Algorithm Analysis Binary Search Sorting

CSC148 Lecture 8. Algorithm Analysis Binary Search Sorting CSC148 Lecture 8 Algorithm Analysis Binary Search Sorting Algorithm Analysis Recall definition of Big Oh: We say a function f(n) is O(g(n)) if there exists positive constants c,b such that f(n)

More information

Sampling Based Range Partition Methods for Big Data Analytics

Sampling Based Range Partition Methods for Big Data Analytics Milan Vojnović Microsoft Research milanv@microsoft.com Sampling Based Range Partition Methods for Big Data Analytics Fei Xu Microsoft Corporation feixu@microsoft.com March 202 Technical Report MSR-TR-202-8

More information

Fact Sheet In-Memory Analysis

Fact Sheet In-Memory Analysis Fact Sheet In-Memory Analysis 1 Copyright Yellowfin International 2010 Contents In Memory Overview...3 Benefits...3 Agile development & rapid delivery...3 Data types supported by the In-Memory Database...4

More information

Analytics Edge Project: Predicting the outcome of the 2014 World Cup

Analytics Edge Project: Predicting the outcome of the 2014 World Cup Analytics Edge Project: Predicting the outcome of the 2014 World Cup Virgile Galle and Ludovica Rizzo June 27, 2014 Contents 1 Project Presentation: the Soccer World Cup 2 2 Data Set 2 3 Models to predict

More information

Continuous Matrix Approximation on Distributed Data

Continuous Matrix Approximation on Distributed Data Continuous Matrix Approximation on Distributed Data Mina Ghashami School of Computing University of Utah ghashami@cs.uah.edu Jeff M. Phillips School of Computing University of Utah jeffp@cs.uah.edu Feifei

More information

City of De Pere. Halogen How To Guide

City of De Pere. Halogen How To Guide City of De Pere Halogen How To Guide Page1 (revised 12/14/2015) Halogen Performance Management website address: https://global.hgncloud.com/cityofdepere/welcome.jsp The following steps take place to complete

More information

To Download Library Books Onto Your KINDLE FIRE

To Download Library Books Onto Your KINDLE FIRE Using 3M CLOUD LIBRARY 0BOFG JJ To Download Library Books Onto Your KINDLE FIRE To borrow an e-book from the Ocean County Library system, you need to have a valid Ocean County Library card (less than $25

More information

System Monitoring and Reporting

System Monitoring and Reporting This chapter contains the following sections: Dashboard, page 1 Summary, page 2 Inventory Management, page 3 Resource Pools, page 4 Clusters, page 4 Images, page 4 Host Nodes, page 6 Virtual Machines (VMs),

More information

Master s Theory Exam Spring 2006

Master s Theory Exam Spring 2006 Spring 2006 This exam contains 7 questions. You should attempt them all. Each question is divided into parts to help lead you through the material. You should attempt to complete as much of each problem

More information

Log Wealth In each period, algorithm A performs 85% as well as algorithm B

Log Wealth In each period, algorithm A performs 85% as well as algorithm B 1 Online Investing Portfolio Selection Consider n stocks Our distribution of wealth is some vector b e.g. (1/3, 1/3, 1/3) At end of one period, we get a vector of price relatives x e.g. (0.98, 1.02, 1.00)

More information

A binary heap is a complete binary tree, where each node has a higher priority than its children. This is called heap-order property

A binary heap is a complete binary tree, where each node has a higher priority than its children. This is called heap-order property CmSc 250 Intro to Algorithms Chapter 6. Transform and Conquer Binary Heaps 1. Definition A binary heap is a complete binary tree, where each node has a higher priority than its children. This is called

More information

B561 Advanced Database Concepts. 0 Introduction. Qin Zhang 1-1

B561 Advanced Database Concepts. 0 Introduction. Qin Zhang 1-1 B561 Advanced Database Concepts 0 Introduction Qin Zhang 1-1 Self introduction: my research interests Algorithms for Big Data: streaming/sketching algorithms; algorithms on distributed data; I/O-efficient

More information

Algorithms. Theresa Migler-VonDollen CMPS 5P

Algorithms. Theresa Migler-VonDollen CMPS 5P Algorithms Theresa Migler-VonDollen CMPS 5P 1 / 32 Algorithms Write a Python function that accepts a list of numbers and a number, x. If x is in the list, the function returns the position in the list

More information

Frontier Tandem. Administrator User Guide. Version 2.4 January 28, 2013

Frontier Tandem. Administrator User Guide. Version 2.4 January 28, 2013 Frontier Tandem Administrator User Guide Version 2.4 January 28, 2013 About This Document 1 Version 7.3 Jan 28, 2013 Frontier Tandem Administrator Guide CONFIDENTIAL About This Document The Frontier Small

More information

K-Cover of Binary sequence

K-Cover of Binary sequence K-Cover of Binary sequence Prof Amit Kumar Prof Smruti Sarangi Sahil Aggarwal Swapnil Jain Given a binary sequence, represent all the 1 s in it using at most k- covers, minimizing the total length of all

More information

Binary Heaps. CSE 373 Data Structures

Binary Heaps. CSE 373 Data Structures Binary Heaps CSE Data Structures Readings Chapter Section. Binary Heaps BST implementation of a Priority Queue Worst case (degenerate tree) FindMin, DeleteMin and Insert (k) are all O(n) Best case (completely

More information

Massachusetts Institute of Technology

Massachusetts Institute of Technology n (i) m m (ii) n m ( (iii) n n n n (iv) m m Massachusetts Institute of Technology 6.0/6.: Probabilistic Systems Analysis (Quiz Solutions Spring 009) Question Multiple Choice Questions: CLEARLY circle the

More information

Monte Carlo testing with Big Data

Monte Carlo testing with Big Data Monte Carlo testing with Big Data Patrick Rubin-Delanchy University of Bristol & Heilbronn Institute for Mathematical Research Joint work with: Axel Gandy (Imperial College London) with contributions from:

More information

Global Futures Exchange & Trading Co., Inc. Turbo Trader 2 Quickstart Manual

Global Futures Exchange & Trading Co., Inc. Turbo Trader 2 Quickstart Manual Global Futures Exchange & Trading Co., Inc. Turbo Trader 2 Quickstart Manual Turbo Trader 2 Quickstart Step #1 System Requirements Step #2 Installation Instructions Step #3 Overview Step #1 System Requirements

More information

6.2 Normal distribution. Standard Normal Distribution:

6.2 Normal distribution. Standard Normal Distribution: 6.2 Normal distribution Slide Heights of Adult Men and Women Slide 2 Area= Mean = µ Standard Deviation = σ Donation: X ~ N(µ,σ 2 ) Standard Normal Distribution: Slide 3 Slide 4 a normal probability distribution

More information

Online Maintenance of Very Large Random Samples

Online Maintenance of Very Large Random Samples Online Maintenance of Very Large Random Samples Christopher Jermaine, Abhijit Pol, Subramanian Arumugam Department of Computer and Information Sciences and Engineering University of Florida Gainesville,

More information

Data Streams A Tutorial

Data Streams A Tutorial Data Streams A Tutorial Nicole Schweikardt Goethe-Universität Frankfurt am Main DEIS 10: GI-Dagstuhl Seminar on Data Exchange, Integration, and Streams Schloss Dagstuhl, November 8, 2010 Data Streams Situation:

More information

Distributed Computing over Communication Networks: Maximal Independent Set

Distributed Computing over Communication Networks: Maximal Independent Set Distributed Computing over Communication Networks: Maximal Independent Set What is a MIS? MIS An independent set (IS) of an undirected graph is a subset U of nodes such that no two nodes in U are adjacent.

More information

SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis

SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu October 17, 2015 Outline

More information

Aloha Guest Manager. What s New in Aloha Guest Manager Release 1.6.394 Table of Contents. Release 1.6.394 Enhancements Guide SERVER MANAGEMENT...

Aloha Guest Manager. What s New in Aloha Guest Manager Release 1.6.394 Table of Contents. Release 1.6.394 Enhancements Guide SERVER MANAGEMENT... Aloha Guest Manager Release 1.6.394 Enhancements Guide What s New in Aloha Guest Manager Release 1.6.394 Table of Contents SERVER MANAGEMENT... 3 Multiple server assignments for the day...3 Employee Search

More information

Student Guide - Student Groups and Adobe Connect in Canvas

Student Guide - Student Groups and Adobe Connect in Canvas Student Guide - Student Groups and Adobe Connect in Canvas Creating an Adobe Connect Conference 1. Use Chrome or Firefox as your browser. Make sure you are on the latest version. 2. Connect your headset

More information

Email Basics Working with Filters

Email Basics Working with Filters Email Basics Working with Filters The following are provided to help keep your Zimbra email and calendar usage consistent, professional, and up-to-date. Working with Email Filters Setting Up Filters Creating

More information

Active Queue Management

Active Queue Management Active Queue Management TELCOM2321 CS2520 Wide Area Networks Dr. Walter Cerroni University of Bologna Italy Visiting Assistant Professor at SIS, Telecom Program Slides partly based on Dr. Znati s material

More information

More on Using Phone Books. Linear Search. Linear Search. Search. 17. Searching and Sorting Examples: LinSearch: The Spec 3/24/2015

More on Using Phone Books. Linear Search. Linear Search. Search. 17. Searching and Sorting Examples: LinSearch: The Spec 3/24/2015 17. Searching and Sorting Examples: Search Topics: Binary Search Measuring Execution Time The Divide and Conquer Framework Merge Sort Is this song in that playlist? Is this number in that phone book? Is

More information

Statistical Methods for Network and Computer Security p.1/43

Statistical Methods for Network and Computer Security p.1/43 Statistical Methods for Network and Computer Security David J. Marchette marchettedj@nswc.navy.mil Naval Surface Warfare Center Code B10 Statistical Methods for Network and Computer Security p.1/43 A Few

More information

SMALL INDEX LARGE INDEX (SILT)

SMALL INDEX LARGE INDEX (SILT) Wayne State University ECE 7650: Scalable and Secure Internet Services and Architecture SMALL INDEX LARGE INDEX (SILT) A Memory Efficient High Performance Key Value Store QA REPORT Instructor: Dr. Song

More information

TCP, Active Queue Management and QoS

TCP, Active Queue Management and QoS TCP, Active Queue Management and QoS Don Towsley UMass Amherst towsley@cs.umass.edu Collaborators: W. Gong, C. Hollot, V. Misra Outline motivation TCP friendliness/fairness bottleneck invariant principle

More information

Business Objects Enterprise version 4.1. Report Viewing

Business Objects Enterprise version 4.1. Report Viewing Business Objects Enterprise version 4.1 Note about Java: With earlier versions, the Java run-time was not needed for report viewing; but was needed for report writing. The default behavior in version 4.1

More information

Random Early Detection Gateways for Congestion Avoidance

Random Early Detection Gateways for Congestion Avoidance Random Early Detection Gateways for Congestion Avoidance Sally Floyd and Van Jacobson Lawrence Berkeley Laboratory University of California floyd@eelblgov van@eelblgov To appear in the August 1993 IEEE/ACM

More information

Offline sorting buffers on Line

Offline sorting buffers on Line Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com

More information

Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream Xuemin Lin University of New South Wales Sydney, Australia Jian Xu University of New South Wales Sydney, Australia

More information

Query Selectivity Estimation for Uncertain Data

Query Selectivity Estimation for Uncertain Data Query Selectivity Estimation for Uncertain Data Sarvjeet Singh *, Chris Mayfield *, Rahul Shah #, Sunil Prabhakar * and Susanne Hambrusch * * Department of Computer Science, Purdue University # Department

More information

Summary Data Structures for Massive Data

Summary Data Structures for Massive Data Summary Data Structures for Massive Data Graham Cormode Department of Computer Science, University of Warwick graham@cormode.org Abstract. Prompted by the need to compute holistic properties of increasingly

More information

Portal User Guide. Customers. Version 1.1. May 2013 http://www.sharedband.com 1 of 5

Portal User Guide. Customers. Version 1.1. May 2013 http://www.sharedband.com 1 of 5 Portal User Guide Customers Version 1.1 May 2013 http://www.sharedband.com 1 of 5 Table of Contents Introduction... 3 Using the Sharedband Portal... 4 Login... 4 Request password reset... 4 View accounts...

More information

SeeVogh Player manual

SeeVogh Player manual SeeVogh Player manual Current Version as of: (03/28/2014) v.2.0 1 The SeeVogh Player is a simple application, which allows you to playback recordings made during a SeeVogh meeting with the recording function

More information

Seed Distributions for the NCAA Men s Basketball Tournament: Why it May Not Matter Who Plays Whom*

Seed Distributions for the NCAA Men s Basketball Tournament: Why it May Not Matter Who Plays Whom* Seed Distributions for the NCAA Men s Basketball Tournament: Why it May Not Matter Who Plays Whom* Sheldon H. Jacobson Department of Computer Science University of Illinois at Urbana-Champaign shj@illinois.edu

More information

Problem. Indexing with B-trees. Indexing. Primary Key Indexing. B-trees. B-trees: Example. primary key indexing

Problem. Indexing with B-trees. Indexing. Primary Key Indexing. B-trees. B-trees: Example. primary key indexing Problem Given a large collection of records, Indexing with B-trees find similar/interesting things, i.e., allow fast, approximate queries Anastassia Ailamaki http://www.cs.cmu.edu/~natassa 2 Indexing Primary

More information

Any symbols displayed within these pages are for illustrative purposes only, and are not intended to portray any recommendation.

Any symbols displayed within these pages are for illustrative purposes only, and are not intended to portray any recommendation. mobiletws for Android Users' Guide October 2012 mobiletws for Android Version 4.1.360 2012 Interactive Brokers LLC. All Rights Reserved Any symbols displayed within these pages are for illustrative purposes

More information

1. Install a Virtual Machine... 2. 2. Download Ubuntu Ubuntu 14.04.1 LTS... 2. 3. Create a New Virtual Machine... 2

1. Install a Virtual Machine... 2. 2. Download Ubuntu Ubuntu 14.04.1 LTS... 2. 3. Create a New Virtual Machine... 2 Introduction APPLICATION NOTE The purpose of this document is to explain how to create a Virtual Machine on a Windows PC such that a Linux environment can be created in order to build a Linux kernel and

More information

OUTLOOK WEB APP 2013 ESSENTIAL SKILLS

OUTLOOK WEB APP 2013 ESSENTIAL SKILLS OUTLOOK WEB APP 2013 ESSENTIAL SKILLS CONTENTS Login to engage365 Web site. 2 View the account home page. 2 The Outlook 2013 Window. 3 Interface Features. 3 Creating a new email message. 4 Create an Email

More information

1. The subnet must prevent additional packets from entering the congested region until those already present can be processed.

1. The subnet must prevent additional packets from entering the congested region until those already present can be processed. Congestion Control When one part of the subnet (e.g. one or more routers in an area) becomes overloaded, congestion results. Because routers are receiving packets faster than they can forward them, one

More information

Nessus Cloud User Registration

Nessus Cloud User Registration Nessus Cloud User Registration Create Your Tenable Nessus Cloud Account 1. Click on the provided URL to create your account. If the link does not work, please cut and paste the entire URL into your browser.

More information

ΤΕΙ Κρήτης, Παράρτηµα Χανίων

ΤΕΙ Κρήτης, Παράρτηµα Χανίων ΤΕΙ Κρήτης, Παράρτηµα Χανίων ΠΣΕ, Τµήµα Τηλεπικοινωνιών & ικτύων Η/Υ Εργαστήριο ιαδίκτυα & Ενδοδίκτυα Η/Υ Modeling Wide Area Networks (WANs) ρ Θεοδώρου Παύλος Χανιά 2003 8. Modeling Wide Area Networks

More information

TEACHING STATISTICS IN MANAGEMENT COURSES IN INDIA. R. P. Suresh Indian Institute of Management Kozhikode India

TEACHING STATISTICS IN MANAGEMENT COURSES IN INDIA. R. P. Suresh Indian Institute of Management Kozhikode India TEACHING STATISTICS IN MANAGEMENT COURSES IN INDIA R. P. Suresh Indian Institute of Management Kozhikode India Statistics is taught as a core course in the Management programmes leading to the degree of

More information

Using Data-Oblivious Algorithms for Private Cloud Storage Access. Michael T. Goodrich Dept. of Computer Science

Using Data-Oblivious Algorithms for Private Cloud Storage Access. Michael T. Goodrich Dept. of Computer Science Using Data-Oblivious Algorithms for Private Cloud Storage Access Michael T. Goodrich Dept. of Computer Science Privacy in the Cloud Alice owns a large data set, which she outsources to an honest-but-curious

More information

Chapter 13: Query Processing. Basic Steps in Query Processing

Chapter 13: Query Processing. Basic Steps in Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

GravityLab Multimedia Inc. Windows Media Authentication Administration Guide

GravityLab Multimedia Inc. Windows Media Authentication Administration Guide GravityLab Multimedia Inc. Windows Media Authentication Administration Guide Token Auth Menu GravityLab Multimedia supports two types of authentication to accommodate customers with content that requires

More information