# Sublinear Algorithms for Big Data. Part 4: Random Topics

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Sublinear Algorithms for Big Data Part : Random Topics Qin Zhang 1-1

2 Topic 3: Random sampling in distributed data streams (based on a paper with ormode, Muthukrishnan and Yi, PODS 10, JAM 12) 2-1

3 Distributed streaming Motivated by database/networking applications Adaptive filters [Olston, Jiang, Widom, SIGMOD 03] A generic geometric approach [Scharfman et al. SIGMOD 06] Prediction models [ormode, Garofalakis, Muthukrishnan, Rastogi, SIGMOD 05] sensor networks network monitoring 3-1 environment monitoring cloud computing

4 Reservoir sampling [Waterman??; Vitter 85] Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample -1

5 Reservoir sampling [Waterman??; Vitter 85] Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample When the i-th item arrives With probability s/i, use it to replace an item in the current sample chosen uniformly at ranfom With probability 1 s/i, throw it away -2

6 Reservoir sampling from distributed streams When k = 1, reservoir sampling has cost Θ(s log n) When k 2, reservoir sampling has cost O(n) because it s costly to track i S k S 3 S S 1 time

7 Reservoir sampling from distributed streams When k = 1, reservoir sampling has cost Θ(s log n) When k 2, reservoir sampling has cost O(n) because it s costly to track i S k Tracking i approximately? Sampling won t be uniform S 3 S S 1 time

8 Reservoir sampling from distributed streams When k = 1, reservoir sampling has cost Θ(s log n) When k 2, reservoir sampling has cost O(n) because it s costly to track i S k S 3 S 2 Tracking i approximately? Sampling won t be uniform Key observation: We don t have to know the size of the population in order to sample! 5-3 S 1 time

9 Basic idea: binary Bernoulli sampling 6-1

10 Basic idea: binary Bernoulli sampling

11 Basic idea: binary Bernoulli sampling onditioned upon a row having s active items, we can draw a sample from the active items 6-3

12 Basic idea: binary Bernoulli sampling onditioned upon a row having s active items, we can draw a sample from the active items The could maintain a Bernoulli sample of size between s and O(s) 6-

13 Random sampling Algorithm [with ormode, Muthu & Yi, PODS 10 JAM 11] Initialize i = 0 sites S 1 S 2 S 3 Sites send in every item w.pr. 2 i In epoch i: S k 17-1

14 Random sampling Algorithm [with ormode, Muthu & Yi, PODS 10 JAM 11] Initialize i = 0 In epoch i: upper lower Sites send in every item w.pr. 2 i oordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. sites S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) 17-2

15 Random sampling Algorithm [with ormode, Muthu & Yi, PODS 10 JAM 11] Initialize i = 0 In epoch i: upper lower Sites send in every item w.pr. 2 i oordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. sites S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) When the lower sample reaches size s, the broadcasts to k sites advance to epoch i i+1 Discards the upper sample Randomlysplitsthelowersampleintoanewlowerandanuppersample 17-3

16 Random sampling Algorithm [with ormode, Muthu & Yi, PODS 10 JAM 11] Initialize i = 0 In epoch i: upper lower Sites send in every item w.pr. 2 i oordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. sites S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) When the lower sample reaches size s, the broadcasts to k sites advance to epoch i i+1 Discards the upper sample Randomlysplitsthelowersampleintoanewlowerandanuppersample orrectness: (1): In epoch i, each item is maintained in w. pr. 2 i 17-

17 Random sampling Algorithm [with ormode, Muthu & Yi, PODS 10 JAM 11] Initialize i = 0 In epoch i: upper lower Sites send in every item w.pr. 2 i oordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. sites S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) When the lower sample reaches size s, the broadcasts to k sites advance to epoch i i+1 Discards the upper sample Randomlysplitsthelowersampleintoanewlowerandanuppersample orrectness: (1): In epoch i, each item is maintained in w. pr. 2 i (2): Always s items are maintained in 17-5

18 A running example upper lower Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S 18-1

19 A running example upper lower Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S

20 A running example upper lower 1 Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S

21 A running example upper lower 1 Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S

22 A running example upper lower 1 2 Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S

23 A running example upper lower Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S

24 A running example upper lower Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S

25 A running example upper lower Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S

26 A running example upper lower Maintain s = 3 samples Epoch 0 (p = 1) Now lower sample = 3 discard upper sample split lower sample advance to Epoch 1 sites S 1 S 2 S 3 S

27 A running example upper lower 1 5 Maintain s = 3 samples Epoch 0 (p = 1) Now lower sample = 3 discard upper sample split lower sample advance to Epoch 1 sites S 1 S 2 S 3 S

28 A running example (cont.) upper lower 1 5 Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S

29 A running example (cont.) upper lower 1 5 Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S (discard) 19-2

30 A running example (cont.) upper lower Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S (discard)

31 A running example (cont.) upper lower Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S (discard)

32 A running example (cont.) upper lower Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S (discard) (discard) 19-5

33 A running example (cont.) upper lower Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S (discard) (discard)

34 A running example (cont.) upper lower Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S (discard) (discard)

35 A running example (cont.) upper lower Maintain s = 3 samples Epoch 1 (p = 1/2) Again lower sample = 3 discard upper sample split lower sample advance to Epoch 2 sites S 1 S 2 S 3 S (discard) (discard)

36 A running example (cont.) upper lower Maintain s = 3 samples Epoch 1 (p = 1/2) Again lower sample = 3 discard upper sample split lower sample advance to Epoch 2 sites S 1 S 2 S 3 S (discard) (discard)

37 A running example (cont.) upper lower Maintain s = 3 samples Epoch 2 (p = 1/) More items will be discarded locally sites S 1 S 2 S 3 S (discard) (discard)

38 A running example (cont.) upper lower Maintain s = 3 samples Epoch 2 (p = 1/) More items will be discarded locally Intuition: maintain a sample prob. at each site p s/n (n: total # items) without knowing n. sites S 1 S 2 S 3 S (discard) (discard)

39 Random sampling Analysis Initialize i = 0 In epoch i: upper lower Sites send in every item w.pr. 2 i oordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. sites S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) When the lower sample reaches size s, the broadcasts to k sites advance to epoch i i+1 Discards the upper sample Splitsthelowersampleintoanewlowersampleandanuppersample Analysis: (Msgs sent) Messages sent per epoch O(k +s) # epochs O(logn) = O((k +s)logn) 21-1

40 Random sampling Analysis and experiments an be 1. improved to Θ(klog k/s n+slogn) and 2. extended to sliding window cases. 22-1

41 Random sampling Analysis and experiments an be 1. improved to Θ(klog k/s n+slogn) and 2. extended to sliding window cases. Experiments on the real data set from 1998 world cup logs. log 2 (cost) theory 15 log 2 (cost) cost 10 3 practice log 2 s Basic case cost VS sample size n = ,k = log 2 s Time-based sliding cost VS sample size n = ,k = ( log2 Time-based sliding cost VS window size s = 128,k = 128 ) w

42 Random sampling Analysis and experiments an be 1. improved to Θ(klog k/s n+slogn) and 2. extended to sliding window cases. log 2 (cost) theory 15 practice Experiments on the real data set from 1998 world cup logs log 2 s Basic case cost VS sample size n = ,k = 128 total # items n = 7,000,000 # items sent,000 size of sample s = 128 # sites k =

43 Sampling from a (time-based) sliding window sliding window expired windows frozen window current window t 10-1

44 Sampling from a (time-based) sliding window sliding window t expired windows frozen window current window Sample for sliding window = need new ideas (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window by ISWoR 10-2

45 Sampling from a (time-based) sliding window sliding window expired windows frozen window current window Sample for sliding window = need new ideas (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window by ISWoR (1), (2) may be sampled by different rates. But as long as both have sizes min{s,# live items}, fine. t 10-3

46 Sampling from a (time-based) sliding window sliding window expired windows frozen window current window Sample for sliding window = need new ideas (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window by ISWoR (1), (2) may be sampled by different rates. But as long as both have sizes min{s,# live items}, fine. The key issue: how to guarantee both have sizes s? as items in the frozen window are expiring... t 10-

47 Sampling from a (time-based) sliding window sliding window expired windows frozen window current window Sample for sliding window = need new ideas (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window by ISWoR (1), (2) may be sampled by different rates. But as long as both have sizes min{s,# live items}, fine. The key issue: how to guarantee both have sizes s? as items in the frozen window are expiring... Solution: In the frozen window, find a good sample rate such that the sample size s. t 10-5

48 Dealing with the frozen window sliding window expired windows frozen window current window t Keep all the levels? Need O(w) communication 11-1

49 Dealing with the frozen window sliding window expired windows frozen window current window t Keep all the levels? Need O(w) communication (s = 2) Keep most recent sampled items in a level until s of them are also sampled at the next level. Total size: O(slogw) 11-2

50 Dealing with the frozen window sliding window expired windows frozen window current window t Keep all the levels? Need O(w) communication (s = 2) Keep most recent sampled items in a level until s of them are also sampled at the next level. Total size: O(slogw) Guaranteed: There is a blue window with s sampled items that covers the unexpired portion of the frozen window 11-3

51 Dealing with the frozen window: The algorithm (s = 2) Each site builds its own level-sampling structure for the current window until it freezes Needs O(slogw) space and O(1) time per item 12-1

52 Dealing with the frozen window: The algorithm (s = 2) Each site builds its own level-sampling structure for the current window until it freezes Needs O(slogw) space and O(1) time per item When the current window freezes For each level, do a k-way merge to build the level of the global structure at the. Total communication O((k + s) log w) 12-2

### Optimal Tracking of Distributed Heavy Hitters and Quantiles

Optimal Tracking of Distributed Heavy Hitters and Quantiles Ke Yi HKUST yike@cse.ust.hk Qin Zhang MADALGO Aarhus University qinzhang@cs.au.dk Abstract We consider the the problem of tracking heavy hitters

### LECTURE 16. Readings: Section 5.1. Lecture outline. Random processes Definition of the Bernoulli process Basic properties of the Bernoulli process

LECTURE 16 Readings: Section 5.1 Lecture outline Random processes Definition of the Bernoulli process Basic properties of the Bernoulli process Number of successes Distribution of interarrival times The

### Big Data & Scripting Part II Streaming Algorithms

Big Data & Scripting Part II Streaming Algorithms 1, Counting Distinct Elements 2, 3, counting distinct elements problem formalization input: stream of elements o from some universe U e.g. ids from a set

### Sublinear Algorithms for Big Data. Part 4: Random Topics

Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang 1-1 2-1 Topic 1: Compressive sensing Compressive sensing The model (Candes-Romberg-Tao 04; Donoho 04) Applicaitons Medical imaging reconstruction

### Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

Nimble Algorithms for Cloud Computing Ravi Kannan, Santosh Vempala and David Woodruff Cloud computing Data is distributed arbitrarily on many servers Parallel algorithms: time Streaming algorithms: sublinear

### résumé de flux de données

résumé de flux de données CLEROT Fabrice fabrice.clerot@orange-ftgroup.com Orange Labs data streams, why bother? massive data is the talk of the town... data streams, why bother? service platform production

### Zabin Visram Room CS115 CS126 Searching. Binary Search

Zabin Visram Room CS115 CS126 Searching Binary Search Binary Search Sequential search is not efficient for large lists as it searches half the list, on average Another search algorithm Binary search Very

### B669 Sublinear Algorithms for Big Data

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Now about the Big Data Big data is everywhere : over 2.5 petabytes of sales transactions : an index of over 19 billion web pages : over 40 billion of

### Point Biserial Correlation Tests

Chapter 807 Point Biserial Correlation Tests Introduction The point biserial correlation coefficient (ρ in this chapter) is the product-moment correlation calculated between a continuous random variable

### Efficient Fault-Tolerant Infrastructure for Cloud Computing

Efficient Fault-Tolerant Infrastructure for Cloud Computing Xueyuan Su Candidate for Ph.D. in Computer Science, Yale University December 2013 Committee Michael J. Fischer (advisor) Dana Angluin James Aspnes

### 4. MAC protocols and LANs

4. MAC protocols and LANs 1 Outline MAC protocols and sublayers, LANs: Ethernet, Token ring and Token bus Logic Link Control (LLC) sublayer protocol Bridges: transparent (spanning tree), source routing

### Fundamentals of Analyzing and Mining Data Streams

Fundamentals of Analyzing and Mining Data Streams Graham Cormode graham@research.att.com Outline 1. Streaming summaries, sketches and samples Motivating examples, applications and models Random sampling:

### Data Mining on Streams

Data Mining on Streams Using Decision Trees CS 536: Machine Learning Instructor: Michael Littman TA: Yihua Wu Outline Introduction to data streams Overview of traditional DT learning ALG DT learning ALGs

### Multi-Way Search Trees (B Trees)

Multi-Way Search Trees (B Trees) Multiway Search Trees An m-way search tree is a tree in which, for some integer m called the order of the tree, each node has at most m children. If n

### 9th Max-Planck Advanced Course on the Foundations of Computer Science (ADFOCS) Primal-Dual Algorithms for Online Optimization: Lecture 1

9th Max-Planck Advanced Course on the Foundations of Computer Science (ADFOCS) Primal-Dual Algorithms for Online Optimization: Lecture 1 Seffi Naor Computer Science Dept. Technion Haifa, Israel Introduction

### The Selection Problem

The Selection Problem The selection problem: given an integer k and a list x 1,..., x n of n elements find the k-th smallest element in the list Example: the 3rd smallest element of the following list

### Sampling Central Limit Theorem Proportions. Outline. 1 Sampling. 2 Central Limit Theorem. 3 Proportions

Outline 1 Sampling 2 Central Limit Theorem 3 Proportions Outline 1 Sampling 2 Central Limit Theorem 3 Proportions Populations and samples When we use statistics, we are trying to find out information about

### Mining Data Streams. Chapter 4. 4.1 The Stream Data Model

Chapter 4 Mining Data Streams Most of the algorithms described in this book assume that we are mining a database. That is, all our data is available when and if we want it. In this chapter, we shall make

### Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman To motivate the Bloom-filter idea, consider a web crawler. It keeps, centrally, a list of all the URL s it has found so far. It

### Pulling a Random Sample from a MAXQDA Dataset

In this guide you will learn how to pull a random sample from a MAXQDA dataset, using the random cell function in Excel. In this process you will learn how to export and re-import variables from MAXQDA.

### 1: Meditech Information Systems On-line Help 2: Links to Micromedex 3: Print page view 4: Lock Session: temporarily minimize the session and lock it

MEDITECH 6.0 Training Manual Training Manual Outline Shortcuts Main Menu Scheduling Desktop Header /F Footer Bars Basic Scheduling Set Scheduling Medical Necessity / Prior Authorization Pending Appointments

### A. OPENING POINT CLOUDS. (Notepad++ Text editor) (Cloud Compare Point cloud and mesh editor) (MeshLab Point cloud and mesh editor)

MeshLAB tutorial 1 A. OPENING POINT CLOUDS (Notepad++ Text editor) (Cloud Compare Point cloud and mesh editor) (MeshLab Point cloud and mesh editor) 2 OPENING POINT CLOUDS IN NOTEPAD ++ Let us understand

### Cross Validation. Dr. Thomas Jensen Expedia.com

Cross Validation Dr. Thomas Jensen Expedia.com About Me PhD from ETH Used to be a statistician at Link, now Senior Business Analyst at Expedia Manage a database with 720,000 Hotels that are not on contract

### ADOBE ACROBAT CONNECT PRO MOBILE VISUAL QUICK START GUIDE

ADOBE ACROBAT CONNECT PRO MOBILE VISUAL QUICK START GUIDE GETTING STARTED WITH ADOBE ACROBAT CONNECT PRO MOBILE FOR IPHONE AND IPOD TOUCH Overview Attend Acrobat Connect Pro meetings using your iphone

### Big Data & Scripting Part II Streaming Algorithms

Big Data & Scripting Part II Streaming Algorithms 1, 2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set),

### USING OUTLOOK WEB ACCESS

USING OUTLOOK WEB ACCESS 17 March 2009, Version 1.0 WHAT IS OUTLOOK WEB ACCESS? Outlook Web Access (OWA) is a webmail service of Microsoft Exchange Server. The web interface of Outlook Web Access resembles

### RANDOMIZED ALGORITHMS

RANDOMIZED ALGORITHMS CONCEPTS Deterministic algorithms will always perform the same series of operations (and produce the same result) on a given input Random algorithms may execute different series of

The Center for Teaching Excellence Table of Contents Basics... 1 What is a spreadsheet... 1 Why on a computer... 1 Basics of a spreadsheet... 1 Types of data... 1 Formatting... 1 Organize content in rows

### MATH 140 Lab 4: Probability and the Standard Normal Distribution

MATH 140 Lab 4: Probability and the Standard Normal Distribution Problem 1. Flipping a Coin Problem In this problem, we want to simualte the process of flipping a fair coin 1000 times. Note that the outcomes

### ILLINOIS HOUSING DEVELOPMENT AUTHORITY INFORMATION TECHNOLOGY DEVELOPMENT AUTHORITY_ONLINE_UTILITY_ALLOWANCE_GUIDE.DOC

ILLINOIS HOUSING DEVELOPMENT AUTHORITY INFORMATION TECHNOLOGY DEVELOPMENT AUTHORITY_ONLINE_UTILITY_ALLOWANCE_GUIDE.DOC The following guide details how to enter Utility Allowance information for projects

### Compact Summaries for Large Datasets

Compact Summaries for Large Datasets Big Data Graham Cormode University of Warwick G.Cormode@Warwick.ac.uk The case for Big Data in one slide Big data arises in many forms: Medical data: genetic sequences,

### Contents. TTM4155: Teletraffic Theory (Teletrafikkteori) Probability Theory Basics. Yuming Jiang. Basic Concepts Random Variables

TTM4155: Teletraffic Theory (Teletrafikkteori) Probability Theory Basics Yuming Jiang 1 Some figures taken from the web. Contents Basic Concepts Random Variables Discrete Random Variables Continuous Random

### B561 Advanced Database Concepts. 0 Introduction. Qin Zhang 1-1

B561 Advanced Database Concepts 0 Introduction Qin Zhang 1-1 Self introduction: my research interests 2-1 Algorithms for Big Data: streaming/sketching algorithms; algorithms on distributed data; data structures;

### Data Mining Practical Machine Learning Tools and Techniques

Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

### Approximate Counts and Quantiles over Sliding Windows

Approximate Counts and Quantiles over Sliding Windows Arvind Arasu Stanford University arvinda@cs.stanford.edu Gurmeet Singh Manku Stanford University manku@cs.stanford.edu ABSTRACT We consider the problem

### Visual Data Mining. Motivation. Why Visual Data Mining. Integration of visualization and data mining : Chidroop Madhavarapu CSE 591:Visual Analytics

Motivation Visual Data Mining Visualization for Data Mining Huge amounts of information Limited display capacity of output devices Chidroop Madhavarapu CSE 591:Visual Analytics Visual Data Mining (VDM)

### Lecture 13: Validation

Lecture 3: Validation g Motivation g The Holdout g Re-sampling techniques g Three-way data splits Motivation g Validation techniques are motivated by two fundamental problems in pattern recognition: model

### Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding

### An XML Framework for Integrating Continuous Queries, Composite Event Detection, and Database Condition Monitoring for Multiple Data Streams

An XML Framework for Integrating Continuous Queries, Composite Event Detection, and Database Condition Monitoring for Multiple Data Streams Susan D. Urban 1, Suzanne W. Dietrich 1, 2, and Yi Chen 1 Arizona

### Data Mining. Nonlinear Classification

Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

### Distributed Data Management Summer Semester 2013 TU Kaiserslautern

Distributed Data Management Summer Semester 2013 TU Kaiserslautern Dr.- Ing. Sebas4an Michel smichel@mmci.uni- saarland.de 1 Lecture 8+ (DISTRIBUTED) DATA STREAM PROCESSING (INTRODUCTION) 2 So Far: Databases/NoSQL

### SOLVING DATABASE QUERIES USING ORTHOGONAL RANGE SEARCHING

SOLVING DATABASE QUERIES USING ORTHOGONAL RANGE SEARCHING CPS234: Computational Geometry Prof. Pankaj Agarwal Prepared by: Khalid Al Issa (khalid@cs.duke.edu) Fall 2005 BACKGROUND : RANGE SEARCHING, a

### Continuous Monitoring of Top-k Queries over Sliding Windows

Continuous Monitoring of Top-k Queries over Sliding Windows Kyriakos Mouratidis 1 Spiridon Bakiras 2 Dimitris Papadias 2 1 School of Information Systems Singapore Management University 8 Stamford Road,

### MODELING RANDOMNESS IN NETWORK TRAFFIC

MODELING RANDOMNESS IN NETWORK TRAFFIC - LAVANYA JOSE, INDEPENDENT WORK FALL 11 ADVISED BY PROF. MOSES CHARIKAR ABSTRACT. Sketches are randomized data structures that allow one to record properties of

### B490 Mining the Big Data. 2 Clustering

B490 Mining the Big Data 2 Clustering Qin Zhang 1-1 Motivations Group together similar documents/webpages/images/people/proteins/products One of the most important problems in machine learning, pattern

### Streaming Algorithms

3 Streaming Algorithms Great Ideas in Theoretical Computer Science Saarland University, Summer 2014 Some Admin: Deadline of Problem Set 1 is 23:59, May 14 (today)! Students are divided into two groups

### Page Replacement Strategies. Jay Kothari Maxim Shevertalov CS 370: Operating Systems (Summer 2008)

Page Replacement Strategies Jay Kothari (jayk@drexel.edu) Maxim Shevertalov (max@drexel.edu) CS 370: Operating Systems (Summer 2008) Page Replacement Policies Why do we care about Replacement Policy? Replacement

### CSC148 Lecture 8. Algorithm Analysis Binary Search Sorting

CSC148 Lecture 8 Algorithm Analysis Binary Search Sorting Algorithm Analysis Recall definition of Big Oh: We say a function f(n) is O(g(n)) if there exists positive constants c,b such that f(n)

### Sampling Based Range Partition Methods for Big Data Analytics

Milan Vojnović Microsoft Research milanv@microsoft.com Sampling Based Range Partition Methods for Big Data Analytics Fei Xu Microsoft Corporation feixu@microsoft.com March 202 Technical Report MSR-TR-202-8

### Fact Sheet In-Memory Analysis

Fact Sheet In-Memory Analysis 1 Copyright Yellowfin International 2010 Contents In Memory Overview...3 Benefits...3 Agile development & rapid delivery...3 Data types supported by the In-Memory Database...4

### Analytics Edge Project: Predicting the outcome of the 2014 World Cup

Analytics Edge Project: Predicting the outcome of the 2014 World Cup Virgile Galle and Ludovica Rizzo June 27, 2014 Contents 1 Project Presentation: the Soccer World Cup 2 2 Data Set 2 3 Models to predict

### Continuous Matrix Approximation on Distributed Data

Continuous Matrix Approximation on Distributed Data Mina Ghashami School of Computing University of Utah ghashami@cs.uah.edu Jeff M. Phillips School of Computing University of Utah jeffp@cs.uah.edu Feifei

### City of De Pere. Halogen How To Guide

City of De Pere Halogen How To Guide Page1 (revised 12/14/2015) Halogen Performance Management website address: https://global.hgncloud.com/cityofdepere/welcome.jsp The following steps take place to complete

Using 3M CLOUD LIBRARY 0BOFG JJ To Download Library Books Onto Your KINDLE FIRE To borrow an e-book from the Ocean County Library system, you need to have a valid Ocean County Library card (less than \$25

### System Monitoring and Reporting

This chapter contains the following sections: Dashboard, page 1 Summary, page 2 Inventory Management, page 3 Resource Pools, page 4 Clusters, page 4 Images, page 4 Host Nodes, page 6 Virtual Machines (VMs),

### Master s Theory Exam Spring 2006

Spring 2006 This exam contains 7 questions. You should attempt them all. Each question is divided into parts to help lead you through the material. You should attempt to complete as much of each problem

### Log Wealth In each period, algorithm A performs 85% as well as algorithm B

1 Online Investing Portfolio Selection Consider n stocks Our distribution of wealth is some vector b e.g. (1/3, 1/3, 1/3) At end of one period, we get a vector of price relatives x e.g. (0.98, 1.02, 1.00)

### A binary heap is a complete binary tree, where each node has a higher priority than its children. This is called heap-order property

CmSc 250 Intro to Algorithms Chapter 6. Transform and Conquer Binary Heaps 1. Definition A binary heap is a complete binary tree, where each node has a higher priority than its children. This is called

### B561 Advanced Database Concepts. 0 Introduction. Qin Zhang 1-1

B561 Advanced Database Concepts 0 Introduction Qin Zhang 1-1 Self introduction: my research interests Algorithms for Big Data: streaming/sketching algorithms; algorithms on distributed data; I/O-efficient

### Algorithms. Theresa Migler-VonDollen CMPS 5P

Algorithms Theresa Migler-VonDollen CMPS 5P 1 / 32 Algorithms Write a Python function that accepts a list of numbers and a number, x. If x is in the list, the function returns the position in the list

### K-Cover of Binary sequence

K-Cover of Binary sequence Prof Amit Kumar Prof Smruti Sarangi Sahil Aggarwal Swapnil Jain Given a binary sequence, represent all the 1 s in it using at most k- covers, minimizing the total length of all

### Binary Heaps. CSE 373 Data Structures

Binary Heaps CSE Data Structures Readings Chapter Section. Binary Heaps BST implementation of a Priority Queue Worst case (degenerate tree) FindMin, DeleteMin and Insert (k) are all O(n) Best case (completely

### Massachusetts Institute of Technology

n (i) m m (ii) n m ( (iii) n n n n (iv) m m Massachusetts Institute of Technology 6.0/6.: Probabilistic Systems Analysis (Quiz Solutions Spring 009) Question Multiple Choice Questions: CLEARLY circle the

### Monte Carlo testing with Big Data

Monte Carlo testing with Big Data Patrick Rubin-Delanchy University of Bristol & Heilbronn Institute for Mathematical Research Joint work with: Axel Gandy (Imperial College London) with contributions from:

### Global Futures Exchange & Trading Co., Inc. Turbo Trader 2 Quickstart Manual

Global Futures Exchange & Trading Co., Inc. Turbo Trader 2 Quickstart Manual Turbo Trader 2 Quickstart Step #1 System Requirements Step #2 Installation Instructions Step #3 Overview Step #1 System Requirements

### 6.2 Normal distribution. Standard Normal Distribution:

6.2 Normal distribution Slide Heights of Adult Men and Women Slide 2 Area= Mean = µ Standard Deviation = σ Donation: X ~ N(µ,σ 2 ) Standard Normal Distribution: Slide 3 Slide 4 a normal probability distribution

### Online Maintenance of Very Large Random Samples

Online Maintenance of Very Large Random Samples Christopher Jermaine, Abhijit Pol, Subramanian Arumugam Department of Computer and Information Sciences and Engineering University of Florida Gainesville,

### Data Streams A Tutorial

Data Streams A Tutorial Nicole Schweikardt Goethe-Universität Frankfurt am Main DEIS 10: GI-Dagstuhl Seminar on Data Exchange, Integration, and Streams Schloss Dagstuhl, November 8, 2010 Data Streams Situation:

### Distributed Computing over Communication Networks: Maximal Independent Set

Distributed Computing over Communication Networks: Maximal Independent Set What is a MIS? MIS An independent set (IS) of an undirected graph is a subset U of nodes such that no two nodes in U are adjacent.

### SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis

SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu October 17, 2015 Outline

### Aloha Guest Manager. What s New in Aloha Guest Manager Release 1.6.394 Table of Contents. Release 1.6.394 Enhancements Guide SERVER MANAGEMENT...

Aloha Guest Manager Release 1.6.394 Enhancements Guide What s New in Aloha Guest Manager Release 1.6.394 Table of Contents SERVER MANAGEMENT... 3 Multiple server assignments for the day...3 Employee Search

### Email Basics Working with Filters

Email Basics Working with Filters The following are provided to help keep your Zimbra email and calendar usage consistent, professional, and up-to-date. Working with Email Filters Setting Up Filters Creating

### Active Queue Management

Active Queue Management TELCOM2321 CS2520 Wide Area Networks Dr. Walter Cerroni University of Bologna Italy Visiting Assistant Professor at SIS, Telecom Program Slides partly based on Dr. Znati s material

### More on Using Phone Books. Linear Search. Linear Search. Search. 17. Searching and Sorting Examples: LinSearch: The Spec 3/24/2015

17. Searching and Sorting Examples: Search Topics: Binary Search Measuring Execution Time The Divide and Conquer Framework Merge Sort Is this song in that playlist? Is this number in that phone book? Is

### Statistical Methods for Network and Computer Security p.1/43

Statistical Methods for Network and Computer Security David J. Marchette marchettedj@nswc.navy.mil Naval Surface Warfare Center Code B10 Statistical Methods for Network and Computer Security p.1/43 A Few

### SMALL INDEX LARGE INDEX (SILT)

Wayne State University ECE 7650: Scalable and Secure Internet Services and Architecture SMALL INDEX LARGE INDEX (SILT) A Memory Efficient High Performance Key Value Store QA REPORT Instructor: Dr. Song

### TCP, Active Queue Management and QoS

TCP, Active Queue Management and QoS Don Towsley UMass Amherst towsley@cs.umass.edu Collaborators: W. Gong, C. Hollot, V. Misra Outline motivation TCP friendliness/fairness bottleneck invariant principle

### Business Objects Enterprise version 4.1. Report Viewing

Business Objects Enterprise version 4.1 Note about Java: With earlier versions, the Java run-time was not needed for report viewing; but was needed for report writing. The default behavior in version 4.1

### Random Early Detection Gateways for Congestion Avoidance

Random Early Detection Gateways for Congestion Avoidance Sally Floyd and Van Jacobson Lawrence Berkeley Laboratory University of California floyd@eelblgov van@eelblgov To appear in the August 1993 IEEE/ACM

### Offline sorting buffers on Line

Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com

### Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream Xuemin Lin University of New South Wales Sydney, Australia Jian Xu University of New South Wales Sydney, Australia

### Query Selectivity Estimation for Uncertain Data

Query Selectivity Estimation for Uncertain Data Sarvjeet Singh *, Chris Mayfield *, Rahul Shah #, Sunil Prabhakar * and Susanne Hambrusch * * Department of Computer Science, Purdue University # Department

### Summary Data Structures for Massive Data

Summary Data Structures for Massive Data Graham Cormode Department of Computer Science, University of Warwick graham@cormode.org Abstract. Prompted by the need to compute holistic properties of increasingly

### Portal User Guide. Customers. Version 1.1. May 2013 http://www.sharedband.com 1 of 5

Portal User Guide Customers Version 1.1 May 2013 http://www.sharedband.com 1 of 5 Table of Contents Introduction... 3 Using the Sharedband Portal... 4 Login... 4 Request password reset... 4 View accounts...

### SeeVogh Player manual

SeeVogh Player manual Current Version as of: (03/28/2014) v.2.0 1 The SeeVogh Player is a simple application, which allows you to playback recordings made during a SeeVogh meeting with the recording function

### Seed Distributions for the NCAA Men s Basketball Tournament: Why it May Not Matter Who Plays Whom*

Seed Distributions for the NCAA Men s Basketball Tournament: Why it May Not Matter Who Plays Whom* Sheldon H. Jacobson Department of Computer Science University of Illinois at Urbana-Champaign shj@illinois.edu

### Problem. Indexing with B-trees. Indexing. Primary Key Indexing. B-trees. B-trees: Example. primary key indexing

Problem Given a large collection of records, Indexing with B-trees find similar/interesting things, i.e., allow fast, approximate queries Anastassia Ailamaki http://www.cs.cmu.edu/~natassa 2 Indexing Primary

### Any symbols displayed within these pages are for illustrative purposes only, and are not intended to portray any recommendation.

mobiletws for Android Users' Guide October 2012 mobiletws for Android Version 4.1.360 2012 Interactive Brokers LLC. All Rights Reserved Any symbols displayed within these pages are for illustrative purposes

### 1. Install a Virtual Machine... 2. 2. Download Ubuntu Ubuntu 14.04.1 LTS... 2. 3. Create a New Virtual Machine... 2

Introduction APPLICATION NOTE The purpose of this document is to explain how to create a Virtual Machine on a Windows PC such that a Linux environment can be created in order to build a Linux kernel and

### OUTLOOK WEB APP 2013 ESSENTIAL SKILLS

OUTLOOK WEB APP 2013 ESSENTIAL SKILLS CONTENTS Login to engage365 Web site. 2 View the account home page. 2 The Outlook 2013 Window. 3 Interface Features. 3 Creating a new email message. 4 Create an Email

### 1. The subnet must prevent additional packets from entering the congested region until those already present can be processed.

Congestion Control When one part of the subnet (e.g. one or more routers in an area) becomes overloaded, congestion results. Because routers are receiving packets faster than they can forward them, one

### Nessus Cloud User Registration

Nessus Cloud User Registration Create Your Tenable Nessus Cloud Account 1. Click on the provided URL to create your account. If the link does not work, please cut and paste the entire URL into your browser.

### ΤΕΙ Κρήτης, Παράρτηµα Χανίων

ΤΕΙ Κρήτης, Παράρτηµα Χανίων ΠΣΕ, Τµήµα Τηλεπικοινωνιών & ικτύων Η/Υ Εργαστήριο ιαδίκτυα & Ενδοδίκτυα Η/Υ Modeling Wide Area Networks (WANs) ρ Θεοδώρου Παύλος Χανιά 2003 8. Modeling Wide Area Networks

### TEACHING STATISTICS IN MANAGEMENT COURSES IN INDIA. R. P. Suresh Indian Institute of Management Kozhikode India

TEACHING STATISTICS IN MANAGEMENT COURSES IN INDIA R. P. Suresh Indian Institute of Management Kozhikode India Statistics is taught as a core course in the Management programmes leading to the degree of

### Using Data-Oblivious Algorithms for Private Cloud Storage Access. Michael T. Goodrich Dept. of Computer Science

Using Data-Oblivious Algorithms for Private Cloud Storage Access Michael T. Goodrich Dept. of Computer Science Privacy in the Cloud Alice owns a large data set, which she outsources to an honest-but-curious