Sublinear Algorithms for Big Data. Part 4: Random Topics


 Malcolm Kelly
 3 years ago
 Views:
Transcription
1 Sublinear Algorithms for Big Data Part : Random Topics Qin Zhang 11
2 Topic 3: Random sampling in distributed data streams (based on a paper with ormode, Muthukrishnan and Yi, PODS 10, JAM 12) 21
3 Distributed streaming Motivated by database/networking applications Adaptive filters [Olston, Jiang, Widom, SIGMOD 03] A generic geometric approach [Scharfman et al. SIGMOD 06] Prediction models [ormode, Garofalakis, Muthukrishnan, Rastogi, SIGMOD 05] sensor networks network monitoring 31 environment monitoring cloud computing
4 Reservoir sampling [Waterman??; Vitter 85] Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample 1
5 Reservoir sampling [Waterman??; Vitter 85] Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample When the ith item arrives With probability s/i, use it to replace an item in the current sample chosen uniformly at ranfom With probability 1 s/i, throw it away 2
6 Reservoir sampling from distributed streams When k = 1, reservoir sampling has cost Θ(s log n) When k 2, reservoir sampling has cost O(n) because it s costly to track i S k S 3 S S 1 time
7 Reservoir sampling from distributed streams When k = 1, reservoir sampling has cost Θ(s log n) When k 2, reservoir sampling has cost O(n) because it s costly to track i S k Tracking i approximately? Sampling won t be uniform S 3 S S 1 time
8 Reservoir sampling from distributed streams When k = 1, reservoir sampling has cost Θ(s log n) When k 2, reservoir sampling has cost O(n) because it s costly to track i S k S 3 S 2 Tracking i approximately? Sampling won t be uniform Key observation: We don t have to know the size of the population in order to sample! 53 S 1 time
9 Basic idea: binary Bernoulli sampling 61
10 Basic idea: binary Bernoulli sampling
11 Basic idea: binary Bernoulli sampling onditioned upon a row having s active items, we can draw a sample from the active items 63
12 Basic idea: binary Bernoulli sampling onditioned upon a row having s active items, we can draw a sample from the active items The could maintain a Bernoulli sample of size between s and O(s) 6
13 Random sampling Algorithm [with ormode, Muthu & Yi, PODS 10 JAM 11] Initialize i = 0 sites S 1 S 2 S 3 Sites send in every item w.pr. 2 i In epoch i: S k 171
14 Random sampling Algorithm [with ormode, Muthu & Yi, PODS 10 JAM 11] Initialize i = 0 In epoch i: upper lower Sites send in every item w.pr. 2 i oordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. sites S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) 172
15 Random sampling Algorithm [with ormode, Muthu & Yi, PODS 10 JAM 11] Initialize i = 0 In epoch i: upper lower Sites send in every item w.pr. 2 i oordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. sites S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) When the lower sample reaches size s, the broadcasts to k sites advance to epoch i i+1 Discards the upper sample Randomlysplitsthelowersampleintoanewlowerandanuppersample 173
16 Random sampling Algorithm [with ormode, Muthu & Yi, PODS 10 JAM 11] Initialize i = 0 In epoch i: upper lower Sites send in every item w.pr. 2 i oordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. sites S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) When the lower sample reaches size s, the broadcasts to k sites advance to epoch i i+1 Discards the upper sample Randomlysplitsthelowersampleintoanewlowerandanuppersample orrectness: (1): In epoch i, each item is maintained in w. pr. 2 i 17
17 Random sampling Algorithm [with ormode, Muthu & Yi, PODS 10 JAM 11] Initialize i = 0 In epoch i: upper lower Sites send in every item w.pr. 2 i oordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. sites S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) When the lower sample reaches size s, the broadcasts to k sites advance to epoch i i+1 Discards the upper sample Randomlysplitsthelowersampleintoanewlowerandanuppersample orrectness: (1): In epoch i, each item is maintained in w. pr. 2 i (2): Always s items are maintained in 175
18 A running example upper lower Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S 181
19 A running example upper lower Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S
20 A running example upper lower 1 Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S
21 A running example upper lower 1 Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S
22 A running example upper lower 1 2 Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S
23 A running example upper lower Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S
24 A running example upper lower Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S
25 A running example upper lower Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S
26 A running example upper lower Maintain s = 3 samples Epoch 0 (p = 1) Now lower sample = 3 discard upper sample split lower sample advance to Epoch 1 sites S 1 S 2 S 3 S
27 A running example upper lower 1 5 Maintain s = 3 samples Epoch 0 (p = 1) Now lower sample = 3 discard upper sample split lower sample advance to Epoch 1 sites S 1 S 2 S 3 S
28 A running example (cont.) upper lower 1 5 Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S
29 A running example (cont.) upper lower 1 5 Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S (discard) 192
30 A running example (cont.) upper lower Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S (discard)
31 A running example (cont.) upper lower Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S (discard)
32 A running example (cont.) upper lower Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S (discard) (discard) 195
33 A running example (cont.) upper lower Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S (discard) (discard)
34 A running example (cont.) upper lower Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S (discard) (discard)
35 A running example (cont.) upper lower Maintain s = 3 samples Epoch 1 (p = 1/2) Again lower sample = 3 discard upper sample split lower sample advance to Epoch 2 sites S 1 S 2 S 3 S (discard) (discard)
36 A running example (cont.) upper lower Maintain s = 3 samples Epoch 1 (p = 1/2) Again lower sample = 3 discard upper sample split lower sample advance to Epoch 2 sites S 1 S 2 S 3 S (discard) (discard)
37 A running example (cont.) upper lower Maintain s = 3 samples Epoch 2 (p = 1/) More items will be discarded locally sites S 1 S 2 S 3 S (discard) (discard)
38 A running example (cont.) upper lower Maintain s = 3 samples Epoch 2 (p = 1/) More items will be discarded locally Intuition: maintain a sample prob. at each site p s/n (n: total # items) without knowing n. sites S 1 S 2 S 3 S (discard) (discard)
39 Random sampling Analysis Initialize i = 0 In epoch i: upper lower Sites send in every item w.pr. 2 i oordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. sites S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) When the lower sample reaches size s, the broadcasts to k sites advance to epoch i i+1 Discards the upper sample Splitsthelowersampleintoanewlowersampleandanuppersample Analysis: (Msgs sent) Messages sent per epoch O(k +s) # epochs O(logn) = O((k +s)logn) 211
40 Random sampling Analysis and experiments an be 1. improved to Θ(klog k/s n+slogn) and 2. extended to sliding window cases. 221
41 Random sampling Analysis and experiments an be 1. improved to Θ(klog k/s n+slogn) and 2. extended to sliding window cases. Experiments on the real data set from 1998 world cup logs. log 2 (cost) theory 15 log 2 (cost) cost 10 3 practice log 2 s Basic case cost VS sample size n = ,k = log 2 s Timebased sliding cost VS sample size n = ,k = ( log2 Timebased sliding cost VS window size s = 128,k = 128 ) w
42 Random sampling Analysis and experiments an be 1. improved to Θ(klog k/s n+slogn) and 2. extended to sliding window cases. log 2 (cost) theory 15 practice Experiments on the real data set from 1998 world cup logs log 2 s Basic case cost VS sample size n = ,k = 128 total # items n = 7,000,000 # items sent,000 size of sample s = 128 # sites k =
43 Sampling from a (timebased) sliding window sliding window expired windows frozen window current window t 101
44 Sampling from a (timebased) sliding window sliding window t expired windows frozen window current window Sample for sliding window = need new ideas (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window by ISWoR 102
45 Sampling from a (timebased) sliding window sliding window expired windows frozen window current window Sample for sliding window = need new ideas (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window by ISWoR (1), (2) may be sampled by different rates. But as long as both have sizes min{s,# live items}, fine. t 103
46 Sampling from a (timebased) sliding window sliding window expired windows frozen window current window Sample for sliding window = need new ideas (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window by ISWoR (1), (2) may be sampled by different rates. But as long as both have sizes min{s,# live items}, fine. The key issue: how to guarantee both have sizes s? as items in the frozen window are expiring... t 10
47 Sampling from a (timebased) sliding window sliding window expired windows frozen window current window Sample for sliding window = need new ideas (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window by ISWoR (1), (2) may be sampled by different rates. But as long as both have sizes min{s,# live items}, fine. The key issue: how to guarantee both have sizes s? as items in the frozen window are expiring... Solution: In the frozen window, find a good sample rate such that the sample size s. t 105
48 Dealing with the frozen window sliding window expired windows frozen window current window t Keep all the levels? Need O(w) communication 111
49 Dealing with the frozen window sliding window expired windows frozen window current window t Keep all the levels? Need O(w) communication (s = 2) Keep most recent sampled items in a level until s of them are also sampled at the next level. Total size: O(slogw) 112
50 Dealing with the frozen window sliding window expired windows frozen window current window t Keep all the levels? Need O(w) communication (s = 2) Keep most recent sampled items in a level until s of them are also sampled at the next level. Total size: O(slogw) Guaranteed: There is a blue window with s sampled items that covers the unexpired portion of the frozen window 113
51 Dealing with the frozen window: The algorithm (s = 2) Each site builds its own levelsampling structure for the current window until it freezes Needs O(slogw) space and O(1) time per item 121
52 Dealing with the frozen window: The algorithm (s = 2) Each site builds its own levelsampling structure for the current window until it freezes Needs O(slogw) space and O(1) time per item When the current window freezes For each level, do a kway merge to build the level of the global structure at the. Total communication O((k + s) log w) 122
Optimal Tracking of Distributed Heavy Hitters and Quantiles
Optimal Tracking of Distributed Heavy Hitters and Quantiles Ke Yi HKUST yike@cse.ust.hk Qin Zhang MADALGO Aarhus University qinzhang@cs.au.dk Abstract We consider the the problem of tracking heavy hitters
More informationLECTURE 16. Readings: Section 5.1. Lecture outline. Random processes Definition of the Bernoulli process Basic properties of the Bernoulli process
LECTURE 16 Readings: Section 5.1 Lecture outline Random processes Definition of the Bernoulli process Basic properties of the Bernoulli process Number of successes Distribution of interarrival times The
More informationBig Data & Scripting Part II Streaming Algorithms
Big Data & Scripting Part II Streaming Algorithms 1, Counting Distinct Elements 2, 3, counting distinct elements problem formalization input: stream of elements o from some universe U e.g. ids from a set
More informationSublinear Algorithms for Big Data. Part 4: Random Topics
Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang 11 21 Topic 1: Compressive sensing Compressive sensing The model (CandesRombergTao 04; Donoho 04) Applicaitons Medical imaging reconstruction
More informationNimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff
Nimble Algorithms for Cloud Computing Ravi Kannan, Santosh Vempala and David Woodruff Cloud computing Data is distributed arbitrarily on many servers Parallel algorithms: time Streaming algorithms: sublinear
More informationrésumé de flux de données
résumé de flux de données CLEROT Fabrice fabrice.clerot@orangeftgroup.com Orange Labs data streams, why bother? massive data is the talk of the town... data streams, why bother? service platform production
More informationZabin Visram Room CS115 CS126 Searching. Binary Search
Zabin Visram Room CS115 CS126 Searching Binary Search Binary Search Sequential search is not efficient for large lists as it searches half the list, on average Another search algorithm Binary search Very
More informationB669 Sublinear Algorithms for Big Data
B669 Sublinear Algorithms for Big Data Qin Zhang 11 Now about the Big Data Big data is everywhere : over 2.5 petabytes of sales transactions : an index of over 19 billion web pages : over 40 billion of
More informationPoint Biserial Correlation Tests
Chapter 807 Point Biserial Correlation Tests Introduction The point biserial correlation coefficient (ρ in this chapter) is the productmoment correlation calculated between a continuous random variable
More informationEfficient FaultTolerant Infrastructure for Cloud Computing
Efficient FaultTolerant Infrastructure for Cloud Computing Xueyuan Su Candidate for Ph.D. in Computer Science, Yale University December 2013 Committee Michael J. Fischer (advisor) Dana Angluin James Aspnes
More information4. MAC protocols and LANs
4. MAC protocols and LANs 1 Outline MAC protocols and sublayers, LANs: Ethernet, Token ring and Token bus Logic Link Control (LLC) sublayer protocol Bridges: transparent (spanning tree), source routing
More informationFundamentals of Analyzing and Mining Data Streams
Fundamentals of Analyzing and Mining Data Streams Graham Cormode graham@research.att.com Outline 1. Streaming summaries, sketches and samples Motivating examples, applications and models Random sampling:
More informationData Mining on Streams
Data Mining on Streams Using Decision Trees CS 536: Machine Learning Instructor: Michael Littman TA: Yihua Wu Outline Introduction to data streams Overview of traditional DT learning ALG DT learning ALGs
More informationMultiWay Search Trees (B Trees)
MultiWay Search Trees (B Trees) Multiway Search Trees An mway search tree is a tree in which, for some integer m called the order of the tree, each node has at most m children. If n
More information9th MaxPlanck Advanced Course on the Foundations of Computer Science (ADFOCS) PrimalDual Algorithms for Online Optimization: Lecture 1
9th MaxPlanck Advanced Course on the Foundations of Computer Science (ADFOCS) PrimalDual Algorithms for Online Optimization: Lecture 1 Seffi Naor Computer Science Dept. Technion Haifa, Israel Introduction
More informationThe Selection Problem
The Selection Problem The selection problem: given an integer k and a list x 1,..., x n of n elements find the kth smallest element in the list Example: the 3rd smallest element of the following list
More informationSampling Central Limit Theorem Proportions. Outline. 1 Sampling. 2 Central Limit Theorem. 3 Proportions
Outline 1 Sampling 2 Central Limit Theorem 3 Proportions Outline 1 Sampling 2 Central Limit Theorem 3 Proportions Populations and samples When we use statistics, we are trying to find out information about
More informationMining Data Streams. Chapter 4. 4.1 The Stream Data Model
Chapter 4 Mining Data Streams Most of the algorithms described in this book assume that we are mining a database. That is, all our data is available when and if we want it. In this chapter, we shall make
More informationCloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman To motivate the Bloomfilter idea, consider a web crawler. It keeps, centrally, a list of all the URL s it has found so far. It
More informationPulling a Random Sample from a MAXQDA Dataset
In this guide you will learn how to pull a random sample from a MAXQDA dataset, using the random cell function in Excel. In this process you will learn how to export and reimport variables from MAXQDA.
More information1: Meditech Information Systems Online Help 2: Links to Micromedex 3: Print page view 4: Lock Session: temporarily minimize the session and lock it
MEDITECH 6.0 Training Manual Training Manual Outline Shortcuts Main Menu Scheduling Desktop Header /F Footer Bars Basic Scheduling Set Scheduling Medical Necessity / Prior Authorization Pending Appointments
More informationA. OPENING POINT CLOUDS. (Notepad++ Text editor) (Cloud Compare Point cloud and mesh editor) (MeshLab Point cloud and mesh editor)
MeshLAB tutorial 1 A. OPENING POINT CLOUDS (Notepad++ Text editor) (Cloud Compare Point cloud and mesh editor) (MeshLab Point cloud and mesh editor) 2 OPENING POINT CLOUDS IN NOTEPAD ++ Let us understand
More informationCross Validation. Dr. Thomas Jensen Expedia.com
Cross Validation Dr. Thomas Jensen Expedia.com About Me PhD from ETH Used to be a statistician at Link, now Senior Business Analyst at Expedia Manage a database with 720,000 Hotels that are not on contract
More informationADOBE ACROBAT CONNECT PRO MOBILE VISUAL QUICK START GUIDE
ADOBE ACROBAT CONNECT PRO MOBILE VISUAL QUICK START GUIDE GETTING STARTED WITH ADOBE ACROBAT CONNECT PRO MOBILE FOR IPHONE AND IPOD TOUCH Overview Attend Acrobat Connect Pro meetings using your iphone
More informationBig Data & Scripting Part II Streaming Algorithms
Big Data & Scripting Part II Streaming Algorithms 1, 2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set),
More informationUSING OUTLOOK WEB ACCESS
USING OUTLOOK WEB ACCESS 17 March 2009, Version 1.0 WHAT IS OUTLOOK WEB ACCESS? Outlook Web Access (OWA) is a webmail service of Microsoft Exchange Server. The web interface of Outlook Web Access resembles
More informationRANDOMIZED ALGORITHMS
RANDOMIZED ALGORITHMS CONCEPTS Deterministic algorithms will always perform the same series of operations (and produce the same result) on a given input Random algorithms may execute different series of
More informationExcel  Beginner Documentation. Table of Contents
The Center for Teaching Excellence Table of Contents Basics... 1 What is a spreadsheet... 1 Why on a computer... 1 Basics of a spreadsheet... 1 Types of data... 1 Formatting... 1 Organize content in rows
More informationMATH 140 Lab 4: Probability and the Standard Normal Distribution
MATH 140 Lab 4: Probability and the Standard Normal Distribution Problem 1. Flipping a Coin Problem In this problem, we want to simualte the process of flipping a fair coin 1000 times. Note that the outcomes
More informationILLINOIS HOUSING DEVELOPMENT AUTHORITY INFORMATION TECHNOLOGY DEVELOPMENT AUTHORITY_ONLINE_UTILITY_ALLOWANCE_GUIDE.DOC
ILLINOIS HOUSING DEVELOPMENT AUTHORITY INFORMATION TECHNOLOGY DEVELOPMENT AUTHORITY_ONLINE_UTILITY_ALLOWANCE_GUIDE.DOC The following guide details how to enter Utility Allowance information for projects
More informationCompact Summaries for Large Datasets
Compact Summaries for Large Datasets Big Data Graham Cormode University of Warwick G.Cormode@Warwick.ac.uk The case for Big Data in one slide Big data arises in many forms: Medical data: genetic sequences,
More informationContents. TTM4155: Teletraffic Theory (Teletrafikkteori) Probability Theory Basics. Yuming Jiang. Basic Concepts Random Variables
TTM4155: Teletraffic Theory (Teletrafikkteori) Probability Theory Basics Yuming Jiang 1 Some figures taken from the web. Contents Basic Concepts Random Variables Discrete Random Variables Continuous Random
More informationB561 Advanced Database Concepts. 0 Introduction. Qin Zhang 11
B561 Advanced Database Concepts 0 Introduction Qin Zhang 11 Self introduction: my research interests 21 Algorithms for Big Data: streaming/sketching algorithms; algorithms on distributed data; data structures;
More informationData Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
More informationApproximate Counts and Quantiles over Sliding Windows
Approximate Counts and Quantiles over Sliding Windows Arvind Arasu Stanford University arvinda@cs.stanford.edu Gurmeet Singh Manku Stanford University manku@cs.stanford.edu ABSTRACT We consider the problem
More informationVisual Data Mining. Motivation. Why Visual Data Mining. Integration of visualization and data mining : Chidroop Madhavarapu CSE 591:Visual Analytics
Motivation Visual Data Mining Visualization for Data Mining Huge amounts of information Limited display capacity of output devices Chidroop Madhavarapu CSE 591:Visual Analytics Visual Data Mining (VDM)
More informationLecture 13: Validation
Lecture 3: Validation g Motivation g The Holdout g Resampling techniques g Threeway data splits Motivation g Validation techniques are motivated by two fundamental problems in pattern recognition: model
More informationInformation Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture  17 ShannonFanoElias Coding and Introduction to Arithmetic Coding
More informationAn XML Framework for Integrating Continuous Queries, Composite Event Detection, and Database Condition Monitoring for Multiple Data Streams
An XML Framework for Integrating Continuous Queries, Composite Event Detection, and Database Condition Monitoring for Multiple Data Streams Susan D. Urban 1, Suzanne W. Dietrich 1, 2, and Yi Chen 1 Arizona
More informationData Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
More informationDistributed Data Management Summer Semester 2013 TU Kaiserslautern
Distributed Data Management Summer Semester 2013 TU Kaiserslautern Dr. Ing. Sebas4an Michel smichel@mmci.uni saarland.de 1 Lecture 8+ (DISTRIBUTED) DATA STREAM PROCESSING (INTRODUCTION) 2 So Far: Databases/NoSQL
More informationSOLVING DATABASE QUERIES USING ORTHOGONAL RANGE SEARCHING
SOLVING DATABASE QUERIES USING ORTHOGONAL RANGE SEARCHING CPS234: Computational Geometry Prof. Pankaj Agarwal Prepared by: Khalid Al Issa (khalid@cs.duke.edu) Fall 2005 BACKGROUND : RANGE SEARCHING, a
More informationContinuous Monitoring of Topk Queries over Sliding Windows
Continuous Monitoring of Topk Queries over Sliding Windows Kyriakos Mouratidis 1 Spiridon Bakiras 2 Dimitris Papadias 2 1 School of Information Systems Singapore Management University 8 Stamford Road,
More informationMODELING RANDOMNESS IN NETWORK TRAFFIC
MODELING RANDOMNESS IN NETWORK TRAFFIC  LAVANYA JOSE, INDEPENDENT WORK FALL 11 ADVISED BY PROF. MOSES CHARIKAR ABSTRACT. Sketches are randomized data structures that allow one to record properties of
More informationB490 Mining the Big Data. 2 Clustering
B490 Mining the Big Data 2 Clustering Qin Zhang 11 Motivations Group together similar documents/webpages/images/people/proteins/products One of the most important problems in machine learning, pattern
More informationStreaming Algorithms
3 Streaming Algorithms Great Ideas in Theoretical Computer Science Saarland University, Summer 2014 Some Admin: Deadline of Problem Set 1 is 23:59, May 14 (today)! Students are divided into two groups
More informationPage Replacement Strategies. Jay Kothari Maxim Shevertalov CS 370: Operating Systems (Summer 2008)
Page Replacement Strategies Jay Kothari (jayk@drexel.edu) Maxim Shevertalov (max@drexel.edu) CS 370: Operating Systems (Summer 2008) Page Replacement Policies Why do we care about Replacement Policy? Replacement
More informationCSC148 Lecture 8. Algorithm Analysis Binary Search Sorting
CSC148 Lecture 8 Algorithm Analysis Binary Search Sorting Algorithm Analysis Recall definition of Big Oh: We say a function f(n) is O(g(n)) if there exists positive constants c,b such that f(n)
More informationSampling Based Range Partition Methods for Big Data Analytics
Milan Vojnović Microsoft Research milanv@microsoft.com Sampling Based Range Partition Methods for Big Data Analytics Fei Xu Microsoft Corporation feixu@microsoft.com March 202 Technical Report MSRTR2028
More informationFact Sheet InMemory Analysis
Fact Sheet InMemory Analysis 1 Copyright Yellowfin International 2010 Contents In Memory Overview...3 Benefits...3 Agile development & rapid delivery...3 Data types supported by the InMemory Database...4
More informationAnalytics Edge Project: Predicting the outcome of the 2014 World Cup
Analytics Edge Project: Predicting the outcome of the 2014 World Cup Virgile Galle and Ludovica Rizzo June 27, 2014 Contents 1 Project Presentation: the Soccer World Cup 2 2 Data Set 2 3 Models to predict
More informationContinuous Matrix Approximation on Distributed Data
Continuous Matrix Approximation on Distributed Data Mina Ghashami School of Computing University of Utah ghashami@cs.uah.edu Jeff M. Phillips School of Computing University of Utah jeffp@cs.uah.edu Feifei
More informationCity of De Pere. Halogen How To Guide
City of De Pere Halogen How To Guide Page1 (revised 12/14/2015) Halogen Performance Management website address: https://global.hgncloud.com/cityofdepere/welcome.jsp The following steps take place to complete
More informationTo Download Library Books Onto Your KINDLE FIRE
Using 3M CLOUD LIBRARY 0BOFG JJ To Download Library Books Onto Your KINDLE FIRE To borrow an ebook from the Ocean County Library system, you need to have a valid Ocean County Library card (less than $25
More informationSystem Monitoring and Reporting
This chapter contains the following sections: Dashboard, page 1 Summary, page 2 Inventory Management, page 3 Resource Pools, page 4 Clusters, page 4 Images, page 4 Host Nodes, page 6 Virtual Machines (VMs),
More informationMaster s Theory Exam Spring 2006
Spring 2006 This exam contains 7 questions. You should attempt them all. Each question is divided into parts to help lead you through the material. You should attempt to complete as much of each problem
More informationLog Wealth In each period, algorithm A performs 85% as well as algorithm B
1 Online Investing Portfolio Selection Consider n stocks Our distribution of wealth is some vector b e.g. (1/3, 1/3, 1/3) At end of one period, we get a vector of price relatives x e.g. (0.98, 1.02, 1.00)
More informationA binary heap is a complete binary tree, where each node has a higher priority than its children. This is called heaporder property
CmSc 250 Intro to Algorithms Chapter 6. Transform and Conquer Binary Heaps 1. Definition A binary heap is a complete binary tree, where each node has a higher priority than its children. This is called
More informationB561 Advanced Database Concepts. 0 Introduction. Qin Zhang 11
B561 Advanced Database Concepts 0 Introduction Qin Zhang 11 Self introduction: my research interests Algorithms for Big Data: streaming/sketching algorithms; algorithms on distributed data; I/Oefficient
More informationAlgorithms. Theresa MiglerVonDollen CMPS 5P
Algorithms Theresa MiglerVonDollen CMPS 5P 1 / 32 Algorithms Write a Python function that accepts a list of numbers and a number, x. If x is in the list, the function returns the position in the list
More informationFrontier Tandem. Administrator User Guide. Version 2.4 January 28, 2013
Frontier Tandem Administrator User Guide Version 2.4 January 28, 2013 About This Document 1 Version 7.3 Jan 28, 2013 Frontier Tandem Administrator Guide CONFIDENTIAL About This Document The Frontier Small
More informationKCover of Binary sequence
KCover of Binary sequence Prof Amit Kumar Prof Smruti Sarangi Sahil Aggarwal Swapnil Jain Given a binary sequence, represent all the 1 s in it using at most k covers, minimizing the total length of all
More informationBinary Heaps. CSE 373 Data Structures
Binary Heaps CSE Data Structures Readings Chapter Section. Binary Heaps BST implementation of a Priority Queue Worst case (degenerate tree) FindMin, DeleteMin and Insert (k) are all O(n) Best case (completely
More informationMassachusetts Institute of Technology
n (i) m m (ii) n m ( (iii) n n n n (iv) m m Massachusetts Institute of Technology 6.0/6.: Probabilistic Systems Analysis (Quiz Solutions Spring 009) Question Multiple Choice Questions: CLEARLY circle the
More informationMonte Carlo testing with Big Data
Monte Carlo testing with Big Data Patrick RubinDelanchy University of Bristol & Heilbronn Institute for Mathematical Research Joint work with: Axel Gandy (Imperial College London) with contributions from:
More informationGlobal Futures Exchange & Trading Co., Inc. Turbo Trader 2 Quickstart Manual
Global Futures Exchange & Trading Co., Inc. Turbo Trader 2 Quickstart Manual Turbo Trader 2 Quickstart Step #1 System Requirements Step #2 Installation Instructions Step #3 Overview Step #1 System Requirements
More information6.2 Normal distribution. Standard Normal Distribution:
6.2 Normal distribution Slide Heights of Adult Men and Women Slide 2 Area= Mean = µ Standard Deviation = σ Donation: X ~ N(µ,σ 2 ) Standard Normal Distribution: Slide 3 Slide 4 a normal probability distribution
More informationOnline Maintenance of Very Large Random Samples
Online Maintenance of Very Large Random Samples Christopher Jermaine, Abhijit Pol, Subramanian Arumugam Department of Computer and Information Sciences and Engineering University of Florida Gainesville,
More informationData Streams A Tutorial
Data Streams A Tutorial Nicole Schweikardt GoetheUniversität Frankfurt am Main DEIS 10: GIDagstuhl Seminar on Data Exchange, Integration, and Streams Schloss Dagstuhl, November 8, 2010 Data Streams Situation:
More informationDistributed Computing over Communication Networks: Maximal Independent Set
Distributed Computing over Communication Networks: Maximal Independent Set What is a MIS? MIS An independent set (IS) of an undirected graph is a subset U of nodes such that no two nodes in U are adjacent.
More informationSYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis
SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu October 17, 2015 Outline
More informationAloha Guest Manager. What s New in Aloha Guest Manager Release 1.6.394 Table of Contents. Release 1.6.394 Enhancements Guide SERVER MANAGEMENT...
Aloha Guest Manager Release 1.6.394 Enhancements Guide What s New in Aloha Guest Manager Release 1.6.394 Table of Contents SERVER MANAGEMENT... 3 Multiple server assignments for the day...3 Employee Search
More informationStudent Guide  Student Groups and Adobe Connect in Canvas
Student Guide  Student Groups and Adobe Connect in Canvas Creating an Adobe Connect Conference 1. Use Chrome or Firefox as your browser. Make sure you are on the latest version. 2. Connect your headset
More informationEmail Basics Working with Filters
Email Basics Working with Filters The following are provided to help keep your Zimbra email and calendar usage consistent, professional, and uptodate. Working with Email Filters Setting Up Filters Creating
More informationActive Queue Management
Active Queue Management TELCOM2321 CS2520 Wide Area Networks Dr. Walter Cerroni University of Bologna Italy Visiting Assistant Professor at SIS, Telecom Program Slides partly based on Dr. Znati s material
More informationMore on Using Phone Books. Linear Search. Linear Search. Search. 17. Searching and Sorting Examples: LinSearch: The Spec 3/24/2015
17. Searching and Sorting Examples: Search Topics: Binary Search Measuring Execution Time The Divide and Conquer Framework Merge Sort Is this song in that playlist? Is this number in that phone book? Is
More informationStatistical Methods for Network and Computer Security p.1/43
Statistical Methods for Network and Computer Security David J. Marchette marchettedj@nswc.navy.mil Naval Surface Warfare Center Code B10 Statistical Methods for Network and Computer Security p.1/43 A Few
More informationSMALL INDEX LARGE INDEX (SILT)
Wayne State University ECE 7650: Scalable and Secure Internet Services and Architecture SMALL INDEX LARGE INDEX (SILT) A Memory Efficient High Performance Key Value Store QA REPORT Instructor: Dr. Song
More informationTCP, Active Queue Management and QoS
TCP, Active Queue Management and QoS Don Towsley UMass Amherst towsley@cs.umass.edu Collaborators: W. Gong, C. Hollot, V. Misra Outline motivation TCP friendliness/fairness bottleneck invariant principle
More informationBusiness Objects Enterprise version 4.1. Report Viewing
Business Objects Enterprise version 4.1 Note about Java: With earlier versions, the Java runtime was not needed for report viewing; but was needed for report writing. The default behavior in version 4.1
More informationRandom Early Detection Gateways for Congestion Avoidance
Random Early Detection Gateways for Congestion Avoidance Sally Floyd and Van Jacobson Lawrence Berkeley Laboratory University of California floyd@eelblgov van@eelblgov To appear in the August 1993 IEEE/ACM
More informationOffline sorting buffers on Line
Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com
More informationContinuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream
Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream Xuemin Lin University of New South Wales Sydney, Australia Jian Xu University of New South Wales Sydney, Australia
More informationQuery Selectivity Estimation for Uncertain Data
Query Selectivity Estimation for Uncertain Data Sarvjeet Singh *, Chris Mayfield *, Rahul Shah #, Sunil Prabhakar * and Susanne Hambrusch * * Department of Computer Science, Purdue University # Department
More informationSummary Data Structures for Massive Data
Summary Data Structures for Massive Data Graham Cormode Department of Computer Science, University of Warwick graham@cormode.org Abstract. Prompted by the need to compute holistic properties of increasingly
More informationPortal User Guide. Customers. Version 1.1. May 2013 http://www.sharedband.com 1 of 5
Portal User Guide Customers Version 1.1 May 2013 http://www.sharedband.com 1 of 5 Table of Contents Introduction... 3 Using the Sharedband Portal... 4 Login... 4 Request password reset... 4 View accounts...
More informationSeeVogh Player manual
SeeVogh Player manual Current Version as of: (03/28/2014) v.2.0 1 The SeeVogh Player is a simple application, which allows you to playback recordings made during a SeeVogh meeting with the recording function
More informationSeed Distributions for the NCAA Men s Basketball Tournament: Why it May Not Matter Who Plays Whom*
Seed Distributions for the NCAA Men s Basketball Tournament: Why it May Not Matter Who Plays Whom* Sheldon H. Jacobson Department of Computer Science University of Illinois at UrbanaChampaign shj@illinois.edu
More informationProblem. Indexing with Btrees. Indexing. Primary Key Indexing. Btrees. Btrees: Example. primary key indexing
Problem Given a large collection of records, Indexing with Btrees find similar/interesting things, i.e., allow fast, approximate queries Anastassia Ailamaki http://www.cs.cmu.edu/~natassa 2 Indexing Primary
More informationAny symbols displayed within these pages are for illustrative purposes only, and are not intended to portray any recommendation.
mobiletws for Android Users' Guide October 2012 mobiletws for Android Version 4.1.360 2012 Interactive Brokers LLC. All Rights Reserved Any symbols displayed within these pages are for illustrative purposes
More information1. Install a Virtual Machine... 2. 2. Download Ubuntu Ubuntu 14.04.1 LTS... 2. 3. Create a New Virtual Machine... 2
Introduction APPLICATION NOTE The purpose of this document is to explain how to create a Virtual Machine on a Windows PC such that a Linux environment can be created in order to build a Linux kernel and
More informationOUTLOOK WEB APP 2013 ESSENTIAL SKILLS
OUTLOOK WEB APP 2013 ESSENTIAL SKILLS CONTENTS Login to engage365 Web site. 2 View the account home page. 2 The Outlook 2013 Window. 3 Interface Features. 3 Creating a new email message. 4 Create an Email
More information1. The subnet must prevent additional packets from entering the congested region until those already present can be processed.
Congestion Control When one part of the subnet (e.g. one or more routers in an area) becomes overloaded, congestion results. Because routers are receiving packets faster than they can forward them, one
More informationNessus Cloud User Registration
Nessus Cloud User Registration Create Your Tenable Nessus Cloud Account 1. Click on the provided URL to create your account. If the link does not work, please cut and paste the entire URL into your browser.
More informationΤΕΙ Κρήτης, Παράρτηµα Χανίων
ΤΕΙ Κρήτης, Παράρτηµα Χανίων ΠΣΕ, Τµήµα Τηλεπικοινωνιών & ικτύων Η/Υ Εργαστήριο ιαδίκτυα & Ενδοδίκτυα Η/Υ Modeling Wide Area Networks (WANs) ρ Θεοδώρου Παύλος Χανιά 2003 8. Modeling Wide Area Networks
More informationTEACHING STATISTICS IN MANAGEMENT COURSES IN INDIA. R. P. Suresh Indian Institute of Management Kozhikode India
TEACHING STATISTICS IN MANAGEMENT COURSES IN INDIA R. P. Suresh Indian Institute of Management Kozhikode India Statistics is taught as a core course in the Management programmes leading to the degree of
More informationUsing DataOblivious Algorithms for Private Cloud Storage Access. Michael T. Goodrich Dept. of Computer Science
Using DataOblivious Algorithms for Private Cloud Storage Access Michael T. Goodrich Dept. of Computer Science Privacy in the Cloud Alice owns a large data set, which she outsources to an honestbutcurious
More informationChapter 13: Query Processing. Basic Steps in Query Processing
Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing
More informationGravityLab Multimedia Inc. Windows Media Authentication Administration Guide
GravityLab Multimedia Inc. Windows Media Authentication Administration Guide Token Auth Menu GravityLab Multimedia supports two types of authentication to accommodate customers with content that requires
More information