B669 Sublinear Algorithms for Big Data



Similar documents
B490 Mining the Big Data. 0 Introduction

B561 Advanced Database Concepts. 0 Introduction. Qin Zhang 1-1

CIS 700: algorithms for Big Data

Data Streams A Tutorial

Big Graph Processing: Some Background

Infrastructures for big data

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Large-Scale Data Processing

Algorithmic Techniques for Big Data Analysis. Barna Saha AT&T Lab-Research

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

Summary Data Structures for Massive Data

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

CAs and Turing Machines. The Basis for Universal Computation

Algorithmic Aspects of Big Data. Nikhil Bansal (TU Eindhoven)

Extreme Computing. Big Data. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

CAS CS 565, Data Mining

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Scalable Machine Learning - or what to do with all that Big Data infrastructure

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Big Data & Scripting Part II Streaming Algorithms

The Internet of Things and Big Data: Intro

MATH 115 Mathematics for Liberal Arts (3 credits) Professor s notes* As of June 27, 2007

Introduction to Parallel Programming and MapReduce

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Lecture Data Warehouse Systems

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Information Management course

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

résumé de flux de données

Challenges for Data Driven Systems

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

Sublinear Algorithms for Big Data. Part 4: Random Topics

Part 1 Expressions, Equations, and Inequalities: Simplifying and Solving

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

A Model of Computation for MapReduce

Lecture 10: Regression Trees

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

CPS 216: Advanced Database Systems (Data-intensive Computing Systems) Shivnath Babu

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Spark: Cluster Computing with Working Sets

Building your Big Data Architecture on Amazon Web Services

1 Review of Least Squares Solutions to Overdetermined Systems

361 Computer Architecture Lecture 14: Cache Memory

Classification and Prediction

PhD Proposal: Functional monitoring problem for distributed large-scale data streams

Big Ideas in Mathematics

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Big Data Big Deal? Salford Systems

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Streaming Algorithms

Operations and Supply Chain Management Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras

Topics in basic DBMS course

Machine Learning.

Large-Scale Test Mining

Count the Dots Binary Numbers

Continuous Random Variables

ANALYTICS CENTER LEARNING PROGRAM

Data Science Center Eindhoven. Big Data: Challenges and Opportunities for Mathematicians. Alessandro Di Bucchianico

Distance Degree Sequences for Network Analysis

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

A NOTE ON OFF-DIAGONAL SMALL ON-LINE RAMSEY NUMBERS FOR PATHS

STAT 360 Probability and Statistics. Fall 2012

COMP9321 Web Application Engineering

Machine Learning over Big Data

MODELING RANDOMNESS IN NETWORK TRAFFIC

Data Streaming Algorithms for Estimating Entropy of Network Traffic

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Computer Science Theory. From the course description:

HUAWEI Advanced Data Science with Spark Streaming. Albert Bifet

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Big Data Begets Big Database Theory

1.00 Lecture 1. Course information Course staff (TA, instructor names on syllabus/faq): 2 instructors, 4 TAs, 2 Lab TAs, graders

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015

Database Systems. Lecture 1: Introduction

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

Exponential time algorithms for graph coloring

Every tree contains a large induced subgraph with all degrees odd

Syllabus for MATH 191 MATH 191 Topics in Data Science: Algorithms and Mathematical Foundations Department of Mathematics, UCLA Fall Quarter 2015

File System Management

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Big Data and Analytics: Challenges and Opportunities

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Section Format Day Begin End Building Rm# Instructor. 001 Lecture Tue 6:45 PM 8:40 PM Silver 401 Ballerini

Introduction to Data Mining. Lijun Zhang

File Management. Chapter 12

AMIS 7640 Data Mining for Business Intelligence

CS514: Intermediate Course in Computer Systems

A Catalogue of the Steiner Triple Systems of Order 19

Transcription:

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1

Now about the Big Data Big data is everywhere : over 2.5 petabytes of sales transactions : an index of over 19 billion web pages : over 40 billion of pictures... 2-1

Now about the Big Data Big data is everywhere : over 2.5 petabytes of sales transactions : an index of over 19 billion web pages : over 40 billion of pictures... Magazine covers Nature 06 Nature 08 CACM 08 Economist 10 2-2

Source and Challenge Source Retailer databases Logistics, financial & health data Social network Pictures by mobile devices Internet of Things New forms of scientific data 3-1

Source and Challenge Source Retailer databases Logistics, financial & health data Social network Pictures by mobile devices Internet of Things New forms of scientific data Challenge Volume Velocity Variety (Documents, Stock records, Personal profiles, Photographs, Audio & Video, 3D models, Location data,... ) 3-2

Source and Challenge Source Retailer databases Logistics, financial & health data Social network Pictures by mobile devices Internet of Things New forms of scientific data Challenge Volume Velocity } The focus of algorithm design Variety (Documents, Stock records, Personal profiles, Photographs, Audio & Video, 3D models, Location data,... ) 3-3

What is the meaning of Big Data IN THEORY? We don t define Big Data in terms of TB, PB, EB,... 4-1

What is the meaning of Big Data IN THEORY? We don t define Big Data in terms of TB, PB, EB,... The data is stored there, but no time to read them all. What can we do? 4-2

What is the meaning of Big Data IN THEORY? We don t define Big Data in terms of TB, PB, EB,... The data is stored there, but no time to read them all. What can we do? Read some of them. Sublinear in time 4-3

What is the meaning of Big Data IN THEORY? We don t define Big Data in terms of TB, PB, EB,... The data is stored there, but no time to read them all. What can we do? Read some of them. The data is too big to fit in main memory. What can we do? Sublinear in time 4-4

What is the meaning of Big Data IN THEORY? We don t define Big Data in terms of TB, PB, EB,... The data is stored there, but no time to read them all. What can we do? Read some of them. The data is too big to fit in main memory. What can we do? Sublinear in time Store on the disk (page/block access) Sublinear in I/O 4-5

What is the meaning of Big Data IN THEORY? We don t define Big Data in terms of TB, PB, EB,... The data is stored there, but no time to read them all. What can we do? Read some of them. The data is too big to fit in main memory. What can we do? Store on the disk (page/block access) Throw some of them away. Sublinear in time Sublinear in I/O Sublinear in space 4-6

What is the meaning of Big Data IN THEORY? We don t define Big Data in terms of TB, PB, EB,... The data is stored there, but no time to read them all. What can we do? Read some of them. The data is too big to fit in main memory. What can we do? Store on the disk (page/block access) Throw some of them away. Sublinear in time Sublinear in I/O Sublinear in space The data is too big to be stored in a single machine. What can we do if we do not want to throw them away? 4-7

What is the meaning of Big Data IN THEORY? We don t define Big Data in terms of TB, PB, EB,... The data is stored there, but no time to read them all. What can we do? Read some of them. The data is too big to fit in main memory. What can we do? Sublinear in time Store on the disk (page/block access) Throw some of them away. Sublinear in I/O Sublinear in space The data is too big to be stored in a single machine. What can we do if we do not want to throw them away? Store in multiple machines, which collaborate via communication Sublinear in communication 4-8

What do we mean by sublinear? Time/space/communication spent is o(input size) 5-1

Conceretly, theory folks talk about the following... Sublinear time algorithms Sublinear time approximation algorithms Property testing (leave to the future) 6-1

Conceretly, theory folks talk about the following... Sublinear time algorithms Sublinear time approximation algorithms Property testing (leave to the future) Sublinear space algorithms Data stream algorithms 6-2

Conceretly, theory folks talk about the following... Sublinear time algorithms Sublinear time approximation algorithms Property testing (leave to the future) Sublinear space algorithms Data stream algorithms Sublinear communication algorithms Multiparty communication protocols/algorithms (special case: MapReduce, BSP,... ) 6-3

Conceretly, theory folks talk about the following... Sublinear time algorithms Sublinear time approximation algorithms Property testing (leave to the future) Sublinear space algorithms Data stream algorithms Sublinear communication algorithms Multiparty communication protocols/algorithms (special case: MapReduce, BSP,... ) Sublinear I/O algorithms (leave to the future) External memory data structures/algorithms 6-4

Sublinear in time Example: Given a social network graph, we want to compute its average degree. (i.e., the average # of friends people have in the network) Can we do it without quering the degrees of all nodes? (i.e., asking everyone) 7-1

Why hard? You can t see everything in sublinear time! Computing exact average degree is impossible without querying at least n 1 nodes (n: # total nodes). So our goal is to get a (1 + ɛ)-approximation w.h.p. (ɛ is a very small constant, e.g., 0.01) 8-1

Why hard? You can t see everything in sublinear time! Computing exact average degree is impossible without querying at least n 1 nodes (n: # total nodes). So our goal is to get a (1 + ɛ)-approximation w.h.p. (ɛ is a very small constant, e.g., 0.01) Can we simply use sampling? 8-2

Why hard? You can t see everything in sublinear time! Computing exact average degree is impossible without querying at least n 1 nodes (n: # total nodes). So our goal is to get a (1 + ɛ)-approximation w.h.p. (ɛ is a very small constant, e.g., 0.01) Can we simply use sampling? No, it doesn t work. Consider the star, with degree sequence (n 1, 1,..., 1). 8-3

Why hard? You can t see everything in sublinear time! Computing exact average degree is impossible without querying at least n 1 nodes (n: # total nodes). So our goal is to get a (1 + ɛ)-approximation w.h.p. (ɛ is a very small constant, e.g., 0.01) Can we simply use sampling? No, it doesn t work. Consider the star, with degree sequence (n 1, 1,..., 1). So can we do anything non-trivial? (think about it, and we will discuss later in the course) 8-4

Sublinear in space The data stream model (Alon, Matias and Szegedy 1996) RAM CPU 9-1

Sublinear in space The data stream model (Alon, Matias and Szegedy 1996) RAM Applications Internet Router. Packets limited space CPU Router The router wants to maintain some statistics on data. E.g., want to detect anomalies for security. 9-2 Stock data, ad auction, flight logs on tapes, etc.

Why hard? You do see everything but then forget! Game 1: A sequence of numbers 10-1

Why hard? You do see everything but then forget! Game 1: A sequence of numbers 52 10-2

Why hard? You do see everything but then forget! Game 1: A sequence of numbers 45 10-3

Why hard? You do see everything but then forget! Game 1: A sequence of numbers 18 10-4

Why hard? You do see everything but then forget! Game 1: A sequence of numbers 23 10-5

Why hard? You do see everything but then forget! Game 1: A sequence of numbers 17 10-6

Why hard? You do see everything but then forget! Game 1: A sequence of numbers 41 10-7

Why hard? You do see everything but then forget! Game 1: A sequence of numbers 33 10-8

Why hard? You do see everything but then forget! Game 1: A sequence of numbers 29 10-9

Why hard? You do see everything but then forget! Game 1: A sequence of numbers 49 10-10

Why hard? You do see everything but then forget! Game 1: A sequence of numbers 12 10-11

Why hard? You do see everything but then forget! Game 1: A sequence of numbers 35 10-12

Why hard? You do see everything but then forget! Game 1: A sequence of numbers Q: What s the median? 10-13

Why hard? You do see everything but then forget! Game 1: A sequence of numbers Q: What s the median? A: 33 10-14

Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul 11-1

Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Alice and Bob become friends 11-2

Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Carol and Eva become friends 11-3

Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Eva and Bob become friends 11-4

Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Dave and Paul become friends 11-5

Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Alice and Paul become friends 11-6

Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Eva and Bob unfriends 11-7

Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Alice and Dave become friends 11-8

Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Bob and Paul become friends 11-9

Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Dave and Paul unfriends 11-10

Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Dave and Carol become friends 11-11

Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Q: Are Eva and Bob connected by friends? 11-12

Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Q: Are Eva and Bob connected by friends? A: YES. Eva Carol Dave Alice Bob 11-13

Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Q: Are Eva and Bob connected by friends? A: YES. Eva Carol Dave Alice Bob Have to allow approx/randomization given a small memory. 11-14

Sublinear in communication The model x 1 = 010011 x 2 = 111011 x 3 = 111111 x k = 100011 They want to jointly compute f (x 1, x 2,..., x k ) (e.g., f is # distinct ele) Goal: minimize total bits of communication 12-1

Sublinear in communication The model x 1 = 010011 x 2 = 111011 x 3 = 111111 x k = 100011 They want to jointly compute f (x 1, x 2,..., x k ) (e.g., f is # distinct ele) Goal: minimize total bits of communication Applicaitons 12-2 etc.

Why hard? You do not have a global view of the data. Let s think about the graph connectivity problem: k machine each holds a set of edges of a graph. Goal: compute whether the graph is connected. 13-1

Why hard? You do not have a global view of the data. Let s think about the graph connectivity problem: k machine each holds a set of edges of a graph. Goal: compute whether the graph is connected. A trivial solution: each machine sends a local spanning tree to the first machine. Cost O(kn log n) bits. 13-2

Why hard? You do not have a global view of the data. Let s think about the graph connectivity problem: k machine each holds a set of edges of a graph. Goal: compute whether the graph is connected. A trivial solution: each machine sends a local spanning tree to the first machine. Cost O(kn log n) bits. Can we do better, e.g., o(kn) bits of communication? 13-3

Why hard? You do not have a global view of the data. Let s think about the graph connectivity problem: k machine each holds a set of edges of a graph. Goal: compute whether the graph is connected. A trivial solution: each machine sends a local spanning tree to the first machine. Cost O(kn log n) bits. 13-4 Can we do better, e.g., o(kn) bits of communication? What if the graph is node partitioned among the k machines? That is, each node is stored in 1 machine with all adjancent edges.

Problems Statistical problems Frequency moments F p F 0 : #distinct elements F 2 : size of self-join Heavy hitters Quantile Entropy... 14-1

Problems Statistical problems Frequency moments F p F 0 : #distinct elements F 2 : size of self-join Heavy hitters Quantile Entropy... Graph problems Connectivity Bipartiteness Counting triangles Matching Minimum spanning tree... 14-2

Problems Statistical problems Frequency moments F p F 0 : #distinct elements F 2 : size of self-join Heavy hitters Quantile Entropy... Graph problems Connectivity Bipartiteness Counting triangles Matching Minimum spanning tree... L p regression Numerical linear algebra Low-rank approximation... 14-3

Problems Statistical problems 14-4 Frequency moments F p F 0 : #distinct elements F 2 : size of self-join Heavy hitters Quantile Entropy... Numerical linear algebra L p regression Low-rank approximation... Graph problems Connectivity Bipartiteness Counting triangles Matching Minimum spanning tree... DB queries Strings Geometry problems Clustering Conjuntive queries Edit distance Earth-Mover Distance... Longest increasing sequence

A few examples using sublinear space RAM CPU 15-1

A toy example: Reservoir Sampling Tasks: Find a uniform sample from a stream of unknown length, can we do it in O(1) space? 16-1

A toy example: Reservoir Sampling Tasks: Find a uniform sample from a stream of unknown length, can we do it in O(1) space? Algorithm: Store 1-st item. When the i-th (i > 1) item arrives With probability 1/i, replace the current sample; With probability 1 1/i, throw it away. 16-2

A toy example: Reservoir Sampling Tasks: Find a uniform sample from a stream of unknown length, can we do it in O(1) space? Algorithm: Store 1-st item. When the i-th (i > 1) item arrives With probability 1/i, replace the current sample; With probability 1 1/i, throw it away. Correctness: each item is included in the final sample w.p. 1 i (1 1 i+1 )... (1 1 n ) = 1 n (n: total # items) Space: O(1) 16-3

Maintain a sample for Sliding Windows Tasks: Find a uniform sample from the last w items. 17-1

Maintain a sample for Sliding Windows Tasks: Find a uniform sample from the last w items. Algorithm: For each x i, we pick a random value v i (0, 1). In a window < x j w+1,..., x j >, return value x i with smallest v i. To do this, maintain the set of all x i in sliding window whose v i value is minimal among subsequent values. 17-2

Maintain a sample for Sliding Windows Tasks: Find a uniform sample from the last w items. Algorithm: For each x i, we pick a random value v i (0, 1). In a window < x j w+1,..., x j >, return value x i with smallest v i. To do this, maintain the set of all x i in sliding window whose v i value is minimal among subsequent values. Correctness: Obvious. Space (expected): 1/w + 1/(w 1) +... + 1/1 = log w. 17-3

Tentative course plan Part 0 : Introductions (3 lectures) New models for Big Data Interesting problems Basic probabilistic tools Part 1 : Sublinear in space (6 lectures) Distinct elements Heavy hitters Part 2 : Sublinear in communication (4 lectures) Sampling (l 0 sampling) Connectivity Min-cut and sparsification Part 3 : Sublinear in time (3 + *2 lectures) Average degree *Minimum spanning tree Part 4 : Student presentations 18-1 Sept. 16 (Mon) and Sept. 18 (Wed), 2 guest lectures by Prof. Funda Ergun

Resources There is no textbook for the class. Background on Randomized Algorithms: Probability and Computing by Mitzenmacher and Upfal (Advanced undergraduate textbook) Randomized Algorithms by Motwani and Raghavan (Graduate textbook) 19-1 Concentration of Measure for the Analysis of Randomised Algorithms by Dubhashi, Panconesi (Advanced reference book)

Surveys Surveys Sketch Techniques for Apporximate Query Processing by Cormode Data Streams: Algorithms and Applications by Muthukrishnan Check course website for more resources 20-1

Grading Assignments 42% : There will be two homework assignments, each with about 3-5 questions. Solutions should be typeset in LaTeX and submitted by email. Project 58% : The project consists of three components: 1. Write a proposal. 2. Make a presentation. 3. Write a report. (Details will be posted later) 21-1

Grading Assignments 42% : There will be two homework assignments, each with about 3-5 questions. Solutions should be typeset in LaTeX and submitted by email. Project 58% : The project consists of three components: 1. Write a proposal. 2. Make a presentation. 3. Write a report. (Details will be posted later) 21-2 Most important thing: Learn something about models / algorithmic techniques / theoretical analysis for Big Data.

Prerequisites A research-oriented course. Will be quite mathematical. One is expected to know: basics on algorithm design and analysis + basic probability. e.g., have taken B403 Introduction to Algorithm Design and Analysis or equivalent courses. I will NOT start with things like big-o notations, the definitions of expectation and variance, and hashing. But, please always ask at any time if you don t understand sth. 22-1

Frequently asked questions Is this a course good for my job hunting in industry? Yes and No. Yes, if you get to know some advanced (and easily implementable) algorithms for handling big data, that will certainly help. (e.g., Google interview questions) But, this is a research-oriented course, and is NOT designed for teaching commercially available techniques. 23-1

Frequently asked questions Is this a course good for my job hunting in industry? Yes and No. Yes, if you get to know some advanced (and easily implementable) algorithms for handling big data, that will certainly help. (e.g., Google interview questions) But, this is a research-oriented course, and is NOT designed for teaching commercially available techniques. I haven t taken B403 Introduction to Algorithm Design and Analysis or equivalent courses. Can I take the course? Or, will this course fit me? Generally speaking, this is an advanced algorithm course. It might be difficult if you do not have enough background. But you can always (and welcome to) have a try. 23-2

Summary for today We have introduced three types of sublinear algorithms: in time, space, and communication We have talked about why algorithmic design in these models/paradigms are difficult. We have given a few examples. We have talked about the course plan and assessment. 24-1

Today s homework (this is not assignment questions) Think about questions (see below) we have talked about today. If you have novel solutions, please let me know (typeset in latex and send to me by email), and good solutions will receive up to 10 extra points in the final grade. 1. To estimate the average degree of a graph, can we do something non-trivial (that is, without asking the degree fo all nodes)? 2. Can k machines test the connectivity of the graph (of n nodes) using o(nk) bits of communication? 25-1

26-1 Thank you!

Attribution Some examples are borrowed from Andrew McGregor s course http://people.cs.umass.edu/~mcgregor/courses/ CS711S12/index.html 27-1