B490 Mining the Big Data. 0 Introduction



Similar documents
B669 Sublinear Algorithms for Big Data

B561 Advanced Database Concepts. 0 Introduction. Qin Zhang 1-1

CAS CS 565, Data Mining

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

CS Data Science and Visualization Spring 2016

BUDT 758B-0501: Big Data Analytics (Fall 2015) Decisions, Operations & Information Technologies Robert H. Smith School of Business

Big Data Analytics. Genoveva Vargas-Solar French Council of Scientific Research, LIG & LAFMIA Labs

CSE 427 CLOUD COMPUTING WITH BIG DATA APPLICATIONS

Algorithmic Aspects of Big Data. Nikhil Bansal (TU Eindhoven)

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

COMP9321 Web Application Engineering

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015

Extreme Computing. Big Data. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

Application and practice of parallel cloud computing in ISP. Guangzhou Institute of China Telecom Zhilan Huang

Machine Learning using MapReduce

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Large-Scale Data Processing

Teaching Scheme Credits Assigned Course Code Course Hrs./Week. BEITC802 Big Data Analytics. Theory Marks

Big Data Systems CS 5965/6965 FALL 2015

Infrastructures for big data

Estimating PageRank Values of Wikipedia Articles using MapReduce

Big Data and Analytics: Challenges and Opportunities

Big Data Analytics Process & Building Blocks

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

Management & Analysis of Big Data in Zenith Team

Lecture Data Warehouse Systems

CPS 216: Advanced Database Systems (Data-intensive Computing Systems) Shivnath Babu

BIG DATA What it is and how to use?

Big Data Analytics. Lucas Rego Drumond

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Introduction to DISC and Hadoop

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

CSE-E5430 Scalable Cloud Computing Lecture 2

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

B490 Mining the Big Data. 2 Clustering

How To Learn To Use Big Data

Machine Learning Big Data using Map Reduce

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

MapReduce and Hadoop Distributed File System

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Massive Cloud Auditing using Data Mining on Hadoop

Big Data Analytics Hadoop and Spark

MapReduce: Algorithm Design Patterns

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

What happens when Big Data and Master Data come together?

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Web intelligence on Big Data in Today s Life. Web intelligence on Big Data in Today s Life,

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Big Graph Processing: Some Background

Hur hanterar vi utmaningar inom området - Big Data. Jan Östling Enterprise Technologies Intel Corporation, NER

Are You Ready for Big Data?

CSC590: Selected Topics BIG DATA & DATA MINING. Lecture 2 Feb 12, 2014 Dr. Esam A. Alwagait

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Big Data a threat or a chance?

INTRO TO BIG DATA. Djoerd Hiemstra. Big Data in Clinical Medicinel, 30 June 2014

Information Management course

Clarity High School Student Survey

Big Data & Scripting Part II Streaming Algorithms

Cleveland State University

CSCI-599 DATA MINING AND STATISTICAL INFERENCE

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Big Data and Industrial Internet

Big Data and Apache Hadoop s MapReduce

Hadoop and Map-reduce computing

BIG DATA, MAPREDUCE & HADOOP

Are You Ready for Big Data?

Information Processing, Big Data, and the Cloud

Intro to Map/Reduce a.k.a. Hadoop

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Hadoop Ecosystem B Y R A H I M A.

A Performance Analysis of Distributed Indexing using Terrier

Clarity Middle School Survey

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Syllabus for MATH 191 MATH 191 Topics in Data Science: Algorithms and Mathematical Foundations Department of Mathematics, UCLA Fall Quarter 2015

Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)

Large-Scale Test Mining

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Outline. What is Big data and where they come from? How we deal with Big data?

Driving Better Marketing Results with Big Data and Analytics David Corrigan, IBM, Director of Product Marketing

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

Transcription:

B490 Mining the Big Data 0 Introduction Qin Zhang 1-1

Data Mining What is Data Mining? A definition : Discovery of useful, possibly unexpected, patterns in data. 2-1

Data Mining What is Data Mining? A definition : Discovery of useful, possibly unexpected, patterns in data. I don t think this is practical, until a day machines have intelligence. (You can have different opinions) 2-2

Data Mining What is Data Mining? A definition : Discovery of useful, possibly unexpected, patterns in data. I don t think this is practical, until a day machines have intelligence. (You can have different opinions) I think, most of the time, people just mean to Compute some functions defined on the data (Efficient algorithms). Fit data into some concrete models (Statistical modeling). 2-3

In this course, we will talk about... In this course we will focus on efficient algorithms. In particular, we will discuss Finding similar items 3-1

In this course, we will talk about... In this course we will focus on efficient algorithms. In particular, we will discuss Finding similar items Mining frequent items 3-2

In this course, we will talk about... In this course we will focus on efficient algorithms. In particular, we will discuss Finding similar items Mining frequent items Clustering (aggregate similar items) 3-3

In this course, we will talk about... In this course we will focus on efficient algorithms. In particular, we will discuss Finding similar items Mining frequent items Link analysis (explore structure in large graphs) Clustering (aggregate similar items) 3-4

4-1 Big Data

Big Data Big data is everywhere : over 2.5 petabytes of sales transactions : an index of over 19 billion web pages : over 40 billion of pictures... 5-1

Big Data Big data is everywhere : over 2.5 petabytes of sales transactions : an index of over 19 billion web pages : over 40 billion of pictures... Magazine covers Nature 06 Nature 08 CACM 08 Economist 10 5-2

Source and Challenge Source Retailer databases: Amazon, Walmart Logistics, financial & health data: Stock prices Social network: Facebook, twitter Pictures by mobile devices: iphone Internet traffic: IP addresses New forms of scientific data: Large Synoptic Survey Telescope 6-1

Source and Challenge Source Retailer databases: Amazon, Walmart Logistics, financial & health data: Stock prices Social network: Facebook, twitter Pictures by mobile devices: iphone Internet traffic: IP addresses New forms of scientific data: Large Synoptic Survey Telescope Challenge Volume Velocity Variety (Documents, Stock records, Personal profiles, Photographs, Audio & Video, 3D models, Location data,... ) 6-2

Source and Challenge Source Retailer databases: Amazon, Walmart Logistics, financial & health data: Stock prices Social network: Facebook, twitter Pictures by mobile devices: iphone Internet traffic: IP addresses New forms of scientific data: Large Synoptic Survey Telescope Challenge Volume Velocity } The focus of algorithm design Variety (Documents, Stock records, Personal profiles, Photographs, Audio & Video, 3D models, Location data,... ) 6-3

What does Big Data Really Mean? We don t define Big Data in terms of TB, PB, EB,... The data is too big to fit in memory. What can we do? 7-1

What does Big Data Really Mean? We don t define Big Data in terms of TB, PB, EB,... The data is too big to fit in memory. What can we do? Processing one by one as they come, and throw some of them away on the fly. 7-2

What does Big Data Really Mean? We don t define Big Data in terms of TB, PB, EB,... The data is too big to fit in memory. What can we do? Processing one by one as they come, and throw some of them away on the fly. Store in multiple machines, which collaborate via communication 7-3

What does Big Data Really Mean? We don t define Big Data in terms of TB, PB, EB,... The data is too big to fit in memory. What can we do? Processing one by one as they come, and throw some of them away on the fly. Store in multiple machines, which collaborate via communication RAM model does not fit RAM A processor and an infinite size memory Probing each cell of the memory has a unit cost CPU 7-4

8-1 Popular Models for Big Data

Data Streams The data stream model (Alon, Matias & Szegedy 1996) RAM CPU Widely used: Stanford Stream, Aurora, Telegraph, NiagaraCQ... 9-1

Data Streams The data stream model (Alon, Matias & Szegedy 1996) RAM Applications Internet Router. Packets limited space Router CPU Widely used: Stanford Stream, Aurora, Telegraph, NiagaraCQ... 9-2 The router wants to maintain some statistics on data. E.g., want to detect anomalies for security. Stock data, ad auction, flight logs on tapes, etc.

Difficulty: See and forget! Game 1: A sequence of numbers 10-1

Difficulty: See and forget! Game 1: A sequence of numbers 52 10-2

Difficulty: See and forget! Game 1: A sequence of numbers 45 10-3

Difficulty: See and forget! Game 1: A sequence of numbers 18 10-4

Difficulty: See and forget! Game 1: A sequence of numbers 23 10-5

Difficulty: See and forget! Game 1: A sequence of numbers 17 10-6

Difficulty: See and forget! Game 1: A sequence of numbers 41 10-7

Difficulty: See and forget! Game 1: A sequence of numbers 33 10-8

Difficulty: See and forget! Game 1: A sequence of numbers 29 10-9

Difficulty: See and forget! Game 1: A sequence of numbers 49 10-10

Difficulty: See and forget! Game 1: A sequence of numbers 12 10-11

Difficulty: See and forget! Game 1: A sequence of numbers 35 10-12

Difficulty: See and forget! Game 1: A sequence of numbers Q: What s the median? 10-13

Difficulty: See and forget! Game 1: A sequence of numbers Q: What s the median? A: 33 10-14

Difficulty: See and forget! Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul 11-1

Difficulty: See and forget! Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Alice and Bob become friends 11-2

Difficulty: See and forget! Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Carol and Eva become friends 11-3

Difficulty: See and forget! Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Eva and Bob become friends 11-4

Difficulty: See and forget! Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Dave and Paul become friends 11-5

Difficulty: See and forget! Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Alice and Paul become friends 11-6

Difficulty: See and forget! Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Eva and Bob unfriends 11-7

Difficulty: See and forget! Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Alice and Dave become friends 11-8

Difficulty: See and forget! Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Bob and Paul become friends 11-9

Difficulty: See and forget! Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Dave and Paul unfriends 11-10

Difficulty: See and forget! Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Dave and Carol become friends 11-11

Difficulty: See and forget! Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Q: Are Eva and Bob connected by friends? 11-12

Difficulty: See and forget! Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Q: Are Eva and Bob connected by friends? A: YES. Eva Carol Dave Alice Bob 11-13

Difficulty: See and forget! Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Q: Are Eva and Bob connected by friends? A: YES. Eva Carol Dave Alice Bob Have to allow approx/randomization given a small memory. 11-14

MapReduce The MapReduce model (Dean & Ghemawat 2004) Input Output Map Shuffle Reduce Standard model in industry for massive data computation E.g., Hadoop. 12-1

MapReduce The MapReduce model (Dean & Ghemawat 2004) For each value x i, x i {(key 1, v 1 ), (key 2, v 2 ),...} {(key 1, v 1 ), (key 1, v 2 ),...} {y 1, y 2,...} Input Output Map Shuffle Aggregate keys Reduce Standard model in industry for massive data computation E.g., Hadoop. 12-2

MapReduce The MapReduce model (Dean & Ghemawat 2004) For each value x i, x i {(key 1, v 1 ), (key 2, v 2 ),...} {(key 1, v 1 ), (key 1, v 2 ),...} {y 1, y 2,...} Input Output Goal Map Shuffle Aggregate keys Reduce Minimize (1) total communication, (2) # rounds. Standard model in industry for massive data computation E.g., Hadoop. 12-3

ActiveDHT The ActiveDHT model (Bahmani, Chowdhury & Goel 2010) Update (key, a t ) Query (key) Used in Yahoo! S4 & Twitter Storm 13 12 11 10 14 9 15 8 0 7 1 6 2 3 5 4 responsible for keys with hash = 4, 5 responsible for keys with hash = 6, 7 13-1

Tentative course plan 14-1 Part 0 : Introductions Part 1 : Finding Similar Items Jaccard Similarty and Min-Hashing Locality Sensitive Hashing (LSH) and Distances Implementing LSH in ActiveDHT Part 2 : Clustering Hierachical Clustering Assignment-based Clustering (k-center, k-mean, k-median) Spectural Clustering Part 3 : Mining Frequent Items Finding Frequent Itemsets Finding Frequent Items in Data Stream Part 4 : Link Analysis Markov Chain Basics Webpage Similarity and PageRank Implementing PageRank in MapReduce

Resources There is no official textbook for the class. Main reference book: Mining Massive Data Sets by Anand Rajaraman and Jeff Ullman Background on Randomized Algorithms: Probability and Computing by Mitzenmacher and Upfal 15-1

Instructors Instructor: Qin Zhang Email: qzhangcs@indiana.edu Office hours: By email appointment Assitant Instructor: Prasanth Velamala Email: prasvela@umail.iu.edu Office hours: Thursdays, 2pm-3pm 16-1

Grading Assignments 50% : There will be several homework assignments. Solutions should be typeset in LaTeX (highly recommended) or Word. Project 50% : The project consists of three components: 1. Write a proposal. 2. Write a report. 3. Make a presentation. (Details will be posted online) Use A, B,... for each item (assignments or projects). Final grade will be a weighted average (according to XX%). 17-1

Grading 17-2 Assignments 50% : There will be several homework assignments. Solutions should be typeset in LaTeX (highly recommended) or Word. Project 50% : The project consists of three components: 1. Write a proposal. 2. Write a report. 3. Make a presentation. (Details will be posted online) Use A, B,... for each item (assignments or projects). Final grade will be a weighted average (according to XX%). Most important thing: Learn something about models / algorithmic techniques / theoretical analysis for Mining the Big Data.

LaTeX LaTeX: Highly recommended tools for assignments/reports 1. Read wiki articles: http://en.wikipedia.org/wiki/latex 2. Find a good LaTeX editor. 3. Learn how to use it, e.g., read A Not So Short Introduction to LaTeX 2e (Google it) 18-1

Prerequisites One is expected to know: Basics on algorithm design and analysis + probability + programming. e.g., have taken (Math) M365 Introduction to Probability and Statistics, (Math) M301 Linear Algebra and Applications, (CS) C241 Discrete Structures for Computer Science, (CS) B403 Introduction to Algorithm Design and Analysis, or equivalent courses. I will NOT start with things like big-o notations, the definitions of random variables and expectation. But, please always ask at any time if you don t understand sth. 19-1

Possible project topics 20-1 Part 1 : Finding Similar Items Locality Sensitive Hashing: Given a dictionary of a large number of documents (or other objects) and a set of query docs. For each query doc, find all docs in the dictionary that are similar. Compare LSH with other methods that you can think of (e.g., the trivial one: compute the query with each of the docs in the dictionary), in terms of the running time. Part 2 : Clustering Assignment-based Clustering (k-center, k-mean, k-median): Select clustering algorithms taught in class, and run them on large data sets. One can also try to compare it with the hierarchical clustering. Part 3 : Mining Frequent Items Finding Frequent Itemsets: Run the A-priori algorithms on large data sets to find frequent itemsets. Finding Frequent Items in Data Stream: Implement streaming algorithms taught in class, and run them on large data sets to find frequent items. Compare the results with the true frequent items/itemsets.

21-1 Basics on probability

Approximation and Randomization Approximation Return ˆf (A) instead of f (A) where f (A) ˆf (A) ɛf (A) is a (1 + ɛ)-approximation of f (A). 22-1

Approximation and Randomization Approximation Return ˆf (A) instead of f (A) where f (A) ˆf (A) ɛf (A) is a (1 + ɛ)-approximation of f (A). Randomization Return ˆf (A) instead of f (A) where [ Pr f (A) ˆf ] (A) ɛf (A) 1 δ is a (1 + ɛ, δ)-approximation of f (A). 22-2

Markov and Chebyshev inequalities Markov Inequality Let X 0 be a random variable. Then for all a > 0, Pr[X a] E[X ] a. 23-1

Markov and Chebyshev inequalities Markov Inequality Let X 0 be a random variable. Then for all a > 0, Pr[X a] E[X ] a. Chebyshev s Inequality Let X 0 be a random variable. Then for all a > 0, Pr[ X E[X ] a] Var[X ] a 2. 23-2

Application: Birthday Paradox Birthday Paradox In a set of k randomly chosen people, what is the probability that there exists at least a pair of them will have the same birthday? Assuming each person s birthday is randomly chosen from Jan. 1 to Dec. 31. 24-1

Application: Birthday Paradox Birthday Paradox In a set of k randomly chosen people, what is the probability that there exists at least a pair of them will have the same birthday? Assuming each person s birthday is randomly chosen from Jan. 1 to Dec. 31. Take 1: For any pair of people, the probability that they have the same birthday is 1/n. For k people, we have ( k 2) pairs of people. The probability that none of them have the same birthday is (1 1/n) (k 2). Thus the answer is 1 (1 1/n) ( k 2). 24-2

Application: Birthday Paradox Birthday Paradox In a set of k randomly chosen people, what is the probability that there exists at least a pair of them will have the same birthday? Assuming each person s birthday is randomly chosen from Jan. 1 to Dec. 31. Take 1: For any pair of people, the probability that they have the same birthday is 1/n. For k people, we have ( k 2) pairs of people. The probability that none of them have the same birthday is (1 1/n) (k 2). Thus the answer is 1 (1 1/n) ( k 2). Wrong! 24-3

Application: Birthday Paradox Birthday Paradox In a set of k randomly chosen people, what is the probability that there exists at least a pair of them will have the same birthday? Assuming each person s birthday is randomly chosen from Jan. 1 to Dec. 31. Take 1: For any pair of people, the probability that they have the same birthday is 1/n. For k people, we have ( k 2) pairs of people. The probability that none of them have the same birthday is (1 1/n) (k 2). Thus the answer is 1 (1 1/n) ( k 2). Take 2: 1 ( n 0 n ) ( n 1 n Pr[exists collision] k 2 /(2n) ) ( n 2 n )... ( n (k 1) n ) Wrong! 24-4

Application: Coupon Collector Coupon Collector Suppose that each of box of cereal contains one of n different coupons. Once you obtain one of every type of coupon, you can send in for a prize. Assuming that the coupon in each box is chosen independently and uniformly at random from the n possibilities, how many boxes of cereal must you buy before you obtain at least one of every type of coupon? 25-1

Application: Coupon Collector Coupon Collector Suppose that each of box of cereal contains one of n different coupons. Once you obtain one of every type of coupon, you can send in for a prize. Assuming that the coupon in each box is chosen independently and uniformly at random from the n possibilities, how many boxes of cereal must you buy before you obtain at least one of every type of coupon? Analysis (on board) 25-2

The Union Bound The Union Bound Consider t possible dependent random events X 1,..., X t. The probability that all events occur is at least 1 t (1 Pr[X i occurs]) i=1 26-1

Summary for the introduction We have discussed Big Data and Data Mining We have introduced three popular models for modern computation. We have talked about the course plan and assessment. We have covered some basics on probability 27-1

28-1 Thank you!