CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

Size: px

Start display at page:

Download "CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof."

Sherman Armstrong
8 years ago
Views:

1 CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensie Computing Uniersity of Florida, CISE Department Prof. Daisy Zhe Wang

2 Map/Reduce: Simplified Data Processing on Large Clusters Parallel/Distributed Computing Distributed File System M/R Programming Model Parallel Analytics using M/R Adapted Slides from Jeff Ullman, Anand Rajaraman and Jure Lesoec from Stanford

Programming Model Parallel Analytics using M/R Adapted

3 Parallel Computing MapReduce is designed for parallel computing Before MapReduce Enterprise: a few super-computers, parallelism is achieed by parallel DBs (e.g., Teradata) Science, HPC: MPI, openmp oer commodity computer clusters or oer super-computers for parallel computing, do not handle failures (i.e., Fault tolerance) Today: Enterprise: map-reduce, parallel DBs Science, HPC: openmp, MPI and map-reduce MapReduce can apply oer both multicore and distributed enironment 3

, Teradata) Science, HPC: MPI, openmp oer commodity computer clusters or oer super-computers for parallel computing,

4 Data Analysis on Single Serer Many data sets can be ery large Tens to hundreds of terabytes Cannot always perform data analysis on large datasets on a single serer (why?) Run time (i.e., CPU), memory, dis space CPU CPU CPU CPU Memory Memory Dis Dis Dis Dis

analysis on large datasets on a single serer (why?) Run time (i.

5 Motiation: Google s Problem 20+ billion web pages * 20KB = 400+ TB 1 computer reads MB/sec from dis ~4 months to read the web ~1000 hard dries to store the web Taes een more to do something useful with the data! CPU Memory 5

months to read the web ~1000 hard dries to store the web

6 Distributed Computing Challenges: Distributed/parallel programming is hard How to distribute Data? Computation? Scheduling? Fault Tolerant? Debugging? On the software side: what is the programming model? On the hardware side: how to deal with hardware failures? MapReduce/Hadoop addresses all of the aboe Google s computational/data manipulation model Elegant way to wor with big data 6

On the software side: what is the programming model?

7 Commodity Clusters Recently standard commodity architecture for such problems: Cluster of commodity Linux nodes Gigabit ethernet interconnect Challenge: How to organize computations on this architecture? handle issues such as hardware failure

ethernet interconnect Challenge: How to organize

8 Cluster Architecture 1 Gbps between any pair of nodes in a rac Switch 2-10 Gbps bacbone between racs Switch Switch CPU CPU CPU CPU Mem Mem Mem Mem Dis Dis Dis Dis Each rac contains nodes

9 Example Cluster Datacenters In 2011, it was gustimated that Google had 1M machines. Amazon Cloud? Probably millions nodes today AWB Piotal Analytics Wor Bench, 1000 nodes, SW and irtualization proided Hypergator UF s cluster datacenter, no API, no irtualization 9

Probably millions nodes today AWB Piotal Analytics Wor Bench,

10 Problems with Cluster Datacenters Hardware Failures Machine failures Diss failure, CPU failure, Data corruption One serer may stay up 3 years (1,000 days) If you hae 1,000 serers, expect to loose 1/day With 1M machines 1,000 machines fail eery day! Networ failure Copy data oer a networ taes time 10

days) If you hae 1,000 serers, expect to loose 1/day With 1M machines

11 Stable storage First order problem: if nodes can fail, how can we store data persistently? Idea: Store files multiple times for reliability Answer: Distributed File System Proides global file namespace Google GFS; Hadoop HDFS Typical usage pattern Huge files (100s of GB to TB) Data is rarely updated in place Reads and appends are common

Idea: Store files multiple times for reliability Answer: Distributed File System

12 Distributed File System Chun Serers File is split into contiguous chuns Typically each chun is 16-64MB Each chun replicated (usually 2x or 3x) Try to eep replicas in different racs Master node a..a. Name Nodes in HDFS Stores metadata Liely to be replicated as well Client library for file access Tals to master to find chun serers Connects directly to chun serers to access data

13 Reliable Distributed File System Data ept in chuns spread across machines Each chun replicated on different machines Seamless recoery from dis or machine failure 13

14 MapReduce Motiation Large-Scale Data Processing Want to use 1000s of CPUs, But don t want the hassle of scheduling, synchronization, etc. MapReduce proides Automatic parallelization & distribution Fault tolerance I/O scheduling and synchronization Monitoring & status updates

MapReduce proides Automatic parallelization & distribution Fault

15 MapReduce Basics MapReduce Programming model similar to Lisp (and other functional languages) Many problems can be phrased using Map and Reduce functions Benefits for implementing an algorithm in MapReduce API Automatic parallelization Easy to distribute across nodes Nice retry/failure semantics

functions Benefits for implementing an algorithm in MapReduce API

16 Example Problem: Word Count We hae a huge text file: doc.txt or a large corpus of documents: docs/* Count the number of times each distinct word appears in the file Sample application: analyze web serer logs to find popular URLs

number of times each distinct word appears in the file

17 Word Count (2) Case 1: File too large for memory, but all <word, count> pairs fit in memory Case 2: File on dis, too many distinct words to fit in memory Count occurrences of words words(doc.txt) sort uniq c words taes a file and output the words in it, one word a line Naturally capture the essence of MapReduce and naturally parallelizable

18 3 steps of MapReduce Sequentially read a lot of data Map: Extract something you care about Group by ey: Sort and shuffle Reduce: Aggregate, summarize, filter or transform Output the result For different problems, steps stay the same, Map and Reduce change to fit the problem 18

Aggregate, summarize, filter or transform Output the result For

19 MapReduce: more detail Input: a set of ey/alue pairs Programmer specifies two methods Map(,) list(1,1) Taes a ey-alue pair and outputs a set of eyalue pairs (e.g., ey is the file name, alue is a single line in the file) One map call for eery (,) pair Reduce(1, list(1)) list (1, 2) All alues 1 with the same ey 1 are reduced together to produce a single alue 2 There is one reduce call per unique ey 1 Group by ey is executed by default in the same way: Sort(list(1,1)) list (1, list(1))

, ey is the file name, alue is a single line in the file) One map call for eery (,) pair Reduce(1, list(1)) list (1, 2)

20 MapReduce: The Map Step Input ey-alue pairs Intermediate ey-alue pairs map map

21 MapReduce: The Group and Reduce Step Intermediate ey-alue pairs Key-alue groups reduce Output ey-alue pairs group reduce

22 Word Count using MapReduce map(ey, alue): // ey: document name; alue: text of document for each word w in alue: emit(w, 1) reduce(ey, alues): // ey: a word; alue: an iterator oer counts result = 0 for each count in alues: result += emit(result)

23 MapReduce: Word Count 23

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very