CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensie Computing Uniersity of Florida, CISE Department Prof. Daisy Zhe Wang
Map/Reduce: Simplified Data Processing on Large Clusters Parallel/Distributed Computing Distributed File System M/R Programming Model Parallel Analytics using M/R Adapted Slides from Jeff Ullman, Anand Rajaraman and Jure Lesoec from Stanford
Parallel Computing MapReduce is designed for parallel computing Before MapReduce Enterprise: a few super-computers, parallelism is achieed by parallel DBs (e.g., Teradata) Science, HPC: MPI, openmp oer commodity computer clusters or oer super-computers for parallel computing, do not handle failures (i.e., Fault tolerance) Today: Enterprise: map-reduce, parallel DBs Science, HPC: openmp, MPI and map-reduce MapReduce can apply oer both multicore and distributed enironment 3
Data Analysis on Single Serer Many data sets can be ery large Tens to hundreds of terabytes Cannot always perform data analysis on large datasets on a single serer (why?) Run time (i.e., CPU), memory, dis space CPU CPU CPU CPU Memory Memory Dis Dis Dis Dis
Motiation: Google s Problem 20+ billion web pages * 20KB = 400+ TB 1 computer reads 30-35 MB/sec from dis ~4 months to read the web ~1000 hard dries to store the web Taes een more to do something useful with the data! CPU Memory 5
Distributed Computing Challenges: Distributed/parallel programming is hard How to distribute Data? Computation? Scheduling? Fault Tolerant? Debugging? On the software side: what is the programming model? On the hardware side: how to deal with hardware failures? MapReduce/Hadoop addresses all of the aboe Google s computational/data manipulation model Elegant way to wor with big data 6
Commodity Clusters Recently standard commodity architecture for such problems: Cluster of commodity Linux nodes Gigabit ethernet interconnect Challenge: How to organize computations on this architecture? handle issues such as hardware failure
Cluster Architecture 1 Gbps between any pair of nodes in a rac Switch 2-10 Gbps bacbone between racs Switch Switch CPU CPU CPU CPU Mem Mem Mem Mem Dis Dis Dis Dis Each rac contains 16-64 nodes
Example Cluster Datacenters In 2011, it was gustimated that Google had 1M machines. Amazon Cloud? Probably millions nodes today AWB Piotal Analytics Wor Bench, 1000 nodes, SW and irtualization proided Hypergator UF s cluster datacenter, no API, no irtualization 9
Problems with Cluster Datacenters Hardware Failures Machine failures Diss failure, CPU failure, Data corruption One serer may stay up 3 years (1,000 days) If you hae 1,000 serers, expect to loose 1/day With 1M machines 1,000 machines fail eery day! Networ failure Copy data oer a networ taes time 10
Stable storage First order problem: if nodes can fail, how can we store data persistently? Idea: Store files multiple times for reliability Answer: Distributed File System Proides global file namespace Google GFS; Hadoop HDFS Typical usage pattern Huge files (100s of GB to TB) Data is rarely updated in place Reads and appends are common
Distributed File System Chun Serers File is split into contiguous chuns Typically each chun is 16-64MB Each chun replicated (usually 2x or 3x) Try to eep replicas in different racs Master node a..a. Name Nodes in HDFS Stores metadata Liely to be replicated as well Client library for file access Tals to master to find chun serers Connects directly to chun serers to access data
Reliable Distributed File System Data ept in chuns spread across machines Each chun replicated on different machines Seamless recoery from dis or machine failure 13
MapReduce Motiation Large-Scale Data Processing Want to use 1000s of CPUs, But don t want the hassle of scheduling, synchronization, etc. MapReduce proides Automatic parallelization & distribution Fault tolerance I/O scheduling and synchronization Monitoring & status updates
MapReduce Basics MapReduce Programming model similar to Lisp (and other functional languages) Many problems can be phrased using Map and Reduce functions Benefits for implementing an algorithm in MapReduce API Automatic parallelization Easy to distribute across nodes Nice retry/failure semantics
Example Problem: Word Count We hae a huge text file: doc.txt or a large corpus of documents: docs/* Count the number of times each distinct word appears in the file Sample application: analyze web serer logs to find popular URLs
Word Count (2) Case 1: File too large for memory, but all <word, count> pairs fit in memory Case 2: File on dis, too many distinct words to fit in memory Count occurrences of words words(doc.txt) sort uniq c words taes a file and output the words in it, one word a line Naturally capture the essence of MapReduce and naturally parallelizable
3 steps of MapReduce Sequentially read a lot of data Map: Extract something you care about Group by ey: Sort and shuffle Reduce: Aggregate, summarize, filter or transform Output the result For different problems, steps stay the same, Map and Reduce change to fit the problem 18
MapReduce: more detail Input: a set of ey/alue pairs Programmer specifies two methods Map(,) list(1,1) Taes a ey-alue pair and outputs a set of eyalue pairs (e.g., ey is the file name, alue is a single line in the file) One map call for eery (,) pair Reduce(1, list(1)) list (1, 2) All alues 1 with the same ey 1 are reduced together to produce a single alue 2 There is one reduce call per unique ey 1 Group by ey is executed by default in the same way: Sort(list(1,1)) list (1, list(1))
MapReduce: The Map Step Input ey-alue pairs Intermediate ey-alue pairs map map
MapReduce: The Group and Reduce Step Intermediate ey-alue pairs Key-alue groups reduce Output ey-alue pairs group reduce
Word Count using MapReduce map(ey, alue): // ey: document name; alue: text of document for each word w in alue: emit(w, 1) reduce(ey, alues): // ey: a word; alue: an iterator oer counts result = 0 for each count in alues: result += emit(result)
MapReduce: Word Count 23