Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41
For today Big Data and Text analytics Functional programming concepts MapReduce Apache Hadoop Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 2 / 41
What is Big Data? A concept referring to data, whose size is beyond the ability of commonly used software to handle in acceptable time limits. 1 1 Snijders, C., Matzat, U., & Reips, U. (2012). Big Data : Big gaps of knowledge in the field of Internet science. International Journal of Internet Science Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 3 / 41
Problems Volume: increasing amounts become difficult to handle Velocity: processing speed is key. Fast insight gives you an edge Variety: big data is usually unstructured and very diverse in type Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 4 / 41
How big is Big Data? 90% of data available today was created in the last 2 years 12 TB (12,000,000 MB) of Tweets are generated every day Data sources are growing: healthcare, weather, stocks, etc. Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 5 / 41
How is Big Data analyzed? Parallel computing (CUDA) Distributed computing (Hadoop, MongoDB, etc) Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 6 / 41
Big data scenarios Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 7 / 41
Example 1 Your client s stores are crowded at peak hours. So crowded, that customers walk away in frustration. At other times, the stores are nearly empty. They are selling below potential due to cart abandonment and failure to attract customers throughout the day. Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 8 / 41
Data scientist scenario: Example 1 Your client s stores are crowded at peak hours. So crowded, that customers walk away in frustration. At other times, the stores are nearly empty. They are selling below potential due to cart abandonment and failure to attract customers throughout the day. The tools: You have a marketing budget, authority to send email advertisements, and to make special offers promotion schemes. You also have control over staff scheduling and checkout procedures. Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 9 / 41
Example 2 You are asked to build a recommendation system for an online merchant. i.e. present information that is likely of interest to the user Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 10 / 41
Amazon? Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 11 / 41
Google? Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 12 / 41
Brainstorming session Focus on recommendation for Amazon shoppers. What would you say to the client to get the contract? How would you approach the problem? What do you need from me? Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 13 / 41
Text analytics Data cleansing in very important. tokenization (phrase? words?... ) spelling normalization: color, colour orthographic normalization: In Danish, søster = sister morphological normalization: verbs, adverbs, adverbial participles Zipf s law: inverse frequency law: the 7%, of 3.5% Can also can be considered part of data representation Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 14 / 41
Distributed Computing is hard Need to be fast: lots of data to churn Need to be scalable: varying amounts of data to churn Need to be fault-tolerant: intensive processes hardware failures Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 15 / 41
Parallel design patterns so far: low level kernel programming: thread mapping, etc think at a higher level than individual CUDA kernels specify what to compute, not how to compute it let programmer worry about algorithm Functional approaches are motivated from above Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 16 / 41
Functional programming? Languages: Lisp, ML, Haskell, etc... Different (often useful) perspective: Recursion (no loops allowed) Function pointers (Map, Reduce, etc...) Finds uses in programming language theory (logic, proof systems, etc) design of compilers concurrent programming Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 17 / 41
Roadmap for Map Reduce Map: applies a process to data Reduce: combines results into answers ie. Divide and conquer for big data, introduced by Google in 2004 Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 18 / 41
Passing functions as arguments // square a number int f(int x) { return x*x; } // apply a function pointer to a number int g(int (*f)(int), int x) { return (*f)(x); } // passing f as a function pointer to g void main() { int res = g(f, 4); } Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 19 / 41
Map: applying a function to all elements int inc(int x){ return x + 1; } // apply function pointer f to every element of an array void map(int* ary, int n, int (*f)(int)){ if ( n == 0 ) return; *ary = (*f)(*ary); map(ary+1, n-1, f); } void main() { int ary[5] = {1,2,3,4,5}; map(ary, 5, inc); } Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 20 / 41
Reduce: combines elements of array int add(int x, int y) { return x + y; } int mul(int x, int y) { return x * y; } // reduce every element of array with function pointer f int reduce(int* ary, int n, int (*f)(int, int), int b) { if ( n == 0 ) return b; return (*f)(reduce(ary+1, n-1, f, b), *ary); } void main() { int ary[5] = {1,2,3,4,5}; int sum = reduce(ary, 5, add, 0); int fac = reduce(ary, 5, mul, 1); } Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 21 / 41
Roadmap for Map Reduce Map: assigns processes to machines Reduce: combines machines results into answers These interactions have consequences Sometimes parallelization is obvious, sometimes not Recursion/map/reduce often help to decide how Again: divide and conquer mentality Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 22 / 41
MapReduce - Big Picture A programming model for processing large data sets in batch Designed to execute on a cluster of commodity hardware Let tasks fail and be able to retry Brings code to data, rather than data to code Limit communication by allowing only certain operations in a flow All inspired by functional programming... Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 23 / 41
Central MapReduce Ideas Operate on key-value pairs Data scientist provides map and reduce (input) < k1, v1 > map < k2, v2 > < k2, v2 > combine,sort < k2, v2 > < k2, v2 > reduce < k3, v3 > (output) Efficient Sort provide by MapReduce library Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 24 / 41
MapReduce Example - Word Count Example: Two text files file1: Hello World Bye World file2: Hello Hadoop Goodbye Hadoop Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 25 / 41
MapReduce Example - Word Count: Map step First map emits: < Hello, 1> < World, 1> < Bye, 1> < World, 1> Second map emits: < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 26 / 41
MapReduce Example - Word Count: Sort & Combine step Sorted output of first map: < Bye, 1> < Hello, 1> < World, 2> Sorted output of second map: < Goodbye, 1> < Hadoop, 2> < Hello, 1> Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 27 / 41
MapReduce Example - Word Count: Reduce step Reduce method sums the values for each key. Output of the job is: < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2> Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 28 / 41
Caveats All problems do not fit well (or at all) within this model This model is not suitable for real-time processing of data May not linearly scale in relation to your input data Easier than scratch: but can be tricky to fit your problem to it Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 29 / 41
What is MapReduce good at solving? Identify, transform, aggregate, filter, count, sort... Discovery tasks (vs. high repetition of similar tasks, many reads) Unstructured data (vs. tabular, indexes) Continuously updated data (indexing cost) Many, many, many machines (fault tolerance) Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 30 / 41
Painfully Parallel Problems Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 31 / 41
Painfully Parallel Problems Given a function that is both commutative and associative (e.g., + or ) Commutative: x + y = y + x Associative: (x + y) + z = x + (y + z) Partition: 5 6 4 1 4 1 2 5 6 2 7 6 3 4 6 1 Map: 16 12 21 14 Reduce: 63 Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 32 / 41
Hadoop as a product Cloud computing: sell time to make profitable use of excess capacity. Google, Yahoo! and Amazon offer cloud computing services Customer submit jobs to vendor, who run them it in parallel Hadoop is written in Java Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 33 / 41
Counting example - Java code mapper: void map(string name, String document){ // name: document name // document: document contents for word in document: EmitIntermediate(word, "1"); reducer: void reduce(string word, Iterator partialcounts){ // word: a word // partialcounts: a list of aggregated partial counts int sum = 0; for pc in partialcounts: sum += ParseInt(pc); Emit(word, AsString(sum)); Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 34 / 41
The Components HDFS: Hadoop Distributed File System. NameNode - tracks where in the cluster HDFS data is kept DataNode - Duplicates data in the HDFS (see later) JobTracker - Assigns MapReduce tasks to nodes in the cluster TaskTracker - Node that tracks MapReduce tasks from JobTracker Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 35 / 41
The role of disk files Hadoop has its own file system, HDFS, built on top of the native OS. Very large files are possible: some span more than one disk/machine. This raises serious reliability issues. The HDFS is replicated, existing in at least 3 copies, i.e. on at least 3 separate disks. Note: having IO files in HDFS minimizes communications costs in shipping data. Slogan: Moving computation is cheaper than moving data. Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 36 / 41
Abstraction with Pig MapReduce: A programming model for parallel processing, introduced in Google s 2004 paper. Hadoop provides a framework for running MapReduce jobs written in Java. Pig: A high level data flow language for processing data. Pig describe steps to be executed by running one or more MapReduce jobs. Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 37 / 41
Pig Pig is an SQL like language, and it is an abstraction layer over the MapReduce framework. Pig is nothing but a higher level abstraction of MapReduce which can ease the problem solving and programming process it shares the advantages and disadvantages of MapReduce Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 38 / 41
Word count example using Pig Example: Word Count A = LOAD /raw_data/ USING TextLoader(); B = FOREACH A GENERATE FLATTEN(TOKENIZE(*)); C = GROUP B BY $0; D = FOREACH C GENERATE group, COUNT(B); STORE D INTO /myvolume/wordcount ; Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 39 / 41
Hadoop demo Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 40 / 41
Resources Java http://download.oracle.com/javase/7/docs/api/java/lang/ Thread.html http://download.oracle.com/javase/tutorial/essential/ concurrency/ MapReduce http://labs.google.com/papers/mapreduce.html http://hadoop.apache.org/ Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 41 / 41