Lecture 10 - Functional programming: Hadoop and MapReduce



Similar documents
Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Chapter 7. Using Hadoop Cluster and MapReduce

Big Data and Apache Hadoop s MapReduce

Open source Google-style large scale data analysis with Hadoop

Introduction to Hadoop

MapReduce (in the cloud)

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Introduction to Hadoop

Hadoop Architecture. Part 1

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Open source large scale distributed data management with Google s MapReduce and Bigtable

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Apache Hadoop. Alexandru Costan

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

A very short Intro to Hadoop

MapReduce and Hadoop Distributed File System

MapReduce with Apache Hadoop Analysing Big Data

Hadoop implementation of MapReduce computational model. Ján Vaňo

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Big Data With Hadoop

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop IST 734 SS CHUNG

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Large scale processing using Hadoop. Ján Vaňo

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

MapReduce, Hadoop and Amazon AWS

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

MapReduce and Hadoop Distributed File System V I J A Y R A O

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Map Reduce & Hadoop Recommended Text:

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop Parallel Data Processing


BIG DATA TECHNOLOGY. Hadoop Ecosystem

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

MAPREDUCE Programming Model

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Parallel Processing of cluster by Map Reduce

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

The Hadoop Framework

Introduction to Hadoop

How To Use Hadoop

NoSQL and Hadoop Technologies On Oracle Cloud

Keywords: Big Data, HDFS, Map Reduce, Hadoop

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

How To Handle Big Data With A Data Scientist

Introduction to MapReduce and Hadoop

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

BIG DATA TRENDS AND TECHNOLOGIES

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Click Stream Data Analysis Using Hadoop

Introduction to Cloud Computing

L1: Introduction to Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop

Big Data Introduction

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Challenges for Data Driven Systems

BIG DATA What it is and how to use?

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Big Data Analytics* Outline. Issues. Big Data

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Big Data and Scripting map/reduce in Hadoop

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Suresh Lakavath csir urdip Pune, India

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Hadoop and Map-Reduce. Swati Gore

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Integrating Big Data into the Computing Curricula

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Internals of Hadoop Application Framework and Distributed File System

How To Scale Out Of A Nosql Database

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Processing of Hadoop using Highly Available NameNode

A Cost-Evaluation of MapReduce Applications in the Cloud

Transcription:

Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41

For today Big Data and Text analytics Functional programming concepts MapReduce Apache Hadoop Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 2 / 41

What is Big Data? A concept referring to data, whose size is beyond the ability of commonly used software to handle in acceptable time limits. 1 1 Snijders, C., Matzat, U., & Reips, U. (2012). Big Data : Big gaps of knowledge in the field of Internet science. International Journal of Internet Science Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 3 / 41

Problems Volume: increasing amounts become difficult to handle Velocity: processing speed is key. Fast insight gives you an edge Variety: big data is usually unstructured and very diverse in type Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 4 / 41

How big is Big Data? 90% of data available today was created in the last 2 years 12 TB (12,000,000 MB) of Tweets are generated every day Data sources are growing: healthcare, weather, stocks, etc. Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 5 / 41

How is Big Data analyzed? Parallel computing (CUDA) Distributed computing (Hadoop, MongoDB, etc) Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 6 / 41

Big data scenarios Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 7 / 41

Example 1 Your client s stores are crowded at peak hours. So crowded, that customers walk away in frustration. At other times, the stores are nearly empty. They are selling below potential due to cart abandonment and failure to attract customers throughout the day. Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 8 / 41

Data scientist scenario: Example 1 Your client s stores are crowded at peak hours. So crowded, that customers walk away in frustration. At other times, the stores are nearly empty. They are selling below potential due to cart abandonment and failure to attract customers throughout the day. The tools: You have a marketing budget, authority to send email advertisements, and to make special offers promotion schemes. You also have control over staff scheduling and checkout procedures. Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 9 / 41

Example 2 You are asked to build a recommendation system for an online merchant. i.e. present information that is likely of interest to the user Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 10 / 41

Amazon? Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 11 / 41

Google? Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 12 / 41

Brainstorming session Focus on recommendation for Amazon shoppers. What would you say to the client to get the contract? How would you approach the problem? What do you need from me? Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 13 / 41

Text analytics Data cleansing in very important. tokenization (phrase? words?... ) spelling normalization: color, colour orthographic normalization: In Danish, søster = sister morphological normalization: verbs, adverbs, adverbial participles Zipf s law: inverse frequency law: the 7%, of 3.5% Can also can be considered part of data representation Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 14 / 41

Distributed Computing is hard Need to be fast: lots of data to churn Need to be scalable: varying amounts of data to churn Need to be fault-tolerant: intensive processes hardware failures Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 15 / 41

Parallel design patterns so far: low level kernel programming: thread mapping, etc think at a higher level than individual CUDA kernels specify what to compute, not how to compute it let programmer worry about algorithm Functional approaches are motivated from above Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 16 / 41

Functional programming? Languages: Lisp, ML, Haskell, etc... Different (often useful) perspective: Recursion (no loops allowed) Function pointers (Map, Reduce, etc...) Finds uses in programming language theory (logic, proof systems, etc) design of compilers concurrent programming Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 17 / 41

Roadmap for Map Reduce Map: applies a process to data Reduce: combines results into answers ie. Divide and conquer for big data, introduced by Google in 2004 Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 18 / 41

Passing functions as arguments // square a number int f(int x) { return x*x; } // apply a function pointer to a number int g(int (*f)(int), int x) { return (*f)(x); } // passing f as a function pointer to g void main() { int res = g(f, 4); } Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 19 / 41

Map: applying a function to all elements int inc(int x){ return x + 1; } // apply function pointer f to every element of an array void map(int* ary, int n, int (*f)(int)){ if ( n == 0 ) return; *ary = (*f)(*ary); map(ary+1, n-1, f); } void main() { int ary[5] = {1,2,3,4,5}; map(ary, 5, inc); } Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 20 / 41

Reduce: combines elements of array int add(int x, int y) { return x + y; } int mul(int x, int y) { return x * y; } // reduce every element of array with function pointer f int reduce(int* ary, int n, int (*f)(int, int), int b) { if ( n == 0 ) return b; return (*f)(reduce(ary+1, n-1, f, b), *ary); } void main() { int ary[5] = {1,2,3,4,5}; int sum = reduce(ary, 5, add, 0); int fac = reduce(ary, 5, mul, 1); } Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 21 / 41

Roadmap for Map Reduce Map: assigns processes to machines Reduce: combines machines results into answers These interactions have consequences Sometimes parallelization is obvious, sometimes not Recursion/map/reduce often help to decide how Again: divide and conquer mentality Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 22 / 41

MapReduce - Big Picture A programming model for processing large data sets in batch Designed to execute on a cluster of commodity hardware Let tasks fail and be able to retry Brings code to data, rather than data to code Limit communication by allowing only certain operations in a flow All inspired by functional programming... Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 23 / 41

Central MapReduce Ideas Operate on key-value pairs Data scientist provides map and reduce (input) < k1, v1 > map < k2, v2 > < k2, v2 > combine,sort < k2, v2 > < k2, v2 > reduce < k3, v3 > (output) Efficient Sort provide by MapReduce library Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 24 / 41

MapReduce Example - Word Count Example: Two text files file1: Hello World Bye World file2: Hello Hadoop Goodbye Hadoop Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 25 / 41

MapReduce Example - Word Count: Map step First map emits: < Hello, 1> < World, 1> < Bye, 1> < World, 1> Second map emits: < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 26 / 41

MapReduce Example - Word Count: Sort & Combine step Sorted output of first map: < Bye, 1> < Hello, 1> < World, 2> Sorted output of second map: < Goodbye, 1> < Hadoop, 2> < Hello, 1> Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 27 / 41

MapReduce Example - Word Count: Reduce step Reduce method sums the values for each key. Output of the job is: < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2> Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 28 / 41

Caveats All problems do not fit well (or at all) within this model This model is not suitable for real-time processing of data May not linearly scale in relation to your input data Easier than scratch: but can be tricky to fit your problem to it Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 29 / 41

What is MapReduce good at solving? Identify, transform, aggregate, filter, count, sort... Discovery tasks (vs. high repetition of similar tasks, many reads) Unstructured data (vs. tabular, indexes) Continuously updated data (indexing cost) Many, many, many machines (fault tolerance) Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 30 / 41

Painfully Parallel Problems Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 31 / 41

Painfully Parallel Problems Given a function that is both commutative and associative (e.g., + or ) Commutative: x + y = y + x Associative: (x + y) + z = x + (y + z) Partition: 5 6 4 1 4 1 2 5 6 2 7 6 3 4 6 1 Map: 16 12 21 14 Reduce: 63 Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 32 / 41

Hadoop as a product Cloud computing: sell time to make profitable use of excess capacity. Google, Yahoo! and Amazon offer cloud computing services Customer submit jobs to vendor, who run them it in parallel Hadoop is written in Java Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 33 / 41

Counting example - Java code mapper: void map(string name, String document){ // name: document name // document: document contents for word in document: EmitIntermediate(word, "1"); reducer: void reduce(string word, Iterator partialcounts){ // word: a word // partialcounts: a list of aggregated partial counts int sum = 0; for pc in partialcounts: sum += ParseInt(pc); Emit(word, AsString(sum)); Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 34 / 41

The Components HDFS: Hadoop Distributed File System. NameNode - tracks where in the cluster HDFS data is kept DataNode - Duplicates data in the HDFS (see later) JobTracker - Assigns MapReduce tasks to nodes in the cluster TaskTracker - Node that tracks MapReduce tasks from JobTracker Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 35 / 41

The role of disk files Hadoop has its own file system, HDFS, built on top of the native OS. Very large files are possible: some span more than one disk/machine. This raises serious reliability issues. The HDFS is replicated, existing in at least 3 copies, i.e. on at least 3 separate disks. Note: having IO files in HDFS minimizes communications costs in shipping data. Slogan: Moving computation is cheaper than moving data. Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 36 / 41

Abstraction with Pig MapReduce: A programming model for parallel processing, introduced in Google s 2004 paper. Hadoop provides a framework for running MapReduce jobs written in Java. Pig: A high level data flow language for processing data. Pig describe steps to be executed by running one or more MapReduce jobs. Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 37 / 41

Pig Pig is an SQL like language, and it is an abstraction layer over the MapReduce framework. Pig is nothing but a higher level abstraction of MapReduce which can ease the problem solving and programming process it shares the advantages and disadvantages of MapReduce Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 38 / 41

Word count example using Pig Example: Word Count A = LOAD /raw_data/ USING TextLoader(); B = FOREACH A GENERATE FLATTEN(TOKENIZE(*)); C = GROUP B BY $0; D = FOREACH C GENERATE group, COUNT(B); STORE D INTO /myvolume/wordcount ; Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 39 / 41

Hadoop demo Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 40 / 41

Resources Java http://download.oracle.com/javase/7/docs/api/java/lang/ Thread.html http://download.oracle.com/javase/tutorial/essential/ concurrency/ MapReduce http://labs.google.com/papers/mapreduce.html http://hadoop.apache.org/ Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 41 / 41