Hadoop and Map-reduce computing

Similar documents
R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Comparison of Different Implementation of Inverted Indexes in Hadoop

CS 378 Big Data Programming

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

The Performance Characteristics of MapReduce Applications on Scalable Clusters

CSE-E5430 Scalable Cloud Computing Lecture 2

The Hadoop Framework

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Chapter 7. Using Hadoop Cluster and MapReduce

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Introduction to Hadoop

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Introduction to Hadoop

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Big Data and Apache Hadoop s MapReduce

Parallel Processing of cluster by Map Reduce

MapReduce (in the cloud)

Package hive. January 10, 2011

Map Reduce & Hadoop Recommended Text:

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

MapReduce and Hadoop Distributed File System

Architectures for massive data management

Jeffrey D. Ullman slides. MapReduce for data intensive computing

BIG DATA, MAPREDUCE & HADOOP

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart


Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Yuji Shirasaki (JVO NAOJ)

A programming model in Cloud: MapReduce

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

MAPREDUCE Programming Model

Map Reduce / Hadoop / HDFS

# Not a part of 1Z0-061 or 1Z0-144 Certification test, but very important technology in BIG DATA Analysis

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Hadoop Parallel Data Processing

Introduction to Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Big Systems, Big Data

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Hadoop and Map-Reduce. Swati Gore

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Developing a MapReduce Application

Reduction of Data at Namenode in HDFS using harballing Technique

Lecture 10 - Functional programming: Hadoop and MapReduce

MapReduce and Hadoop Distributed File System V I J A Y R A O

Data-Intensive Computing with Map-Reduce and Hadoop

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Cloud Computing. Chapter Hadoop

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Cloudera Certified Developer for Apache Hadoop

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Data Intensive Computing Handout 6 Hadoop

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Hadoop Design and k-means Clustering

CS 378 Big Data Programming. Lecture 5 Summariza9on Pa:erns

What is Analytic Infrastructure and Why Should You Care?

Log Mining Based on Hadoop s Map and Reduce Technique

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

MapReduce. Tushar B. Kute,

Open source large scale distributed data management with Google s MapReduce and Bigtable

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

From GWS to MapReduce: Google s Cloud Technology in the Early Days

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Extreme Computing. Hadoop. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

Using Big Data and GIS to Model Aviation Fuel Burn

Recap. CSE 486/586 Distributed Systems Data Analytics. Example 1: Scientific Data. Two Questions We ll Answer. Data Analytics. Example 2: Web Data C 1

Introduction to MapReduce and Hadoop

Introduction to Parallel Programming and MapReduce

Big Data and Scripting map/reduce in Hadoop

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

NoSQL and Hadoop Technologies On Oracle Cloud

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

Data Intensive Computing Handout 5 Hadoop

Developing MapReduce Programs

How To Use Hadoop For Gis

MapReduce, Hadoop and Amazon AWS

Transcription:

Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework. Map-reduce is a strategy for solving problems using two stages for processing data that includes a sort action between those two stages. This problem-solving strategy has been used for decades in functional programming languages such as LISP or Scheme. More recently, Google has adapted the map-reduce programming model (Dean and Ghemawat, 2004) to function efficiently on large clusters in order to process vast amounts of data for example, Google's selection of the entire web. The Apache Foundation provides an open-source implementation of map-reduce for clusters called Hadoop, which has primarily been implemented by Yahoo!. St. Olaf student researchers have created a user interface called WebMapReduce (WMR) that uses Hadoop to make map-reduce programming convenient enough for CS1 students to use. 2 Map-reduce computing In map-reduce programming, a programmer provides two functions, called the mapper and the reducer, for carrying out a sequence of two computational stages on potentially vast quantities of data. The mapper function is applied to each line of data, and breaks up that data into labeled units of interest, called key-value pairs. The reducer function is then applied to all the key-value pairs that share the same key, allowing some kind of consolidation of those pairs. Between the mapper and reducer stages, a map-reduce system automatically reorganizes the intermediate key-value pairs, so that each call of the reducer function can receive a complete set of key-value pairs for a particular key, and so that the reducer function is called for every key in sorted order. We will refer to this reorganization of key-value pairs between the mapper and reducer stages as shuffling. A map-reduce implementation such as Hadoop takes care of details such as splitting the data and shuffling, so a map-reduce programmer needs only provide mapper and reducer algorithms. The following diagram illustrates the three steps of mapping, shuffling, and reducing.

3 Wordcount, an example The goal is to start with an input text (of arbitrary size), and print a table of frequencies of each word appearing in that text. For example, if the input is the phrase "the cat in the hat", then the following table should be produced: cat 1 hat 1 in 1 the 2 The mapper function will receive one line of input at a time. It will split up that line into words, then produce the key-value pairs for each word found in the line. word 1 A call to the reducer function will receive all the key-value pairs for some word, e.g., That call to the reducer function will add the values, then produce one key-value pair representing the count for that word. the 4 We can represent this algorithm schematically as follows: Mapper: "the cat in the hat" V [mapper for this line] /----------+----------+-----------+-----------\ V V V V V ("the" "1") ("cat" "1") ("in" "1") ("the" "1") ("hat" "1") Reducer: ("the" "1") ("the" "1") \-----+-----/ V [reducer for the key "the"] V ("the" "2") Scheme implementation for the mapper function, for our WebMapReduce (WMR) system: (define helper (lambda (lis) ; a list of strings (if (null? lis) #t

;; at least one element in lis (begin (wmr-emit (car lis) "1") (helper (cdr lis)))))) (define mapper (lambda (key value) (helper (wmr-split key)))) Here is an alternative implementation using cond instead of if: (define helper (lambda (lis) ; a list of strings (cond ((null? lis) #t) ;; at least one element in lis (else (wmr-emit (car lis) "1") (helper (cdr lis)))))) (define mapper (lambda (key value) (helper (wmr-split key)))) Notes: The WMR system calls a function named mapper once for each line of input. When the input data consists of plain text, not key-value pairs, then each line of input text becomes a key, and the value is treated as empty. helper is a tail-recursive helper function for producing all the key-value pairs (word 1) for a list of word strings. The key-value pairs are not actually represented as Scheme strings. Use the function wmr-emit to produce a key-value pair appropriate for our system. wmr-emit takes exactly two arguments, both of which must be strings. The function wmr-split, when used with one string argument, will produce a list of strings representing the words in that argument. "Words" are defined as sequences of non-space characters, by default. The value #t is returned when there are no more key-value pairs to process. This value #t is not passed on to the shuffler or reducer---only emitted key-value pairs will go forward. Scheme implementation for the reducer function, for WMR: (define loop (lambda (key val-stream ct) ; a key, value stream, and subtotal ; so far (if (stream-pair? val-stream) (loop key (stream-cdr val-stream) (+ ct (string->number (stream-car val-stream)))) ;; assert -- no more input values to add (wmr-emit key (number->string ct))))) (define reducer (lambda (key iter)

(loop key iter 0))) Here is an alternative implementation, using cond instead of if: (define loop (lambda (key val-stream ct) ; a key, value stream, and subtotal ; so far (cond ((stream-pair? val-stream) (loop key (stream-cdr val-stream) (+ ct (string->number (stream-car val-stream))))) ;; assert -- no more input values to add (else (wmr-emit key (number->string ct)))))) (define reducer (lambda (key iter) (loop key iter 0))) Notes: The WMR system calls a function named reducer once for each key appearing in mapper output. That key is passed as the first argument to reducer; all the values are passed in a stream (called val-stream in this example). Streams in Scheme are like lists, except not all the elements are necessarily held in memory. For the most part, you can treat them like normal lists, except rather than the standard functions like car and cdr, you must use their stream counterparts, like the following: stream-pair? returns #t if the stream passed as its first argument has any more values, and #f otherwise. Its opposite is stream-null?. stream-car returns the first value in the stream. stream-cdr returns a new stream that starts at the next value in the current stream. If there is not another value in the current stream, the result will be equal to the empty stream, stream-null (similar to the empty list '()) Note: The reducers in WMR are written using streams in order to handle potentially enormous data. If there were billions or trillions of values for a certain key, an ordinary list could get too large to store in a single computer's memory. Streams, on the other hand, only keep the values that are currently being used, which (hopefully!) will not overflow memory. loop is a tail-recursive helper function for adding up the values. The predefined functions string->number and number->string are useful for converting between integer and string forms of numbers. 4 Trying it out 1. In a browser, visit the WMR site. Login using the WMR username and password provided for you, then select "Launch WMR". 2. Enter a jobname (perhaps involving your username, for uniqueness; I'd avoid spaces in the jobname). 3. Choose the Scheme language. 4. Enter the input data, e.g., the cat in the hat line above. You can use the "Direct

Input" option and enter that data in the text box provided. 5. Enter the mapper. You can use the "Direct Input" option as before for the data; alternatively, you can use the "Upload" option and navigate to a function that contains the mapper (this is more convenient for repeated runs). Check that the appropriate radio button is clicked to indicate the source option you're actually using. 6. Also enter the reducer. 7. Click on the submit button. 8. A status page should appear indicating that the job started successfully and is running. Refresh the page (titled "Job in Progress") as desired until the job completes. 9. Once the job runs to completion, refreshing the job status page will lead to a Job Complete page. This page will include your output. If you used the cat input, your output should match the illustration above. 10. Try running the program again, this time with large data (ask your instructor where you can find larger data sets like whole books). 5 More on wmr-split; regular expressions wmr-split accepts an optional second argument, which describes a pattern for word delimiters (i.e., the characters that do not count as part of a word.) (wmr-split "This is, the end" ",") --> ("This is" " the end") (wmr-split "This is the end" " ") --> ("This" "is" "" "the" "end") (wmr-split "This is the end" " +") --> ("This" "is" "the" "end") (wmr-split "Hello, world!" " ") --> ("Hello," "" "world!") (wmr-split "Hello, world!" "[,]") --> ("Hello" "" "" "world!") (wmr-split "Hello, world!" "[!,]+") --> ("Hello" "world") The format for describing patterns used by wmr-split is called regular expression format. 6 More about parallel computing Map-reduce computing is designed for computer clusters. A cluster of computers consists of multiple computers that are networked together and equipped with software so they can perform cooperative computations on a single problem. The individual computers in a cluster are often called nodes in that cluster. Map-reduce computations are performed on a cluster in order to carry out large-scale data operations in a reasonable amount of time. We say that map-reduce computations are scalable, meaning that the computations can remain feasible even when there is an increase in the size or complexity of those computations. The strategy of performing multiple physical computations at the same time is called parallel computing. Running map-reduce in parallel on multiple computers in a cluster enables the map-reduce to scale to large data sets. Google and other companies run map-reduce on clusters with thousands of nodes, enabling them to compute with large sections of the entire world-wide web. The total number of bytes in such a computation may be on the order of tens of

petabytes (1 petabyte is about 1 quadrillion characters)! Here are two basic categories of parallelism: 1. Data parallelism is when the same algorithms are carried out on different computers with different data at the same time. For example, if map-reduce is applied to a large data set (e.g., gigabytes or terabytes, where a terabyte is about a trillion bytes), that data is first divided up into "slices" that may be processed on separate computers in a cluster. 2. Task parallelism is when different algorithms are carried out at the same time. An assembly line is an example of task parallelism: multiple stations carry out different tasks at the same time. For another example, the mapper and reducer functions represent two different algorithms. If the key-value pairs produced by a mapper running on one computer were forwarded directly to a reducer running on a different computer as soon as those key-value pairs were produced, that would have an example of task parallelism. (In practice, the mapping, shuffling, and reducing stages of map-reduce are each completed in order, for scalability and other reasons discussed below.) WMR uses the Hadoop implementation of map-reduce, which is an open-source software project (meaning that the program code is made publicly accessible) under the Apache Software Foundation. The primary authors of Hadoop are employees of Yahoo!, Inc., and this software is used professionally by Yahoo! and other companies as well as at educational institutions. (Google's implementation of map-reduce is proprietary, i.e., a company secret.) Programming Hadoop directly requires more programming background than a beginning student can handle, in the Java programming language. WMR makes it possible for beginning programmers to program map-reduce computations in the language they are learning. Although the programming is simplified, genuine Hadoop map-reduce computations are carried out (except for the testing option), so significant large-scale computing involving parallelism can be performed. Some examples of large data: Netflix data, originally distributed for a competition sponsored by Netflix, consists of lines containing a movie ID number, a viewer ID number, a rating, and a date. (Approx 2 GB = 2 gigabytes) "Snapshots" of Wikipedia are available. (approximately 150 MB) Facebook data is obviously large, but privacy protections make it difficult to deal with. For linguistic analysis via map-reduce, there are many potential sources of data on the Internet, such as online newspapers. Scientific data may be taken from experimental readings or generated from a computational model of a phenomenon. Project Gutenberg and Google Books projects provide public-domain complete literary works. For example, we have retrieved Dostoevsky novels and the complete works of Edgar Allen Poe through this source.

7 Multiple map-reduce cycles WMR enables you to use the output from one map-reduce cycle as the input for a subsequent map-reduce cycle. In that case, the mapper function in the second cycle may use both the key and value for its computations. Example: Ordering word-count output in frequency order Computing average ratings per movie on the first map-reduce cycle; ordering by rating on a second cycle. One could write an identity reducer, which emits key-value pairs without any change (though the output will be sorted as it has gone through the shuffling step.) An identity mapper is similar in that it emits key-value pairs from lines of input data without alteration. An identity reducer in Scheme: (define loop (lambda (key val-stream) ; a key and a value stream (if (stream-pair? val-stream) (begin (wmr-emit key (stream-car val-stream)) (loop key (stream-cdr val-stream))) #t))) (define reducer (lambda (key val-stream) (loop key val-stream))) An identity mapper in Scheme: (define mapper (lambda (key value) (wmr-emit key value))) 8 Creating an index Suppose that input for the mapper stage consists of a location identifier (e.g., line number) as the key and a line of text as the value. How could one produce an index, where each word in the text is listed together with the line numbers it appears in? A concordance consists of the words in a text together with the line that contains them, for context. How could you create a concordance using WMR?