CS 378 Big Data Programming. Lecture 2 Map- Reduce



Similar documents
CS 378 Big Data Programming

Machine- Learning Summer School

CS 378 Big Data Programming. Lecture 1 Introduc:on

CS 378 Big Data Programming. Lecture 5 Summariza9on Pa:erns

Chapter 7. Using Hadoop Cluster and MapReduce

Map Reduce & Hadoop Recommended Text:

Hadoop and Map-reduce computing

Testing 3Vs (Volume, Variety and Velocity) of Big Data

How To Scale Out Of A Nosql Database

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Map Reduce / Hadoop / HDFS

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Introduction to Hadoop

10605 BigML Assignment 4(a): Naive Bayes using Hadoop Streaming

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Hands-on Exercises with Big Data

Introduction to Hadoop

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Distributed Computing and Big Data: Hadoop and MapReduce

How To Use Hadoop

Big Data and Apache Hadoop s MapReduce

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

Introduction to Hadoop

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop WordCount Explained! IT332 Distributed Systems

Developing a MapReduce Application

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

L1: Introduction to Hadoop

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Getting Started with Hadoop with Amazon s Elastic MapReduce

Big Data With Hadoop

Using BAC Hadoop Cluster

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

ITG Software Engineering

CS 378 Big Data Programming. Lecture 24 RDDs

Hadoop Installation MapReduce Examples Jake Karnes

Performance and Energy Efficiency of. Hadoop deployment models

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Open source Google-style large scale data analysis with Hadoop

The Hadoop Framework

CSE-E5430 Scalable Cloud Computing Lecture 2

To reduce or not to reduce, that is the question

Yuji Shirasaki (JVO NAOJ)

Big Data : Experiments with Apache Hadoop and JBoss Community projects

Cloud Computing. Chapter Hadoop

map/reduce connected components

Hadoop Job Oriented Training Agenda

Introduction To Hive

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Hadoop and Map-Reduce. Swati Gore

Connecting Hadoop with Oracle Database

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Large scale processing using Hadoop. Ján Vaňo

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Understanding Hadoop Performance on Lustre

Can the Elephants Handle the NoSQL Onslaught?

A Scalable Data Transformation Framework using the Hadoop Ecosystem

MapReduce. Tushar B. Kute,

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Open source large scale distributed data management with Google s MapReduce and Bigtable

Hadoop Cluster Applications

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Big Data and Scripting map/reduce in Hadoop

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Hadoop. Sunday, November 25, 12

Getting to know Apache Hadoop

Sriram Krishnan, Ph.D.

Suresh Lakavath csir urdip Pune, India

HiBench Introduction. Carson Wang Software & Services Group

Big Data Analytics* Outline. Issues. Big Data

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Internals of Hadoop Application Framework and Distributed File System

Real Time Big Data Processing

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

Improving Data Processing Speed in Big Data Analytics Using. HDFS Method

GraySort and MinuteSort at Yahoo on Hadoop 0.23

NoSQL and Hadoop Technologies On Oracle Cloud

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015

Testing Big data is one of the biggest

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Introduction to Cloud Computing

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Transcription:

CS 378 Big Data Programming Lecture 2 Map- Reduce

MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is processed But viewed in small increments For the most part, map and reduce tasks are stateless Write once, read muljple Jmes Data Warehouse has this intended usage (write once) Unstructured data vs. structured/normalized Data pipelines are common Chain of MR jobs, with intermediate results

MapReduce Table 1-1, Hadoop The DefiniJve Guide Tradi&onal RDBMS MapReduce Data Size Gigabytes Petabytes Access InteracJve and batch Batch Updates Read and write many Jmes Write once, read many Structure StaJc schema Dynamic schema Integrity High Low Scaling Nonlinear Linear

MapReduce Tom White, in Hadoop: The Defini/ve Guide MapReduce works well on unstructured or semistructured data because it is designed to interpret the data at processing /me. In other words, the input keys and values for MapReduce are not intrinsic proper/es of the data, but they are chosen by the persona analyzing the data.

MapReduce When wrijng a MapReduce program You don t know the size of the data You don t know the extent of the parallelism MapReduce tries to collocate the data with the compute node Parallelize the I/O Make the I/O local (versus across network)

MapReduce As the name implies, for each problem we ll write Map method/funcjon Reduce method/funcjon Terms from funcjonal programming Map Apply a funcjon to each input, output the result Reduce Given a list of inputs, compute some output value

The data flow for the general case of multiple reduce tasks is illustrated in Figure 2-4. This diagram makes it clear why the data flow between map and reduce tasks is colloquially known as the shuffle, as each reduce task is fed by many map tasks. The MapReduce in Hadoop Figure 2.4, Hadoop - The DefiniJve Guide shuffle is more complicated than this diagram suggests, and tuning it can have a big impact on job execution time, as you will see in Shuffle and Sort on page 208. Figure 2-4. MapReduce data flow with multiple reduce tasks

Map FuncJon Map input is a stream of key/value pairs Web logs: Server name (key), log entry (value) Sensor reading: sensor ID (key), sensed values (value) Document ID (key), contents (value) Map funcjon processes each input pair in turn For each input pair, the map funcjon can (but isn t required) to emit one or more key/value pairs Key/value pair(s) derived from the input key/value pair Does not need to be the same key or value data type

Reduce FuncJon Reduce input is a stream of key/value- list pairs These are the key value pairs emieed by the map funcjon Reduce funcjon processes each input pair in turn For each input pair, the reduce funcjon can (but isn t required) to emit a key/value pair Key value pair derived from the input key/value- list pair Does not need to be the same key or value data type

WordCount Example For an input text file of arbitrary size, or MulJple text files of arbitrary size, or An arbitrary number of documents Count the number occurrences of all the words that appear in the input. Output: word1, count word2, count

WordCount Example - Map Map input is a stream of key/value pairs File posijon in bytes (key), line of text (value) Map funcjon processes each input pair in turn Extract each word from the line of text Emits a key/value pair for each word: <the- word, 1> For each input pair, the map funcjon emits muljple key/value pairs Key is a text string (the word), value is a number

WordCount Example - Reduce Reduce input is a stream of key/value- list pairs These are the key value pairs emieed by the map funcjon Key is a text string (the word), value is a list of some number of the value 1 Hadoop has grouped data together by key Reduce funcjon processes each input pair in turn Sums the values in the value- list For each input pair, the reduce funcjon emits a key/ value pair Key is a text string (the word), value is total count for that word

The data flow for the general case of multiple reduce tasks is illustrated in Figure 2-4. This diagram makes it clear why the data flow between map and reduce tasks is colloquially known as the shuffle, as each reduce task is fed by many map tasks. The MapReduce in Hadoop Figure 2.4, Hadoop - The DefiniJve Guide shuffle is more complicated than this diagram suggests, and tuning it can have a big impact on job execution time, as you will see in Shuffle and Sort on page 208. Figure 2-4. MapReduce data flow with multiple reduce tasks

MapReduce (from cubrid.org)

Java and Maven Review Directory structure expected by maven (supported in IDEs): Project directory (example name: bdp) Source code directory: bdp/src/main/java The Java package structure appears in the java directory Ex: bdp/src/main/java/com/refactorlabs/cs378/assign1 A class defined in the com.refactorlabs.cs378/assign1 package placed here Ex: bdp/src/main/java/com/refactorlabs/cs378/assign1/wordcount.java Easy setup - Create you project directory Place pom.xml in this directory Place WordCount.java as shown above Import the maven pom.xml into your IDE.

Assignment ArJfacts For each assignment, there will be one or more arjfacts to submit: Java code Source files in one directory (for easy inspecjon) Source files in src/main/java/ structure (use tar ) Build info: pom.xml file used for maven An inijal pom.xml file will be provided, and we ll expand this during the semester Program outputs Extracted from HDFS ArJfacts required for each assignment will be listed.

Assignment 1 Build a JAR file Upload to AWS S3 Create a cluster using ElasJc MapReduce (EMR) Run your map- reduce job on EMR cluster Download the output