CS 378 Big Data Programming



Similar documents
CS 378 Big Data Programming. Lecture 2 Map- Reduce

Machine- Learning Summer School

CS 378 Big Data Programming. Lecture 1 Introduc:on

CS 378 Big Data Programming. Lecture 5 Summariza9on Pa:erns

Chapter 7. Using Hadoop Cluster and MapReduce

Map Reduce & Hadoop Recommended Text:

Hadoop and Map-reduce computing

Testing 3Vs (Volume, Variety and Velocity) of Big Data

How To Scale Out Of A Nosql Database

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Map Reduce / Hadoop / HDFS

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Introduction to Hadoop

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

10605 BigML Assignment 4(a): Naive Bayes using Hadoop Streaming

Hands-on Exercises with Big Data

Introduction to Hadoop

CS 378 Big Data Programming. Lecture 24 RDDs

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

The Performance Characteristics of MapReduce Applications on Scalable Clusters

How To Use Hadoop

Distributed Computing and Big Data: Hadoop and MapReduce

Introduction to Hadoop

Big Data and Apache Hadoop s MapReduce

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

Hadoop implementation of MapReduce computational model. Ján Vaňo

Using BAC Hadoop Cluster

Getting Started with Hadoop with Amazon s Elastic MapReduce

Hadoop WordCount Explained! IT332 Distributed Systems

Developing a MapReduce Application

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

L1: Introduction to Hadoop

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Big Data With Hadoop

Hadoop and Map-Reduce. Swati Gore

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

ITG Software Engineering

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Hadoop Cluster Applications

Big Data and Scripting map/reduce in Hadoop

Can the Elephants Handle the NoSQL Onslaught?

Hadoop Installation MapReduce Examples Jake Karnes

Performance and Energy Efficiency of. Hadoop deployment models

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Open source Google-style large scale data analysis with Hadoop

The Hadoop Framework

CSE-E5430 Scalable Cloud Computing Lecture 2

Yuji Shirasaki (JVO NAOJ)

To reduce or not to reduce, that is the question

Big Data Analytics* Outline. Issues. Big Data

Big Data : Experiments with Apache Hadoop and JBoss Community projects

Cloud Computing. Chapter Hadoop

Introduction To Hive

Big Systems, Big Data

map/reduce connected components

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Hadoop Job Oriented Training Agenda

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Connecting Hadoop with Oracle Database

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Hadoop Tutorial. General Instructions

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015

Large scale processing using Hadoop. Ján Vaňo

Understanding Hadoop Performance on Lustre

NoSQL for SQL Professionals William McKnight

A Scalable Data Transformation Framework using the Hadoop Ecosystem

MapReduce. Tushar B. Kute,

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Open source large scale distributed data management with Google s MapReduce and Bigtable

Basic Hadoop Programming Skills

Jeffrey D. Ullman slides. MapReduce for data intensive computing

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Sriram Krishnan, Ph.D.

Getting to know Apache Hadoop

Hadoop. Sunday, November 25, 12

Suresh Lakavath csir urdip Pune, India

HiBench Introduction. Carson Wang Software & Services Group

Performance Testing of Big Data Applications

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Internals of Hadoop Application Framework and Distributed File System

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Real Time Big Data Processing

Lecture Data Warehouse Systems

Improving Data Processing Speed in Big Data Analytics Using. HDFS Method

NoSQL and Hadoop Technologies On Oracle Cloud

Transcription:

CS 378 Big Data Programming Lecture 2 Map- Reduce CS 378 - Fall 2015 Big Data Programming 1

MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is processed But viewed in small increments For the most part, map and reduce tasks are stateless Write once, read mulmple Mmes Data Warehouse has this intended usage (write once) Unstructured data vs. structured/normalized Data pipelines are common Chain of MR jobs, with intermediate results CS 378 - Fall 2015 Big Data Programming 2

MapReduce Table 1-1, Hadoop The DefiniMve Guide Tradi&onal RDBMS MapReduce Data Size Gigabytes Petabytes Access InteracMve and batch Batch Updates Read and write many Mmes Write once, read many TransacMons ACID None Structure Schema- on- write Schema- on- read Integrity High Low Scaling Nonlinear Linear CS 378 - Fall 2015 Big Data Programming 2.3

MapReduce Tom White, in Hadoop: The Defini/ve Guide MapReduce works well on unstructured or semistructured data because it is designed to interpret the data at processing /me. In other words, the input keys and values for MapReduce are not intrinsic proper/es of the data, but they are chosen by the persona analyzing the data. CS 378 - Fall 2015 Big Data Programming 4

MapReduce When wrimng a MapReduce program You don t know the size of the data You don t know the extent of the parallelism MapReduce tries to collocate the code and the data on a compute node Parallelize the I/O Make the I/O local (versus across network) CS 378 - Fall 2015 Big Data Programming 5

MapReduce As the name implies, for each problem we ll write Map method/funcmon Reduce method/funcmon Terms from funcmonal programming Map Apply a funcmon to each input, output the result Reduce Given a list of inputs, compute some output value CS 378 - Fall 2015 Big Data Programming 6

The data flow for the general case of multiple reduce tasks is illustrated in Figure 2-4. This diagram makes it clear why the data flow between map and reduce tasks is colloquially known as the shuffle, as each reduce task is fed by many map tasks. The MapReduce in Hadoop Figure 2.4, Hadoop - The DefiniMve Guide shuffle is more complicated than this diagram suggests, and tuning it can have a big impact on job execution time, as you will see in Shuffle and Sort on page 208. Figure 2-4. MapReduce data flow with multiple reduce tasks CS 378 - Fall 2015 Big Data Programming 7

Map FuncMon Map input is a stream of key/value pairs Web logs: Server name (key), log entry (value) Sensor reading: sensor ID (key), sensed values (value) Document ID (key), contents (value) Map funcmon processes each input pair in turn For each input pair, the map funcmon can (but isn t required) to emit one or more key/value pairs Key/value pair(s) derived from the input key/value pair Does not need to be the same key or value data type CS 378 - Fall 2015 Big Data Programming 8

Reduce FuncMon Reduce input is a stream of key/value- list pairs These are the key value pairs emihed by the map funcmon Reduce funcmon processes each input pair in turn For each input pair, the reduce funcmon can (but isn t required) to emit a key/value pair Key value pair derived from the input key/value- list pair Does not need to be the same key or value data type CS 378 - Fall 2015 Big Data Programming 9

WordCount Example For an input text file of arbitrary size, or MulMple text files of arbitrary size, or An arbitrary number of documents Count the occurrences of all the words that appear in the input. Output: word1, count word2, count CS 378 - Fall 2015 Big Data Programming 10

WordCount Example - Map Map input is a stream of key/value pairs File posimon in bytes (key), line of text (value) Map funcmon processes each input pair in turn Extract each word from the line of text Emits a key/value pair for each word: <the- word, 1> For each input pair, the map funcmon emits mulmple key/value pairs Key is a text string (the word), value is a number CS 378 - Fall 2015 Big Data Programming 11

WordCount Example - Reduce Reduce input is a stream of key/value- list pairs These are the key value pairs emihed by the map funcmon Key is a text string (the word), value is a list of some number of the value 1 Hadoop has grouped data together by key Reduce funcmon processes each input pair in turn Sums the values in the value- list For each input pair, the reduce funcmon emits a key/ value pair Key is a text string (the word), value is total count for that word CS 378 - Fall 2015 Big Data Programming 12

The data flow for the general case of multiple reduce tasks is illustrated in Figure 2-4. This diagram makes it clear why the data flow between map and reduce tasks is colloquially known as the shuffle, as each reduce task is fed by many map tasks. The MapReduce in Hadoop Figure 2.4, Hadoop - The DefiniMve Guide shuffle is more complicated than this diagram suggests, and tuning it can have a big impact on job execution time, as you will see in Shuffle and Sort on page 208. Figure 2-4. MapReduce data flow with multiple reduce tasks CS 378 - Fall 2015 Big Data Programming 13

MapReduce (from cubrid.org) CS 378 - Fall 2015 Big Data Programming 14

Java and Maven Review Directory structure expected by maven (supported in IDEs): Project directory (example name: bdp) Source code directory: bdp/src/main/java The Java package structure appears in the java directory Ex: bdp/src/main/java/com/refactorlabs/cs378/assign1 A class defined in the com.refactorlabs.cs378/assign1 package placed here Ex: bdp/src/main/java/com/refactorlabs/cs378/assign1/wordcount.java Easy setup - Create you project directory Place pom.xml in this directory Place WordCount.java as shown above Import the maven pom.xml into your IDE. CS 378 - Fall 2015 Big Data Programming 15

Assignment ArMfacts For each assignment, there will be one or more armfacts to submit: Java code Source files in one directory (for easy inspecmon) Source files in src/main/java/ structure (use tar ) Build info: pom.xml file used for maven An inimal pom.xml file will be provided, and we ll expand this during the semester Program outputs Extracted from HDFS ArMfacts required for each assignment will be listed. CS 378 - Fall 2015 Big Data Programming 16

Assignment 1 Build a JAR file Upload to AWS S3 Create a cluster using ElasMc MapReduce (EMR) Run your map- reduce job on EMR cluster Download the output ArMfacts to submit zip or tar of all source files in one (flat) directory tar of project (will include pom.xml and source files) Output CS 378 - Fall 2015 Big Data Programming 17