CS 378 Big Data Programming. Lecture 5 Summariza9on Pa:erns



Similar documents
CS 378 Big Data Programming. Lecture 2 Map- Reduce

CS 378 Big Data Programming

How MapReduce Works 資碩一 戴睿宸

6. How MapReduce Works. Jari-Pekka Voutilainen

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

CS 378 Big Data Programming. Lecture 1 Introduc:on

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

YARN and how MapReduce works in Hadoop By Alex Holmes

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

A Tutorial Introduc/on to Big Data. Hands On Data Analy/cs over EMR. Robert Grossman University of Chicago Open Data Group

Introduc)on to Map- Reduce. Vincent Leroy

MapReduce. Tushar B. Kute,

Big Data and Scripting map/reduce in Hadoop

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Hadoop Parallel Data Processing

Understanding Hadoop Performance on Lustre


Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Keywords: Big Data, HDFS, Map Reduce, Hadoop

How To Use Hadoop

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

A very short Intro to Hadoop

Cloud Computing. Lectures 10 and 11 Map Reduce: System Perspective

Hadoop Architecture. Part 1

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

10605 BigML Assignment 4(a): Naive Bayes using Hadoop Streaming

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

How To Write A Map Reduce In Hadoop Hadooper (Ahemos)

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Lecture 3 Hadoop Technical Introduction CSE 490H

To reduce or not to reduce, that is the question

Chapter 7. Using Hadoop Cluster and MapReduce

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Introduc)on to. Eric Nagler 11/15/11

Map Reduce & Hadoop Recommended Text:

Data-intensive computing systems

Data Structures and Performance for Scientific Computing with Hadoop and Dumbo

Extreme Computing. Hadoop. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

MapReduce and Hadoop Distributed File System V I J A Y R A O

Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster. A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech

Fair Scheduler. Table of contents

Apache Hadoop. Alexandru Costan

Hadoop and Map-reduce computing

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

Getting to know Apache Hadoop

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Extreme Computing. Hadoop MapReduce in more detail.

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Capacity Scheduler Guide

Internals of Hadoop Application Framework and Distributed File System

HADOOP MOCK TEST HADOOP MOCK TEST II

MapReduce and Hadoop Distributed File System

Introduction to Cloud Computing

HADOOP PERFORMANCE TUNING

Dell Reference Configuration for Hortonworks Data Platform

Big Data With Hadoop

MAPREDUCE Programming Model

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop Streaming. Table of contents

Hadoop Scheduler w i t h Deadline Constraint

COURSE CONTENT Big Data and Hadoop Training

Survey on Scheduling Algorithm in MapReduce Framework

University of Maryland. Tuesday, February 2, 2010

7 Deadly Hadoop Misconfigurations. Kathleen Ting February 2013

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Machine- Learning Summer School

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

MapReduce, Hadoop and Amazon AWS

Tes$ng Hadoop Applica$ons. Tom Wheeler

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Performance and Energy Efficiency of. Hadoop deployment models

Developing a MapReduce Application

MyCloudLab: An Interactive Web-based Management System for Cloud Computing Administration

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop and Map-Reduce. Swati Gore

Hadoop Memory Usage Model

Apache Hadoop new way for the company to store and analyze big data

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

Performance Management in Big Data Applica6ons. Michael Kopp, Technology

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

Data Intensive Computing Handout 6 Hadoop

Introduction to MapReduce and Hadoop

ITG Software Engineering

Transcription:

CS 378 Big Data Programming Lecture 5 Summariza9on Pa:erns

Review Assignment 2 Ques9ons? If you d like to use guava (Google collec9ons classes) pom.xml available for assignment 2 Includes dependency for guava Creates an uber JAR for upload to AWS

Summariza9on Other summariza9ons of interest Min, max, mean Suppose we are interested in these metrics for paragraph length (Assignment 2 data) If paragraph lengths are normally distributed, then the median will be very near the mean If the distribu9on of paragraph lengths is skewed, then the mean and median will be very different

Summariza9on Min and max are straighzorward For each paragraph, output two values Min length (the length of the current paragraph) Max length (the length of the current paragraph) Key? Combiner will get a list of value pairs Select the min, max from the list, output the values Key? Reducer does the same

Summariza9on Median Get all the values, sort them, then find the middle Since our computa9on is distributed, no mapper sees all the values Should we send them all to one reducer? Not u9lizing map- reduce (computa9on not distributed) Data sizes likely too large to keep in memory

Summariza9on Median Keep the unique paragraph lengths, and The frequency of each length Map output: <paragraph length, 1> Combiner gets a list of these pairs and updates the count for recurring lengths Reducer does the same, then computes the median

Summariza9on Median Hadoop provides the SortedMapWritable class Can associate a frequency count with a paragraph length Keeps the lengths in sorted order See the example in Chapter 2 of Map- Reduce Design Pa1erns How could we compute all in one pass over the data? min, max, median

Counters Hadoop map- reduce infrastructure provides counters Accessed by group name Cannot have a large number of counters For example, can t use counters to solve WordCount A few tens of counters can be used Counters are stored in memory on JobTracker

Counters Figure 2-6, MapReduce Design Pa:erns

How Hadoop MapReduce Works We ve seen some terms like: Job JobTracker TaskTracker Let s look at what they do Details from Chapter 6, Hadoop: The Defini9ve Guide 3 rd Edi9on

How Hadoop MapReduce Works Figure 6-1, Hadoop: The Defini9ve Guide 3 rd Edi9on

Job Submission Job submission Input files exist? Output directory exist? If yes, it fails. Hadoop expects to create this directory Copy resources to HDFS JAR files Configura9on file Computed file splits

Job Tracker Creates tasks (work to be done) Map task for each input split Requested number of reducer tasks Job setup, job cleanup tasks Map tasks are assigned to task trackers that are close to the input split loca9on Data local preferred Rack local next Reduce task can go anywhere. Why? Scheduling algorithm orders the tasks

Task Tracker Configured for several map and reduce tasks Periodically sends a heartbeat to job tracker I m s9ll alive Ready for new task For a new task Copy files to local file system (JAR, configura9on) Launch a new JVM (TaskRunner) Load the mapper/reducer class and call its method Update the task tracker progress

Task Progress Mapper What por9on of the input has been processed Reducer more complicated Sort, shuffle and reduce are considered here Progress is an es9mate of how much of the total work has been done

Shuffle Figure 6-6, Hadoop: The Defini9ve Guide 3 rd Edi9on

The data flow for the general case of multiple reduce tasks is illustrated in Figure 2-4. This diagram makes it clear why the data flow between map and reduce tasks is colloquially known as the shuffle, as each reduce task is fed by many map tasks. The MapReduce in Hadoop Figure 2.4, Hadoop - The Defini9ve Guide shuffle is more complicated than this diagram suggests, and tuning it can have a big impact on job execution time, as you will see in Shuffle and Sort on page 208. Figure 2-4. MapReduce data flow with multiple reduce tasks