Developing a MapReduce Application

Similar documents
Chapter 7. Using Hadoop Cluster and MapReduce

Understanding Hadoop Performance on Lustre

CSE-E5430 Scalable Cloud Computing Lecture 2

HADOOP MOCK TEST HADOOP MOCK TEST II

Package hive. January 10, 2011

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

A Performance Analysis of Distributed Indexing using Terrier

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Map Reduce & Hadoop Recommended Text:


CS 378 Big Data Programming. Lecture 2 Map- Reduce

Hadoop WordCount Explained! IT332 Distributed Systems

PassTest. Bessere Qualität, bessere Dienstleistungen!

CS 378 Big Data Programming

Hadoop Architecture. Part 1

Yuji Shirasaki (JVO NAOJ)

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

How To Use Hadoop

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Introduction to Cloud Computing

MAPREDUCE Programming Model

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Hadoop Parallel Data Processing

Tutorial for Assignment 2.0

COURSE CONTENT Big Data and Hadoop Training

Tutorial for Assignment 2.0

Introduction to HDFS. Prasanth Kothuri, CERN

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

Disco: Beyond MapReduce

Apache Hadoop. Alexandru Costan

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

A very short Intro to Hadoop

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

Analysis of MapReduce Algorithms

MapReduce. Tushar B. Kute,

Using Big Data and GIS to Model Aviation Fuel Burn

Distributed Apriori in Hadoop MapReduce Framework

Hadoop and Map-reduce computing

A Novel Optimized Mapreduce across Datacenter s Using Dache: a Data Aware Caching for BigData Applications

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Introduction to DISC and Hadoop

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Cloud Computing. Chapter Hadoop

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

CS455 - Lab 10. Thilina Buddhika. April 6, 2015

Gluster Filesystem 3.3 Beta 2 Hadoop Compatible Storage

Scaling Out With Apache Spark. DTL Meeting Slides based on

Jeffrey D. Ullman slides. MapReduce for data intensive computing

How To Install Hadoop From Apa Hadoop To (Hadoop)

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Hadoop: The Definitive Guide

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

The Hadoop Framework

Getting to know Apache Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop

Hadoop Installation MapReduce Examples Jake Karnes

Hadoop Tutorial. General Instructions

Hadoop Job Oriented Training Agenda

7 Deadly Hadoop Misconfigurations. Kathleen Ting February 2013

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Apache Tomcat Clustering

YARN and how MapReduce works in Hadoop By Alex Holmes

Single Node Hadoop Cluster Setup

File S1: Supplementary Information of CloudDOE

NetBeans IDE Field Guide

Running a typical ROOT HEP analysis on Hadoop/MapReduce. Stefano Alberto Russo Michele Pinamonti Marina Cobal

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Implement Hadoop jobs to extract business value from large and varied data sets

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

HDFS Architecture Guide

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

ITG Software Engineering

Extreme Computing. Hadoop MapReduce in more detail.

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Generic Log Analyzer Using Hadoop Mapreduce Framework

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

MEN'S FASHION UK Items are ranked in order of popularity.

IDS 561 Big data analytics Assignment 1

Big Data and Scripting map/reduce in Hadoop

Introduction to HDFS. Prasanth Kothuri, CERN

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud


Transcription:

TIE 12206 - Apache Hadoop Tampere University of Technology, Finland November, 2014

Outline 1 MapReduce Paradigm 2 Hadoop Default Ports 3

Outline 1 MapReduce Paradigm 2 Hadoop Default Ports 3

MapReduce is a software framework for processing (large) data sets in a distributed fashion over several machines. Core idea < key, value > pairs

MapReduce is a software framework for processing (large) data sets in a distributed fashion over several machines. Core idea < key, value > pairs Almost all data can be mapped into key, value pairs.

MapReduce is a software framework for processing (large) data sets in a distributed fashion over several machines. Core idea < key, value > pairs Almost all data can be mapped into key, value pairs. Keys and values may be of any type.

Outline 1 MapReduce Paradigm 2 Hadoop Default Ports 3

Write your map and reduce functions

Write your map and reduce functions Test with a small subset of data

Write your map and reduce functions Test with a small subset of data If it fails use your IDE s debugger to find the problem

Write your map and reduce functions Test with a small subset of data If it fails use your IDE s debugger to find the problem Run on full dataset

Write your map and reduce functions Test with a small subset of data If it fails use your IDE s debugger to find the problem Run on full dataset If it fails Hadoop provides some debugging tools

Write your map and reduce functions Test with a small subset of data If it fails use your IDE s debugger to find the problem Run on full dataset If it fails Hadoop provides some debugging tools e.g. IsolationRunner : runs a task over the same input which it failed.

Write your map and reduce functions Test with a small subset of data If it fails use your IDE s debugger to find the problem Run on full dataset If it fails Hadoop provides some debugging tools e.g. IsolationRunner : runs a task over the same input which it failed. Do profiling to tune the performance

Hadoop Default Ports Outline 1 MapReduce Paradigm 2 Hadoop Default Ports 3

Hadoop Default Ports Hadoop Default Ports Handful of ports over TCP. Some used by Hadoop itself (to schedule jobs, replicate blocks, etc.). Some are directly for users (either via an interposed Java client or via plain old HTTP)

Outline 1 MapReduce Paradigm 2 Hadoop Default Ports 3

Task: Counting the word occurances (frequencies) in a text file (or set of files). < word, count > as < key, value > pair Mapper: Emits < word, 1 > for each word (no counting at this part). Shuffle in between: pairs with same keys grouped together and passed to a single machine. Reducer: Sums up the values (1s) with the same key value.

Outline 1 MapReduce Paradigm 2 Hadoop Default Ports 3

Tasks

Name Node

Outline 1 MapReduce Paradigm 2 Hadoop Default Ports 3

Test mapper and reducer outside hadoop.

Test mapper and reducer outside hadoop. Copy your MapReduce function and files to DFS.

Test mapper and reducer outside hadoop. Copy your MapReduce function and files to DFS. Test mapper and reducer with hadoop using a small portion of the data.

Test mapper and reducer outside hadoop. Copy your MapReduce function and files to DFS. Test mapper and reducer with hadoop using a small portion of the data. Track the jobs, debug, do profiling

Questions/Comments