How To Use Hadoop

Similar documents

How To Install Hadoop From Apa Hadoop To (Hadoop)

MapReduce. Tushar B. Kute,

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop Architecture. Part 1

HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Hadoop (pseudo-distributed) installation and configuration

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Chapter 7. Using Hadoop Cluster and MapReduce

A very short Intro to Hadoop

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Running Kmeans Mapreduce code on Amazon AWS

Hadoop Parallel Data Processing

Hadoop Lab - Setting a 3 node Cluster. Java -

Distributed Filesystems

Apache Hadoop. Alexandru Costan

Map Reduce & Hadoop Recommended Text:

研發專案原始程式碼安裝及操作手冊. Version 0.1

Apache Hadoop new way for the company to store and analyze big data

Introduction to Cloud Computing

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

MapReduce, Hadoop and Amazon AWS

TP1: Getting Started with Hadoop

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Installation and Configuration Documentation

HDFS Installation and Shell

Open source Google-style large scale data analysis with Hadoop

Introduction to Hadoop

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Introduction to MapReduce and Hadoop

CSE-E5430 Scalable Cloud Computing Lecture 2

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Data-Intensive Computing with Map-Reduce and Hadoop

Simple Parallel Computing in R Using Hadoop

Hadoop Training Hands On Exercise

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

L1: Introduction to Hadoop

Large scale processing using Hadoop. Ján Vaňo

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

HADOOP MOCK TEST HADOOP MOCK TEST II

HSearch Installation

Extreme Computing. Hadoop. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

Hadoop implementation of MapReduce computational model. Ján Vaňo

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

Intro to Map/Reduce a.k.a. Hadoop

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

Hadoop 2.6 Configuration and More Examples

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

HADOOP - MULTI NODE CLUSTER

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Hadoop MultiNode Cluster Setup

Parallel Processing of cluster by Map Reduce

Introduction to HDFS. Prasanth Kothuri, CERN

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

HADOOP MOCK TEST HADOOP MOCK TEST I

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

A. Aiken & K. Olukotun PA3

Big Data With Hadoop

Getting to know Apache Hadoop

Hadoop Setup Walkthrough

HADOOP CLUSTER SETUP GUIDE:

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

RDMA for Apache Hadoop User Guide

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Hadoop Installation. Sandeep Prasad

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster"

Big Data 2012 Hadoop Tutorial

A Cost-Evaluation of MapReduce Applications in the Cloud

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

COURSE CONTENT Big Data and Hadoop Training

Integration Of Virtualization With Hadoop Tools

How to install Apache Hadoop in Ubuntu (Multi node setup)

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Big Data : Experiments with Apache Hadoop and JBoss Community projects

Fault Tolerance in Hadoop for Work Migration

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Transcription:

Hadoop in Action Justin Quan March 15, 2011

Poll

What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts

Key Take Aways Hadoop is a widely used open source framework for process large datasets with multiple machines Hadoop is a tool simple enough to add to your back pocket

What is Hadoop?

What is Hadoop? Hadoop is one way of using an enormous cluster of computers to store an enormous amount of data and then operate on that data in parallel. -Keith Wiley

History of Hadoop Open source implementation of the Google File System (GFS) and MapReduce framework (2004) Suited for processing large data sets (page rank) Doug Cutting joins Yahoo in 2006 and spearheads the open source implementation

Motivation for Hadoop Processing large datasets requires lots of cpu, I/O, bandwidth, memory Large scale means failures Adding fault tolerance to your app is hard Hadoop to the rescue! Let application developers worry about writing applications

What does Hadoop Provide? Decouples distributed systems fault tolerance from application logic Scalable storage (just add nodes) System that can tolerate machine failure Distributes your data processing code to take advantage of idle CPUs and data locality

Hadoop Limitations Data privacy small datasets realtime processing

How does Hadoop work?

How does Hadoop work? How does it store data? How does it process data?

HDFS: Overview Shared namespace for the entire cluster (/user/justin/todo.txt) Write once. Append ok. unix like access: ls, df, du, mv, cp, rm, cat, chmod, chown, etc... use -put -get to move between local filesystem and hdfs

HDFS: Underneath Files are split up in large chunks (64Mb+) Replication for durability. Default=3 copies Single NameNode tracks filenames, permissions, block locations, cluster config, transaction log Many DataNodes, each stores blocks on the local filesystem

HDFS: How it works

HDFS: How it works

Data Processing Map Reduce Paradigm Borrows constructs similar to those found in functional programming languages Map and Reduce gets called on a list of key value pairs of unknown length

Map Reduce Two-stage processing Mappers produce intermediate results Reduce aggregates and consolidates results

Map Map(K,V) -> (Ki,Vi) list Each invocation is fed a key value pair Each invocation returns 0 or more key value pairs

Map Examples Wordcount App: K is the file line number, V is the line of text def map(k,v): wordlist = v.split( ) foreach word in wordlist: output(word, 1) end end... map(421, around and around ) output(around, 1) output(and, 1) output(around, 1)

Shuffle Phase Done automatically Each Mapper sorts output by key Hashes key to send data to appropriate reducer

Shuffle Phase

Reduce Reduce(K,Vi list) -> (Kj,Vj) list Each invocation is fed a key and list of values Each invocation returns 0 or more key value pairs

Reduce Examples Wordcount App: def reduce(k,v): count = 0 foreach value in v: count += value end output(k, count) end... reduce(and, [1]) output(and, 1) reduce(around, [1, 1]) output(around, 2)

Combiner An optionally run reducer for Mapper output Goal is to reduce size of mapper output No guarantee to run, so don t depend on it!

Combiner Example Wordcount App: combine(k,v) count = 0 foreach value in v count += value end output(k, count) end e.g. combine(and, [1]) output(and, 1) combine(around, [1,1]) output(around, 2)

Hadoop Map Reduce Data Processing Workflow 1. Map function is distributed 2. Data read from HDFS 3. Data fed to mapper 4. Output fed to Combiner 5. Output aggregated by key 6. Data distributed to reducers 7. Output written to HDFS

Real Example

Google Ngram Dataset 1-gram dataset file format: <ngram><tab><year><tab><count><tab><pages><tab><books> map: extract nword + count reduce: sum ngram counts

Hard numbers 10 gigs of data 475 million lines Hadoop running with 1 gig memory 2 machines: 2x2.33ghz, 4x3ghz Word count job: 5 hours 18 minutes

How do I use Hadoop?

Install Download from http://hadoop.apache.org/ tar -xzvf hadoop-0.21.0.tar.gz Need Java

Components One NameNode (coordinates HDFS) One JobTracker (coordinates data processing) Many DataNodes (stores HDFS blocks) Many TaskTrackers (runs mapper and reducer work)

Setup Config Configure NameNode (HDFS) core-site.xml, hdfs-site.xml Configure JobTracker (Map/Reduce) mapred-site.xml on every host slaves file on master hosts

core-site.xml fs.default.name: NameNode endpoint <configuration> <property> <name>fs.default.name</name> <value>hdfs://yourhost:9000</value> </property> </configuration>

hdfs-site.xml dfs.name.dir: directory to store NameNode data dfs.data.dir: directory to store DataNode data <configuration> <property> <name>dfs.name.dir</name> <value>/home/hadoop/hdfs/name/</value> </property> <property> <name>dfs.data.dir</name> <value>/home/hadoop/hdfs/data/</value> </property> </configuration>

mapred-site.xml mapred.job.tracker: JobTracker endpoint mapred.local.dir: directories to store mapper output <configuration> <property> <name>mapred.job.tracker</name> <value>yourhost:9001</value> </property> <property> <name>mapred.local.dir</name> <value>/home/justin/mapred</value> </property> </configuration>

slave Define DataNodes and TaskTrackers One host per line, used by NameNode and JobTracker to startup daemons yourhost1.domain yourhost2.domain yourhost3.domain

Starting Up Hadoop Format hdfs $HADOOP_HOME/bin/hadoop namenode -format Start NameNode $HADOOP_HOME/bin/start-dfs.sh Start JobTracker $HADOOP_HOME/bin/start-mapred.sh

Hadoop Streaming Not a Java person? Streaming provides command-line interface Mapper: Provided <key><tab><value><newline> Reducer: Provided <key><tab><value><newline>

Quicky Ruby MR script Mapper: for line in STDIN.each_line do parts = line.split(/\t/) print "#{parts[0]}\t#{parts[2]}\n" end Reducer: sum = {} for line in STDIN.each_line do parts = line.split(/\t/) val = sum[parts[0]] 0 sum[parts[0]] = val+parts[1].to_i end for k in sum.keys do print "#{k}\t#{sum[k]}\n" end

Hadoop Streaming Execution $HADOOP_HOME/bin/hdfs dfs -put /home/justin/localdata/* inputdir/ $HADOOP_HOME/bin/hadoop jar hadoop-streamin.jar \ -input inputdir/* \ -output outputdir \ -mapper ruby map.rb \ -reducer ruby reduce.rb \ -file map.rb \ -file reduce.rb \

How do I get started? Borrow access on peer desktops Look to the cloud: Amazon Webservices Build your own cloud

Recap Overview and motivation for using Hadoop How Hadoop works How to use Hadoop How to get started

Final Thoughts Hadoop is a widely used open source framework for processing large datasets with multiple machines Hadoop is a tool simple enough to add to your back pocket

EOF

How to interface with the Mapper Hadoop comes with a variety of ways to get data in Pig for different languages other than native Java Hadoop-Streaming comes bundled, for command line style execution