Hadoop 2.6 Configuration and More Examples



Similar documents
YARN Apache Hadoop Next Generation Compute Platform

Hadoop MultiNode Cluster Setup

Hadoop Configuration and First Examples

How To Install Hadoop From Apa Hadoop To (Hadoop)

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

How to install Apache Hadoop in Ubuntu (Multi node setup)

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop (pseudo-distributed) installation and configuration

CDH 5 Quick Start Guide

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe

How to install Apache Hadoop in Ubuntu (Multi node/cluster setup)

Set JAVA PATH in Linux Environment. Edit.bashrc and add below 2 lines $vi.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-oracle/

HSearch Installation

Hadoop in the Enterprise

Map Reduce & Hadoop Recommended Text:

MapReduce, Hadoop and Amazon AWS

Single Node Hadoop Cluster Setup

Sujee Maniyam, ElephantScale

Extending Hadoop beyond MapReduce

How To Use Hadoop

Обработка больших данных: Map Reduce (Python) + Hadoop (Streaming) Максим Щербаков ВолгГТУ 8/10/2014

Apache Hadoop new way for the company to store and analyze big data

Running Kmeans Mapreduce code on Amazon AWS

Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters

Hadoop Setup. 1 Cluster

HADOOP - MULTI NODE CLUSTER

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

CURSO: ADMINISTRADOR PARA APACHE HADOOP

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Apache Hadoop. Alexandru Costan

Hadoop & Spark Using Amazon EMR

Big Data Technology Core Hadoop: HDFS-YARN Internals

Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics

Hadoop Installation. Sandeep Prasad

Hadoop implementation of MapReduce computational model. Ján Vaňo

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Distributed Filesystems

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

HADOOP MOCK TEST HADOOP MOCK TEST II

CSE 344 Introduction to Data Management. Section 9: AWS, Hadoop, Pig Latin TA: Yi-Shu Wei

Hadoop Multi-node Cluster Installation on Centos6.6

Hadoop Installation Guide

Hadoop Lab - Setting a 3 node Cluster. Java -

Large scale processing using Hadoop. Ján Vaňo

6. How MapReduce Works. Jari-Pekka Voutilainen

HADOOP MOCK TEST HADOOP MOCK TEST

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

E6893 Big Data Analytics: Demo Session for HW I. Ruichi Yu, Shuguan Yang, Jen-Chieh Huang Meng-Yi Hsu, Weizhen Wang, Lin Haung.

CSE-E5430 Scalable Cloud Computing. Lecture 4

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Hadoop Installation MapReduce Examples Jake Karnes

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Hadoop Setup Walkthrough

HDFS Cluster Installation Automation for TupleWare

Big Data With Hadoop

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Pivotal HD Enterprise 1.0 Stack and Tool Reference Guide. Rev: A03

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

A. Aiken & K. Olukotun PA3

TP1: Getting Started with Hadoop

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

MapReduce. Tushar B. Kute,

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

docs.hortonworks.com

Introduction to Hadoop

Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

HADOOP CLUSTER SETUP GUIDE:

MPJ Express Meets YARN: Towards Java HPC on Hadoop Systems

Hadoop Distributed File System. Dhruba Borthakur June, 2007

CactoScale Guide User Guide. Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB)

Big Data Realities Hadoop in the Enterprise Architecture

Test-King.CCA Q.A. Cloudera CCA-500 Cloudera Certified Administrator for Apache Hadoop (CCAH)

Installation and Configuration Documentation

Hadoop Distributed File System and Map Reduce Processing on Multi-Node Cluster

Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster. A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Hadoop & its Usage at Facebook

Ankush Cluster Manager - Hadoop2 Technology User Guide

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Tutorial- Counting Words in File(s) using MapReduce

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Hadoop & its Usage at Facebook

Big Data : Experiments with Apache Hadoop and JBoss Community projects

YARN and how MapReduce works in Hadoop By Alex Holmes

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Transcription:

Hadoop 2.6 Configuration and More Examples Big Data 2015

Apache Hadoop & YARN Apache Hadoop (1.X)! De facto Big Data open source platform Running for about 5 years in production at hundreds of companies like Yahoo, Ebay and Facebook Hadoop 2.X Significant improvements in HDFS distributed storage layer. High Availability, NFS, Snapshots YARN next generation compute framework for Hadoop designed from the ground up based on experience gained from Hadoop 1 YARN running in production at Yahoo for about a year

1st Generation Hadoop: Batch Focus HADOOP 1.0 Built for Web-Scale Batch Apps Single'App' Single'App' All other usage patterns MUST leverage same infrastructure INTERACTIVE ONLINE Single'App' BATCH Single'App' BATCH Single'App' BATCH Forces Creation of Silos to Manage Mixed Workloads HDFS HDFS HDFS

Hadoop 1 Architecture JobTracker Manage Cluster Resources & Job Scheduling TaskTracker Per-node agent Manage Tasks

Hadoop 1 Limitations Lacks Support for Alternate Paradigms and Services Force everything needs to look like Map Reduce Iterative applications in MapReduce are 10x slower Scalability Max Cluster size ~5,000 nodes Max concurrent tasks ~40,000 Availability Failure Kills Queued & Running Jobs Hard partition of resources into map and reduce slots Non-optimal Resource Utilization

Hadoop as Next-Gen Platform Single Use System Batch Apps HADOOP 1.0 Multi Purpose Platform Batch, Interactive, Online, Streaming, HADOOP 2.0 MapReduce% (data*processing)* Others% MapReduce% (cluster*resource*management* *&*data*processing)* YARN% (cluster*resource*management)* HDFS% (redundant,*reliable*storage)* HDFS2% (redundant,*highly8available*&*reliable*storage)*

Hadoop 2 - YARN Architecture ResourceManager (RM) Central agent - Manages and allocates cluster resources NodeManager (NM) Per-Node agent - Manages and enforces node resource allocations ApplicationMaster (AM) Per-Application Manages application Client Resource Manager Node Manager Node Manager App Mstr Container lifecycle and task scheduling MapReduce Status Job Submission Node Status Resource Request Node Manager

YARN: Taking Hadoop Beyond Batch Store ALL DATA in one place Interact with that data in MULTIPLE WAYS with Predictable Performance and Quality of Service Applica'ons+Run+Na'vely+in#Hadoop+ BATCH+ (MapReduce)+ INTERACTIVE+ (Tez)+ ONLINE+ (HBase)+ STREAMING+ (Storm,+S4, )+ GRAPH+ (Giraph)+ INLMEMORY+ (Spark)+ HPC+MPI+ (OpenMPI)+ OTHER+ (Search)+ (Weave )+ YARN+(Cluster*Resource*Management)*** HDFS2+(Redundant,*Reliable*Storage)*

5 Key Benefits of YARN 1. New&Applica-ons&&&Services& 2. Improved&cluster&u-liza-on& 3. Scale& 4. Experimental&Agility& 5. Shared&Services&

5 Key Benefits of YARN Yahoo! leverages YARN 40,000+ nodes running YARN across over 365PB of data ~400,000 jobs per day for about 10 million hours of compute time Estimated a 60% 150% improvement on node usage per day using YARN Eliminated Colo (~10K nodes) due to increased utilization For more details check out the YARN SOCC 2013 paper Apache Hadoop YARN: Yet Another Resource Negotiator Vinod Kumar Vavilapalli et al.

Environment variables In the bash_profile export all needed environment variables

Hadoop 2 Configuration Allow remote login

Hadoop 2 Configuration Download the binary release of apache hadoop: hadoop-2.6.0.tar.gz

Hadoop 2 Configuration At https://hadoop.apache.org/docs/stable/ you can find a WIKI about Hadoop 2

Hadoop 2 Configuration In the etc/hadoop directory of the hadoop-home directory, set the following files hadoop-env.sh core-site.xml hdfs-site.xml mapred-site.xml yarn-site.xml

Hadoop 2 Configuration In the hadoop-env.sh file you have to edit the following

Hadoop 2 Configuration core-site.xml hdfs-site.xml mapred-site.xml yarn-site.xml

Hadoop 2 Configuration & Running Final configuration: $:~ hadoop-*/bin/hdfs namenode -format Running hadoop: $:~ hadoop-*/sbin/start-dfs.sh

Hadoop 2: commands on hdfs $:~hadoop-*/bin/hdfs dfs <command> <parameters> create a directory in hdfs! $:~hadoop-*/bin/hdfs dfs -mkdir input copy a local file in hdfs $:~hadoop-*/bin/hdfs dfs -put /tmp/example.txt input!! copy result files from hdfs to local file system $:~hadoop-*/bin/hdfs dfs -get output/result localoutput delete a directory in hdfs $:~hadoop-*/bin/hdfs dfs rm -r input

Hadoop 2 Configuration & Running Final configuration: $:~ hadoop-*/bin/hdfs namenode -format Running hadoop: $:~ hadoop-*/sbin/start-dfs.sh Make the HDFS directories required to execute MapReduce jobs: $:~ hadoop-*/bin/hdfs dfs -mkdir /user $:~ hadoop-*/bin/hdfs dfs -mkdir /user/<username> Copy the input files into the distributed filesystem: $:~ hadoop-*/bin/hdfs dfs -put etc/hadoop input

Hadoop2: browse hdfs http://localhost:50070

Hadoop 2 Configuration & Running Check all running daemons in Hadoop using the command jps $:~ jps 20758 NameNode 20829 DataNode 20920 SecondaryNameNode 21001 Jps

Hadoop 2: execute MR application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: Word Count in MapReduce $:~hadoop-*/bin/hdfs dfs -mkdir output $:~hadoop-*/bin/hdfs dfs -put /tmp/parole.txt input $:~hadoop-*/bin/hadoop jar /tmp/word.jar WordCount input/parole.txt output/result { { path on hdfs to reach the file path on hdfs of the directory to generate to store the result

Let s start with some examples!

Average Word Length by initial letter sopra la panca la capra campa, sotto la panca la capra crepa INPUT (0, sopra) (6, la) (9, panca) (15, la) (18, capra) (24, campa,) MAP (s, 5) (l, 2) (p, 5) (l, 2) (c, 5) (c, 6) (s, 5) (l, 2) (p, 5) (l, 2) (c, 5) (c, 5) SHUFFLE & SORT (c, [5,6]) (l, [2,2]) (p, [5]) (s, [5]) REDUCE (c, 5.25) (l, 2) (p, 5) (s, 5)

Hadoop 2: execute AverageWordLength application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: AverageWordLength in MapReduce $:~hadoop-*/bin/hdfs dfs -mkdir output $:~hadoop-*/bin/hdfs dfs -put /example_data/words.txt input $:~hadoop-*/bin/hadoop jar /example_jar/bg-example1-1.jar avgword/averagewordlength input/words.txt output/result_avg_word { { path on hdfs to reach the file path on hdfs of the directory to generate to store the result

Hadoop 2: clean AverageWordLength application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: AverageWordLength in MapReduce $:~hadoop-*/bin/hdfs dfs rm -r output

{ Inverted Index there is a tab (\t) 1 if you prick us do we not bleed, 2 if you tickle us do we not laugh, 3 if you poison us do we not die and, 4 if you wrong us shall we not revenge

Inverted Index 1 if you prick us do we not bleed, 2 if you tickle us do we not laugh, 3 if you poison us do we not die and, 4 if you wrong us shall we not revenge INPUT (1, if you prick ) (2, if you tickle ) (3, if you poison ) (4, if you wrong ) MAP (if, 1) (you, 1) (prick, 1) (if, 2) (you, 2) (tickle, 2) REDUCE (if, [1,2, ]) (you, [1,2, ]) (prick, [1]) (tickle, [2])

Hadoop 2: execute InvertedIndex application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: InvertedIndex in MapReduce $:~hadoop-*/bin/hdfs dfs -mkdir output $:~hadoop-*/bin/hdfs dfs -put /example_data/words2.txt input $:~hadoop-*/bin/hadoop jar /example_jar/bg-example1-1.jar invertedindex/invertedindex input/words2.txt output/result_ii { { path on hdfs to reach the file path on hdfs of the directory to generate to store the result

Hadoop 2: clean InvertedIndex application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: InvertedIndex in MapReduce $:~hadoop-*/bin/hdfs dfs rm -r output

Top K records (citations) A,B C,B F,B A,A C,A R,F R,C B,B (B, 4) (A, 2) (F, 1) (C, 1) The objective is to find the top-10 most frequently cited patents in descending order.! The format of the input data is CITING_PATENT,CITED_PATENT.

Top K records (citations) A,B C,B F,B A,A C,A R,F R,C B,B MAP1 (CITEDPATENT, 1) (B, 1) (B, 1) (B, 1) (A, 1) (A, 1) (F, 1) REDUCE1 (CITEDPATENT, N) (B, 4) (A, 2) (F, 1) (CITEDPATENT, N) (CITEDPATENT, N) (CITEDPATENT, N) (B, 4) (A, 2) (F, 1) MAP2 (B, 4) (A, 2) (F, 1) COPY REDUCE2 (B, 4) (A, 2) (F, 1) FILTER

Hadoop 2: execute TopKRecords application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: TopKRecords in MapReduce $:~hadoop-*/bin/hdfs dfs -mkdir output $:~hadoop-*/bin/hdfs dfs -put /example_data/citations.txt input $:~hadoop-*/bin/hadoop jar /example_jar/bg-example1-1.jar topk/topkrecords input/citations.txt output/result_topk { { path on hdfs to reach the file path on hdfs of the directory to generate to store the result

Hadoop 2: clean TopKRecords application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: TopKRecords in MapReduce $:~hadoop-*/bin/hdfs dfs rm -r output

Disjoint selection ratings UserID::MovieID::Rating::Timestamp! 1::1193::5::978300760 1::661::3::978302109 2::3068::4::978299000 2::1537::4::978299620 2::647::3::978299351 3::3534::3::978297068 users UserID::Gender::Age::Occupation::Zip-Code! 1::F::1::10::48067 2::M::56::16::70072 3::M::25::15::55117!!

Disjoint selection ratings UserID::MovieID::Rating::Timestamp! 1::1193::5::978300760 1::661::3::978302109 2::3068::4::978299000 2::1537::4::978299620 2::647::3::978299351 3::3534::3::978297068 users UserID::Gender::Age::Occupation::Zip-Code! 1::F::1::10::48067 2::M::56::16::70072 3::M::25::15::55117!! 1::1193::5::978300760 1::661::3::978302109 1::F::1::10::48067 2::M::56::16::70072 2::3068::4::978299000 2::1537::4::978299620 2::647::3::978299351 3::M::25::15::55117 3::3534::3::978297068

Disjoint selection users_ratings 1::1193::5::978300760 1::661::3::978302109 1::F::1::10::48067 2::M::56::16::70072 2::3068::4::978299000 2::1537::4::978299620 2::647::3::978299351 3::M::25::15::55117 3::3534::3::978297068 The objective is to find users who are not "power taggers", ie, those who have tagged less than MIN_RATINGS (25) movies.! The problem is complicated by users and ratings are mixed in a single file.

Disjoint selection 1::1193::5::978300760 1::661::3::978302109 1::F::1::10::48067 2::M::56::16::70072 2::3068::4::978299000 2::1537::4::978299620 2::647::3::978299351 3::M::25::15::55117 3::3534::3::978297068 For this example consider MIN_RATINGS (3) (userid, recordtype) (userid, numberofratings) MAP (1, R) (1, R) (1, U) (2, U) (2, R) (2, R) REDUCE (1, 2) (3, 1)

Hadoop 2: execute DisjointSelection application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: DisjointSelection in MapReduce $:~hadoop-*/bin/hdfs dfs -mkdir output $:~hadoop-*/bin/hdfs dfs -put /example_data/users_ratings.txt input $:~hadoop-*/bin/hadoop jar /example_jar/bg-example1-1.jar rating/disjointselection input/users_ratings.txt output/result_rating { { path on hdfs to reach the file path on hdfs of the directory to generate to store the result

Hadoop 2: clean DisjointSelection application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: DisjointSelection in MapReduce $:~hadoop-*/bin/hdfs dfs rm -r output

Stopping the Hadoop 2 DFS Stop hadoop: $:~ hadoop-*/sbin/stop-dfs.sh That's all! Happy Map Reducing!

Hadoop 2 on AMAZON

Hadoop 2 on AMAZON AWS Code Creden+als User Name Access Key Id Secret Access Key BigData AKIAJRJMBITGSMBKLTLQ jjxqik8z/xxxkyl4ynszqudt+wgadfcabkjpeov0 AWS Web Console Creden+als User Name Password Direct Signin Link BigData XkHNB0v)IVhf hqp://signin.aws.amazon.com/console

Hadoop 2 on AMAZON

Hadoop 2 on AMAZON

Hadoop 2 on AMAZON Regions

Hadoop 2 on AMAZON S3 and buckets

Hadoop 2 on AMAZON S3 and buckets

EC2 virtual servers Hadoop 2 on AMAZON

Hadoop 2 on AMAZON file.pem

Hadoop 2 on AMAZON Elastic Map Reduce

Hadoop 2 on AMAZON EMR cluster

Hadoop 2 on AMAZON EMR cluster

Hadoop 2 on AMAZON EMR cluster

Hadoop 2 on AMAZON EMR cluster

Hadoop 2 Running on AWS Authorization of the. pen file $:~ chmod 400 BigData_Key.pem Upload files (data and your personal jars) on hadoop of EMR cluster: $:~ scp -i <file.pem> <file> hadoop@<dns_emr_cluster>:~ $:~ scp -i BigData_Key.pem./example_data/words.txt hadoop@ec2-54-164-153-7.compute-1.amazonaws.com:~

Hadoop 2 Running on AWS Connection to hadoop of EMR cluster: $:~ ssh hadoop@<dns_emr_cluster> -i <file.pem> $:~ ssh hadoop@ec2-54-164-153-7.compute-1.amazonaws.com -i BigData_Key.pem Then you can execute MR jobs as in your local machine Finally TERMINATE your cluster

Hadoop 2.6 Configuration and More Examples Big Data 2015