Tutorial. Christopher M. Judd



Similar documents
Tutorial. Christopher M. Judd

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Tutorial- Counting Words in File(s) using MapReduce

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

BIG DATA APPLICATIONS

Big Data 2012 Hadoop Tutorial

Word Count Code using MR2 Classes and API

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

Hadoop Configuration and First Examples

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

Introduc)on to Map- Reduce. Vincent Leroy

Hadoop Basics with InfoSphere BigInsights

HSearch Installation

Hadoop MultiNode Cluster Setup

2.1 Hadoop a. Hadoop Installation & Configuration

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

How To Install Hadoop From Apa Hadoop To (Hadoop)

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

19 Putting into Practice: Large-Scale Data Management with HADOOP

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

Hadoop Installation. Sandeep Prasad

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

Mrs: MapReduce for Scientific Computing in Python

1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"

Hadoop Installation MapReduce Examples Jake Karnes

Introduction to HDFS. Prasanth Kothuri, CERN

Single Node Setup. Table of contents

Hadoop and Hive. Introduction,Installation and Usage. Saatvik Shah. Data Analytics for Educational Data. May 23, 2014

HDFS Installation and Shell

TP1: Getting Started with Hadoop

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

Introduction to Cloud Computing

Hadoop Installation Guide

Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics

How to install Apache Hadoop in Ubuntu (Multi node/cluster setup)

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

Running Hadoop on Windows CCNP Server

Set JAVA PATH in Linux Environment. Edit.bashrc and add below 2 lines $vi.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-oracle/

HADOOP - MULTI NODE CLUSTER

Hadoop Shell Commands

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

RDMA for Apache Hadoop User Guide

Comparative Study of Hive and Map Reduce to Analyze Big Data

HDFS File System Shell Guide

Hadoop Shell Commands

File System Shell Guide

Hadoop (pseudo-distributed) installation and configuration

Extreme computing lab exercises Session one

Data Science Analytics & Research Centre

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Extreme Computing. Hadoop MapReduce in more detail.

Installation and Configuration Documentation

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tableau Spark SQL Setup Instructions

HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe

CDH 5 Quick Start Guide

Extreme computing lab exercises Session one

Xiaoming Gao Hui Li Thilina Gunarathne

Getting to know Apache Hadoop

Distributed Filesystems

Introduction to MapReduce and Hadoop

How to install Apache Hadoop in Ubuntu (Multi node setup)

Hadoop for HPCers: A Hands-On Introduction. Part I: Overview, MapReduce. Jonathan Dursi, SciNet Michael Nolta, CITA. Agenda

Distributed Systems + Middleware Hadoop

Installing Hadoop. Hortonworks Hadoop. April 29, Mogulla, Deepak Reddy VERSION 1.0

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Hadoop and Spark Tutorial for Statisticians

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

5 HDFS - Hadoop Distributed System

Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Single Node Hadoop Cluster Setup

IDS 561 Big data analytics Assignment 1

Running Kmeans Mapreduce code on Amazon AWS

Hadoop Tutorial. General Instructions

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

map/reduce connected components

Обработка больших данных: Map Reduce (Python) + Hadoop (Streaming) Максим Щербаков ВолгГТУ 8/10/2014

Introduction to Big Data Science. Wuhui Chen

INTRODUCTION TO HADOOP

Enterprise Data Storage and Analysis on Tim Barr

E6893 Big Data Analytics: Demo Session for HW I. Ruichi Yu, Shuguan Yang, Jen-Chieh Huang Meng-Yi Hsu, Weizhen Wang, Lin Haung.

Setting up Hadoop with MongoDB on Windows 7 64-bit

Hadoop Distributed File System and Map Reduce Processing on Multi-Node Cluster

Perforce Helix Threat Detection On-Premise Deployment Guide

This handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download.

Introduction to HDFS. Prasanth Kothuri, CERN

Hadoop Training Hands On Exercise

Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Transcription:

Tutorial Christopher M. Judd

Christopher M. Judd CTO and Partner at leader Columbus Developer User Group (CIDUG)

Marc Peabody @marcpeabody

Introduction

http://hadoop.apache.org/

Scale up

Scale up

Scale up

Scale-up Scale-out

Hadoop Approach scale-out share nothing expect failure smart software, dumb hardware move processing, not data build applications, not infrastructure

http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html What is Hadoop good for? Don't use Hadoop - your data isn't that big <100 mb - Excel 100 mb > 10 gb - Add memory and use Pandas 100 gb > 1 TB - Buy big hard drive and use Postgres > 5 TB - life sucks consider Hadoop

Hadoop is an evolving project

Hadoop is an evolving project old api org.apache.hadoop.mapred new api org.apache.hadoop.mapreduce

Hadoop is an evolving project MapReduce 1 MapReduce 2 Classic MapReduce YARN

Setup

Hadoop Tutorial user/fun4all /opt/data

Configure SSH add a second Host-only Adaptor

Configure SSH

Configure SSH sudo /Library/StartupItems/VirtualB sudo /Library/StartupItems/Virtual

SSH ing $ ssh user@192.168.56.101 The authenticity of host '192.168.56.101 (192.168.56.101)' can't be established. RSA key fingerprint is 40:60:72:ce:48:03:ac:c7:4c:23:9a:4f:1e:a5:ae:b9. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '192.168.56.101' (RSA) to the list of known hosts. user@192.168.56.101's password: Welcome to Ubuntu 12.04.3 LTS (GNU/Linux 3.8.0-29-generic x86_64)! * Documentation: https://help.ubuntu.com/! Last login: Tue Jan 7 08:49:28 2014 from user-virtualbox.local user@user-virtualbox:~$

Hadoop

http://www.cloudera.com/ http://hortonworks.com/

Add Hadoop User and Group $ sudo addgroup hadoop $ sudo adduser --ingroup hadoop hduser $ sudo adduser hduser sudo! $ su hduser $ cd ~

Install Hadoop $ sudo mkdir -p /opt/hadoop $ sudo tar vxzf /opt/data/hadoop-2.2.0.tar.gz -C /opt/hadoop $ sudo chown -R hduser:hadoop /opt/hadoop/hadoop-2.2.0 $ vim.bashrc # other stuff # java variables export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64! # hadoop variables export HADOOP_HOME=/opt/hadoop/hadoop-2.2.0 export PATH=$PATH:$HADOOP_HOME/bin export PATH=$PATH:$HADOOP_HOME/sbin export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME $ source.bashrc $ hadoop version

Run Hadoop Job $ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 4 1000

$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files. aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files. bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi. dbcount: An example job that count the pageview counts from a database. distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi. grep: A map/reduce program that counts the matches of a regex in the input. join: A job that effects a join over sorted, equally partitioned datasets multifilewc: A job that counts words from several files. pentomino: A map/reduce tile laying program to find solutions to pentomino problems. pi: A map/reduce program that estimates Pi using a quasi-monte Carlo method. randomtextwriter: A map/reduce program that writes 10GB of random textual data per node. randomwriter: A map/reduce program that writes 10GB of random data per node. secondarysort: An example defining a secondary sort to the reduce. sort: A map/reduce program that sorts the data written by the random writer. sudoku: A sudoku solver. teragen: Generate data for the terasort terasort: Run the terasort teravalidate: Checking results of terasort wordcount: A map/reduce program that counts the words in the input files. wordmean: A map/reduce program that counts the average length of the words in the input files. wordmedian: A map/reduce program that counts the median length of the words in the input files. wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files. $ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi Usage: org.apache.hadoop.examples.quasimontecarlo <nmaps> <nsamples>

Lab 1 1. Create Hadoop user and group 2. Install Hadoop 3. Run example Hadoop job such as pi

Local Standalone mode Pseudo-distributed mode Fully distributed mode

HDFS

POSIX! portable operating system interface

Reading Data

Writing Data

Configure passwordless login $ ssh-keygen -t rsa -P '' $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys $ ssh localhost $ exit

Configure HDFS $ sudo mkdir -p /opt/hdfs/namenode $ sudo mkdir -p /opt/hdfs/datanode $ sudo chmod -R 777 /opt/hdfs $ sudo chown -R hduser:hadoop /opt/hdfs $ cd /opt/hadoop/hadoop-2.2.0 $ sudo vim etc/hadoop/hdfs-site.xml <?xml version="1.0" encoding="utf-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>! <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/opt/hdfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/opt/hdfs/datanode</value> </property> </configuration>

Format HDFS $ hdfs namenode -format

Configure JAVA_HOME $ vim etc/hadoop/hadoop-env.sh # other stuff! export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64! # more stuff

Configure Core $ sudo vim etc/hadoop/core-site.xml <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration>

Start HDFS $ start-dfs.sh $ jps 6433 DataNode 6844 Jps 6206 NameNode 6714 SecondaryNameNode

http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/filesystemshell.html Use HDFS commands $ hdfs dfs -ls / $ hdfs dfs -mkdir /books $ hdfs dfs -ls / $ hdfs dfs -ls /books $ hdfs dfs -copyfromlocal /opt/data/moby_dick.txt /books $ hdfs dfs -cat /books/moby_dick.txt appendtofile cat chgrp chmod chown copytolocal count cp du get copyfromlocal ls lsr mkdir movefromlocal movetolocal mv put rm rmr stat tail test text touchz

http://192.168.56.101:50070/dfshealth.jsp

Lab 2 1. Configure passwordless login 2. Configure HDFS 3. Format HDFS 4. Configure JAVA_HOME 5. Configure Core 6. Start HDFS 7. Experiment HDFS commands (ls, mkdir, copyfromlocal, cat)

HADOOP Pseudo-Distributed

Configure YARN $ sudo vim etc/hadoop/yarn-site.xml <?xml version="1.0"?> <configuration>! <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.shufflehandler</value> </property>! </configuration>

Configure Map Reduce $ sudo mv etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xml $ sudo vim etc/hadoop/mapred-site.xml <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>

Start YARN $ start-yarn.sh $ jps 6433 DataNode 8355 Jps 8318 NodeManager 6206 NameNode 6714 SecondaryNameNode 8090 ResourceManager

Run Hadoop Job $ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 4 1000

http://192.168.56.101:8042/node

Lab 3 1. Configure YARN 2. Configure Map Reduce 3. Start YARN 4. Run pi job

Combine Hadoop & HDFS

Run Hadoop Job $ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount /books out $ hdfs dfs -ls out $ hdfs dfs -cat out/_success $ hdfs dfs -cat out/part-r-00000 young-armed 1 young; 2 younger 2 youngest 1 youngish 1 your 251 your@login 1 yours 5 yours? 1 yourselbs 1 yourself 14 yourself, 5 yourself," 1 yourself.' 1 yourself; 4 yourself? 1 yourselves 1 yourselves! 3 yourselves, 1 yourselves," 1 yourselves; 1 youth 5 youth, 2 youth. 1 youth; 1 youthful 1

Lab 4 1. Run wordcount job 2. Review output 3. Run wordcount job again with same parameters

Writing Map Reduce Jobs

{K1,V1} we write optionally write we write {K1,V1} {K2, List<V2>} {K3,V3}

MOBY DICK; OR THE WHALE! By Herman Melville! CHAPTER 1. Loomings.! Call me Ishmael. Some years ago--never mind how long precisely--having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people's hats off--then, I account it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me.! There now is your insular city of the Manhattoes, belted round by wharves as Indian isles by coral reefs--commerce surrounds it with her surf. Right and left, the streets take you waterward. Its extreme downtown is the battery, where that noble mole is washed by waves, and cooled by breezes, which a few hours previous were out of sight of land. Look at the crowds of water-gazers there.

K V 1 Call me Ishmael. Some years ago--never mind how long precisely--having 2 little or no money in my purse, and nothing particular to interest me on 3 shore, I thought I would sail about a little and see the watery part of 4 the world. It is a way I have of driving off the spleen and regulating 5 the circulation. {K1,V1} {K2, List<V2>} {K3,V3}

Mapper package com.manifestcorp.hadoop.wc;!! import java.io.ioexception;!! import org.apache.hadoop.io.intwritable;! import org.apache.hadoop.io.text;! import org.apache.hadoop.mapreduce.mapper;!! public class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> {!!! private static final String SPACE = " ";!!!! private static final IntWritable ONE = new IntWritable(1);!! private Text word = new Text();!!! K1! public void map(object key, Text value, Context context)! throws IOException, InterruptedException {!!! String[] words = value.tostring().split(space);!!!!!! for (String str: words) {!!!! word.set(str);! K2 V2!!! context.write(word, ONE);!!! }!!!!! }! } V1 K1 V1 K2 V2

K V 1 Call me Ishmael. Some years ago--never mind how long precisely--having 2 little or no money in my purse, and nothing particular to interest me on 3 shore, I thought I would sail about a little and see the watery part of 4 the world. It is a way I have of driving off the spleen and regulating 5 the circulation. map K V Call 1 me 1 Ishmael. 1 Some 1 years 1 ago--never 1 mind 1 how 1 of 1 long 1 little 1 of 1 or 1 of 1 sort K V ago--never 1 Call 1 how 1 Ishmael. 1 me 1 little 1 long 1 mind 1 of 1 of 1 of 1 or 1 Some 1 years 1 group K V ago--never 1 Call 1 how 1 Ishmael. 1 me 1 little 1 long 1 mind 1 of 1,1,1 or 1 Some 1 years 1 {K1,V1} {K2, List<V2>} {K3,V3}

Reducer package com.manifestcorp.hadoop.wc;!! import java.io.ioexception;!! import org.apache.hadoop.io.intwritable;! import org.apache.hadoop.io.text;! import org.apache.hadoop.mapreduce.reducer;!! public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {!!! K2 V2! public void reduce(text key, Iterable<IntWritable> values, Context context)!!!!!!!!!!!!! throws IOException, InterruptedException {!!! int total = 0;!!!!!! for (IntWritable value : values) {!!!! total++;!!! }!!!! K3 V3!! context.write(key, new IntWritable(total));!! }! } K2 V2 K3 V3

K V 1 Call me Ishmael. Some years ago--never mind how long precisely--having 2 little or no money in my purse, and nothing particular to interest me on 3 shore, I thought I would sail about a little and see the watery part of 4 the world. It is a way I have of driving off the spleen and regulating 5 the circulation. map K V Call 1 me 1 Ishmael. 1 Some 1 years 1 ago--never 1 mind 1 how 1 of 1 long 1 little 1 of 1 or 1 of 1 sort K V ago--never 1 Call 1 how 1 Ishmael. 1 me 1 little 1 long 1 mind 1 of 1 of 1 of 1 or 1 Some 1 years 1 group K V ago--never 1 Call 1 how 1 Ishmael. 1 me 1 little 1 long 1 mind 1 of 1,1,1 or 1 Some 1 years 1 reduce K V ago--never 1 Call 1 how 1 Ishmael. 1 me 1 little 1 long 1 mind 1 of 3 or 1 Some 1 years 1 {K1,V1} {K2, List<V2>} {K3,V3}

Driver package com.manifestcorp.hadoop.wc;!! import org.apache.hadoop.fs.path;! import org.apache.hadoop.io.intwritable;! import org.apache.hadoop.io.text;! import org.apache.hadoop.mapreduce.job;! import org.apache.hadoop.mapreduce.lib.input.fileinputformat;! import org.apache.hadoop.mapreduce.lib.output.fileoutputformat;!! public class MyWordCount {!!! public static void main(string[] args) throws Exception {!!!! Job job = new Job();!!! job.setjobname("my word count");!!! job.setjarbyclass(mywordcount.class);!!!!!! FileInputFormat.addInputPath(job, new Path(args[0]));!!! FileOutputFormat.setOutputPath(job, new Path(args[1]));!!!!!! job.setmapperclass(wordcountmapper.class);!!! job.setreducerclass(wordcountreducer.class);!!!! job.setoutputkeyclass(text.class);!!! job.setoutputvalueclass(intwritable.class);!!!! System.exit(job.waitForCompletion(true)? 0 : 1);!! }! }

pom.xml <project xmlns="http://maven.apache.org/pom/4.0.0" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xsi:schemalocation="http://maven.apache.org/pom/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelversion>4.0.0</modelversion> <groupid>com.manifestcorp.hadoop</groupid> <artifactid>hadoop-mywordcount</artifactid> <version>0.0.1-snapshot</version> <packaging>jar</packaging> <properties> <hadoop.version>2.2.0</hadoop.version> </properties> <build> <plugins> <plugin> <artifactid>maven-compiler-plugin</artifactid> <version>2.3.2</version> <configuration> <source>1.6</source> <target>1.6</target> </configuration> </plugin> </plugins> </build> <dependencies> <dependency> <groupid>org.apache.hadoop</groupid> <artifactid>hadoop-client</artifactid> <version>${hadoop.version}</version> <scope>provided</scope> </dependency> </dependencies> </project>

Run Hadoop Job $ hadoop jar target/hadoop-mywordcount-0.0.1-snapshot.jar com.manifestcorp.hadoop.wc.mywordcount /books out

Lab 5 1. Unzip /opt/data/hadoop-mywordcount-start.zip 2. Write Mapper class 3. Write Reducer class 4. Write Driver class 5. Build (mvn clean package) 6. Run mywordcount job 7. Review output

Hadoop in the Cloud

http://aws.amazon.com/elasticmapreduce/

Making it more Real

http://aws.amazon.com/architecture/

http://aws.amazon.com/architecture/

Resources

Christopher M. Judd CTO and Partner email: cjudd@juddsolutions.com web: www.juddsolutions.com blog: juddsolutions.blogspot.com twitter: javajudd