Lab 10: NHẬ P MO N ẬPẬCHE HẬDOOP

Similar documents
How To Write A Mapreduce Program In Java.Io (Orchestra)

Word count example Abdalrahman Alsaedi

Introduction to MapReduce and Hadoop

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

map/reduce connected components

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Biomap Jobs and the Big Picture

Hadoop WordCount Explained! IT332 Distributed Systems

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Hadoop: Understanding the Big Data Processing Method

By Hrudaya nath K Cloud Computing

Data Science in the Wild

Word Count Code using MR2 Classes and API

CS54100: Database Systems

Hadoop & Pig. Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

MR-(Mapreduce Programming Language)

Download and install Download virtual machine Import virtual machine in Virtualbox

Hadoop Configuration and First Examples

Single Node Setup. Table of contents

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Introduction To Hadoop

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro

Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Programming in Hadoop Programming, Tuning & Debugging

MapReduce Tutorial. Table of contents

How To Write A Map Reduce In Hadoop Hadooper (Ahemos)

Big Data Management and NoSQL Databases

Hadoop Map-Reduce Tutorial

THE EFFECT OF HUMAN RESOURCE COMPETENCIES ON PROJECT PERFORMANCE IN VIETNAMESE INFRASTRUCTURE PROJECTS

Tutorial- Counting Words in File(s) using MapReduce

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

CHƯƠNG I: LÀM QUEN VỚI OUTLOOK 2010

C1 Tổng quan về SharePoint 2010

HPCHadoop: MapReduce on Cray X-series

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Data Science Analytics & Research Centre

PHAT TRIEN TAI CHINH VI MO Of KHU VITC NONG MMlh N01S THOi 1 ~

Citibank Credit Card Terms and Conditions.

Big Data 2012 Hadoop Tutorial

Như ng kiê n thư c câ n biê t vê giâ y phe p cư tru điê n tư (eat)

Building a distributed search system with Apache Hadoop and Lucene. Mirko Calvaresi

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL)

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

TIỀM NĂNG TẠO VIỆC LÀM XANH Ở VIỆT NAM THE POTENTIAL FOR CREATING GREEN JOBS IN VIETNAM

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

DataSys Tutorial

School of Parallel Programming & Parallel Architecture for HPC ICTP October, Hadoop for HPC. Instructor: Ekpe Okorafor

Implementations of iterative algorithms in Hadoop and Spark

INTRODUCTION TO HADOOP

Extreme Computing. Hadoop MapReduce in more detail.

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Hadoop (pseudo-distributed) installation and configuration

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Introduction to Big Data Science. Wuhui Chen

Hadoop, Hive & Spark Tutorial

Getting to know Apache Hadoop

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

HEALTH, INSURANCE, & YOU: New Options For Health Insurance. Presented by SEIU Local 521

Hadoop + Clojure. Hadoop World NYC Friday, October 2, Stuart Sierra, AltLaw.org

Hadoop (Hands On) Irene Finocchi and Emanuele Fusco

IDS 561 Big data analytics Assignment 1

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

How To Install Hadoop From Apa Hadoop To (Hadoop)

Introduction to MapReduce

Introduction to Cloud Computing

Basic Hadoop Programming Skills

Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

Space Technology Institute, Vietnam Academy of Science and Technology, 18 Hoang Quoc Viet, Hanoi, Vietnam

hadoop Running hadoop on Grid'5000 Vinicius Cogo Marcelo Pasin Andrea Charão

Certificate IV in Business Business Administration

Hadoop Installation MapReduce Examples Jake Karnes

Apache Hadoop new way for the company to store and analyze big data

Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics

Hadoop Training Hands On Exercise

Hadoop Tutorial. General Instructions

HowTo Hadoop. Devaraj Das

Data management in the cloud using Hadoop

19 Putting into Practice: Large-Scale Data Management with HADOOP

Discover Hadoop. MapReduce Flexible framework for data analysis. Bonus. Network & Security. Special. Step by Step: Configuring.

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

LIST OF BAO VIET INSURANCE S DIRECT BILLING NETWORK

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

MapReduce Programming with Apache Hadoop Viraj Bhat

Introduction to MapReduce and Hadoop

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop Streaming. Table of contents

TP1: Getting Started with Hadoop

Mrs: MapReduce for Scientific Computing in Python

CS2510 Computer Operating Systems Hadoop Examples Guide

Transcription:

Lab 10: NHẬ P MO N ẬPẬCHE HẬDOOP Biên soạn: ThS. Nguyễn Quang Hùng E-mail: hungnq2@cse.hcmut.edu.vn 1. Giới thiệu: Hadoop Map/Reduce là một khung nền (software framework) mã nguồn mở, hỗ trợ người lập trình viết các ứng dụng theo mô hình Map/Reduce. Để hiện thực một ứng dụng theo mô hình Map/Reduce, sinh viên cần sử dụng các interface lập trình do Hadoop cung cấp như: Mapper, Reducer, JobConf, JobClient, Partitioner, OutputCollector, Reporter, InputFormat, OutputFormat, v.v.. Yêu cầu sinh viên thực thi ứng dụng WordCount trên hai mô hình: Pseudo-Distributed Operation và Fully-Distributed Operation để hiểu rõ hoạt động của mô hình Map/Reduce và kiến trúc HDFS (Hadoop Distributed FileSystem). 2. Tài liệu hướng dẫn cài đặt Apache Hadoop và MapReduce tutorial: - Hadoop: http://hadoop.apache.org/docs/r1.1.2/#getting+started - Tài liệu hướng dẫn cài đặt Apache Hadoop trên 1 máy tính (Single node setup): http://hadoop.apache.org/docs/r1.1.2/single_node_setup.html - MapReduce Tutorial: http://hadoop.apache.org/docs/r1.1.2/mapred_tutorial.html 3. Chương trình ví dụ: 3.1. Cài đặt và sử dụng MapReduce SV có thể cài đặt mô hình Single Node Mode hay Pseudo-Distributed Operation trên một máy đơn. Các bước thực hiện như sau: Download hadoop distribution từ một trong các liên kết sau: http:// hadoop.apache.org Khởi động môi trường hadoop mapreduce bằng các lệnh sau: $ cd $HADOOP_HOME $ bin/hadoop namenode format $ bin/start-all.sh

Thực hiện duyệt các trang web sau để kiểm tra xem Hadoop MapReduce đã hoạt động hay chưa: Namenode: http://localhost:50070 JobTracker: http://localhost:50030 - Thực thi ứng dụng mẫu được cung cấp bởi hadoop: $ bin/hadoop fs -put conf input $ bin/hadoop jar hadoop-example-*.jar grep input output dfs[a-z.]+ $ bin/hadoop fs get output output $ cat output/* - Kết thúc môi trường hadoop mapreduce $ bin/stop-all.sh Một số file cấu hình để thiết lập môi trường Cluster mode cho Hadoop gồm: - Ba (3) tập tin chính trong thư mục hadoop-version/conf: conf/core-site.xml: <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration> conf/hdfs-site.xml: <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> conf/mapred-site.xml: <configuration> <property>

<name>mapred.job.tracker</name> <value>localhost:9001</value> </property> </configuration> 3.2. Chương trình ví dụ: WordCount.java /* Filename: WordCount.java */ package org.myorg; import java.io.ioexception; import java.util.*; import org.apache.hadoop.fs.path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); output.collect(word, one); public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasnext()) { sum += values.next().get(); output.collect(key, new IntWritable(sum)); public static void main(string[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setjobname("wordcount"); conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(intwritable.class); conf.setmapperclass(map.class); conf.setcombinerclass(reduce.class);

conf.setreducerclass(reduce.class); conf.setinputformat(textinputformat.class); conf.setoutputformat(textoutputformat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); 2. Biên dịch và thực thi $ export HADOOP_HOME=<thư mục cài hadoop> $ javac -classpath $HADOOP_HOME/hadoop-core-*.jar -d../wordcount_classes/ WordCount.java $ jar -cvf wordcount.jar -C../wordcount_classes/. $ mkdir wordcount $ cd wordcount $ mkdir input $ cd input/ $ vi file01 Hello Hadoop Goodbye Hadoop $ vi file02 Hello World Bye World $ bin/hadoop -fs mkdir wordcount $ bin/hadoop dfs -put $HOME/input/file0* wordcount/input/ $ bin/hadoop dfs -ls wordcount/input Found 2 items 28 2012-05-08 08:15 /user/hung/wordcount/input/file01 22 2012-05-08 08:15 /user/hung/wordcount/input/file02

$ bin/hadoop dfs -cat wordcount/input/file* Hello Hadoop Goodbye Hadoop Hello World Bye World $ bin/hadoop dfs -ls wordcount/input Found 2 items 28 2012-05-08 08:15 /user/hung/wordcount/input/file01 22 2012-05-08 08:15 /user/hung/wordcount/input/file02 $ bin/hadoop jar ~/wordcount.jar org.myorg.wordcount wordcount/input wordcount/output 12/05/08 08:17:38 WARN mapred.jobclient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 12/05/08 08:17:38 INFO mapred.fileinputformat: Total input paths to process : 2 12/05/08 08:17:38 INFO mapred.jobclient: Running job: job_201205080748_0004 12/05/08 08:17:39 INFO mapred.jobclient: map 0% reduce 0% 12/05/08 08:17:57 INFO mapred.jobclient: map 66% reduce 0% 12/05/08 08:18:17 INFO mapred.jobclient: map 100% reduce 100% 12/05/08 08:18:22 INFO mapred.jobclient: Job complete: job_201205080748_0004 12/05/08 08:18:22 INFO mapred.jobclient: Counters: 30.. 12/05/08 08:18:22 INFO mapred.jobclient: Combine output records=6 12/05/08 08:18:22 INFO mapred.jobclient: Physical memory (bytes) snapshot=471007232 12/05/08 08:18:22 INFO mapred.jobclient: Reduce output records=5 12/05/08 08:18:22 INFO mapred.jobclient: Virtual memory (bytes) snapshot=1495506944 12/05/08 08:18:22 INFO mapred.jobclient: Map output records=8

$ bin/hadoop dfs -ls wordcount/output Found 3 items drwxr-xr-x - hung supergroup 0 2012-05-08 08:18 /user/hung/wordcount/output/_success 0 2012-05-08 08:17 /user/hung/wordcount/output/_logs 41 2012-05-08 08:18 /user/hung/wordcount/output/part-00000 [hung@ppdslab01 hadoop-0.20.205.0]$ bin/hadoop dfs -cat wordcount/output/part-00000 Chú ý: Một số lệnh thao tác trên HDFS $ bin/hadoop dfs put <source> <dest> : cung cấp input cho chương trình $ bin/hadoop dfs get <dest> <source> : lấy về output của chương trình. $ bin/hadoop dfs rmr <dir> : xóa thư mục. $ bin/hadoop dfs rm <file> : xóa tập tin 3 Bài tập Bài 1: SV thực thi chương trình WordCount có đếm tần suất xuất hiện các từ trong các văn bản. Bài 2: SV viết chương trình tính PI theo mô hình Map/Reduce. Bài 3: Cho trước tập các đỉnh (tọa độ trong không gian hai chiều (x, y)). Tìm đường đi ngắn nhất qua hai đỉnh cho trước. Gợi ý: hiện thực giải thuật Dijistra trên Hadoop MapReduce.