Introduc)on to RHadoop Master s Degree in Informa1cs Engineering Master s Programme in ICT Innova1on: Data Science (EIT ICT Labs Master School)

Similar documents

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

VOL. 5, NO. 2, August 2015 ISSN ARPN Journal of Systems and Software AJSS Journal. All rights reserved

Distributed Filesystems

INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

TP1: Getting Started with Hadoop

RHadoop and MapR. Accessing Enterprise- Grade Hadoop from R. Version 2.0 (14.March.2014)

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Accessing bigger datasets in R using SQLite and dplyr

Hadoop MultiNode Cluster Setup

Hadoop Architecture. Part 1

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Apache Hadoop. Alexandru Costan

Introduc)on to Map- Reduce. Vincent Leroy

Introduc)on to. Eric Nagler 11/15/11

Introduction to MapReduce and Hadoop

Introduction to HDFS. Prasanth Kothuri, CERN

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

MapReduce. Tushar B. Kute,

Chapter 7. Using Hadoop Cluster and MapReduce

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Introduction to HDFS. Prasanth Kothuri, CERN

Extreme computing lab exercises Session one

How To Install Hadoop From Apa Hadoop To (Hadoop)

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

RHadoop Installation Guide for Red Hat Enterprise Linux

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Chase Wu New Jersey Ins0tute of Technology

Extreme computing lab exercises Session one

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

How To Use Hadoop

HPC & Big Data. Adam S.Z Belloum Software and Network engineering group University of Amsterdam

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

A very short Intro to Hadoop

Apache Hadoop new way for the company to store and analyze big data

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster. A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech

HADOOP MOCK TEST HADOOP MOCK TEST II

Hunk & Elas=c MapReduce: Big Data Analy=cs on AWS

Hadoop 只支援用 Java 開發嘛? Is Hadoop only support Java? 總不能全部都重新設計吧? 如何與舊系統相容? Can Hadoop work with existing software?

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Hadoop Distributed File System Propagation Adapter for Nimbus

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Big Data Analytics Using R

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Big Data Analytics. Lucas Rego Drumond

Big Data and Scripting map/reduce in Hadoop

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop Shell Commands

Big Data Analysis with Revolution R Enterprise

Hadoop Ecosystem B Y R A H I M A.

Обработка больших данных: Map Reduce (Python) + Hadoop (Streaming) Максим Щербаков ВолгГТУ 8/10/2014

CS 378 Big Data Programming. Lecture 5 Summariza9on Pa:erns

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Recommended Literature for this Lecture

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Hadoop Shell Commands

Tutorial for Assignment 2.0

Introduction to Cloud Computing

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Hadoop IST 734 SS CHUNG

Understanding Hadoop Performance on Lustre

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Hadoop Parallel Data Processing

CSE-E5430 Scalable Cloud Computing. Lecture 4

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Reduction of Data at Namenode in HDFS using harballing Technique

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

HSearch Installation

研發專案原始程式碼安裝及操作手冊. Version 0.1

Similarity Search in a Very Large Scale Using Hadoop and HBase

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Parallel Options for R

Getting to know Apache Hadoop

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Transcription:

Introduc)on to RHadoop Master s Degree in Informa1cs Engineering Master s Programme in ICT Innova1on: Data Science (EIT ICT Labs Master School) Academic Year 2015-2106

Contents Introduc1on to MapReduce HDFS Hadoop Data Analy1cs with RHadoop

MapReduce & DQ Divide and Conquer (DQ) General idea Divide a problem into sub- problems (smaller) Solve each problem (independently) Combine the solu1ons

DQ: pseudo- code Func1on DQ (X: Problem data) if small(x) then S = easy(x) if not divide(x) => (X 1,..., X k ) for i = 1 to k do S i = DQ(X i ) S = combine(s 1,..., S k ) return S

DQ: efficiency Efficiency of this approach An appropriate threshold must be selected to apply easy(x) Decomposi1on and combining func1ons must be efficient Sub- problems must be (approximately) of the same size

DQ: Remarks It can not be applied to any type of problems Some1mes, it might not be obvious how to divide a large problem into sub- problems If such division is uneven, we will have an unbalanced system, which would have an import impact on the overall performance of the algorithm The size of the reduced problems must be significantly smaller than the original one so that massively parallel supercomputer could be used and the communica1on overhead can be compensated

MapReduce: general scheme Source: www.academia.edu

MapReduce: more detail Source: Hadoop Book

MapReduce: example Source: MilanoR

Hadoop Distributed File System (HDFS) Distributed File System evolved from Google implementa1on (GFS) Fault- tolerant: files and divided in chunks and those are distributed and replicated through the cluster Normally, the replica1on ra1o is 3 There is a Master Node that stores this meta- data: which files, into how many chunks these are divided and where they are stored Large block sizes are preferred (128MB by default)

Hadoop Distributed File System (HDFS) Source: Hadoop tutorial

Hadoop Distributed File System (HDFS) In HDFS, blocks should be read from the beginning to the end (this favors the MapReduce approach) Files in the HDFS system ARE NOT stored along with the host system files HDFS is normally an abstrac1on OVER an exis1ng file system (ext3, ext4, etc.) Thus, there are specific commands to manipulate the HDFS file system To open a file stored in HDFS, the client must contact the NameNode to retrieve the loca1on of each block of the file (at the DataNodes) Parallel reads are possible (and preferred)

Hadoop Distributed File System (HDFS) Data locality: normally, when launching a job, it is run in the same node that stores the data it must manipulate The meta- data stored in the NameNode is not automa1cally replicated (it must be done manually or with an inac1ve NameNode)

HDFS from the command line Each user of the HDFS has a personal directory No security direc1ves implemented, so users can write anywhere Access to HDFS through the hdfs command hdfs dfs command Important commands - copyfromlocal vs. - copytolocal - mkdir - cp, - mv Documenta1on in the Hadoop Website

Hadoop MRv1 vs Yarn (MRv2) Hadoop MRv1 Resources management and tasks scheduling and monitoring done by a single process (bogle- neck): Job Tracker Each sub- problem is run by an independent process: Task Tracker Hadoop MRv2 Resources management and tasks scheduling and monitoring are split in different processes Resource Manager (RM): overall resources management Applica>on Master(AM): per job tasks scheduling and monitoring A NodeManager runs the tasks at each compu1ng node

Hadoop MRv1 vs Yarn (MRv2)

Example: wordcount Input: document made up of words Output: A set of (Word, count(word)) Two func1ons: map and reduce map(k1, v1): for each word w in v1 emit(w, 1) reduce(k2, v2_list): int result = 0; for each v in v2_list result += v; emit(k2, result)

Example: wordcount

Example: wordcount

RHadoop Developed by Revolu1on Analy1cs (acquired by Microsol) Three main components rhdfs: R + HDFS rmr2: R + Map Reduce rhbase: R + Hbase Can be downloaded from: hgps://github.com/revolu1onanaly1cs/rhadoop/wiki/downloads Already installed and configured in the VM provided

RHadoop: interac)ng with HDFS # Load rhdfs library library(rhdfs) # Start rhdfs hdfs.init() # Basic "ls", path is mandatory hdfs.ls("/user/hadoop ) # Create directory work.dir <- "/user/hadoop/aux/ hdfs.mkdir(work.dir) # And delete hdfs.delete(work.dir) # Create again hdfs.mkdir(work.dir)

RHadoop: wordcount example Library loading and ini1aliza1on # Loading the RHadoop libraries library('rhdfs ) library('rmr2') # Ini1alizaing the RHadoop hdfs.init()

RHadoop: wordcount example wordcount = func1on(input, # The output can be an HDFS path but # if it is NULL some temporary file will # be generated and wrapped in a big data # object, like the ones generated by to.dfs output = NULL, pagern = " ") { # Defining wordcount Map func1on wc.map = func1on(., lines) { keyval( unlist(strsplit(x = lines, split = pagern)), 1) } # Defining wordcount Reduce func1on wc.reduce = func1on(word, counts ) { keyval(word, sum(counts)) }

RHadoop: wordcount example } # Defining MapReduce parameters by calling mapreduce func1on mapreduce(input = input, output = output, # You can specify your own input and output formats # and produce binary formats with the func1ons # make.input.format and make.output.format input.format = "text, map = wc.map, reduce = wc.reduce, # With combiner combine = T)

RHadoop: wordcount example # Running MapReduce Job by passing the Hadoop # input directory loca1on as parameter wordcount('/user/hadoop/wordcount/quijote.txt') # Retrieving the RHadoop MapReduce output # data by passing output # directory loca1on as parameter from.dfs("/tmp/file1b0817a5bcd0") El Quijote can be downloaded from: hgp://www.gutenberg.org/cache/epub/996/pg996.txt

RHadoop: airline example We will analyze the commercial data of an airline The input data file is a CSV We will need to use a custom input formager to ease the task of processing the file Data can be downloaded from: hgp://stat- compu1ng.org/dataexpo/2009/1987.csv.bz2

RHadoop: airline example library(rmr2) library('rhdfs ) hdfs.init() # Put data in HDFS hdfs.data.root = '/user/hadoop/rhadoop/airline hdfs.data = file.path(hdfs.data.root, 'data ) hdfs.mkdir(hdfs.data) hdfs.put("/home/hadoop/downloads/1987.csv", hdfs.data) hdfs.out = file.path(hdfs.data.root, 'out')

RHadoop: airline example (input format) # # asa.csv.input.format() - read CSV data files and label field names # for beger code readability (especially in the mapper) # asa.csv.input.format = make.input.format(format='csv', mode='text', streaming.format = NULL, sep=',', col.names = c('year', 'Month', 'DayofMonth', 'DayOfWeek', 'DepTime', 'CRSDepTime', 'ArrTime', 'CRSArrTime', 'UniqueCarrier', 'FlightNum', 'TailNum', 'ActualElapsedTime', 'CRSElapsedTime', 'AirTime', 'ArrDelay', 'DepDelay', 'Origin', 'Dest', 'Distance', 'TaxiIn', 'TaxiOut', 'Cancelled', 'Cancella1onCode', 'Diverted', 'CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircralDelay'), stringsasfactors=f)

RHadoop: airline example (mapper 1/2) # # the mapper gets keys and values from the input formager # in our case, the key is NULL and the value is a data.frame from read.table() # mapper.year.market.enroute_1me = func1on(key, val.df) { # Remove header lines, cancella1ons, and diversions: val.df = subset(val.df, Year!= 'Year' & Cancelled == 0 & Diverted == 0) # We don't care about direc1on of travel, so construct a new 'market' vector # with airports ordered alphabe1cally (e.g, LAX to JFK becomes 'JFK- LAX') market = with(val.df, ifelse(origin < Dest, paste(origin, Dest, sep='- '), paste(dest, Origin, sep='- ')) )

RHadoop: airline example (mapper 2/2) # key consists of year, market output.key = data.frame(year=as.numeric(val.df$year), market=market, stringsasfactors=f) # emit data.frame of gate- to- gate elapsed 1mes (CRS and actual) + 1me in air output.val = val.df[,c('crselapsedtime', 'ActualElapsedTime', 'AirTime')] colnames(output.val) = c('scheduled', 'actual', 'inflight') # and finally, make sure they're numeric while we're at it output.val = transform(output.val, scheduled = as.numeric(scheduled), actual = as.numeric(actual), inflight = as.numeric(inflight)) return( keyval(output.key, output.val) ) }

RHadoop: airline example (reducer) # # the reducer gets all the values for a given key # the values (which may be mul1- valued as here) come in the form of a data.frame # reducer.year.market.enroute_1me = func1on(key, val.df) { output.key = key output.val = data.frame(flights = nrow(val.df), scheduled = mean(val.df$scheduled, na.rm=t), actual = mean(val.df$actual, na.rm=t), inflight = mean(val.df$inflight, na.rm=t) ) return( keyval(output.key, output.val) ) }

RHadoop: final configura)on and execu)on mr.year.market.enroute_1me = func1on (input, output) { mapreduce(input = input, output = output, input.format = asa.csv.input.format, map = mapper.year.market.enroute_1me, reduce = reducer.year.market.enroute_1me, backend.parameters = list( hadoop = list(d = "mapred.reduce.tasks=2") ), } verbose=t) out = mr.year.market.enroute_1me(hdfs.data, hdfs.out)

RHadoop: gathering results results = from.dfs( out ) results.df = as.data.frame(results, stringsasfactors=f ) colnames(results.df) = c('year', 'market', 'flights', 'scheduled', 'actual', 'inflight') print(head(results.df)) # save(results.df, file="out/enroute.1me.market.rdata")