Introduc)on to RHadoop Master s Degree in Informa1cs Engineering Master s Programme in ICT Innova1on: Data Science (EIT ICT Labs Master School) Academic Year 2015-2106
Contents Introduc1on to MapReduce HDFS Hadoop Data Analy1cs with RHadoop
MapReduce & DQ Divide and Conquer (DQ) General idea Divide a problem into sub- problems (smaller) Solve each problem (independently) Combine the solu1ons
DQ: pseudo- code Func1on DQ (X: Problem data) if small(x) then S = easy(x) if not divide(x) => (X 1,..., X k ) for i = 1 to k do S i = DQ(X i ) S = combine(s 1,..., S k ) return S
DQ: efficiency Efficiency of this approach An appropriate threshold must be selected to apply easy(x) Decomposi1on and combining func1ons must be efficient Sub- problems must be (approximately) of the same size
DQ: Remarks It can not be applied to any type of problems Some1mes, it might not be obvious how to divide a large problem into sub- problems If such division is uneven, we will have an unbalanced system, which would have an import impact on the overall performance of the algorithm The size of the reduced problems must be significantly smaller than the original one so that massively parallel supercomputer could be used and the communica1on overhead can be compensated
MapReduce: general scheme Source: www.academia.edu
MapReduce: more detail Source: Hadoop Book
MapReduce: example Source: MilanoR
Hadoop Distributed File System (HDFS) Distributed File System evolved from Google implementa1on (GFS) Fault- tolerant: files and divided in chunks and those are distributed and replicated through the cluster Normally, the replica1on ra1o is 3 There is a Master Node that stores this meta- data: which files, into how many chunks these are divided and where they are stored Large block sizes are preferred (128MB by default)
Hadoop Distributed File System (HDFS) Source: Hadoop tutorial
Hadoop Distributed File System (HDFS) In HDFS, blocks should be read from the beginning to the end (this favors the MapReduce approach) Files in the HDFS system ARE NOT stored along with the host system files HDFS is normally an abstrac1on OVER an exis1ng file system (ext3, ext4, etc.) Thus, there are specific commands to manipulate the HDFS file system To open a file stored in HDFS, the client must contact the NameNode to retrieve the loca1on of each block of the file (at the DataNodes) Parallel reads are possible (and preferred)
Hadoop Distributed File System (HDFS) Data locality: normally, when launching a job, it is run in the same node that stores the data it must manipulate The meta- data stored in the NameNode is not automa1cally replicated (it must be done manually or with an inac1ve NameNode)
HDFS from the command line Each user of the HDFS has a personal directory No security direc1ves implemented, so users can write anywhere Access to HDFS through the hdfs command hdfs dfs command Important commands - copyfromlocal vs. - copytolocal - mkdir - cp, - mv Documenta1on in the Hadoop Website
Hadoop MRv1 vs Yarn (MRv2) Hadoop MRv1 Resources management and tasks scheduling and monitoring done by a single process (bogle- neck): Job Tracker Each sub- problem is run by an independent process: Task Tracker Hadoop MRv2 Resources management and tasks scheduling and monitoring are split in different processes Resource Manager (RM): overall resources management Applica>on Master(AM): per job tasks scheduling and monitoring A NodeManager runs the tasks at each compu1ng node
Hadoop MRv1 vs Yarn (MRv2)
Example: wordcount Input: document made up of words Output: A set of (Word, count(word)) Two func1ons: map and reduce map(k1, v1): for each word w in v1 emit(w, 1) reduce(k2, v2_list): int result = 0; for each v in v2_list result += v; emit(k2, result)
Example: wordcount
Example: wordcount
RHadoop Developed by Revolu1on Analy1cs (acquired by Microsol) Three main components rhdfs: R + HDFS rmr2: R + Map Reduce rhbase: R + Hbase Can be downloaded from: hgps://github.com/revolu1onanaly1cs/rhadoop/wiki/downloads Already installed and configured in the VM provided
RHadoop: interac)ng with HDFS # Load rhdfs library library(rhdfs) # Start rhdfs hdfs.init() # Basic "ls", path is mandatory hdfs.ls("/user/hadoop ) # Create directory work.dir <- "/user/hadoop/aux/ hdfs.mkdir(work.dir) # And delete hdfs.delete(work.dir) # Create again hdfs.mkdir(work.dir)
RHadoop: wordcount example Library loading and ini1aliza1on # Loading the RHadoop libraries library('rhdfs ) library('rmr2') # Ini1alizaing the RHadoop hdfs.init()
RHadoop: wordcount example wordcount = func1on(input, # The output can be an HDFS path but # if it is NULL some temporary file will # be generated and wrapped in a big data # object, like the ones generated by to.dfs output = NULL, pagern = " ") { # Defining wordcount Map func1on wc.map = func1on(., lines) { keyval( unlist(strsplit(x = lines, split = pagern)), 1) } # Defining wordcount Reduce func1on wc.reduce = func1on(word, counts ) { keyval(word, sum(counts)) }
RHadoop: wordcount example } # Defining MapReduce parameters by calling mapreduce func1on mapreduce(input = input, output = output, # You can specify your own input and output formats # and produce binary formats with the func1ons # make.input.format and make.output.format input.format = "text, map = wc.map, reduce = wc.reduce, # With combiner combine = T)
RHadoop: wordcount example # Running MapReduce Job by passing the Hadoop # input directory loca1on as parameter wordcount('/user/hadoop/wordcount/quijote.txt') # Retrieving the RHadoop MapReduce output # data by passing output # directory loca1on as parameter from.dfs("/tmp/file1b0817a5bcd0") El Quijote can be downloaded from: hgp://www.gutenberg.org/cache/epub/996/pg996.txt
RHadoop: airline example We will analyze the commercial data of an airline The input data file is a CSV We will need to use a custom input formager to ease the task of processing the file Data can be downloaded from: hgp://stat- compu1ng.org/dataexpo/2009/1987.csv.bz2
RHadoop: airline example library(rmr2) library('rhdfs ) hdfs.init() # Put data in HDFS hdfs.data.root = '/user/hadoop/rhadoop/airline hdfs.data = file.path(hdfs.data.root, 'data ) hdfs.mkdir(hdfs.data) hdfs.put("/home/hadoop/downloads/1987.csv", hdfs.data) hdfs.out = file.path(hdfs.data.root, 'out')
RHadoop: airline example (input format) # # asa.csv.input.format() - read CSV data files and label field names # for beger code readability (especially in the mapper) # asa.csv.input.format = make.input.format(format='csv', mode='text', streaming.format = NULL, sep=',', col.names = c('year', 'Month', 'DayofMonth', 'DayOfWeek', 'DepTime', 'CRSDepTime', 'ArrTime', 'CRSArrTime', 'UniqueCarrier', 'FlightNum', 'TailNum', 'ActualElapsedTime', 'CRSElapsedTime', 'AirTime', 'ArrDelay', 'DepDelay', 'Origin', 'Dest', 'Distance', 'TaxiIn', 'TaxiOut', 'Cancelled', 'Cancella1onCode', 'Diverted', 'CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircralDelay'), stringsasfactors=f)
RHadoop: airline example (mapper 1/2) # # the mapper gets keys and values from the input formager # in our case, the key is NULL and the value is a data.frame from read.table() # mapper.year.market.enroute_1me = func1on(key, val.df) { # Remove header lines, cancella1ons, and diversions: val.df = subset(val.df, Year!= 'Year' & Cancelled == 0 & Diverted == 0) # We don't care about direc1on of travel, so construct a new 'market' vector # with airports ordered alphabe1cally (e.g, LAX to JFK becomes 'JFK- LAX') market = with(val.df, ifelse(origin < Dest, paste(origin, Dest, sep='- '), paste(dest, Origin, sep='- ')) )
RHadoop: airline example (mapper 2/2) # key consists of year, market output.key = data.frame(year=as.numeric(val.df$year), market=market, stringsasfactors=f) # emit data.frame of gate- to- gate elapsed 1mes (CRS and actual) + 1me in air output.val = val.df[,c('crselapsedtime', 'ActualElapsedTime', 'AirTime')] colnames(output.val) = c('scheduled', 'actual', 'inflight') # and finally, make sure they're numeric while we're at it output.val = transform(output.val, scheduled = as.numeric(scheduled), actual = as.numeric(actual), inflight = as.numeric(inflight)) return( keyval(output.key, output.val) ) }
RHadoop: airline example (reducer) # # the reducer gets all the values for a given key # the values (which may be mul1- valued as here) come in the form of a data.frame # reducer.year.market.enroute_1me = func1on(key, val.df) { output.key = key output.val = data.frame(flights = nrow(val.df), scheduled = mean(val.df$scheduled, na.rm=t), actual = mean(val.df$actual, na.rm=t), inflight = mean(val.df$inflight, na.rm=t) ) return( keyval(output.key, output.val) ) }
RHadoop: final configura)on and execu)on mr.year.market.enroute_1me = func1on (input, output) { mapreduce(input = input, output = output, input.format = asa.csv.input.format, map = mapper.year.market.enroute_1me, reduce = reducer.year.market.enroute_1me, backend.parameters = list( hadoop = list(d = "mapred.reduce.tasks=2") ), } verbose=t) out = mr.year.market.enroute_1me(hdfs.data, hdfs.out)
RHadoop: gathering results results = from.dfs( out ) results.df = as.data.frame(results, stringsasfactors=f ) colnames(results.df) = c('year', 'market', 'flights', 'scheduled', 'actual', 'inflight') print(head(results.df)) # save(results.df, file="out/enroute.1me.market.rdata")