MapReduce on
Big Data
Map / Reduce
Hadoop Hello world - Word count
Hadoop Ecosystem
+ rmr - functions providing Hadoop MapReduce functionality in R rhdfs - functions providing file management of the HDFS from within R rhbase - functions providing database management for the HBase distributed database from within R NEW! plyrmr - higher level plyr-like data processing for structured data, powered by rmr
library(rhdfs) Loading required package: rjava HADOOP_CMD=/usr/bin/hadoop Be sure to run hdfs.init() hdfs.init() hdfs.ls("pig_out") permission owner group size modtime file 1 -rw-r--r-- brokaa linga_admin 0 2013-11-20 22:46 /user/brokaa/pig_out/_success 2 drwx--x--x brokaa linga_admin 0 2013-11-20 22:46 /user/brokaa/pig_out/_logs 3 -rw-r--r-- brokaa linga_admin 108507 2013-11-20 22:46 /user/brokaa/pig_out/partm-00000 hdfs.stat("pig_out/part-m-00000") perms isdir block replication owner group size modtime path 1 rw-r--r-- FALSE 134217728 3 brokaa linga_admin 108507 45859-01-30 19: 15:45 pig_out/part-m-00000 pig_out = hdfs.cat("pig_out/part-m-00000") pig_out[1:4] [1] "" [2] "PROJECT GUTENBERG ETEXT OF A MIDSUMMER NIGHT'S DREAM BY SHAKESPEARE" [3] "PG HAS MULTIPLE EDITIONS OF WILLIAM SHAKESPEARE'S COMPLETE WORKS" [4] ""
MapReduce without Hadoop 1 # Generate some numbers small.ints = 1:10 cat(small.ints) 2 3 4 5 6 7 8 9 10 # Map sapply(small.ints, function(x) x^2) [1] 1 4 9 16 25 36 49 64 81 100 # Reduce sum(sapply(small.ints, function(x) x^2)) [1] 385
Map only, No Reduce yet library(rmr2) Loading required package: Rcpp Loading required package: RJSONIO Loading required package: bitops Loading required package: digest Loading required package: functional Loading required package: stringr Loading required package: plyr Loading required package: reshape2 ints = to.dfs(1:10) squares = mapreduce( + input=ints, + map=function(k,v) cbind(v, v^2) + ) from.dfs(squares) $key NULL $val [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] v 1 1 2 4 3 9 4 16 5 25 6 36 7 49 8 64 9 81 10 100
packagejobjar: [/tmp/rtmpr9tys5/rmr-local-env62ee3e53886e, /tmp/rtmpr9tys5/rmrglobal-env62ee751996c, /tmp/rtmpr9tys5/rmr-streaming-map62ee231197ff, /tmp/hadoopbrokaa/hadoop-unjar2194300276505107223/] [] /tmp/streamjob4506528159341704260.jar tmpdir=null 13/11/21 06:18:23 WARN mapred.jobclient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 13/11/21 06:18:24 INFO mapred.fileinputformat: Total input paths to process : 1 13/11/21 06:18:24 INFO streaming.streamjob: getlocaldirs(): [/tmp/hadoopbrokaa/mapred/local] 13/11/21 06:18:24 INFO streaming.streamjob: Running job: job_201311190929_0081 13/11/21 06:18:24 INFO streaming.streamjob: To kill this job, run: 13/11/21 06:18:24 INFO streaming.streamjob: UNDEF/bin/hadoop job -Dmapred.job. tracker=name-0-1.local:8021 -kill job_201311190929_0081 13/11/21 06:18:24 INFO streaming.streamjob: Tracking URL: http://name-0-1.local: 50030/jobdetails.jsp?jobid=job_201311190929_0081 13/11/21 06:18:25 INFO streaming.streamjob: map 0% reduce 0% 13/11/21 06:18:34 INFO streaming.streamjob: map 50% reduce 0% 13/11/21 06:18:36 INFO streaming.streamjob: map 100% reduce 0% 13/11/21 06:18:38 INFO streaming.streamjob: map 100% reduce 100% 13/11/21 06:18:38 INFO streaming.streamjob: Job complete: job_201311190929_0081 13/11/21 06:18:38 INFO streaming.streamjob: Output: /tmp/rtmpr9tys5/file62ee1f1f4715
MapReduce in Action + + + + + input.size=10000 input.ga = to.dfs(cbind(1:input.size, rnorm(input.size))) group = function(x) x%%10 aggregate = function(x) sum(x) result = mapreduce( input.ga, map = function(k, v) keyval(group(v[,1]), v[,2]), reduce = function(k, vv) keyval(k, aggregate(vv)), combine = TRUE ) from.dfs(result) $key [1] 7 8 9 0 2 3 4 1 5 6 $val [1] 43.6705736-37.8089057 0.7431469 20.7087651-50.2379686-15.2318460 [10] -15.8000149 23.8315629-40.7084170-57.9374157
packagejobjar: [/tmp/rtmpr9tys5/rmr-local-env62ee790bc164, /tmp/rtmpr9tys5/rmr-globalenv62ee4e9d9a75, /tmp/rtmpr9tys5/rmr-streaming-map62ee10105eb4, /tmp/rtmpr9tys5/rmrstreaming-reduce62ee6a9746ba, /tmp/rtmpr9tys5/rmr-streaming-combine62ee5a41c721, /tmp/hadoop-brokaa/hadoop-unjar522590615447518349/] [] /tmp/streamjob3468024272610200696.jar tmpdir=null 13/11/21 06:31:54 WARN mapred.jobclient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 13/11/21 06:31:54 INFO mapred.fileinputformat: Total input paths to process : 1 13/11/21 06:31:55 INFO streaming.streamjob: getlocaldirs(): [/tmp/hadoopbrokaa/mapred/local] 13/11/21 06:31:55 INFO streaming.streamjob: Running job: job_201311190929_0082 13/11/21 06:31:55 INFO streaming.streamjob: To kill this job, run: 13/11/21 06:31:55 INFO streaming.streamjob: UNDEF/bin/hadoop job -Dmapred.job. tracker=name-0-1.local:8021 -kill job_201311190929_0082 13/11/21 06:31:55 INFO streaming.streamjob: Tracking URL: http://name-0-1.local: 50030/jobdetails.jsp?jobid=job_201311190929_0082 13/11/21 06:31:56 INFO streaming.streamjob: map 0% reduce 0% 13/11/21 06:32:07 INFO streaming.streamjob: map 50% reduce 0% 13/11/21 06:32:12 INFO streaming.streamjob: map 100% reduce 0% 13/11/21 06:32:24 INFO streaming.streamjob: map 100% reduce 11% 13/11/21 06:32:25 INFO streaming.streamjob: map 100% reduce 33% 13/11/21 06:32:26 INFO streaming.streamjob: map 100% reduce 52% 13/11/21 06:32:27 INFO streaming.streamjob: map 100% reduce 70% 13/11/21 06:32:28 INFO streaming.streamjob: map 100% reduce 86% 13/11/21 06:32:29 INFO streaming.streamjob: map 100% reduce 100% 13/11/21 06:32:31 INFO streaming.streamjob: Job complete: job_201311190929_0082 13/11/21 06:32:31 INFO streaming.streamjob: Output: /tmp/rtmpr9tys5/file62ee21f87721