Similar documents
RHadoop Installation Guide for Red Hat Enterprise Linux

BIG DATA ANALYSIS USING RHADOOP

Introduc)on to RHadoop Master s Degree in Informa1cs Engineering Master s Programme in ICT Innova1on: Data Science (EIT ICT Labs Master School)

RHadoop and MapR. Accessing Enterprise- Grade Hadoop from R. Version 2.0 (14.March.2014)

TDWI 2013 Munich. Training - Using R for Business Intelligence in Big Data

Tutorial - Big Data Analyses with R

CS 455 Spring Word Count Example

INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Parallelization in R, Revisited

Package hive. January 10, 2011

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Package hive. July 3, 2015

Apache Sqoop. A Data Transfer Tool for Hadoop

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Big Data Analytics Using R

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

OTN Developer Day: Oracle Big Data

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

Package HadoopStreaming

How To Write A Data Processing Pipeline In R

COURSE CONTENT Big Data and Hadoop Training

z/os Hybrid Batch Processing and Big Data Session zba07

Running Hadoop On Ubuntu Linux (Multi-Node Cluster) - Michael G...

ITG Software Engineering

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

z/os Hybrid Batch Processing and Big Data

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

A SMART ELEPHANT FOR A SMART-GRID: (ELECTRICAL) TIME-SERIES STORAGE AND ANALYTICS. EDF R&D SIGMA Project Marie-Luce Picard

Applied Multivariate Analysis - Big data analytics

7 Deadly Hadoop Misconfigurations. Kathleen Ting February 2013

Big Data and Scripting map/reduce in Hadoop

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics

Hands-on Exercises with Big Data

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Qsoft Inc

The Hadoop Eco System Shanghai Data Science Meetup

Big Data Operations Guide for Cloudera Manager v5.x Hadoop

Xiaoming Gao Hui Li Thilina Gunarathne

Scalable Forensics with TSK and Hadoop. Jon Stewart

Hadoop Tutorial. General Instructions

Data Analyst Program- 0 to 100

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Big Data: Pig Latin. P.J. McBrien. Imperial College London. P.J. McBrien (Imperial College London) Big Data: Pig Latin 1 / 36

Introduction to Big Data Analysis with R

IDS 561 Big data analytics Assignment 1

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Big Data : Experiments with Apache Hadoop and JBoss Community projects

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

This material is built based on, Patterns covered in this class FILTERING PATTERNS. Filtering pattern

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Using distributed technologies to analyze Big Data

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Recommended Literature for this Lecture

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Big Data Analytics Predicting Risk of Readmissions of Diabetic Patients

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Hadoop (pseudo-distributed) installation and configuration

5 HDFS - Hadoop Distributed System

Hadoop Shell Commands

Apache Flume and Apache Sqoop Data Ingestion to Apache Hadoop Clusters on VMware vsphere SOLUTION GUIDE

Introduction to Hadoop

Introduction to Hadoop

Hadoop Shell Commands

TIBCO ActiveMatrix BusinessWorks Plug-in for Big Data User s Guide

Impala Introduction. By: Matthew Bollinger

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Scalable Network Measurement Analysis with Hadoop. Taghrid Samak and Daniel Gunter Advanced Computing for Sciences, LBNL

Big Data Too Big To Ignore

HadoopRDF : A Scalable RDF Data Analysis System

Analysis of Web Archives. Vinay Goel Senior Data Engineer

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

python hadoop pig October 29, 2015

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Assignment 1: MapReduce with Hadoop

Introduction to Hadoop

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Virtual Machine (VM) For Hadoop Training

Lecture 10 - Functional programming: Hadoop and MapReduce

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Hadoop Hands-On Exercises

Transcription:

MapReduce on

Big Data

Map / Reduce

Hadoop Hello world - Word count

Hadoop Ecosystem

+ rmr - functions providing Hadoop MapReduce functionality in R rhdfs - functions providing file management of the HDFS from within R rhbase - functions providing database management for the HBase distributed database from within R NEW! plyrmr - higher level plyr-like data processing for structured data, powered by rmr

library(rhdfs) Loading required package: rjava HADOOP_CMD=/usr/bin/hadoop Be sure to run hdfs.init() hdfs.init() hdfs.ls("pig_out") permission owner group size modtime file 1 -rw-r--r-- brokaa linga_admin 0 2013-11-20 22:46 /user/brokaa/pig_out/_success 2 drwx--x--x brokaa linga_admin 0 2013-11-20 22:46 /user/brokaa/pig_out/_logs 3 -rw-r--r-- brokaa linga_admin 108507 2013-11-20 22:46 /user/brokaa/pig_out/partm-00000 hdfs.stat("pig_out/part-m-00000") perms isdir block replication owner group size modtime path 1 rw-r--r-- FALSE 134217728 3 brokaa linga_admin 108507 45859-01-30 19: 15:45 pig_out/part-m-00000 pig_out = hdfs.cat("pig_out/part-m-00000") pig_out[1:4] [1] "" [2] "PROJECT GUTENBERG ETEXT OF A MIDSUMMER NIGHT'S DREAM BY SHAKESPEARE" [3] "PG HAS MULTIPLE EDITIONS OF WILLIAM SHAKESPEARE'S COMPLETE WORKS" [4] ""

MapReduce without Hadoop 1 # Generate some numbers small.ints = 1:10 cat(small.ints) 2 3 4 5 6 7 8 9 10 # Map sapply(small.ints, function(x) x^2) [1] 1 4 9 16 25 36 49 64 81 100 # Reduce sum(sapply(small.ints, function(x) x^2)) [1] 385

Map only, No Reduce yet library(rmr2) Loading required package: Rcpp Loading required package: RJSONIO Loading required package: bitops Loading required package: digest Loading required package: functional Loading required package: stringr Loading required package: plyr Loading required package: reshape2 ints = to.dfs(1:10) squares = mapreduce( + input=ints, + map=function(k,v) cbind(v, v^2) + ) from.dfs(squares) $key NULL $val [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] v 1 1 2 4 3 9 4 16 5 25 6 36 7 49 8 64 9 81 10 100

packagejobjar: [/tmp/rtmpr9tys5/rmr-local-env62ee3e53886e, /tmp/rtmpr9tys5/rmrglobal-env62ee751996c, /tmp/rtmpr9tys5/rmr-streaming-map62ee231197ff, /tmp/hadoopbrokaa/hadoop-unjar2194300276505107223/] [] /tmp/streamjob4506528159341704260.jar tmpdir=null 13/11/21 06:18:23 WARN mapred.jobclient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 13/11/21 06:18:24 INFO mapred.fileinputformat: Total input paths to process : 1 13/11/21 06:18:24 INFO streaming.streamjob: getlocaldirs(): [/tmp/hadoopbrokaa/mapred/local] 13/11/21 06:18:24 INFO streaming.streamjob: Running job: job_201311190929_0081 13/11/21 06:18:24 INFO streaming.streamjob: To kill this job, run: 13/11/21 06:18:24 INFO streaming.streamjob: UNDEF/bin/hadoop job -Dmapred.job. tracker=name-0-1.local:8021 -kill job_201311190929_0081 13/11/21 06:18:24 INFO streaming.streamjob: Tracking URL: http://name-0-1.local: 50030/jobdetails.jsp?jobid=job_201311190929_0081 13/11/21 06:18:25 INFO streaming.streamjob: map 0% reduce 0% 13/11/21 06:18:34 INFO streaming.streamjob: map 50% reduce 0% 13/11/21 06:18:36 INFO streaming.streamjob: map 100% reduce 0% 13/11/21 06:18:38 INFO streaming.streamjob: map 100% reduce 100% 13/11/21 06:18:38 INFO streaming.streamjob: Job complete: job_201311190929_0081 13/11/21 06:18:38 INFO streaming.streamjob: Output: /tmp/rtmpr9tys5/file62ee1f1f4715

MapReduce in Action + + + + + input.size=10000 input.ga = to.dfs(cbind(1:input.size, rnorm(input.size))) group = function(x) x%%10 aggregate = function(x) sum(x) result = mapreduce( input.ga, map = function(k, v) keyval(group(v[,1]), v[,2]), reduce = function(k, vv) keyval(k, aggregate(vv)), combine = TRUE ) from.dfs(result) $key [1] 7 8 9 0 2 3 4 1 5 6 $val [1] 43.6705736-37.8089057 0.7431469 20.7087651-50.2379686-15.2318460 [10] -15.8000149 23.8315629-40.7084170-57.9374157

packagejobjar: [/tmp/rtmpr9tys5/rmr-local-env62ee790bc164, /tmp/rtmpr9tys5/rmr-globalenv62ee4e9d9a75, /tmp/rtmpr9tys5/rmr-streaming-map62ee10105eb4, /tmp/rtmpr9tys5/rmrstreaming-reduce62ee6a9746ba, /tmp/rtmpr9tys5/rmr-streaming-combine62ee5a41c721, /tmp/hadoop-brokaa/hadoop-unjar522590615447518349/] [] /tmp/streamjob3468024272610200696.jar tmpdir=null 13/11/21 06:31:54 WARN mapred.jobclient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 13/11/21 06:31:54 INFO mapred.fileinputformat: Total input paths to process : 1 13/11/21 06:31:55 INFO streaming.streamjob: getlocaldirs(): [/tmp/hadoopbrokaa/mapred/local] 13/11/21 06:31:55 INFO streaming.streamjob: Running job: job_201311190929_0082 13/11/21 06:31:55 INFO streaming.streamjob: To kill this job, run: 13/11/21 06:31:55 INFO streaming.streamjob: UNDEF/bin/hadoop job -Dmapred.job. tracker=name-0-1.local:8021 -kill job_201311190929_0082 13/11/21 06:31:55 INFO streaming.streamjob: Tracking URL: http://name-0-1.local: 50030/jobdetails.jsp?jobid=job_201311190929_0082 13/11/21 06:31:56 INFO streaming.streamjob: map 0% reduce 0% 13/11/21 06:32:07 INFO streaming.streamjob: map 50% reduce 0% 13/11/21 06:32:12 INFO streaming.streamjob: map 100% reduce 0% 13/11/21 06:32:24 INFO streaming.streamjob: map 100% reduce 11% 13/11/21 06:32:25 INFO streaming.streamjob: map 100% reduce 33% 13/11/21 06:32:26 INFO streaming.streamjob: map 100% reduce 52% 13/11/21 06:32:27 INFO streaming.streamjob: map 100% reduce 70% 13/11/21 06:32:28 INFO streaming.streamjob: map 100% reduce 86% 13/11/21 06:32:29 INFO streaming.streamjob: map 100% reduce 100% 13/11/21 06:32:31 INFO streaming.streamjob: Job complete: job_201311190929_0082 13/11/21 06:32:31 INFO streaming.streamjob: Output: /tmp/rtmpr9tys5/file62ee21f87721