Simple Parallel Computing in R Using Hadoop



Similar documents
Distributed Text Mining with tm

Distributed Computing with Hadoop

Distributed Text Mining with tm

Package hive. January 10, 2011

Package hive. July 3, 2015

How To Use Hadoop

CSE-E5430 Scalable Cloud Computing Lecture 2

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Jeffrey D. Ullman slides. MapReduce for data intensive computing

A Performance Analysis of Distributed Indexing using Terrier

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Hadoop Architecture. Part 1

MapReduce. Tushar B. Kute,

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Open source Google-style large scale data analysis with Hadoop

Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster. Dan Şerban

Comparison of Different Implementation of Inverted Indexes in Hadoop

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Big Data and Apache Hadoop s MapReduce

How To Install Hadoop From Apa Hadoop To (Hadoop)

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Hadoop Installation. Sandeep Prasad

MapReduce and Hadoop Distributed File System

A Study of Data Management Technology for Handling Big Data

Tutorial for Assignment 2.0

Big Data With Hadoop

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Reduction of Data at Namenode in HDFS using harballing Technique

GraySort and MinuteSort at Yahoo on Hadoop 0.23

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

HADOOP MOCK TEST HADOOP MOCK TEST II

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

Apache Hadoop new way for the company to store and analyze big data

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

What is Big Data? Concepts, Ideas and Principles. Hitesh Dharamdasani

Hadoop Basics with InfoSphere BigInsights

Distributed Filesystems

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

Apache Hadoop. Alexandru Costan

Introduction to Hadoop

Fault Tolerance in Hadoop for Work Migration

Introduction to MapReduce and Hadoop

Open source large scale distributed data management with Google s MapReduce and Bigtable

Mobile Cloud Computing for Data-Intensive Applications

Distributed File Systems

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

TP1: Getting Started with Hadoop

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

THE HADOOP DISTRIBUTED FILE SYSTEM

Data-Intensive Computing with Map-Reduce and Hadoop

Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

MAPREDUCE Programming Model

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Distributed Computing and Big Data: Hadoop and MapReduce

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

MapReduce, Hadoop and Amazon AWS

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Apache Hama Design Document v0.6

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Distributed Framework for Data Mining As a Service on Private Cloud

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Hadoop

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

HADOOP PERFORMANCE TUNING

HDFS Cluster Installation Automation for TupleWare

L1: Introduction to Hadoop

Application Development. A Paradigm Shift

Big Data Analytics Using R

Data processing goes big

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

HADOOP MOCK TEST HADOOP MOCK TEST I

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Transcription:

Simple Parallel Computing in R Using Hadoop Stefan Theußl WU Vienna University of Economics and Business Augasse 2 6, 1090 Vienna Stefan.Theussl@wu.ac.at 30.06.2009

Agenda Problem & Motivation The MapReduce Paradigm Package hive Distributed Text Mining in R Discussion

Motivation Recap of my talk last year: Computer architecture: distributed memory (DMP) and shared memory platforms (SMP) Types of parallelism: functional or data-driven Parallel computing strategies: threads (on SMP), message-passing (on DMP), high-level abstractions (e.g., packages snow, nws)

Motivation Main motivation: large scale data processing Many tasks, i.e. we produce output data via processing lots of input data Want to make use of many CPUs Typically this is not easy (parallelization, synchronization, I/O, debugging, etc.) Need for an integrated framework preferably usable on large scale distributed systems

The MapReduce Paradigm

The MapReduce Paradigm Programming model inspired by functional language primitives Automatic parallelization and distribution Fault tolerance I/O scheduling Examples: document clustering, web access log analysis, search index construction,... Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI 04, 6th Symposium on Operating Systems Design and Implementation, pages 137 150, 2004. Hadoop (http://hadoop.apache.org/core/) developed by the Apache project is an open source implementation of MapReduce.

The MapReduce Paradigm Distributed Data Local Data Local Data Local Data Map Map Map Intermediate Data Partial Result Partial Result Partial Result Reduce Reduce Aggregated Result Figure: Conceptual Flow

The MapReduce Paradigm A MapReduce implementation like Hadoop typically provides a distributed file system (DFS): Master/worker architecture (Namenode/Datanodes) Data locality Map tasks are applied to partitioned data Map tasks scheduled so that input blocks are on same machine Datanodes read input at local disk speed Data replication leads to fault tolerance Application does not care whether nodes are OK or not

The MapReduce Paradigm Figure: HDFS architecture Source: http://hadoop.apache.org/core/docs/r0.20.0/hdfs_design.html

Hadoop Streaming Utility allowing to create and run MapReduce jobs with any executable or script as the mapper and/or the reducer $HADOOP HOME/bin/hadoop jar $HADOOP HOME/hadoop-streaming.jar -input inputdir -output outputdir -mapper./mapper -reducer./reducer Local Data Intermediate Data Processed Data stdin() stdout() stdin() stdout() R: Map R: Reduce

Hadoop InteractiVE (hive)

Hadoop InteractiVE (hive) hive provides: Easy-to-use interface to Hadoop Currently, only Hadoop core (http://hadoop.apache.org/core/) supported High-level functions for handling Hadoop framework (hive start(), hive create(), hive is available(), etc.) DFS accessor functions in R (DFS put(), DFS list(), DFS cat(), etc.) Streaming via Hadoop (hive stream()) Available on R-Forge in project RHadoop

hive: How to Get Started? Prerequisites: Java 1.6.x Passwordless ssh within nodes Best practice: provide everything via shared filesystem (e.g., NFS).

hive: How to Get Started? Installation: Download and untar Hadoop core from http://hadoop.apache.org/core/ Modify JAVA HOME environment variable in /path/to/hadoop-0.20.0/conf/hadoop-env.sh appropriately Further (optional) configuration: mapred-site.xml e.g., configuration of number of parallel tasks (mapred.tasktracker.map.tasks.maximum) core-site.xml e.g., where is the DFS managed? (fs.default.name) hdfs-site.xml e.g., number of chunk replication (dfs.replication) Export HADOOP HOME environment variable Start Hadoop either via command line or with hive start()

Example: Word Count Data preparation: 1 > library (" hive ") 2 Loading required package : rjava 3 Loading required package : XML 4 > hive _ start () 5 > hive _is_ available () 6 [1] TRUE 7 > DFS _ put ("~/ Data / Reuters / minimal ", "/ tmp / Reuters ") 8 > DFS _ list ("/ tmp / Reuters ") 9 [1] "reut -00001. xml " "reut -00002. xml " "reut -00003. xml " 10 [4] "reut -00004. xml " "reut -00005. xml " "reut -00006. xml " 11 [7] "reut -00007. xml " "reut -00008. xml " "reut -00009. xml " 12 > head ( DFS _ read _ lines ("/ tmp / Reuters /reut -00002. xml ")) 13 [1] " <? xml version =\" 1.0\ "?>" 14 [2] "<REUTERS TOPICS =\"NO\" LEWISSPLIT =\" TRAIN \" [...] 15 [3] " <DATE >26 - FEB -1987 15:03:27.51 < / DATE >" 16 [4] " < TOPICS />" 17 [5] " <PLACES >" 18 [6] " <D>usa </D>"

Example: Word Count 1 mapper <- function (){ 2 mapred _ write _ output <- function ( key, value ) 3 cat ( sprintf ("%s\t%s\n", key, value ), sep = "") 5 trim _ white _ space <- function ( line ) 6 gsub ("(^ +) ( +$)", "", line ) 7 split _ into _ words <- function ( line ) 8 unlist ( strsplit (line, " [[: space :]]+ ")) 10 con <- file (" stdin ", open = "r") 11 while ( length ( line <- readlines ( con, n = 1, 12 warn = FALSE )) > 0) { 13 line <- trim _ white _ space ( line ) 14 words <- split _ into _ words ( line ) 15 if( length ( words )) 16 mapred _ write _ output ( words, 1) 17 } 18 close ( con ) 19 }

Example: Word Count 1 reducer <- function (){ 2 [...] 3 env <- new. env ( hash = TRUE ) 4 con <- file (" stdin ", open = "r") 5 while ( length ( line <- readlines ( con, n = 1, 6 warn = FALSE )) > 0) { 7 split <- split _ line ( line ) 8 word <- split $ word 9 count <- split $ count 10 if( nchar ( word ) > 0){ 11 if( exists ( word, envir = env, inherits = FALSE )) { 12 oldcount <- get ( word, envir = env ) 13 assign ( word, oldcount + count, envir = env ) 14 } 15 else assign ( word, count, envir = env ) 16 } 17 } 18 close ( con ) 19 for ( w in ls( env, all = TRUE )) 20 cat (w, "\t", get (w, envir = env ), "\n", sep = "") 21 }

Example: Word Count 1 > hive _ stream ( mapper = mapper, 2 reducer = reducer, 3 input = "/ tmp / Reuters ", 4 output = "/ tmp / Reuters _ out ") 5 > DFS _ list ("/ tmp / Reuters _ out ") 6 [1] "_ logs " "part -00000 " 7 > results <- DFS _ read _ lines ( 8 "/ tmp / Reuters _ out /part -00000 ") 9 > head ( results ) 10 [1] " -\t2" " --\t7" 11 [3] ":\ t1" ".\ t1" 12 [5] " 0064 < / UNKNOWN >\ t1" " 0066 < / UNKNOWN >\ t1" 13 > tmp <- strsplit ( results, "\t") 14 > counts <- as. integer ( unlist ( lapply (tmp, function (x) 15 x [[2]]))) 16 > names ( counts ) <- unlist ( lapply (tmp, function (x) 17 x [[1]])) 18 > head ( sort ( counts, decreasing = TRUE )) 19 the to and of at said 20 58 44 41 30 25 22

hive: Summary Further functions: DFS put object(), DFS cat(), hive create(), hive get parameter(),... Currently, heavy usage of command line tools Java interface in preparation (presentation @ user 2009) Use infrastructure of package HadoopStreaming? Higher-level abstraction (e.g., variants of apply())

Application: Text Mining in R

Why Distributed Text Mining? Highly interdisciplinary research field utilizing techniques from computer science, linguistics, and statistics Vast amount of textual data available in machine readable format: scientific articles, abstracts, books,... memos, letters,... online forums, mailing lists, blogs,... Data volumes (corpora) become bigger and bigger Steady increase of text mining methods (both in academia as in industry) within the last decade Text mining methods are becoming more complex and hence computer intensive Thus, demand for computing power steadily increases

Why Distributed Text Mining? High Performance Computing (HPC) servers available for a reasonable price Integrated frameworks for parallel/distributed computing available (e.g., Hadoop) Thus, parallel/distributed computing is now easier than ever Standard software for data processing already offer extensions to use this software

Text Mining in R tm Package Tailored for Plain texts, articles and papers Web documents (XML, SGML,... ) Surveys Methods for Clustering Classification Visualization

Text Mining in R I. Feinerer tm: Text Mining Package, 2009 URL http://cran.r-project.org/package=tm R package version 0.3-3 I. Feinerer, K. Hornik, and D. Meyer Text mining infrastructure in R Journal of Statistical Software, 25(5):1 54, March 2008 ISSN 1548-7660 URL http://www.jstatsoft.org/v25/i05

Distributed Text Mining in R Example: Stemming Erasing word suffixes to retrieve their radicals Reduces complexity Stemmers provided in packages Rstem 1 and Snowball 2 Data: Wizard of Oz book series (http://www.gutenberg.org): 20 books, each containing 1529 9452 lines of text Reuters-21578: one of the most widely used test collection for text categorization research (New York Times corpus) 1 Duncan Temple Lang (version 0.3-0 on Omegahat) 2 Kurt Hornik (version 0.0-3 on CRAN)

Distributed Text Mining in R Motivation: Large data sets Corpus typically loaded into memory Operations on all elements of the corpus (so-called transformations) Available transformations: stemdoc(), stripwhitespace(), tmtolower(),...

Distributed Text Mining Strategies in R Strategies: Text mining using tm and Hadoop/hive 1 Text mining using tm and MPI/snow 2 1 Stefan Theußl (version 0.1-1) 2 Luke Tierney (version 0.3-3)

Distributed Text Mining in R Solution (Hadoop): Data set copied to DFS ( DistributedCorpus ) Only meta information about the corpus in memory Computational operations (Map) on all elements in parallel) Work horse tmmap() Processed documents (revisions) can be retrieved on demand

Distributed Text Mining in R - Listing 1 > library ("tm") 2 Loading required package : slam 3 > input <- "~/ Data / Reuters / reuters _ xml " 4 > co <- Corpus ( DirSource ( input ), [...]) 5 > co 6 A corpus with 21578 text documents 7 > print ( object. size (co), units = "Mb") 8 65.5 Mb 10 > source (" corpus.r") 11 > source (" reader.r") 12 > dc <- DistributedCorpus ( DirSource ( input ), [...]) 13 > dc 14 A corpus with 21578 text documents 15 > dc [[1]] 16 Showers continued throughout the week in 17 [...] 18 > print ( object. size (dc), units = "Mb") 19 1.9 Mb

Distributed Text Mining in R - Listing Mapper (called by tmmap): 1 mapper <- function (){ 2 require ("tm") 3 fun <- some _tm_ method 4 [...] 5 con <- file (" stdin ", open = "r") 6 while ( length ( line <- readlines (con, n = 1L, 7 warn = FALSE )) > 0) { 8 input <- split _ line ( line ) 9 result <- fun ( input $ value ) 10 if( length ( result )) 11 mapred _ write _ output ( input $key, result ) 12 } 13 close ( con ) 14 }

Distributed Text Mining in R Infrastructure: Development platform: 8-core Power 6 shared memory system IBM System p 550 4 2-core IBM POWER6 @ 3.5 GHz 128 GB RAM Computers of PC Lab used as worker nodes 8 PCs with an Intel Pentium 4 CPU @ 3.2 GHz and 1 GB of RAM Each PC has > 20 GB reserved for DFS MapReduce framework: Hadoop (implements MapReduce + DFS) R (2.9.0) with tm (0.4) and hive (0.1-1) Code implementing DistributedCorpus Cluster installation coming soon (loose integration with SGE)

Benchmark Reuters-21578: Single processor runtime (lapply()): 133 min. tm/hive on 8-core SMP (hive stream()): 4 min. tm/snow on 8 nodes of cluster@wu (parlapply()): 2.13 min.

Benchmark Runtime Speedup Runtime [s] 500 1500 2500 3500 Speedup 1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 # CPU 1 2 3 4 5 6 7 8 # CPU

Distributed Text Mining in R Excursion: How to store text files in the DFS Requirement: access text documents in R via [[ Difficult to achieve: almost random output after calling map in Hadoop Output chunks automatically renamed to part-xxxxx. Solution: add meta information to each chunk (chunk name, position in the chunk) Update DistributedCorpus after Map process

Lessons Learned Problem size has to be sufficiently large Location of texts in DFS (currently: ID = file path) Thus, serialization difficult (how to update text IDs?) Remote file operation on DFS around 2.5 sec. (will be significantly reduced after Java implementation)

Conclusion MapReduce has proven to be a useful abstraction Greatly simplifies distributed computing Developer focus on problem Implementations like Hadoop deal with messy details different approaches to facilitate Hadoop s infrastructure language- and use case dependent

Thank You for Your Attention! Stefan Theußl Department of Statistics and Mathematics email: Stefan.Theussl@wu.ac.at URL: http://statmath.wu.ac.at/~theussl WU Vienna University of Economics and Business Augasse 2 6, A-1090 Wien