OTN Developer Day: Oracle Big Data

Similar documents

Big Data Analytics Scaling R to Enterprise Data user! 2013 Albacete Spain #user2013

Oracle R zum Anfassen: Die Themen

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Connecting Hadoop with Oracle Database

Oracle Big Data Essentials

Hadoop Basics with InfoSphere BigInsights

Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine

Kognitio Technote Kognitio v8.x Hadoop Connector Setup

Integrating VoltDB with Hadoop

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

NIST/ITL CSD Biometric Conformance Test Software on Apache Hadoop. September National Institute of Standards and Technology (NIST)

How to Install and Configure EBF15328 for MapR or with MapReduce v1

Data processing goes big

Big Data and Scripting map/reduce in Hadoop

Oracle Data Miner (Extension of SQL Developer 4.0)

Oracle Data Integrator for Big Data. Alex Kotopoulis Senior Principal Product Manager

Oracle Advanced Analytics Oracle R Enterprise & Oracle Data Mining

CSE 344 Introduction to Data Management. Section 9: AWS, Hadoop, Pig Latin TA: Yi-Shu Wei

Cloud Computing. Chapter Hadoop

ORACLE NOSQL DATABASE HANDS-ON WORKSHOP Cluster Deployment and Management

Monitoring Oracle Enterprise Performance Management System Release Deployments from Oracle Enterprise Manager 12c

Hadoop Data Warehouse Manual

RHadoop Installation Guide for Red Hat Enterprise Linux

Implement Hadoop jobs to extract business value from large and varied data sets

User's Guide - Beta 1 Draft

Hands-on Exercises with Big Data

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Data Domain Profiling and Data Masking for Hadoop

Hadoop Tutorial. General Instructions

Map Reduce & Hadoop Recommended Text:

OTN Developer Day: Oracle Big Data. Hands On Lab Manual. Introduction to Oracle NoSQL Database

Rumen. Table of contents

LAB 2 SPARK / D-STREAM PROGRAMMING SCIENTIFIC APPLICATIONS FOR IOT WORKSHOP

Chase Wu New Jersey Ins0tute of Technology

Miami University RedHawk Cluster Working with batch jobs on the Cluster

Click Stream Data Analysis Using Hadoop

MySQL for Beginners Ed 3

Important Notice. (c) Cloudera, Inc. All rights reserved.

Setting Up the Site Licenses

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Cloudera Certified Developer for Apache Hadoop

Hadoop Basics with InfoSphere BigInsights

R / TERR. Ana Costa e SIlva, PhD Senior Data Scientist TIBCO. Copyright TIBCO Software Inc.

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

IDS 561 Big data analytics Assignment 1

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Project 5 Twitter Analyzer Due: Fri :59:59 pm

ITG Software Engineering

A Performance Analysis of Distributed Indexing using Terrier

Integrating Big Data into the Computing Curricula

Hadoop Streaming. Table of contents

TRAINING PROGRAM ON BIGDATA/HADOOP

Configuring Secure Socket Layer (SSL) for use with BPM 7.5.x

High Performance Computing with Hadoop WV HPC Summer Institute 2014

IBM Software Hadoop Fundamentals

MapReduce. Tushar B. Kute,

Managed File Transfer with Universal File Mover

Data Migration from Magento 1 to Magento 2 Including ParadoxLabs Authorize.Net CIM Plugin Last Updated Jan 4, 2016

Running Hadoop on Windows CCNP Server

Uploads from client PC's to mercury are not enabled for security reasons.

Package HadoopStreaming

CSCI6900 Assignment 2: Naïve Bayes on Hadoop

ICE Trade Vault. Public User & Technology Guide June 6, 2014

The Hadoop Eco System Shanghai Data Science Meetup

Setting Up ALERE with Client/Server Data

Cloudera Backup and Disaster Recovery

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Spectrum Technology Platform. Version 9.0. Spectrum Spatial Administration Guide

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

CactoScale Guide User Guide. Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB)

Actian Analytics Platform Express Hadoop SQL Edition 2.0

Complete Java Classes Hadoop Syllabus Contact No:

The Hadoop Implementation. Thomas Zimmermann Philipp Berger

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Perceptive Intelligent Capture Solution Configration Manager

MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015

Using Keil software with Linux via VirtualBox

COURSE CONTENT Big Data and Hadoop Training

cloud-kepler Documentation

Constructing a Data Lake: Hadoop and Oracle Database United!

To reduce or not to reduce, that is the question

Architecting the Future of Big Data

Novell ZENworks Asset Management 7.5

Best Practices for Hadoop Data Analysis with Tableau

Cloudera Backup and Disaster Recovery

Forensic Clusters: Advanced Processing with Open Source Software. Jon Stewart Geoff Black

SAS 9.3 Foundation for Microsoft Windows

Cloudera Manager Training: Hands-On Exercises

Big Data, beating the Skills Gap Using R with Hadoop

How To Use Query Console

Figure 1. Accessing via External Tables with in-database MapReduce

Database migration using Wizard, Studio and Commander. Based on migration from Oracle to PostgreSQL (Greenplum)

IBM InfoSphere MDM Server v9.0. Version: Demo. Page <<1/11>>

DMX-h ETL Use Case Accelerator. Word Count

Configuring a Custom Load Evaluator Use the XenApp1 virtual machine, logged on as the XenApp\administrator user for this task.

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

Managing Linux Servers with System Center 2012 R2

ODBC Client Driver Help Kepware, Inc.

Transcription:

OTN Developer Day: Oracle Big Data Hands On Lab Manual Oracle Big Data Connectors: Introduction to Oracle R Connector for Hadoop

ORACLE R CONNECTOR FOR HADOOP 2.0 HANDS-ON LAB Introduction to Oracle R Connector for Hadoop

Contents Introduction to Oracle R Connector for Hadoop... 3 Exercise 1 Work with data in HDFS and Oracle Database... 3 Exercise 2 Execute a simple MapReduce job using ORCH... 4 Exercise 3 Count words in movie plot summaries... 6 Solution for Introduction to Oracle R Connector for Hadoop... 9 Exercise 1 Work with data in HDFS and Oracle Database... 9 Exercise 2 Execute a simple MapReduce job using ORCH... 12 Exercise 3 Count words in movie plot summaries... 15

Introduction to Oracle R Connector for Hadoop Oracle R Connector for Hadoop (ORCH), a component of the Big Data Connectors option, provides transparent access to Hadoop and HDFS-resident data. Hadoop is a high performance distributed computational system, and the Hadoop Distributed File System (HDFS) is a distributed high-availability file storage mechanism. With ORCH, R users are not forced to learn a new language to work with Hadoop/HDFS they continue to work in R. In addition they can leverage open source R packages as part of their mapper and reducer functions when working on HDFS-resident data. ORCH allows for Hadoop jobs to be executed locally at the client for testing purposes, then, by changing one setting, the exact same code can be executed on the Hadoop cluster without requiring the involvement of administrators, or knowledge of Hadoop internals, the Hadoop call level interface or IT infrastructure. ORCH and Oracle R Enterprise (ORE) can interact in a variety of ways. If ORE is installed on the R client with ORCH, ORCH can copy ore.frames (data tables) to HDFS, ORE can preprocess data that is fed to map-reduce jobs, and ORE can post-process results of map-reduce jobs once data is moved from HDFS to Oracle Database. If ORE is installed on the Big Data Appliance task nodes, mapper and reducer functions can include functions calls to ORE. If ORCH is installed on Oracle Database server, R scripts in embedded R execution can invoke ORCH functionality, achieving operationalization of ORCH scripts via SQL-based applications or those leveraging DBMS_SCHEDULER. To run the commands in this document on the virtual machine (VM), point Firefox to http://localhost:8787 and log into RStudio using oracle user s credentials. From the RStudio File menu, select File-Open File and navigate to location /home/oracle/movie/moviework/advancedanalytics. Select the R Script file 20130206_ORCH_Hands-on_Lab.R, and the HOL s script s commands will be opened and available to run in RStudio. Exercise 1 Work with data in HDFS and Oracle Database Loading the ORCH library provides access to some basic functions for manipulating HDFS. After navigating to a specified directory, we ll again access database data in the form of the MOVIE_FACT and MOVIE_GENRE tables, and connect to Oracle Database from ORCH. Although you re connected to the database through ORE, to transfer data between Oracle Database and HDFS requires an ORCH connection. Then, you ll copy data from the database to HDFS for later use with a MapReduce job. Run these commands from the /home/oracle/movie/moviework/advancedanalytics Linux directory. 1. If you are in R, first exit from R using CTRL-D CTRL-D. This will in effect invoke q() and not save the workspace. Change directory and start R: cd /home/oracle/movie/moviework/advancedanalytics R 2. If you are not already connected by default, load the Oracle R Enterprise (ORE) library and connect to the Oracle database, then list the contents of the database to test the connection. Notice that if a table contains columns with unsupported data types, a warning message is returned. If you are connected, you can just invoke ore.ls(). library(ore) ore.connect("moviedemo","orcl","localhost","welcome1",all=true ) ore.ls()

3. Load the Oracle R Connector for Hadoop (ORCH) library, get the current working directory, and list the directory contents in Hadoop Distributed File System (HDFS). Change directory in HDFS and view the contents there: library(orch) hdfs.pwd() hdfs.ls() hdfs.cd ("/user/oracle/moviework/advancedanalytics/data") hdfs.ls() 4. Using ORE, view the names of the database tables MOVIE_FACT and MOVIE_GENRE, look at the first few rows of each table, and get the table dimensions: ore.sync("moviedemo","movie_fact") MF <- MOVIE_FACT names(mf) head(mf,3) dim(mf) names(movie_genre) head(movie_genre,3) dim(movie_genre) 5. Since we will use the table MOVIE_GENRE later in our Hadoop recommendation jobs, copy a subset of MOVIE_GENRE from the database to HDFS and validate that it exists. This requires using orch.connect to establish the connect to the database from ORCH. MG_SUBSET <- MOVIE_GENRE[1:10000,] hdfs.rm('movie_genre_subset') orch.connect(host="localhost", user="moviedemo", sid="orcl",passwd="welcome1",secure=f) mg.dfs <- hdfs.push(mg_subset, dfs.name='movie_genre_subset', split.by="genre_id") hdfs.exists('movie_genre_subset') hdfs.describe('movie_genre_subset') hdfs.size('movie_genre_subset') Exercise 2 Execute a simple MapReduce job using ORCH In this exercise, you will execute a Hadoop job that counts the number of movies in each genre. You will first run the script in dry run mode, executing on the local machine serially. Then, you will run on the cluster in the VM. Finally, you will compare the results using ORE. 1. Use hdfs.attach() to attach the movie_genre HDFS file to the working session:

mg.dfs <- hdfs.attach("/user/oracle/moviework/advancedanalytics/data/movie_genre_subset ) mg.dfs hdfs.describe(mg.dfs) 2. Specify to run in dry run mode and then execute the MapReduce job that partitions the data based on genre_id, and counts up the number of movies in each genre. Note that you will receive debug output while in dry run mode. orch.dryrun(t) res.dryrun <- NULL res.dryrun <- hadoop.run( mg.dfs, mapper = function(key, val) { orch.keyval(val$genre_id, 1) reducer = function(key, vals) { count <- length(vals) orch.keyval(key, count) }, config = new("mapred.config", map.output = data.frame(key=0, val=0), reduce.output = data.frame(genre_id=0, COUNT=0)) ) 3. Retrieve the result of the Hadoop job, which is stored as an HDFS file. Note that since this is dry run mode, not all data may be used, so only a subset of results may be returned. hdfs.get(res.dryrun) 4. Specify to execute using the cluster by setting orch.dryrun to FALSE, rerun the same MapReduce job, and view the result. Note that this will take longer to execute since it is starting actual Hadoop jobs on the cluster. orch.dryrun(f) res.cluster <- NULL res.cluster <- hadoop.run( mg.dfs, mapper = function(key, val) { orch.keyval(val$genre_id, 1) reducer = function(key, vals) { count <- length(vals) orch.keyval(key, count) }, config = new("mapred.config", map.output = data.frame(key=0, val=0), reduce.output = data.frame(genre_id=0, COUNT=0)) ) hdfs.get(res.cluster) 5. Perform the same analysis using ORE:

res.table <- table(mg_subset$genre_id) res.table Exercise 3 Count words in movie plot summaries In this exercise, you will execute a Hadoop job that counts how many times each of the words in MOVIE plot summaries occurs. You will first create the HDFS file containing the data extracted from Oracle Database using ORE. Then, you will run the MarpReduce job on the cluster in the VM. Finally, you will view the results using ORE, but since we ll want the results sorted by most frequent words, another MapReduce job will be needed. 1. If starting a fresh R session, execute the first block. Otherwise, continue to find all the movies with plot summaries and convert them from an ore.factor to ore.character. Remove various unneeded punctuation from the text, create a database table from these, and create the input corpus for the MapReduce job: library(orch) orch.connect(host="localhost", user="moviedemo", sid="orcl",passwd="welcome1",secure=f) hdfs.cd("/user/oracle/moviework/advancedanalytics/data") ore.drop(table= corpus_table ) corpus <- as.character(movie[!is.na(movie$plot_summary),"plot_summary"]) class(corpus) corpus <- gsub("([/\\\":,#.@-])", " ", corpus) head(corpus,2) corpus <- data.frame(text=corpus) ore.create (corpus,table = "corpus_table") hdfs.rm("plot_summary_corpus") input <- hdfs.put(corpus_table,dfs.name="plot_summary_corpus") 2. Try the following example to see how R parses text using strsplit. Notice the extra space between my and text. This gets converted to a null output. You will account for that in the next step. txt <- "This is my text" strsplit(txt," ") mylist <- list(a = 5, B = 10, C = 25) sum(unlist(mylist)) 3. Execute the MapReduce job that performs the word count.

res <- hadoop.exec(dfs.id = input, mapper = function(k,v) { x <- strsplit(v[[1]], " ")[[1]] x <- x[x!=''] out <- NULL for(i in 1:length(x)) out <- c(out, orch.keyval(x[i],1)) out reducer = function(k,vv) { orch.keyval(k, sum(unlist(vv))) config = new("mapred.config", job.name = "wordcount", map.output = data.frame(key='', val=0), reduce.output = data.frame(key='', val=0) ) ) 4. View the path of the result HDFS file. Then get the contents of the result. Notice that the results are unordered. res hdfs.get(res) 5. To sort the results, we can use the following MapReduce job. Notice that we can specify explicit stopwords, i.e., those to be excluded from the set, but that we also eliminate words of 3 letters or fewer. Then view the sorted results, as well as a sample of 10 rows from the HDFS file. Which words are the most popular in the plot summaries?

stopwords <- c("from","they","that","with","their","when","into","what") sorted.res <- hadoop.exec(dfs.id = res, mapper = function(k,v) { if(!(k %in% stopwords) & nchar(k) > 3) { cnt <- sprintf("%05d", as.numeric(v[[1]])) orch.keyval(cnt,k) } reducer = function(k,vv) { orch.keyvals(k, vv) export= orch.export(stopwords), config = new("mapred.config", job.name = "sort.words", reduce.tasks = 1, map.output = data.frame(key='', val=''), reduce.output = data.frame(key='', val='') ) ) hdfs.get(sorted.res) hdfs.sample(sorted.res,10)

Solution for Introduction to Oracle R Connector for Hadoop Oracle R Connector for Hadoop (ORCH), a component of the Big Data Connectors option, provides transparent access to Hadoop and HDFS-resident data. Hadoop is a high performance distributed computational system, and the Hadoop Distributed File System (HDFS) is a distributed high-availability file storage mechanism. With ORCH, R users are not forced to learn a new language to work with Hadoop/HDFS they continue to work in R. In addition they can leverage open source R packages as part of their mapper and reducer functions when working on HDFS-resident data. ORCH allows for Hadoop jobs to be executed locally at the client for testing purposes, then, by changing one setting, the exact same code can be executed on the Hadoop cluster without requiring the involvement of administrators, or knowledge of Hadoop internals, the Hadoop call level interface or IT infrastructure. ORCH and Oracle R Enterprise (ORE) can interact in a variety of ways. If ORE is installed on the R client with ORCH, ORCH can copy ore.frames (data tables) to HDFS, ORE can preprocess data that is fed to map-reduce jobs, and ORE can post-process results of map-reduce jobs once data is moved from HDFS to Oracle Database. If ORE is installed on the Big Data Appliance task nodes, mapper and reducer functions can include functions calls to ORE. If ORCH is installed on Oracle Database server, R scripts in embedded R execution can invoke ORCH functionality, achieving operationalization of ORCH scripts via SQL-based applications or those leveraging DBMS_SCHEDULER. To run the commands in this document on the virtual machine (VM), point Firefox to http://localhost:8787 and log into RStudio using oracle user s credentials. From the RStudio File menu, select File-Open File and navigate to location /home/oracle/movie/moviework/advancedanalytics. Select the R Script file 20130206_ORCH_Hands-on_Lab.R, and the HOL s script s commands will be opened and available to run in RStudio. Exercise 1 Work with data in HDFS and Oracle Database Loading the ORCH library provides access to some basic functions for manipulating HDFS. After navigating to a specified directory, we ll again access database data in the form of the MOVIE_FACT and MOVIE_GENRE tables, and connect to Oracle Database from ORCH. Although you re connected to the database through ORE, to transfer data between Oracle Database and HDFS requires an ORCH connection. Then, you ll copy data from the database to HDFS for later use with a MapReduce job. Run these commands from the /home/oracle/movie/moviework/advancedanalytics Linux directory. 1. If you are in R, first exit from R using CTRL-D CTRL-D. This will in effect invoke q() and not save the workspace. Change directory and start R: cd /home/oracle/movie/moviework/advancedanalytics R

2. If you are not already connected by default, load the Oracle R Enterprise (ORE) library and connect to the Oracle database, then list the contents of the database to test the connection. Notice that if a table contains columns with unsupported data types, a warning message is returned. If you are connected, you can just invoke ore.ls(). library(ore) ore.connect("moviedemo","orcl","localhost","welcome1",all=true ) ore.ls() 3. Load the Oracle R Connector for Hadoop (ORCH) library, get the current working directory, and list the directory contents in Hadoop Distributed File System (HDFS). Change directory in HDFS and view the contents there:

library(orch) hdfs.pwd() hdfs.ls() hdfs.cd ("/user/oracle/moviework/advancedanalytics/data") hdfs.ls() 4. Using ORE, view the names of the database tables MOVIE_FACT and MOVIE_GENRE, look at the first few rows of each table, and get the table dimensions: ore.sync("moviedemo","movie_fact") MF <- MOVIE_FACT names(mf) head(mf,3) dim(mf) names(movie_genre) head(movie_genre,3) dim(movie_genre) 5. Since we will use the table MOVIE_GENRE later in our Hadoop recommendation jobs, copy a subset of MOVIE_GENRE from the database to HDFS and validate that it exists. This requires using orch.connect to establish the connect to the database from ORCH.

MG_SUBSET <- MOVIE_GENRE[1:10000,] hdfs.rm('movie_genre_subset') orch.connect(host="localhost", user="moviedemo", sid="orcl",passwd="oracle",secure=f) mg.dfs <- hdfs.push(mg_subset, dfs.name='movie_genre_subset', split.by="genre_id") hdfs.exists('movie_genre_subset') hdfs.describe('movie_genre_subset') hdfs.size('movie_genre_subset') Exercise 2 Execute a simple MapReduce job using ORCH In this exercise, you will execute a Hadoop job that counts the number of movies in each genre. You will first run the script in dry run mode, executing on the local machine serially. Then, you will run on the cluster in the VM. Finally, you will compare the results using ORE. 1. Use hdfs.attach() to attach the movie_genre HDFS file to the working session: mg.dfs <- hdfs.attach("/user/oracle/moviework/advancedanalytics/data/movie_genre_subset") mg.dfs hdfs.describe(mg.dfs)

2. Specify to run in dry run mode and then execute the MapReduce job that partitions the data based on genre_id, and counts up the number of movies in each genre. Note that you will receive debug output while in dry run mode. orch.dryrun(t) res.dryrun <- NULL res.dryrun <- hadoop.run( mg.dfs, mapper = function(key, val) { orch.keyval(val$genre_id, 1) reducer = function(key, vals) { count <- length(vals) orch.keyval(key, count) }, config = new("mapred.config", map.output = data.frame(key=0, val=0), reduce.output = data.frame(genre_id=0, COUNT=0)) )

3. Retrieve the result of the Hadoop job, which is stored as an HDFS file. Note that since this is dry run mode, not all data may be used, so only a subset of results may be returned. hdfs.get(res.dryrun) 4. Specify to execute using the cluster by setting orch.dryrun to FALSE, rerun the same MapReduce job, and view the result. Note that this will take longer to execute since it is starting actual Hadoop jobs on the cluster.

orch.dryrun(f) res.cluster <- NULL res.cluster <- hadoop.run( mg.dfs, mapper = function(key, val) { orch.keyval(val$genre_id, 1) reducer = function(key, vals) { count <- length(vals) orch.keyval(key, count) }, config = new("mapred.config", map.output = data.frame(key=0, val=0), reduce.output = data.frame(genre_id=0, COUNT=0)) ) hdfs.get(res.cluster) 5. Perform the same analysis using ORE: res.table <- table(mg_subset$genre_id) res.table Exercise 3 Count words in movie plot summaries In this exercise, you will execute a Hadoop job that counts how many times each of the words in MOVIE plot summaries occurs. You will first create the HDFS file containing the data extracted from Oracle Database using ORE. Then, you will run the MarpReduce job on the cluster in the VM. Finally, you will view the results using ORE, but since we ll want the results sorted by most frequent words, another MapReduce job will be needed.

1. If starting a fresh R session, execute the first block. Otherwise, continue to find all the movies with plot summaries and convert them from an ore.factor to ore.character. Remove various unneeded punctuation from the text, create a database table from these, and create the input corpus for the MapReduce job: library(orch) orch.connect(host="localhost", user="moviedemo", sid="orcl",passwd="welcome1",secure=f) hdfs.cd ("/user/oracle/moviework/advancedanalytics/data") corpus <- as.character(movie[!is.na(movie$plot_summary),"plot_summary"]) class(corpus) corpus <- gsub("([/\\\":,#.@-])", " ", corpus) head(corpus,2) corpus <- data.frame(text=corpus) ore.create (corpus,table = "corpus_table") hdfs.rm("plot_summary_corpus") input <- hdfs.put(corpus_table,dfs.name="plot_summary_corpus") 2. Try the following example to see how R parses text using strsplit. Notice the extra space between my and text. This gets converted to a null output. You will account for that in the next step. txt <- "This is my text" strsplit(txt," ") mylist <- list(a = 5, B = 10, C = 25) sum(unlist(mylist))

3. Execute the MapReduce job that performs the word count: res <- hadoop.exec(dfs.id = input, mapper = function(k,v) { x <- strsplit(v[[1]], " ")[[1]] x <- x[x!=''] out <- NULL for(i in 1:length(x)) out <- c(out, orch.keyval(x[i],1)) out reducer = function(k,vv) { orch.keyval(k, sum(unlist(vv))) config = new("mapred.config", job.name = "wordcount", map.output = data.frame(key='', val=0), reduce.output = data.frame(key='', val=0) ) ) 4. View the path of the result HDFS file. Then get the contents of the result. Notice that the results are unordered. res hdfs.get(res)

5. To sort the results, we can use the following MapReduce job. Notice that we can specify explicit stopwords, i.e., those to be excluded from the set, but that we also eliminate words of 3 letters or fewer. Then view the sorted results, as well as a sample of 10 rows from the HDFS file. Which words are the most popular in the plot summaries? stopwords <- c("from","they","that","with","their","when","into","what") sorted.res <- hadoop.exec(dfs.id = res, mapper = function(k,v) { if(!(k %in% stopwords) & nchar(k) > 3) { cnt <- sprintf("%05d", as.numeric(v[[1]])) orch.keyval(cnt,k) } reducer = function(k,vv) { orch.keyvals(k, vv) export= orch.export(stopwords), config = new("mapred.config", job.name = "sort.words", reduce.tasks = 1, map.output = data.frame(key='', val=''), reduce.output = data.frame(key='', val='') ) ) hdfs.get(sorted.res) hdfs.sample(sorted.res,10)