OTN Developer Day: Oracle Big Data Hands On Lab Manual Oracle Big Data Connectors: Introduction to Oracle R Connector for Hadoop
ORACLE R CONNECTOR FOR HADOOP 2.0 HANDS-ON LAB Introduction to Oracle R Connector for Hadoop
Contents Introduction to Oracle R Connector for Hadoop... 3 Exercise 1 Work with data in HDFS and Oracle Database... 3 Exercise 2 Execute a simple MapReduce job using ORCH... 4 Exercise 3 Count words in movie plot summaries... 6 Solution for Introduction to Oracle R Connector for Hadoop... 9 Exercise 1 Work with data in HDFS and Oracle Database... 9 Exercise 2 Execute a simple MapReduce job using ORCH... 12 Exercise 3 Count words in movie plot summaries... 15
Introduction to Oracle R Connector for Hadoop Oracle R Connector for Hadoop (ORCH), a component of the Big Data Connectors option, provides transparent access to Hadoop and HDFS-resident data. Hadoop is a high performance distributed computational system, and the Hadoop Distributed File System (HDFS) is a distributed high-availability file storage mechanism. With ORCH, R users are not forced to learn a new language to work with Hadoop/HDFS they continue to work in R. In addition they can leverage open source R packages as part of their mapper and reducer functions when working on HDFS-resident data. ORCH allows for Hadoop jobs to be executed locally at the client for testing purposes, then, by changing one setting, the exact same code can be executed on the Hadoop cluster without requiring the involvement of administrators, or knowledge of Hadoop internals, the Hadoop call level interface or IT infrastructure. ORCH and Oracle R Enterprise (ORE) can interact in a variety of ways. If ORE is installed on the R client with ORCH, ORCH can copy ore.frames (data tables) to HDFS, ORE can preprocess data that is fed to map-reduce jobs, and ORE can post-process results of map-reduce jobs once data is moved from HDFS to Oracle Database. If ORE is installed on the Big Data Appliance task nodes, mapper and reducer functions can include functions calls to ORE. If ORCH is installed on Oracle Database server, R scripts in embedded R execution can invoke ORCH functionality, achieving operationalization of ORCH scripts via SQL-based applications or those leveraging DBMS_SCHEDULER. To run the commands in this document on the virtual machine (VM), point Firefox to http://localhost:8787 and log into RStudio using oracle user s credentials. From the RStudio File menu, select File-Open File and navigate to location /home/oracle/movie/moviework/advancedanalytics. Select the R Script file 20130206_ORCH_Hands-on_Lab.R, and the HOL s script s commands will be opened and available to run in RStudio. Exercise 1 Work with data in HDFS and Oracle Database Loading the ORCH library provides access to some basic functions for manipulating HDFS. After navigating to a specified directory, we ll again access database data in the form of the MOVIE_FACT and MOVIE_GENRE tables, and connect to Oracle Database from ORCH. Although you re connected to the database through ORE, to transfer data between Oracle Database and HDFS requires an ORCH connection. Then, you ll copy data from the database to HDFS for later use with a MapReduce job. Run these commands from the /home/oracle/movie/moviework/advancedanalytics Linux directory. 1. If you are in R, first exit from R using CTRL-D CTRL-D. This will in effect invoke q() and not save the workspace. Change directory and start R: cd /home/oracle/movie/moviework/advancedanalytics R 2. If you are not already connected by default, load the Oracle R Enterprise (ORE) library and connect to the Oracle database, then list the contents of the database to test the connection. Notice that if a table contains columns with unsupported data types, a warning message is returned. If you are connected, you can just invoke ore.ls(). library(ore) ore.connect("moviedemo","orcl","localhost","welcome1",all=true ) ore.ls()
3. Load the Oracle R Connector for Hadoop (ORCH) library, get the current working directory, and list the directory contents in Hadoop Distributed File System (HDFS). Change directory in HDFS and view the contents there: library(orch) hdfs.pwd() hdfs.ls() hdfs.cd ("/user/oracle/moviework/advancedanalytics/data") hdfs.ls() 4. Using ORE, view the names of the database tables MOVIE_FACT and MOVIE_GENRE, look at the first few rows of each table, and get the table dimensions: ore.sync("moviedemo","movie_fact") MF <- MOVIE_FACT names(mf) head(mf,3) dim(mf) names(movie_genre) head(movie_genre,3) dim(movie_genre) 5. Since we will use the table MOVIE_GENRE later in our Hadoop recommendation jobs, copy a subset of MOVIE_GENRE from the database to HDFS and validate that it exists. This requires using orch.connect to establish the connect to the database from ORCH. MG_SUBSET <- MOVIE_GENRE[1:10000,] hdfs.rm('movie_genre_subset') orch.connect(host="localhost", user="moviedemo", sid="orcl",passwd="welcome1",secure=f) mg.dfs <- hdfs.push(mg_subset, dfs.name='movie_genre_subset', split.by="genre_id") hdfs.exists('movie_genre_subset') hdfs.describe('movie_genre_subset') hdfs.size('movie_genre_subset') Exercise 2 Execute a simple MapReduce job using ORCH In this exercise, you will execute a Hadoop job that counts the number of movies in each genre. You will first run the script in dry run mode, executing on the local machine serially. Then, you will run on the cluster in the VM. Finally, you will compare the results using ORE. 1. Use hdfs.attach() to attach the movie_genre HDFS file to the working session:
mg.dfs <- hdfs.attach("/user/oracle/moviework/advancedanalytics/data/movie_genre_subset ) mg.dfs hdfs.describe(mg.dfs) 2. Specify to run in dry run mode and then execute the MapReduce job that partitions the data based on genre_id, and counts up the number of movies in each genre. Note that you will receive debug output while in dry run mode. orch.dryrun(t) res.dryrun <- NULL res.dryrun <- hadoop.run( mg.dfs, mapper = function(key, val) { orch.keyval(val$genre_id, 1) reducer = function(key, vals) { count <- length(vals) orch.keyval(key, count) }, config = new("mapred.config", map.output = data.frame(key=0, val=0), reduce.output = data.frame(genre_id=0, COUNT=0)) ) 3. Retrieve the result of the Hadoop job, which is stored as an HDFS file. Note that since this is dry run mode, not all data may be used, so only a subset of results may be returned. hdfs.get(res.dryrun) 4. Specify to execute using the cluster by setting orch.dryrun to FALSE, rerun the same MapReduce job, and view the result. Note that this will take longer to execute since it is starting actual Hadoop jobs on the cluster. orch.dryrun(f) res.cluster <- NULL res.cluster <- hadoop.run( mg.dfs, mapper = function(key, val) { orch.keyval(val$genre_id, 1) reducer = function(key, vals) { count <- length(vals) orch.keyval(key, count) }, config = new("mapred.config", map.output = data.frame(key=0, val=0), reduce.output = data.frame(genre_id=0, COUNT=0)) ) hdfs.get(res.cluster) 5. Perform the same analysis using ORE:
res.table <- table(mg_subset$genre_id) res.table Exercise 3 Count words in movie plot summaries In this exercise, you will execute a Hadoop job that counts how many times each of the words in MOVIE plot summaries occurs. You will first create the HDFS file containing the data extracted from Oracle Database using ORE. Then, you will run the MarpReduce job on the cluster in the VM. Finally, you will view the results using ORE, but since we ll want the results sorted by most frequent words, another MapReduce job will be needed. 1. If starting a fresh R session, execute the first block. Otherwise, continue to find all the movies with plot summaries and convert them from an ore.factor to ore.character. Remove various unneeded punctuation from the text, create a database table from these, and create the input corpus for the MapReduce job: library(orch) orch.connect(host="localhost", user="moviedemo", sid="orcl",passwd="welcome1",secure=f) hdfs.cd("/user/oracle/moviework/advancedanalytics/data") ore.drop(table= corpus_table ) corpus <- as.character(movie[!is.na(movie$plot_summary),"plot_summary"]) class(corpus) corpus <- gsub("([/\\\":,#.@-])", " ", corpus) head(corpus,2) corpus <- data.frame(text=corpus) ore.create (corpus,table = "corpus_table") hdfs.rm("plot_summary_corpus") input <- hdfs.put(corpus_table,dfs.name="plot_summary_corpus") 2. Try the following example to see how R parses text using strsplit. Notice the extra space between my and text. This gets converted to a null output. You will account for that in the next step. txt <- "This is my text" strsplit(txt," ") mylist <- list(a = 5, B = 10, C = 25) sum(unlist(mylist)) 3. Execute the MapReduce job that performs the word count.
res <- hadoop.exec(dfs.id = input, mapper = function(k,v) { x <- strsplit(v[[1]], " ")[[1]] x <- x[x!=''] out <- NULL for(i in 1:length(x)) out <- c(out, orch.keyval(x[i],1)) out reducer = function(k,vv) { orch.keyval(k, sum(unlist(vv))) config = new("mapred.config", job.name = "wordcount", map.output = data.frame(key='', val=0), reduce.output = data.frame(key='', val=0) ) ) 4. View the path of the result HDFS file. Then get the contents of the result. Notice that the results are unordered. res hdfs.get(res) 5. To sort the results, we can use the following MapReduce job. Notice that we can specify explicit stopwords, i.e., those to be excluded from the set, but that we also eliminate words of 3 letters or fewer. Then view the sorted results, as well as a sample of 10 rows from the HDFS file. Which words are the most popular in the plot summaries?
stopwords <- c("from","they","that","with","their","when","into","what") sorted.res <- hadoop.exec(dfs.id = res, mapper = function(k,v) { if(!(k %in% stopwords) & nchar(k) > 3) { cnt <- sprintf("%05d", as.numeric(v[[1]])) orch.keyval(cnt,k) } reducer = function(k,vv) { orch.keyvals(k, vv) export= orch.export(stopwords), config = new("mapred.config", job.name = "sort.words", reduce.tasks = 1, map.output = data.frame(key='', val=''), reduce.output = data.frame(key='', val='') ) ) hdfs.get(sorted.res) hdfs.sample(sorted.res,10)
Solution for Introduction to Oracle R Connector for Hadoop Oracle R Connector for Hadoop (ORCH), a component of the Big Data Connectors option, provides transparent access to Hadoop and HDFS-resident data. Hadoop is a high performance distributed computational system, and the Hadoop Distributed File System (HDFS) is a distributed high-availability file storage mechanism. With ORCH, R users are not forced to learn a new language to work with Hadoop/HDFS they continue to work in R. In addition they can leverage open source R packages as part of their mapper and reducer functions when working on HDFS-resident data. ORCH allows for Hadoop jobs to be executed locally at the client for testing purposes, then, by changing one setting, the exact same code can be executed on the Hadoop cluster without requiring the involvement of administrators, or knowledge of Hadoop internals, the Hadoop call level interface or IT infrastructure. ORCH and Oracle R Enterprise (ORE) can interact in a variety of ways. If ORE is installed on the R client with ORCH, ORCH can copy ore.frames (data tables) to HDFS, ORE can preprocess data that is fed to map-reduce jobs, and ORE can post-process results of map-reduce jobs once data is moved from HDFS to Oracle Database. If ORE is installed on the Big Data Appliance task nodes, mapper and reducer functions can include functions calls to ORE. If ORCH is installed on Oracle Database server, R scripts in embedded R execution can invoke ORCH functionality, achieving operationalization of ORCH scripts via SQL-based applications or those leveraging DBMS_SCHEDULER. To run the commands in this document on the virtual machine (VM), point Firefox to http://localhost:8787 and log into RStudio using oracle user s credentials. From the RStudio File menu, select File-Open File and navigate to location /home/oracle/movie/moviework/advancedanalytics. Select the R Script file 20130206_ORCH_Hands-on_Lab.R, and the HOL s script s commands will be opened and available to run in RStudio. Exercise 1 Work with data in HDFS and Oracle Database Loading the ORCH library provides access to some basic functions for manipulating HDFS. After navigating to a specified directory, we ll again access database data in the form of the MOVIE_FACT and MOVIE_GENRE tables, and connect to Oracle Database from ORCH. Although you re connected to the database through ORE, to transfer data between Oracle Database and HDFS requires an ORCH connection. Then, you ll copy data from the database to HDFS for later use with a MapReduce job. Run these commands from the /home/oracle/movie/moviework/advancedanalytics Linux directory. 1. If you are in R, first exit from R using CTRL-D CTRL-D. This will in effect invoke q() and not save the workspace. Change directory and start R: cd /home/oracle/movie/moviework/advancedanalytics R
2. If you are not already connected by default, load the Oracle R Enterprise (ORE) library and connect to the Oracle database, then list the contents of the database to test the connection. Notice that if a table contains columns with unsupported data types, a warning message is returned. If you are connected, you can just invoke ore.ls(). library(ore) ore.connect("moviedemo","orcl","localhost","welcome1",all=true ) ore.ls() 3. Load the Oracle R Connector for Hadoop (ORCH) library, get the current working directory, and list the directory contents in Hadoop Distributed File System (HDFS). Change directory in HDFS and view the contents there:
library(orch) hdfs.pwd() hdfs.ls() hdfs.cd ("/user/oracle/moviework/advancedanalytics/data") hdfs.ls() 4. Using ORE, view the names of the database tables MOVIE_FACT and MOVIE_GENRE, look at the first few rows of each table, and get the table dimensions: ore.sync("moviedemo","movie_fact") MF <- MOVIE_FACT names(mf) head(mf,3) dim(mf) names(movie_genre) head(movie_genre,3) dim(movie_genre) 5. Since we will use the table MOVIE_GENRE later in our Hadoop recommendation jobs, copy a subset of MOVIE_GENRE from the database to HDFS and validate that it exists. This requires using orch.connect to establish the connect to the database from ORCH.
MG_SUBSET <- MOVIE_GENRE[1:10000,] hdfs.rm('movie_genre_subset') orch.connect(host="localhost", user="moviedemo", sid="orcl",passwd="oracle",secure=f) mg.dfs <- hdfs.push(mg_subset, dfs.name='movie_genre_subset', split.by="genre_id") hdfs.exists('movie_genre_subset') hdfs.describe('movie_genre_subset') hdfs.size('movie_genre_subset') Exercise 2 Execute a simple MapReduce job using ORCH In this exercise, you will execute a Hadoop job that counts the number of movies in each genre. You will first run the script in dry run mode, executing on the local machine serially. Then, you will run on the cluster in the VM. Finally, you will compare the results using ORE. 1. Use hdfs.attach() to attach the movie_genre HDFS file to the working session: mg.dfs <- hdfs.attach("/user/oracle/moviework/advancedanalytics/data/movie_genre_subset") mg.dfs hdfs.describe(mg.dfs)
2. Specify to run in dry run mode and then execute the MapReduce job that partitions the data based on genre_id, and counts up the number of movies in each genre. Note that you will receive debug output while in dry run mode. orch.dryrun(t) res.dryrun <- NULL res.dryrun <- hadoop.run( mg.dfs, mapper = function(key, val) { orch.keyval(val$genre_id, 1) reducer = function(key, vals) { count <- length(vals) orch.keyval(key, count) }, config = new("mapred.config", map.output = data.frame(key=0, val=0), reduce.output = data.frame(genre_id=0, COUNT=0)) )
3. Retrieve the result of the Hadoop job, which is stored as an HDFS file. Note that since this is dry run mode, not all data may be used, so only a subset of results may be returned. hdfs.get(res.dryrun) 4. Specify to execute using the cluster by setting orch.dryrun to FALSE, rerun the same MapReduce job, and view the result. Note that this will take longer to execute since it is starting actual Hadoop jobs on the cluster.
orch.dryrun(f) res.cluster <- NULL res.cluster <- hadoop.run( mg.dfs, mapper = function(key, val) { orch.keyval(val$genre_id, 1) reducer = function(key, vals) { count <- length(vals) orch.keyval(key, count) }, config = new("mapred.config", map.output = data.frame(key=0, val=0), reduce.output = data.frame(genre_id=0, COUNT=0)) ) hdfs.get(res.cluster) 5. Perform the same analysis using ORE: res.table <- table(mg_subset$genre_id) res.table Exercise 3 Count words in movie plot summaries In this exercise, you will execute a Hadoop job that counts how many times each of the words in MOVIE plot summaries occurs. You will first create the HDFS file containing the data extracted from Oracle Database using ORE. Then, you will run the MarpReduce job on the cluster in the VM. Finally, you will view the results using ORE, but since we ll want the results sorted by most frequent words, another MapReduce job will be needed.
1. If starting a fresh R session, execute the first block. Otherwise, continue to find all the movies with plot summaries and convert them from an ore.factor to ore.character. Remove various unneeded punctuation from the text, create a database table from these, and create the input corpus for the MapReduce job: library(orch) orch.connect(host="localhost", user="moviedemo", sid="orcl",passwd="welcome1",secure=f) hdfs.cd ("/user/oracle/moviework/advancedanalytics/data") corpus <- as.character(movie[!is.na(movie$plot_summary),"plot_summary"]) class(corpus) corpus <- gsub("([/\\\":,#.@-])", " ", corpus) head(corpus,2) corpus <- data.frame(text=corpus) ore.create (corpus,table = "corpus_table") hdfs.rm("plot_summary_corpus") input <- hdfs.put(corpus_table,dfs.name="plot_summary_corpus") 2. Try the following example to see how R parses text using strsplit. Notice the extra space between my and text. This gets converted to a null output. You will account for that in the next step. txt <- "This is my text" strsplit(txt," ") mylist <- list(a = 5, B = 10, C = 25) sum(unlist(mylist))
3. Execute the MapReduce job that performs the word count: res <- hadoop.exec(dfs.id = input, mapper = function(k,v) { x <- strsplit(v[[1]], " ")[[1]] x <- x[x!=''] out <- NULL for(i in 1:length(x)) out <- c(out, orch.keyval(x[i],1)) out reducer = function(k,vv) { orch.keyval(k, sum(unlist(vv))) config = new("mapred.config", job.name = "wordcount", map.output = data.frame(key='', val=0), reduce.output = data.frame(key='', val=0) ) ) 4. View the path of the result HDFS file. Then get the contents of the result. Notice that the results are unordered. res hdfs.get(res)
5. To sort the results, we can use the following MapReduce job. Notice that we can specify explicit stopwords, i.e., those to be excluded from the set, but that we also eliminate words of 3 letters or fewer. Then view the sorted results, as well as a sample of 10 rows from the HDFS file. Which words are the most popular in the plot summaries? stopwords <- c("from","they","that","with","their","when","into","what") sorted.res <- hadoop.exec(dfs.id = res, mapper = function(k,v) { if(!(k %in% stopwords) & nchar(k) > 3) { cnt <- sprintf("%05d", as.numeric(v[[1]])) orch.keyval(cnt,k) } reducer = function(k,vv) { orch.keyvals(k, vv) export= orch.export(stopwords), config = new("mapred.config", job.name = "sort.words", reduce.tasks = 1, map.output = data.frame(key='', val=''), reduce.output = data.frame(key='', val='') ) ) hdfs.get(sorted.res) hdfs.sample(sorted.res,10)