Big Analytics Scaling R to Enterprise user! 2013 Albacete Spain #user2013 Luis Campos Mark Hornick 1 Big Solutions Lead, Oracle EMEA Director, Oracle base Advanced Analytics @luigicampos @MarkHornick
2
The girl with all the questions! The real innovation here is that we can ask questions and get the answer back before we have forgotten why we asked the question in the first place. Hilary Mason, Chief Scientist Bit.ly + member of NYC Mayor Bloomberg s Technology and Innovation Advisory Council 3
Nexus of Forces, Platform 3.0, Four Pillars What Analysts/groups are saying? 4
New Information Challenges Explosion A Decade of Digital Universe Growth: Storage in Exabytes (Source: IDC s Digital Universe Study, June 2011) Combinatory Explosion Dimension Explosion 5
Big Solution = + Analytics + Tools Source: McKinsey study Big data: What s your plan? (March 2013) http://www.mckinsey.com/insights/business_technology/big_data_whats_your_plan DATA Any, Any Source ANALYTICS Out-of-the box Analytics, New Models TOOLS Self Service Discovery On Premise, On Cloud, On Mobile 6
Oracle Complete Business Analytics Solution BIG DATA APPLIANCE BIG DATA CONNECTORS NoSQL DB Oracle DATA Advanced MINING ORACLE Analytics R Ent. SPATIAL,GRAPH Real Time Decisions (RTD) OBIEE ENDECA Collective Intellect (CI) On Premise, Oracle Cloud, On Mobile 7
Apply Advanced Analytics on All Visualise it with any BI Tool Hadoop HDFS Relational BI Tools 8
Oracle R Advantages 1. Keep the R tools 2. Keep the data where it sits (Relational or HDFS) 3. Keep the SQL Based BI Tools 4. Scale to LARGE data sets R workspace console Function push-down data transformation & statistics Oracle statistics engine OBIEE, Web Services Development Production Consumption 9
Oracle s Advanced Analytics Strategic Offerings Deliver enterprise-level advanced analytics in the base Oracle in-base Mining algorithms Access through Free GUI from SQL Developer or programmatically from SQL, PL/SQL, R or Java Predictive model APIs for the Oracle R Enterprise Exadata architecture advantages for up to 5x improvement with Smart Scan Oracle R Distribution Free download, pre-installed on Oracle Big Appliance, bundled with Oracle Linux Enhanced linear algebra performance: Intel s Math Kernel Library, AMD s Core Math Library (Windows and Linux), SUN Solaris and IBM AIX Enterprise support for customers of Oracle Advanced Analytics, Big Appliance, and Oracle Linux 10 Copyright 2012, Oracle and/or its affiliates. All rights reserved.
Oracle s Advanced Analytics Strategic Offerings Deliver enterprise-level R in the base or Hadoop Oracle R Enterprise Transparent access to database-resident data from R Embedded R script execution through database managed R engines Statistics engine Enhanced support for high-speed Exadata scoring Oracle R Connector for Hadoop [ORCH] (Part of Oracle Big Connectors) R interface to Oracle Hadoop Cluster on BDA and non-oracle Hadoop clusters Access and manipulate data in HDFS, database, and file system Write MapReduce functions using R and execute through natural R interface Predictive models with execution in-cluster against Hadoop-stored data 11 Copyright 2012, Oracle and/or its affiliates. All rights reserved.
Oracle R Components Component layout Analyst Laptop Oracle base Optional with ORCH Oracle R Distribution Oracle R Connector for Hadoop Client Oracle R Enterprise Client Packages Oracle R Distribution Oracle R Connector for Hadoop Oracle R Enterprise Client Packages Big Appliance Oracle R Distribution Oracle R Enterprise Server Components Oracle R Enterprise Client Packages Exadata 12 Copyright 2012, Oracle and/or its affiliates. All rights reserved.
Knowledge Exploitation Process Typical stages in a Big Project Deployment Business Understanding Scientist Selection Evaluation Discovery Model Building Preparation 13 13
Loading with Oracle R Enterprise Deployment Evaluation Business Understanding Scientist Model Building Preparation Selection Discovery library(ore) R> df <- data.frame(a=1:26, B=letters[1:26]) R> dim(df) [1] 26 2 R> class(df) [1] "data.frame" R> ore.create(df, table="df_table") R> ore.ls() [1] "DF_TABLE" R> class(df_table) [1] "ore.frame" attr(,"package") [1] "OREbase" R> dim(df_table) [1] 26 2 16 16
Discovery with Oracle R in-db and HDFS library(ore) Deployment Business Understanding Scientist Selection ore.ls() # list tables in DB class(my_table) # ore.frame dim(my_table) # overloaded R functions head(my_table) sample(my_table) summary(my_table) library(orch) Evaluation Discovery hdfs.ls() hdfs.dim("myhdfsdata") hdfs.head("myhdfsdata") hdfs.sample("myhdfsdata") Model Building Preparation hdfs.tohive("myhdfsdata", tablename="my_hive_data") summary(my_hive_data) 17 17
Prep with Oracle R in-db and HDFS Deployment Business Understanding Scientist Selection library(ore) / library(orch) # join merge (MY_TABLE1, MY_TABLE2,by.x="x1", by.y="x2") # project columns df <- MY_TABLE[,c("X","Y","Z")] # filter rows df <- df[df$z<=4.3 df$a=="b",1:3] Evaluation Discovery #binning IRIS_TAB <- ore.push(iris[1:4]) Model Building Preparation IRIS_TAB$PetalBins = ifelse(iris_tab$petal.length < 2.0, "SMALL PETALS", ifelse(iris_tab$petal.length < 4.0, "MEDIUM PETALS", "LARGE PETALS")) 18 18
Densifying data: custom MapReduce jobs Count occurrence of hash tags in tweets per customer for select tags maphashtags <- function (k,v) { x <- strsplit(v$text, " ") x <- x[x!=''] importanttags <- tolower(importanttags) for(twt in 1:length(x)) { for(tag in x[[twt]]) { if(substr(tag,1,1) == "#") { tagl <- tolower(tag) if(tagl %in% importanttags) { orch.keyval(v[twt,"screenname"],tagl) }}}}} reducehashtags <- function(k,vals) { # k = screenname, vals = vector(tags) importanttags <- tolower(importanttags) vals <- factor(vals$val,levels=importanttags) x <- as.data.frame(t(as.matrix(table(vals)))) orch.keyval(k,x) # k = screenname, x = df(importanttags as cols) with counts } 19 19
ORCH: Create your own MapReduce jobs Count occurrence of hash tags in tweets per customer for select tags importanttags <- c("#bigdata","#database","#oracle","#sql") tag.summary <- hadoop.exec(tweets.id, mapper=maphashtags, reducer=reducehashtags, export=orch.export(importanttags=importanttags), config=new("mapred.config", job.name = "TwitterScreenNameHashTags", reduce.tasks = 5, map.output = data.frame(key='a', val='a'), reduce.output = data.frame(key='a', bigdata=0, database=0,oracle=0, sql=0))) hdfs.get(tag.summary) > hdfs.get(tag.summary) key bigdata database oracle sql 1 twitter.user.1 4 7 37 91 2 twitter.user.2 15 19 1 32 3 twitter.user.3 104 57 8 0 4 twitter.user.4 0 64 549 0 20 20
Modelling with Oracle R in-db and HDFS # Clustering with ORE Deployment Business Understanding Scientist Selection X <- ore.push (data.frame(x)) km.mod1 <- ore.odmkmeans(~., X, num.centers=2, num.bins=5) summary(km.mod1) rules(km.mod1) clusterhists(km.mod1) Evaluation Discovery # Regression with ORCH mod.lm <- orch.lm(myformula, my, nreducers = 2) summary(mod.lm) Model Building Preparation pred <- predict.orch.lm(mod.lm, newdata = my) res.pred <- hdfs.get(pred) head(res.pred) 21 21
In-database performance advantage R lm vs. ORE ore.lm : 500k to 1.5m records, 3 predictors Performance: 2x-3x improvement for build, 4x improvement for scoring 22 22
In-database performance advantage lm More tests at http://blogs.oracle.com/r/entry/oracle_r_enterprise_1_32 23 23
Deploying with Oracle R Enterprise Production Deploy ment Business Understanding Scientist Selection Load R scripts into ORE script repository Invoke R scripts by name from SQL Store R objects directly in Oracle base (no separate files) Optional return values: frame consumable by any SQL-ready application Evaluation Discovery XML containing structured data, complex R objects, PNG images PNG table with BLOB column containing images for immediate consumption Model Building Preparation Schedule for automatic execution 24 24
Oracle Advanced Analytics: Embedded R Execution SQL interface rqeval generate XML string for graphic output Oracle PL/SQL begin sys.rqscriptcreate('example6', 'function(){ res <- 1:10 Oracle BI Publisher plot( 1:100, rnorm(100), pch = 21, bg = "red", cex = 2 ) R Language res }'); end; / Oracle SQL select value from table(rqeval(null,'xml','example6')); 25 Copyright 2012, Oracle and/or its affiliates. All rights reserved.
Summary Oracle R Enterprise (ORE) A comprehensive, database-centric environment for end-to-end analytical processes in R with immediate deployment to production environments Wide range of in-database advanced analytics algorithms exposed through R Eliminate R client memory limits Oracle R Connector for Hadoop (ORCH) A collection of R packages enabling Big analytics from an R environment Allows R users to leverage a Hadoop Cluster with HDFS and MapReduce from R Prepackaged advanced analytics algorithms Transparent manipulation of HIVE data Enable R users to conduct Big projects from R Eliminate client R engine memory barrier Scale to large data sets Deploy R-based solutions without translation to other languages or environments 26 26
Resources http://www.oracle.com/goto/r Blog: https://blogs.oracle.com/r/ Forum: https://forums.oracle.com/forums/forum.jspa?forumid=1397 Oracle R Distribution: http://www.oracle.com/technetwork/indexes/downloads/r-distribution-1532464.html ROracle: http://cran.r-project.org/web/packages/roracle Oracle R Enterprise: http://www.oracle.com/technetwork/database/options/advanced-analytics/r-enterprise Oracle R Connector for Hadoop: http://www.oracle.com/us/products/database/big-data-connectors/overview 27 27
28 28