Massive Predictive Modeling using Oracle R Technologies Mark Hornick, Director, Oracle Advanced Analytics

Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle s products remains at the sole discretion of Oracle. 3

Agenda 1 2 3 Massive Predictive Modeling Use cases Enabling technologies 4

Quick Survey: How many models have you built? in your lifetime > 10 > 100 > 1000 > 10000 >100000 >1000000 5

Data Size (rows) billions Massive Predictive Modeling 100s 1 millions Generalized Specialized # Models 7

billions Data Size (rows) 1000s Broad coverage # Models per Entity 100s 1 Targeted 1 millions # Models 8

Massive Predictive Modeling - Goals Build one or more models per entity, e.g., customer Understand and/or predict entity behavior Aggregate results across entities, e.g., to assess future demand model model model model model model model model model n Σ cust=1 Demand over time 9

Massive Predictive Modeling - Challenges Effectively dealing with Big Data Hardware, software, network, storage Algorithms that scale and perform with Big Data Building many models in parallel Production deployment Storing and managing models Backup, recovery, and security 10

Use Cases 14

Predicting Customer Electricity Usage 15

Motivation: Energy Theft Detecting patterns of meter tampering Storage of information about which meters have been tampered with Analysis and decision making SA country loses US$4 billion per year due to energy theft Forecast future behavior 16

Motivation: Different customers, different demands Creation of a demand and consumption curve for each customer Analysis: in which period will company have to deliver more energy? Price electricity in a given period Storage of information about the consumption of each customer in different periods of day Each customer has different demand and consumption patterns Customer decides when to use energy to reduce cost Company redirects the energy to where it is most needed at the moment, saving on the generation

Sensor Data Analysis Model each customer s usage to understand behavior and predict individual usage and overall aggregate demand Consider 200K customers, each with a utility smart meter 1 reading / meter / hour 200K x 8760 hours / year 1.752B readings 3 years worth of data 5.256B readings 26280 readings per customer 10 seconds to build each model 555.6 hours (23.2 days) with 128 DOP 4.3 hours

Database-centric architecture Smart meter scenario Oracle Database Data c1 c2 ci cn R Datastore R Script Repository f(dat,args, ) f(dat,args, ) f(dat,args, ) f(dat,args, ) f(dat,args, ) { R Script build model Model c1 Model c2 Model ci Model cn }

Database-centric architecture Smart meter scenario Oracle Database Data c1 c2 ci cn R Datastore Model R Script Repository f(dat,args, ) f(dat,args, ) f(dat,args, ) f(dat,args, ) f(dat,args, ) { } R Script score data scores c1 scores c2 scores ci scores cn

How many lines of code do you think it should take to implement this?

Build models and store in database, partition on CUST_ID ore.groupapply (CUST_USAGE_DATA, 14 lines CUST_USAGE_DATA$CUST_ID, function(dat, ds.name) { cust_id <- dat$cust_id[1] mod <- lm(consumption ~. -CUST_ID, dat) mod$effects <- mod$residuals <- mod$fitted.values <- NULL name <- paste("mod", cust_id,sep="") assign(name, mod) ds.name1 <- paste(ds.name,".",cust_id,sep="") ore.save(list=paste("mod",cust_id,sep=""), name=ds.name1, overwrite=true) TRUE }, ds.name="mydatastore", ore.connect=true, parallel=true ) 22

Score customers in database, partition on CUST_ID ore.groupapply(cust_usage_data_new, CUST_USAGE_DATA_NEW$CUST_ID, 16 lines function(dat, ds.name) { cust_id <- dat$cust_id[1] ds.name1 <- paste(ds.name,".",cust_id,sep="") ore.load(ds.name1) name <- paste("mod", cust_id,sep="") mod <- get(name) prd <- predict(mod, newdata=dat) prd[as.integer(rownames(prd))] <- prd res <- cbind(cust_id=cust_id, PRED = prd) data.frame(res) }, ds.name="mydatastore", ore.connect=true, parallel=true, FUN.VALUE=data.frame(CUST_ID=numeric(0), PRED=numeric(0)) ) 23

Execution (sec) Execution Examples (with DOP=24) 1000 Models Data: 26,280,000 rows Total build time: 65.2 seconds Total scoring time: 25.7 seconds (all data) 50,000 Models Data: 1,314,000,000 rows Total build time: 55.85 minutes Total scoring time: 18 minutes (all data) 10,000 Models Data: 262,800,000 rows Total build time: 516 seconds Total scoring time: 217 seconds (all data) 1 Model/Customer 10000 1000 100 10 1 26.3 262.8 1314 # rows (millions) Build Time Score Time 24

Simulation 25

Compute distribution of generated random normal values simulation <- function(index, n) { set.seed(index) x <- rnorm(n) res <- data.frame(t(matrix(summary(x)))) names(res) <- c("min","q1","median","mean","q3","max") res$id <- index res } (res <- simulation(1,1000)) 26

Simulation with sample size 1000 over 10 trials res <- ore.indexapply(10, simulation, n=1000, FUN.VALUE=res[1,], parallel=true) stats <- ore.pull(res) library(reshape2) melt.stats <- melt(stats, id.vars="id") boxplot(value~variable, data=melt.stats, main="distribution of Stats - sample 1000, 10 trials") 27

Simulation with sample sizes 10 1:6 and 100 trials num.trials <- 100 for(n in 10^(1:6)){ t1 <- system.time(stats <- ore.pull(ore.indexapply(num.trials, simulation, n=n, FUN.VALUE=res[1,], parallel=true)))[3] cat("n=",n,", time=",t1,"\n") melt.stats <- melt(stats, id.vars="id") boxplot(value~variable, data=melt.stats, main=paste("distribution of Stats - sample",n,",", num.trials, "trials")) gc() } 28

Plot Results: sample sizes 10 1:6 and 100 trials

Scalable Performance varying number of trials 200..5000 (10^x)

Enabling Technologies 32

Oracle R Enterprise Oracle Advanced Analytics Option to Oracle Database Eliminate memory constraint of client R engine Minimize or eliminate data movement latency Execute R scripts through database server machine for scalability and performance Achieve scalability and performance by leveraging Oracle Database as HPC environment Enable integration and management of R scripts through SQL Operationalize entire R scripts in production applications eliminate porting R code Avoid reinventing code to integrate R results into existing applications Client R Engine Transparency Layer ORE packages Oracle Database User tables In-db stats SQL Interfaces SQL*Plus, SQLDeveloper, Database Server Machine 34

Oracle s R Technologies Oracle R Distribution ROracle Software available to R Community for free Oracle R Enterprise Oracle R Advanced Analytics for Hadoop Come to our booth to learn more 35

Resources Oracle R Distribution ROracle Oracle R Enterprise Oracle R Advanced Analytics for Hadoop http://oracle.com/goto/r Book: Using R to Unlock the Value of Big Data Blog: https://blogs.oracle.com/r/ Forum: https://forums.oracle.com/forums/forum.jspa?forumid=1397 47

FastR New implementation of R in Java Uses the new Truffle interpreter framework and Graal optimizing compiler in conjunction with the HotSpot JVM for high performance, scalability and portability Dynamically compiles, adaptively optimizes and deoptimizes at run time Joint effort: Oracle Labs (Germany, USA, Austria), JKU Linz (Austria), Purdue University (USA), TU Dortmund (Germany) Open-source project (research prototype!) GPLv2 https://bitbucket.org/allr/fastr More info at the poster session 48