Massive Predictive Modeling using Oracle R Technologies Mark Hornick, Director, Oracle Advanced Analytics

Similar documents
Starting Smart with Oracle Advanced Analytics

High-Performance Analytics

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

Oracle Advanced Analytics - Option to Oracle Database: Oracle R Enterprise and Oracle Data Mining. Data Warehouse Global Leaders Winter 2013

Learning R Series Session 4: Oracle R Enterprise 1.3 Predictive Analytics Mark Hornick Oracle Advanced Analytics

Oracle Database 12c Plug In. Switch On. Get SMART.

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Are You Ready for Big Data?

Session 1: Introduction to Oracle's R Technologies

I/O Considerations in Big Data Analytics

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Oracle Advanced Analytics Oracle R Enterprise & Oracle Data Mining

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Big Data Are You Ready? Thomas Kyte

CA Technologies Big Data Infrastructure Management Unified Management and Visibility of Big Data

Are You Ready for Big Data?

Big Data and Advanced Analytics Technologies for the Smart Grid

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Towards Smart and Intelligent SDN Controller

INTRODUCTION TO CLOUD COMPUTING CEN483 PARALLEL AND DISTRIBUTED SYSTEMS

Performance And Scalability In Oracle9i And SQL Server 2000

Streaming Big Data Performance Benchmark for Real-time Log Analytics in an Industry Environment

Five Essential Components for Highly Reliable Data Centers

ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process

SAP and Hortonworks Reference Architecture

Streaming Big Data Performance Benchmark. for

CASE STUDY: Oracle TimesTen In-Memory Database and Shared Disk HA Implementation at Instance level. -ORACLE TIMESTEN 11gR1

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

SAP SE - Legal Requirements and Requirements

Customized Report- Big Data

Advanced Big Data Analytics with R and Hadoop

Scalable Architecture on Amazon AWS Cloud

Big Data on Microsoft Platform

Harnessing the power of advanced analytics with IBM Netezza

Holistic Performance Analysis of J2EE Applications

<Insert Picture Here> Move to Oracle Database with Oracle SQL Developer Migrations

Microsoft Research Windows Azure for Research Training

Testing & Assuring Mobile End User Experience Before Production. Neotys

Understanding the Benefits of IBM SPSS Statistics Server

Jun Liu, Senior Software Engineer Bianny Bian, Engineering Manager SSG/STO/PAC

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Microsoft Research Microsoft Azure for Research Training

Enabling R for Big Data with PL/R and PivotalR Real World Examples on Hadoop & MPP Databases

2009 Oracle Corporation 1

Lecture 9: Data Mining, Data Analytics and Big Data

Constructing a Data Lake: Hadoop and Oracle Database United!

Oracle Communications WebRTC Session Controller: Basic Admin. Student Guide

In-Database Analytics

Application of Predictive Analytics for Better Alignment of Business and IT

What s Cool in the SAP JVM (CON3243)

Time series IoT data ingestion into Cassandra using Kaa

Oracle BI Publisher Enterprise Cluster Deployment. An Oracle White Paper August 2007

An Oracle White Paper May Oracle Database Cloud Service

BW-EML SAP Standard Application Benchmark

An Oracle White Paper June Oracle: Big Data for the Enterprise

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

COMPUTER MEASUREMENT GROUP - India Hyderabad Chapter. Strategies to Optimize Cloud Costs By Cloud Performance Monitoring

ORACLE DATABASE 10G ENTERPRISE EDITION

Search Big Data with MySQL and Sphinx. Mindaugas Žukas

IoT Security Platform

Optimizing Storage for Better TCO in Oracle Environments. Part 1: Management INFOSTOR. Executive Brief

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

How To Build A Cloud Computer

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Inge Os Sales Consulting Manager Oracle Norway

SAP HANA SPS 09 - What s New? HANA IM Services: SDI and SDQ

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Archiving and Sharing Big Data Digital Repositories, Libraries, Cloud Storage

System Requirements Table of contents

Veeam Backup and Replication Architecture and Deployment. Nelson Simao Systems Engineer

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Hadoop for Enterprises:

Migrating SaaS Applications to Windows Azure

Developing Relevant Dining Visits with Oracle Advanced Analytics Olive Garden s transition toward tailoring guests experiences

From Spark to Ignition:

Hadoop IST 734 SS CHUNG

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Enabling High performance Big Data platform with RDMA

Oracle: Database and Data Management Innovations with CERN Public Day

Data Centric Computing Revisited

How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns

Preview of Oracle Database 12c In-Memory Option. Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Oracle Big Data SQL Technical Update

SAP HANA Reinventing Real-Time Businesses through Innovation, Value & Simplicity. Eduardo Rodrigues October 2013

Integrating Big Data into the Computing Curricula

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Graph Database Performance: An Oracle Perspective

How To Handle Big Data With A Data Scientist

Transcription:

Massive Predictive Modeling using Oracle R Technologies Mark Hornick, Director, Oracle Advanced Analytics

Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle s products remains at the sole discretion of Oracle. 3

Agenda 1 2 3 Massive Predictive Modeling Use cases Enabling technologies 4

Quick Survey: How many models have you built? in your lifetime > 10 > 100 > 1000 > 10000 >100000 >1000000 5

Data Size (rows) billions Massive Predictive Modeling 100s 1 millions Generalized Specialized # Models 7

billions Data Size (rows) 1000s Broad coverage # Models per Entity 100s 1 Targeted 1 millions # Models 8

Massive Predictive Modeling - Goals Build one or more models per entity, e.g., customer Understand and/or predict entity behavior Aggregate results across entities, e.g., to assess future demand model model model model model model model model model n Σ cust=1 Demand over time 9

Massive Predictive Modeling - Challenges Effectively dealing with Big Data Hardware, software, network, storage Algorithms that scale and perform with Big Data Building many models in parallel Production deployment Storing and managing models Backup, recovery, and security 10

Use Cases 14

Predicting Customer Electricity Usage 15

Motivation: Energy Theft Detecting patterns of meter tampering Storage of information about which meters have been tampered with Analysis and decision making SA country loses US$4 billion per year due to energy theft Forecast future behavior 16

Motivation: Different customers, different demands Creation of a demand and consumption curve for each customer Analysis: in which period will company have to deliver more energy? Price electricity in a given period Storage of information about the consumption of each customer in different periods of day Each customer has different demand and consumption patterns Customer decides when to use energy to reduce cost Company redirects the energy to where it is most needed at the moment, saving on the generation

Sensor Data Analysis Model each customer s usage to understand behavior and predict individual usage and overall aggregate demand Consider 200K customers, each with a utility smart meter 1 reading / meter / hour 200K x 8760 hours / year 1.752B readings 3 years worth of data 5.256B readings 26280 readings per customer 10 seconds to build each model 555.6 hours (23.2 days) with 128 DOP 4.3 hours

Database-centric architecture Smart meter scenario Oracle Database Data c1 c2 ci cn R Datastore R Script Repository f(dat,args, ) f(dat,args, ) f(dat,args, ) f(dat,args, ) f(dat,args, ) { R Script build model Model c1 Model c2 Model ci Model cn }

Database-centric architecture Smart meter scenario Oracle Database Data c1 c2 ci cn R Datastore Model R Script Repository f(dat,args, ) f(dat,args, ) f(dat,args, ) f(dat,args, ) f(dat,args, ) { } R Script score data scores c1 scores c2 scores ci scores cn

How many lines of code do you think it should take to implement this?

Build models and store in database, partition on CUST_ID ore.groupapply (CUST_USAGE_DATA, 14 lines CUST_USAGE_DATA$CUST_ID, function(dat, ds.name) { cust_id <- dat$cust_id[1] mod <- lm(consumption ~. -CUST_ID, dat) mod$effects <- mod$residuals <- mod$fitted.values <- NULL name <- paste("mod", cust_id,sep="") assign(name, mod) ds.name1 <- paste(ds.name,".",cust_id,sep="") ore.save(list=paste("mod",cust_id,sep=""), name=ds.name1, overwrite=true) TRUE }, ds.name="mydatastore", ore.connect=true, parallel=true ) 22

Score customers in database, partition on CUST_ID ore.groupapply(cust_usage_data_new, CUST_USAGE_DATA_NEW$CUST_ID, 16 lines function(dat, ds.name) { cust_id <- dat$cust_id[1] ds.name1 <- paste(ds.name,".",cust_id,sep="") ore.load(ds.name1) name <- paste("mod", cust_id,sep="") mod <- get(name) prd <- predict(mod, newdata=dat) prd[as.integer(rownames(prd))] <- prd res <- cbind(cust_id=cust_id, PRED = prd) data.frame(res) }, ds.name="mydatastore", ore.connect=true, parallel=true, FUN.VALUE=data.frame(CUST_ID=numeric(0), PRED=numeric(0)) ) 23

Execution (sec) Execution Examples (with DOP=24) 1000 Models Data: 26,280,000 rows Total build time: 65.2 seconds Total scoring time: 25.7 seconds (all data) 50,000 Models Data: 1,314,000,000 rows Total build time: 55.85 minutes Total scoring time: 18 minutes (all data) 10,000 Models Data: 262,800,000 rows Total build time: 516 seconds Total scoring time: 217 seconds (all data) 1 Model/Customer 10000 1000 100 10 1 26.3 262.8 1314 # rows (millions) Build Time Score Time 24

Simulation 25

Compute distribution of generated random normal values simulation <- function(index, n) { set.seed(index) x <- rnorm(n) res <- data.frame(t(matrix(summary(x)))) names(res) <- c("min","q1","median","mean","q3","max") res$id <- index res } (res <- simulation(1,1000)) 26

Simulation with sample size 1000 over 10 trials res <- ore.indexapply(10, simulation, n=1000, FUN.VALUE=res[1,], parallel=true) stats <- ore.pull(res) library(reshape2) melt.stats <- melt(stats, id.vars="id") boxplot(value~variable, data=melt.stats, main="distribution of Stats - sample 1000, 10 trials") 27

Simulation with sample sizes 10 1:6 and 100 trials num.trials <- 100 for(n in 10^(1:6)){ t1 <- system.time(stats <- ore.pull(ore.indexapply(num.trials, simulation, n=n, FUN.VALUE=res[1,], parallel=true)))[3] cat("n=",n,", time=",t1,"\n") melt.stats <- melt(stats, id.vars="id") boxplot(value~variable, data=melt.stats, main=paste("distribution of Stats - sample",n,",", num.trials, "trials")) gc() } 28

Plot Results: sample sizes 10 1:6 and 100 trials

Scalable Performance varying number of trials 200..5000 (10^x)

Enabling Technologies 32

Oracle R Enterprise Oracle Advanced Analytics Option to Oracle Database Eliminate memory constraint of client R engine Minimize or eliminate data movement latency Execute R scripts through database server machine for scalability and performance Achieve scalability and performance by leveraging Oracle Database as HPC environment Enable integration and management of R scripts through SQL Operationalize entire R scripts in production applications eliminate porting R code Avoid reinventing code to integrate R results into existing applications Client R Engine Transparency Layer ORE packages Oracle Database User tables In-db stats SQL Interfaces SQL*Plus, SQLDeveloper, Database Server Machine 34

Oracle s R Technologies Oracle R Distribution ROracle Software available to R Community for free Oracle R Enterprise Oracle R Advanced Analytics for Hadoop Come to our booth to learn more 35

Resources Oracle R Distribution ROracle Oracle R Enterprise Oracle R Advanced Analytics for Hadoop http://oracle.com/goto/r Book: Using R to Unlock the Value of Big Data Blog: https://blogs.oracle.com/r/ Forum: https://forums.oracle.com/forums/forum.jspa?forumid=1397 47

FastR New implementation of R in Java Uses the new Truffle interpreter framework and Graal optimizing compiler in conjunction with the HotSpot JVM for high performance, scalability and portability Dynamically compiles, adaptively optimizes and deoptimizes at run time Joint effort: Oracle Labs (Germany, USA, Austria), JKU Linz (Austria), Purdue University (USA), TU Dortmund (Germany) Open-source project (research prototype!) GPLv2 https://bitbucket.org/allr/fastr More info at the poster session 48

49