R zum Anfassen: Die Themen 09:30 Begrüßung 09:45 R Zum Anfassen Einführung 10:15 Minikurs in der Sprache R Sprachmittel, Hilfen, GUIs zum Erstellen der Skripte Schnell und einfach ansprechende Grafiken erstellen 11:00 Pause 11:15 Showcase Teil 1: Data Mining mit R Vergleich der Genauigkeit zweier Modelle 11:35 R Enterprise Einfache Anwendung, Performance Showcase Teil 2: Data Mining mit R in der Datenbank 12:20 Mittagspause 13:00 Big Data & R Hadoop, Map Reduce & R R als Instrument für Prototyping für Big Data 13:30 Abschließende Fragen 1
Warum und wie Big Data jetzt? Neue Art der Datenentstehung? Speicherung und Handling Beiläufig entstehende Daten Maschinengenerierte Massendaten Kommunikations-Daten Geo-Daten Text-Daten LOW DENSITY - Daten Neue Geschäftsideen Bessere Einsichten Optimierte Prozesse Was sind interessante Daten? Wie sind diese zu speichern? Welche Analysverfahren sind passend? Welche Kosten entstehen? 2
Externe Daten Interne Daten Data Warehouse und Big Data Klassisches BI Kunden Lieferanten Produkte Mitarbeiter Lager Verkäufe Buchhaltung Log Files Web-Clicks Mails Call-Center Verträge Kurse Berichte Webservices Kaufdaten Integration Enterprise Information Harmonisierung Stammdaten Prüfen Referenzdaten Umsätze / Fakten Relational Database 12c (DWH) Hodoop Loader HDFS nosql DB H a d o o p User View Kennzahlen Sandbox Event Processing SQL Realtime Decision Interactive Dashboards Reporting & Publishing Guide Search &Experiences Realtime Decisions Map Reduce Framework Predictive Analytics & Mining 3
In-Database Analytics Big Data Platform Big Data Appliance Big Data Connectors Exadata Exalytics Optimized for Hadoop, R, and NoSQL Processing System of Record Optimized for DW/OLTP Optimized for Analytics & In-Memory Workloads Event Processing Hadoop Open Source R NoSQL Database Applications Big Data Connectors Data Integrator Advanced Analytics Data Warehouse Database Enterprise Performance Management Business Intelligence Applications Business Intelligence Tools Endeca Information Discovery Embeds Times Ten 4 4
R Enterprise Predictive Analytics R Engine User R Engine Other R packages R Enterprise packages SQL Results Database Server Maschine Database User tables R Results R Engine(s) managed by DB R Engine Other R packages R Enterprise packages Lineare Modelle Clusterung Segmentierung Neuronale Netze MapReduce Nodes Hadoop Cluster (BDA) HDFS Nodes 5
Type of analysis using Hadoop Text mining Index building Graph creation and analysis Pattern recognition Collaborative filtering Prediction models Sentiment analysis Risk assessment 6
MapReduce Provides parallelization and distribution with fault tolerance MapReduce programs provide access to data on Hadoop "Map" phase Map task typically operates on one HDFS block of data Map tasks process smaller problems, store results in HDFS, and report success jobtracker "Reduce" phase Reduce task receives sorted subsets of Map task results One or more reducers compute answers to form final answer Final results stored in HDFS Computational processing can occur on unstructured or structured data Abstracts all housekeeping away from the developer 7
Map Reduce Example Graphically Speaking HDFS DataNode (key, values ) HDFS DataNode (key, values ) map map (key A, values ) (key B, values ) (key C, values ) (key A, values ) (key B, values ) (key C, values ) shuffle and sort aggregates intermediate values by output key (key A, intermediate values ) (key B, intermediate values ) (key C, intermediate values ) reduce reduce reduce Final key A values Final key B values Final key C values 8
Text analysis example Count the number of times each word occurs in a corpus of documents map One mapper per block of data Shuffle and Sort reduce One or more reducers combining results Documents divided into blocks in HDFS Outputs each word and its count: 1 each time a word is encountered Key Value The 1 Big 1 Data 1 word 1 count 1 example 1 One reducer receives only the key-value pairs for the word Big and sums up the counts Key Value Big 1... Big 1 It then outputs the final key-value result Key Value Big 2040 9
Map Reduce Example Graphically Speaking HDFS DataNode (key, values ) HDFS DataNode (key, values ) For Word Count There s no key, only value as input to mapper map (key A, values ) (key B, values ) (key C, values ) map (key A, values ) (key B, values ) (key C, values ) Mapper output is a set of key-value pairs where key is the word and value is the count=1 shuffle and sort aggregates intermediate values by output key (key A, intermediate values ) (key B, intermediate values ) (key C, intermediate values ) reduce reduce reduce Each reducer receives values for each word key is the word value is a set of counts Final key A values Final key B values Final key C values Outputs key as the word and value as the sum 10
Mapper and reducer code in ORCH for Word Count corpus <- scan("corpus.dat", what=" ",quiet= TRUE, sep="\n") corpus <- gsub("([/\\\":,#.@-])", " ", corpus) input <- hdfs.put(corpus) res <- hadoop.exec(dfs.id = input, mapper = function(k,v) { x <- strsplit(v[[1]], " ")[[1]] x <- x[x!=''] out <- NULL for(i in 1:length(x)) out <- c(out, orch.keyval(x[i],1)) out }, reducer = function(k,vv) { orch.keyval(k, sum(unlist(vv))) }, config = new("mapred.config", job.name = "wordcount", map.output = data.frame(key='', val=0), reduce.output = data.frame(key='', val=0) ) ) res hdfs.get(res) Load the R data.frame into HDFS Specify and invoke map-reduce job Split words and output each word Sum the count of each word 11
R Connector for Hadoop Big Data Appliance R script Hadoop Cluster (BDA) ORD R Client {CRAN packages} Hadoop Job Mapper Reducer R HDFS R MapReduce R sqoop MapReduce Nodes {CRAN packages} ORD HDFS Nodes Database Provide transparent access to Hadoop Cluster: MapReduce and HDFS-resident data Access and manipulate data in HDFS, database, and file system - all from R Write MapReduce functions using R and execute through natural R interface Leverage CRAN R packages to work on HDFS-resident data Transition work from lab to production deployment on a Hadoop cluster without requiring knowledge of Hadoop internals, Hadoop CLI, or IT infrastructure 12
Exploring Available Data HDFS, Database, file system HDFS hdfs.pwd() hdfs.ls() hdfs.mkdir("xq") hdfs.cd("xq") hdfs.ls() hdfs.size("ontime_s") hdfs.parts("ontime_s") hdfs.sample("ontime_s",lines=3) Database ore.ls() names(ontime_s) head(ontime_s,3) File System getwd() dir() # or list.files() dir.create("/home/oracle/orch") setwd("/home/oracle/orch") dat <- read.csv("ontime_s.dat") head(dat) 13
Load data in HDFS Data from File Use hdfs.upload Key is first column: YEAR Data from Database Table Use hdfs.push Key column: DEST Data from R data.frame Use hdfs.put Key column: DEST hdfs.rm('ontime_file') ontime.dfs_file <- hdfs.upload('ontime_s2000.dat', dfs.name='ontime_file') hdfs.exists('ontime_file') hdfs.rm('ontime_db') ontime.dfs_d <- hdfs.push(ontime_s2000, key='dest', dfs.name='ontime_db') hdfs.exists('ontime_db') hdfs.rm('ontime_r') ontime <- ore.pull(ontime_s2000) ontime.dfs_r <- hdfs.put(ontime, key='dest', dfs.name='ontime_r') hdfs.exists('ontime_r') 14
hadoop.exec() concepts 1. Mapper 1. Receives set of rows from HDFS file as (key,value) pairs 2. Key has the same data type as that of the input 3. Value can be of type list or data.frame 4. Mapper outputs (key,value) pairs using orch.keyval() 5. Value can be ANY R object packed using orch.pack() 2. Reducer 1. Receives (packed) input of type generated by a mapper 2. Reducer outputs (key, value) pairs using orch.keyval() 3. Value can be ANY R object packed using orch.pack() 3. Variables from R environment can be exported to Hadoop environment mappers and reducers using orch.export() (optional) 4. Job configuration (optional) 15
ORCH Dry Run Enables R users to test R code locally on laptop before submitting job to Hadoop Cluster Supports testing/debugging of scripts orch.dryrun(true) Hadoop Cluster is not required for a dry run Sequential execution of mapper and reducer code Creates row streams from HDFS input into mapper and reducer Constrained by the memory available to R Recommend to subset / sample input data to fit in memory Upon job success, resulting data put in HDFS No change in the R code is required for dry run 16
Example: Test script in dry run mode Take the average arrival delay for all flights to SFO orch.dryrun(t) dfs <- hdfs.attach('ontime_r') res <- NULL res <- hadoop.run( dfs, mapper = function(key, ontime) { if (key == 'SFO') { keyval(key, ontime) } }, reducer = function(key, vals) { sumad <- 0 count <- 0 for (x in vals) { if (!is.na(x$arrdelay)) {sumad <- sumad + x$arrdelay; count <- count + 1} } res <- sumad / count keyval(key, res) } ) res hdfs.get(res) 17
Example: Test script on Hadoop Cluster one change Take the average arrival delay for all flights to SFO orch.dryrun(f) dfs <- hdfs.attach('ontime_r') res <- NULL res <- hadoop.run( dfs, mapper = function(key, ontime) { if (key == 'SFO') { keyval(key, ontime) } }, reducer = function(key, vals) { sumad <- 0 count <- 0 for (x in vals) { if (!is.na(x$arrdelay)) {sumad <- sumad + x$arrdelay; count <- count + 1} } res <- sumad / count keyval(key, res) } ) res hdfs.get(res) 18
Executing Hadoop Job in Dry-Run Mode Linux Client Retrieve data from HDFS Execute script locally in laptop R engine R Connector For Hadoop Client Hadoop Cluster Software R Distribution BDA R Distribution R Connector For Hadoop Driver Package 19
Executing Hadoop Job on Hadoop Cluster Linux Client Submit MapReduce job to Hadoop Cluster Execute Mappers and Reducers using R instances on BDA task nodes R Connector For Hadoop Client Hadoop Cluster Software R Distribution BDA R Distribution R Connector For Hadoop Driver Package 20
<Insert Picture Here> DATA WAREHOUSE