Oracle R zum Anfassen: Die Themen



Similar documents
OTN Developer Day: Oracle Big Data

Connecting Hadoop with Oracle Database

Oracle Big Data Essentials

Oracle Big Data Handbook

Big Data Are You Ready? Thomas Kyte

Big Data Are You Ready? Jorge Plascencia Solution Architect Manager

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

An Integrated Big Data & Analytics Infrastructure June 14, 2012 Robert Stackowiak, VP Oracle ESG Data Systems Architecture

<Insert Picture Here> Big Data

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Architecting for the Internet of Things & Big Data

Internals of Hadoop Application Framework and Distributed File System

TRAINING PROGRAM ON BIGDATA/HADOOP

Chapter 7. Using Hadoop Cluster and MapReduce

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

A very short Intro to Hadoop

Hadoop Job Oriented Training Agenda

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Introducing Oracle Exalytics In-Memory Machine

Using distributed technologies to analyze Big Data

Oracle Big Data Strategy Simplified Infrastrcuture

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Testing Big data is one of the biggest

How To Use Hadoop

Oracle Advanced Analytics Oracle R Enterprise & Oracle Data Mining

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

ITG Software Engineering

Oracle Big Data SQL Technical Update

Qsoft Inc

Safe Harbor Statement

Big Data: Are You Ready? Kevin Lancaster

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya

An Integrated Analytics & Big Data Infrastructure September 21, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle Enterprise

Big Data and Scripting map/reduce in Hadoop

An Oracle White Paper June Oracle: Big Data for the Enterprise

Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies

Constructing a Data Lake: Hadoop and Oracle Database United!

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Database Performance with In-Memory Solutions

Big Data Introduction

Integrating Big Data into the Computing Curricula

Oracle Big Data Discovery Unlock Potential in Big Data Reservoir

Testing 3Vs (Volume, Variety and Velocity) of Big Data

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Integrating VoltDB with Hadoop

PassTest. Bessere Qualität, bessere Dienstleistungen!

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

MapReduce. Tushar B. Kute,

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

An Oracle White Paper October Oracle: Big Data for the Enterprise

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Implement Hadoop jobs to extract business value from large and varied data sets

Map Reduce & Hadoop Recommended Text:

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Hadoop Ecosystem B Y R A H I M A.

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

BIG DATA What it is and how to use?

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Oracle Big Data Fundamentals Ed 1 NEW

Big Data and Analytics in Government

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CS 378 Big Data Programming. Lecture 2 Map- Reduce

High Performance Data Management Use of Standards in Commercial Product Development

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata


Hadoop Architecture. Part 1

Performance and Energy Efficiency of. Hadoop deployment models

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Big Data Use Cases Update

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

An Oracle White Paper September Oracle: Big Data for the Enterprise

CS 378 Big Data Programming

Infrastructures for big data

Oracle EM 12cc als Datenlieferant für ITAM/SAM Tools?

NoSQL for SQL Professionals William McKnight

Big Data and Your Data Warehouse Philip Russom

Workshop on Hadoop with Big Data

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Hadoop & SAS Data Loader for Hadoop

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Chase Wu New Jersey Ins0tute of Technology

Apache Hadoop. Alexandru Costan

Deep Quick-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors Mark Rittman, CTO, Rittman Mead Oracle Openworld 2014, San Francisco

Transcription:

R zum Anfassen: Die Themen 09:30 Begrüßung 09:45 R Zum Anfassen Einführung 10:15 Minikurs in der Sprache R Sprachmittel, Hilfen, GUIs zum Erstellen der Skripte Schnell und einfach ansprechende Grafiken erstellen 11:00 Pause 11:15 Showcase Teil 1: Data Mining mit R Vergleich der Genauigkeit zweier Modelle 11:35 R Enterprise Einfache Anwendung, Performance Showcase Teil 2: Data Mining mit R in der Datenbank 12:20 Mittagspause 13:00 Big Data & R Hadoop, Map Reduce & R R als Instrument für Prototyping für Big Data 13:30 Abschließende Fragen 1

Warum und wie Big Data jetzt? Neue Art der Datenentstehung? Speicherung und Handling Beiläufig entstehende Daten Maschinengenerierte Massendaten Kommunikations-Daten Geo-Daten Text-Daten LOW DENSITY - Daten Neue Geschäftsideen Bessere Einsichten Optimierte Prozesse Was sind interessante Daten? Wie sind diese zu speichern? Welche Analysverfahren sind passend? Welche Kosten entstehen? 2

Externe Daten Interne Daten Data Warehouse und Big Data Klassisches BI Kunden Lieferanten Produkte Mitarbeiter Lager Verkäufe Buchhaltung Log Files Web-Clicks Mails Call-Center Verträge Kurse Berichte Webservices Kaufdaten Integration Enterprise Information Harmonisierung Stammdaten Prüfen Referenzdaten Umsätze / Fakten Relational Database 12c (DWH) Hodoop Loader HDFS nosql DB H a d o o p User View Kennzahlen Sandbox Event Processing SQL Realtime Decision Interactive Dashboards Reporting & Publishing Guide Search &Experiences Realtime Decisions Map Reduce Framework Predictive Analytics & Mining 3

In-Database Analytics Big Data Platform Big Data Appliance Big Data Connectors Exadata Exalytics Optimized for Hadoop, R, and NoSQL Processing System of Record Optimized for DW/OLTP Optimized for Analytics & In-Memory Workloads Event Processing Hadoop Open Source R NoSQL Database Applications Big Data Connectors Data Integrator Advanced Analytics Data Warehouse Database Enterprise Performance Management Business Intelligence Applications Business Intelligence Tools Endeca Information Discovery Embeds Times Ten 4 4

R Enterprise Predictive Analytics R Engine User R Engine Other R packages R Enterprise packages SQL Results Database Server Maschine Database User tables R Results R Engine(s) managed by DB R Engine Other R packages R Enterprise packages Lineare Modelle Clusterung Segmentierung Neuronale Netze MapReduce Nodes Hadoop Cluster (BDA) HDFS Nodes 5

Type of analysis using Hadoop Text mining Index building Graph creation and analysis Pattern recognition Collaborative filtering Prediction models Sentiment analysis Risk assessment 6

MapReduce Provides parallelization and distribution with fault tolerance MapReduce programs provide access to data on Hadoop "Map" phase Map task typically operates on one HDFS block of data Map tasks process smaller problems, store results in HDFS, and report success jobtracker "Reduce" phase Reduce task receives sorted subsets of Map task results One or more reducers compute answers to form final answer Final results stored in HDFS Computational processing can occur on unstructured or structured data Abstracts all housekeeping away from the developer 7

Map Reduce Example Graphically Speaking HDFS DataNode (key, values ) HDFS DataNode (key, values ) map map (key A, values ) (key B, values ) (key C, values ) (key A, values ) (key B, values ) (key C, values ) shuffle and sort aggregates intermediate values by output key (key A, intermediate values ) (key B, intermediate values ) (key C, intermediate values ) reduce reduce reduce Final key A values Final key B values Final key C values 8

Text analysis example Count the number of times each word occurs in a corpus of documents map One mapper per block of data Shuffle and Sort reduce One or more reducers combining results Documents divided into blocks in HDFS Outputs each word and its count: 1 each time a word is encountered Key Value The 1 Big 1 Data 1 word 1 count 1 example 1 One reducer receives only the key-value pairs for the word Big and sums up the counts Key Value Big 1... Big 1 It then outputs the final key-value result Key Value Big 2040 9

Map Reduce Example Graphically Speaking HDFS DataNode (key, values ) HDFS DataNode (key, values ) For Word Count There s no key, only value as input to mapper map (key A, values ) (key B, values ) (key C, values ) map (key A, values ) (key B, values ) (key C, values ) Mapper output is a set of key-value pairs where key is the word and value is the count=1 shuffle and sort aggregates intermediate values by output key (key A, intermediate values ) (key B, intermediate values ) (key C, intermediate values ) reduce reduce reduce Each reducer receives values for each word key is the word value is a set of counts Final key A values Final key B values Final key C values Outputs key as the word and value as the sum 10

Mapper and reducer code in ORCH for Word Count corpus <- scan("corpus.dat", what=" ",quiet= TRUE, sep="\n") corpus <- gsub("([/\\\":,#.@-])", " ", corpus) input <- hdfs.put(corpus) res <- hadoop.exec(dfs.id = input, mapper = function(k,v) { x <- strsplit(v[[1]], " ")[[1]] x <- x[x!=''] out <- NULL for(i in 1:length(x)) out <- c(out, orch.keyval(x[i],1)) out }, reducer = function(k,vv) { orch.keyval(k, sum(unlist(vv))) }, config = new("mapred.config", job.name = "wordcount", map.output = data.frame(key='', val=0), reduce.output = data.frame(key='', val=0) ) ) res hdfs.get(res) Load the R data.frame into HDFS Specify and invoke map-reduce job Split words and output each word Sum the count of each word 11

R Connector for Hadoop Big Data Appliance R script Hadoop Cluster (BDA) ORD R Client {CRAN packages} Hadoop Job Mapper Reducer R HDFS R MapReduce R sqoop MapReduce Nodes {CRAN packages} ORD HDFS Nodes Database Provide transparent access to Hadoop Cluster: MapReduce and HDFS-resident data Access and manipulate data in HDFS, database, and file system - all from R Write MapReduce functions using R and execute through natural R interface Leverage CRAN R packages to work on HDFS-resident data Transition work from lab to production deployment on a Hadoop cluster without requiring knowledge of Hadoop internals, Hadoop CLI, or IT infrastructure 12

Exploring Available Data HDFS, Database, file system HDFS hdfs.pwd() hdfs.ls() hdfs.mkdir("xq") hdfs.cd("xq") hdfs.ls() hdfs.size("ontime_s") hdfs.parts("ontime_s") hdfs.sample("ontime_s",lines=3) Database ore.ls() names(ontime_s) head(ontime_s,3) File System getwd() dir() # or list.files() dir.create("/home/oracle/orch") setwd("/home/oracle/orch") dat <- read.csv("ontime_s.dat") head(dat) 13

Load data in HDFS Data from File Use hdfs.upload Key is first column: YEAR Data from Database Table Use hdfs.push Key column: DEST Data from R data.frame Use hdfs.put Key column: DEST hdfs.rm('ontime_file') ontime.dfs_file <- hdfs.upload('ontime_s2000.dat', dfs.name='ontime_file') hdfs.exists('ontime_file') hdfs.rm('ontime_db') ontime.dfs_d <- hdfs.push(ontime_s2000, key='dest', dfs.name='ontime_db') hdfs.exists('ontime_db') hdfs.rm('ontime_r') ontime <- ore.pull(ontime_s2000) ontime.dfs_r <- hdfs.put(ontime, key='dest', dfs.name='ontime_r') hdfs.exists('ontime_r') 14

hadoop.exec() concepts 1. Mapper 1. Receives set of rows from HDFS file as (key,value) pairs 2. Key has the same data type as that of the input 3. Value can be of type list or data.frame 4. Mapper outputs (key,value) pairs using orch.keyval() 5. Value can be ANY R object packed using orch.pack() 2. Reducer 1. Receives (packed) input of type generated by a mapper 2. Reducer outputs (key, value) pairs using orch.keyval() 3. Value can be ANY R object packed using orch.pack() 3. Variables from R environment can be exported to Hadoop environment mappers and reducers using orch.export() (optional) 4. Job configuration (optional) 15

ORCH Dry Run Enables R users to test R code locally on laptop before submitting job to Hadoop Cluster Supports testing/debugging of scripts orch.dryrun(true) Hadoop Cluster is not required for a dry run Sequential execution of mapper and reducer code Creates row streams from HDFS input into mapper and reducer Constrained by the memory available to R Recommend to subset / sample input data to fit in memory Upon job success, resulting data put in HDFS No change in the R code is required for dry run 16

Example: Test script in dry run mode Take the average arrival delay for all flights to SFO orch.dryrun(t) dfs <- hdfs.attach('ontime_r') res <- NULL res <- hadoop.run( dfs, mapper = function(key, ontime) { if (key == 'SFO') { keyval(key, ontime) } }, reducer = function(key, vals) { sumad <- 0 count <- 0 for (x in vals) { if (!is.na(x$arrdelay)) {sumad <- sumad + x$arrdelay; count <- count + 1} } res <- sumad / count keyval(key, res) } ) res hdfs.get(res) 17

Example: Test script on Hadoop Cluster one change Take the average arrival delay for all flights to SFO orch.dryrun(f) dfs <- hdfs.attach('ontime_r') res <- NULL res <- hadoop.run( dfs, mapper = function(key, ontime) { if (key == 'SFO') { keyval(key, ontime) } }, reducer = function(key, vals) { sumad <- 0 count <- 0 for (x in vals) { if (!is.na(x$arrdelay)) {sumad <- sumad + x$arrdelay; count <- count + 1} } res <- sumad / count keyval(key, res) } ) res hdfs.get(res) 18

Executing Hadoop Job in Dry-Run Mode Linux Client Retrieve data from HDFS Execute script locally in laptop R engine R Connector For Hadoop Client Hadoop Cluster Software R Distribution BDA R Distribution R Connector For Hadoop Driver Package 19

Executing Hadoop Job on Hadoop Cluster Linux Client Submit MapReduce job to Hadoop Cluster Execute Mappers and Reducers using R instances on BDA task nodes R Connector For Hadoop Client Hadoop Cluster Software R Distribution BDA R Distribution R Connector For Hadoop Driver Package 20

<Insert Picture Here> DATA WAREHOUSE