Big Data Analytics Using R



Similar documents
Hadoop Cluster Applications

Apache Hadoop. Alexandru Costan

Hadoop Architecture. Part 1

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

RHadoop Installation Guide for Red Hat Enterprise Linux

Big Data Systems CS 5965/6965 FALL 2015

Chapter 7. Using Hadoop Cluster and MapReduce

A Service for Data-Intensive Computations on Virtual Clusters

Introduction to Hadoop

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Cloud Computing. Chapter Hadoop

INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS

GraySort and MinuteSort at Yahoo on Hadoop 0.23

A very short Intro to Hadoop

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

A Performance Analysis of Distributed Indexing using Terrier

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Hadoop Scheduler w i t h Deadline Constraint

Cloud Computing Summary and Preparation for Examination

Click Stream Data Analysis Using Hadoop

Hadoop. Sunday, November 25, 12

Neptune. A Domain Specific Language for Deploying HPC Software on Cloud Platforms. Chris Bunch Navraj Chohan Chandra Krintz Khawaja Shams

Log Mining Based on Hadoop s Map and Reduce Technique

Storage Architectures for Big Data in the Cloud

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

RHadoop and MapR. Accessing Enterprise- Grade Hadoop from R. Version 2.0 (14.March.2014)

How to Hadoop Without the Worry: Protecting Big Data at Scale

Lecture 10 - Functional programming: Hadoop and MapReduce

HADOOP PERFORMANCE TUNING

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Introduction to Hadoop

Simplest Scalable Architecture

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

MapReduce, Hadoop and Amazon AWS

Performance of the JMA NWP models on the PC cluster TSUBAME.

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Extending Hadoop beyond MapReduce

Understanding Hadoop Performance on Lustre

Simple Parallel Computing in R Using Hadoop

Fault Tolerance in Hadoop for Work Migration

Hadoop & its Usage at Facebook

Benchmarking Hadoop & HBase on Violin

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

UPS battery remote monitoring system in cloud computing

Amazon Web Services. Elastic Compute Cloud (EC2) and more...

Introduction to Cloud Computing

R / TERR. Ana Costa e SIlva, PhD Senior Data Scientist TIBCO. Copyright TIBCO Software Inc.

Using Big Data and GIS to Model Aviation Fuel Burn

Hadoop and Map-Reduce. Swati Gore

MapReduce and Hadoop Distributed File System

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

CSE-E5430 Scalable Cloud Computing Lecture 2

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

PEPPERDATA OVERVIEW AND DIFFERENTIATORS

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Jeffrey D. Ullman slides. MapReduce for data intensive computing

MapReduce and Hadoop Distributed File System V I J A Y R A O

Assignment # 1 (Cloud Computing Security)

Hadoop: Embracing future hardware

RevoScaleR Speed and Scalability

Integrating VoltDB with Hadoop

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

MapReduce Evaluator: User Guide

Datacenters and Cloud Computing. Jia Rao Assistant Professor in CS

Distributed Text Mining with tm

Reduction of Data at Namenode in HDFS using harballing Technique

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Hadoop IST 734 SS CHUNG

Network Infrastructure Services CS848 Project

Open source Google-style large scale data analysis with Hadoop

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

How To Use Hadoop

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Apache HBase. Crazy dances on the elephant back

Map Reduce / Hadoop / HDFS

From GWS to MapReduce: Google s Cloud Technology in the Early Days

HADOOP MOCK TEST HADOOP MOCK TEST I

Bringing Big Data Modelling into the Hands of Domain Experts

Big Data and Apache Hadoop s MapReduce

Introducing EEMBC Cloud and Big Data Server Benchmarks

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

Transcription:

October 23, 2014

Table of contents BIG DATA DEFINITION 1 BIG DATA DEFINITION Definition Characteristics Scaling Challange 2 Divide and Conquer Amdahl s and Gustafson s Law Life experience Where to parallelize? Storage issues 3 Explicit Parallelism Hadoop Hadoop - HDFS

Definition Characteristics Scaling Challange Gartners term (probably) Simple (personal) Definition Too much data to be easily processed. Used mainly for marketing and FUD

Characteristics BIG DATA DEFINITION Definition Characteristics Scaling Challange

Characteristics BIG DATA DEFINITION Definition Characteristics Scaling Challange Volume

Definition Characteristics Scaling Challange Characteristics Volume Velocity

Definition Characteristics Scaling Challange Characteristics Volume Velocity Verity

Definition Characteristics Scaling Challange Characteristics Volume Velocity Verity Veracity (integrity)

ScaleUp vs. ScaleOut Definition Characteristics Scaling Challange

ScaleUp vs. ScaleOut Definition Characteristics Scaling Challange Scale Up - more... in one box

ScaleUp vs. ScaleOut Definition Characteristics Scaling Challange Scale Up - more... in one box expensive (growing exp) closed box approach technical complexity

ScaleUp vs. ScaleOut Definition Characteristics Scaling Challange Scale Up - more... in one box expensive (growing exp) closed box approach technical complexity Scale Out - more boxes

ScaleUp vs. ScaleOut Definition Characteristics Scaling Challange Scale Up - more... in one box expensive (growing exp) closed box approach technical complexity Scale Out - more boxes cheap (avg.) best of breed usage is not trivial

ScaleUp vs. ScaleOut Definition Characteristics Scaling Challange Scale Up - more... in one box expensive (growing exp) closed box approach technical complexity Scale Out - more boxes cheap (avg.) best of breed usage is not trivial You want Scale Out? Well, Scale Out costs. And right here is where you start paying... in sweat...

Definition Characteristics Scaling Challange How long it takes to read / write data?

Definition Characteristics Scaling Challange How long it takes to read / write data? Generating a 1e7 var using runif() and writing it - 10-16 sec.

Definition Characteristics Scaling Challange How long it takes to read / write data? Generating a 1e7 var using runif() and writing it - 10-16 sec. Reading the same (96MB) file takes 1.5-6 sec.

Definition Characteristics Scaling Challange How long it takes to read / write data? Generating a 1e7 var using runif() and writing it - 10-16 sec. Reading the same (96MB) file takes 1.5-6 sec. Generating 1e9 var was not possible on the host I used

Definition Characteristics Scaling Challange How long it takes to read / write data? Generating a 1e7 var using runif() and writing it - 10-16 sec. Reading the same (96MB) file takes 1.5-6 sec. Generating 1e9 var was not possible on the host I used Calculating mean() and var() on all data was not possible

Definition Characteristics Scaling Challange How long it takes to read / write data? Generating a 1e7 var using runif() and writing it - 10-16 sec. Reading the same (96MB) file takes 1.5-6 sec. Generating 1e9 var was not possible on the host I used Calculating mean() and var() on all data was not possible

And in parallel? BIG DATA DEFINITION Definition Characteristics Scaling Challange

And in parallel? BIG DATA DEFINITION Definition Characteristics Scaling Challange Create 100 files, each size 1e7 and write them on the disk

Definition Characteristics Scaling Challange And in parallel? Create 100 files, each size 1e7 and write them on the disk For each file: read it, compute mean() and var() for each file

Definition Characteristics Scaling Challange And in parallel? Create 100 files, each size 1e7 and write them on the disk For each file: read it, compute mean() and var() for each file BUT - reading 100 files takes 100 times reading of one file :-(

Definition Characteristics Scaling Challange And in parallel? Create 100 files, each size 1e7 and write them on the disk For each file: read it, compute mean() and var() for each file BUT - reading 100 files takes 100 times reading of one file :-( Compute mean and var of all files and...

Definition Characteristics Scaling Challange And in parallel? Create 100 files, each size 1e7 and write them on the disk For each file: read it, compute mean() and var() for each file BUT - reading 100 files takes 100 times reading of one file :-( Compute mean and var of all files and...compute total VAR

Definition Characteristics Scaling Challange Now you know why you are here at this time!

One is not enough BIG DATA DEFINITION Definition Characteristics Scaling Challange Data Data can not be gathered on one node

Definition Characteristics Scaling Challange One is not enough Data Data can not be gathered on one node One computer can not process all data

Definition Characteristics Scaling Challange One is not enough Data Data can not be gathered on one node One computer can not process all data Data transfer is not feasible

Definition Characteristics Scaling Challange One is not enough Data Data can not be gathered on one node One computer can not process all data Data transfer is not feasible Processors One process, One thread (old and simple)

Definition Characteristics Scaling Challange One is not enough Data Data can not be gathered on one node One computer can not process all data Data transfer is not feasible Processors One process, One thread (old and simple) Multiple processes, One host

Definition Characteristics Scaling Challange One is not enough Data Data can not be gathered on one node One computer can not process all data Data transfer is not feasible Processors One process, One thread (old and simple) Multiple processes, One host Multiple hosts

Definition Characteristics Scaling Challange One is not enough Data Data can not be gathered on one node One computer can not process all data Data transfer is not feasible Processors One process, One thread (old and simple) Multiple processes, One host Multiple hosts Reference High-Performance and Parallel Computing with R by Derik Eddelbuettel

Splitting tasks - How to? Divide and Conquer Amdahl s and Gustafson s Law Life experience Where to parallelize? Storage issues Functional

Splitting tasks - How to? Divide and Conquer Amdahl s and Gustafson s Law Life experience Where to parallelize? Storage issues Functional - requires sync

Splitting tasks - How to? Divide and Conquer Amdahl s and Gustafson s Law Life experience Where to parallelize? Storage issues Functional - requires sync Data

Splitting tasks - How to? Divide and Conquer Amdahl s and Gustafson s Law Life experience Where to parallelize? Storage issues Functional - requires sync Data - requires pre/post processing

Splitting tasks - How to? Divide and Conquer Amdahl s and Gustafson s Law Life experience Where to parallelize? Storage issues Functional - requires sync Data - requires pre/post processing Combination

Splitting tasks - How to? Divide and Conquer Amdahl s and Gustafson s Law Life experience Where to parallelize? Storage issues Functional - requires sync Data - requires pre/post processing Combination Hadoop made it easy!

Can we parallelize forever? Divide and Conquer Amdahl s and Gustafson s Law Life experience Where to parallelize? Storage issues Gustafson s Law S(P) = P α(p 1) S - scale-up, P - # of processors, α - proportion that can not be parallelized Amdahl s Law T (P) = T (1)(α + 1 P T - processing time (1 α))

Thumb rules BIG DATA DEFINITION Divide and Conquer Amdahl s and Gustafson s Law Life experience Where to parallelize? Storage issues Programs are small, data is big move program to data!

Thumb rules BIG DATA DEFINITION Divide and Conquer Amdahl s and Gustafson s Law Life experience Where to parallelize? Storage issues Programs are small, data is big move program to data! Parallel processing requires conversion of results Do it wise!

Thumb rules BIG DATA DEFINITION Divide and Conquer Amdahl s and Gustafson s Law Life experience Where to parallelize? Storage issues Programs are small, data is big move program to data! Parallel processing requires conversion of results Do it wise! Queuing systems overhead is proportional to # of jobs & hosts

Thumb rules BIG DATA DEFINITION Divide and Conquer Amdahl s and Gustafson s Law Life experience Where to parallelize? Storage issues Programs are small, data is big move program to data! Parallel processing requires conversion of results Do it wise! Queuing systems overhead is proportional to # of jobs & hosts Multiple hosts might require authentication process.

Thumb rules BIG DATA DEFINITION Divide and Conquer Amdahl s and Gustafson s Law Life experience Where to parallelize? Storage issues Programs are small, data is big move program to data! Parallel processing requires conversion of results Do it wise! Queuing systems overhead is proportional to # of jobs & hosts Multiple hosts might require authentication process. Checklist: Pre-processing, Synchronization and Post-processing

Little supercomputing center Divide and Conquer Amdahl s and Gustafson s Law Life experience Where to parallelize? Storage issues Local cluster

Little supercomputing center Divide and Conquer Amdahl s and Gustafson s Law Life experience Where to parallelize? Storage issues Local cluster Hosting services

Little supercomputing center Divide and Conquer Amdahl s and Gustafson s Law Life experience Where to parallelize? Storage issues Local cluster Hosting services Cloud

Little supercomputing center Divide and Conquer Amdahl s and Gustafson s Law Life experience Where to parallelize? Storage issues Local cluster Hosting services Cloud Storage (beware of outgoing traffic costs)

Little supercomputing center Divide and Conquer Amdahl s and Gustafson s Law Life experience Where to parallelize? Storage issues Local cluster Hosting services Cloud Storage (beware of outgoing traffic costs) CPUs

Little supercomputing center Divide and Conquer Amdahl s and Gustafson s Law Life experience Where to parallelize? Storage issues Local cluster Hosting services Cloud Storage (beware of outgoing traffic costs) CPUs Nice tool: segue - Allows running R jobs on AWS

The data won t feet in... Divide and Conquer Amdahl s and Gustafson s Law Life experience Where to parallelize? Storage issues Loading the data is the first stage, but...

The data won t feet in... Divide and Conquer Amdahl s and Gustafson s Law Life experience Where to parallelize? Storage issues Loading the data is the first stage, but... Mapping instead of loading

The data won t feet in... Divide and Conquer Amdahl s and Gustafson s Law Life experience Where to parallelize? Storage issues Loading the data is the first stage, but... Mapping instead of loading The data is large and the memory is small...

The data won t feet in... Divide and Conquer Amdahl s and Gustafson s Law Life experience Where to parallelize? Storage issues Loading the data is the first stage, but... Mapping instead of loading The data is large and the memory is small... I/O is a performance killer

The data won t feet in... Divide and Conquer Amdahl s and Gustafson s Law Life experience Where to parallelize? Storage issues Loading the data is the first stage, but... Mapping instead of loading The data is large and the memory is small... I/O is a performance killer NAS / SAN is usually better than simple disk

The data won t feet in... Divide and Conquer Amdahl s and Gustafson s Law Life experience Where to parallelize? Storage issues Loading the data is the first stage, but... Mapping instead of loading The data is large and the memory is small... I/O is a performance killer NAS / SAN is usually better than simple disk Mounting remote file systems e.g. sshfs (degrades performance)

The data won t feet in... Divide and Conquer Amdahl s and Gustafson s Law Life experience Where to parallelize? Storage issues Loading the data is the first stage, but... Mapping instead of loading The data is large and the memory is small... I/O is a performance killer NAS / SAN is usually better than simple disk Mounting remote file systems e.g. sshfs (degrades performance) Distributed file systems (w/o parallel processing) usually cheap, maintenance

The data won t feet in... Divide and Conquer Amdahl s and Gustafson s Law Life experience Where to parallelize? Storage issues Loading the data is the first stage, but... Mapping instead of loading The data is large and the memory is small... I/O is a performance killer NAS / SAN is usually better than simple disk Mounting remote file systems e.g. sshfs (degrades performance) Distributed file systems (w/o parallel processing) usually cheap, maintenance Move the program, not the data!

Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR Snow and Snowfall MPI tools (beyond the scope of this talk) Snow and Snowfall https://www.stat.berkeley.edu/classes/s244/snowfall.pdf biopara

Clusters and Grids BIG DATA DEFINITION Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR Queuing systems: Can run almost any program Very simple to use (for command liners) Grids: Requires authentication issues Huge amount of resources Ubiquitous mostly in the academic environment

Example: condor queuing system Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR Executable = / usr / local / bin /R arguments = -- vanilla -- file = path / file.r when_to_transfer_output = ON_EXIT_OR_EVICT Log = eddiea - ed2. log Error = del. eddiea - ed2. err.$( Process ) Output = del. eddiea - ed2. out.$( Process ) Requirements = ARCH ==" X86_ 64 " && \ ( target. FreeMemoryMB > 6000 ) notification = never Universe = VANILLA Queue 3000

Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR Clouds Use resources as service (Ben-Yehuda et al.) Very cheap for peak usage or low frequency Wrong usage might be very expensive Look for the service you need (it would usually be better and cheaper) In case of problem - Google it.

Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR Hadoop - overview An open source reliable, scalable, distributed computing system Designed to scale up from single servers to thousands of machines Each offering local computation and storage Designed to detect and handle failures at the application layer or hardware

Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR Hadoop - components Hadoop Common HDFS YARN, MR Many other tools (incl. R package)

Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR Using hadoop file system (HDFS) library ( rhdfs ) hdfs. init () hdfs.ls ( / )

Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR Using hadoop file system (HDFS) library ( rjava ). jinit ( parameters =" - Xmx8g ")

Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR Writing an R object (1e8) to hdfs > f = hdfs. file ("/ a. hdfs ","w", buffersize =10485760 > system. time ( hdfs. write (a,f, hsync = FALSE )) user system elapsed 7. 392 2. 192 15. 072 > hdfs. close (f) [1] TRUE

Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR Writing an R object (1e8) to tmpfs > a<- runif (1 e8) > system. time ( save (a, file = a1 )) user system elapsed 99. 648 0. 316 99. 893

Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR Hadoop - map reduce Splits each process into Map and Reduce map - Classifies the data that would be process together reduce - process the data Hadoop takes care of all the behind the scenes

Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR MR parameters library ( rmr2 ) Sys. setenv ( HADOOP_STREAMING =".../ hadoop - streaming Sys. setenv ( HADOOP_HOME =".../ hadoop ") Sys. setenv ( HADOOP_CMD =".../ bin / hadoop ")

Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR Simple mapper pi.map.fn <- function (n){ X<- runif (n) Y<- runif (n) S= sum ((X^2+ Y ^2) <1) return ( keyval (1,n/S)) }

Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR MR parameters pi. reduce.fn <- function (p){ keyval (4* mean (p)) }

Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR The MR process PI= function (n, output = NULL ){ mapreduce ( input =n, output = output, input. format =" text ", map =pi.map.fn, verbose = TRUE ) }

Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR From command line hadoop jar { PATH }/ hadoop - streaming -2.5.1. jar -file map.r - mapper map.r -file reduce.r - reducer reduce.r - input input_file - output out } map.r and reduce.r in this case have to be Rscript Files should be located in HDFS

The bottom line BIG DATA DEFINITION Moving to cloud increases agility

The bottom line BIG DATA DEFINITION Moving to cloud increases agility Moving to cloud reduces costs

The bottom line BIG DATA DEFINITION Moving to cloud increases agility Moving to cloud reduces costs Moving to cloud improve operations

The bottom line BIG DATA DEFINITION Moving to cloud increases agility Moving to cloud reduces costs Moving to cloud improve operations Moving to cloud offers new business

The bottom line BIG DATA DEFINITION Moving to cloud increases agility Moving to cloud reduces costs Moving to cloud improve operations Moving to cloud offers new business You can t refuse!

Thank you