S3IT: Service and Support for Science IT. Scaling R on cloud infrastructure Sergio Maffioletti IS/Cloud S3IT: Service and Support for Science IT



Similar documents
Introducing ScienceCloud

Big Data and Parallel Work with R

Parallel Options for R

Scaling up to Production

HPC performance applications on Virtual Clusters

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

SURFsara HPC Cloud Workshop

Hardware/Software Guidelines

ADAM 5.5. System Requirements

Running R from Amazon's Elastic Compute Cloud

Enterprise Application Integration (Middleware)

Veeam Cloud Connect. Version 8.0. Administrator Guide

BT Ireland and the Cloud

CLUSTER COMPUTING TODAY

SURFsara HPC Cloud Workshop

DevOps with Containers. for Microservices

Openstack. Cloud computing with Openstack. Saverio Proto

Cloud Computing through Virtualization and HPC technologies

Ubuntu OpenStack on VMware vsphere: A reference architecture for deploying OpenStack while limiting changes to existing infrastructure

High Performance Computing in CST STUDIO SUITE

Your Place or Mine? In-House e- Discovery Platform vs. Software as a Service

Best Practices for Virtualised SharePoint

Performance Testing of a Cloud Service

Hadoop as a Service. VMware vcloud Automation Center & Big Data Extension

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Contact for all enquiries Phone: info@recordpoint.com.au. Page 2. RecordPoint Release Notes V3.8 for SharePoint 2013

Is there any alternative to Exadata X5? March 2015

Cloud Federation to Elastically Increase MapReduce Processing Resources

Server Installation Manual 4.4.1

Open source Google-style large scale data analysis with Hadoop

IBM Platform Computing : infrastructure management for HPC solutions on OpenPOWER Jing Li, Software Development Manager IBM

Powering the Next Generation Cloud with Azure Stack, Nano Server & Windows Server 2016! Jeff Woolsey Principal Program Manager Cloud & Enterprise

FAQ. NetApp MAT4Shift. March 2015

SAP HANA virtualized Technology Roadmap. Arne Arnold, SAP HANA Product Management September, 2014

Cloud Optimize Your IT

Virtualisation Cloud Computing at the RAL Tier 1. Ian Collier STFC RAL Tier 1 HEPiX, Bologna, 18 th April 2013

PARALLELS CLOUD STORAGE

Parallels Cloud Server 6.0

Het is een kleine stap naar een hybrid cloud

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

WINDOWS AZURE AND WINDOWS HPC SERVER

Facilitating Consistency Check between Specification and Implementation with MapReduce Framework

MongoDB and Couchbase

Hardware Configuration Guide

Hadoop IST 734 SS CHUNG

THE ON-DEMAND DATA CENTER NETWORK TRANSFORMATION IN A CHANGING INDUSTRY. Ken Cheng, CTO, Brocade October, 2013

BlobSeer: Towards efficient data storage management on large-scale, distributed systems

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Dell One Identity Manager Scalability and Performance

HPC Cluster Decisions and ANSYS Configuration Best Practices. Diana Collier Lead Systems Support Specialist Houston UGM May 2014

Understanding Enterprise NAS

Solution for private cloud computing

Emerging Technology for the Next Decade

Relational Databases in the Cloud

SAP HANA In-Memory Database Sizing Guideline

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Invest in your business with Ubuntu Advantage.

CUMULUX WHICH CLOUD PLATFORM IS RIGHT FOR YOU? COMPARING CLOUD PLATFORMS. Review Business and Technology Series

Distributed Text Mining with tm

Challenges for cloud software engineering

Gladstone Health & Leisure Technical Services

Self service for software development tools

What Is Microsoft Private Cloud Fast Track?

EMC ISILON AND ELEMENTAL SERVER

Understanding Neo4j Scalability

Cloud-pilot.doc SA1 Marcus Hardt, Marcin Plociennik, Ahmad Hammad, Bartek Palak E U F O R I A

PrimaryIO Application Performance Acceleration Date: July 2015 Author: Tony Palmer, Senior Lab Analyst

SolidFire SF3010 All-SSD storage system with Citrix CloudPlatform Reference Architecture

Deploying and Managing Microsoft System Center Virtual Machine Manager

SAP BusinessObjects BI4 Sizing What You Need to Know

An HPC Application Deployment Model on Azure Cloud for SMEs

System requirements. for Installation of LANDESK Service Desk Clarita-Bernhard-Str. 25 D Muenchen. Magelan GmbH

Package parallel. R-core. May 19, 2015

Advances in Virtualization In Support of In-Memory Big Data Applications

Urbancode Deploy Overview

3 Ways to build a SaaS Product. Asteor Software Inc Ram Kumar - Director Product Management

Are You Ready for the Holiday Rush?

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Part 1: Windows Server 2012 R2. Datacenter Specialist

Big Data Analytics and HPC

Amazon EC2 XenApp Scalability Analysis

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Hadoop Architecture. Part 1

Continuous Integration for XML and RDF Data

Hadoop: Embracing future hardware

Diablo and VMware TM powering SQL Server TM in Virtual SAN TM. A Diablo Technologies Whitepaper. May 2015

Hadoop & SAS Data Loader for Hadoop

JovianDSS Evaluation and Product Training. Presentation updated: October 2015

Transcription:

S3IT: Service and Support for Science IT Scaling R on cloud infrastructure Sergio Maffioletti IS/Cloud S3IT: Service and Support for Science IT Zurich, 19.03.2015

Who am I? Sergio Maffioletti: Cloud and Application specialist Head of the IS/Cloud Services unit Head of the S3IT User support

What is S3IT? Connect IT and Science Zentrale Informatik ScienceIT support unit Dedicated support for computations and data analysis SPEED : faster time to solution ACCESS : to competitive infrastructure ENABLE : remove barriers new possibilities

Disclaimer What is presented here is *not* an authoritative analysis of how R can be scaled. It is just the result of few years of experience in supporting and helping research groups in improving R performances.

What are we going to talk about today? How to scale your R code on a cloud infrastructure 1. What is a could infrastructure 2. What problem are we trying to address 3. Possible scenario for scaling R

What is a cloud infrastructure? Infrastructure Cloud Service Multi-Tenancy Compute Storage Network Virtualisation infrastructure Self-provisioning and Elasticity of resources Customization and control of the environment Multi-tenancy

What is a cloud infrastructure? Infrastructure Cloud Service Multi-Tenancy Compute Storage Network Virtualisation infrastructure Virtual Machines (VM) Virtual Storage block devices Virtual private network Self-provisioning and Elasticity of resources Customization and control of the environment Multi-tenancy

What is a cloud infrastructure? Infrastructure Cloud Service Multi-Tenancy Compute Storage Network Virtualisation infrastructure Self-provisioning and Elasticity of resources End-users can allocate and release resources when needed. Customization and control of the environment Multi-tenancy

What is a cloud infrastructure? Infrastructure Cloud Service Multi-Tenancy Compute Storage Network Virtualisation infrastructure Self-provisioning and Elasticity of resources Customization and control of the environment End-users can tailor the research infrastructure to his/her specific needs. Multi-tenancy

What problem are we trying to address? You normally run your R scripts on your local workstation At best: 8 cores, 16GB RAM, 100-200GB SSD drive This configuration is quite sufficient for most of the small/medium scale analysis problems

How your R script lever the infrastructure R is single-threaded There are several packages for parallel computation mclapply: a parallelized version of lapply parallel: built on multicore and snow

How your R script lever the infrastructure R is single-threaded There are several packages for parallel computation mclapply: a parallelized version of lapply Single node, multi-core Relatively easy to use within your script Scaling limited to single node availability parallel: built on multicore and snow

How your R script lever the infrastructure R is single-threaded There are several packages for parallel computation mclapply: a parallelized version of lapply parallel: built on multicore and snow Can use the CPUs/cores on a single machine (multicore), or several machines, using MPI (snow) Need to prepare the cluster of resources before initialization Need to apply LoadBalancing to distribute tasks Not widely used in community

mclapply example library(multicore) workerfunc <- function(n) return(nˆ2) values <- 1:100 library(parallel) numworkers <- 8 res <- mclapply(values, workerfunc, mc.cores = numworkers) print(unlist(res))

using parallel package workerfunc <- function(n) return(nˆ2) values <- 1:100 library(parallel) numworkers <- 8 cl <- makecluster(numworkers, type = "MPI") res <- parlapply(cl, values, workerfunc) stopcluster(cl) mpi.exit() print(unlist(res)) Running: mpirun -n 1 R -slave -f simple_mpi.r

Common limitation All make the assumption the infrastructure is available On a cloud infrastructure, provision (and release) of resources is a necessary step

How do you scale your R code on a cloud infrastructure? Scale-up Bigger workstation RStudio Scale-out

Scale-up: Bigger workstation

Scale-up: Bigger workstation Drawbacks: Cannot scale indefinitely (limit is the node size) Higher specs for the node = higher costs

Scale-up: RStudio Web-based interface to run R scripts Same single point of access (group, individual user)

Scale-up: RStudio Web-based interface to run R scripts Same single point of access (group, individual user) Drawbacks: Still runs on single node

Scale-out: GC3Pie Scaling outside of your R script Divide large execution in smaller chunks Run those chunks independently from each other Provision of the infrastructure based on script requirements

Let s see an example: gweight Usecase from Business department Run GetWeight function over 3.5M forum entries Each GetWeight takes 2 That s still 75 days on a 64 cores node Each item in the forum could be processed independently from the others

gweight workflow Takes initial 3.5M entries input (in.csv format) Create smaller.csv files of size chunk (default 1000) For each chunk file, create a dedicated VM On the VM, run GetWeight function with the specific.csv chunk file Terminate VM if no more.csv files have to be processed Aggregate results at the end into a single large.csv result file

Let s see how it work... https://youtu.be/8hn-4qxtahe

Conclusions Scaling R effectively on cloud infrastructure requires knowledge on how to split your computation S3IT provides support and tools (e.g. GC3Pie) Get in touch with us: help@s3it.uzh.ch, Visit our website: www.s3it.uzh.ch