Big Data and Parallel Work with R



Similar documents
Deal with big data in R using bigmemory package

S3IT: Service and Support for Science IT. Scaling R on cloud infrastructure Sergio Maffioletti IS/Cloud S3IT: Service and Support for Science IT

Your Best Next Business Solution Big Data In R 24/3/2010

Package biganalytics

Big Data Analytics and HPC

Parallel Computing for Data Science

Parallel Options for R

Big data in R EPIC 2015

Large Datasets and You: A Field Guide

Streaming Data, Concurrency And R

How To Write A Data Processing Pipeline In R

Package Rdsm. February 19, 2015

Bringing Big Data Modelling into the Hands of Domain Experts

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Getting Started with domc and foreach

HPC performance applications on Virtual Clusters

Big Data, R, and HPC

MANIPULATION OF LARGE DATABASES WITH "R"

Scalable Strategies for Computing with Massive Data: The Bigmemory Project

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

2015 The MathWorks, Inc. 1

Parallelization Strategies for Multicore Data Analysis

Distributed R for Big Data

Applied Multivariate Analysis - Big data analytics

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

What is Multi Core Architecture?

Motivation: Smartphone Market

Performance and Scalability Overview

What s New in MATLAB and Simulink

Presto/Blockus: Towards Scalable R Data Analysis

Scalable Strategies for Computing with Massive Data: The Bigmemory Project

High Performance Computing with R

Memory Management in BigData

rm(list=ls()) library(sqldf) system.time({large = read.csv.sql("large.csv")}) # seconds, 4.23GB of memory used by R

Big Data Analytics in R. Matthew J. Denny. University of Massachusetts Amherst. March 31,

Semiparametric Regression of Big Data in R

Package bigrf. February 19, 2015

Distributed R for Big Data

Big Data, Fast Processing Speeds Kevin McGowan SAS Solutions on Demand, Cary NC

Multicore Parallel Computing with OpenMP

Semiparametric Regression of Big Data in R

This guide specifies the required and supported system elements for the application.

extreme Datamining mit Oracle R Enterprise

Lua as a business logic language in high load application. Ilya Martynov ilya@iponweb.net CTO at IPONWEB

Next Generation GPU Architecture Code-named Fermi

Garbage Collection in the Java HotSpot Virtual Machine

Distributed Realtime Systems Framework for Sustainable Industry 4.0 applications

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

Product Overview. Initial Seeding

Hands-On Data Science with R Dealing with Big Data. Graham.Williams@togaware.com. 27th November 2014 DRAFT

Linux für bwgrid. Sabine Richling, Heinz Kredel. Universitätsrechenzentrum Heidelberg Rechenzentrum Universität Mannheim. 27.

Distributed Text Mining with tm

Fast Analytics on Big Data with H20

Data Structures and Performance for Scientific Computing with Hadoop and Dumbo

GPGPU accelerated Computational Fluid Dynamics

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Hardware/Software Guidelines

Using Hadoop to Expand Data Warehousing

HDB++: HIGH AVAILABILITY WITH. l TANGO Meeting l 20 May 2015 l Reynald Bourtembourg

ADAM 5.5. System Requirements

Introduction to parallel computing in R

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Integrating Apache Spark with an Enterprise Data Warehouse

RevoScaleR Speed and Scalability

# Not a part of 1Z0-061 or 1Z0-144 Certification test, but very important technology in BIG DATA Analysis

Package snowfall. October 14, 2015

Unlocking the True Value of Hadoop with Open Data Science

11.1 inspectit inspectit

Java Performance Tuning

Martinos Center Compute Clusters

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Part I Courses Syllabus

CDH installation & Application Test Report

GeoImaging Accelerator Pansharp Test Results

Monitoring and Managing a JVM

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Open source large scale distributed data management with Google s MapReduce and Bigtable

Scaling Out With Apache Spark. DTL Meeting Slides based on

Notes on Transferring 100 TB of Data Using Globus. William E. Mihalo; Anton Verlygo; Ryan K. Sisk Northwestern University

Kafka & Redis for Big Data Solutions

Quantcast Petabyte Storage at Half Price with QFS!

A Study of Data Management Technology for Handling Big Data

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

SEIZE THE DATA SEIZE THE DATA. 2015

Understanding the Value of In-Memory in the IT Landscape

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC

Java Environment for Parallel Realtime Development Platform Independent Software Development for Multicore Systems

Overview of HPC Resources at Vanderbilt

Transcription:

Big Data and Parallel Work with R

What We'll Cover Data Limits in R Optional Data packages Optional Function packages Going parallel Deciding what to do

Data Limits in R

Big Data? What is big data? More and more often, we have GB or TB of data we need to process Reaching physical limits for space and time Need to find solutions

Limits Space is limited to dozens of GB Time is limited by human patience Most installations of R use 32-bit versions of libraries, so have limits like 2 31-1, or ~2 billion matrix elements To use more than this, usually needs end user to compile both R and external libraries (like BLAS)

Solutions Use more efficient storage schemes to make maximum use of available space Use incremental algorithms to do calculations on data chunks Use parallel machines to divide the data across multiple CPU's or machines

Optional Data Packages

Why? When dealing with big data sets, need to work with them efficiently Problem with regular lists is that every element needs to be checked before being worked with The simplest move is to use data.frame Frames have some restrictions Data within a column needs to be all the same type Rows need to be the same size

Data Frames read.table returns a data frame data.frame() allows you to create a data frame from other data constructs Columns need to be named Rows need to be named, names can be one of the columns read.table can be told to read in the data incrementally

NetCDF In scientific computing, data often comes in NetCDF format The R package to handle NetCDF allows for files to be opened but not loaded You can then incrementally access the data without using up all of the available memory

Databases Databases provide a way of storing massive amounts of data and to be able to pull out selections of data R packages provide access to all standard database engines (MySQL, Oracle, Postgres, etc.) In parallel environments, there are packages to use Hadoop

bigmemory The package bigmemory provides for multi- GB data sets Shared-memory can be used by multiple processes on the same box File-backed access can be used, which aids multi-machine access The core object is big.matrix

big.matrix Implemented in C++ The standard big.matrix uses RAM, and so is limited filebacked.big.matrix uses the hard drive A big.matrix is handed by reference to functions, not by value, so there may be side-effects

Optional Function Packages

Functions Dealing with large sets of data requires efficient functions The simple solution is to use functions like apply The bigmemory project also offers biglm and biganalytics

lapply With large data sets, you may need to apply some function to each value in a list To do this, you can use the function lapply x <- list(a = 1:10, beta = exp(-3:3), logic = c(true,false, FALSE,TRUE)) # compute the list mean for each list element lapply(x,mean)

biganalytics The bigmemory project allows provides functions optimized for using big.matrix It adds overloaded versions of the standard descriptive statistics functions (mean, var, etc) It also provides an overloaded version of apply

biglm If you are trying to fit a model to a large data set, you can use the biglm package Introduces a biglm function You can step through additional chunks of data with the update function

Going Parallel

Parallelization? For large problems, it may make sense to use multiple CPUs Traditional methods use packages like multicore, SNOW, Rmpi, etc Starting with version 2.14.0, R includes the package parallel implements multicore implements SNOW provides parallel RNGs

Parallel To use the parallel package library("parallel") The multicore part is implemented on most systems using threads On Windows, this gets implemented as separate processes Used by functions like mclapply The default is to break the list into even chunks You can break into smaller chunks to aid load balancing

SNOW Simple Network Of Workstations Involves creating a cluster of processes to use These processes can be on one machine or many networked together There are multiple steps involved

Clusters library("parallel") cl <- makecluster(size) parlapply(cl, list, FUN) stopcluster(cl)

Clusters - 2 If you want to use shared memory processes, you can use makeforkcluster This doesn't work on Windows If you want to use multiple machines, you can use makepsockcluster This uses ssh on most machines On Windows machines, you want to use something like rshcmd="plink.exe"

foreach The foreach function makes loops easy to deal with You can force whether it happens serially or in parallel foreach (i=1:3) %do% sqrt(i) foreach (i=1:3) %dopar% sqrt(i)

Deciding what to do

Profiling In order to decide what to optimize, you need measurements This is called profiling You can profile time, space, memory usage, function calls, etc.

Time Usually the first thing to do is measure how long things take system.time is easiest thing to do system.time(expr, gcfirst=true) This calls the garbage collector first

Memory Need to see how much space things are taking In general, you can use memory.profile() For the size of a specific object, use object.size(obj) You can force a garbage collection with gc()

General Profiling There is a full set profiling functions Rprof(filename="profile.log", append=false, interval=0.02, memory. profiling=false)...... Rprof() summaryrprof("profile.log")