Big Data and Scripting. Plotting in R



Similar documents
Linear Discriminant Analysis

Getting Started with R and RStudio 1

Viewing Ecological data using R graphics

Graphics in R. Biostatistics 615/815

Analysis of Binary Search algorithm and Selection Sort algorithm

Package tagcloud. R topics documented: July 3, 2015

Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode

Tutorial 2: Descriptive Statistics and Exploratory Data Analysis

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Data Exploration Data Visualization

Advanced Statistical Methods in Insurance

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data-Intensive Applications on HPC Using Hadoop, Spark and RADICAL-Cybertools

Data Visualization in R

R Graphics II: Graphics for Exploratory Data Analysis

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

Visualizing class probability estimators

Chapter 13: Query Processing. Basic Steps in Query Processing

Visualizing Data. Contents. 1 Visualizing Data. Anthony Tanbakuchi Department of Mathematics Pima Community College. Introductory Statistics Lectures

Graphical Representation of Multivariate Data

Lecture 2: Exploratory Data Analysis with R

Exploratory Data Analysis

Why? A central concept in Computer Science. Algorithms are ubiquitous.

Cluster Analysis using R

Clustering & Visualization

CUDA Programming. Week 4. Shared memory and register

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Load Balancing in MapReduce Based on Scalable Cardinality Estimates

Operating Systems OBJECTIVES 7.1 DEFINITION. Chapter 7. Note:

Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)

Contributions to Gang Scheduling

TECH TUTORIAL: EMBEDDING ANALYTICS INTO A DATABASE USING SOURCEPRO AND JMSL

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC Politecnico di Milano)

Data Visualization. Christopher Simpkins

Introduction to MATLAB (Basics) Reference from: Azernikov Sergei

Zabin Visram Room CS115 CS126 Searching. Binary Search

Physical Data Organization

Using these objects to view the process of the whole event from triggering waiting for processing until alarm stops. Define event content first.

How To Write A Data Processing Pipeline In R

Distributed Image Processing using Hadoop MapReduce framework. Binoy A Fernandez ( ) Sameer Kumar ( )

Coordinate Plane, Slope, and Lines Long-Term Memory Review Review 1

Visualization of missing values using the R-package VIM

R Graphics Cookbook. Chang O'REILLY. Winston. Tokyo. Beijing Cambridge. Farnham Koln Sebastopol

Data Mining and Visualization

Prof. Nicolai Meinshausen Regression FS R Exercises

Charts for SharePoint

G563 Quantitative Paleontology. SQL databases. An introduction. Department of Geological Sciences Indiana University. (c) 2012, P.

Table of Useful R commands

Binary Heaps * * * * * * * / / \ / \ / \ / \ / \ * * * * * * * * * * * / / \ / \ / / \ / \ * * * * * * * * * *

Getting started with qplot

A QUICK OVERVIEW OF THE OMNeT++ IDE

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

sample median Sample quartiles sample deciles sample quantiles sample percentiles Exercise 1 five number summary # Create and view a sorted

STATGRAPHICS Online. Statistical Analysis and Data Visualization System. Revised 6/21/2012. Copyright 2012 by StatPoint Technologies, Inc.

Visualization of 2D Domains

Introduction to Parallel Programming and MapReduce

Machine Architecture and Number Systems. Major Computer Components. Schematic Diagram of a Computer. The CPU. The Bus. Main Memory.

UCINET Visualization and Quantitative Analysis Tutorial

Overlapping Data Transfer With Application Execution on Clusters

The ff package: Handling Large Data Sets in R with Memory Mapped Pages of Binary Flat Files

Section IV.1: Recursive Algorithms and Recursion Trees

What s new in TIBCO Spotfire 6.5

Clustering. Chapter Introduction to Clustering Techniques Points, Spaces, and Distances

Analysis Tools and Libraries for BigData

Each function call carries out a single task associated with drawing the graph.

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

CS Fall 2008 Homework 2 Solution Due September 23, 11:59PM

Approximation Algorithms

Parallelization Strategies for Multicore Data Analysis

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Map-Reduce for Machine Learning on Multicore

Big Data and Scripting. Part 4: Memory Hierarchies

Cell Phone Vibration Experiment

Improved metrics collection and correlation for the CERN cloud storage test framework

Affdex SDK for Windows!

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Analysis of System Performance IN2072 Chapter M Matlab Tutorial

Server Load Prediction

Mini-project in TSRT04: Cell Phone Coverage

EMC Unisphere for VMAX Database Storage Analyzer

Graphs. Exploratory data analysis. Graphs. Standard forms. A graph is a suitable way of representing data if:

A Locally Cache-Coherent Multiprocessor Architecture

Online Data Monitoring Framework Based on Histogram Packaging in Network Distributed Data Acquisition Systems

Package MDM. February 19, 2015

SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis

DELL. Virtual Desktop Infrastructure Study END-TO-END COMPUTING. Dell Enterprise Solutions Engineering

SCALABILITY OF CONTEXTUAL GENERALIZATION PROCESSING USING PARTITIONING AND PARALLELIZATION. Marc-Olivier Briat, Jean-Luc Monnot, Edith M.

Citrix EdgeSight for Load Testing User s Guide. Citrx EdgeSight for Load Testing 2.7

Application Notes "EPCF 1%' 1SJOU &OHJOF "11&

CUDAMat: a CUDA-based matrix class for Python

Time Series Analysis AMS 316

Package bigrf. February 19, 2015

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Working with Excel in Origin

Transcription:

1, Big Data and Scripting Plotting in R

2, the art of plotting: first steps fundament of plotting in R: plot(x,y) plot some random values: plot(runif(10)) values are interpreted as y-values, x-values filled in as 1:10 plot a nx2 array of points in a scatterplot: plot(x) plot has a humongous amount of parameters with strange names pch - change point type (e.g. pch=20 gives points) cex - change point size col - change point color,...

3, a simple plotting example supply lists for point-wise settings example: data(iris) # load some flower data attach(iris) plot(iris, col=species) # plot the whole thing # plot specific axes plot(sepal.length, Sepal.Width, col=species) plot points in x, colored by species use rainbow() to create colors create individual colors with rgb() or gray()

4, setting parameters for plotting plot() accepts a number of parameters even more can be set using par() outer margins with mar=c(down, left,up,right) overplotting with new=t plot to certain areas fig=c(left,right,lower, upper) switch off axes with axes=f make your own with axis() some, but not all, of these parameters can be passed to plot() return value is a list with the old values of the changed parameters can be used to reset parameters to previous state

5, a more complicated plotting example data(iris)# get data attach(iris)# attach for easy access # plot petal width x height plot(petal.length, Petal.Width, col=species, pch=20) # make a small box on top with sepal values par=par(fig=c(0.6,0.9,0.18,0.48), new=t, mar=c(1,1,1,0)+0.1, cex=0.8) plot(sepal.length, Sepal.Width, col=species,pch=20, axes=f, main="sepal extensions") box()# make a box around the small plot par(par) # reset parameters detach(iris)

the resulting plot 1 2 3 4 5 6 7 0.5 1.0 1.5 2.0 2.5 Petal.Length Petal.Width sepal extensions Sepal.Length Sepal.Width 6,

specialized plot functions many packages provide specialized plot functions for their results example: library(igraph) g=graph.star(15) plot(g) 7 8 12 2 9 1 3 4 13 5 14 11 15 6 10 this uses the overriding mechanism for functions called dispatch not covered here, see stat.ethz.ch/r-manual/r-devel/library/methods/html/ Methods.html for detailed information 7,

8, plotting to files plotting to files is simple with file devices, example: pdf("plot.pdf");# open plot.pdf in current dir plot(1:5); # plot something dev.off(); # close device (and write file) devices can be opened, e.g. x11() opens a plotting window there is usually a currently active device if not, a plot window is created dev.off() closes the active device writes files to disk (for file devices) if possible, switches to the previously active device variants: x11(), pdf(), svg(), jpeg(),... besides file, individual parameters for each format (e.g. size for pdf, resolution for jpeg)

9, example: visualizing a distribution of networks c(0, 1) c(0, 1) c(0, 1) c(0, 1) c(0, 1) c(0, 1) c(0, 1) c(0, 1) 31.90x7.78 (65.31%)

example: visualizing a spectral distributions density 0 0.1 0.3 0.5 1 10,

11, some useful plotting functions bars() create a bar plot hist() create a histogram of values and plot it points() add additional points lines() create lines connecting the given points grconvertx(), grconverty() convert between coordinate systems

Parallel Programming on a multi CPU System 12,

13, basic questions about the machine model execute algorithms on multiple CPUs (cores) CPUs load data from memory into their registers, compute something and write the results back to memory 1. do all cores have access to the same memory? yes: (following) PRAM-model (parallel random access memory) no: (later) distributed computing 2. concurrent access (reading/writing in parallel)? parallel reading: exclusive or concurrent parallel writing: if concurrent, which value stays? four different variants in the following, we allow concurrent reading and avoid concurrent writing

14, an example algorithm: summation problem given an array of numbers A[1],...,A[n] determine sum over all A[i] straightforward without parallel execution: O(n) speed up with more cores possible?

15, parallel summation: idea partition into smallest possible subproblems solve these in parallel combine the results again parallel continue until all values are combined

16, algorithm input: array A, # assume length(a)=n=2 h B=A;// B holds results on current level while(length(b)>1){// while intermediate results have to be combined T=array(length(B)/2) parallel for(i in 1:length(T)){ // execute in parallel T[i]=B[2*i]+B[2*i-1] // solve subproblem } B=T // advance to next level } return(b[1]) assumptions/preconditions: length of A is power of two (if not, pad with zeros) the + -operation is distributive, i.e. (a + b) + c = a + (b + c) approach works for every distributive operation

17, analysis memory: need additional array for current level number of operations: (length(a)=n = 2 h ) 2 h 1 + 2 h 2 +... + 2 0 = 2 h 1 O(n) no gain in comparison to sequential approach execution time on n/2 cores let one + -operation take O(f (n)) time and length(a)=n assume, copying B=A and B=T is done in parallel, too inner for-loop is executed in parallel time O(1) outer while-loop iterates levels of binary tree log 2 n levels total time consumption: O(f (n) log 2 n), for + O(log 2 n) note difference between number of operations and execution time

execution of n parallel processes on c cores our analysis assumed that there are n/2 cores available that s usually an unrealistic assumption instead: distribute parallel processes to as many cores as possible example for simple parallel execution on limited number of cores input: array of tasks: jobs, number of cores: cores executeparallel=function(jobs, cores){ i=1; while(i<length(jobs)){ parallel for(j in i:(i+cores-1)){ start(jobs[j]); } i=i+cores; } parallel for executes all iterations in parallel 18,

19, a more flexible parallelization approach (idea) assume operations depend on intermediate results created by other operations no simple systematic, but the more general case e.g. 2 depends on input from 3 and 1 8 can be executed, when 7 is finished, while 4 has in addition to wait for 5 and 2

19, a more flexible parallelization approach (idea) several possible execution orders optimal order depends on execution times simple strategy: 1. list of unoccupied cores 2. list of unfinished jobs, with number of unfinished dependencies 3. start unfinished jobs with no unfinished dependencies until all cores occupied 4. when job finishes: decrease number of unfinished dependencies on depending jobs 5. if not finished, repeat from 3

20, intermission: mapply new 1 apply variant mapply(fun,...) first argument is function to apply following arguments are vectors or lists to apply fun to calls fun for element i in all following lists if arguments are named, fun is called with named arguments >fun=function(a,b){paste(a,b,sep="-")} >mapply(fun,b=1:6,a=3:1); [1] "3-1" "2-2" "1-3" "3-4" "2-5" "1-6" naming of arguments makes order irrelevant shorter vectors are reused result: list of return values of fun 1 that s number 5

21, parallelization in R library parallel provides functions for parallel computations in particular: mcmapply() parallel mapply() mclapply() parallel lapply() execute functions for list elements in parallel important parameters: mc.cores - the max. number of CPU cores to use mc.preschedule decide job to core distribution at start or dynamically TRUE for many small and/or equal length jobs FALSE if jobs vary strongly in execution time

22, parallelize distributive functions as R code parallelaccumulate=function(f,a){ require(parallel); b=a; while(length(b)>1){ b=mclapply(1:(length(b)/2), function(i) return(f(b[[2*i]],b[[2*i-1]])); ); } return(b); } execution: plus=function(a,b) {a+b}; parallelaccumulate(plus,1:64); simple, but not very generic

23, parallelization of a function function in R parallelize=function(f){ par=function(b){ require(parallel); b=a; while(length(b)>1){ b=mclapply(1:(length(b)/2), function(i) return(f(b[[2*i]],b[[2*i-1]]))); } return(b[[1]]); } return(par); } execution: plus=function(a,b) {a+b}; psum=parallelize(plus); psum(1:64);