1, Big Data and Scripting Plotting in R
2, the art of plotting: first steps fundament of plotting in R: plot(x,y) plot some random values: plot(runif(10)) values are interpreted as y-values, x-values filled in as 1:10 plot a nx2 array of points in a scatterplot: plot(x) plot has a humongous amount of parameters with strange names pch - change point type (e.g. pch=20 gives points) cex - change point size col - change point color,...
3, a simple plotting example supply lists for point-wise settings example: data(iris) # load some flower data attach(iris) plot(iris, col=species) # plot the whole thing # plot specific axes plot(sepal.length, Sepal.Width, col=species) plot points in x, colored by species use rainbow() to create colors create individual colors with rgb() or gray()
4, setting parameters for plotting plot() accepts a number of parameters even more can be set using par() outer margins with mar=c(down, left,up,right) overplotting with new=t plot to certain areas fig=c(left,right,lower, upper) switch off axes with axes=f make your own with axis() some, but not all, of these parameters can be passed to plot() return value is a list with the old values of the changed parameters can be used to reset parameters to previous state
5, a more complicated plotting example data(iris)# get data attach(iris)# attach for easy access # plot petal width x height plot(petal.length, Petal.Width, col=species, pch=20) # make a small box on top with sepal values par=par(fig=c(0.6,0.9,0.18,0.48), new=t, mar=c(1,1,1,0)+0.1, cex=0.8) plot(sepal.length, Sepal.Width, col=species,pch=20, axes=f, main="sepal extensions") box()# make a box around the small plot par(par) # reset parameters detach(iris)
the resulting plot 1 2 3 4 5 6 7 0.5 1.0 1.5 2.0 2.5 Petal.Length Petal.Width sepal extensions Sepal.Length Sepal.Width 6,
specialized plot functions many packages provide specialized plot functions for their results example: library(igraph) g=graph.star(15) plot(g) 7 8 12 2 9 1 3 4 13 5 14 11 15 6 10 this uses the overriding mechanism for functions called dispatch not covered here, see stat.ethz.ch/r-manual/r-devel/library/methods/html/ Methods.html for detailed information 7,
8, plotting to files plotting to files is simple with file devices, example: pdf("plot.pdf");# open plot.pdf in current dir plot(1:5); # plot something dev.off(); # close device (and write file) devices can be opened, e.g. x11() opens a plotting window there is usually a currently active device if not, a plot window is created dev.off() closes the active device writes files to disk (for file devices) if possible, switches to the previously active device variants: x11(), pdf(), svg(), jpeg(),... besides file, individual parameters for each format (e.g. size for pdf, resolution for jpeg)
9, example: visualizing a distribution of networks c(0, 1) c(0, 1) c(0, 1) c(0, 1) c(0, 1) c(0, 1) c(0, 1) c(0, 1) 31.90x7.78 (65.31%)
example: visualizing a spectral distributions density 0 0.1 0.3 0.5 1 10,
11, some useful plotting functions bars() create a bar plot hist() create a histogram of values and plot it points() add additional points lines() create lines connecting the given points grconvertx(), grconverty() convert between coordinate systems
Parallel Programming on a multi CPU System 12,
13, basic questions about the machine model execute algorithms on multiple CPUs (cores) CPUs load data from memory into their registers, compute something and write the results back to memory 1. do all cores have access to the same memory? yes: (following) PRAM-model (parallel random access memory) no: (later) distributed computing 2. concurrent access (reading/writing in parallel)? parallel reading: exclusive or concurrent parallel writing: if concurrent, which value stays? four different variants in the following, we allow concurrent reading and avoid concurrent writing
14, an example algorithm: summation problem given an array of numbers A[1],...,A[n] determine sum over all A[i] straightforward without parallel execution: O(n) speed up with more cores possible?
15, parallel summation: idea partition into smallest possible subproblems solve these in parallel combine the results again parallel continue until all values are combined
16, algorithm input: array A, # assume length(a)=n=2 h B=A;// B holds results on current level while(length(b)>1){// while intermediate results have to be combined T=array(length(B)/2) parallel for(i in 1:length(T)){ // execute in parallel T[i]=B[2*i]+B[2*i-1] // solve subproblem } B=T // advance to next level } return(b[1]) assumptions/preconditions: length of A is power of two (if not, pad with zeros) the + -operation is distributive, i.e. (a + b) + c = a + (b + c) approach works for every distributive operation
17, analysis memory: need additional array for current level number of operations: (length(a)=n = 2 h ) 2 h 1 + 2 h 2 +... + 2 0 = 2 h 1 O(n) no gain in comparison to sequential approach execution time on n/2 cores let one + -operation take O(f (n)) time and length(a)=n assume, copying B=A and B=T is done in parallel, too inner for-loop is executed in parallel time O(1) outer while-loop iterates levels of binary tree log 2 n levels total time consumption: O(f (n) log 2 n), for + O(log 2 n) note difference between number of operations and execution time
execution of n parallel processes on c cores our analysis assumed that there are n/2 cores available that s usually an unrealistic assumption instead: distribute parallel processes to as many cores as possible example for simple parallel execution on limited number of cores input: array of tasks: jobs, number of cores: cores executeparallel=function(jobs, cores){ i=1; while(i<length(jobs)){ parallel for(j in i:(i+cores-1)){ start(jobs[j]); } i=i+cores; } parallel for executes all iterations in parallel 18,
19, a more flexible parallelization approach (idea) assume operations depend on intermediate results created by other operations no simple systematic, but the more general case e.g. 2 depends on input from 3 and 1 8 can be executed, when 7 is finished, while 4 has in addition to wait for 5 and 2
19, a more flexible parallelization approach (idea) several possible execution orders optimal order depends on execution times simple strategy: 1. list of unoccupied cores 2. list of unfinished jobs, with number of unfinished dependencies 3. start unfinished jobs with no unfinished dependencies until all cores occupied 4. when job finishes: decrease number of unfinished dependencies on depending jobs 5. if not finished, repeat from 3
20, intermission: mapply new 1 apply variant mapply(fun,...) first argument is function to apply following arguments are vectors or lists to apply fun to calls fun for element i in all following lists if arguments are named, fun is called with named arguments >fun=function(a,b){paste(a,b,sep="-")} >mapply(fun,b=1:6,a=3:1); [1] "3-1" "2-2" "1-3" "3-4" "2-5" "1-6" naming of arguments makes order irrelevant shorter vectors are reused result: list of return values of fun 1 that s number 5
21, parallelization in R library parallel provides functions for parallel computations in particular: mcmapply() parallel mapply() mclapply() parallel lapply() execute functions for list elements in parallel important parameters: mc.cores - the max. number of CPU cores to use mc.preschedule decide job to core distribution at start or dynamically TRUE for many small and/or equal length jobs FALSE if jobs vary strongly in execution time
22, parallelize distributive functions as R code parallelaccumulate=function(f,a){ require(parallel); b=a; while(length(b)>1){ b=mclapply(1:(length(b)/2), function(i) return(f(b[[2*i]],b[[2*i-1]])); ); } return(b); } execution: plus=function(a,b) {a+b}; parallelaccumulate(plus,1:64); simple, but not very generic
23, parallelization of a function function in R parallelize=function(f){ par=function(b){ require(parallel); b=a; while(length(b)>1){ b=mclapply(1:(length(b)/2), function(i) return(f(b[[2*i]],b[[2*i-1]]))); } return(b[[1]]); } return(par); } execution: plus=function(a,b) {a+b}; psum=parallelize(plus); psum(1:64);