Introduction to Big Data Analysis with R


 Dominick Parker
 1 years ago
 Views:
Transcription
1 Introduction to Big Data Analysis with R YungHsiang Huang National Center for Highperformance Computing, Taiwan 2014/12/01
2 Agenda Big Data, Big Challenge Introduction to R Some RPackages to Deal With Big Data Several useful Rpackages Hadoop vs. R 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 2
3 Big Data Big data is an allencompassing term for any collection of data sets so large and complex that it becomes difficult to process them using traditional data processing applications. Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, manage, and process data within a tolerable elapsed time. Big data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data. 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 3
4 Big Data (cont.) Big data is a set of techniques and technologies that require new forms of integration to uncover large hidden values from large datasets that are diverse, complex, and of a massive scale. The challenges include analysis, capture, search, sharing, storage, transfer, visualization, and privacy violations. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, prevent diseases, combat crime and so on." 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 4
5 Volume, Velocity, Varity Bigdata is usually transformed in three dimensions volume, velocity and variety. Volume: Machine generated data is produced in larger quantities than non traditional data. Velocity: This refers to the speed of data processing. Variety: This refers to large variety of input data which in turn generates large amount of data as output. Some make it 4 V s or 5V s Value: How to generate maximum value Veracity: The uncertainty of data 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 5
6 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 6
7 Agenda Big Data, Big Challenge Introduction to R Some RPackages to Deal With Big Data Several useful Rpackages Hadoop vs. R 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 7
8 What is R? The most powerful and most widely used statistical software https://www.youtube.com/watch?v=tr2bhsj_eck 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 8
9 What is R? R is a comprehensive statistical and graphical programming language and is a dialect of the S language: S2: RA Becker, JM Chambers, A Wilks S3: JM Chambers, TJ Hastie S4: JM Chambers R: initially written by Ross Ihaka and Robert Gentleman at Dep. of Statistics of U of Auckland, New Zealand during 1990s. Since 1997: international Rcore team of 15 people with access to common CVS archive. 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 9
10 What is R? The R statistical programming language is a free open source package based on the S language developed by Bell Labs. The language is very powerful for writing programs. Many statistical functions are already built in. Contributed packages expand the functionality to cutting edge research. Since it is a programming language, generating computer code to complete tasks is required. 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 10
11 RProject Opensource: R is a free software environment for statistical computing and graphics Offers tools to manage and analyze data Standard and many more statistical methods are implemented Support via the R mailing list by members of the core team Rannounce, Rpackages, Rhelp, Rdevel, Support via several manuals and books /12/01 SEAIP 2014: Intro R & Big Data (Sean) 11
12 RProject Huge onlinelibraries with Rpackages CRAN: BioConductor for genomic data: Omegahat: RForge: Possibility to write personalized code and to contribute new packages The New York Times (Jan, 2009), Data Analysts Captivated by R s Power 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 12
13 R vs. SAS vs. R is a open source software, SAS is a commercial product, R is free and available to everyone R code is open source and can be modified by everyone R is a complete and enclosed programming language R has a big and active community Number of scholarly articles that reference each software by year, after removing the top two, SPSS and SAS. Sources: 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 13
14 R Books Introduction to scientific programming and simulation using R, ISBN: Statistical Computing with R, ISBN: R Programming for Bioinformatics, ISBN: R for Business Analytics, ISBN: A Handbook of Statistical Analyses using R, ISBN: Introductory Statistics with R, Edition: 2, ISBN: /12/01 SEAIP 2014: Intro R & Big Data (Sean) 14
15 Installing, Running, and Interacting with R How to get R Available in Windows, Linux and Mac OS X 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 15
16 Installing, Running, and Interacting with R (cont.) 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 16
17 Installing, Running, and Interacting with R (cont.) 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 17
18 RStudio, an Integrated Development Environment (IDE) for R 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 18
19 RStudio (cont.) 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 19
20 RStudio (cont.) 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 20
21 R is a Calculator R will evaluate basic calculations which you type into the console (input window) > 3+10 [1] 13 > 3/(10+3) [1] > 2^19 [1] > log(2,base=10) [1] > sin(pi/2) [1] 1 > x = 1:10 > x [1] > x<5 [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE > x[3:7] [1] > x[4] [1] > 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 21
22 R is a Graphing Device > x = rnorm(1000,0,1) > hist(x) > > plot(density(x)) 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 22
23 R is a Statistics Package > x = c(1.0, 2.3, 3.1, 4.8, 5.6, 6.5) > y = c(2.6, 2.8, 3.1, 4.7, 5.1, 5.3) > lm.fit = lm(y ~ x) > summary(lm.fit) Call: lm(formula = y ~ x) Residuals: Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) ** x **  Signif. codes: 0 *** ** 0.01 * Residual standard error: on 4 degrees of freedom Multiple Rsquared: , Adjusted Rsquared: Fstatistic: on 1 and 4 DF, pvalue: > 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 23
24 R is a Simulator > # Simulated tossing two fair dices 2000 times, and observe the sum of dices. > # > # N : the time to toss two fair dices > # rolls[,1] : 1st N tosses of two die > # rolls[,2] : 2nd N tosses of two die > # rolls.sum : sum the rolls column by colume > # obs : frequency of observed outcomes > # exp : expected frequency of outcomes > N = 2000 > rolls = matrix(ceiling(6*runif(2*n)),ncol=2) > rolls.sum = rolls[,1] + rolls[,2] > obs = table(rolls.sum)/n > exp = c(1:6,5:1)/36 > print(round(cbind(obs,exp),4)) obs exp > 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 24
25 R is a Programming Language > my.fun < function(x,y) { + res = x+y^2 + return(res) + } > my.fun(3,5) [1] 28 > hist.normal < function(n,color) { + x = rnorm(n) + h = hist(x,freq=f) + lines(density(x),col=color) + } > par(mfrow=c(2,1)) > hist.normal(2000,2) # 2 for red > hist.normal(1000,4) # 4 for blue > 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 25
26 R Basics Getting Help in R Basic Usage Naming convention Type of variables Missing values Importing external file to R Exporting R data to external file Load packages and data Functions Many statistical/mathematical methods and functions Vector and Matrix Some useful functions 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 26
27 Getting Help in R library() lists all available libraries on system help(command) getting help for one command, e.g. help(heatmap) help.search( topic ) searches help system for packages associated with the topic, e.g. help.search( normal ) help.start() starts local HTML interface q() quits R console 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 27
28 Basic Usage The general R command syntax uses the assignment operator < (or = ) to assign data to object. x<c(5,2,3,10,1); object < function (arguments) Equivalently, assign( y, c(10,6,7,8,9)); c(1,2,3,4,5)>z source( myscript.r ) command to execute an R script named as myscript.r. objects() or ls() list the names of all objects rm(data1) Remove the object named data1 from the current environment data1 < edit(data.frame()) Starts empty GUI spreadsheet editor for manual data entry. 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 28
29 class(object) displays the object type. str(object) Basic Usage (cont.) displays the internal type and structure of an R object. attributes(object) dir() Returns an object's attribute list. Reads content of current working directory. getwd() Returns current working directory. setwd( d:/data ) Changes current working directory to user specified directory. 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 29
30 Naming Convention All alphanumeric symbols are allowed plus. and _ Must start with a letter (AZ, az) Casesensitive MyData is different from mydata 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 30
31 Types of objects: Type of Variables vector, factor, array, matrix, data.frame, ts, list Attributes Mode: numeric, character, complex, logical Length: number of elements in object Creation Assign a value Create a blank object 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 31
32 Missing Values R is designed to handle statistical data and therefore predestined to deal with missing values Variables of each data type (numeric, character, logical) can also take the value NA: not available. NA is not the same as 0 NA is not the same as NA is not the same as FALSE NA is not the same as NULL NA, NaN, and Null NA (Not Available): applies to many modes (character, numeric, etc.) NaN (Not a Number): applies only to numeric modes NULL: Lists with zero length > x = c(1,2,3,na) > x+3 [1] NA > 0/0 [1] NaN > y = NULL > length(y) [1] 0 > 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 32
33 Importing External file to R Text files (ASCII) Files in other formats (Excel, SAS, SPSS, ) Data on Web pages SQLlike Databases Binary file Much more information is available in the Data Import/Export manual. read.table() Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file. 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 33
34 Syntax Notes read.table(file, header=false, sep=, dec=., ) file header sep qutoe dec row.name col.name as.is na.strings colclasses nrows skip check.names fill strip.white blank.lines.skip comment.char the name of the file (within or a variable of mode character), possible with its path (the symbol \ is not allowed and must be replaced by /, even under Windows), or a remote access to a file of type URL (http://...) a logical (FALSE or TRUE) indicating if the file contains the names of the variables on its first line the field separator used in the file, for instance sep= \t if it is a tabulation the characters used to cite the variables of mode character the character used for the decimal point a vector with the names of the lines which can be either a vector of mode character, or the number (or the name) of a variable of the file (by default: 1, 2, 3, ) a vector with he names of the variables (by default: V1, V2, V3, ) controls the conversion of character variables as factors (if FALSE) or keeps them as characters (TRYE); as.is can be a logical, numeric or character vector specifying the variables to be kept as character the value given to missing data (converted as NA) a vector of mode character giving the classes to attribute to the columns the maximum number of lines to read (negative values are ignored) the number of lines to be skipped before reading the data if TRUE, checks that the variable name are valid for R if TRUE and all lines do not have the same number of variables, blanks are added (conditional to sep) if TURE, deletes extra spaces before and after the character variables if TRUE, ignores blank lines a character defining comments in the data file, the rest of the line after this character is ignored (to disable this argument, use comment.char= ) 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 34
35 Data Import read.delim( clipboard, header=t) scan( my_file ) Reads vector/array into vector from file or keyboard read.csv(file= path, header=true) You can skip lines, read a limited number of lines, different decimal separator, and more importing options. The foreign package can read files from Stata, SAS, and SPSS. 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 35
36 Fixed Format Data The function read.fwf reads the fixedformat data mydata < read.fwf(file, widths, header=false, sep= \t, buffersize=2000) buffersize if the maximum number of lines to read at one time. A B C D E F G > mydata = read.fwf("c:/tmp/fixed.txt",widths=c(1,4,3)) > mydata V1 V2 V3 1 A B C D E F G > 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 36
37 Exporting R Data to External File write.table(iris, "clipboard", sep="\t", col.names=na, quote=f) Command to copy&paste from R into Excel or other programs. It writes the data of an R data frame object into the clipbroard from where it can be pasted into other applications. write.table(dataframe, file= file path", sep="\t", col.names = NA) Writes data frame to a tabdelimited text file. The argument 'col.names = NA' makes sure that the titles align with columns when row/index names are exported (default). write(x, file="file path") Writes matrix data to a file. sink("my_r_output") redirects all subsequent R output to a file 'My_R_Output' without showing it in the R console anymore. restores normal R output behavior. 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 37
38 Load Packages and Data GUI Mode Package(s) install package(s) Package(s) load package Command Mode > # install.packages("onion") > library(onion) > data(bunny) > head(bunny,n=3) x y z [1,] [2,] [3,] > # Three dimensional plotting of points. > # Produces a nicelooking 3D scatterplot with > # greying out of further points givin a visual > # depth cue > p3d(bunny,theta=3,phi=104,box=false) 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 38
39 Functions Actions can be performed on objects using functions A function is itself an object Functions have arguments and options, often there are defaults Functions provide a result The parentheses () are used to specify that a function is being called > my.fun < function(a,b=10) { + ret = a+b + ret + } > > my.fun(1) [1] 11 > my.fun(1,2) [1] 3 > 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 39
40 Statistical/Mathematical Methods and Functions Kmeans Clustering > x = rbind(matrix(rnorm(100,mean=0,sd=0.3),ncol=2), + matrix(rnorm(100,mean=1,sd=0.3),ncol=2)) > head(x,n=3) [,1] [,2] [1,] [2,] [3,] > dim(x) [1] > cl = kmeans(x,4) > plot(x,col=cl$cluster) > points(cl$centers,col=1:4,pch=8,cex=2) 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 40
41 > cl Kmeans clustering with 4 clusters of sizes 16, 47, 14, 23 Cluster means: [,1] [,2] Clustering vector: [1] [43] [85] Within cluster sum of squares by cluster: [1] (between_ss / total_ss = 83.8 %) Available components: [1] "cluster" "centers" "totss" "withinss" "tot.withinss" [6] "betweenss" "size" "iter" "ifault" 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 41
42 Statistical/Mathematical Methods and Functions Statistical distribution Density (d), cumulative distribution function (p), quantile function (q) and random variate generation (r). Normal: dnorm, pnorm, qnorm, rnorm Beta: dbeta, pbeta, qbeta, rbeta F: df, pf, qf, rf Some basic mathematical operator log, exp mean, median, mode, max, min, sd trigonometry set operations logical operators: <, <=, >, >=, ==,!= 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 42
43 > pnorm(1,mean=0,sd=1)  pnorm(1,mean=0,sd=1) [1] > pnorm(2,mean=0,sd=1)  pnorm(1,mean=0,sd=1) [1] > qnorm(0.9785,mean=0,sd=1) [1] > x = seq(3.2,3.2,by=0.01) > y = dnorm(x) > x.sampling = rnorm(200,mean=0,sd=1) > hist(x.sampling,prob=t,xlim=c(3.2,3.2),ylim=c(0,0.5),xlab="",ylab="",main="") > lines(x,y) > lines(density(x.sampling),col="red") > legend("topright",c("dnorm: exact dist'n","rnorm: random number"),col=c(1,2),lty=1) > 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 43
44 Vector and Matrix Vector: all elements are the same data type > v1 = c(1,2,3,4,5,6) > v2 = c("one","two","three","four","five","six") > v3 = c(true,false,true,false,true,false) > v1 [1] > v2 [1] "one" "two" "three" "four" "five" "six" > v3 [1] TRUE FALSE TRUE FALSE TRUE FALSE > Matrix: all elements are the same data type > m = matrix(letters[1:12],nrow=3,ncol=4) > m [,1] [,2] [,3] [,4] [1,] "A" "D" "G" "J" [2,] "B" "E" "H" "K" [3,] "C" "F" "I" "L" 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 44
45 Vector and Matrix Data Frame: different columns can have different data types List: the ordered collection of various elements > dframe = data.frame(v1,v2,v3) > dframe v1 v2 v3 1 1 one TRUE 2 2 two FALSE 3 3 three TRUE 4 4 four FALSE 5 5 five TRUE 6 6 six FALSE > l = list(n= Sean",mat=m,df=dframe) > l $n [1] Sean" $mat [,1] [,2] [,3] [,4] [1,] "A" "D" "G" "J" [2,] "B" "E" "H" "K" [3,] "C" "F" "I" "L" $df v1 v2 v3 1 1 one TRUE 2 2 two FALSE 3 3 three TRUE 4 4 four FALSE 5 5 five TRUE 6 6 six FALSE 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 45
46 apply(iris[,1:3], 1, mean) Some Useful Functions Calculates the mean values for the columns 13 in the sample data frame 'iris'. With the argument setting '1', rowwise iterations are performed and with '2' columnwise iterations. tapply(iris[,4], iris$species, mean) Calculates the mean values for the 4th column based on the grouping information in the 'Species' column in the 'iris' data frame. sapply(x, sqrt) Calculates the square root for each element in the vector x. Generates the same result as 'sqrt(x)'. lapply(x, fun) Returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X. 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 46
47 Some Useful Functions: Examples > iris.part = iris[1:10,1:3] > iris.part Sepal.Length Sepal.Width Petal.Length > apply(iris.part,1,mean) > apply(iris.part,2,mean) Sepal.Length Sepal.Width Petal.Length > tapply(iris[,4],iris$species,mean) setosa versicolor virginica > sapply(iris[,1:3],mean) Sepal.Length Sepal.Width Petal.Length > apply(iris[,1:3],2,mean) Sepal.Length Sepal.Width Petal.Length /12/01 SEAIP 2014: Intro R & Big Data (Sean) 47
48 Agenda Big Data, Big Challenge Introduction to R Some RPackages to Deal With Big Data Several useful Rpackages Hadoop vs. R 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 48
49 HighPerformance and Parallel Computing with R Parallel computing: Explicit parallelism Parallel computing: Implicit parallelism Parallel computing: Grid computing Parallel computing: Hadoop Parallel computing: Random numbers Parallel computing: Resource managers and batch schedulers Parallel computing: Applications Parallel computing: GPUs Large memory and outofmemory data 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 49
50 Parallel Computing Basics Serial and parallel tasks: Task 1 Task 1 Task 2 Task 3 Task n Task 2 Task 2.1 Task 3.1 Task n.1 Task 3 Task 2.2 Task 3.2 Task n.2 Task n+1 Task 2.m Task 3.m Task n.m Task n+1 Problem is broken into a discrete series of instructions and they are processed one after another. Problem is broken into discrete parts, that can be solved concurrently. 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 50
51 Package: data.table An extension of data.frame for fast indexing, fast ordered joins, fast assignment, fast grouping and list columns > library(data.table) > v1 = c(1,2,5.3,6,2,4) > v2 = c( red, white, red,na, blue, orange ) > v3 = c(t,t,t,f,t,f) > my.datatable = data.table(v1,v2,v3) > my.datatable v1 v2 v3 1: 1.0 one TRUE 2: 2.0 two TRUE 3: 5.3 three TRUE 4: 6.0 four FALSE 5: 2.0 five TRUE 6: 4.0 six FALSE > my.datatable[2] v1 v2 v3 1: 2 two TRUE > my.datatable[,v2] [1] "one" "two" "three" "four" "five" "six" > my.datatable[,sum(v1),by=v3] v3 V1 1: TRUE 6.3 2: FALSE 10.0 > setkey(my.datatable,v2) > my.datatable["five"] v1 v2 v3 1: 2 five TRUE 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 51
52 Package: plyr plyr is the tools for splitting, applying and combining data Functions are named according to what sort of data structure used (a:array, l:list, d:data.frame, m:multiple inputs, r:repeat multiple times) Provides a set of helper functions for common data analysis > library(plyr) > data(iris) > head(iris,n=3) Sepal.Length Sepal.Width Petal.Length Petal.Width Species setosa setosa setosa > count(iris,vars="species") Species freq 1 setosa 50 2 versicolor 50 3 virginica /12/01 SEAIP 2014: Intro R & Big Data (Sean) 52
53 Package: plyr (cont.) Summarise works in an analogous way to mutate, except instead of adding columns to an existing data frame, it creates a new data frame. This is particularly useful in conjunction with ddply as it makes it easy to perform groupwise summaries > is.data.frame(iris) [1] TRUE > dim(iris) [1] > summary(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50 1st Qu.: st Qu.: st Qu.: st Qu.:0.300 versicolor:50 Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50 Mean :5.843 Mean :3.057 Mean :3.758 Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 > summarise(iris,mean_petal_length=mean(petal.length), + max_metal_length=max(sepal.length)) mean_petal_length max_metal_length /12/01 SEAIP 2014: Intro R & Big Data (Sean) 53
54 Package: plyr (cont.) ddply: For each subset of a data frame, apply function then combine results into a data frame. daply: For each subset of data frame, apply function then combine results into an array. daply with a function that operates columnwise is similar to aggregate > ddply(iris,.(species),summarise,mean_petal_length=mean(petal.length), + max_petal_length=max(sepal.length)) Species mean_petal_length max_petal_length 1 setosa versicolor virginica > daply(iris[,c(1,2,5)],.(species),colwise(mean)) Species Sepal.Length Sepal.Width setosa versicolor virginica /12/01 SEAIP 2014: Intro R & Big Data (Sean) 54
55 Package: RJSONIO JSON: Javascript Object Notation a lightweight datainterchange format > library(rjsonio) > json = tojson(list(a=c(1,2,3),name="markus")) > cat(json) { "a": [ 1, 2, 3 ], "name": "Markus" } > robj = fromjson(json) > robj $a [1] $name [1] "Markus" > 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 55
56 Package: bigmemory, biganalytics, foreach, Available on Unixalikes, including Mac. Manage massive matrices with shared memory and memorymapped files > library(bigmemory) > library(biganalytics) > library(foreach) > x = rbind(matrix(rnorm(100,sd=0.3),ncol=2), + matrix(rnorm(100,mean=1,sd=0.3),ncol=2)) > bigmatrix = as.big.matrix(x) > res = bigkmeans(bigmatrix,3) > res Kmeans clustering with 3 clusters of sizes 27, 49, 24 Cluster means: [,1] [,2] [1,] [2,] [3,] Clustering vector: [1] [38] [75] Within cluster sum of squares by cluster: [1] Available components: [1] "cluster" "centers" "withinss" "size" > 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 56
57 Package: parallel First version was released with R Contains functionality derived from and pretty much equivalent to the multicore and snow packages > N = 1000 > x = list(a=rnorm(n),b=rbeta(n,2,3)) > lapply(x,mean) $a [1] $b [1] > library(parallel) > cl = makecluster(2) > parlapply(cl,x,mean) $a [1] $b [1] > stopcluster(cl) > mclapply(x,mean) $a [1] $b [1] /12/01 SEAIP 2014: Intro R & Big Data (Sean) 57
58 Package: Rcpp The Rcpp package provides C++ classes that greatly facilitate interfacing C or C++ code in R packages using the.call interface provided by R. A clean, approachable API that lets you write highperformance code. Can help with loops, recursive functions and functions with advanced data structures > library(rcpp) > cppfunction(' 03 + int add(int x, int y, int z) { 04 + int sum = x+y+z; 05 + return sum; 06 + } 07 + ') > add function (x, y, z).primitive(".call")(<pointer: 0x c1770>, x, y, z) > add(2014,12,1) [1] 2027 > cppfunction(' 14 + int fibonacci(const int x) { 15 + if (x == 0) return(0); 16 + if (x == 1) return(1); 17 + return (fibonacci(x  1)) + fibonacci(x  2); 18 + } 19 + ') 20 > fibonacci(20) 2014/12/01 21 [1] 6765 SEAIP 2014: Intro R & Big Data (Sean) 58
59 Package: ggplot2 ggplot2 is useful for producing complex graphics relatively simply. An implementation of the Grammar of Graphics book by Leland Wilkinson The basic notion is that there is a grammar to the composition of graphical components in statistical graphics By directly controlling that grammar, you can generate a large set of carefully constructed graphics from a relatively small set of operations A good grammar will allow us to gain insight into the ocmposition of complicated graphics, and reveal unexpected connections between seemingly different graphics. 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 59
60 > library(ggplot2) > qplot(sepal.length,petal.length,data=iris,color=species) # topleft > res = qplot(sepal.length,petal.length,data=iris,color=species,size=petal.width,alpha=i(0.5)) > res # topright > res+geom_line(size=1) # bottomleft > res+geom_boxplot(size=0.2,alpha=i(0.3)) # bottomright 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 60
61 Shiny Easy Web Application Developed by RStudio Turn analyses into interactive web applications that anyone can use Let your users choose input parameters using friendly controls like slides, dropdowns, and text fields Easily incorporate any number of outputs like plots, tables, and summaryies No HTML or JavaScript knowledge is necessary, only R Hello World Shiny > library(shiny) > runexample("01_hello") 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 61
62 Package: bigvis Revolution R Enterprise Tools for exploratory data analysis of large data sets (76 million) 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 62
63 R and Databases SQL provides a standard language to filter, aggregate, group, sort data SQLlike query languages showing up in new places Hadoop Hive, ODBC provides SQL interface to nondatabase data Excel, CSV, text files, R stores relational data in data frames. Rank DBMS Database Model Score Changes 1. Oracle Relational DBMS MySQL Relational DBMS Microsoft SQL Server Relational DBMS PostgreSQL Relational DBMS MongoDB Document store DB2 Relational DBMS Microsoft Access Relational DBMS SQLite Relational DBMS Cassandra Wide column store Sybase ASE Relational DBMS DBEngines Ranking 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 63
64 Package: sqldf sqldf is an R package for running SQL statements on R data frames SQL statements in R using data frame names in place of table names A database with appropriate table layouts/schema is automatically created, the data frames are automatically loaded into the database The result is read back into R sqldf supports the SQLite backend database(default), the H2 java database, the PostgreSQL database and MySQL 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 64
65 Package: sqldf (cont.) > library(sqldf) > sqldf('select * from iris limit 4') Sepal.Length Sepal.Width Petal.Length Petal.Width Species setosa setosa setosa setosa > sqldf('select count(*) from iris') count(*) > sqldf('select Species, count(*) from iris group by Species') Species count(*) 1 setosa 50 2 versicolor 50 3 virginica 50 > sqldf('select Species, avg("sepal.length") as "Sepal Avg.", + variance("sepal.width") as "Sepal Width" from iris group by Species') Species Sepal Avg. Sepal Width 1 setosa versicolor virginica > 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 65
66 Other Relational Packages RMySQL provide an interface to MySQL RPostgreSQL provide an interface to PostgreSQL ROracle provide an interface to Oracle RJDBC provide access to database through a JDBC interface RSQLite provide access to SQLite Bottleneck to deal with BIG DATA All packages read the full result in R memory 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 66
67 Hadoop An open source software framework designed to support large scale data processing Map Reduce: a computational paradigm Application is divided into many small fragments of work HDFS: Hadoop Distributed File System A distributed file system that stores data on the compute nodes The Ecosystem: Hive, Pig, Flume, Mahout, Written in Java, opened up to alternatives by its Streaming API 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 67
68 HDFS & Hadoop Cluster HDFS is a blockstructured file system Blocks are stored across a cluster of one or more machines with data storage capacity. (datanode) Data is accessed in a write once and read many model HDFS does com with its own utilities for file management HDFS file system stores its metadata reliably. (namenode) 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 68
69 RHadoop An open source project sponsored by Revolution Analytics Package rmr2 hosts all Map Reduce related functions, uses Hadoop Streaming API rhdfs for the interaction with HDFS file system plyrmr convenient processing on a Hadoop cluster of large data sets rhbase connect with Hadoop s NoSQL database HBase Installation https://github.com/revolutionanalytics/rhadoop 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 69
70 Dependence Package: rmr2 Rcpp, functional, bitops, catools, RJSONIO, etc. Simple Parallel Computing with R An Map Reduce Example Job > x = 1:5 > x [1] > unlist(lapply(x, function(y) y^2)) [1] > library(parallel) > unlist(mclapply(x, function(y) y^2)) [1] > > library(rmr2) > rmr.options(backend=c("local")) NULL > small.ints = to.dfs(keyval(1,1:100)) > out = mapreduce( + input=small.ints, + map=function(k,v) cbind(v,v^2)) > df = from.dfs(out) > head(df$val,n=5) v [1,] 1 1 [2,] 2 4 [3,] 3 9 [4,] 4 16 [5,] /12/01 SEAIP 2014: Intro R & Big Data (Sean) 70
71 Hadoop Hello World Example: Word Count 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 71
Econometrics in R. Grant V. Farnsworth. October 26, 2008
Econometrics in R Grant V. Farnsworth October 26, 2008 This paper was originally written as part of a teaching assistantship and has subsequently become a personal reference. I learned most of this stuff
More informationAn Introduction to R. W. N. Venables, D. M. Smith and the R Core Team
An Introduction to R Notes on R: A Programming Environment for Data Analysis and Graphics Version 3.2.0 (20150416) W. N. Venables, D. M. Smith and the R Core Team This manual is for R, version 3.2.0
More informationChapter 2 Getting Data into R
Chapter 2 Getting Data into R In the following chapter we address entering data into R and organising it as scalars (single values), vectors, matrices, data frames, or lists. We also demonstrate importing
More informationW. J. Owen Department of Mathematics and Computer Science University of Richmond
The Guide Version 2.5 W. J. Owen Department of Mathematics and Computer Science University of Richmond Black Cherry Trees large residual Volume 10 20 30 40 50 60 70 65 70 75 80 85 Consider a log transform
More informationVersion 4.0. Base Handbook
Version 4.0 Base Handbook Copyright This document is Copyright 2013 by its contributors as listed below. You may distribute it and/or modify it under the terms of either the GNU General Public License
More informationOpenOffice.org 3.2 BASIC Guide
OpenOffice.org 3.2 BASIC Guide Copyright The contents of this document are subject to the Public Documentation License. You may only use this document if you comply with the terms of the license. See:
More informationFor anyone who needs to work with an RDBMS, SQuirreL can make life easier.
SQuirreL, a Universal SQL Client by Gerd Wagner and Glenn Griffin Do you use a Relational Database System (RDBMS)? If so, you have probably run across one or more of the following situations:  Typing
More informationA Study of NoSQL and NewSQL databases for data aggregation on Big Data
A Study of NoSQL and NewSQL databases for data aggregation on Big Data ANANDA SENTRAYA PERUMAL MURUGAN Master s Degree Project Stockholm, Sweden 2013 TRITAICTEX2013:256 A Study of NoSQL and NewSQL
More informationDetecting LargeScale System Problems by Mining Console Logs
Detecting LargeScale System Problems by Mining Console Logs Wei Xu Ling Huang Armando Fox David Patterson Michael I. Jordan EECS Department University of California at Berkeley, USA {xuw,fox,pattrsn,jordan}@cs.berkeley.edu
More informationIBM SPSS Missing Values 22
IBM SPSS Missing Values 22 Note Before using this information and the product it supports, read the information in Notices on page 23. Product Information This edition applies to version 22, release 0,
More informationIBM SPSS Modeler 15 User s Guide
IBM SPSS Modeler 15 User s Guide Note: Before using this information and the product it supports, read the general information under Notices on p. 249. This edition applies to IBM SPSS Modeler 15 and to
More informationQuick Start Guide. www.goldensoftware.com. 2D & 3D Graphing for Scientists, Engineers & Business Professionals. GoldenSoftware,
Quick Start Guide 2D & 3D Graphing for Scientists, Engineers & Business Professionals Quick Sta rt Guide www.goldensoftware.com Golden GoldenSoftware, Software,Inc. Inc. Voxler Registration Information
More informationGretl User s Guide. Gnu Regression, Econometrics and Timeseries Library. Allin Cottrell Department of Economics Wake Forest University
Gretl User s Guide Gnu Regression, Econometrics and Timeseries Library Allin Cottrell Department of Economics Wake Forest University Riccardo Jack Lucchetti Dipartimento di Economia Università Politecnica
More informationHelp File Version 1.0.14
Version 1.0.14 By engineering@optimumg.com www.optimumg.com Welcome Thank you for purchasing OptimumT the new benchmark in tire model fitting and analysis. This help file contains information about all
More informationImageNow Administrator Getting Started Guide
ImageNow Administrator Getting Started Guide Version: 6.6.x Written by: Product Documentation, R&D Date: June 2011 ImageNow and CaptureNow are registered trademarks of Perceptive Software, Inc. All other
More informationCloud Computing and Big Data Analytics: What Is New from Databases Perspective?
Cloud Computing and Big Data Analytics: What Is New from Databases Perspective? Rajeev Gupta, Himanshu Gupta, and Mukesh Mohania IBM Research, India {grajeev,higupta8,mkmukesh}@in.ibm.com Abstract. Many
More informationIntroduction to Stata
Introduction to Stata Christopher F Baum Faculty Micro Resource Center Boston College August 2011 Christopher F Baum (Boston College FMRC) Introduction to Stata August 2011 1 / 157 Strengths of Stata What
More informationUser Guide and Reference Manual. Version 3.1 September 2014
User Guide and Reference Manual Version 3.1 September 2014 Scientific Toolworks, Inc. 53 N Main St. George, UT 84770 Copyright 2014 Scientific Toolworks, Inc. All rights reserved. The information in this
More informationINTRODUCTION TO SCILAB
www.scilab.org INTRODUCTION TO SCILAB Consortium scilab Domaine de Voluceau  Rocquencourt  B.P. 10578153 Le Chesnay Cedex France This document has been written by Michaël Baudin from the Scilab Consortium.
More informationDHIS 2 Enduser Manual 2.19
DHIS 2 Enduser Manual 2.19 20062015 DHIS2 Documentation Team Revision 1529 Version 2.19 20150707 21:00:22 Warranty: THIS DOCUMENT IS PROVIDED BY THE AUTHORS ''AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES,
More informationMcAfee epolicy Orchestrator
Best Practices Guide McAfee epolicy Orchestrator for use with epolicy Orchestrator versions 4.5.0 and 4.0.0 COPYRIGHT Copyright 2011 McAfee, Inc. All Rights Reserved. No part of this publication may be
More informationOrigin 9.1 User Guide
Origin 9.1 User Guide Copyright 2013 by OriginLab Corporation All rights reserved. No part of the contents of this book may be reproduced or transmitted in any form or by any means without the written
More informationAN INTRODUCTION TO. Data Science. Jeffrey Stanton, Syracuse University
AN INTRODUCTION TO Data Science Jeffrey Stanton, Syracuse University INTRODUCTION TO DATA SCIENCE 2012, Jeffrey Stanton This book is distributed under the Creative Commons Attribution NonCommercialShareAlike
More informationOracle Data Integrator Best Practices for a Data Warehouse. Oracle Best Practices March 2008
Oracle Data Integrator Best Practices for a Data Warehouse Oracle Best Practices March 2008 Oracle Data Integrator Best Practices for a Data Warehouse PREFACE... 7 PURPOSE... 7 AUDIENCE... 7 ADDITIONAL
More informationJournal of Statistical Software
JSS Journal of Statistical Software MMMMMM YYYY, Volume VV, Issue II. http://www.jstatsoft.org/ Tidy Data Hadley Wickham RStudio Abstract A huge amount of effort is spent cleaning data to get it ready
More informationBigBench: Towards an Industry Standard Benchmark for Big Data Analytics
BigBench: Towards an Industry Standard Benchmark for Big Data Analytics Ahmad Ghazal 1,5, Tilmann Rabl 2,6, Minqing Hu 1,5, Francois Raab 4,8, Meikel Poess 3,7, Alain Crolotte 1,5, HansArno Jacobsen 2,9
More informationMAKE YOUR WEB PRESENCE FELT WITH MOVABLE TYPE. WE MAKE IT SIMPLE TO CREATE AND EASY TO MANAGE YOUR CONTENT. Six Apart Movable Type
MAKE YOUR WEB PRESENCE FELT WITH MOVABLE TYPE. WE MAKE IT SIMPLE TO CREATE AND EASY TO MANAGE YOUR CONTENT. Six Apart Movable Type WHY CHOOSE MOVABLE TYPE? Publishers large and small love our professional
More informationGetting Started with Richmond SupportDesk
Getting Started with Richmond SupportDesk Richmond SupportDesk is a Help Desk, Service Management and Asset Management software solution designed for internal support (IT support, facilities management
More informationBorland StarTeam 2009. StarTeam Server Help
Borland StarTeam 2009 StarTeam Server Help Borland Software Corporation 8310 N Capital of Texas Hwy, Bldg 2, Ste 100 Austin, Texas 78731 USA www.borland.com Borland Software Corporation may have patents
More informationEvaluation of CUDA Fortran for the CFD code Strukti
Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center
More information