Introduction to Big Data Analysis with R

Size: px
Start display at page:

Download "Introduction to Big Data Analysis with R"

Transcription

1 Introduction to Big Data Analysis with R Yung-Hsiang Huang National Center for High-performance Computing, Taiwan 2014/12/01

2 Agenda Big Data, Big Challenge Introduction to R Some R-Packages to Deal With Big Data Several useful R-packages Hadoop vs. R 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 2

3 Big Data Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process them using traditional data processing applications. Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, manage, and process data within a tolerable elapsed time. Big data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data. 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 3

4 Big Data (cont.) Big data is a set of techniques and technologies that require new forms of integration to uncover large hidden values from large datasets that are diverse, complex, and of a massive scale. The challenges include analysis, capture, search, sharing, storage, transfer, visualization, and privacy violations. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, prevent diseases, combat crime and so on." 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 4

5 Volume, Velocity, Varity Bigdata is usually transformed in three dimensions- volume, velocity and variety. Volume: Machine generated data is produced in larger quantities than non traditional data. Velocity: This refers to the speed of data processing. Variety: This refers to large variety of input data which in turn generates large amount of data as output. Some make it 4 V s or 5V s Value: How to generate maximum value Veracity: The uncertainty of data 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 5

6 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 6

7 Agenda Big Data, Big Challenge Introduction to R Some R-Packages to Deal With Big Data Several useful R-packages Hadoop vs. R 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 7

8 What is R? The most powerful and most widely used statistical software https://www.youtube.com/watch?v=tr2bhsj_eck 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 8

9 What is R? R is a comprehensive statistical and graphical programming language and is a dialect of the S language: S2: RA Becker, JM Chambers, A Wilks S3: JM Chambers, TJ Hastie S4: JM Chambers R: initially written by Ross Ihaka and Robert Gentleman at Dep. of Statistics of U of Auckland, New Zealand during 1990s. Since 1997: international R-core team of 15 people with access to common CVS archive. 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 9

10 What is R? The R statistical programming language is a free open source package based on the S language developed by Bell Labs. The language is very powerful for writing programs. Many statistical functions are already built in. Contributed packages expand the functionality to cutting edge research. Since it is a programming language, generating computer code to complete tasks is required. 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 10

11 R-Project Open-source: R is a free software environment for statistical computing and graphics Offers tools to manage and analyze data Standard and many more statistical methods are implemented Support via the R mailing list by members of the core team R-announce, R-packages, R-help, R-devel, Support via several manuals and books /12/01 SEAIP 2014: Intro R & Big Data (Sean) 11

12 R-Project Huge online-libraries with R-packages CRAN: BioConductor for genomic data: Omegahat: R-Forge: Possibility to write personalized code and to contribute new packages The New York Times (Jan, 2009), Data Analysts Captivated by R s Power 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 12

13 R vs. SAS vs. R is a open source software, SAS is a commercial product, R is free and available to everyone R code is open source and can be modified by everyone R is a complete and enclosed programming language R has a big and active community Number of scholarly articles that reference each software by year, after removing the top two, SPSS and SAS. Sources: 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 13

14 R Books Introduction to scientific programming and simulation using R, ISBN: Statistical Computing with R, ISBN: R Programming for Bioinformatics, ISBN: R for Business Analytics, ISBN: A Handbook of Statistical Analyses using R, ISBN: Introductory Statistics with R, Edition: 2, ISBN: /12/01 SEAIP 2014: Intro R & Big Data (Sean) 14

15 Installing, Running, and Interacting with R How to get R Available in Windows, Linux and Mac OS X 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 15

16 Installing, Running, and Interacting with R (cont.) 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 16

17 Installing, Running, and Interacting with R (cont.) 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 17

18 RStudio, an Integrated Development Environment (IDE) for R 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 18

19 RStudio (cont.) 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 19

20 RStudio (cont.) 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 20

21 R is a Calculator R will evaluate basic calculations which you type into the console (input window) > 3+10 [1] 13 > 3/(10+3) [1] > 2^19 [1] > log(2,base=10) [1] > sin(pi/2) [1] 1 > x = 1:10 > x [1] > x<5 [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE > x[3:7] [1] > x[-4] [1] > 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 21

22 R is a Graphing Device > x = rnorm(1000,0,1) > hist(x) > > plot(density(x)) 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 22

23 R is a Statistics Package > x = c(1.0, 2.3, 3.1, 4.8, 5.6, 6.5) > y = c(2.6, 2.8, 3.1, 4.7, 5.1, 5.3) > lm.fit = lm(y ~ x) > summary(lm.fit) Call: lm(formula = y ~ x) Residuals: Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) ** x ** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 4 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 4 DF, p-value: > 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 23

24 R is a Simulator > # Simulated tossing two fair dices 2000 times, and observe the sum of dices. > # > # N : the time to toss two fair dices > # rolls[,1] : 1st N tosses of two die > # rolls[,2] : 2nd N tosses of two die > # rolls.sum : sum the rolls column by colume > # obs : frequency of observed outcomes > # exp : expected frequency of outcomes > N = 2000 > rolls = matrix(ceiling(6*runif(2*n)),ncol=2) > rolls.sum = rolls[,1] + rolls[,2] > obs = table(rolls.sum)/n > exp = c(1:6,5:1)/36 > print(round(cbind(obs,exp),4)) obs exp > 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 24

25 R is a Programming Language > my.fun <- function(x,y) { + res = x+y^2 + return(res) + } > my.fun(3,5) [1] 28 > hist.normal <- function(n,color) { + x = rnorm(n) + h = hist(x,freq=f) + lines(density(x),col=color) + } > par(mfrow=c(2,1)) > hist.normal(2000,2) # 2 for red > hist.normal(1000,4) # 4 for blue > 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 25

26 R Basics Getting Help in R Basic Usage Naming convention Type of variables Missing values Importing external file to R Exporting R data to external file Load packages and data Functions Many statistical/mathematical methods and functions Vector and Matrix Some useful functions 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 26

27 Getting Help in R library() lists all available libraries on system help(command) getting help for one command, e.g. help(heatmap) help.search( topic ) searches help system for packages associated with the topic, e.g. help.search( normal ) help.start() starts local HTML interface q() quits R console 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 27

28 Basic Usage The general R command syntax uses the assignment operator <- (or = ) to assign data to object. x<-c(5,2,3,10,1); object <- function (arguments) Equivalently, assign( y, c(10,6,7,8,9)); c(1,2,3,4,5)->z source( myscript.r ) command to execute an R script named as myscript.r. objects() or ls() list the names of all objects rm(data1) Remove the object named data1 from the current environment data1 <- edit(data.frame()) Starts empty GUI spreadsheet editor for manual data entry. 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 28

29 class(object) displays the object type. str(object) Basic Usage (cont.) displays the internal type and structure of an R object. attributes(object) dir() Returns an object's attribute list. Reads content of current working directory. getwd() Returns current working directory. setwd( d:/data ) Changes current working directory to user specified directory. 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 29

30 Naming Convention All alpha-numeric symbols are allowed plus. and _ Must start with a letter (A-Z, a-z) Case-sensitive MyData is different from mydata 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 30

31 Types of objects: Type of Variables vector, factor, array, matrix, data.frame, ts, list Attributes Mode: numeric, character, complex, logical Length: number of elements in object Creation Assign a value Create a blank object 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 31

32 Missing Values R is designed to handle statistical data and therefore predestined to deal with missing values Variables of each data type (numeric, character, logical) can also take the value NA: not available. NA is not the same as 0 NA is not the same as NA is not the same as FALSE NA is not the same as NULL NA, NaN, and Null NA (Not Available): applies to many modes (character, numeric, etc.) NaN (Not a Number): applies only to numeric modes NULL: Lists with zero length > x = c(1,2,3,na) > x+3 [1] NA > 0/0 [1] NaN > y = NULL > length(y) [1] 0 > 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 32

33 Importing External file to R Text files (ASCII) Files in other formats (Excel, SAS, SPSS, ) Data on Web pages SQL-like Databases Binary file Much more information is available in the Data Import/Export manual. read.table() Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file. 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 33

34 Syntax Notes read.table(file, header=false, sep=, dec=., ) file header sep qutoe dec row.name col.name as.is na.strings colclasses nrows skip check.names fill strip.white blank.lines.skip comment.char the name of the file (within or a variable of mode character), possible with its path (the symbol \ is not allowed and must be replaced by /, even under Windows), or a remote access to a file of type URL (http://...) a logical (FALSE or TRUE) indicating if the file contains the names of the variables on its first line the field separator used in the file, for instance sep= \t if it is a tabulation the characters used to cite the variables of mode character the character used for the decimal point a vector with the names of the lines which can be either a vector of mode character, or the number (or the name) of a variable of the file (by default: 1, 2, 3, ) a vector with he names of the variables (by default: V1, V2, V3, ) controls the conversion of character variables as factors (if FALSE) or keeps them as characters (TRYE); as.is can be a logical, numeric or character vector specifying the variables to be kept as character the value given to missing data (converted as NA) a vector of mode character giving the classes to attribute to the columns the maximum number of lines to read (negative values are ignored) the number of lines to be skipped before reading the data if TRUE, checks that the variable name are valid for R if TRUE and all lines do not have the same number of variables, blanks are added (conditional to sep) if TURE, deletes extra spaces before and after the character variables if TRUE, ignores blank lines a character defining comments in the data file, the rest of the line after this character is ignored (to disable this argument, use comment.char= ) 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 34

35 Data Import read.delim( clipboard, header=t) scan( my_file ) Reads vector/array into vector from file or keyboard read.csv(file= path, header=true) You can skip lines, read a limited number of lines, different decimal separator, and more importing options. The foreign package can read files from Stata, SAS, and SPSS. 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 35

36 Fixed Format Data The function read.fwf reads the fixed-format data mydata <- read.fwf(file, widths, header=false, sep= \t, buffersize=2000) buffersize if the maximum number of lines to read at one time. A B C D E F G > mydata = read.fwf("c:/tmp/fixed.txt",widths=c(1,4,3)) > mydata V1 V2 V3 1 A B C D E F G > 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 36

37 Exporting R Data to External File write.table(iris, "clipboard", sep="\t", col.names=na, quote=f) Command to copy&paste from R into Excel or other programs. It writes the data of an R data frame object into the clipbroard from where it can be pasted into other applications. write.table(dataframe, file= file path", sep="\t", col.names = NA) Writes data frame to a tab-delimited text file. The argument 'col.names = NA' makes sure that the titles align with columns when row/index names are exported (default). write(x, file="file path") Writes matrix data to a file. sink("my_r_output") redirects all subsequent R output to a file 'My_R_Output' without showing it in the R console anymore. restores normal R output behavior. 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 37

38 Load Packages and Data GUI Mode Package(s) install package(s) Package(s) load package Command Mode > # install.packages("onion") > library(onion) > data(bunny) > head(bunny,n=3) x y z [1,] [2,] [3,] > # Three dimensional plotting of points. > # Produces a nice-looking 3D scatterplot with > # greying out of further points givin a visual > # depth cue > p3d(bunny,theta=3,phi=104,box=false) 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 38

39 Functions Actions can be performed on objects using functions A function is itself an object Functions have arguments and options, often there are defaults Functions provide a result The parentheses () are used to specify that a function is being called > my.fun <- function(a,b=10) { + ret = a+b + ret + } > > my.fun(1) [1] 11 > my.fun(1,2) [1] 3 > 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 39

40 Statistical/Mathematical Methods and Functions K-means Clustering > x = rbind(matrix(rnorm(100,mean=0,sd=0.3),ncol=2), + matrix(rnorm(100,mean=1,sd=0.3),ncol=2)) > head(x,n=3) [,1] [,2] [1,] [2,] [3,] > dim(x) [1] > cl = kmeans(x,4) > plot(x,col=cl$cluster) > points(cl$centers,col=1:4,pch=8,cex=2) 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 40

41 > cl K-means clustering with 4 clusters of sizes 16, 47, 14, 23 Cluster means: [,1] [,2] Clustering vector: [1] [43] [85] Within cluster sum of squares by cluster: [1] (between_ss / total_ss = 83.8 %) Available components: [1] "cluster" "centers" "totss" "withinss" "tot.withinss" [6] "betweenss" "size" "iter" "ifault" 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 41

42 Statistical/Mathematical Methods and Functions Statistical distribution Density (d), cumulative distribution function (p), quantile function (q) and random variate generation (r). Normal: dnorm, pnorm, qnorm, rnorm Beta: dbeta, pbeta, qbeta, rbeta F: df, pf, qf, rf Some basic mathematical operator log, exp mean, median, mode, max, min, sd trigonometry set operations logical operators: <, <=, >, >=, ==,!= 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 42

43 > pnorm(1,mean=0,sd=1) - pnorm(-1,mean=0,sd=1) [1] > pnorm(2,mean=0,sd=1) - pnorm(1,mean=0,sd=1) [1] > qnorm(0.9785,mean=0,sd=1) [1] > x = seq(-3.2,3.2,by=0.01) > y = dnorm(x) > x.sampling = rnorm(200,mean=0,sd=1) > hist(x.sampling,prob=t,xlim=c(-3.2,3.2),ylim=c(0,0.5),xlab="",ylab="",main="") > lines(x,y) > lines(density(x.sampling),col="red") > legend("topright",c("dnorm: exact dist'n","rnorm: random number"),col=c(1,2),lty=1) > 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 43

44 Vector and Matrix Vector: all elements are the same data type > v1 = c(1,2,3,4,5,6) > v2 = c("one","two","three","four","five","six") > v3 = c(true,false,true,false,true,false) > v1 [1] > v2 [1] "one" "two" "three" "four" "five" "six" > v3 [1] TRUE FALSE TRUE FALSE TRUE FALSE > Matrix: all elements are the same data type > m = matrix(letters[1:12],nrow=3,ncol=4) > m [,1] [,2] [,3] [,4] [1,] "A" "D" "G" "J" [2,] "B" "E" "H" "K" [3,] "C" "F" "I" "L" 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 44

45 Vector and Matrix Data Frame: different columns can have different data types List: the ordered collection of various elements > dframe = data.frame(v1,v2,v3) > dframe v1 v2 v3 1 1 one TRUE 2 2 two FALSE 3 3 three TRUE 4 4 four FALSE 5 5 five TRUE 6 6 six FALSE > l = list(n= Sean",mat=m,df=dframe) > l $n [1] Sean" $mat [,1] [,2] [,3] [,4] [1,] "A" "D" "G" "J" [2,] "B" "E" "H" "K" [3,] "C" "F" "I" "L" $df v1 v2 v3 1 1 one TRUE 2 2 two FALSE 3 3 three TRUE 4 4 four FALSE 5 5 five TRUE 6 6 six FALSE 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 45

46 apply(iris[,1:3], 1, mean) Some Useful Functions Calculates the mean values for the columns 1-3 in the sample data frame 'iris'. With the argument setting '1', row-wise iterations are performed and with '2' column-wise iterations. tapply(iris[,4], iris$species, mean) Calculates the mean values for the 4th column based on the grouping information in the 'Species' column in the 'iris' data frame. sapply(x, sqrt) Calculates the square root for each element in the vector x. Generates the same result as 'sqrt(x)'. lapply(x, fun) Returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X. 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 46

47 Some Useful Functions: Examples > iris.part = iris[1:10,1:3] > iris.part Sepal.Length Sepal.Width Petal.Length > apply(iris.part,1,mean) > apply(iris.part,2,mean) Sepal.Length Sepal.Width Petal.Length > tapply(iris[,4],iris$species,mean) setosa versicolor virginica > sapply(iris[,1:3],mean) Sepal.Length Sepal.Width Petal.Length > apply(iris[,1:3],2,mean) Sepal.Length Sepal.Width Petal.Length /12/01 SEAIP 2014: Intro R & Big Data (Sean) 47

48 Agenda Big Data, Big Challenge Introduction to R Some R-Packages to Deal With Big Data Several useful R-packages Hadoop vs. R 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 48

49 High-Performance and Parallel Computing with R Parallel computing: Explicit parallelism Parallel computing: Implicit parallelism Parallel computing: Grid computing Parallel computing: Hadoop Parallel computing: Random numbers Parallel computing: Resource managers and batch schedulers Parallel computing: Applications Parallel computing: GPUs Large memory and out-of-memory data 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 49

50 Parallel Computing Basics Serial and parallel tasks: Task 1 Task 1 Task 2 Task 3 Task n Task 2 Task 2.1 Task 3.1 Task n.1 Task 3 Task 2.2 Task 3.2 Task n.2 Task n+1 Task 2.m Task 3.m Task n.m Task n+1 Problem is broken into a discrete series of instructions and they are processed one after another. Problem is broken into discrete parts, that can be solved concurrently. 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 50

51 Package: data.table An extension of data.frame for fast indexing, fast ordered joins, fast assignment, fast grouping and list columns > library(data.table) > v1 = c(1,2,5.3,6,-2,4) > v2 = c( red, white, red,na, blue, orange ) > v3 = c(t,t,t,f,t,f) > my.datatable = data.table(v1,v2,v3) > my.datatable v1 v2 v3 1: 1.0 one TRUE 2: 2.0 two TRUE 3: 5.3 three TRUE 4: 6.0 four FALSE 5: -2.0 five TRUE 6: 4.0 six FALSE > my.datatable[2] v1 v2 v3 1: 2 two TRUE > my.datatable[,v2] [1] "one" "two" "three" "four" "five" "six" > my.datatable[,sum(v1),by=v3] v3 V1 1: TRUE 6.3 2: FALSE 10.0 > setkey(my.datatable,v2) > my.datatable["five"] v1 v2 v3 1: -2 five TRUE 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 51

52 Package: plyr plyr is the tools for splitting, applying and combining data Functions are named according to what sort of data structure used (a:array, l:list, d:data.frame, m:multiple inputs, r:repeat multiple times) Provides a set of helper functions for common data analysis > library(plyr) > data(iris) > head(iris,n=3) Sepal.Length Sepal.Width Petal.Length Petal.Width Species setosa setosa setosa > count(iris,vars="species") Species freq 1 setosa 50 2 versicolor 50 3 virginica /12/01 SEAIP 2014: Intro R & Big Data (Sean) 52

53 Package: plyr (cont.) Summarise works in an analogous way to mutate, except instead of adding columns to an existing data frame, it creates a new data frame. This is particularly useful in conjunction with ddply as it makes it easy to perform group-wise summaries > is.data.frame(iris) [1] TRUE > dim(iris) [1] > summary(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50 1st Qu.: st Qu.: st Qu.: st Qu.:0.300 versicolor:50 Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50 Mean :5.843 Mean :3.057 Mean :3.758 Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 > summarise(iris,mean_petal_length=mean(petal.length), + max_metal_length=max(sepal.length)) mean_petal_length max_metal_length /12/01 SEAIP 2014: Intro R & Big Data (Sean) 53

54 Package: plyr (cont.) ddply: For each subset of a data frame, apply function then combine results into a data frame. daply: For each subset of data frame, apply function then combine results into an array. daply with a function that operates column-wise is similar to aggregate > ddply(iris,.(species),summarise,mean_petal_length=mean(petal.length), + max_petal_length=max(sepal.length)) Species mean_petal_length max_petal_length 1 setosa versicolor virginica > daply(iris[,c(1,2,5)],.(species),colwise(mean)) Species Sepal.Length Sepal.Width setosa versicolor virginica /12/01 SEAIP 2014: Intro R & Big Data (Sean) 54

55 Package: RJSONIO JSON: Javascript Object Notation a lightweight data-interchange format > library(rjsonio) > json = tojson(list(a=c(1,2,3),name="markus")) > cat(json) { "a": [ 1, 2, 3 ], "name": "Markus" } > robj = fromjson(json) > robj $a [1] $name [1] "Markus" > 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 55

56 Package: bigmemory, biganalytics, foreach, Available on Unix-alikes, including Mac. Manage massive matrices with shared memory and memorymapped files > library(bigmemory) > library(biganalytics) > library(foreach) > x = rbind(matrix(rnorm(100,sd=0.3),ncol=2), + matrix(rnorm(100,mean=1,sd=0.3),ncol=2)) > bigmatrix = as.big.matrix(x) > res = bigkmeans(bigmatrix,3) > res K-means clustering with 3 clusters of sizes 27, 49, 24 Cluster means: [,1] [,2] [1,] [2,] [3,] Clustering vector: [1] [38] [75] Within cluster sum of squares by cluster: [1] Available components: [1] "cluster" "centers" "withinss" "size" > 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 56

57 Package: parallel First version was released with R Contains functionality derived from and pretty much equivalent to the multicore and snow packages > N = 1000 > x = list(a=rnorm(n),b=rbeta(n,2,3)) > lapply(x,mean) $a [1] $b [1] > library(parallel) > cl = makecluster(2) > parlapply(cl,x,mean) $a [1] $b [1] > stopcluster(cl) > mclapply(x,mean) $a [1] $b [1] /12/01 SEAIP 2014: Intro R & Big Data (Sean) 57

58 Package: Rcpp The Rcpp package provides C++ classes that greatly facilitate interfacing C or C++ code in R packages using the.call interface provided by R. A clean, approachable API that lets you write high-performance code. Can help with loops, recursive functions and functions with advanced data structures > library(rcpp) > cppfunction(' 03 + int add(int x, int y, int z) { 04 + int sum = x+y+z; 05 + return sum; 06 + } 07 + ') > add function (x, y, z).primitive(".call")(<pointer: 0x c1770>, x, y, z) > add(2014,12,1) [1] 2027 > cppfunction(' 14 + int fibonacci(const int x) { 15 + if (x == 0) return(0); 16 + if (x == 1) return(1); 17 + return (fibonacci(x - 1)) + fibonacci(x - 2); 18 + } 19 + ') 20 > fibonacci(20) 2014/12/01 21 [1] 6765 SEAIP 2014: Intro R & Big Data (Sean) 58

59 Package: ggplot2 ggplot2 is useful for producing complex graphics relatively simply. An implementation of the Grammar of Graphics book by Leland Wilkinson The basic notion is that there is a grammar to the composition of graphical components in statistical graphics By directly controlling that grammar, you can generate a large set of carefully constructed graphics from a relatively small set of operations A good grammar will allow us to gain insight into the ocmposition of complicated graphics, and reveal unexpected connections between seemingly different graphics. 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 59

60 > library(ggplot2) > qplot(sepal.length,petal.length,data=iris,color=species) # top-left > res = qplot(sepal.length,petal.length,data=iris,color=species,size=petal.width,alpha=i(0.5)) > res # top-right > res+geom_line(size=1) # bottom-left > res+geom_boxplot(size=0.2,alpha=i(0.3)) # bottom-right 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 60

61 Shiny Easy Web Application Developed by RStudio Turn analyses into interactive web applications that anyone can use Let your users choose input parameters using friendly controls like slides, drop-downs, and text fields Easily incorporate any number of outputs like plots, tables, and summaryies No HTML or JavaScript knowledge is necessary, only R Hello World Shiny > library(shiny) > runexample("01_hello") 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 61

62 Package: bigvis Revolution R Enterprise Tools for exploratory data analysis of large data sets (76 million) 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 62

63 R and Databases SQL provides a standard language to filter, aggregate, group, sort data SQL-like query languages showing up in new places Hadoop Hive, ODBC provides SQL interface to non-database data Excel, CSV, text files, R stores relational data in data frames. Rank DBMS Database Model Score Changes 1. Oracle Relational DBMS MySQL Relational DBMS Microsoft SQL Server Relational DBMS PostgreSQL Relational DBMS MongoDB Document store DB2 Relational DBMS Microsoft Access Relational DBMS SQLite Relational DBMS Cassandra Wide column store Sybase ASE Relational DBMS DB-Engines Ranking 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 63

64 Package: sqldf sqldf is an R package for running SQL statements on R data frames SQL statements in R using data frame names in place of table names A database with appropriate table layouts/schema is automatically created, the data frames are automatically loaded into the database The result is read back into R sqldf supports the SQLite back-end database(default), the H2 java database, the PostgreSQL database and MySQL 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 64

65 Package: sqldf (cont.) > library(sqldf) > sqldf('select * from iris limit 4') Sepal.Length Sepal.Width Petal.Length Petal.Width Species setosa setosa setosa setosa > sqldf('select count(*) from iris') count(*) > sqldf('select Species, count(*) from iris group by Species') Species count(*) 1 setosa 50 2 versicolor 50 3 virginica 50 > sqldf('select Species, avg("sepal.length") as "Sepal Avg.", + variance("sepal.width") as "Sepal Width" from iris group by Species') Species Sepal Avg. Sepal Width 1 setosa versicolor virginica > 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 65

66 Other Relational Packages RMySQL provide an interface to MySQL RPostgreSQL provide an interface to PostgreSQL ROracle provide an interface to Oracle RJDBC provide access to database through a JDBC interface RSQLite provide access to SQLite Bottleneck to deal with BIG DATA All packages read the full result in R memory 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 66

67 Hadoop An open source software framework designed to support large scale data processing Map Reduce: a computational paradigm Application is divided into many small fragments of work HDFS: Hadoop Distributed File System A distributed file system that stores data on the compute nodes The Ecosystem: Hive, Pig, Flume, Mahout, Written in Java, opened up to alternatives by its Streaming API 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 67

68 HDFS & Hadoop Cluster HDFS is a block-structured file system Blocks are stored across a cluster of one or more machines with data storage capacity. (datanode) Data is accessed in a write once and read many model HDFS does com with its own utilities for file management HDFS file system stores its metadata reliably. (namenode) 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 68

69 RHadoop An open source project sponsored by Revolution Analytics Package rmr2 hosts all Map Reduce related functions, uses Hadoop Streaming API rhdfs for the interaction with HDFS file system plyrmr convenient processing on a Hadoop cluster of large data sets rhbase connect with Hadoop s NoSQL database HBase Installation https://github.com/revolutionanalytics/rhadoop 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 69

70 Dependence Package: rmr2 Rcpp, functional, bitops, catools, RJSONIO, etc. Simple Parallel Computing with R An Map Reduce Example Job > x = 1:5 > x [1] > unlist(lapply(x, function(y) y^2)) [1] > library(parallel) > unlist(mclapply(x, function(y) y^2)) [1] > > library(rmr2) > rmr.options(backend=c("local")) NULL > small.ints = to.dfs(keyval(1,1:100)) > out = mapreduce( + input=small.ints, + map=function(k,v) cbind(v,v^2)) > df = from.dfs(out) > head(df$val,n=5) v [1,] 1 1 [2,] 2 4 [3,] 3 9 [4,] 4 16 [5,] /12/01 SEAIP 2014: Intro R & Big Data (Sean) 70

71 Hadoop Hello World Example: Word Count 2014/12/01 SEAIP 2014: Intro R & Big Data (Sean) 71

TDWI 2013 Munich. Training - Using R for Business Intelligence in Big Data

TDWI 2013 Munich. Training - Using R for Business Intelligence in Big Data TDWI 2013 Munich Training - Using R for Business Intelligence in Big Data Dr. rer. nat. Markus Schmidberger @cloudhpc markus.schmidberger@comsysto.com June 19th, 2013 TDWI 2013 Munich June 19th, 2013 1

More information

Tutorial - Big Data Analyses with R

Tutorial - Big Data Analyses with R Tutorial - Big Data Analyses with R O Reilly Strata Conference London Dr. rer. nat. Markus Schmidberger @cloudhpc markus.schmidberger@comsysto.com November 13th, 2013 M. Schmidberger Tutorial - Big Data

More information

Getting Started with R and RStudio 1

Getting Started with R and RStudio 1 Getting Started with R and RStudio 1 1 What is R? R is a system for statistical computation and graphics. It is the statistical system that is used in Mathematics 241, Engineering Statistics, for the following

More information

Using R for Windows and Macintosh

Using R for Windows and Macintosh 2010 Using R for Windows and Macintosh R is the most commonly used statistical package among researchers in Statistics. It is freely distributed open source software. For detailed information about downloading

More information

R Tools Evaluation. A review by Analytics @ Global BI / Local & Regional Capabilities. Telefónica CCDO May 2015

R Tools Evaluation. A review by Analytics @ Global BI / Local & Regional Capabilities. Telefónica CCDO May 2015 R Tools Evaluation A review by Analytics @ Global BI / Local & Regional Capabilities Telefónica CCDO May 2015 R Features What is? Most widely used data analysis software Used by 2M+ data scientists, statisticians

More information

BIG DATA ANALYTICS MADE EASY WITH RHADOOP

BIG DATA ANALYTICS MADE EASY WITH RHADOOP BIG DATA ANALYTICS MADE EASY WITH RHADOOP Adarsh V. Rotte 1, Gururaj Patwari 2, Suvarnalata Hiremath 3 1 Student, Department of CSE, BKEC, Karnataka, India 2 Asst. Prof., Department of CSE, BKEC, Karnataka,

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Psychology 205: Research Methods in Psychology

Psychology 205: Research Methods in Psychology Psychology 205: Research Methods in Psychology Using R to analyze the data for study 2 Department of Psychology Northwestern University Evanston, Illinois USA November, 2012 1 / 38 Outline 1 Getting ready

More information

Big Data, beating the Skills Gap Using R with Hadoop

Big Data, beating the Skills Gap Using R with Hadoop Big Data, beating the Skills Gap Using R with Hadoop Using R with Hadoop There are a number of R packages available that can interact with Hadoop, including: hive - Not to be confused with Apache Hive,

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

rm(list=ls()) library(sqldf) system.time({large = read.csv.sql("large.csv")}) #172.97 seconds, 4.23GB of memory used by R

rm(list=ls()) library(sqldf) system.time({large = read.csv.sql(large.csv)}) #172.97 seconds, 4.23GB of memory used by R Big Data in R Importing data into R: 1.75GB file Table 1: Comparison of importing data into R Time Taken Packages Functions (second) Remark/Note base read.csv > 2,394 My machine (8GB of memory) ran out

More information

Integrating Big Data into the Computing Curricula

Integrating Big Data into the Computing Curricula Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

A Short Guide to R with RStudio

A Short Guide to R with RStudio Short Guides to Microeconometrics Fall 2013 Prof. Dr. Kurt Schmidheiny Universität Basel A Short Guide to R with RStudio 1 Introduction 2 2 Installing R and RStudio 2 3 The RStudio Environment 2 4 Additions

More information

Integrating VoltDB with Hadoop

Integrating VoltDB with Hadoop The NewSQL database you ll never outgrow Integrating with Hadoop Hadoop is an open source framework for managing and manipulating massive volumes of data. is an database for handling high velocity data.

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Reading and writing files

Reading and writing files Reading and writing files Importing data in R Data contained in external text files can be imported in R using one of the following functions: scan() read.table() read.csv() read.csv2() read.delim() read.delim2()

More information

Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

More information

Part 3: Data Import/Export

Part 3: Data Import/Export Part 3: Data Import/Export 140.776 Statistical Computing Ingo Ruczinski Thanks to Thomas Lumley and Robert Gentleman of the R-core group (http://www.r-project.org/) for providing some tex files that appear

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到

More information

Your Best Next Business Solution Big Data In R 24/3/2010

Your Best Next Business Solution Big Data In R 24/3/2010 Your Best Next Business Solution Big Data In R 24/3/2010 Big Data In R R Works on RAM Causing Scalability issues Maximum length of an object is 2^31-1 Some packages developed to help overcome this problem

More information

What is R? R s Advantages R s Disadvantages Installing and Maintaining R Ways of Running R An Example Program Where to Learn More

What is R? R s Advantages R s Disadvantages Installing and Maintaining R Ways of Running R An Example Program Where to Learn More Bob Muenchen, Author R for SAS and SPSS Users, Co-Author R for Stata Users muenchen.bob@gmail.com, http://r4stats.com What is R? R s Advantages R s Disadvantages Installing and Maintaining R Ways of Running

More information

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley Disclaimer: This material is protected under copyright act AnalytixLabs, 2011. Unauthorized use and/ or duplication of this material or

More information

INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS

INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS Bogdan Oancea "Nicolae Titulescu" University of Bucharest Raluca Mariana Dragoescu The Bucharest University of Economic Studies, BIG DATA The term big data

More information

A Quick Introduction to R

A Quick Introduction to R A Quick Introduction to R Guy Lebanon January 18, 2011 1 Overview R is a relatively new programming language aimed at computational data analysis, statistical modeling, and data visualization. It is similar

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

G563 Quantitative Paleontology. SQL databases. An introduction. Department of Geological Sciences Indiana University. (c) 2012, P.

G563 Quantitative Paleontology. SQL databases. An introduction. Department of Geological Sciences Indiana University. (c) 2012, P. SQL databases An introduction AMP: Apache, mysql, PHP This installations installs the Apache webserver, the PHP scripting language, and the mysql database on your computer: Apache: runs in the background

More information

Introduction Computing at Hopkins Biostatistics

Introduction Computing at Hopkins Biostatistics Introduction Computing at Hopkins Biostatistics Ingo Ruczinski Thanks to Thomas Lumley and Robert Gentleman of the R-core group (http://www.r-project.org/) for providing some tex files that appear in part

More information

HDFS. Hadoop Distributed File System

HDFS. Hadoop Distributed File System HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files

More information

Big Data and Parallel Work with R

Big Data and Parallel Work with R Big Data and Parallel Work with R What We'll Cover Data Limits in R Optional Data packages Optional Function packages Going parallel Deciding what to do Data Limits in R Big Data? What is big data? More

More information

RHadoop Installation Guide for Red Hat Enterprise Linux

RHadoop Installation Guide for Red Hat Enterprise Linux RHadoop Installation Guide for Red Hat Enterprise Linux Version 2.0.2 Update 2 Revolution R, Revolution R Enterprise, and Revolution Analytics are trademarks of Revolution Analytics. All other trademarks

More information

Importing Data into R

Importing Data into R 1 R is an open source programming language focused on statistical computing. R supports many types of files as input and the following tutorial will cover some of the most popular. Importing from text

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

L1: Introduction to Hadoop

L1: Introduction to Hadoop L1: Introduction to Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General

More information

Hadoop Job Oriented Training Agenda

Hadoop Job Oriented Training Agenda 1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

Case Study : 3 different hadoop cluster deployments

Case Study : 3 different hadoop cluster deployments Case Study : 3 different hadoop cluster deployments Lee moon soo moon@nflabs.com HDFS as a Storage Last 4 years, our HDFS clusters, stored Customer 1500 TB+ data safely served 375,000 TB+ data to customer

More information

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required. What is this course about? This course is an overview of Big Data tools and technologies. It establishes a strong working knowledge of the concepts, techniques, and products associated with Big Data. Attendees

More information

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools

More information

The Saves Package. an approximate benchmark of performance issues while loading datasets. Gergely Daróczi daroczig@rapporer.net.

The Saves Package. an approximate benchmark of performance issues while loading datasets. Gergely Daróczi daroczig@rapporer.net. The Saves Package an approximate benchmark of performance issues while loading datasets Gergely Daróczi daroczig@rapporer.net December 27, 2013 1 Introduction The purpose of this package is to be able

More information

Distributed R for Big Data

Distributed R for Big Data Distributed R for Big Data Indrajit Roy HP Vertica Development Team Abstract Distributed R simplifies large-scale analysis. It extends R. R is a single-threaded environment which limits its utility for

More information

R data import and export

R data import and export R data import and export A. Blejec andrej.blejec@nib.si October 27, 2013 Read R can read... plain text tables and files Excel files SPSS, SAS, Stata formated data databases (like MySQL) XML, HTML files

More information

Advanced Big Data Analytics with R and Hadoop

Advanced Big Data Analytics with R and Hadoop REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

More information

DiskPulse DISK CHANGE MONITOR

DiskPulse DISK CHANGE MONITOR DiskPulse DISK CHANGE MONITOR User Manual Version 7.9 Oct 2015 www.diskpulse.com info@flexense.com 1 1 DiskPulse Overview...3 2 DiskPulse Product Versions...5 3 Using Desktop Product Version...6 3.1 Product

More information

Bringing Big Data Modelling into the Hands of Domain Experts

Bringing Big Data Modelling into the Hands of Domain Experts Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the

More information

DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7

DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7 DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7 UNDER THE GUIDANCE Dr. N.P. DHAVALE, DGM, INFINET Department SUBMITTED TO INSTITUTE FOR DEVELOPMENT AND RESEARCH IN BANKING TECHNOLOGY

More information

Deal with big data in R using bigmemory package

Deal with big data in R using bigmemory package Deal with big data in R using bigmemory package Xiaojuan Hao Department of Statistics University of Nebraska-Lincoln April 28, 2015 Background What Is Big Data Size (We focus on) Complexity Rate of growth

More information

Bigger data analysis. Hadley Wickham. @hadleywickham Chief Scientist, RStudio. Thursday, July 18, 13

Bigger data analysis. Hadley Wickham. @hadleywickham Chief Scientist, RStudio. Thursday, July 18, 13 http://bit.ly/bigrdata3 Bigger data analysis Hadley Wickham @hadleywickham Chief Scientist, RStudio July 2013 http://bit.ly/bigrdata3 1. What is data analysis? 2. Transforming data 3. Visualising data

More information

Viewing Ecological data using R graphics

Viewing Ecological data using R graphics Biostatistics Illustrations in Viewing Ecological data using R graphics A.B. Dufour & N. Pettorelli April 9, 2009 Presentation of the principal graphics dealing with discrete or continuous variables. Course

More information

Open Source Technologies on Microsoft Azure

Open Source Technologies on Microsoft Azure Open Source Technologies on Microsoft Azure A Survey @DChappellAssoc Copyright 2014 Chappell & Associates The Main Idea i Open source technologies are a fundamental part of Microsoft Azure The Big Questions

More information

Installing R and the psych package

Installing R and the psych package Installing R and the psych package William Revelle Department of Psychology Northwestern University August 17, 2014 Contents 1 Overview of this and related documents 2 2 Install R and relevant packages

More information

Session 85 IF, Predictive Analytics for Actuaries: Free Tools for Life and Health Care Analytics--R and Python: A New Paradigm!

Session 85 IF, Predictive Analytics for Actuaries: Free Tools for Life and Health Care Analytics--R and Python: A New Paradigm! Session 85 IF, Predictive Analytics for Actuaries: Free Tools for Life and Health Care Analytics--R and Python: A New Paradigm! Moderator: David L. Snell, ASA, MAAA Presenters: Brian D. Holland, FSA, MAAA

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata Up Your R Game James Taylor, Decision Management Solutions Bill Franks, Teradata Today s Speakers James Taylor Bill Franks CEO Chief Analytics Officer Decision Management Solutions Teradata 7/28/14 3 Polling

More information

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS) Use Data from a Hadoop Cluster with Oracle Database Hands-On Lab Lab Structure Acronyms: OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS) All files are

More information

Big Data Too Big To Ignore

Big Data Too Big To Ignore Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction

More information

Using distributed technologies to analyze Big Data

Using distributed technologies to analyze Big Data Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/

More information

Dataframes. Lecture 8. Nicholas Christian BIOST 2094 Spring 2011

Dataframes. Lecture 8. Nicholas Christian BIOST 2094 Spring 2011 Dataframes Lecture 8 Nicholas Christian BIOST 2094 Spring 2011 Outline 1. Importing and exporting data 2. Tools for preparing and cleaning datasets Sorting Duplicates First entry Merging Reshaping Missing

More information

4 Other useful features on the course web page. 5 Accessing SAS

4 Other useful features on the course web page. 5 Accessing SAS 1 Using SAS outside of ITCs Statistical Methods and Computing, 22S:30/105 Instructor: Cowles Lab 1 Jan 31, 2014 You can access SAS from off campus by using the ITC Virtual Desktop Go to https://virtualdesktopuiowaedu

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Large Datasets and You: A Field Guide

Large Datasets and You: A Field Guide Large Datasets and You: A Field Guide Matthew Blackwell m.blackwell@rochester.edu Maya Sen msen@ur.rochester.edu August 3, 2012 A wind of streaming data, social data and unstructured data is knocking at

More information

Big Data and Data Science: Behind the Buzz Words

Big Data and Data Science: Behind the Buzz Words Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Big Data on Microsoft Platform

Big Data on Microsoft Platform Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4

More information

Prof. Nicolai Meinshausen Regression FS 2014. R Exercises

Prof. Nicolai Meinshausen Regression FS 2014. R Exercises Prof. Nicolai Meinshausen Regression FS 2014 R Exercises 1. The goal of this exercise is to get acquainted with different abilities of the R statistical software. It is recommended to use the distributed

More information

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. 2015 The MathWorks, Inc. 1 Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult

More information

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges James Campbell Corporate Systems Engineer HP Vertica jcampbell@vertica.com Big

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

ANALYTICS CENTER LEARNING PROGRAM

ANALYTICS CENTER LEARNING PROGRAM Overview of Curriculum ANALYTICS CENTER LEARNING PROGRAM The following courses are offered by Analytics Center as part of its learning program: Course Duration Prerequisites 1- Math and Theory 101 - Fundamentals

More information

How, What, and Where of Data Warehouses for MySQL

How, What, and Where of Data Warehouses for MySQL How, What, and Where of Data Warehouses for MySQL Robert Hodges CEO, Continuent. Introducing Continuent The leading provider of clustering and replication for open source DBMS Our Product: Continuent Tungsten

More information

2/24/2010 ClassApps.com

2/24/2010 ClassApps.com SelectSurvey.NET Training Manual This document is intended to be a simple visual guide for non technical users to help with basic survey creation, management and deployment. 2/24/2010 ClassApps.com Getting

More information

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015 COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt

More information

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable

More information

Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

More information

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB Overview of Databases On MacOS Karl Kuehn Automation Engineer RethinkDB Session Goals Introduce Database concepts Show example players Not Goals: Cover non-macos systems (Oracle) Teach you SQL Answer what

More information

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel Big Data and Analytics: Getting Started with ArcGIS Mike Park Erik Hoel Agenda Overview of big data Distributed computation User experience Data management Big data What is it? Big Data is a loosely defined

More information

Data Warehouse and Hive. Presented By: Shalva Gelenidze Supervisor: Nodar Momtselidze

Data Warehouse and Hive. Presented By: Shalva Gelenidze Supervisor: Nodar Momtselidze Data Warehouse and Hive Presented By: Shalva Gelenidze Supervisor: Nodar Momtselidze Decision support systems Decision Support Systems allowed managers, supervisors, and executives to once again see the

More information

Find the Hidden Signal in Market Data Noise

Find the Hidden Signal in Market Data Noise Find the Hidden Signal in Market Data Noise Revolution Analytics Webinar, 13 March 2013 Andrie de Vries Business Services Director (Europe) @RevoAndrie andrie@revolutionanalytics.com Agenda Find the Hidden

More information

American International Journal of Research in Science, Technology, Engineering & Mathematics

American International Journal of Research in Science, Technology, Engineering & Mathematics American International Journal of Research in Science, Technology, Engineering & Mathematics Available online at http://www.iasir.net ISSN (Print): 2328-3491, ISSN (Online): 2328-3580, ISSN (CD-ROM): 2328-3629

More information

R Language Fundamentals

R Language Fundamentals R Language Fundamentals Data Types and Basic Maniuplation Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Where did R come from? Overview Atomic Vectors Subsetting

More information

Parallel Options for R

Parallel Options for R Parallel Options for R Glenn K. Lockwood SDSC User Services glock@sdsc.edu Motivation "I just ran an intensive R script [on the supercomputer]. It's not much faster than my own machine." Motivation "I

More information

R: A Free Software Project in Statistical Computing

R: A Free Software Project in Statistical Computing R: A Free Software Project in Statistical Computing Achim Zeileis Institut für Statistik & Wahrscheinlichkeitstheorie http://www.ci.tuwien.ac.at/~zeileis/ Acknowledgments Thanks: Alex Smola & Machine Learning

More information

Introduction Predictive Analytics Tools: Weka

Introduction Predictive Analytics Tools: Weka Introduction Predictive Analytics Tools: Weka Predictive Analytics Center of Excellence San Diego Supercomputer Center University of California, San Diego Tools Landscape Considerations Scale User Interface

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Chase Wu New Jersey Ins0tute of Technology

Chase Wu New Jersey Ins0tute of Technology CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

BiDAl: Big Data Analyzer for Cluster Traces

BiDAl: Big Data Analyzer for Cluster Traces BiDAl: Big Data Analyzer for Cluster Traces Alkida Balliu, Dennis Olivetti, Ozalp Babaoglu, Moreno Marzolla, Alina Sirbu Department of Computer Science and Engineering University of Bologna, Italy BigSys

More information

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 1 What is data exploration? A preliminary

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression A regression with two or more explanatory variables is called a multiple regression. Rather than modeling the mean response as a straight line, as in simple regression, it is

More information

Analytics on Big Data

Analytics on Big Data Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis

More information

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig Introduction to Pig Agenda What is Pig? Key Features of Pig The Anatomy of Pig Pig on Hadoop Pig Philosophy Pig Latin Overview Pig Latin Statements Pig Latin: Identifiers Pig Latin: Comments Data Types

More information

MIXED MODEL ANALYSIS USING R

MIXED MODEL ANALYSIS USING R Research Methods Group MIXED MODEL ANALYSIS USING R Using Case Study 4 from the BIOMETRICS & RESEARCH METHODS TEACHING RESOURCE BY Stephen Mbunzi & Sonal Nagda www.ilri.org/rmg www.worldagroforestrycentre.org/rmg

More information

BIG DATA TECHNOLOGY. Hadoop Ecosystem

BIG DATA TECHNOLOGY. Hadoop Ecosystem BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big

More information

Transforming the Telecoms Business using Big Data and Analytics

Transforming the Telecoms Business using Big Data and Analytics Transforming the Telecoms Business using Big Data and Analytics Event: ICT Forum for HR Professionals Venue: Meikles Hotel, Harare, Zimbabwe Date: 19 th 21 st August 2015 AFRALTI 1 Objectives Describe

More information

Introduction to R and UNIX Working with microarray data in a multi-user environment

Introduction to R and UNIX Working with microarray data in a multi-user environment Microarray Data Analysis Workshop MedVetNet Workshop, DTU 2008 Introduction to R and UNIX Working with microarray data in a multi-user environment Carsten Friis Media glna tnra GlnA TnrA C2 glnr C3 C5

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

Qsoft Inc www.qsoft-inc.com

Qsoft Inc www.qsoft-inc.com Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:

More information