TDWI 2013 Munich. Training - Using R for Business Intelligence in Big Data
|
|
|
- Shona Haynes
- 10 years ago
- Views:
Transcription
1 TDWI 2013 Munich Training - Using R for Business Intelligence in Big Data Dr. rer. nat. Markus [email protected] June 19th, 2013 TDWI 2013 Munich June 19th, / 79 Motivation and Goals Today, there exists a lot of data and a huge pool of analyses methods R is a great tool for your Big Data analyses Provide an overview of Big Data technologies in R Hands-on code and exercises to get started TDWI 2013 Munich June 19th, / 79
2 Outline 1 R-project.org 2 R and Hadoop 3 R and Databases TDWI 2013 Munich June 19th, / 79 Dr. rer. nat. Markus Schmidberger High Performance Computing Gentleman Lab R Map Reduce Big Data husband Data Science Parallel Computing MunichBavaria DevOps Biological Data Data Quality Management MPI affypara 2 kids mongodb AWS NoSQL Bioconductor PhD LMU Technomathematics Snowboarding comsysto GmbH Cloud Computing TU Munich Hadoop TDWI 2013 Munich June 19th, / 79
3 comsysto GmbH Lean Company. Great Chances! software company specialized in lean business, technology development and Big Data focuses on open source frameworks Meetup organizer for Munich TDWI 2013 Munich June 19th, / 79 Big Data a big hype topic everything is big data everyone want to work with big data Wikipedia: a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications TDWI 2013 Munich June 19th, / 79
4 Big Data my view on data access to many different data sources (Internet) storage is cheep - store everything today, CIOs get interested in the power of all their data a lot of different and complex data have to be stored in one data warehouse NoSQL TDWI 2013 Munich June 19th, / 79 NoSQL - MongoDB databases using looser consistency models to store data MongoDB: open source document-oriented database system supported by 10gen stores structured data as JSON-like documents TDWI 2013 Munich June 19th, / 79
5 Big Data my view on processing more and more data have to be processed backward-looking analysis is outdated today, we are working on quasi real-time analysis but, CIOs are interested in forward-looking predictive analysis more complex methods, more processing time Hadoop, Super Computer and statistical tools TDWI 2013 Munich June 19th, / 79 Hadoop a big data technology open-source software framework designed to support large scale data processing supports data-intensive distributed applications inspired by Google s Map Reduce based computational infrastructure Map Reduce: a computational paradigm application is divided into many small fragments of work HDFS: Hadoop Distributed File System a distributed file system that stores data on the compute nodes the Ecosystem: Hive, Pig, Flume, Mahout,... written in Java, opened up to alternatives through its Streaming API TDWI 2013 Munich June 19th, / 79
6 Big Data my view on resources computing resources are available to everyone and cheep man-power is expensive and it is difficult to hire Big Data experts Java Programmer: good programming - bad statistical background Statistician: good methodology - bad programming and database knowledge Big Data Analyst: = Fachkräftemangel welcome to the Training - Use R for BI in Big Data to solve this problem TDWI 2013 Munich June 19th, / 79 Outline 1 R-project.org R vs. SAS vs. Julia R books Installation RStudio IDE R as calculator Packages: ggplot2 & shiny Exercise 1 2 R and Hadoop 3 R and Databases TDWI 2013 Munich June 19th, / 79
7 open-source: R is a free software environment for statistical computing and graphics offers tools to manage and analyze data standard statistical methods are implemented compiles and runs under different systems: UNIX, Windows, Mac OS support via the R mailing list by members of the core team R-announce, R-packages, R-help, R-devel,... support via several manuals and books: TDWI 2013 Munich June 19th, / 79 huge online-libraries with R-packages: CRAN: BioConductor (genomic data): Omegahat: R-Forge: all statistical analyzes tools and latest developments possibility to write personalized code and to contribute new packages really famous since January 6, 2009: The New York Times, Data Analysts Captivated by R s Power TDWI 2013 Munich June 19th, / 79
8 R vs. SAS vs. Julia R is open source, SAS is a commercial product and Julia a very new dynamic programming language R is free and available to everyone R code is open source and can be modified by everyone R is a complete and enclosed programming language R has a big and active community TDWI 2013 Munich June 19th, / 79 R books (a) (b) TDWI 2013 Munich June 19th, / 79
9 (c) (d) TDWI 2013 Munich June 19th, / 79 Installation Installation and Download: Linux, Mac OS X, and Windows support starting R: R (at your command line) Editors: emacs, Tinn-R, Eclipse, RStudio,.... quitting R: > q() TDWI 2013 Munich June 19th, / 79
10 RStudio IDE TDWI 2013 Munich June 19th, / 79 TDWI 2013 Munich June 19th, / 79
11 TDWI 2013 Munich June 19th, / 79 TDWI 2013 Munich June 19th, / 79
12 R as calculator > (5+5) - 1 * 3 [1] 7 > abs(-5) [1] 5 > x <- 3 > x [1] 3 > x^2 + 4 [1] 13 > sum(1,2,3,4) [1] 10 TDWI 2013 Munich June 19th, / 79 > help("mean") >?mean > x <- 1:10 > x [1] > y <- c(1,2,3) > y [1] > x < 5 [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE > x[3:7] [1] > x[-2] [1] TDWI 2013 Munich June 19th, / 79
13 load packages and data > > > > # install.packages("onion") library(onion) data(bunny) head(bunny, n=3) x y z [1,] [2,] [3,] > p3d(bunny,theta=3,phi=104,box=false) TDWI 2013 Munich June 19th, / 79 June 19th, / 79 functions > myfun <- function(x,y){ + res <- x+y^2 + return(res) + } > myfun(2,3) [1] 11 > myfun function(x,y){ res <- x+y^2 return(res) } TDWI 2013 Munich
14 many statistical functions, e.g. k-means clustering > x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), + matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) > head(x, n=3) [,1] [,2] [1,] [2,] [3,] > dim(x) [1] > cl <- kmeans(x, 4) > cl K-means clustering with 4 clusters of sizes 16, 35, 28, 21 Cluster means: [,1] [,2] TDWI 2013 Munich June 19th, / > plot(x, col = cl$cluster) > Clustering points(cl$centers, vector: col = 1:4, pch = 8, cex = 2) [1] [38] [75] Within cluster sum of squares by cluster: [1] (between_ss / total_ss = 85.3 %) x[,2] Available components: [1] "cluster" "centers" "totss" "withinss" "t [6] "betweenss" "size" x[,1] TDWI 2013 Munich June 19th, / 79
15 Package: ggplot2 Author: Prof. Hardley Wickham (since 2012 RStudio member) useful for producing complex graphics relatively simply an implementation of the Grammar of Graphics book by Liland Wilkinson the basic notion is that there is a grammar to the composition of graphical components in statistical graphics by directly controlling that grammar, you can generate a large set of carefully constructed graphics from a relatively small set of operations A good grammar will allow us to gain insight into the composition of complicated graphics, and reveal unexpected connections between seemingly different graphics. TDWI 2013 Munich June 19th, / 79 > library(ggplot2) > head(mtcars, n=3) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX Mazda RX4 Wag Datsun > summary(mtcars) mpg cyl disp hp Min. :10.40 Min. :4.000 Min. : 71.1 Min. : st Qu.: st Qu.: st Qu.: st Qu.: 96.5 Median :19.20 Median :6.000 Median :196.3 Median :123.0 Mean :20.09 Mean :6.188 Mean :230.7 Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.:180.0 Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 drat wt qsec vs Min. :2.760 Min. :1.513 Min. :14.50 Min. : st Qu.: st Qu.: st Qu.: st Qu.: Median :3.695 Median :3.325 Median :17.71 Median : Mean :3.597 Mean :3.217 Mean :17.85 Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.: Max. :4.930 Max. :5.424 Max. :22.90 Max. : TDWI 2013 Munich June 19th, / 79
16 > pie <- ggplot(mtcars, aes(x = factor(1), + fill = factor(cyl))) + + geom_bar(width = 1, + position = "fill", color = "black") > pie 1.00 count factor(cyl) [cyl: number of cylinders] 1 factor(1) TDWI 2013 Munich June 19th, / 79 > pie + coord_polar(theta = "y") 0.00/ factor(cyl) factor(1) count TDWI 2013 Munich June 19th, / 79
17 Shiny - easy web application Developed by RStudio turn analyses into interactive web applications that anyone can use let your users choose input parameters using friendly controls like sliders, drop-downs, and text fields easily incorporate any number of outputs like plots, tables, and summaries no HTML or JavaScript knowledge is necessary, only R TDWI 2013 Munich June 19th, / 79 Shiny - Server node.js based application to host Shiny Apps on webserver Developed by RStudio RApache package provides similar functionality to host and execute R code RApache more difficult to use, but more flexibility TDWI 2013 Munich June 19th, / 79
18 Hello World Shiny a simple application that generates a random distribution with a configurable number of observations and then plots it > library(shiny) > runexample("01_hello") Shiny applications have two components: a user-interface definition: ui.r a server script: server.r For more documentation check the tutorial: TDWI 2013 Munich June 19th, / 79 Exercise 1 Connect to the RStudio Server: user01 - user30 (pw: comsysto) Use R as calculator: In your home directory you can find a file exercise1.r. Run all the commands. R experts will find a small challenge at the end of the file. Check the Shiny App running on: In your home directory you can find a folder ShinyApp. This folder holds all the code for several example ShinyApps. There are 11 different ShinyApps. Go to the URL of one or two other ShinyApps from your user, e.g. : /users/userXX/05_sliders Feel free to make changes and check the results in your browser. TDWI 2013 Munich June 19th, / 79
19 Outline 1 R-project.org 2 R and Hadoop Short Map Reduce intro Package rmr2 Package rhdfs RHadoop advanced RHadoop k-means clustering Exercise 2 3 R and Databases TDWI 2013 Munich June 19th, / 79 RHadoop an open source project sponsored by Revolution Analytics package overview: rmr2 hosts all Map Reduce related functions uses Hadoop Streaming API rhdfs for interaction with HDFS file system rhbase connect with Hadoop s NoSQL database HBase installation is the biggest challenge check web for installation guidelines works with MapR and Cloudera Hadoop distribution so far there is no AWS EMR support TDWI 2013 Munich June 19th, / 79
20 HDFS and Hadoop cluster HDFS is a block-structured file system blocks are stored across a cluster of one or more machines with data storage capacity: DataNode data is accessed in a write once and read many model HDFS does come with its own utilities for file management HDFS file system stores its metadata reliably: NameNode TDWI 2013 Munich June 19th, / 79 Hadoop Distribution the training system runs on MapR a enterprise-grade platform for Hadoop complete Hadoop distribution comprehensive management suite industry-standard interfaces combines open source packages with Enterprise-grade dependability higher performance mount Hadoop with Direct Access NFS visit MapR at our conference desk for more information TDWI 2013 Munich June 19th, / 79
21 Parallel Computing basics Serial and parallel tasks: Task 1 Task 1 Task 2 Task 2 Task 2.1 Task 3 Task 3.1 Task n Task n.1 Task 3 Task 2.2 Task 3.2 Task n.2 Task n+1 Task 2.m Task 3.m Task n.m Task n+1 Problem is broken into a discrete series of instructions and they are processed one after another. Problem is broken into discrete parts, that can be solved concurrently. TDWI 2013 Munich June 19th, / 79 Simple Parallel Computing with R > x <- c(1,2,3,4,5) > x [1] > unlist( lapply(x, function(y) y^2) ) [1] > library(parallel) > unlist( mclapply(x, function(y) y^2) ) [1] TDWI 2013 Munich June 19th, / 79
22 My first Map Reduce Job > library(rmr2) > small.ints <- to.dfs(1:1000) > out <- mapreduce( + input = small.ints, + map = function(k, v) cbind(v, v^2)) > df <- as.data.frame(from.dfs(out)) to.dfs put the data into HDFS not possible to write out big data, not in a scalable way nonetheless very useful for a variety of uses like writing test cases, learning and debugging can put the data in a file of your own choosing if you don t specify one it will create temp files and clean them up when done return value is something we call a big data object it is a stub, that is the data is not in memory, only some information TDWI 2013 Munich June 19th, / 79 > library(rmr2) > small.ints <- to.dfs(1:1000) > out <- mapreduce( + input = small.ints, + map = function(k, v) cbind(v, v^2)) > df <- as.data.frame(from.dfs(out)) mapreduce replaces lapply input is the variable small.ints which contains the output of to.dfs function to apply, which is called a map function is a regular R function with a few constraint a function of two arguments, a collection of keys and one of values returns key value pairs using the function keyval, which can have vectors, lists, matrices or data.frames as arguments avoid calling keyval explicitly but the return value x will be converted with a call to keyval(null,x) (a reduce function, which we are not using here) we are not using the keys at all, only the values, but we still need both to support the general map reduce case TDWI 2013 Munich June 19th, / 79
23 > library(rmr2) > small.ints <- to.dfs(1:1000) > out <- mapreduce( + input = small.ints, + map = function(k, v) cbind(v, v^2)) > df <- as.data.frame(from.dfs(out)) return value is big data object you can pass it as input to other jobs read it into memory with from.dfs it will fail for big data! from.dfs is complementary to to.dfs and returns a key-value pair collection TDWI 2013 Munich June 19th, / 79 Hadoop Hello World - Word-count TDWI 2013 Munich June 19th, / 79
24 > wc.map = + function(., lines) { + keyval( + unlist( + strsplit( + x = lines, + split = pattern)), + 1)} > wc.reduce = + function(word, counts ) { + keyval(word, sum(counts))} TDWI 2013 Munich June 19th, / 79 > sourcefile <- "/etc/passwd" > tempfile <- "/tmp/wordcount-test" > file.copy(sourcefile, tempfile) > pattern <- " +" > out <- mapreduce( + input = tempfile, + input.format = "text", + map = wc.map, + reduce = wc.reduce, + combine = T) > res <- from.dfs(out) TDWI 2013 Munich June 19th, / 79
25 Hadoop environments variables all RHadoop packages has to access hadoop in Linux they need the correct environment variables in R you have to set them explicitly > Sys.setenv(HADOOP_CMD="/opt/mapr/hadoop/hadoop /bin/hadoop > Sys.setenv(HADOOP_STREAMING="/opt/mapr/hadoop/hadoop /cont > library(rmr2) > small.ints <- to.dfs(1:1000) > out <- mapreduce( + input = small.ints, + map = function(k, v) cbind(v, v^2)) > df <- as.data.frame(from.dfs(out)) TDWI 2013 Munich June 19th, / 79 Package rhdfs basic connectivity to the Hadoop Distributed File System can browse, read, write, and modify files stored in HDFS package has a dependency on rjava is dependent upon the HADOOP CMD environment variable > Sys.setenv(HADOOP_CMD="/opt/mapr/hadoop/hadoop /bin/hadoop > library(rhdfs) > hdfs.init() > hdfs.ls() > data <- 1:1000 > filename <- "my_smart_unique_name" > file <- hdfs.file(filename, "w") > hdfs.write(data, file) > hdfs.close(file) TDWI 2013 Munich June 19th, / 79
26 RHadoop advanced rmr2 the easiest, most productive, most elegant way to write map reduce jobs with rmr2 one-two orders of magnitude less code than Java with rmr2 readable, reusable, extensible map reduce with rmr2 is a great prototyping, executable spec and research language rmr2 is a way to work on big data sets in a way that is R-like Simple things should be simple, complex things should be possible TDWI 2013 Munich June 19th, / 79 rmr2 is not Hadoop Streaming it uses streaming no support for every single option that streaming has Streaming is accessible from R with no additional packages because R can execute an external program and R scripts can read stdin and stdout map reduce programs written in rmr2 are not going to be the most efficient TDWI 2013 Munich June 19th, / 79
27 Hive, Impala you can access via RODBC Hive you can access with the R package hive Want to learn more? Check Revolution Webinars TDWI 2013 Munich June 19th, / 79 RHadoop k-means clustering live demo TDWI 2013 Munich June 19th, / 79
28 Exercise 2 Connect to the RStudio Server: Use R to run MapReduce jobs and explore HDFS In your home directory you can find a file exercise2.r. Run all the commands. R experts will find a small challenge at the end of the file. TDWI 2013 Munich June 19th, / 79 Outline 1 R-project.org 2 R and Hadoop 3 R and Databases Starting relational R and MongoDB rmongodb Exercise 3 TDWI 2013 Munich June 19th, / 79
29 R and Databases Starting relational SQL provides a standard language to filter, aggregate, group, sort data SQL-like query languages showing up in new places (Hadoop Hive) ODBC provides SQL interface to non-database data (Excel, CSV, text files) R stores relational data in data.frames > data(iris) > head(iris, n=3) Sepal.Length Sepal.Width Petal.Length Petal.Width Species setosa setosa setosa > class(iris) [1] "data.frame" TDWI 2013 Munich June 19th, / 79 Package sqldf sqldf is an R package for running SQL statements on R data frames SQL statements in R using data frame names in place of table names a database with appropriate table layouts/schema is automatically created, the data frames are automatically loaded into the database the result is read back into R sqldf supports the SQLite back-end database (by default), the H2 java database, the PostgreSQL database and MySQL TDWI 2013 Munich June 19th, / 79
30 > library(sqldf) > sqldf("select * from iris limit 2") Sepal_Length Sepal_Width Petal_Length Petal_Width Species setosa setosa > sqldf("select count(*) from iris") count(*) > sqldf("select Species, count(*) from iris group by Species") Species count(*) 1 setosa 50 2 versicolor 50 3 virginica 50 > sqldf("select avg(sepal_length) mean, variance(sepal_length) var mean var TDWI 2013 Munich June 19th, / 79 Package RODBC provides access to databases (including Microsoft Access and Microsoft SQL Server) through an ODBC interface. > library(rodbc) > myconn <- odbcconnect("mydsn", uid="rob", pwd="aardvark") > crimedat <- sqlfetch(myconn, Crime) > pundat <- sqlquery(myconn, "select * from Punishment") > close(myconn) TDWI 2013 Munich June 19th, / 79
31 Other relational package RMySQL package provides an interface to MySQL. RPostgreSQL package provides an interface to PostgreSQL. ROracle package provides an interface for Oracle. RJDBC package provides access to databases through a JDBC interface. RSQLite package provides access to SQLite. The source for the SQLite engine is included. One big problem: all packages read full result in R memory for Big Data this can be a problem TDWI 2013 Munich June 19th, / 79 R and MongoDB on CRAN there are two packages to connect R with MongoDB rmongodb supported by 10gen powerful for big data difficult to use due to BSON objects RMongo is very similar to all the relational packages easy to use limited functionality reads full results in R memory does not work on MAC OS X TDWI 2013 Munich June 19th, / 79
32 Package Rmongo uses a R to Java bridge and the Java MongoDB driver > library(rmongo) > mongo <- mongodbconnect("test", "localhost", 27017) > dbshowcollections(mongo) > dbgetquery(mongo, "zips","{'state':'al'}") > dbgetquery(mongo, "zips","{'pop':{'$gt': }}}") > dbinsertdocument(mongo, "test_data", + '{"foo": "bar", "size": 5 }') > dbinsertdocument(mongo, "test_data", + '{"foo": "nl", "size": 10 }') > dbgetquery(mongo, "test_data","{}") TDWI 2013 Munich June 19th, / 79 supports the aggregation framework > output <- dbaggregate(mongo, "zips", + c('{ "$group" : { "_id" : "$state", totalpop : + { $sum : "$pop" } } } ', + ' { "$match" : {totalpop : { $gte : } } } ') + ) in SQL: SELECT state, SUM(pop) AS pop FROM zips GROUP BY state HAVING pop > (10000) TDWI 2013 Munich June 19th, / 79
33 Package rmongodb developed on top of the MongoDB supported C driver runs almost entirely in native code, so you can expect high performance MongoDB and rmongodb use BSON documents to represent objects in the database and for network messages BSON is an efficient binary representation of JSON-like objects ( there are numerous functions in rmongodb for serializing/deserializing data to/from BSON TDWI 2013 Munich June 19th, / 79 > library(rmongodb) rmongodb package (mongo-r-driver) loaded Use 'help("mongo")' to get started. > mongo <- mongo.create(host="localhost") > mongo [1] 0 attr(,"mongo") <pointer: 0x10512c720> attr(,"class") [1] "mongo" attr(,"host") [1] "localhost" attr(,"name") [1] "" attr(,"username") [1] "" attr(,"password") [1] "" Dr. attr(,"db") rer. nat. Markus TDWI 2013 Munich June 19th, / 79 [1] "admin"
34 > buf <- mongo.bson.buffer.create() > mongo.bson.buffer.append(buf, "state", "AL") [1] TRUE > query <- mongo.bson.from.buffer(buf) > cursor <- mongo.find(mongo, "test.zips", query) > cursor [1] 0 attr(,"mongo.cursor") <pointer: 0x103c7b090> attr(,"class") [1] "mongo.cursor" TDWI 2013 Munich June 19th, / 79 > res <- NULL > while (mongo.cursor.next(cursor)){ + tmp <- mongo.bson.to.list(mongo.cursor.value(cursor)) + res <- rbind(res, tmp) + } > mongo.cursor.destroy(cursor) [1] FALSE > mongo.destroy(mongo) NULL > head(res, n=4) city loc pop state _id tmp "ACMAR" Numeric, "AL" "35004" tmp "ADAMSVILLE" Numeric, "AL" "35005" tmp "ADGER" Numeric, "AL" "35006" tmp "KEYSTONE" Numeric, "AL" "35007" TDWI 2013 Munich June 19th, / 79
35 > mongo <- mongo.create() > buf <- mongo.bson.buffer.create() > mongo.bson.buffer.start.object(buf, "pop") [1] TRUE > mongo.bson.buffer.append(buf, "$gt", L) [1] TRUE > mongo.bson.buffer.finish.object(buf) [1] TRUE > query <- mongo.bson.from.buffer(buf) > mongo.count(mongo, "test.zips", query) [1] 4 TDWI 2013 Munich June 19th, / 79 [1] TRUE > mongo.insert(mongo, "test.test_data", + list(foo="bar", size=5l)) [1] TRUE > mongo.insert(mongo, "test.test_data", + list(foo="nl", size=10l)) [1] TRUE > buf <- mongo.bson.buffer.create() > query <- mongo.bson.from.buffer(buf) > mongo.count(mongo, "test.test_data", query) [1] 2 TDWI 2013 Munich June 19th, / 79
36 > cursor <- mongo.find(mongo, "test.test_data", query) > res <- NULL > while (mongo.cursor.next(cursor)){ + tmp <- mongo.bson.to.list(mongo.cursor.value(cursor)) + res <- rbind(res, tmp) + } > mongo.cursor.destroy(cursor) [1] FALSE > mongo.destroy(mongo) NULL > head(res, n=3) _id foo size tmp "bar" 5 tmp "nl" 10 TDWI 2013 Munich June 19th, / 79 Create BSON objects It is all about creating BSON query or field objects: > b <- mongo.bson.from.list( + list(name="fred", age=29, city="boston")) > b name : 2 Fred age : city : 2 Boston > l <- mongo.bson.to.list(b) > l $name [1] "Fred" $age [1] 29 $city [1] "Boston" TDWI 2013 Munich June 19th, / 79
37 >?mongo.bson >?mongo.bson.buffer.append >?mongo.bson.buffer.start.array >?mongo.bson.buffer.start.object TDWI 2013 Munich June 19th, / 79 Exercise 3 Connect to the RStudio Server: Use Rockmongo to view content of mongodb: User/pw : admin/admin Use R to connect and query MongoDB In your home directory you can find a file exercise3.r. Run all the commands. R experts will find a small challenge at the end of the file. TDWI 2013 Munich June 19th, / 79
38 Summary R is a powerful statistical tool to analyse many different kind of data: from small to big R can access databases MongoDB and rmongodb support big data R can run Hadoop jobs rmr2 runs your Map Reduce jobs R is open source and there is a lot of community driven development TDWI 2013 Munich June 19th, / 79 the tutorial did not cover... parallel computing with R (Package parallel) big data in memory with R (Packages bigmemory, bigmatrix) visualizing big data with R (Packages ggplot, lattice, bigvis) the commercial R version: Revolution R Enterprise TDWI 2013 Munich June 19th, / 79
39 stepping-up-to-big-data-with-r-and-python TDWI 2013 Munich June 19th, / 79 How to go on start playing around with R and Big Data get part of the community: interested in more R courses hosted by comsysto GmbH? two day R beginners training (October 2013) one day R Data Handling and Graphics (plyr, ggplot, shiny,...) (November 2013) one day R Big Data Analyses (November 2013) registration: [email protected] TDWI 2013 Munich June 19th, / 79
40 Goodbye thanks a lot for your attention please feel free to get in contact concerning R, MongoDB and Hadoop meet you in one of our Meetup groups: Twitter: cloudhpc [email protected] TDWI 2013 Munich June 19th, / 79
Tutorial - Big Data Analyses with R
Tutorial - Big Data Analyses with R O Reilly Strata Conference London Dr. rer. nat. Markus Schmidberger @cloudhpc [email protected] November 13th, 2013 M. Schmidberger Tutorial - Big Data
Introduction to Big Data Analysis with R
Introduction to Big Data Analysis with R Yung-Hsiang Huang National Center for High-performance Computing, Taiwan 2014/12/01 Agenda Big Data, Big Challenge Introduction to R Some R-Packages to Deal With
R Tools Evaluation. A review by Analytics @ Global BI / Local & Regional Capabilities. Telefónica CCDO May 2015
R Tools Evaluation A review by Analytics @ Global BI / Local & Regional Capabilities Telefónica CCDO May 2015 R Features What is? Most widely used data analysis software Used by 2M+ data scientists, statisticians
MapReduce on Big Data Map / Reduce Hadoop Hello world - Word count Hadoop Ecosystem + rmr - functions providing Hadoop MapReduce functionality in R rhdfs - functions providing file management of the
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
Introduc)on to RHadoop Master s Degree in Informa1cs Engineering Master s Programme in ICT Innova1on: Data Science (EIT ICT Labs Master School)
Introduc)on to RHadoop Master s Degree in Informa1cs Engineering Master s Programme in ICT Innova1on: Data Science (EIT ICT Labs Master School) Academic Year 2015-2106 Contents Introduc1on to MapReduce
Getting Started with R and RStudio 1
Getting Started with R and RStudio 1 1 What is R? R is a system for statistical computation and graphics. It is the statistical system that is used in Mathematics 241, Engineering Statistics, for the following
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
RHadoop Installation Guide for Red Hat Enterprise Linux
RHadoop Installation Guide for Red Hat Enterprise Linux Version 2.0.2 Update 2 Revolution R, Revolution R Enterprise, and Revolution Analytics are trademarks of Revolution Analytics. All other trademarks
What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea
What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea Overview Riding Google App Engine Taming Hadoop Summary Riding
Big Data and Data Science: Behind the Buzz Words
Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing
Workshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.
Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in
Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?
Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到
Data processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
Sentimental Analysis using Hadoop Phase 2: Week 2
Sentimental Analysis using Hadoop Phase 2: Week 2 MARKET / INDUSTRY, FUTURE SCOPE BY ANKUR UPRIT The key value type basically, uses a hash table in which there exists a unique key and a pointer to a particular
Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
A very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS
INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS Bogdan Oancea "Nicolae Titulescu" University of Bucharest Raluca Mariana Dragoescu The Bucharest University of Economic Studies, BIG DATA The term big data
Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP
Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools
Open Source Technologies on Microsoft Azure
Open Source Technologies on Microsoft Azure A Survey @DChappellAssoc Copyright 2014 Chappell & Associates The Main Idea i Open source technologies are a fundamental part of Microsoft Azure The Big Questions
Parallel Options for R
Parallel Options for R Glenn K. Lockwood SDSC User Services [email protected] Motivation "I just ran an intensive R script [on the supercomputer]. It's not much faster than my own machine." Motivation "I
RHadoop and MapR. Accessing Enterprise- Grade Hadoop from R. Version 2.0 (14.March.2014)
RHadoop and MapR Accessing Enterprise- Grade Hadoop from R Version 2.0 (14.March.2014) Table of Contents Introduction... 3 Environment... 3 R... 3 Special Installation Notes... 4 Install R... 5 Install
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
Internals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
Case Study : 3 different hadoop cluster deployments
Case Study : 3 different hadoop cluster deployments Lee moon soo [email protected] HDFS as a Storage Last 4 years, our HDFS clusters, stored Customer 1500 TB+ data safely served 375,000 TB+ data to customer
Building Your Big Data Team
Building Your Big Data Team With all the buzz around Big Data, many companies have decided they need some sort of Big Data initiative in place to stay current with modern data management requirements.
Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges
Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges James Campbell Corporate Systems Engineer HP Vertica [email protected] Big
Cloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah
Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big
Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu
Lecture 4 Introduction to Hadoop & GAE Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu Outline Introduction to Hadoop The Hadoop ecosystem Related projects
Peers Techno log ies Pv t. L td. HADOOP
Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and
Hadoop Big Data for Processing Data and Performing Workload
Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
Data Analytics Infrastructure
Data Analytics Infrastructure Data Science SG Nov 2015 Meetup Le Nguyen The Dat @lenguyenthedat Backgrounds ZALORA Group (2013 2014) o Biggest online fashion retails in South East Asia o Data Infrastructure
MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering
MySQL and Hadoop: Big Data Integration Shubhangi Garg & Neha Kumari MySQL Engineering 1Copyright 2013, Oracle and/or its affiliates. All rights reserved. Agenda Design rationale Implementation Installation
GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION
GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION Syed Rasheed Solution Manager Red Hat Corp. Kenny Peeples Technical Manager Red Hat Corp. Kimberly Palko Product Manager Red Hat Corp.
Tap into Hadoop and Other No SQL Sources
Tap into Hadoop and Other No SQL Sources Presented by: Trishla Maru What is Big Data really? The Three Vs of Big Data According to Gartner Volume Volume Orders of magnitude bigger than conventional data
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
HADOOP. Revised 10/19/2015
HADOOP Revised 10/19/2015 This Page Intentionally Left Blank Table of Contents Hortonworks HDP Developer: Java... 1 Hortonworks HDP Developer: Apache Pig and Hive... 2 Hortonworks HDP Developer: Windows...
brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385
brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 1 Hadoop in a heartbeat 3 2 Introduction to YARN 22 PART 2 DATA LOGISTICS...59 3 Data serialization working with text and beyond 61 4 Organizing and
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Apache Hadoop: Past, Present, and Future
The 4 th China Cloud Computing Conference May 25 th, 2012. Apache Hadoop: Past, Present, and Future Dr. Amr Awadallah Founder, Chief Technical Officer [email protected], twitter: @awadallah Hadoop Past
BIG DATA & HADOOP DEVELOPER TRAINING & CERTIFICATION
FACT SHEET BIG DATA & HADOOP DEVELOPER TRAINING & CERTIFICATION BIGDATA & HADOOP CLASS ROOM SESSION GreyCampus provides Classroom sessions for Big Data & Hadoop Developer Certification. This course will
Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84
Index A Amazon Web Services (AWS), 50, 58 Analytics engine, 21 22 Apache Kafka, 38, 131 Apache S4, 38, 131 Apache Sqoop, 37, 131 Appliance pattern, 104 105 Application architecture, big data analytics
CSE-E5430 Scalable Cloud Computing. Lecture 4
Lecture 4 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 5.10-2015 1/23 Hadoop - Linux of Big Data Hadoop = Open Source Distributed Operating System
Bringing Big Data to People
Bringing Big Data to People Microsoft s modern data platform SQL Server 2014 Analytics Platform System Microsoft Azure HDInsight Data Platform Everyone should have access to the data they need. Process
Cluster Analysis using R
Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other
WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley
WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley Disclaimer: This material is protected under copyright act AnalytixLabs, 2011. Unauthorized use and/ or duplication of this material or
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM
HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM 1. Introduction 1.1 Big Data Introduction What is Big Data Data Analytics Bigdata Challenges Technologies supported by big data 1.2 Hadoop Introduction
Constructing a Data Lake: Hadoop and Oracle Database United!
Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.
Getting Started with Hadoop with Amazon s Elastic MapReduce
Getting Started with Hadoop with Amazon s Elastic MapReduce Scott Hendrickson [email protected] http://drskippy.net/projects/emr-hadoopmeetup.pdf Boulder/Denver Hadoop Meetup 8 July 2010 Scott Hendrickson
Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>
s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline
Data Visualization with R Language
1 Data Visualization with R Language DENG, Xiaodong ([email protected] ) Research Assistant Saw Swee Hock School of Public Health, National University of Singapore Why Visualize Data? For better
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
TRAINING PROGRAM ON BIGDATA/HADOOP
Course: Training on Bigdata/Hadoop with Hands-on Course Duration / Dates / Time: 4 Days / 24th - 27th June 2015 / 9:30-17:30 Hrs Venue: Eagle Photonics Pvt Ltd First Floor, Plot No 31, Sector 19C, Vashi,
BIG DATA HADOOP TRAINING
BIG DATA HADOOP TRAINING DURATION 40hrs AVAILABLE BATCHES WEEKDAYS (7.00AM TO 8.30AM) & WEEKENDS (10AM TO 1PM) MODE OF TRAINING AVAILABLE ONLINE INSTRUCTOR LED CLASSROOM TRAINING (MARATHAHALLI, BANGALORE)
Integrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
File S1: Supplementary Information of CloudDOE
File S1: Supplementary Information of CloudDOE Table of Contents 1. Prerequisites of CloudDOE... 2 2. An In-depth Discussion of Deploying a Hadoop Cloud... 2 Prerequisites of deployment... 2 Table S1.
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld
Tapping into Hadoop and NoSQL Data Sources in MicroStrategy Presented by: Trishla Maru Agenda Big Data Overview All About Hadoop What is Hadoop? How does MicroStrategy connects to Hadoop? Customer Case
BIG DATA TRENDS AND TECHNOLOGIES
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
MapReduce with Apache Hadoop Analysing Big Data
MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside [email protected] About Journey Dynamics Founded in 2006 to develop software technology to address the issues
Big Data Visualization with JReport
Big Data Visualization with JReport Dean Yao Director of Marketing Greg Harris Systems Engineer Next Generation BI Visualization JReport is an advanced BI visualization platform: Faster, scalable reports,
Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE
Spring,2015 Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Contents: Briefly About Big Data Management What is hive? Hive Architecture Working
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
Qsoft Inc www.qsoft-inc.com
Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
Introduction to Hadoop
Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh [email protected] October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples
Big Data Too Big To Ignore
Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction
Big Data Course Highlights
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru
Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy Presented by: Jeffrey Zhang and Trishla Maru Agenda Big Data Overview All About Hadoop What is Hadoop? How does MicroStrategy connects to Hadoop?
Hadoop Job Oriented Training Agenda
1 Hadoop Job Oriented Training Agenda Kapil CK [email protected] Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module
A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect
A very short talk about Apache Kylin Business Intelligence meets Big Data Fabian Wilckens EMEA Solutions Architect 1 The challenge today 2 Very quickly: OLAP Online Analytical Processing How many beers
L1: Introduction to Hadoop
L1: Introduction to Hadoop Feng Li [email protected] School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General
Real-time Streaming Analysis for Hadoop and Flume. Aaron Kimball odiago, inc. OSCON Data 2011
Real-time Streaming Analysis for Hadoop and Flume Aaron Kimball odiago, inc. OSCON Data 2011 The plan Background: Flume introduction The need for online analytics Introducing FlumeBase Demo! FlumeBase
Ankush Cluster Manager - Hadoop2 Technology User Guide
Ankush Cluster Manager - Hadoop2 Technology User Guide Ankush User Manual 1.5 Ankush User s Guide for Hadoop2, Version 1.5 This manual, and the accompanying software and other documentation, is protected
American International Journal of Research in Science, Technology, Engineering & Mathematics
American International Journal of Research in Science, Technology, Engineering & Mathematics Available online at http://www.iasir.net ISSN (Print): 2328-3491, ISSN (Online): 2328-3580, ISSN (CD-ROM): 2328-3629
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected]
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected] Hadoop, Why? Need to process huge datasets on large clusters of computers
IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look
IBM BigInsights Has Potential If It Lives Up To Its Promise By Prakash Sukumar, Principal Consultant at iolap, Inc. IBM released Hadoop-based InfoSphere BigInsights in May 2013. There are already Hadoop-based
An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov
An Industrial Perspective on the Hadoop Ecosystem Eldar Khalilov Pavel Valov agenda 03.12.2015 2 agenda Introduction 03.12.2015 2 agenda Introduction Research goals 03.12.2015 2 agenda Introduction Research
How to Hadoop Without the Worry: Protecting Big Data at Scale
How to Hadoop Without the Worry: Protecting Big Data at Scale SESSION ID: CDS-W06 Davi Ottenheimer Senior Director of Trust EMC Corporation @daviottenheimer Big Data Trust. Redefined Transparency Relevance
SAP Predictive Analysis Installation
SAP Predictive Analysis Installation SAP Predictive Analysis is the latest addition to the SAP BusinessObjects suite and introduces entirely new functionality to the existing Business Objects toolbox.
TP1: Getting Started with Hadoop
TP1: Getting Started with Hadoop Alexandru Costan MapReduce has emerged as a leading programming model for data-intensive computing. It was originally proposed by Google to simplify development of web
Testing Big data is one of the biggest
Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing
Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12
Introduction to NoSQL Databases and MapReduce Tore Risch Information Technology Uppsala University 2014-05-12 What is a NoSQL Database? 1. A key/value store Basic index manager, no complete query language
Lecture 10 - Functional programming: Hadoop and MapReduce
Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
