R Tools Evaluation. A review by Analytics @ Global BI / Local & Regional Capabilities. Telefónica CCDO May 2015



Similar documents
High Performance Predictive Analytics in R and Hadoop:

R and Hadoop: Architectural Options. Bill Jacobs VP Product Marketing & Field CTO, Revolution

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Revolution R Enterprise: Efficient Predictive Analytics for Big Data

Revolution R Enterprise

Decision Trees built in Hadoop plus more Big Data Analytics with Revolution R Enterprise

In-Database Analytics Deep Dive with Teradata and Revolution R

Using Microsoft R Server to Address Scalability Issues

Big Data and Data Science: Behind the Buzz Words

Find the Hidden Signal in Market Data Noise

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

SEIZE THE DATA SEIZE THE DATA. 2015

R-Academy I Knowledge, that matters

Session 85 IF, Predictive Analytics for Actuaries: Free Tools for Life and Health Care Analytics--R and Python: A New Paradigm!

Sisense. Product Highlights.

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.

Advanced Big Data Analytics with R and Hadoop

ANALYTICS CENTER LEARNING PROGRAM

Fast Analytics on Big Data with H20

Data Science with R. Introducing Data Mining with Rattle and R.

Delivering Value from Big Data with Revolution R Enterprise and Hadoop

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

Big Data Analytics and Optimization

White Paper: Datameer s User-Focused Big Data Solutions

KnowledgeSEEKER POWERFUL SEGMENTATION, STRATEGY DESIGN AND VISUALIZATION SOFTWARE

How To Understand Data Mining In R And Rattle

Building and Deploying Customer Behavior Models

Operationalise Predictive Analytics

How To Test The Performance Of An Ass 9.4 And Sas 7.4 On A Test On A Powerpoint Powerpoint 9.2 (Powerpoint) On A Microsoft Powerpoint 8.4 (Powerprobe) (

JAVASCRIPT CHARTING. Scaling for the Enterprise with Metric Insights Copyright Metric insights, Inc.

SAP Predictive Analytics: An Overview and Roadmap. Charles Gadalla, SESSION CODE: 603

KnowledgeSEEKER Marketing Edition

Classroom Demonstrations of Big Data

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

EMC Greenplum Driving the Future of Data Warehousing and Analytics. Tools and Technologies for Big Data

Sunnie Chung. Cleveland State University

Azure Machine Learning, SQL Data Mining and R

Big Data Visualization and Dashboards

WHAT S NEW IN SAS 9.4

Native Connectivity to Big Data Sources in MSTR 10

Welcome to the second half ofour orientation on Spotfire Administration.

Big Data Analytics. Benchmarking SAS, R, and Mahout. Allison J. Ames, Ralph Abbey, Wayne Thompson. SAS Institute Inc., Cary, NC

Geo Analysis, Visualization and Performance with JReport 13

How to Build MicroStrategy Projects on Top of Big Data Sources in the Cloud

Open Source Technologies on Microsoft Azure

Predictive Analytics Powered by SAP HANA. Cary Bourgeois Principal Solution Advisor Platform and Analytics

An In-Depth Look at In-Memory Predictive Analytics for Developers

MicroStrategy Course Catalog

Ad Hoc Analysis of Big Data Visualization

Technical Paper. Performance of SAS In-Memory Statistics for Hadoop. A Benchmark Study. Allison Jennifer Ames Xiangxiang Meng Wayne Thompson

Journée Thématique Big Data 13/03/2015

2015 Workshops for Professors

Embedded Analytics & Big Data Visualization in Any App

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

Introducing the Reimagined Power BI Platform. Jen Underwood, Microsoft

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Predictive Analytics

ANALYTICS IN BIG DATA ERA

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

An Open Source NoSQL solution for Internet Access Logs Analysis

Extend your analytic capabilities with SAP Predictive Analysis

Predictive Modeling Techniques in Insurance

Data Mining. Dr. Saed Sayad. University of Toronto

Data processing goes big

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Data Mining in the Swamp

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

TIBCO Spotfire Server Deployment and Administration

Data Science and Business Analytics Certificate Data Science and Business Intelligence Certificate

SAP Solution Brief SAP HANA. Transform Your Future with Better Business Insight Using Predictive Analytics

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Microsoft Services Exceed your business with Microsoft SharePoint Server 2010

BIG DATA TRENDS AND TECHNOLOGIES

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

Big Data Visualization with JReport

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

How To Write A Trusted Analytics Platform (Tap)

P4.1 Reference Architectures for Enterprise Big Data Use Cases Romeo Kienzler, Data Scientist, Advisory Architect, IBM Germany, Austria, Switzerland

Architectures for Big Data Analytics A database perspective

SAS REPORTS ON YOUR FINGERTIPS? SAS BI IS THE ANSWER FOR CREATING IMMERSIVE MOBILE REPORTS

Starting Smart with Oracle Advanced Analytics

Client Overview. Engagement Situation. Key Requirements

White Paper. Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics

Make Better Decisions Through Predictive Intelligence

Workday Big Data Analytics

Oracle Data Miner (Extension of SQL Developer 4.0)

APPROACHABLE ANALYTICS MAKING SENSE OF DATA

Transcription:

R Tools Evaluation A review by Analytics @ Global BI / Local & Regional Capabilities Telefónica CCDO May 2015

R Features

What is? Most widely used data analysis software Used by 2M+ data scientists, statisticians and analysts Most powerful statistical programming language Flexible, extensible and comprehensive for productivity Create beautiful and unique data visualizations As seen in New York Times, Twitter and Flowing Data Thriving open-source community Leading edge of analytics research Fills the talent gap New graduates prefer R Text from

Importance of R is the highest paid IT skill R most-used data science language after SQL R is used by 70% of data miners R is #15 of all programming languages R growing faster than any other data science language R is the #1 Google Search for Advanced Analytics software R has more than 2 million users worldwide R Usage Growth Rexer Data Miner Survey, 2007-2013 70% of data miners report using R Text from R is the first choice of more data miners than any other software Source: www.rexeranalytics.com

Data import with Data collection (multiple connectors) CSV Text files delimited or fixed, xml, json... (1) (2) (3) Other analytics formats files (Excel, SPSS, SAS, Stata, Systat ) ODBC/JDBC connectors (9) (10) (4) (5) (6) (7) (8) Native relational database connectors (Oracle, Teradata, SQL Server, Mysql ) (11) (12) (13) (14) Hadoop connectors (Revolution RRE, Rhadoop, Rhipe, ORAAH, Rhive, SparkR, H2O) (15) (16) (17) (18) No SQL connectors (MongoDB, Cassandra, Hbase, Neo4j ) Http (SOA, WS, REST) and ftp connectors (22) (23) (24) (25) (26) (19) (20) (21) Social networks connectors (Twitter, Facebook ) (27) (28) Other enterprise tools connectors (SAP/R3, Salesforce, Splunk) () Packages reference, see last slide (29) (30) (31)

Data preparation with Variable creation and transformation Recode variables Factor variables Missing value handling Sort Merge & Join Split Aggregate (means, sums) Reshape

Traditional BI: Reports & Dashboards with Reports in Html, MS Word and Pdf with r markdown and knitr Very easy way to create reports from r markdown files with RStudio knitr http://yihui.name/knitr/ http://rmarkdown.rstudio.com/ http://www.rstudio.com/

Traditional BI: Reports & Dashboards with 1 The three most known and easiest options to publish reports in R http://www.rstudio.com/ 2 knitr http://yihui.name/knitr/ http://slidify.github.io/ https://rpubs.com/ https://rpubs.com/ 3 knitr http://yihui.name/knitr/ https://gist.github.com/ https://www.dropbox.com/es/ R Presentation http://www.rstudio.com/ https://rpubs.com/

Discover Analytics with 1 Interactive reports http://www.rstudio.com/ On-premise Shiny Server http://shiny.rstudio.com/ 2 http://www.rstudio.com/ knitr http://yihui.name/knitr/ Cloud Shinyapps.io https://www.shinyapps.io/ 3 https://www.intuitics.com

Data Visualizations with ggplot2 (http://ggplot2.org/) contains a very complete catalog of visualization widgets (PieChart, BarCharts, Directed/Undirected Graphs, CloudWords, Gauges, Tree Map, Scatter charts ) Rcharts (http://rcharts.io/) use R to create graphs in html5 by leveraging the most advanced javascript libraries for visualizations (d3js, Polycharts,Morris,NVD3,xCharts ) + + Plotly (https://plot.ly/ ) is a platform to create and publish html5 graphs from several programming languages: R, python, mathlab, excel +

Predictive Analytics with : Open Source Tools R Console - CLI http://www.r-project.org/ http://www.rstudio.com/ Rattle: A Graphical User Interface for Data Mining using R http://rattle.togaware.com/

Predictive Analytics with : Packages More than 5,000 packages for statistical, predictive analytics and data visualization Descriptive Statistics Min / Max Mean Median Quantiles Standard Deviation Variance Correlation Covariance Sum of Squares Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data Marginal Summaries of Cross Tabulations Text and figures from Sampling Subsample (observations & variables) Random Sampling Variable Selection Stepwise Regression Linear Logistic GLM Cluster Analysis Predictive & Classification Sum of Squares (cross product matrix for set variables) Multiple Linear Regression Generalized Linear Models (GLM) - All exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions including: cauchy, identity, log, logit, probit Covariance Matrix Correlation Matrix Logistic Regression Classification & Regression Trees Residuals for all models Decision Trees Decision Forests Boosted Decision Trees Deployment K-Means Hierarchical Model Based Prediction (scoring) PMML Export

In Cloud As a Service https://www.elasticr.com http://www.ebi.ac.uk/tools/rcloud/ AWS http://www.louisaslett.com/rstudio_ami http://azure.microsoft.com/en-us/documentation/articles/machine-learning-r-csharp-web-service-examples https://api.blockspring.com/docs/r-quickstart-run On Premise http://www.openanalytics.eu/architect-server https://www.opencpu.org (*) http://www.rforge.net/rserve http://www.rforge.net/fastrweb http://sysbio.mrc-bsu.cam.ac.uk/rwui http://www.math.montana.edu/rweb (*) It could be run in Amazon EC2 too

Data Visualizations with Rbokeh (http://hafen.github.io/rbokeh) use R to create graphs in html5/d3js + ggvis (http://ggvis.rstudio.com/) is a data visualization package for R using Vega, a javascript html5 library ggvis + 14

R & BIG DATA

Limitations of for enterprises Big Data In-memory bound for many use cases Speed of Analysis Enterprise Readiness Single threaded by design Community support AnalyticBreadth & Depth Commercial Viability 5700+ innovative analytic packages Risk of deployment of open source

Hadoop processing modes with Figure from Method 1: Local parallel processing using all cores on one node, using local linuxfile-system data Revolution Analytics parallelr (http://projects.revolutionanalytics.com/documents/parallelr/parallerrpkgs/) Method 2: Local parallel processing using all cores on one node, reading from / to HDFS data Revolution Rhadoop (https://github.com/revolutionanalytics/rhadoop/wiki), RHIPE (https://www.datadr.org/ ), ORAAH (Oracle R Advanced Analytics for Hadoop) or package RHIVE (http://cran.rproject.org/web/packages/rhive/rhive.pdf ) Revolution Analytics parallelr (http://projects.revolutionanalytics.com/documents/parallelr/parallerrpkgs/)

Hadoop processing modes with Method 3: Hadoop (Map-Reduce) parallel processing using all cores on n nodes, using HDFS data in-situ Commercial Tool Open Source Tool

BD Analytic Tools Strenghts Most widely used data analysis and predictive software in the world A lot of packages (5000+) to do almost everything you want, kept by a huge developers community Completely free Integration with a great amount of tools (free and commercial) Multiple connectors to get a lot of type of data Not only for analytics, good to data discover and reporting too Weaknesses More difficult to learn than other software Help files are written for relatively advanced users R holds all its data in your computer s main memory. There are free and commercial tools to parallelize R but not too many alternatives Because the great amount of packages it is often difficult finding and choosing the better ones R core is quite stable, but sometimes some package changes and dependencies are not updated Integration with web apps is not mature Packages & Projects Reference (http://crantastic.org/ or http://cran.r-project.org/web/packages/) Data Access RForcecom github.com/rfsp/r (30) (31) RSAP RJDBC RCurl (29) (10) (26) yhatr RODBC (26) XML(2) (9) sqldf RHive (19) twitter(27) rjson(3) ROracle foreign (11) Rfacebook rmongodb dplyr RSQLServer RCassandra (13) tidyr xlsx github.com/nicolewhite/rneo4j RMySQL (4) (14) Hmisc rpython datadr.org RPostgresSQL rjava github.com/revolutionanalytics/rhadoop/wiki amplab-extras.github.io/sparkr-pkg Reporting & Discover Predictive manipulate rstudio.com rpubs.com shinyapps.io care rattle.togaware.com rstudio.com intuitics.com caret slidify.github.io topepo.github.io/caret rcharts.io pvclust ggvis.rstudio.com plot.ly/r yhatr mclust ggplot2.org yihui.name/knitr opencpu neuralnet github.com/bart6114/scheduler ga tm maps mapdata maps mapdata sp mapproj sp mapproj

Área Company Name 20