Big Data Analytics in R



Similar documents
Advanced Big Data Analytics with R and Hadoop

RevoScaleR Speed and Scalability

Laurence Liew General Manager, APAC. Economics Is Driving Big Data Analytics to the Cloud

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

High Performance Predictive Analytics in R and Hadoop:

Big Data and Data Science: Behind the Buzz Words

Hadoop & SAS Data Loader for Hadoop

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

APPROACHABLE ANALYTICS MAKING SENSE OF DATA

Big Data Executive Survey

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

GETTING STARTED WITH R AND DATA ANALYSIS

Bringing Big Data Modelling into the Hands of Domain Experts

Chapter 7. Using Hadoop Cluster and MapReduce

Data Refinery with Big Data Aspects

How To Use Hp Vertica Ondemand

NoSQL for SQL Professionals William McKnight

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Outline. What is Big data and where they come from? How we deal with Big data?

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

Building and Deploying Customer Behavior Models

Architectures for Big Data Analytics A database perspective

Building your Big Data Architecture on Amazon Web Services

HIGH PERFORMANCE ANALYTICS FOR TERADATA

Oracle Big Data SQL Technical Update

Big Data Are You Ready? Thomas Kyte

Modern Data Architecture for Predictive Analytics

R and Hadoop: Architectural Options. Bill Jacobs VP Product Marketing & Field CTO, Revolution

Getting Started Practical Input For Your Roadmap

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Predictive Analytics: Turn Information into Insights

How To Handle Big Data With A Data Scientist

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Big Data? Definition # 1: Big Data Definition Forrester Research

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

Find the Hidden Signal in Market Data Noise

EMC Greenplum Driving the Future of Data Warehousing and Analytics. Tools and Technologies for Big Data

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

EVERYTHING THAT MATTERS IN ADVANCED ANALYTICS

ANALYTICS IN BIG DATA ERA

The Power of Predictive Analytics

R Tools Evaluation. A review by Global BI / Local & Regional Capabilities. Telefónica CCDO May 2015

Data Isn't Everything

Why your business decisions still rely more on gut feel than data driven insights.

The Future of Data Management

Greenplum Database. Getting Started with Big Data Analytics. Ofir Manor Pre Sales Technical Architect, EMC Greenplum

Delivering Value from Big Data with Revolution R Enterprise and Hadoop

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Annex: Concept Note. Big Data for Policy, Development and Official Statistics New York, 22 February 2013

Advanced In-Database Analytics

BIG DATA TRENDS AND TECHNOLOGIES

White Paper. Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics

Research Note What is Big Data?

The Rise of Industrial Big Data

Using In-Memory Computing to Simplify Big Data Analytics

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Chapter 1. Contrasting traditional and visual analytics approaches

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Beyond Traditional Management Reporting IBM Corporation

Data Mining in the Swamp

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

Big Data. Fast Forward. Putting data to productive use

SQL Server 2012 Parallel Data Warehouse. Solution Brief

SAP Predictive Analytics: An Overview and Roadmap. Charles Gadalla, SESSION CODE: 603

Cray: Enabling Real-Time Discovery in Big Data

Big Data Are You Ready? Jorge Plascencia Solution Architect Manager

DATA EXPERTS MINE ANALYZE VISUALIZE. We accelerate research and transform data to help you create actionable insights

Optimized Hadoop for Enterprise

Universal PMML Plug-in for EMC Greenplum Database

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Data Centric Computing Revisited

The Big Deal about Big Data. Mike Skinner, CPA CISA CITP HORNE LLP

Better planning and forecasting with IBM Predictive Analytics

SQL Server 2005 Features Comparison

The Use of Open Source Is Growing. So Why Do Organizations Still Turn to SAS?

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.

ANALYTICS CENTER LEARNING PROGRAM

SAP Solution Brief SAP HANA. Transform Your Future with Better Business Insight Using Predictive Analytics

Manifest for Big Data Pig, Hive & Jaql

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

A Study on Big-Data Approach to Data Analytics

Using Attunity Replicate with Greenplum Database Using Attunity Replicate for data migration and Change Data Capture to the Greenplum Database

The Future of Data Management with Hadoop and the Enterprise Data Hub

BIG DATA THE NEW OPPORTUNITY

Game On: How Information is Changing the Rules of Insurance

Databricks. A Primer

Big Data Comes of Age: Shifting to a Real-time Data Platform

Transcription:

Big Data Analytics in R Big Opportunity, Big Challenge September, 2011

Revolution Confidential Most advanced statistical analysis software available Half the cost of commercial alternatives The professor who invented analytic software for the experts now wants to take it to the masses 2M+ Users 2,500+ Applications Power Statistics Finance Life Sciences Predictive Analytics Data Mining Manufacturing Retail Telecom Enterprise Readiness Productivity Visualization Social Media Government

Revolution Confidential Big Data and Big Opportunities 35 Increase in Retail Net Profits Decrease in product development costs Reduction in working capital Annual savings in US healthcare spending 1 2 Worldwide data created and replicated, Zettabytes* *Source IDC, McKinsey Global Institute, 1 Zettabyte = 1 Trillion Gigabytes 3

Companies that have massive amounts of data without massive amounts of clue are going to be displaced by startups that have less data but more clue. -- Tim O Reilly 4

Challenges for data driven organizations in the Age of Analytics Revolution Confidential Access to advanced analytics tools to derive knowledge from an explosion of data Attracting analysts at all levels capable of performing advanced analytics Efficiently & cost effectively sharing analyses and disseminating knowledge throughout the organization 5

R is the most powerful & flexible statistical programming language in the world 1 Capabilities Sophisticated statistical analyses Predictive analytics Data visualization Applications Real-time trading Finance Risk assessment Forecasting Bio-technology Drug development Social networks.. and more 250 200 150 100 50 6 4 2 0-2 -4-6 MSFT [2009-01-02/2010-03-31] Last 29.29 Volume (millions): 63,760,000 Moving Average Convergence Divergence (12,26,9): MACD: 0.702 Signal: 0.712 Jan 02 2009 Apr 01 2009 Jul 01 2009 Oct 01 2009 Jan 04 2010 Mar 31 2010 30 25 20 15 Revolution Confidential 6

Exploding Popularity Driving Exploding Functionality Scholarly Activity Google Scholar hits ( 05-09 CAGR) Revolution Confidential R 46% SAS -11% SPSS -27% S-Plus 0% I ve been astonished by the rate at which R has been adopted. Four years ago, everyone in my economics department [at the University of Chicago] was using Stata; now, as far as I can tell, R is the standard tool, and students learn it first. Stata 10% Package Growth Number of R packages listed on CRAN 2500 2000 1500 1000 500 Deputy Editor for New Products at Forbes A key benefit of R is that it provides nearinstant availability of new and experimental methods created by its user base without waiting for the development/release cycle of commercial software. SAS recognizes the value of R to our customer base 0 2002 2004 2006 2008 2010 Product Marketing Manager SAS Institute, Inc Source: http://r4stats.com/popularity; Why R is a name to know in 2011, Forbes 7

The rise of Data Science Of statisticians consider themselves Data Scientists 1 1. Survey conducted at Joint Statistical Meeting August 2011 8

but what is a Data Scientist? Statistician + Computer Scientist -or- Computer Scientist + Statistician Who: Obtains Cleans Structures Explores Analyzes Interprets Data 9

Revolution Confidential Enabling Data Scientists as a seamless bridge between IT and Business IT Manager Data Architect CIO Programmer IT Operations & Data Infrastructure Model Development Statistician Data Scientist Quant Data consumption & actionable decisions Business Analyst Executive Manager 10

What is Big Data? Large Structured Data 1Gb? 1Tb? 1Pb? Unstructured Data Log data, text, voice, video Big Models Computationally intensive analyses, simulations, models with many parameters Big, like beauty is in the eyes of the beholder Common theme is inability to analyze with legacy tools 11

Why deal with Big Data? Relax assumptions of Linearity or Normality Identify Rare Events or Low Incidence Populations Gain insight by studying Microstructure of data Generate better predictions and better understanding of effects 12

Many opportunities for improvement Of Data Scientists feel the current tools are inadequate to meet their Big Data needs 1 1. Survey conducted at Joint Statistical Meeting August 2011 13

Two Big problems: capacity and speed Capacity: problems handling the size of data sets or models Data too big to fit into memory Even if it can fit, there are limits on what can be done Even simple data management can be extremely challenging Speed: even without a capacity limit, computation may be too slow to be useful 14

Multi-Threaded Processing Shared Memory Data Data Data Disk Core 0 (Thread 0) Core 1 (Thread 1) Core 2 (Thread 2) Core n (Thread n) Multicore Processor (4, 8, 16+ cores) RevoScaleR A RevoScaleR algorithm is provided a data source as input The algorithm loops over data, reading a block at a time. Blocks of data are read by a separate worker thread (Thread 0). Other worker threads (Threads 1..n) process the data block from the previous iteration of the data loop and update intermediate results objects in memory When all of the data is processed a master results object is created from the intermediate results objects 15

Distributed Computing Data Partition Data Partition Data Partition Data Partition Compute Node (RevoScaleR) Compute Node (RevoScaleR) Compute Node (RevoScaleR) Compute Node (RevoScaleR) Master Node (RevoScaleR) Portions of the data source are made available to each compute node RevoScaleR on the master node assigns a task to each compute node Each compute node independently processes its data, and returns it s intermediate results back to the master node master node aggregates all of the intermediate results from each compute node and produces the final result 16

Revolution Confidential A common analytic platform across big data architectures Hadoop File Based In-database 17

Creating an Analytics Center of Excellence Individual Data Scientists H O S T S S S HDFS Database Appliance (Revolution R) Hadoop Cluster (R/Hadoop) Deployment Servers (DeployR) High Workload Clusters (Revolution/R ScaleR) Business Users 18

Use Case: Financial Analysis EXAMPLES: PREVENTING A FLASH CRASH, LEVERAGING NEW DATA SOURCES 19

R in Finance Case Study Modeling and detection of asset price jumps Source: Boudt, Leuven R/Finance 2011 20

R in Finance Case Study Seamlessly leveraging alterative data sources Source: Rothermich R/Finance 2011 21

Use Case: Internet and Social Media EXAMPLES: ONLINE USER BEHAVIOR 22

Social Media Case Study Improving profitability by studying user behavior The extensive engagement of our players provides over 15 terabytes of game data per day that we use to enhance our games by designing, testing and releasing new features on an ongoing basis. We believe that combining data analytics with creative game design enables us to create a superior player experience.. Zynga S1 Filing The Model 1 : Rank pages by quality Remove pages one by one Calculate predicted effects of removal Technically: Eliminate columns from the Markov transition matrices Multiply the initial state vector by the transition matrices Inspect changes to probability of good absorbing state 1. Decoding Customer Attention & Retention Behavior, Gaia Online, Predictive Analytics World 2011 23

Social Media Case Study Predicting Colleague-Colleague interactions at Facebook How Facebook uses R: Experimentation for large machine learning models Picking models Feature selection Small / medium one-off analyses User engagement studies Analyses to improve internal processes Analysis and visualization for social network research Applications of Colleague-Colleague study: Suggesting peer reviewers during performance review season Setting up optimally constructed teams Optimizing seating charts for productivity Automatically filtering internal feeds of employee content Suggesting new colleague interactions Giving managers insight into employees interactions Source: Criss-crossing the Org Chart Predicting Colleague interactions with R, E. Sun, user 2010 24

Online retailing case study Enabling new insights with Hadoop + R at Orbitz 1 Prototypes: User Segmentation Hotel Booking Airline Performance Insights: Purchase behavior by browser use Seasonal variation of purchase behavior 1. Distributed Data Analysis with Hadoop and R, Orbitz - OSCON 2011 25

Use Case: Healthcare & Life Sciences EXAMPLES: GENOME ANALYSIS, PREDICTING HOSPITALIZATIONS 26

Life Sciences Case Study Genome Wide Association Studies Analysis Challenge: 22 genotype files total 25Gb 3m variables (SNPs) and 3k rows (subjects) Need to run 3m regressions per analysis Solution: A computationally intensive Big Model Leverage external memory framework, data chunking, and task parallelization 27

Life Sciences Case Study Predicting future hospitalizations The Challenge: Each year, > 71M people in the US admitted to hospitals In 2006, > $30B spent on unnecessary hospital admissions Very few patients account for most spending Given patient history, medical records etc., can a predictive model be built to identify who is at risk for hospitalization? The $3 Million Heritage Health Prize: Big Data Enhancing Wellness Across the World --Kit Dotson, February 4, 2011 33% of all Kaggle competitors report using R 50% of previous competition winners used R to create the most accurate predictive models 2 1 http://www.heritagehealthprize.com/c/hhp 2 http://blog.revolutionanalytics.com/2011/04/how-kaggle-competitors-use-r.html 28

Use Case: Network Analysis EXAMPLE: TWITTER 29

Analyzing Networks The Challenge: Post 9/11, network analysis has been viewed as increasingly important Analysts face a massive amount of data and insufficient analytical capability A dizzying of single solution tools are available The R Advantage: R is a language for building solutions Community R experts have developed a diverse set of packages for analyzing networks Solutions range from basic to cutting edge igraph Graph theoretic foundations Community detection Solutions Statnet Social Science focused Statistical analysis of networks 30

Mining online social graphs: Twitter example twitter and igraph ggplot2 31

Use Case: Text & Data Mining EXAMPLE: ANALYZING IQT PRESS RELEASES 32

Data Mining The Challenge: The data explosion has made it increasingly difficult for analysts to know which signals to look for in data Data mining is about extracting small findings from large data i.e. finding the needle in the haystack The R Advantage: Unlike alternate tools, R analysts can construct and modify specific solutions to their unique data mining tasks Community R experts have developed a diverse set of packages for data mining Solutions A dizzying of single solution tools are available Rcurl, XML Online data extraction Rweka, tm Natural Language Processing and text analysis Lattice, ggplot2 graphics Reshape, plyr Data manipulation topicmodels Topic modeling Revo ScaleR Big Data analysis 33

Analyzing InQTel Press Releases Model data in less then 50 lines of R code Extract http://www.iqt.org XML Parse [ ] Base Model topicmodels Visualize ggplot2 34

Use Case: Analytics Re-use and Knowledge Sharing EXAMPLE: DOCUMENTATION WITH SWEAVE 35

Analytical reproducibility and knowledge sharing The Challenge: Importance of data sharing has been widely publicized Next stage is to improve the ease of knowledge sharing The R Difference: As the Lingua Franca of statistics, R creates a universal and binding platform for analysis Analysis scripts are easy to share and to audit R is easily integrated into many reporting and presentation tools Most solutions lack methodological transparency making sharing and duplicating analyses difficult Solutions SWeave Facilitates automated creation of professional publication ready documents Revo DeployR Integrates R into online reporting and presentation tools for real-time, interactive reporting and analysis 36

Streamlining recurring analysis Analysts often have to produce the same reports (daily, weekly, etc.), using new or updated data Suppose we had to edit an analysis based on new data Traditional Work Flow 1. Download data, inspect (Excel) 2. Copy paste relevant new data 3. Apply analysis, maybe automated 4. Produce analytical output (charts, tables, summaries) 5. Import output into document 6. Check over to make sure no new error introduced Using R and SWeave 1. Download data 2. Edit data input variable <<echo=true,keep.source=true>>= df<-read.csv( new_data.csv ) @ 3. Typeset document, send to customer 37

Use Case: Advanced Graphics & Data Visualization EXAMPLE: WIKI LEAKS ANALYSIS 38

Analyzing data with advanced graphics The Challenge: Conveying complex ideas in meaningful and understandable ways Sophisticated and flexible graphics are a critical component to any analytical presentation If a picture is worth 1,000 words, then having the tools to produce compelling data visualizations is priceless The R Difference: R has amongst the most sophisticated and comprehensive graphics capabilities of any analysis tool Graphics can be used as-is, modified, or developed from scratch Solutions 39

WikiLeaks Afghanistan War Diaries: Description 40

WikiLeaks Afghanistan War Diaries: Geospatial Analysis 41

Revolution Confidential Thank you. The leading commercial provider of software and support for the popular open source R statistics language. www.revolutionanalytics.com 650.330.0553 Twitter: @RevolutionR 42

Revolution Confidential Follow-up Contact Information http://www.revolutionanalytics.com/ Norman H. Nie, CEO norman@revolutionanalytics.com Jeff Erhardt, COO jeff@revolutionanalytics.com Mike Minelli, VP Sales mike@revolutionanalytics.com 43