Batter Up! Advanced Sports Analytics with R and Storm

Similar documents
Why Spark on Hadoop Matters

Comprehensive Analytics on the Hortonworks Data Platform

Self-service BI for big data applications using Apache Drill

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

HDP Hadoop From concept to deployment.

HDP Enabling the Modern Data Architecture

The Future of Data Management with Hadoop and the Enterprise Data Hub

Self-service BI for big data applications using Apache Drill

Information Builders Mission & Value Proposition

Hadoop Ecosystem B Y R A H I M A.

The Future of Data Management

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Data Security in Hadoop

HADOOP VENDOR DISTRIBUTIONS THE WHY, THE WHO AND THE HOW? Guruprasad K.N. Enterprise Architect Wipro BOTWORKS

Upcoming Announcements

Modern Data Architecture for Predictive Analytics

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

ANALYTICS CENTER LEARNING PROGRAM

A Modern Data Architecture with Apache Hadoop

Integrating a Big Data Platform into Government:

Big Data Realities Hadoop in the Enterprise Architecture

How to Hadoop Without the Worry: Protecting Big Data at Scale

Consulting and Systems Integration (1) Networks & Cloud Integration Engineer

Moving From Hadoop to Spark

Cloudera Enterprise Data Hub in Telecom:

Cisco IT Hadoop Journey

#TalendSandbox for Big Data

HADOOP. Revised 10/19/2015

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Roadmap Talend : découvrez les futures fonctionnalités de Talend

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Delivering Value from Big Data with Revolution R Enterprise and Hadoop

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Dominik Wagenknecht Accenture

How Companies are! Using Spark

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

The Digital Enterprise Demands a Modern Integration Approach. Nada daveiga, Sr. Dir. of Technical Sales Tony LaVasseur, Territory Leader

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Apache Hadoop: Past, Present, and Future

The Internet of Things and Big Data: Intro

Session 0202: Big Data in action with SAP HANA and Hadoop Platforms Prasad Illapani Product Management & Strategy (SAP HANA & Big Data) SAP Labs LLC,

SQL on NoSQL (and all of the data) With Apache Drill

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Cisco IT Hadoop Journey

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Big Data Management and Security

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Dell In-Memory Appliance for Cloudera Enterprise

Big Data and New Paradigms in Information Management. Vladimir Videnovic Institute for Information Management

Transforming the Telecoms Business using Big Data and Analytics

Building Your Big Data Team

R and Hadoop: Architectural Options. Bill Jacobs VP Product Marketing & Field CTO, Revolution

SAP and Hortonworks Reference Architecture

Are You Big Data Ready?

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

INDUS / AXIOMINE. Adopting Hadoop In the Enterprise Typical Enterprise Use Cases

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

The Future of Big Data SAS Automotive Roundtable Los Angeles, CA 5 March 2015 Mike Olson Chief Strategy Officer,

Hadoop in the Enterprise

Big Data and Data Science. The globally recognised training program

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

How To Create A Data Visualization With Apache Spark And Zeppelin

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Case Study : 3 different hadoop cluster deployments

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Find the Hidden Signal in Market Data Noise

Interactive data analytics drive insights

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

and Hadoop Technology

Hadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Extend your analytic capabilities with SAP Predictive Analysis

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

Talend Big Data. Delivering instant value from all your data. Talend

High Performance Predictive Analytics in R and Hadoop:

How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns

Big Data Analytics. Copyright 2011 EMC Corporation. All rights reserved.

Big Data and Data Science: Behind the Buzz Words

Databricks. A Primer

Cisco Data Preparation

A Case Study of Hadoop in Healthcare

Constructing a Data Lake: Hadoop and Oracle Database United!

Databricks. A Primer

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Transcription:

Batter Up! Advanced Sports Analytics with R and Storm Meeting the Real-Time Analytics Opportunity Head-On December 11, 2014 Bill Jacobs VP Product Marketing Revolution Analytics @bill_jacobs Vineet Sharma Dir., Partner Marketing MapR Technologies Allen Day Principal Data Scientist MapR Technologies @allenday

Who Am I? Bill Jacobs, VP Product Marketing Revolution Analytics @bill_jacobs

Polling Question #1: Who Are You? (choose one) Statistician or modeler Data Scientist Hadoop Expert Application builder Data guru Business user Baseball fan

Sports Analytics as Analogy. Sports Teams Are Like Other Corporations. Great Value Achievable With Data Vast Range of Data Sources Timely Analysis Amplifies Value And apologies if you came to learn whom to bet upon in next year s season.

Game Changing Big Data Analytics Applications Marketing: Clickstream & Campaign Analyses Digital Media: Recommendation Engines Social Media: Sentiment Analysis Retail: Purchase Prediction Insurance: Fraud Waste and Abuse Healthcare Delivery: Treatment Outcome Prediction Risk Analysis: Insurance Underwriting Manufacturing: Predictive Maintenance Operations: Supply Chain Optimization Econometrics: Market Prediction Marketing: Mix and Price Optimization Life Sciences: Pharmacogenetics Transportation: Asset Utilization

Polling Question #2: What Language or Tools Is In Use for Analytics (check all that apply) R SAS or SPSS Python Java BI tools including: MSTR, Qlik, Tableau, Business Objects, Cognos Salford Systems or MATLAB H20, RapidMiner, KNIME or similar Other data mining tools Other programming languages None or Don t know

R Open Source - Language, Community, Collaboration - Robert Gentleman & Ross Ihaka, 1993 - Version 1.0 released 2000-2.5 Million Global Users - Over 4,800 add-on Packages - Why R? R in Universities = New Talent WELCOME & INTRODUCTIONS Emerging Modeling/Visualization Lower Cost Alternative Open Source = Flexible & Innovative Access to Free Packages

R is exploding in popularity & functionality R Usage Growth Rexer Data Miner Survey, 2007-2013 70% of data miners report using R R is the first choice of more data miners than any other software Source: www.rexeranalytics.com

Innovate with R Most widely used data analysis software Used by 2M+ data scientists, statisticians and analysts Most powerful statistical programming language Flexible, extensible and comprehensive for productivity Create beautiful and unique data visualizations As seen in New York Times, Twitter and Flowing Data Thriving open-source community Leading edge of analytics research Fills the talent gap New graduates prefer R White Paper R is Hot bit.ly/r-is-hot

Polling Question #3: How are you using R today? (choose one) Not using R Studying R now Initial R project(s) underway R is widely used for exploration & modeling R is deployed into production

Revolution Analytics In A Nutshell Our Vision: R is becoming the defacto standard for enterprise predictive analytics Our Mission: Drive enterprise adoption of R by providing enhanced R products tailored to meet enterprise challenges

Revolution Analytics Builds & Delivers: Software Products: Support & Services Stable Distributions Commercial Support Programs Broad Platform Support Training Programs Big Data Analytics in R Professional Services Application Integration Academic Support Programs Deployment Platforms IP Indemnification Agile Development Tooling Future Platform Support

Revolution R Advantages for Analytics Professionals: Broadly-used, scalable R language Large (2M+), collaborative, young R analytics community Largest repository of statistical & analytical algorithms Big data analytics capabilities Scales from workstations to Hadoop Transparent parallelism Cross platform compatibility Multi-platform architectures Broadens career opportunities

Revolution R Advantages for Business Executives Viable Alternative to Legacy Analytics Solutions Predictable Time To Results Simplified Licensing All-Inclusive Environment Lower Staffing Costs Controllable Open Source Risks Support IP Infringement Protections

Revolution R Advantages for IT Organizations Consistency Across Platforms Avoids Sprawl Support for Workstations, Servers, Hadoop, EDWs and Grids Heterogeneous Architecture Capabilities Integrates With Major BI & Application Tools Streamline Model Deployment Run Complex Analytics in the Data Lake Be a Good Citizen in shared systems Commercial Support Reduces Project Risks Quick Start Programs Accelerate Results Platform Continuity Future-Proofs Architectures

Revolution R Enterprise: Predictive Analytics Across Huge Data in Hadoop YARN Exploration Visualization Predictive Modeling HDFS

Polling Question #4: Stage of Hadoop Adoption? (choose one) No Need Studying Setting-Up Hadoop Experimenting with Hadoop Deploying Hadoop Now Hadoop in Production

Introducing: Vineet Sharma Director, Partner Marketing MapR Technologies 2014 MapR Technologies 18

MapR + Revolution Leverages MapR As A Scalable Enterprise R Engine. Plus: Run RRE Analytics In MapR Hadoop Without Change Broad Adoption of Hadoop for Big Data Analytics MapR Enterprise Data Platform Capabilities Rapid Adoption of R Eliminate Need To Design Parallel Software or Think In MapReduce Leverage All Revolution R Enterprise Pre-Parallelized Algorithms Enable Users To Build Custom Apps That Leverage Hadoop s Parallelism Slash Data Movement by Analyzing Data Inside the MapR Data Platform Expand Deployment and Integration Options 2014 MapR Technologies 19

Desktop Users with Analytical Access to Huge Data in Hadoop Predictive Modeling Algorithms MapR FS Data 2014 MapR Technologies 20

MapR: Best Solution for Customer Success Top Ranked Exponential Growth 500+ Customers Premier Investors >2x annual bookings 90% software licenses 80% of accounts expand 3X < 1% lifetime churn > $1B in incremental revenue generated by 1 customer 2014 MapR Technologies 21

Management The Power of the Open Source Community APACHE HADOOP AND OSS ECOSYSTEM Batch Tez* ML, Graph SQL NoSQL & Search Streaming Data Integration & Access Security Workflow & Data Governance Provisioning & coordination Spark Drill Cascading GraphX Shark Accumulo* Hue Savannah* Pig MLLib Impala Solr Storm* HttpFS Juju MapReduce v1 & v2 Mahout Hive HBase Spark Streaming Flume Knox* Falcon* Whirr YARN Sqoop Sentry* Oozie ZooKeeper EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS MapR-FS MapR Data Platform MapR-DB * Certification/support planned 2014 MapR Technologies 22

Management MapR Distribution for Hadoop APACHE HADOOP AND OSS ECOSYSTEM Batch Tez* ML, Graph SQL NoSQL & Search Streaming Data Integration & Access Security Workflow & Data Governance Provisioning & coordination Spark Drill Cascading GraphX Shark Accumulo* Hue Savannah* Pig MLLib Impala Solr Storm* HttpFS Juju MapReduce v1 & v2 Mahout Hive HBase Spark Streaming Flume Knox* Falcon* Whirr YARN Sqoop Sentry* Oozie ZooKeeper High availability Data protection Disaster recovery EXECUTION ENGINES Standard file access Standard database access Pluggable MapR-FS services Broad developer support 2X to 7X higher performance Consistent, low latency Ability to logically divide a cluster to support different use cases, job MapR Data Platform types, user groups, and administrators DATA GOVERNANCE Enterprise security AND OPERATIONS authorization Wire-level authentication MapR-DB Data governance Ability to support predictive analytics, real-time database operations, and support high arrival rate data Enterprise-grade Interoperability Performance Multi-tenancy Security Operational * Certification/support planned 2014 MapR Technologies 23

Management Revolution R Enterprise and MapR Hadoop R on the Desktop APACHE HADOOP & OSS ECOSYSTEM ZooKeeper Oozie Hue Pig Hive Impala Shark Drill Tez Via Browsers Flume HttpFS Cascading Solr Juju Mahout MLLib Knox Sentry Sqoop Whirr HBase YARN Storm Spark Streaming Spark Falcon BI and other Apps MapR Data Platform Edge Node YARN MapReduce Mobile

Introducing: Allen Day Principal Data Scientist MapR Technologies @allenday 2014 MapR Technologies 25

Talk Overview Agile Real-time Stats R + Storm github.com/allenday/r-storm Agile Methods Advanced Statistics DEMO How to do it? Continuous Real-time Delivery Q & A @allenday github.com/allenday/hadoop-summit-r-storm-demo-public 2014 MapR Technologies 26

Architecting R into the Storm Application Development Process 2014 MapR Technologies 2014 MapR Technologies 27

Quick intro Allen Day, Principal Data Scientist [ @allenday ] 7yr Hadoop dev, 12yr R dev/author PhD, Human Genetics, UCLA Medicine 2014 MapR Technologies 28

What s Storm? What s R? What s Storm? Processes a data stream. Akin to UNIX pipe + tee & merge commands Runs on a cluster. Fault-tolerant and designed to scale out Used for: real-time analytics & machine learning What s R? Programming language with advanced statistics libraries Does not scale out. Can scale up Used for: prototyping, data modeling, visualization How to combine these? 2014 MapR Technologies 29

R outside, Storm inside: not practical. Why? User R Model-building and QA is done on data snapshots Storm However, R => Hadoop is realistic. Key difference: referenced data can be static Use MapR snapshots for dev and QA See also: RHIPE (Purdue) and RHadoop (RevolutionAnalytics) 2014 MapR Technologies 30

Storm outside, R inside: a good fit User Storm R Enables separation of concerns Independently manage modeling, ops timelines, and version control Integrate as needed Enables role specialization R built-ins allow faster iteration and more concise stats-type code Do DevOps with specific SW engineering tech, e.g. Java 2014 MapR Technologies 31

Q: Who really likes statistics? A: Baseball fans A: Team Managers = Portfolio Managers 2014 2014 MapR MapR Technologies Technologies 32

Famous Vintage Data Oakland Athletics 2002 Season 20 consecutive wins the current record 2014 MapR Technologies 33

Goal: Detect Moneyball 2002 Winning Streak 2014 2014 MapR MapR Technologies Technologies 34

Methods: Change Point Detection Find natural breakpoints in a time-series set of data points R packages implement this: changepoint: more sensitve, but not streaming bcp: streaming, but less sensitive 2014 MapR Technologies 35

Methods: R+Storm Demo Architecture github.com/allenday/hadoop-summit-r-storm-demo-public Oakland A s Data (accelerated) Storm Bolt R online change point detector Storm Bolt (write to Jetty) GIFs to MapR Filesystem Jetty Webserver Browser (D3.js) Us 2014 MapR Technologies 36

Demo 50-game sliding window/buffer to detect change points Cumulative history with detected break points Raw data (score difference between A s and opponent) 2014 2014 MapR MapR Technologies Technologies 37

Methods Details: How it s done Uses R-Storm binding github.com/allenday/r-storm Storm package on CRAN cran.r-project.org/web/packages/storm Storm (dev team) R (stats team) Storm (dev team, pure Java) Producer Consumer 2014 MapR Technologies 38

Methods Details: Easy integration R: lambda function storm = Storm$new(); storm$lambda = function(s) { t = s$tuple; t$output = vector(length=1); t$output[1] = tada! s$emit(t) } Storm: extend ShellBolt public static class MyRBolt extends ShellBolt implements IRichBolt { } public RBolt() { } super("rscript", my.r"); 2014 MapR Technologies 39

Results Change points are identified, but none for winning streak Not using score difference, anyway Discussion Time to integrate with the modeling team! Send @kunpognr or @allenday a pull request on GitHub Applicable to many other use cases, e.g. Security (fraud detection, intrusion detection) Marketing (intent to purchase / social media streams) Customer Support (help desk voice calls) 2014 MapR Technologies 40

Polling Question #5 How important will Real-Time analytical apps become? (choose one) Uncertain Not important Necessary Critical

Real-Time and Internet of Things: Foundation of a Compelling Trend for 2015 Big Data Analytics Meets The Internet of Things Transactions + Human Behavior + Internet of Things: Sensors and extracting value using Traditional Statistics Visualization Machine Learning plus adaptability Real-Time Agile Modeling & Fast Model Execution Production Capable, Stable and Secure Rapidly-Evolving Data Science

Where Does Real Time Impact The Analytical Lifecycle? Data Engineering Collection and Ingest Blending Modeling Aggregation, Segmentation & Exploration Model Development & Optimization Testing & Validation Operationalization Deployment & Scoring Delivery Monitoring & Evaluation

Typical Analytical Lifecycle Model Ingest Explore Model Deploy Score Score Act Measure

More Complex Event Driven Analytical Cycle Data Analytics & Process Design Historic Ingest Blend Explore Model Deploy Scoring Append Improve Event Ingest Trans- Form Act Measure Enrich

Real-Time Analytics Best Practices Develop a Common Lexicon for Real-Time Discriminate Between Needs of Each Stage in Lifecycle Data Ingest & Manipulation and Enrichment Data Source / Repository Integration Needs Processes that Fill the Lake Process that Act on the Stream Vastly different computationally Big differences in data ingest volume & latency Start with Tractable Goals Anticipate Growing Requirements: Microbatched >> Interactive >> Autonomy Build for today, Architect for tomorrow

Real-Time Realities Plan for Diverse Needs Real-Time Score Retrieval, Scoring, Modeling Wide-Ranging Performance Microbatch Interactive - Autonomous Fragmentation Data Delivery Systems Pre-Exist Will Vary Widely by Vertical Market Competing Proprietary Solutions Growing Demand Numerous high-value targets The next step : Put big data analytics to work

What s Needed Real-Time Performance plus Agility Deployment models Organization Infrastructure Analytics Manageable Costs Hadoop Open Source R Production Platform(s) Proven Performant

Next Steps Resources: R foundation URL: www.r-project.org Download Revolution R: http://mran.revolutionanalytics.com/download/ Learn about Apache Storm: https://storm.apache.org/ R-Storm bindings: github.com/allenday/r-storm Storm package on CRAN: cran.r-project.org/web/packages/storm www.revolutionanalytcs.com www.maprtech.com Whitepaper: Revolution R for Hadoop: http://www.revolutionanalytics.com/whitepaper/deli vering-value-big-data-revolution-r-enterprise-andhadoop or http://bit.ly/1ua43bu

Thank you. www.revolutionanalytics.com 1.855.GET.REVO Twitter: @RevolutionR