Modern Data Architecture for Predictive Analytics David Smith VP Marketing and Community - Revolution Analytics John Kreisa VP Strategic Marketing- Hortonworks Hortonworks Inc. 2013 Page 1
Your Presenters David Smith (@revodavid) VP Marketing and Community at Revolution Analytics Data Scientist, Blogger and co-author of An Introduction to R John Kreisa (@marked_man) VP Strategic Marketing, Hortonworks Over 20 years in data management as a developer and a marketer Avid camper Hortonworks Inc. 2013 Page 2
Today s Topics Introduction Drivers for the Modern Data Architecture (MDA) Apache Hadoop in the MDA R s role in the MDA Q&A Hortonworks Inc. 2013 Page 3
Poll #1: What stage are you at looking in Hadoop? Research Evaluation Trial Haven t started research Hortonworks Inc. 2013 Page 4
SOURCES DATA SYSTEM APPLICATIONS Existing Data Architecture Business Analytics Custom Applications Packaged Applications DEV & DATA TOOLS BUILD & TEST OPERATIONAL TOOLS RDBMS EDW MPP REPOSITORIES MANAGE & MONITOR Existing Sources (CRM, ERP, Clickstream, Logs) Hortonworks Inc. 2013 Page 5
SOURCES DATA SYSTEM APPLICATIONS Existing Data Architecture Business Analytics Custom Applications Packaged Applications 2.8 ZB in 2012 RDBMS EDW MPP REPOSITORIES 85% from New Data Types 15x Machine Data by 2020 40 ZB by 2020 Source: IDC Existing Sources (CRM, ERP, Clickstream, Logs) Hortonworks Inc. 2013 Page 6
SOURCES DATA SYSTEM APPLICATIONS Modern Data Architecture Enabled Business Analytics Custom Applications Packaged Applications DEV & DATA TOOLS BUILD & TEST OPERATIONAL TOOLS RDBMS EDW MPP REPOSITORIES MANAGE & MONITOR Existing Sources (CRM, ERP, Clickstream, Logs) Emerging Sources (Sensor, Sentiment, Geo, Unstructured) Hortonworks Inc. 2013 - Confidential Page 7
Hadoop Powers Modern Data Architecture Hadoop Cluster compute & storage.......... compute & storage Hadoop clusters provide scale-out storage and distributed data processing on commodity hardware Apache Hadoop is an open source project governed by the Apache Software Foundation (ASF) that allows you to gain insight from massive amounts of structured and unstructured data quickly and without significant investment. Hortonworks Inc. 2013 - Confidential Page 8
Drivers for Hadoop Adoption Modern Data Architecture Hadoop has a central role in next generation data architectures while integrating with existing data systems Driving Efficiency Business Applications Use Hadoop to extract insights that enable new customer value and competitive edge Driving Opportunity Big Data Sets Existing Traditional Server log Clickstream Emerging Sentiment/Social Machine/Sensor Geo-locations Hortonworks Inc. 2013 - Confidential
Opportunity in types of data 1. Sentiment Understand how your customers feel about your brand and products right now 2. Clickstream Capture and analyze website visitors data trails and optimize your website 3. Sensor/Machine Discover patterns in data streaming automatically from remote sensors and machines 4. Geographic Analyze location-based data to manage operations where they occur Value 5. Server Logs Research logs to diagnose process failures and prevent security breaches 6. Unstructured (txt, video, pictures, etc..) Understand patterns in files across millions of web pages, emails, and documents Hortonworks Inc. 2013 - Confidential Page 10
SOURCES DATA SYSTEM APPLICATIONS Efficiency in the Modern Data Architecture Business Analytics Custom Applications Packaged Applications Drive efficiency via modern data architecture Store data once and access it in many ways RDBMS EDW MPP REPOSITORIES Often referred to a data lake or data repository Infrastructure platform driven Existing Sources (CRM, ERP, Clickstream, Logs) Emerging Sources (Sensor, Sentiment, Geo, Unstructured) IT-oriented, TCO based Hortonworks Inc. 2013 - Confidential Page 11
SOURCES DATA SYSTEM APPLICATIONS Engineered for Interoperability BusinessObjects BI DEV & DATA TOOLS OPERATIONAL TOOLS RDBMS EDW MPP HANA INFRASTRUCTURE Existing Sources (CRM, ERP, Clickstream, Logs) Emerging Sources (Sensor, Sentiment, Geo, Unstructured) Hortonworks Inc. 2013 - Confidential Page 12
Requirements for Hadoop Adoption Requirements for Hadoop s Role in the Modern Data Architecture Integrated Interoperable with existing data center investments Skills Leverage your existing skills: development, operations, analytics Key Services Platform, operational and data services essential for the enterprise Hortonworks Inc. 2013 - Confidential Page 13
SOURCES DATA SYSTEM APPLICATIONS Revolution R Enterprise Architecture Business Analytics Custom Applications Packaged Applications DEV & DATA TOOLS BUILD & TEST OPERATIONAL TOOLS RDBMS EDW MPP REPOSITORIES MANAGE & MONITOR Existing Sources (CRM, ERP, Clickstream, Logs) Emerging Sources (Sensor, Sentiment, Geo, Unstructured) = Revolution R Enterprise Hortonworks Inc. 2013 - Confidential Page 14
Today s Topics Introduction Drivers for the Modern Data Architecture (MDA) Apache Hadoop s role in the MDA R s role in the MDA Q&A Hortonworks Inc. 2013 Page 15
Poll #2: Which of the following best describes your use of R and Hadoop? We have R+ Hadoop in Production We have testing R+ Hadoop We have started to investigate but nothing is implemented No current plans Hortonworks Inc. 2013 Page 16
What is the Open Source R Project? Revolution Confidential The R Language: Object-Oriented Language for Stats, Math and Data Science Comprehensive data visualization and statistical modeling capabilities The R Community: 2M+ Users with the Skill to Tackle Big Data Statistical and Numerical Analysis and Machine Learning Projects New graduates with data skills learn R The R Ecosystem: 5000+ Freely Available Algorithms in CRAN Specialized methods for finance, economics, genomics, linguistics, and every data-driven domain 17
R is open source and drives analytic innovation but has Revolution Confidential some limitations for Enterprises Memory Bound Big Data Bigger data sizes Single Threaded Community Support Innovative 5000+ packages Exponential growth Scale out, parallel processing, high speed Commercial production support Combines with open source R packages where needed Speed of analysis Production support Innovation and scale
Revolution R Enterprise Revolution Confidential Revolution R Enterprise is the only commercial big data analytics platform based on open source R statistical computing language High Performance Analytics Big Data Analytics Cross-Platform Easier Build & Deploy Enterprise-Ready 19
Modern Data Architecture Extract and Analyze Ad-hoc Data Distillation Exploratory Data Analysis / Data Visualization Model Development SOURCE DATA INTERACTIVE Query/Visualization/ Reporting/Analytical Tools and Apps DBs AMBARI HIVE Server2 Fil Fil es es Files DATA REFINEMENT PIG HIVE CUSTOM ANALYTICAL rhadoop Analytical Tools JMS Queue s - Sensor Logs - Clickstream - Flat Files - Unstructured - Sentiment - Customer - Inventory REST HTTP STREAM LOAD SQOOP FLUME NFS WebHDFS MAPREDUCE YARN HDFS STRUCTURE HCATALOG (metadata services) LOAD SQOOP/Hive Web HDFS Data Sources CSV DATABASES
The Data Scientist s Big Data Toolkit Revolution Confidential R Data Step Descriptive Statistics Statistical Tests Sampling Simulation Data Visualization Machine Learning Predictive Models 21
Parallel External-Memory Algorithms CPU CPU CPU SMP SERVER 22
Parallel External-Memory Algorithms HADOOP NODE HADOOP NODE HADOOP NODE HADOOP CLUSTER 23
Modern Data Architecture with RRE7 Revolution Confidential In-Hadoop Predictive Analytics Production Data Distillation (e.g. Semantic Analysis) Production Model Processing / Re-Estimation Production Model Scoring SOURCE DATA INTERACTIVE Query/Visualization/ Reporting/Analytical Tools and Apps DBs AMBARI HIVE Server2 Fil Fil es es Files JMS Queue s - Sensor Logs - Clickstream - Flat Files - Unstructured - Sentiment - Customer - Inventory REST HTTP STREAM LOAD SQOOP FLUME NFS WebHDFS PIG DATA REFINEMENT HIVE MAPREDUCE YARN HDFS DISTILLED DATA FILES CUSTOM ANALYTICAL Revolution R Enterprise STRUCTURE HCATALOG (metadata services) LOAD SQOOP/Hive Web HDFS Analytical Tools Data Sources CSV DATABASES
Hadoop As An R Engine Revolution Confidential Hadoop Use Revolution R Enterprise PEMAs in Hadoop No need to change existing R code Simple R programming No need to Think In MapReduce Eliminate data movement to slash latencies Use Hadoop nodes as parallel R computation engines 25
Requirements for Hadoop Adoption Requirements for Hadoop s Role in the Modern Data Architecture Integrated Interoperable with existing data center investments Skills Leverage your existing skills: development, operations, analytics Key Services Platform, operational and data services essential for the enterprise Hortonworks Inc. 2013 Page 26
Poll #3: Which of the following would you most like to accomplish with R + Hadoop? Build a model to be put in product in Hadoop Build a model to be put in product elsewhere Create new data from Hadoop to supplement an existing analytics process Something else Hortonworks Inc. 2013 Page 27
Next Steps: More about Revolution Analytics and Hadoop http://www.revolutionanalytics.com/products/r-forhadoop.php Get started on Hadoop with Hortonworks Sandbox http://hortonworks.com/sandbox Follow us: @hortonworks @RevolutionR Hortonworks Inc. 2013 Page 28