Hadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis



Similar documents
Predictive Analytics: Seeing the Whole Picture

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Model Deployment. Dr. Saed Sayad. University of Toronto

HDP Hadoop From concept to deployment.

Easy Execution of Data Mining Models through PMML

HDP Enabling the Modern Data Architecture

SEIZE THE DATA SEIZE THE DATA. 2015

BIG DATA What it is and how to use?

Universal PMML Plug-in for EMC Greenplum Database

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

EMC Greenplum Driving the Future of Data Warehousing and Analytics. Tools and Technologies for Big Data

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

The R pmmltransformations Package

Advanced In-Database Analytics

Hadoop, the Data Lake, and a New World of Analytics

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

Bringing the Power of SAS to Hadoop. White Paper

Modern Data Architecture for Retail with Apache Hadoop on Windows

Modern Data Architecture for Financial Services with Apache Hadoop on Windows

High-Performance Analytics

Ganzheitliches Datenmanagement

KNIME UGM 2014 Partner Session

Azure Machine Learning, SQL Data Mining and R

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.

Integrating a Big Data Platform into Government:

Extend your analytic capabilities with SAP Predictive Analysis

ANALYTICS CENTER LEARNING PROGRAM

The Future of Data Management

The Internet of Things and Big Data: Intro

Predictive Analytics Powered by SAP HANA. Cary Bourgeois Principal Solution Advisor Platform and Analytics

SAP and Hortonworks Reference Architecture

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Data Mining + Business Intelligence. Integration, Design and Implementation

Advanced Big Data Analytics with R and Hadoop

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

The Future of Data Management with Hadoop and the Enterprise Data Hub

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Make Better Decisions Through Predictive Intelligence

Big Data and Data Science: Behind the Buzz Words

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

THE JOURNEY TO A DATA LAKE

Modern Data Architecture for Predictive Analytics

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Big Data and Hadoop for the Executive A Reference Guide

Big Data Realities Hadoop in the Enterprise Architecture

Production ready hadoop. By Deepak Rao Na,onal Head Datawarehousing Bajaj Finserv

Harnessing Big Data with KNIME

Comprehensive Analytics on the Hortonworks Data Platform

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

The Use of Open Source Is Growing. So Why Do Organizations Still Turn to SAS?

In-Database Analytics

Standards in Predictive Analytics

Real-Time Big Data Analytics + Internet of Things (IoT) = Value Creation

Ensembles and PMML in KNIME

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

5 Big Data Use Cases to Understand Your Customer Journey CUSTOMER ANALYTICS EBOOK

Open Source in Financial Services: Meet the challenges of new business models and disruption

Hadoop Job Oriented Training Agenda

VIEWPOINT. High Performance Analytics. Industry Context and Trends

Tax Fraud in Increasing

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Data Governance in the Hadoop Data Lake. Kiran Kamreddy May 2015

The basic data mining algorithms introduced may be enhanced in a number of ways.

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

An In-Depth Look at In-Memory Predictive Analytics for Developers

IBM SPSS Modeler 15 In-Database Mining Guide

Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.

Apache Hadoop: The Big Data Refinery

Oracle Big Data Discovery Unlock Potential in Big Data Reservoir

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Big Data: Making Sense of it all!

Big Data. Fast Forward. Putting data to productive use

How To Turn Big Data Into An Insight

ANALYTICS IN BIG DATA ERA

Upcoming Announcements

HP Vertica. Echtzeit-Analyse extremer Datenmengen und Einbindung von Hadoop. Helmut Schmitt Sales Manager DACH

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

The Enterprise Data Hub and The Modern Information Architecture

Sunnie Chung. Cleveland State University

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

Using Tableau Software with Hortonworks Data Platform

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Transcription:

Webinar will begin shortly Hadoop s Advantages for Machine Learning and Predictive Analytics Presented by Hortonworks & Zementis September 10, 2014 Copyright 2014 Zementis, Inc. All rights reserved. 2

Hadoop s Advantages for Machine Learning and Predictive Analytics Moderator Presenters Mark Rabkin Director Business Development Zementis Ofer Mendelevitch Director of Data Science Hortonworks Michael Zeller CEO Zementis Copyright 2014 Zementis, Inc. All rights reserved. 4

The Speakers Ofer Mendelevitch Director of Data Science Hortonworks Michael Zeller CEO & Founder Zementis Ofer Mendelevitch is Director of data sciences at Hortonworks, where he is responsible for professional services involving data science with Hadoop, including use-cases like recommender systems, prediction, classification and search. Prior to joining Hortonworks, Ofer has held a number of positions from Entrepreneur in Residence at XSeed Capital, VP of Engineering at Nor1 and Director of engineering at Yahoo where he led multiple engineering and data science teams. Michael Zeller is the CEO and Co-Founder of Zementis. His vision is to help companies deepen and accelerate insights from big data through the power of predictive analytics. Michael also serves on the Board of Directors of Software San Diego and as Secretary/Treasurer on the Executive Committee of ACM SIGKDD, which is the premier international organization for data mining researchers and practitioners from academia, industry, and government. Copyright 2014 Zementis, Inc. All rights reserved. 5

Hortonworks & Zementis Hortonworks: We Do Hadoop. Our mission is to power your Modern Data Architecture by delivering Enterprise Apache Hadoop Zementis provides software for operational deployment of predictive analytics Reseller Partners: Our Commitment: Open Leadership Drive innovation in the open exclusively via the Apache community-driven open source process Enterprise Rigor Engineer, test and certify Apache Hadoop with the enterprise in mind Ecosystem Endorsement Focus on deep integration with existing data center technologies and skills Products & Capabilities: Vendor-neutral architecture for - Data mining tools - Analytics and data warehouse platforms Supports PMML industry standard and wide range of predictive modeling techniques Rapidly deploys and executes predictive models Accelerates business insight Copyright 2014 Zementis, Inc. All rights reserved. 6

A data architecture under pressure from new data APPLICATIONS* Business** Analy4cs* Custom* Applica4ons* Packaged* Applica4ons* OLTP,&ERP,&CRM&Systems& Unstructured&documents,&emails& 2.8*ZB*in*2012* Server&logs& DATA**SYSTEM* RDBMS* EDW* MPP* REPOSITORIES* 85%*from*New*Data*Types* 15x*Machine*Data*by*2020* Sen>ment,&Web&Data& 40*ZB*by*2020* Source: IDC Sensor.&Machine&Data& SOURCES* Exis4ng*Sources** (CRM,*ERP,*Clickstream,*Logs)* Clickstream& GeoEloca>on& Page 1 Hortonworks Inc. 2011 2014. All Rights Reserved

Hadoop within an emerging Modern Data Architecture APPLICATIONS* DATA**SYSTEM* Business** Analy4cs* RDBMS* EDW* MPP* REPOSITORIES* Custom* Applica4ons* Governance & Integration Data Access Data Management Packaged* Applica4ons* Security Operations DEV*&*DATA*TOOLS* Build & Test OPERATIONS*TOOLS* Provision, Manage & Monitor Data Lake An architectural shift in the data center that uses Hadoop to deliver deeper insight across a large, broad, diverse set of data at efficient scale SOURCES* OLTP,&ERP,& Documents,&& CRM&Systems& Emails& Web&Logs,& Click&Streams& Social& Networks& Machine& Generated& Sensor& Data& Geoloca>on& Data& Page 2 Hortonworks Inc. 2011 2014. All Rights Reserved

Hadoop unlocks a new approach: Iterative Analytics Current Reality Apply schema on write Dependent on IT Augment w/ Hadoop Apply schema on read Support range of access patterns to data stored in HDFS: polymorphic access SQL* Single&Query&Engine& Repeatable&Linear&Process& Hadoop* Mul>ple&Query&Engines& Itera>ve&Process:&Explore,&Transform,&Analyze& Determine* list*of* ques4ons* Design* solu4ons* Collect* structured* data* Ask* ques4ons* from*list* Detect* addi4onal* ques4ons* Batch* Interac4ve* Real\4me* Streaming* Page 3 Hortonworks Inc. 2011 2014. All Rights Reserved

A (partial) map of machine learning tasks Discovery Clustering Detect natural groupings Outlier detection Detect anomalies Association rule mining Co-occurrence patterns Prediction Classification Predict a category Regression Predict a value Recommendation Predict a preference Page 4 Hortonworks Inc. 2011 2014. All Rights Reserved

Typical iterative flow in machine learning modeling Visualize, Explore Clean Data Hypothesize; Model Acquire Data Measure/ Evaluate Deploy & Monitor Page 5 Hortonworks Inc. 2011 2014. All Rights Reserved Page 5

Why Apache Hadoop for Data Science? Hadoop s schema-on-read reduces cycle time Hadoop is ideal for pre-processing of raw data Structured & unstructured Larger datasets enable better models Large-scale parallel scoring Page 6 Hortonworks Inc. 2011 2014. All Rights Reserved

Hadoop s schema-on-read accelerates innovation I&need&new& data& Schema change project Finally,&we& start& collec>ng& Let&me&see &is& it&any&good?& Start 3 months 6 months 9 months Let&me&see &is& it&any&good?& My&model&is& awesome& Let s&just&put&it&in&a& folder&on&hdfs& Page 7 Hortonworks Inc. 2011 2014. All Rights Reserved

Hadoop is ideal for large scale pre-processing Sample& Transform& Raw&Data& Aggregate& Normalize& Feature& Matrix& Join& OCR& NLP& Page 8 Hortonworks Inc. 2011 2014. All Rights Reserved

Hadoop enables modeling with larger datasets Larger datasets better outcomes More examples More features Banko & Brill, 2001 Page 9 Hortonworks Inc. 2011 2014. All Rights Reserved

Hadoop enables large-scale parallelized scoring Training set Learning Model PMML Native Test set Scoring Output Embarrassingly Parallel Using Hadoop as grid compute infrastructure Page 10 Hortonworks Inc. 2011 2014. All Rights Reserved

What is PMML? Predictive Model Markup Language (PMML) industry standard reduces the complexity of operationalizing models Mature standard developed by the DMG (Data Mining Group) to avoid proprietary issues and incompatibilities and to deploy models XML-based language used to define statistical and data mining models and to share these between compliant applications Supported by most leading data mining tools, commercial and open-source Data handling and transformations (pre-and post-processing) are a core component of the PMML standard Allows for the clear separation of tasks: Model development vs. model deployment Eliminates the need for custom code and proprietary model deployment solutions Copyright 2014 Zementis, Inc. All rights reserved. 8

Predictive Analytics Workflow PMML in action, covering a complete workflow from raw data input to decision output PMML File Raw Inputs Model Signature Input Validation Data Pre- Processing Predictive Model Data Post- Processing Prediction Data and operational types Outliers, Missing Values, Invalid Values Normalize, Discretize, Bin, Map, etc. Derived Model Inputs Model Outputs Scaling, Business Decisions, Thresholds, etc. Copyright 2014 Zementis, Inc. All rights reserved. Confidential 9

Path to Business Value Predictive analytics helps organizations unlock the value of their big data Big Data Predictive Analytics Business Insights Decisions & Actions Business Value Applications Databases Cloud Log Files RSS Feeds Other Sources Predictive Models Machine Learning Techniques Data Mining Tools More relevant More accurate More comprehensive More nuanced Faster Lower risk Greater positive impact Accelerated time-tomarket More precise targeting Real-time responsiveness Enhanced operational agility Competitive advantage Higher revenue growth rates Greater profitability Copyright 2014 Zementis, Inc. All rights reserved. 10

Traditional Deployment Cycle but model deployment challenges can often erode much of the value that predictive analytics can deliver Develop Operationalize Utilize Business Decisions Data Scientist IT Engineer Business Professional Predictive model deployment becomes a rework cycle Extensive manual coding Cross-checking Fixing coding errors Delayed insight Less accurate decisions Missed opportunities Loss of value Copyright 2014 Zementis, Inc. All rights reserved. 11

Deployment with Zementis & PMML Enter Zementis, whose solutions accelerate time-to-insight for predictive analytics Economic Value Time-to-insight Within 2 days * ~ 6 months Accelerated deployment timeline Reduced model deployment cycle time Reduced model deployment expense Increased model throughput Enhanced accuracy Minimal rework, if any Model Deployment Cycle Time Without Zementis With Zementis * And sometimes even within a few hours Rapid insight = Rapid time-to-value from predictive analytics Copyright 2014 Zementis, Inc. All rights reserved. 12

Universal PMML Plug-in (UPPI) Data Mining Tools Commercial Vendors (e.g. IBM SPSS, SAS) Open Source Tools (R, KNIME,...) Predictive Algorithms Decision Trees Neural Networks Support Vector Machines Linear and Logistic Regression Naive Bayes Classifiers General and Generalized Linear Models Cox Regression Rule Set Models Clustering Scorecards Association Rules Multiple Models (Segmentation, Chaining, Composition and Ensemble, including Random Forest Models) PMML Model Deployment Integration/Execution Zementis UPPI for Hive/Hadoop Simple Deployment & Execution Upload PMML file(s) in Hive PMML turns into HiveQL functions Seamlessly score data on Hadoop Copyright 2014 Zementis, Inc. All rights reserved. Confidential 13

Hive 0.13 Now faster than ever, up to 100x performance improvements and more to come Copyright 2014 Zementis, Inc. All rights reserved. Confidential 14

UPPI for Hive 0.13 Performance Scaling by Hadoop Cluster Size 100 Time 50 0 10 Nodes 20 Nodes Speeding Up Performance with Tez & ORC Time 100 75 50 25 0 Hive 0.13 21% Tez 29% Tez & ORC Performance executing a complex PMML model as UDF (User-Defined Function) using Hive 0.13 29% performance improvement when executing the same model and data by enabling Tez & ORC Copyright 2014 Zementis, Inc. All rights reserved. Confidential 15

DEMO Zementis Universal PMML Plug-in (UPPI) demo on Hortonworks Sandbox Zementis UPPI for Hive 1. PMML Sample Models > Hive UDFs 2. Run Customer Churn Example Copyright 2014 Zementis, Inc. All rights reserved. 16

Broad Applicability Hortonworks and Zementis products accelerate predictive model insights for multiple industries and business use cases Fraud & Risk Scoring Sensor & Device Data Processing Marketing & Sales Financial institutions Scoring bureaus Fraud detection Advanced decision management Rotating equipment Energy Biometrics IP network security Up- /cross-sell and nextbest-offer Marketing campaign optimization Real-time recommendations Copyright 2014 Zementis, Inc. All rights reserved. 17

Thank You Questions? Copyright 2014 Zementis, Inc. All rights reserved. 18