Webinar will begin shortly Hadoop s Advantages for Machine Learning and Predictive Analytics Presented by Hortonworks & Zementis September 10, 2014 Copyright 2014 Zementis, Inc. All rights reserved. 2
Hadoop s Advantages for Machine Learning and Predictive Analytics Moderator Presenters Mark Rabkin Director Business Development Zementis Ofer Mendelevitch Director of Data Science Hortonworks Michael Zeller CEO Zementis Copyright 2014 Zementis, Inc. All rights reserved. 4
The Speakers Ofer Mendelevitch Director of Data Science Hortonworks Michael Zeller CEO & Founder Zementis Ofer Mendelevitch is Director of data sciences at Hortonworks, where he is responsible for professional services involving data science with Hadoop, including use-cases like recommender systems, prediction, classification and search. Prior to joining Hortonworks, Ofer has held a number of positions from Entrepreneur in Residence at XSeed Capital, VP of Engineering at Nor1 and Director of engineering at Yahoo where he led multiple engineering and data science teams. Michael Zeller is the CEO and Co-Founder of Zementis. His vision is to help companies deepen and accelerate insights from big data through the power of predictive analytics. Michael also serves on the Board of Directors of Software San Diego and as Secretary/Treasurer on the Executive Committee of ACM SIGKDD, which is the premier international organization for data mining researchers and practitioners from academia, industry, and government. Copyright 2014 Zementis, Inc. All rights reserved. 5
Hortonworks & Zementis Hortonworks: We Do Hadoop. Our mission is to power your Modern Data Architecture by delivering Enterprise Apache Hadoop Zementis provides software for operational deployment of predictive analytics Reseller Partners: Our Commitment: Open Leadership Drive innovation in the open exclusively via the Apache community-driven open source process Enterprise Rigor Engineer, test and certify Apache Hadoop with the enterprise in mind Ecosystem Endorsement Focus on deep integration with existing data center technologies and skills Products & Capabilities: Vendor-neutral architecture for - Data mining tools - Analytics and data warehouse platforms Supports PMML industry standard and wide range of predictive modeling techniques Rapidly deploys and executes predictive models Accelerates business insight Copyright 2014 Zementis, Inc. All rights reserved. 6
A data architecture under pressure from new data APPLICATIONS* Business** Analy4cs* Custom* Applica4ons* Packaged* Applica4ons* OLTP,&ERP,&CRM&Systems& Unstructured&documents,&emails& 2.8*ZB*in*2012* Server&logs& DATA**SYSTEM* RDBMS* EDW* MPP* REPOSITORIES* 85%*from*New*Data*Types* 15x*Machine*Data*by*2020* Sen>ment,&Web&Data& 40*ZB*by*2020* Source: IDC Sensor.&Machine&Data& SOURCES* Exis4ng*Sources** (CRM,*ERP,*Clickstream,*Logs)* Clickstream& GeoEloca>on& Page 1 Hortonworks Inc. 2011 2014. All Rights Reserved
Hadoop within an emerging Modern Data Architecture APPLICATIONS* DATA**SYSTEM* Business** Analy4cs* RDBMS* EDW* MPP* REPOSITORIES* Custom* Applica4ons* Governance & Integration Data Access Data Management Packaged* Applica4ons* Security Operations DEV*&*DATA*TOOLS* Build & Test OPERATIONS*TOOLS* Provision, Manage & Monitor Data Lake An architectural shift in the data center that uses Hadoop to deliver deeper insight across a large, broad, diverse set of data at efficient scale SOURCES* OLTP,&ERP,& Documents,&& CRM&Systems& Emails& Web&Logs,& Click&Streams& Social& Networks& Machine& Generated& Sensor& Data& Geoloca>on& Data& Page 2 Hortonworks Inc. 2011 2014. All Rights Reserved
Hadoop unlocks a new approach: Iterative Analytics Current Reality Apply schema on write Dependent on IT Augment w/ Hadoop Apply schema on read Support range of access patterns to data stored in HDFS: polymorphic access SQL* Single&Query&Engine& Repeatable&Linear&Process& Hadoop* Mul>ple&Query&Engines& Itera>ve&Process:&Explore,&Transform,&Analyze& Determine* list*of* ques4ons* Design* solu4ons* Collect* structured* data* Ask* ques4ons* from*list* Detect* addi4onal* ques4ons* Batch* Interac4ve* Real\4me* Streaming* Page 3 Hortonworks Inc. 2011 2014. All Rights Reserved
A (partial) map of machine learning tasks Discovery Clustering Detect natural groupings Outlier detection Detect anomalies Association rule mining Co-occurrence patterns Prediction Classification Predict a category Regression Predict a value Recommendation Predict a preference Page 4 Hortonworks Inc. 2011 2014. All Rights Reserved
Typical iterative flow in machine learning modeling Visualize, Explore Clean Data Hypothesize; Model Acquire Data Measure/ Evaluate Deploy & Monitor Page 5 Hortonworks Inc. 2011 2014. All Rights Reserved Page 5
Why Apache Hadoop for Data Science? Hadoop s schema-on-read reduces cycle time Hadoop is ideal for pre-processing of raw data Structured & unstructured Larger datasets enable better models Large-scale parallel scoring Page 6 Hortonworks Inc. 2011 2014. All Rights Reserved
Hadoop s schema-on-read accelerates innovation I&need&new& data& Schema change project Finally,&we& start& collec>ng& Let&me&see &is& it&any&good?& Start 3 months 6 months 9 months Let&me&see &is& it&any&good?& My&model&is& awesome& Let s&just&put&it&in&a& folder&on&hdfs& Page 7 Hortonworks Inc. 2011 2014. All Rights Reserved
Hadoop is ideal for large scale pre-processing Sample& Transform& Raw&Data& Aggregate& Normalize& Feature& Matrix& Join& OCR& NLP& Page 8 Hortonworks Inc. 2011 2014. All Rights Reserved
Hadoop enables modeling with larger datasets Larger datasets better outcomes More examples More features Banko & Brill, 2001 Page 9 Hortonworks Inc. 2011 2014. All Rights Reserved
Hadoop enables large-scale parallelized scoring Training set Learning Model PMML Native Test set Scoring Output Embarrassingly Parallel Using Hadoop as grid compute infrastructure Page 10 Hortonworks Inc. 2011 2014. All Rights Reserved
What is PMML? Predictive Model Markup Language (PMML) industry standard reduces the complexity of operationalizing models Mature standard developed by the DMG (Data Mining Group) to avoid proprietary issues and incompatibilities and to deploy models XML-based language used to define statistical and data mining models and to share these between compliant applications Supported by most leading data mining tools, commercial and open-source Data handling and transformations (pre-and post-processing) are a core component of the PMML standard Allows for the clear separation of tasks: Model development vs. model deployment Eliminates the need for custom code and proprietary model deployment solutions Copyright 2014 Zementis, Inc. All rights reserved. 8
Predictive Analytics Workflow PMML in action, covering a complete workflow from raw data input to decision output PMML File Raw Inputs Model Signature Input Validation Data Pre- Processing Predictive Model Data Post- Processing Prediction Data and operational types Outliers, Missing Values, Invalid Values Normalize, Discretize, Bin, Map, etc. Derived Model Inputs Model Outputs Scaling, Business Decisions, Thresholds, etc. Copyright 2014 Zementis, Inc. All rights reserved. Confidential 9
Path to Business Value Predictive analytics helps organizations unlock the value of their big data Big Data Predictive Analytics Business Insights Decisions & Actions Business Value Applications Databases Cloud Log Files RSS Feeds Other Sources Predictive Models Machine Learning Techniques Data Mining Tools More relevant More accurate More comprehensive More nuanced Faster Lower risk Greater positive impact Accelerated time-tomarket More precise targeting Real-time responsiveness Enhanced operational agility Competitive advantage Higher revenue growth rates Greater profitability Copyright 2014 Zementis, Inc. All rights reserved. 10
Traditional Deployment Cycle but model deployment challenges can often erode much of the value that predictive analytics can deliver Develop Operationalize Utilize Business Decisions Data Scientist IT Engineer Business Professional Predictive model deployment becomes a rework cycle Extensive manual coding Cross-checking Fixing coding errors Delayed insight Less accurate decisions Missed opportunities Loss of value Copyright 2014 Zementis, Inc. All rights reserved. 11
Deployment with Zementis & PMML Enter Zementis, whose solutions accelerate time-to-insight for predictive analytics Economic Value Time-to-insight Within 2 days * ~ 6 months Accelerated deployment timeline Reduced model deployment cycle time Reduced model deployment expense Increased model throughput Enhanced accuracy Minimal rework, if any Model Deployment Cycle Time Without Zementis With Zementis * And sometimes even within a few hours Rapid insight = Rapid time-to-value from predictive analytics Copyright 2014 Zementis, Inc. All rights reserved. 12
Universal PMML Plug-in (UPPI) Data Mining Tools Commercial Vendors (e.g. IBM SPSS, SAS) Open Source Tools (R, KNIME,...) Predictive Algorithms Decision Trees Neural Networks Support Vector Machines Linear and Logistic Regression Naive Bayes Classifiers General and Generalized Linear Models Cox Regression Rule Set Models Clustering Scorecards Association Rules Multiple Models (Segmentation, Chaining, Composition and Ensemble, including Random Forest Models) PMML Model Deployment Integration/Execution Zementis UPPI for Hive/Hadoop Simple Deployment & Execution Upload PMML file(s) in Hive PMML turns into HiveQL functions Seamlessly score data on Hadoop Copyright 2014 Zementis, Inc. All rights reserved. Confidential 13
Hive 0.13 Now faster than ever, up to 100x performance improvements and more to come Copyright 2014 Zementis, Inc. All rights reserved. Confidential 14
UPPI for Hive 0.13 Performance Scaling by Hadoop Cluster Size 100 Time 50 0 10 Nodes 20 Nodes Speeding Up Performance with Tez & ORC Time 100 75 50 25 0 Hive 0.13 21% Tez 29% Tez & ORC Performance executing a complex PMML model as UDF (User-Defined Function) using Hive 0.13 29% performance improvement when executing the same model and data by enabling Tez & ORC Copyright 2014 Zementis, Inc. All rights reserved. Confidential 15
DEMO Zementis Universal PMML Plug-in (UPPI) demo on Hortonworks Sandbox Zementis UPPI for Hive 1. PMML Sample Models > Hive UDFs 2. Run Customer Churn Example Copyright 2014 Zementis, Inc. All rights reserved. 16
Broad Applicability Hortonworks and Zementis products accelerate predictive model insights for multiple industries and business use cases Fraud & Risk Scoring Sensor & Device Data Processing Marketing & Sales Financial institutions Scoring bureaus Fraud detection Advanced decision management Rotating equipment Energy Biometrics IP network security Up- /cross-sell and nextbest-offer Marketing campaign optimization Real-time recommendations Copyright 2014 Zementis, Inc. All rights reserved. 17
Thank You Questions? Copyright 2014 Zementis, Inc. All rights reserved. 18