HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015
Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some software C. Deep in a trial D. In production with a hadoop cluster E. What s Hadoop? The question will open when you start your session and slideshow. # votes: 66 Closed Internet TXT Twitter Page 42 This text box will be used to describe the different message sending methods. The applicable explanations will be inserted after you have started a session. It is possible to move, resize and modify the appearance of this text box.
Where are you in your Hadoop Journey? A. Researching our options 40.9% B. Currently evaluating some software 7.6% C. Deep in a trial 9.1% D. In production with a hadoop cluster 9.1% E. What s Hadoop? 33.3% Closed Internet TXT Twitter Page 43 This text box will be used to describe the different message sending methods. The applicable explanations will be inserted after you have started a session. It is possible to move, resize and modify the appearance of this text box.
Hadoop for the Enterprise: Implement a Modern Data Architecture with HDP Customer Momentum 230+ customers (as of Q3 2014) Founded in 2011 Original 24 architects, developers, operators of Hadoop from Yahoo! 600+ Employees 800+ Ecosystem Partners Hortonworks Data Platform Completely open multi-tenant platform for any app & any data. A centralized architecture of consistent enterprise services for resource management, security, operations, and governance. Partner for Customer Success Open source community leadership focus on enterprise needs Unrivaled world class support Page 44
Hadoop: A Modern Storage and Data Processing Platform. Page 45
Traditional systems under pressure 1 Challenges Constrains data to app Can t manage new data Costly to Scale INDUSTRY LEADERS 2020 40 Zettabytes Clickstream Geolocation Business Value 2 New Data New Web Data Internet of Things Docs, emails Server logs LAGGARDS ERP CRM SCM 2012 2.8 Zettabytes Traditional Page 46
Modern Data Architecture emerges to unify data & processing ANALYTICS Data Applications Marts Business Analytics Visualization & Dashboards Modern Data Architecture Enable applications to have access to all your enterprise data through an efficient centralized platform Batch MP P Batch EDW Batch Interactive YARN: Data Operating System HDFS Real-Time (Hadoop Distributed File System) Partner ISV Supported with a centralized approach governance, security and operations Versatile to handle any applications and datasets no matter the size or type SOURCES ERP CRM SC M Existing Systems Clickstream Web & Social Geolocation Sensor & Machine Server Logs Unstructured Schema on read. Complements rather than replaces. Page 47
HDP IS Apache Hadoop There is ONE Enterprise Hadoop: everything else is a vendor derivation HDP 2.2 October 2014 HDP 2.1 April 2014 HDP 2.0 October 2013 2.6.0 2.4.0 2.2.0 Hadoop &YARN 0.14.0 0.12.1 0.12.0 Pig 0.14.0 0.13.0 0.12.0 Hive & HCatalog 0.98.4 0.98.0 0.96.1 HBase 4.2 4.0.0 Phoenix 1.6.1 1.5.1 Accumulo 0.9.3 0.9.1 Storm 1.2.0 Spark 4.10.0 4.7.2 Solr 0.60 0.4.0 Tez 0.5.1 Slider 0.6.0 0.5.0 Falcon 0.8.1 Kafka 1.4.5 1.4.4 Sqoop 1.5.0 1.4.0 1.3.1 Flume 1.7.0 1.5.1 1.4.4 Ambari 4.1.0 4.0.0 3.3.2 Oozie 3.4.5 3.4.5 Zookeeper 0.5.0 0.4.0 Knox 0.4.0 Ranger Data Management Data Access Governance & Integration Operations Security Page 48 Hortonworks Data Platform 2.2 * version numbers are targets and subject to change at time of general availability in accordance with ASF release process
HDP delivers a comprehensive data management platform GOVERNANCE Hortonworks Data Platform 2.2 BATCH, INTERACTIVE & REAL-TIME DATA ACCESS SECURITY OPERATIONS YARN is the architectural center of HDP Data Workflow, Lifecycle & Governance Falcon Sqoop Flume Kafka NFS WebHDFS Script Pig Tez SQL Hive Tez Java Scala Cascading Tez NoSQL HBase Accumulo Slider Stream Storm Slider In-Memory Spark YARN: Data Operating System (Cluster Resource Management) Search Solr Others ISV Engines Authentication Authorization Accounting Data Protection Storage: HDFS Resources: YARN Access: Hive, Pipeline: Falcon Cluster: Knox Cluster: Ranger Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Enables batch, interactive and real-time workloads Provides comprehensive enterprise capabilities HDFS (Hadoop Distributed File System) 1 Linux Windows Deployment Choice On-Premises Cloud The widest range of deployment options Delivered Completely in the OPEN Page 49
Hadoop adoption follows a predictable journey Cost Optimization, new analytic apps, and ultimately to a data lake Page 50
Hadoop Driver: Cost optimization HDP helps you reduce costs and optimize the value associated with your EDW ANALYTICS Data Marts Business Analytics Visualization & Dashboards Archive Data off EDW Move rarely used data to Hadoop as active archive, store more data longer DATA SYSTEMS MPP In-Memory Enterprise Data Warehouse Hot HDP 2.2 Cold Data, Deeper Archive & New Sources ELT N Offload costly ETL process Free your EDW to perform high-value functions like analytics & operations, not ETL Enrich the value of your EDW Use Hadoop to refine new data sources, such as web and machine data for new analytical context SOURCES ERP CRM SC M Existing Systems Clickstream Web & Social Geolocation Sensor & Machine Server Logs Unstructured Page 51
Financial Drivers Hadoop Enables Scalable Compute & Storage at a Compelling Cost Structure Cost Efficiencies Reduce costs associated with expensive archive systems Utilize existing relationships with hardware vendors Open Source Software Active Archive Provide access to archived data. It s not there to collect dust. Cloud Storage HADOOP NAS Engineered System MPP SAN Storage Costs/Compute Costs from $19/GB to $0.23/GB $0 $20,000 $40,000 $60,000 $80,000 $180,000 Fully-loaded Cost Per Raw TB of Data (Min Max Cost) Page 52
Hadoop Driver: Today s Data Architectures Inhibit a Single View ANALYTICS App 1 App Data Marts App Visualization 1. Data Silos: disparate views of each customer DATA SYSTEMS 2 Enterprise Data Warehouse 3 2. Volume Limitations: cannot store and process all customer data in the EDW SOURCES RDBMS CRM CRM Systems of Record ODS Clickstream Web & Social Geolocation Sensor & Machine Server Logs Unstructured 3. New Data Sources: unable to capture and use new data to complete the view Page 53
Hadoop Driver: Single View: Consolidating the Silos HDP provides a centralized architecture for any application and any data ANALYTICS Data Applications Marts Business Analytics Visualization & Dashboards Single Data Repository Resolve customer data across repositories Provide analysts with a single view of data Batch MP P Batch EDW Batch Interactive Real-Time YARN: Data Operating System Partner ISV Optimized Storage Eliminate unnecessary silos to reduce costs HDFS (Hadoop Distributed File System) Store more data about each customer Analytical Flexibility SOURCES ERP CRM SC M Existing Systems Clickstream Web & Social Geolocation Sensor & Machine Server Logs Unstructured Dynamic schema on read removes limitations of other single view applications Page 54
Hadoop Driver: Today s Data Architectures Limit Predictive Capabilities ANALYTICS Data Marts Business Analytics Visualization & Dashboards 1. Data Silos: difficult to find predictive correlations DATA SYSTEMS MPP In-Memory 1 Enterprise Data Warehouse Hot 2 3 2. Data Volumes: cannot store enough data to find patterns 3. New Data Sources: unable to capture and use new data for real-time analysis SOURCES RDBMS CRM ERP Systems of Record Clickstream Web & Social Geolocation Sensor & Machine Server Logs Unstructured Page 55
Hadoop Driver: Predictive Analytics: Capture Opportunity with HDP Future state analysis Capture and combine large data sets Understand patterns, model outcomes, and forecast accurately to guide action RDBMS MPP EDW Other Existing and New Data HDP 2.2 Consolidate data Run iterative analytics Predictive Insight N Real-time insight Analyze streaming data from new sources on the fly Deliver timely insights to the right people and systems to take action Streaming Data HDP 2.2 Process in real-time Store to HDFS Actionable Insight N Page 56
New requirements to shift from reactive to proactive A shift from Reactive to Proactive New Requirements: From break then fix From static resource planning From reaction to human activity to Preventative Maintenance to Resource Optimization to Behavioral Insight Analyze extremely large data sets to find patterns Combine disparate data and process it in multiple ways Capture unstructured data from a variety of new sources Perform real-time analysis on streaming data Page 57
Hadoop Driver 3: Today s Data Architectures Limit Data Discovery ANALYTICS Data Marts Business Analytics Visualization & Dashboards 1. Data Silos: miss insights because data is isolated DATA SYSTEMS 1 Enterprise Data Warehouse 2 3 2. Data Volumes: throwing away data that has value 3. New Data Sources: unable to mine new data sources SOURCES RDBMS CRM ERP Systems of Record Clickstream Web & Social Geolocation Sensor & Machine Server Logs Unstructured Page 58
Hadoop Driver 3: Data Discovery: Unlock New Insights with HDP HDP provides a centralized architecture for any application and any data ANALYTICS Data Applications Marts Business Analytics Visualization & Dashboards Combine Combine data from many systems and of many different types Batch MP P Batch EDW Batch Interactive Real-Time YARN: Data Operating System Partner ISV Take advantage of schema on read to define new analyses and seek new answers Explore HDFS (Hadoop Distributed File System) Explore large volumes of data together in its many forms Answer a wide range of questions applying multiple processing techniques SOURCES ERP CRM SC M Existing Systems Clickstream Web & Social Geolocation Sensor & Machine Server Logs Unstructured Page 59
Hadoop Driver: Enabling the data lake SCALE Journey to the Data Lake with Hadoop Systems of Insight DATA LAKE Goal: Centralized Architecture Data-driven Business Data Lake Definition Centralized Architecture Multiple applications on a shared data set with consistent levels of service Any App, Any Data Multiple applications accessing all data affording new insights and opportunities. Unlocks Systems of Insight Advanced algorithms and applications used to derive new value and optimize existing value. Drivers: 1. Cost Optimization 2. Advanced Analytic Apps Page 60 SCOPE
Deployment. Page 61
Deployment Options 1 On-line IAAS: Rackspace Managed Big Data 2 On-line: Elastic Service: Rackspace Cloud Big Data 3 Laptop: Sandbox: Single node Hadoop distribution http://hortonworks.com/products/hortonworks-sandbox/ 4 On-Premis: HDP: Complete Hadoop Distribution http://docs.hortonworks.com/hdpdocuments/hdp2/hdp-2.2.0/hdp_man_install_v22/index.html Page 62
Summary: Any Data, Any Application, Anywhere Any Data Deploy applications fueled by clickstream, sensor, social, mobile, geo-location, server log, and other new paradigm datasets with existing legacy datasets. Anywhere Implement HDP naturally across the complete range of deployment options commodity appliance cloud ERP CRM SC M Clickstream Web & Social Geolocation Internet of Things Server Logs Files, emails hybrid Any Application Deep integration with ecosystem partners to extend existing investments and skills Over 70 Hortonworks Certified YARN Apps Broadest set of applications through the stable of YARN-Ready applications Page 63
Questions. Page 64