A modern, flexible approach to Hadoop implementation incorporating innovations from HP Vertica & IDOL Gilles Noisette, HP EMEA Big Data CoE London 2015
Agenda Hadoop in the HP Big Data picture HP Platforms for Hadoop HP Reference Architectures for Hadoop HP Big Data Reference Architecture HP Haven & Hadoop HP Vertica Fast analytics on Hadoop HP IDOL Smart Hadoop Data Lake HP SecureData for Hadoop Trafodion SQL DBMS on Hadoop HP Big Data Services
The HP Haven Big Data Platform Powering Big Data Analytics to Applications Turn 100% of your data into action. Human Data Machine Data Business Data Haven Big Data Platform Insight Haven Enterprise SQL / BI / Reporting Predictive Analytics Machine Learning Log Analytics Search Image / Audio / Video Haven OnHadoop Secure Data Lake Exploration Open Data Format Governance Native support for MapR, Hortonworks & Cloudera Haven OnDemand Open APIs Rapid POCs & deployment Elastic / Multi-tenant Private Cloud-ready Pay-as-you-go HP Vertica, HP IDOL, KeyView, HP 3 Distributed R Predictive Analytics HP Vertica SQL on Hadoop, HP IDOL for Hadoop HP Vertica OnDemand & HP IDOL OnDemand
UID ProLiant DL380e Gen8 500 GB 500 GB 1 2 3 4 7 5 8 6 9 UID 10 13 11 14 12 15 500 GB 500 GB UID UID UID 1 2 3 4 7 5 8 6 9 UID 10 13 11 14 12 15 500 GB 500 GB 1 2 3 4 7 5 8 6 9 UID 10 13 11 14 12 15 ProLiant SL4540 Gen8 HP Big Data platform Hadoop centric view an HP company Analytics Data Intelligence Security SQL DBMS HP Vertica HP IDOL HP SecureData Trafodion Open Source Hadoop Ecosystem Open Source HP ProLiant / Converged Infrastructure DL380, Apollo 4200, Apollo 4530, Moonshot 1500, Network Cluster Operation HP BSM / HP DSM / HP CMU 4
HP Reference Architectures for Hadoop
UID ProLiant DL380e Gen8 UID UID UID 1 2 3 4 7 5 8 6 9 UID 10 13 11 14 12 15 1 2 3 4 7 5 8 6 9 UID 10 13 11 14 12 15 1 2 3 4 7 5 8 6 9 UID 10 13 11 14 12 15 ProLiant SL4540 Gen8 UID ProLiant DL380e Gen8 UID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 A 19 20 21 22 23 24 25 26 27 B 28 29 30 31 21 33 34 35 36 37 38 39 40 41 42 43 44 45 Moonshot 1500 HP Reference Architecture(s) for Hadoop Flexible, pre-approved & optimized configurations + Scaling from 4 to thousands of HP ProLiant Servers Sized to customer s workload and storage needs Impressive Processor and Storage density A set of pre-tested hardware components Processor, Drives, Network, 1TB/8TB disk size etc... Breakthrough economics, density, simplicity DL 380 500 GB 500 GB Apollo 4530 500 GB 500 GB 500 GB 500 GB Apollo 4200 HP 5900 10GbE HP 5930 10GbE x 2 Network Switches 3 x DL360 Gen9 Head s 24 x HP ProLiant Apollo 4530 Worker s Moonshot 1500 HP Apollo 4000 example 6 2.46 PB raw storage 630 TB Hadoop usable 756 Xeon E5 cores for a full rack 3.5 PB raw storage 900 TB Hadoop usable 960 Xeon E5 cores for a full rack 4.26 PB raw storage 1 PB Hadoop usable 756 Xeon E5 cores for a full rack 1620 Xeon E3 cores 3240 Linux CPUs for a full rack
HP Apollo 4200 - Bringing Big Data storage server density to enterprise The enterprise bridge to Big Data - Available June 1, 2015 Storage density Plug and play Performance and efficiency Leadership storage density 28 LFF or 50 SFF HDD Enterprise bridge Fits traditional enterprise/sme rack server data centers deploy today, no cost of change Configuration flexibility Balanced capacity, performance and throughput with flexible options - Disks, CPUs, I/O and interconnects Highest storage density in a traditional 2U rack server - 224 TB 7
HP Apollo 4530 System - Massive density for Hadoop and Big Data Analytics Purpose-built for Hadoop and Big Data analytics - Available June 1, 2015 Analytics At scale Versatile performance Hadoop optimized 3 servers in 4U chassis ideal for Hadoop-based analytics with 3-copy data replication Efficient analytics scaling Up to 30 servers with 15 HDDs/SSDs each and 3.6 PB capacity per 42U rack For Big Data variety Customize for Hadoop workload variety and NoSQL analytics with disk, CPU, I/O and interconnect options Unleash the full value of Big Data with Hadoop 8
You need more than good servers to get a good cluster It s also about Networking and Cluster operation + HP Networking Network matters for Hadoop clusters HP s perfect Top of Rack and Aggregation switch offer Hadoop likes the HP Deep Buffer caching feature HP IRF simplifies architecture of server access networks and enables massive scalability HP FlexFabric 5930 Switch Series : 32 x 40GbE + 6 x 40G uplink ports family of high-density, ultra-low-latency Aggregation switches HP FlexFabric 5900 Switch Series : 48 x 10GbE + 4 x 40GbE ports Family of low-latency Top of Racks (ToR) switches HP Switch HP Insight Cluster Management utility Designed to operate top500 clusters Provision thousand of nodes in minutes Monitor clusters of any size (2D instant view, 3D time view) Control thousand of servers like one Perfectly fits Hadoop cluster operation needs 1GbE, 10GbE or 40GbE Hadoop cluster behavior real time analysis 9
HP Big Data Reference Architecture 10
Interesting released Hadoop feature Architecture trends YARN Labelling (-labels / jira YARN-796) Capability to create groups of similar nodes to run different types of applications with different workload, each, on the most appropriate group of node Admin tags nodes with labels (e.g.: GPU, Storm) One node can have more than one label (e.g.: GPU, m710) Applications can include labels in container requests I want a GPU Application Master 11 Manager [Storm] Manager [Analytic, XL230a] HP Apollo 6000 blade Manager [GPU, m710] Moonshot cartridge Enabling the next Generation of Hadoop Applications...
Interesting released Hadoop features Architecture trends HDFS Tiering / Heterogeneous Storage Tiers (HDFS-2832) For example, HBase can request that its data files (Hfiles) be stored on SSD. Then when HBase does writes and reads from HDFS, these requests will hit SSD and provide the latency requirements that HBase needs for supporting near real time applications. Phase2: HDFS-5682 - Application APIs for heterogeneous storage HDFS-7228 - SSD storage tier HDFS-5851 - Memory as a storage tier (beta) HDFS Archival Storage Design (HDFS-6584) Introduces a new concept of storage policies. For accommodating future storage technology and different cluster characteristics, cluster administrators will be able to modify the predefined storage policies and/or define custom storage policies. Data policy names : Very Hot Hot Warm Luke Warm Cold 12
New approach to address Big Data demands Modern and Flexible Current traditional Big Data approach New HP Big Data approach Compute and storage are always collocated All servers are identical Data is partitioned across servers on direct-attached storage (DAS) Separate compute and storage tiers connected by Ethernet networking Standard Hadoop installed asymmetrically with storage components on the storage servers and yarn applications on the compute servers Compute Optimized Servers YARN Applications Two Socket, 2U Servers YARN Applications, HDFS, ORC Files, Parquet, Hbase, Cassandra HDFS, ORC Files, Parquet, Hbase, Cassandra Storage Optimized Servers 14
Benefits of HP Big Data Reference Architecture HP Moonshot and HP Apollo servers addresse a variety of enterprise big data needs Compute HP Moonshot Storage HP Apollo Ethernet (RoCE) Cluster consolidation Multiple big data environments can directly access a shared pool of data Flexibility to scale Scale compute and storage independently Maximum elasticity Rapidly provision compute without affecting storage Breakthrough economics Significantly better density, cost and power through workload optimized components 15
HP Apollo and Moonshot - HP Big Data Reference Architecture 2X Hadoop MapReduce performance with the same footprint 2.5X HBase performance with the same footprint 2 X Higher Density versus 20% more Memory 46% Less Power (Watts) Traditional architecture 16 Big Data Reference Architecture Note: Comparison configuration is ProLiant DL380 Gen9 servers
Maximum Elasticity for Big Data workloads Hadoop Labels feature (jira YARN-796) HP contributed IP into the Hadoop trunk Specifying labels on nodes allows for scheduling of YARN containers to specific pools of nodes - Admins able to target workloads at optimized platforms Combined with the HP Big Data Reference Architecture, compute nodes can be dynamically assigned - No data repartitioning 12am 6am Hadoop Cluster 1 Hadoop Cluster 2 6am 12am Hadoop Cluster 1 Hadoop Cluster 2 Vertica Analytics Spark 18 Storage Storage
HP Haven & Hadoop
HP IDOL for Hadoop To Build a Smarter data Lake
HP Intelligent Data Operating Layer (IDOL) The OS for human information Single processing layer to handle the continuum of human information Connect Understand Act & Automate Access virtually any source of information Form an understanding of information, including docs, emails, databases, social media, rich media, etc. Over 500 functions to derive actionable insights aka: HP Autonomy IDOL 23
A Smarter Data Lake Needs HP IDOL Features Integration points with Hadoop Breakdown information silos across enterprise Understand myriad file formats and types Improved, intuitive visibility to contents Automatically analyse rich media Connectors & Policies KeyView + IDOL to Vertica IDOL Server (incl HDFS Sync) Image Server & Video Server Knowledge Graph Advanced Speech-to-Text 24
HP Vertica SQL on Hadoop Fast analytics on Hadoop
HP Vertica Analytics platform 7 High-performance data analytics platform purpose-built for big data - columnar database engine Blazing fast analytics Gain insight into your data in near-real time by running queries 50x -1,000x faster than legacy products Massive scalability - PBs Infinitely scale your solution by adding an unlimited number of industry-standard servers Open architecture Protect and embrace your investment in hardware and software with built-in support for Hadoop, R, and a range of ETL and BI tools Optimized data storage Store 10x-30x more data per server than row databases with patented columnar compression Load & analyze growing forms of semi-structured data Quickly and easily load, explore, analyze emerging and rapidly growing forms of semi-structured data. Easy Set-Up and Administration Get to market quickly with your analytics initiatives at low cost of administration and maintenance 26 Speed, scalability, and openness at lower TCO
HP Vertica Data Storage Options and Performance HP Vertica SQL on Hadoop Query Engine Vertica ANSI SQL-99 Vertica ANSI SQL-99 Vertica ANSI SQL-99 Vertica ANSI SQL-99 Vertica ANSI SQL-99 Format Vertica ROS Vertica ROS Hadoop Format* Flex Tables Flat Files File System EXT4 HDFS HDFS HDFS HDFS Fastest Analytics Performance Slowest Discovery Structured Semi-Structured *Supported Hadoop file formats : Parquet, ORC 29
HP Secure Data for Hadoop To Secure your data
HP SecureData Data-Centric Encryption and Tokenization HP SecureData Key Servers HP SecureData Central Management Console HP Stateless Key Management No key database to store or manage High performance, unlimited scalability Both encryption & tokenization technologies Format Preserving Encryption (FPE) for De-Identification Secure Stateless Tokenization (SST) for Payment Card Industry Customize solution to meet your exact requirements Broad Platform Support On-premise / cloud / Big Data Structured / Unstructured Linux, Hadoop, Windows, AWS, IBM z/os, HP NonStop, Teradata, etc Quick time-to-value Complete end-to-end protection within a common platform Format-preservation dramatically reduces implementation effort FPE 345-753-5772 AES HP SecureData Web Services API 934-72-2356 Tax ID 8juYE%Uks&dDFa2345^WFLERG HP SecureData Command Line and Automated Parsers Credit Card 1234 5678 8765 4321 HP SecureData Native APIs (C, Java, C#,.NET) First Name: Gunther Last Name: Robertson SSN: 934-72-2356 DOB: 20-07-1966 First Name: Uywjlqo Last Name: Muwruwwbp SSN: 253-67-2356 DOB: 18-06-1972 Ija&3k24kQotugDF2390^32 0OWioNu2(*872weW Oiuqwriuweuwr%oIUOw1@ Tax ID 934-72-2356 SST 8736 5533 4678 9453 347-982-8309 Partial SST Obvious SST 1234 5633 4678 4321 1234 56AZ UYTZ 4321 347-982-2356 AZS-UXD-2356 34
Options for Securing Data in Hadoop with HP Security Voltage Hadoop Cluster Applications & Data HP Security Voltage 1 4 Hadoop Jobs & Analytics Applications & Data 2 Landing Zone ETL & Batch HP Security Voltage HDFS 5 Hadoop Jobs & Analytics HP Security Voltage Applications, Analytics & Data Applications & Data Hadoop Jobs HP Security Voltage 6 Egress Zone ETL & Batch HP Security Voltage Applications, Analytics & Data 7 HP Security Voltage BI Tools & Downstream Applications Legend: Unprotected Data De-Identified Data Application with HP Security Voltage Interface Point Standard Application 35
HP Trafodion v1.0.0 ( Open Source since June 2014) Forrester - Mike Gualtieri (October 22nd, 2013) The Future of Hadoop is real time and transactional Doug Cutting (October 30th, 2013) We're in the middle of a revolution in data processing it is inevitable that we will see just about every kind of workload be moved to this platform even OnLine Transaction Processing (OLTP) Copyright 2013 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Addresses an under-served Hadoop market segment Operational Real-time insights SQL DBMS = OLTP + Summary BI Interactive Parameterized reports Drilldown visualization Exploration Non-interactive Data preparation Incremental batch processing Dashboards, scorecards Batch Current Market Focus: Data Warehousing and Analytics Operational batch processing Enterprise reports Data mining Trafodion Focus Sub-second Response Time Hours Adds Value to Hadoop 37 Transaction Support Data Integrity Real-time Performance Operational Optimizations Workload Management
Trafodion Trafodion is a joint HP Labs and HP-IT research project to develop operational SQL on Hadoop database capabilities Complete : Full-function SQL Reuse existing SQL skills and improve developer productivity Protected : Distributed ACID transactions Guarantees data consistency across multiple rows, tables, SQL statements Efficient : Optimized for low-latency read and write transactions Supports real-time, high concurrency, transaction processing applications Interoperable : Standard ODBC/JDBC access Works with existing tools and applications Open : Hadoop and Linux distribution neutral Easy to add to your existing infrastructure and no vendor lock-in Hadoop + Operational SQL Open source project sponsorship and investment from HP 38 Production ready version 1.0 release available at www.trafodion.org
HP Big Data Services
Advisory and Discovery Services for Big Data Advisory Our industry and technical experts can support people in technology assessments and strategy development. Big Data TW Used to define Big Data strategy Transformation Workshop format Discovery Workshop Used to identify/prioritize use-cases Validate functional and technical viability Discovery Experience Discovery Lab Time boxed engagement to run a pilot Based on use-cases from workshop Run on Haven cloud environment Insert a Haven lab in the customer ecosystem Platform, platform management and lab function management (on-premise or cloud) 41
HP Services for Hadoop Bringing value to the customer Technical Services Analytics Services Hadoop Roadmap Service Enterprise Design Services Advisory & Discovery Services Information Management Services Hadoop Proof of Concept Cluster Implementation Services Hadoop Solutions & Applications Development Data Science Services Support/Management Services Cluster Support Managed Services As-a-Service 42
Summary +
Summary HP offers industry leading capability for Hadoop Open systems Deep expertise Complete support Ongoing innovation Leading Partnerships Contribution to Apache community Collaboration with Hortonworks Full portfolio of consulting services Projects Moonshot HP ProLiant Gen9 HP Apollo 4200-4530 Industry Standard Solutions HP Insight CMU HP BSM HP DSM Global Solution Center Haven Big Data Platform Designed for Big Data an HP company 45
Thank You
Learn more about HP Haven www.hp.com/go/haven Solution brochure Technical white paper HP Vertica SQL on Hadoop FAQ Customer analytics use case 47
HP Big data Reference Architecture External Collateral White papers: HP Big Data Reference Architecture: A Modern Approach http://h20195.www2.hp.com/v2/getdocument.aspx?docname=4aa5-6141enw&cc=us&lc=en HP Big Data Reference Architecture: Cloudera Enterprise reference architecture implementation http://h20195.www2.hp.com/v2/getdocument.aspx?docname=4aa5-6137enw&cc=us&lc=en HP Big Data Reference Architecture: Hortonworks Data Platform reference architecture implementation http://h20195.www2.hp.com/v2/getdocument.aspx?docname=4aa5-6136enw&cc=us&lc=en Blog posts: HP Blog post (from Greg Battas) http://h30507.www3.hp.com/t5/hyperscale-computing-blog/the-future-of-big-data-platforms-bringing-order-to-chaos-and/ba-p/178209#.vh91wkpna9i Hortonworks blog post http://hortonworks.com/blog/want-new-ways-optimize-big-data-workloads/ Joseph George s blog post (The HP Big Data Reference Architecture: It s Worth Taking a Closer Look ) http://hp.nu/i20rn Silicon Angle Blog post http://siliconangle.com/blog/2014/12/23/hp-thinks-its-got-a-better-way-to-run-hadoop-hpdiscover/ Forrester Blog Post http://blogs.forrester.com/richard_fichera/15-01-28-rethinking_analytics_infrastructure Videos: Steve Tramack interview on The Cube at Discover https://www.youtube.com/watch?v=x2ymmuhzxas&list=plenh213llmcbdrkaihfw9ue9zkxdygkxs 48
Monitoring Hadoop with HP Insight Cluster Management Utility Hadoop worker-nodes Timed View Hadoop cluster behavior real time analysis 49