1 White Paper Intel Distribution for Apache Hadoop* Big Data Real-Time Big Data Analytics for the Enterprise SAP HANA* and the Intel Distribution for Apache Hadoop* Software Executive Summary Companies are using real-time big data analytics to reshape the competitive landscape in their industries. They do it by capturing, storing, and analyzing volumes and varieties of data that were previously unmanageable, and then extracting insights fast enough to support real-time business processes. What started with a few leading Internet companies has spread to finance, healthcare, government, manufacturing, retail, scientific research, and many other fields. Yet implementing real-time big data analytics can be challenging, requiring IT organizations to implement mission-critical solutions based, at least in part, on opensource software that does not always meet enterprise requirements. Not only is integration complex, but IT organizations must establish security, compliance, and high availability from the ground up to ensure the system is up to the challenge of housing sensitive data and supporting revenue-generating business processes. Intel and SAP have addressed these challenges to provide an enterprise-ready solution for real-time big data analytics. With SAP HANA* running on the latest Intel Xeon processor E7 family and the Intel Distribution for Apache Hadoop* software running on the latest Intel Xeon processor E5 family, businesses can ingest, store, and analyze petabytes of polystructured data, and they can generate insights in fractions of a second to support real-time business processes. This solution includes a rich set of data management and business intelligence tools for turning data into high-value insights that can be embedded into other applications and business processes. Just as importantly, the solution is designed to meet enterprise requirements of security, compliance, and high availability so businesses can confidently integrate sensitive data into their analytics environment. This white paper discusses the value of performing real-time analytics using all available enterprise data and describes how Intel and SAP have overcome the inherent challenges to deliver an enterprise-ready solution.
2 Table of Contents Executive Summary Extending Real-Time Analytics to All Enterprise Data Solving the Challenges of Big Data Integration... 4 Advanced Analytics across All Data Sets Industry-Leading Performance for Apache Hadoop... 4 Integrated Data Management An Enterprise-Ready Platform... 6 End-to-End Security... 6 High Availability Enterprise-Class Manageability SAP and Intel: A Shared Vision for Big Data Integration... 7 SAP: Single Point of Contact for Service and Support Conclusion
3 Extending Real-Time Analytics to All Enterprise Data Advances in data analytics are changing the way businesses compete, enabling them to make faster and better decisions based on real-time analysis. Until recently, companies had to make tradeoffs between deep analysis of large data sets and fast time to results. Intel and SAP are eliminating the need to compromise with an analytics platform designed to deliver real-time query performance while acting on petabytes of both structured and unstructured data. SAP HANA provides a real-time analytics platform using an in-memory database. Organizations can combine large data sets from their operational systems and other sources and perform complex queries in real time, typically in milliseconds. They can even use a single SAP HANA instance as a common foundation for all their applications, both transactional and analytical. This approach streamlines infrastructure and eliminates the physical and operational complexities of moving large amounts of data from operational systems to analytic systems. With these capabilities, SAP HANA answers the business challenge of delivering data-driven intelligence to support realtime business processes. Big data introduces a new set of challenges. Companies generate enormous volumes of poly-structured data from Web logs, sensors, call records, social network posts, s, and many other sources. They need a cost-effective, massively scalable solution for capturing, storing, and analyzing this data. They also need to be able to integrate their big data into their real-time analytics environment to maximize business value. For example, many companies want to analyze the clickstream trails of online customers in combination with historical purchasing patterns to deliver personalized offers and information. Deep analysis across diverse data sets can improve outcomes in such scenarios, but results are needed quickly to positively impact online transactions. Intel and SAP have collaborated to meet this challenge by integrating the Intel Distribution for Apache Hadoop (IDH) software with SAP HANA, SAP Data Services, and SAP Business Objects. The result is a real-time analytics platform designed to efficiently ingest, store, integrate, and analyze all enterprise data. The platform offers: Real-time analytics with cost-effective storage that can scale to petabytes, and potentially exabytes, of data. Transparent data integration and query federation, so advanced analytics can be applied across all data using SAP tools and familiar SQL-based programming models. Enterprise-class support for security, compliance, and manageability so businesses can realize the advantages of real-time big data analytics more quickly and with reduced cost and risk. 3
4 Solving the Challenges of Big Data Integration SAP HANA is known for its unmatched query performance at scale. Intel collaborated with SAP engineers to help them optimize their in-memory processing platform to get maximum benefit from the hardware capabilities of the Intel Xeon processor E7 family, including its multicore architecture, large cache, large memory capacity and high-bandwidth I/O channels. Based on these efforts, SAP HANA speeds query processing times by as much as 10,000 times 1 versus traditional data warehouse solutions. The latest Intel Xeon processor E7 v2 family delivers even greater performance benefits and can process much larger in-memory data sets. These new processors support three times more memory than previousgeneration processors: up to 6 TB on a four-socket server and up to 12 TB on an eight-socket server. They also provide more cores, threads, and system bandwidth to enable up to 2x faster performance 2 for complex, ad hoc queries, compared to previous-generation SAP HANA platforms. The distributed architecture of Apache Hadoop addresses very different requirements than SAP HANA. Hadoop enables query performance and data capacity to be scaled cost-effectively across tens to hundreds of standard, two-socket servers based on Intel Xeon processors and configured with directattached storage drives. This clustered architecture stores and processes data at a cost-per-terabyte that is far lower than traditional data warehousing systems. Although Hadoop enables fast processing of massive data sets, queries typically take minutes to hours to complete. This creates challenges when integrating Hadoop into a real-time analytics environment. Intel and SAP address these challenges in two ways. First, IDH is highly optimized for performance on Intel architecture (see sidebar). Second, Intel and SAP make it easy to generate queries that make efficient use of both platforms. Advanced Analytics across All Data Sets SAP HANA and SAP Business Objects provide comprehensive support for advanced analytics, including traditional SQL-based queries, dashboards, predictive analytics, planning, text mining, and more. In combination with IDH, these models can be applied transparently across the data stored in both platforms. BI users and developers see data stored in IDH as an extension of the data stored in SAP HANA. The queries they generate are automatically federated, as appropriate, across the two platforms. For example, one part of a query might extract customer purchasing data from SAP HANA; another part might search associated Web server logs or call center data records in the Hadoop cluster. The results are then combined and further analyzed in SAP HANA to provide desired insights. As part of this query federation process, some components of the SQL queries generated by BI users and developers are automatically translated into MapReduce* applications that can run natively in Hadoop. The separate parts of a federated query can be performed simultaneously. They can also be performed asynchronously, so that intermediate results from the Hadoop cluster are available as needed to support real-time processes in SAP HANA. Query performance statistics are provided, so developers can shape queries to address specific latency requirements. Industry-Leading Performance for Apache Hadoop* The Intel Distribution for Apache Hadoop* (IDH) software is optimized with the latest Intel Xeon processors, Intel Solid-State Drives, and 10 gigabit Intel Ethernet Adapters to deliver: Up to 30x higher performance than unoptimized Hadoop software running on legacy hardware. 3 Up to 2.6x faster performance than other open-source Hadoop distributions running on the same hardware platform. 4 Additional optimizations within IDH help to improve performance for other key functions, such as MapReduce* job launches and Hive* queries (Hive provides data-warehouse-like functionality for Hadoop environments and is a key component for integrating the Intel Distribution with SAP HANA*.) These and other optimizations help to shorten query completion times. They also allow organizations to perform more queries in the time available, which provides greater agility and better utilization of the infrastructure. 4
5 Weather Data Real-Time Analytics with Big Data Integration Market Data ETL SAP HANA* OLAP Analysis Location Data Real Time SAP HANA Smart Data Access Optimized for: Data relocation Query federation and acceleration (proxy tables, hot replication, caching) SAP Data Services SAP Business Objects Data Mining Reporting Web Logs Call Logs Sensor Logs Big Data Connectors Ingest, Export Sqoop* Data Exchange Flume* Log Collector Oozie* Workflow Zookeeper* Coordination Open source components with: Intel Manager for Apache Hadoop* Software Deployment, Configuration, Monitoring, Alerts, and Security Pig* Scripting Mahout* Machine Learning R* Stats HCatalog* Metadata YARN* (+ MapReduce*) Distributed Processing Framework HDFS Hadoop* Distributed File System Intel Distribution for Apache Hadoop Software Hive* Query HBase* NoSQL Store Figure 1. The SAP HANA* Smart Data Access connector has been engineered and optimized by Intel and SAP to simplify and accelerate data sharing and query execution across both platforms. As a result, analysts can achieve fast query results across petabytes of structured and unstructured data. Some Intel optimization Extensive Intel optimization Much of this functionality is supported through the SAP HANA Smart Data Access connector, which Intel and SAP have optimized for use with IDH (Figure 1). This connector supports data relocation as well as the creation of proxy tables within SAP HANA to simplify and accelerate data access and query execution. Intel implemented a number of optimizations to improve query performance on Apache Hadoop. One example is hot replication, in which multiple replicas of frequently used data are automatically created to avoid contention. Suppose a company launches a popular new product, and the associated data is under continuous demand. Dozens or even hundreds of replicas can be generated so the data can be accessed and manipulated without bottlenecks. Another performance-enhancing feature is caching. Frequently used data and intermediate query results are automatically stored in the in-memory database of SAP HANA, so they can be accessed almost instantly when needed. With these and other optimizations, Intel and SAP help to make the integration between SAP HANA and IDH as seamless and as transparent as possible for BI users and developers. 5
6 Integrated Data Management SAP Data Services provides an integrated, enterprise-class platform for data integration, data quality, data profiling, and metadata management. System administrators can use it to load and manage data across both SAP HANA and IDH for SAP. They can also use it to manage data that has been loaded independently into the Hadoop cluster. An Enterprise-Ready Platform SAP HANA is engineered specifically to support mission-critical computing environments. Intel implements advanced security and reliability features in the Intel Xeon processor E7 family and related platform components, and works with SAP to ensure they are fully utilized throughout the SAP HANA solution stack. Apache Hadoop, on the other hand, is an open-source software application that combines features and optimizations generated by many companies and individuals. This development model enables exceptionally fast innovation, which is evidenced by the rapid evolution of the Hadoop software ecosystem. However, because of this rapid evolution, there are gaps in most available Hadoop distributions, particularly with respect to security, availability, and manageability. These gaps have kept many businesses from deploying Hadoop in production environments. Intel has worked to close those gaps in IDH. IDH includes the full open source solution stack, with all components pre-integrated and optimized to improve performance on Intel architecture. Intel also integrates a combination of open source and proprietary tools to provide a platform that addresses the requirements of enterprise deployments. End-to-End Security IDH provides end-to-end security to protect data. Tools and capabilities include: Authentication and Access Control. IDH supports user authentication and role-based access controls. Queries generated in SAP Business Objects are authenticated just once for both SAP HANA and IDH, and IDH provides granular access controls for data and services. Users and queries can only access authorized data sets, which helps to protect sensitive data against both internal threats and external hackers. Project Rhino Establishing comprehensive security for Apache Hadoop* Connectors Netezza, Oracle, SAP, SQLServer, Teradata, DB2 Sqoop* Data Transfer Flume* Log Collector Oozie* Workflow Zookeeper* Coordination Recommendation Engine Kafka* Event Bus Pig* Scripting Lucene*, Solr* Search Mahout* Machine Learning Intel Distribution for Apache Hadoop Analytics Workbench Behavior Model R* Stats YARN* (+MapReduce*) Distributed Processing Framework Graph Mining Hcatalo* Metadata HDFS Lustre* GlusterFS Hadoop Compatible File Systems High Availability and Disaster Recovery SLURM* Scheduler Rhino (Security) [Encryption, Authentication, Authorization, Auditing] Vertical Accelerators Gryphon* Low-latency SQL-92 Hive Query HBase* Explorer HBase Intel Manager Heat Map Security Controls Job Profiler Resource Monitor Upgrade Alerts Unified Logging Tuning Configuration Deployment Intel proprietary components Intel-optimized open source components Includes Intel security enhancements Figure 2. The Intel Distribution for Apache Hadoop* includes extensive enhancements for enterprise-class security and compliance and Intel is working on Project Rhino to establish a comprehensive security framework across the Hadoop* ecosystem. The goal is to provide a common authentication and authorization framework with integrated support for regulatory requirements in financial, healthcare, government, and e-commerce environments. 6
7 Fast, transparent data encryption. IDH uses Intel Data Protection Technology with Advanced Encryption Standard New Instructions 5 (AES- NI), which accelerates encryption and decryption performance by up to 19 times 6, to enable strong data protection without compromising query performance. Data can be encrypted selectively and transparently, both in motion and at rest, to meet security and compliance requirements. Within IDH, transparent encryption is supported in Hive, Pig*, MapReduce, HBase*, and the Hadoop Distributed File System* (HDFS*). Governance. All database operations are logged across both SAP HANA and IDH and can be audited to verify that users only access authorized data sets and services. Reports and automated alerts help IT protect data and document compliance. Intel is working to extend these and other security capabilities across the Hadoop ecosystem through an open source project called Project Rhino (Figure 2). The goal is to establish a comprehensive security framework for Hadoop that will help businesses address security issues and compliance protocols across a wide range of use cases in financial, healthcare, government, and e-commerce environments. Project Rhino will contribute code to the Apache Foundation so these capabilities will be freely available. High Availability Big data analytics are often used to improve outcomes in revenue-producing business processes, so high availability is important. SAP HANA provides integrated support for data replication and system failover to prevent downtime. Hadoop implements 3-way data replication by default, so that any data node in a cluster could fail without impacting service or data availability. However, the cluster NameNode and Job Tracker servers, which are required in all Hadoop deployments, are potential single points of failure. IDH provides integrated support for high availability for both these critical servers. Intel is also working on the open source Project Ladon, which is designed to support disaster recovery of Apache Hadoop through multisite data replication. Enterprise-Class Manageability SAP HANA is typically delivered as an appliance for onsite deployments. All hardware and software is tightly integrated and optimized to simplify deployment and management. Apache Hadoop, on the other hand, is based on open source software that is designed to run on large numbers of off-the-shelf servers. Management can be complex in this more distributed computing environment, and the challenges increase as a cluster grows. IDH includes Intel Manager for Apache Hadoop software, which combines open source and proprietary tools to provide enterprise-level manageability, including: A user friendly interface for managing access controls and for updating the system. Built-in wizards provide workflows and guidance to speed deployment, simplify upgrades, and improve results. Automatic cluster configuration and tuning, using the Intel Active Tuner. Advanced machine-learning algorithms select the best setup based on workload characteristics to deliver optimized query performance quickly and with no need for complex manual tuning. Built-in monitoring, with a dashboard that provides a comprehensive view of the cluster and system health. Flexible extensibility, with an application programming interface (API) that allows third-party and custom applications to access the functions in Intel Manager for Apache Hadoop. SAP and Intel: A Shared Vision for Big Data Integration Intel and SAP continue to jointly engineer, optimize, and enhance the integration of SAP HANA and IDH. The companies are working together to integrate new functionality and to optimize software to derive maximum benefit from advances in hardware. Some objectives of this collaboration include: Simplified troubleshooting, so query failures can be identified, diagnosed, and fixed more quickly and efficiently. Future solutions will include built-in analytics for root-cause analysis. Enhanced data relocation, so data can be moved more quickly, flexibly, and transparently between the two platforms. Stronger security, by further improving integration and by providing more comprehensive, multilayered protections in both hardware and software. Intel is also deeply involved in hundreds of open source projects to increase Hadoop performance and functionality, and the results of these efforts will continue to increase the capability and value of IDH. Many of these developments are also offered back to the open source community to help drive innovation and interoperability across the broader big data ecosystem. 7
8 SAP: Single Point of Contact for Service and Support SAP HANA and IDH are available from SAP sales teams worldwide. SAP offers full support for the joint solution. SAP also offers comprehensive consulting services, from initial planning and assessment through implementation and ongoing optimization. The speed, scale, and flexibility of the platform go far beyond what has been possible in the past, and IT organizations can accelerate deployment by working with experts who have extensive experience with SAP HANA and Apache Hadoop. Intel Distribution for Apache Hadoop: SAP Big Data: Conclusion SAP and Intel provide an optimized solution for real-time big data analytics based on SAP HANA and the Intel Distribution for Apache Hadoop. Using this joint solution, data and business analysts can combine the performance of in-memory analytics with the massive scalability of Apache Hadoop. As a result, they can store and analyze petabytes of poly-structured data cost effectively at the speeds needed to support real-time business processes. Intel and SAP have worked closely together to optimize the combined platform to support fast, federated queries that tighten the seams between the two platforms and make it easier for BI users to get the results they want without worrying about the infrastructure. The solution is designed to support enterprise requirements for security, availability, and manageability, so IT organizations can integrate the platform into their datacenter while minimizing cost and risk. 1. Source: Sikka, Vishal, SAP. The Business Value of Speed! Lessons from 10,000X SAP HANA Performance Club. August blog/2012/08/05/the-business-value-of-speed. 2. Source: Intel internal measurements November Configurations: Baseline 1.0x: Intel E7505 Chipset using four Intel Xeon processors E (4P/10C/20T, 2.4GHz) with 256GB DDR memory scoring 110,061 queries per hour. Source: Intel Technical Report #1347. New Generation 2x: Intel C606J Chipset using four Intel Xeon processors E v2 (4P/15C/30T, 2.8GHz) with 512GB DDR (running 2:1 VMSE) memory scoring 218,406 queries per hour. Source: Intel Technical Report # Source: TeraSort Benchmarks conducted by Intel in December Custom settings: mapred.reduce.tasks=100 and mapred.job.reuse.jvm.num.tasks=-1. Cluster configuration: One head node (name node, job tracker), 10 workers (data nodes, task trackers), Cisco Nexus* Gigabit switch. Performance measured using Iometer* with Queue Depth 32. Baseline worker node: SuperMicro SYS-1026T-URF 1U servers with two Intel Xeon processors 3.47 GHz, 48 GB RAM, 700 GB 7200 RPM SATA hard drives, Intel Ethernet Server Adapter I350-T2, Apache Hadoop* 1.0.3, Red Hat Enterprise Linux* 6.3, Oracle Java* 1.7.0_05. Baseline storage: 700 GB 7200 RPM SATA hard drives, upgraded storage: Intel Solid-State Drive 520 Series (the Intel Solid-State Drive 520 Series is currently not validated for data center usage). Baseline network adapter: Intel Ethernet Server Adapter I350-T2, upgraded network adapter: Intel Ethernet Converged Network Adapter X520-DA2.Upgraded software in worker node: Intel Distribution for Apache Hadoop* software Note: Solid-state drive performance varies by capacity. More information: current/api/org/apache/hadoop/examples/terasort/package-summary.html. 4. Source: Terasort Benchmarks conducted by Intel. Configuration details: One head node (name node, job tracker), 10 workers (data nodes, task trackers), Dual Intel Xeon processor GHz, 32 cores per node, 7 x 1 TB dedicated data disks per node, 10 GbE network. System Swap turned off, Kernel Buffer Cache cleared before each performance test. 5. No computer system can provide absolute security. Requires an enabled Intel processor and software optimized for use of the technology. Consult your system manufacturer and/or software vendor for more information. 6. Source: Intel Internal tests using OpenSSL 1.0.1c* encryption software to encrypt and decrypt a 1 GB text file, with and without AES-NI enabled. Server configuration: 4-socket server with 4 x Intel Xeon processor E (32 core system, 1 core used in testing), 32 GB memory, CentOS 6.3* operating system, Apache Hadoop Distributed File System* (HDFS*) with namenode, datanode, and the test program all run on the same server, 240 GB Intel Solid State Drive 320 Series storage. For details, see the Intel Solution Brief, Fast, Low-Overhead Encryption for Apache Hadoop*. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. A Mission Critical Application is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS,COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked reserved or undefined. Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling , or go to: 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Core, Xeon, Intel Inside, the Intel Inside logo, the Look Inside. logo, and Look Inside. are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Printed in USA 0214/MR/CMD/PDF Please Recycle US