Real-Time Big Data Analytics SAP HANA with the Intel Distribution for Apache Hadoop software

Similar documents
Fast, Low-Overhead Encryption for Apache Hadoop*

Accelerating Business Intelligence with Large-Scale System Memory

Accelerating Business Intelligence with Large-Scale System Memory

Intel Cloud Builder Guide: Cloud Design and Deployment on Intel Platforms

Intel and Qihoo 360 Internet Portal Datacenter - Big Data Storage Optimization Case Study

CIO Guide How to Use Hadoop with Your SAP Software Landscape

Real-Time Big Data Analytics for the Enterprise

Intel Platform and Big Data: Making big data work for you.

Cloud based Holdfast Electronic Sports Game Platform

Intel Network Builders: Lanner and Intel Building the Best Network Security Platforms

Big Data. Value, use cases and architectures. Petar Torre Lead Architect Service Provider Group. Dubrovnik, Croatia, South East Europe May, 2013

Dell* In-Memory Appliance for Cloudera* Enterprise

Different NFV/SDN Solutions for Telecoms and Enterprise Cloud

High Performance Computing and Big Data: The coming wave.

A Superior Hardware Platform for Server Virtualization

Intelligent Business Operations

McAfee and SAP HANA. Executive Summary. Real-Time Business Requires Real-Time Security

Intel Core i5 processor 520E CPU Embedded Application Power Guideline Addendum January 2011

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Intel Media SDK Library Distribution and Dispatching Process

Intel SSD 520 Series Specification Update

Intel Service Assurance Administrator. Product Overview

Intel Cloud Builders Guide to Cloud Design and Deployment on Intel Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

How To Reduce Pci Dss Scope

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

Affordable Building Automation System Enabled by the Internet of Things (IoT)

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Simplifying Data Governance and Accelerating Real-time Big Data Analysis for Healthcare with MarkLogic Server and Intel

Intel RAID RS25 Series Performance

Intel Solid-State Drive Pro 2500 Series Opal* Compatibility Guide

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Leading Virtualization 2.0

SAP and Hortonworks Reference Architecture

NFV Reference Platform in Telefónica: Bringing Lab Experience to Real Deployments

Intel RAID SSD Cache Controller RCS25ZB040

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

Simplifying Data Governance and Accelerating Real-time Big Data Analysis in Financial Services with MarkLogic Server and Intel

The Case for Rack Scale Architecture

Intel Network Builders

IS IN-MEMORY COMPUTING MAKING THE MOVE TO PRIME TIME?

Colgate-Palmolive selects SAP HANA to improve the speed of business analytics with IBM and SAP

Il mondo dei DB Cambia : Tecnologie e opportunita`

An Oracle White Paper May Exadata Smart Flash Cache and the Oracle Exadata Database Machine

Hur hanterar vi utmaningar inom området - Big Data. Jan Östling Enterprise Technologies Intel Corporation, NER

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Intel Cloud Builder Guide to Cloud Design and Deployment on Intel Xeon Processor-based Platforms

Microsoft Analytics Platform System. Solution Brief

Tap into Big Data at the Speed of Business

Online Transaction Processing in SQL Server 2008

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Big Data Performance Growth on the Rise

Getting More Performance and Efficiency in the Application Delivery Network

HPC & Big Data THE TIME HAS COME FOR A SCALABLE FRAMEWORK

Intel Solid-State Drives Increase Productivity of Product Design and Simulation

How to Configure Intel Ethernet Converged Network Adapter-Enabled Virtual Functions on VMware* ESXi* 5.1

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

COSBench: A benchmark Tool for Cloud Object Storage Services. Jiangang.Duan@intel.com

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

IT Platforms for Utilization of Big Data

Unlocking the Intelligence in. Big Data. Ron Kasabian General Manager Big Data Solutions Intel Corporation

Using In-Memory Data Fabric Architecture from SAP to Create Your Data Advantage

SAP Database Strategy Overview. Uwe Grigoleit September 2013

How High Temperature Data Centers and Intel Technologies Decrease Operating Costs

Intel Active Management Technology Embedded Host-based Configuration in Intelligent Systems

VNF & Performance: A practical approach

Running Oracle s PeopleSoft Human Capital Management on Oracle SuperCluster T5-8 O R A C L E W H I T E P A P E R L A S T U P D A T E D J U N E

Intel Desktop Board DP55WB

SAP HANA SAP s In-Memory Database. Dr. Martin Kittel, SAP HANA Development January 16, 2013

Configuring RAID for Optimal Performance

Vendor Update Intel 49 th IDC HPC User Forum. Mike Lafferty HPC Marketing Intel Americas Corp.

Big Data for Big Science. Bernard Doering Business Development, EMEA Big Data Software

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

Interactive data analytics drive insights

System Event Log (SEL) Viewer User Guide

Oracle Big Data SQL Technical Update

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Move Data from Oracle to Hadoop and Gain New Business Insights

SharePlex for SQL Server

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Dell One Identity Manager Scalability and Performance

Intel Data Center Manager. Data center IT agility and control

Intel Virtualization Technology (VT) in Converged Application Platforms

IBM Storwize V7000 Unified and Storwize V7000 storage systems

Big Data Management. What s Holding Back Real-time Big Data Analysis?

IBM DB2 Near-Line Storage Solution for SAP NetWeaver BW

The Future of Data Management

with PKI Use Case Guide

iscsi Quick-Connect Guide for Red Hat Linux

What Is In-Memory Computing and What Does It Mean to U.S. Leaders? EXECUTIVE WHITE PAPER

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

An Oracle White Paper May Oracle Database Cloud Service

Transcription:

Real-Time Big Data Analytics with the Intel Distribution for Apache Hadoop software Executive Summary is already helping businesses extract value out of Big Data by enabling real-time analysis of diverse data sets. Apache Hadoop is being used today to manage petabytes and possibly exabytes of data. uses in-memory computing to deliver analytic results much faster than Apache Hadoop. But Hadoop can handle arbitrary data structures more easily and at much lower hardware storage costs per terabyte than traditional data warehouses. Clearly, and Hadoop have complementary strengths. So when you put and Hadoop together you get the ability to analyze big data in real-time, allowing you to make better decisions faster. To help enterprises worldwide take full advantage of the combination of these technologies, SAP and Intel engineers worked together to integrate and optimize the Intel Distribution for Apache Hadoop with. This document describes the reference architecture of the real-time big data analytics solution jointly engineered by SAP and Intel to deliver the scalability, performance, security, and manageability required by enterprises whose businesses depend on the quality and timeliness of data-driven decisions. Contents Starting with a Representative Use Case... 2 Defining System Requirements... 2 Combining with the Intel Distribution for Apache Hadoop... 3 Optimizing for Performance, Security, and Manageability... 6 Conclusion... 7 14 August 2013

Starting with a Representative Use Case For over ten thousand years, agriculture has depended upon the farmer's ability to analyze weather reports, soil conditions, terrain maps, water resources, and market conditions. Yet, it is only in the last twenty years that open access to accurate real-time data has unlocked dramatic improvements in modern agriculture. Tractors, tillers, combines and other agricultural machines now use embedded sensors to provide real-time data continuously. These smart machines can assess environmental conditions and communicate with each other. The real-time analysis of this cornucopia of data can help today's farmer predict crop yields, increase the efficiency and reduce the cost of their equipment, develop better strategies for mitigating inevitable risks, and optimize yield prices. Consider an agricultural machine manufacturer that wants to provide agronomics -- agricultural datadriven analytics -- to help farmers improve their business. The manufacturer has built sensors into the field equipment. By combining senor data from the equipment with detailed data on field, weather, and market conditions, the company can create a rich and full picture of the farm and its environment. The company can then use advanced analytics to generate recommendations for each farmer. The benefits to the farmer are considerable. Since the recommendation is specific to time and location -- at the granularity of the area of the field and day of the week -- and can specify soil treatments water, fertilizer, and pesticides to improve crop yields, the farmer is now equipped with much more than an agricultural machine -- they now have an intelligent system backed by real-time big data analytics. Because the agronomic system fosters better use of the farm equipment and provides detailed and continuous knowledge of the customer, the machine manufacturer benefits by growing a loyal customer base and sustainable communities that can afford newer equipment. Defining System Requirements The business requirements of the agricultural machine manufacturer are as follows: Capture data from various data sources: equipment sensors, soil conditions, crop history, plant data, weather data, climate history, market data, and other relevant sources. Store the large and growing volume of data in a scalable data reservoir Cleanse and prepare data for analysis Analyze the data by executing a variety of statistical methods. The analytics can be further subdivided: o Identify the problem (formulate the query) o Execute the query o Draw conclusions o Revise the query Present the data to the farmer through intuitive analytic interfaces or BI tools 2

The technical requirements of the system would stretch the abilities of traditional data warehouses and conventional RDBMS but the advent of In-memory database computing and distributed data management technologies offers the critical building blocks: Data Stores o Distributed Data Store for large volumes of unstructured data (sensor logs, history) o In-memory Database for streaming data (weather data, market data) Fast Analytics (Real-time response to queries on some portion of the data) Deep Analytics (Advanced analytics across entire data set) High-performance Connector between the Analytics engines? Fast Analytics Fast Data Store Deep Analytics Large Data Store Combining with the Intel Distribution for Apache Hadoop Understanding the differences in capability between and the Intel Distribution for Apache Hadoop is the first step to recognizing the value in their integration. is already helping businesses extract value out of Big Data by enabling real-time analysis of diverse data sets. Likewise, Apache Hadoop is being used to manage petabytes and quite possibly exabytes of data, far more than the 100 TB or less that and traditional relational database management systems (RDBMS) typically handle. Hadoop is significantly slower than, taking minutes or hours to provide analytic results, but it can handle arbitrary data structures more easily and at much lower hardware storage costs per terabyte than traditional data warehouses. and Hadoop have complementary strengths. So when you put and Hadoop together you get the ability to analyze big data in real-time, helping you make even better decisions faster. Intel Distribution for Apache Hadoop Mainly structured data in memory Any data or file structure on disk Very fast access (~<1 ms) Very slow data access (seconds to hours) Predefined schema No schema or schema on read 1 or many servers (100s of cores) Cluster of servers (1000s of cores) Scale-up or scale-out architecture Scale-out architecture 3

Server-level failover Query and server-level failover Database appliance Low-cost server Excellent OLTP No OLTP Excellent OLAP High data consistency (ACID) Slow OLAP Eventual data consistency (BASE) Enterprise-ready management tools Enterprise-ready management tools License fee Subscription fee The use of Hadoop does not fundamentally change the type of analytics that can be performed with. The use of Hadoop does, however, change the scope and architecture of the analytics solution. Because the data volumes stored by Hadoop can be so vast, its raw data cannot be readily copied into an existing disk-based or in-memory data warehouse. It is too much data to economically store permanently, and moving data from Hadoop just in time for analysis would take too long. This means that some of the data analysis must be carried out in Hadoop. Moreover, queries executed in Hadoop usually take longer to run, anywhere from minutes to hours. The solution must compensate for the latency or else the analytic process of formulate query > analyze data > present results > draw conclusions will slow to the point where it cannot be done in real time. And finally, some of the analytic operations will run on multiple analytic engines that each draws from a different data store, consolidating the results from the in-memory database, the on-disk data warehouse, and the Hadoop cluster. This usage is the focus of the integration between the Intel Distribution for Apache Hadoop with. Federating Queries A federated query is one in which the original query is split into separate parts that can run on any combination of the Intel Distribution for Apache Hadoop,, SAP Sybase IQ, and other data platforms. The parts are then executed separately. A query running on the Intel Distribution, for example, may have to perform a database join with data in. This presents two options: retrieve data from while the Hadoop query is running or copy the data in to Hadoop before the analysis query on Hadoop starts. There may also be dependencies between the parts of the query. For example, a job running on the 4

Intel Distribution may need to complete first so that the results it produces are available to complete a query in. Because one part can take a longer time to run, the federated query must have a way of executing queries asynchronously, where the system accepts the query and passes control back to the user so they can continue with other activities. Later, when the query is complete, the system notifies the user so they can examine the result at that time. integrated with the Intel Distribution for Apache Hadoop implements federated queries in four ways: 1. Client-side Federation (available in BI 4.0 FP3): SAP Business Objects BI executes queries separately and consolidates data from the Intel Distribution and the database. 2. Standalone Analytics: SAP Business Objects BI analyzes data in Hadoop directly. 3. Query Federation (available in SP06 with smart data access): SAP HANA executes queries on multiple data sources including the Intel Distribution for Apache Hadoop, consolidating and combining the analytic results. 4. Data Federation (available with SAP Data Services 4.1): treats data in the Intel Distribution as an external table. 1 3 4 2 5

Optimizing Query Federation for Performance and Security In integrated with the Intel Distribution for Apache Hadoop, the optimization is dual-fold: Intel software developers optimized both and the Apache Hadoop platform for optimum performance on Intel Xeon processor-based servers. takes full advantage of advanced features in the Intel Xeon processor E7 family, such as expanded memory addressability, increased core count, larger cache, as well as security, reliability, availability, and serviceability features. The Intel Distribution for Apache Hadoop* software is built from the silicon up to deliver security with no compromise on performance and manageability by making full use of the Intel Xeon processor, SSD storage, and 10GbE networking. The collaboration between SAP and Intel has been even more fruitful in the joint engineering of the SAP Smart Data Access Connectors optimized with the Intel Distribution for Apache Hadoop. The connectors allow the Intel Distribution to transcend the traditional use of Apache Hadoop as an engine for offline batch processing. Proxy Smart Data Access Hive IDH In SP06, the Smart Data Access connector enables the Intel Distribution for Apache Hadoop to operate as an extended data store for using the proxy table construct. Intel and SAP jointly engineered the connector with several unique optimizations such as caching and hot data replication to improve performance. In addition, the connector uniquely implements end-to-end security with encryption of data in motion and at rest. When a user submits a query via studio, the combined solution returns the result transparently with no loss of real-time performance. In SP07, the integration is further enhanced to support data relocation between SAP HANA and the Intel Distribution. When a query -- for example, to join a smaller data set in with a larger data set in the Intel Distribution for Apache Hadoop -- is submitted to, the connector relocates the data to the Hadoop cluster, where the job is executed. The Intel Distribution then returns the result to via the proxy table. integrated with the Intel Distribution for Apache Hadoop provides a real-time big data analytics solution. Unlike other offerings in the industry, the solution with the Intel Distribution is optimized top-to-bottom through the stack with Intel technologies such as data security enhanced by Intel AES-NI and Intel Expressway Gateway, Intel Manager equipped with Intel Active Tuner, and balanced performance across Intel SSD storage and 10GbE networking. 6

The Intel Distribution for Apache Hadoop software (Intel Distribution) includes Apache Hadoop along with other software components enhanced and optimized by Intel for performance, security, and manageability. Performance: The Intel Distribution 2.5.1 has been shown to perform 2.4X better than the HortonWorks Data Platform 1.3.9 on the same hardware, completing the Terasort and Wordcount workloads on a 10-node cluster in significantly less time. Security: Only the Intel Distribution is built from silicon up to deliver security without compromising performance and manageability. The Intel Distribution running on Intel Xeon processors enables Hadoop to make full use of the performance and security capabilities available in the processor. For example, the Intel Distribution uses Intel Advanced Encryption Standard New Instructions (Intel AES-NI) available in Intel Xeon processors to accelerate cryptographic functions, erasing the typical performance penalty of encryption and decryption. Using AES-NI, the Intel Distribution has been shown to deliver up to 20X increases in throughput over the same operations performed without the use of AES-NI. Manageability: Only the Intel Distribution has an automated performance tuning mechanism that saves configuration time while selecting a more optimal configuration than a human expert. The Intel Active Tuner uses a patented mechanism to test various parameters rapidly until it converges on the optimal configuration for a given Hadoop application. The Intel Distribution is developed and supported by a software team that has real-world experience with Hadoop deployments as well as deep expertise in the full platform stack of Apache Hadoop on Intel hardware including Java and Linux. Moreover, Intel contributes its knowledge and all hardwareenabled platform enhancements to the open source Apache Hadoop community. Conclusion SAP is uniquely positioned to address the needs of the enterprise business analysts with a platform that can deliver real-time analytics operating on a diverse range of data sources., by virtue of its in-memory database computing engine, already supports a variety of business applications that require instant analysis of data stored in transactional databases. Apache Hadoop complements and enhances this state-of-art by enabling access to a potentially unlimited source of unstructured data. By integrating with Apache Hadoop, SAP can provide customers -- such as the agricultural machine manufacturer -- with a real-time big data analytics solution that scales with the volume of data. By selecting the Intel Distribution for Apache Hadoop as its preferred platform, SAP relies on Intel to deliver performance and security enhanced by Intel hardware, and to provide an open source data management platform on which SAP can innovate. By signing an agreement to resell the Intel Distribution, SAP provides its customers with a complete set of big data software and a single point of contact for technical support. 7

For more information, visit http://hadoop.intel.com. To get started with a proof of concept or pilot project with and the Intel Distribution for Apache Hadoop, contact SAP sales or Intel Software Sales. INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WAR- RANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked reserved or undefined. Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from uture changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel s Web site at www.intel.com. Copyright 2013 Intel Corporation. All rights reserved. Intel, the Intel logo, and Xeon are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Printed in USA. Please Recycle 8