1 Applying SAP HANA, Apache Hadoop, and IBM GPFS to retail point of sales and web log data Christopher Y. Chung, IBM William Gardella, SAP Dewei Sun, SAP IBM Systems and Technology Group ISV Enablement SAP Labs May 2012
2 Table of contents Abstract... 1 Motivation... 1 Demo scenario... 2 Big Data solution... 4 Solution Configuration... 6 SAP HANA... 6 What is SAP HANA?... 7 Who uses SAP HANA today?... 7 What is the future for SAP HANA?... 7 SAP HANA Configuration... 7 Apache Hadoop... 8 Hadoop and GPFS Cluster Configuration... 9 SAP Data Services SAP BusinessObjects BI SAP HANA, Hadoop, and GPFS cluster Results Summary Resources About the authors Trademarks and special notices... 17
3 Abstract In early 2012, IBM and SAP created a realistic reference scenario demonstrating the benefits of using SAP HANA with Apache Hadoop, in which SAP HANA provides real time data exploration and Hadoop brings cost effective batch-style processing of large unstructured data volumes that would be difficult to analyze with traditional technologies. The purpose of the exercise is to enable similar scenarios by providing practical guidance and a blueprint that uses currently available software and hardware technology. For this project, the landscape consisted of an SAP HANA system running on an IBM ex5 enterprise server, x3690 X5 Workload Optimized Solution for SAP HANA; a Hadoop cluster of multiple IBM System x x3620 M3 servers; and virtual machine (VM)-based SAP Data Services 4.1 and SAP BusinessObjects BI 4.0. An additional interesting feature is the IBM General Parallel File System (IBM GPFS ) filesystem, which was used to run both SAP HANA and the Hadoop cluster, for optimum support of extremely large and complex analytics against unstructured and real-time data. Motivation Big Data is a hot topic in the IT industry and there have been many inquiries about it from partners, analysts and customers who are trying to understand the likely impact to the enterprises they serve. The explosive growth of data, driven partly by the ascendency of the internet, has given rise to new technologies that enable enterprises to take advantage of that data in ways that would not have been feasible in the past. Traditional data stores such as relational databases are increasingly being supplemented by new information sources that typically involve large amounts of unstructured or semistructured data. These information sources often require quite a bit of processing before they can be used. This paper gives a concrete example of one such scenario, and explains how the various parts of the solution fit together.
4 Demo scenario Consider the hypothetical case of a large and successful retailer that does business in stores and also on the web. This business needs to deliver real-time customer insights that increase its ability to win customers by ensuring that it offers attractive products at attractive prices. In order for IT to deliver innovation to the business, it has to be able to leverage all available data to gain a competitive advantage. And driving excellence means that the solution has to keep ensuring that the company s products are selling at their full potential, day in and day out. SAP BusinessObjects BI 4.0 and SAP HANA enable users to examine even billions of retail point of sales (POS) records in real time. For the purposes of this case study we will take it as a point of departure that our theoretical retailer already has this capacity. Therefore, the starting information space for this scenario might look something like the image below, with retail point of sales dollar amounts listed by category. Figure 1. Retail Point of Sales data in BusinessObjects Explorer before enrichment with customer interest data
5 This shows us in great detail what the retailer sold. That s no small thing, especially when you consider exploring over 1.1 billion records in seven thousand product categories in less than a second. But retailers always have to push for the competitive edge, and in this case they don t know as much as they might like. For example, how can they tell whether their current products are selling at their full potential? After all, they know what they have sold, but how would they know if there was strong customer interest in one of their products if they are failing to make the sale for some reason? What they need is a measure of customer interest that is independent of sales. In this demo scenario, we will enrich the retailer s point of sales (POS) data, using other data that it already has. The company s web logs show a record of what visitors did when they came to the web site. If we could bring enough computational power to bear, we could figure out which products customers look at when they visit. These product views give us a measure of how much customer interest there is in a product. If many people look at a product, that correlates with strong interest, and if few people look at it, that correlates with weak interest. In the end, we can bring that customer interest data into our sales reports to compare the customer interest in products with the actual sales performance of those products. In this way, we can uncover products that should be selling better than they are. The following sections talk about the technologies that enable both the before and after scenarios. Figure 2. The new information space adds the number of users who viewed the products online
6 Big Data solution SAP s solution for big data is based on the SAP real-time data platform, a comprehensive offering that handles a wide range of enterprise data needs. SAP HANA is the central core of this platform, providing real time in-memory data processing and excellent performance characteristics, even over large data sets. Another piece of the big data puzzle that many people have also heard buzz about is Hadoop. Apache Hadoop is a proven, open source technology (really family of related technologies) developed for processing extremely large sets of raw data in parallel using highly flexible MapReduce algorithms with maximum cost efficiency. Although on the surface, it may sound like these technologies are in conflict, they are really designed with somewhat different core value propositions and usage patterns. This means they can be fruitfully combined, but the question for most businesses is about the Hadoop side of the equation more than the SAP HANA side. This paper aims to illustrate exactly how Hadoop and SAP HANA can complement each other to address a business problem that would be difficult to do using only traditional relational databases. SAP HANA database has been designed and implemented with fast response times for both online analytic processing (OLAP) and online transaction processing (OLTP) workloads as a primary goal in mind. SAP HANA and its functionality will be described in greater detail in the following section. While SAP HANA succeeds in providing very fast data processing and analytics for large volumes of high-value data, there are still use cases that customers are interested in that are outside the scope of today s SAP HANA implementation or vision. For this reason, it s important to enable users to easily leverage and combine data residing in Hadoop or similar distributed file systems with data residing in SAP HANA for real-time business analytics and in business transactions Broadly speaking, the use cases for Hadoop are defined by the need for cost efficient storage and processing of very large volumes of structured, semi-structured, and unstructured data such as web logs, machine data, text data, call data records (CDRs), audio, video data. In general, the use cases involve some or all of the following features: Batch processing Where fast response times are less critical than reliability and scalability Complex information processing Enable heavily recursive algorithms, machine learning, and queries that cannot be easily expressed in SQL Low-value data archiving Data stays available, though access is slower Post-hoc analysis Mine raw data that is either schema-less or where schema changes over time Hadoop serves as an extended workbench for intensive processing of large amounts of typically unstructured data. Sometimes people refer to this as low value data, by which they mean that for any given user of that data, the ratio of valuable information to meaningless information is low. This paper describes just such an example. Imagine you wanted to look at terabytes of web logs dating back several years to find out, for every product you have ever sold, how many people looked at the product, how many added it to the shopping cart, and how many actually bought it. Hadoop could handle that, even if the web site has changed a few times and the log entries don t look the same from one year to the next. But the vast majority of the data in the web logs being processed would be actions that do not bear on that question one way or another. The data in the web logs could be used in other ways to answer other questions, so you wouldn t want to just throw it away. Another time with a different question, a different
7 small percentage of data would be useful. Only a small percentage is useful for answering any particular question. Another way of looking at the same usage pattern is to describe it in terms of post hoc analysis, or collecting data first and deciding how you might use it later. This means that you store everything because you re not sure exactly what you re going to need to find out. To take a completely different example, consider a collection of sensor data, such as is collected in an oil field or large power plant. In conducting a forensic investigation, a worker safety team might go looking for particular patterns of sensor readings that preceded a dangerous event, perhaps casting their net wide in hopes of finding a correlation between various connected pieces of equipment. The same team might look for very different events when trying to discover signal patterns that could be used to predict when equipment maintenance was needed. Figure 3. SAP Real-time Data Platform The need for large-scale data processing has given rise to shared-nothing architectures as the basis for data intensive applications such as Hadoop. These architectures are typically implemented with large clusters composed of inexpensive commodity hardware components, an approach which originated in the context of web indexing and search. In such a cluster, each node stores a small portion of the data and can perform computations on it locally using the data. The coordination of all nodes in the cluster for the execution of users' applications on data is carried out by a framework that runs on top of the cluster. The central idea in this design is the idea of function shipping where a compute task is moved to the node that has data as opposed to the traditional data shipping approach where the data is moved from a storage node to a compute node. SAP and IBM understand that big data technology such as Hadoop is important and is rapidly hitting the radar at many companies. At the same time, the depths to understand the collection of features that are critical to the enterprise level need to be understood closely. An important challenge in realizing
8 performance in data intensive supercomputing systems is the achievement of high throughput for applications in a commodity environment where bandwidth resources are scarce and failures are common. To achieve this goal, a clustered file system can optimize the performance of data-intensive application like Hadoop by: Exposing locality information to enable function shipping Allowing a multiplicity of block sizes in the same file system Providing a write affinity mechanism for an application to influence data layout Using pipelined replication to optimize the use of network bandwidth Supporting distributed recovery to minimize the effect of failure Solution Configuration The goal of this activity was to take a highly realistic scenario and build it from end-to-end, using technologies that are available today and which have good enterprise-readiness. By enterprise-readiness, we mean ease of administration, security, available support, and so forth. This is partly because these features are necessary and expected, and partly because these or similar technologies are probably available in a typical enterprise environment. The following sections consider the main technology components one at a time. The picture below describes the high level map of where the data lives and how it s accessed. One component not shown below is SAP Data Services 4.1, which is used for transferring data from Hadoop to SAP HANA. SAP HANA
9 What is SAP HANA? SAP HANA is a next-generation database platform that delivers unprecedented value by natively leveraging hardware evolution in processors, memory, storage, and network architectures. Designed from the ground-up to run entirely in-memory across clusters of affordable Intel-based servers, SAP HANA combines the best of row-based relational databases with columnar technology present in leading analytical databases to bring the best of both worlds together in a single database. With the ability to combine transactional and analytical functions, customers gain the ability to make decisions on business data as business is happening, enabling organizations to achieve real-time business. Starting with the accelerators for ERP systems, SAP is building and delivering various classes of applications for various scenarios of use. These include SAP Rapid Deployment Solutions (SAP RDS) and in-memory engine accelerated applications like SAP NetWeaver Business Warehouse (SAP NetWeaver BW) and SAP Smart Meter Analytics. SAP HANA s in-memory optimized engines for planning, predictive and business oriented computational libraries will enable next generation of business applications. Who uses SAP HANA today? SAP HANA was made generally available as a standalone database platform in July of Since then, sales of SAP HANA have exceeded all expectations, generating over 160M in sales across more than 250 customers in the first six months. Early customers and ISVs have been able to exploit the real-time capabilities present in HANA to build new classes of solutions not possible before, spanning such use cases as genome sequencing, retail predictive analytics, and vehicle fault analysis. Whereas these customers would previously have been required to rely on traditional RDBMS technology and high-maintenance tuning efforts to achieve even moderate performance, customers have found that SAP HANA serves as a full-featured and high performing in-memory database platform. And SAP has been encouraging additional new and innovative solutions with emerging providers through its support of startup forums in the Silicon Valley area. Along with these uses, SAP has delivered solutions to customers, including SAP HANA application accelerators, native SAP HANA applications, and the latest SAP NetWeaver BW on HANA. What is the future for S AP HANA? Customers today are already experiencing unprecedented levels of performance with SAP HANA as the state-of-the-art database platform powering such traditional business applications as SAP Business One, SAP Business ByDesign, SAP NetWeaver Business Warehouse, plus many other SAP applications and technology solutions in the portfolio, powered by SAP HANA. This will be continued with support for the SAP Business Suite, planned by the end of 2012, and additional new and innovative SAP applications beyond that. SAP s goal is to ensure SAP applications and SAP HANA will deliver unmatched business process performance with simplified landscape & user experience. SAP will continuously innovate on functionality and experience, specifically focusing on performance and the application platform as the key differentiators. SAP plans to re-engineer the enterprise application platform on the basis of SAP HANA architecture to revolutionize application building from ease of construction to optimized execution. SAP HANA will also serve as the core data foundation within the SAP on-demand application development and deployment platform. With successful development of SAP BusinessObjects BI on Demand already running HANA, customers can look forward to many cloud-based real-time business solutions to come. SAP HANA Configuration For the purposes of this scenario, we used SAP HANA 1.0 SP3 running on a powerful IBM ex5 enterprise server with Intel Xeon processor E7 family, combining the speed and efficiency of in-memory processing with the ability to analyze massive amounts of enterprise data. IBM System x3690 X5 is a
10 powerful two-socket 2U rack-mount server. This is one of two IBM Intel processor-based high-end servers, Workload Optimized Solution model for SAP HANA, which is optimally designed and certified by SAP. Figure 4 IBM System x3690 X5 It supports the following specifications: Up to two sockets for Intel Xeon E7 processors. Depending on the processor model, processors have six, eight, or ten cores per socket. Scalable from 32 to 64 DIMMs sockets with the addition of a MAX5 memory expansion unit (up to 1 TB of memory using all 16GB DIMMs, not currently certified for SAP HANA). Advanced networking capabilities with a Broadcom 5709 dual Gb Ethernet controller and an Emulex 10 Gb dual-port Ethernet adapter as optional. Up to 16 hot-swap 2.5-inch SAS HDDs or 24 hot-swap 1.8-inch solid state drives (SSDs), up to 9.6 TB of maximum internal storage with RAID 0, 1, 10, 5 or 50 to maximize throughput and ease installation. New exflash high-iops solid-state storage technology. Five PCIe 2.0 slots. Integrated Management Module (IMM) for enhanced systems management capabilities. IBM System x3690 X5 delivered with preconfigured and preinstalled on key software components, such as SLES for SAP Operating System and IBM General Parallel File System (GPFS), to help rapid delivery and deployment of the solution. The x3690 X5 features the IBM exflash internal storage using solid state drives to maximize the number of I/O operations per second (IOPS). All configurations for SAP HANA based on x3690 X5 use exflash internal storage for high IOPS log storage or for both data and log storage. Apache Hadoop Apache Hadoop is free and open source Java technology-based software that enables distributed, scalable, reliable computing on clusters of inexpensive servers. It allows for cost effective distributed storage and processing of very large data sets, ranging up to petabytes. Hadoop uses a simple
11 programming model that lets one easily write and run MapReduce applications: allows for distributed processing of the map and reduction operations. In "Map" step, the master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. In the "Reduce" step, the master node then collects the answers to all the sub-problems and combines them in some way to form the output the answer to the problem it was originally trying to solve. Provided each mapping operation is independent of the others, all maps can be performed in parallel. With distributed computing power and data, Hadoop can process fast in parallel by utilizing local node computation and storage. Similarly, a set of 'reducers' can perform the reduction phase - provided all outputs of the map operation that share the same key are presented to the same reducer at the same time. Note that Hadoop still has a ways to go in terms of enterprise readiness, so be aware that extra services or support may be needed for Hadoop projects in the near future. Most users of Hadoop have been moving away from custom MapReduce code and instead use higher level Apache projects such as Hive (SQL-like interface) or Pig (high-level scripting language for data analysts) that are more suitable for enterprise users. These languages result in MapReduce jobs, but don t require the same level of programming skill to create. Hadoop represents a complex and rapidly changing technology space. For more information, the interested reader is encouraged to seek out other sources of information. Hadoop and GPFS Clus ter Configuration Data intensive computing relies on commodity hardware and software stack. Although other distributions could have been used, this scenario used Cloudera Hadoop cdh3u3 running on SUSE Linux 11 SP1. For persistence, IBM General Parallel Files System with File Placement Optimizer (GPFS FPO) distributed file system was deployed. This is a notable difference compared to many Hadoop installations, which would instead use HDFS for the distributed file system. GPFS FPO is a high-performance scalable file management solution available specifically for SNA HANA that provides fast, reliable access to a common set of file data from multiple node systems. A GPFS FPO file system is built from a collection of disks from a node s local storage which contain the data and metadata that associated with every file. GPFS FPO delivers online storage management, scalable access and integrated information lifecycle tools capable of managing petabytes of data and billions of files. O One of the key advantages over other type of file systems is there is no limit placed upon the number of simultaneously opened files within a single file system. For this project, the scenario used GPFS version The hardware configuration is based on the IBM System x3620 M3, suitable for building a multi-node cluster that will utilize each node s local disks to support its required storage capacity. IBM System x3620 M3 is a two-socket 2U rack-mount server built on innovative IBM X-Architecture l using Intel Quick Path Interconnect (QPI) technology. Featuring power-optimized, high-performance Intel Xeon multicore processors and an energy-efficient design with balanced functionality, the x3620 M3 can help reduce cost, improve service, and allow you to manage risk easily and simply.
12 Figure 5 IBM System x3620 M3 It supports the following specifications: Up to two six-core (up to 3.06 GHz) or quad-core (up to 3.2 GHz) Intel Xeon 5600 series processors with QuickPath Interconnect technology up to 6.4 GT/s Up to 12 MB L3 for Intel Xeon 5600 series processors. Up to 192 GB with 16 GB DDR3 registered DIMMs (RDIMMs) and 12 populated DIMM slots (up to 96 GB with 6 DIMMs per processor) with up to 1333 MHz memory speed. Up to eight 3.5" hot-swap SAS/SATA HDD disk drive bays (model dependent). Up to 16.0 TB internal storage with 2 TB SATA HDDs. Intermix of SAS/SATA is supported. RAID Support Software RAID 0, 1 with integrated 6-port SATA controller. RAID 0, 1, 1E with an optional ServeRAID M1015 controller. RAID 0, 1, 5, 10, 50 with an optional ServeRAID M5014 or M5015 controller. Optional upgrade to RAID 5 is available for M1015. Optional upgrade to RAID 6, 60 is available for M5014/M5015. Network interfaces Integrated 2-port Gigabit Ethernet (Intel 82575). Three PCI Express 2.0 slots: - One PCI Express 2.0 x16 (x4 wired) (dedicated to ServeRAID BR10il v2) - One PCI Express 2.0 x16 (x8 wired), full-height, half-length - One PCI Express 2.0 x8 (x8 wired), full-height, half-length SAP Data Services 4.1 SAP Data Services is an enterprise-class solution for data integration, data quality, data profiling and text data processing that allows you to integrate, transform, improve and deliver trusted data to critical business processes. It provides one development UI, metadata repository, data connectivity layer, runtime environment and management console. In our scenario, SAP Data Services provides the bridge between SAP HANA and Hadoop. In an enterprise environment, it would probably be used for all types of extract, transform and load capabilities so that our use case would be a minor extension of its duties. In this example, Hadoop is just another data source from the SAP Data Services perspective, and can be handled just like a database. Note that Hadoop support was added as of the SAP Data Services 4.1 release, providing high-performance reading from and loading into Hadoop. Business analysts can use a familiar UI to define queries to identify and extract relevant data and text from Hive or HDFS. Since SAP Data Services will transform the queries into HiveQL or Pig scripts that can be read by Map-Reduce, business analysts can extract meaningful information without having to be a programming expert. SAP
13 Data Services then provisions relevant data to SAP HANA or any other data, enabling deeper contextual analysis of structured and unstructured information. SAP BusinessObjects BI 4.0 SAP BusinessObjects BI platform is designed to help users discover and share insights for better decisions. The platform supports a comprehensive BI suite, integrates with existing applications and information sources, and provides everyone across your organization with decision-ready information. This scenario used an SAP-internal pre-release of SAP BusinessObjects BI 4.0 Feature Pack 3. By the time this paper goes to press, this should be available to customers and partners. The main software component we relied on was SAP BusinessObjects Explorer, Accelerated Version, which combines intuitive information search and exploration functionality with the high performance and scalability of in-memory analytics. With immediate insights into vast amounts of data from anywhere in the organization, you can explore business at the speed of thought and improve your ability to make sound, timely decisions.
14 SAP HANA, Hadoop, and GPFS cluster Figure 6 Cluster Infrastructure
15 Results The team at SAP was able to build the desired scenario relatively easily. A concise video summary of the whole scenario is available on YouTube at the following URL: The set up and configuration of all components was straightforward and uneventful. Note however that the Hadoop-related work requires specialized skills and knowledge that may not be readily available in many enterprises today. This includes the set up, configuration, and maintenance of the Hadoop cluster, as well as the custom MapReduce coding necessary to process the web logs. For Hadoop related tasks, an enterprise may wish to engage services or training support from a partner or otherwise ensure that qualified staff is available. The entire scenario described in this paper, including software, hardware, and data, was moved to the SAP Co-Innovation Lab (COIL) and made available to interested partners.
16 Summary In this example, you ve seen how using Hadoop as an extended workbench can make sense under the right circumstances. Hadoop, as a complement to enterprise technologies such as SAP HANA, opens up new ways to use data that would not have been practical in the past. At this stage of maturity, Hadoop is still quite far away from a point and click solution, but for sufficiently valuable use cases it is entirely feasible. For the company in this example scenario, the solution has used available sources of data in new ways to achieve deeper customer insights, including a fuller picture of purchasing behavior and brand-awareness across channels. With these insights, you could improve sales by generating more accurate and comprehensive real-time information about customer interests and optimizing your product offerings to address discrepancies between customer on-line behavior and in-store purchasing behavior.
17 Resources These websites provide useful references to supplement the information contained in this paper: Latest documentation for SAP HANA https://service.sap.com/hana * Describes the options for setting up connections to the SAP HANA database SAP BusinessObjects Data Services XI 4.0 (14.0.0) Enterprise Information Management with SAP IBM Systems Solution for SAP HANA ibm.com/systems/x/solutions/sap/hana/ An Introduction to GPFS Version 3.x, IBM white paper ftp://public.dhe.ibm.com/common/ssi/ecm/en/xbw03010usen/xbw03010usen.pdf GPFS Library, IBM Cluster Information Center s.doc/gpfsbooks.html IBM Redbooks IBM Publications Center IBM Systems on PartnerWorld (*) You need an authorized user ID to access this information.
18 About the authors Christopher Y. Chung is a Sr. IT Specialist in IBM Systems and Technology Group ISV Enablement Organization. He has over 10 years of experience working with the IBM System x and other IBM platforms in his over 30 years of experience in the IT industry. His experience includes producing technical papers, performing benchmarks with major RDBMS platforms and developing tuning guides for enterprise resource planning (ERP) applications on multiple IBM hardware platforms. You can reach Christopher at Will Gardella leads the Cloud and Big Data technology innovation team at SAP Labs in Palo Alto, California. He is the technical advisor for the SAP global Big Data strategy and has focused on applying Hadoop technology in the enterprise context since You can reach Will at Dewei Sun has been working for the Software industry for over 13 years. He has extensive experience and skills with developing CMS (Content Management Systems), Ad Systems, Data Warehousing and Big Data systems for companies like Fox News Corporation, NexTag, Google and now SAP. As a technical lead of the SAP Big Data team (Hadoop and related projects), he is responsible for the infrastructure design and development management. Dewei holds a MS in Computer Science from USC and a BS in Information Science from Tunghai University in Taiwan. Thanks to Prasenjit Sarkar and Reshu Jain from IBM Almaden Research for their contributions on GPFS configuration to this project.
19 Trademarks and special notices Copyright IBM Corporation References in this document to IBM products or services do not imply that IBM intends to make them available in every country. IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol ( or ), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Intel, Intel Inside (logos), MMX, and Pentium are trademarks of Intel Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. SET and the SET Logo are trademarks owned by SET Secure Electronic Transaction LLC. Other company, product, or service names may be trademarks or service marks of others. Information is provided "AS IS" without warranty of any kind. All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Information concerning non-ibm products was obtained from a supplier of these products, published announcement material, or other publicly available sources and does not constitute an endorsement of such products by IBM. Sources for non-ibm list prices and performance numbers are taken from publicly available information, including vendor announcements and vendor worldwide homepages. IBM has not tested these products and cannot confirm the accuracy of performance, capability, or any other claims related to non-ibm products. Questions on the capability of non-ibm products should be addressed to the supplier of those products. All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. Contact your local IBM office or IBM authorized reseller for the full text of the specific Statement of Direction. Some information addresses anticipated future capabilities. Such information is not intended as a definitive statement of a commitment to specific levels of performance, function or delivery schedules with respect to any future products. Such commitments are only made in IBM product announcements. The information is
20 presented here to communicate IBM's current investment and development activities as a good faith effort to help with our customers' future planning. Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput or performance improvements equivalent to the ratios stated here. Photographs shown are of engineering prototypes. Changes may be incorporated in production models. Any references in this information to non-ibm websites are provided for convenience only and do not in any manner serve as an endorsement of those websites. The materials at those websites are not part of the materials for this IBM product and use of those websites is at your own risk. XSW03118-USEN-01
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
Dell s SAP HANA Appliance SAP HANA is the next generation of SAP in-memory computing technology. Dell and SAP have partnered to deliver an SAP HANA appliance that provides multipurpose, data source-agnostic,
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
Real-Time Big Data Analytics with the Intel Distribution for Apache Hadoop software Executive Summary is already helping businesses extract value out of Big Data by enabling real-time analysis of diverse
Microsoft SharePoint data externalization to IBM storage using AvePoint DocAve Storage Manager Sanjay Sudam IBM Systems and Technology Group ISV Enablement May 2014 Copyright IBM Corporation, 2014 Table
Title Click to edit Master text styles Second level Third level IBM s Vision For The New Enterprise Data Center Subram Natarajan Senior Consultant, STG Asia Pacific firstname.lastname@example.org Multiple
SYSTEM X SERVERS SOLUTION BRIEF Maximum performance, minimal risk for data warehousing Microsoft Data Warehouse Fast Track for SQL Server 2014 on System x3850 X6 (95TB) The rapid growth of technology has
Performance benefit of MAX5 for databases The MAX5 Advantage: Clients Benefit running Microsoft SQL Server Data Warehouse (Workloads) on IBM BladeCenter HX5 with IBM MAX5 Vinay Kulkarni Kent Swalin IBM
Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack HIGHLIGHTS Real-Time Results Elasticsearch on Cisco UCS enables a deeper
INSIGHT Oracle's All- Out Assault on the Big Data Market: Offering Hadoop, R, Cubes, and Scalable IMDB in Familiar Packages Carl W. Olofson IDC OPINION Global Headquarters: 5 Speen Street Framingham, MA
IBM System x reference architecture solutions for big data Easy-to-implement hardware, software and services for analyzing data at rest and data in motion Highlights Accelerates time-to-value with scalable,
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
Integrating Cloudera and SAP HANA Version: 103 Table of Contents Introduction/Executive Summary 4 Overview of Cloudera Enterprise 4 Data Access 5 Apache Hive 5 Data Processing 5 Data Integration 5 Partner
Positioning the Roadmap for POWER5 iseries and pseries Guy Paradise Larry Amy Ian Jarman Agenda The Case For Common Platforms Diverse Markets: the pseries and iseries Common Platform: the Roadmap for pseries
SYSTEM X SERVERS SOLUTION BRIEF Minimize cost and risk for data warehousing Microsoft Data Warehouse Fast Track for SQL Server 2014 on System x3850 X6 (55TB) Highlights Improve time to value for your data
IBM Systems and Technology System x IBM System x family brochure IBM System x rack, BladeCenter and tower servers 2 IBM System x family brochure Highlights IBM System x and BladeCenter servers help deliver
Fast, Low-Overhead Encryption for Apache Hadoop* Solution Brief Intel Xeon Processors Intel Advanced Encryption Standard New Instructions (Intel AES-NI) The Intel Distribution for Apache Hadoop* software
Migrating LAMP stack from x86 to Power using the Server Consolidation Tool Naveen N. Rao Lucio J.H. Correia IBM Linux Technology Center November 2014 Version 3.0 1 of 24 Table of Contents 1.Introduction...3
Accelerating Business Intelligence with Large-Scale System Memory A Proof of Concept by Intel, Samsung, and SAP Executive Summary Real-time business intelligence (BI) plays a vital role in driving competitiveness
SAS and Oracle: Big Data and Cloud Partnering Innovation Targets the Third Platform David Lawler, Oracle Senior Vice President, Product Management and Strategy Paul Kent, SAS Vice President, Big Data What
White Paper BIG DATA-AS-A-SERVICE What Big Data is about What service providers can do with Big Data What EMC can do to help EMC Solutions Group Abstract This white paper looks at what service providers
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class
How to Deliver Measurable Business Value with the Enterprise CMDB James Moore email@example.com Product Manager, Business Service, Netcool/Impact 2010 IBM Corporation Agenda What is a CMDB? What are CMDB
The IBM Cognos Platform for Enterprise Business Intelligence Highlights Optimize performance with in-memory processing and architecture enhancements Maximize the benefits of deploying business analytics
Agenda Key: Session Number: 35CA 540195 IBM Systems Director Navigator for i5/os New Web console for i5, Fast, Easy, Ready 8 Copyright IBM Corporation, 2008. All Rights Reserved. This publication may refer
Advisory Infrastructure Matters: POWER8 vs. Xeon x86 Executive Summary This report compares IBM s new POWER8-based scale-out Power System to Intel E5 v2 x86- based scale-out systems. A follow-on report
Cisco for SAP HANA Scale-Out Solution Solution Brief December 2014 With Intelligent Intel Xeon Processors Highlights Scale SAP HANA on Demand Scale-out capabilities, combined with high-performance NetApp
News and trends in Data Warehouse Automation, Big Data and BI Johan Hendrickx & Dirk Vermeiren Extreme Agility from Source to Analysis DWH Appliances & DWH Automation Typical Architecture 3 What Business
NoSQL for SQL Professionals William McKnight Session Code BD03 About your Speaker, William McKnight President, McKnight Consulting Group Frequent keynote speaker and trainer internationally Consulted to
Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4
September 9 11, 2013 Anaheim, California Extend your analytic capabilities with SAP Predictive Analysis Charles Gadalla Learning Points Advanced analytics strategy at SAP Simplifying predictive analytics
Technology Insight Paper Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities By John Webster February 2015 Enabling you to make the best technology decisions Enabling
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
SUN ORACLE EXADATA STORAGE SERVER KEY FEATURES AND BENEFITS FEATURES 12 x 3.5 inch SAS or SATA disks 384 GB of Exadata Smart Flash Cache 2 Intel 2.53 Ghz quad-core processors 24 GB memory Dual InfiniBand
IBM and TEMENOS T24 workload optimization on the new IBM PureFlex System TEMENOS T24 Enterprise performance testing using IBM POWER7 and x86 hybrid architectures with IBM DB2 v 9.7 Authors Temenos UK Simon
SAP Co-Innovation Lab Harness Insight from Hadoop with MapReduce and Text Data Processing Using SAP Data Services and SAP HANA Table of Contents 3 Abstract 4 Hadoop, MapReduce, NoSQL, and the Enterprise
Oracle Exadata Database Machine for SAP Systems - Innovation Provided by SAP and Oracle for Joint Customers Masood Ahmed EMEA Infrastructure Solutions Oracle/SAP Relationship Overview First SAP R/3 release
WHITE PAPER April 2014 Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA Executive Summary...1 Background...2 File Systems Architecture...2 Network Architecture...3 IBM BigInsights...5
Can Flash help you ride the Big Data Wave? Steve Fingerhut Vice President, Marketing Enterprise Storage Solutions Corporation Forward-Looking Statements During our meeting today we may make forward-looking
SUN HARDWARE FROM ORACLE: PRICING FOR EDUCATION AFFORDABLE, RELIABLE, AND GREAT PRICES FOR EDUCATION Optimized Sun systems run Oracle and other leading operating and virtualization platforms with greater
Accelerating Business Intelligence with Large-Scale System Memory A Proof of Concept by Intel, Samsung, and SAP Executive Summary Real-time business intelligence (BI) plays a vital role in driving competitiveness
White Paper Dell Microsoft Business Intelligence and Data Warehousing Reference Configuration Performance Results Phase III Performance of Microsoft SQL Server 2008 BI and D/W Solutions on Dell PowerEdge
SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first
Microsoft Big Data Solution Brief Contents Introduction... 2 The Microsoft Big Data Solution... 3 Key Benefits... 3 Immersive Insight, Wherever You Are... 3 Connecting with the World s Data... 3 Any Data,
SAP HANA SAP s In-Memory Database Dr. Martin Kittel, SAP HANA Development January 16, 2013 Disclaimer This presentation outlines our general product direction and should not be relied on in making a purchase
An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
SCALEOUT SOFTWARE Using an In-Memory Data Grid for Near Real-Time Data Analysis by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 IN today s competitive world, businesses
Architecting the Future of Big Data Whitepaper Apache Hadoop: The Big Data Refinery Introduction Big data has become an extremely popular term, due to the well-documented explosion in the amount of data
An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct
Microsoft System Center 2012 SP1 Virtual Machine Manager with Storwize family products IBM Systems and Technology Group ISV Enablement January 2014 Copyright IBM Corporation, 2014 Table of contents Abstract...
SAP HANA forms the future technology foundation for new, innovative applications based on in-memory technology. It enables better performing business strategies, including planning, forecasting, operational
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
SUN ORACLE DATABASE MACHINE FEATURES AND FACTS FEATURES From 2 to 8 database servers From 3 to 14 Sun Oracle Exadata Storage Servers Up to 5.3 TB of Exadata QDR (40 Gb/second) InfiniBand Switches Uncompressed
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the
Case Study Predictive Enterprise Intel Xeon processors Intel Server Board Embedded technology Accelerating Data Compression with Intel Multi-Core Processors Data Domain incorporates Multi-Core Intel Xeon
Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics Please note the following IBM s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice
IBM BigInsights for Apache Hadoop Efficiently manage and mine big data for valuable insights Highlights: Enterprise-ready Apache Hadoop based platform for data processing, warehousing and analytics Advanced
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, firstname.lastname@example.org Assistant Professor, Information
Published: April 2012 Applies to: SQL Server 2012 Copyright The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication.
SAS deployment on IBM Power servers with IBM PowerVM dedicated-donating LPARs Narayana Pattipati IBM Systems and Technology Group ISV Enablement January 2013 Table of contents Abstract... 1 IBM PowerVM
Data Center Storage Solutions Enterprise software, appliance and hardware solutions you can trust When it comes to storage, most enterprises seek the same things: predictable performance, trusted reliability
Datasheet FUJITSU Integrated System PRIMEFLEX for Hadoop is a powerful and scalable platform analyzing big data volumes at high velocity FUJITSU Integrated System PRIMEFLEX Your fast track to datacenter
Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing Wayne W. Eckerson Director of Research, TechTarget Founder, BI Leadership Forum Business Analytics
An Integrated Analytics & Big Data Infrastructure September 21, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle Enterprise Solutions Group The following is intended to outline our
An Oracle White Paper April 2012; Revised July 2012 Improving Data Center Infrastructure with Oracle s Sun x86 Systems Executive Summary... 1 Best x86 Platform for Running Oracle Software... 3 Simplicity
IBM PureFlex System The infrastructure system with integrated expertise 2 IBM PureFlex System IT is moving to the strategic center of business Over the last 100 years information technology has moved from
Frequently Asked Questions SAP HANA Vora SAP HANA Vora : Gain Contextual Awareness for a Smarter Digital Enterprise SAP HANA Vora software enables digital businesses to innovate and compete through in-the-moment
IBM SAP International Competence Center Load testing SAP ABAP Web Dynpro applications with IBM Rational Performance Tester Ease of use, excellent technical support from the IBM Rational team and, of course,
Boost Database Performance with the Cisco UCS Storage Accelerator Performance Brief February 213 Highlights Industry-leading Performance and Scalability Offloading full or partial database structures to
White Paper The Business Analyst s Guide to Hadoop Get Ready, Get Set, and Go: A Three-Step Guide to Implementing Hadoop-based Analytics By Alteryx and Hortonworks (T)here is considerable evidence that
IBM InfoSphere Guardium Data Activity Monitor for Hadoop-based systems Proactively address regulatory compliance requirements and protect sensitive data in real time Highlights Monitor and audit data activity
SAP Technical Brief SAP s for Enterprise Information Management SAP Data Services Objectives Integrate and Deliver Trusted Data and Enable Deep Insights Provide a wide-ranging view of enterprise information
Session 0202: Big Data in action with SAP HANA and Hadoop Platforms Prasad Illapani Product Management & Strategy (SAP HANA & Big Data) SAP Labs LLC, Bellevue, WA Legal disclaimer The information in this
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
Dell Virtualization Solution for Microsoft SQL Server 2012 using PowerEdge R820 This white paper discusses the SQL server workload consolidation capabilities of Dell PowerEdge R820 using Virtualization.
Data sheet HP Vertica OnDemand Enterprise-class Big Data analytics in the cloud Enterprise-class Big Data analytics for any size organization Vertica OnDemand Organizations today are experiencing a greater
Data Sheet Cisco UCS B420 M3 Blade Server Product Overview The Cisco Unified Computing System (Cisco UCS ) combines Cisco UCS B-Series Blade Servers and C-Series Rack Servers with networking and storage
WHITE PAPER CDH AND BUSINESS CONTINUITY: An overview of the availability, data protection and disaster recovery features in Hadoop Abstract Using the sophisticated built-in capabilities of CDH for tunable
Pentaho High-Performance Big Data Reference Configurations using Cisco Unified Computing System By Jake Cornelius Senior Vice President of Products Pentaho June 1, 2012 Pentaho Delivers High-Performance
IBM DS5020 Express Manage growth, complexity, and risk with scalable, high-performance storage Highlights Mixed host interfaces support (Fibre Channel/iSCSI) enables SAN tiering Balanced performance well-suited
In-Memory Analytics for Big Data Game-changing technology for faster, better insights WHITE PAPER SAS White Paper Table of Contents Introduction: A New Breed of Analytics... 1 SAS In-Memory Overview...
High-Performance Business Analytics: SAS and IBM Netezza Data Warehouse Appliances Highlights IBM Netezza and SAS together provide appliances and analytic software solutions that help organizations improve
G-Cloud Big Data Suite Powered by Pivotal December 2014 G-Cloud service definitions TABLE OF CONTENTS Service Overview... 3 Business Need... 6 Our Approach... 7 Service Management... 7 Vendor Accreditations/Awards...
Database Lenovo Database Configuration for Microsoft SQL Server 2016 22TB Data Warehouse Fast Track Solution Data Warehouse problem and a solution The rapid growth of technology means that the amount of