1 Applying SAP HANA, Apache Hadoop, and IBM GPFS to retail point of sales and web log data Christopher Y. Chung, IBM William Gardella, SAP Dewei Sun, SAP IBM Systems and Technology Group ISV Enablement SAP Labs May 2012
2 Table of contents Abstract... 1 Motivation... 1 Demo scenario... 2 Big Data solution... 4 Solution Configuration... 6 SAP HANA... 6 What is SAP HANA?... 7 Who uses SAP HANA today?... 7 What is the future for SAP HANA?... 7 SAP HANA Configuration... 7 Apache Hadoop... 8 Hadoop and GPFS Cluster Configuration... 9 SAP Data Services SAP BusinessObjects BI SAP HANA, Hadoop, and GPFS cluster Results Summary Resources About the authors Trademarks and special notices... 17
3 Abstract In early 2012, IBM and SAP created a realistic reference scenario demonstrating the benefits of using SAP HANA with Apache Hadoop, in which SAP HANA provides real time data exploration and Hadoop brings cost effective batch-style processing of large unstructured data volumes that would be difficult to analyze with traditional technologies. The purpose of the exercise is to enable similar scenarios by providing practical guidance and a blueprint that uses currently available software and hardware technology. For this project, the landscape consisted of an SAP HANA system running on an IBM ex5 enterprise server, x3690 X5 Workload Optimized Solution for SAP HANA; a Hadoop cluster of multiple IBM System x x3620 M3 servers; and virtual machine (VM)-based SAP Data Services 4.1 and SAP BusinessObjects BI 4.0. An additional interesting feature is the IBM General Parallel File System (IBM GPFS ) filesystem, which was used to run both SAP HANA and the Hadoop cluster, for optimum support of extremely large and complex analytics against unstructured and real-time data. Motivation Big Data is a hot topic in the IT industry and there have been many inquiries about it from partners, analysts and customers who are trying to understand the likely impact to the enterprises they serve. The explosive growth of data, driven partly by the ascendency of the internet, has given rise to new technologies that enable enterprises to take advantage of that data in ways that would not have been feasible in the past. Traditional data stores such as relational databases are increasingly being supplemented by new information sources that typically involve large amounts of unstructured or semistructured data. These information sources often require quite a bit of processing before they can be used. This paper gives a concrete example of one such scenario, and explains how the various parts of the solution fit together.
4 Demo scenario Consider the hypothetical case of a large and successful retailer that does business in stores and also on the web. This business needs to deliver real-time customer insights that increase its ability to win customers by ensuring that it offers attractive products at attractive prices. In order for IT to deliver innovation to the business, it has to be able to leverage all available data to gain a competitive advantage. And driving excellence means that the solution has to keep ensuring that the company s products are selling at their full potential, day in and day out. SAP BusinessObjects BI 4.0 and SAP HANA enable users to examine even billions of retail point of sales (POS) records in real time. For the purposes of this case study we will take it as a point of departure that our theoretical retailer already has this capacity. Therefore, the starting information space for this scenario might look something like the image below, with retail point of sales dollar amounts listed by category. Figure 1. Retail Point of Sales data in BusinessObjects Explorer before enrichment with customer interest data
5 This shows us in great detail what the retailer sold. That s no small thing, especially when you consider exploring over 1.1 billion records in seven thousand product categories in less than a second. But retailers always have to push for the competitive edge, and in this case they don t know as much as they might like. For example, how can they tell whether their current products are selling at their full potential? After all, they know what they have sold, but how would they know if there was strong customer interest in one of their products if they are failing to make the sale for some reason? What they need is a measure of customer interest that is independent of sales. In this demo scenario, we will enrich the retailer s point of sales (POS) data, using other data that it already has. The company s web logs show a record of what visitors did when they came to the web site. If we could bring enough computational power to bear, we could figure out which products customers look at when they visit. These product views give us a measure of how much customer interest there is in a product. If many people look at a product, that correlates with strong interest, and if few people look at it, that correlates with weak interest. In the end, we can bring that customer interest data into our sales reports to compare the customer interest in products with the actual sales performance of those products. In this way, we can uncover products that should be selling better than they are. The following sections talk about the technologies that enable both the before and after scenarios. Figure 2. The new information space adds the number of users who viewed the products online
6 Big Data solution SAP s solution for big data is based on the SAP real-time data platform, a comprehensive offering that handles a wide range of enterprise data needs. SAP HANA is the central core of this platform, providing real time in-memory data processing and excellent performance characteristics, even over large data sets. Another piece of the big data puzzle that many people have also heard buzz about is Hadoop. Apache Hadoop is a proven, open source technology (really family of related technologies) developed for processing extremely large sets of raw data in parallel using highly flexible MapReduce algorithms with maximum cost efficiency. Although on the surface, it may sound like these technologies are in conflict, they are really designed with somewhat different core value propositions and usage patterns. This means they can be fruitfully combined, but the question for most businesses is about the Hadoop side of the equation more than the SAP HANA side. This paper aims to illustrate exactly how Hadoop and SAP HANA can complement each other to address a business problem that would be difficult to do using only traditional relational databases. SAP HANA database has been designed and implemented with fast response times for both online analytic processing (OLAP) and online transaction processing (OLTP) workloads as a primary goal in mind. SAP HANA and its functionality will be described in greater detail in the following section. While SAP HANA succeeds in providing very fast data processing and analytics for large volumes of high-value data, there are still use cases that customers are interested in that are outside the scope of today s SAP HANA implementation or vision. For this reason, it s important to enable users to easily leverage and combine data residing in Hadoop or similar distributed file systems with data residing in SAP HANA for real-time business analytics and in business transactions Broadly speaking, the use cases for Hadoop are defined by the need for cost efficient storage and processing of very large volumes of structured, semi-structured, and unstructured data such as web logs, machine data, text data, call data records (CDRs), audio, video data. In general, the use cases involve some or all of the following features: Batch processing Where fast response times are less critical than reliability and scalability Complex information processing Enable heavily recursive algorithms, machine learning, and queries that cannot be easily expressed in SQL Low-value data archiving Data stays available, though access is slower Post-hoc analysis Mine raw data that is either schema-less or where schema changes over time Hadoop serves as an extended workbench for intensive processing of large amounts of typically unstructured data. Sometimes people refer to this as low value data, by which they mean that for any given user of that data, the ratio of valuable information to meaningless information is low. This paper describes just such an example. Imagine you wanted to look at terabytes of web logs dating back several years to find out, for every product you have ever sold, how many people looked at the product, how many added it to the shopping cart, and how many actually bought it. Hadoop could handle that, even if the web site has changed a few times and the log entries don t look the same from one year to the next. But the vast majority of the data in the web logs being processed would be actions that do not bear on that question one way or another. The data in the web logs could be used in other ways to answer other questions, so you wouldn t want to just throw it away. Another time with a different question, a different
7 small percentage of data would be useful. Only a small percentage is useful for answering any particular question. Another way of looking at the same usage pattern is to describe it in terms of post hoc analysis, or collecting data first and deciding how you might use it later. This means that you store everything because you re not sure exactly what you re going to need to find out. To take a completely different example, consider a collection of sensor data, such as is collected in an oil field or large power plant. In conducting a forensic investigation, a worker safety team might go looking for particular patterns of sensor readings that preceded a dangerous event, perhaps casting their net wide in hopes of finding a correlation between various connected pieces of equipment. The same team might look for very different events when trying to discover signal patterns that could be used to predict when equipment maintenance was needed. Figure 3. SAP Real-time Data Platform The need for large-scale data processing has given rise to shared-nothing architectures as the basis for data intensive applications such as Hadoop. These architectures are typically implemented with large clusters composed of inexpensive commodity hardware components, an approach which originated in the context of web indexing and search. In such a cluster, each node stores a small portion of the data and can perform computations on it locally using the data. The coordination of all nodes in the cluster for the execution of users' applications on data is carried out by a framework that runs on top of the cluster. The central idea in this design is the idea of function shipping where a compute task is moved to the node that has data as opposed to the traditional data shipping approach where the data is moved from a storage node to a compute node. SAP and IBM understand that big data technology such as Hadoop is important and is rapidly hitting the radar at many companies. At the same time, the depths to understand the collection of features that are critical to the enterprise level need to be understood closely. An important challenge in realizing
8 performance in data intensive supercomputing systems is the achievement of high throughput for applications in a commodity environment where bandwidth resources are scarce and failures are common. To achieve this goal, a clustered file system can optimize the performance of data-intensive application like Hadoop by: Exposing locality information to enable function shipping Allowing a multiplicity of block sizes in the same file system Providing a write affinity mechanism for an application to influence data layout Using pipelined replication to optimize the use of network bandwidth Supporting distributed recovery to minimize the effect of failure Solution Configuration The goal of this activity was to take a highly realistic scenario and build it from end-to-end, using technologies that are available today and which have good enterprise-readiness. By enterprise-readiness, we mean ease of administration, security, available support, and so forth. This is partly because these features are necessary and expected, and partly because these or similar technologies are probably available in a typical enterprise environment. The following sections consider the main technology components one at a time. The picture below describes the high level map of where the data lives and how it s accessed. One component not shown below is SAP Data Services 4.1, which is used for transferring data from Hadoop to SAP HANA. SAP HANA
9 What is SAP HANA? SAP HANA is a next-generation database platform that delivers unprecedented value by natively leveraging hardware evolution in processors, memory, storage, and network architectures. Designed from the ground-up to run entirely in-memory across clusters of affordable Intel-based servers, SAP HANA combines the best of row-based relational databases with columnar technology present in leading analytical databases to bring the best of both worlds together in a single database. With the ability to combine transactional and analytical functions, customers gain the ability to make decisions on business data as business is happening, enabling organizations to achieve real-time business. Starting with the accelerators for ERP systems, SAP is building and delivering various classes of applications for various scenarios of use. These include SAP Rapid Deployment Solutions (SAP RDS) and in-memory engine accelerated applications like SAP NetWeaver Business Warehouse (SAP NetWeaver BW) and SAP Smart Meter Analytics. SAP HANA s in-memory optimized engines for planning, predictive and business oriented computational libraries will enable next generation of business applications. Who uses SAP HANA today? SAP HANA was made generally available as a standalone database platform in July of Since then, sales of SAP HANA have exceeded all expectations, generating over 160M in sales across more than 250 customers in the first six months. Early customers and ISVs have been able to exploit the real-time capabilities present in HANA to build new classes of solutions not possible before, spanning such use cases as genome sequencing, retail predictive analytics, and vehicle fault analysis. Whereas these customers would previously have been required to rely on traditional RDBMS technology and high-maintenance tuning efforts to achieve even moderate performance, customers have found that SAP HANA serves as a full-featured and high performing in-memory database platform. And SAP has been encouraging additional new and innovative solutions with emerging providers through its support of startup forums in the Silicon Valley area. Along with these uses, SAP has delivered solutions to customers, including SAP HANA application accelerators, native SAP HANA applications, and the latest SAP NetWeaver BW on HANA. What is the future for S AP HANA? Customers today are already experiencing unprecedented levels of performance with SAP HANA as the state-of-the-art database platform powering such traditional business applications as SAP Business One, SAP Business ByDesign, SAP NetWeaver Business Warehouse, plus many other SAP applications and technology solutions in the portfolio, powered by SAP HANA. This will be continued with support for the SAP Business Suite, planned by the end of 2012, and additional new and innovative SAP applications beyond that. SAP s goal is to ensure SAP applications and SAP HANA will deliver unmatched business process performance with simplified landscape & user experience. SAP will continuously innovate on functionality and experience, specifically focusing on performance and the application platform as the key differentiators. SAP plans to re-engineer the enterprise application platform on the basis of SAP HANA architecture to revolutionize application building from ease of construction to optimized execution. SAP HANA will also serve as the core data foundation within the SAP on-demand application development and deployment platform. With successful development of SAP BusinessObjects BI on Demand already running HANA, customers can look forward to many cloud-based real-time business solutions to come. SAP HANA Configuration For the purposes of this scenario, we used SAP HANA 1.0 SP3 running on a powerful IBM ex5 enterprise server with Intel Xeon processor E7 family, combining the speed and efficiency of in-memory processing with the ability to analyze massive amounts of enterprise data. IBM System x3690 X5 is a
10 powerful two-socket 2U rack-mount server. This is one of two IBM Intel processor-based high-end servers, Workload Optimized Solution model for SAP HANA, which is optimally designed and certified by SAP. Figure 4 IBM System x3690 X5 It supports the following specifications: Up to two sockets for Intel Xeon E7 processors. Depending on the processor model, processors have six, eight, or ten cores per socket. Scalable from 32 to 64 DIMMs sockets with the addition of a MAX5 memory expansion unit (up to 1 TB of memory using all 16GB DIMMs, not currently certified for SAP HANA). Advanced networking capabilities with a Broadcom 5709 dual Gb Ethernet controller and an Emulex 10 Gb dual-port Ethernet adapter as optional. Up to 16 hot-swap 2.5-inch SAS HDDs or 24 hot-swap 1.8-inch solid state drives (SSDs), up to 9.6 TB of maximum internal storage with RAID 0, 1, 10, 5 or 50 to maximize throughput and ease installation. New exflash high-iops solid-state storage technology. Five PCIe 2.0 slots. Integrated Management Module (IMM) for enhanced systems management capabilities. IBM System x3690 X5 delivered with preconfigured and preinstalled on key software components, such as SLES for SAP Operating System and IBM General Parallel File System (GPFS), to help rapid delivery and deployment of the solution. The x3690 X5 features the IBM exflash internal storage using solid state drives to maximize the number of I/O operations per second (IOPS). All configurations for SAP HANA based on x3690 X5 use exflash internal storage for high IOPS log storage or for both data and log storage. Apache Hadoop Apache Hadoop is free and open source Java technology-based software that enables distributed, scalable, reliable computing on clusters of inexpensive servers. It allows for cost effective distributed storage and processing of very large data sets, ranging up to petabytes. Hadoop uses a simple
11 programming model that lets one easily write and run MapReduce applications: allows for distributed processing of the map and reduction operations. In "Map" step, the master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. In the "Reduce" step, the master node then collects the answers to all the sub-problems and combines them in some way to form the output the answer to the problem it was originally trying to solve. Provided each mapping operation is independent of the others, all maps can be performed in parallel. With distributed computing power and data, Hadoop can process fast in parallel by utilizing local node computation and storage. Similarly, a set of 'reducers' can perform the reduction phase - provided all outputs of the map operation that share the same key are presented to the same reducer at the same time. Note that Hadoop still has a ways to go in terms of enterprise readiness, so be aware that extra services or support may be needed for Hadoop projects in the near future. Most users of Hadoop have been moving away from custom MapReduce code and instead use higher level Apache projects such as Hive (SQL-like interface) or Pig (high-level scripting language for data analysts) that are more suitable for enterprise users. These languages result in MapReduce jobs, but don t require the same level of programming skill to create. Hadoop represents a complex and rapidly changing technology space. For more information, the interested reader is encouraged to seek out other sources of information. Hadoop and GPFS Clus ter Configuration Data intensive computing relies on commodity hardware and software stack. Although other distributions could have been used, this scenario used Cloudera Hadoop cdh3u3 running on SUSE Linux 11 SP1. For persistence, IBM General Parallel Files System with File Placement Optimizer (GPFS FPO) distributed file system was deployed. This is a notable difference compared to many Hadoop installations, which would instead use HDFS for the distributed file system. GPFS FPO is a high-performance scalable file management solution available specifically for SNA HANA that provides fast, reliable access to a common set of file data from multiple node systems. A GPFS FPO file system is built from a collection of disks from a node s local storage which contain the data and metadata that associated with every file. GPFS FPO delivers online storage management, scalable access and integrated information lifecycle tools capable of managing petabytes of data and billions of files. O One of the key advantages over other type of file systems is there is no limit placed upon the number of simultaneously opened files within a single file system. For this project, the scenario used GPFS version The hardware configuration is based on the IBM System x3620 M3, suitable for building a multi-node cluster that will utilize each node s local disks to support its required storage capacity. IBM System x3620 M3 is a two-socket 2U rack-mount server built on innovative IBM X-Architecture l using Intel Quick Path Interconnect (QPI) technology. Featuring power-optimized, high-performance Intel Xeon multicore processors and an energy-efficient design with balanced functionality, the x3620 M3 can help reduce cost, improve service, and allow you to manage risk easily and simply.
12 Figure 5 IBM System x3620 M3 It supports the following specifications: Up to two six-core (up to 3.06 GHz) or quad-core (up to 3.2 GHz) Intel Xeon 5600 series processors with QuickPath Interconnect technology up to 6.4 GT/s Up to 12 MB L3 for Intel Xeon 5600 series processors. Up to 192 GB with 16 GB DDR3 registered DIMMs (RDIMMs) and 12 populated DIMM slots (up to 96 GB with 6 DIMMs per processor) with up to 1333 MHz memory speed. Up to eight 3.5" hot-swap SAS/SATA HDD disk drive bays (model dependent). Up to 16.0 TB internal storage with 2 TB SATA HDDs. Intermix of SAS/SATA is supported. RAID Support Software RAID 0, 1 with integrated 6-port SATA controller. RAID 0, 1, 1E with an optional ServeRAID M1015 controller. RAID 0, 1, 5, 10, 50 with an optional ServeRAID M5014 or M5015 controller. Optional upgrade to RAID 5 is available for M1015. Optional upgrade to RAID 6, 60 is available for M5014/M5015. Network interfaces Integrated 2-port Gigabit Ethernet (Intel 82575). Three PCI Express 2.0 slots: - One PCI Express 2.0 x16 (x4 wired) (dedicated to ServeRAID BR10il v2) - One PCI Express 2.0 x16 (x8 wired), full-height, half-length - One PCI Express 2.0 x8 (x8 wired), full-height, half-length SAP Data Services 4.1 SAP Data Services is an enterprise-class solution for data integration, data quality, data profiling and text data processing that allows you to integrate, transform, improve and deliver trusted data to critical business processes. It provides one development UI, metadata repository, data connectivity layer, runtime environment and management console. In our scenario, SAP Data Services provides the bridge between SAP HANA and Hadoop. In an enterprise environment, it would probably be used for all types of extract, transform and load capabilities so that our use case would be a minor extension of its duties. In this example, Hadoop is just another data source from the SAP Data Services perspective, and can be handled just like a database. Note that Hadoop support was added as of the SAP Data Services 4.1 release, providing high-performance reading from and loading into Hadoop. Business analysts can use a familiar UI to define queries to identify and extract relevant data and text from Hive or HDFS. Since SAP Data Services will transform the queries into HiveQL or Pig scripts that can be read by Map-Reduce, business analysts can extract meaningful information without having to be a programming expert. SAP
13 Data Services then provisions relevant data to SAP HANA or any other data, enabling deeper contextual analysis of structured and unstructured information. SAP BusinessObjects BI 4.0 SAP BusinessObjects BI platform is designed to help users discover and share insights for better decisions. The platform supports a comprehensive BI suite, integrates with existing applications and information sources, and provides everyone across your organization with decision-ready information. This scenario used an SAP-internal pre-release of SAP BusinessObjects BI 4.0 Feature Pack 3. By the time this paper goes to press, this should be available to customers and partners. The main software component we relied on was SAP BusinessObjects Explorer, Accelerated Version, which combines intuitive information search and exploration functionality with the high performance and scalability of in-memory analytics. With immediate insights into vast amounts of data from anywhere in the organization, you can explore business at the speed of thought and improve your ability to make sound, timely decisions.
14 SAP HANA, Hadoop, and GPFS cluster Figure 6 Cluster Infrastructure
15 Results The team at SAP was able to build the desired scenario relatively easily. A concise video summary of the whole scenario is available on YouTube at the following URL: The set up and configuration of all components was straightforward and uneventful. Note however that the Hadoop-related work requires specialized skills and knowledge that may not be readily available in many enterprises today. This includes the set up, configuration, and maintenance of the Hadoop cluster, as well as the custom MapReduce coding necessary to process the web logs. For Hadoop related tasks, an enterprise may wish to engage services or training support from a partner or otherwise ensure that qualified staff is available. The entire scenario described in this paper, including software, hardware, and data, was moved to the SAP Co-Innovation Lab (COIL) and made available to interested partners.
16 Summary In this example, you ve seen how using Hadoop as an extended workbench can make sense under the right circumstances. Hadoop, as a complement to enterprise technologies such as SAP HANA, opens up new ways to use data that would not have been practical in the past. At this stage of maturity, Hadoop is still quite far away from a point and click solution, but for sufficiently valuable use cases it is entirely feasible. For the company in this example scenario, the solution has used available sources of data in new ways to achieve deeper customer insights, including a fuller picture of purchasing behavior and brand-awareness across channels. With these insights, you could improve sales by generating more accurate and comprehensive real-time information about customer interests and optimizing your product offerings to address discrepancies between customer on-line behavior and in-store purchasing behavior.
17 Resources These websites provide useful references to supplement the information contained in this paper: Latest documentation for SAP HANA https://service.sap.com/hana * Describes the options for setting up connections to the SAP HANA database SAP BusinessObjects Data Services XI 4.0 (14.0.0) Enterprise Information Management with SAP IBM Systems Solution for SAP HANA ibm.com/systems/x/solutions/sap/hana/ An Introduction to GPFS Version 3.x, IBM white paper ftp://public.dhe.ibm.com/common/ssi/ecm/en/xbw03010usen/xbw03010usen.pdf GPFS Library, IBM Cluster Information Center s.doc/gpfsbooks.html IBM Redbooks IBM Publications Center IBM Systems on PartnerWorld (*) You need an authorized user ID to access this information.
18 About the authors Christopher Y. Chung is a Sr. IT Specialist in IBM Systems and Technology Group ISV Enablement Organization. He has over 10 years of experience working with the IBM System x and other IBM platforms in his over 30 years of experience in the IT industry. His experience includes producing technical papers, performing benchmarks with major RDBMS platforms and developing tuning guides for enterprise resource planning (ERP) applications on multiple IBM hardware platforms. You can reach Christopher at Will Gardella leads the Cloud and Big Data technology innovation team at SAP Labs in Palo Alto, California. He is the technical advisor for the SAP global Big Data strategy and has focused on applying Hadoop technology in the enterprise context since You can reach Will at Dewei Sun has been working for the Software industry for over 13 years. He has extensive experience and skills with developing CMS (Content Management Systems), Ad Systems, Data Warehousing and Big Data systems for companies like Fox News Corporation, NexTag, Google and now SAP. As a technical lead of the SAP Big Data team (Hadoop and related projects), he is responsible for the infrastructure design and development management. Dewei holds a MS in Computer Science from USC and a BS in Information Science from Tunghai University in Taiwan. Thanks to Prasenjit Sarkar and Reshu Jain from IBM Almaden Research for their contributions on GPFS configuration to this project.
19 Trademarks and special notices Copyright IBM Corporation References in this document to IBM products or services do not imply that IBM intends to make them available in every country. IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol ( or ), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Intel, Intel Inside (logos), MMX, and Pentium are trademarks of Intel Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. SET and the SET Logo are trademarks owned by SET Secure Electronic Transaction LLC. Other company, product, or service names may be trademarks or service marks of others. Information is provided "AS IS" without warranty of any kind. All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Information concerning non-ibm products was obtained from a supplier of these products, published announcement material, or other publicly available sources and does not constitute an endorsement of such products by IBM. Sources for non-ibm list prices and performance numbers are taken from publicly available information, including vendor announcements and vendor worldwide homepages. IBM has not tested these products and cannot confirm the accuracy of performance, capability, or any other claims related to non-ibm products. Questions on the capability of non-ibm products should be addressed to the supplier of those products. All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. Contact your local IBM office or IBM authorized reseller for the full text of the specific Statement of Direction. Some information addresses anticipated future capabilities. Such information is not intended as a definitive statement of a commitment to specific levels of performance, function or delivery schedules with respect to any future products. Such commitments are only made in IBM product announcements. The information is
20 presented here to communicate IBM's current investment and development activities as a good faith effort to help with our customers' future planning. Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput or performance improvements equivalent to the ratios stated here. Photographs shown are of engineering prototypes. Changes may be incorporated in production models. Any references in this information to non-ibm websites are provided for convenience only and do not in any manner serve as an endorsement of those websites. The materials at those websites are not part of the materials for this IBM product and use of those websites is at your own risk. XSW03118-USEN-01
White Paper Big Data Analytics Extract, Transform, and Load Big Data with Apache Hadoop* ABSTRACT Over the last few years, organizations across public and private sectors have made a strategic decision
Chapter 1 Grasping the Fundamentals of Big Data In This Chapter Looking at a history of data management Understanding why big data matters to business Applying big data to business effectiveness Defining
Business Analytics Big Data Next-Generation Analytics the way we see it Table of contents Executive summary 1 Introduction: What is big data and why is it different? 3 The business opportunity 7 Traditional
Front cover Building Big Data and Analytics Solutions in the Cloud Characteristics of big data and key technical challenges in taking advantage of it Impact of big data on cloud computing and implications
OPEN DATA CENTER ALLIANCE : sm Big Data Consumer Guide SM Table of Contents Legal Notice...3 Executive Summary...4 Introduction...5 Objective...5 Big Data 101...5 Defining Big Data...5 Big Data Evolution...7
White Paper Intel Reference Architecture Big Data Analytics Predictive Analytics and Interactive Queries on Big Data WRITERS Moty Fania, Principal Engineer, Big Data/Advanced Analytics, Intel IT Parviz
Customer Cloud Architecture for Big Data and Analytics Executive Overview Using analytics reveals patterns, trends and associations in data that help an organization understand the behavior of the people
Analytic Platforms: Beyond the Traditional Data Warehouse By Merv Adrian and Colin White BeyeNETWORK Custom Research Report Prepared for Vertica Executive Summary The once staid and settled database market
SAP Statement of Direction Business Intelligence Solutions Business Intelligence Solutions from SAP: Statement of Direction Table of Contents 3 Quick Facts 4 Driving Business Innovation Through Radical
W I N T E R C O R P O R A T I O N Executive Report BIG DATA: BUSINESS OPPORTUNITIES, REQUIREMENTS AND ORACLE S APPROACH RICHARD WINTER December 2011 SUMMARY NEW SOURCES OF DATA and distinctive types of
Convergence of Social, Mobile and Cloud: 7 Steps to Ensure Success June, 2013 Contents Executive Overview...4 Business Innovation & Transformation...5 Roadmap for Social, Mobile and Cloud Solutions...7
Database Systems Journal vol. III, no. 4/2012 3 Perspectives on Big Data and Big Data Analytics Elena Geanina ULARU, Florina Camelia PUICAN, Anca APOSTU, Manole VELICANU 1 Phd. Student, Institute of Doctoral
An Oracle White Paper March 2013 Big Data Analytics Advanced Analytics in Oracle Database Advanced Analytics in Oracle Database Disclaimer The following is intended to outline our general product direction.
1 Contents Introduction. 1 View Point Phil Shelley, CTO, Sears Holdings Making it Real Industry Use Cases Retail Extreme Personalization. 6 Airlines Smart Pricing. 9 Auto Warranty and Insurance Efficiency.
February 2009 Seeding the Clouds: Key Infrastructure Elements for Cloud Computing Page 2 Table of Contents Executive summary... 3 Introduction... 4 Business value of cloud computing... 4 Evolution of cloud
MarkLogic PAGE 14 MAKING HADOOP BETTER WITH MARKLOGIC Best Practices Series Cisco PAGE 17 CREATING SOLUTIONS TO MEET OUR CUSTOMERS DATA AND ANALYTICS CHALLENGES Attunity PAGE 18 HADOOP DATA LAKES: INCORPORATING
IBM i on Power - Performance FAQ February 5, 2013 IBM Corporation Table of Contents 1 Introduction 8 1.1 Purpose of this document 8 1.2 Overview 8 1.3 Document Responsibilities 8 2 What Is Performance?
White Paper Intel Distribution for Apache Hadoop* Big Data Real-Time Big Data Analytics for the Enterprise SAP HANA* and the Intel Distribution for Apache Hadoop* Software Executive Summary Companies are
Linked data Connecting and exploiting big data Whilst big data may represent a step forward in business intelligence and analytics, Fujitsu sees particular additional value in linking and exploiting big
Front cover Business Performance Management... Meets Business Intelligence Proactive monitoring and management to maximize business performance Continuous data workflows for real-time business intelligence
INTELLIGENT BUSINESS STRATEGIES W H I T E P A P E R Architecting A Big Data Platform for Analytics By Mike Ferguson Intelligent Business Strategies October 2012 Prepared for: Table of Contents Introduction...
HP Performance Engineering Best Practices Series for Performance Engineers and Managers Performance Monitoring Best Practices Document Release Date: 201 Software Release Date: 2014 Legal Notices Warranty
SPECIAL REPORT W I N T E R C O R P O R A T I O N T h e L a r g e S c a l e Big Data What Does It Really Cost? D a t a M a n a g e m e n t Expe r t s W I N T E R C O R P O R A T I O N Big Data What Does
IBM Industries White paper Business analytics in the cloud Driving business innovation through cloud computing and analytics solutions 2 Business analytics in the cloud Contents 2 Abstract 3 The case for
MASARYK UNIVERSITY FACULTY OF INFORMATICS Best Practices in Scalable Web Development MASTER THESIS Martin Novák May, 2014 Brno, Czech Republic Declaration Hereby I declare that this paper is my original
Microsoft System Center 2012 R2 Why Microsoft? For Virtualizing & Managing SharePoint July 2014 v1.0 2014 Microsoft Corporation. All rights reserved. This document is provided as-is. Information and views
The Definitive Guide tm To Cloud Computing Ch apter 10: Key Steps in Establishing Enterprise Cloud Computing Services... 185 Ali gning Business Drivers with Cloud Services... 187 Un derstanding Business
Issue 4 Handling Inactive Data Efficiently 1 Editor s Note 3 Does this mean long term backup? NOTE FROM THE EDITOR S DESK: 4 Key benefits of archiving the data? 5 Does archiving file servers help? 6 Managing