In-Memory Computing: Powering Enterprise High-Performance Computing

Cognizant 20-20 Insights In-Memory Computing: Powering Enterprise High-Performance Computing To succeed in today s modern digital era, organizations must embrace the next wave of hyperscale computing into mainstream business by considering in-memory computing technologies that not only bolster their large-scale data processing capabilities but accelerate the transformation of raw information into applied knowledge. Executive Summary Traditional high performance computing (HPC)/ supercomputing, analytics and mainstream realtime/batch computing are quickly converging. Mainstream workloads are crossing over the high performance computing arena, demanding faster analytics/batching, resource-intensive computations and algorithms. To succeed in today s accelerating digital world, enterprises must collect and analyze mind-boggling amounts of data, in real time, and at ever-faster speeds that most legacy enterprise HPC technologies and systems were not originally designed to accommodate. In our view, organizations need to embark on what we call Enterprise HPC 2.0. This term refers to the ecosystem that leverages/utilizes various latest commodity-hardware-based hyperscale grid technologies such as in-memory computing (IMC), compute and data grid technologies, streaming analytics, graph analytics, etc. These are in conjunction with infrastructure advancements such as solid state drives (SSD)-enabled technology, GPGPU acceleration, general purpose Infiniband cognizant 20-20 insights november 2015 interconnect technology, etc. that enable IT organizations to fast-track enterprise computing to better serve the ever-growing data needs of the business. Significant enthusiasm is building around the IMC paradigm for large-scale data analysis. Historically, in-memory grid technologies were primarily data-focused and used by the organizations for distributed caching patterns to achieve low latency reads of critical transactional data. However, IMC technology is progressively emerging as a key empowering agent for enterprises seeking to accelerate their real-time decision-making ability and agility, by enabling Web-scale data processing, which are capabilities necessary for staying relevant and competitive in today s digital era. IMC s impact is typically felt where organizations are creating new and more innovative ways of working. A dramatic reduction in memory hardware costs also favors the growth of IMC technologies. However, several factors continue

to slow the adoption at the enterprise, such as a fragmented technology and vendor landscape, a lack of commonly agreed upon industry standards, scarcity of skills and still-emerging industry best practices. Given that the technology remains in its adolescence, the selection of the right IMC technology is critical to any strategic digital business transformation decision. Soaring enterprise workloads and the use cases that make use of in-memory processing are informing key decisions around IMC technology platform selection. A blind jump into the IMC technology valley will not yield durable value. It requires clear and effective analysis and understanding of workloads and business priorities, with a goal to increase scalable performance and competitive benefits for the business. This entails skilled experts to perform a focused evaluation. Furthermore, the multitude of new and emerging products makes is extremely challenging to select the right product and approach. However daunting this decision may seem, it is of utmost importance for organizations to use IMC technology to help address their ever-mounting high-performance and low-latency processing needs across the enterprise. This white paper summarizes the features and benefits of using IMC for large-scale data-set aggregations using multiple popular IMC approaches. The paper presents results from an internal study performed in which we created an evaluation scenario to compare various IMC approaches/technology architectures. The study results establish that simple migration to an IMC technology yields performance levels 13 times greater for a given batch workload previously implemented using a disk-based architecture. This paper not only highlights the importance of embracing the IMC agenda for enterprise workloads but offers a formal methodology for choosing the most appropriate IMC platform to fit given business needs. In-Memory Computing: A Market Check Effective use of IMC technology along with a clear strategy for adoption can help enterprises reap multiple benefits. Figure 1 lists some of the key use cases across specific industries. While this is just an indication, the possibilities are abundant and are not limited to the specified list. There have been rapid innovations in the IMC space recently to enable faster computation and processing speeds. These include Hadoop In-Memory Computing (Enterprise HPC 2.0) Retail Real-time in-store analytics. Fast real-time loyalty offers. Telecom Real-time ads placements. Real-time sentiment analysis. Healthcare Faster medical imaging processing. Genome analysis. Insurance Faster claim processing & modeling. Faster actuarial science. Fraud detection. Banking & Financial Services Real-time trading decisions. Faster reporting. Manufacturing Inventory management. Predictive analytics to avoid unplanned downtime. Figure 1 cognizant 20-20 insights 2

MapReduce a batch processing framework that has added support for an in-memory file system called Tachyon. In addition, IBM has added Apache Spark an IMC system to its z Systems to bring analytics to mainframes. Also, SQL Server 2016 Community Technology Preview 2 adds IMC power. This has led to the availability of a plethora of IMC technology-based products. However, these products can be classified into various segments, based on their inherent architecture and technological approaches. Moreover, each IMC system is not applicable for every type of enterprise workload. It is therefore imperative to have a clear understanding of the pros and cons of each of these system types in order to effectively select and utilize IMC systems and reap the business benefits. IMC technology has evolved from its earliest avatar (distributed caching) to today s integrated in-memory platform that provides storage, compute and transactional services for large-scale data sets. These systems fall under the pure-play IMC technologies category. The alternate IMC segment applies to products such as Apache Spark, which, in our view, does not represent all-encompassing in-memory technology in the strict sense since it does not provide a platform for storing large-scale data. However, it provides a processing platform for large-scale in-memory computing and is said to provide performance up to 100 times faster for certain applications 1 and is being endorsed by IBM 2 and Amazon Web Services. 3 Figure 2 illustrates the evolution of IMC technology, some of the popular products under each segment and the typical workloads for which they are best used. Given the rapid pace of innovation, the IMC product landscape requires the latest skills and a thorough understanding of a specific IMC system s architectural underpinnings to validate its fit and effective use for a given enterprise workload. Furthermore, with the multiple options available, enterprises can find it difficult to make the best choice and use of an IMC technology to satisfy their high performance computing needs. To address these challenges, we at the Cognizant Hyperscale Computing (HPC) Lab have launched a structured methodology to help enterprises realize value from the next wave of hyperscale computing using Enterprise HPC 2.0, which leverages in-memory computing grids. IMC Technology s Progression Alternate IMC A cache that partitions its data among all cluster nodes. Distributed Caches Memcached Ehcache Pivotal GemFire Distributed Key/Value Cache for Low Latency access. A data fabric across large cluster of servers for distributed in-memory storage and management of large data sets. In-Memory Data Grid (IMDG) Pivotal GemFire XD Oracle Coherence GigaSpaces XAP Hazelcast Infinispan nispan (JBoss) For real-time big data initiatives, handling HPC payloads along the lines of MapReduce, MPP with partial SQL support. Pure Play IMC A RDBM system that stores data in memory instead of on disk. In-Memory Database (IMDB) SAP HANA Oracle Exalytics Exadata MS SQL2014 In-memory high speed alternative for existing disk-based RDBMS with full SQL support, with no change to application. A next-gen platform that integrates IMDG with IMCG and provides additional features like CEP, streaming etc. In-Memory Data Fabric (IMDF) Apache Ignite (GridGain) For a single integrated platform for real-time big data management and computing, handling new HPC payloads such as Streaming, CEP. A platform for computing and transacting on large-scale data sets in parallel. In-Memory Compute Grid (IMCG) Apache Spark For in-memory computation and processing of data stored in disks. Figure 2 cognizant 20-20 insights 3

IMC Technology Selection Process IMC Assessment Methodology Establishment (Stage I) Refinement (Stage II) 1 2 3 4 Figure 3 IMC Value Creation: Methodology A clear process, as well as a framework, is required to establish the business goals and successfully determine the best-fit IMC technology. This is vital to garner the utmost value from an IMC-led transformation. Figure 3 depicts our process for establishing and identifying the right IMC product for the business. 1 2 3 4 Step 1: Discovery The business use cases and the workloads to be implemented via IMC technology play a crucial role in the selection of the products. So first the workload is chosen and key goals for implementation are defined. For this white paper, we studied a retail customer analytics workload previously processed on 1a 2 3 4 modern scalable batch model using Apache Pig, a Step 2: Analysis Hadoop MapReduce-based technology, which has a disk-based architecture. The nature of the technology used for this implementation permitted the solution to be an offline and batch-based system. To be better prepared to handle the disruptive nature of the consumer behavior where latency implies loss of business, we preemptively wanted an alternative solution to support faster and/or near-real-time performance and support for the customer s customers. We devised an internal study to transform the batch workload using multiple IMC technologies and successfully applied appropriate IMC technology to make it faster. Next, we defined the key use cases that the workload requires, which becomes the input for the IMC system evaluation matrix. For quick development of the use case and benchmarking, we wanted the following core features to be readily and easily supported by the product, apart from the in-memory caching features normally available with such products: Bulk data loading. SQL support for easy and fast retrieval of data with conditions. SQL support for joining multiple data sets based on criteria. Support for creating new tables/data sets dynamically on the fly with data from other tables/data sets. Support for stored procedures/user-defined functions/mapreduce to handle very specific aggregations. In-memory distributed computation capabilities. Second, we needed to ascertain the segment of IMC technology that would best suit the workload and identify a potential list of IMC systems from the category that readily support the evaluation criteria for specific use cases. This is carefully chosen after deliberation with the enterprise s business and architect stakeholders. We then performed deep-dive fit and architectural analysis on the selected list and determined the best-fit match based on the aforementioned evaluation criteria. From the output of this analysis, the final list of IMC systems that closely fit the requirements was determined. Further proof-of-concept, proof-of-technology and benchmarking were performed on the final list of IMC systems to validate, establish and recommend the best-fit IMC system for a given workload. cognizant 20-20 insights 4

And so, in our case, we selected an initial list of potential IMC products from the IMDG, IMDF and alternate IMC segments, as we needed the capabilities like that of MapReduce to handle specific aggregations demanded by the chosen workload. Distributed caching systems lack these features and an IMDB system like SAP HANA that primarily supports SQL workloads was not the right fit in this case. Figure 4 lists the IMC systems selected. As an internal study, we chose a list of products rated as top vendors and leaders in the given segment by various leading analysts from a good mix of commercial and open-source products. Establishing the Short List Pure-Play IMC Technology Commercial Figure 4 Pivotal GemFire XD Oracle Coherence GigaSpaces XAP Open Source Apache Ignite Apache Infinispan from JBoss Apache Hazelcast Alternate IMC Technology Others Apache Spark Fitment Analysis Next, we performed a comprehensive product comparison and weighted scoring and ranking model on 20 different attributes and dimensions based on the specific list of features that were most essential for quick development and benchmarking of the use case, as listed in Figure 5. This methodology helped us to quickly shortlist one data grid system each from the commercial and open source categories for our final evaluation. In-memory data grids offer many other useful features. IMC vendors have developed unique selling propositions for their products that need to be compared, analyzed and leveraged on a case-by-case basis. The final considerations were based on the score ratings depicted in following two product comparison scoring figures. Figure 6 shows a comparison between three commercial data grids and offers a comparison between three opensource data grids selected from the previous step, as depicted in Figure 4. Analysis Results For the final benchmark and evaluation, we chose Apache Spark as the first product, for its reputation as the next best IMC technology to replace the Hadoop MapReduce framework. From the scoring process, from the commercial category we selected Pivotal GemFire XD (the community version of the GemFire is now available as Apache Geode); the third product chosen from the open source category was Apache Ignite. Both of these products scored the highest as the Scoring the Requirements Category Weightage Percent Criteria Features 60% Bulk Data Loading, SQL Queries Support, Stored Procedures Support, Dynamic Data Set Creation, Txn Support, UDF Support, SQL Joins, Sub Queries, JDBC Driver, Caching Patterns (Side Cache, In-line Cache), Replication, Guaranteed Delivery, Change Data Capture, Cloud Integration System Setup Dev Setup Figure 5 25% 15% Application Server (Tomcat/Jetty) Integration, Administration Consoles Availability, Monitoring/Management Consoles Availability, HA & Fault Tolerance, Deployment & Configuration Speed Programming Language Support (.Net/Java), Client SDKs/APIs Support, Spring Data Support cognizant 20-20 insights 5

The Comparative Matrix Open Source IMC Product Comparison Commercial IMC Product Comparison Apache Ignite Apache Hazelcast Jboss Infinispan 60 % 45 % 35 % GigaSpaces XAP Oracle Coherence Pivotal GemFireXD 55 % 60 % 45 % 25 % 25 % 25 % 25 % 25 % 10 % 10 % 10 % 10 % 10 % 15 % 15 % Dev Setup System setup Features Dev Setup System setup Features Figure 6 potential best-fit technology to meet our needs (i.e., the other compared products did not support straightforward SQL joins or subqueries). We followed this with a detailed proof-of-concept (PoC) and proof-of-technology (PoT) approach and compared the various aspects of the architectures of the three IMC systems selected. We then considered their features, differences and relevance for supporting the large-scale data aggregation required by the use case and validated this with a benchmarking process. Performance Benchmarking An identical computing cluster consisting of three nodes was provisioned using the Cognizant Hyperscale Application Platform, which allows for fast setup and deployment and provides monitoring facilities to gather the benchmark results. The system detail of each node and the IMC software details are shown in Figure 7. The three systems were then configured with the default cluster settings to determine the as-is performance of the IMC systems compared with traditional Hadoop MapReduce (MR) using Apache Pig on Apache Hadoop Yarn 2.4.0. For all three systems, the only setting change we performed was to increase the IMC system process s memory parameters (JVM) such that the total cluster heap memory size was 250 GB for the in-memory data cache. Node Details Disk Space (TB) RAM (GB) CPU Cores CPU Clock Speed 2 128 32 2.6 Operating System - CentOS release 6.5 IMC System Version Apache Spark 1.3.1 Apache Ignite 1.2.0-incubating Pivotal GemFire XD 1.4.1 Figure 7 cognizant 20-20 insights 6

Benchmark Task Our study was to compare a batch workload, which performed a good mix of various computations to create new data sets, with computed fields based on aggregations performed in previous steps. The original data was persisted in four different structured data sets with relational integrity between them based on certain attributes/fields. The study was done on 50 GB of data with 500 million records using the traditional MR mode and compared with optimal performance of each system the system configuration parameters must be tweaked based on data size, workload types, hardware capacities, resource utilizations, etc. The metrics shown in Figure 8 would therefore change based on the system tuning and optimization techniques used. However, we expect only the execution times to be faster and the relative performance rating of these systems to be equivalent when measured against each other. the twin approaches using Alternate 1IMC Apache 2 Spark and using IMDG New SQL products. 3 4 Step 3: Recommendation Benchmark Execution We executed each task three times for each IMC system and reported the average of the trials. Each system executes the benchmark tasks separately to ensure exclusive access to the cluster s resources. During the tests, it was found that Apache Ignite, unlike the other three systems, did not provide out-of-the-box support for bulk ingestion of data from csv files and was unable to handle the ingestion beyond 1 GB volume of data with its default cluster environment settings in a stable manner. This prevented us from testing the system for task executions. Results Figure 8 depicts the overall performance numbers of the three IMC systems under different task scenarios. It is important to note that although performance tuning was not considered in our study, for Third, after creating PoCs and performance-related benchmarks, we can easily derive, validate and recommend the best-fit IMC system for any given workload. We can also consider where these technologies would potentially give the most durable benefit for enterprise workloads by performing such detailed analysis of their architectural aspects. For the current workload, we established key findings for each IMC system, as shown in Figure 9 (next page). The results provide evidence and confirm that using IMC technology accelerates computational performance that the enterprises can harness after due diligence and consideration. IMC technology can considerably improve the overall processing times, from data loading to execution. For the given use case and data load, processing times improved 13-fold by simply replacing the MapReduce-based batch system with an IMC technology. We found that Apache Spark was best suited for this particular scenario. Performance Comparison Workload Operations Mix Percent Data Set Metrics Aggregations/ Computations 50% Input Data Size (4 datasets) 50G Data Set Joins 30% Input Records Count 500 mil Data Set Filters 10% Output Data Size (1 denormalized view) 150G Data Set Select/Create 10% Output Records Count 300 mil Time Taken (minutes) 25 20 15 10 5 0 25 Data Loading Times (50GB) Apache Pig Pivital GemFireXD Apache Spark Task Execution Times (50GB) Performance Metrics Pre-IMC Execution Time 13hrs 15min Post-IMC Execution Time 1hr 6sec Total Performance Improvement By Apache Spark 13x Time Taken (hours) 20 15 10 5 Figure 8 0 Apache Pig Pivital GemFireXD Apache Spark cognizant 20-20 insights 7

Functional Findings Pivotal GemFire XD Apache Spark Apache Ignite (incubating) Ideal for low latency transactional and operational workloads. Easy to implement. Easy to administer and monitor. Extensive SQL support. Ideal for iterative data analysis, caching intermediate data for real-time querying. Ideal for live stream analytics and predictive workloads also involving machine learning. Easy to implement. Ideal for big data analytics, fraud detection, risk analytics, customer intelligence. Single integrated platform with additional capabilities such as Compute Grid, Service Grid, CEP Streaming. Not ideal for analytical and predictive workloads in stand-alone mode. Lacks support for running iterative loops based on a large number of keys from a specific collection. Processing times deteriorate due to missing feature. Not ideal for transactional processing in stand-alone mode. Rudimentary management and monitoring consoles. Lacks support for in-memory data storage. Nascent stage and requires maturing from incubation status. No out-of-the-box CSV streamer for bulk data ingestion. Large data loading times suffer due to missing feature. Not so easy to implement. Figure 9 2 3 4 Step 4: Planning Finally, with the knowledge and validation achieved in the previous steps, we can then successfully plan and create an effective IMC roadmap. Key Recommendations Our analysis establishes that IMC is the future of computing and a key enabling technology for enterprise HPC workloads that require analytical, predictive and cognitive capabilities. As such, we recommend that: Although technology maturity is still uneven, decision-makers must realize that IMC technologies and architectures are well positioned to be adopted and utilized for their mainstream businesses. Application development and other IT leaders must look at IMC technology to support a wide range of use cases including batch, analytics, transaction processing and event processing rather than limiting the technology to distributed caching applications. Organizations would benefit by shifting to IMC technology when they need to reengineer established applications to increase their performance and scalability for fast transactional data access (e.g., inventory management, financial reference data, real-time transactional data) or to offload workloads from legacy systems performing heavyweight offline calculations (e.g., pattern analysis, trade reconciliation, number crunching) or real-time stream processing (e.g., real-time analytics, continuous calculation, fraud detection, clickstream analytics). When opting for IMC systems from the open-source model, one way to proceed in a fail-proof manner is to conduct a PoC and a PoT to validate the system and then adopt the commercial counterpart of the same system to ensure stable system support. Even though our study was limited to three IMC systems, we recommend that enterprises consider a broader range of products for initial evaluation. This should be based on criteria most critical to the business such as available expertise, business drivers for IMC adoption, preference for IMC appliance model, cloud support, product support for post-implementation, mega-vendors, small-size vendors and newer open-source options for open integration. All of these considerations are critical to the evaluation matrix. This should be accompanied by the deep-dive-comparison scoring model approach similar to that which we followed on a list of parameters such as most significant use cases, workload patterns of use cases, short-term and long-term goals, ability to realize ROI in next three to five years, etc. A PoC/PoT on shortlisted products would further reinforce the merits/demerits of any evaluated product. This would help the enterprise to make an informed decision to adopt a new IMC technology that creates impact for their business. cognizant 20-20 insights 8

Looking Forward Albeit in-memory technology has been around for many years, the latest advancements around scale-out architecture, increased automation and reduced memory costs have increased the technology s appeal to all enterprises. IMC innovation continues to be unabated across the whole spectrum of IT market segments from hardware to application infrastructure to packaged business applications. New in-memory technologies can support new and complex workloads that organizations can confidently apply to achieve competitive advantage. While we do not advise general replacement of all workloads and traditional approaches by IMC technology, our study suggests that organizations can reap a high reward with the technology if the platform is properly vetted, selected and deployed. So, if you ask us, what technology can accelerate data processing 10x times and deliver real-time business insights and information with high performance and low latency?, our answer would be Enterprise HPC 2.0 and in-memory computing technology. Footnotes 1 Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael; Shenker, Scott; Stoica, Ion, Shark: SQL and Rich Analytics at Scale, June 2013. 2 http://www.firstpost.com/business/ibms-apache-spark-push-plans-put-spark-bluemix-open-tech-centre-2296260.html. 3 http://searchaws.techtarget.com/news/4500248624/amazon-elastic-mapreduce-moves-forward-with- Apache-Spark. References Taxonomy, Definitions and Vendor Landscape for In-Memory Computing Technologies, Gartner report. Hype Cycle for In-Memory Computing Technology, 2014, Gartner report. Noel Yuhanna, Market Overview: In-Memory Data Platforms, Forrester report, December 26, 2014. cognizant 20-20 insights 9

About the Author Archana Rao is a Senior Technology Architect within Cognizant HyPerscale Computing Lab, a unit of the Cognizant Technology Labs business unit. She has 11-plus years of cross-industry IT experience developing and providing solutions, focusing on architecture and design of enterprise high performance computing (HPC) applications using various compute and data grid technologies such as Hadoop, Windows HPC, in-memory computing, search grids and NoSQL. Archana s focus is on business enablement and transformation through HPC technology and architecture, where she has consulted with many clients implementing strategic technology transformation initiatives. She holds a B.E. in electrical engineering and electronics from University of Madras, Chennai. Archana can be reached at Archana.Rao2@cognizant.com Twitter: @ArchanaRA0. Acknowledgment Special thanks to Senthil Ramaswamy Sankarasubramanian, Director, Cognizant HyPerscale Computing Lab, a unit of Cognizant Technology Labs, for his invaluable feedback during the course of writing this paper. About Cognizant Cognizant (NASDAQ: CTSH) is a leading provider of information technology, consulting, and business process outsourcing services, dedicated to helping the world s leading companies build stronger businesses. Headquartered in Teaneck, New Jersey (U.S.), Cognizant combines a passion for client satisfaction, technology innovation, deep industry and business process expertise, and a global, collaborative workforce that embodies the future of work. With over 100 development and delivery centers worldwide and approximately 218,000 employees as of June 30, 2015, Cognizant is a member of the NASDAQ-100, the S&P 500, the Forbes Global 2000, and the Fortune 500 and is ranked among the top performing and fastest growing companies in the world. Visit us online at www.cognizant.com or follow us on Twitter: Cognizant. World Headquarters 500 Frank W. Burr Blvd. Teaneck, NJ 07666 USA Phone: +1 201 801 0233 Fax: +1 201 801 0243 Toll Free: +1 888 937 3277 Email: inquiry@cognizant.com European Headquarters 1 Kingdom Street Paddington Central London W2 6BD Phone: +44 (0) 20 7297 7600 Fax: +44 (0) 20 7121 0102 Email: infouk@cognizant.com India Operations Headquarters #5/535, Old Mahabalipuram Road Okkiyam Pettai, Thoraipakkam Chennai, 600 096 India Phone: +91 (0) 44 4209 6000 Fax: +91 (0) 44 4209 6060 Email: inquiryindia@cognizant.com Copyright 2015, Cognizant. All rights reserved. No part of this document may be reproduced, stored in a retrieval system, transmitted in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the express written permission from Cognizant. The information contained herein is subject to change without notice. All other trademarks mentioned herein are the property of their respective owners. TL Codex 1546