April The Elephant on the Mainframe Using Hadoop to Gain Insights from Mainframe Data A Joint Point of View from IBM and Veristorm

Transcription

1 April 2014 The Elephant on the Mainframe Using Hadoop to Gain Insights from Mainframe Data A Joint Point of View from IBM and Veristorm 1

2 THE MAINFRAMER S GUIDE TO HADOOP 2 THERE S RELATIONAL DATA, AND THERE S EVERYTHING ELSE 2 WHY SHOULD MAINFRAME SHOPS BE INTERESTED IN ANALYZING NON-RELATIONAL DATA? 3 NON-RELATIONAL DATA CAN HELP IT 3 NON-RELATIONAL DATA CAN HELP THE BUSINESS 3 WHAT IS HADOOP AND WHY IS IT SO EFFICIENT AT PROCESSING EVERYTHING ELSE? 4 A BRIEF TECHNICAL DESCRIPTION OF HADOOP 4 BUT WAIT HADOOP IS DESIGNED FOR COMMODITY HARDWARE, RIGHT? 5 COMMODITY CLUSTERS ARE NOT ALWAYS THE MOST EFFICIENT OPTION 5 HOWEVER, IT S REALLY ABOUT THE DATA 6 INTRODUCING VERISTORM S VSTORM ENTERPRISE 7 ZDOOP: THE HADOOP FRAMEWORK FOR SYSTEM Z 7 VSTORM CONNECT: POPULATING THE HADOOP ECOSYSTEM 8 VSTORM ENTERPRISE: SECURE, EASY, AGILE, EFFICIENT ANALYSIS OF MAINFRAME DATA 9 A SECURE PIPE FOR DATA 9 EASY TO USE INGESTION ENGINE 10 TEMPLATES FOR AGILE DEPLOYMENT 10 MAINFRAME EFFICIENCIES 10 AND MORE 11 REAL CLIENTS, REAL NEEDS 12 DETECTING FRAUD 12 ENHANCING INSURANCE POLICY ENFORCEMENT 12 REDUCING HEALTHCARE COSTS 13 IMPROVING CLAIMS RESPONSE TIME 13 PUTTING VSTORM ENTERPRISE TO THE TEST 15 TABLE STAKES EXECUTING THE EXPECTED STRESS TESTS 15 THE CONFIGURATION 15 THE TESTS 16 BUT WHAT CAN VSTORM REALLY DO? THE SCENARIO 16 HOW WOULD IT FEEL TO HAVE AN ELEPHANT ON YOUR MAINFRAME? 18 HADOOP AS PART OF A HYBRID TRANSACTION AND ANALYTICS PLATFORM 18 MORE INFORMATION 19 ABOUT VERISTORM 19 ABOUT IBM 19 1

3 The Mainframer s Guide to Hadoop It s impossible to check your favorite technical news feeds or attend a conference these days without being exposed to several definitions of big data and opinions as to why it s important. Many of these exhortations are tied to Hadoop and its cute little yellow elephant logo. In this paper we will avoid the hype surrounding big data and spare you yet another definition just bear in mind that we view big data as representing all data that could be used to achieve better business outcomes, as well as the technologies leveraged to manage and analyze it. Our goal is to take a critical look at Hadoop from the mainframe perspective, and examine how new technology available today from Veristorm can help you solve problems that you may be struggling with. There is, indeed, a place for the elephant on the mainframe. There s relational data, and there s everything else The mainframe is the undisputed king of transactional data. Depending on who s opinion you read, anywhere from 60%-80% of all the world s transactional data is said to reside on the mainframe. Since transactions are, by definition, highly repeatable entities, the past fifty years have seen a number of breakthrough improvements in the way the data underneath these transactions is managed. For the past couple of decades, the gold standard has been the relational database. The art of managing transactional data in relational databases has been perfected on the mainframe. Similarly, the art of relational data analysis has also been perfected on the mainframe. It is a near certainty that if you are heavily invested in relational data stores for your operational data you are also heavily invested in data warehouses, data marts, and the like to efficiently gain insights from that data. Mainframers know how to collect, manage, and analyze their relational data. But in reality this relational data only represents a fraction of all the data held on the mainframe. Long-time mainframe shops almost certainly have significant data assets stored in record-oriented file management systems, such as VSAM, that predate the advent of the relational database. A significant amount of modern data is stored in XML format. And a huge, largely untapped source of data comes in the form of unstructured or semistructured data that has been dubbed data exhaust i the system logs, application logs, event logs and similar data that is routinely generated by the day-to-day activities of operational systems. Is there any value in analyzing this non-relational data and data exhaust, and if so what s the best way to get at it? One only needs to look at recent surveys and studies to see that there is currently a very strong desire to unlock insights from all that non-relational data held by the mainframe. For example, Gartner s 2013 big data adoption study ii revealed that transactional sources are the dominant data types analyzed in big data initiatives which one would expect given 2

4 all the historical focus on analyzing this data. But log data came in a strong second place in this study, showing that there is indeed value to be gained from analyzing this data exhaust. Why should mainframe shops be interested in analyzing non-relational data? The Gartner study shows a strong interest in non-relational analysis, but why, and what does this mean to mainframe shops? We ll examine some specific use cases later in this paper; for now, here are some general ideas for using data exhaust to improve results. Non-relational data can help IT It s quite possible that you may already have been exposed to IBM features and products that analyze mainframe data exhaust to help the IT shop improve performance, optimize resources, and take pre-emptive action to avoid problems. For example: The IBM System z Advanced Workload Analysis Reporter (IBM zaware, a feature of the IBM zenterprise EC12 and IBM zenterprise BC12) monitors mainframe data exhaust to help identify unusual system behavior in near real time and improve problem determination leading to an overall improvement in availability of systems. IBM Capacity Management Analytics iii is an advanced analytics solution designed specifically to help manage and predict consumption of IBM zenterprise infrastructure resources. IBM SmartCloud Analytics - Log Analysis iv analyzes mainframe data exhaust to help identify, isolate and resolve a variety of IT problems. These are all complete solutions with pre-built templates and reports that target specific use cases and scenarios. If you need to gain insights from your log data that these solutions were not set up to handle (or cannot be extended to handle), building your own Hadoop cluster for your specific needs can be the next best thing. Non-relational data can help the business Non-relational data analysis is not just for improving IT. A myriad of business applications create their own data exhaust, and mining these logs and files can help improve business results. Consider business applications developed as part of a Service Oriented Architecture (SOA). SOA applications are composed of several sometimes many discrete services interacting with each other. If all of these services do not interact properly, business results will suffer: orders may be lost, inquiries may go unanswered, customers may lose patience and take their business elsewhere. Isolating the cause of failure or sluggish performance in any given business application can be quite difficult due to the number of moving parts. But these services emit sufficient log and trace data exhaust that a rigorous analysis can help isolate and rectify problems. Consolidating these logs into a Hadoop cluster for analysis can be the ideal means of improving results and resolving issues. 3

5 What is Hadoop and why is it so efficient at processing everything else? The traditional data analysis systems that we have built to analyze mainframe data are very, very good at producing insights when questions are well-defined; for example, looking at our sales data for the last quarter, which territories are underperforming based on historical norms? A little Structured Query Language (SQL) and a trip to the warehouse will easily produce the desired results. What s come to be known as big data analysis begins without any such preconceived questions; it s more focused on sifting through mounds of data in an exploratory fashion, performing iterative analysis to try and identify patterns. In other words, we don t know in advance what questions will produce useful insights. The traditional structured warehouse analytics approach is just not set up for this kind of analysis. Although advances have been made in using parallel technologies to greatly accelerate queries in traditional analytics systems (see, for example, the IBM DB2 Analytics Accelerator for z/os v built on Netezza technology), these systems are still focused on answering structured inquiries. They are not necessarily well suited for performing exploratory analysis against unstructured data. Something different is required. Many people believe that Hadoop is that something different. A brief technical description of Hadoop Hadoop is an open-source project from the Apache Software Foundation. At its most basic level, Hadoop is a highly parallelized computing environment coupled with a distributed, clustered file system that is generally viewed to be the most efficient environment for tackling the analysis of unstructured data. It was designed specifically for rapid processing across very large data sets and is ideally suited for discovery analysis (note that Hadoop is by and large batch-oriented and is not well suited for real-time analysis; every technique has its place and Hadoop is not the answer to every problem!). When dealing with analysis that involves a high level of computation and/or a data set that is very large and/or unpredictable, the most efficient way to tackle that analysis is to divide both the data and computation across clusters of processing units that all work on bits of the problem in parallel. That s exactly what Hadoop does: it breaks work down into mapper and reducer tasks that manipulate data that has been spread across a parallel cluster. The Hadoop Distributed File System (HDFS) HDFS is a distributed file system optimized to support very large files (MBs to TBs). With HDFS, each file is broken down into smaller pieces called blocks that are distributed throughout a cluster of hundreds to thousands of nodes; by distributing data in this fashion, all the nodes can work on a single problem in parallel. 4

6 Since Hadoop was initially designed with a deployment platform of commodity processing nodes and disks in mind, data replication is built into HDFS by default to enable some degree of fault tolerance. In a Hadoop cluster, every block of data can be replicated across multiple servers the default is for three copies of the data to be spread across the nodes of the HDFS; this can be increased or decreased depending on one s confidence in the reliability of the compute nodes and storage medium. This feature of HDFS can also prove to have a downside, as at some point the economics of holding all data in triplicate can outweigh the benefit of using cheap disks. It also increases the overall I/O load on the system, and avoiding I/O bottlenecks becomes a difficult cluster design consideration. MapReduce jobs Once the HDFS has been populated with the data to be analyzed, Hadoop offers a programming paradigm called MapReduce to process that data. MapReduce is a two-phase process where a Map job first performs filtering and sorting of the data, followed by a Reduce job that performs some summary operation on that data. The overall MapReduce framework directs all nodes to perform these operations on the blocks of data they are assigned in parallel, and results are rolled up into a single result set. Native Hadoop MapReduce jobs are written in Java and can take some specialized skills to produce. For this reason, a number of additional packages have emerged, such as Hive and Pig, that put a friendlier, SQL-like or procedural flavor on the programming process. But wait Hadoop is designed for commodity hardware, right? Virtually every paper, blog and point of view concerning Hadoop points out that Hadoop was designed to be run in massive clusters of commodity processors backed by inexpensive disk. This is not the definition of a mainframe. So why are we talking about Hadoop in a mainframe paper? Commodity clusters are not always the most efficient option Companies like Google, Amazon and ebay have shown that a large-scale enterprise can thrive on commodity hardware. But that doesn t mean that such a deployment model best serves every business model. IBM, for example, found it far more efficient to replace its distributed analytics servers with a highly virtualized System z environment. The project, called Blue Insights vi, saved the company tens of millions of dollars in infrastructure cost but also delivered the agility to be able to onboard new projects in a fraction of the time it used to take. But the cost or efficiency of the underlying hardware is actually the least interesting factor in choosing a Hadoop deployment platform. The most critical factor is a thorough understanding of where the data to be analyzed comes from, and how it will be used. 5

7 However, it s really about the data Whenever data is moved from place to place, it is put at risk. And the financial and reputational ramifications of that risk can be huge. When a security flaw exposed card data for some 70 million of its customers towards the end of 2013, Target reported meaningfully weaker-than-expected sales for the critical holiday season, and the ongoing lack of trust in Target stores led to sharp declines in earnings per share, stock price, and just about any other tangible measure of business success. While Target s breach had nothing to do with Hadoop, the lesson learned is that any security exposure can lead to significantly adverse business results; data security should be the first priority in any Hadoop deployment. When you re mainly interested in non-mainframe data Although the mainframe controls a huge amount of critical data, it certainly does not control all of it. And this non-mainframe data is growing at exponential rates. Consider social media data. Twitter alone generates 12 terabytes of tweets each and every day. It is not desirable or practical to move such volumes of non-mainframe data into the mainframe for analysis. However, there are use cases where such data is of value. For example, it would be very useful to comb through social media for information that can augment the traditional analytics that you already do against your relational mainframe data. A solution is already available for such scenarios. IBM s InfoSphere BigInsights vii product running, for example, on a Power or IBM System x server, is certainly well-suited to ingest and process this sort of data. Connectors available with DB2 for z/os enable a DB2 process to initiate a BigInsights analytics job against the remote Hadoop cluster, then ingest the result set back into a relational database or traditional data warehouse to augment the mainframe data (there are similar connectors for IMS but that is not the focus of this paper). Mainframe data never leaves the mainframe, and is augmented and enhanced by data ingested from other sources. When most of the data you want to analyze originates on the mainframe In this paper we d like to focus on those scenarios where all, or the majority, of the data to be analyzed originates on the mainframe. This is where infrastructure considerations have to move well past the commodity or virtualized phase and focus on data governance issues. When the data that you need to analyze sits on the mainframe even in files and logs that data must be accessed, ingested and processed within a consistent governance framework that maintains tight security throughout the entire data lifecycle. To address this need, Veristorm, a company whose principals have decades of experience handling mainframe data, has delivered vstorm Enterprise, a solution for mainframe-based Hadoop processing. It is currently the only supported solution on the market that allows mainframe data to be analyzed with Hadoop while preserving the mainframe security model. 6

8 Introducing Veristorm s vstorm Enterprise Veristorm has delivered the first commercial distribution of Hadoop for the mainframe. The Hadoop distribution, coupled with state of the art data connector technology, makes z/os data available for processing using the Hadoop big data paradigm without that data ever having to leave the mainframe. Because the entire solution runs in Linux on System z, it can be deployed to low-cost, dedicated mainframe Linux processors, known as the Integrated Facility for Linux (IFL). Plus, by taking advantage of the mainframe s ability to activate additional capacity ondemand, as needed, vstorm Enterprise can be used to build out a highly scalable private cloud for big data analysis using Hadoop. zdoop: the Hadoop framework for System z vstorm Enterprise includes zdoop, a fully supported implementation of the open source Apache Hadoop project; as of the writing of this paper, zdoop delivers Apache Hadoop zdoop supports an ever-expanding open source big data ecosystem. In addition to basic support for MapReduce processing and HDFS, zdoop delivers Hive for developers who prefer to draw on their SQL background, and Pig for a more procedural approach to building applications. zdoop currently supports the following popular packages: Apache Hive Apache HCatalog Apache ZooKeeper Apache Pig Apache Sqoop Apache Flume Veristorm is continuously identifying new big data ecosystems to support with their extensible plug-in model, based on customer requirements. For example, Cassandra and MongoDB are under consideration for future support. 7

9 vstorm Connect: populating the Hadoop ecosystem Because Hadoop is designed to handle very large data sets, one critical consideration for any Hadoop project is how the data gets ingested into the HDFS. Current techniques to move mainframe data off-platform for analysis typically involve the use of cumbersome, heavyweight federation engines that require a fair amount of skill to set up. To simplify and streamline data ingestion for mainframe users, vstorm Enterprise also includes vstorm Connect a set of native data collectors that uses a graphical interface to facilitate the movement of z/os data to the zdoop HDFS. Out of the box, vstorm Connect provides the ability to ingest a wide variety of z/os data sources into the HDFS, including: DB2 VSAM QSAM SMF, RMF System log files, operator log files It eliminates the need for SQL extracts or ETL consulting engagements by providing simple point and click data movement from z/os to HDFS and Hive. Data conversion requirements, such as EBCDIC code page and BCD issues, are handled in-flight without incurring z/os MIPS costs. vstorm Connect is not just easy to use; it s fast and economical, too. Data and metadata from the selected source is streamed to the selected target without intermediate staging, eliminating the need for additional storage in z/os. vstorm Connect s patented parallel streaming technology supports both HiperSockets (the mainframe-unique mechanism for high-speed connectivity between partitions) as well as a standard 10Gbps network connection. Working with IBM s performance test team in Poughkeepsie, New York, Veristorm has been able to demonstrate transfer rates that measure very close to the theoretical maximum for the connection. 8

10 vstorm Enterprise: secure, easy, agile, efficient analysis of mainframe data This brief review of the technical nuts and bolts of the vstorm Enterprise offering just scratches the surface of the true value proposition behind Veristorm s offering; being able to run Hadoop directly on System z with vstorm Enterprise addresses some of the most critical needs of mainframe shops. A secure pipe for data The mainframe is renowned for its ability to keep enterprise-sensitive and customersensitive data safe and secure. The data contained in virtually all mainframe files and logs has the same degree of sensitivity as operational data, and therefore requires the same governance and security models that are applied to operational data. Transmitting such sensitive data over public networks, to racks of commodity servers, breaks those governance and security models and puts data at risk of compromise. Veristorm recognized the need for a solution that kept data within the confines of the mainframe in order to meet compliance challenges without adding the complexity of coordinating multiple governance and security zones or worrying about firewalls. vstorm Enterprise is fully integrated with the z/os security manager, RACF, meaning that users of the product must log in with valid RACF credentials, and will only be able to access z/os data to which they are already authorized via z/os security. There is no need to establish unique and potentially inconsistent credentials to use vstorm Enterprise; security is built-in, right out of the box. Having the Hadoop ecosystem remain on System z maintains mainframe security over the data and simplifies compliance with enterprise data governance controls. Not only does the data never leave the box; streaming from source to target occurs over a secure channel using the hardware crypto facilities of System z. Keeping mainframe data on the mainframe is a critical requirement for many clients, and zdoop enables Hadoop analysis of mainframe data without that data ever leaving the security zone of the mainframe. 9

11 Easy to use ingestion engine The developers at Veristorm have a long history of managing a wide variety of mainframe file types. They saw the need to simplify the transfer of data from z/os to Hadoop without the end user having to be concerned with outdated file types, COBOL copybooks, or code page conversions. They were also very sensitive to the need for data to be transferred efficiently without adding load to the z/os environment (which can drive up software licensing costs). The inclusion of vstorm Connect alongside the Hadoop distribution enables the efficient, quick and easy ingestion of z/os data into Hadoop. And in addition to the security considerations of keeping the data on-platform, time and cost are limited by avoiding the network lag and expense of moving data off board. Templates for agile deployment One of the hardest tasks in managing a Hadoop analysis project is sizing the clusters. Underestimate and the project must be put on hold until capacity is added. Commodity servers may be cheap, but they still have to be procured, inserted into the data center, configured, loaded with software and made known to the Hadoop management system. Overestimate and unused capacity sits idle, negating the perceived cost benefit of using commodity infrastructure in the first place. This is not a very agile or friendly process if you re in an environment where your analytics needs are likely to change frequently, you will struggle to keep capacity in line with processing requirements. vstorm Enterprise, running on the mainframe, executes within a fully virtualized environment managed by one of the industry s most efficient Linux hypervisors. Better still, it offers out of the box templates so that new virtual servers can be added to the Hadoop cluster with just a few basic commands (or via automation software). This is a true cloud environment where your users can obtain and return capacity on demand to suit their needs. vstorm Enterprise enables Hadoop-as-a-service for the analysis of mainframe data. Mainframe efficiencies Security, simplicity, and agility are the true hallmark value propositions of vstorm Enterprise for analyzing mainframe data. But there are also some basic efficiencies that arise simply from running Hadoop in a scale-up (mainframe) vs scale-out (commodity) environment. Consider that when processors are more efficient, fewer of them are required. This can lead to lower licensing costs, as most software and software maintenance is priced per processor; fewer processors, a lower software bill. Fewer processors also means fewer physical machines, which can lead to more efficient energy and space utilization as well as lower management and maintenance costs. And you may be able to configure your cluster with fewer, more powerful nodes which leads to faster results. As mentioned earlier, the Hadoop design point anticipates inexpensive infrastructure. To compensate for the unreliability of commodity storage and compute nodes, HDFS by default triplicates all data for availability purposes. At some volume, all of this extra 10

12 storage and associated maintenance undercuts the economics of using Hadoop on commodity systems. Running Hadoop with the quality of a mainframe underneath it can prove to be more economical in the long run because there is very low risk in running with an HDFS replication factor of 1 leading to up to 60% less storage requirements. And it can greatly simply I/O planning since bottlenecks would no longer be such an issue. And finally, considering that offloading data to a commodity server farm costs both mainframe MIPs as well as network expense viii, the elimination of data movement off board further makes the case for the economics of a mainframe Hadoop deployment. And more vstorm Enterprise also includes built-in visualization tools that allow the data in HDFS and Hive to be graphed and displayed in a wide variety of formats. And of course, any analytics or visualization tool can work with the data stored on System z simply by pointing them to the HDFS or Hive URLs. 11

13 Real clients, real needs Hadoop on the mainframe is great for analyzing mainframe logs and non-relational data, and there are many use cases where a lot of insights can be gleaned from these monolithic data sources. But some of the most interesting use cases will blend sensitive mainframe data with non-mainframe data in order to enhance results. Earlier we discussed how DB2 z/os connectors to BigInsights could be used to pull insights from an external Hadoop cluster back into DB2 to enhance traditional processing using relational tools. But what about those situations that don t involve DB2 z/os, or where Hadoop is a better tool for processing the entire data set? Pushing mainframe data to an external Hadoop cluster raises the security, complexity, and agility issues previously described. A better approach in these cases is to keep the mainframe data on-platform and pull in what is needed to complete the analysis. Both Veristorm and IBM have engaged with many mainframe clients to understand their big data processing needs and gather requirements; vstorm Enterprise was built in direct response to these client needs. The following scenarios are indicative of the ways in which mainframe clients are looking to improve business results by using Hadoop to analyze mainframe data. Detecting fraud Fraud is a huge problem for any organization that makes payments: banks (stolen card information), property and casualty insurers (staged accidents), health plan providers (excessive billing) and government tax agencies (ineligible deductions) are just a few examples where fraudulent payments are authorized every day. For many mainframe clients, payment systems data is maintained in relational databases and IBM offers solutions that embed predictive analytics directly into these databases to help identify potentially fraudulent payments in real time before they are made ix. However, it is not always the case that all data required to make a determination of fraud exist in relational data. There are still many, many non-relational databases and file systems in use for core payment systems. And even when payment data is held in relational databases, results can be improved by augmenting that data with insights derived from non-mainframe sources. Loading all relevant z/os-based data into a Hadoop cluster on the mainframe, supplemented with non-mainframe sources as appropriate, enables comprehensive fraud analysis to be added to the payment process without exposing sensitive data to the outside world (and inadvertently opening the door to more fraud!). Enhancing insurance policy enforcement Auto insurance companies are constantly looking for creative ways to tailor policies that reward drivers who exhibit safe driving habits or specific behaviors. One insurer issues policies that come with reduced premiums for drivers who keep their mileage under a specified threshold. But how can such policy conditions be enforced? 12

14 In the state in which this insurer operates, every time a vehicle is presented for emissions testing the mileage and VIN number are stored in a publicly accessible database. The insurer maintains sensitive customer and policy data securely on their mainframe. Their goal is to use VIN numbers to match the public mileage data to their private policy data and analyze the combined data set for policy conformity. This is a great scenario for vstorm Enterprise, which can load the mainframe policy data into a mainframe-based Hadoop cluster while also pulling in the public data using included JDBC connectors. The mainframe data is not exposed to potential threats, and is augmented with the information required for policy enforcement. This scenario can also be expanded to include telematics, using data from devices embedded in cars as well as other sources, such as GPS or navigation systems, to further refine the parameters of a policy. Reducing healthcare costs In the U.S. healthcare system, in-hospital care is the most expensive option. While most hospital stays are unavoidable, short-term readmissions are often due to preventable causes. For example, statistics show that one in five elderly patients are back in the hospital within 30 days of leaving; very often, the patient is readmitted because they did not understand instructions, take the proper medications, or seek out the necessary followup care x. The cost of unnecessary re-admittance is staggering estimated at $17 billion for Medicare patients alone. For this reason, Medicare imposes stiff penalties on hospitals that readmit a patient within 30 days to incent these hospitals to focus on the problem. Hospital records are often kept in mainframe relational databases, so some degree of traditional analysis can be made to identify patients at high risk of re-admittance (looking at the patient s admission history, for example). But other factors can deliver a more accurate prediction of risk: for example, frequent changes of home address, absence of emergency contacts, or lack of living relatives could signal that the support structure is not in place for this patient to receive the help needed to stay out of the hospital. Much of this sort of data can be gleaned from outside sources. HIPPA regulations are very, very strict concerning patient data copying that data to multiple platforms for analysis can create a regulatory nightmare. An on-mainframe Hadoop cluster would keep those patient records safely on the mainframe, mixing-in nonstructured data from outside sources as necessary to enable a complete and more accurate analysis of re-admittance risk. Improving claims response time Earlier in this paper we pointed out how application logs can help identify bottlenecks in processes that are actually composed of many interacting services. The processing of insurance claims is an example of such a business application. 13

15 Since health insurance is a very competitive business, many health plan providers want to improve customer service by reducing the response time for processing claims. Aside from the claims processing system itself, likely using a relational database to store claim details, there are a number of other business systems that play a part in servicing the claim. For example, if a claimant had searched for information from a self-service web portal, weblogs would identify browsing patterns and the type of information that was sought. If a call center had been contacted, logs would provide the times and numbers of calls made as well as the call center agent s notes. There is a fair amount of digital exhaust surrounding the claims system that could be useful in analyzing how the claims process is performing. Combining relational and machine data, and analyzing in aggregate, provides the best opportunity for matching actual results to business metrics, and helps improve the end-toend process by identifying where claimants might be struggling to find information. Logs are as sensitive as client data, so keeping all relevant data on the mainframe for analysis via an on-platform Hadoop cluster is the most secure option. 14

16 Putting vstorm Enterprise to the test The vstorm Enterprise value proposition secure, easy, agile, efficient analysis of mainframe data is compelling and clearly meets the needs for use cases that we have discussed directly with clients. But still, we are often asked if the mainframe can really deliver this value with software that was, by conventional wisdom, designed for commodity hardware arrays. The only way to know for sure is to put the product through its paces and see how well the yellow elephant can dance on the mainframe. Veristorm and the IBM Poughkeepsie Performance Lab partnered to run a series of tests designed to prove that vstorm Enterprise does indeed meet or exceed the performance expectation of mainframe users. Table stakes executing the expected stress tests Benchmark is usually defined as a standard or point of reference against which things may be compared or assessed. Computer system vendors have long sought to develop and promote performance benchmarks that would invite direct comparisons of standardized workloads. While a noble idea in theory, computer system benchmarks are notoriously difficult to use effectively in practice because the workloads rarely match conditions that routinely rise in production environments. And there is so much latitude in these tests that most published results are obtained on highly tuned and unrealistic (and sometimes not commercially available) systems. For example, many mainframers will recall the nuclear arms race of the 1990s and 2000s as vendors madly competed to post top marks on the Transaction Processing Council s benchmarks for evaluating transactional and decision support systems. The TPC- x benchmarks became more useful to vendors as marketing tools than to clients for useful evaluations. The focus of the vstorm Enterprise testing was to demonstrate something useful proof of the mainframe s efficient, linear scalability for Hadoop workloads and not to post the best absolute numbers. After all, what s more important in a production environment: being able to handle the daily challenges of an ever-changing set of requirements, or knowing that your platform can boast of the best test scores? The configuration All testing was done in the Poughkeepsie lab with source data on z/os and the target HDFS on Linux for System z on the same physical machine, a zenterprise EC12. Hadoop nodes were configured as System z Linux virtual machines on top of the z/vm hypervisor, and the entire Linux environment was built on dedicated IFL processors. Since the key goal of this testing was demonstrating linear scalability, results were measured on configurations ranging from 2 to 40 IFLs. When you consider how powerful the current family of zenterprise processors is, this represents quite a significant range of scale. 15

17 The tests The Apache Hadoop distribution contains a set of benchmarks and stress tests designed to give a level of confidence that the Hadoop framework is running to expectations. The Veristorm-IBM team put vstorm Enterprise through the paces using these standard tests. TestDFSIO Since the performance of a Hadoop installation is heavily dependent on the efficiency of data reads, the TestDFSIO benchmark a read and write test for the HDFS gives a good indication of how fast a cluster is in terms of I/O. TestDFSIO was run against data sets of 500 GB, 1 TB and 5 TB and data rates remained consistently strong despite the variety of the load. NameNode benchmark (NNBench) In Hadoop, the NameNode is the core of the HDFS system, keeping track of all files and where the file blocks are stored. As such it is a prime cause for bottlenecks. The NNBench tool is used to perform load testing of the NameNode by generating lots of small HDFS requests. NNBench is a test whose success is measured simply by running to completion, and no issues were exposed in the vstorm Enterprise cluster. MapReduce benchmark (MRBench) It s useful to test a Hadoop configuration s ability to handle both small and large jobs. The MRBench tool executes a small job a number of times in a tight loop to see if the system is responsive and efficient. As with NNBench, the purpose of MRBench is simply to validate that a system is running properly, and vstorm Enterprise exited the testing with no issues. Terasort Terasort is the Hadoop world s equivalent of the TPC benchmarks, where organizations often go through great pains to achieve bragging rights by posting the lowest number. But what does a number mean to you, and the real problems you need to solve? The benchmark itself is a fairly simple workload (complete the sorting of 1TB of randomly generated data) designed to exercise the ability to execute a highly parallel analysis of a large data set. As such, it provides no useful information on workload variability or scale. Our goal in running Terasort was to test scale observing how efficiently applying more processing power to the problem would improve results. We ran this test on configurations from 3 to 40 IFLs and were able to report better than linear results; the 40 IFL system contained 13x more processors but improved the sort time by over 18x. But what can vstorm really do? The scenario Running the standard benchmarks and stress tests demonstrated that Hadoop does indeed scale quite nicely on the mainframe; all tests met or exceeded expectations. But in order to generate a more meaningful measure, Veristorm wanted to exercise vstorm Enterprise on a workload that mimicked the type of real-world problems that our clients have been 16

18 sharing with us. And to do it on a modestly configured system that was in line with what we would expect these clients to find acceptable for a Hadoop environment on the mainframe. The performance test team configured a 2 IFL system, pulled 2 billion stock trading records from the New York Stock Exchange into a DB2 z/os database, and wrote a Pig job to analyze this trading data for relevant metrics. A test was then designed to measure how long it took to extract the data from DB2 z/os, stream it to Linux on System z for ingestion into the zdoop HDFS, and perform the analysis. The 2 billion record database was extracted, streamed to the HDFS, and analyzed in an endto-end process that lasted 2 hours, using the power of 2 IFLs. Thus, the benchmark was born! As we ve been socializing this result with clients, we re finding that represents a very effective metric for the kinds of mainframe data that they want to analyze with Hadoop. In this test the database consisted of 100 byte records, representing an HDFS load of approximately 200GB of data a mid-size load as far as Hadoop goes, but perfectly in line with what our clients want to do. To put this in perspective, consider other examples from a variety of industries: According to the US Bureau of Transportation Statistics for 2009 there are million registered passenger vehicles. Two billion records provide more than enough capacity to satisfy the needs of the policy enforcement scenario outlined earlier in this paper. A leading credit card provider in the United States indicated to us that they process an average of 200 million transactions a day. These tend to be very small transactions in terms of data used; two billion records can hold a week s worth of data with room to spare. At any point of time, there are five or six million shipping containers on cargo ships sailing across the world s oceans. How much cargo analysis could be performed in two billion records? There are approximately 51.4 million in-patient surgeries per year in United States. Even if the data around each record is dense, 200GB is more than sufficient capacity to analyze an entire year of surgeries. As was mentioned earlier, one of the hardest tasks in managing Hadoop systems is sizing the clusters. The advantage of using the System z platform for performing Hadoop analysis is its ability to scale linearly, and add capacity on-demand two factors that greatly simplify this problem. Did you plan for two billion records, but unexpectedly need to process four billion? Turn on a few more IFLs, on demand. Is the capacity sufficient, but you need to cut the processing time in half? Turn on a few more IFLs, on demand. And turn them off when you no longer need them. System z provides a very dynamic environment for the fluid demands of the real world. The measure, along with measured linear scalability, means that the mainframe can efficiently meet your needs whether they are 4-4-2, 2-4-1, 2-1-4, or just about any combination. 17

19 How would it feel to have an elephant on your mainframe? Veristorm s vstorm Enterprise is a mainframe solution that enables secure, easy, agile, efficient analysis of mainframe data. You might think that a large yellow elephant would be too much for a mainframe to handle, but our testing conducted shows that the mainframe is indeed more than up to the task and can handle a broad range of real-world problems quite nicely. Hadoop as part of a hybrid transaction and analytics platform In this paper we focused on the Hadoop analysis of non-relational data that originates on the mainframe, but this is just one aspect of the overall System z strategy. As the holder of most of the world s transactional data, and a significant portion of related logs and files, the mainframe is rather uniquely positioned to deliver an end-to-end information governance platform that embraces both traditional data sources as well as new sources such as social, text, sensor and stream data. By integrating the best of traditional mainframe processing with emerging technologies, such as IBM DB2 Analytics Accelerator Netezza-based appliances and Hadoop, a secure zone for sensitive mainframe data can be established that analyzes that data while simultaneously embracing insights from data that originates outside that zone. This hybrid approach extending the System z ecosystem with non-system z technology enables all relevant data to become part of every operation and analysis within the context of a cohesive governance framework. By avoiding the unnecessary movement of data off-platform, it has become possible to perform truly real-time analytics because decisions are made based on the most accurate data available, not some stale copy. And by integrating these analytics technologies with transactional systems, System z has enabled insights to be injected directly into operational decision processes providing analytics the same business-critical support that operational systems enjoy today. So are you ready to put the elephant on your mainframe? 18

20 More Information About Veristorm Veristorm was founded with a simple vision: Enable mainframe users to securely integrate their most critical data. Our vstorm Enterprise software enables secure integration of your z/os data into any data analysis platform. With ever-expanding partnerships to new platforms, we help create a data integration hub with your most important data securely maintained. By facilitating faster, easier integration of data across platforms, we re helping enterprises gain more meaningful insights from analytics, right-size and lower the costs of data processing, and apply new skills and technologies to the challenges of a data-driven organization. Veristorm has created many business partnerships with the leading analytics and application management IT firms. Our team has experience in startups, cloud, and the IBM ecosystem. We stand by our mission to deliver the best Big Data management platform possible. We are privately held and located in Santa Clara, CA. To learn more about Veristorm: To learn more about vstorm Enterprise: About IBM IBM is a global technology and innovation company headquartered in Armonk, NY. It is the largest technology and consulting employer in the world, with more than 400,000 employees serving clients in 170 countries. IBM offers a wide range of technology and consulting services; a broad portfolio of middleware for collaboration, predictive analytics, software development and systems management; and the world's most advanced servers and supercomputers. Utilizing its business consulting, technology and R&D expertise, IBM helps clients become "smarter" as the planet becomes more digitally interconnected. To learn more about IBM: To learn more about Linux on System z: 19