April The Elephant on the Mainframe Using Hadoop to Gain Insights from Mainframe Data A Joint Point of View from IBM and Veristorm

Size: px
Start display at page:

Download "April 2014. The Elephant on the Mainframe Using Hadoop to Gain Insights from Mainframe Data A Joint Point of View from IBM and Veristorm"

Transcription

1 April 2014 The Elephant on the Mainframe Using Hadoop to Gain Insights from Mainframe Data A Joint Point of View from IBM and Veristorm 1

2 THE MAINFRAMER S GUIDE TO HADOOP 2 THERE S RELATIONAL DATA, AND THERE S EVERYTHING ELSE 2 WHY SHOULD MAINFRAME SHOPS BE INTERESTED IN ANALYZING NON-RELATIONAL DATA? 3 NON-RELATIONAL DATA CAN HELP IT 3 NON-RELATIONAL DATA CAN HELP THE BUSINESS 3 WHAT IS HADOOP AND WHY IS IT SO EFFICIENT AT PROCESSING EVERYTHING ELSE? 4 A BRIEF TECHNICAL DESCRIPTION OF HADOOP 4 BUT WAIT HADOOP IS DESIGNED FOR COMMODITY HARDWARE, RIGHT? 5 COMMODITY CLUSTERS ARE NOT ALWAYS THE MOST EFFICIENT OPTION 5 HOWEVER, IT S REALLY ABOUT THE DATA 6 INTRODUCING VERISTORM S VSTORM ENTERPRISE 7 ZDOOP: THE HADOOP FRAMEWORK FOR SYSTEM Z 7 VSTORM CONNECT: POPULATING THE HADOOP ECOSYSTEM 8 VSTORM ENTERPRISE: SECURE, EASY, AGILE, EFFICIENT ANALYSIS OF MAINFRAME DATA 9 A SECURE PIPE FOR DATA 9 EASY TO USE INGESTION ENGINE 10 TEMPLATES FOR AGILE DEPLOYMENT 10 MAINFRAME EFFICIENCIES 10 AND MORE 11 REAL CLIENTS, REAL NEEDS 12 DETECTING FRAUD 12 ENHANCING INSURANCE POLICY ENFORCEMENT 12 REDUCING HEALTHCARE COSTS 13 IMPROVING CLAIMS RESPONSE TIME 13 PUTTING VSTORM ENTERPRISE TO THE TEST 15 TABLE STAKES EXECUTING THE EXPECTED STRESS TESTS 15 THE CONFIGURATION 15 THE TESTS 16 BUT WHAT CAN VSTORM REALLY DO? THE SCENARIO 16 HOW WOULD IT FEEL TO HAVE AN ELEPHANT ON YOUR MAINFRAME? 18 HADOOP AS PART OF A HYBRID TRANSACTION AND ANALYTICS PLATFORM 18 MORE INFORMATION 19 ABOUT VERISTORM 19 ABOUT IBM 19 1

3 The Mainframer s Guide to Hadoop It s impossible to check your favorite technical news feeds or attend a conference these days without being exposed to several definitions of big data and opinions as to why it s important. Many of these exhortations are tied to Hadoop and its cute little yellow elephant logo. In this paper we will avoid the hype surrounding big data and spare you yet another definition just bear in mind that we view big data as representing all data that could be used to achieve better business outcomes, as well as the technologies leveraged to manage and analyze it. Our goal is to take a critical look at Hadoop from the mainframe perspective, and examine how new technology available today from Veristorm can help you solve problems that you may be struggling with. There is, indeed, a place for the elephant on the mainframe. There s relational data, and there s everything else The mainframe is the undisputed king of transactional data. Depending on who s opinion you read, anywhere from 60%-80% of all the world s transactional data is said to reside on the mainframe. Since transactions are, by definition, highly repeatable entities, the past fifty years have seen a number of breakthrough improvements in the way the data underneath these transactions is managed. For the past couple of decades, the gold standard has been the relational database. The art of managing transactional data in relational databases has been perfected on the mainframe. Similarly, the art of relational data analysis has also been perfected on the mainframe. It is a near certainty that if you are heavily invested in relational data stores for your operational data you are also heavily invested in data warehouses, data marts, and the like to efficiently gain insights from that data. Mainframers know how to collect, manage, and analyze their relational data. But in reality this relational data only represents a fraction of all the data held on the mainframe. Long-time mainframe shops almost certainly have significant data assets stored in record-oriented file management systems, such as VSAM, that predate the advent of the relational database. A significant amount of modern data is stored in XML format. And a huge, largely untapped source of data comes in the form of unstructured or semistructured data that has been dubbed data exhaust i the system logs, application logs, event logs and similar data that is routinely generated by the day-to-day activities of operational systems. Is there any value in analyzing this non-relational data and data exhaust, and if so what s the best way to get at it? One only needs to look at recent surveys and studies to see that there is currently a very strong desire to unlock insights from all that non-relational data held by the mainframe. For example, Gartner s 2013 big data adoption study ii revealed that transactional sources are the dominant data types analyzed in big data initiatives which one would expect given 2

4 all the historical focus on analyzing this data. But log data came in a strong second place in this study, showing that there is indeed value to be gained from analyzing this data exhaust. Why should mainframe shops be interested in analyzing non-relational data? The Gartner study shows a strong interest in non-relational analysis, but why, and what does this mean to mainframe shops? We ll examine some specific use cases later in this paper; for now, here are some general ideas for using data exhaust to improve results. Non-relational data can help IT It s quite possible that you may already have been exposed to IBM features and products that analyze mainframe data exhaust to help the IT shop improve performance, optimize resources, and take pre-emptive action to avoid problems. For example: The IBM System z Advanced Workload Analysis Reporter (IBM zaware, a feature of the IBM zenterprise EC12 and IBM zenterprise BC12) monitors mainframe data exhaust to help identify unusual system behavior in near real time and improve problem determination leading to an overall improvement in availability of systems. IBM Capacity Management Analytics iii is an advanced analytics solution designed specifically to help manage and predict consumption of IBM zenterprise infrastructure resources. IBM SmartCloud Analytics - Log Analysis iv analyzes mainframe data exhaust to help identify, isolate and resolve a variety of IT problems. These are all complete solutions with pre-built templates and reports that target specific use cases and scenarios. If you need to gain insights from your log data that these solutions were not set up to handle (or cannot be extended to handle), building your own Hadoop cluster for your specific needs can be the next best thing. Non-relational data can help the business Non-relational data analysis is not just for improving IT. A myriad of business applications create their own data exhaust, and mining these logs and files can help improve business results. Consider business applications developed as part of a Service Oriented Architecture (SOA). SOA applications are composed of several sometimes many discrete services interacting with each other. If all of these services do not interact properly, business results will suffer: orders may be lost, inquiries may go unanswered, customers may lose patience and take their business elsewhere. Isolating the cause of failure or sluggish performance in any given business application can be quite difficult due to the number of moving parts. But these services emit sufficient log and trace data exhaust that a rigorous analysis can help isolate and rectify problems. Consolidating these logs into a Hadoop cluster for analysis can be the ideal means of improving results and resolving issues. 3

5 What is Hadoop and why is it so efficient at processing everything else? The traditional data analysis systems that we have built to analyze mainframe data are very, very good at producing insights when questions are well-defined; for example, looking at our sales data for the last quarter, which territories are underperforming based on historical norms? A little Structured Query Language (SQL) and a trip to the warehouse will easily produce the desired results. What s come to be known as big data analysis begins without any such preconceived questions; it s more focused on sifting through mounds of data in an exploratory fashion, performing iterative analysis to try and identify patterns. In other words, we don t know in advance what questions will produce useful insights. The traditional structured warehouse analytics approach is just not set up for this kind of analysis. Although advances have been made in using parallel technologies to greatly accelerate queries in traditional analytics systems (see, for example, the IBM DB2 Analytics Accelerator for z/os v built on Netezza technology), these systems are still focused on answering structured inquiries. They are not necessarily well suited for performing exploratory analysis against unstructured data. Something different is required. Many people believe that Hadoop is that something different. A brief technical description of Hadoop Hadoop is an open-source project from the Apache Software Foundation. At its most basic level, Hadoop is a highly parallelized computing environment coupled with a distributed, clustered file system that is generally viewed to be the most efficient environment for tackling the analysis of unstructured data. It was designed specifically for rapid processing across very large data sets and is ideally suited for discovery analysis (note that Hadoop is by and large batch-oriented and is not well suited for real-time analysis; every technique has its place and Hadoop is not the answer to every problem!). When dealing with analysis that involves a high level of computation and/or a data set that is very large and/or unpredictable, the most efficient way to tackle that analysis is to divide both the data and computation across clusters of processing units that all work on bits of the problem in parallel. That s exactly what Hadoop does: it breaks work down into mapper and reducer tasks that manipulate data that has been spread across a parallel cluster. The Hadoop Distributed File System (HDFS) HDFS is a distributed file system optimized to support very large files (MBs to TBs). With HDFS, each file is broken down into smaller pieces called blocks that are distributed throughout a cluster of hundreds to thousands of nodes; by distributing data in this fashion, all the nodes can work on a single problem in parallel. 4

6 Since Hadoop was initially designed with a deployment platform of commodity processing nodes and disks in mind, data replication is built into HDFS by default to enable some degree of fault tolerance. In a Hadoop cluster, every block of data can be replicated across multiple servers the default is for three copies of the data to be spread across the nodes of the HDFS; this can be increased or decreased depending on one s confidence in the reliability of the compute nodes and storage medium. This feature of HDFS can also prove to have a downside, as at some point the economics of holding all data in triplicate can outweigh the benefit of using cheap disks. It also increases the overall I/O load on the system, and avoiding I/O bottlenecks becomes a difficult cluster design consideration. MapReduce jobs Once the HDFS has been populated with the data to be analyzed, Hadoop offers a programming paradigm called MapReduce to process that data. MapReduce is a two-phase process where a Map job first performs filtering and sorting of the data, followed by a Reduce job that performs some summary operation on that data. The overall MapReduce framework directs all nodes to perform these operations on the blocks of data they are assigned in parallel, and results are rolled up into a single result set. Native Hadoop MapReduce jobs are written in Java and can take some specialized skills to produce. For this reason, a number of additional packages have emerged, such as Hive and Pig, that put a friendlier, SQL-like or procedural flavor on the programming process. But wait Hadoop is designed for commodity hardware, right? Virtually every paper, blog and point of view concerning Hadoop points out that Hadoop was designed to be run in massive clusters of commodity processors backed by inexpensive disk. This is not the definition of a mainframe. So why are we talking about Hadoop in a mainframe paper? Commodity clusters are not always the most efficient option Companies like Google, Amazon and ebay have shown that a large-scale enterprise can thrive on commodity hardware. But that doesn t mean that such a deployment model best serves every business model. IBM, for example, found it far more efficient to replace its distributed analytics servers with a highly virtualized System z environment. The project, called Blue Insights vi, saved the company tens of millions of dollars in infrastructure cost but also delivered the agility to be able to onboard new projects in a fraction of the time it used to take. But the cost or efficiency of the underlying hardware is actually the least interesting factor in choosing a Hadoop deployment platform. The most critical factor is a thorough understanding of where the data to be analyzed comes from, and how it will be used. 5

7 However, it s really about the data Whenever data is moved from place to place, it is put at risk. And the financial and reputational ramifications of that risk can be huge. When a security flaw exposed card data for some 70 million of its customers towards the end of 2013, Target reported meaningfully weaker-than-expected sales for the critical holiday season, and the ongoing lack of trust in Target stores led to sharp declines in earnings per share, stock price, and just about any other tangible measure of business success. While Target s breach had nothing to do with Hadoop, the lesson learned is that any security exposure can lead to significantly adverse business results; data security should be the first priority in any Hadoop deployment. When you re mainly interested in non-mainframe data Although the mainframe controls a huge amount of critical data, it certainly does not control all of it. And this non-mainframe data is growing at exponential rates. Consider social media data. Twitter alone generates 12 terabytes of tweets each and every day. It is not desirable or practical to move such volumes of non-mainframe data into the mainframe for analysis. However, there are use cases where such data is of value. For example, it would be very useful to comb through social media for information that can augment the traditional analytics that you already do against your relational mainframe data. A solution is already available for such scenarios. IBM s InfoSphere BigInsights vii product running, for example, on a Power or IBM System x server, is certainly well-suited to ingest and process this sort of data. Connectors available with DB2 for z/os enable a DB2 process to initiate a BigInsights analytics job against the remote Hadoop cluster, then ingest the result set back into a relational database or traditional data warehouse to augment the mainframe data (there are similar connectors for IMS but that is not the focus of this paper). Mainframe data never leaves the mainframe, and is augmented and enhanced by data ingested from other sources. When most of the data you want to analyze originates on the mainframe In this paper we d like to focus on those scenarios where all, or the majority, of the data to be analyzed originates on the mainframe. This is where infrastructure considerations have to move well past the commodity or virtualized phase and focus on data governance issues. When the data that you need to analyze sits on the mainframe even in files and logs that data must be accessed, ingested and processed within a consistent governance framework that maintains tight security throughout the entire data lifecycle. To address this need, Veristorm, a company whose principals have decades of experience handling mainframe data, has delivered vstorm Enterprise, a solution for mainframe-based Hadoop processing. It is currently the only supported solution on the market that allows mainframe data to be analyzed with Hadoop while preserving the mainframe security model. 6

8 Introducing Veristorm s vstorm Enterprise Veristorm has delivered the first commercial distribution of Hadoop for the mainframe. The Hadoop distribution, coupled with state of the art data connector technology, makes z/os data available for processing using the Hadoop big data paradigm without that data ever having to leave the mainframe. Because the entire solution runs in Linux on System z, it can be deployed to low-cost, dedicated mainframe Linux processors, known as the Integrated Facility for Linux (IFL). Plus, by taking advantage of the mainframe s ability to activate additional capacity ondemand, as needed, vstorm Enterprise can be used to build out a highly scalable private cloud for big data analysis using Hadoop. zdoop: the Hadoop framework for System z vstorm Enterprise includes zdoop, a fully supported implementation of the open source Apache Hadoop project; as of the writing of this paper, zdoop delivers Apache Hadoop zdoop supports an ever-expanding open source big data ecosystem. In addition to basic support for MapReduce processing and HDFS, zdoop delivers Hive for developers who prefer to draw on their SQL background, and Pig for a more procedural approach to building applications. zdoop currently supports the following popular packages: Apache Hive Apache HCatalog Apache ZooKeeper Apache Pig Apache Sqoop Apache Flume Veristorm is continuously identifying new big data ecosystems to support with their extensible plug-in model, based on customer requirements. For example, Cassandra and MongoDB are under consideration for future support. 7

9 vstorm Connect: populating the Hadoop ecosystem Because Hadoop is designed to handle very large data sets, one critical consideration for any Hadoop project is how the data gets ingested into the HDFS. Current techniques to move mainframe data off-platform for analysis typically involve the use of cumbersome, heavyweight federation engines that require a fair amount of skill to set up. To simplify and streamline data ingestion for mainframe users, vstorm Enterprise also includes vstorm Connect a set of native data collectors that uses a graphical interface to facilitate the movement of z/os data to the zdoop HDFS. Out of the box, vstorm Connect provides the ability to ingest a wide variety of z/os data sources into the HDFS, including: DB2 VSAM QSAM SMF, RMF System log files, operator log files It eliminates the need for SQL extracts or ETL consulting engagements by providing simple point and click data movement from z/os to HDFS and Hive. Data conversion requirements, such as EBCDIC code page and BCD issues, are handled in-flight without incurring z/os MIPS costs. vstorm Connect is not just easy to use; it s fast and economical, too. Data and metadata from the selected source is streamed to the selected target without intermediate staging, eliminating the need for additional storage in z/os. vstorm Connect s patented parallel streaming technology supports both HiperSockets (the mainframe-unique mechanism for high-speed connectivity between partitions) as well as a standard 10Gbps network connection. Working with IBM s performance test team in Poughkeepsie, New York, Veristorm has been able to demonstrate transfer rates that measure very close to the theoretical maximum for the connection. 8

10 vstorm Enterprise: secure, easy, agile, efficient analysis of mainframe data This brief review of the technical nuts and bolts of the vstorm Enterprise offering just scratches the surface of the true value proposition behind Veristorm s offering; being able to run Hadoop directly on System z with vstorm Enterprise addresses some of the most critical needs of mainframe shops. A secure pipe for data The mainframe is renowned for its ability to keep enterprise-sensitive and customersensitive data safe and secure. The data contained in virtually all mainframe files and logs has the same degree of sensitivity as operational data, and therefore requires the same governance and security models that are applied to operational data. Transmitting such sensitive data over public networks, to racks of commodity servers, breaks those governance and security models and puts data at risk of compromise. Veristorm recognized the need for a solution that kept data within the confines of the mainframe in order to meet compliance challenges without adding the complexity of coordinating multiple governance and security zones or worrying about firewalls. vstorm Enterprise is fully integrated with the z/os security manager, RACF, meaning that users of the product must log in with valid RACF credentials, and will only be able to access z/os data to which they are already authorized via z/os security. There is no need to establish unique and potentially inconsistent credentials to use vstorm Enterprise; security is built-in, right out of the box. Having the Hadoop ecosystem remain on System z maintains mainframe security over the data and simplifies compliance with enterprise data governance controls. Not only does the data never leave the box; streaming from source to target occurs over a secure channel using the hardware crypto facilities of System z. Keeping mainframe data on the mainframe is a critical requirement for many clients, and zdoop enables Hadoop analysis of mainframe data without that data ever leaving the security zone of the mainframe. 9

11 Easy to use ingestion engine The developers at Veristorm have a long history of managing a wide variety of mainframe file types. They saw the need to simplify the transfer of data from z/os to Hadoop without the end user having to be concerned with outdated file types, COBOL copybooks, or code page conversions. They were also very sensitive to the need for data to be transferred efficiently without adding load to the z/os environment (which can drive up software licensing costs). The inclusion of vstorm Connect alongside the Hadoop distribution enables the efficient, quick and easy ingestion of z/os data into Hadoop. And in addition to the security considerations of keeping the data on-platform, time and cost are limited by avoiding the network lag and expense of moving data off board. Templates for agile deployment One of the hardest tasks in managing a Hadoop analysis project is sizing the clusters. Underestimate and the project must be put on hold until capacity is added. Commodity servers may be cheap, but they still have to be procured, inserted into the data center, configured, loaded with software and made known to the Hadoop management system. Overestimate and unused capacity sits idle, negating the perceived cost benefit of using commodity infrastructure in the first place. This is not a very agile or friendly process if you re in an environment where your analytics needs are likely to change frequently, you will struggle to keep capacity in line with processing requirements. vstorm Enterprise, running on the mainframe, executes within a fully virtualized environment managed by one of the industry s most efficient Linux hypervisors. Better still, it offers out of the box templates so that new virtual servers can be added to the Hadoop cluster with just a few basic commands (or via automation software). This is a true cloud environment where your users can obtain and return capacity on demand to suit their needs. vstorm Enterprise enables Hadoop-as-a-service for the analysis of mainframe data. Mainframe efficiencies Security, simplicity, and agility are the true hallmark value propositions of vstorm Enterprise for analyzing mainframe data. But there are also some basic efficiencies that arise simply from running Hadoop in a scale-up (mainframe) vs scale-out (commodity) environment. Consider that when processors are more efficient, fewer of them are required. This can lead to lower licensing costs, as most software and software maintenance is priced per processor; fewer processors, a lower software bill. Fewer processors also means fewer physical machines, which can lead to more efficient energy and space utilization as well as lower management and maintenance costs. And you may be able to configure your cluster with fewer, more powerful nodes which leads to faster results. As mentioned earlier, the Hadoop design point anticipates inexpensive infrastructure. To compensate for the unreliability of commodity storage and compute nodes, HDFS by default triplicates all data for availability purposes. At some volume, all of this extra 10

12 storage and associated maintenance undercuts the economics of using Hadoop on commodity systems. Running Hadoop with the quality of a mainframe underneath it can prove to be more economical in the long run because there is very low risk in running with an HDFS replication factor of 1 leading to up to 60% less storage requirements. And it can greatly simply I/O planning since bottlenecks would no longer be such an issue. And finally, considering that offloading data to a commodity server farm costs both mainframe MIPs as well as network expense viii, the elimination of data movement off board further makes the case for the economics of a mainframe Hadoop deployment. And more vstorm Enterprise also includes built-in visualization tools that allow the data in HDFS and Hive to be graphed and displayed in a wide variety of formats. And of course, any analytics or visualization tool can work with the data stored on System z simply by pointing them to the HDFS or Hive URLs. 11

13 Real clients, real needs Hadoop on the mainframe is great for analyzing mainframe logs and non-relational data, and there are many use cases where a lot of insights can be gleaned from these monolithic data sources. But some of the most interesting use cases will blend sensitive mainframe data with non-mainframe data in order to enhance results. Earlier we discussed how DB2 z/os connectors to BigInsights could be used to pull insights from an external Hadoop cluster back into DB2 to enhance traditional processing using relational tools. But what about those situations that don t involve DB2 z/os, or where Hadoop is a better tool for processing the entire data set? Pushing mainframe data to an external Hadoop cluster raises the security, complexity, and agility issues previously described. A better approach in these cases is to keep the mainframe data on-platform and pull in what is needed to complete the analysis. Both Veristorm and IBM have engaged with many mainframe clients to understand their big data processing needs and gather requirements; vstorm Enterprise was built in direct response to these client needs. The following scenarios are indicative of the ways in which mainframe clients are looking to improve business results by using Hadoop to analyze mainframe data. Detecting fraud Fraud is a huge problem for any organization that makes payments: banks (stolen card information), property and casualty insurers (staged accidents), health plan providers (excessive billing) and government tax agencies (ineligible deductions) are just a few examples where fraudulent payments are authorized every day. For many mainframe clients, payment systems data is maintained in relational databases and IBM offers solutions that embed predictive analytics directly into these databases to help identify potentially fraudulent payments in real time before they are made ix. However, it is not always the case that all data required to make a determination of fraud exist in relational data. There are still many, many non-relational databases and file systems in use for core payment systems. And even when payment data is held in relational databases, results can be improved by augmenting that data with insights derived from non-mainframe sources. Loading all relevant z/os-based data into a Hadoop cluster on the mainframe, supplemented with non-mainframe sources as appropriate, enables comprehensive fraud analysis to be added to the payment process without exposing sensitive data to the outside world (and inadvertently opening the door to more fraud!). Enhancing insurance policy enforcement Auto insurance companies are constantly looking for creative ways to tailor policies that reward drivers who exhibit safe driving habits or specific behaviors. One insurer issues policies that come with reduced premiums for drivers who keep their mileage under a specified threshold. But how can such policy conditions be enforced? 12

14 In the state in which this insurer operates, every time a vehicle is presented for emissions testing the mileage and VIN number are stored in a publicly accessible database. The insurer maintains sensitive customer and policy data securely on their mainframe. Their goal is to use VIN numbers to match the public mileage data to their private policy data and analyze the combined data set for policy conformity. This is a great scenario for vstorm Enterprise, which can load the mainframe policy data into a mainframe-based Hadoop cluster while also pulling in the public data using included JDBC connectors. The mainframe data is not exposed to potential threats, and is augmented with the information required for policy enforcement. This scenario can also be expanded to include telematics, using data from devices embedded in cars as well as other sources, such as GPS or navigation systems, to further refine the parameters of a policy. Reducing healthcare costs In the U.S. healthcare system, in-hospital care is the most expensive option. While most hospital stays are unavoidable, short-term readmissions are often due to preventable causes. For example, statistics show that one in five elderly patients are back in the hospital within 30 days of leaving; very often, the patient is readmitted because they did not understand instructions, take the proper medications, or seek out the necessary followup care x. The cost of unnecessary re-admittance is staggering estimated at $17 billion for Medicare patients alone. For this reason, Medicare imposes stiff penalties on hospitals that readmit a patient within 30 days to incent these hospitals to focus on the problem. Hospital records are often kept in mainframe relational databases, so some degree of traditional analysis can be made to identify patients at high risk of re-admittance (looking at the patient s admission history, for example). But other factors can deliver a more accurate prediction of risk: for example, frequent changes of home address, absence of emergency contacts, or lack of living relatives could signal that the support structure is not in place for this patient to receive the help needed to stay out of the hospital. Much of this sort of data can be gleaned from outside sources. HIPPA regulations are very, very strict concerning patient data copying that data to multiple platforms for analysis can create a regulatory nightmare. An on-mainframe Hadoop cluster would keep those patient records safely on the mainframe, mixing-in nonstructured data from outside sources as necessary to enable a complete and more accurate analysis of re-admittance risk. Improving claims response time Earlier in this paper we pointed out how application logs can help identify bottlenecks in processes that are actually composed of many interacting services. The processing of insurance claims is an example of such a business application. 13

15 Since health insurance is a very competitive business, many health plan providers want to improve customer service by reducing the response time for processing claims. Aside from the claims processing system itself, likely using a relational database to store claim details, there are a number of other business systems that play a part in servicing the claim. For example, if a claimant had searched for information from a self-service web portal, weblogs would identify browsing patterns and the type of information that was sought. If a call center had been contacted, logs would provide the times and numbers of calls made as well as the call center agent s notes. There is a fair amount of digital exhaust surrounding the claims system that could be useful in analyzing how the claims process is performing. Combining relational and machine data, and analyzing in aggregate, provides the best opportunity for matching actual results to business metrics, and helps improve the end-toend process by identifying where claimants might be struggling to find information. Logs are as sensitive as client data, so keeping all relevant data on the mainframe for analysis via an on-platform Hadoop cluster is the most secure option. 14

16 Putting vstorm Enterprise to the test The vstorm Enterprise value proposition secure, easy, agile, efficient analysis of mainframe data is compelling and clearly meets the needs for use cases that we have discussed directly with clients. But still, we are often asked if the mainframe can really deliver this value with software that was, by conventional wisdom, designed for commodity hardware arrays. The only way to know for sure is to put the product through its paces and see how well the yellow elephant can dance on the mainframe. Veristorm and the IBM Poughkeepsie Performance Lab partnered to run a series of tests designed to prove that vstorm Enterprise does indeed meet or exceed the performance expectation of mainframe users. Table stakes executing the expected stress tests Benchmark is usually defined as a standard or point of reference against which things may be compared or assessed. Computer system vendors have long sought to develop and promote performance benchmarks that would invite direct comparisons of standardized workloads. While a noble idea in theory, computer system benchmarks are notoriously difficult to use effectively in practice because the workloads rarely match conditions that routinely rise in production environments. And there is so much latitude in these tests that most published results are obtained on highly tuned and unrealistic (and sometimes not commercially available) systems. For example, many mainframers will recall the nuclear arms race of the 1990s and 2000s as vendors madly competed to post top marks on the Transaction Processing Council s benchmarks for evaluating transactional and decision support systems. The TPC- x benchmarks became more useful to vendors as marketing tools than to clients for useful evaluations. The focus of the vstorm Enterprise testing was to demonstrate something useful proof of the mainframe s efficient, linear scalability for Hadoop workloads and not to post the best absolute numbers. After all, what s more important in a production environment: being able to handle the daily challenges of an ever-changing set of requirements, or knowing that your platform can boast of the best test scores? The configuration All testing was done in the Poughkeepsie lab with source data on z/os and the target HDFS on Linux for System z on the same physical machine, a zenterprise EC12. Hadoop nodes were configured as System z Linux virtual machines on top of the z/vm hypervisor, and the entire Linux environment was built on dedicated IFL processors. Since the key goal of this testing was demonstrating linear scalability, results were measured on configurations ranging from 2 to 40 IFLs. When you consider how powerful the current family of zenterprise processors is, this represents quite a significant range of scale. 15

17 The tests The Apache Hadoop distribution contains a set of benchmarks and stress tests designed to give a level of confidence that the Hadoop framework is running to expectations. The Veristorm-IBM team put vstorm Enterprise through the paces using these standard tests. TestDFSIO Since the performance of a Hadoop installation is heavily dependent on the efficiency of data reads, the TestDFSIO benchmark a read and write test for the HDFS gives a good indication of how fast a cluster is in terms of I/O. TestDFSIO was run against data sets of 500 GB, 1 TB and 5 TB and data rates remained consistently strong despite the variety of the load. NameNode benchmark (NNBench) In Hadoop, the NameNode is the core of the HDFS system, keeping track of all files and where the file blocks are stored. As such it is a prime cause for bottlenecks. The NNBench tool is used to perform load testing of the NameNode by generating lots of small HDFS requests. NNBench is a test whose success is measured simply by running to completion, and no issues were exposed in the vstorm Enterprise cluster. MapReduce benchmark (MRBench) It s useful to test a Hadoop configuration s ability to handle both small and large jobs. The MRBench tool executes a small job a number of times in a tight loop to see if the system is responsive and efficient. As with NNBench, the purpose of MRBench is simply to validate that a system is running properly, and vstorm Enterprise exited the testing with no issues. Terasort Terasort is the Hadoop world s equivalent of the TPC benchmarks, where organizations often go through great pains to achieve bragging rights by posting the lowest number. But what does a number mean to you, and the real problems you need to solve? The benchmark itself is a fairly simple workload (complete the sorting of 1TB of randomly generated data) designed to exercise the ability to execute a highly parallel analysis of a large data set. As such, it provides no useful information on workload variability or scale. Our goal in running Terasort was to test scale observing how efficiently applying more processing power to the problem would improve results. We ran this test on configurations from 3 to 40 IFLs and were able to report better than linear results; the 40 IFL system contained 13x more processors but improved the sort time by over 18x. But what can vstorm really do? The scenario Running the standard benchmarks and stress tests demonstrated that Hadoop does indeed scale quite nicely on the mainframe; all tests met or exceeded expectations. But in order to generate a more meaningful measure, Veristorm wanted to exercise vstorm Enterprise on a workload that mimicked the type of real-world problems that our clients have been 16

18 sharing with us. And to do it on a modestly configured system that was in line with what we would expect these clients to find acceptable for a Hadoop environment on the mainframe. The performance test team configured a 2 IFL system, pulled 2 billion stock trading records from the New York Stock Exchange into a DB2 z/os database, and wrote a Pig job to analyze this trading data for relevant metrics. A test was then designed to measure how long it took to extract the data from DB2 z/os, stream it to Linux on System z for ingestion into the zdoop HDFS, and perform the analysis. The 2 billion record database was extracted, streamed to the HDFS, and analyzed in an endto-end process that lasted 2 hours, using the power of 2 IFLs. Thus, the benchmark was born! As we ve been socializing this result with clients, we re finding that represents a very effective metric for the kinds of mainframe data that they want to analyze with Hadoop. In this test the database consisted of 100 byte records, representing an HDFS load of approximately 200GB of data a mid-size load as far as Hadoop goes, but perfectly in line with what our clients want to do. To put this in perspective, consider other examples from a variety of industries: According to the US Bureau of Transportation Statistics for 2009 there are million registered passenger vehicles. Two billion records provide more than enough capacity to satisfy the needs of the policy enforcement scenario outlined earlier in this paper. A leading credit card provider in the United States indicated to us that they process an average of 200 million transactions a day. These tend to be very small transactions in terms of data used; two billion records can hold a week s worth of data with room to spare. At any point of time, there are five or six million shipping containers on cargo ships sailing across the world s oceans. How much cargo analysis could be performed in two billion records? There are approximately 51.4 million in-patient surgeries per year in United States. Even if the data around each record is dense, 200GB is more than sufficient capacity to analyze an entire year of surgeries. As was mentioned earlier, one of the hardest tasks in managing Hadoop systems is sizing the clusters. The advantage of using the System z platform for performing Hadoop analysis is its ability to scale linearly, and add capacity on-demand two factors that greatly simplify this problem. Did you plan for two billion records, but unexpectedly need to process four billion? Turn on a few more IFLs, on demand. Is the capacity sufficient, but you need to cut the processing time in half? Turn on a few more IFLs, on demand. And turn them off when you no longer need them. System z provides a very dynamic environment for the fluid demands of the real world. The measure, along with measured linear scalability, means that the mainframe can efficiently meet your needs whether they are 4-4-2, 2-4-1, 2-1-4, or just about any combination. 17

19 How would it feel to have an elephant on your mainframe? Veristorm s vstorm Enterprise is a mainframe solution that enables secure, easy, agile, efficient analysis of mainframe data. You might think that a large yellow elephant would be too much for a mainframe to handle, but our testing conducted shows that the mainframe is indeed more than up to the task and can handle a broad range of real-world problems quite nicely. Hadoop as part of a hybrid transaction and analytics platform In this paper we focused on the Hadoop analysis of non-relational data that originates on the mainframe, but this is just one aspect of the overall System z strategy. As the holder of most of the world s transactional data, and a significant portion of related logs and files, the mainframe is rather uniquely positioned to deliver an end-to-end information governance platform that embraces both traditional data sources as well as new sources such as social, text, sensor and stream data. By integrating the best of traditional mainframe processing with emerging technologies, such as IBM DB2 Analytics Accelerator Netezza-based appliances and Hadoop, a secure zone for sensitive mainframe data can be established that analyzes that data while simultaneously embracing insights from data that originates outside that zone. This hybrid approach extending the System z ecosystem with non-system z technology enables all relevant data to become part of every operation and analysis within the context of a cohesive governance framework. By avoiding the unnecessary movement of data off-platform, it has become possible to perform truly real-time analytics because decisions are made based on the most accurate data available, not some stale copy. And by integrating these analytics technologies with transactional systems, System z has enabled insights to be injected directly into operational decision processes providing analytics the same business-critical support that operational systems enjoy today. So are you ready to put the elephant on your mainframe? 18

20 More Information About Veristorm Veristorm was founded with a simple vision: Enable mainframe users to securely integrate their most critical data. Our vstorm Enterprise software enables secure integration of your z/os data into any data analysis platform. With ever-expanding partnerships to new platforms, we help create a data integration hub with your most important data securely maintained. By facilitating faster, easier integration of data across platforms, we re helping enterprises gain more meaningful insights from analytics, right-size and lower the costs of data processing, and apply new skills and technologies to the challenges of a data-driven organization. Veristorm has created many business partnerships with the leading analytics and application management IT firms. Our team has experience in startups, cloud, and the IBM ecosystem. We stand by our mission to deliver the best Big Data management platform possible. We are privately held and located in Santa Clara, CA. To learn more about Veristorm: To learn more about vstorm Enterprise: About IBM IBM is a global technology and innovation company headquartered in Armonk, NY. It is the largest technology and consulting employer in the world, with more than 400,000 employees serving clients in 170 countries. IBM offers a wide range of technology and consulting services; a broad portfolio of middleware for collaboration, predictive analytics, software development and systems management; and the world's most advanced servers and supercomputers. Utilizing its business consulting, technology and R&D expertise, IBM helps clients become "smarter" as the planet becomes more digitally interconnected. To learn more about IBM: To learn more about Linux on System z: 19

HADOOP AND MAINFRAMES CRAZY OR CRAZY LIKE A FOX? Mike Combs, VP of Marketing 978-996-3580 mcombs@veristorm.com

HADOOP AND MAINFRAMES CRAZY OR CRAZY LIKE A FOX? Mike Combs, VP of Marketing 978-996-3580 mcombs@veristorm.com HADOOP AND MAINFRAMES CRAZY OR CRAZY LIKE A FOX? Mike Combs, VP of Marketing 978-996-3580 mcombs@veristorm.com The Big Picture for Big Data 2 The Lack of Information Problem The Surplus of Data Problem

More information

What s Happening to the Mainframe? Mobile? Social? Cloud? Big Data?

What s Happening to the Mainframe? Mobile? Social? Cloud? Big Data? December, 2014 What s Happening to the Mainframe? Mobile? Social? Cloud? Big Data? Glenn Anderson IBM Lab Services and Training Today s mainframe is a hybrid system z/os Linux on Sys z DB2 Analytics Accelerator

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

July 2015. Zementis for IBM z Systems

July 2015. Zementis for IBM z Systems July 2015 Zementis for IBM z Systems Page 1 Zementis for IBM z Systems An integrated predictive analytics deployment and scoring capability for organizations managing data and transactions with IBM z Systems

More information

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,

More information

Luncheon Webinar Series May 13, 2013

Luncheon Webinar Series May 13, 2013 Luncheon Webinar Series May 13, 2013 InfoSphere DataStage is Big Data Integration Sponsored By: Presented by : Tony Curcio, InfoSphere Product Management 0 InfoSphere DataStage is Big Data Integration

More information

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA TRENDS AND TECHNOLOGIES BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.

More information

IBM System x reference architecture solutions for big data

IBM System x reference architecture solutions for big data IBM System x reference architecture solutions for big data Easy-to-implement hardware, software and services for analyzing data at rest and data in motion Highlights Accelerates time-to-value with scalable,

More information

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct

More information

Information Architecture

Information Architecture The Bloor Group Actian and The Big Data Information Architecture WHITE PAPER The Actian Big Data Information Architecture Actian and The Big Data Information Architecture Originally founded in 2005 to

More information

What s Happening to the Mainframe? Mobile? Social? Cloud? Big Data?

What s Happening to the Mainframe? Mobile? Social? Cloud? Big Data? Glenn Anderson, IBM Lab Services and Training What s Happening to the Mainframe? Mobile? Social? Cloud? Big Data? Summer SHARE August 2014 Session 15595 (c) Copyright 2014 IBM Corporation 1 Today s mainframe

More information

The Future of Data Management

The Future of Data Management The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class

More information

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload Drive operational efficiency and lower data transformation costs with a Reference Architecture for an end-to-end optimization and offload

More information

IBM Software Hadoop in the cloud

IBM Software Hadoop in the cloud IBM Software Hadoop in the cloud Leverage big data analytics easily and cost-effectively with IBM InfoSphere 1 2 3 4 5 Introduction Cloud and analytics: The new growth engine Enhancing Hadoop in the cloud

More information

IBM InfoSphere Guardium Data Activity Monitor for Hadoop-based systems

IBM InfoSphere Guardium Data Activity Monitor for Hadoop-based systems IBM InfoSphere Guardium Data Activity Monitor for Hadoop-based systems Proactively address regulatory compliance requirements and protect sensitive data in real time Highlights Monitor and audit data activity

More information

Apache Hadoop: The Big Data Refinery

Apache Hadoop: The Big Data Refinery Architecting the Future of Big Data Whitepaper Apache Hadoop: The Big Data Refinery Introduction Big data has become an extremely popular term, due to the well-documented explosion in the amount of data

More information

Improve your IT Analytics Capabilities through Mainframe Consolidation and Simplification

Improve your IT Analytics Capabilities through Mainframe Consolidation and Simplification Improve your IT Analytics Capabilities through Mainframe Consolidation and Simplification Ros Schulman Hitachi Data Systems John Harker Hitachi Data Systems Insert Custom Session QR if Desired. Improve

More information

HadoopTM Analytics DDN

HadoopTM Analytics DDN DDN Solution Brief Accelerate> HadoopTM Analytics with the SFA Big Data Platform Organizations that need to extract value from all data can leverage the award winning SFA platform to really accelerate

More information

Big Data and Natural Language: Extracting Insight From Text

Big Data and Natural Language: Extracting Insight From Text An Oracle White Paper October 2012 Big Data and Natural Language: Extracting Insight From Text Table of Contents Executive Overview... 3 Introduction... 3 Oracle Big Data Appliance... 4 Synthesys... 5

More information

A financial software company

A financial software company A financial software company Projecting USD10 million revenue lift with the IBM Netezza data warehouse appliance Overview The need A financial software company sought to analyze customer engagements to

More information

Data Refinery with Big Data Aspects

Data Refinery with Big Data Aspects International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 655-662 International Research Publications House http://www. irphouse.com /ijict.htm Data

More information

IBM InfoSphere Optim Test Data Management

IBM InfoSphere Optim Test Data Management IBM InfoSphere Optim Test Data Management Highlights Create referentially intact, right-sized test databases or data warehouses Automate test result comparisons to identify hidden errors and correct defects

More information

Big Data Strategies with IMS

Big Data Strategies with IMS Big Data Strategies with IMS #16103 Richard Tran IMS Development richtran@us.ibm.com Insert Custom Session QR if Desired. Agenda Big Data in an Information Driven economy Why start with System z IMS strategies

More information

Hadoop and data integration with System z

Hadoop and data integration with System z Hadoop and data integration with System z Dr. Cameron Seay, Ph.D North Carolina Agricultural and Technical State University Mike Combs Veristorm August 6, 2014 Session 15961 Insert Custom Session QR if

More information

IBM InfoSphere BigInsights Enterprise Edition

IBM InfoSphere BigInsights Enterprise Edition IBM InfoSphere BigInsights Enterprise Edition Efficiently manage and mine big data for valuable insights Highlights Advanced analytics for structured, semi-structured and unstructured data Professional-grade

More information

IBM Enterprise Linux Server

IBM Enterprise Linux Server IBM Systems and Technology Group February 2011 IBM Enterprise Linux Server Impressive simplification with leading scalability, high availability and security Table of Contents Executive Summary...2 Our

More information

CA Technologies Big Data Infrastructure Management Unified Management and Visibility of Big Data

CA Technologies Big Data Infrastructure Management Unified Management and Visibility of Big Data Research Report CA Technologies Big Data Infrastructure Management Executive Summary CA Technologies recently exhibited new technology innovations, marking its entry into the Big Data marketplace with

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

IBM PureFlex System. The infrastructure system with integrated expertise

IBM PureFlex System. The infrastructure system with integrated expertise IBM PureFlex System The infrastructure system with integrated expertise 2 IBM PureFlex System IT is moving to the strategic center of business Over the last 100 years information technology has moved from

More information

The IBM Cognos Platform

The IBM Cognos Platform The IBM Cognos Platform Deliver complete, consistent, timely information to all your users, with cost-effective scale Highlights Reach all your information reliably and quickly Deliver a complete, consistent

More information

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the

More information

Simplifying Mainframe Data Access with IBM InfoSphere System z Connector for Hadoop IBM Redbooks Solution Guide Did you know?

Simplifying Mainframe Data Access with IBM InfoSphere System z Connector for Hadoop IBM Redbooks Solution Guide Did you know? Simplifying Mainframe Data Access with IBM InfoSphere System z Connector for Hadoop IBM Redbooks Solution Guide Did you know? For many, the IBM z Systems mainframe forms the backbone of mission-critical

More information

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics ESSENTIALS EMC ISILON Use the industry's first and only scale-out NAS solution with native Hadoop

More information

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and

More information

CA Big Data Management: It s here, but what can it do for your business?

CA Big Data Management: It s here, but what can it do for your business? CA Big Data Management: It s here, but what can it do for your business? Mike Harer CA Technologies August 7, 2014 Session Number: 16256 Insert Custom Session QR if Desired. Test link: www.share.org Big

More information

Which SQL Engine Leads the Herd?

Which SQL Engine Leads the Herd? October 2014 Which SQL Engine Leads the Herd? A Comparison of three leading SQL-on-Hadoop Implementations for compatibility, performance and scalability Which SQL Engine Leads the Herd? 2 Contents Executive

More information

A business intelligence agenda for midsize organizations: Six strategies for success

A business intelligence agenda for midsize organizations: Six strategies for success IBM Software Business Analytics IBM Cognos Business Intelligence A business intelligence agenda for midsize organizations: Six strategies for success A business intelligence agenda for midsize organizations:

More information

IBM BigInsights for Apache Hadoop

IBM BigInsights for Apache Hadoop IBM BigInsights for Apache Hadoop Efficiently manage and mine big data for valuable insights Highlights: Enterprise-ready Apache Hadoop based platform for data processing, warehousing and analytics Advanced

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

BIG DATA IS MESSY PARTNER WITH SCALABLE

BIG DATA IS MESSY PARTNER WITH SCALABLE BIG DATA IS MESSY PARTNER WITH SCALABLE SCALABLE SYSTEMS HADOOP SOLUTION WHAT IS BIG DATA? Each day human beings create 2.5 quintillion bytes of data. In the last two years alone over 90% of the data on

More information

Consolidated security management for mainframe clouds

Consolidated security management for mainframe clouds Security Thought Leadership White Paper February 2012 Consolidated security management for mainframe clouds Leveraging the mainframe as a security hub for cloud-computing environments 2 Consolidated security

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

How To Manage Energy At An Energy Efficient Cost

How To Manage Energy At An Energy Efficient Cost Hans-Dieter Wehle, IBM Distinguished IT Specialist Virtualization and Green IT Energy Management in a Cloud Computing Environment Smarter Data Center Agenda Green IT Overview Energy Management Solutions

More information

Energy Management in a Cloud Computing Environment

Energy Management in a Cloud Computing Environment Hans-Dieter Wehle, IBM Distinguished IT Specialist Virtualization and Green IT Energy Management in a Cloud Computing Environment Smarter Data Center Agenda Green IT Overview Energy Management Solutions

More information

DOAG 2014. 18.November 2014. Hintergrund. Oracle Mainframe Datanbanken für extreme Anforderungen

DOAG 2014. 18.November 2014. Hintergrund. Oracle Mainframe Datanbanken für extreme Anforderungen DOAG 2014 18.November 2014 Hintergrund zu Oracle Mainframe Datanbanken für extreme Anforderungen Dr. Manfred Gnirss Senior IT Specialist IBM Client Center, IBM Germany Lab gnirss@de.ibm.com Die folgenden

More information

Advanced Big Data Analytics with R and Hadoop

Advanced Big Data Analytics with R and Hadoop REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

More information

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of

More information

IBM Software Delivering trusted information for the modern data warehouse

IBM Software Delivering trusted information for the modern data warehouse Delivering trusted information for the modern data warehouse Make information integration and governance a best practice in the big data era Contents 2 Introduction In ever-changing business environments,

More information

IMS Data Integration with Hadoop

IMS Data Integration with Hadoop Data Integration with Hadoop Karen Durward InfoSphere Product Manager 17/03/2015 * Technical Symposium 2015 z/os Structured Data Integration for Big Data The Big Data Landscape Introduction to Hadoop What,

More information

Delivering Real-World Total Cost of Ownership and Operational Benefits

Delivering Real-World Total Cost of Ownership and Operational Benefits Delivering Real-World Total Cost of Ownership and Operational Benefits Treasure Data - Delivering Real-World Total Cost of Ownership and Operational Benefits 1 Background Big Data is traditionally thought

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Taking control of the virtual image lifecycle process

Taking control of the virtual image lifecycle process IBM Software Thought Leadership White Paper March 2012 Taking control of the virtual image lifecycle process Putting virtual images to work for you 2 Taking control of the virtual image lifecycle process

More information

IBM Analytics. Just the facts: Four critical concepts for planning the logical data warehouse

IBM Analytics. Just the facts: Four critical concepts for planning the logical data warehouse IBM Analytics Just the facts: Four critical concepts for planning the logical data warehouse 1 2 3 4 5 6 Introduction Complexity Speed is businessfriendly Cost reduction is crucial Analytics: The key to

More information

The Future of Data Management with Hadoop and the Enterprise Data Hub

The Future of Data Management with Hadoop and the Enterprise Data Hub The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah Cofounder & CTO, Cloudera, Inc. Twitter: @awadallah 1 2 Cloudera Snapshot Founded 2008, by former employees of Employees

More information

IBM Cognos 10: Enhancing query processing performance for IBM Netezza appliances

IBM Cognos 10: Enhancing query processing performance for IBM Netezza appliances IBM Software Business Analytics Cognos Business Intelligence IBM Cognos 10: Enhancing query processing performance for IBM Netezza appliances 2 IBM Cognos 10: Enhancing query processing performance for

More information

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum Big Data Analytics with EMC Greenplum and Hadoop Big Data Analytics with EMC Greenplum and Hadoop Ofir Manor Pre Sales Technical Architect EMC Greenplum 1 Big Data and the Data Warehouse Potential All

More information

Cray: Enabling Real-Time Discovery in Big Data

Cray: Enabling Real-Time Discovery in Big Data Cray: Enabling Real-Time Discovery in Big Data Discovery is the process of gaining valuable insights into the world around us by recognizing previously unknown relationships between occurrences, objects

More information

Cloud Computing with xcat on z/vm 6.3

Cloud Computing with xcat on z/vm 6.3 IBM System z Cloud Computing with xcat on z/vm 6.3 Thang Pham z/vm Development Lab thang.pham@us.ibm.com Trademarks The following are trademarks of the International Business Machines Corporation in the

More information

Cloudera Enterprise Data Hub in Telecom:

Cloudera Enterprise Data Hub in Telecom: Cloudera Enterprise Data Hub in Telecom: Three Customer Case Studies Version: 103 Table of Contents Introduction 3 Cloudera Enterprise Data Hub for Telcos 4 Cloudera Enterprise Data Hub in Telecom: Customer

More information

Unisys ClearPath Forward Fabric Based Platform to Power the Weather Enterprise

Unisys ClearPath Forward Fabric Based Platform to Power the Weather Enterprise Unisys ClearPath Forward Fabric Based Platform to Power the Weather Enterprise Introducing Unisys All in One software based weather platform designed to reduce server space, streamline operations, consolidate

More information

HDP Hadoop From concept to deployment.

HDP Hadoop From concept to deployment. HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some

More information

can you effectively plan for the migration and management of systems and applications on Vblock Platforms?

can you effectively plan for the migration and management of systems and applications on Vblock Platforms? SOLUTION BRIEF CA Capacity Management and Reporting Suite for Vblock Platforms can you effectively plan for the migration and management of systems and applications on Vblock Platforms? agility made possible

More information

Elastic Private Clouds

Elastic Private Clouds White Paper Elastic Private Clouds Agile, Efficient and Under Your Control 1 Introduction Most businesses want to spend less time and money building and managing IT infrastructure to focus resources on

More information

Colgate-Palmolive selects SAP HANA to improve the speed of business analytics with IBM and SAP

Colgate-Palmolive selects SAP HANA to improve the speed of business analytics with IBM and SAP selects SAP HANA to improve the speed of business analytics with IBM and SAP Founded in 1806, is a global consumer products company which sells nearly $17 billion annually in personal care, home care,

More information

Stella-Jones takes pole position with IBM Business Analytics

Stella-Jones takes pole position with IBM Business Analytics Stella-Jones takes pole position with IBM Faster, more accurate reports, budgets and forecasts support a rapidly growing business Overview The need Following several key strategic acquisitions, Stella-Jones

More information

Virtualizing Apache Hadoop. June, 2012

Virtualizing Apache Hadoop. June, 2012 June, 2012 Table of Contents EXECUTIVE SUMMARY... 3 INTRODUCTION... 3 VIRTUALIZING APACHE HADOOP... 4 INTRODUCTION TO VSPHERE TM... 4 USE CASES AND ADVANTAGES OF VIRTUALIZING HADOOP... 4 MYTHS ABOUT RUNNING

More information

SAS and Oracle: Big Data and Cloud Partnering Innovation Targets the Third Platform

SAS and Oracle: Big Data and Cloud Partnering Innovation Targets the Third Platform SAS and Oracle: Big Data and Cloud Partnering Innovation Targets the Third Platform David Lawler, Oracle Senior Vice President, Product Management and Strategy Paul Kent, SAS Vice President, Big Data What

More information

Fleet Optimization with IBM Maximo for Transportation

Fleet Optimization with IBM Maximo for Transportation Efficiencies, savings and new opportunities for fleet Fleet Optimization with IBM Maximo for Transportation Highlights Integrates IBM Maximo for Transportation with IBM Fleet Optimization solutions Offers

More information

Enabling High performance Big Data platform with RDMA

Enabling High performance Big Data platform with RDMA Enabling High performance Big Data platform with RDMA Tong Liu HPC Advisory Council Oct 7 th, 2014 Shortcomings of Hadoop Administration tooling Performance Reliability SQL support Backup and recovery

More information

Session Title: Cloud Computing 101 What every z Person must know

Session Title: Cloud Computing 101 What every z Person must know 2009 System z Expo October 5 9, 2009 Orlando, FL Session Title: Cloud Computing 101 What every z Person must know Session ID: ZDI08 Frank J. De Gilio - degilio@us.ibm.com 2 3 View of Cloud Computing Application

More information

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014 Forecast of Big Data Trends Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014 Big Data transforms Business 2 Data created every minute Source http://mashable.com/2012/06/22/data-created-every-minute/

More information

IBM Software Information Management Creating an Integrated, Optimized, and Secure Enterprise Data Platform:

IBM Software Information Management Creating an Integrated, Optimized, and Secure Enterprise Data Platform: Creating an Integrated, Optimized, and Secure Enterprise Data Platform: IBM PureData System for Transactions with SafeNet s ProtectDB and DataSecure Table of contents 1. Data, Data, Everywhere... 3 2.

More information

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies Big Data: Global Digital Data Growth Growing leaps and bounds by 40+% Year over Year! 2009 =.8 Zetabytes =.08

More information

Accelerating and Simplifying Apache

Accelerating and Simplifying Apache Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly

More information

Offload Enterprise Data Warehouse (EDW) to Big Data Lake. Ample White Paper

Offload Enterprise Data Warehouse (EDW) to Big Data Lake. Ample White Paper Offload Enterprise Data Warehouse (EDW) to Big Data Lake Oracle Exadata, Teradata, Netezza and SQL Server Ample White Paper EDW (Enterprise Data Warehouse) Offloads The EDW (Enterprise Data Warehouse)

More information

Big Data. Fast Forward. Putting data to productive use

Big Data. Fast Forward. Putting data to productive use Big Data Putting data to productive use Fast Forward What is big data, and why should you care? Get familiar with big data terminology, technologies, and techniques. Getting started with big data to realize

More information

IBM InfoSphere Optim Test Data Management solution for Oracle E-Business Suite

IBM InfoSphere Optim Test Data Management solution for Oracle E-Business Suite IBM InfoSphere Optim Test Data Management solution for Oracle E-Business Suite Streamline test-data management and deliver reliable application upgrades and enhancements Highlights Apply test-data management

More information

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE Hadoop Storage-as-a-Service ABSTRACT This White Paper illustrates how EMC Elastic Cloud Storage (ECS ) can be used to streamline the Hadoop data analytics

More information

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing Wayne W. Eckerson Director of Research, TechTarget Founder, BI Leadership Forum Business Analytics

More information

Interactive data analytics drive insights

Interactive data analytics drive insights Big data Interactive data analytics drive insights Daniel Davis/Invodo/S&P. Screen images courtesy of Landmark Software and Services By Armando Acosta and Joey Jablonski The Apache Hadoop Big data has

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

An Oracle White Paper August 2011. Oracle VM 3: Application-Driven Virtualization

An Oracle White Paper August 2011. Oracle VM 3: Application-Driven Virtualization An Oracle White Paper August 2011 Oracle VM 3: Application-Driven Virtualization Introduction Virtualization has experienced tremendous growth in the datacenter over the past few years. Recent Gartner

More information

SOLUTION BRIEF BIG DATA MANAGEMENT. How Can You Streamline Big Data Management?

SOLUTION BRIEF BIG DATA MANAGEMENT. How Can You Streamline Big Data Management? SOLUTION BRIEF BIG DATA MANAGEMENT How Can You Streamline Big Data Management? Today, organizations are capitalizing on the promises of big data analytics to innovate and solve problems faster. Big Data

More information

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ End to End Solution to Accelerate Data Warehouse Optimization Franco Flore Alliance Sales Director - APJ Big Data Is Driving Key Business Initiatives Increase profitability, innovation, customer satisfaction,

More information

IBM Software Integrating and governing big data

IBM Software Integrating and governing big data IBM Software big data Does big data spell big trouble for integration? Not if you follow these best practices 1 2 3 4 5 Introduction Integration and governance requirements Best practices: Integrating

More information

Manifest for Big Data Pig, Hive & Jaql

Manifest for Big Data Pig, Hive & Jaql Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,

More information

IBM Analytical Decision Management

IBM Analytical Decision Management IBM Analytical Decision Management Deliver better outcomes in real time, every time Highlights Organizations of all types can maximize outcomes with IBM Analytical Decision Management, which enables you

More information

Setting smar ter sales per formance management goals

Setting smar ter sales per formance management goals IBM Software Business Analytics Sales performance management Setting smar ter sales per formance management goals Use dedicated SPM solutions with analytics capabilities to improve sales performance 2

More information

I/O Considerations in Big Data Analytics

I/O Considerations in Big Data Analytics Library of Congress I/O Considerations in Big Data Analytics 26 September 2011 Marshall Presser Federal Field CTO EMC, Data Computing Division 1 Paradigms in Big Data Structured (relational) data Very

More information

NoSQL for SQL Professionals William McKnight

NoSQL for SQL Professionals William McKnight NoSQL for SQL Professionals William McKnight Session Code BD03 About your Speaker, William McKnight President, McKnight Consulting Group Frequent keynote speaker and trainer internationally Consulted to

More information

The 3 questions to ask yourself about BIG DATA

The 3 questions to ask yourself about BIG DATA The 3 questions to ask yourself about BIG DATA Do you have a big data problem? Companies looking to tackle big data problems are embarking on a journey that is full of hype, buzz, confusion, and misinformation.

More information

High-Performance Business Analytics: SAS and IBM Netezza Data Warehouse Appliances

High-Performance Business Analytics: SAS and IBM Netezza Data Warehouse Appliances High-Performance Business Analytics: SAS and IBM Netezza Data Warehouse Appliances Highlights IBM Netezza and SAS together provide appliances and analytic software solutions that help organizations improve

More information

How Cisco IT Built Big Data Platform to Transform Data Management

How Cisco IT Built Big Data Platform to Transform Data Management Cisco IT Case Study August 2013 Big Data Analytics How Cisco IT Built Big Data Platform to Transform Data Management EXECUTIVE SUMMARY CHALLENGE Unlock the business value of large data sets, including

More information

Fiserv. Saving USD8 million in five years and helping banks improve business outcomes using IBM technology. Overview. IBM Software Smarter Computing

Fiserv. Saving USD8 million in five years and helping banks improve business outcomes using IBM technology. Overview. IBM Software Smarter Computing Fiserv Saving USD8 million in five years and helping banks improve business outcomes using IBM technology Overview The need Small and midsize banks and credit unions seek to attract, retain and grow profitable

More information

GigaSpaces Real-Time Analytics for Big Data

GigaSpaces Real-Time Analytics for Big Data GigaSpaces Real-Time Analytics for Big Data GigaSpaces makes it easy to build and deploy large-scale real-time analytics systems Rapidly increasing use of large-scale and location-aware social media and

More information