We will discuss Capping Capacity in the traditional sense as it is done today.



Similar documents
DATA CENTER VIRTUALIZATION WHITE PAPER SEPTEMBER 2006

Analyzing IBM i Performance Metrics

Going beyond ITIL : IT Capacity Management with SAS

Using Simulation to Understand and Optimize a Lean Service Process

The Top 20 VMware Performance Metrics You Should Care About

Windows Server Performance Monitoring

Network performance and capacity planning: Techniques for an e-business world

Why Relative Share Does Not Work

Managing Application Performance and Availability in a Virtual Environment

SafePeak Case Study: Large Microsoft SharePoint with SafePeak

Leveraging Technologies for MLC Software Expense Management

Understanding Linux on z/vm Steal Time

Tableau Server Scalability Explained

Optimizing Your Database Performance the Easy Way

Welcome to today's webinar: How to Transform RMF & SMF into Availability Intelligence

BROCADE PERFORMANCE MANAGEMENT SOLUTIONS

The Service Provider s Speed Mandate and How CA Can Help You Address It

USER STORY. ErgoGroup Case Study: This case study celebrates 10 years of ErgoGroup s utilisation of FDR/UPSTREAM.

Say Yes to Virtualization: Server virtualization helps businesses achieve considerable reductions in Total Cost of Ownership (TCO).

Agenda. Capacity Planning practical view CPU Capacity Planning LPAR2RRD LPAR2RRD. Discussion. Premium features Future

solution brief September 2011 Can You Effectively Plan For The Migration And Management of Systems And Applications on Vblock Platforms?

Tableau Server 7.0 scalability

How To Improve Your It Performance

Data center virtualization

Disk Storage Shortfall

Perform-Tools. Powering your performance

BEST PRACTICES WHITE PAPER. Workload automation: helping cloud computing take flight

Server Migration from UNIX/RISC to Red Hat Enterprise Linux on Intel Xeon Processors:

Cloud Computing Capacity Planning. Maximizing Cloud Value. Authors: Jose Vargas, Clint Sherwood. Organization: IBM Cloud Labs

Putting Critical Applications in the Public Cloud. The Very Latest Best Practices & Methodologies

Unprecedented Performance and Scalability Demonstrated For Meter Data Management:

Four important conclusions were drawn from the survey:

Avoiding Performance Bottlenecks in Hyper-V

can you effectively plan for the migration and management of systems and applications on Vblock Platforms?

A guide for creating a more secure, efficient managed file transfer methodology

Capacity planning with Microsoft System Center

Taking Virtualization

Thought Leadership White Paper. Consolidate Job Schedulers to Save Money

Case Study I: A Database Service

PARALLELS CLOUD SERVER

Estimate Performance and Capacity Requirements for Workflow in SharePoint Server 2010

1 INTRODUCTION 2 APPLICATION PROFILING OVERVIEW

Throughput Capacity Planning and Application Saturation

Windows Admins... & Long-term capacity planning... do the two go together?

BUSINESS IMPACT OF POOR WEB PERFORMANCE

SAS deployment on IBM Power servers with IBM PowerVM dedicated-donating LPARs

Steps to Migrating to a Private Cloud

CICS Transactions Measurement with no Pain

Cloud Computing (In Plain English)

TCO for Application Servers: Comparing Linux with Windows and Solaris

The Journey to Cloud Computing: from experimentation to business reality

Getting The Most Value From Your Cloud Provider

Mainframe Performance Management: A New Twist

Key Elements of a Successful Disaster Recovery Strategy: Virtual and Physical by Greg Shields, MS MVP & VMware vexpert

Cloud Server. Parallels. An Introduction to Operating System Virtualization and Parallels Cloud Server. White Paper.

Windows Server 2008 R2 Hyper-V Live Migration

White paper: Unlocking the potential of load testing to maximise ROI and reduce risk.

WHITE PAPER Using SAP Solution Manager to Improve IT Staff Efficiency While Reducing IT Costs and Improving Availability

How To Create A Virtual Data Center

IBM DB2 Recovery Expert June 11, 2015

Best Practices for Monitoring Databases on VMware. Dean Richards Senior DBA, Confio Software

Enterprise Job Scheduling: How Your Organization Can Benefit from Automation

IBM z13 Software Pricing Announcements

Virtual Desktop Infrastructure Optimization with SysTrack Monitoring Tools and Login VSI Testing Tools

The Business Case for Virtualization Management: A New Approach to Meeting IT Goals By Rich Corley Akorri

The business value of improved backup and recovery

Capacity planning for IBM Power Systems using LPAR2RRD.

HOW IS WEB APPLICATION DEVELOPMENT AND DELIVERY CHANGING?

W W W. Z J O U R N A L. C O M o c t o b e r / n o v e m b e r INSIDE

A Shift in the World of Business Intelligence

You re not alone if you re feeling pressure

Challenges of Capacity Management in Large Mixed Organizations

Case Study In the last 80 years, Nationwide has grown from a small mutual auto

The Importance of Software License Server Monitoring

Are Your Capacity Management Processes Fit For The Cloud Era?

Rapid Bottleneck Identification

VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5

With Cloud Computing, Who Needs Performance Testing?

CSF Designer. Complete Customer Communication

Proactive Performance Management for Enterprise Databases

Streamlining the communications product lifecycle. By Eitan Elkin, Amdocs

The Total Cost of Ownership (TCO) of migrating to SUSE Linux Enterprise Server for System z

Resource Allocation and Scheduling

HOW TO. to Executives. You know that marketing automation is the greatest thing since sliced bread. After all, what else can help you...

YOUR ERP PROJECT S MISSING LINK: 7 REASONS YOU NEED BUSINESS INTELLIGENCE NOW

DB2 for z/os Backup and Recovery: Basics, Best Practices, and What's New

Response Time Analysis

Custom Systems Corp.

The Association of System Performance Professionals

Make the right decisions with Distribution Intelligence

Operations Management for Virtual and Cloud Infrastructures: A Best Practices Guide

CRM SOFTWARE EVALUATION TEMPLATE

Cloud Computing Payback. An explanation of where the ROI comes from

White Paper. Using Linux on z/vm to Meet the Challenges of the 21st Century

BridgeWays Management Pack for VMware ESX

The Flash- Transformed Server Platform Maximizing Your Migration from Windows Server 2003 with a SanDisk Flash- enabled Server Platform

Understanding the Performance of an X User Environment

Resource Monitoring During Performance Testing. Experience Report by Johann du Plessis. Introduction. Planning for Monitoring

Five Key Principles of Conversion-Focused Website Design

Developing a Load Testing Strategy

Transcription:

We have a very focused agenda so let s get right to it. The theme of our session today is automation and how that automation is delivered with ThruPut Manager Automation Edition or TM AE. We will discuss Capping Capacity in the traditional sense as it is done today. Then we will talk about Capping Demand; an innovative technique we ve developed with TM AE to directly reduce monthly software costs while maintaining performance for your key applications. Finally, we ll talk about Automating Workload Management. A necessary step to enable Capping Demand.

Capacity Management and Workload Management are key techniques that have been in the datacenter for years. Our position is that these functions can deliver huge benefits when automated. While anything can be accomplished with an army of people, machines and money, we believe that automation is the most effective and efficient technique to run a datacenter today. You will see these charts again later in the presentation. The chart on the top left illustrates how Automated Capacity Management can lower software costs. The chart on the bottom right shows how Automated Workload Management can improve performance by managing utilization and balancing workloads.

Why do datacenters cap anyway? Deciding to use capping means making a conscious decision to lower your capacity levels, thus increasing your utilization levels. It seems counter intuitive to performance.

Of course, this is why. Software is the largest operating expense in the datacenter. While sub capacity pricing allows billing based on peak consumption, running without caps means a spike in the Rolling Four Hour Average can lead to unacceptably high software bills. Caps provide the ability to control this cost, but with some [ high ] risk to performance.

Managing Your z/os Software Costs Caps are also referred to as Capacity Limits. As mentioned, their primary purpose is to control monthly software charges by limiting the Rolling Four Hour Average. Hard Caps limit application CPU consumption to the cap level at all times. This means the Rolling Four Hour Average can never exceed the cap limit. Soft Caps only limit application consumption when the Rolling Four Hour Average exceeds the cap level

This is what a hard cap looks like. Neither the application demand nor the Rolling Four Hour Average is ever allowed to exceed the set limit. It s just like having a smaller machine. This ensures very consistent software bills, but can be rather unforgiving to application performance.

This is what soft caps look like. The application demand is freely allowed to exceed the cap limit because the Rolling Four Hour average, as shown with the red line, does not exceed the limit. As you can see, this option provides much more flexibility to application performance.

When the Rolling Four Hour Average does exceed the cap limit, applications are immediately affected. When this occurs, the LPAR (or group of LPARs) is capped by PR/SM. CPU resources are only provided to the LPAR at a rate that does not exceed the capacity limit regardless of the actual application demand. The effect is very much as if the applications are suddenly running on a smaller machine. Further, the capping will remain in place until the Rolling Four Hour Average is back below the capacity limit. In this case the effect remained for 3 hours.

So what do we do when cap limits are exceeded? Typical & traditional, manual methods are to monitor and react to changes as best we can. We cancel jobs, adjust priorities and do our best to help our applications continue to perform in a more constrained environment. It requires a lot of expert staff and still can t be performed at machine speeds or volumes.

Availability is usually job one in a mainframe environment. So when limits are hit, often we simply resort to raising the cap. While this provides a positive effect on our applications, it provides a negative effect on our software bill. There is really no point in putting the cap back down as the software bill is based on the highest peak.

Recalling our description of hitting the wall, when the Rolling Four Hour Average hits the capacity limit, the entire LPAR is capped, potentially affecting all applications on the LPAR. Rather than allow this to occur, TM AE allows you to only cap specified batch workloads. Further, TM AE takes a soft hammer approach, by gradually reducing or deferring the CPU consumption of the batch that you designate before you hit the wall. Many shops are under the mistaken impression that batch does not impact their peaks because it runs at lower priority and it runs at night. The facts don t bear that argument out. Firstly, keep in mind that all workloads contribute to the Rolling Four Hour Average regardless of priority. A CPU second is a CPU second. Secondly, consider the following chart.

This chart was generated from actual customer data and only reflects the day shift. While online is certainly the dominant contributor, note that batch makes up a significant portion of the Rolling Four Hour Average. Even a modest reduction in batch consumption during these peaks can yield significant software savings while maintaining performance of your key workloads. Now, we ll go into a little more detail about how TM AE allows you to lower your software costs by capping demand.

In order to cap demand at the right times, TM AE has to understand the environment and be aware of any changes that may occur. Of course TM AE determines if there are any Defined Capacity or LPAR Group Limits set by the installation, and will detect any changes made to these limits. It also tracks the current rolling 4 hour average CPU usage for each LPAR and the current CPU demand. It s important to keep current demand in mind when making decisions on whether or not certain workloads should be increased. In order to have a full picture, TM AE also tracks the overall CPU consumption of the CEC and each LPAR in the CEC. High CPU consumption by one or more LPARs can affect CPU availability for other LPARs. TM AE monitors batch workload performance to avoid overloading leading to poor overall performance and delays. This is particularly important when capping demand, so that the workload being capped still performs as well as possible given that it is being constrained.

TM AE caps the batch workloads chosen by the installation in three phases. First of all, as the R4HA increases and approaches the limits set by Defined Capacity or LPAR Group Limit, TM AE automatically starts restricting the CPU consumption by these workloads. It does this in 5 gradual steps. The idea is to slowly shrink usage and avoid hitting the limit with very high CPU consumption we usually refer to this as hitting the wall. When you become soft capped at high rates of CPU utilization, this sudden drop in CPU availability can significantly affect performance. When this happens, this usually causes installations to immediately increase their limits and their software bills. By taking action in stages before the limit is reached, TM AE makes capping manageable and eliminates the negative performance effects. Then, while soft capping is occurring, TM AE continues to constrain the lower priority batch workloads to the maximum extent specified by the installation, reducing the overall CPU demand in each affected LPAR. This leaves more cycles available for high priority workloads such as online, even if the high and low priority work are on different LPARs. Once the peak passes, the LPAR is no longer being capped and overall CPU consumption begins to come down. TM AE starts to smoothly remove the constraints on the affected batch workloads, gradually allowing more access to CPU, reversing the 5 steps that I just talked about. The goal is to automatically get as much of the deferred workload running as quickly as possible as long as the rolling 4 hour average is not increased a difficult task without automation. Gradually TM AE starts to run the deferred jobs. As long as the R4HA does not increase, the constraints will be relaxed further and more of the deferred batch workload will be allowed to run. TM AE also makes sure that the deferred work is selected at a rate that does not overload the LPAR. TM AE makes it possible to live not only with the caps you may currently have implemented, but to reduce them further while still protecting the performance of your high priority applications. Lower cap values translate directly into lower software bills every month going forward. Here s a couple of examples to show how this all works.

Here s an example of an LPAR Group with two LPARs, LPAR1 and LPAR2, with a mix of online and batch load. On the left, before using TM AE we see that on LPAR 1, online and batch are both contributing 600 MSUs/hr to the peak R4HA. On LPAR2, online contributes 700 MSUs/hr and batch 300. The total peak R4HA is 2200 MSUs/hr. After implementing TM AE, the installation decides that 25% of their batch should be eligible to be deferred. The results? The online usage remains the same, while the batch is smaller for a new total peak R4HA of 1975 MSUs/hr, a saving of 225 MSUs/hr. The LPAR Group limit can be reduced by 225 MSUs/hr, without affecting online performance.

In this second example, we have a different configuration: an LPAR Group with two LPARs, 3 and 4, one of which (LPAR 3) has mostly online and a small amount of high priority batch that must run on that LPAR, and another (LPAR 4) that runs entirely batch load. On the left, before TM AE, the peak R4HA for this LPAR Group was 1850 MSUs/hr. This installation wants to reduce their peak monthly R4HA but they also have reached a point where their online applications are going to need more CPU to handle an expected increase in transaction volume due to the introduction of a new application. These are usually conflicting goals. The installation identifies 300 MSUs of batch workload on LPAR 4 that can be deferred during peak times. They decide to reduce their LPAR Group limit by 200 MSUs/hr and let the online work on LPAR 3 consume an additional 100 MSUs/hr. Note that the batch on LPAR 3 is all high priority so they left it alone. Their online applications got the capacity boost they needed, and they still reduced their peak monthly rolling 4 hour average.

Here s a graph created from actual data from a large TM AE installation. On it you can see the R4HA, the CPU utilization and the horizontal line is the limit. In this case the installation is using LPAR Group Limits. Notice how the R4HA line gradually flattens until it just barely hits the LPAR Group Limit. This is TM AE in action, reducing the demand and the growth in the R4HA so that effects of capping are minimized. This customer was able to reduce their limits and realize 7 figure annual savings. Let s look at typical savings.

This chart shows the expected savings for three different installations. For each of these, the contribution of all batch workloads to the overall peak R4HA was determined and then assuming that 25% of that is lower priority and deferrable, the potential savings were calculated. For this we used a rate of $300/MSU/hr, which is typical for a large installation with the usual IBM software set. Smaller installations may not save quite as many MSUs, but in some cases their incremental rate is $1400 per MSU/hr, so the savings can still be substantial. Some installations current annual savings exceed the $1.3M shown on the chart.

It s clear that Capping Demand is the key to lower software costs while protecting key applications. To enable capping the demand of our applications, we first need to take control of our applications with automation. Changes in system utilization and workload demand happen at machine speeds, we need to anticipate and react at machine speeds as well. This means dynamically tracking and managing utilization and automatically balancing workloads across the available resources.

Here we have an over utilized highway. The design doesn t prevent more cars from coming on the road even as the average speed approaches zero. The result is many cars on the road, but no one getting to their destination within the expected time.

The same principles hold true in a computer system. If additional workloads are continually added to the system even as utilization reaches critical levels, everything waits longer. Some users may have a comfortable feeling since they see their job has started, but like the cars on the highway, the job is not going anywhere very fast. You can see from this queuing graph that at very high levels of utilization the total elapsed time becomes 10 or 20 times longer. The CPU is working just as fast but the wait time to access the CPU increases exponentially. As utilization approaches 100% the wait time increases to intolerable levels, just like gridlock on the highway. The key to peak throughput is to constantly manage the workload distribution against the rapidly changing utilization to maintain just the right amount of load. Now we ll explain how TM AE accomplishes this automatically.

To avoid overloading, TM AE only adds batch load when and where it makes sense. LPARs must have available capacity. This means either the LPARs are not using up their CPU entitlement by weight or there are still unused CPU cycles on the CEC of which the LPAR is able to take advantage. The Service Class in which a job will run must be performing well to avoid unproductive overloading. Because everything works better when you avoid overloading. It is not important when a batch job starts, it s when it ends that matters. This is even truer with production applications where one or more jobs may be dependent on the completion of another. Jobs end sooner in a TM AE managed environment. We did a benchmark that shows this in action.

We ran over 1000 jobs being submitted over several hours. These jobs had a mix of CPU and I/O load and were run in WLM batch initiators and under TM AE. There were no job dependencies. Of course, the two environments were identical in every way: hardware, software and service class definitions. No other work was running on the CEC. The result? TM AE ran far fewer jobs at once, starting many fewer initiators and completed much more work. Here s the graph of the benchmark results:

You can see that the work ran for hours. WLM started 300 initiators while TM AE automatically started only 25 initiators. As you can see from the graph, after over 8 hours, TM AE had completed 200 more jobs. The only reason it didn t get higher is that we stopped submitting work! These results would be even more dramatic with workloads that make use of job dependencies. Many installations use manually controlled JES2 initiators. In that environment you will often find that the machine either is overloaded or underutilized since it s just not possible to be as effective manually as automation that is constantly monitoring the environment and making instant decisions.

Sometimes less is more. By avoiding overloading, TM AE gets more batch jobs done, faster. Systems are much more prone to overloading when capped, which is why automation is so important when managing demand in a soft capped environment.

It s safe to say this load is imbalanced. As mentioned previously, the other key factor in effective Automated Workload Management is balance.

As we demonstrated with utilization, a system running at 100% is spending more time waiting than doing productive work. In the example given here, even though the system at 60% will provide slightly better service than the systems at 80%, the exponential delays experienced by the system at 100% will more than wipe out those gains. The facts are simple: in order to achieve peak performance, workloads must be balanced across the available resources. As with utilization, workload demands change too rapidly to manage manually. This is another automated function of TM AE.

In order to balance batch workloads automatically, TM AE makes sure it is aware of the performance and capacity of the entire environment. It tracks the utilization of all LPARs and the required system and resource affinities of all batch workloads. The installation specified business priorities are defined to TM AE so that it always selects the most urgent job. Workload is balanced automatically because TM AE controls the number of initiators on each system and dynamically spreads the workload across all members of the JESplex.

TM AE does workload balancing right. Balancing is not simply making sure the same number of jobs are running on each system. Balancing means that each system in the JESplex runs the right amount of batch workload for the current conditions on each system while still respecting any specific system affinity and resource requirements of individual batch jobs. TM AE considers the actual activity on each LPAR and reevaluates CEC, LPAR and Service Class performance every 10 seconds so it can respond to environments that can and do change very rapidly. Only automation can do this. TM AE avoids overloading by rebalancing the batch workload as CPU demand and availability change. Capacity can change too, due to Capacity on Demand, LPAR weight changes, and, of course, our old friend, soft capping, which can cause the available capacity to be suddenly reduced. By balancing batch workload intelligently, TM AE delivers increased throughput with proper use of existing resources.

Automation is no longer an option in datacenters today. Capacity and Workload Managers need to harness this technology to control costs while maximizing performance. Automated Workload Management provides the necessary controls to maximize throughput by managing utilization and balance. Automated Capacity Management enables the datacenter to Cap Demand the only way to safely reduce cap levels and lower software costs.