The Comeback of Batch Tuning By Avi Kohn, Time Machine Software Introduction A lot of attention is given today by data centers to online systems, client/server, data mining, and, more recently, the Internet. The batch workload is far from the limelight as data centers search for new ways to increase productivity. After all, batch applications are all moving to online systems or to client/server systems, or are going to be replaced by new-fashioned e-something systems. Or are they? Reality is quite different. While some (usually new) batch applications have indeed been moved, most batch applications are still alive and kicking. A considerable portion of the MVS data center s workload is still made up of batch work, and its importance is actually growing rather than decreasing. For the majority of MVS data centers, batch work is largely confined to certain time periods (windows) when online applications are either unavailable or, at least, place a lower resource requirement on the system. However, with most data centers moving towards 24x7 operations, online applications must stay up as long as possible; consequently the batch windows must shrink and lessen their impact on online services. Three-Tiered Approach to Reducing the Batch Window There is no silver bullet solution for shrinking the batch window. Instead, there are several methods and approaches that can be used in parallel, each of them having a different price tag. Usually, only a combination of approaches can achieve the desired results. Tier #1: The classic solution is of course to buy more hardware (faster processor, more memory, more sophisticated I/O equipment, etc.). This solution can cause the batch window to shrink, but cannot eliminate it - even if your data center has an unlimited budget for buying more hardware. There is, of course, a huge price tag associated with the purchase of more hardware (especially processors). When your hardware vendor quotes a given price, remember to add to it the costs for hardware maintenance, environmentals, space rent and support staff. Moving to a stronger processor also means higher software costs. While there is no way for a dynamic and growing company to eliminate new hardware purchases, delaying processor upgrades for as long as possible can save your company a great deal of money. Tier #2: A substantial amount of batch work (e.g., data backup, database maintenance) will never go away. To best cope with this load, you must exploit hardware and software solutions to enable this batch work to coexist with the online work. These solutions must provide excellent performance, with minimal disruption of online processes. Significant progress has been made during the last few years in this area. For data backup, numerous hardware-based and/or software-based options (e.g., mirroring, snapshot copy, concurrent copy) are
available. For database maintenance, various vendor product utilities are available which can greatly enhance the standard database utilities. Each solution has its advantages and disadvantages, and each carries a price tag (e.g., newer disks and disk controllers, more cache, additional software). Tier #3: Last, but certainly not least, there is batch tuning. About 25 data center surveys performed by myself have revealed that, on average, 5% of batch processing is not required in the first place, some 10%-15% of batch processing is not performed using the correct tools and utilities, and another 40% can be optimized using simple tuning options. A large or mid-size data center can often reduce up to 30% of the batch window by performing a proper batch tuning exercise. Tuning efforts are not without expense. The personnel involved in performance management and in production management does cost money (salary, training, meetings). Consequently, their time should be used to focus on the most effective and, at the same time, cost-justified methods to cut the batch window. Existing tuning tools should be exploited whenever possible, and additional tools should be purchased if required and cost-justified. Available tuning articles and books (e.g., Technical Support Magazine articles, IBM Redbooks) should be exploited as well. The remainder of this article focuses on the third tier, batch tuning, describing a sample tuning exercise. Most elements in this exercise are most likely applicable to your data center. Sample Tuning Exercise In order to reduce the batch window, the elapsed time of the batch workflow must be cut. Focus must be placed on tuning issues which directly impact elapsed time for batch processing. Following are a few key optimization strategies for the batch tuning exercise. I also added a few real-life performance-related observations and tips that may not be so obvious. Eliminate Unnecessary Processing Often there are certain tasks (jobs, steps, functions) executed which are actually not required. For example, a job continues to run each day even though the requirement for this job was eliminated some time ago. Eliminating such unnecessary tasks cuts down 100% of the system resource utilization and elapsed time consumed by these tasks (presumably, we can all agree on this assumption). Sample instances: Data is created but never referenced afterwards; a job step can be eliminated; a sort is executed when the input data is already sorted; unnecessary on-the-spot backups are performed. It s worthwhile following up the flow of actions performed on data sets. Check for data sets which are created and, at a later stage, re-created or overwritten without any task (another job, TSO user, CICS, etc.) accessing the data in between. You might discover data sets which are no longer required or which were never required in the first place. You might also identify on-the-spot backups which are not required.
It s also worthwhile tracing the sorts performed on data sets. Input data sets processed by multiple jobs are sometimes mistakenly sorted twice (with same sort keys). In other cases, input data sets are processed multiple times by SORT because the programmer did not take advantage of the SORT OUTFIL option (allowing the creation of multiple output files with a single pass on the input data set). Optimize I/O Batch jobs make use of certain processor resources (e.g., CPU, storage, I/O). When the elapsed time for the batch jobs is broken down into components, the bulk of the time is usually consumed performing I/O (60-70% on average, and 90% or more if the job is I/O bound). I/O optimization therefore often generates the most significant payback in any tuning project. Many techniques and options, either hardware or software, are available in order to reduce the number of I/Os ( the best I/O is no I/O ) and to perform the remaining I/Os as efficiently as possible. Sample points: Optimal VSAM buffering and VSAM cluster definition; optimal sequential data set block size; hiperbatch; caching; data set striping. Optimal VSAM cluster definitions are highly important. Each time a new VSAM cluster is created in the data center, the responsible person tends to reuse the most recent IDCAMS DEFINE statements used rather than reviewing the relevant VSAM manual (well, who can blame him/her). This may explain for example why so many read-only VSAM clusters are defined with significant (and unnecessary) freespace, which also hurts performance to some extent. There is a myth floating around that SMS-managed data centers no longer have problems with nonoptimal sequential data set block size. That s just what this is - a myth. Each day, most SMS-managed data centers perform millions (or at least hundreds of thousands) of unnecessary I/O operations due to this problem. Optimizing data set blocksizes and eliminating unnecessary VSAM freespace can result in disk space savings which can sometimes be significant. So, although not originally targeted, disk space savings can be a byproduct of performance tuning. Increase Operational Effectiveness Batch tasks (jobs, steps or specific functions) which require certain physical or logical resources are frequently delayed or slowed. Optimizing the use of resources and eliminating resource contention of a certain job can significantly reduce elapsed time for that job (or, of no less importance, for another job which requires the same resources). Sample points: A job allocates more tape drives than necessary; a wait occurs due to unavailable physical or logical resource (e.g., cassette drive, data set or initiator); job-related processes require manual intervention (e.g., job rerun/restart processing, setup of job parameters, open/close of CICS files prior to job run). The cost of lack of initiator availability is often underestimated. If, for example, 10 jobs belonging to a critical path each wait an average of 3 minutes for an available initiator, that critical path can immediately be shorted by 30 minutes!
Many users are not aware of how their job scheduler reserves runtime resources (e.g., a tape drive) for a job: the resource is usually reserved at job submission time (not job execution time), and released only when the entire job finishes processing. So, if a job waits 3 minutes for an available initiator, uses the tape drive (in the first job step) for 5 minutes, and continues execution of subsequent steps for 50 minutes the tape drive is reserved for 58 minutes instead of 5 minutes! Improve Job and Application Efficiency Many site-developed programs and utilities are not as efficient as they could be. This may cause a significant problem when the degree of inefficiency is large. The best opportunity to optimize applications is during application development when a new application is written, or when an existing application is significantly modified due to changed business requirements. It is difficult to tune and modify an application once it has been integrated in the production process. Not only are the costs of optimizing the application substantially higher at this point, but new bugs may be introduced into the application and jeopardize the production work. Therefore, most users do not analyze such applications except in extreme cases (e.g., selected jobs take far too long to run, and no alternative methods are available for speeding them up). Sample points: A program opens (and reads or writes) a sequential data set an excessive number of times; a program performs non-optimal sorting; other illogical program logic; non-optimal program compile options. User-written applications and utilities sometimes read (and/or write) a sequential file more than once during job execution. Enhanced program design can usually solve this inefficiency. Examples: The program is using the file as a kind of temporary storage; A program has been set up to select only one type of record in each pass over the data, so that multiple steps are required in the job to process different types of records; A utility needs to know the number of records in the file before processing them, so the program reads the entire input file twice. SORT executions (whether triggered by a user program or directly by a job step) are often not as efficient as they could be. There are several sort installation (and run-time) parameters which control sort performance. Values for some of these parameters may have been determined some years ago, when memory was still very expensive. Make sure that small amounts of data (e.g., less than 0.5-1.0 MB of data) are always sorted in storage. Also, determine whether SORTWORK work files still need to be preallocated prior to SORT invocation. If so, customize your SORT product to dynamically allocate the required SORTWORK work files. This will result among else in more optimal files and will eliminate sort failures due to insufficient workspace. Job and program failures directly impact batch workload performance, as they result in wasted runtime resources utilization. Examine your job scheduler s log for the most common job failures experienced during the last few weeks, and try to reduce them. Some failures, such as Sx37 abends (disk space problems) and sort failures due to insufficient workspace, can and should be totally eliminated. Conclusion Data centers have to find a way to continue and execute their batch workload while moving towards 24x7 online operations. Unnecessary batch processing should be eliminated, and the remaining batch workload should be optimized to work as efficient as possible and affect
online work as less as possible. This goal can be achieved it just requires focus, planning and effective tools. Avi Kohn is Chief Technology Officer of Time Machine Software, a leading provider of production optimization solutions. Mr. Kohn is the architect of SmartProduction (a production optimization product), and is the original inventor of CONTROL-M (a job scheduler product). He can be contacted at timemachine@il.ibm.com.