Introduction to NonStop Operations Management

Transcription

1 Introduction to NonStop Operations Management Abstract This manual introduces operations managers to NonStop operations management. It provides guidelines, suggestions, and ideas on the following topics: staffing, operations and support areas, operations documentation, production management, problem management, change management, configuration management, performance management, security management, application management, automating and centralizing operations, and improving operations management processes. Product Version N.A. Supported Releases This manual supports the G01.00 release and all subsequent G-series releases until otherwise indicated in a new edition. Part Number Published Release ID December 1996 G01.00

2 Document History Part Number Product Version Published N.A. December N.A. December N.A. December 1996 New editions incorporate any updates issued since the previous edition. Ordering Information For manual ordering information: domestic U.S. customers, call ; international customers, contact your local sales representative. Document Disclaimer Information contained in a manual is subject to change without notice. Please check with your authorized Tandem representative to make sure you have the most recent information. Export Statement Export of the information contained in this manual may require authorization from the U.S. Department of Commerce. Examples Examples and sample programs are for illustration only and may not be suited for your particular purpose. Tandem does not warrant, guarantee, or make any representations regarding the use or the results of the use of any examples or sample programs in any documentation. You should verify the applicability of any example or sample program before placing the software into productive use. U.S. Government Customers FOR U.S. GOVERNMENT CUSTOMERS REGARDING THIS DOCUMENTATION AND THE ASSOCIATED SOFTWARE: These notices shall be marked on any reproduction of this data, in whole or in part. NOTICE: Notwithstanding any other lease or license that may pertain to, or accompany the delivery of, this computer software, the rights of the Government regarding its use, reproduction and disclosure are as set forth in Section of the FARS Computer Software Restricted Rights clause. RESTRICTED RIGHTS NOTICE: Use, duplication, or disclosure by the Government is subject to the restrictions as set forth in subparagraph (c)(1)(ii) of the Rights in Technical Data and Computer Software clause at DFARS RESTRICTED RIGHTS LEGEND: Use, duplication or disclosure by the Government is subject to restrictions as set forth in paragraphþ(b)(3)(b) of the rights in Technical Data and Computer Software clause in DAR (a). This computer software is submitted with restricted rights. Use, duplication or disclosure is subject to the restrictions as set forth in NASA FARþSUP 18-52þ (Aprilþ1985) Commercial Computer Software Restricted Rights (Aprilþ1985). If the contract contains the Clause at 18-52þ Rights in Data General then the Alternate III clause applies. U.S. Government Users Restricted Rights Use, duplication or disclosure restricted by GSA ADP Schedule Contract. Unpublished All rights reserved under the Copyright Laws of the United States.

3 New and Changed Information The Introduction to NonStop Operations Management manual has been revised to: Delete references to all operations management products and features, manuals, and NonStop systems that are not supported in the G01.00 release. Products include: CMI, CSM, DSC/COUP, Envoy, InfoWay, Install, NonStop NET/MASTER, PUP, RCP, RDF, RMI, ROF, Surveyor, Syshealth, Tandem CD Read, TMDS, and ViewPoint. Describe new or enhanced operations management products and features, as well as new Tandem terminology, manuals, and NonStop systems that are supported in the G01.00 release. New products include Tandem Service Management (TSM), the TSM EMS Event Viewer, the Tandem Information Manager (TIM) product, and enhancements to the Subsystem Control Facility (SCF). Specifically, Section 1, Overview of NonStop Operations Management, contains updated information about Tandem Education and the International Tandem User s Group (ITUG). In Section 2, The Operations Staff, detailed descriptions of the recommended training paths for entry-level through senior-level support personnel have been removed and replaced by sources for the latest Tandem Education course information. Also, support personnel responsibilities have been updated to include new support products such as the Tandem Service Management (TSM) product and the TSM EMS Event Viewer. Section 3, The Operations and Support Areas, now includes references to the Himalaya S-series servers and information about TSM and the TSM EMS Event Viewer. Section 4, Operations Documentation, has been updated to document the ServerNet system network area (ServerNet SAN) and the ServerNet wide area networking (ServerNet WAN) subsystems and the new configuration listing that can be generated from SCF. Section 5, Production Management, has been updated to document new production management tools, including TSM, the TSM EMS Event Viewer, and enhancements to SCF. Section 6, Problem Management, has been updated to document new problem management tools, including TSM and the TSM EMS Event Viewer. Section 7, Change and Configuration Management, has been updated to document new change and configuration management tools, including Distributed Systems Management/Software Configuration Manager (DSM/SCM), TSM, the TSM EMS Event Viewer, and SCF enhancements. Section 8, Performance Management, has been updated with information about the TSM EMS Event Viewer. iii

4 New and Changed Information In Section 9, Security Management, all references to non-supported security management tools such as NonStop NET/MASTER, PUP, and RMI have been removed. In Section 10, Contingency Planning, all references to RDF have been removed. Section 11, Application Management, has been updated to document the TSM EMS Event Viewer s role in application management. Section 12, Automating and Centralizing Operations, has been updated to document the TSM EMS Event Viewer. Section 14, Operations Management Tools, has been updated by deleting all references to products not supported in G01.00 and by adding overview descriptions for the new or enhanced products mentioned throughout the manual, including TSM, the TSM EMS Event Viewer, and SCF. Appendix A, Additional Reading, has been updated to include current G01.00 manuals and to remove references to noncurrent manuals. Appendix B, Check Lists, has been updated by deleting references to products that are not supported in the G01.00 release and adding references to new or enhanced products that are supported in the release. The Glossary has been revised. Definitions for non-supported products have been removed, and definitions for new and enhanced products have been added. iv

5 Contents 1. New and Changed Information About This Manual xix Notation Conventions xxv Overview of NonStop Operations Management Overview 1-1 What Is Operations Management? 1-1 Service-Level Agreements 1-2 Specifying Business Goals 1-2 Specifying End-Users Requirements 1-2 Determining Operations Management Objectives 1-3 The OM Model 1-3 Production Management 1-4 Problem Management 1-5 Change Management 1-6 Configuration Management 1-6 Performance Management 1-7 Security Management 1-7 Managing Operations From an End-User s Perspective 1-8 Availability and the High Cost of Down Time 1-8 Viewing Availability From an End-User s Perspective 1-8 Maximizing Availability 1-10 Tandem NonStop Systems and Software 1-11 Tandem NonStop Systems 1-11 Tandem Software 1-12 Where to Go for More Information 1-13 Tandem Software Publications 1-13 World Wide Web (WWW) Home Page 1-15 Tandem Education 1-15 Tandem Hardware and Software Support 1-16 Tandem Alliance Program 1-16 International Tandem Users Group (ITUG) 1-17 Tandem Professional Services 1-17 Account Quality Planning (AQP) Service 1-18 FAXAdvisor 1-18 iii v

6 Contents 2. The Operations Staff 2. The Operations Staff Overview 2-1 Staffing and the Operations Management Model 2-1 Who Provides Each Level of Expertise? 2-2 Staffing Levels Within the Production Function 2-4 Staffing the Operations Area 2-4 Staffing the Support Area 2-5 Staffing Levels Within the Change Function 2-7 Staffing the Planning Area 2-7 Staffing the Control Area 2-8 Sample Operations Organizations 2-9 A Small Operations Group 2-10 A Distributed Operations Group 2-11 A Centralized Operations Group 2-12 A Telecommunications Group 2-13 A Technical Support Group 2-14 Sample Job Descriptions 2-15 The Operations Area 2-15 The Support Area 2-20 The Planning Area 2-22 The Control Area 2-23 The Operations Manager 2-24 Training 2-26 Tandem Education 2-26 Tandem Manuals 2-26 In-House Training 2-27 Other Vendor Training 2-27 Check List The Operations and Support Areas Overview 3-1 A Computer Room or an Office 3-1 Selecting a Location 3-1 Both Computer-Room and Office Environments 3-1 Computer Room Environments 3-2 Office Environments 3-4 The Environment 3-4 Physical Security 3-5 Equipment and Supplies 3-5 vi

7 Contents 4. Operations Documentation System Installation 3-6 Computer Room Environments 3-6 Office Environments 3-7 Preventive Maintenance 3-7 Both Computer-Room and Office Environments 3-7 Computer Room Environments 3-8 Office Environments 3-8 Support Areas 3-8 Check List Operations Documentation Overview 4-1 What Is Operations Documentation? 4-1 Policies, Standards, and Procedures 4-1 Service-Level Agreements 4-2 Creating Service-Level Agreements 4-3 Agreements, Contracts, and Supporting Documents 4-4 Configuration Diagrams and Listings 4-5 Flow Diagrams 4-9 Tandem Manuals 4-12 Tandem Software Release Documents 4-12 Logs 4-12 Operator Logs 4-13 Error Logs 4-13 CE Logs 4-13 Outage Logs 4-14 Internal Operator Guides 4-16 Online Files 4-17 Error Messages 4-17 Check List Production Management Overview 5-1 What Is Production Management? 5-1 Monitoring System Status 5-2 Controlling the System 5-2 Tracking System Usage 5-3 Step 1 Establishing a Strategy 5-4 Step 2 Determining the System Resources to Be Monitored 5-4 Step 3 Collecting Accounting Data 5-4 vii

8 Contents 6. Problem Management Providing Daily and Weekly Reports 5-5 Using a Production Schedule 5-5 Creating a Production Schedule 5-6 Analyzing the Completed Schedule 5-6 The 24-Hour Clock Worksheet 5-6 Management Responsibilities 5-7 Routine Operations Tasks 5-8 System Startup, Processor Dumps, Processor Reload, and System Shutdown 5-8 System Startup 5-8 Processor Dump 5-9 Processor Reload 5-9 System Shutdown 5-9 Daily Tasks 5-10 Start-of-Day Tasks 5-10 Start-of-Shift Tasks 5-10 During-the-Shift Tasks 5-12 End-of-Day Tasks 5-13 Weekly Tasks 5-13 Monthly Tasks 5-15 Recovery Procedures 5-15 Production Management Tools 5-17 Check List Problem Management Overview 6-1 What Is Problem Management? 6-1 The Goals of Problem Management 6-1 Common Problems in an Operations Environment 6-1 Management Responsibilities 6-2 Establishing Policies and Procedures 6-2 Providing Outage Prevention and Recovery Training 6-3 Predicting and Preventing Problems 6-3 Problem Prevention Strategies 6-4 Recovering From Problems 6-6 Step 1 Detecting and Isolating the Problem 6-7 Step 2 Gathering the Facts and Reporting the Problem 6-7 Step 3 Identifying the Cause and Developing and Implementing a Solution 6-10 Step 4 Escalating the Problem (If Necessary) 6-10 Step 5 Reviewing the Problem 6-12 viii

9 Contents 7. Change and Configuration Management Case Study 6-12 Business Background and System Configuration 6-12 Business and Operations Activities 6-13 Problem Scenario 6-14 Gathering Facts About the Problem 6-14 Gathering Facts About the Situation 6-14 Determining the Cause and Resolving the Problem 6-15 Problem Management Tools 6-17 Check List Change and Configuration Management Overview 7-1 What Are Change Management and Configuration Management? 7-1 The Goals of Change and Configuration Management 7-2 Management Responsibilities 7-3 Staffing 7-3 Anticipating and Planning for Change 7-4 Installing and Implementing Changes 7-4 Performing Hardware Changes 7-4 Performing System Configuration Changes 7-5 Performing Subsystem Changes 7-6 Performing Software Changes 7-6 Controlling the Introduction of Change 7-8 What Is Change Control? 7-8 Implementing Change Control Successfully 7-8 The Change Control Process 7-9 Case Study 7-10 User Profile 7-10 Business Background 7-10 Analysis of Problem 7-10 Implementation of Recommendations 7-11 Conclusion 7-12 Change-Management and Configuration-Management Tools 7-13 Check List Performance Management Overview 8-1 What Is Performance Management? 8-1 Service-Level Agreements 8-2 Staffing 8-3 ix

10 Contents 9. Security Management Application Sizing 8-4 Step 1 Establishing the Requirements and Strategy 8-4 Step 2 Forecasting 8-4 Step 3 Reporting Results 8-5 Capacity Planning 8-5 Step 1 Establishing the Requirements and Strategy 8-5 Step 2 Performance Reporting 8-5 Step 3 Forecasting 8-6 Step 4 Developing the Capacity Plan 8-6 Performance Analysis and Tuning 8-6 Step 1 Establishing Performance Requirements 8-6 Step 2 Gathering Performance Information 8-7 Step 3 Analyzing Performance Information 8-7 Step 4 Optimizing System Performance 8-8 Step 5 Reporting Results 8-9 How It Fits Together 8-10 Case Study 8-10 User Profile 8-10 Analysis of Problem and Recommendations 8-11 Performance Management Tools 8-12 Check List Security Management Overview 9-1 What Is Security Management? 9-1 Basic Security Rules 9-2 Developing a Security Policy 9-3 Security Guidelines 9-4 Security Is a People Problem 9-4 Management Support 9-4 Staff Support 9-5 User Community Support 9-5 Organizational Issues 9-6 The Tandem Security System 9-6 Authentication Services Provided by the Tandem NonStop Kernel 9-8 Safeguard 9-8 NonStop SQL/MP 9-9 $CMON 9-9 Physical Security 9-9 x

11 Contents 9. Security Management The Computer Room 9-9 Environmental Controls 9-10 System Cabinets 9-10 Terminals 9-10 Printers 9-10 Tape Units 9-10 Tape Library 9-10 On-Site and Off-Site Media Storage 9-10 Data Encryption 9-11 Managing Access to the System 9-11 User Groups 9-11 Access-Control Lists (ACLs) 9-12 Adding User IDs 9-12 Assigning User Aliases 9-12 Special User IDs 9-12 Guest-User IDs 9-15 Unused User IDs 9-15 Deleting Users IDs 9-15 Reusing User IDs 9-16 Managing Passwords 9-16 Requiring Strong Passwords 9-16 Setting Unexpected Initial Passwords 9-17 Enforcing Routine Password Changes 9-17 Protecting Passwords 9-17 Dial-Up Access and Security 9-17 Authorization Lists 9-18 Additional External Passwords 9-18 Callback Routine 9-18 Automatic Terminal Authentication 9-18 Periodic Password and Telephone Number Changes 9-18 What Happens if the Line Is Dropped? 9-18 Securing Network Access 9-18 Managing Network User IDs 9-18 Security Precautions 9-19 Encrypting Data Between Systems 9-19 Communication With Other Operations Groups 9-19 Securing Client/Server Environments 9-19 OSS System Security 9-20 OSS File Security 9-20 xi

12 Contents 10. Contingency Planning Interoperability With Safeguard Security 9-20 Special Security Concerns 9-21 Program Development 9-21 PROGID Programs 9-22 Licensed Programs 9-23 Check List Contingency Planning Overview 10-1 What Is a Disaster? 10-1 Preventing Disasters 10-1 Computer Center Location and Facilities 10-2 Security 10-2 Preventive Maintenance and System Monitoring 10-2 System and Network Configuration 10-2 Data Recovery and Integrity 10-3 Data Archiving 10-3 Disaster Recovery Planning 10-4 Step 1 Taking Inventory 10-5 Step 2 Developing the Plan 10-6 Step 3 Testing the Plan and Training the Staff 10-9 Step 4 Revising the Plan 10-9 Backup Sites Cold Sites Operational-Ready Sites Data-Ready Sites Online-Ready Sites Determining Which Type of Backup Site Best Meets Your Needs Check List Application Management Overview 11-1 What Is Application Management? 11-1 Establishing Application Requirements 11-2 Requirements 11-2 Check List for an Applications Review 11-4 Establishing a Production-Assurance Control Group 11-6 Batch, Online, and Client/Server Processing 11-7 Batch Processing 11-7 Online Transaction Processing 11-8 xii

13 Contents 12. Automating and Centralizing Operations Client/Server Processing Case Study Business Background Analysis of Problem Implementation of Recommendations Check List Automating and Centralizing Operations Overview 12-1 Why Automate and Centralize Operations? 12-1 Automating Operations Tasks 12-4 Centralizing System Operations 12-5 Automation and Centralization Tools 12-6 Check List Operations Management and Continuous Improvement Overview 13-1 Why Improve Your Operations Environment? 13-1 Implementing an Operations-Management Improvement Program 13-2 Using the Maturity Framework 13-3 Step 1 Assessing Your Environment 13-5 Step 2 Developing a Vision 13-5 Step 3 Developing an Action List 13-6 Step 4 Scheduling and Committing Resources 13-6 Step 5 Executing the Plan 13-6 Step 6 Assessing the Improvement Program 13-6 Case Study 13-7 User Profile 13-7 Problem Scenario 13-8 Implementing an Operations-Management Improvement Program 13-8 Conclusion Check List Operations Management Tools Overview 14-1 $CMON 14-4 Command Files 14-4 Data Access Language (DAL) Server 14-5 Disk Space Analysis Program (DSAP) 14-5 Distributed Name Service (DNS) 14-6 xiii

14 Contents 14. Operations Management Tools Distributed Systems Management/NonStop Operations for Windows (DSM/NOW) 14-7 Distributed Systems Management/Software Configuration Manager (DSM/SCM) 14-8 Enform 14-9 Event Management Service (EMS) 14-9 Event Management Service Analyzer (EMSA) 14-9 File Utility Program (FUP) 14-9 Flow Map Guardian Performance Analyzer (GPA) Measure NetBatch and NetBatch-Plus NonStop Access for Networking NonStop ODBC Server NonStop SQL/MP NonStop SQL/MP SQLCI NonStop Transaction Manager/MP (NonStop TM/MP) NonStop TM/MP Interfaces (TMFCOM, TMFSERVE) NonStop Transaction Services/MP (NonStop TS/MP) NonStop TS/MP PATHCOM Interface NonStop Virtual Hometerm Subsystem (VHS) NSKCOM Object Monitoring Facility (OMF) Open Notification Service (ONS) Pathway Open Environment Toolkit (POET) Pathway/TS PEEK Remote Server Call (RSC) Safeguard SeeView Simple Network Management Protocol (SNMP) Subsystem Control Facility (SCF) Subsystem Programmatic Interface (SPI) Tandem Advanced Command Language (TACL) Tandem Capacity Model (TCM) and MeasTCM Tandem Failure Data System (TFDS) Tandem Network Statistics Extended (NSX) Tandem Performance Data Collector (TPDC) Tandem Reload Analyzer (Reload Analyzer) Tandem Service Management (TSM) xiv

15 Contents A. Additional Reading Transfer TSM EMS Event Viewer ViewSys A. Additional Reading Overview A-1 Section 1 Overview of NonStop Operations Management A-1 Section 2 The Operations Staff A-1 Section 3 The Operations and Support Areas A-2 Section 4 Operations Documentation A-2 Section 5 Production Management A-3 Section 6 Problem Management A-4 Section 7 Change and Configuration Management A-5 Section 8 Performance Management A-5 Section 9 Security Management A-6 Section 10 Contingency Planning A-6 Section 11 Application Management A-7 Section 12 Automating and Centralizing Operations A-7 Section 13 Operations Management and Continuous Improvement A-8 Section 14 Operations-Management Tools A-8 B. Check Lists Overview B-1 The Operations Staff B-1 The Operations and Support Areas B-2 Operations Documentation B-3 Production Management B-4 Problem Management B-5 Change and Configuration Management B-7 Performance Management B-8 Security Management B-9 Contingency Planning B-11 Application Management B-13 Automating and Centralizing Operations B-14 Operations Management and Continuous Improvement B-15 Glossary Index xv

16 Contents Figures Figures Figure 1-1. The Operations Management Disciplines 1-4 Figure 1-2. Operations Manuals 1-14 Figure 2-1. A Small Operations Group 2-10 Figure 2-2. A Distributed Operations Group 2-11 Figure 2-3. A Centralized Operations Group 2-12 Figure 2-4. A Telecommunications Group 2-13 Figure 2-5. A Technical Support Group 2-14 Figure 4-1. A Network Configuration Diagram 4-6 Figure 4-2. A System Configuration Diagram 4-7 Figure 4-3. An Activity Flow Diagram: Activities Performed in a 24-Hour Period 4-10 Figure 4-4. A Process Flow Diagram: Solving System Access Problems 4-11 Figure 4-5. A Sample Outage Log 4-15 Figure Hour Clock Worksheet 5-7 Figure 6-1. Systematic Problem Solving 6-6 Figure 6-2. Sample Problem Report Form 6-9 Figure 6-3. Case Study: Just For Children, Inc. (JFC) Computer System 6-13 Figure 6-4. Problem-Solving Worksheet 6-16 Figure 7-1. Case Study: Change-Control Process Flow at Allied Bank 7-12 Figure 8-1. Performance Management Functions 8-10 Figure 9-1. Paths of Security Communication 9-6 Figure 9-2. Layers of Tandem Security 9-7 Figure The Disaster Planning Process 10-5 Figure Damage-Assessment Team Responsibilities 10-7 Figure Command Post Responsibilities 10-7 Figure Batch Processing 11-7 Figure Online Transaction Processing 11-9 Figure Simple Client/Server Environment Figure Complex Client/Server Environment Figure Typical Operations Problems 12-2 Figure Centralized Operations 12-3 Figure Causes of System Outages 13-2 Figure Operations-Management Improvement Framework 13-3 Figure Case Study: Manual Recoveries Versus Automated Recoveries xvi

17 Contents Tables Tables Table 1-1. Table 1-2. LAN Availability and Down Time per 40-Hour Workweek (Traditional Measurement) 1-9 Outage Minutes per Year (24-Hour by 7-Day by Year-Round Clock) 1-9 Table 2-1. Staff Levels of Expertise 2-3 Table 5-1. Summary of Production Management Tools 5-17 Table 6-1. Unplanned Outage Classes 6-2 Table 6-2. Problem Management Tools 6-17 Table 7-1. Summary of Change-Management and Configuration-Management Tools 7-13 Table 8-1. Summary of Performance Management Tools 8-12 Table 9-1. Classes of Special System Users 9-13 Table Backup-Site Alternatives: Advantages and Disadvantages Table Online Transaction-Processing Tools Table Client/Server Processing Tools Table Automation and Centralization Tools 12-6 Table The Maturity Framework 13-4 Table Case Study: NAC s Production Profile 13-7 Table Case Study: Schedule for the Operations-Management Improvement Program Table Operations Management Tools 14-1 Table Examples of Typical Command Files 14-5 xvii

18 Contents xviii

19 Overview About This Manual The Introduction to NonStop Operations Management manual provides an overview of Tandem operations management concepts, tasks, products, and manuals for NonStop systems. This manual is a prerequisite for reading other Tandem operations manuals. This manual will: Help operations managers prepare for the installation of a Tandem system Provide managers of existing Tandem systems with useful operational and reference information The Introduction to NonStop Operations Management manual provides guidelines, suggestions, and ideas on managing Tandem NonStop systems effectively and efficiently. Because each organization is unique and has different needs, this manual presents recommendations and not requirements. When you read this manual, you may find that a topic is not relevant to your operation, or you may need to make alterations to the guidelines to meet your needs. However, anyone responsible for managing a Tandem NonStop system can find valuable information in this manual. Who Should Read This Manual? Operations managers in companies of varying sizes with systems of varying sizes should read this manual. This includes managers who: Have responsibility for the success of a Tandem operations environment Make operations decisions Establish operations procedures Hire and train operations staff The operations managers might be anyone from the system manager to the MIS director. Some topics will be of interest to all levels of operations management. Other topics might be of interest only to a subset of managers. This manual can also serve as an introduction to operations management for programmers, operators, and technical specialists. What s in This Manual? This manual covers the groundwork of operations management. Each section of the manual discusses a key area of operations management and provides suggestions, guidelines, and summary check lists. The sections are ordered from the general to the more specific. Overview and introductory information is near the beginning of the manual; information on how to prepare and organize an operations environment is next, followed by information on how to manage an operations environment. xix

20 About This Manual What s in This Manual? This manual is organized in 14 sections, two appendixes, and a glossary. The glossary defines technical terms and acronyms. Section 1, Overview of NonStop Operations Management This section defines operations management and explains how to apply the operations management model in a Tandem environment. Section 2, The Operations Staff This section explains how to use the operations management model to help determine the type of operations organization, staff requirements, job descriptions, and training resources needed. This section also gives examples of operations organizations and job descriptions, and provides training suggestions. Section 3, The Operations and Support Areas Once the decision has been made to buy systems and peripheral hardware, the process of site planning starts. This section provides site-planning suggestions for traditional computer-room environments and for office environments. Section 4, Operations Documentation Documentation serves two purposes: it helps establish guidelines, standards, and policies, and it helps you educate and inform. This section provides a list of documentation that helps operations personnel perform their jobs. Types of documentation include policies, procedures, service agreements, logs, configuration diagrams, manuals, schedules, and recovery plans. Section 5, Production Management There are many routine tasks that are performed each day, week, and month. This section outlines these tasks so that you are aware of what is required and can ensure that the necessary tasks are performed. It also lists the products Tandem offers to help with production management tasks. Section 6, Problem Management No matter how well-planned your operations and how fault-tolerant the system, problems may still occur. Well-defined procedures can help you and your staff quickly and correctly resolve problems. This section describes the process of managing problems within the operations environment and outlines a systematic approach to help detect, isolate, and resolve problems quickly. It also lists the products Tandem offers to help with problem management tasks. Section 7, Change and Configuration Management By establishing change and configuration management functions, you can prevent confusion and disruption when system and application changes occur. This section defines the process of anticipating, implementing, and controlling change within the operations environment and outlines a change control process to help you plan and xx

21 About This Manual What s in This Manual? manage change. It also lists the products Tandem offers to help with change and configuration management tasks. Section 8, Performance Management This section defines performance management and provides guidelines for managing system and network performance to help you ensure that you get the best return from your NonStop systems and that the systems meet your business needs. This section also provides guidelines for carrying out the performance management functions of application sizing, capacity planning, and performance analysis and tuning. In addition, this section lists the products Tandem offers to help with performance management tasks. Section 9, Security Management A security policy helps a company ensure that its systems, software, data, and personnel are well protected. This section provides guidelines and considerations to help you participate in security policy planning and enforcement. This section also describes system protection tools and methods, and the types of tasks required to maintain a secure system. Section 10, Contingency Planning Disasters can occur anytime and anywhere. In companies where day-to-day business activity relies on a computer system, a sound recovery plan is imperative. This section will help you take preventive measures, and, if necessary, recover from a disaster as quickly as possible with minimal damage and at minimal cost. Section 11, Application Management Typically, the operations staff must run and manage applications on production systems. This section provides information on running applications on Tandem NonStop systems. Section 12, Automating and Centralizing Operations Automating and consolidating operations can help you better manage operations. This section describes some of the tasks that can be automated and centralized. It also provides guidelines for automating and centralizing tasks, and lists the products Tandem offers to help automate and consolidate operations. Section 13, Operations Management and Continuous Improvement The processes and procedures of an operations organization should never remain static. Because change occurs all the time, it is vital that you continually improve the processes of your operations environment to adapt to change. This section outlines a step-by-step approach to help you improve your operations processes. Section 14, Operations Management Tools This section lists and describes the Tandem operations tools that help your staff perform operations management tasks. xxi

22 About This Manual Prerequisite Reading Appendix A, Additional Reading This appendix provides a list of documents that provide additional information about the topics and products mentioned in this manual. Appendix B, Check Lists The check lists from each section in this manual are reproduced in this appendix so that you can easily use the check lists for note taking or photocopying. Prerequisite Reading Before reading this manual, you should be familiar with the material described in the Himalaya S-Series Operations Guide. This manual introduces you to the ServerNet architecture implemented with the G01.00 release. Related Manuals The following manuals are related to the material presented in this manual. The Availability Guide for Change Management explains how to increase availability of NonStop systems by effectively managing change in an operations environment. The Availability Guide for Problem Management explains how to increase availability of NonStop systems by effectively managing problems in an operations environment. The Availability Guide for Performance Management explains how to measure system performance, analyze system performance information, and optimize the performance of Tandem NonStop systems. The Availability Guide for Application Design provides an overview of application availability options available to designers and developers. The Security Management Guide describes how to use the NonStop Kernel and Safeguard security features to control access to Tandem systems. Your Comments Invited After using this manual, please take a moment to send us your comments. You can do this by returning a Reader Comment Card or by sending an Internet mail message. A Reader Comment Card is located at the back of printed manuals and as a separate file on the Tandem User Documentation disc of the Tandem Information Manager (TIM) product. You can either FAX or mail the card to us. The FAX number and mailing address are provided on the card. Also provided on the Reader Comment Card is an Internet mail address. When you send an Internet mail message to us, we immediately acknowledge receipt of your message. A detailed response to your message is sent as soon as possible. Be sure to include your xxii

23 About This Manual Your Comments Invited name, company name, address, and phone number in your message. If your comments are specific to a particular manual, also include the part number and title of the manual. Many of the improvements you see in Tandem manuals are a result of suggestions from our customers. Please take this opportunity to help us improve future manuals. xxiii

24 About This Manual Your Comments Invited xxiv

25 Notation Conventions General Syntax Notation The following list summarizes the notation conventions for syntax presentation in this manual. UPPERCASE LETTERS. Uppercase letters indicate keywords and reserved words; enter these items exactly as shown. Items not enclosed in brackets are required. For example: MAXATTACH lowercase italic letters. Lowercase italic letters indicate variable items that you supply. Items not enclosed in brackets are required. For example: file-name Change Bar Notation Change bars are used to indicate substantive differences between this edition of the manual and the preceding edition. Change bars are vertical rules placed in the right margin of changed portions of text, figures, tables, examples, and so on. Change bars highlight new or revised information. For example: The message types specified in the REPORT clause are different in the COBOL85 environment and the Common Run-Time Environment (CRE). The CRE has many new message types and some new message type codes for old message types. In the CRE, the message type SYSTEM includes all messages except LOGICAL-CLOSE and LOGICAL-OPEN. xxv

26 Notation Conventions Change Bar Notation xxvi

27 1 Overview of NonStop Operations Management Overview Your business benefits from effective operations management practices. With today s rapidly changing marketplace and business pressures of global competition, educated consumers, and economic conditions, Tandem recognizes that operations organizations are often faced with ever-increasing demands. With thoughtful planning and management of system operations, you will be prepared to run your Tandem NonStop systems efficiently and effectively. This section: Defines operations management in the Tandem environment Describes service-level agreements and how they define operations management objectives Defines the operations management (OM) model and how it can be used to organize and operate your operations environment Describes how to manage an operations environment with an end-users perspective Explains how Tandem systems and software help you meet your objectives Lists sources of additional operations management information What Is Operations Management? Operations management can often mean different things to different people. Tandem defines operations management as the operation and management of systems and networks in support of your business. Planning for operations management includes: Establishing and fulfilling service-level agreements. Service-level agreements define your company s business goals, your end-users requirements, and your organization s objectives and standards. Defining and understanding the functions involved in operations management. Tandem uses the OM model to categorize the functions of the operations environment into six industry-standard disciplines. Using the OM model can help you effectively structure, staff, plan, and operate your organization to meet your objectives. Managing operations with an end-user s perspective. Managing the availability of your systems and applications from the end-user s perspective can help ensure that your organization is meeting its end-user s requirements. 1-1

28 Overview of NonStop Operations Management Service-Level Agreements Optimizing the features of Tandem NonStop systems and software. Through the optimal use of Tandem NonStop systems fault-tolerant, scalable, distributed processing, and many other features, you will be able to meet your operations management objectives. Service-Level Agreements Every operations organization should consider developing service-level agreements. Service-level agreements specify the level of service that operations should provide and are usually developed through negotiations between the operations organization and the organization s users (or those representing the users). The agreements serve three functions: They specify the company s business goals. They specify the end-users requirements. They determine operations management objectives, requirements, and standards, with the intention of aligning operations goals with the goals of the company and the requirements of end users. For more information on service-level agreements and for guidelines on creating service-level agreements, refer to Section 4, Operations Documentation. Specifying Business Goals Specifying the business goals of your company in your service-level agreements can help you make the correct trade-offs between hardware, software, cost, performance, response time, availability, and personnel training costs. Some common business goals might include: Increasing customer satisfaction by providing system availability 24 hours a day, 7 days a week, 365 days a year Accommodating the company s geographically distributed operations Reducing the cost of providing services by downsizing transaction processing to a distributed computing environment Increasing staff efficiency through the use of automated operations. Specifying End-Users Requirements In most companies, the end users not the business determine when, where, and how services should be provided. Specifying end-user requirements in your service-level agreements can help to ensure that your organization supports the requirements. Depending on the type of business your company provides, end users might demand requirements for: Availability Data integrity Response times and throughput rates 1-2

29 Overview of NonStop Operations Management Determining Operations Management Objectives Data security Reduced cost of operation Determining Operations Management Objectives By determining your operations management objectives, requirements, and standards, and aligning your operations goals with the goals of the company, you can determine: The type of staff coverage to provide The tasks the staff should perform The types of equipment you need The type of budget you need Your department s priorities The OM Model To help fulfill your service-level agreements, Tandem uses an OM model to categorize the functions and tasks of an operations environment into six industry-standard disciplines: Production management Problem management Change management Configuration management Security management Performance management Using the OM model provides you with the following advantages: You have a general picture of the major tasks involved in running Tandem systems. This can help determine what type of operations organization you need. The OM model covers every aspect of operations management. This helps create a reliable and predictable environment, because all operations tasks are assigned and performed. You can identify the strengths and weaknesses of your organization, thus enabling you to staff and train operations personnel, to implement new operations processes and strengthen existing ones, and to determine what new technology and tools are required if any. 1-3

30 Overview of NonStop Operations Management Production Management Figure 1-1 shows the OM disciplines working together to ensure a stable and predictable OM environment. Figure 1-1. The Operations Management Disciplines Production Management Problem Management Change Management Stable and Predictable Environment Configuration Management Performance Management Security Management 001 Production Management Production management includes the day-to-day tasks performed by operations personnel who operate and manage the production environment. For example, some of the tasks included in this discipline are: Monitoring the systems, networks, and applications to ensure that service-level agreements are being met Starting and shutting down systems, LANs and WANs, and applications Monitoring event and alert messages Monitoring system performance and capacity Scheduling jobs Logging problems Providing technical support Processing online transaction and batch operations Maintaining disk and tape media Maintaining documentation such as operator logs, configuration diagrams, policies and procedures, service-level agreements, and manuals 1-4

31 Overview of NonStop Operations Management Problem Management Tandem provides a number of tools to manage the production environment, including tools for: Monitoring systems, networks, and applications online Automating operator procedures Managing distributed systems from a central site Managing networks, databases, and applications For guidelines and suggestions on managing the production environment, refer to Section 5, Production Management. For guidelines on managing applications, refer to Section 11, Application Management. For guidelines on automating and centralizing operations tasks, refer to Section 12, Automating and Centralizing Operations. Problem Management Problem management includes the tasks required to manage and administer the problem environment. For example, some of the tasks included in this discipline are: Ensuring system fault tolerance Predicting and preventing problems Detecting and analyzing problems Documenting and reporting problems Researching, diagnosing, and isolating problems Escalating problems Resolving problems and analyzing the cause Recovering from problems as quickly as possible Establishing problem prevention techniques Tandem provides a number of tools for managing the problem environment and recommends a systematic method for detecting, isolating, and recovering from problems. For guidelines and suggestions on managing the problem environment, refer to Section 6, Problem Management. For comprehensive information about managing the problem environment and for information about problem management tools provided by Tandem, refer to the Availability Guide for Problem Management. For guidelines on disaster planning, refer to Section 10, Contingency Planning. 1-5

32 Overview of NonStop Operations Management Change Management Change Management Change management includes the tasks required to manage the maintenance and growth of your NonStop system. Change management involves managing all hardware, software, and procedural changes and includes all of the tasks required to properly manage change within the operations environment. For example, some of the tasks included in this discipline are: Anticipating and planning for change Controlling the introduction of change Installing and implementing changes to system software and hardware, application subsystems, communications subsystems, and application software Tandem systems are engineered so they can grow and change in response to business needs. They also provide a flexible processing environment that allows you to add terminals, applications, and databases as needed while the system is running. For guidelines and suggestions on managing the change environment, refer to Section 7, Change and Configuration Management. For comprehensive information about managing change and for information about change management tools provided by Tandem, refer to the Availability Guide for Change Management. Configuration Management Change and configuration management are interrelated functions. Configuration management includes the tasks required to manage and administrate the configuration of system software and hardware, application subsystems, communications subsystems, and application software. For example, some of the tasks included in this discipline are: Maintaining and tracking configuration documentation Defining and managing names for configuration components and maintaining the relationships between configuration components Controlling, maintaining, and distributing multiple versions of software such as operating system and subsystem object code, Tandem and third-party application object code, and in-house application source and object code Tandem provides a number of tools to help your staff with change and configuration management tasks. For guidelines and suggestions on managing the configuration environment, refer to Section 7, Change and Configuration Management. 1-6

33 Overview of NonStop Operations Management Performance Management Performance Management Performance management includes the tasks required to manage the performance of your computer system. For example, some of the tasks included in this discipline are: Analyzing and optimizing the current environment by monitoring the performance of the operating system, subsystems, network, and applications Isolating performance problems such as system availability and current hardware utilization Forecasting the performance impact of changes within the system environment. Planning for system and network growth Collecting and analyzing data for usage accounting, as well as trending information Tandem provides a number of tools for measuring and tuning system performance. For guidelines and suggestions on managing your system s performance, refer to Section 8, Performance Management. For comprehensive information about managing the performance of Tandem NonStop systems and networks, refer to the Availability Guide for Performance Management. For detailed information about products documented in this manual which are not supported in the G01.00 release, refer to the Himalaya S-Series Publications Note. Security Management Security management includes the security features necessary to implement a secure, audited operations environment. For example, some of the tasks included in this discipline are: Managing the identification of users and their proper access to the system, including user IDs, password requirements, and access-control lists Securing network access Establishing and monitoring the physical security of the system Defining security features contained within the Tandem NonStop Kernel operating system The Tandem security system is an integrated group of software products that protect data existing on a system. Tandem software allows you to implement a variety of security policies. For guidelines and suggestions for managing your system s security, refer to Section 9, Security Management. For comprehensive information about system security, refer to the Security Management Guide. 1-7

34 Overview of NonStop Operations Management Managing Operations From an End-User s Perspective Managing Operations From an End-User s Perspective Today s globalization of consumers and the demand for increased customer service require that many businesses offer services around the clock. Offering services around the clock requires computer, network, and application services that are available all the time. To ensure that you are supporting the demands of your company s customers, Tandem suggests that you monitor and measure the availability of systems and applications from the end-users perspective. Availability and the High Cost of Down Time Tandem defines availability as the total time an application running on a Tandem system can be accessed by a user of that application. When an application is unavailable, your business becomes vulnerable to various types of losses, such as lost revenue, lost consumer confidence, and lost productivity. For example, if the application is a revenue-generating service, such as an automated teller machine (ATM) application or a long-distance telephone network application, the business suffers an immediate loss of revenue and continues to lose revenue until the system comes back online. Consider another example: An airline reservation system that is 99.1 percent available per week translates to 90 minutes of down time per week. At an estimated cost of $36,000 per minute, the company loses revenues of $3.24 million per week. Lost productivity, management dissatisfaction, and overtime costs can be even more costly than lost revenue. Viewing Availability From an End-User s Perspective Tandem recommends that the measurement of availability be from the end-user s perspective. For example, it is not enough to record that a certain hardware or software component has gone down. You must also take into consideration the user s ability to access the service, the quality of the service provided, and whether or not the response time is acceptable to the user. Traditional Measurements The computer industry has traditionally reported availability in percentages. While this measurement is valid, it is difficult to envision and use. Consider this example: An employee at a company complains that the LAN is always down. The LAN manager responds that the LAN is up 95 percent of the time; however, 95 percent may not be as impressive at is sounds. Table 1-1 shows how some percentages break down in a 40-hour workweek: 1-8

35 Overview of NonStop Operations Management Viewing Availability From an End-User s Perspective Table 1-1. LAN Availability and Down Time per 40-Hour Workweek (Traditional Measurement) Percentage of Time LAN Is Up Equivalent Number of Minutes LAN Is Down 90 percent 240 minutes 95 percent 120 minutes 99 percent 24 minutes Using an Outage-Minutes-per-Year Measurement Tandem recommends using a total outage-minutes-per-year measurement to reveal outages. An outage is the time during which the system is not capable of doing useful work because of planned or unplanned interruptions. From the end-user s perspective, an outage is any time an application is not available. Using the outage-minutes-per-year measurement is easy to understand and provides more meaningful data than percentile numbers such as 95 percent available. Table 1-2 compares percentage numbers with equivalent outage minutes and the resulting user impact. Table 1-2. Outage Minutes per Year (24-Hour by 7-Day by Year-Round Clock) Percent Availability 90% 99% 99.9% 99.99% % 100% Outage Minutes/Year* 50,000 5, User Impact* 35 days 3.5 days 8.3 hours 50 minutes 5 minutes 0 minutes *Outage minutes per year and user impact days are approximations. Measuring User-Outage Minutes in a Client/Server Environment For client/server types of applications, it is useful to express down time as the number of user-outage minutes. A failure in the client part of the application might affect only one user; but to that user, the application is down. A failure in part of the network could affect several users. A failure in the server, however, could affect thousands of users. It is important that an outage in the server be weighted over an outage in the client. In a client/server environment, it therefore makes sense to measure down time as the number of minutes the application is unavailable multiplied by the number of affected users. A one-minute outage in the workstation equals one minute of down time. An outage of one minute in the server, however, equals one minute times the number of users accessing the server. 1-9

36 Overview of NonStop Operations Management Maximizing Availability Alternate Ways of Measuring Down Time Depending on specific business needs, down time may be measured in ways other than user-outage minutes. For example, a site might be obligated to pay a penalty for each transaction that does not get processed while an application is down. Such a site might supplement its measure of down time by keeping records of the number of transactions it normally processes by minute and by day of the week. If an outage occurs, for example, at 10 a.m. on Tuesday morning and lasts for 15 minutes, the site can calculate the average number of transactions that would normally be processed during that period. Subsequently, the site pays a corresponding penalty to its customer. Using this method leads to significantly different outage costs, depending on the time of day and the day of the week. An hour-long outage at 2 a.m. on Monday morning might carry a negligible penalty when compared with a 15-minute outage at 5 p.m. on a Friday. Maximizing Availability Maximizing the availability of your systems, networks, and applications can be achieved by reducing or eliminating outages, both planned and unplanned. Reducing or Eliminating Planned Outages Planned outages occur when there are changes that must be implemented and the computing environment must be stopped to implement the changes. An example of such a change is the installation of a new version of the operating system. You can reduce or eliminate planned outages by: Performing changes online. Online change is any change that can be performed while the system is still operational. Being able to make changes to your hardware or software online is one way to reduce or even eliminate system and application down time. Reducing the time required for planned outages. Section 7, Change and Configuration Management, provides guidelines for managing planned outages in an operations environment. For comprehensive information about managing change and for information about change management tools provided by Tandem, refer to the Availability Guide for Change Management. 1-10

37 Overview of NonStop Operations Management Tandem NonStop Systems and Software Reducing or Eliminating Unplanned Outages Unplanned outages occur when system or application down time is caused by a problem situation such as faulty hardware, operator error, or disaster. An example of such a problem is an application change that makes the application unusable by introducing unexpected problems. You can reduce or eliminate unplanned outages by: Predicting and then preventing problems before they occur Quickly recovering from problems Section 6, Problem Management, provides general guidelines for predicting, preventing, and recovering from unplanned outages in an operations environment. Section 10, Contingency Planning, provides guidelines and suggestions for disaster prevention and recovery. For comprehensive information about managing the problem environment and for information about problem management tools provided by Tandem, refer to the Availability Guide for Problem Management. Tandem NonStop Systems and Software Through the optimal use of Tandem NonStop systems and software, you can fulfill your operations management goals and service-level agreements. Tandem NonStop Systems Tandem NonStop systems help you achieve your objectives because Tandem NonStop systems: Are highly reliable. The fault-tolerant, multiprocessor architecture of Tandem NonStop systems helps you keep a system running through a malfunction or single point of failure. For example, if one processor goes down, other processors can immediately take over the downed processor s functions while continuing with their own processing. Preserve the integrity of databases during processing (with NonStop Transaction Manager/MP [TM/MP] and mirrored disks). Are engineered so that they can grow and change in response to business needs. Provide a flexible processing environment that allows you to add terminals, applications, and databases as needed while the system is running. Provide geographic independence. You can move easily from a single system to a network of systems. You can add systems in the same room, the same city, or across the world. Though Tandem NonStop systems are fault-tolerant, a fault-tolerant operation requires more than fault-tolerant computers and software. If your goal is to create a completely 1-11

38 Overview of NonStop Operations Management Tandem Software fault-tolerant operation (one that does not stop because of a single point of failure), you need to make all aspects of the operation fault-tolerant. This includes: Staff providing backup personnel for all key positions Sites providing a backup site in case the primary site is damaged Data communications providing backup lines, switches, and so on Environmental systems providing backup sources of power, air conditioning, and so on Anything else your systems need to continue operating Tandem Software Tandem software products help you manage large and small systems, centralized and distributed operations. Tandem provides software products to help you: Manage the operating system environment Manage the databases and applications Measure and tune system performance Automate operations procedures Manage networks Manage distributed systems from a central site Many of the important tools in system operations come from the Distributed Systems Management (DSM) family of products. The DSM products provide a set of services that can monitor and control every element of distributed systems, including: Geographically distributed processors Tandem software products (such as NonStop TS/MP and Expand) Peripheral devices Applications Data communications software Line facilities Terminal devices By consolidating the management of all these elements, DSM provides an umbrella of control for an entire distributed network and allows you to maintain high levels of support for users. DSM also provides a framework for creating other system management applications. You can use the DSM set of standard interfaces to build system management applications suited to your requirements. System management applications eliminate many repetitive operations tasks, improve system availability by reducing the chance of human error, and free operations staff to concentrate on critical tasks. Throughout this manual, more information is provided about DSM and other operations tools. 1-12

39 Overview of NonStop Operations Management Where to Go for More Information Where to Go for More Information This manual provides an overview of system operations. After reading this manual, you might want to find out more about specific concepts, products, or procedures. The following sources can help you find more information about the topics covered in this manual: Tandem Software Publications Tandem WWW Home Page Tandem Education Tandem Hardware and Software Support Tandem Alliance Program International Tandem User s Group (ITUG) Tandem Professional Services Account Quality Planning (AQP) Service Tandem Software Publications Tandem provides manuals for all Tandem products. The manuals are organized into four levels of information based on the knowledge required to use and manage the products. The possible levels are: 1. Introductory and overview information 2. Tutorial or task-oriented information 3. Reference information 4. Summary information (reference cards, templates, and reference summaries) Much of the information related to system operations crosses level boundaries. Figure 1-2 provides a map to some of the system operations manuals. Read these manuals for overview, procedural, and reference information about Tandem products. Appendix A, Additional Reading, lists the manuals that provide more information about topics mentioned in this manual. Tandem manuals are available on CD-ROM disc to be viewed with the Tandem Information Manager (TIM) product. You can also order Tandem manuals in book form. For a complete list of Tandem manuals supporting the G01 release, refer to the About This Collection document in the G01.00 TIM collection. 1-13

40 Overview of NonStop Operations Management Tandem Software Publications Figure 1-2. Operations Manuals Introduction to Tandem NonStop Systems Introduction to NonStop Operations Management Operations Management Manuals Operations Manuals Operations Tools and Products Operations Programming Manuals Server Description Manuals (Himalaya S-Series Servers) Availability Guide for Change Management Availability Guide for Performance Management Availability Guide for Problem Management Guardian User's Guide Guardian System Operations Guide Open System Services User's Guide Guardian Disk and Tape Utilities Reference Manual Safeguard User's Guide Measure User's Guide Availability Guide for Application Design Subsystem Control Point (SCP) Management Programming Manual TACL Programming Manual Introductory Manuals to other areas and products (Data Management, Networking and Data Communications, Transaction Processing, NonStop TM/MP, NonStop TS/MP, NonStop SQL/MP... ) Security Management Guide Open System Services Management and Operations Guide NonStop TS/MP and Pathway System Management Guide Guardian Programmer's Guide Himalaya S-Series Operations Guides NonStop TM/MP Operations and Recovery Guide

41 Overview of NonStop Operations Management World Wide Web (WWW) Home Page World Wide Web (WWW) Home Page For customers with Internet access and a Web browser, Tandem maintains a home page on the World Wide Web. The universal resource locator (URL) for Tandem s home page is The Tandem home page contains links to the following kinds of information: Descriptions of Tandem hardware and software products, including servers, system software, transaction services, networking services, and operations management software Tandem Education course descriptions, schedules, and prices Descriptions of service and support programs available through Professional Services, including Application and Database services, Business Solutions services, Multi Vendor Solutions services, Networking and Communication services, and Systems and Operations services Articles describing how various industries (such as Finance, Government, Health Care, Manufacturing, and Telecommunications) have implemented solutions to complex business problems with Tandem hardware and software In addition, if you need to ask technical questions about installed products or if you would like to request information about products and solutions, the Tandem WWW home page provides listings of telephone numbers, postal addresses, and Internet electronic mail addresses for Technical Support, Sales, and Marketing. Tandem Education Tandem software education provides training for: Operators System managers Technical support specialists Tandem users Network managers and designers Data communications programmers and analysts Database administrators MIS programmers and analysts System programmers Software education provides lecture-based courses, interactive distance learning (IDL), and independent study programs (ISPs). ISPs include such course formats as text only, text and video, video only, computer-based training (CBT), and CD. For more information, refer to the Tandem Education Catalog, the Tandem WWW home page, or contact your Tandem representative. 1-15

42 Overview of NonStop Operations Management Tandem Hardware and Software Support Tandem Hardware and Software Support Tandem provides hardware and software support. Tandem offers services in the following areas: Hardware and software installations Site planning and local area network consulting System configuration Hardware maintenance Hardware and software problem resolution Equipment inspection Equipment reconfiguration Application design and testing Data communications planning Education planning Identifying staffing requirements Operations reviews A description of the services provided by the support organization is available from the Tandem WWW home page or the Tandem Support Guide. Your Tandem representative can provide you with a copy. Tandem Alliance Program Tandem has developed partnerships with businesses focused on areas of importance to their customers. Alliance partners add to Tandem services and can: Help you with application development. Provide additional network-management, system-management, applicationmanagement, and application-development tools. Provide software application packages for users of Tandem systems. Work with Tandem and your company to provide integrated application solutions for large projects. Alliance partners offer a range of services that include consulting, project management, and installation assistance. The Solutions and Services Directory describes the products and services offered by Alliance partners, as does the partners area of the Tandem WWW home page. For more information about the Alliance program or to obtain a copy of the Solutions and Services Directory, contact your Tandem representative. 1-16

43 Overview of NonStop Operations Management International Tandem Users Group (ITUG) International Tandem Users Group (ITUG) ITUG is an independent organization of over 2,000 members that: Encourages communication and information exchange among Tandem users Serves as an exchange for design concepts and software Establishes a forum for special interest groups such as banking, manufacturing, and transportation Provides feedback to Tandem regarding equipment and programming needs ITUG holds an international conference once a year, publishes the bimonthly Tandem Connection, and maintains a library of useful programs and tools developed by Tandem users. ITUG has headquarters in the United States and branch offices in other countries. For more information about ITUG, contact: ITUG Headquarters 401 North Michigan Avenue Chicago, Illinois , USA (312) Telex: SBA Internet: [email protected] WWW home page: If you are located outside the United States, your Tandem representative can provide the address and telephone number of the ITUG branch office nearest you. Tandem Professional Services Tandem Professional Services provides customized solutions for a customer s business needs by assisting in all phases of a project, including: System planning Application design and development System implementation and testing Operations and management Production-system maintenance For more information about Tandem Professional Services, refer to the Tandem WWW home page or contact your Tandem representative. 1-17

44 Overview of NonStop Operations Management Account Quality Planning (AQP) Service Account Quality Planning (AQP) Service The Tandem AQP provides services for improving your current operations management processes, including: Performing a profile assessment and analysis of your operations environment Identifying problem areas and targeting improvements for areas that will produce the most benefits for your organization Analyzing the root cause of problems Developing and implementing an action plan to improve the problem areas in your operations environment For more information on the AQP Service, contact your Tandem representative. FAXAdvisor The Tandem FAXAdvisor is a free, automated fax information system that enables you to receive professional services documents, support documents, and product documents by means of a touch-tone telephone and a fax machine. You can have documents delivered to your fax machine within minutes by dialing the FAXAdvisor telephone number: U.S. callers: TNDMFAX ( ) International callers: An interactive voice response systems leads you through a series of selections. You can request an up-to-date listing of all the documents currently available in a number of categories, such as Recently Announced Products and Services and Professional Services and Support Products. Then, you can order copies of specific documents of interest, for example, a service description of the NonStop Availability Review Service available through Tandem Professional Services, a product description of Tandem s Nomadic Disk Technology, or a course description of the Automating Tandem Operations class. For assistance in using FAXAdvisor, call Tandem Fax Support at

45 2 The Operations Staff Overview Before receiving your Tandem NonStop system, you should determine what type of operations organization you will need, what type of training you should arrange for current staff, and what type of staff you need to hire (if any). This section provides guidelines to help you make these decisions. If you currently have Tandem NonStop systems, you might use these guidelines to reorganize your current operations staff. Staffing and the Operations Management Model The operations management (OM) model described in Section 1, Overview of NonStop Operations Management, can help you ensure that all functions related to operations management are identified and addressed. To make sure that all the functions in the OM model are performed by the appropriate staff and that the staff s roles and responsibilities are clearly defined, Tandem defines four activity areas that are indispensable to the success of the operations environment. Depending on the size of your organization, you may need to assign one person, several persons, or whole departments to an activity. The operations management activities are divided into two functional areas: production and change. Planning Control Operations Support Staff assigned to the planning area develop plans for all aspects of operations management, including performance and capacity management, security management, disaster recovery, and production management. Staff assigned to the control area control and execute the introduction of changes into production. Staff assigned to the operations area run and monitor the business systems, applications, and networks. Staff assigned to the support area support the running of business systems, applications, and networks. They resolve technical problems, automate production tasks, and handle administrative tasks. There are usually five levels of expertise for Tandem operations: entry, intermediate, senior, management, and executive. The five levels of expertise cover all tasks from the most basic to the most complex. Experience has shown that operations organizations function best when they have welldefined levels of expertise within each activity area, which: Helps the staff maintain a consistent level of service Improves productivity by balancing skill levels with the cost of delivering the skills Improves morale by providing well-defined career paths and responsibilities Improves efficiency by reducing the recurrence of problems 2-1

46 The Operations Staff Who Provides Each Level of Expertise? Table 2-1 provides a general description of each level of expertise. The entry-level, intermediate-level, and senior-level skills and tasks are described in more detail in the following subsections. Who Provides Each Level of Expertise? Which staff members provide each level of expertise depends on the size of your organization. If you have a small group, one person may provide several levels of expertise, and you may not need full-time, senior-level support. However, as your company grows and you add applications and systems, you may need someone to perform senior-level tasks full-time. If you have a large organization, you may need one or more people to provide each level of expertise. Successful management of NonStop systems or a Tandem Expand network requires at least one person who has a detailed understanding of the system, the devices connected to it, and the characteristics of the network (if the company has one). Note. No matter how you organize the staff, be sure there is always a backup person for each level. A backup person can help prevent disruptions in service caused by personnel turnover or absence. Successful operations management also requires a manager to ensure that the operation runs effectively and efficiently. Depending on the size of your operation, you may have several managers. For example, your company may need one manager for each functional area, for each activity area, or within an activity area to oversee network operations or the maintenance of specific applications. Only after analyzing the needs of your operation will you be able to determine the skills needed, the number of people to employ, and how to structure the organization. 2-2

47 The Operations Staff Who Provides Each Level of Expertise? Table 2-1. Staff Levels of Expertise Levels Entry-Level Tasks: Intermediate-Level Tasks: Senior-Level Tasks: Line-Management Tasks: Executive-Level Tasks: Description Most basic tasks in each functional area. Most operations employees start by learning how to perform these tasks. More complex than the entry-level tasks. Staff who performs intermediate-level tasks needs more in-depth knowledge and experience, and less supervision, than entry-level personnel. Most complex tasks. These tasks require in-depth knowledge and experience. Types of personnel who perform senior-level tasks include analysts who specialize in specific technical areas (such as communications or database administration) and programmers who automate operator functions. Line-management tasks. The manager s main concerns are to ensure efficient, cost-effective use of resources and to see that other staff members get the technical support they need. The operations manager: Sets policies in such areas as problem escalation, disaster recovery, staffing, and workload distribution Evaluates and assists in the selection of hardware and software and the best configuration of the two to ensure the success of the operation Ensures that all personnel have adequate training Determines who performs specific tasks Defines standard procedures Establishes goals for each level of support Monitors the staff so schedules and assignments can be adjusted as needed Meets regularly with suppliers to ensure that the group s needs continue to be met High-level management tasks. Every company has an executive responsible for the success of the operations organization. For example, the executive often has one of these titles: MIS director, vice-president of operations, vice-president of MIS, or chief information officer. The executive: Establishes the business goals for the operations organization Ensures that the operations organization is meeting its business goals Develops the organization s budget Approves hardware purchases Defines or approves the organizational structure Evaluates the operations manager or managers Authorizes and approves service-level agreements 2-3

48 The Operations Staff Staffing Levels Within the Production Function Staffing Levels Within the Production Function The production function is divided into two activity areas: operations and support. The following paragraphs describe the staffing levels for these areas. Staffing the Operations Area The operations activity comprises a range of tasks and skills from entry level to senior level. The tasks range from basic problem solving and system monitoring to the more complex tasks of coordinating efforts with vendors and running applications. Depending on your needs, all levels of tasks may be performed by one person, or several people may be responsible for a single level. Entry-Level Skills and Tasks Staff performing entry-level tasks should be able to follow procedures and use online help, and should have a basic understanding of the command interpreter and Tandem systems. Entry-level personnel require well-defined and well-documented procedures. Their tasks should be as automated and as simplified as possible. Entry-level tasks include: Basic problem solving Answering phone calls from users Logging problems Escalating problems to the next level of support when necessary System monitoring Running batch jobs Controlling and maintaining consoles, tape units, printers, and workstations Depending on the size of your operations, you may need to divide entry-level tasks between two or more people. For example, some companies have computer room operators and help desk operators to handle the entry-level tasks. The computer room operators are responsible for the 24-hour operation of the system and have access to the system hardware. The help-desk operators answer phone calls from users, solve simple problems, and determine the impact of user problems and the priority of user requests. Intermediate-Level Skills and Tasks Staff performing intermediate-level tasks should have a good knowledge of Tandem utilities and systems and should know how to find information in manuals. Intermediatelevel tasks include: Training entry-level staff Diagnosing, solving, or escalating problems Solving problems escalated from entry-level staff Starting up and shutting down systems, applications, and peripherals Performing network operations Maintaining disk, tape, and optical storage media Depending on the size of your operations, you may need to divide intermediate-level tasks between two or more people. For example, some companies have intermediate- 2-4

49 The Operations Staff Staffing the Support Area level operators who specialize in different areas of system operations, including network operations and teleprocessing. Other companies have several intermediate-level operators who are supervised by the most experienced intermediate-level operator (often called a lead operator). Senior-Level Skills and Tasks Staff performing senior-level tasks require an in-depth knowledge of Tandem system operations, system management, and products. Senior-level tasks include: Monitoring applications Maintaining and scheduling batch jobs Solving problems escalated by intermediate-level staff Interdepartmental consulting Communicating with vendor support personnel Senior-level staff are sometimes called senior operators, operations specialists, or operations analysts. In operations groups with multiple shifts, the person who performs senior-level tasks often supervises the entry-level and intermediate-level staff and might be called the shift supervisor. Staffing the Support Area The support activity within production comprises a range of tasks and skills from entry level to senior level. The tasks range from basic problem solving and system generation to the more complex tasks of automating procedures and planning for system installations. Depending on your needs, all levels of tasks might be performed by one person, or several people may be responsible for a single level. Entry-Level Skills and Tasks Staff performing entry-level support tasks should know how to use the Tandem utilities and operations tools. The staff should also have a good understanding of system operations and management concepts. Senior-level operations personnel are good candidates for an entry-level support position. Entry-level tasks include: Basic problem solving Logging problems Providing the first level of support to operations staff Escalating problems to the next level of support when necessary Configuring and generating simple systems Writing simple macros and command files to automate operations tasks Performing basic security tasks such as adding user IDs, maintaining access-control lists, and monitoring audit files Maintaining inventory 2-5

50 The Operations Staff Staffing the Support Area A person performing entry-level support tasks might be called an operations analyst or operations specialist. Intermediate-Level Skills and Tasks Staff performing intermediate-level tasks should have a good understanding of the Event Management Service (EMS), NonStop architecture, and Tandem products such as Expand, NonStop TS/MP, NonStop SQL/MP, and NonStop TM/MP. Intermediatelevel tasks include: Training entry-level staff Diagnosing, solving, or escalating problems Solving problems escalated from entry-level staff Using more complex tools (such as EMS) to automate operations tasks Monitoring system performance Managing databases Performing security tasks such as monitoring audit logs, detecting and stopping breaches to security, automating security tasks, and implementing security controls A person performing intermediate-level support tasks might be called a support analyst. Senior-Level Skills and Tasks Staff performing senior-level tasks require an in-depth knowledge of Tandem system operations, system management, and products. Senior-level tasks include: Planning for system installations Implementing Distributed Systems Management (DSM) facilities Advising and assisting other departments Communicating with vendor support personnel Providing technical support for the operating system Analyzing system performance data Tuning and balancing system performance Performing security tasks such as maintaining and monitoring security batch jobs and databases, escalating security problems to management and the audit function (if there is one), and authorizing requests for access to IDs and changes to passwords A person performing senior-level support tasks might be called a technical support specialist. 2-6

51 The Operations Staff Staffing Levels Within the Change Function Staffing Levels Within the Change Function The change function is divided into two activity areas: planning and control. The following paragraphs describe the staffing levels for these areas. Staffing the Planning Area The planning activity comprises a range of tasks and skills from entry level to senior level. The tasks range from site planning and performance analysis to network design and application review. Depending on your needs, all levels of tasks may be performed by one person, or several people may be responsible for a single level. Entry-Level Skills and Tasks Staff performing entry-level planning tasks should know basic planning concepts and should have a good understanding of the computer-industry and Tandem NonStop system capabilities. Entry-level tasks include: Basic problem solving Logging problems Providing the first level of support to staff from other functions Developing computer room floor plans Site planning Developing procedures for operations staff Developing security procedures for the support staff Gathering and publishing performance information Gathering and publishing security information Communicating with vendor support personnel A person performing entry-level planning tasks might be called a systems analyst. Intermediate-Level Skills and Tasks Staff performing intermediate-level tasks should have a good understanding of the NonStop architecture, Tandem products and utilities, network design, and capacity management. Intermediate-level tasks include: Training entry-level staff Diagnosing, solving, or escalating problems Solving problems escalated from entry-level staff Communicating with vendor support personnel Designing networks Designing secure systems Analyzing performance trends Analyzing business and user needs A person performing intermediate-level planning tasks might be called a systems planner. 2-7

52 The Operations Staff Staffing the Control Area Senior-Level Skills and Tasks Staff performing senior-level tasks require an in-depth knowledge of Tandem system operations, system management, and products. Senior-level tasks include: Planning for system installations Planning for DSM facilities Advising and assisting other departments Communicating with vendor support personnel Participating in application design reviews and serving as liaison to application development groups Planning for the implementation of new hardware and software Deciding when to install a new release of the operating system Planning system capacity needs Defining problem management and security requirements Analyzing new technology Evaluating vendor products and potential Coordinating efforts with other business groups Writing proposals for new hardware and software A person performing senior-level planning tasks might be called a senior systems planner. Staffing the Control Area The control activity comprises a range of tasks and skills from entry level to senior level. The tasks range from adding users and monitoring changes to installing new software and securing the system. Depending on your needs, all levels of tasks may be performed by one person, or several people may be responsible for a single level. Entry-Level Skills and Tasks Staff performing entry-level control tasks should be able to follow procedures and use online help and should have a basic understanding of the Tandem Advanced Command Language (TACL) command interpreter, Tandem utilities, and Tandem systems. Entry-level personnel require well-defined and well-documented procedures. Their tasks should be as automated and as simplified as possible. Entry-level tasks include: Basic problem solving Logging problems Adding users to the system Distributing reports 2-8

53 The Operations Staff Sample Operations Organizations Monitoring changes A person performing entry-level control tasks might be called a systems analyst. Intermediate-Level Skills and Tasks Staff performing intermediate-level tasks should have a good knowledge of Tandem utilities and systems, and should know how to find information in manuals. Intermediate-level tasks include: Training entry-level staff Diagnosing and solving (or escalating) problems Solving problems escalated from entry-level staff Communicating with vendor support personnel Maintaining system configuration documentation Installing new software and hardware A person performing intermediate-level control tasks might be called a configuration planner, production controller, or change administrator. Senior-Level Skills and Tasks Staff performing senior-level tasks require an in-depth knowledge of Tandem system operations, system management, and products. Senior-level tasks include: Advising and assisting other departments Communicating with vendor support personnel Managing configuration and change control Securing the system Automating control processes A person performing senior-level control tasks might be called a senior configuration planner, senior production controller, or senior change administrator. Sample Operations Organizations The best type of operations organization is one that fits the size and complexity of its company. The organizational structure depends on the size of the system, the number of systems connected to a network, the number of applications that run on a system or network, the number of systems bought from different vendors, and the size of the company. In some companies, operations functions are contained in one department. In other companies, the operations functions are spread among several departments. Because of the number of variables involved in developing an organization, this section does not attempt to explain how to set up your organization. Instead, it shows examples of organization charts that illustrate several ways of providing entry-level through linemanagement support. Examples include a small operations group, a distributed group, a centralized group, a telecommunications group, and a technical support group. 2-9

54 The Operations Staff A Small Operations Group A Small Operations Group The example in Figure 2-1 shows an operations group that consists of entry-level through senior-level staff. Operations activities are performed by two computer room operators and one lead operator. The support, planning, and control activities are performed by a part-time operations specialist. The operations manager performs both planning and control activities and provides line-management support. There might be different numbers of computer room operators for each shift, depending on the size and complexity of the system. Figure 2-1. A Small Operations Group Operations Manager LM Operations Specialist SL Lead Operator IL Computer-Room Operator EL Computer-Room Operator EL LM = Line-management tasks SL = Senior-level tasks IL = Intermediate-level tasks EL = Entry-level tasks

55 The Operations Staff A Distributed Operations Group A Distributed Operations Group The example in Figure 2-2 shows an operations group that supports a network with three nodes at different locations. The operations group consists of entry-level through linemanagement staff. The entry-level and intermediate-level staff perform operations activities and are distributed to the three sites. Sites A, B, and C each have one lead operator and a varying number of computer room operators. (There might be different numbers of computer room operators for each shift, depending on the size and complexity of the system.) A technical support specialist provides senior-level support to all sites. A senior systems planner performs the planning and control activities. The operations manager provides line-management support to all sites. Figure 2-2. A Distributed Operations Group Technical Support Specialist SL Remote Operations Manager LM Senior Systems Planner SL Site A Site B Site C Lead Operator Lead Operator Lead Operator IL IL IL Computer-Room Operator EL Computer-Room Operator EL Computer-Room Operator EL Computer-Room Operator EL Computer-Room Operator EL Computer-Room Operator EL Computer-Room Operator EL Computer-Room Operator EL Computer-Room Operator EL LM = Line-management tasks SL = Senior-level tasks IL = Intermediate-level tasks EL = Entry-level tasks

56 The Operations Staff A Centralized Operations Group A Centralized Operations Group The example in Figure 2-3 shows an operations group that supports a network with three nodes at different locations. The operations group consists of entry-level through linemanagement staff, all located at Site A. There might be different numbers of personnel for each shift, depending on the size and complexity of the systems and network. Site A serves as the control node for the other nodes in the network. The staff located at Site A manages the systems at Sites B and C. Sites B and C are unattended. Figure 2-3. A Centralized Operations Group Operations Manager LM Site A Site B Site C Operations Specialist Unattended Unattended SL Lead Operator IL Computer-Room Operator EL Computer-Room Operator EL Computer-Room Operator EL LM = Line-management tasks SL = Senior-level tasks IL = Intermediate-level tasks EL = Entry-level tasks

57 The Operations Staff A Telecommunications Group A Telecommunications Group The example in Figure 2-4 shows a telecommunications group that is typical of organizations that support multiple vendors. The group consists of entry-level, intermediate-level, and line-management staff. The help-desk operators answer phone calls from all users. The teleprocessing operators perform intermediate-level tasks and maintain communications lines and modems. The manager and supervisors perform line-management tasks. Equipment vendors supply senior-level support on a contract basis. Figure 2-4. A Telecommunications Group Network Data Communications and Telecommunications Manager LM Equipment Vendors Teleprocessing Hardware Supervisor IL Teleprocessing Planning and Procurement IL Help-Desk Supervisor EL Teleprocessing Operator IL Help-Desk Operator EL Teleprocessing Operator EL Help-Desk Operator EL Teleprocessing Operator EL Help-Desk Operator EL Teleprocessing Operator EL Help-Desk Operator EL LM = Line-management tasks SL = Senior-level tasks IL = Intermediate-level tasks EL = Entry-level tasks

58 The Operations Staff A Technical Support Group A Technical Support Group The example in Figure 2-5 shows a centralized technical support group for Tandem systems within a large data processing environment. The technical support function can either be centralized into a single group or can become part of the various organizations in an existing data processing organization. Experience has shown that the centralized group approach is best. A centralized group becomes the center of Tandem expertise within a company and provides support more efficiently than a decentralized technical support group. The group shown in Figure 2-5 provides senior-level and linemanagement support. Figure 2-5. A Technical Support Group Tandem Technical Support Manager LM Technical Support Specialist SL Support Analyst SL Support Analyst SL Support Analyst SL LM = Line-management tasks SL = Senior-level tasks

59 The Operations Staff Sample Job Descriptions Sample Job Descriptions Once you determine how your operations staff should be organized, you can develop job descriptions for each staff member. By developing formal job descriptions, you can ensure that all levels of required support are provided. Following are sample job descriptions for each operations activity area. Note. The descriptions do not represent requirements or recommendations. The needs of your group should determine how many people you hire and what each person does. The Operations Area The following are sample job descriptions for staffing the operations area: Computer room operator Help-desk operator Lead operator Computer Room Operator Following is a sample job description of a computer room operator who performs entrylevel tasks in the operations area. Former users, or operators with experience on other computers, are good candidates for this position. Job Title Computer-Room System Operator (Entry-Level Position) Summary of Responsibilities Computer room operators are responsible for the day-to-day operation of the Tandem computer systems. The operators should have a basic understanding of the Tandem system, the applications, and the peripheral devices. Computer room operators should also have a knowledge of the department s problem-management and problem-tracking procedures. Detailed Duties and Responsibilities A detailed summary of duties and responsibilities includes: Answer and log phone queries. Monitor the system using check lists, command files, and prewritten TACL routines: Monitor and maintain tape inventory and tape units, keep tape drives clean and operational, and order supplies Perform NonStop TM/MP online dumps and audit dumps Perform backups, and restore and manage tapes Monitor and maintain printers, monitor paper and ribbon, order supplies, change printer locations, and print jobs and forms from the spooler 2-15

60 The Operations Staff The Operations Area Monitor the physical environment in the computer room Monitor terminals, processors, communications equipment, applications, and console messages as instructed by other support levels. Install equipment as needed Follow problem determination and resolution procedures: Run peripheral self-test routines Identify both hardware and software problems Log problems in the problem log or problem reporting system Follow check lists to solve problems Perform hardware diagnostics Escalate problems that cannot be solved at this level Perform procedures as instructed by other support levels: Run regularly scheduled jobs by following check lists Respond to problems that may require restarting jobs Maintain computer room security by monitoring physical access to the computer room. Standards/Objectives Computer room operators complete all scheduled jobs on time and resolve problems within a specified time or else escalate the problem to the appropriate intermediate-level operations personnel. External Contacts Computer room operators interact with users, help-desk operators, intermediate-level operators, Tandem customer engineers (CEs), and Tandem analysts. Tools/Equipment Computer manuals, operator instructions and documentation, telephones, video terminals, workstations, and error logs should be available to help computer room operators complete their tasks. 2-16

61 The Operations Staff The Operations Area Help-Desk Operator Following is a sample job description of a help-desk operator who performs entry-level tasks. A former user who has a good telephone manner and remains calm under pressure is an ideal candidate for this position. Job Title Help-Desk Operator (Entry-Level Position) Summary of Responsibilities Help-desk operators answer phone calls and try to resolve user problems. Help-desk operators log problems and escalate problems they cannot resolve to the operators or operations specialists. Help-desk operators must be able to communicate well over the phone, be sensitive to callers problems, and be patient. They must have a knowledge of applications and business needs so that they can make judgments regarding the impact of problems and user priorities. They must also understand the department s problem-management and problem-tracking procedures. Detailed Duties and Responsibilities A detailed summary of duties and responsibilities includes: Handle phone queries from users that are having problems: Answer phone queries Log phone queries Follow problem determination and resolution procedures: Identify both hardware and software problems Log problems in a problem log Escalate problems they cannot solve while the user is on the phone and provide users with an estimated time of resolution Follow operations check lists Check with the user to make sure that the problem was resolved satisfactorily Standards/Objectives Help-desk operators should resolve percent of user problems while the user is on the phone. Performance is evaluated on the basis of the number of calls received (call volume), the number of problems solved without being escalated, the number of calls escalated to the proper person the first time, user satisfaction, and the amount of time users are on the phone waiting for an answer. 2-17

62 The Operations Staff The Operations Area External Contacts Help-desk operators interact with users, computer room operators, intermediate-level operators, and Tandem customer engineers (CEs). Tools/Equipment Telephones (preferably with headsets, conference call features, and recording machines), a problem-tracking and problem-escalation system, video terminals or workstations, documentation, and appropriate check lists should be available to assist help-desk operators in completing their tasks. Lead Operator Following is a sample job description of a lead operator who performs intermediatelevel tasks in the operations area. Former entry-level operators are good candidates for this position. Job Title Lead Operator (Intermediate-Level Position) Summary of Responsibilities Lead operators assist the entry-level operators in their jobs. In addition, lead operators start and shut down the system, applications, terminals, and other devices; recover from system failures; manage different subsystems, including the spooler; perform procedures as instructed by senior-level technical and applications support; and escalate problems to senior-level operators when necessary. Lead operators should: Have good oral and written communication skills, good documentation skills, previous entry-level experience, and a good knowledge of Tandem utilities and the overall Tandem system, as well as an understanding of Tandem applications Understand the company s business needs and the department s problemmanagement, change-management, and scheduling procedures Be willing to carry a pager. Detailed Duties and Responsibilities A detailed summary of duties and responsibilities includes: Ensure that entry-level system monitoring tasks are performed. Perform complicated system monitoring tasks using Distributed Systems Management facilities and system utilities. For example: Monitor applications controlled by the PATHMON process Monitor computer subsystems Analyze problems and take corrective action using Tandem utilities: 2-18

63 The Operations Staff The Operations Area Write and maintain operations check lists Take down and bring up devices Differentiate between hardware, software, and firmware problems Resolve terminal-related problems Manage disks Verify system integrity by switching devices to their primary and backup paths Manage file space Spare bad sectors on disks Handle processor failures Dump processors Reload processors Perform hardware diagnostics with the Tandem Service Management (TSM) package Recover from total and partial system failures Start up and shut down the system Start and stop network components, terminals, and transaction-processing applications Notify users of pending hardware and software down time Escalate problems that cannot be solved at this level Assemble relevant information for problem reviews Manage the job scheduling system Maintain computer room security Perform procedures as instructed by senior-level support: Generate system reports for other groups Control object-code versions Configure cache Manage the spooler Manage NonStop TM/MP Perform product-specific installation procedures Use the TSM EMS Event Viewer Help entry-level support personnel as needed Standards/Objectives Lead operators should solve percent of the problems they receive within a specified amount of time or else escalate problems to senior-level operators. 2-19

64 The Operations Staff The Support Area Performance is judged by how quickly operators detect and solve system problems and by the number of problems escalated. External Contacts Lead operators interact with users, entry-level operators, senior-level operators, Tandem customer engineers (CEs), and Tandem analysts. Tools/Equipment Manuals, operator instructions and documentation, video terminals and workstations, and pagers should be available to help lead operators complete their tasks. The Support Area Following is a sample job description for staffing the support area. Technical Support Specialist Following is a sample job description of a technical support specialist who performs senior-level tasks in the support area. Operators with education beyond high school, and two to three years of Tandem system operations experience are good candidates for this position. Job Title Technical Support Specialist (Senior-Level Position) Summary of Responsibilities Technical support specialists provide technical assistance to network, hardware, and software users. Technical support specialists might focus on the operating system, on applications, or on data communications. They: Assist in capacity planning, configuration and change management, and problem management Handle performance measurement for system tuning. Configure and manage the system software for maximum fault-tolerance and optimal performance Produce automated routines for entry-level and intermediate-level operations personnel according to a defined set of standards Provide clear and accurate instructions for tasks that require operator intervention Technical support specialists should have an in-depth knowledge of the operation of the system, software, hardware, communication lines, software tools, and system management products. They should also be able to communicate well with others. 2-20

65 The Operations Staff The Support Area Detailed Duties and Responsibilities A detailed summary of duties and responsibilities includes: Develop operational routines using DSM facilities, such as TACL, the Distributed Name Service (DNS), and the Subsystem Programmatic Interface (SPI); and special customized programs that: Process applications regularly or on an as-needed basis (for example, ad hoc reports) Monitor the system at regular times (for example, status checks or performance measurements) Load and shut down applications Start up and shut down the system Start processes on peripheral devices Use TSM to maintain the system and to diagnose hardware problems. Use the TSM EMS Event Viewer to view and analyze events. Set up tape archiving procedures and document information about tape cycles, tape locations, the backup strategy, and NonStop TM/MP considerations Manage capacity planning Monitor system, network, and application performance Resolve nonstandard and technical problems As necessary, perform senior-level problem determination, take memory dumps, and take traces Serve as the Tandem expert for all departments, and accumulate and disseminate Tandem information Generate system reports for other groups Provide programming assistance by creating specialized Transaction Application Language (TAL) routines and subroutines and by creating specialized utilities and subsystems Standards/Objectives Technical support specialists are evaluated on the number of problems requiring their attention, the turnaround time on problems, how well users level-of-service expectations are met, and how well the system is operating. External Contacts Technical support specialists interact with programmers in the application development group; entry-level, intermediate-level, and senior-level operations and support personnel; management; Tandem analysts; and Tandem customer engineers (CEs). 2-21

66 The Operations Staff The Planning Area Tools/Equipment A full set of Tandem manuals, video terminals or workstations, system management utilities, and pagers should be available to help technical support specialists complete their tasks. The Planning Area Following is a sample job description for staffing the planning area. Senior Systems Planner Following is a sample job description of a senior systems planner who performs seniorlevel tasks in the planning area. Operators with education beyond high school, and two to three years of Tandem system operations experience, are good candidates for this position. Job Title Senior Systems Planner (Senior-Level Position) Summary of Responsibilities Senior systems planners are responsible for developing plans for all aspects of operations management, including performance and capacity management, security management, disaster recovery, network design, application review, and production management. Senior systems planners should have an in-depth knowledge of the operation of the system, software, hardware, communication lines, software tools, and system management products. They must have a knowledge of the business needs of the organization so that they can make judgments regarding the impact of future plans. They must also understand the department s change-management and configurationmanagement procedures. In addition, they should be able to communicate well with others. Detailed Duties and Responsibilities A detailed summary of duties and responsibilities includes: Plan for system capacity needs Assist in developing and maintaining programming and operational standards and procedures, and in evaluating hardware and software Create hardware recovery procedures for use by operators, including instructions on handling a disk, communications, processor, system, or power failure Assume responsibility for system security Define problem management requirements Plan for disaster recovery 2-22

67 The Operations Staff The Control Area Plan for implementation of upgrades and new releases Advise development on application design Assess, with the network or teleprocessing specialist, the impact of changing the communications environment Standards/Objectives Senior systems planners are evaluated on how well users level-of-service expectations are met and how well the system is operating. External Contacts Senior systems planners interact with programmers in the application development group, senior-level personnel, management, Tandem analysts, and Tandem customer engineers (CEs). Tools/Equipment A full set of Tandem manuals, video terminals or workstations, system management utilities, and pagers should be available to help senior systems planners complete their tasks. The Control Area Following is a sample job description for staffing the control area. Senior Configuration Planner Following is a sample job description of a senior configuration planner who performs senior-level tasks in the control area. Operators with education beyond high school, and two to three years of Tandem system operations experience, are good candidates for this position. Job Title Senior Configuration Planner (Senior-Level Position) Summary of Responsibilities Senior configuration planners are responsible for enforcing standards, managing the change control process, and managing the system schedule. A senior configuration planner should understand the change-management, problemmanagement, and scheduling procedures. Detailed duties and responsibilities A detailed summary of duties and responsibilities includes: Maintain the system scheduler Manage the configuration-control and change-control process 2-23

68 The Operations Staff The Operations Manager Create and maintain appropriate hardware, software, and application configurations Install new and changed operating system images and software releases Develop and maintain programming and operational standards and procedures for the operations group Assist in providing quality assurance testing for new applications and new system software Participate in the evaluation of operations management software Advise development on application design Standards/Objectives The performance of senior configuration planners is judged by the number and frequency of schedule errors, the number of software and application change installations that fail, and the number of problems that occur because standards are not documented or because they are documented poorly. External Contacts Senior configuration planners interact with programmers in the application development group; entry-level, intermediate-level, and senior-level operations and support personnel; management; Tandem analysts; and Tandem customer engineers (CEs). Tools/Equipment A full set of Tandem manuals, video terminals or workstations, system management utilities, and pagers should be available to help senior configuration planners complete their tasks. The Operations Manager Following is a sample job description of an operations manager who performs linemanagement tasks. Job Title Operations Manager (Line-Management Position) Summary of Responsibilities Operations managers set policies in such areas as security, problem escalation, and disaster recovery. Operations managers also handle personnel, budgeting, and other department administration tasks. 2-24

69 The Operations Staff The Operations Manager Detailed Duties and Responsibilities A detailed summary of duties and responsibilities includes: Set up the operations organization Manage the daily operations and staff. Remove obstacles that inhibit the staff s performance Manage computer room space Manage new software releases and new product installations Prepare for future changes to the system and oversee capacity planning Order hardware and software Handle budgets, hardware maintenance contracts, and service-level agreements Recruit and train staff, and plan the staff s career development Set policies regarding problem reporting and tracking, disaster recovery, change control, configuration management, and security Manage staff schedules Survey users to determine the quality of service provided by the staff Standards/Objectives The performance of operations managers is judged by how well service-level agreements are fulfilled, how well the system is operating, how well the staff are performing, and how well the budget is managed. External Contacts Operations managers interact with all levels of staff in all activity areas, with staff and managers in other departments, Tandem analysts, and Tandem customer engineers (CEs). Tools/Equipment Manuals, video terminals or workstations, administrative tools, and pagers should be available to help operations managers perform their tasks. 2-25

70 The Operations Staff Training Training Once you determine how to allocate the support tasks among your department, you can evaluate your staff s training needs. Training is available from many sources: Tandem Software Education, Tandem manuals, experienced people within your company, and other vendors manuals and classes. Tandem Education If you are receiving a Tandem system for the first time, your entire staff will need some training. If your staff already has Tandem experience, they may need training only for new hardware or software, or for career development. Tandem Education offers training solutions for all levels of support and for all levels of learning needs. Tandem Education offers lecture-based courses, customized courses, independent study programs, computer-based training, and videotapes. Lecture-based courses are offered at Tandem training centers throughout the world. Customized courses are offered on demand. These types of courses range from an introduction to Tandem concepts and facilities to specialized courses on Tandem software products and operations. Independent study programs, computer-based training programs, and videotapes provide on-site, tailored, and flexible training alternatives. These training programs allow you and your staff to study at your own pace and at your own location. These programs are particularly helpful to entry-level support personnel and to people new to Tandem systems. Tandem also offers courses for intermediate-level through senior-level support personnel. Tandem s education planning services are available to assist you in planning and scheduling the training you need to increase the productivity of your staff and your Tandem resources. To obtain a list of new and updated courses for support personnel, ask your Tandem representative for the latest copy of the Tandem Education Course Catalog. The catalog contains a complete list of courses, training programs, and training centers. The catalog also contains diagrams showing training paths for a variety of Tandem users, including the diagrams shown in this section and diagrams for network managers, programmers, database administrators, systems and operations management, and technical specialists. This course information is also available from the Tandem Education home page (accessible from the Tandem WWW home page at For enrollment information, call Tandem Education Enrollment at in the U.S. For enrollment outside of the U.S., contact your local Tandem representative. Tandem Manuals Manuals provide introductory, procedural, and reference information for Tandem products. The manuals corresponding to your system configuration should be on hand at all times and easily accessible to the operations staff. You might need more than one set of manuals, depending on the number of employees and where they are located. To help you remain current with the latest Tandem manuals, Tandem manuals are available on CD-ROM disc to be viewed with the Tandem Information Manager (TIM) product. You 2-26

71 The Operations Staff In-House Training can also order Tandem manuals in book form. For a complete list of Tandem manuals, refer to the About This Collection document in the G01.00 TIM collection. In-House Training A very useful type of training is on-the-job training. On-the-job training is most effective when it is well planned and is most valuable for entry-level personnel. Advanced employees need more structured and in-depth training on specific products and tasks (for example, NonStop TM/MP or capacity planning). Tandem courses and educational materials are a useful addition to in-house training efforts and together provide a complete curriculum. Other Vendor Training Whenever you buy hardware or software, make sure that the vendor offers training and documentation. Training and manuals are effective means for employees to learn how to use products. The faster they learn, the faster they can use the products, and the faster your investment pays off. Check List The following check list covers the main points of this section. 1. Structure your organization so that it most effectively and efficiently provides the entry-level through senior-level operations, planning, control, and support activities your company needs. 2. Define each person s job duties. Make sure that there is a well-defined path for problem escalation and for career growth. 3. Define how each person s performance is evaluated. 4. Determine the staff s training needs. 5. Provide the necessary training. 6. Provide training for career development. 2-27

72 The Operations Staff Check List 2-28

73 3 The Operations and Support Areas Overview Before receiving your Tandem system, you need to prepare the operations environment. The operations environment includes both the operations and support areas. The operations area is where you locate the computer systems and peripherals (such as printers). The support area is where the operations staff are located. In some companies, the operations and support areas are the same; in other companies, the areas are separate. This section provides guidelines for setting up an operations environment. A check list at the end of this section summarizes the main points. A Computer Room or an Office The operations environment depends on the type of systems installed. NonStop Himalaya S-series servers can operate in offices or computer rooms, depending on the system configuration. Selecting a Location Following are considerations for selecting a location. First, considerations for both computer-room and office environments are listed, then computer room considerations, and finally office-environment considerations. Both Computer-Room and Office Environments When selecting and preparing the operations areas, there are a number of items to consider, including the following: The computer room or office should be accessible from the delivery receiving area. You should make sure that all required deliveries (the initial system delivery and regular supply deliveries) can be transported to the appropriate location. Make sure that all hallways and doors are wide enough. If an elevator will be used for deliveries, make sure that the elevator is strong enough to transport the systems. Consider the work flow when developing a floor plan. Especially in environments with large numbers of batch jobs, the layout of the computers and peripherals can either ease or impede the flow of work. For example, by placing tape drives and printers in different areas, you can prevent operators who are loading tapes from getting in the way of operators who are maintaining the printers. Determine whether data communications lines are needed. Arrange for the installation of all lines and at least one telephone. To ensure that at least one data communications line will be up at all times, ask your supplier to select different routes for each line. 3-1

74 The Operations and Support Areas Computer Room Environments Reserve storage space for data processing supplies, manuals, equipment, and archived material. Depending on your needs, you might want to plan a tape library. A tape library is a separate area or room that contains backup tapes, site update tapes, software release tapes, NonStop TM/MP online dumps and audit dumps, and any tapes required to run applications. Tape libraries help you store, organize, and protect information. Tape libraries usually have: A controlled environment to preserve the tapes Locks to prevent unauthorized personnel from accessing stored data and programs Fire-detection and fire-extinguishing equipment Storage racks In addition, you might want to consider installing fire-proof vaults. Depending on your needs, you might want to plan off-site storage for critical information. Offsite storage protects your information from accidents that might occur at your operations facility. Offsite storage facilities are available from contractors in most cities. The computer room or office should have enough lighting to allow operators to perform their tasks. Avoid brilliant lighting since it reduces the visibility of terminal displays and equipment indicator lights. Diffused lighting is best. Computer Room Environments The physical site and location of a computer center are of prime importance to your operations. You can minimize the danger of disaster and of breaches in security by choosing a good site. Site Location Follow these guidelines when deciding where to locate the computer facility: Note. These guidelines are provided to help you provide as much physical protection as possible. Select the guidelines that fit your needs and the level of risk your operations can tolerate. Avoid sites near flood areas or earthquake fault lines. Determine whether the computer center should be located at a remote, computeronly site or with other business operations. The farther your computer center is from streets, freeways, parking lots, and vehicles, the safer it is. A major drawback to a remote, computer-only site is the expense. Remote computer centers usually cost more than computer centers located with the rest of the business operations. If you are setting up computer operations in a multistory building, locate the computer center in the middle of the building. Centers located at the street level are 3-2

75 The Operations and Support Areas Computer Room Environments accessed easily and provide opportunities for entry or damage. Basement locations are at greater risk to damage caused by faulty plumbing and flooding. Depending on your company s needs, you might want to consider creating a highsecurity facility to prevent disasters, which has: A perimeter security system to limit site access to only those people who are critical to operations or support Redundant hardware, including redundant environmental systems and communications equipment Computer Room Preparation Consider the following when planning the computer room: Air conditioning and electrical service must be available. Your Tandem representative will inform you of the requirements. Prior to the installation of any computer equipment, request an environmental audit (with a written report). An environmental audit helps you ensure that the air conditioning and electrical service will be of high quality. Tandem and other vendors can perform an environmental audit. The room should be protected from disasters. You can protect the computer room by: Installing sensors for detecting smoke, high temperature, and high humidity. Fireproofing the room and installing fire-extinguishing systems. Providing a dedicated heating and cooling system. Computer room fires are rare, but smoke damage from someone else s fire is more common. Enforcing a strict no-smoking policy. Some insurance companies choose not to cover a company that allows smoking in its computer room. Ensuring that water pipes are not located over the computer room (to prevent flooding). Installing a Halon gas system. You might want to add a second power source or uninterruptible power supply (UPS). In some installations, geographic location or system size might require that you also add a second or alternate air-conditioning system. When adding air conditioning, avoid installing water pipes directly over computer equipment and in the floors directly above the computer room. The computer room floor should be strong enough to support the weight of all necessary equipment. If you are unsure of the floor strength, contact a qualified structural engineer. The ideal flooring for computer rooms is a raised floor. Raised floors allow you to route equipment cables freely and to protect cables from damage. If you do not choose to use a raised floor, make sure that cables do not get in the way of the staff and that they are installed in accordance with all safety standards and regulations. 3-3

76 The Operations and Support Areas Office Environments Plan the computer room layout to increase the efficiency of work flow and personnel traffic. To allow for growth, consider selecting a computer room that provides enough space for the initial installation as well as for future expansion. If sensitive information is displayed or stored in the computer room, cover computer room windows. Uncovered windows might allow unauthorized personnel to obtain information. Office Environments Follow these guidelines when selecting a location for your systems in an office area: Ensure that the intended equipment location allows enough room for both the equipment and any needed furniture. Plan the cabling layout so that the cables do not impede traffic, that they are protected from damage, and that they are installed in accordance with all safety standards and regulations. Ensure that environmental requirements (temperature, humidity, and latitude) meet system specifications. If there will be many people and systems in the same room, you might need more air conditioning than is normally used in offices. Ensure that electrical power meets system specifications. Determine whether you need to add power surge suppressors to protect your systems from abrupt changes in power. (Many systems come with built-in surge suppression.) Install dust, smoke, and static electricity controls. Determine whether you need noise controls. If many systems and peripherals (such as printers) will be in the same room, you might need to add noise control devices (for example, baffles or white noise). The Environment Larger computer systems are sensitive to environmental changes, whereas office systems are sensitive only to extremes in temperature. To prevent system malfunctions in both environments: Make sure that you and your staff completely understand system requirements for operating temperature, humidity, storage temperature, altitude, and air quality. Install controls to monitor the environment and to detect environmental changes before the changes affect the systems. 3-4

77 The Operations and Support Areas Physical Security Develop procedures for protecting the systems when an environmental system (such as air conditioning) malfunctions. The procedures should include the following information: Whom to contact when a malfunction occurs How long the computer systems can run and when the staff should shut down the systems How to start available backup systems Physical Security Your company security policy will determine the physical security precautions you take. The strictest policy requires that all areas of the data processing center be protected, including the operations and support staff, the data processing center, equipment, material and supplies, software applications, and data. You might not need to provide such strict security, particularly if you are managing office-environment systems. Whatever type of system you are managing, you should install the security controls required for the level of risk your organization is willing to tolerate. Refer to Section 9, Security Management, for more detailed information on system security. Section 9 provides guidelines for security policies, describes items that should be protected, and provides suggestions for protecting your systems. Equipment and Supplies Having the proper equipment and supplies available helps your staff work efficiently and effectively. Follow these guidelines concerning equipment and supplies: Provide equipment (such as printers) for logging operator messages. Locate printers as far away as possible from disk drives to prevent printer dust from damaging the disks. Determine which supplies should be kept on hand. You will probably need printer paper, ribbon, and toner; spare disks; spare tapes; and cleaning supplies. Determine whether a system console is needed. You might want to acquire additional consoles and console applications that facilitate system monitoring. For example, some companies have several terminals that are dedicated to running different system management applications. The TSM console would display system status information; another terminal might display Network Statistics Extended (NSX) information to help operators monitor a network. Still another terminal might allow operators to view operator messages and system status information. Your Tandem representative can help you determine what type of additional consoles and applications would be useful. 3-5

78 The Operations and Support Areas System Installation Determine whether a voice alert system is needed. Voice alert systems send a message over a loud-speaker when major problems occur or when certain people are needed. Your Tandem representative can tell you what types of alert systems are available. Provide at least one telephone near the system cabinets and terminals to be used for operations. Providing a telephone with a cord long enough to allow operators to communicate while performing operations tasks is very helpful. The ability to use the telephone while performing tasks is particularly important when problems occur. A telephone with a long cord saves operators from the frustration of running back and forth between the system and the telephone when it rings or when the operator is executing instructions. A telephone that has a headset can also increase efficiency. Headsets allow operators to work with both hands while solving a problem (instead of using one hand to hold a telephone handset). If your systems or software require automatic calling units or modems, install extra telephone lines if necessary. Consider providing a terminal dedicated to online documentation. Tandem manuals are provided on CD-ROM disc and can be accessed through the Tandem Information Manager (TIM) product. Manuals are also available in printed form. If you choose to order printed manuals, determine which manuals should be kept near the systems. Locating manuals in appropriate locations increases your staff s efficiency. For example, place all manuals that explain how to use the hardware near the hardware, all manuals that explain how to perform operation tasks in a location easily accessible to operators, and operator messages manuals near message logging devices. If the staff does not have a handheld vacuum cleaner, consider purchasing one. System Installation After you have prepared the operations area, the system can be installed. The following paragraphs list the system installation steps for computer-room and office environments. Computer Room Environments Tandem customer engineers (CEs) install most new systems free of charge. Tandem customer engineers: 1. Specify the operating environment requirements for the Tandem equipment to be installed 2. Make sure that the equipment inventory is complete and was not damaged in shipping 3. Install the equipment, start up the system, and test the system 3-6

79 The Operations and Support Areas Office Environments Your responsibilities during the installation process include: Providing a suitable computer room, environment, and facilities in accordance with current published guidelines, including: Providing electric power to system cabinets and peripherals (CEs provide the specifications) Furnishing and testing AC power requirement as needed Connecting and testing all external communications equipment Furnishing all labor required for unpacking and for placing each item or hardware in the desired location Office Environments Office systems are specially designed to be customer-installable. Each office system comes with detailed installation instructions. The following list summarizes the installation tasks: 1. Select locations for the equipment. 2. Prepare the locations to ensure that all environmental requirements are met. 3. Take inventory to ensure that all equipment is available and in good condition. 4. Install the equipment, start up the system, and make sure that it is functioning properly. For an extra fee, Tandem will install office systems. Preventive Maintenance Regular maintenance prevents system malfunctions. The following paragraphs list maintenance tasks your staff should perform regularly. Note. To help you in your preventive maintenance efforts, Tandem offers a hardware preventive maintenance service. As part of the program, a Tandem customer engineer (CE) cleans, inspects, tests, and adjusts hardware, and checks error logs at regularly scheduled times to ensure that all system components are functioning properly. Both Computer-Room and Office Environments Your staff should perform the following maintenance tasks in both computer-room and office environments: Perform maintenance and diagnostic testing of hardware and software to detect possible failures before they occur. Section 5, Production Management, lists the tasks the operations staff should perform to make sure that the software and hardware are running properly. The TSM package provides the tools required to query and test resources, generate system problem notification, and configure the 3-7

80 The Operations and Support Areas Both Computer-Room and Office Environments system for local or remote access. TSM diagnoses system problems as they occur and often detects failures before they affect the system s performance. Make sure that air vents are not blocked. Make sure that all fire-detection and fire-extinguishing equipment works properly. Keep computer areas clean. Accumulated debris can cause accidents and fires. As appropriate, clean tape drives and printers regularly. 3-8

81 The Operations and Support Areas Computer Room Environments Computer Room Environments In computer room environments, incorporate the following maintenance tasks into the staff s regular routine: Replace air-conditioning filters at regular intervals to prevent hardware from overheating or failing. Monitor the computer-room temperature and humidity constantly. Know how to turn off power should the temperature or humidity rise beyond the point considered safe for your systems. Computer boards subjected to extremes in temperature might cause problems for months afterward. Keep the area under a raised computer floor clean and free of debris and dust. Prevent eating, drinking, and smoking in the computer room. Office Environments Regular preventive maintenance is important in office environments, especially to prevent damage from dust, cigarette smoke, and spilled food or drink. Incorporate the following maintenance tasks into the staff s regular routine: Inspect air filters once a month. Replace air filters twice a year or as needed. Support Areas Support areas are usually located as close to the systems as possible. For example, operators for office systems might have a support area in or near the offices where the systems are located, and computer room operators might have a support area next to the computer room. In computer room environments, a support area separate from the computer room provides the operations staff with a more comfortable working area (less noise and a comfortable temperature). Usually, support areas have the following items: Desks and standard desk equipment A comfortable temperature A telephone with a long cord (and perhaps a headset) A complete set of manuals, including Tandem manuals, other vendor manuals, and internal documentation Necessary operations forms, such as problem report forms, system administration forms, shift logs, and so on A copy of the group s disaster recovery plan and problem escalation procedures Terminals for performing tasks. To ensure that operations personnel will have access to the system in the event of a single-component failure or a looping process, you might want to: 3-9

82 The Operations and Support Areas Check List Provide at least two terminals reserved for system monitoring and problem resolution Provide at least one terminal in which the command interpreters run at a high priority (for example, 199) Connect the terminals to different controllers to reduce the risk of losing system access if a controller fails If your group has a help-desk function, you might also have to provide help-desk equipment, such as: Problem report forms Tandem manuals and other vendor manuals Terminals for performing tasks and solving problems Procedures for executing peripheral device self-tests (for testing devices such as terminals, workstations, bar code readers, printers, modems, multiplexers, and so on) A list of contacts for problem resolution Telephones with the appropriate functions to increase the efficiency of the help desk. Some of the telephone system features that might be useful include: Answering machines Autodial systems Headsets Logging systems to record the length of each call, the amount of time a telephone rings before it is answered, the number of in-coming and out-going calls, the number of callers who hang up when they receive a recorded message, and so on Check List The following check list summarizes the main points of this section: 1. Determine the type of environment your systems require. 2. Select the location for your system. The location should: Be the safest and most secure available to you Provide all system and environmental requirements Have enough space for all equipment and for storage areas Have all required data communications lines and telephones 3. Take precautions to protect hardware, software, and data from intruders and untrained personnel. 4. Obtain all needed equipment and supplies, such as consoles, terminals, printer supplies, and manuals. 3-10

83 The Operations and Support Areas Check List 5. Once the site has been prepared and all necessary requirements met, install the systems. Most new computer room systems are installed by Tandem customer engineers (CEs). 6. Plan for preventive maintenance: Develop procedures and schedules Arrange for CE support if needed Keep all equipment and work areas clean 7. Make sure that the support areas are located near the systems and provide all necessary equipment (for example manuals, telephones, forms, and so on). 3-11

84 4 Operations Documentation Overview Operations documentation can help your operations organization perform efficiently and effectively. This section lists and describes the types of documentation often used in an operations environment. A documentation check list is provided at the end of this section to help you select the appropriate operations documentation for your environment. What Is Operations Documentation? Operations documentation is online or hard-copy information the staff needs to run the systems, including manuals, contracts, standards, procedures, service-level agreements, online help files, system configuration diagrams, schedules, and recovery plans. By taking the time to determine your organization s documentation needs and ensuring that documentation is available when needed, you can help your staff to function efficiently. Not only is it important to have the documentation, it also is important to: Update the documentation regularly. Store the documentation in locations easily accessible for use and updating. (Online files accessible by a text editor can make it easier for the staff to keep documentation current.) Inform the staff when updates are made or when the location of documentation has changed. Having out-of-date, incomplete, or inaccessible documentation causes frustration and mistakes. By providing a complete set of current documentation, you can: Reduce the risk of operator errors Train new employees Ensure that problems are solved quickly Provide for an efficient and effective operations organization Policies, Standards, and Procedures All operations policies, standards, and procedures should be fully documented, including: Service-level agreements that establish the goals and objectives of the operations organization, against which the organization s performance is measured. The company s security policy and the procedures the operations staff should follow to maintain security. Your organization s problem-escalation policy and the procedures to follow to report, track, and solve problems. 4-1

85 Operations Documentation Service-Level Agreements Job duties and performance standards for each staff member, each operations group, and the complete operations organization. The company s disaster recovery plan and the procedures the operations staff should follow to implement the plan. Naming conventions for systems, volumes, subvolumes, files, devices, event filters, and programs. Standardized names make it easier for you to find files, monitor programs, and solve problems. For guidelines on assigning names within Tandem systems, refer to Configuring Controllers for NonStop Systems. Configuration-management and change-control policies and procedures. Operations requirements for internal applications software. Policies and procedures for shutting down a system and for shutting down applications, including instructions on when to inform users and how to schedule down time. You might also want to develop and require a shutdown request form that provides the following information: the reason for the shutdown, the persons to be notified, who must approve the shutdown, backup requirements, and any other information relevant to your operation. Policies and procedures for scheduling and processing job and work requests. Procedures for transferring duties or tasks from one shift to another. Service-Level Agreements Service-level agreements specify the level of service that operations should provide. The agreements specify the company s business goals and determine the operation s management objectives, requirements, and standards, with the intention of aligning operations goals with the goals of the company. Most service-level agreements specify the following: The availability and reliability standards for systems, peripherals, networks, and applications. For instance, the agreement might specify that the system must be running 100 percent of the time, that individual processors must not be down for more than one hour per 40-hour work week, that all applications must be available 24 hours per day except for scheduled down time, and that 90 percent of the communications lines must be connected to the network throughout the day. Acceptable performance goals, such as acceptable system response time, deadlines for batch jobs, and transaction volumes. For example, the agreement might specify that 95 percent of all transactions for an application must be completed within one second and that all batch jobs must be completed within 6 hours of submittal. Priorities for resource allocation. Requirements for problem resolution, by specifying how quickly problems must be resolved before they are referred to a higher authority, how many problems should be resolved by one level of support before the problems are escalated to a higher level of support, and so on. 4-2

86 Operations Documentation Creating Service-Level Agreements The acceptable level of service when the system is stressed, such as during peak workloads, partial equipment failures, and medium-term increases in workload. Creating Service-Level Agreements When creating service-level agreements, following a few simple steps can help ensure that the operations services provided match the needs and expectations of your users. For each service, the following steps should be performed with the user (or someone representing the user): 1. Define the business need. 2. Establish quality requirements. 3. Establish measurements for the service. 4. Follow up and adjust the process. Step 1 Define the Business Need The first step in creating a service-level agreement involves determining what business need or goal you are trying to meet with the service. For instance, by implementing automation to detect and recover from problems before problems affect end users, you can reduce help-desk phone calls. This service helps meet the business goal of improving staff efficiency. Step 2 Establish Quality Requirements For each service you provide, there will be a number of requirements that the user will consider critical such as availability, reliability, performance, service levels, data integrity, data security, and ease of operations. Decide with the user the requirements that apply to the specific service. Be sure to define the requirements with sufficient detail for someone else to understand. Step 3 Establish Measurements for the Service Once you and the user have agreed upon the requirements for a service you are providing, you further refine the agreement by including measurements for each of the requirements. It is best to specify measurable requirements only. If you specify a requirement you can t measure, it will be difficult to determine whether you are fulfilling the requirements. For example, an agreement that specifies that the users must be satisfied with the system response time is difficult to fulfill, since it is difficult to measure user satisfaction. A better agreement would specify a specific response-time goal as measured by time. Some other examples include: An availability requirement: response time to a device-down call will be under 10 minutes. A reliability requirement: application release will not include any modules that have not been through post-development QA. 4-3

87 Operations Documentation Agreements, Contracts, and Supporting Documents As an alternative, you can specify a check list. This method is especially effective for measuring services. For example: Monthly preventive-maintenance procedures for G-series systems will include the following... Proposals for system configuration changes will follow the outline provided and include review signoff from the following groups... Application modules will be tested under the following failure-recovery simulation conditions... Step 4 Follow Up and Adjust the Process Periodically, review your service-level agreements to make sure that your organization s objectives and the business goals of your company are still aligned. There might be several reasons for changing your service-level agreements. For example: New technology or tools in your organization might change the requirements or measurements for a service. Users of a particular service have changed their requirements for that service. New business goals for your company might require a change in the measurements for a specific service. Agreements, Contracts, and Supporting Documents All sales agreements, service agreements, and maintenance contracts should be on file and available in case questions or problems arise. You should also have supporting documentation (such as purchase orders) to help you track orders and contact vendors. 4-4

88 Operations Documentation Configuration Diagrams and Listings Configuration Diagrams and Listings Configuration diagrams and listings help the operations staff monitor systems, recognize problems, and prepare for configuration changes. Useful diagrams include: Network diagrams. Network configuration diagrams show the network nodes and the lines that connect the nodes. On the network diagram, you can also specify the specific lines such as Expand, SNAX, X25AM, and so on. System diagrams. System diagrams show the names and connections of processors, devices, and paths. Figure 4-1 shows a simplified network configuration diagram. Figure 4-2 shows a system configuration diagram for a four-processor system, including the router connections, ServerNet expansion boards (SEBs), and ServerNet adapters used by the ServerNet system area network (ServerNet SAN). For G-series systems, the ServerNet SAN provides the communication path used for interprocessor messages and for communications between processors and I/O controllers. Because each ServerNet link is point-to-point, different links can be in use simultaneously with different messages. The ServerNet architecture maintains an important similarity to the architecture of previous Tandem systems: each processor has two independent paths to other processors or ServerNet adapters. No single failure can disrupt communications among the remaining units. Also, the two paths (X and Y) shown in Figure 4-2 can be used simultaneously to improve performance. Refer to the ServerNet Communications and Configuration Manual for a detailed description of the ServerNet SAN subsystem. The ServerNet wide area networking (WAN) subsystem is a collection of software and hardware components that provides G-series systems with networking and datacommunications capabilities. The ServerNet WAN subsystem is used to configure both WAN and local area network (LAN) connectivity for supported communication (COMM) subsystem objects. 4-5

89 Operations Documentation Configuration Diagrams and Listings Figure 4-1. A Network Configuration Diagram $LH1A $LH1B \LA \DALLAS $LH2B $LH2A $LH5A $LH5B \SAC \AUSTIN $LH3B $LH3A \BOISE $LH4B $LH4A

90 Operations Documentation Configuration Diagrams and Listings Figure 4-2. A System Configuration Diagram Processor Processor Processor Processor X Y X Y X Y X Y Router Router Router Router ServerNet Adapter ServerNet Adapter ServerNet Adapter SEB SEB ServerNet Adapter SEB SEB ServerNet Adapter Disks ServerNet Adapter ServerNet Adapter Disks ServerNet Adapter 111 Legend SEB = ServerNet Expansion Board 4-7

91 Operations Documentation Configuration Diagrams and Listings Some of the important configuration listings include: The Subsystem Control Facility (SCF) INFO command. Use the SCF INFO command to display system configuration information for a specified device object (for example, DISK, TAPE, or ADAPTER), including the current attribute values for that object. Refer to the SCF Reference Manual for the Storage Subsystem for a detailed description of the SCF INFO command. The SCF STATUS command. Use the SCF STATUS command to display the current status of a specified object. To display the status of all disks, use the SCF STATUS DISK $* command. To display the status of all tape devices, use the SCF STATUS TAPE $* command. Refer to the SCF Reference Manual for the Storage Subsystem for a detailed description of the SCF STATUS command. A process pair directory (PPD) listing. A PPD listing taken just after system and application startup provides a quick reference of process names and their primary and backup processors. Listings from SCF commands (INFO, LISTDEV, STATUS) used to display the status of terminals, printers, and communications lines. Refer to the SCF Reference Manual for Himalaya S-Series Servers and the ServerNet Communications Configuration and Management Manual for detailed information about these commands for device types other than disk or tape. Spooler configuration listings. The spooler configuration file shows the spooler configuration and the names of the spooler components. The spooler interface (SPOOLCOM) commands allow you to print the current configuration and the spooler routing structure. System configuration listings called CONFLIST and CONFTEXT. These listings are generated during the system generation (SYSGEN or SYSGENR) phase of the DSM/SCM program (DSM/SCM is used to generate a new system image and to install new software releases.) CONFLIST is the output file produced by SYSGENR. As SYSGENR processes the CONFTEXT configuration file, it writes any action taken to the CONFLIST file, including error and warning messages that are helpful to intermediate-level and senior-level support personnel. CONFTEXT is the configuration file used as input to SYSGENR that contains a series of entries defining the Tandem NonStop Kernel operating system image for all processors in the system, and is helpful to intermediate-level and seniorlevel support personnel, and CEs. For G-series systems, the CONFTEXT file consists of one or two paragraphs: DEFINED (optional) and ALLPROCESSORS. Because SCF configures peripheral devices and I/O processes, all other CONFTEXT files required for D-series systems are removed. DSM/SCM configuration reports. Standard reports extracted from the DSM/SCM database and formatted by Structured Query Language (SQL) are available to help DSM/SCM users analyze software and resources. Standard reports include reports that list the current set of products for a selected system, reports that provide detailed pictures of the contents of a given configuration by product, and reports that can be used to determine the requisite interim product modifications (IPMs) to add 4-8

92 Operations Documentation Flow Diagrams to a configuration. In addition to the packaged reports, you can use the SQL report writer to design and produce custom reports whose contents are tailored to your specifications. Application configuration listings. These listings help the staff monitor applications and ensure that all required processes are running. Database configuration listings. These listings show the major databases and the applications that access them. The staff can use these listings to monitor disk I/O load balancing and identify the applications that might be affected by a potential disk failure. NonStop SQL/MP catalog listings. These listing show where the NonStop SQL/MP catalogs are located on the network. Note. Although configuration diagrams and listings can help operators monitor the system, using online monitoring tools that display up-to-the-moment status information can improve system reliability and availability. By monitoring your system online, you are instantly notified of error conditions, state changes, and threshold conditions that have been exceeded, allowing you to react to problems or changes more quickly. For more information about monitoring systems online and for online monitoring tools provided by Tandem, refer to Section 5, Production Management. Flow Diagrams A flow diagram is a visual representation of a sequence of steps. Flow diagrams can help the staff understand the steps required to perform a task. Useful flow diagrams include: Activity flow diagrams. Activity flow diagrams show the sequence of activities performed in an operations environment during a 24-hour period. By diagramming the activities performed during the day, you can: Get a global view of what goes on in your operations environment. Safeguard against oversight, ensuring that all activities that must be performed are performed. Identify the dependencies required for each activity. For example, before you run the Payroll application, you must load the Employee database. Identify tasks that can be automated. For example, if application X must always run before application Y, automate the task. Process flow diagrams. Process flow diagrams show the actual processes required to perform an activity or task and can include files, commands, and hardware used. Figure 4-3 shows an activity flow diagram and Figure 4-4 shows a process flow diagram. 4-9

93 Operations Documentation Flow Diagrams Figure 4-3. An Activity Flow Diagram: Activities Performed in a 24-Hour Period START Load Bonds Rates Get Controller's Go Ahead System Resource Clear Up Release BATCH2 Release BATCH1 Start DEALER Application Check Inspection Terminals Open Network Start Application Y Respond to Close Event Bold items are dependent on TRADE INSTRUCTIONS posted by Department Y Stop Application Y Stop DEALER Application Close Network BACK TO START

94 Operations Documentation Flow Diagrams Figure 4-4. A Process Flow Diagram: Solving System Access Problems (>STATUS*,TERM $TERM) If It Appears OK Step 1: Is TACL OK? If Not OK Step 2: Is Terminal Hardware OK? If OK But Still No TACL Step 3: Stop Other Processes If Not OK Fix Is TACL OK Now? If OK Resume Check With User If Not OK Step 4: Stop TACL Process Step 5: Start New TACL Step 6: Check, Abort, & Restart Line If Not OK Is TACL OK Now? If OK Resume Is TACL OK Now? Start New TACL Escalate If Not OK Is TACL OK Now? If OK Resume

95 Operations Documentation Tandem Manuals Tandem Manuals Manuals provide you with information about your system hardware and software. They provide introductory, procedural, and reference material. You should have a complete set of Tandem manuals for the Tandem products you use, in addition to manuals from other vendors. You should also have internal manuals. Appendix A, Additional Reading, lists the manuals that describe the tasks and products mentioned in this manual. For a complete list of Tandem manuals, refer to the first document listed in the About This Collection document in the G01.00 Tandem Information Manager (TIM) collection. Tandem provides manuals on CD-ROM disc to be viewed online (using PCs or workstations) with the TIM product. The manuals should be accessible to the people who use them. You might need to provide multiple PCs or workstations if the manuals are used by people in different locations. For example, if an operator frequently needs to use a manual in the computer room, and a technical specialist needs to use the same manual in his or her office, you might need to provide a PC in each location so that both people can work efficiently. Or, you might print parts of or whole manuals and distribute these copies to the people who need them. Tandem Software Release Documents Logs For the Tandem software your company orders, you receive a site update tape (SUT). Each site update tape contains software products and software release documentation for the products on the tape. These documents describe: The product name and number The release date Required hardware and software New product features, including notes regarding the installation of the new features, if appropriate Corrected problems and known problems remaining The software release documentation is available on CD-ROM disc to be viewed online (using a PC or workstation) with the Tandem Information Manager (TIM) product. You can also print the documentation from the site update tape. Your staff should read and understand the software release documentation before installing and running the software. Keep the documentation in an accessible location so that the staff can refer to it when necessary. Operations organizations usually have at least three types of logs: operator logs, error logs, and CE logs. Also, Tandem recommends the use of outage logs. 4-12

96 Operations Documentation Operator Logs Operator Logs Operator logs provide a history of problems encountered during each work shift, a record of unresolved problems, and a record of tasks scheduled and performed or not performed. The logs help the staff track tasks and determine why problems occur. Operator logs are maintained by the operators. At each shift change, operators from both shifts usually sign the operator logs, and the incoming shift usually takes responsibility for correcting unresolved problems and completing tasks. Operator logs contain information on: Problems that have occurred, including what the problem is, the time the problem occurred, error messages or codes, action taken, the results of the problem, a description of the recovery action taken. Problems that have not been closed and need operation action. Tasks that were scheduled for one shift and completed. Tasks that were scheduled for one shift and not completed. Instructions or information on special conditions for oncoming shifts. An operator log is usually either a notebook located next to the system console or an online file. Whichever type of log you use, be sure that the log is always accessible. Error Logs An error log is a system-generated hard-copy or online log that contains a record of all error, warning, and informational messages sent by the system, applications, and subsystems. Error logs help you track and resolve problems. An example of an error log is the TSM problem incident report or the NetBatch scheduler log file. You can use the Event Management Service (EMS) to direct the flow of event messages to log files or devices. You can use the Event Management Service (EMS) Analyzer to examine and analyze log files. CE Logs When your system is installed, you might receive a site log book that is maintained by Tandem customer engineers (CEs). The log book contains sections that allow for the following: System configuration diagrams A preventive-maintenance schedule and contract information Backplane and cable diagrams Service report forms and service reports Floor plans showing the location of system cabinets Equipment inventories Information about whom to contact when questions arise Anything else the CE finds useful The log book helps you and the CE plan for hardware changes and hardware maintenance. It is also a good source for reviewing hardware information. 4-13

97 Operations Documentation Outage Logs The log book should remain by the system console or system cabinets. Outage Logs Tandem recommends maintaining outage logs to help you assess system availability. Outage logs provide a history of any failure or upgrade that causes a system outage. This historical data can be used for trend analysis, which in turn can be used to determine where improvements are needed. Improvements could include operator training, change management, operational tools and utilities, or additional system software or hardware. Outage logs should be maintained by lead operators. An outage log should be tailored to meet the specific needs of your particular environment. Even though you will customize your outage log, you still need to be aware that it should also contain the following types of crucial information: The date and time that the outage occurred The duration of the outage The suspected cause of the outage and the objects involved (for example, disk failure, database needs rollforward, and so on) The actions taken to recover from the outage, including: The name of the persons initiating recovery The date and time when the persons initiating recovery were notified of the outage The type of procedure followed to perform the recovery The applications, number of users, and business services affected by the outage Any follow-up actions that were taken when the recovery was completed Figure 4-5 shows a sample outage log form. 4-14

98 Operations Documentation Outage Logs Figure 4-5. A Sample Outage Log OUTAGE LOG REVISED: mm/dd/yy BY: name NODE:\ DATE: / / Time Operator Initial Open Event Action Taken Operator Initial Closed Time Closed

99 Operations Documentation Internal Operator Guides Internal Operator Guides Even though Tandem provides the manuals you need to run your system, you might also need internal operator guides. Internal operator guides (also called runbooks) describe the procedures required for a particular site or a particular organization. These procedures can be copied from Tandem manuals and can be tailored to the needs of your organization. Internal operator guides provide a simplified view of system operations that is useful for training entry-level personnel and for helping all levels of support complete their tasks quickly. An internal guide might include: Information about the system, such as the type of system you have (and the system number), how it is used, and who uses it Charts that show which programs run in each processor, the priorities of each program for each processor, and so on A list of installed applications and support contacts for each A list of online files the operations staff frequently use Diagrams of the system and network, and copies of configuration listings Diagrams of activities and processes Check lists of daily, weekly, monthly, quarterly, and yearly tasks and check lists for shutting down and starting up the systems Schedules the staff should follow Procedures and a list of contacts for problem resolution Procedures for system and application startup Procedures for routine tasks Procedures for executing peripheral device self-tests (for testing devices such as terminals, work stations, bar code readers, printers, modems, multiplexers, and so on) Procedures for recovering from system, network, application, or workstation problems A list of people responsible for updating the guide Any other procedures appropriate to your organization 4-16

100 Operations Documentation Online Files Online Files Online files are needed to perform many operations tasks. To ensure that the staff can quickly find the files required, consider documenting the location of the files needed to run the system. One recommended method of documenting online files is to: Provide a hard-copy list by subvolume of the locations of log files, system configuration files, system startup and shutdown files, application files, operations macro files, system measurement files, and so on. Provide an edit file in each subvolume (for example, AAINDEX) that contains a list of the files in the subvolume and the contents of the files. Provide a listing or report that shows the relationship between the files and their use. For example, list each application by name, and list the files needed to run the application beside the name. Another table might list the files by name and location, with a brief description of their use. Error Messages Tandem provides two types of error message documentation: manuals and online descriptions. All Tandem system-generated messages are documented in Tandem manuals. In addition, operator message documentation is provided online through the TSM EMS Event Viewer. The TSM EMS Event Viewer provides event detail screens that describe the cause and effect of each operator message, and any recovery action that is required. You can modify the descriptions to make them more useful and more specific to your organization. You can also add descriptions for messages generated by internally developed applications. In addition to documenting messages generated by internally developed applications, consider documenting the messages in hard-copy form. Messages manuals for companyspecific messages are useful for reference and problem solving. Check List The following check list summarizes the main points of this section: 1. Determine what type of documentation you need in order to run your organization efficiently. You might need the following: Operations policies, standards, and procedures Agreements, contracts, and supporting documents Operator logs Error logs CE logs Outage logs System-configuration and network-configuration diagrams 4-17

101 Operations Documentation Check List System-configuration and network-configuration listings Flow diagrams Tandem and other vendor manuals Tandem software release documents Internal operator guides Documentation on the location and use of online files Cause, effect, and recovery information for application error messages Anything else applicable to your operation 2. Place the documentation where it is accessible to those who need it. 3. Set up disk files for documentation you want to have online. In the case of log files, be sure to determine when logging should be switched to a new log file. 4. Establish procedures for updating documentation and for informing the staff of updates. 4-18

102 5 Production Management Overview This section describes production management and provides guidelines to help you manage the day-to-day support and operations tasks in the production environment. This section: Provides suggestions for monitoring system status Provides suggestions on how to control the system efficiently and effectively Provides guidelines for tracking system usage Describes how generating daily and weekly statistical reports can provide valuable information on the health of the system Describes how to use a production schedule to ensure that day-to-day production tasks are performed effectively Describes management responsibilities in the production environment Provides check lists of routine tasks including: System startup, system shutdown, processor dump, and processor reload tasks Daily, weekly, and month tasks Recovery tasks A check list summarizing the main points of production management is provided at the end of this section. What Is Production Management? Production management involves managing the day-to-day tasks of the production operations environment while also maximizing the availability and reliability of the systems, peripherals, networks, and applications as outlined by your service-level agreements. To manage an effective, efficient, and reliable production environment, it is important to have up-to-the-moment status information about the systems and to be able to respond quickly to any changes or problems that occur. This is best accomplished by taking an online approach when performing the following activities: Monitoring system status Controlling the system Tracking system usage Providing daily and weekly reports to management on the health of the system Using a daily production schedule can help you manage each day s tasks, transfer operations duties from one shift to the next, and ensure that you meet your service-level agreements. 5-1

103 Production Management Monitoring System Status Monitoring System Status To ensure that the system is operating properly and to recognize when corrective action is required, it is important to monitor the status of all the resources of the system and network. Monitor on a continuous basis. Resources include processors, cabinets, disks, paths, volumes, controllers, communication lines, Expand lines, transaction-processing servers, terminal control processes (TCPs), terminals, spooler devices, and programs. Monitoring should include: Monitoring event and alert messages Monitoring resources as they change states Monitoring performance of processors, disks, and communication lines By monitoring system status, you can: See if resources are currently up or down Be quickly notified of error conditions, state changes, and threshold conditions that have been exceeded or are reaching their limits See a chronological list of events that can aid in problem diagnosis and resolution Determine how much of a particular resource is being used, for example, processor cycles, disk or file space, or communication line bandwidth Find bottlenecks, which can affect the users of the system Make better use of existing resources Ensure that applications such as NonStop SQL/MP, NonStop Transaction Manager/MP, and NonStop Transaction Services/MP are available Prevent problems from occurring Controlling the System Based on the information gathered from looking at events, monitoring objects, and watching performance, you must be able to control these objects. For example, you must be able to issue commands to fix problems, avoid problems, perform routine tasks, or increase system stability. To be able to control the system effectively, it is important to: Provide operator tools to report, resolve, and fix problems; view documentation and manuals online; and write reports. Providing your staff with these tools can: Help improve operator productivity and make better use of computer resources Make training easier Reduce the number of operator mistakes Automate operations. Automating operations is important because it can: Help manage unattended remote nodes Perform routine tasks Automate monitoring tasks 5-2

104 Production Management Tracking System Usage Reduce operator errors React to problems more quickly and perhaps more accurately than an operator could For more information on automating operations, refer to Section 12, Automating and Centralizing Operations. Tracking System Usage Tracking system usage (also known as usage accounting) is the process of tracking the use of system resources for accounting purposes (for example, tracking the types and quantity of resources used by particular applications or users). Tracking system usage helps you: Ensure efficient use of the system. Tracking system usage can assist in changing procedures to improve performance. Control costs. Monitoring resource utilization enables essential user needs to be satisfied at a reasonable cost. Recognize when a user or group of users may be abusing their access privileges and burdening the system at the expense of other users. Plan for system growth if user activity is known in sufficient detail. Charge back to other departments. Check how well you are meeting your service-level agreements. The staff responsible for tracking system usage should have basic accounting skills and should know how to: Determine the kinds of accounting information to be recorded Collect, track, and analyze accounting statistics Determine accounting algorithms to be used in calculating charges for resource usage Identify and evaluate options for computer-system growth or performance improvements Tracking system usage consists of three steps: 1. Establishing an accounting strategy 2. Determining the system resources to be monitored 3. Collecting accounting data and reporting results 5-3

105 Production Management Step 1 Establishing a Strategy Step 1 Establishing a Strategy The accounting strategy is based on the service-level agreements. To help ensure that the service agreements are being adhered to, the staff responsible for tracking system usage should develop: The process used to keep track of the usage of system and network resources by users The chargeback process for the use of those resources The accounting strategy used can vary depending on the type of environment and service provided. For example, an internal accounting system may be used to assess the overall usage of resources and determine what proportion of the cost of each shared resource should be allotted to each department. In other environments, where usage is broken down by account, by project, or even by individual user for the purpose of billing, the information gathered by the accounting system must be more detailed and more accurate than that required for a general system. Step 2 Determining the System Resources to Be Monitored After the accounting strategy is established, the staff develops a list of system resources to be monitored to support the accounting strategy. Examples of resources that might be subject to accounting include: Communications facilities (LANs, WANs, leased lines, dial-up lines) Computer hardware (workstations, servers) Software and systems (applications, server software, a data center, end-user sites) Services (commercial communications, services available to network users) Step 3 Collecting Accounting Data For any given type of resource, and based on the requirements of the accounting system, accounting data is collected. For example, the following communications-related accounting data might be gathered and maintained for each user: User identification (provided by the originator of a transaction) Receiver (identifies the network component to which a connection is made or attempted) Number of packets (count of data transmitted) Security level (identifies the transmission and processing priorities) Timestamps (transaction start and stop times) Status codes (indicate the nature of any errors or malfunctions that are detected) Resources used (indicates which resources are invoked by this transaction) 5-4

106 Production Management Providing Daily and Weekly Reports Reporting the results on a weekly basis helps track system and network resource usage and can provide the capacity planners with data useful for forecasting future demands. Providing Daily and Weekly Reports The operations staff should generate daily and weekly reports containing statistical information about how the Tandem systems are operating. Since managers typically do not spend all their time in the computer room, these reports are important because: They can help ensure that service-level agreements are being met. They can provide system and network usage information for capacity planners. Managers can use this information to make forecasting, purchasing, and staffing decisions. Many managers come from other computing environments and need a simplified way of understanding the Tandem systems. Consider providing reports that summarize statistics on: Problem-reporting and problem-tracking information The number of unplanned and planned outages System and application performance Weekly accounting reports on system and network resource usage. Using a Production Schedule A daily production schedule can improve your ability to allocate sufficient resources to meet each day s priority requirements and ensure the highest possible level of service to your users. By using a 24-hour clock worksheet such as the one in Figure 5-1, you can simplify and reduce the amount of time required to manage system resources. Using a production schedule: Helps you identify dependencies. For example, using a 24-hour clock worksheet ensures that start, stop, and setup for tasks finish successfully. Encourages you to prioritize tasks. Helps you set expectations between groups such as operations, technical support, and development. Allows you to identify resources and provide information for growth and capacity planning. Helps you determine if most operational tasks are planned or unplanned. Provides information to help you determine staffing requirements. Provides cross-shift operations continuity. 5-5

107 Production Management Creating a Production Schedule Creating a Production Schedule Perform the following tasks to create a production schedule: 1. Use a 24-hour clock worksheet (like Figure 5-1) to list all tasks that are performed daily. 2. Identify what task (either business or operations and management) will be performed and who (either an automated process or a person) will perform it. 3. Identify sets of interdependent business flows (for example, just-in-time [JIT] data processing) and critical paths. 4. Collect existing documents such as Tandem Advanced Command Language (TACL) routines, OBEY files, and runbooks. 5. Identify time and resource constraints. 6. Extend the schedule to handle weekly, monthly, quarterly, and yearly activities. 7. Use automation tools such as NetBatch to ensure that jobs are submitted at the appropriate times. Analyzing the Completed Schedule When analyzing your completed 24-hour clock worksheet, look for the following: Are enough operators available at times when manual procedures must be performed? For example, some companies need more coverage on weekends when they bring down the production application and test new applications. Could availability be increased if the schedule were changed? For example, backups might need to be rescheduled to avoid peak business periods. Is the batch-processing window encroaching on the online transaction-processing (OLTP) window? Can manual jobs be automated? The 24-Hour Clock Worksheet Figure 5-1 is an example of a 24-hour clock worksheet. Your 24-hour clock worksheet would contain topics specific to your organization, and might be arranged with time on the x-axis to account for multiple occurrences of the same task. When completing the worksheet, be sure to: Show starting times and duration of tasks Note special cases on weekends, holiday, end-of-quarter, and so forth 5-6

108 Production Management Management Responsibilities Figure Hour Clock Worksheet Time Staff Shift Changes Backups and TMF Dumps Online Startup and Shutdown Batch Jobs and Files Transferred to IBM Automated Jobs Note Times of Critical Management Reports, Peak OLTP Periods Management Responsibilities In addition to monitoring, controlling, and tracking the system, managing a production environment is most effective when policies and procedures are developed and enforced, tasks are defined and assigned, and the staff is trained. It is important to: Adequately staff your organization, and: Establish work schedules appropriate to your operations. Determine what types of training your staff requires. Prioritize tasks. Using production schedules, described later in this section, can help you prioritize tasks. Determine who should perform which tasks. Ensure that the tasks are performed and thus ensure that your systems are wellmanaged. Section 2, The Operations Staff, provides guidelines and suggestions for staffing your production management staff. Determine which tasks should be automated. Determine which tasks should be documented in internal operator guides. Define standard procedures for security, problem escalation, and disaster recovery. 5-7

109 Production Management Routine Operations Tasks Recognize the types of documentation that should be available to your staff, including daily run sheets, configuration listings and diagrams, manuals, and so on. For more information about documentation requirements, refer to Section 4, Operations Documentation. Evaluate and select hardware, software, and tools. Routine Operations Tasks The following check lists of routine tasks can help you determine what types of tasks your staff needs to perform. The tasks include: System startup, system shutdown, processor dump, and processor reload tasks Daily tasks Weekly tasks Monthly tasks Recovery tasks Note. Most of the tasks described in this section can be automated and centralized. For information on automating and centralizing tasks, refer to Section 12, Automating and Centralizing Operations. This section does not explain how to perform the tasks. For more information on how to perform the tasks, refer to the manuals listed in Appendix A, Additional Reading. System Startup, Processor Dumps, Processor Reload, and System Shutdown Starting up the system, dumping a processor, reloading a processor, and shutting down the system are routine but infrequent tasks. Your Tandem system can run 24 hours a day, 7 days a week, and should even continue running after power is restored following a power failure. The only time you might need to shut down and start up a system is when you: Change the system configuration Add some types of hardware Install a new operating system release and, in some cases, when you install a modification to the operating system software System Startup System startup or system load is performed to bring up a system that has been completely shut down. Startup includes bringing up the operating system and applications. 5-8

110 Production Management Processor Dump Processor Dump A processor dump is performed to copy the contents of a processor s memory onto disk or tape. Although a processor dump is useful for most processor or system failures, in some cases you should contact your Tandem representative first. Processor Reload A processor reload is performed to bring up a processor in a running system. System Shutdown System shutdown is performed to bring down a running system in an orderly manner. System shutdown involves properly closing all open files and stopping all processing on the system. If the shutdown is not performed properly, files and applications might be corrupted. You might want to develop procedures for handling shutdown requests and informing users of pending shutdowns. Formal procedures help your staff ensure that it has all the necessary information and that the requests are approved. To ensure that all shutdowns are authorized and properly coordinated, you might also want to develop a shutdown request form. Most shutdown request forms include the following information: The name of the person making the request The reason for the shutdown Which users must be notified Which backup tapes are required, if any When the shutdown is needed How long the system will be down The signature of the person authorized to approve shutdown requests Once the shutdown is approved, the operations staff usually: 1. Schedules the shutdown. 2. Notifies users of the shutdown. Notification should take place at least a day in advance of the shutdown. 3. Performs the shutdown. Intermediate-level or senior-level operations support personnel usually perform the shutdown. Do not allow people from other groups to perform the shutdown on your systems. Be sure to document the procedures for performing a system shutdown and then automate the task. Automating the task ensures that all processes are stopped in the correct order, thereby ensuring an orderly shutdown. For more information on automating operations tasks, refer to Section 12, Automating and Centralizing Operations. 5-9

111 Production Management Daily Tasks Daily Tasks The following types of daily tasks help you ensure that the system is running properly and that potential problems are detected early: Start-of-day tasks. These tasks are performed by the first shift of the day. Start-of-shift tasks. Every operations work shift performs these tasks at the start of the shift. During-the-shift tasks. Every operations work shift performs these tasks periodically during the shift. End-of-day tasks. These tasks are performed by the last shift of the day. Consider documenting these tasks in an internal operator guide, automating these tasks, and providing site-specific check lists for your staff to use. Documenting and automating these tasks helps your staff work more efficiently. Check lists help you ensure that all tasks have been performed. Note. Not all tasks listed in this section are applicable to all systems. Depending on your systems, you might want to add or delete tasks. Most of the tasks listed in this section can be automated so that your staff has to enter only a few commands. For information on automating tasks, refer to Section 12, Automating and Centralizing Operations. Start-of-Day Tasks There is one task that is usually performed at the start of each day: Make sure that Measure is running properly. Measure collects performance statistics for your systems at peak intervals throughout the day. Measure can run continuously after it is installed on the system. Start-of-Shift Tasks The following tasks are usually performed at the start of each shift. If your system runs continuously, perform these tasks during a period of reduced activity. If you bring up applications daily, perform these tasks before bringing up the applications. Check operator or shift logs for information about outstanding problems, events that occurred during the previous shift, and tasks that should be completed during the current shift. Check hardware components. Make sure that all hardware is working properly: Make sure that all devices and paths are up, all system components are functioning properly, and the primary paths for all devices are up. Use SCF. Use SCF to make sure that all processors are up. If you have specific processors assigned to run only certain applications, verify that only those applications are running in them. 5-10

112 Production Management Start-of-Shift Tasks Make sure that all printers are operational and online. Make sure that the temperature and humidity are normal. Check the spooler. Use the spooler interface (SPOOLCOM) to make sure that the spooler components are working properly. Check for operator messages: Check the operator message logs, TSM EMS Event Viewer screens, and your own system management applications for operator messages and event messages. Read all messages that appear in order to detect potential problems early. Save error information to help identify problems should they occur. If a log file is nearly full, switch logging to a new file. Use Tandem Service Management (TSM) to view new alarms, especially critical alarms. Use TSM to make sure that all processes that should be running are running. Check the network status: Use the Network Statistics Extended (NSX) to monitor all systems in a network, and gather network-wide statistics and generate reports. NSX can collect statistics on all nodes, processors, and Expand line handlers in a network without operator intervention. Use SCF to verify that all network nodes and the processors at the nodes are up, to obtain information on routing and throughput delays, and to determine the line status between your system and other systems. Check line handlers. Use SCF on all configured lines to make sure that all devices are up, to check for block check character or buffer allocation errors, and to ensure that data is accumulated for the current day only. Check the status of the online transaction-processing environment. Use the PATHCOM interface to: Check the status of the PATHMON process-management process, the LOG status, and the status of primary and backup processors. Check the status of objects controlled by PATHMON. Determine if any servers have failed. Determine if any terminal control processes (TCPs) have failed. Check the status of Transfer. Use PATHCOM to check the status of terminals and servers. Check the status of the NonStop Transaction Manager/MP (TM/MP). Use TMFCOM to check the status of audit trails, audit-dump processes, the backout process, the NonStop TM/MP catalog, and NonStop TM/MP activity. 5-11

113 Production Management During-the-Shift Tasks Check files: Note any bad tracks on mirrored disks and spare the bad tracks (use SCF). Make sure that sufficient free space is available for dynamic extent allocation during the day, and monitor space fragmentation. Use the LISTFREE function of the Disk Space Analysis Program (DSAP). Look for excessive file fragmentation or exceptional conditions such as extent overlaps, unspared defective sectors, or lost free space pages. Use DSAP. Check application file sizes. Make sure that online application database files are not reaching their configured capacity. Use the File Utility Program (FUP). Run PEEK on all processors to collect processor resource usage statistics. During-the-Shift Tasks The following tasks are usually performed during every shift: Check for operator messages at regular intervals as suggested for start-of-shift tasks. If a large number of messages is being generated, fix the problem or problems causing the messages, or prepare to change the log files more frequently than usual. Monitor processors: Run PEEK on all processors. Examine the output to detect abnormal paging, table allocation, and pool utilization. Run ViewSys to monitor how busy the processors are. Use the TSM EMS Event Viewer to detect excessive processor activity. Monitor the spooler as suggested for start-of-shift tasks. Monitor system security. Check physical security and use the Safeguard command interpreter (SAFECOM) and TACL to: Check for changes to owner IDs, file security settings, licensing, and so on Monitor audit logs for unusual events Check the users default security settings Delete old user IDs Monitor the network as suggested for start-of-shift tasks. Monitor devices and files as suggested for start-of-shift tasks. Monitor the online transaction-processing environment as suggested for start-of-shift tasks. Monitor applications. Make sure that all processes that should be running are running. Check room temperature and humidity. Vacuum printers. 5-12

114 Production Management End-of-Day Tasks Check printer supplies. Make sure that a supply of printer paper and ribbons is always available. Perform preventive maintenance on all tape drives. End-of-Day Tasks The following tasks are usually performed during the last shift of the day: Stop Measure data collection and generate system performance reports. Perform NonStop TM/MP audit dumps. Check equipment: Clean the cabinets as needed. Cleaning a cabinet is an infrequent task that you perform as required by conditions at your site. Clean tape drive heads and sensors. This task might need to be performed more frequently depending on tape usage. Inspect printers. Turn the ribbons on drum printers daily to increase their life. Run partial backups. In a Guardian environment, use BACKUP to back up files that have been changed since the previous day or since the last full backup. In an OSS environment, use the pax utility, as described in the Open System Services Management and Operations Guide. Run backups during a period of low system activity. Weekly Tasks Perform the following tasks at least once a week: Perform full backups. In a Guardian environment, use BACKUP to take full backups of all files (including the complete NonStop TM/MP subvolume). Make two copies of the backups (using BACKCOPY). One copy should be archived off site. For OSS file considerations, see the Open System Services Management and Operations Guide. Run backups during a period of low system activity. Perform system security and system administration tasks: Check the system for unauthorized users. Delete old user IDs. Change super-id passwords. 5-13

115 Production Management Monthly Tasks Verify that only those users authorized to have remote passwords have them and that the passwords allow access to the proper systems. Maintain inventory records of hardware. Order supplies. Summarize daily statistics: The number of problems reported, resolved, or still unresolved The amount of time and the levels of support required to solve problems The number of users added and deleted from the system The number of terminals installed or fixed Any other useful information Prepare security audit reports. Check disks: Check disks for defects and spares (use SCF). Report significant information. Check disks for free space (use SCF). Run DSAP and analyze the output for indications of disk fragmentation. Look for subvolumes and files that should not be on the volume. Save all DSAP output. Compress the disk. Use the Disk Compression Program (DCOM). This task might need to be performed more frequently or less frequently, depending on your needs. Test all backup paths. Make sure that all controller paths are working properly so that the backup path will come up if a primary path goes down. Depending on the level of assurance you want, you might want to execute this task daily. Review system performance and summarize findings: Run Enform reports to accumulate and analyze the daily Measure data. Use PATHCOM to collect statistics about objects controlled by PATHMON. By collecting object statistics, you can determine how to improve throughput, optimize resource use, and correct any problems that arise. Collect statistics about the actual use of TCPs, terminals, and servers. Generate reports from the NSX data gathered during the week. Review daily PEEK output to analyze resource usage in all processors. Use DSAP reports to analyze disk utilization. Perform NonStop TM/MP online dumps and produce a hard copy of the dump information. Monthly Tasks Perform the following tasks at least once a month: 5-14

116 Production Management Recovery Procedures Back up and restore complete disks. Make sure that the on-site and off-site archives are current. Reload application files (might be required more or less frequently, depending on the applications). Perform system preventive maintenance. (Your Tandem support representative might perform this task, depending on your support contract.) Review weekly performance reports. Determine if additional system capacity will be needed. Review monitoring strategies. Make sure that problems are found before they become serious, that reports are generated when required, and that applications are running properly. Review problem-reporting and problem-tracking procedures and make sure that problems are reported and resolved in a timely manner. Review system security. Make sure that security procedures are preventing system penetration and that users and the operations staff are abiding by your organization s security policy. Review staff training requirements. Recovery Procedures System problems are never routine. However, recovering from system problems can be routine if your staff knows how to recover from as many problem situations as possible. Some of the problems your staff should know how to resolve are: Terminal-related problems Looping processes Application failures Security problems Disk failures Communications failures Processor failures Power failures System failures Air-conditioning failures Site disaster recovery If your staff performs the routine daily tasks, they will detect many of these problems before the problems become serious. Developing recovery procedures for problems that might occur helps your staff prepare for potential difficulties. When establishing recovery procedures, consider these guidelines: Review the guidelines provided in Section 6, Problem Management. Problemreporting and problem-tracking procedures help you and your staff learn from past problems, avoid repeat problems, and recognize future problems quickly. 5-15

117 Production Management Recovery Procedures Review the guidelines provided in Section 10, Contingency Planning. Having a disaster recovery plan in place can help you and your staff recover from a disaster as quickly as possible, with minimal damage to your system and data. Document recovery procedures and make them available to operations and support staff. Information on how to determine the cause of a problem should also be documented and available. Assign the responsibility for carrying out the recovery procedures to the appropriate people. Make sure that full copies of all manuals are available in the operations area, including all relevant application user manuals. Provide documentation describing all major system components, their configurations, and how they deliver services. Include an SCF LISTDEV of the system in its normal state and examples of system and network diagrams. Ensure that a terminal and telephone are situated in the computer room next to the system so that an operator can discuss the problem on the phone while accessing information in the computer room. Make sure that backup tapes are available and readable to ensure that files can be restored. In an environment without NonStop TM/MP, document all files that are updated online. Document the location, size, type, contents, and importance of each file, and the recovery options. Test the recovery procedures to ensure that they work and are as simple to use as possible. Train your staff in the recovery procedures. Ensure that super-group (255,n) capabilities are available under special conditions. Although a super-group logon is not needed under normal conditions, it might be required to solve certain problems. Require surprise drills to ensure that your staff can perform the recovery procedures. For serious failures, such as hardware, processor, system, and power failures, you should notify your Tandem representative in addition to following your internal escalation procedures. 5-16

118 Production Management Production Management Tools Production Management Tools Tandem provides a number of tools to help your staff with production management tasks. Table 5-1 summarizes the production management tools and their capabilities. For detailed descriptions of these tools, refer to Section 14, Operations Management Tools. For a list of automation tools, refer to Section 12, Automating and Centralizing Operations. Table 5-1. Summary of Production Management Tools Tool Distributed Systems Management/NonStop Operations for Windows (DSM/NOW) Event Management Service (EMS) Monitor Resources Collect Events Display Events Generate Events Schedule and Control Jobs Command Interface X X X Event Management Service Analyzer (EMSA) NetBatch & NetBatch-Plus X X Network Statistics Extended (NSX) X X NonStop Virtual Hometerm Subsystem (VHS) Object Monitoring Facility (OMF) SeeView Subsystem Control Facility (SCF) Tandem Service Management (TSM) X X X Generate Reports X X X X X X TSM EMS Event Viewer X X X ViewSys X X X Emulator X 5-17

119 Production Management Check List Check List The following check list summarizes the main points of this section: 1. Implement operator tools to monitor the systems, networks, applications, processors, disks, and communications lines. 2. Determine which tools to use. Tandem offers these tools: DSM/NOW EMS EMSA NetBatch and NetBatch-Plus NSX OMF SeeView SCF TSM TSM EMS Event Viewer VHS ViewSys 3. Develop an internal operations guide that provides guidelines and procedures for performing tasks within your organization. 4. Establish work schedules based on the tasks listed in this section and other tasks required by your organization. 5. Automate the tasks as much as possible. 6. Train your staff to perform the tasks quickly and well. 7. Review the procedures and the staff s performance to determine whether the procedures can be improved. 5-18

120 6 Problem Management Overview No matter how well-managed your system is, errors and problems can occur. Because a problem can mean the loss of availability, your staff needs to know how to report and resolve the problem. If your staff cannot resolve the problem, it must know how to escalate the problem so that recovery occurs. This section describes problem management and provides suggestions, guidelines, and tools for administering problems in an operations environment. This section ends with a check list that summarizes the main points of problem management. Note. Participating in application design reviews can help your staff eliminate potential problem areas and ensure that errors and recovery procedures are documented and understandable. Building quality into an application reduces the chance of problems once the application is installed. Section 11, Application Management, provides suggestions for participating in design reviews. The Availability Guide for Problem Management defines problem management in detail, providing information on how to predict, prevent, and recover from problems; which problem management tools to use; and how the tools fit together. What Is Problem Management? Problem management involves managing and administering the problem environment including capabilities to monitor, detect, analyze, escalate, work around, and resolve problems in an online environment. The Goals of Problem Management The goal of problem management is to reduce or eliminate problems. This can be done by: Predicting and then preventing problems before they occur Quickly recovering from problems that do occur by using a systematic approach to resolving problems Predicting, preventing, and recovering from problems are described later in this section. Common Problems in an Operations Environment Many problems result in unplanned outages. An unplanned outage is the time in which the application or system becomes unavailable to the end user because of a problem situation such as faulty hardware, operator error, disaster, and so forth. Tandem defines four unplanned outage classes, which categorize the causes of unplanned outages. Table 6-1 defines the four outage classes. 6-1

121 Problem Management Management Responsibilities Table 6-1. Unplanned Outage Classes Outage Class Physical Design Operations Environmental Description Physical faults or failure in the hardware. Examples include system disk failure and network router failure, nonfaulttolerant hardware configurations (such as unmirrored disk drives), and nonfault-tolerant application configurations. Design errors such as bugs in design and design failure in hardware or software. Examples include an application change that makes the application unusable by introducing unexpected problems. Errors caused by operations personnel caused by accident, inexperience, or malice. Examples include deleting data, incorrectly installing software, procedural problems (or lack of procedures), lack of operator training, and basic operations and maintenance tasks not being done or not being done correctly. Failures in power, cooling, network connections, natural disasters (earthquake, flood), terrorism, and accidents. Examples include air-conditioning system failure, power failures (such as batteries dead, no backup generator), or computer in basement destroyed by flood. Management Responsibilities Managing the problem environment is most effective when problem-reporting and problem-escalation policies and procedures are developed and enforced, and the staff is trained in outage prevention and recovery. Establishing Policies and Procedures Past experience has shown that organizations lacking problem-reporting and problemescalation procedures have a higher rate of errors, a less efficient organization, longer recovery times, and a greater percentage of dissatisfied users. Established problem-reporting and problem-escalation policies and procedures help you: Ensure that all identified problems are reported, recorded, assigned a priority, and resolved Track how quickly problems are resolved in order to determine if procedures need to be improved and if service-level agreements are being met Identify recurring problems in order to eliminate the problems or to help the staff resolve the problems more quickly Ensure that applications are designed to help your staff resolve problems when they occur 6-2

122 Problem Management Providing Outage Prevention and Recovery Training Providing Outage Prevention and Recovery Training Providing outage prevention and recovery training can help the operations staff become more aware of the concept and cost of outages, and promotes outage prevention habits. Establishing an outage prevention and recovery training plan involves: Assisting the Tandem Education Group (TEG) in determining your specific training needs by sharing ideas about your education needs, completing education surveys, and providing constructive evaluation following all training activities. Encouraging management and planning staff to consider the education implications of anticipated changes to your environment. Some of the changes affecting skill levels include hardware changes, operating system upgrades, application changes, implementation of new tools, and industry changes. Establishing a function for a training manager whose responsibilities would be to develop education programs for new-hires and continuing education for current staff. Establishing a learning environment on site for use of Independent Study Programs (ISPs), Audio-Digital Technology (ADT), and Computer-Based Training (CBT) classes. Predicting and Preventing Problems Because solving problems can mean the loss of availability, the best kind of problem solving is problem prevention. There are substantial delays whenever a system experiences a problem. It takes time to: Recognize the problem Log the problem and get someone to work on it (administrative delays) Collect the necessary tools Analyze the problem Verify the cause Fix the problem Test and evaluate the fix Put the system back into operation (recover from the problem) By implementing problem prevention strategies, you can: Predict potential problems Prevent potential problems from becoming unplanned outages Prepare for problems that may occur 6-3

123 Problem Management Problem Prevention Strategies Problem Prevention Strategies You can prevent many problems by implementing the following strategies: Monitor the hardware and software. To ensure that the system is operating properly and to recognize when a potential problem might occur, it is important to monitor continuously the status of all the resources of the system and network. Resources commonly monitored include processors, disks, paths, devices, processes, spooler components, audit trails, audit dumps, NonStop TM/MP transactions, tape mount requests, communication lines, and programs. Monitoring includes: Monitoring resources as they change states (up or down). (Use the Object Monitoring Facility [OMF] or TSM.) Monitoring end-user response time and throughput. (Use ViewSys or NSX.) Monitoring critical resource utilization (threshold limits, disk files and volumes percent full, memory queues, message queues, disk queues, processor utilization, and control block usage). (Use ViewSys or NSX.) Monitor system and application software message logs by using DSM facilities, such as EMS and the TSM EMS Event Viewer. DSM also helps developers create applications that generate events and create log files. Automate operations and recovery procedures. Examples of tasks that are typically automated for problem prevention include: Object state monitoring. Performance monitoring. Critical resource monitoring. Recovery tasks for routine (recurring) problems. Routine (recurring) tasks. If you have to perform a task more than three times, automate the task. Problem determination steps. For example, an event is generated when a line goes down. Problem analysis tasks, such as gathering information to help you determine the cause of the failure, can be automated. For more information on automating operations and automation tools, refer to Section 12, Automating and Centralizing Operations. Make sure that your system is fault tolerant. Tandem systems provide continuous availability and fault-tolerance features; however, it is up to you to make sure that these unique features are fully used and maintained. The Availability Guide for Problem Management provides information on auditing your system for fault tolerance. Guidelines are included to help you determine the fault tolerance of your software and hardware configurations. Design your system and application to take advantage of quick startup and shutdown techniques. The Availability Guide for Change Management provides operational strategies for reducing startup and shutdown time. The Availability Guide for 6-4

124 Problem Management Problem Prevention Strategies Application Design provides guidelines for designing applications for high availability. Ensure the availability of super-group (255, n) capabilities. While a super-group logon is not needed under normal conditions, it may be required to solve certain problems. Having access to a super-group password is sometimes the fastest way and even the only way to solve a problem. Section 9, Security Management, describes a procedure to take advantage of super-group logon capabilities while maintaining appropriate security and providing access to the system when it is needed. Prepare for environmental problems and disasters. Planning ahead can help you prevent some environmental problems and disasters. Having a disaster recovery plan can minimize the effect of those disasters you cannot prevent. Section 3, The Operations and Support Areas, provides guidelines and considerations for selecting and preparing the computer center location and facilities. Section 10, Contingency Planning, describes disaster prevention and recovery in detail. The Availability Guide for Problem Management also provides detailed guidelines for preparing your operations environment for problems and disasters. Maintain accurate, up-to-date, and well-tested problem recovery procedures. Section 5, Production Management, provides guidelines for establishing recovery procedures. Have a reserve Tandem system (or use a development system). Problem prevention can be greatly enhanced by using a reserve system for: Testing new Tandem software releases Testing new application software releases Testing operational procedures Training new system operators Running the application when the production system must be down to install a new release or new configuration of the operating system Maintain a well-trained operations and support staff. An inadequately trained operations staff is one of the biggest vulnerabilities an operations group can face. A well-trained staff is better able to respond to problems. 6-5

125 Problem Management Recovering From Problems Recovering From Problems Despite the best planning and prevention, problems can still occur. To get your system or application back online quickly after an unplanned outage, it is important to organize and analyze problem information. By implementing systematic problem-solving techniques, you will be able to pinpoint the cause of a problem and resolve the problem in a timely and efficient manner. Systematic problem solving consists of five steps: 1. Detecting and isolating the problem 2. Gathering the facts and reporting the problem 3. Identifying the cause and developing a solution 4. Escalating the problem, if necessary 5. Reviewing the problem (focus on prevention) Figure 6-1 illustrates the systematic problem-solving process. Figure 6-1. Systematic Problem Solving Detect and Isolate the Problem Get the Facts and Log the Problem Skill? Knowledge? Authority? Time? No Escalate Yes Find and Eliminate Cause Problem Review Focus on Prevention

126 Problem Management Step 1 Detecting and Isolating the Problem Step 1 Detecting and Isolating the Problem To detect problems quickly, operators must be aware that a problem exists. Some of the same techniques used to predict and prevent problems are also used to determine if a problem exists. These are: Monitoring hardware and software. Monitoring system and application software message logs. Using Tandem Service Management (TSM) tools, including the TSM EMS Event Viewer. TSM uses expert systems technology to detect, analyze, diagnose, and archive hardware problems as they occur often detecting failures before they affect system performance. Automating monitoring tasks and recovery procedures. Receiving information from a user or from users indicating that a problem exists. To ensure that problems are detected as quickly as possible, establish procedures for monitoring the system and logs, and for receiving information from users. For guidelines to help you develop monitoring procedures, refer to The Availability Guide for Problem Management. Step 2 Gathering the Facts and Reporting the Problem After a problem is detected, it is usually reported. Consider establishing procedures for reporting problems. Established procedures help you track: Each problem that occurs How the problem was resolved Who resolved the problem and when Recurring problems How long it took to resolve the problem Whether a problem can be prevented or recovery procedures for that problem can be automated If all problems are logged, your staff can generate weekly or monthly summaries that allow you to evaluate system and staff performance and focus on problem areas. 6-7

127 Problem Management Step 2 Gathering the Facts and Reporting the Problem Following are suggestions for problem reporting requirements: Develop a standard online or hard-copy problem report form to log problems. Require that all problems be documented with this form. The form should record the facts about the problem and the facts about the situation surrounding the problem. Facts about the problem include what happened, where, when, and the magnitude of the problem. Facts about the situation include: Who reported the problem How critical the situation is What events led up to the problem What recent changes might have caused or contributed to the problem What information the event messages, error logs, and memory dumps provide What the current hardware and software configuration is (for example, release levels and product numbers) Keeping detailed records can be tedious; however, it saves your staff time and effort. Detailed records help you and your staff determine the cause of a problem, ensure that all identified problems are resolved, determine if procedures need to be improved, and identify recurring problems in order to eliminate the problems or to resolve the problems more quickly. Figure 6-2 illustrates a sample problem report form. Create and maintain an outage log. Outage logs provide a useful tool for tracking outages. They provide an accurate assessment of system availability and can be used for trend analysis, to set and maintain service-level objectives, and to determine improvement areas. For example, improvements could include operator training, change management, operational tools and utilities, or additional system software or hardware. Section 4, Operations Documentation, provides guidelines for creating and maintaining outage logs. Designate the people responsible for logging problems. For example, you could require that all problems found by operators be logged in an operator log, and all problems encountered by users be logged by help-desk operators. Train the staff and users on problem reporting procedures. Everyone needs to understand fully how the problem reporting system functions in order for you to create a smooth flow of work. 6-8

128 Problem Management Step 2 Gathering the Facts and Reporting the Problem Figure 6-2. Sample Problem Report Form Revised MM/DD/YY By: Problem-Reporting Log Page of Problem Tracking Number: Node: \ Date: \ \ Describe the problem What specifically is wrong? Where was the problem first noticed? Which applications, components, devices, users have been affected? Device name and location: When did the problem occur? What is the frequency of the problem? What is the impact of the problem? (Circle one.) Is the problem getting worse? Y/N Who reported the problem? Has this problem occurred before? Y/N Low Medium Critical How many users are affected? Where can they be reached? What was the user doing when the problem occurred? What events preceded the problem? What error messages were displayed when the problem occurred? What information do event messages, error logs and memory dumps display? What is the configuration (including release and product numbers) of the hardware and software products affected? What is the probable cause? Was the problem escalated to a higher level of support? Y/N If so, who is currently working on it? How was the problem resolved? Who should resolve the problem? Who resolved the problem and when? Should the users be notified? Y/N

129 Problem Management Step 3 Identifying the Cause and Developing and Implementing a Solution Step 3 Identifying the Cause and Developing and Implementing a Solution Using the information obtained when reporting the problem, you are in the position to speculate about what caused the problem and to develop a solution. The following paragraphs provide guidelines for determining the cause of a problem and developing a solution. Identifying the Cause To identify the cause of a problem: List the possible causes. Using your own knowledge and experience, facts about the situation, and facts about the problem, generate a list of possible causes. Identify the most likely cause. To evaluate the possible causes, compare each possible cause with the problem symptoms. The most likely cause is the one that best explains all the problem symptoms. Using a problem-solving worksheet such as the one illustrated in Figure 6-3 can help you perform these comparisons by applying a systematic approach. The Availability Guide for Problem Management explains how to apply the systematic approach to solving a problem. Developing a Solution Given the cause of the problem, determine the best solution to resolve the problem and then follow through and implement the solution. The best solution is one that considers: Expense. Is this the least expensive solution? Speed. Is this the quickest way to solve the problem? Safety. Will this solution adversely affect other components of the system? Reliability. Will this solution eliminate the problem? Will this solution cause other problems? For example, if you start a new log on another disk because you are running out of log space, another problem can develop if you forget to erase old log disks. Step 4 Escalating the Problem (If Necessary) Some problems are simple and can be resolved by the person who reports the problem. Other problems must be forwarded or escalated to more knowledgeable people for resolution. People at each step in the problem-solving process must decide whether they should proceed or get help. Most problems can be resolved within your company; others may require Tandem support. The following paragraphs provide guidelines for establishing in-company problem escalation procedures and describe how Tandem can help you resolve problems. 6-10

130 Problem Management Step 4 Escalating the Problem (If Necessary) In-Company Problem Escalation Procedures Problem escalation procedures help you ensure that problems are escalated to the correct people in a timely manner. When establishing problem escalation procedures, consider the following: Easy-to-fix problems should be solved by the lowest level of support. This allows higher levels of support to spend time on more complex problems and on other tasks. Each level of support should have a specific amount of time in which to solve problems. This prevents lower levels of support from spending too much time on problems that need to be escalated. Develop a list of people who can help the operations staff resolve problems. You should list contacts for each application running on the system, and for system software and hardware. You should also include the names of your Tandem representatives. Update the problem report log each time a problem is escalated to another level of support. Meet with your Tandem representative to establish guidelines for requesting Tandem s help in problem resolution. Determine who will be the Tandem contact within your organization. If a problem is escalated to your Tandem support representative, you should provide a copy of a memory dump and system log files. Tandem Support Tandem support personnel are available to help your staff resolve system software and hardware problems. Contact Tandem whenever your staff cannot resolve a hardware or system software problem. Whom to Contact Contact the Tandem NonStop Support Center (TNSC) if one is available in your country. The TNSC can answer questions, diagnose hardware problems, and dispatch local help as needed. For more information on contacting the TNSC, refer to the Tandem Support Guide or contact your local Tandem representative. If a TNSC is not available, contact your Tandem representative. Your representative can answer questions, explain problem escalation procedures, help you resolve problems, and dispatch help, if necessary. Note. The Operator Messages Manual and the Processor Halt Codes Manual contain lists of items and information that need to be collected before contacting Tandem. What Tandem Does Tandem helps you diagnose and resolve problems. If the problem is critical or hardware must be replaced, a Tandem analyst or customer engineer (CE) is dispatched to your site 6-11

131 Problem Management Step 5 Reviewing the Problem to solve the problem. If the analyst or CE cannot solve the problem, he or she escalates the problems to the district level, then to the regional level. Step 5 Reviewing the Problem When a problem is resolved, the solution can be recorded, and the problem report can be closed. Reviewing problems and solutions can help the staff prevent the same problems from recurring. Consider holding regular review meetings with the staff to: Review resolved and unresolved problems Ensure that progress is made to close open issues and unresolved problems Learn how problems were resolved and determine if problems could have been resolved more quickly Improve problem-reporting and problem-escalation procedures as necessary Using information gathered from the problem review, you can detect trends and generate reports that provide statistics such as the number and types of problems encountered, reported, escalated, and resolved. You can also use this information to determine the amount of time problems remained open, as well as the levels of support required to solve the problems. These reports help you measure the performance of your staff, determine whether service-level agreements are being fulfilled, and determine what training or changes are needed to improve the problem-reporting, problem-tracking, problem-escalation, and recovery procedures. Case Study The following case study illustrates a problem scenario in which a computer operator systematically uses the problem-solving worksheet (illustrated in Figure 6-4) to determine the cause of a problem and develop a solution. Business Background and System Configuration Just For Children, Inc. (JFC) is a retail store that sells a variety of children s clothes, toys, and furniture. As illustrated in Figure 6-3, JFC s four-processor G-series system is connected to: Twenty-two terminals operating 16 hours a day (from 8:00 a.m. to midnight). These terminals are operated by telephone order clerks working at the central warehouse located approximately 30 miles from headquarters. Twenty other terminals accessed by sales clerks in nine retail stores throughout the northwestern and western United States 11 hours a day (10:00 a.m. to 9:00 p.m.) 6-12

132 Problem Management Business and Operations Activities Figure 6-3. Case Study: Just For Children, Inc. (JFC) Computer System Portland Dial-Up Lines Headquarters Communications Controller Himalaya S-Series Server Reno Sacramento Leased Communication Lines JFC Retail Stores Cluster Controllers Communications Controller Warehouse Telephone Order Department Cluster Controllers $WHS2.#TRM1 $WHS2.#TRM7 $WHS4.#TRM15 $WHS4.#TRM Business and Operations Activities Telephone order clerks and other employees at the warehouse log on at a standard Tandem Advanced Command Language (TACL) prompt and access various applications such as electronic mail, order entry, and inventory. They use these applications to communicate with coworkers at other sites, take telephone orders from customers, and verify item availability. Sales clerks in the retail stores access the applications to verify item availability and arrange for delivery of requested items. Batch programs access the orders and inventory database during the night to generate shipping and billing documents. To prepare for an expansion in the warehouse, the operations group performed a system move during the previous weekend, moving several terminals from the east wall to the west wall of the warehouse. 6-13

133 Problem Management Problem Scenario Problem Scenario It is late fall and the holiday season is fast approaching. JFC s business is on the upswing. The busiest (and most profitable) season of the year is approaching. On Tuesday morning at 8:00 a.m., the operations group gets a call from the manager of the Telephone Order department at the warehouse, indicating that a terminal is down. The operations group is busy cleaning up after a system move, and the problem is not urgent because there are four spare terminals hooked up at the warehouse. So the operator on duty agrees to look into it within two days. On Wednesday morning at 8:00 a.m., the operations group gets another call from the manager of the Telephone Order department, indicating that a second terminal is down. She is quite concerned now. The busy season is upon her, and she has been preparing for it by hiring temporary workers. Two more telephone order clerks are scheduled to start work on Friday. That will leave her with no spare terminals and no margin for error. She wants the problem fixed today. Gathering Facts About the Problem Using the problem-solving worksheet illustrated in Figure 6-4, the operator on duty determines the following facts about the problem: What? Where? When? Magnitude? Two terminals are down (unresponsive with blank screens). They are identified as $WHS2.#TRM7 and $WHS4.#TRM20. The downed terminals are located in the warehouse. Terminal $WHS2.#TRM7 is located along the east wall of the warehouse and terminal $WHS4.#TRM20 is located along the west wall of the warehouse. The first incident occurred on Tuesday at 8:00 a.m. The second incident occurred on Wednesday at 8:00 a.m. Both incidents occurred at the beginning of the first shift of the day. This has become a severe problem. By Friday, all working terminals will be in use. If another terminal goes down, there will be no spare terminals available. Gathering Facts About the Situation By asking questions about the situation, the operator discovers the following additional information about the problem: There are two shifts. The day shift hours are from 8:00 a.m. to 4:00 pm. The evening shift hours are from 4:00 p.m. to midnight. The temporary workers are assigned to the evening shift. Both terminals are used by the temporary workers during the evening shift. 6-14

134 Problem Management Determining the Cause and Resolving the Problem The temporary workers often neglect to power down the terminals after their shift, leaving the terminals on overnight. (This has been against company policy, since the start of the energy conservation program.) In both situations, the terminals were plugged in and the cable connections were solid. Determining the Cause and Resolving the Problem After evaluating the problem symptoms and studying the system configuration, the operator develops a list of possible causes for the unresponsive terminals. Using the problem-solving worksheet to compare the symptoms against the possible causes, the operator suspects a stopped or suspended TACL process to be the most likely cause. The following table shows how the operator arrives at this decision: Possible Cause Terminal hardware problem Terminal configuration problem Faulty communication lines Controller malfunction TACL problem A problem as a result of the system move Likely Cause? Not likely. Because both terminals were unresponsive at the beginning of the day shift but were in working order the night before, it is unlikely that there is a hardware problem. A self-test performed on the terminals verifies that both are in working order. Not likely. Because both terminals were unresponsive at the beginning of the day shift but were in working order the night before, it is unlikely that the terminal configuration was changed. No. Other terminals using the same communications lines are in working order. No. Other terminals using the same controller are in working order. Yes. A stopped or suspended TACL process could cause all the problem symptoms. No. Only one terminal ($WHS4.#TRM20) was moved during the system move. Terminal $WHS2.#TRM7 was not moved. The operator suspects that the new temporary workers, not familiar with TACL, entered EXIT at the TACL prompt at the end of their evening shift, thereby stopping the local TACL process and locking up their terminals. Employees beginning their shift the next morning were unable to access the terminals. After verifying with the manager that this was indeed what happened, the operator restarts the TACL processes for each terminal. To prevent this problem from occurring again, the operator suggests to the manager that TACL training be provided for new employees. 6-15

135 Problem Management Determining the Cause and Resolving the Problem Figure 6-4. Problem-Solving Worksheet PROBLEM-SOLVING WORKSHEET Problem Facts Possible Causes Terminal Terminal Comm. System Hardware Config. Lines Controller TACL Move What? 2 terminals down $WHS2.#TRM7 Yes Yes Yes Yes Yes Yes $WHS4.#TRM20 Yes Yes Yes Yes Yes No Where? In warehouse $WHS2.#TRM7 on east wall Yes Yes No No Yes No $WHS4.#TRM20 on west wall Yes Yes No No Yes Yes When? One on Tuesday at 8:00 a.m.?? No No Yes No One on Wednesday at 8:00 a.m.?? No No Yes No Both at beginning of shift Magnitude? Severe. By Friday, there will be no spare terminals. Situation Facts Jill Jones, Manager, Telephone Order Dept., X455. New employees during evening shift. Problem terminals used by new employees. Most Likely Cause Stopped or suspended TACL process. New employees entered EXIT at TACL prompt. Plan to Prevent/Control Damage Provide TACL training for new employees. Terminals plugged in and cables secure. Terminals left on overnight. Plan to Verify/Fix Check with manager to verify that employees entered EXIT at TACL prompt. Restart TACL process. Escalation Decision None

136 Problem Management Problem Management Tools Problem Management Tools Tandem provides a number of tools to help your staff with problem management tasks. Table 6-2 summarizes the problem management tools and their capabilities. For detailed descriptions of these tools, refer to Section 14, Operations Management Tools. Note. For a list of automation tools, refer to Section 12, Automating and Centralizing Operations. For a list of performance tools, refer to Section 8, Performance Management. Table 6-2. Problem Management Tools Tool Distributed Systems Management/NonStop Operations for Windows (DSM/NOW) Enform Event Management Service (EMS) Event Management Service Analyzer (EMSA) NetBatch and NetBatch-Plus Network Statistics Extended (NSX) Object Monitoring Facility (OMF) Open Notification Service (ONS) Subsystem Control Facility (SCF) Tandem Failure Data System (TFDS) Event Viewing Event Collection Resource Monitoring Tandem Service Management (TSM) X TSM EMS Event Viewer X X ViewSys X Problem Detection Problem Analysis Problem Recovery Automated Operations X X X X X X X X X X X X X X X X X X X X X X X X X 6-17

137 Problem Management Check List Check List The following check list summarizes the main points of problem management: 1. Maintain a well-trained operations and support staff. 2. Establish problem prevention strategies. Your staff should: Monitor the hardware and software Monitor system and application message logs Automate operations and recovery procedures as much as possible Ensure that the system s fault-tolerant features are fully used and maintained Design your system to take advantage of quick startup and shutdown techniques Ensure the availability of super-group (255, n) capabilities to solve certain problems Be prepared and trained for environmental problems and disasters Maintain up-to-date and well-tested recovery procedures 3. Establish problem detection procedures. Your staff should: Monitor the hardware and software Monitor system and application software message logs Automate system-monitoring tasks and use monitoring check lists Monitor TSM incident reports Act on information received from users reporting problems 4. Establish procedures for reporting problems: Develop a standard problem report form. Create and maintain a system outage log. Designate people responsible for logging problems. Consider establishing a help desk. Train staff and users in problem reporting procedures. 5. Establish problem-solving techniques for identifying the cause of a problem and developing a solution. Using a problem-solving worksheet can help operators systematically list the facts about a problem, list possible causes, identify the cause, and develop a solution. 6. Establish problem escalation procedures. Your staff should: Know who should work on easy-to-fix problems and who should work on complex problems, and determine the percentage of problems that should be resolved by each level of support. Know how long to work on a problem before escalating the problem to the next level of support. Know whom to contact for help with system-related and application-related problems. Update the problem report form whenever a problem is escalated. Know which person on each shift is the Tandem contact. The Tandem contact should understand when and how to contact Tandem. Know how to take processor memory dumps and obtain copies of system log files. 6-18

138 Problem Management Check List 7. Establish procedures for reviewing problems: Periodically meet with your staff to review solved and unsolved problems and to determine if improvements in the procedures can be made to prevent the same problems from occurring in the future. Generate reports to provide statistics on the number of problems encountered, solved, and not solved, and on the time and levels of staff required for problem resolution. 8. Determine which tools to use. Tandem offers these tools: Distributed Systems Management/NonStop Operations for Windows (DSM/NOW) Enform Event Management Service (EMS) Event Management Service Analyzer (EMSA) NetBatch and NetBatch-Plus Network Statistics Extended (NSX) Object Monitoring Facility (OMF) Open Notification Service (ONS) Subsystem Control Facility (SCF) Tandem Failure Data System (TFDS) Tandem Service Management (TSM) TSM EMS Event Viewer ViewSys 6-19

139 Problem Management Check List 6-20

140 7 Change and Configuration Management Overview Systems and software often change. For example, you might add hardware to a system, update applications, or install a new release of the operating system. These changes can increase the effectiveness of your operations, or they can create confusion and problems, depending on how your organization handles the changes. If you establish formal change-management and configuration-management functions within your organization, you can ensure that change will proceed smoothly. This section provides an overview of: Change-management and configuration-management functions and goals Management responsibilities and staffing requirements How to anticipate and prepare for change Guidelines for installing and implementing hardware and software changes Implementing a change control process This section ends with a check list that summarizes the main points of change and configuration management. Note. The Availability Guide for Change Management defines change management in detail, providing information on how to manage change in the Tandem environment, indicating which Tandem tools can be used to make changes, advising how to plan for and perform changes online, and suggesting how to minimize down time when making changes offline. What Are Change Management and Configuration Management? Change management and configuration management are interrelated functions. Change management is the process of managing the maintenance and growth of your NonStop system. Change management involves managing all hardware, software, and procedural changes and includes all of the tasks required to properly manage change within the operations environment. Configuration management is the process of managing and administering the configuration of system software and hardware, application subsystems, communications subsystems, and application software. Configuration management functions include inventory control, version control, software distribution, and name management. 7-1

141 Change and Configuration Management The Goals of Change and Configuration Management Change and configuration management encompasses the following major areas, which are described briefly in this section: Anticipating and planning for change Installing and implementing changes to system software and hardware, application subsystems, communications subsystems, and application software Controlling the introduction of change For detailed descriptions of these topics, refer to the Availability Guide for Change Management. The Goals of Change and Configuration Management The main goal of change and configuration management is to minimize the impact of change on system and application availability, while successfully migrating your system or application from one stable configuration to another. You can meet the goals of change management by: Performing changes online. Being able to make changes to your hardware or software online is one way to reduce or even eliminate system and application down time caused by change. Online change is change that can be performed while the system or subsystem is still running and available to users. Because of its fault-tolerant architecture, Tandem systems reduce the down time necessary to reconfigure hardware. Down time for software upgrades can be eliminated by using tools provided by Tandem, allowing you to make software changes online. For detailed information about performing changes online, refer to the Availability Guide for Change Management. Reducing the time required for planned outages. Planned outages occur when there are changes that must be implemented and the computing environment must be stopped to implement the changes. An example of such a change is the installation of a new version of the operating system. Reducing the time required for planned outages can be accomplished by: Minimizing the frequency of planned outages. Reducing system and application startup and shutdown time. Using a formal change control process to manage change. Change control helps to ensure that the planned outages will proceed smoothly. Techniques for reducing system and application startup and shutdown time and minimizing the frequency of planned outages are described in the Availability Guide for Change Management. The change control process is described in this section. 7-2

142 Change and Configuration Management Management Responsibilities Management Responsibilities The change-management and configuration-management functions are most effective when policies and procedures are developed and enforced, and the staff is trained. It is important to: Establish change-management and configuration-management policies Determine who must approve changes Train appropriate staff in the policies and procedures Staffing Your staffing needs depend on: The size of your company and the number of systems and applications your company runs. The larger the company and the greater the number of systems and applications, the greater the number of people who must be involved in change and configuration management. If you have a small operation, you might need to assign only one person to change and configuration management. If you have a large operation, you might need to assign a whole group to this function. Whether you want to centralize or decentralize the tasks. If you have distributed systems or client/server implementations, you must decide whether or not to centralize the change-management and configuration-management functions. Centralizing the functions allows one group of people to control change on all systems. Decentralizing the function allows each site to handle its own changes. No matter what the size of your company, you should assign (full-time or part-time) technically knowledgeable people to change management. For change management tasks, the staff is most effective when it: Knows how to evaluate software quality assurance test results Is able to negotiate with people Is able to solve problems Understands how changes might affect all parts of the system and operating procedures For configuration management tasks, the staff is most effective when it knows how to: Design a fault-tolerant configuration Design a configuration that balances system components so that the system performs well Use Tandem system-configuration and system-generation tools For both the change-management and configuration-management tasks, the staff should also have good communication, planning, and organizational skills, and should understand the needs of application developers, operators, management, and users. 7-3

143 Change and Configuration Management Anticipating and Planning for Change Anticipating and Planning for Change By taking the time to anticipate and plan for change, you can avoid taking your system down unnecessarily. Planning for change is especially important in environments that require 24-hour-a-day, 7-day-a-week operations. You can anticipate and plan for change by: Evaluating system performance and growth. By evaluating system performance, and tracking and anticipating growth, you can establish plans to accommodate that growth. Providing adequate computer room resources. To allow for growth and avoid unnecessary down time, you might want to provide enough physical space for future expansions and ensure that you have enough power and cooling capacity for additional equipment. Configuring your system with change in mind. If you plan ahead for capacity growth, you can preconfigure additional resources into the system according to your plans. By configuring growth into the system, you might not have to bring down the system when you add hardware or load software upgrades. For more information on capacity planning, refer to Section 8, Performance Management. Installing and Implementing Changes Installing and implementing changes to your system can involve the following: Performing hardware changes. Performing system configuration changes. This includes installing a new operating system release; installing Tandem software products; installing an interim product modification (IPM); and adding, removing, and reconfiguring hardware. Performing subsystem changes. This includes installing and reconfiguring Tandem application and communications subsystems. Performing software changes. This includes installing vendor software and installing software developed in-house. Performing Hardware Changes When hardware changes are required, be sure to contact your Tandem representative and customer engineer (CE) well in advance of changes. Follow these guidelines when planning the installation of hardware: Provide enough space for additional hardware. Determine how the equipment will be used and where it should be placed. If you use a raised floor, make sure that the floor can support the additional load. If floor cuts are required, make sure that you schedule the cuts. Make sure that there are enough I/O and controller slots available. You might need to add additional system cabinets. 7-4

144 Change and Configuration Management Performing System Configuration Changes Make sure that there is sufficient power. If there is not, schedule time for adding the power. You might need to add power sockets. Make sure that there is sufficient air conditioning. If there is not, schedule time for improving the air-conditioning system. Determine whether you need special cables, modems, or cabinets for communications equipment. Determine whether the change requires down time. If so, schedule the down time with operations and notify users. If your staff needs help from a Tandem CE, notify the CE at least 30 days in advance of the change. Determine whether the hardware change requires a new system configuration. Make sure that the necessary operations staff is available to install the change. Performing System Configuration Changes A system configuration change is when you install a new operating system release, install an IPM, add system software (including new Tandem products), or add, remove, or reconfigure hardware. Follow these guidelines when changing the system configuration: When changing the hardware configuration, make sure that the system will be as fault-tolerant as possible. This can help prevent unplanned down time from occurring when a change is made. The Availability Guide for Problem Management describes how to audit your system for fault tolerance and how to configure your hardware and software for fault tolerance. Before making system configuration changes, draw out the changes in a system diagram. System diagrams help you check your work and prevent mistakes. An example of a system diagram is shown in Section 4, Operations Documentation. After you finish the system diagram, you might want the change management staff to approve the changes. To prevent a system outage, determine if the system configuration change can be installed online. To install updated software to support a new configuration, use the Distributed Systems Management/Software Configuration Manager (DSM/SCM) product. DSM/SCM provides for the centralized planning, management, and installation of software on distributed (target) Tandem systems. DSM/SCM runs on a Tandem central (host) system and performs the tasks of receiving, archiving, configuring, and packaging software for target sites. It also runs on each target system, where its primary function is to apply the software received from the central site. To change a hardware configuration, you can use DSM/SCM. DSM/SCM is used to create a new operating system image (OSI), which allows you to add and change configurations for all device types; however, you must stop all processing on the system to load the new OSI. 7-5

145 Change and Configuration Management Performing Subsystem Changes SCF allows you to add and change configurations for many (not all) device types while the system is running. SCF also reduces the need to preconfigure software changes. Performing Subsystem Changes The Tandem environment consists of application and communications subsystems. The application subsystems enable you to develop and run high-performance, high-volume, and highly available OLTP applications. The application subsystems that make up the Tandem application environment include NonStop Transaction Services/MP (TS/MP), NonStop SQL/MP, and NonStop Transaction Manager/MP (TM/MP). The communications subsystems provides users with access to a set of communications services. Follow these guidelines when changing or creating an application or communication subsystem: Before starting the change process, create a list of all parameters that will be changed. Keep this list in case questions arise in the future. You might have to shut down the applications that are using the products before reconfiguring the products. Refer to the Availability Guide for Change Management for a description of application changes that can be performed online. Use each product s command interpreter to reconfigure the application. For example, use TMFCOM to reconfigure NonStop TM/MP. To reconfigure data communications lines and devices, you can use DSM/SCM or the Subsystem Control Facility (SCF), depending on the device or line type. Performing Software Changes A software change can include installing software purchased from another vendor or installing software developed in-house. To protect system security, develop procedures to prevent the installation of any software except the legitimate software. The staff should answer the following questions before installing new software: Has the software come through ordinary channels? Is the software documented in the way that is standard for the organization? If the software is an update, does it update the particular version of the software you already have? Do the installation instructions require you to expose your system security? Exercise similar care when dealing with software created by your system programmers or application programmers. Review their software carefully, especially if it performs actions on behalf of other users or uses special privileges. Consider including the following questions in staff check lists: Has the software been subjected to standard quality-assurance and offline testing? Has the software been reviewed and approved by management? Does the software require privileges for an obscure or unnecessary function? 7-6

146 Change and Configuration Management Performing Software Changes Once it is determined that the changes will not affect system security, the staff can prepare to install the software. The following guidelines can be incorporated into a preinstallation check list: Determine whether the change requires a new system configuration or reconfiguration of applications. Determine whether the change requires down time. If so, schedule the down time with operations and notify users. If an application is distributed across a network, new versions of the application must be compatible with the old versions at the other nodes or at least be able to coexist with the old versions. To solve this problem, use version identifiers that are imbedded in the requestors, in the servers, and in the messages between the requestor/server pairs. The requestors and servers can check the version identifiers to ensure compatibility. Back up all files and programs that will be changed. You might need the backups for recovery in case the changes do not work. Make sure that the necessary operations staff is available to install the change. Develop a strategy for determining where to locate data files. For example, if an alternate-key file is accessed frequently by an application, do not locate it on the same volume as the primary file. If the files are located on the same volume, system performance might be degraded. Provide the operations staff with the information (documentation, list of file names, and so on) and tapes it needs to install the changes. You might want to develop a formal sign-off procedure for handing over all materials to operations. 7-7

147 Change and Configuration Management Controlling the Introduction of Change Controlling the Introduction of Change Change occurs all the time. If you do not control who makes the changes and when, you might put your system at risk. If there are no controls, frequent changes and changes by unauthorized personnel might threaten the stability of the system. For example, unauthorized personnel might damage the system by loading programs that don t conform to your security policies, by removing programs without permission, or by changing existing programs without informing others. Methods of preventing these problems include limiting the frequency of changes and making a person or a group responsible for change control. Limiting the frequency of changes stabilizes the system. For example, by grouping noncritical changes into scheduled releases, you can minimize down time. What Is Change Control? Change control is the process for proposing, planning, implementing, and testing change and should be required for all changes. Using a change control process helps minimize the duration of planned outages (system or application down time that is planned or scheduled). Change control also ensures the successful migration of a system or application environment from one stable configuration to another by: Ensuring that the scope and ramifications of the change are fully understood Providing plans for restoring the original environment so that system operation can be resumed if the change doesn t work Ensuring that problems and errors are anticipated and reacted to appropriately Maintaining the security of your system and applications Implementing Change Control Successfully The prerequisites for implementing successful change control include: Establishing a single point of control. Making a person or a group responsible for change control prevents unauthorized personnel from accessing and changing the system. The change control staff is usually responsible for controlling changes to: Hardware Operating system software Applications Procedures Macros, command files, and routines used to perform system operations tasks Tandem products Configuration files Databases Obtaining company commitment. All organizations necessary to support the change must be committed to the plan and must clearly understand the change process. 7-8

148 Change and Configuration Management The Change Control Process Reviewing the process. Continual process improvement should be an integral part of the change control process. The change control staff should: Take baseline measurements (using Measure, for example) of your system and application environments before you make changes After the change has been implemented, measure the impact of the change on these baseline measurements Keep track of how long it takes to implement changes Know the common changes in your environment The Change Control Process Changes are usually handled in four phases: definition and documentation, change planning, implementation, and verification. Phase 1 Definition and Documentation In this phase, the change control staff formally defines the proposed change, makes sure that all requirements have been met, and documents the expected results of the change. Phase 2 Change Planning In this phase, the change control staff assesses the impact of the change, creates a plan to implement the change, and develops a recovery plan in case the change does not work. Phase 3 Implementation In this phase, the change control staff implements the change according to the change plan created during the change-planning phase (phase 2). Phase 4 Verification In this phase, the change control staff makes sure that the system is running correctly and reviews the change control process to make any necessary improvements. The Availability Guide for Change Management provides detailed guidelines for implementing a change control process. Tools offered by Tandem that can be used for change and configuration management are listed later in this section. Your representative can direct you to the Tandem Alliance Program companies that offer additional change-management and configurationmanagement tools. 7-9

149 Change and Configuration Management Case Study Case Study Effective planning and managing of change minimize risk and help ensure that customer service levels are met. The following case study shows how a New England bank implemented change control procedures to manage their growing environment. User Profile Allied Bank is a financial institution with 370 offices located in the New England area. Current assets exceed $18 billion dollars. Allied Bank operates the Money Machine Network of automated teller machines (ATMs). The Money Machine Network consists of approximately 700 ATMs and interfaces to cover 80 financial institutions. A Tandem system consisting of 8 G-series system processors and 20 logical disk volumes is used to drive the ATM network and interfaces. The network currently processes over 4 million transactions per month. Business Background The Money Machine Network has experienced a high rate of growth over the past two years because of acquisition and aggressive marketing. In the past 24 months, the network has grown approximately 350 percent in transaction volume and number of ATMs. Projected growth for the next year is over 40 percent. Because of their growing environment, Allied Bank has required frequent changes: installation of new hardware, reconfiguration of system software, and application changes. These frequent changes have resulted in several service outages in the past two years. Analysis of Problem At Allied Bank, overall responsibility for change management was not defined. Changes were implemented haphazardly, without always following formal procedures. Other observations included: Change management procedures did not exist. Changes to the environment did not go through a formal change process. Change schedules were not formalized, comprehensive, or communicated. Each group had a schedule of changes that they were responsible for, but no one group was coordinating all change activities. Consequently, an overall schedule of changes was not kept across functional areas. Changes were made without test, installation, or backout plans. Configuration changes, both hardware and software, were made without risk assessment, management approval, or communication to all functional areas. Major changes, such as system software upgrades and new hardware installations, were managed informally but more effectively than configuration changes. Allied Bank did assign a project leader to develop a plan and manage the migration to a supported release of the operating system. 7-10

150 Change and Configuration Management Implementation of Recommendations Implementation of Recommendations Allied Bank had developed a few change procedures and had assigned one person to coordinate operating system migrations. Although this approach was somewhat effective, it did not provide for ongoing change management. A formal method needed to be developed to manage all change, not just major changes. The following recommendations were implemented at Allied Bank: Establish a change management group Define formal change procedures Assign knowledgeable individuals with the responsibility for developing the plans for the change request Hold periodic change control meetings Establish a Change Management Group It is important to have a single focal point for all changes to the environment. This minimizes the possibility of a change affecting seemingly unrelated parts of the system and helps ensure effective communication among all groups. Therefore, a change management group was established at Allied Bank. Their responsibilities are to: Define the change procedures and ensure that all changes follow the new procedures Manage all changes Document all changes Define Formal Change Procedures The change management group is responsible for defining change control procedures. At Allied Bank, change control procedures addressed the following areas: Risk assessment Test procedures Installation planning Backout planning Notification Management approval Scheduling Installation verification The change management group further defined their change control procedures by illustrating the flow of change requests. Figure 7-1 shows the required steps and departments a change request must go through at Allied Bank. Allied Bank uses this change control process for application changes only. Another diagram was created for system changes. 7-11

151 Change and Configuration Management Conclusion Figure 7-1. Case Study: Change-Control Process Flow at Allied Bank Development Environment Design (new change) Operations Test Environment Business Operations Group Customer Production Environment Programming Technical Support Group Implementation System Acceptance Test Librarian Operations Acceptance Test Production Operations Customer Acceptance & Training 026 Assign Knowledgeable Individuals With Responsibility for Change Requests Individuals with the most knowledge regarding the change request are chartered with the responsibility of developing the needed plans. For example, installing a new version of application software requires the applications development group to develop the test, installation, backout, and verification plans. These plans are then presented at a change control meeting for approval and scheduling. Hold Periodic Change-Control Meetings Periodic change-control meetings are now held with members from applications development, technical support, quality assurance, computer operations, network operations, and business operations. These meetings are chaired by the change management group. The meetings focus on: The previous week s changes. The changes are reviewed and their status is noted. New change requests. New changes are presented and reviewed. If all approval criteria are met, the change is scheduled for installation. The minutes of the meeting are published, and a weekly schedule is distributed to each functional area and to management. Conclusion Definition of change management responsibility, formal change procedures, periodic change-control meetings, and the development of a change schedule helps position Allied Bank to manage more effectively the changes anticipated because of high growth. 7-12

152 Change and Configuration Management Change-Management and Configuration- Management Tools Change-Management and Configuration- Management Tools Tandem provides a number of tools to help your staff with change-management and configuration-management tasks. Table 7-1 summarizes the change-management and configuration-management tools and their capabilities. For detailed descriptions of these tools, refer to Section 14, Operations Management Tools. For a list of tools to help you predict and assess the impact of change on your system, refer to Section 8, Performance Management. Table 7-1. Summary of Change-Management and Configuration-Management Tools Tool Distributed Name Service (DNS) Distributed Systems Management/Software Configuration Manager (DSM/SCM) NonStop SQL/MP SQLCI NonStop TM/MP Interfaces (TMFCOM and TMSERVE) NonStop TS/MP PATHCOM Interface Subsystem Control Facility (SCF) Tandem Service Management (TSM) Install or Change Hardware and Software X X X Change Applications X X X Can Perform Changes Online X X X X X Name Management X 7-13

153 Change and Configuration Management Check List Check List The following check list summarizes the main points of change and configuration management: 1. Obtain management commitment to developing policies and procedures, training staff in the policies and procedures, and enforcing policies and procedures. 2. Determine your staffing needs. 3. Anticipate and plan for change by: Evaluating system performance and growth to accommodate change. Providing adequate computer room resources to allow for growth and avoid unnecessary down time. 4. To reduce the outage required for the change, determine if the change can be performed while the system is still running. 5. If the system must be shut down, minimize the outage by reducing system and application startup and shutdown times and writing efficient command files. 6. Develop change control procedures for installing and implementing change. Changes are usually handled in four phases. The change control staff: a. Defines and documents the change b. Plans for the change c. Installs the change d. Makes sure that the system is running correctly and improves the change control process 7. Determine which tools to use. Tandem offers these tools: Distributed Name Service (DNS) Distributed Systems Management/Software Configuration Manager (DSM/SCM) NonStop SQL/MP SQLCI NonStop TM/MP interfaces (TMFCOM and TMSERVE) NonStop TS/MP PATHCOM interface Subsystem Control Facility (SCF) Tandem Service Management (TSM) 7-14

154 8 Performance Management Overview Performance management helps you ensure that you get the best return from your systems and that the systems meet your business needs. This section provides guidelines for performance management tasks. In particular, this section covers: How service-level agreements influence performance management The applications-sizing, capacity-planning, and performance-analysis-and-tuning functions Tandem tools that help you in performance management This section ends with a check list that summarizes the main points of performance management. Note. The Availability Guide for Performance Management defines performance management in detail. It explains how to measure system performance, analyze system performance information, and optimize the performance of Tandem NonStop systems. It also defines key performance terms, defines a framework for capacity planning and application sizing, and describes the performance management tools provided by Tandem. What Is Performance Management? Performance management is the process of managing the performance of your system and network environment to ensure that you get the best return from your systems and that the systems meet your business needs defined by your service-level agreements. Performance management includes the following functions: Application sizing is the process of forecasting the effects of new applications on your system through the use of models to determine how well new applications will handle their intended workloads. Application sizing helps you: Plan for growth in system workloads caused by new applications Determine how much capacity a new application will require Note. Operations functions in areas such as security, disaster recovery, source and change control, and organization and staffing might be affected when a new application is moved from a development system to a production system. Section 11, Application Management, discusses the operations functions that should be reviewed when an application moves from a development environment to a production environment. 8-1

155 Performance Management Service-Level Agreements Capacity planning is the process of forecasting future capacity needs based on performance trends and the growth in users, applications, and your company s business. Capacity planning helps you: Plan for growth in system workloads based on business growth. For example, if capacity planners know that the company s business is going to grow such that your systems will have to handle 10 additional transactions per second, they can determine how much additional capacity is required to handle the future workload. Prepare budgets for the acquisition of additional equipment. Avoid crises related to overloaded systems. Performance analysis and tuning is the process of measuring system performance and acting on the results of the measurements in order to improve system performance and availability. Performance analysis and tuning help you determine whether: The system is performing optimally. Service-level agreements are being met. Performance guidelines are being followed. Resources are allocated according to established priorities. System usage is occurring in a prudent, nonwasteful manner. Available resources can meet the performance goals desired. Service-Level Agreements Application sizing, capacity planning, and performance analysis and tuning are performed to help an organization adhere to service-level agreements. The agreements: Determine the availability and reliability standards for systems, peripherals, networks, and applications Establish processing requirements by specifying acceptable performance goals, such as acceptable system response times, deadlines for batch jobs, and transaction volumes Specify the acceptable level of service when the system is stressed, such as during peak workloads and during partial equipment failures Determine priorities for resource allocation For a more detailed description of service-level agreements and for guidelines on creating service-level agreements, refer to Section 4, Operations Documentation. Note. It is best to specify measurable requirements only. If you specify a requirement you can t measure, it will be difficult for you to determine if you are fulfilling the requirement. Written agreements should be created and approved by users representatives, capacity planners, and operations managers. Performance management helps you measure system and application performance, optimize current performance, and determine when additional capacity should be obtained in order to continue fulfilling the agreements. 8-2

156 Performance Management Staffing Staffing Staffing needs depend on the size of your company and the number of systems and applications your company runs. The larger the company and the greater the number of systems and applications, the greater the number of people who are involved in performance management. If you have a small operations group, you might need to assign only one person to the function. If you have a large operations group, you might need to assign a group (or groups) to the function. Staffing needs also depend on how the performance management tasks are divided. For example, one group can perform application sizing, capacity planning, and performance analysis and tuning. An alternative is to divide the three tasks among three different groups. Note. The trend in the industry is to use fewer people to manage a greater number of objects. This is tied to automation. Many of the performance-monitoring and data-collection tasks can be automated, relieving the staff of these routine functions. For more information, refer to Section 12, Automating and Centralizing Operations. To carry out its tasks effectively, the performance management staff should be technically knowledgeable; have good communication skills; and have access to necessary tools, training, and documentation. In addition, the staff should have the following skills: The application-sizing staff should understand your company s plans for growth and know how to: Analyze application workload components Create performance models for applications Identify and evaluate options for computer system growth Evaluate vendor offerings and tools The capacity-planning staff should have good business analysis skills, understand your company s plans for growth and business requirements, and know how to: Analyze information provided by the application-sizing staff Collect and analyze performance statistics Translate business plans into predictions of workload growth Identify and evaluate options for computer system growth Develop performance models Tandem provides training in capacity planning. Currently, Software Education offers a course called Capacity Modeling with TCM. 8-3

157 Performance Management Application Sizing The performance-analysis-and-tuning staff should understand how the system and applications are structured and know how to: Collect and analyze statistics Identify peak periods of resource usage Identify and correct current performance problems Tandem provides training in performance analysis and tuning. Currently, Software Education offers a course called Performance Analysis and Tuning. Note. For more information about the operations staff and the delegation of duties, refer to Section 2, The Operations Staff. Application Sizing Application sizing is the process of forecasting the effects of new applications on your system through the use of models to determine how well new applications will handle their intended workloads. Application sizing consists of three steps: 1. Establishing the requirements and strategy 2. Developing a model of existing usage and using it to forecast future requirements 3. Reporting results to the capacity-planning staff Step 1 Establishing the Requirements and Strategy The application-sizing requirements and strategy are based on the service-level agreements. Service-level agreements specify the level of service that your operations group should provide and are usually developed through negotiations between your operations organization and the organization s users. The application-sizing staff uses the service agreements to develop strategic performance requirements and to analyze the effects of a new application on the system. Step 2 Forecasting Forecasting consists of the following tasks: Identifying business entities, transactions, and transaction volumes, and translating this information into computer transactions and volumes. Developing transaction and database profiles. For example, for each transaction, how many input and output functions are performed? What is the screen size in bytes? Determining the resource demand for each transaction. For example, what is the cost per transaction in processing time? Given the information gathered above, determining the type of hardware configuration and the amount of disk space needed. 8-4

158 Performance Management Step 3 Reporting Results Step 3 Reporting Results Sizing results are usually reported to the capacity-planning staff and management. Useful reports describe the sizing staff s assumptions, describe modeling results, list alternatives, and provide recommendations. Capacity Planning Capacity planning consists of four steps: 1. Establishing the requirements and strategy 2. Instituting performance reporting 3. Developing a model of existing usage and using it to forecast future requirements 4. Developing, reviewing, and modifying the capacity plan Step 1 Establishing the Requirements and Strategy The capacity-planning requirements and strategy are based on the service-level agreements. Planners determine how much capacity will be required in the future based on an estimated future workload and the service agreements. Step 2 Performance Reporting Performance reporting includes the following tasks: Gathering information to determine how current capacity is being used and to determine when additional capacity will be needed. Typically, planners determine what should be measured and how often, and the operations staff executes command files to generate the measurements. Tandem provides a number of tools to help planners gather and analyze data. These tools are described later in the subsection Performance Management Tools. Establishing a reporting schedule. Management usually specifies how often planners should deliver reports and to whom. You might want reports twice a month, once a month, or quarterly. If planners find performance problems that can be corrected without additional capacity, they should report their findings to the operations staff. Delivering reports, including capacity-usage reports, and tuning recommendations. Supply summary information on system usage in graphic and tabular form to senior management and to the user community. 8-5

159 Performance Management Step 3 Forecasting Step 3 Forecasting Forecasting consists of the following tasks: Developing a model of the current system that reflects current performance characteristics. Using the information provided by the application-sizing staff and by business analysts to estimate projected workload volumes and workload profiles for the next period (for example, the next two years). Analyzing the expected performance characteristics of the projected workload volumes and profiles on the current system. Step 4 Developing the Capacity Plan Once the forecasting steps are completed, the capacity-planning staff should be able to develop a capacity plan and report on the alternatives. Developing a capacity plan consists of the following tasks: Determining if or when additional capacity is needed to meet service-level agreements. If capacity is needed, analyzing technology and topology alternatives for generating additional capacity. For each alternative, use a model to: Understand the effects of equipment capacity growth Analyze the ability of the architecture to accommodate unpredictable events (such as a rapid increase in workload) Recommending which actions to take to ensure that enough capacity will be available for future needs. Performance Analysis and Tuning Performance analysis and tuning consists of five steps: 1. Establishing performance requirements 2. Gathering performance information 3. Analyzing performance information 4. Acting on collected data to optimize system performance (tuning) 5. Reporting results and providing capacity planners with data Step 1 Establishing Performance Requirements Performance requirements are based on the service-level agreements and on information provided by capacity planners. The requirements specify: The resources to be managed and tracked. The availability and reliability standards for systems, peripherals, networks, and applications. 8-6

160 Performance Management Step 2 Gathering Performance Information Acceptable performance goals, such as acceptable system response times, deadlines for batch jobs, and transaction volumes. Acceptable level of service when the system is stressed, such as during peak periods and during partial equipment failures. The priorities for resource allocation. Once the requirements are specified, application developers can instrument applications so that the necessary measurements are available, and the operations staff can proceed with measuring performance. Step 2 Gathering Performance Information Before a system s performance can be analyzed and optimized, the performanceanalysis-and-tuning staff must gather relevant performance information. This involves: Collecting information about system and application configurations. Deciding what to measure. Deciding when to take measurements. Typically, the staff takes measurements throughout each day in order to understand how changes in workload affect the performance of the systems. Deciding how to take measurements. Tandem provides a number of tools to help measure performance. The tools can help the staff: Collect statistics on processors, disk drives, and files Measure network traffic Analyze disk space usage Collect information on many other resources For more information about Tandem tools used for performance analysis and tuning, refer to Performance Management Tools, later in this section. Identifying peak periods. Step 3 Analyzing Performance Information Performance analysis involves examining performance information for potential performance problems and includes the following tasks: Evaluating system and application configurations Creating a workload profile Analyzing resource use Identifying workload imbalances 8-7

161 Performance Management Step 4 Optimizing System Performance Step 4 Optimizing System Performance Optimizing system performance involves: Analyzing the results of the performance measurements Identifying load imbalances and bottlenecks created by applications Determining what should be done to improve system performance Improving system performance by tuning the system, balancing the system, performing online performance troubleshooting, or reconfiguring the system The following guidelines can help the performance-analysis-and-tuning staff: In a Tandem environment, performance issues that exist at a network level also exist at a system level. In a network, a heavily used node can easily influence the performance of nodes to which it is connected. A node can serve either as a resource aiding the flow of work or as a bottleneck impeding the flow. Performance problems are often a result of an unbalanced workload that causes bottlenecks in the flow of work through the system. No resource should be stressed beyond its optimal performance levels. Configuration, application-development, and system-management problems are often mistakenly viewed as performance problems; diminishing performance is often a symptom of these other problems. The key to achieving optimal performance is a workload-sharing strategy that promotes concurrence (allows several activities to occur at the same time) and averts contention (prevents two activities from competing for the same resource). This type of workload-sharing strategy focuses on ensuring that the main resource users do not contend for the same resource. Load balancing is most helpful when pursued with the goal of restraint rather than equality. To balance workloads: 1. Develop a sense of the workload and how the load is allocated among resources. Each process uses a different mix of system resources. When you move a process to balance one resource (for example, processor usage), you must be aware of the effect of that process on other system resources. 2. Confine initial balancing actions to curing contention. First resolve memory contention, and then resolve device contention. 3. After contention is eliminated, try to smooth peak use among system components. Frequently, about 10 percent of processes and disk files account for 80 to 90 percent of processor and disk usage. You can tune and balance a system by redistributing as few as 10 percent of the processes and disk files. Moving processes among processors to equalize processor busy levels might not improve system performance if the highest-usage 10 percent of the processes and disk files are not properly distributed. 8-8

162 Performance Management Step 5 Reporting Results Each change made to the system affects a number of different system resources. Introducing change in a gradual, planned manner allows you to observe the systemwide effect of a change before trying more changes. If multiple measurements are performed concurrently, the measurements should be coordinated through a single group at each site. If several people want to measure the same resources at the same time, it is most efficient to perform one all-inclusive measurement and have the people share the data. If the performance problem continues, the problem might be caused by a lack of capacity or by an application that is poorly designed. Applications should be tunable. An application is tunable if you can move application processes from one processor to another and if the application can expand linearly. Linear expansion allows you to create additional processes to handle increased activity. Step 5 Reporting Results Weekly performance-analysis-and-tuning reports help track problems and resource usage. It is helpful to have reports that: Summarize the data collected during the week List service goals and how well the goals were met Summarize actions taken (if any) to improve performance Summarize how the resources were used (for planning and accounting purposes) Summarize outstanding issues 8-9

163 Performance Management How It Fits Together How It Fits Together Figure 8-1 shows the relationship between application sizing, capacity planning, and performance analysis and tuning. Figure 8-1. Performance Management Functions Performance Analysis and Tuning Steps Step 1 Establish Performance Requirements Step 2 Step 3 Step 4 Gather Performance Information Analyze Performance Information Optimize Performance Step 5 Report Results Step 1 Establish Requirements and Strategy Step 2 Step 3 Institute Periodic Performance Reporting Capacity Planning Steps Forecast Future Performance Needs Step 4 Develop or Review and Modify the Capacity Plan Step 1 Establish Requirements and Strategy Application Sizing Steps Step 2 Step 3 Forecast Future Performance Needs Report Results 028 Case Study The following case study describes how a large hospital improved system and application performance by developing and implementing software configuration and application practices. User Profile SJ County Medical is a major hospital in Southern California. It is a growing hospital with strict availability requirements that discourage extended periods of down time. They have a G-series system that currently uses 10 processors and a number of disks. 8-10

164 Performance Management Analysis of Problem and Recommendations Analysis of Problem and Recommendations Because of its growing environment, SJ County Medical is often faced with system performance problems that require action by the performance-analysis-and-tuning (PAT) staff. Over time, the PAT staff has developed a few techniques for accommodating hardware growth, alleviating system performance problems, and minimizing down time. These include: Software configuration practices for maintaining system availability when adding new hardware (such as adding disks, controllers, and processors) Techniques for improving application availability Software Considerations When Adding New Hardware In anticipation of future hardware growth, the PAT staff implemented the following software configuration practices: System-generation practices. Because G-series installation dictates system generation for 16 processors (regardless of the number initially installed), the PAT staff was not required to edit its system s SYSGEN CONFTEXT file to add nonexistent processors and disks before running the DSM/SCM program. To install new hardware, the PAT staff needs only to install a new processor board; it does not have to bring down the system. NonStop Transaction Manager/MP (TM/MP) practices. Adding extra disks can require that the NonStop TM/MP product be stopped. To avoid initializing the NonStop TM/MP product with the new configuration, the PAT staff configured the audit trail for USAGE,* which includes all volumes. The subsequent START TMF command automatically includes all available volumes. When the USAGE attribute of the audit trail does not include all volumes but lists the volumes separately by name, the NonStop TM/MP product must be initialized with a new configuration. Techniques for Improving Application Availability The PAT staff developed the following techniques for improving application availability: Moving files. To improve application performance, the PAT staff moves files to a less utilized disk to reduce queueing to a specific disk. To move disks and their associated files to a new processor while an application is running, the PAT staff uses the NonStop SQL/MP database. With the NonStop SQL/MP database, it is possible to add and drop partitions of a file while an application is running. Reorganizing and reloading files. Because records are added and deleted in keysequenced files, the PAT staff also reloads or reorganizes files to improve performance, to reclaim disk space, to avoid chronic block splits or block collapses, and to provide for adequate slack space. The PAT staff uses the FUP RELOAD command to reorganize Enscribe and NonStop SQL/MP key-sequenced files online. 8-11

165 Performance Management Performance Management Tools Performance Management Tools Tandem provides a number of tools to help your staff with performance management tasks. Table 8-1 summarizes the performance management tools and their capabilities. For detailed descriptions of these tools, refer to Section 14, Operations Management Tools. Table 8-1. Summary of Performance Management Tools Tool Disk Space Analysis Program (DSAP) Enform File Utility Program (FUP) Check List Performance Monitoring Performance Data Collection Performance and Usage Tuning Recommendations Flow Map X X The following check list summarizes the main points of this section: 8-12 X X X Planning Reports/Charts Generation Guardian Performance Analyzer (GPA) X X Measure X X X X NSKCOM X PEEK X SCF/SCP X X System generation program (SYSGEN) X Tandem Capacity Model (TCM) and MeasTCM Tandem Network Statistics Extended (NSX) X X X X X X X Tandem Performance Data Collector (TPDC) X Tandem Reload Analyzer X X TSM EMS Event Viewer X ViewSys X X

166 Performance Management Check List 1. Establish service-level agreements. 2. Assign staff to the capacity-planning, application-sizing, and performance-analysisand-tuning functions. Provide training as needed. 3. Establish procedures for application sizing. Typically, the application-sizing staff: a. Establishes sizing requirements and strategy b. Forecasts and develops models of future demands c. Reports results 4. Establish procedures for capacity planning. Typically, the capacity-planning staff: a. Establishes capacity-planning requirements and strategy b. Institutes performance reporting c. Forecasts and develops models of future demands d. Develops the capacity plan 5. Establish procedures for performance analysis and tuning. Typically, the performance-analysis-and-tuning staff: a. Establishes performance requirements b. Measures performance c. Acts on collected data to optimize system performance d. Reports results and provides capacity planners with data 8-13

167 Performance Management Check List 6. Select tools to help with performance management. Some of the tools Tandem offers are: DSAP Enform Flow Map Guardian Performance Analyzer (GPA) Measure NSKCOM NSX PEEK Subsystem Control Facility (SCF)/Subsystem Control Point (SCP) Tandem Capacity Model (TCM) and MeasTCM Tandem Performance Data Collector (TPDC) Tandem Reload Analyzer TSM EMS Event Viewer ViewSys 8-14

168 9 Security Management Overview Data is a vital and irreplaceable part of every business. However, data protection is a difficult task. Not only do you have to protect the data, but you also have to protect everything that allows people to access the data, including the computer equipment, storage media, the operating system, and application software. This section describes security management and provides suggestions, guidelines, and tools for administering the security process. The Security Management Guide provides detailed information for managing a secure operations environment. This section concludes with a check list that summarizes the steps involved in security management. Note. Disaster recovery planning and change control are two other important areas of security planning. Refer to Section 10, Contingency Planning, and Section 7, Change and Configuration Management, for information on these topics. What Is Security Management? The primary goal of security management is to protect information; it involves managing three components to computer security: The system. The system component is concerned with managing the operating system s ability to control access to the system by defining the security features contained within the Tandem NonStop Kernel operating system and associated tools and products. The environment. The environment component is concerned with managing all aspects of physical security of the computer, its peripherals, and its environment, and providing power necessary to run it. The human. The human component is concerned with managing the people who access the system through the use of system IDs, network IDs, passwords, and dialup security precautions. In addition to managing these components, effective security depends on: Following basic security rules. Following a few basic security rules and guidelines can help create a successful security program. A sound security policy. Developing a security policy, selecting people to enforce the security policy, and following recommended guidelines for ensuring that data is secure help you achieve the greatest success in protecting information. People. Management, the staff, and the users must be committed to supporting the security practices of your organization. 9-1

169 Security Management Basic Security Rules Basic Security Rules Before determining how to secure your hardware and software, you should understand the following basic security rules. Use these rules when establishing your security program or when reviewing a program that is already in place. Rule 1 Rule 2 The highest levels of management should support and be committed to a security program. Management should define the authority and responsibility for development of a security program and should implement the program. The organization s approach to security should be understood and agreed to by members of the organization. A formally approved security policy statement and plan should be developed. A security policy: Establishes the security needs and goals of your company Indicates who should and should not have access to data Describes the protection procedures employees and departments should follow Rule 3 Rule 4 Rule 5 Rule 6 Rule 7 A security program starts with risk assessment. You need to determine what to protect. You can protect assets, confidentiality, command and control, availability of a service or function, and so on. You can also determine the probability and importance of the risk. If you cannot quantify the risk numerically, set qualitative values such as high, medium, and low. Staff and users must be able to achieve security goals. The goals should be realistic, and the rules and tasks should be simple and straightforward. If the rules and tasks are complicated and cumbersome, people will not comply. Security is implemented through a combination of physical barriers, administrative practices, hardware, and software. For example, the computer should be physically secure; users should have access only to what they need; and you should have tools and procedures for enforcing security. Separate job duties and responsibilities to a point where collusion is necessary for fraud to occur. Written job definitions and formally outlined responsibilities are key elements of this goal. For data processing, you should define who is responsible for password assignments, file security, accounting totals, physical access, software changes, and so on. Establish security transaction logs so that you can determine who is accountable for transactions and activities. The number and types of security logs should be in proportion to the level of risk or exposure that exists. 9-2

170 Security Management Developing a Security Policy Rule 8 Rule 9 The staff responsible for security monitoring and auditing should periodically review adherence to security rules. Develop or acquire audit tools and reports to support this activity. The security program must have integrity. Actively test the validity of the security program, the physical barriers, the administrative practices, and the hardware-protection and software-protection mechanisms. The following subsections provide you with information that will help you implement these rules. Developing a Security Policy A security policy is usually a high-level statement of the security goals and procedures of an organization. Because security depends on the cooperation of all users, all users must be made aware of the security policy and what they must do to comply with it. For all departments in your organization, the policy should address the control and disposition of sensitive information in all forms: online data, printed reports, data communications, magnetic media, and off-site storage. To review a sample security policy, refer to the Security Management Guide. A security policy should accomplish three goals: 1. Set the basic scope and general tone of an organization s security program. 2. Define who has overall responsibility for security and who is responsible for maintaining the security policy. 3. Define the security procedures for all departments that handle sensitive information. Some examples of security procedures that you should develop include: Installation procedures for system and application software Procedures for adding and removing users from the system Procedures governing the actions of privileged users, including control of their passwords Procedures governing how passwords are assigned and when they should be changed Procedures for developers to follow regarding the security of applications and utilities Consider the following guidelines when developing your security policy. 9-3

171 Security Management Security Guidelines Security Guidelines Your security policy might range from permissive to restrictive. Initially, it is most helpful to use a somewhat restrictive approach, because it is difficult to tighten security practices once users become accustomed to a permissive approach. Security concepts that can guide your security policy are: Least privilege Baseline security Least Privilege Least privilege dictates that users access the system only when they need to. You might initially provide insufficient access for some people to get their jobs done, but you can correct this matter by granting access as needed. This approach is preferable to allowing unwarranted access, which might become impossible to correct and which might cause serious damage to your company. Baseline Security Baseline security is the minimal level of security your organization is committed to providing. You can base the level of protection on what is done in organizations similar to yours. Some experts recommend this approach because of its use as a defense in legal proceedings resulting from a break-in. (Proof might be necessary that prudent protection was provided; policies comparable to other similar operations could be essential to such proof.) Security Is a People Problem Effective security depends on the commitment of management, the staff, and users. Without this commitment, people tend to select convenience at the expense of security, and so make computer operations vulnerable. Intruders use this situation to their advantage. For example, even so simple a convenience as not logging off when away from a terminal can provide the opportunity an intruder needs to break into a system. Management Support Your management must be convinced of the importance of security and should openly support security policy. Management should also assume responsibility for enforcing whatever security policy is adopted. Usually, a corporate security officer ensures that a security policy is developed and implemented and that users are trained. 9-4

172 Security Management Staff Support Staff Support Separation of security duties helps you avoid collusion and helps you ensure that your system is well secured. Security administration duties are usually divided between an auditor, a security administrator (or administration team), and the operations staff. Depending on your organization s structure, the security administrator might also be a member of the operations staff. The auditor is responsible for auditing the system. The security administrator is responsible for: Managing access to the system (user IDs) Managing passwords Developing and implementing security procedures and policies The Security Management Guide provides detailed check lists for security administrators. The operations staff is responsible for: Monitoring physical security Controlling dial-up access Restricting access to system software, utilities, sensitive information, and critical system resources Securing network access All staff who perform security administration duties should thoroughly understand the security policy and know how to detect intruders. Tandem provides training in computer security. Currently, Software Education offers the following courses: Security Concepts and Planning, Securing Guardian Systems, and Security for Auditors. User Community Support Without user support, a security policy is difficult to enforce. Getting user support might well be the most challenging and rewarding task you face especially where adherence to security rules interferes with productive work and people are rewarded mainly for their productivity. Your security policy should seek ways of making security as convenient as possible without jeopardizing your organization s security. After the policy is in place, educate users about security issues and how they can help maintain a secure system. 9-5

173 Security Management Organizational Issues Organizational Issues Good security requires that people communicate and cooperate across organizational lines. Figure 9-1 shows the paths of communication needed to sustain a strong security effort. Figure 9-1. Paths of Security Communication Policy Management Reports Policy Reports System Administration Team Security Administrator Safeguard Control Reports Tandem System Guidelines Verification Application Programmer Management Code Code Review Guidelines Verification Application Programmers Training Users Observation 030 The Tandem Security System The Tandem security system is an integrated group of software products that protect data existing on a system. Tandem software allows you to implement a variety of security policies through: Authentication, to verify the identity of system users Authorization, to control what users can do with system resources Auditing, to record specified events on the system Administration, to define access rights to system resources; translate policy into enforceable access rules; add, change, and delete users and aliases; and maintain file-sharing groups on the system 9-6

174 Security Management The Tandem Security System The Tandem NonStop Kernel operating system and its utilities offer basic system protection. The security software product Safeguard extends security features to include auditing, extended access-control, authentication features, and segregation of administrative tasks. If you are using the OSS environment, you must use the Safeguard software to define users for your system. If you use NonStop SQL/MP, a relational database management system, you can extend security features even further. In addition, Tandem supports a message interface to a command-interpreter monitoring process called $CMON. $CMON allows you to control and track logon attempts and important security changes. Figure 9-2 shows the layers of Tandem security. Figure 9-2. Layers of Tandem Security Applications Tandem Software Hardware Machine Instruction Definitions Tandem NonStop Kernel Safeguard, NonStop SQL/MP, Utilities User Defined ($CMON)

175 Security Management Authentication Services Provided by the Tandem NonStop Kernel Authentication Services Provided by the Tandem NonStop Kernel The Tandem NonStop Kernel operating system has built-in security that uses passwords for authentication and security strings to control access to files. Utilities, including the File Utility Program (FUP) and the Disk Space Analysis Program (DSAP), help you control and monitor system security. The Tandem NonStop Kernel operating system provides both local and network security. Network security default settings are more restrictive than local system security defaults. For more comprehensive security, the Safeguard product should be used with the operating system. Safeguard The Safeguard product extends the access-control features of the operating system, thereby allowing security administrators to easily tailor the level of protection to suit their needs. The Safeguard product augments the operating system security features by: Providing more control over the logon process Allowing detailed specifications of user access privileges Extending access control to operating system resources such as disk volumes, printers, communication lines, and processes Auditing attempts to: Log on Change user records Access system objects Change protection records for system objects The Tandem NonStop Kernel operating system with the Safeguard product allows you to: Specify the users who are allowed to create and manage user profiles Control access to resources, and audit attempts to access resources Control all file creation on any disk volume or subvolume on the network, and audit all attempts to create files Control creation of NonStop SQL/MP tables, views, indexes, and catalogs on any disk volume and subvolume on the network Control creation or deletion of any named process, and audit all process creation and deletion attempts Audit changes made to the security database Comply with many government computer security standards including those of the: United States (NCSC and Treasury), C2 level 9-8

176 Security Management NonStop SQL/MP German Information Security Agency (GISA), F2/Q3 security-function and F7/Q3 system-availability levels Harmonized European Information Technology Security Evaluation Criteria (ITSEC), E3 level Note. The suggestions in this section are based on the assumption that you use the Safeguard product to help protect your systems. If you do not use the Safeguard product, you should seriously consider doing so. NonStop SQL/MP NonStop SQL/MP, Tandem s relational database management system, also uses the Tandem NonStop Kernel file-security features. This security provides user authorization for NonStop SQL/MP tables, views, indexes, and programs. In addition, two NonStop SQL/MP features also contribute to database protection. NonStop SQL/MP allows you to: Access data only through NonStop SQL/MP commands, ensuring complete protection of the data and its definition Provide field-level security of logical views by allowing access to only those columns of data that are authorized $CMON $CMON is a user-written program that monitors some command-interpreter activities. You can use $CMON to audit and restrict attempts to: Log on and log off Run a program Alter the priority of a process Add users to the system or delete users from the system Change a user s logon password and remote passwords The International Tandem User s Group (ITUG) can supply you with a sample copy of $CMON. Physical Security Weakness in the physical security of your computer installation can provide an easy avenue of intrusion. The following paragraphs discuss some of the more common vulnerabilities resulting from weak physical security. The Computer Room Access to the equipment in the computer room can provide ample opportunity for both system intrusion and accidental or malicious system damage. Limiting access to the computer room can help you prevent security problems. For example, you can limit access to the computer room by locating frequently used devices such as printers away 9-9

177 Security Management Environmental Controls from the room, putting locks on the doors, and not posting signs that indicate the location of the computer facilities. Environmental Controls Access to the power supply and the air conditioning can provide ample opportunity for accidental or malicious damage. Consider controlling access to the power supply and the air conditioning by locking the control panels. System Cabinets Protect the system cabinets from accidental damage and deliberate malicious acts. All cabinets are shipped with locks that use the same key. You might want to consider installing your own cabinet locks to better protect the system. Anyone with access to the computer cabinets and the appropriate key could bring down the system. Terminals Printers Unattended, logged-on terminals invite intruders to access the system. You might want to require that the command interpreter s automatic logoff option is always on. Unless care is taken, intruders can obtain the information they need to break into a system by examining the output of system printers. For example, user account numbers, telephone access numbers and codes, and even privileged passwords might be printed on publicly accessible printers. Printed copies of electronic mail messages can also provide names that enable intruders to deceive others into presuming the legitimacy of their requests for information. Tape Units Like all computer peripherals, protect tape units physically and procedurally from accidental and malicious damage. Unprotected, they offer an avenue of intrusion. For example, with the proper timing, an intruder might remove a backup tape from the tape drive, take it to another system, read it, then return it without detection. Operators should be vigilant when performing system backups. When backing up files from disk to tape, you might want to consider using the NOMYID option of the BACKUP utility. This option prevents files that originally belonged to one user ID from being restored onto another user ID. Tape Library Monitor the on-site tape storage area closely to ensure that an intruder does not get access to a previous backup tape. Keep audit trails for all tape library transactions. On-Site and Off-Site Media Storage Protect the on-site and off-site media storage areas from intruders. Methods of protecting storage areas include keeping transaction logs for all tape library transactions, 9-10

178 Security Management Data Encryption carefully screening all who request materials, allowing access to approved persons only, and creating explicit hand-over procedures between the storage-area staff (especially staff on contract) and your staff. Data Encryption If you cannot provide physical security for data, consider encrypting the data so that intruders cannot easily access the data. For example, tapes sent through the mail, disks that are transported, and communications lines that can be tapped all provide points of access to data. Consider encrypting all data transported in these ways. Managing Access to the System Users must have an ID to access a system. User IDs can be very powerful tools and are the items most commonly under attack when an intruder is trying to penetrate a system. Therefore, it is important that your security policy provide guidelines for the operations staff regarding: User groups Access-control lists Adding user IDs Assigning user aliases Special group IDs Guest-user IDs Unused IDs Deleting user IDs Reusing user IDs User Groups Belonging to a user group gives the group member the right to access objects (such as files and processes) that are secured for group access. Deciding how classes of users need to share files is a major requirement for developing a strategy for group assignment. Two common ways of assigning groups are to: Assign groups by function: create distinct groups for system programmers, application programmers, quality-assurance testers, administrative assistants, technical writers, and data-entry clerks. Assign groups by project: create a group for each project and assign user names within that group for all designers, testers, and other project people. Managing this approach can be difficult when people work on more than one project, switch from one project to another, or don t belong to one project (for example, department administrators). The Safeguard Administrator s Manual contains information on defining and managing user groups. 9-11

179 Security Management Access-Control Lists (ACLs) Access-Control Lists (ACLs) Depending on your organization s security policy, you might have to restrict access to system software so that only selected users or user groups can execute the software. To restrict access, use Safeguard access-control lists (ACLs). Safeguard ACLs allow you to specify exactly which users have access to what files. The Safeguard product maintains ACLs for all objects under its protection. If you do not use Safeguard ACLs, group membership is the only way you can limit file access to a subset of users. Note. Safeguard access-control lists cannot be used to protect OSS files. Access to OSS files is controlled by OSS file permission bits, as described in the Open System Services Management and Operations Guide. Adding User IDs When users are added to the system, user ID attributes must be defined. You should provide guidelines for defining the attributes for ID ownership, logon and password expiration, and audit access and logging. Assigning User Aliases If you are using the Safeguard product, you can define user aliases. A user alias is an alternate name that can be assigned to a user for purposes of logging on to the system. Each alias may be assigned a unique set of attributes. The use of aliases can provide individual accountability and separation of duties when several users share the same user ID or when a single user performs separate job functions. For example, in the OSS environment, it may be advantageous to assign different aliases for the same user ID, then assign each alias to a different file-sharing group. This way, different users sharing the same user ID would receive different group file permissions based on file-sharing group membership. Special User IDs There are three classes of special user IDs: the super ID (255,255), the super-group user (255,n), and the group manager (n,255). Special IDs give users additional privileges. Table 9-1 shows the three user classes and the associated user names and user IDs. Your security policy should explain who can use the special user IDs and under what conditions. You should restrict access to the special IDs to as few people as possible. Note. Special-ID functions can be assigned to other user IDs through the Safeguard product. You might want to control who is allowed to perform these functions whether or not users have a special ID. 9-12

180 Security Management Special User IDs Table 9-1. Classes of Special System Users Users Typical User Name User ID Super ID SUPER.SUPER 255,255 Super-group user SUPER.user-name 255,n Group manager group-name.manager n,255 The Super ID Users with the super ID (255,255) can access all data and devices, and they can log on as any user without knowing the user s password. You can use the Safeguard product to restrict some of the super-id capabilities. Controlling access to the super ID is crucial to protecting a Tandem system because the super ID bypasses protective restrictions that the operating system applies to other users. The super ID (255,255) should not be used for day-to-day operations. The super ID should be used only to: Resolve emergencies License files Revoke licenses Install new software While a super-id logon is not needed under normal conditions, it might be required to solve certain problems. Having access to a super-id password is sometimes the fastest way and even the only way to solve a problem. One way to ensure the availability of super-id capabilities while also restricting their use is to record the super-id password on a piece of paper, seal it in an envelope, and entrust the envelope to a party or organization who is informed and who is always present. Use of the envelope is governed by the following procedures and guidelines: The trusted party (who is always available) is given a list of people and circumstances under which the envelope can be surrendered. A log is kept of the envelope s use. The envelope must be torn open to get to the password. The true guardians of the password must be able to audit the envelope to ensure that it has not been improperly tampered with. The person who needs access to the envelope must log a business reason for the access. These guidelines provide an audit trail, separation of duties, and access to the system when it is needed. Your security policy should document these guidelines. 9-13

181 Security Management Special User IDs The purpose, use, and dangers of the super ID (255,255) are fully described in the Security Management Guide. Note. In the Open System Services (OSS) environment, the super ID has the user ID and has the set of special permissions called appropriate privileges. The Guardian user ID (255,255) is the same user ID as the OSS user ID The Super-Group User Super-group users (255,n) are operators who perform system and network operations tasks such as controlling the status of peripherals and other system components. Supergroup users can execute potentially destructive actions, such as: Starting and stopping devices Reloading processor modules Setting the current date and time of day for the system Altering bus availability states (hardware paths) Configuring the Safeguard product The Group Manager The group manager (n,255) helps users control access to their groups. Group managers (n,255) can (unless restricted by Safeguard settings): Log on as any other group member without knowing that member s password (which means the group manager [n,255] has access to the member s files unless the Safeguard product is used to restrict access) Add members to the group Delete members from the group Manage the Safeguard records for group members Handling Changes in a User s Role When a person who has access to a special user ID changes roles, especially when leaving the organization or group, change the password or delete the user ID. Also consider these points: The privileged user might have had access to other people s passwords (if those passwords were stored unencrypted or encrypted by a reversible method). You might choose to require the invalidation of all passwords to which the person had access. In high-security groups, you might also want to require that all members of the group change their passwords when another member leaves the group. The privileged user might be aware of holes in the security policy or the security practices that would allow the user to gain access to the system after changing roles. Consider reviewing system security immediately after the person changes roles to ensure that your procedures are intact and working properly. 9-14

182 Security Management Guest-User IDs Guest-User IDs You can provide a guest-user ID on your system. A guest-user ID makes your system temporarily available to people who must have physical access to your system, but who do not need long-term access. Before providing a guest-user ID, consider these points: Keep the user ID as unprivileged as possible. For example, the guest-user ID should not have access to any sensitive files or system resources. You can limit guest-user ID access by using Safeguard access-control lists or by keeping the guest-user ID in a distinct group so that the guest user cannot access files in other groups. The guestuser ID should not be super-group user (255,n), or group manager (n,255). Because outside intruders often look for guest-user IDs as an easy way to access a system, be sure that the guest-user ID does not have an obvious user name and password (for example, a group and user name of GUEST.GUEST with a password of GUEST). Unused User IDs To manage unused user IDs: Institute a procedure for keeping the system current. For example, have the Safeguard product enforce user expiration dates on all user IDs. Then, from time to time, obtain a list of current authorized users from other department managers. Use this list to extend the expiration dates for current users, and allow unreported user IDs to expire. Automatically assign a three-month or six-month expiration date to each new user ID, and issue a periodic report notifying users when they need to request an extension of their expiration date. In both schemes, a user who is not specifically verified as current is automatically denied access to the system once the expiration date is passed. Deleting Users IDs When a user leaves the organization, the user s ID should be removed from the system. If a user has any aliases, the aliases must be deleted before the user ID can be deleted. Provide procedures for: Freezing the person s user ID. You might want to freeze the ID while the actions listed below are completed. Once the actions are completed, the ID should be unfrozen and then deleted. Checking the system for files owned by the deleted user and disposing of the user s files by giving them to another user, or deleting them by transferring them to backup media. If you can t decide what to do with files you want to keep, consider giving them temporarily to some unused user ID until you know who the new owner should be. Changing the passwords for other IDs the person could access. 9-15

183 Security Management Reusing User IDs Evaluating the risk to an unencrypted password database, and, if necessary, changing all passwords to an unencrypted password database the user had access to. Changing the guest-user ID if your system has guest-user IDs. If the person is merely moving to a different group and the members of the group are still allowed to use your guest-user ID, this change might be unnecessary. Removing references to the user ID from Safeguard access-control lists. Once this step is taken, the user ID should be unfrozen and then deleted. Removing the user s remote passwords and informing the managers of remote systems that the user s ID has been removed. Reusing User IDs Once you remove a user ID from the system, don t reuse it immediately, especially if user IDs that have never been used are available. A new user might inherit a previous user s privileges if the following items remain in the system: The old user ID set up for network access, complete with matching remote passwords Files owned by the previous user References to the old user ID in Safeguard access-control lists References to the old user ID in automated procedures Managing Passwords A password prevents an intruder from using the system and allows the system to verify that someone claiming to be a user is really that user. When establishing your security policy, consider: Requiring strong passwords (for example, passwords that are five or more characters long) Setting unexpected initial passwords Enforcing routine password changes Educating users on how to protect passwords Requiring Strong Passwords A password s length and the choice of characters in it significantly influence the time necessary to discover a password through an exhaustive automated search. The longer the password and the more varied the choice of characters, the more difficult it is to discover. You can use the Safeguard product to specify a minimum password length. The best password is one that cannot be found in any dictionary. Such a password would have a mix of uppercase and lowercase letters and include numbers, but still be relatively easy to memorize. 9-16

184 Security Management Setting Unexpected Initial Passwords Setting Unexpected Initial Passwords Don t derive initial passwords from the user name or user ID, since an inside intruder might log on to a user ID that has been created but not yet assigned. Enforcing Routine Password Changes You can use the Safeguard product to force a password to expire after a specified time. This Safeguard feature motivates people to change their passwords before the expiration date. Once a password is changed, a new expiration date is automatically set, and the new password remains valid until that date. Be careful not to require changes too frequently. If users must change their passwords too often, they: Might set up a mechanism to change the password through a predictable series (pswrd1, pswrd2,...) or even to change the password to itself. (Proper Safeguard settings can be used to discourage this behavior.) Might change a password correctly but write it down in an obvious place to remember it. To protect passwords for special user IDs, you might want to require more frequent password changes for special IDs than for general user IDs. Protecting Passwords You should provide guidelines for protecting passwords. All users should: Never write passwords down Use blind password entry (password entry that does not show the password on the screen as you enter it) Not store passwords in a system file While logging on, be careful that no one is watching while they are entering the password Dial-Up Access and Security Give dial-up access only to users who really need it and who will take extra care in protecting your organization s resources. Your policy and procedures regarding dial-up lines should include special criteria for screening requests for dial-up access. To protect your dial-up facility, consider: Using authorization lists Using additional external passwords Using callback systems Using automatic terminal authentication Periodically changing passwords and telephone numbers Providing precautions for when a dial-up line is dropped 9-17

185 Security Management Authorization Lists Authorization Lists Use authorization-list software to limit dial-up access to a designated subset of the user community. The Safeguard product provides this ability. Additional External Passwords Some systems demand an additional system-wide password during the dial-up logon sequence. The system password is roughly the dial-up equivalent of allowing physical access to the main work site. Inform legitimate users of the current system password through some means of limited distribution. Change the password periodically to lessen the chance of intrusion. Callback Routine A callback routine allows the system to authenticate a caller s telephone location before permitting the caller to access the system. Because the list of telephone numbers for any particular user is limited and prearranged, the chances for intrusion are limited. Automatic Terminal Authentication Some terminals can be programmed to hold an answerback string of characters. An answerback string is a set of characters that the terminal sends in answer to a computer request. By setting a terminal s answerback string to a value unknown to the user, you can create an additional authentication method. Periodic Password and Telephone Number Changes Periodically change system passwords and phone numbers, but avoid changing them too often or retaining them too long. You should also try to acquire telephone numbers that are not sequential. What Happens if the Line Is Dropped? A phone line might disconnect (drop) before a session completes. Design your TACL macro or application so that when a line drops before a session completes, the session terminates automatically, closing any lingering processes. Terminating the session prevents someone from dialing in and inheriting the session from a previous user. Securing Network Access Network user IDs allow users to transfer or access information across the network. Network user IDs also allow applications to transfer or access information across the network on behalf of users. Managing Network User IDs Handling network user IDs requires careful planning and cooperation among distributed organizations. The Tandem NonStop Kernel operating system requires that network user IDs have the same user name and user ID on all affected systems. This condition 9-18

186 Security Management Security Precautions requires advance network-wide planning. As part of your planning effort, you should consider: Reserving a range of group numbers (for example, 200 to 254) for network user IDs, and assigning network user IDs from these groups. Deciding on the network-wide names for the groups on an as-needed basis, maybe even reserving a particular initial letter (like N) for network groups. Security Precautions Guard the ID for a network application such as the Transfer product. If an intruder accesses the network-application ID, the intruder gains access to virtually any networksecured file on the network, rather than just the network-secured files on the systems for which the user has matching remote passwords. Encrypting Data Between Systems With the standard network software, data moves between systems without encryption. However, you might want to consider installing encryption devices for link-level or bulk encryption of sensitive data. Communication With Other Operations Groups In a distributed system management environment, an intruder can obtain sensitive information by pretending to be a member of an operations group at another site (for example, a newly hired or temporary operator). If your organization spans a large physical area, authenticate all sensitive communications: telephone calls, interoffice mail, standard mail, electronic mail, and any other communications. The security policy should indicate the steps required for authenticating urgent and nonurgent requests. Securing Client/Server Environments Client/server environments have become increasingly popular because they provide the flexibility to integrate heterogeneous hardware and software. The client portion of the application or program usually resides on a PC or workstation and makes requests, over a local area network (LAN) or wide area network (WAN), to the server portion of the application, which usually resides on a host. The variety of platforms, software, and networks involved in a client/server environment offers many opportunities for a security breach. To secure a client/server environment, consider the following guidelines: Every user should be assigned a personal ID and password. Because client/server applications use a LAN, you might want to consider installing encryption devices for link-level or bulk encryption of sensitive data. You might want to authenticate the user at the client workstation by installing a smart card device in the workstation. A smart card is a small computer in the shape of a credit card used to identify and authenticate its bearer. In the client/server design, the client: 9-19

187 Security Management OSS System Security Authenticates the user by using smart cards or personal identification numbers (PINs). Decides what servers the user is entitled to use. Passes the personal ID when it calls the server. Resides on a diskless workstation. Diskless workstations can prevent information from being copied to a floppy disk and removed or from being left where someone might break into the workstation to access the hard disk. No sensitive data should be stored on the client workstation or on an unprotected workgroup server. The server: Receives the personal ID Decides whether it is open to all users or restricted to certain personal IDs or whether it needs stronger identification or verification. OSS System Security Security features relevant in the OSS environment primarily deal with directory and file access. OSS users enter the OSS environment by entering the osh command from the Guardian environment; many Guardian and Safeguard security features apply to the OSS environment as well. OSS File Security Safeguard access-control lists cannot be used to protect OSS files. Access to OSS files is controlled by OSS file permission codes. Each file and directory in the OSS environment has associated with it a permission code that indicates the security applied to the file or directory. Only the file owner or the super ID (255,255) can alter a file or directory s permission codes with OSS shell commands. The permission code for a file or directory grants or denies read, write, and execution permissions for each of three separate classes of users: the file owner, the file group, and all others. Unlike Guardian files, there is no purge permission for OSS files. Interoperability With Safeguard Security All system users, user aliases, and file-sharing groups are added and managed through the Safeguard product. In addition, Safeguard volume-protection and process-protection records can control who is authorized to create disk files on specific disk volumes and use specific process names. User Authentication Record Some attributes in a Safeguard user authentication record, such as the user s primary group, initial working directory, initial program, and initial program type, apply exclusively to the OSS environment, 9-20

188 Security Management Special Security Concerns File-Sharing Groups File-sharing groups are particularly important in the OSS environment. Each user has a group list that contains the names of all groups to which that user belongs. When the user attempts to access a file, the file s group permissions are granted to that user if the user s group list includes the name of the file s group. Volume Protection Each time an OSS file is created, the Safeguard software checks to determine if a Safeguard volume-protection record exists for the physical volume on which the file is to reside. If such a volume-protection record exists, the user creating the file must have create authority on the access-control list for that volume. Otherwise, the file-creation attempt is denied. In an OSS environment, all volumes that are used for a fileset must be given the same protection, because in the OSS environment, you cannot predict on which volume a file will be placed. Special Security Concerns When forming your security policy, you should be aware of the special security concerns of program development, PROGID programs, and licensed programs. Program development environments usually have a more permissive security policy to enable developers to develop and test programs. Your security policy should identify the procedures to follow when moving the program from a development environment to a production environment. PROGID programs and licensed programs provide you with two powerful tools that should be understood and carefully controlled. Program Development When a program is being developed, the system environment in which it is being created often has a permissive security policy to make it easy for developers to access files and test the program. Possible Hazards A development environment s permissive security policy can create the following security hazards: If the development environment and the production environment are on the same system, both operators and developers might log on as super-group users (255,n), enabling developers access to databases containing sensitive business information. Ideally, the development environment and the production environment should be separate systems. When a program moves into production, often file security settings and logons are overlooked, which can allow unauthorized users to access the program and its database files. 9-21

189 Security Management PROGID Programs Implications for Your Security Policy Your security policy should establish guidelines for: File security during the development process. If the development environment and production environment are on the same system, create separate production disk volumes or subvolumes. If the Safeguard product is installed, secure the volumes and subvolumes so that developers do not have create or write authority to production files. Moving programs from development to production. When the program moves from the development environment to a production environment, your staff should: Coordinate the move with the change management staff Verify that all programs are tested before they are released to production systems Review file security settings and logons so that users have access only to the processes and files that they need Use only authorized, documented versions of the programs Make sure that all program files and production data files are adequately protected according to your company s security policy Note. For more requirements on application program development, refer to Section 11, Application Management. For guidelines on the change control process, refer to Section 7, Change and Configuration Management. PROGID Programs PROGID programs allow one user to temporarily use a controlled subset of another user s privileges. When a user executes a PROGID program, the program operates using the privileges of the program owner and accesses only those resources that the program owner has access to. PROGID programs are used to: Control access to system operations. Certain operations that are easily performed by the super ID (255,255) might have to be performed by users who aren t super IDs for example, a system operator who backs up files to which the operator does not have access. If the system operator is not the super ID, a PROGID program provides a convenient and secure solution. Control access to a database. A PROGID program becomes an ordinary program when ownership of the program file is changed or when the program is restored from magnetic tape. In both cases, the owners can reenable the program as a PROGID program. To determine whether a program is secured with PROGID, use FUP or the DSAP utility. 9-22

190 Security Management Licensed Programs Possible Hazards Inappropriate design of PROGID programs can result in serious security holes: Without sufficient checking of the input data range and form, an incompletely debugged PROGID program can unintentionally provide unauthorized access to restricted data. The privileges of a PROGID program propagate to any processes created by the program. System programs (such as licensed programs and system utilities) should not be enabled as PROGID programs unless required. A system program enabled as a PROGID program can provide excessive and easily subverted capabilities. The ability to audit accesses made by a process is effectively lost if the process has the PROGID attribute. The Safeguard product logs the user ID of the PROGID program owner, not the ID of the process user; thus, accountability is lost. PROGID programs do not allow a user other than the program owner to run the program in debug mode from the command interpreter (TACL). However, if a PROGID program is running and enters Debug or Inspect (debugging tools), the person running the program assumes the privileges of the program owner. By patching the program data, it might be possible to defeat whatever security is built into the program. Implications for Your Security Policy Your security policy should provide guidelines for using and monitoring the use of PROGID programs. The operations staff should receive training on the uses and risks of PROGID programs and how to recognize these programs. Licensed Programs Licensing a program has the effect of giving a program the privileges of the operating system. When a licensed program runs, privileged operations in it can bypass any ordinary security interface (such as authentication of user IDs, Safeguard protection, and memory-management protection). Only the super ID (255,255) can license a program or revoke a license. Licensing a program that performs no privileged operations has no effect on security because the program gains no privileges that it did not already have. To determine whether a program is licensed, use FUP or the DSAP utility. Possible Hazards Licensing a program that uses privileged operations can seriously compromise both system integrity and security. Such a program can gather and modify information 9-23

191 Security Management Licensed Programs anywhere in the system, disrupt the system, disrupt the network, and do anything that the super ID (255,255) can do (including license another program). If an intruder s program is licensed, the intruder can: Modify protected memory areas containing a program s instructions and data, without leaving evidence of the change Gain the privileges of other users (including the super ID), and then browse and change files Directly manipulate physical hardware resources Tandem system programs maintain data integrity and allow safe access to user resources. Even so, do not allow all users to execute licensed programs. Use standard Tandem NonStop Kernel security or Safeguard access-control lists to limit the use of system programs that allow access to files belonging to wide range of users. Implications for Your Security Policy Your security policy should establish guidelines for: Approving a request for a program license. Writing code for a licensed program requires an intimate knowledge of the operating-system code and should be undertaken only by programmers having legitimate access to operating-system source code. Reviewing, compiling, binding, and testing source code before issuing a license. Even after extensive testing and revision, licensed programs might contain residual bugs that could seriously interfere with operating-system functions. Monitoring licensed programs. A licensed program has the potential to bypass known, documented, and tested interfaces. Integrating licensed programs into new releases. A licensed user-written program might be release-dependent; therefore, the licensed program might be affected by changes in the internal operating-system structures from one release to another. Such a licensed program can fail or do great harm under one release even though it might have worked perfectly under a previous release. Note. Tandem does not accept responsibility for the effects of user-written programs functioning at the level of the operating system and does not support such programs. 9-24

192 Security Management Check List Check List The following check list summarizes the main points of security planning: 1. Develop a security policy for your organization. 2. Educate the user community and the operations staff about security and their responsibilities for protecting the system. 3. Designate a security administrator and a security administration team to manage security. Set up check lists for the administrator and team members. 4. Maintain physical security: Limit access to the computer room (if applicable). Protect the computer cabinets and tape units from accidental damage and deliberate malicious acts. Protect the tape library from intruders accessing previous backup tapes. If your printers print sensitive information, make sure that each piece of output is delivered to its proper recipient. Protect on-site and off-site media storage from intruders. Keep transaction logs for all transactions. Create clear hand-over procedures between the storage-area staff and other staff. Determine if you need to encrypt data. 5. Establish guidelines for managing user IDs, including guidelines for: Assigning groups. Using Safeguard access-control lists (ACLs). Preventing shared user IDs. Preventing multiple user IDs for one person. Using the special IDs (the super ID [255,255], super-group user [255,n], and group manager [n,255]) and the procedures for monitoring and assigning these IDs. 6. Establish guidelines for managing passwords: Require strong passwords. Establish unexpected initial passwords. Enforce routine password changes. Teach users how to protect their passwords. 7. Establish guidelines for dial-up access. To protect your dial-up facility, use authorization lists, additional external passwords, callback systems, and automatic terminal authentication. In addition, periodically change passwords and telephone numbers. 8. Secure network access: 9-25

193 Security Management Check List Reserve a range of group numbers for network user IDs, and assign network user IDs from these groups. Decide on the network-wide names for the groups on an as-needed basis. Designate a particular organization to own each group name and group ID, and make that organization responsible for controlling the allocation of user IDs within its group. Determine what applications and users can use network IDs. Consider using encryption devices. Establish procedures for verifying communications with operations staff at other locations. 9. Secure client/server environments: Assign a personal ID and password to every client/server application user. Consider using encryption devices. Authenticate the user at the client workstation by installing a smart-card device in the workstation. Place the client portion of the application on a diskless workstation to prevent copying sensitive information to a floppy disk or giving access to a hard disk. Design the client/server application so that the client portion authenticates the user, determines what servers the user is entitled to use, and passes the personal ID when it calls the server. The server portion of the application should receive the personal ID and decide whether it is open to all users, is restricted to certain personal IDs, or needs stronger identification/verification. 10. Establish guidelines for moving programs from a development environment to a production environment. To secure new programs, verify that the programs are tested, use only authorized and documented versions, and ensure that security settings and logons comply with the requirements of the security policy. 11. Establish procedures for controlling PROGID programs. 12. Establish procedures for controlling licensed programs. Describe the steps operations staff should take to: Approve a request for a program license Review, compile, bind, and test source code before issuing a license Monitor and detect licensed programs Integrate licensed programs into new releases 9-26

194 10 Contingency Planning Overview Contingency planning can help you prevent, prepare for, and recover from a disaster. Disasters can occur any time and anywhere. In companies where day-to-day business activity is tied to a computer system, a sound recovery plan is imperative. Planning ahead can help you prevent some disasters and to respond to those disasters you cannot prevent. This section helps you plan so that you can take preventive measures and, if a disaster is unavoidable, to recover as quickly as possible with minimal damage to your system and data. This section ends with a check list that sums up the steps involved in contingency planning. What Is a Disaster? A disaster can be any sudden calamitous event that brings widespread or localized destruction, loss, chaos, or injury. Disasters are commonly associated with environmental occurrences such as fire, flood, earthquake, and so on. However, companies are also at risk from nonenvironmental disasters (for example, a chemical leak that makes a facility unusable), crimes, civil unrest, utility failures (such as a power failure), and telecommunications failures. Disasters can lead to great losses by disrupting internal business procedures; by causing a loss of business volume, corporate assets, and goodwill; or by damaging the company s reputation. In a data processing environment, losing staff, materials, supplies, data, equipment, power, and so on, partially or totally, could damage a computer system so much that further losses would severely hurt a company. If your company depends upon its computer for day-to-day business activity, it is important for you to identify the risks to system operations, to take preventive actions, and to develop a recovery plan. Preventing Disasters The first step toward preparing for a disaster is making a dedicated effort to prevent one from occurring. Tandem systems give you a head start on disaster prevention by providing continuous availability and fault tolerance, by preserving data integrity, and by allowing geographic independence and flexible system configurations. However, it is up to you to make sure that the unique features of Tandem systems are fully used and maintained and that other areas of your operation are reviewed with the goal in mind of preventing disasters. The following paragraphs provide tips on reviewing: The computer center location and facilities Security Preventive maintenance and system-monitoring procedures 10-1

195 Contingency Planning Computer Center Location and Facilities Network and system configurations Data recovery and integrity Data archiving procedures Computer Center Location and Facilities Review Section 3, The Operations and Support Areas, to ensure that your computer center and systems are protected. If you follow the guidelines in Section 3, you can avoid many disasters such as flooding, fires, and illegal access or at least minimize the damage that such adversities would cause. Security Security can help you protect all areas of the data processing center, including the operations and support staff, the data processing center, equipment, material and supplies, software applications, and data. Review Section 9, Security Management, for guidelines on developing a security policy and security procedures. Controlling access to all areas of the operation, including the software, can also help you protect your operations. For example, you can limit access to the computer room by locating frequently used devices, such as printers, away from the room and by not posting signs that indicate the location of the computer facilities. Preventive Maintenance and System Monitoring Business-threatening disasters can occur when regular preventive maintenance and system monitoring are not performed. Review Section 3, The Operations and Support Areas, to ensure that maintenance guidelines are followed, and Section 5, Production Management, to ensure that system-monitoring tasks are performed. System and Network Configuration The Tandem system and network architecture helps you avoid disasters related to hardware failures. All systems have a primary and backup processor, dual-ported controllers, dual-ported disks, and the option of having mirrored disks. Usually, disks that contain critical data are mirrored (for example, the operating system data is usually located on a mirrored disk volume called $SYSTEM). If one disk of the volume fails, the information is not lost: the mirror volume is still operational, and programs continue to write data to it without interruption. When the failed disk is restored, all data is copied back onto it from the mirror volume while transaction processing continues, and the mirrored operation resumes in full. The Expand data communications network extends fault-tolerant operations to networks of geographically distributed computer systems. You can use Expand to connect Tandem NonStop systems at different locations to form a single network in which communications paths are constantly available, even in the event of a single line or component failure. If one path between nodes fails, Expand automatically reroutes and (when appropriate) retransmits messages using the next-best available path. 10-2

196 Contingency Planning Data Recovery and Integrity NonStop Access for Networking provides alternate paths to guard against local area network (LAN) failure in client/server topologies. Most Tandem systems are delivered with a fault-tolerant configuration. It is up to you to maintain a fault-tolerant configuration whenever you change or add hardware. When changing the configuration, follow the guidelines described in Section 7, Change and Configuration Management. Data Recovery and Integrity By using the NonStop Transaction Manager/MP (TM/MP) product, you can prevent loss of data and more quickly bring your applications up after a disaster. NonStop Transaction Manager/MP (TM/MP) NonStop TM/MP maintains database consistency during processing. The database can reside on a single Tandem system or can be distributed over multiple nodes of an Expand network. In either case, NonStop TM/MP ensures that the database remains consistent in the event of a program failure, a single component failure, or the total loss of communications between nodes. Using NonStop TM/MP, you can also perform online dumps and audit dumps to a remote system, thus enabling you to maintain a copy of the database at a remote backup site. Data Archiving In the event of a computer disaster, archived data might be key to the survival of your business. To protect essential data, you first need to determine which data is fundamental to your operation and how much of that data your company can afford to lose before its survival is threatened. Once you have identified the essential data, you can design a plan for backing up the data on a regular basis. Tandem provides NonStop TM/MP online dumps and the BACKUP and BACKCOPY utilities (for archiving to tape) for backing up data. Backing up data is not enough; storing data in a safe place is equally important. Consider these guidelines when storing data: Store data and archive media (disks, tapes, microfiche, and so on) in a controlled environment, at a cool temperature, away from the computer room. For archiving, use rooms or facilities that have controls and sensors for detecting and warning of extreme temperature, humidity, smoke, or other contamination. Determine whether data should be stored at a location separate from the computer facility and whether you need fireproof data vaults. If you do not have an off-site facility for data storage, you can arrange for off-site storage through a vendor. Perform random checks of the archives and archived data to make sure that the procedures for requesting data, the procedures for retrieving data, and the data itself are in order and functioning properly. 10-3

197 Contingency Planning Disaster Recovery Planning Disaster Recovery Planning Disaster planning is a major undertaking and a team effort. To be effective, the planning effort should: Be supported by your company s executives, and funds should be allocated to enable departments to fulfill the plan requirements Involve people who have a knowledge of the company s business, who are technically knowledgeable, and who are experienced in system operations Once a planning team has been assembled, establishing the disaster recovery plan involves the following steps: 1. Taking inventory of what is at risk and how various disasters could affect your installations 2. Developing a recovery plan 3. Testing the recovery plan and training the staff 4. Revising the plan By taking inventory, developing a recovery plan, and testing and implementing the plan, you will be prepared to recover more quickly from disasters and to minimize the effect of the disaster on your company s computer operations. Figure 10-1 illustrates the disaster planning process. 10-4

198 Contingency Planning Step 1 Taking Inventory Figure The Disaster Planning Process Gain Support of Executive Staff Form Planning Team 1. Take Inventory 2. Develop the Plan 3. Test the Plan and Train the Staff 4. Revise and Update the Plan as Needed 032 Step 1 Taking Inventory As a first step toward preparing a recovery plan, the planning team usually determines what is at risk and prioritizes the risks. Taking inventory involves answering these questions: 1. What is at risk? Consider everything that affects computer operations, including staff, data, equipment, applications, the customer base, and the building. 2. What are affordable data and service losses for the company? How long can the company function without running each critical application? What is the cost of down time for each critical application? What are the intangible costs such as loss of company image? 3. What types of disasters are most likely to affect computer operations? What is the level of risk each type of disaster presents? What levels of risk are acceptable? 4. When is a situation a disaster; that is, when should the disaster plan be activated? For example, if there is a fire near a site, when should the disaster plan be activated when the fire is next door, in the building, or in the computer room? 5. Who has the authority to declare a disaster? 10-5

199 Contingency Planning Step 2 Developing the Plan 6. Is insurance available? Should your company purchase insurance for loss of equipment or business? 7. What are the recovery alternatives, the costs associated with each alternative, and the best alternative for your needs? Recovery alternatives usually include the use of a backup site. For a description of backup site options, refer to Backup Sites, later in this section. By answering these questions, the planning team will be prepared for the second step in disaster recovery planning developing the plan of action. Step 2 Developing the Plan A disaster recovery plan that is flexible and comprehensive allows the staff to respond to a variety of situations. A plan that specifies what steps should be taken, and by whom, helps the staff recover promptly from a disaster. Plans are most effective when they include the following information: The plan requirements, including a list of the items that need protection and their locations. Evacuation plans, procedures for accounting for all personnel, and procedures for contacting rescue authorities can help you protect employees. The first priority in a recovery plan should be protecting employees and preventing injury and loss of life. Your staff will feel more confident and secure if it knows its safety is your top priority. The name and phone numbers of the person who has ultimate decision-making authority for both determining when the plan must be activated and for implementing the plan. Damage assessment procedures. Damage assessment provides the information that you or others must have in order to determine if people are injured, if the facility is safe for entry, if the equipment is salvageable, and if data has been lost or corrupted. It is a good idea to form and train a damage-assessment team so there will always be someone who can quickly determine the extent of damage and provide the information you need to recover from a disaster. Figure 10-2 illustrates the responsibilities of the damage-assessment team. 10-6

200 Contingency Planning Step 2 Developing the Plan Figure Damage-Assessment Team Responsibilities Hardware Staff Software and Data Damage Assessment Facilities Site 033 Command-post information and procedures. Communication is vital to successful recovery. A command post serves as the focal point of disaster recovery. The command post is responsible for coordinating all activities and receiving and disseminating information internally and externally. The plan should indicate where the command post will be located, who will operate the command post (usually several people), what information should be directed to the command post, and what information the command post should provide to employees, customers, and vendors. Figure 10-3 illustrates the command post s responsibilities. Figure Command Post Responsibilities Vendors Internal Personnel Command Post Damage Assessment & Recovery Teams Customers

201 Contingency Planning Step 2 Developing the Plan A list of all materials and services that must be available during a disaster, along with information on how to access the materials and services. Following are items that should be available: Note. Contracts and service agreements with third parties might be required for some of these materials and services. Additional copies of the disaster recovery plan. Copies should be kept in safe places at various locations, in case some locations become inaccessible during a disaster. Priority Tandem hardware shipments and Tandem analyst support. Discuss your needs with Tandem ahead of time so that support is available when needed. Work space and living space for the duration of the disaster (for example, a backup site). Overnight delivery service for critical data, material, reports, and supplies. Transportation. Vendor technical support. All necessary operations and maintenance manuals. All necessary software, including a copy of the operating system, utility programs, tools, and software products. All necessary archived data. The location of backup power, communications equipment, first aid equipment, and important data and records. A list of all necessary files. A copy of configuration diagrams. A prioritized list of people to contact in case of emergency and their phone numbers, including: The name and phone number of your Tandem contact The names and phone numbers of the staff responsible for recovery (for example, operations staff for hardware and software problems, security department for security and safety problems, and so on) The name and phone number of your insurance agent A list of critical applications and the procedures for managing the applications during a disaster, with the projected amount of time required for restoring total processing. Procedures describing when and how to back up, store, maintain, and restore data when a disaster occurs. 10-8

202 Contingency Planning Step 3 Testing the Plan and Training the Staff Backup site procedures. If your company has a backup site, the planning team should document the procedures for moving to the alternate site. For more information about backup sites, see Backup Sites, later in this section. Procedures for reestablishing operations in the primary site or at a new permanent site. If a disaster forces computer operations to a temporary site, the operations staff will need procedures that explain when and how operations should return to the original primary site or to a new primary site. Well-defined procedures will help the staff move to the primary site without disrupting business activity and without losing data. Any other procedures and information that will help your computer operations staff recover quickly from a disaster. Step 3 Testing the Plan and Training the Staff A plan is not complete until the staff is trained and the plan is tested. Training ensures that the staff knows how to carry out all required procedures. Training is an ongoing process; new staff needs training when jobs are started, and all staff needs training when there is a change in a plan or procedures and to refresh their knowledge. Testing provides the recovery planning team with an opportunity to carry out and check its plans before a disaster strikes. Thorough testing ensures that the operations staff will be as successful as possible when recovering from a disaster. To be successful, testing should be: Realistic. Testing should be spontaneous (not preannounced) and rigorous. Regular. To maintain a valid recovery plan, the plan should be tested regularly. Reviewed. The planning team should review the results of each test to determine if any procedures or information should be changed, added, or deleted. Step 4 Revising the Plan Regular plan reviews help the planning team ensure that the plan reflects only what is absolutely essential and that the plan is revised whenever your company adds new hardware, changes databases, and changes or adds applications. Of course, when the plan changes, the planning team should test the changes to ensure that the plan furthers efficient and successful recovery efforts. 10-9

203 Contingency Planning Backup Sites Backup Sites An important part of developing a recovery plan is determining whether or not your company needs a backup site. A backup site is a second site that is available for use when a disaster stops operations at your primary site. Depending on the type of backup site, you can restart operations at the backup location within 10 minutes to 30 days. Your company can maintain the backup site, or pay another company to maintain the site. Backup sites can be: Owned by a company for its own use. Owned by several companies. Sites owned by several companies are called mutual backup sites. Leased from third parties that own the sites and provide contract disaster services. A leased site is called a third-party backup site. The backup sites should be equipped with all necessary hardware and software. In addition, all sites should have: Necessary data communications equipment. (If possible, arrange for backup lines in case the main lines fail.) Backups of databases, startup files, configuration files, and operations tools used to simplify operations tasks. Trained and knowledgeable people who can convert the site to a primary-processing site. If you decide to have a backup site, don t forget to develop procedures for moving primary processing to the backup site. There are four major types of backup sites: cold sites, operational-ready sites, data-ready sites, and online-ready sites. Cold Sites A cold site (sometimes called a cold shell) is an empty shell or building with power, air conditioning, data communications lines, and water at the site. When a disaster occurs, you move all necessary equipment, software, data, and personnel to the site. Plan on 20 or more days to make the cold site operational. Cold sites are practical when disasters of major proportions occur. For disasters that last less than 30 days, a cold site is not viable. Developing a plan for acquiring and installing equipment will help you use a cold site effectively. You also need contracts or agreements with vendors so that equipment is supplied when you need it. Operational-Ready Sites An operational-ready site (also known as a hot site) is a fully operational site. The site has all the necessary hardware and software as well as support for daily operations such as telephone and other systems crucial for the survival of your organization. You can use the operational-ready site to perform low-priority processing during nondisaster periods

204 Contingency Planning Data-Ready Sites Archived data is sent to the operational-ready site but is not loaded onto the system until a disaster occurs. During a disaster, you convert an operational-ready site to primary-processing status by: Backing up and removing the low-priority processing Loading the archived data Starting the necessary applications Plan on one or more days to convert an operational-ready site into a primary-processing center. Data-Ready Sites A data-ready site is similar to an operational-ready site except that data-ready sites take advantage of electronic vaulting. Data-ready sites are updated on a staged basis. Archived data reside on the data-ready site systems and do not need to be loaded during a disaster. However, the data-ready site systems are only as current as the last data loaded onto the systems. You must determine how often backups are sent and loaded onto the data-ready site systems. To convert a data-ready site to primary-processing status, you simply switch all primary processing to the data-ready site. The data-ready site already has the latest archived data and the necessary applications. Plan on a few hours to one day to convert a data-ready site to primary-processing status. You might need to establish procedures for updating data on a regular basis in addition to establishing procedures for moving primary processing to the data-ready site when a disaster occurs. Online-Ready Sites Online-ready sites (also referred to as processing-ready sites) are secondary computer sites that are ready to take over processing from a primary site within an hour, without loss of data. Online-ready sites use concurrent processing to protect applications and maintain a current database. Determining Which Type of Backup Site Best Meets Your Needs To decide which backup site best meets your needs, you must first determine your window of recovery the length of time your business can survive without your critical applications and then evaluate the costs of each recovery alternative that can support that window. Table 10-1 lists the different types of backup sites and the advantages and disadvantages associated with each alternative

205 Contingency Planning Determining Which Type of Backup Site Best Meets Your Needs Table Backup-Site Alternatives: Advantages and Disadvantages (page 1 of 2) Backup Site Advantages Disadvantages Cold Site Inexpensive way to acquire or lease a second computer site. Operational- Ready Site Data-Ready Site Online-Ready Site No equipment or operating costs until a disaster occurs. System and site are in a ready state. Security is in the control of one company. Simplifies recovery-plan testing. System and site are in a ready state. Security is in the control of one company. Simplifies recovery-plan testing. Less expensive than an onlineready site. Recovery occurs within an hour, with no loss of data or transactions. Normal database backup can be shared among nodes, thus minimizing the burden on any one system. Simplifies recovery-plan testing. Can require 20 days or more to become operational. Everything from furniture to computers must be ordered, delivered, and installed. Acquiring hardware during an emergency might be difficult. It is difficult to test a recovery plan within defined time periods. When leasing a cold site, there is a risk of contention for use of the facility or hardware in the event that a widespread disaster affects other companies under contract for the same site. There are restrictions on how long a tenant can use the cold site. If the system is large and is not part of a network, ensuring database consistency with the primary site requires time and adversely affects the primary system s availability and response time. Data is only as current as the latest backup. There is a risk of losing data from transactions on the primary system that have not been restored to the data-ready systems. Practical only for companies with multiple computer sites. More expensive than most of the other alternatives

206 Contingency Planning Determining Which Type of Backup Site Best Meets Your Needs Table Backup-Site Alternatives: Advantages and Disadvantages (page 2 of 2) Backup Site Advantages Disadvantages Mutual Backup Site Third-Party Backup Site May be least expensive way to establish a backup site. Requires less capital investment. Realistic recovery plan can be tested. During nondisaster periods, site may be shared by participants for development work. Provides faster turnaround than a cold site. Third parties have experience in disaster planning and can help you develop an effective plan and test your plan. May be less expensive than some of the other alternatives. You do not have total control of hardware and software, security issues, system configuration, and contractual matters. Since the system configuration is negotiated among participants, the configuration might not be ideal for any participant. It might be difficult to find partners. Each partner must keep more than 50 percent of the capacity in reserve to ensure that enough backup capacity remains for the other partners. Site choices are limited. It is difficult to find fully compatible systems. It might be difficult to run your operations at a distance. The vendor might be a source of management and response problems. If the site is already occupied, your company might not be allowed to use it. Time limits are imposed on usage. You must transport supplies to the site. Software and applications might have to be downloaded to the third-party site. You must perform frequent tests to ensure compatibility of hardware and applications

207 Contingency Planning Check List Check List The following check list covers the main points of disaster prevention and recovery planning. 1. Take preventive steps to limit risks of disaster: a. Select the best site possible for your organization: Should the site should be located at a remote, computer-only site, or with other business operations? Is the site away from known danger zones? b. Select or design the best facility possible. Follow the guidelines in Section 3, The Operations and Support Areas. c. Follow the security guidelines described in Section 9, Security Management. d. Establish preventive maintenance procedures for the hardware, software, air conditioning, and computer rooms as described in Section 3, The Operations and Support Areas. e. Establish system-monitoring tasks as described in Section 5, Production Management. f. Configure the system and network (if applicable) for fault tolerance. g. Establish procedures for backing up and archiving data in a safe and secure location. h. Establish procedures for checking the integrity of the archiving system and the archived data. i. If you have distributed systems, make sure that applications are designed to take advantage of TMF. 2. Plan for disaster recovery. a. Take inventory: What is at risk? What are affordable data and service losses for your company? How long can your company function without running each critical application, and what is the cost of down time for each critical application? What types of disasters are most likely to affect your operations? When should the disaster plan be activated? Who has the authority to declare a disaster? Is insurance available? Should your company purchase insurance for loss of equipment or business? 10-14

208 Contingency Planning Check List What are the recovery alternatives, the costs associated with each alternative, and the best alternative for your needs? b. Develop a plan that documents: Plan requirements Procedures for evacuating personnel, accounting for personnel, and contacting rescue authorities Damage-assessment procedures The name and phone number of the person with ultimate decision-making authority Command-post information, including location, personnel responsible for the command post, and procedures for processing information A list of materials and services that should be available during a disaster, and the location of necessary support contracts The location of backup power, communications lines, first-aid equipment, and important data and records A prioritized list of names and phone numbers of emergency contacts. Escape routes and emergency survival procedures A list of the critical applications and procedures for managing the applications during a disaster Procedures for backing up, storing, maintaining, and restoring data Backup site procedures, if applicable Procedures for reestablishing operations at the primary site or at a new permanent site Anything else necessary for recovering from a disaster 3. Train the staff in disaster recovery. 4. Test the plan: Define the test objectives. Design the test. Execute the test. Revise the plan as needed and test the plan revisions. 5. Update the plan as needed and test all updates

209 Contingency Planning Check List 10-16

210 11 Application Management Overview Applications are key to the operation of many businesses. The cost of an unavailable application can result in: Revenue loss. Many companies sell their ability to deliver services any time of the day or night. If the application responsible for providing the service is unavailable, the customer might call another supplier. Lost productivity. Information-based companies rely on computer applications. The time lost because of an unavailable application results in lost productivity. Penalties. Companies that provide transaction services often guarantee 7x24 availability and, if they do not meet those guarantees, pay penalties to their customers. Support and operations costs. Recovering from an application problem or failure results in support and operations costs. Therefore, application management is probably one of your major responsibilities. This section provides guidelines for managing operations-oriented applications. The section ends with a check list that summarizes the main points of application management. What Is Application Management? The Tandem application environment consists of application subsystems that enable you to develop and run high-performance, high-volume, and highly available online transaction-processing (OLTP) applications. The application subsystems that make up the Tandem application environment include NonStop Transaction Services/MP (NonStop TS/MP), NonStop SQL/MP, and NonStop Transaction Manager/MP (NonStop TM/MP). Application management includes: Working with application developers to ensure that the information and support you need to run the applications are available. This step is the most important part of application management. Managing change. Change and configuration management help you ensure that changes proceed smoothly and that the system is configured for optimal performance. See Section 7, Change and Configuration Management, for more information about change control and configuration management. Running online, batch, and client/server applications on production systems. After installing the applications, the staff has to start and then monitor them. The system operations requirements for this step depend on the type of application (online, batch, or client/server) and the Tandem products you are using. 11-1

211 Application Management Establishing Application Requirements The following subsections provide guidelines for: Establishing operations-oriented application requirements Managing batch applications Managing online transaction-processing applications Managing client/server applications Using Tandem tools for application management Use these guidelines to plan for application management; to establish schedules, priorities, and job assignments; to determine staffing and training needs; and to prepare operations documentation for the required procedures. Establishing Application Requirements Defining application requirements is often an overlooked and neglected area in system management especially considering the extent to which applications affect system operations. Staffing, training, schedules, tasks, procedures, and so on are usually planned around the applications that run on your systems. Because you and your staff are judged in large part by how well you support the applications, you and your staff should consider implementing the following suggestions for the applications purchased or developed by your company: Define operations requirements. Participate in application reviews. Establish a production-assurance control group. The following paragraphs provide a list of possible requirements, a check list for applications reviews, and guidelines for establishing a production-assurance control group. Requirements Establishing operations requirements for applications: Ensures that the operations staff can learn to manage new applications quickly Ensures the smooth transfer of an application from development to production Assists in developing future applications that are more manageable Key requirements include: Operator training. The applications development group or the vendor should provide training for new applications and upgrades. If the development group or vendor cannot provide training, you should determine how to provide the training for your staff. Naming conventions. Established naming conventions help your staff find files, monitor programs, solve problems, and manage distributed systems. Establish naming requirements for systems, volumes, subvolumes, files, event filters, programs, and composite names used with the Distributed Name Service (DNS) product. 11-2

212 Application Management Requirements Events and operator messages. The application should use the Event Management Service (EMS) to format events and messages in a standard fashion. Make sure that the application provides the information you need. For example: Make sure that an event or message is generated whenever a problem occurs. For example, an event or message should be generated when there is a modem problem, an application problem, a network problem, a tape problem, and so on. Messages should contain enough information so that operators can readily identify the application and component within the application that is causing the problem. For example, if a server process in a Pathway application is causing a problem, operator messages should identify both the application and the server process that is generating the messages. You might want to create separate console environments for system and application messages. This can help correlate the causes of problems when they occur. For example, a communication line going down might generate only one critical message, whereas the application might generate tens of them. This arrangement makes it easier to understand the cause-and-effect relationship of problems. If you use tokenized events, you might want to require standard tokens and standard placement of the tokens. For operator messages, you might want to require standard message formats. Determine whether you need message numbers and standard text such as WARNING, ERROR, or INFORMATIONAL. Determine the type of information your staff needs to solve problems. For example, should the message list terminal names, user IDs, or system names? You might want to create EMS event filters to select only events that are critical or that require action. The filter could search messages for words such as ERROR, ABENDING, ABORT, and EXCEPTION, and highlight those messages only. Make sure that all events and messages are documented. Documentation should explain the cause of the event or message, the effect, and the required recovery action. TSM EMS Event Viewer and NonStop Virtual Hometerm Subsystem (VHS) usage for system monitoring and event collection. If your organization uses VHS, make sure the application allows you to use VHS for system monitoring and event collection. The TSM EMS Event Viewer can display all messages formatted by EMS. VHS can receive home terminal messages. Problem escalation procedures. Make sure that your internal applicationdevelopment group or the application vendor will provide support when problems occur. Establish procedures for problem escalation. Security. Make sure that the application is designed so that the operations staff can enforce your security policy, audit the application, and perform security administration tasks. 11-3

213 Application Management Check List for an Applications Review Performance measurement. Performance-measurement tools or counters should be built into the applications. Determine what types of counters you need and what types of procedures are required. Application-specific operations guides. Require documentation for all applications. Your staff needs information on monitoring, installing, starting, and stopping applications, and on resolving simple problems. Your staff also needs to know how the application operates, who the end users are, what hardware is required, and what other applications can or cannot run concurrently. For batch jobs, your staff needs to know any output destinations and commitment times. Application control. The operations staff should have total control over the application on production systems. To maintain security, you might want to create a policy that prevents applications staff from accessing a production system and data except under the supervision of the operations staff or during an emergency if the operations staff is unavailable. Application development. Ideally, the applications staff should have their own development system for developing application programs. If this is not possible, create separate production disk volumes or subvolumes with create or write restrictions to the production files. Check List for an Applications Review The cost of failure recovery for an application escalates by an order of magnitude at each of the following implementations of the application: Specification and design Prototype Manufacture Installation Operation Participating in an application review (at the design stage, if possible) is a good means of ensuring that your requirements are met and that the application developers and vendors understand the needs and concerns of the operations staff. Usually, senior-level support personnel attend the reviews. The following list provides examples of questions to ask during an application review: What operator intervention is required to run the program? What parameters must be input by the operator? Does the operator understand and have access to all the necessary information? Are the screens clear and easy to use? Is the screen design and the use of the keyboard standard with the application? If the application needs to run when there are no operators present (for example, during a holiday), is it designed so that operator intervention isn t required? Will data be available when the job needs to run? What is the impact if some of the data is not available when the job needs to be run? Is the job as easy to run as possible? 11-4

214 Application Management Check List for an Applications Review Will status messages be displayed so that operators know that the application is running properly? Have the naming conventions been followed? What operational procedures are required to support the application? How fault-tolerant is the application? Can it handle minor problems? Does the application have recovery procedures for each phase of processing? What are the backup procedures? How often should NonStop TM/MP dumps be performed? What provisions must be made for off-site disaster recovery? Are additional tools or utilities (programs, command files, TACL routines) needed before the application can go into production? Will the operations staff be involved in tool development? What impact does the application have on production schedules and report distribution? What documentation does the operator need? Is there online help? What are the procedures for updating the documentation? Who is responsible for updating documentation? What training does the operator need to monitor the application and run jobs? What does the operator have to do to restart the job? Is this information documented? What error messages will operators have to understand? Will all problems generate events or messages? Is this information documented? What are the problem escalation procedures? Whom do operators contact when they cannot resolve a problem? How often does the application need to be run, and what is the impact of the application on the level of service? Is the application a batch program that will need to be run many times, or is the application an online program that needs to be started only once? How long does the application take to run? Can other applications run while this application is running? Does the application provide performance statistics? What type of statistics? What are the security provisions? Does the application require users to log on and supply passwords? Who can access files, databases, terminals, systems, and so on? Can the operations staff control access? Can the operations staff audit access attempts? 11-5

215 Application Management Establishing a Production-Assurance Control Group Is the interface to application security easy to use? What are the hardware requirements? What are the disk space requirements and are they reasonable? Has your organization s capacity-planning and application-sizing staff analyzed the application for future hardware requirements? Should a Tandem analyst be contacted for help with sizing and reviewing the application? Have arrangements been made to acquire additional hardware, if necessary? What is the implementation strategy for the application? Will the application be implemented in a phased manner? Are there backout options? How will the application be tested before it is placed into production and after it is placed into production? How often will software changes occur? Establishing a Production-Assurance Control Group When an application requires input data, users, rather than operators, should be responsible for entering the data. Many operations organizations have a productionassurance control group to act as a liaison between the operations staff and the users. A production-assurance control group is responsible for: Setting up jobs that require parameters such as dates and options. Scheduling jobs to run at the appropriate times. Conveying any special instructions to the operators regarding the runs. Checking the outputs and distributing them when jobs are completed. If there are any discrepancies, they correct the problem and rerun the jobs without involving the users. Tracking down reports and reprinting them, if necessary. Working with the change control group to ensure that new versions of the application programs, libraries, and obey files are properly installed. Production-assurance personnel are usually senior operators or librarians from the applications area. They generally possess good communications skills because they need to translate a user s request into the job in a way that will get it done. They are generally assigned to one or more user departments so that they become very familiar with their assigned group s requirements and problems, providing high service levels to the users. Production-assurance control groups are common in batch environments. 11-6

216 Application Management Batch, Online, and Client/Server Processing Batch, Online, and Client/Server Processing You might have to manage batch, online, and client/server processing applications. Batch, online, and client/server processing applications require different system management techniques. The following paragraphs describe the operations requirements for each type of application. Batch Processing Batch processing programs are characterized by the following: Batch programs are generally used to collect and manipulate data to generate reports. Batch programs do not require user interaction. Batch programs run on demand or on a schedule, updating master files only once a day or only at particular times during the day. Batch programs run at a lower priority than OLTP programs. However, batch programs are big consumers of disk I/O and can impact the performance of online transaction-processing programs. Batch programs run as fast as possible, using whatever resources are available. Batch programs access records and files in sequence. Batch programs perform one or more tasks on many records of the same type. Figure 11-1 shows the sequential nature of batch processing. Figure Batch Processing Transaction Input Application Transaction Input Application Master Files Transaction Input Application

217 Application Management Online Transaction Processing Operations To perform batch processing on Tandem systems, the operations staff usually follows these steps: 1. They identify the batch job. 2. They submit the job input file to a scheduler program. Once the job is submitted, the scheduler does the rest of the work. The scheduler performs the following steps: 1. It schedules the job according to the scheduling options. 2. When the time comes to run the job, the scheduler starts the executor program. The executor program executes the commands in the job input file. 3. Job output is sent to the spooler. The spooler collects the output and prints the required reports. 4. When the job finishes, the scheduler writes a user log file to the user s subvolume. To ensure that batch processing proceeds smoothly, consider establishing policies and guidelines for running batch jobs, such as indicating how batch jobs should be scheduled, when the schedule can be overridden, and who has responsibility for running batch jobs. Tools Tandem offers the NetBatch and NetBatch-Plus products to help you manage batch applications. NetBatch is an automated job-management system that relieves operators of the tasks of scheduling and dispatching jobs manually. NetBatch-Plus is a screendriven interface to NetBatch. For more detailed information about NetBatch and NetBatch-Plus, refer to Section 14, Operations Management Tools. Online Transaction Processing Online transaction-processing programs are characterized by the following: Programs run continuously and as transactions occur. Programs access and update files and records in a random fashion. Programs require user interaction. OLTP programs allow users located throughout a company to retrieve information immediately and update it in a database. Programs require a highly available system. Should the program fail for any reason and you are not using NonStop TM/MP, a large quantity of data could be lost, resulting in a large loss of revenue for your company. Programs often have a batch component that summarizes or acts upon the day s online activity. Some typical online transaction-processing applications include: Retail Point-of-Sale (POS) 11-8

218 Application Management Online Transaction Processing ATM Order entry Credit authorization Prescription orders Stock exchange Manufacture automation Travel reservations Telephone company switches Figure 11-2 illustrates a typical OLTP application. Figure Online Transaction Processing Tandem NonStop System TP Monitor Server Program Database 036 Operations When managing OLTP applications, your most important concerns should be to: Maintain a stable environment. Monitor the application and system to detect potential problems. Detecting problems early helps you maintain the high availability required by online transaction processing. Operations staff should monitor the system whenever the application is running. For example, if your application runs 24 hours a day, you need to establish shifts to monitor the system 24 hours a day. When the application is up and running (in production mode), you should not: Bring down the system. Change the environment. Perform backups unless the application runs 24 hours a day. If the application runs 24 hours a day, perform backups during the slowest part of the day. 11-9

219 Application Management Client/Server Processing Tools Table 11-1 lists Tandem software products that can help you develop and manage online transaction-processing applications. For detailed descriptions of these products, refer to Section 14, Operations Management Tools. Table Online Transaction-Processing Tools Product NonStop Transaction Services/MP (NonStop TS/MP) Pathway/TS NonStop Transaction Manager/MP (NonStop TM/MP) Transfer Function Provides the programs and operating environment required for developing and running OLTP applications. Provides tools for developing and interpreting screen programs to support OLTP applications in the Guardian environment. Helps you maintain the consistency, durability, and integrity of distributed databases that are being updated by concurrent transactions. Enables organizations to move and manage information efficiently within a single Tandem system or within a network of distributed systems. Serves as the foundation of PS MAIL (electronic mail system) and other Tandem products. Client/Server Processing Client/server processing programs are characterized by the following: Programs are divided between a client program, which resides on a PC or workstation, and server programs, which reside on a host system (often, a larger and more powerful system). The client/server architecture is linked together by local area networks (LANs) and wide area networks (WANs). Client programs provide the presentation services (GUIs) and part of the program logic. The rest of the program logic and the database management functions are provided by the server programs. Users typically request database information from the server program through an easy-to-use GUI provided by the client program. Servers are generally powerful computer systems that perform a specific type of service, such as file servers, database servers, and print servers. Clients are typically PC-class computers, workstations, or other client devices such as ATM devices. Ten to fifty clients per (database) server is common. Network components include transceivers, bridges, routers, hubs, and gateways. Client/server environments are very diverse. They range from a small departmental LAN with a single server to networks connecting tens of thousands of users to hundreds of servers. Figure 11-3 shows a simple client/server environment; Figure 11-4 shows a complex environment with a ServerNet wide area networking (SWAN) concentrator configured to a local system s ServerNet system area network (SAN)

220 Application Management Client/Server Processing Figure Simple Client/Server Environment Clients (PC, Macintosh, or UNIX Workstation) Server (Tandem) Client A Transaction Input Application $PROC Client B Transaction Input Application RSC Message Requests Data Reply Transaction Delivery Process (TDP) Application Client C Transaction Input Application Enscribe Pathway Server NonStop SQL 037 For G-series systems, the ServerNet SAN provides the communication path used for interprocessor messages and for communications between processors and I/O controllers. Refer to the ServerNet Communications and Configuration Manual for a detailed description of the ServerNet SAN. The SWAN subsystem is a collection of software and hardware components that provides G-series systems with networking and data-communications capabilities. The SWAN subsystem is used to configure both WAN and LAN connectivity for these supported communication (COMM) subsystem objects: ATP6100 line-handler processes CP6100 line-handler processes EnvoyACP/XF line-handler processes Expand network control process, line-handler processes, and line-handler devices (Expand LINE and PATH objects) SNAX/APN service manager processes and line-handler processes SNAX/XF service manager processes and line-handler processes X25AM line-handler processes For further information about the SWAN subsystem, refer to the ServerNet Communications Configuration and Management Manual

221 Application Management Client/Server Processing Figure Complex Client/Server Environment Processor 0 X Y Processor 1 X Y Processor 2 X Y Processor 3 X Y Y fabric X fabric Ethernet Ethernet Differential SCSI MFIOB (0,1) E1SA (0,1) E1SA (2,3) Differential SCSI MFIOB (2,3) E1SA (1,0) E1SA (3,2) Ethernet Ethernet Differential SCSI MFIOB (1,0) SWAN Differential SCSI MFIOB (3,2) Ethernet Ethernet 903 Legend E1SA MFIOB = Ethernet 1 ServerNet Adapter = Multifunction I/O Board Operations When managing client/server processing applications, your most important concerns should be to: Follow the guidelines for managing OLTP applications. Monitor LAN connections. Check the EMS log (using the TSM EMS Event Viewer) for connection errors, session errors, I/O errors, and application program errors. Maintain transaction logs for both the client and its server. The most common client/server failure is a transient error in the client, network, or server that requires the user to reboot or restart an application. Reestablishing a session can take a long time if the client lacks past transaction information. After the session is reestablished, the client is forced to perform queries to determine the status of transactions that were in process when the session was disrupted

222 Application Management Client/Server Processing Tools If both the client and its server maintain a transaction log of information necessary for reestablishing a client session, client down time can be significantly reduced. After the client reboots and logs in, the server can reestablish the session, provide the status of client transactions, and continue processing transactions as necessary. Table 11-2 lists Tandem software products that can help you develop and manage client/server processing applications. For detailed descriptions of these products, refer to Section 14, Operations Management Tools. Table Client/Server Processing Tools (page 1 of 2) Product Client Server Gateway (CSG) Data Access Language (DAL) Server NonStop Access for Networking NonStop ODBC Server NonStop Transaction Services/MP (NonStop TS/MP) Pathway Open Environment Toolkit (POET) Function A workstation process that accepts service requests from workstation applications through the CSG application programming interface (API). Together with the SSG (see below), it provides a general mechanism for delivering commands from a workstation to command interpreters on a Tandem host and returning responses to the workstation. The DSM/NOW product uses Systems Support Group/Customer Support Group (SSG/CSG) as its underlying mechanism for exchanging commands and responses with host processes. Provides Macintosh users on attached LANs with reliable, transparent, fault-tolerant, distributed access to data residing in Tandem NonStop SQL/MP databases. Extends fault-tolerant computing through the LAN to the desktop. Provides client applications with access to Tandem s NonStop SQL/MP relational database management system through the Microsoft OPEN Database Connectivity (ODBC) interface. Provides the programs and operating environment required for developing and running client/server processing applications. A client/server product that works with the Remote Server Call (RSC) product to enable programmers to create and run client/server applications where the client is a Microsoft Windows program and the server is a Pathway server running on a Tandem NonStop system

223 Application Management Case Study Table Client/Server Processing Tools (page 2 of 2) Product Remote Server Call (RSC) SeeView Server Gateway (SSG) Subsystem Control Facility (SCF) Case Study Developing application requirements can help reduce application down time and improve operator productivity. The following case study shows how a Chicago-based savings and loan institution established application requirements to cope with a consolidation of systems and applications from a distributed environment to a centralized environment. Business Background Function Facilitates client/server computing by allowing workstation applications running in Microsoft Windows, Windows NT, MS- DOS, OS/2, UNIX, Winsock, and Apple Macintosh operating environments to access Pathway server classes and Guardian processes. A Tandem host process that provides a command and control server gateway to NonStop Kernel command interpreters on Tandem host systems, allowing users to access command interpreters from workstation client applications. The DSM/NOW product uses SSG/CSG as its underlying mechanism for exchanging commands and responses with host processes. Allows operators to monitor and change the characteristics of data communications lines without having to take the NonStop system down. North American Savings and Loan (NASL) is a financial institution with over 200 offices located in North America, including New York, Boston, Toronto, and Montreal. NASL has been making several changes to its Tandem computers operations. Probably the most significant is the consolidation of several distributed locations consisting of Tandem NonStop VLX systems into multiple G-series systems in three locations in Chicago. This evolution also changed the way in which their applications were managed from multiple systems running one or two major applications to a few central data centers that run multiple applications. The systems were consolidated into central data centers. Subsequently, the applications were physically consolidated. That is, they shared the same system. They did not, however, share processors or disks. They also did not share operations environments such as job executors, change control procedures, or data files. A physical consolidation, but not a logical one, resulted. In the old distributed environment, the systems and applications were under the control of the department that used them. That department was responsible for all aspects of that application, including operations. Each department had its own operators and support staff, as well as its own standards. Each standard met the requirements of the bank auditors and of the department; but each was, and still is, different

224 Application Management Analysis of Problem In the new centralized environment, operations support is provided by the data center personnel. Application users now have less flexibility and control but no longer have the responsibility of staffing operations. The operators at each central site are now required to know the application and all the peculiarities of its environment, such as the executor, scheduler, and change control procedures. Predictably, this raises a new set of problems. Analysis of Problem Although consolidating the applications into three data centers created the potential for a more efficient and organized environment, the data centers were unprepared for the new applications they had to manage. This resulted in the following observations: Operators were expected to enter data. For example, one of the applications that must be run regularly requires data input. The operator must confirm certain questions by entering Y for yes or N for no. The application checks for a Y response and assumes no if anything else is entered. In one situation, an operator working during the early morning shift entered a space before entering Y and the response was misinterpreted by the application as no. Since the error was not visually obvious, the operator had to contact the support person at 5:00 a.m. to resolve the problem. Meanwhile the application was unavailable for more than an hour. Later, during the same run, while talking on the telephone, the operator entered a dollar amount in excess of $27 million, which was grossly incorrect. Applications did not follow any programming standards. For example, in another incident, the same application prompted the operator on duty for a date in DDMMYY format. The operator entered 23/01/1990 in what appeared to be an error. The operator explained that although the instructions indicate a DDMMYY format, he was told by the application user that he must enter the date in DD/MM/YYYY format. As predicted, the application accepted the input. During the same run, a file system error 13 was reported for the file DATA.ACCLIB.RPTWK2. The error was caused by a missing $ required at the beginning of the file name. The operator was told that the error was normal and expected and that this error would always be displayed. Another error occurred when a sort was called up with no input. The sort ended with the error Error no records on file. Logically, a sort would not be called to sort an empty file, but the application was designed to sort any unexpected errors. Because there were no unexpected errors reported during the run (the errors that were reported were all expected), the sort failed (producing an expected error). Operators were required to make business decisions. For example, during another application run, an operator was requested to enter three dates: the current date and the next two business days. To do this, the operator produced a wallet calendar and, allowing for the weekend and after deciding that Dr. Martin Luther King s birthday was a worldwide holiday, entered the data

225 Application Management Implementation of Recommendations Implementation of Recommendations While each of these problems can be easily fixed, the larger issue is a lack of standards. NASL had no standards for managing the new applications at each of the data centers. The standards that had been developed previously by each application department in the distributed environment were no longer effective in a centralized environment. Application standards and requirements needed to be developed. The following recommendations were implemented at NASL: Establish a production-assurance control group. Establish application requirements. Provide training. Establish a Production-Assurance Control Group Users of the application, not the operators, should be responsible for entering application input data. By establishing a production-assurance (PA) control group that acts as a liaison between the operators and the user departments, NASL ensures that an application user s request translates into the job so that the work will get done. NASL assigned a PA person to each department in which there were users using the applications. This enabled the PA person to become familiar with the assigned group s requirements and problems. At NASL, the new production-assurance control group will: Provide the primary production support for applications. Set up all jobs that require input data. Use NetBatch-Plus to schedule the jobs to run at the appropriate times. Convey any special instructions to the operators regarding the runs. Check the outputs and distribute them when jobs are completed. If there are any discrepancies, they will correct the problem and rerun the jobs. Implement new versions of the application programs. Establish Application Requirements NASL needs to establish application requirements for all applications being managed in each of the three data centers. All applications should be written to conform to a set of programming standards as well as operational standards. Operations must help develop and document the operational standards that will enable NASL s applications to run smoothly. The application requirements would provide the operations staff with instructions for monitoring, starting, and stopping applications; the PA group with the required application input; and the application developers with guidelines for application development, naming conventions, and message management. For a check list of other possible requirements, refer to Establishing Application Requirements, earlier in this section

226 Application Management Implementation of Recommendations At NASL, the application requirements specifically address the following design requirements: All application input data must be verified. The application developers are now required to design applications as follows: 1. Accept all inputs. 2. Echo them back. 3. Ask the operator to verify data. If the operator confirms the data, then proceed. If not, allow the operator to reenter the data. All applications must run normally without errors. There is great danger in allowing production jobs to run normally with error messages. Operators will get used to the errors and might miss a real one. At NASL, messages are now sent to the operator for only two reasons: When operator intervention is required When there is some piece of information the operator must be aware of There are two ways the application developers could resolve the sort error described earlier: The application program, recognizing that there are no records to be sorted since there are no errors, can bypass the sort. A TACL macro could absorb the error messages and discard them so that the operator does not see any error. All applications must use the Event Management Service (EMS). To help reduce operator intervention, NASL now requires that all new applications use EMS to format events and messages in a standard fashion and to generate messages to other systems and other applications programs. Some of the other event-management standards NASL requires include: Applications must be developed using EMS FastStart for creating all event messaging for the application. All events must be tokenized. EMS FastStart can be used to create EMS events in tokenized format. Each event must have a unique event number. The event number must be one of the tokens in the message. Each event must be categorized as informational, warning, or critical. All application EMS events will be reviewed during the design walk-through for the application and again prior to the acceptance of the application by operations. All dates must be retrieved from a database. Time is lost and risk is involved when the operator must enter dates for an application. Looking up dates on a calendar and adjusting for weekends, holidays, and partial holidays in one city or another takes time and can potentially create errors

227 Application Management Check List At NASL, the key to the database is the system date and application ID. For example, if the system date is February 28, 1994, and it is before 5:00 p.m, the transaction date is February 28, the next business date is March 1, and the second business date is March 2. The calendar accounts for all weekends, holidays, and partial holidays (where any one location is open for business even though others are not). The PA group is responsible for maintaining the database file. Provide Training Because new schedulers and automation utilities were being introduced into the data centers, NASL decided to provide operator training in the following areas: Automating Tandem operations TAL syntax In addition, because NASL has a unique product set, a customized class discussing the Tandem NonStop Kernel, NonStop TS/MP, NonStop TM/MP, NetBatch and Netbatch- Plus, and TACL was developed. While Tandem could have provided such a customized class, NASL chose to use its own training staff. Training is necessary to use these products; and with all the recent changes at NASL, training can be a motivation for the staff. Check List The following check list summarizes the main points of this section: 1. Establish operations requirements for all applications. 2. Participate in application reviews to ensure that your requirements are met. 3. Establish a production-assurance control group to ensure that applications are run with the correct data input or options. 4. Establish procedures for managing batch, online, and client/server processing applications. 5. Establish procedures for using Tandem products to manage the applications. 6. Train your staff so they can use the Tandem tools and can install, run, and monitor the applications

228 12 Automating and Centralizing Operations Overview Automating and centralizing operations can help you improve the efficiency and effectiveness of your system operations support staff and help improve system and application availability. This section lists the steps required for automating and centralizing Tandem system operations and describes the products that help you automate and centralize. Why Automate and Centralize Operations? Automating operations is the process of using command files, macros, and applications to perform operations tasks. Automating operations: Minimizes unplanned outages. An unplanned outage is system or application down time caused by a problem situation such as faulty hardware, operator error, disaster, and so forth. Helps you increase staff productivity. Helps you manage unattended systems. Reduces human error. Relieves your staff of repetitive tasks. Ensures that tasks are performed in accordance with the company s operations policies. Reduces the number of operators required at a site and reduces the level of expertise required of operators Figure 12-1 shows how automated procedures can prevent or minimize unplanned down time. It contrasts traditional methods of solving an operations problem with automated methods. 12-1

229 Automating and Centralizing Operations Why Automate and Centralize Operations? Figure Typical Operations Problems Example: Disks Run Out of Space Traditional View Automated View Operations Runs Periodic Reports Monitoring Program Generates Threshold Message Database Manager Examines Report Database Manager Schedules Job to Resolve Problem Event Message Automated Procedure Schedules Disk Decompression (DCOM) With NetBatch Estimated People Cost in Time: High Estimated People Cost in Time: Low Example: CPU Halt Traditional View Automated View Operator Follows Runbook Procedure Automated Procedure Takes Processor Dump and Reloads Processor Operator Takes Processor Dump Operator Reloads Processor Automated Procedure Designates Primary Processors to Balance Load Operator Starts Processes in Reloaded Processor CE Completes TPR Problem Automatically Reported Electronically to Tandem Estimated Time to Detect and Recover: 30 Minutes or More Estimated Time to Detect and Recover: Several Minutes

230 Automating and Centralizing Operations Why Automate and Centralize Operations? Centralizing operations is the process of managing distributed systems, distributed applications, or a whole network from a single site. Centralizing operations: Allows fewer expert operators to manage a greater number of systems Allows you to leave some systems unattended or supported by only minimal staff Typically, a central site serves as a service organization to all other sites. The central site s service-level agreements usually define its role in relation to other sites in the organization. As a service organization, the central site provides: Staff experience and expertise Data security Data storage Daily backup and recovery of data Hardware and software recommendations Figure 12-2 shows the typical hierarchy of a central site managing a network of distributed systems. Figure Centralized Operations Mainframe Central System Distributed Systems Departmental Systems Workstations and PCs End Users

231 Automating and Centralizing Operations Automating Operations Tasks Automating Operations Tasks Automating operations tasks involves the following steps: 1. Commit resources to system automation. Tandem provides the automation tools, but the operations staff need to use the tools to automate system tasks. You might also have to train your staff so that they can use the automation tools. 2. Determine which tasks should be automated. Select the tasks that will increase staff productivity. Automate tasks to: Help you manage unattended systems Reduce operator errors Relieve your staff of repetitive, complex, or time-consuming duties Enforce your company s operations policies Examples of tasks that are typically automated include: Starting up and shutting down the system and applications. Checking the system status and ensuring that the system is configured in its normal and optimal mode. Bringing up network line handlers. Steps for determining problems. For example, an event is generated when a line goes down. Problem analysis tasks, such as gathering information to help you determine the cause of failure, can be automated. Recovering from routine (recurring) problems. Generating reports. Configuring the spooler. 3. Determine which tools to use to automate the tasks. Tandem offers the following products and tools to help you automate tasks: Command files Distributed Name Service (DNS) Event Management Service (EMS) NetBatch and NetBatch-Plus NonStop Virtual Hometerm Subsystem (VHS) Object Monitoring Facility (OMF) Subsystem Programmatic Interface (SPI) Tandem Advanced Command Language (TACL) Tandem Failure Data System (TFDS) TSM EMS Event Viewer These tools are described later in this section. 12-4

232 Automating and Centralizing Operations Centralizing System Operations 4. Have your intermediate-level or senior-level support personnel develop and test automated procedures. Make sure that every online file contains comments that explain what the file does and what each command in the file does. Note. When developing automation procedures, be sure to follow the standards or policies that have been implemented by your operations organization or established by your servicelevel agreements. 5. Document the procedures developed in Step 4. Also document the location and names of all required files. File names should indicate what the file contains or does. For example, a file that loads the system might be called SYSLOAD. 6. If staff action is required to start the automated procedures, teach your staff how to start the procedures. Centralizing System Operations Centralizing operations tasks involves the following steps: 1. Select the site for centralized system operations. 2. Determine which tasks will be centralized and which tasks will be decentralized (if any). Few activities associated with operating Tandem computers require physical access to the computer. Most operations tasks can be performed from an operator console. The console can be located near the computers or miles away. 3. Determine your personnel needs for centralized and decentralized tasks. For example, if backing up data will not be centralized, you should plan to have a staff member available at each computer site to back up data. 4. Determine which tools to use to centralize tasks. Tandem offers the following tools to help you centralize tasks: Distributed Name Service (DNS) Distributed Systems Management/NonStop Operations for Windows (DSM/NOW) Distributed Systems Management/Software Configuration Manager (DSM/SCM) Event Management Service (EMS) File Utility Program (FUP) Measure NetBatch and NetBatch-Plus Network Statistics Extended (NSX) Open Notification Service (ONS) Simple Network Management Protocol (SNMP) Subsystem Programmatic Interface (SPI) Tandem Advanced Command Language (TACL) NonStop Transaction Manager/MP (TM/MP) TSM EMS Event Viewer 5. Develop and test procedures for centralized and decentralized tasks. 12-5

233 Automating and Centralizing Operations Automation and Centralization Tools 6. Develop and test problem recovery procedures. For example, if the communications lines between the central node and a remote node go down, the staff should know what steps to take to perform tasks on the remote node. 7. Document the procedures developed in Steps 5 and Train your staff in the procedures. Automation and Centralization Tools Tandem provides a number of tools to help your staff automate and centralize tasks. Table 12-1 summarizes the automation and centralization tools. For detailed descriptions of these tools, refer to Section 14, Operations Management Tools. Table Automation and Centralization Tools Tool Automate Tasks Centralize Tasks Command Files X Distributed Name Service (DNS) X X Distributed Systems Management/NonStop Operations for Windows (DSM/NOW) X X Distributed Systems Management/Software Configuration Manager (DSM/SCM) X Event Management Service (EMS) X X File Utility Program (FUP) X Measure X NetBatch and NetBatch-Plus X X Network Statistics Extended (NSX) X NonStop Transaction Manager (TM/MP) X NonStop Virtual Hometerm System (VHS) X Object Monitoring Facility (OMF) X Open Notification Service (ONS) X Simple Network Management Protocol (SNMP) X Subsystem Programmatic Interface (SPI) X X Tandem Advanced Command Language (TACL) X X Tandem Failure Data System (TFDS) X TSM EMS Event Viewer X X 12-6

234 Automating and Centralizing Operations Check List Check List The following check list summarizes the main points of this section: 1. Commit resources to system automation and centralization. Determine staffing needs. 2. Determine which tasks should be automated and centralized. 3. Determine which tools to use. Tandem offers these tools: Command files Distributed Name Service (DNS) Distributed Systems Management/NonStop Operations for Windows (DSM/NOW) Distributed Systems Management/Software Configuration Manager (DSM/SCM) Event Management Service (EMS) File Utility Program (FUP) Measure NetBatch and NetBatch-Plus Network Statistics Extended (NSX) NonStop Transaction Manager/MP (TM/MP) NonStop Virtual HomeTerm Subsystem (VHS) Object Monitoring Facility (OMF) Open Notification Service (ONS) Remote Operations Facility (ROF) Simple Network Management Protocol (SNMP) Subsystem Programmatic Interface (SPI) Tandem Advanced Command Language (TACL) Tandem Failure Data System (TFDS) TSM EMS Event Viewer 4. Develop and test procedures. 5. Develop problem recovery procedures as appropriate. 6. Document procedures and the location of required files. Make sure that files have meaningful names. 7. Teach your staff how to use the procedures. 12-7

235 Automating and Centralizing Operations Check List 12-8

236 13 Operations Management and Continuous Improvement Overview An operations environment, even one that is performing well, should never remain static. In business, change is vital. Changes in market conditions, technology, business goals, and competition can affect how you manage your operations environment. Successful operations organizations continuously improve the capabilities and efficiency of their operations management processes and tools to adapt to these changes. This section provides guidelines for implementing an operations-management improvement program to help you improve the capabilities and efficiency of your operations processes. The section ends with a check list summarizing the main points of this section. Why Improve Your Operations Environment? In almost every industry, businesses have to deal with change. To accommodate these changes, you might be required to modify or improve the processes and tools in your operations organization. The following list illustrates some common changes that might affect your operations management processes: Operations or support groups are no longer meeting their service-level agreements. Competition or customer requirements change the service-level agreements. For example, you might be required to provide continuous application availability. New applications and functions result in increased processing demand. Existing technology becomes obsolete and must be replaced with new technology. Existing systems must be integrated with new systems, workstations, or networks. Existing software or hardware must be replaced to reduce costs. Product fixes or upgrades become available. Figure 13-1 shows that over 40 percent of system outages are caused by process and procedure problems. By improving operations processes, you can minimize system outages caused byƒ inadequate or nonexistent operations processes and procedures. 13-1

237 Operations Management and Continuous Improvement Implementing an Operations-Management Improvement Program Figure Causes of System Outages Processes and Procedures Install Upgrades Moves 40 % Configuration 044 Implementing an Operations-Management Improvement Program Improving your operations environment is more than just selecting tools and products. You must consider the relationship among all the tools; the processes used; the required tasks; and the skill, training, and motivation of the people involved. Implementing an operations-management improvement program provides a systematic approach for controlling, measuring, and improving the processes and tools in an operations environment. To implement an operations-management improvement program, you should follow six steps: 1. Assess the current status of your operations management processes (if they exist). 2. Develop a vision of the processes you want to establish. 3. Develop an action list. 4. Schedule the required actions and commit the necessary resources. 5. Execute the plan. 6. Assess the improvement program and start over at Step 1. (This is a continuous process.) 13-2

238 Operations Management and Continuous Improvement Using the Maturity Framework The improvement program will be most successful if: The improvement goals are aligned with the service-level agreements of your organization. It is planned, staffed, and approved by senior management. Assign a sufficient priority to the project so that adequate resources will be assigned and significant actions will take place. The entire operations staff is involved. Improvements are made in small, tested steps. The operations staff is trained to operate the new operations processes. Using the Maturity Framework When implementing an operations-management improvement program, you need to have a clear picture of your improvement goals as well as a way to gauge the progress made during the improvement program. Using a maturity framework can help provide perspective and guide the direction of your improvement program. The maturity framework breaks down an operations environment into five maturity levels, with each level representing issues that many operations organizations face at various stages in process improvement. The maturity framework can help you determine: The maturity level of your current operations processes The maturity level you want to achieve Areas where improvements will be most fruitful Figure 13-2 illustrates how your organization must mature and become more effective as the demands of your customers increase and operations problems intensify. Figure Operations-Management Improvement Framework High Customer Demands Little Control; Ad Hoc Basic Management Control Level 2 Processes Defined and Managed; Automation Possible Level 3 Processes Measured and Analyzed; Ability to Meet Goals Level 4 Continuous Improvement With Minimal Risk; Ideal Environment Level 5 Level 1 Low Low Problem Demands High

239 Operations Management and Continuous Improvement Using the Maturity Framework Table 13-1 summarizes each of the five levels in the maturity framework. Table The Maturity Framework (page 1 of 2) Maturity Level Level 1 Level 2 Level 3 Level 4 Characteristics The operations environment is driven from crisis to crisis by unplanned priorities and unmanaged change. Operators perform tasks in an ad-hoc fashion. Tools are not well integrated with the process, and operators use the tools informally to solve problems. No procedures are defined or documented. The same problems keep repeating. Change control is nonexistent; schedules are arbitrary. Senior management has little understanding of the problems and issues. The staff has some experience with the management and control of the tools and technology. Operators have developed rules of thumb to solve simple problems. Some routine tasks are documented in runbooks. Because operators can perform some routine tasks consistently, they are freed to solve more complex problems. Operations processes are defined, and the staff has learned how to manage them. The staff can now examine its processes and tools in depth and decide how to improve them. Processes are defined and formally documented. When the operations staff is faced with a crisis, it continues to use the procedures that have been defined. Because the staff now understands how problems occur and how to recover from them, it can safely introduce automated operations software to perform problem management. The staff measures the efficiency of its operations management processes to test how well it is meeting its service-level objectives. Comprehensive process measurements and analysis techniques are used. This is the level at which the most significant quality improvements begin. The staff analyzes the way it handles and solves problems. If there are deficiencies in the current procedures, the staff improves them. For example, the staff might examine the efficiency of automation and determine how to improve it to better meet its service-level objective for availability. 13-4

240 Operations Management and Continuous Improvement Step 1 Assessing Your Environment Table The Maturity Framework (page 2 of 2) Maturity Level Level 5 Characteristics The staff continuously measures, analyzes, and improves its operations management processes to optimize productivity and minimize the risk of down time. The operations staff can plan for and incorporate new procedures and technologies with little risk, because it has established methods for managing and improving processes. Managers understand where help is needed and how best to provide the operations staff with the support it requires. The staff understands its work performance, learns from its experiences, and uses this knowledge to make improvements. Step 1 Assessing Your Environment Assess the current status of your operations environment, processes, and tools. This step is critical. If you introduce changes without having a clear view of the strengths and weaknesses of your current environment, you could create unanticipated problems. For example, installing an automated operations product too early in the improvement program could degrade problem recovery instead of improving it. Assessing your environment will: Show you how your organization actually works. Identify major problems and areas where improvement actions will be most productive. Establish a clear picture of your improvement goals as well as guide the direction of your improvement program. Establish a way to gauge the progress of your improvement program. Using the maturity framework, described earlier in this section, can help you assess the current status of your operations management processes. Step 2 Developing a Vision Develop a vision of the operations management processes you want to have in place by establishing goals and objectives that fulfill the service-level agreements of your organization. For example, some of your goals might be to: Improve the quality of end-user services. You may have to lower the number of application outages and reduce recovery time to improve the availability of the applications. Reduce cost of operations and improve operator productivity. You might have to replace current technology with more state-of-the-art tools. 13-5

241 Operations Management and Continuous Improvement Step 3 Developing an Action List Automate recovery tasks currently performed by the operators, for routine (recurring) problems. Automate performance monitoring. Document all major system components and their configurations, and define the actions to be taken when problems occur. Step 3 Developing an Action List After envisioning the improvements you want to make, analyze the relationships among the tasks, decide which tasks are most important, determine the sequence in which to implement them, and create an action list. The order in which the actions on the list are carried out is important. For example, if you introduce automation before you have implemented tools and procedures for managing your system and application messages, you may not realize the full benefit of the automated operator. Step 4 Scheduling and Committing Resources To ensure that the actions you defined are implemented and that the improvement program progresses, you must: Adequately staff the project. For example, you could assign one or two senior-level support personnel to dedicate 50 or 75 percent of their time to the project. You might assign a senior-level operator to assess new tools and processes and help install and configure those tools. Create a project schedule. The project schedule lists each task to be performed and the time allotted to accomplish each task. Step 5 Executing the Plan After creating a project schedule, start executing the tasks according to your project schedule. Step 6 Assessing the Improvement Program The improvement program must be measured and evaluated on a continuous basis. Because both the technology and the business challenges your company faces will always change, you must meet these challenges with further improvements. Periodically, you want to: Determine your new maturity level and decide where to go from there. It is important to stabilize your operations environment and to go from one maturity level to the next without trying to skip a level. The improvement program is more likely to succeed if you proceed in small, sequential steps. Propose further improvements Evaluate the improvement program itself by asking yourself the following questions: 13-6

242 Operations Management and Continuous Improvement Case Study Are you achieving your goals? At regular intervals during your improvement program, reexamine your original goals. Are you still working towards achieving those goals, or have you deviated from them? Before continuing with your improvements, you might have to adjust your improvement program. Are the goals still aligned with the service-level agreements of your organization? When you examine your goals, also examine whether the goals continue to support the service-level agreements of your organization. If, for example, the requirements change for how quickly problems must be resolved before they are escalated to a higher level of support, you might have to change your action list. Start over at Step 1. Improving your operations environment is a continuous process. Case Study The following case study shows how an operations organization can implement an operations-management improvement program to help improve operations processes. User Profile North American Company (NAC) was growing rapidly, expanding its operations in Europe and Asia. Its worldwide users needed to have business services available almost continuously. Table 13-2 shows the characteristics of the company s production environment. The environment included over 10,000 objects (such as processors, disks, files, processes, communication lines, subdevices, and terminals). Table Case Study: NAC s Production Profile Hardware Software System Activity Twelve S-series systems 135 PCs 750 terminals OLTP applications based on NonStop TS/MP and NonStop TM/MP Batch processing controlled by NetBatch Enscribe and NonStop SQL databases Application and system messages directed to HOMETERM hard-copy consoles 240,000 OLTP transactions per day 500 batch jobs per day 65 percent processor utilization Transaction rate growing at 5 percent per month 13-7

243 Operations Management and Continuous Improvement Problem Scenario Problem Scenario The complexity of NAC s systems was growing rapidly. Managers in the MIS department had to ensure that each of the 10,000 objects was installed and configured correctly and ran efficiently. The business applications and the system generated more than 15 events (status, warning, and problem messages) per minute. However, most problems were reported by end users over the phone. Even the most experienced operators had difficulty detecting, recognizing, and recovering from problems in this complex environment. In addition, because business services were now available almost continuously, the operations group no longer had periods of down time in which to perform maintenance and installation tasks. Implementing an Operations-Management Improvement Program As the quality of end-user services decreased, the MIS managers recognized that it would take a serious effort to cope with these new challenges. The MIS managers decided to initiate an operations-management improvement program, assigning a team of two senior support analysts to the project. The following paragraphs describe the improvement team s step-by-step implementation of the improvement program. Step 1 Assessing the Environment The improvement team decided to assess their operations management processes by measuring outages, observing the working environment, and analyzing the effectiveness of their existing tools and processes. Based on their assessment, they concluded that their operations management processes were at maturity level 1. The following paragraphs summarize the improvement team s assessments. Application outages were too frequent. The improvement team required help-desk operators to log each outage, the time of occurrence, end-user name, business services affected, and the time to repair (outage duration). After analyzing the logs, the improvement team determined that during peak hours of the day, the help desk received from 20 to 25 phone calls per hour. Each outage took between 5 and 20 minutes to resolve. In most cases, operators did not detect problems. Generally, end users phoned in to report problems. Sometimes operators learned of a critical situation only when scores of messages started printing on hard-copy consoles. There were so many messages that the operators could not sift through them and take effective action. All application and system messages were directed to hardcopy consoles configured as the HOMETERM device. All problem recovery was performed manually. The hard-copy console arrangement provided inadequate support for problem detection and analysis. Because operators had trouble correlating the information on many pages of listings, they couldn t see what was going on in the system and couldn t control it. 13-8

244 Operations Management and Continuous Improvement Implementing an Operations-Management Improvement Program TACL macros were used to monitor available disk space and processor processes. However, because the macros had limited functions, they had to be recoded each time there were system configuration changes. In addition, operators had to execute and analyze the macros manually. Often when a serious problem occurred, operators were unavailable to execute the macros. Other aspects of the physical environment included an inadequate telephone system and insufficient work space, the latter making it hard to look at operator manuals and product documentation. Step 2 Developing a Vision The next step for the improvement team was to develop a vision of the processes they wanted to have in place. They began by outlining three general goals: Improve the quality of end-user services by improving application availability. Reduce the complexity of operator tasks. Lower the cost of ownership by improving operator productivity. To accomplish these goals, the improvement team developed the following list of objectives: Handle all system and application messages through a standard consolemanagement service. Define application-design standards to be followed by all development groups and projects. Use a management tool to convert text messages in applications that were coded before the standards were established. Document all major system components, their configurations, and how they deliver services. Define the procedures to be taken when problems occur. Automate recovery tasks for routine (recurring) problems. Automate object-state monitoring, resource monitoring, and performance monitoring. Make automatic reporting and trend analysis available on demand. Define change management procedures. Develop a model of the management process. Document the operation organization s structure and responsibilities. Step 3 Developing an Action List After establishing the improvements they wanted to make, the improvement team decided which tasks were most important and determined the sequence in which to implement them. The result was the following action list: 1. Manage system messages. 2. Manage application messages. 13-9

245 Operations Management and Continuous Improvement Implementing an Operations-Management Improvement Program 3. Improve system visibility by monitoring critical objects. 4. Introduce automated problem recovery software. 5. Improve the efficiency of automation and other management processes by implementing process statistics. Step 4 Scheduling and Committing Resources Once the actions were defined, the improvement team could create a schedule and recruit resources. A project of this size and complexity had to be adequately staffed and financed. The improvement team recruited two senior operators to work on the project, assess new tools and processes, and help install and configure those tools. Table 13-3 shows the schedule created for the improvement program. Table Case Study: Schedule for the Operations-Management Improvement Program Activity Step 1: Assess environment Step 2: Develop vision Step 3: Develop action list Step 4: Create schedule and commit resources Step 5: Execute plan Action 1: Manage system messages Action 2: Manage application messages Action 3: Monitor critical objects Action 4: Implement automation Action 5: Implement process statistics Step 6: Assess results Time Allotted 2 weeks 2 weeks 1 week 2 days 4 weeks 4 weeks 6 weeks 8 weeks 8 weeks 1 week Step 5 Executing the Plan The following paragraphs describe how the improvement team executed their improvement program. Action 1: Manage system messages. At NAC, the system generated too many messages (many of them only informative events). Operators couldn t concentrate on reading these messages and selecting the important ones. To manage their system messages, the improvement team: Installed an operations console to replace the hard-copy console. The operations group took training classes and spent a few days getting used to the new console. Filtered the messages, highlighting messages that required operator attention or intervention

246 Operations Management and Continuous Improvement Implementing an Operations-Management Improvement Program Selected the important messages for each subsystem, defined their severity, and documented the recovery steps. Produced a document that specified the critical events and described how operators should respond to them. Used the document to build a set of filters managed by the Event Management Service (EMS). The EMS filters selected only the events that were relevant to the users environment. The filters also specified whether the events were critical. Used the document to create an online runbook that defined the operational procedures for each critical event. For more information about how to manage system messages, refer to the Availability Guide for Problem Management. Action 2: Manage application messages. The next problem the improvement team was faced with was to manage the application messages. To accomplish this, the improvement team: Established design standards specifying that applications should use EMS to generate events. Used NonStop Virtual Hometerm Subsystem (VHS) to convert application messages to EMS format for applications developed before EMS. Created a second console environment dedicated to applications. This arrangement helped operators understand the cause and effect relationship of problems. For example, a communication line going down might generate only one critical system message, whereas the application might generate ten critical messages. Required that development programmers make minor changes in the application programs to reduce the number of informational messages. This reduced the information overload that operators were faced with. Operators could focus on critical events instead of having to react to every event. For more information about how to manage application messages, refer to the Availability Guide for Problem Management. Action 3: Monitor critical objects. At NAC, more than 10,000 objects interacted to provide end-user services. Processors, disks, printers, communication lines, processes, files, and terminals had to be fully and continuously operational. Operators could not possibly verify the health of this system manually. To meet their object-monitoring needs, the improvement team selected the Object Monitoring Facility (OMF) software to: Continuously monitor objects at intervals defined by the user and as short as one minute. Generate events compatible with EMS that can be filtered and displayed on the operator console and used by automated-operations software to recover from problems

247 Operations Management and Continuous Improvement Implementing an Operations-Management Improvement Program Provide a high-level view of the system that operators can easily interpret. OMF can represent many thousands of objects and their states on one screen. With a quick look at this screen, operators get an immediate impression of the health of the system they have to manage. The improvement team implemented OMF in stages, beginning with processors, followed by disks, processes, spooler objects, and finally TMF. This helped operators gain experience with one or two objects at a time. For more information about object monitoring, refer to the Availability Guide for Problem Management. Action 4: Implement automation. Completing the preceding actions allowed operators to display significant events and detect critical conditions before they occurred. Now the improvement team was ready to implement an automated operator product. To accomplish this, the improvement team: Used the default rule set to perform problem recovery for the Pathway, Expand, and SNAX subsystems. Wrote customized recovery rules for their specific installation. Used OMF to develop and optimize new rules for objects monitored by OMF. Coded the automated operator so that an event is generated each time a recovery rule is executed. This helped operators know when a problem occurred and the outcome of the recovery. For more information about implementing automation, refer to the Availability Guide for Problem Management. Action 5: Implement process statistics. After implementing such significant changes, the improvement team wanted to measure the results. Specifically, they wanted to review and optimize the automated recovery rules. To accomplish this, the improvement team used EMS Analyzer (EMSA) to track the efficiency of automation. They made the following observations: Manual recoveries increased in December after the operations console was installed. Because of the improved visibility of messages, operators could detect and fix problems that were previously unnoticed. After the automated operator was installed, automated recoveries began to replace manual recoveries. During the first few months after the automated operator was installed, it recovered from 50 to 80 incidents per week without operator intervention. After OMF was used to develop and optimize new rules, automated recoveries grew to 300 per week. Figure 13-3 compares the number of problem events recovered manually with the number recovered by the automated operator during the improvement program

248 Operations Management and Continuous Improvement Conclusion Figure Case Study: Manual Recoveries Versus Automated Recoveries Nov Dec Jan Feb Mar Apr May Jun Jul Manual Automated 047 Step 6 Assessing the Improvements After completing their improvement program, the improvement team assessed their operations management processes and concluded that they were now at maturity level 3. The following paragraphs summarize the improvement team s evaluations. After having automated many processes, the number of calls to the help desk had been reduced from about 25 per hour to 10 per hour. Because many low-level, repetitive tasks were eliminated, the operators became more productive. As a result, MIS managers could reduce the operations staff. They transferred one operator to the support group and another to the quality-assurance group. These transfers lowered the cost of operations personnel and were positive career opportunities for the two operators. Because the improvement team had instituted some process-measurement procedures, the operations group was now moving toward maturity level 4. Conclusion The MIS managers recognized that both technology and the business challenges they faced would continue to change rapidly. They intended to meet these challenges by proposing further improvements. With the tools and processes now in place, the MIS department will be able to measure their operations environment before and after the proposed changes

249 Operations Management and Continuous Improvement Check List Check List The following check list summarizes the steps involved for implementing an operationsmanagement improvement program: 1. Assess the current status of your operations management processes. Use the maturity framework to help you determine the maturity level of your operations environment. 2. Develop a vision of the operations management processes you want to have in place by establishing goals and objectives. 3. Develop an action list of tasks and the sequence in which to implement them. 4. Commit the resources to carry out the tasks, and create a project schedule to accomplish each task. 5. Execute the tasks according to your project schedule. 6. Assess the results of the improvement program. Determine your new maturity level, and decide where to go from there

250 14 Operations Management Tools Overview Tandem provides a wide variety of tools that help your staff perform operations tasks. These tools allow you to perform the following operations functions: Production management Problem management Change and configuration management Performance management Security management Contingency planning Application management Automating and centralizing operations This section provides information to help you determine which of the operations management tools would best help you perform your operations management tasks. Because of the number of Tandem tools available for operations management, and because some of the tools are cross-functional that is, they can be used for multiple operations management tasks these tools are listed alphabetically in this section. Table 14-1 lists the tools described in this section and the operations tasks for which they are appropriate. Table Operations Management Tools (page 1 of 3) Production Management Problem Management Change and Configuration Management Performance Management Security Management Tool $CMON X Command Files X CONFEXT File X DAL Server X DNS X X DSAP X X DSM/NOW X X X X DSM/SCM X X Enform X X EMS X X X EMSA X X X 14-1 Contingency Planning Application Management Automating and Centralizing

251 Operations Management Tools Overview Table Operations Management Tools (page 2 of 3) Tool FUP Flow Map GPA Production Management Problem Management Change and Configuration Management Performance Management Measure X X X NetBatch/ NetBatch-Plus X X X X NSX X X X NonStop Access for Networking X X X X NonStop ODBC Sever X NonStop SQL/MP X NonStop SQL/MP SQLCI X NonStop TM/MP X X X NonStop TM/MP TMFCOM/ X TMFSERVE NonStop TS/MP X X NonStop VHS X X X X X X Security Management Contingency Planning Application Management Automating and Centralizing X 14-2

252 Operations Management Tools Overview Table Operations Management Tools (page 3 of 3) Tool Production Management Problem Management Change and Configuration Management Performance Management NSKCOM X X Security Management Contingency Planning Application Management Automating and Centralizing OMF X X X X ONS X X PATHCOM X POET X PEEK X RSC X Safeguard X SeeView X SCF/SCP X X X X SNMP X SPI X X TACL X X Tandem Reload Analyzer X Tandem Service Management (TSM) X X X TCM and MeasTCM X TFDS X X TPDC X Transfer X X X X X TSM EMS Event Viewer X X X ViewSys X X X 14-3

253 Operations Management Tools $CMON $CMON $CMON is a user-written program that monitors some command-interpreter activities. You can use $CMON to secure your system by auditing and restricting attempts to: Log on and log off Run a program Alter the priority of a process Add users to the system or delete users from the system Change a user s logon password and remote passwords The International Tandem Users Group (ITUG) can supply you with a sample copy of $CMON. Command Files A command file is a file that contains a series of commands. When the file is executed, the commands within the file are automatically executed. Command files are supported by TACL and many subsystems and interactive interfaces (such as FUP and PATHCOM). Command files are most useful for automating tasks that: Require many commands and few decisions. For example, if operators must start 100 terminals every time they start an application, you can create a command file that contains all the commands for starting the terminals and the application. When it s time to start the application, the operators can execute the command file instead of entering all the commands themselves. Are repetitive. For example, if operators must frequently check NonStop TM/MP status, you can create a command file that contains all the appropriate NonStop TM/MP status commands. With this command file, the operators check NonStop TM/MP status with one command instead of several. Can create serious problems if not executed properly. For example, system shutdown and system startup can be complex tasks requiring many commands. If one of the commands is entered incorrectly, a serious error can result. By creating command files for system shutdown and startup, operators no longer have to worry about incorrectly entering the commands; they simply execute the command file. To create this type of command file, you should first carefully plan all the steps involved in the task before creating the command files. After creating the files, you should test them on a development system. Table 14-2 lists typical command files and their functions. Use this table to help you determine which tasks to automate in command files. Many Tandem manuals explain how to create command files and provide examples of the files. 14-4

254 Operations Management Tools Data Access Language (DAL) Server Table Examples of Typical Command Files File TACLSTRT TACLSTOP SYSLOAD MAILCOLD MAILSTOP MAILWARM NETDOWN NETUP NEWOPLOG PRIMARY SHUTDOWN SPLCOLD SPLCONF SPLSTOP SPLWARM STRTCMON STRTMON WARMSTRT Function Brings up command interpreters on all terminals Stops all command interpreters Loads the system Starts the mail system Stops the mail system Warm starts the mail system Brings down network line handlers Brings up network line handlers Switches operator message logging to a new log file Switches primary paths for devices and sets the cache configuration for disk volumes Stops all processing on the system Cold starts the spooler and purges all jobs in the current collector Used by SPLCOLD to configure the spooler Drains the spooler Warm starts the spooler Starts the command-interpreter process ($CMON) Starts the INSPECT monitor process ($IMON) Used to start various subsystems after a system load Data Access Language (DAL) Server DAL Server provides access for Macintosh clients to the Tandem NonStop systems and NonStop SQL/MP for low volume OLTP. DAL Server: Supports applications and tools running on a Macintosh environment Simplifies access to multiple hosts and databases Allows you to query NonStop system databases, capture information for spreadsheets, databases, word processors, and other programs and to upload new information to the Tandem system Disk Space Analysis Program (DSAP) The DSAP utility helps you to analyze disk space usage. DSAP tells you how many freespace pages, allocated pages, deallocatable extent pages, and unused pages are contained on the disk. 14-5

255 Operations Management Tools Distributed Name Service (DNS) You can specify any one of a number of reports as the output of the DSAP utility. Each report analyzes the disk in a different way, for example: The Subvol Summary report analyzes the space usage for each subvolume on a disk. The User Summary report analyzes the space usage for each user who owns files on the disk. The User Detail report lists the file name and space usage for each file on the disk. The Summary of Space report provides a high-level summary of how space is used on the disk. Distributed Name Service (DNS) DNS is a subsystem that manages a distributed database of names. DNS provides an automated method of keeping track of the names and interrelationships of network components (such as systems, devices, and applications). It allows the configuration management staff to store in a database the names (and aliases) of system and network objects, facts about their relationships, and instructions for replicating name definitions on remote nodes. The staff updates the database as needed to reflect configuration changes. The database is online and can be accessed by operators and other applications. DNS allows you to: Assign one or more names (or aliases) to a component. If two operators know a terminal by two different names, you can assign both names to the terminal so that each operator can easily access the information. Write management applications that access the DNS database. Control a database centrally or locally. You can distribute all or parts of the DNS database to any node in a network while maintaining centralized control of its contents. When you update the database centrally, the updates are distributed to the remote databases automatically. If you want to control DNS databases locally, you can give each node control of its own names and have each node forward its name definitions to a central control system. 14-6

256 Operations Management Tools Distributed Systems Management/NonStop Operations for Windows (DSM/NOW) Distributed Systems Management/NonStop Operations for Windows (DSM/NOW) DSM/NonStop Operations for Windows (DSM/NOW) is a Microsoft Windows client/server operations console environment for NonStop systems. DSM/NOW increases the effectiveness of operations and system management by: Allowing you to run multiple management applications such as SCF from a single workstation. Normalizing and simplifying commands by mapping varieties of subsystem commands to single command buttons. For example, all subsystem commands for starting or bringing up objects are mapped to a single Start button. When you select an object and click the Start button, the DSM/NOW Integrated Command and Control (ICC) application sends the appropriate command to the selected object (for example, SCF START LINE to an Expand line). You can customize the buttonobject-command mapping and add command buttons for existing or new object classes. Simplifying event-message browsing and management with the Multi Event Viewer, which lets you review Tandem host Event Management Service (EMS) event logs, configure filters for online and historical displays, and take appropriate actions through the integration of event messages with the ICC application. Defining console profiles for operators and system managers. You can define and manage sets of applications and console screen arrangements to meet the needs of individual console users. When a user logs on, the environment for that user is automatically launched, and a list of configured applications is displayed. Utilizing the Microsoft Windows intuitive interface. All applications and utilities run in the Windows environment so you can switch from one to another with the click of a mouse button. You can use the Windows copy and paste functions to copy an event message from the TSM EMS Event Viewer, for example, and paste it into the ICC application; then use the ICC Find function to identify the object that is the subject of the event. 14-7

257 Operations Management Tools Distributed Systems Management/Software Configuration Manager (DSM/SCM) Distributed Systems Management/Software Configuration Manager (DSM/SCM) DSM/SCM is a tool for the centralized planning, management, and installation of software on distributed (target) Tandem NonStop systems. DSM/SCM running on a Tandem central (host) system receives, archives, configures, and packages software for target sites. DSM/SCM running on each target system loads the software received from the central site. Major features and capabilities of DSM/SCM include: Centralized control. You can manage software on multiple remote target systems from a central site. Graphical user interface. You access the functions of DSM/SCM through a PC running a Microsoft Windows graphical user interface. Multiple logical targets on a single physical target system. The ability to define multiple logical targets allows a site to run multiple software configurations on a single system; for example, a production configuration and a test configuration. The system can be switched from one configuration to another by performing a system load from alternate system disks. Management of customer and Alliance partner software and updates, in addition to Tandem software releases and updates (IPMs). Flexibility in planning new software configurations. You can easily select specific products from software inputs to create a new software configuration based on a previous configuration, in which changed products are replaced by the new versions. A batch scheduling mechanism that enables you to schedule DSM/SCM events such as receiving software, generating reports, and building configurations to run as batch processes at a specified time. Extensive reporting. DSM/SCM produces a variety of preformatted and userspecified SQLCI reports that provide detailed information about the software configurations in the DSM/SCM environment. Predefined operator instructions. DSM/SCM provides default operator instructions with every software configuration placed on target systems. You can edit these instructions with site-specific information. Softdoc and release document browsing. The graphical user interface enables you to scan through softdocs. For example, you can find the dependencies for a given IPM. You can also scan release documents to view new product highlights and installation information. Minimal down time. Wherever possible, new software configurations can be activated with minimal interruption of running applications. System profiles, which provide defaults for options and information required by DSM/SCM tasks. These defaults greatly reduce the amount of information you must supply when performing tasks through DSM/SCM. 14-8

258 Operations Management Tools Enform Enform Enform is a query language service that generates reports. You can use Enform to generate reports from measurement data, including data collected by Measure. Event Management Service (EMS) EMS collects and consolidates event information generated by software subsystems and routes this information through the network. An event is any normal or abnormal change in the status of a device, line, or system on the network. For example, events occur when a job completes, an application stops or starts, or when a transaction aborts. Examples of subsystems that generate events are Expand, NonStop TS/MP, and NonStop TM/MP. You can route events to: Other systems in the network. DSM applications such as the TSM EMS Event Viewer. Management applications that respond to the event as required (and thus eliminate the need for operator intervention in response to events). EMS allows you to: Collect and view events from one or more locations Print event messages automatically Log events to disk file automatically Event Management Service Analyzer (EMSA) EMSA is a system-analysis tool that summarizes system activities, giving operators a clear picture of conditions. EMSA analyzes the contents of EMS event logs by selecting certain event messages according to user-defined search parameters. You can base the search criteria on start time, stop time, event type, event number, subsystem ID, process ID, system, logical device, and text string. This information can be displayed on terminals, routed to a printer, or used to generate reports. File Utility Program (FUP) FUP is a file-management tool designed to help you manage disk files, nondisk devices (printers, terminals, and tape drives), and processes on a NonStop system. FUP software supports Enscribe disk files (key-sequenced, entry-sequenced, relative, and unstructured including text files) and provides information about OSS files and Tandem NonStop SQL/MP files (tables, indexes, partitions, views, and object programs). 14-9

259 Operations Management Tools Flow Map Flow Map Flow Map is an application-process flow-diagram generator. Flow Map analyzes data collected by TPDC and Measure and creates a Microsoft Excel based graphical representation of the applications running on the system, and their performance on the system. You can create Flow Map diagrams to: Depict the application s processes, files, and the connections between them. Monitor the actual flow of message traffic within an application. This can help you understand the performance of applications or uncover application bottlenecks. Show specific detail about the application, or display a portion of the application. View Measure statistics about entities. Guardian Performance Analyzer (GPA) GPA is a system performance-analysis tool that analyzes system performance data collected by Measure. GPA provides tuning recommendations for improving the overall performance of the system. You can use GPA to analyze a running system automatically to determine if the system is balanced and to observe the predicted effects of a processor failure. GPA can provide the following information: Steps you can take to rebalance the system Processes that are running without backup Processors that do not have adequate capacity to handle the increased load caused by a failed processor Using GPA is an iterative process. You run GPA, implement the recommendations that you choose, then run GPA again to determine the performance improvement and new recommendations. This process is repeated as frequently as needed for your particular system and requirements. For example, when predicting the effects of a processor failure, you run GPA several times following its recommendations until you are convinced load balancing cannot be improved, then determine if enough processor capacity exists to handle failure conditions. If there is not enough capacity, you can reduce the load by moving noncritical jobs to off-peak periods (such as nights or weekends). If this is not possible, consider purchasing additional processors

260 Operations Management Tools Measure Measure Measure is a performance-measurement tool that lets technical specialists or operators collect and examine statistics for a system. It gives specialists or operators immediate, online access to performance statistics for key system and network components, including complex business applications. Specialists or operators can optimize online transaction-processing applications by using the statistics gathered by Measure. Measure collects data about a set of resources and workloads called entities. These entities include: Processors Disk drives Files Processes NonStop TM/MP transactions SQL/MP entities Communication lines Terminals Remote systems For each entity type that you select for measurement, Measure gathers information by a set of predetermined counters. For example, if you select a processor entity, the counters collect such information as the processor busy time, the interrupt busy time, the number of pages swapped, and so forth. You can supplement the predetermined counters with your own counters. User-defined counters can be used to specify new performance indicators for your applications. For example, you can set counters to measure user transactions, the length of a request queue for a particular process, or the amount of time spent by a program in performing certain functions. In addition, Measure allows you to: Monitor the system continuously. Measure can handle up to 64 concurrent measurements started by multiple users. This continuous monitoring capability also makes it practical to collect performance statistics on an ongoing basis for capacity planning purposes. Monitor the network continuously. Each network node can collect information to be used by Measure. For example, you can track both the data traffic going into a node and the data traffic going out of a node. Produce reports with Enform. Create management applications for performance measurement. You can implement customized performance-measurement tools based on Measure collection capabilities, or you can use the Measure programmable interface

261 Operations Management Tools NetBatch and NetBatch-Plus NetBatch and NetBatch-Plus NetBatch schedules and controls batch jobs as follows: The NetBatch scheduler automatically executes and monitors jobs, based on the specified parameters. Operators can specify the times jobs should run, submit the jobs, and then let the scheduler start the jobs at the right time and send the output to the correct location. NetBatch jobs can be run anywhere in an Expand network. You can centralize batch operations by having reports sent to one location. You can also control schedulers on different nodes of the network. In addition, you can use NetBatch with the Event Management Service (EMS) to automatically respond to events. For example, if a report must be generated every time a specific event occurs, application developers can design the application so that a NetBatch job is started each time the event occurs, thus automating the report-generation process. Another method of automating the task is to use EMS to route specific events to the command interpreter and have the command interpreter automatically issue NetBatch commands. NetBatch-Plus is a screen-driven interface for NetBatch. With NetBatch-Plus, you can use menus and function keys to perform NetBatch functions. NonStop Access for Networking NonStop Access for Networking is a collection of network components that extends fault-tolerant computing through the LAN to the desktop. By switching paths from a primary LAN to a secondary LAN, NonStop Access for Networking guards against LAN failure in client/server topologies. NonStop ODBC Server NonStop ODBC Server allows users of desktop computers to access NonStop SQL/MP. NonStop ODBC allows client applications that use either the Microsoft Open Database Connectivity (ODBC) interface or the Microsoft/Sybase SQL Server interface to access databases controlled by NonStop SQL/MP. NonStop SQL/MP NonStop SQL/MP is a relational-database-management system that supports highspeed access and updates to distributed data. NonStop SQL/MP uses the Structured Query Language (SQL) to create relational databases and to describe and manipulate data. NonStop SQL/MP uses the Tandem NonStop Kernel file-security features. This security provides user authorization for NonStop SQL/MP tables, views, indexes, and programs. In addition, two NonStop SQL/MP features also contribute to database protection

262 Operations Management Tools NonStop SQL/MP SQLCI NonStop SQL/MP SQLCI SQLCI is the primary interface through which database administrators create and change structures to manage data. SQLCI provides SQL data description language (DDL) statements to define the database, SQL data manipulation language (DML) statements to query and modify database tables, installation commands to install NonStop SQL/MP, a set of database-management utilities, and a report-writer facility. SQL statements operate on database objects. A database object is an entity that is created, manipulated, and dropped by SQL statements. Examples of SQL objects include tables, indexes, views, constraints, and collations. NonStop Transaction Manager/MP (NonStop TM/MP) The NonStop Transaction Manager/MP (NonStop TM/MP) product is the Tandem system transaction manager. NonStop TM/MP is an integral part of Tandem s massively parallel approach to OLTP. NonStop TM/MP maintains database consistency during processing. The database can be distributed over multiple nodes of an Expand network, or it can be centralized to reside on a single Tandem system. In either case, NonStop TM/MP ensures that the database remains consistent in the event of a program failure, a single component failure, or the total loss of communications between nodes. Using NonStop TM/MP, you can also perform online dumps and audit dumps to a remote system, thus enabling you to maintain a copy of the database at a remote backup site. NonStop TM/MP is the transaction manager for NonStop TUXEDO applications, PTP applications, and Pathway applications. Because applications in all the environments use the same transaction manager, transactions can span multiple environments and multiple databases (the NonStop SQL/MP relational database and the Enscribe database). Each Tandem NonStop system contains one, and only one, NonStop TM/MP subsystem, which can accommodate concurrent mixed workloads of OLTP, query, and batch applications. For all databases on a system, NonStop TM/MP maintains a single transaction log, which greatly simplifies system management operations. NonStop TM/MP ensures that the following requirements for high-volume transactionprocessing applications are met: Transaction protection and concurrency control Database consistency through transaction protection and recovery procedures NonStop TM/MP can monitor thousands of complex transactions sent by hundreds of users to a common database. The database can consist of tables and views created by the NonStop SQL/MP relational-database-management system, files created by the Enscribe record manager, or a combination of tables, views, and files. The database can be distributed among many disks on a system and across many nodes in an Expand network. NonStop TM/MP manages all the complex operations for concurrent 14-13

263 Operations Management Tools NonStop TM/MP Interfaces (TMFCOM, TMFSERVE) transactions and database consistency, making these operations transparent to users and application programmers. NonStop TM/MP Interfaces (TMFCOM, TMFSERVE) NonStop TM/MP provides the following interfaces to enable you to configure, reconfigure, control, and monitor the TMF subsystem: TMFCOM is an interactive command interface that allows commands to be entered and responses to be received through a terminal keyboard and monitor. TMFSERVE provides programmatic access to the Subsystem Programmatic Interface (SPI), making it possible to construct system management programs that monitor and control the NonStop TM/MP environment. NonStop Transaction Services/MP (NonStop TS/MP) NonStop Transaction Services/MP (NonStop TS/MP) is a set of components that includes PATHMON, LINKMON, and PATHCOM, which, together with NonStop Transaction Manager/MP and the Tandem NonStop Kernel, form the foundation for Tandem s open transaction-processing services. These core services provide the underlying infrastructure for your transaction-processing applications and a choice of transaction-processing (TP) monitor environments. NonStop TS/MP provides enhanced transaction-processing functions to client/server, terminal-based, and mixed-computing environments by providing process-management and link-management functions for OLTP applications on Tandem NonStop systems. Process management includes the starting, stopping, and monitoring of processes. Process management is provided by the PATHMON process. Link management includes the coordination of links between client and server processes so that these processes can communicate with one another. Link management is provided by the LINKMON process. For performing management and operations tasks, NonStop TS/MP provides the PATHCOM interactive interface and a management programming interface based on the Subsystem Programmatic Interface (SPI) part of Tandem s Distributed Systems Management (DSM) facility. NonStop TS/MP supports products and processes that run in either the Guardian operating environment or the Open System Services (OSS) operating environment. The NonStop TUXEDO transaction-processing environment, which runs in the OSS operating environment, provides the application programming interface (API) and transaction-monitor functions of the Novell, Inc. TUXEDO transaction-processing system, in addition to the benefits of the Tandem fundamentals

264 Operations Management Tools NonStop TS/MP PATHCOM Interface The Parallel Transaction Processing (PTP) environment, which runs in the Guardian operating environment, supports the CICS command-level API, and enables CICS applications to run on Tandem NonStop systems and to communicate with other CICS applications. The Pathway transaction-processing environment, which also runs in the Guardian operating environment, provides a client/server model for both terminal-based and workstation-based applications, with the benefits of the Tandem fundamentals. Transactions can be initiated by workstations, terminals, intelligent devices, and general devices. The Pathway environment always includes the NonStop TS/MP core service product. It might also include other software, such as the Pathway/TS product and the Remote Server Call (RSC) product. NonStop TS/MP PATHCOM Interface The PATHCOM interface is an interactive interface to the NonStop TS/MP core services. Through PATHCOM, you send commands and instructions to the PATHMON process-management process. Process management includes all tasks related to the starting, monitoring, and stopping of server processes and other processes specific to a transaction-processing environment. In a PTP environment, use PATHCOM to: Start, stop, restart, and display information (configuration, status, and statistics) about a PTP region Specify attributes for server classes and server processes, such as the Guardian security for a server class, the priority at which the server class will run, the number of server processes per server class, and the processors in which a server process will run Alter and delete server class definitions Start and stop server classes Control the NonStop TS/MP monitor process (PATHMON) In a Pathway environment, use PATHCOM to: Configure objects controlled by the PATHMON process Reconfigure an existing PATHMON environment to add new terminals, screen programs, terminal control processes (TCPs), and servers as required Monitor a PATHMON environment to obtain statistics on the behavior of the objects controlled by PATHMON In a NonStop TUXEDO environment, the NonStop TS/MP infrastructure is virtually transparent to the NonStop TUXEDO administrator. For example, when the administrator starts a NonStop TUXEDO application, the underlying NonStop TS/MP processes are automatically started and maintained. The administrator will notice a few added parameters in the configuration file, however, and might want to use PATHCOM to check the status of the underlying service

265 Operations Management Tools NonStop Virtual Hometerm Subsystem (VHS) NonStop Virtual Hometerm Subsystem (VHS) VHS acts as a virtual home terminal for applications by emulating a 6530 terminal. VHS receives messages normally sent to the home terminal, such as displays and application prompts, and uses these messages to generate event messages to EMS to inform operations staff of problems. Processes that enter Debug or Inspect are automatically handled by VHS are stopped, then are restarted to an operational state by the application monitor process. VHS emulates a terminal, but does not have the disadvantages of a single, dedicated physical terminal because you can: Easily access critical application messages Centralize message handling Free up physical terminals NSKCOM NSKCOM, a NonStop Kernel utility program, is the NonStop Kernel command interface to the Kernel-Managed Swap Facility (KMSF). NSKCOM is the primary tool for monitoring, configuring, and managing kernel-managed swap files. Kernel-managed swapping is a method for managing virtual memory using swap files controlled by the NonStop Kernel: each processor has at least one kernel-managed swap file that provides the swap space needed by all the processes running on that processor. Proper configuration and management of kernel-managed swap space is critical to the operation of your system. Using NSKCOM, you can review your kernel-managed swap configuration, check your kernel-managed swap files to ensure that they are running properly and that their usage is at an acceptable level, and alter your swap files. Use NKSCOM to: Display dynamic statistics on swap files, including information on the available and reserved memory pages, the threshold at which the kernel-managed swap facility generates an event messages for the file, and the peak pages ever reserved for the file. Display static configuration information and file attributes for swap files. Alter your kernel-managed swap configuration as necessary to ensure proper system performance. Use NSKCOM to resize swap files, add and delete swap files, and change the threshold at which EMS messages are generated. Object Monitoring Facility (OMF) OMF is a console application that monitors objects on the system automatically. OMF notifies you when thresholds are reached (files filling up, the spooler filling up, processes going away, processors too busy, disks too full, and so on). This proactive monitoring of the system can improve overall availability by preventing potential problems

266 Operations Management Tools Open Notification Service (ONS) When combined with Network Statistics Extended (NSX), OMF can provide a networkwide view of both the performance and status of objects. Open Notification Service (ONS) ONS is a set of processes, files, and management information bases (MIBs) that work together to enable Tandem subsystems to be monitored by network management applications that comply with the Simple Network Management Protocol (SNMP). ONS: Gathers EMS events generated by the following Tandem subsystems from the system event log: Object Monitoring Facility all events Transaction Monitoring Facility all events Translates event messages into the SNMP trap format Forwards status and problem notification information to the Tandem SNMP Agent, through which it becomes available to network management applications that comply with SNMP Pathway Open Environment Toolkit (POET) POET is a set of programs and utilities that helps programmers create and run client/transaction server applications. In the POET environment, the client is a program running on a workstation, and the server is a server process or program in the Pathway transaction-processing environment running on a Tandem NonStop system. POET allows for the transport of data or information (using C code) between Microsoft Windows clients and the Tandem NonStop host. POET includes a simplified programming interface, name mapping, and conversion mapping designed to simplify development of Microsoft Windows clients suitable for business-critical OLTP. POET uses the services of the RSC product. A client that uses POET handles the end-user interface and performs some business application logic operations on the workstation. Pathway servers on the Tandem NonStop system perform business application logic operations and database processing for the client. Interprocess communication (IPC) messages are used to transfer data between the client and the server. Pathway/TS The Pathway/TS product provides tools for developing and interpreting screen programs (requesters) to support OLTP applications in the Guardian operating environment. Pathway/TS screen programs communicate with terminals and intelligent devices. Pathway/TS requesters can access Pathway servers (programs that manage the database) written in any language Tandem supports. To use Pathway/TS, you must also have NonStop TS/MP

267 Operations Management Tools PEEK Pathway/TS includes the terminal control process (TCP) and the SCREEN COBOL compiler and run-time environment: A terminal control process (TCP) interprets and executes SCREEN COBOL programs and, with the help of the PATHMON process, coordinates communication between those programs, their terminals, and server processes. SCREEN COBOL is a high-level language developed by Tandem for creating and running screen programs, which are the programs that display and control data entry and data inquiry screens on fixed-function terminals. The SCREEN COBOL Utility Program (SCUP) is a utility for managing libraries of compiled screen programs. PEEK PEEK is a utility that reports statistical information maintained by the operating system within the processor module. PEEK is very useful for performance analysis and tuning. Use PEEK to: Monitor processor activity for system storage pools Monitor paging activity Send instructions Interrupt conditions Remote Server Call (RSC) RSC extends Pathway software to the desktop. It opens the Pathway transactionprocessing system to workstation users. RSC enables Pathway clients to reside on workstations, allowing some of the processing to be done before any interaction with the host is required, so RSC dramatically improves application performance. Safeguard Safeguard extends the access-control features of the operating system, thereby allowing security administrators to easily tailor the level of protection to suit their needs. SeeView SeeView is a product that creates a multiwindowed menu-driven environment on Tandem 6500-series terminals and workstation emulators. In a SeeView environment, you can: Display several windows on the terminal at the same time, each one displaying the output from a different program. For example, you might have three windows on the screen, one assigned to TACL, one to the spooler, and one to an editor. Run many processes simultaneously on a single terminal. Use menus and function keys to perform actions

268 Operations Management Tools Simple Network Management Protocol (SNMP) Write scripts to customize the interface. For example, you can control window placement on the screen, assign each window to a process, decide what text is sent to a process, and determine how the output from a process is displayed. Simple Network Management Protocol (SNMP) The SNMP subsystem lets operations personnel who operate multiplatform networks (for example, networks containing Integrity S2 systems, Ungermann-Bass products, and NonStop systems) configure Tandem NonStop operating system-based subsystems to be monitored and controlled from network management workstations that comply with SNMP. SNMP defines a standard way of managing devices from multiple vendors on a TCP/IP-based network. Subsystem Control Facility (SCF) For G-series systems, you use the Subsystem Control Facility (SCF) to configure, control, and display information about configured objects within SCF subsystems. Each SCF subsystem responds to and processes SCF commands that affect that subsystem. The processes created in these SCF subsystems are persistent; if the system comes down, these processes are automatically restarted as soon as the system is loaded. Supergroup users (255,n) can use the SCF command-line user interface to make configuration changes from any connected terminal. When you install a G-series release on a Himalaya S-series server, the $SYSTEM disk and a few other initial system-load processes are preconfigured, and SYSGENR uses the CONFTEXT file to establish some system attributes to all processors. Then you finish the system configuration by using SCF. You can use SCF to perform these tasks: Add, alter, or delete objects, including disk and tape devices, I/O processes, and generic processes Obtain configured or current information about supported objects Alter some system variables that are configured on D-series systems by SYSGEN Measure network traffic Because configuration changes are made online using SCF, they take effect as soon as the affected objects are started (with the SCF START command). For subsystems that are new for G-series systems, these changes are permanent; they persist through processor and system loads (unless you load the system using a different configuration file). For older SCF subsystems that are controlled on G-series systems by the ServerNet WAN subsystem, configuration changes are not permanent; you must reimplement them if the system or a processor goes down. Subsystem Programmatic Interface (SPI) SPI provides a standard set of interfaces for building management applications. Application developers can use SPI to create management applications that will help your staff manage systems and networks, eliminate repetitive tasks, and reduce the chance of human error

269 Operations Management Tools Tandem Advanced Command Language (TACL) Tandem Advanced Command Language (TACL) TACL is the command interpreter for the NonStop Kernel operating system. TACL helps you automate operations tasks by allowing you to write macros to perform commands. A macro is a stored sequence of TACL commands to which you assign a name; entering the macro name invokes the command sequence. Macros can accept arguments. For example, you can create a macro called STATUS that checks the status of the system. Operators have to enter only one command (STATUS) and argument (the name of the system) to check whether: All processors are up All disks are up and paths have not been switched NonStop TM/MP is running and the correct volumes are enabled for NonStop TM/MP processing All network lines and nodes are up Spooler devices are up and spooler collectors are full Log files are full Application servers, processes, and terminals are running properly All TACL macros should be formally tested on a development system before they are used on production systems. Define function keys to perform macros or routines. For example, you can define function key F2 on your keyboard to execute a macro (such as STATUS) or to start an application program. Once you define the function key, you simply press the key to execute the macro or program. Use TACL as a programming language to create routines that perform complex operations. Tandem Capacity Model (TCM) and MeasTCM TCM is a PC-based capacity-planning tool designed for capacity planners who need to estimate future system performance. TCM uses MeasTCM data (data generated by Measure and processed by MeasTCM in preparation for TCM) to predict performance for changing workloads and configurations. Using TCM, capacity planners can perform the following for current and planned systems and workloads: Estimate the amount of hardware required to meet current and projected business needs Forecast the maximum capacity of current systems Predict the performance of user applications by answering what-if questions regarding estimated throughputs, response times, and number and usage of processors and disks for various user-defined scenarios 14-20

270 Operations Management Tools Tandem Failure Data System (TFDS) Generate charts to aid in determining performance management alternatives and recommendations Archive and organize historical transaction-oriented data Tandem Failure Data System (TFDS) TFDS is an automated diagnostic-and-recovery tool that monitors Tandem processors and automatically initiates a processor dump in the event of a failure, analyzes the failure data, and initiates recovery based on the type of defect discovered. TFDS can react faster to processor failure conditions than a human operator can. Tandem Network Statistics Extended (NSX) NSX is a network management tool that provides operators with a global perspective of the entire network. With NSX, operators can collect and monitor up-to-the-moment performance statistics on all nodes, processors, and Expand line handlers in the network. In addition, NSX collects and reports on the busiest processes in each processor. When limits are exceeded, NSX issues event messages. NSX automatically displays network statistics in either a block-mode interface or graphical user interface for Windows on a PC. When numbers exceed limits defined by the system, they are highlighted to alert the operators. The data-reporting capabilities of NSX provide a performance management framework for other network-performance tools. Once you have located problems with NSX, you can use Measure or other performance tools for more detailed analysis. The NSX database can be located at a central node, distributed over several regional nodes (so that there is one NSX database for all nodes in a region), or distributed to each node in the network (so that there is one NSX database for each node). When combined with OMF, NSX can provide a network-wide view of both the performance and operational status of system objects (for example, files filling up, the spooler filling up, processes going away, processors too busy, disks too full, and so forth). Tandem NSX for Windows allows you to use the NSX tool from Windows on a PC. Tandem Performance Data Collector (TPDC) TPDC is a performance-data collection tool. It collects, integrates, and normalizes all performance data from Measure and other sources and creates a single, consolidated file of performance data that can be used by other products. For example, the performance data in this file can be used by Enform to generate reports, or the performance data can be analyzed by other Tandem performance analysis tools such as GPA and Flow Map

271 Operations Management Tools Tandem Reload Analyzer (Reload Analyzer) Tandem Reload Analyzer (Reload Analyzer) Reload Analyzer is a database-management tool that helps you identify fragmented keysequenced Enscribe files or key-sequenced NonStop SQL/MP objects. Reload Analyzer makes recommendations based on its calculations and provides general, block, and data chain information to help you decide whether to reorganize a key-sequenced Enscribe file or NonStop object. Tandem Service Management (TSM) Tandem Service Management (TSM) is a client/server application that provides troubleshooting, maintenance, and service tools for Himalaya S-series servers. TSM consists of software components that run on your Himalaya S-series server and a PC-compatible workstation. The TSM software on the workstation features an easy-touse graphical user interface (GUI) that contains extensive online help. You can use TSM to perform these tasks: Start up a system Identify system components and configurations Display physical and logical views of the system configuration Display the current status of the system components Perform maintenance actions on specified customer-replaceable units (CRUs) Send problem and configuration information to a service provider Load new system configurations Monitor system resources Implement remote monitoring and maintenance functions Transfer Transfer enables organizations to move and manage information efficiently within a single Tandem system or a network of distributed systems. Transfer serves as the foundation of PS MAIL (electronic mail system) and other Tandem products. Within the Transfer environment, you can: Monitor the Transfer scheduler queues and the capacity of Transfer database files; control logging parameters and log file locations; look at scheduler queues; trace items; and scan folders. Use PATHCOM, the PATHMON command interface, to monitor activity in the Transfer Pathway environment, to stop terminals and processes if problems arise, to perform online load balancing, and to make configuration changes. Use SCF to add and delete terminals, to monitor activity on communications lines, and to diagnose problems on communications lines. Use NETMON to monitor traffic throughout the network and to identify problems on network lines. Use Measure to measure processor and disk activity throughout the system

272 Operations Management Tools TSM EMS Event Viewer Use TMFCOM, the command interface to NonStop TM/MP, to take online dumps of the Transfer database files and to recover the database files after a system failure. Establish a regular schedule for online dumps of Transfer database files. Monitor disk space for Transfer and applications based on Transfer, such as PS MAIL. These applications have a tendency to use large amounts of disk space very quickly. TSM EMS Event Viewer The TSM EMS Event Viewer assists you in performing many of the tasks associated with viewing and monitoring EMS event logs. The TSM Event Viewer lets you set up criteria for searching for and viewing the log files in a variety of ways, thereby allowing rapid assessment of service problems. The TSM Event Viewer provides dialogs for you to set the search criteria used to retrieve events based on start and end time, subsystem, source, and multiple or specific events. You can save a retrieved event stream, as well as load a previously saved event stream. You can also save and open search criteria and preference files. After retrieving the events from the Tandem system to the workstation, you can further select which events you want to view. This includes changing the display parameters of events (color, font style, size, and type): displaying events by timeframe, subsystem, subject, and priority; and specifying which events should be included in or excluded from the display. The TSM EMS Event Viewer is launched from within the TSM application. ViewSys ViewSys is a system resource monitor. It gives system managers and system operators the ability to view system resources. Viewing the resource allocations across processors on a running system helps you balance the application load more evenly. ViewSys can help you decide when to move user processes to less busy processors and disk files or when to relocate partitions to less busy disk volumes

273 Operations Management Tools ViewSys 14-24

274 A Additional Reading Overview This appendix provides suggestions for additional reading for each section of this book. This information will help you learn how to perform specific tasks, use Tandem products, and gain a better understanding of Tandem systems in general. Section 1 Overview of NonStop Operations Management The following books provide introductory information on Tandem systems and product groups: Introduction to Distributed Systems Management (DSM) Introduction to NonStop Operations Management Introduction to NonStop SQL/MP Introduction to NonStop Transaction Processing Introduction to NonStop Transaction Manager/MP (TM/MP) Introduction to Tandem NonStop Systems Section 2 The Operations Staff Tandem does not provide additional documentation on job descriptions and organizational planning. However, for information on particular job functions, refer to the following manuals: The Guardian System Operations Guide describes the tasks that should be performed by members of the operations staff. The Security Management Guide describes the responsibilities of the security administration team and of security administrators. Other operations and management manuals describe the tasks that should be performed by the operations staff to maintain applications and Tandem products. System-specific operations guides describe how to operate hardware. For additional information on training programs, refer to the Tandem Education Course Catalog and the Tandem Education Schedule and Price Guide. This information is also available through the Tandem World Wide Web home page ( Your Tandem representative can also help you plan for staffing and training needs. A-1

275 Additional Reading Section 3 The Operations and Support Areas Section 3 The Operations and Support Areas For site planning, configuration, and installation of the NonStop Himalaya S-series servers, refer to: Guardian System Operations Guide Guardian System Operations Reference Guide Himalaya S-Series Installation Guide Himalaya S-Series Operations Guide Himalaya S-Series Planning and Configuration Guide Himalaya S-Series Server Description Manual Himalaya S-Series Workstation Installation Guide For information about the Tandem Service Management (TSM) package, refer to the TSM Configuration and Management Guide. Your Tandem representative can also help you with site planning and system installation. Section 4 Operations Documentation For a complete list of Tandem software documentation, refer to the About This Collection document in the G01.00 TIM collection. For information about DNS, refer to the Distributed Name Service (DNS) Management Operations Manual. For information about configuring disk and tape devices, refer to the SCF Reference Manual for the Storage Subsystem. For information about the spooler, refer to the Spooler Utilities Reference Manual. For information about the ServerNet Wide Area Network (SWAN) subsystem, refer to the ServerNet Communications Configuration and Management Manual. A-2

276 Additional Reading Section 5 Production Management Section 5 Production Management For information on system startup, memory dumps, processor reload, and system shutdown, refer to the appropriate system-specific manual: Himalaya S-Series Operations Guide Himalaya S-Series Support Guide Processor Halt Codes Manual Tandem Failure Data System (TFDS) Manual For information on the other tasks and products mentioned in Section 5, refer to the system-specific manuals and the following: DSM/NonStop Operations for Windows (DSM/NOW) Manual EMS Manual Event Management Service (EMS) Analyzer User s Guide and Reference Manual Enform Reference Manual Expand Network Management Guide File Utility Program (FUP) Reference Manual Guardian Disk and Tape Utilities Reference Manual Guardian Operations Reference Summary Guardian System Operations Guide Measure User s Guide NetBatch Manual NetBatch-Plus Reference Manual NonStop TM/MP Planning and Administration Guide NonStop TM/MP Operations and Recovery Guide Operator Messages Manual NonStop TS/MP and Pathway System Management Guide Safeguard User s Guide Safeguard Administrator s Manual SCF Reference Manual for G-Series Releases SCF Reference Manual for the Kernel Subsystem SCF Reference Manual for the Storage Subsystem SeeView Manual SNAX/HLS Configuration and Control Manual Spooler Utilities Reference Manual TACL Reference Manual Tandem Network Statistics Extended Manual Tandem Object Monitoring Facility (OMF) Manual TSM Configuration and Management Guide Transfer Installation and Management Guide ViewSys User s Guide A-3

277 Additional Reading Section 6 Problem Management Section 6 Problem Management For comprehensive information on problem management, refer to the Availability Guide for Problem Management. For information about Distributed Systems Management, refer to the Introduction to Distributed Systems Management (DSM). For information about Distributed Systems Management/NonStop Operations for Windows, refer to the DSM/NonStop Operations for Windows (DSM/NOW) Manual. For information on the Event Management Service (EMS), refer to: Introduction to Distributed Systems Management (DSM) EMS Manual Event Management Service (EMS) Analyzer User s Guide and Reference Manual For information about TSM, refer to the TSM Configuration and Management Guide For information about TFDS, refer to the Tandem Failure Data System (TFDS) Manual. For a complete description of operator messages, refer to the Operator Messages Manual. For information on ViewSys, refer to the ViewSys User s Guide. For information on monitoring the system and applications, see the references listed for Section 5. A-4

278 Additional Reading Section 7 Change and Configuration Management Section 7 Change and Configuration Management For comprehensive information on change and configuration management, refer to the Availability Guide for Change Management. For more information about the tools used for system configuration, refer to: Distributed Name Service (DNS) Management Operations Manual DSM/SCM User s Guide Himalaya S-Series Installation Guide Himalaya S-Series Operations Guide Himalaya S-Series Planning and Configuration Guide NonStop SQL/MP Reference Manual NonStop TM/MP Planning and Administration Guide NonStop TM/MP Operations and Recovery Guide NonStop TS/MP and Pathway Management Reference Manual NonStop TS/MP and Pathway System Management Guide SCF Reference Manual for G-Series Releases SCF Reference Manual for the Kernel Subsystem System Generation Manual for G-Series Releases TSM Configuration and Management Guide Transfer Installation and Management Guide Section 8 Performance Management For comprehensive information on performance management, refer to the Availability Guide for Performance Management. For information on measurement and planning tools, refer to the following manuals: Enform Reference Manual Flow Map Manual Guardian Disk and Tape Utilities Reference Manual (for information on DSAP) Guardian 90 Performance Analyzer (GPA) User s Guide and Reference Manual Kernel-Managed Swap Facility (KMSF) Manual (for information on NSKCOM) Measure Reference Manual Measure User s Guide NonStop TS/MP and Pathway System Management Guide PEEK Reference Manual SCF Reference Manual for G-Series Releases Subsystem Control Point (SCP) Management Programming Manual Tandem Capacity Model (TCM) Manual Tandem Network Statistics Extended (NSX) Manual Tandem Performance Data Collector (TPDC) Manual Tandem Reload Analyzer Manual Viewsys User s Guide A-5

279 Additional Reading Section 9 Security Management Section 9 Security Management For information on security in general, refer to the Security Management Guide. For information on Safeguard, refer to: Safeguard Reference Manual Safeguard User s Guide Safeguard Administrator s Manual For information on NonStop SQL/MP, refer to the NonStop SQL Installation and Management Manual. For information on the Tandem NonStop Kernel operating system security features, refer to the Guardian User s Guide. For information on the security features relevant in the Open System Services environment, refer to the Open System Services User s Guide. For information on using $CMON, refer to the Guardian Programmer s Guide. Section 10 Contingency Planning For information on NonStop TM/MP, refer to: Introduction to NonStop Transaction Manager/MP (TM/MP) NonStop TM/MP Planning and Administration Guide NonStop TM/MP Operations and Recovery Guide NonStop TM/MP Reference Manual NonStop TM/MP Reference Summary NonStop TM/MP Management Programming Manual NonStop TM/MP Application Programmer s Guide A-6

280 Additional Reading Section 11 Application Management Section 11 Application Management For information on Tandem application software, refer to: Data Access Language (DAL) Server Manual NetBatch Manual NetBatch-Plus Manual Introduction to NonStop TM/MP Introduction to NonStop Transaction Processing Introduction to Transfer Delivery System NonStop Access for Networking System Note NonStop TM/MP Planning and Administration Guide NonStop TM/MP Operations and Recovery Guide NonStop TM/MP Reference Manual NonStop TM/MP Reference Summary NonStop TM/MP Management Programming Manual NonStop TM/MP Application Programmer s Guide NonStop TS/MP Pathsend and Server Programming Manual NonStop TS/MP and Pathway System Management Guide Pathway/TS TCP and Terminal Programming Guide Pathway Open Environment Toolkit (POET) Installation and Management Guide Remote Server Call (RSC) Installation and Management Guide SCF Reference Manual for G-Series Releases Transfer Installation and Management Guide Transfer Programming Manual Section 12 Automating and Centralizing Operations For more information about the tools described in Section 12, refer to the appropriate manuals: Distributed Name Service (DNS) Management and Operations Manual Distributed System Management Programming Manual DSM/NonStop Operations for Windows (DSM/NOW) Manual DSM/SCM User s Guide DSM Template Services Manual EMS Manual File Utility Program (FUP) Reference Manual Guardian User s Guide (command files and TACL macros) Guardian System Operations Guide Introduction to Distributed Systems Management (DSM) Measure User s Guide NetBatch Manual NetBatch-Plus Manual NonStop TM/MP Planning and Administration Guide NonStop TM/MP Operations and Recovery Guide NonStop TM/MP Management Programming Manual NonStop TM/MP Application Programmer s Guide SNMP Configuration and Management Manual A-7

281 Additional Reading Section 13 Operations Management and Continuous Improvement SPI Programming Manual TACL Programming Guide Tandem Failure Data System (TFDS) Manual Tandem Network Statistics Extended (NSX) Manual Tandem Object Monitoring Facility (OMF) Manual Section 13 Operations Management and Continuous Improvement This section describes an approach to improving operations-management processes. No tools or products are mentioned in this section. Section 14 Operations-Management Tools This section lists and describes all the tools referred to in this book. For sources of additional reading, refer to the appropriate subsection in this appendix. A-8

282 B Check Lists Overview The check lists from each section in this book are reproduced here so that you can easily use the check lists for note taking or photocopying. The Operations Staff 1. Structure your organization so that it most effectively and efficiently provides the entry-level through senior-level operations, planning, control, and support activities your company needs. 2. Define each person s job duties. Make sure that there is a well-defined path for problem escalation and for career growth. 3. Define how each person s performance is evaluated. 4. Determine the staff s training needs. 5. Provide the necessary training. 6. Provide training for career development. B-1

283 Check Lists The Operations and Support Areas The Operations and Support Areas 1. Determine the type of environment your systems require. 2. Select the location for your system. The location should: Be the safest and most secure available to you Provide all system and environmental requirements Have enough space for all equipment and for storage areas Have all required data communications lines and telephones 3. Take precautions to protect hardware, software, and data from intruders and untrained personnel. 4. Obtain all needed equipment and supplies, such as consoles, terminals, printer supplies, and manuals. 5. Once the site has been prepared and all necessary requirements met, install the systems. Most new computer room systems are installed by Tandem customer engineers (CEs). 6. Plan for preventive maintenance: Develop procedures and schedules. Arrange for CE support if needed. Keep all equipment and work areas clean. 7. Make sure that the support areas are located near the systems and provide all necessary equipment (for example, manuals, telephones, forms, and so on). B-2

284 Check Lists Operations Documentation Operations Documentation 1. Determine what type of documentation you need in order to run your organization efficiently. You might need the following: Operations policies, standards, and procedures Agreements, contracts, and supporting documents Operator logs Error logs CE logs Outage logs System-configuration and network-configuration diagrams System-configuration and network-configuration listings Flow diagrams Online help files (for example, TSM and TSM EMS Event Viewer) Tandem and other vendor manuals Tandem software release documents Internal operator guides Documentation on the location and use of online files Cause, effect, and recovery information for application error messages. If you have the TSM EMS Event Viewer, your staff can document messages online by editing the event detail database. Anything else applicable to your operation. 2. Place the documentation where it is accessible to those who need it. 3. Set up disk files for documentation you want to have online. In the case of log files, be sure to determine when logging should be switched to a new log file. 4. Establish procedures for updating documentation and for informing the staff of updates. B-3

285 Check Lists Production Management Production Management 1. Implement operator tools to monitor the systems, networks, applications, processors, disks, and communications lines. 2. Determine which tools to use. Tandem offers these tools: DSM/NOW EMS EMSA NetBatch and NetBatch-Plus NSX OMF SeeView Tandem Service Management (TSM) TSM EMS Event Viewer VHS ViewSys 3. Develop an internal operations guide that provides guidelines and procedures for performing tasks within your organization. 4. Establish work schedules based on the tasks listed in this section and other tasks required by your organization. 5. Automate the tasks as much as possible. 6. Train your staff to perform the tasks quickly and well. 7. Review the procedures and the staff s performance to determine whether the procedures can be improved. B-4

286 Check Lists Problem Management Problem Management 1. Maintain a well-trained operations and support staff. 2. Establish problem prevention strategies. Your staff should: Monitor the hardware and software Monitor system and application message logs Automate operations and recovery procedures as much as possible Ensure that the system s fault-tolerant features are fully used and maintained Design your system to take advantage of quick startup and shutdown techniques Ensure the availability of super-group (255,n) capabilities to solve certain problems Be prepared and trained for environmental problems and disasters Maintain up-to-date and well-tested recovery procedures 3. Establish problem detection procedures. Your staff should: Monitor the hardware and software Monitor system and application software message logs Automate system-monitoring tasks and use monitoring check lists Monitor TSM incident reports Act on information received from users reporting problems 4. Establish procedures for reporting problems: Develop a standard problem report form. Create and maintain a system outage log. Designate people responsible for logging problems. Consider establishing a help desk. Train staff and users in problem reporting procedures. 5. Establish problem-solving techniques for identifying the cause of a problem and developing a solution. Using a problem-solving worksheet can help operators systematically list the facts about a problem, list possible causes, identify the cause, and develop a solution. 6. Establish problem escalation procedures. Your staff should: Know who should work on easy-to-fix problems and who should work on complex problems, and determine the percentage of problems that should be resolved by each level of support. Know how long to work on a problem before escalating the problem to the next level of support. Know whom to contact for help with system-related and application-related problems. B-5

287 Check Lists Problem Management Update the problem report form whenever a problem is escalated. Know which person on each shift is the Tandem contact. The Tandem contact should understand when and how to contact Tandem. Know how to take processor memory dumps and obtain copies of system log files. 7. Establish procedures for reviewing problems: Periodically meet with your staff to review solved and unsolved problems and to determine if improvements in the procedures can be made to prevent the same problems from occurring in the future. Generate reports to provide statistics on the number of problems encountered, solved, and not solved, and on the time and levels of staff required for problem resolution. 8. Determine which tools to use. Tandem offers these tools: Distributed Systems Management/NonStop Operations for Windows (DSM/NOW) Enform Event Management Service (EMS) Event Management Service Analyzer (EMSA) NetBatch and NetBatch-Plus Network Statistics Extended (NSX) Object Monitoring Facility (OMF) Open Notification Service (ONS) Subsystem Control Facility (SCF) Tandem Failure Data System (TFDS) Tandem Service Management (TSM) TSM EMS Event Viewer ViewSys B-6

288 Check Lists Change and Configuration Management Change and Configuration Management 1. Obtain management commitment to developing policies and procedures, training staff in the policies and procedures, and enforcing policies and procedures. 2. Determine your staffing needs. 3. Anticipate and plan for change by: Evaluating system performance and growth to accommodate change. Providing adequate computer room resources to allow for growth and avoid unnecessary down time. Configuring your system with change in mind. If you plan ahead for capacity growth, you can preconfigure additional resources into the system according to your plans. 4. To reduce the outage required for the change, determine if the change can be performed while the system is still running. 5. If the system must be shut down, minimize the outage by reducing system and application startup and shutdown times and writing efficient command files. 6. Develop change control procedures for installing and implementing change. Changes are usually handled in four phases. The change control staff: a. Defines and documents the change b. Plans for the change c. Installs the change d. Makes sure that the system is running correctly and improves the change control process 7. Determine which tools to use. Tandem offers these tools: Distributed Name Service (DNS) Distributed Systems Management/Software Configuration Manager (DSM/SCM) NonStop SQL/MP SQLCI NonStop TM/MP interfaces (TMFCOM and TMSERVE) NonStop TS/MP PATHCOM interface Subsystem Control Facility (SCF) Tandem Service Management (TSM) B-7

289 Check Lists Performance Management Performance Management 1. Establish service-level agreements. 2. Assign staff to the capacity-planning, application-sizing, and performance-analysis functions. Provide training as needed. 3. Establish procedures for application sizing. Typically, the application-sizing staff: a. Establishes sizing requirements and strategy b. Forecasts and develops models of future demands c. Reports results 4. Establish procedures for capacity planning. Typically, the capacity-planning staff: a. Establishes capacity-planning requirements and strategy b. Institutes performance reporting c. Forecasts and develops models of future demands d. Develops the capacity plan 5. Establish procedures for performance analysis and tuning. Typically, the performance-analysis-and-tuning staff: a. Establishes performance requirements b. Measures performance c. Acts on collected data to optimize system performance d. Reports results and provides capacity planners with data 6. Select tools to help with performance management. Some of the tools Tandem offers are: DSAP Enform Flow Map Guardian Performance Analyzer (GPA) Measure NSX NSKCOM PEEK Subsystem Control Facility (SCF)/Subsystem Control Point (SCP) Tandem Capacity Model (TCM) and MeasTCM Tandem Performance Data Collector (TPDC) Tandem Reload Analyzer TSM EMS Event Viewer ViewSys B-8

290 Check Lists Security Management Security Management 1. Develop a security policy for your organization. 2. Educate the user community and the operations staff about security and their responsibilities for protecting the system. 3. Designate a security administrator and a security administration team to manage security. Set up check lists for the administrator and team members. 4. Maintain physical security: Limit access to the computer room (if applicable). Protect the computer cabinets and tape units from accidental damage and deliberate malicious acts. Protect the tape library from intruders accessing previous backup tapes. If your printers print sensitive information, make sure that each piece of output is delivered to its proper recipient. Protect on-site and off-site media storage from intruders. Keep transaction logs for all transactions. Create clear hand-over procedures between the storage-area staff and other staff. Determine if you need to encrypt data. 5. Establish guidelines for managing user IDs, including guidelines for: Assigning groups Using Safeguard access-control lists (ACLs) Preventing shared user IDs Preventing multiple user IDs for one person Using the special IDs (the super ID [255,255], super-group user [255,n], and group manager [n,255]) and the procedures for monitoring and assigning these IDs 6. Establish guidelines for managing passwords: Require strong passwords. Establish unexpected initial passwords. Enforce routine password changes. Teach users how to protect their passwords. 7. Establish guidelines for dial-up access. To protect your dial-up facility, use authorization lists, additional external passwords, callback systems, and automatic terminal authentication. In addition, periodically change passwords and telephone numbers. 8. Secure network access: B-9

291 Check Lists Security Management Reserve a range of group numbers for network user IDs, and assign network user IDs from these groups. Decide on the network-wide names for the groups on an as-needed basis. Designate a particular organization to own each group name and group ID, and make that organization responsible for controlling the allocation of user IDs within its group. Determine what applications and users can use network IDs. Consider using encryption devices. Establish procedures for verifying communications with operations staff at other locations. 9. Secure client/server environments: Assign a personal ID and password to every client/server application user. Consider using encryption devices. Authenticate the user at the client workstation by installing a smart-card device in the workstation. Place the client portion of the application on a diskless workstation to prevent copying sensitive information to a floppy disk or access to a hard disk. Design the client/server application so that the client portion authenticates the user, determines what servers the user is entitled to use, and passes the personal ID when it calls the server. The server portion of the application should receive the personal ID and decide whether it is open to all users, is restricted to certain personal IDs, or needs stronger identification/verification. 10. Establish guidelines for moving programs from a development environment to a production environment. To secure new programs, verify that the programs are tested, use only authorized and documented versions, and ensure that security settings and logons comply with the requirements of the security policy. 11. Establish procedures for controlling PROGID programs. 12. Establish procedures for controlling licensed programs. Describe the steps operations staff should take to: Approve a request for a program license Review, compile, bind, and test source code before issuing a license Monitor and detect licensed programs Integrate licensed programs into new releases B-10

292 Check Lists Contingency Planning Contingency Planning 1. Take preventive steps to limit risks of disaster: a. Select the best site possible for your organization: Should the site should be located at a remote, computer-only site, or with other business operations? Is the site away from known danger zones? b. Select or design the best facility possible. Follow the guidelines in Section 3, The Operations and Support Areas. c. Follow the security guidelines described in Section 9, Security Management. d. Establish preventive maintenance procedures for the hardware, software, air conditioning, and computer rooms as described in Section 3, The Operations and Support Areas. e. Establish system-monitoring tasks as described in Section 5, Production Management. f. Configure the system and network (if applicable) for fault tolerance. g. Establish procedures for backing up and archiving data in a safe and secure location. h. Establish procedures for checking the integrity of the archiving system and the archived data. i. If you have distributed systems, make sure that applications are designed to take advantage of TMF. 2. Plan for disaster recovery. a. Take inventory: What is at risk? What are affordable data and service losses for your company? How long can your company function without running each critical application, and what is the cost of down time for each critical application? What types of disasters are most likely to affect your operations? When should the disaster plan be activated? Who has the authority to declare a disaster? Is insurance available? Should your company purchase insurance for loss of equipment or business? What are the recovery alternatives, the costs associated with each alternative, and the best alternative for your needs? B-11

293 Check Lists Contingency Planning b. Develop a plan that documents: Plan requirements Procedures for evacuating personnel, accounting for personnel, and contacting rescue authorities Damage-assessment procedures The name and phone number of the person with ultimate decision-making authority Command-post information, including location, personnel responsible for the command post, and procedures for processing information A list of materials and services that should be available during a disaster, and the location of necessary support contracts The location of backup power, communications lines, first aid equipment, and important data and records A prioritized list of names and phone numbers of emergency contacts Escape routes and emergency survival procedures A list of the critical applications and procedures for managing the applications during a disaster Procedures for backing up, storing, maintaining, and restoring data Backup site procedures, if applicable Procedures for reestablishing operations at the primary site or at a new permanent site Anything else necessary for recovering from a disaster 3. Train the staff in disaster recovery. 4. Test the plan: Define the test objectives Design the test Execute the test Revise the plan as needed and test the plan revisions 5. Update the plan as needed and test all updates. B-12

294 Check Lists Application Management Application Management 1. Establish operations requirements for all applications. 2. Participate in application reviews to ensure that your requirements are met. 3. Establish a production-assurance control group to ensure that applications are run with the correct data input or options. 4. Establish procedures for managing batch, online, and client/server processing applications. 5. Establish procedures for using Tandem products to manage the applications. 6. Train your staff so they can use the Tandem tools and can install, run, and monitor the applications. B-13

295 Check Lists Automating and Centralizing Operations Automating and Centralizing Operations 1. Commit resources to system automation and centralization. Determine staffing needs. 2. Determine which tasks should be automated and centralized. 3. Determine which tools to use. Tandem offers these tools: Command files Distributed Name Service (DNS) Distributed Systems Management/NonStop Operations for Windows (DSM/NOW) Distributed Systems Management/Software Configuration Manager (DSM/SCM) Event Management Service (EMS) File Utility Program (FUP) Measure NetBatch and NetBatch-Plus Network Statistics Extended (NSX) NonStop Transaction Manager/MP (TM/MP) NonStop Virtual HomeTerm Subsystem (VHS) Object Monitoring Facility (OMF) Open Notification Service (ONS) Simple Network Management Protocol (SNMP) Subsystem Programmatic Interface (SPI) Tandem Advanced Command Language (TACL) Tandem Failure Data System (TFDS) TSM EMS Event Viewer 4. Develop and test procedures. 5. Develop problem recovery procedures as appropriate. 6. Document procedures and the location of required files. Make sure that files have meaningful names. 7. Teach your staff how to use the procedures. B-14

296 Check Lists Operations Management and Continuous Improvement Operations Management and Continuous Improvement 1. Assess the current status of your operations-management processes. Use the maturity framework to help you determine the maturity level of your operations environment. 2. Develop a vision of the operations-management processes you want to have in place by establishing goals and objectives. 3. Develop an action list of tasks and the sequence in which to implement them. 4. Commit the resources to carry out the tasks, and create a project schedule to accomplish each task. 5. Execute the tasks according to your project schedule. 6. Assess the results of the improvement program. Determine your new maturity level, and decide where to go from there. B-15

297 Check Lists Operations Management and Continuous Improvement B-16

298 Glossary access-control list (ACL). A Safeguard facility that allows you to restrict access to system objects. ACL. See access-control list (ACL). alias. When using TACL, an alias is a name that stands for a command. When using DNS, an alias is a name that stands for a network component. Aliases simplify operations tasks. When using Safeguard, an alias is an alternate name that can be assigned to a user for purposes of logging on to the system. Alliance program. A program that Tandem has developed with third parties to augment Tandem offerings. Alliance partners offer consulting services, products, and application development services. answerback string. A set of characters that the terminal sends in answer to a computer request. Answerback strings are useful for controlling dial-up access. API. See application programming interface (API). application programming interface (API). The means by which a program within an application gains access to a set of services. For example, an API might consist of a set of procedure calls that provide a workstation application with a standard interface for communicating with a Tandem NonStop system. audit dump. A copy of a NonStop TM/MP audit-trail file written to a tape or disk volume. audit-dump process. A NonStop TM/MP process for automatically writing to tape or disk a copy of an audit-trail file that has become full. audit trail. A series of audit-trail files. availability. Tandem is using this term to describe end-user availability. End-user availability is the amount of time an application running on a Tandem system can be used effectively by a user of that application. See also planned outage, unplanned outage, and outage minutes. BACKCOPY. A utility program that allows you to duplicate tapes made with the BACKUP utility. With BACKCOPY, you can create up to two duplicate tapes for archiving, distribution, or disaster recovery. backout process. A NonStop TM/MP process that restores a database to its original state when a transaction aborts. BACKUP. A utility program that copies disk files onto magnetic tape. baseline security. The minimum level of security your organization is committed to providing. Glossary-1

299 Glossary callback routine. callback routine. A routine that allows the system to authenticate a caller s telephone location before permitting the caller to access the system. catalog. A NonStop TM/MP database that contains information about audit dumps, online dumps, and tape volumes. change control. A systematic approach to controlling the introduction of change in the production environment. change management. Activities that manage the maintenance and growth of the production system, including hardware, software, network, and procedural changes. One of the operations disciplines in the operations management model. See operations management model. client/server architecture. A computer architecture that divides work between a client and a server. The client provides application and user interface resources; the server stores, retrieves, and protects data. Client/server architecture enables users to access shared data and resources. Clients and servers run on a local area network. See also client/server computing. client/server computing. A model for distributing applications. Communication takes the form of request and reply pairs, which are initiated by the client and serviced by the server. Client/server computing is often used to connect different types of workstations or personal computers to the host computer system, using supported communications protocols. In the Tandem environment, the Remote Server Call product allows a client process (for example, a workstation application) to access a server (for example, a Pathway server). See also client/server architecture. Client Server Gateway (CSG). The workstation component of the CSG/SSG message transport mechanism. Workstation processes can use the CSG to send commands to host server processes independently of the underlying communications protocol. DSM/NOW uses CSG/SSG to exchange commands and responses between workstation clients and host server processes. See also SeeView Server Gateway (SSG). cold site. An empty building with power, air conditioning, data communications lines, and water at the site. When a disaster occurs, you convert the cold site into a primary site by moving all necessary equipment, software, data, and personnel to the site. Cold sites are practical when disasters of major proportions occur. command file. An edit file that contains a series of commands in the order you want to execute them. To execute the commands in the file, you either use the OBEY command and give the name of the file, or you name the file as the input file when you run TACL. Using command files is a method of automating operations tasks. configuration. (1) The arrangement of cabinets, system components, and peripheral devices into a working unit. (2) The definition or alteration of characteristics of an object. Glossary-2

300 Glossary configuration management. configuration management. The process of configuring the production system hardware and software to adapt to changes. One of the operations disciplines in the operations management model. See operations management model. CONFLIST. A listing that contains all SYSGEN or SYSGENR commands and responses, including error and warning messages, that occur during processing of the configuration files and building of the new operating system. As SYSGEN or SYSGENR processes each line of the CONFTEXT input file, it writes actions taken to the CONFLIST output file. It then creates maps describing how it builds the new operating-system image. These maps are useful for debugging purposes. CONFTEXT file. The main configuration file, copied by SYSGEN or SYSGENR from the input configuration file (CONFnn), that contains all of the hardware and software descriptions applicable to a system. It describes the configuration that produced the new operating-system image (OSIMAGE). For G-series systems, the CONFTEXT file consists of one or two paragraphs: DEFINES (optional) and ALLPROCESSORS because SCF configures all peripheral devices and I/O processes. Tandem recommends that the user make changes to a copy of the CONFTEXT file only, and not the CONFAUX file. CSG. See Client Server Gateway (CSG). DAL. See Data Access Language (DAL) server. Data Access Language (DAL) server. A host-based data server process running on a Tandem system that uses both Structured Query Language (SQL) and Data Access Language (DAL) features. It provides Macintosh users on attached LANs with reliable, transparent, fault-tolerant, distributed access to data residing in Tandem NonStop SQL/MP databases. data-ready site. A backup site similar to an operational-ready site except that data-ready sites take advantage of electronic vaulting. Data-ready sites are updated on a staged basis. Archived data resides on the data-ready site systems and does not need to be loaded during a disaster. The data-ready site systems are only as current as the last data loaded onto the systems. DCOM. See Disk Compression Program (DCOM). define process library. A set of TACL routines that allow you to run background server processes so that management applications can send commands to a number of subsystems without the overhead of creating a new server process for each command. design outage class. An outage class that includes bugs in design and design failures in hardware and software. An example of the class of outage could be an application change that introduces unexpected problems. See also outage class. Disk Compression Program (DCOM). A utility program that compresses disk space. Disk Space Analysis Program (DSAP). A utility program that analyzes how space on a given disk volume is used. Glossary-3

301 Glossary Distributed Systems Management/NonStop Operations for Windows (DSM/NOW) Distributed Systems Management/NonStop Operations for Windows (DSM/NOW). A Microsoft Windows client/server application that simplifies NonStop server management through a graphical user interface. DSM/NOW consists of an application launcher, event viewer, and integrated command and control functions. Distributed Systems Management/Software Configuration Manager (DSM/SCM). A product for installing and managing software configurations on distributed target systems. At the central site, DSM/SCM receives, archives, configures, and packages software for the target sites. On the target sites, DSM/SCM loads new software received from the central site. Distributed Name Service (DNS). The processes responsible for name management services in DSM (for example, assigning aliases, grouping related names, and sharing names among network nodes). Distributed Systems Management (DSM) products. A set of software tools that facilitate management of NonStop systems and Expand networks. These tools include the Distributed Name Service (DNS), the Event Management Service (EMS), the Subsystem Control Facility (SCF) for a variety of subsystems, and the Subsystem Programmatic Interface (SPI). DNS. See Distributed Name Service (DNS). down time. Time during which the system is not capable of doing useful work because of a planned or unplanned outage. From the end-user s perspective, down time is any time the application is not available. The cost of down time can be dramatic in lost revenue, lost consumer confidence, and lost productivity. See also planned outage and unplanned outage. DSAP. See Disk Space Analysis Program (DSAP). DSM. See Distributed Systems Management (DSM). DSM/NOW. See Distributed Systems Management/NonStop Operations for WIndows (DSM/NOW). DSM/SCM. See Distributed Systems Management/Software Configuration Manager (DSM/SCM). electronic vaulting. The process of electronically archiving data and sending the data to a remote, online site. EMS. See Event Management Service (EMS). EMSA. See Event Management Service Analyzer (EMSA). Enform. A report generator. entity. A resource measured by Measure. Glossary-4

302 Glossary environmental outage class. environmental outage class. An outage class that includes failures in power, cooling, network connections, natural disasters (earthquake, flood), terrorism, and accidents. See also outage class. event. A change in some condition in the system or network, whether minor or serious. Events might be operational errors, notifications of limits exceeded, requests for action, and so on. Event Management Service (EMS). The processes, procedures, and utilities used to report and log events; to forward, print, and distribute event messages to applications; and to filter, retrieve, and obtain information from event messages. Event Management Service Analyzer (EMSA). A conversational interface that is used to select and analyze events from EMS log files, such as subsystem ID, event number, text, and start and stop time. event message. The message generated by a subsystem when a subsystem detects an event that might affect its operation. These messages are generally formatted with tokens. Expand. Tandem s NonStop network that extends the concept of fault-tolerant operation to networks of geographically distributed NonStop systems. If the network is properly designed, communication paths are constantly available, even in the event of a single line or component failure. For G-series systems, the Network Control Process (NCP) and Expand line handler processes are defined and started with SCF from the WAN subsystem. fault tolerance. Fault-tolerant systems are able to tolerate physical and design-related outages for all hardware components, including processors. Some vendors refer to fault tolerance as no single point of failure. File Utility Program (FUP). A utility program that allows you to perform a variety of operations on disk files. Flow Map. A Tandem product that translates the data collected by the Tandem Performance Data Collector (TPDC) product into flow charts that can be used to analyze, document, and manage applications. FUP. See File Utility Program (FUP). GPA. See Guardian Performance Analyzer (GPA). Guardian Performance Analyzer (GPA). A Tandem product that gathers performance data from Measure and then analyzes the collected data. See also Measure. group manager (n, 255). A user ID that allows a user to control a group of user IDs. high-security facility. A facility that has a perimeter security system, and totally redundant hardware, environmental systems, and communications equipment. HLSCOM. The command interface to SNAX/HLS. Glossary-5

303 Glossary hot site. hot site. See operational-ready site. ICC. See Integrated Command and Control (ICC). Integrated Command and Control (ICC). A Microsoft Windows workstation component of DSM/NOW. ICC provides a point-and-click object browser for Tandem host subsystem objects and maps related commands from various subsystem command interfaces to clickable buttons. See also Distributed Systems Management/NonStop Operations for Windows (DSM/NOW). International Tandem Users Group (ITUG). An independent organization of Tandem users that encourages communication and information exchange, establishes a forum for special interest groups, and provides feedback to Tandem regarding users needs. ITUG. See International Tandem Users Group (ITUG). Kernel-Managed Swap Facility (KMSF). A facility for managing virtual memory. Through KMSF, the NonStop Kernel opens one or more swap files for each processor and manages the files for all the processes needing them. KMSF receives requests for swap space from the NonStop Kernel, and returns swap-space reservations to the Kernel. Processes swap to the kernel-managed swap files as needed. As a process s need for swap space grows, KMSF increases the amount of swap space reserved for the process. When the process no longer needs the space, it is returned to KMSF. See also NSKCOM. KMSF. See Kernel-Managed Swap Facility (KMSF). Launcher. A Microsoft Windows workstation component of DSM/NOW. The Launcher provides operations personnel with console environments specifically tailored to help them accomplish their operational duties. The Launcher can start useful applications automatically when an operator logs on or the operator can start applications on demand by selecting them from organized lists of utilities. See also Distributed Systems Management/NonStop Operations for Windows (DSM/NOW). least privilege. A security concept that allows users access to only the system resources they need. licensed program. A program that has the privileges of the operating system. When a licensed program runs, privileged operations in it can bypass ordinary security interfaces. macro. A sequence of TACL commands and built-in functions that can contain dummy arguments, thus providing a means for simple argument substitution. When the macro name is given to TACL, TACL substitutes the command sequence for the macro name and replaces any dummy arguments with parameter values supplied to TACL. Macros are used to automate operations tasks. management application. A program or set of programs that issues commands to subsystems, retrieves event messages, or does both things, to assist in managing a system or a network of systems. Glossary-6

304 Glossary MeasTCM. MeasTCM. The interface between Measure on the Tandem host and the capacity planning tool TCM on the PC or the Macintosh. MeasTCM runs under TACL on the Tandem host, summarizes the performance data collected by Measure, and formats this data for use by TCM. Measure. A performance-measurement tool that lets users collect and examine statistics for a system or network. MultiLan. A hardware and software product that allows users to connect their local area networks (LANs) to Tandem NonStop systems. mutual backup site. A backup site owned by several companies. NetBatch. Automated job-management stem that schedules and dispatches batch jobs. NetBatch-Plus. A Pathway application that provides a screen-driven interface to the NetBatch job-management system. You can use the application, which has its own database, to control NetBatch systems running on different nodes. NETMON. A network monitoring utility program that provides statistics regarding network traffic. node. A Tandem computer system that is part of an Expand network. NonStop Access for Networking. A collection of network components that extends faulttolerant computing through the LAN to the desktop. NonStop SQL/MP conversational interface. See SQLCI. NonStop SQL/MP. A relational-database-management system that promotes efficient online access to large distributed databases. NonStop TM/MP. See NonStop Transaction Manager/MP (NonStop TM/MP). NonStop Transaction Manager/MP (NonStop TM/MP). A database-protection subsystem incorporated into the operating system. NonStop TM/MP maintains the consistency and integrity of a distributed database that is updated by concurrent transactions. NonStop Transaction Services/MP (NonStop TS/MP). A Tandem product that provides process-management and link-management functions for OLTP applications on NonStop Himalaya servers. NonStop TS/MP. See NonStop Transaction Services/MP (NonStop TS/MP). NonStop Virtual Hometerm Subsystem (VHS). A subsystem that acts as a virtual home terminal for applications by emulating a 6530 terminal. NonStop VHS receives messages normally sent to the home terminal, such as displays and application prompts, and uses these messages to generate event messages for EMS, which can in turn be used to inform operations staff of problems. Glossary-7

305 Glossary NSKCOM NSKCOM. The command interface to the Kernel-Managed Swap Facility (KMSF). NSKCOM is the primary tool for monitoring, configuring, and managing kernel-managed swap files. See also Kernel-Managed Swap Facility (KMSF). NSX. See Tandem Network Statistics Extended (NSX). offline. Used to describe tasks that can be performed only when the system is down. Contrast with online. OM model. See operations management model. OMF. See Tandem Object Monitoring Facility (OMF). online. Used to describe tasks that can be performed while the system is up. Contrast with offline. online dump. A copy of a NonStop TM/MP audited disk file written to a tape or disk volume. online-ready site. A fully operational backup site (also known as a hot site) that has all necessary hardware and software. Archived data is sent to the operational-ready site but is not loaded onto the system until a disaster occurs. ONS. See Open Notification Service (ONS). Open Notification Service (ONS). A data encapsulation and forwarding server that gathers EMS events from the system event log, translates them into Simple Network Management Protocol (SNMP) trap format, and forwards them to the SNMP Agent, thereby facilitating delivery of Tandem subsystem-specific data to problem management components that comply with SNMP and that are external to the Tandem system. Open System Services (OSS). An open system environment available for interactive or programmatic use with the Tandem NonStop Kernel. Processes that run in the OSS environment use the OSS application program interface; interactive users of the OSS environment use the OSS shell for their command interpreter. operational-ready site. A fully operational backup site (also known as a hot site) that has all necessary hardware and software. Archived data is sent to the operational-ready site but is not loaded onto the system until a disaster occurs. operations area. The area where you locate the computer systems and peripherals, for example a computer room or an office. operations management. The operation and management of systems and networks in support of your business. Planning for operations management includes establishing and fulfilling service-level agreements, defining and understanding the OM model, and optimizing the features of Tandem systems and software. See also operations management model. Glossary-8

306 Glossary operations management activities. operations management activities. Activities, as defined by the Tandem operations management model, that support a production system, plan for all aspects of the production system, control the introduction of change into the production system, and operate the production system. operations management model (OM model). A model for managing Tandem systems that categorizes operations management functions into the following disciplines: production management, problem management, change management, configuration management, security management, and performance management. operator message. The text displayed for a system operator that describes an event. operations outage class. An outage class that includes errors caused by operations personnel due to accidents, inexperience, or malice. See also outage class. OSS. See Open System Services (OSS). outage. Time during which the system is not capable of doing useful work because of a planned or unplanned interruption. From the end-user s perspective, an outage is any time the application is not available. See also down time, outage class, outage log, outage minutes, planned outage, and unplanned outage. outage class. A concept developed by Tandem to categorize the cause of unplanned and planned outages. There are five outage classes: physical, design, operations, environmental, and reconfiguration. See also physical outage class, design outage class, operations outage class, environmental outage class, and reconfiguration outage class. outage log. A record of system outages. An outage log can provide an accurate assessment of availability. Tandem recommends that outages be measured in minutes rather than percentages. See also outage minutes. outage minutes. A metric recommended by Tandem for measuring outages. Translates percentages into minutes of down time per year. See also down time and availability. PATHCOM. The interactive administrative interface to the NonStop Transaction Services/MP core service. See also NonStop Transaction Services/MP. PATHMON process. The central control process for the NonStop TS/MP transactionprocessing core service and the optional Pathway/TS software, which together form the Pathway environment. The PATHMON process controls all processes and devices in the Pathway environment and provides the means to configure, manage, monitor, and change the configuration of the Pathway environment. Pathway environment. The programs and operating environment required for developing and running online transaction-processing (OLTP) applications. This group of tools is packaged as two separate products: NonStop TS/MP and the optional Pathway/TS software. See also NonStop Transaction Services/MP (NonStop TS/MP) and Pathway/TS. Glossary-9

307 Glossary Pathway Open Environment Toolkit (POET). Pathway Open Environment Toolkit (POET). A set of programs and utilities that assist in the creation and running of client/server applications for Tandem systems. Pathway/TS. A Tandem product that provides tools for developing and interpreting screen programs to support OLTP applications in the Guardian operating environment. See also NonStop Transaction Services/MP (NonStop TS/MP) and Pathway environment. PEEK. A utility program that reports statistical information concerning processor activity for system storage pools, paging activity, send instructions, and interrupt conditions. performance management. Activities that manage the performance of the production system and network environment to ensure that the systems meet the business needs defined by operations service-level agreements. One of the operations disciplines in the operations management model. See operations management model. physical outage class. An outage class that includes physical faults or failure in the hardware. Any type of hardware-component failure belongs in this category. See also outage class. planned outage. Time during which the system is not capable of doing useful work because of a planned interruption. A planned outage can be time when the system is brought down to allow for servicing, upgrades, backup, or general maintenance. Contrast with unplanned outage. See also outage, outage minutes, outage log, and availability. POET. See Pathway Open Environment Toolkit (POET). problem management. Activities that provide support for resolving problems in a production environment. One of the operations disciplines in the operations management model. See operations management model. production management. The set of regularly scheduled activities that keeps the applications on a system or network of systems running smoothly. These activities include administering storage media such as disks and tapes, managing space in processors and disks, and starting or stopping system components. One of the operations disciplines in the operations management model. See operations management model. PROGID program. A program that allows one user to temporarily use a controlled subset of another user s privileges. PS MAIL. An electronic mail system that lets you send mail messages to any other registered user on a Tandem network. PUP. See the Peripheral Utility Program (PUP). reconfiguration. See reconfiguration outage class. reconfiguration outage class. An outage class that includes all planned outages. Examples include down time required for planned maintenance such as software upgrades, and configuration changes such as adding a new disk or restructuring a database. See also outage class, planned outage, and down time. Glossary-10

308 Glossary remote mirroring remote mirroring. A pair of mirrored disk drives that are used together as a single logical drive in which the primary drive and the backup (mirror) drive are located in geographically distinct (remote) locations. Each byte of data written to the primary drive is also written to the mirror drive. If the primary drive fails, the mirror drive can continue operations. By providing geographic separation of mirrored volumes, remote mirroring protects the database from local environment hazards. Remote Server Call (RSC). A Tandem product that facilitates client/server computing, allowing personal computer (PC) or workstation applications running in Microsoft Windows, Windows NT, Winsock, MS-DOS, OS/2, UNIX, and Apple Macintosh operating environments to access Pathway server classes and operating-system processes. Transactions are transmitted from the PC or workstation application (the client) to a Pathway application running on a Tandem NonStop system (the server) using a supported communications protocol, such as network basic input-output system (NETBIOS), Transmission Control Protocol/Internet Protocol (TCP/IP), or an asynchronous connection. SAFECOM. The Safeguard command interpreter. Safeguard. A security tool that provides users of Tandem systems and distributed networks with a set of services for protecting the components of the system or network from unauthorized use. Safeguard services include authentication, authorization, and auditing. SCF. See Subsystem Control Facility (SCF). SCP. See Subsystem Control Point (SCP). screen program. A program that controls screen displays on input and output devices and processes, accepts data from these devices and processes, and transmits the data and requests to other programs that operate upon a database. SEB. See ServerNet Expansion Board (SEB). security management. Activities that provide support for establishing and maintaining system security. One of the operations disciplines in the operations management model. See operations management model. SeeView Server Gateway (SSG). The host component of the CSG/SSG message transport mechanism. The SSG, working with CSG on the workstation, routes messages from workstation clients to appropriate host server processes and returns responses. Together, CSG and SSG provide a message transport mechanism between the workstation and the host that is independent of the underlying communication protocol. server. A group of Pathway server processes that receives and reacts to transaction requests from screen programs. ServerNet Expansion Board (SEB). A connector board that plugs into the backplane to allow one or more ServerNet cables to exit the rear of the enclosure. Glossary-11

309 Glossary ServerNet System Area Network (SAN) ServerNet System Area Network (SAN). A wormhole-routed, full-duplex, packet-switched, point-to-point network designed with special attention to reducing latency and ensuring reliability. The ServerNet SAN provides the communication path used for interprocessor messages and for communication between processors and I/O devices. ServerNet Wide Area Network (SWAN) concentrator. A Tandem data communications peripheral that provides connectivity to a Himalaya S-series server. The SWAN concentrator supports both synchronous and asynchronous data over RS-232, RS-449, X.21, and V.35 electrical and physical interfaces. service-level agreements. Agreements between the operations group and the group s users that specify the group s objectives, requirements, and standards. Simple Network Management Protocol (SNMP). An asynchronous request/response protocol (implemented in the Transmission Control Protocol/Internet Protocol (TCP/IP) protocol suite) used for network management. In the SNMP management framework, each managed node is viewed as having several variables. By reading these variables, the managed node is monitored. By changing the value of these variables, the managed node is controlled. site update tape (SUT). A tape that contains software and documentation for a particular release. SNAX/HLS. A tool that provides a general-purpose, high-level interface by which Tandem application programs can communicate with intelligent SNA devices and software products. SNMP. See Simple Network Management Protocol (SNMP). SPI. See Subsystem Programmatic Interface (SPI). SPOOLCOM. A utility program that helps operations personnel monitor and maintain the spooler and spooler components. spooler. A set of programs that acts as an interface between users (and user applications) and the print devices of a system. SQL. Abbreviation for structured query language. See NonStop SQL/MP. SQLCI (NonStop SQL/MP conversational interface). A line-oriented terminal interface that enables a user to enter NonStop SQL/MP statements and commands, format and run reports, and operate database utilities. SSG. See SeeView Server Gateway (SSG). Subsystem Control Facility (SCF). A utility program used to control a variety of subsystems. For G-series systems, you can use SCF online to configure, control, and display information about configured objects within SCF subsystems. SCF has been enhanced to perform many of the functions performed on D-series systems by DSC/COUP and PUP. Glossary-12

310 Glossary Subsystem Control Point (SCP). Subsystem Control Point (SCP). The management process for all Tandem data communications subsystems. There can be several instances of this process. Applications using the Subsystem Programmatic Interface (SPI) send all commands for data communications subsystems to an instance of this process, which in turn sends the commands on to the manager processes of the target subsystems. SCP also processes a few commands itself. It provides security features, version compatibility, support for tracing, and support for applications implemented as NonStop process pairs. See also management process; manager process. Subsystem Programmatic Interface (SPI). A set of procedures for building and decoding commands, responses, and event messages. super-group user (255, n). A user ID that allows users to execute some potentially destructive commands. The super-group user is provided for operators who perform system operations tasks, such as controlling the status of peripherals and other system components. super ID (255,255). A user ID that allows users to do anything on the system. Users with the super ID can access all data and devices. support area. The area where the operations staff is located. SUT. See site update tape (SUT). SYSGEN. A utility program used by DSM/SCM to generate a Tandem NonStop Series (TNS) operating-system image for a given hardware and software configuration on Tandem CISC systems. SYSGENR. A utility program used by DSM/SCM to generate a Tandem NonStop Series/RISC (TNS/R) operating-system image for a given hardware and software configuration on Tandem RISC systems. TACL. See Tandem Advanced Command Language (TACL). Tandem Advanced Command Language (TACL). A powerful, extended command interpreter for the Tandem NonStop Kernel operating system that enables you to perform work on NonStop systems. Tandem Capacity Model (TCM). A Tandem product that provides computer-assisted capacity planning. It uses the Microsoft Excel spreadsheet with Measure data to explore different growth scenarios and system configurations. Tandem Failure Data System (TFDS). A component of the Tandem NonStop Kernel. This tool isolates software problems and provides automatic processor failure data collection, diagnosis, and recovery services. Tandem Information Manager. The Tandem Information Manager (TIM) product integrates multiple collections of Tandem product and support information including customer manuals, education course information, and other technical documents to provide a single, searchable library. The TIM viewer provides the interface to collections of Glossary-13

311 Glossary Tandem Network Statistics Extended (NSX). documents that are available on local CD-ROM discs as well as online, Internetaccessible servers. This common interface allows you to merge local and online searches and display local and online windows. Tandem Network Statistics Extended (NSX). A network management tool that provides operators with a global perspective on the entire network. With NSX, operators can collect and monitor up-to-the-moment performance statistics on all nodes, processors, and Expand line handlers in the network. Tandem NonStop Kernel. The operating system for NonStop systems, which consists of the core and system services. The operating system does not include any application program interface. Tandem Object Monitoring Facility (OMF). A Tandem product that enables operators to supervise objects such as processors, disks, files, and processes within the Tandem environment. Tandem Performance Data Collector (TPDC). A Tandem host-based performance-datacollection product. Tandem Reload Analyzer. A Tandem product used to identify fragmented Enscribe files and NonStop SQL/MP tables and to determine the files and tables that will benefit from an online reorganization. Tandem Service Management. A client/server application that provides troubleshooting, maintenance, and service tools for the Himalaya S-series server. Tandem Service Management (TSM) consists of software components that run on the Himalaya S-series server and on a PC-compatible workstation. TSM combines many of the system maintenance functions provided on D-series releases by the Syshealth toolkit, the Tandem Maintenance and Diagnostic Subsystem (TMDS), and the Remote Maintenance Interface (RMI) product. TCM. See Tandem Capacity Model (TCM). TCP. See Terminal control process (TCP). TERM. A task that uses a screen program to control input and output devices (such as terminals or workstations) or input and output processes (such as front-end processes). Each task runs as a thread in a terminal control process (TCP), which can handle many such tasks concurrently. Terminal control process (TCP). A multithreaded process supplied with Pathway/TS that interprets and executes screen program instructions for each input-output (I/O) device or process the TCP is configured to handle. The TCP coordinates communication between screen programs and their I/O devices or processes and, with the help of the PATHMON process, establishes links between screen programs and Pathway server processes. TFDS. See Tandem Failure Data System (TFDS). third-party backup site. A backup site leased from a third party. Glossary-14

312 Glossary TIM TIM. See Tandem Information Manager. TMFCOM. The NonStop TM/MP command interpreter. TPDC. See Tandem Performance Data Collector (TPDC). Transfer. An information delivery system that enables organizations to move and manage information efficiently within a single Tandem system or a network of distributed systems. TSM. See Tandem Service Management TSM EMS Event Viewer. Used to perform a variety of tasks associated with viewing and monitoring EMS event logs. The TSM EMS Event Viewer lets you select from a variety of parameters to set the criteria to search for and view the EMS event log file. unplanned outage. Time during which the system is not capable of doing useful work because of an unplanned interruption. Unplanned interruptions can include failures caused by faulty hardware, operator error, or disaster. Contrast with planned outage. See also outage, outage minutes, outage log, and availability. ViewSys. An interactive utility that monitors system resources while the system is running. $CMON. A message interface to a command-interpreter monitoring process that allows you to track logon attempts and important security changes. Glossary-15

313 Glossary $CMON. Glossary-16

314 Index A Access control list (ACL) 9-12 Account Quality Planning (AQP) Service 1-18 ACL 9-12 Activity areas, staffing 2-1 Additional reading A-1 Agreements documentation of 4-4 service-level 1-2, 4-2, 8-2 Alias, user 9-12 Alliance program 1-16 Application management case study check list description of 11-1 example operations tools 11-8, 11-10, Application sizing 8-1 Applications batch 11-7/11-8 client/server 11-10/11-13 configuration listings 4-9 controlling 11-2, 11-6 managing 11-1/11-7 monitoring 5-12 online transaction processing 11-8 operations requirements for 11-2 performing configuration changes 7-6 purpose 8-1 reloading files 5-15 reviewing new applications 11-4 sizing 8-3/8-5 AQP See Account Quality Planning (AQP) Service Auditors 9-5 Automatic logoff option 9-10 Availability cost of downtime 1-8 end-users perspective 1-1, 1-8 maximizing 1-10 B BACKCOPY 5-13 BACKUP 5-13, 10-3 Backup sites 10-10/10-11 Backups 5-13, 5-15 Baseline security 9-4 Batch processing 11-7/11-8 C Callback routine 9-18 Capacity planning case study 8-10 description 8-5/8-6 forecasting 8-6 methodology 8-5/8-6 operations tools 8-12 performance reporting 8-5 purpose of 8-1 staff requirements 8-3 CE logs 4-13 Centralized operations organization 2-12 Change 7-1 Change control 7-8/7-9 Change function, staffing requirements 2-7/2-9 Change management case study 7-10 check list 7-14 discipline description 1-6 example 7-10 operations tools 7-13 Index- 1

315 Index D Change management (continued) responsibilities of management 7-3 staffing requirements 7-3 Change, planning for 7-4 Check lists, summary of B-1 Client/server processing description 11-10/11-13 security requirements 9-19 Cold sites Collusion 9-2, 9-5 Command files 12-1, 14-4 Command post for disaster recovery 10-7 Communications, performing configuration changes 7-6 Computer cabinets, security of 9-10 Computer-room description of 3-1 monitoring 3-4 planning for equipment and supplies 3-5 planning for system installation 3-3, 3-6 preventive maintenance 3-7, 3-8 security of 3-5, 9-9 selecting a location for 3-1/3-4 Computer-room operator 2-15 Configuration 4-5/4-9 Configuration management discipline description 1-6 operations tools 7-13 responsibilities of management 7-3 staffing requirements 7-3 CONFLIST 4-8 CONFTEXT 4-8 Contingency planning backup sites check list command posts 10-7 definition of 10-1 planning for 10-1/10-3 preventing 10-1/10-3 See also Backup sites Contracts, documentation of 4-4 Control area, staffing requirements 2-8/2-9 Corporate security officer 9-4 D DAL 14-5 Damage assessment 10-6 Data archiving 10-3 encryption 9-11 Data Access Language (DAL) server 14-5 Data-ready sites 10-10, Devices, monitoring 5-12 Dial-up access 9-17/9-18 Disasters See Contingency planning Disciplines, operations management 1-3 See also Change management, Configuration management, Performance management, Problem management Disk Space Analysis Program (DSAP) See DSAP Disks 5-15 Distributed Name Service (DNS) See DNS Distributed operations organization 2-11 Distributed systems 1-12, 9-19 Distributed Systems Management (DSM) See DSM Distributed Systems Management/NonStop Operations for Windows See DSM/NOW Distributed Systems Management/Software Configuration Manager See DSM/SCM Index- 2

316 Index E DNS description 14-6 naming conventions 11-2 Documentation additional operations information A-1 CE logs 4-13 configuration diagrams and listings 4-5 contracts 4-4 error logs 4-13 error message 4-17 flow diagrams 4-9 manuals 4-12 online files 4-17 operator guides 4-16 operator logs 4-13 outage logs 4-14 policies and procedures 4-1 service-level agreements 4-2 software release documents 4-12 Downtime, high cost of 1-8 DSAP 9-8, 9-22, 9-23, 14-5 DSM 1-12, 6-4 DSM/NOW 14-7 DSM/SCM 7-5 configuration reports 4-8 description 14-8 E Education courses provided by Tandem 1-15, 2-26 EMS description 14-9 used for problem prevention 6-4 used with applications 11-2 EMSA 14-9 Encryption, data 9-11 End-users perspective of availability 1-8 Enform 14-9 Environment, monitoring 3-4 Equipment and supplies 3-5, 10-2 Error logs 4-13 messages 4-17 Errors See Problems Event detail database 4-17 Event Management Service Analyzer (EMSA) 14-9 Event Management Service (EMS) See EMS Events, generation of 11-2 Expand 10-2 F Fault-tolerant operations 1-11 FAXAdvisor 1-18 File Utility Program (FUP) 14-9 Files, reloading 5-15 Flow diagrams 4-9 Flow Map Forecasting 8-6 FUP 14-9 G GPA Group manager ID 9-14 Groups See User groups Guardian Performance Analyzer (GPA) Guest-user IDs 9-15 H Hardware checking components 5-10 client/server security 9-19 implementing configuration changes 7-5 Index- 3

317 Index I Hardware (continued) implementing physical changes 7-4 installing 3-6/3-7 securing 9-9/9-11 support for 1-16 Help desks equipment and supplies 3-9 uses of 6-8 Help-desk operators 2-17, 6-8 High-security facilities 3-2 I Improvement program 13-2/13-7 Internal operator guides 4-16 International Tandem User s Group (ITUG) See ITUG ITUG 1-17, 9-9 J Job descriptions computer-room operator 2-15 help-desk operator 2-17 lead operator 2-18 operations manager 2-24 senior configuration planner 2-23 senior systems planner 2-22 technical support specialist 2-20 Job scheduler 11-8 K Kernel See NonStop Kernel Keys 9-10 KMSF L Lead operator 2-18 Least privilege as a security guideline 9-4 Legal considerations for security 9-4 Levels of staffing 2-1/2-3 Licensed programs 9-23 Line handlers, checking status of 5-11 Load balancing 8-8 Log files 6-4 Logs CE 4-13 error 4-13 for monitoring security 9-2 operator 4-13 outage 4-14 M Maintenance, preventive 3-7, 5-15 Management responsibilities change management 7-3 configuration management 7-3 problem management 6-2 production management 5-7 security management 9-4 Manager of operations 2-24 Manuals location of 3-5, 4-12 operations 1-13, 4-12 provided by Tandem 1-13, 2-26 Maturity framework 13-3 MeasTCM Measure analyzing data 5-14 description routine tasks 5-10, 5-13 Measurements, outage-minutes-per-year 1-9 Measuring outages 1-8 Messages, error 4-17 Mutual backup sites Index- 4

318 Index N N Naming conventions 11-2 NetBatch description operations tasks 11-8 used to manage batch processing 11-8 NetBatch-Plus description operations tasks 11-8 used to manage batch processing 11-8 Network Statistics Extended (NSX) See NSX Networks checking status of 12-3 diagrams of 4-5 encrypting data 9-19 monitoring 5-11, 12-1 security default settings 9-8 NonStop Access for Networking 10-3, NonStop Kernel network ID requirements 9-18 security features 9-6, 9-8 NonStop ODBC Server NonStop SQL/MP 9-9, NonStop SQL/MP SQLCI NonStop systems 1-11 NonStop TM/MP audit dumps, when to perform 5-11, 5-13 backing up TMF subvolume 5-14 description 10-3, online dumps 10-3 used to manage online transaction processing NonStop Transaction Manager/MP (TM/MP) See NonStop TM/MP NonStop Transaction Services/MP (TS/MP) See NonStop TS/MP NonStop TS/MP collecting statistics 5-14 description PATHCOM interface NSKCOM NSX description used for monitoring 5-11 used for problem prevention 6-4 O Object Monitoring Facility (OMF) See OMF Office environment monitoring 3-4 planning for equipment and supplies 3-5 planning for system installation 3-4, 3-7 preventive maintenance 3-7, 3-8 security of 3-5 selecting a location for 3-1/3-4 selecting and preparing for 3-1 Off-site storage 3-2, 9-10 OM disciplines 1-3 model description of 1-1, 1-3 used for staffing 2-1 OMF (Object Monitoring Facility) description used for problem prevention 6-4 Online files, documenting 4-17 Online transaction processing description 11-8 monitoring environment 5-11 tools for managing Online-ready sites 10-10, ONS Index- 5

319 Index O On-site storage, security of 9-10 Open Notification Service (ONS) See ONS Open System Services (OSS) See OSS Operational-ready sites Operations agreements 1-2 documentation 4-1 improvement program 13-2/13-7 improving processes 13-1 manuals 1-13 staffing requirements 2-4/2-6 Operations areas equipment and supplies for 3-5 monitoring the environment 3-4 preventive maintenance 3-7 security of 3-5 selecting a location for 3-1 Operations management activities 2-1 and fault-tolerance 1-11 description of 1-1 disciplines 1-3 improvement program 13-1 objectives 1-1 planning for 1-1 See also Operations organizations, Operations tasks, Operations tools Operations manager 2-24 Operations organizations centralized operations group 2-12 distributed operations group 2-11 small operations group 2-10 technical support group 2-14 telecommunications group 2-13 Operations tasks automating 12-4 centralizing 12-5 daily tasks 5-10/5-13 for batch processing 11-8 for client/server processing for online transaction processing 11-9 for running NetBatch 11-8 monthly tasks 5-15 processor dump 5-9 processor reload 5-9 recovery procedures 5-15 reviewing 5-15 system load 5-8 system shutdown 5-9 system startup 5-8 Operations tools application management 11-8, 11-10, automating and centralizing tasks 12-6 batch processing 11-8 capacity tuning 8-12 change management 7-13 client/server processing configuration management 7-13 data archiving 10-3 data communications 10-2 data recovery 10-3 online transaction processing performance management 8-12 problem management 6-17 production management 5-17 routine tasks 5-17 security management 9-7/9-9 summary of 14-1 Operator guides 4-16 logs 4-13 messages documenting 4-17 monitoring 5-11, 5-12 Index- 6

320 Index P Operators computer-room operator 2-15 help-desk operator 2-17 lead operator 2-18 Optimizing system performance 8-8 OSS file security 9-20 interoperability with Safeguard 9-20 running backups 5-13 security considerations 9-7 user aliases 9-12 Outage logs 4-14 Outages classes 6-1 in a client/server environment 1-9 measuring 1-8 planned 1-9/1-11 prevention and recovery training 6-3 reducing 1-10 unplanned 1-9/1-11, 6-1 Outage-minutes-per-year measurements 1-9 P Passwords 9-16/9-18 PATHCOM 5-11, Pathway Open Environment Toolkit (POET) Pathway/TS PEEK Performance analysis and tuning case study 8-10 methodology 8-6/8-9 purpose of 8-1 Performance management case study 8-10 check list 8-12 discipline description 1-7 example 8-10 operations tools 8-12 staffing requirements 8-3 Physical security 3-5, 9-9/9-11 See also Security management 9-9/9-11 Planned outages 1-9/1-11 Planning area, staffing requirements 2-7/2-8 POET Policies, documentation of 4-1 Preventive maintenance 3-7, 5-15 Printers 5-11, 9-10 Privileged users 9-13, 9-14 Problem management check list 6-18 discipline description 1-5 responsibilities of management 6-2 summary of tools 6-17 Problems case study 6-12 escalation 6-10 logging 6-7 prediction of 6-3 prevention of 6-3 recovering from 6-6/6-12 reporting and tracking 6-6/6-12 reviewing 6-12 statistics on 6-12 systematic problem solving 6-6/6-12 Procedures, documentation of 4-1 Processes improvement program 13-1 monitoring 5-10 Processors measuring resource usage 5-10 reloading 5-9 Production assurance control group 11-6 Production function, staffing requirements 2-4/2-6 Production management check list 5-18 description 1-4, 5-7 Index- 7

321 Index Q Production schedule 5-5/5-6 PROGID programs 9-22 Programs development 9-21 licensed 9-23 PROGID 9-22 Q Quality services provided by Tandem See Account Quality Planning (AQP) Service R Recovery procedures 5-15 Reload Analyzer (Tandem Reload Analyzer) Remote Server Call (RSC) Reporting and tracking, reviewing procedures 5-15 Reports of network data 5-10 of security audits 5-14 of statistical information 5-5 of system performance 5-5, 5-13, 5-14 RSC S Safeguard description of 9-8, enforcing user expiration dates 9-15 interoperability with OSS 9-20 restricting system software access 9-12 SCF 5-11 configuration listings 4-8 INFO command 4-8 product description STATUS command 4-8 used to configured system 4-8 Schedule of production tasks 5-5 Security management administrators 9-5 assigning user groups 9-11 check list 9-25 controlling dial-up access 9-17/9-18 controlling user IDs 9-6, 9-11/9-16 description 9-1 discipline description 1-7 eliminating collusion 9-2 legal considerations 9-4 management responsibilities 9-4 managing development programs 9-21 managing licensed programs 9-23 managing PROGID programs 9-22 managing system access 9-11 monitoring 5-12, 9-2 OSS system security 9-20 policy guidelines 9-3 protecting client/server access 9-19 protecting network access 9-18 protecting NonStop SQL/MP databases 9-9 reviewing procedures 5-15 risk assessment 9-2 rules of 9-2 See also Physical security staff responsibilities 9-5 tools 9-7 transaction logs 9-2 use of encryption 9-11, 9-19 using ACLs 9-12 SeeView Senior configuration planner 2-23 Senior systems planner 2-22 ServerNet SAN 4-5, ServerNet WAN 4-5, Services offered by Tandem 1-16 Index- 8

322 Index T Service-level agreements creation of 4-2 description of 1-2, 4-2 use of 8-2 Shutdown request forms 5-9 Simple Network Management Protocol (SNMP) Site planning 3-1, 3-2 Site update tape (SUT) 4-12 SNMP Software education 1-15, 8-3, 8-4 installing changes to 7-6 publications 1-13 release documents 4-12 support for 1-16 Sources of additional reading A-1 Special user IDs 9-12 SPI SPOOLCOM 5-11 Spooler, monitoring 5-11, 5-12 Staffing check list 2-27 levels 2-1/2-3 operations management activities 2-1 requirements for change and configuration management 7-3 requirements for change function 2-7/2-9 requirements for control area 2-8/2-9 requirements for operations area 2-4/2-5 requirements for performance management 8-3 requirements for planning area 2-7/2-8 requirements for production function 2-4/2-6 requirements for support area 2-5/2-6 sample job descriptions 2-15/2-25 See also Job descriptions sample organizations 2-9/2-14 See also Operations organizations security responsibilities 9-4 Statistics, collection of 5-3, 5-10, 5-12 Subsystem Control Facility (SCF) See SCF Subsystem Programmatic Interface (SPI) Subsystems, performing configuration changes 7-6 Super ID 9-13 Super-group user 9-14 Supplies and equipment in case of disaster 10-2 Support areas 3-8/3-9 Support services offered by Tandem 1-16 SUT (Site update tape) 4-12 System configuration listings 4-5 System load 5-8 System software 9-8, 9-12 Systems changing configurations 7-4 developing models of 8-5 development 9-21 diagrams of 4-5 installing 3-6 maintaining 3-7 managing access to 9-11 optimizing performance 8-8 performing configuration changes 7-5 reloading 5-9 security of cabinets 9-10 shutting down 5-9 starting up 5-8 tracking usage 5-3 T TACL Index- 9

323 Index T Tandem Alliance program 1-16 education 2-26 manuals 1-13, 2-26 software 1-12 support services 1-16 systems 1-11 World Wide Web home page 1-15 Tandem Advanced Command Language (TACL) Tandem Capacity Model (TCM) Tandem Failure Data System (TFDS) Tandem Network Statistics Extended (NSX) See NSX Tandem NonStop Support Center (TNSC) 6-11 Tandem Performance Data Collector (TPDC) Tandem Reload Analyzer (Reload Analyzer) Tandem Service Management See TSM Tape drives, maintenance 5-13 Tape library location of 3-1 security of 9-10 Tape units, security of 9-10 TCM Technical support organization 2-14 Technical support specialist 2-20 Telecommunications operations organization 2-13 Telephones controlling dial-up access 9-17 lines for 3-1, 3-5 requirements for help desk 3-9 Terminals automatic identification 9-18 security of 9-10 TFDS (Tandem Failure Data System) Third-party backup sites TM View TMFCOM TMFSERVE TNSC 6-11 TPDC Training for disaster recovery 10-9 for operations staff 2-26 provided by Tandem 2-26 provided by vendors 2-27 provided in-house 2-27 using manuals for 2-26 Transaction logs 9-2 Transfer description monitoring 5-11 used to manage online transaction processing TSM 3-7, 14-3 as change- and configurationmanagement tool 7-13 as problem management tool 6-7, 6-17 as production management tool 5-11, 5-17 operations staff duties 2-19, 2-21 problem incident report 4-13 product description TSM EMS Event Viewer 14-3 application requirements 11-3 as automation and centralization tool 12-6 as performance management tool 8-12 as problem management tool 6-4, 6-17 as production management tool 5-11, 5-17 managing client/server applications operations staff duties 2-19, 2-21 operator message access 4-17 Index-10

324 Index U TSM EMS Event Viewer (continued) product description U Unplanned outages 1-9/1-11, 6-1 User aliases 9-12 User classes See User IDs User groups 9-11, 9-12 User IDs adding 9-12 deleting 9-15 expiration dates 9-15 freezing 9-15 group manager 9-14 guest-user ID 9-15 network application IDs 9-18 network IDs 9-18 purpose of 9-11 reusing 9-16 special classes of 9-12 super ID 9-13 super-group user 9-14 V ViewSys Voice alert systems 3-5 W World Wide Web, Tandem home page 1-15 Special Character $CMON 9-9, 14-4 Index-11

325 Index Special Character Index-12