Foreword Preface Acknowledgments xv xvii xviii CHAPTER 1 Introduction 1 1.1 What Is Mission Critical? 1 1.2 Purpose of the Book 2 1.3 Network Continuity Versus Disaster Recovery 2 1.4 The Case for Mission-Critical Planning 4 1.5 Trends Affecting Continuity Planning 5 1.6 Mission Goals and Objectives 6 1.7 Organization of the Book 7 References 8 CHAPTER 2 Principles of Continuity 9 2.1 Fault Mechanics 9 2.1.1 Disruptions 10 2.1.2 Containment 11 2.1.3 Errors 11 2.1.4 Failover 12 2.1.5 Recovery 13 2.1.6 Contingency 13 2.1.7 Resumption 14 2.2 Principles of Redundancy 14 2.2.1 Single Points of Failure 14 2.2.2 Types of Redundancy 16 2.3 Principles of Tolerance 20 2.3.1 Fault Tolerance 21 2.3.2 Fault Resilience 22 2.3.3 High Availability 22 2.4 Principles of Design 22 2.4.1 Partitioning 23 2.4.2 Balance 23 vii
viii Contents 2.4.3 Network Scale 25 2.4.4 Complexity 27 2.5 Summary and Conclusions 28 References 29 CHAPTER 3 Continuity Metrics 31 3.1 Recovery Metrics 32 3.1.1 Recovery Time Objective 32 3.1.2 Recovery Point Objective 35 3.1.3 RTO Versus RPO 36 3.2 Reliability Metrics 36 3.2.1 Mean Time to Failure 37 3.2.2 Failure Rate 37 3.2.3 Mean Time to Recovery 38 3.2.4 Mean Time Between Failure 38 3.2.5 Reliability 40 3.3 Availability Metrics 42 3.4 Exposure Metrics 47 3.5 Risk/Loss Metrics 47 3.6 Cost Metrics 48 3.7 Capacity Metrics 52 3.7.1 Utilization 52 3.7.2 Bandwidth 53 3.7.3 Overhead 54 3.8 Performance Metrics 55 3.8.1 Latency 55 3.8.2 Response Time 57 3.8.3 Loss 57 3.8.4 Error 57 3.8.5 Throughput 58 3.9 Summary and Conclusions 59 References 60 CHAPTER 4 Network Topology and Protocol Considerations for Continuity 63 4.1 Network Topology 63 4.1.1 Fundamental Topologies 64 4.1.2 Mesh Topologies 65 4.1.3 Ring Topologies 67 4.1.4 Tiered Topologies 68 4.1.5 Edge Topologies 69 4.1.6 Peer-to-Peer Topologies 69 4.2 Network Protocol Considerations 70 4.3 Summary and Conclusions 71 References 72
ix CHAPTER 5 Networking Technologies for Continuity 73 5.1 Local Area Networks 73 5.1.1 Ethernet 74 5.1.2 Switching Versus Segmenting 76 5.1.3 Backbone Switching 77 5.1.4 Link Redundancy 78 5.1.5 Multilayer LAN Switching 78 5.1.6 Virtual LANs 78 5.1.7 Transceivers 79 5.1.8 Media Translators 80 5.1.9 Network Adapter Techniques 80 5.1.10 Dynamic Hierarchical Configuration Protocol 82 5.2 Wide Area Networks 83 5.2.1 WAN Technologies 83 5.2.2 Routing Methods 90 5.2.3 Multilayer WAN Switching 96 5.2.4 Design Strategies 101 5.2.5 VPNs 102 5.3 Metropolitan Area Networks 103 5.3.1 Metro Ethernet 104 5.3.2 RPR 106 5.4 Summary and Conclusions 108 References 109 CHAPTER 6 Processing, Load Control, and Internetworking for Continuity 113 6.1 Clusters 113 6.1.1 Cluster Types 114 6.1.2 Cluster Resources 117 6.1.3 Cluster Applications 118 6.1.4 Cluster Design Criteria 118 6.1.5 Cluster Failover 119 6.1.6 Cluster Management 120 6.1.7 Cluster Data 121 6.1.8 Wide Area Clusters 122 6.2 Load Balancing 123 6.2.1 Redirection Methods 125 6.2.2 DNS Redirection 128 6.2.3 SSL Considerations 128 6.2.4 Cookie Redirection 129 6.2.5 Load Balancer Technologies 129 6.2.6 Load Balancer Caveats 131 6.3 Internetworking 132 6.3.1 Web Site Performance Management 132 6.3.2 Web Site Design 135 6.3.3 Web Services 137
x Contents 6.3.4 Web Site Recovery Management 137 6.3.5 Internet Access 139 6.4 Caching 143 6.4.1 Types of Caching Solutions 144 6.4.2 Caching Benefits and Drawbacks 145 6.4.3 Content Delivery Networks 146 6.5 Summary and Conclusions 149 References 150 CHAPTER 7 Network Access Continuity 153 7.1 Voice Network Access 153 7.1.1 PBXs 154 7.1.2 IP Telephony 156 7.1.3 Intelligent Voice Response and Voice Mail Systems 158 7.1.4 Carrier Services 159 7.2 Data Network Access 162 7.2.1 Logical Link Access Techniques 163 7.2.2 Physical Access Techniques 165 7.3 Wireless Access 167 7.3.1 Cellular/PCS 167 7.3.2 Wireless LAN 169 7.3.3 Microwave 169 7.3.4 Free Space Optics 171 7.3.5 Broadcast 172 7.3.6 Satellite 172 7.4 Summary and Conclusions 173 References 175 CHAPTER 8 Mission-Critical Platforms 177 8.1 Critical Platform Characteristics 177 8.2 Platform Tolerance 181 8.2.1 Fault Tolerance 182 8.2.2 Fault Resilience 184 8.2.3 High Availability 184 8.2.4 Platform Comparisons 185 8.3 Server Platforms 185 8.3.1 Hardware Architectures 186 8.3.2 Software Architecture 191 8.4 Network Platforms 198 8.4.1 Hardware Architectures 198 8.4.2 Operating Systems 201 8.5 Platform Management 202 8.5.1 Element Management System 202 8.5.2 Platform Maintenance 203 8.6 Power Management 205
xi 8.7 Summary and Conclusions 206 References 207 CHAPTER 9 Software Application Continuity 209 9.1 Classifying Applications 209 9.2 Application Development 210 9.3 Application Architecture 214 9.4 Application Deployment 214 9.5 Application Performance Management 216 9.5.1 Application Availability and Response 217 9.5.2 APM Software 218 9.5.3 Data Collection and Reporting 220 9.6 Application Recovery 221 9.7 Application/Platform Interaction 222 9.7.1 Operating System Interaction 223 9.8 Application Performance Checklist 224 9.9 Summary and Conclusions 228 References 228 CHAPTER 10 Storage Continuity 231 10.1 Mission-Critical Storage Requirements 231 10.2 Data Replication 234 10.2.1 Software and Hardware Replication 235 10.3 Replication Strategies 237 10.3.1 Shared Disk 237 10.3.2 File Replication 237 10.3.3 Mirroring 238 10.3.4 Journaling 242 10.4 Backup Strategies 242 10.4.1 Full Backup 244 10.4.2 Incremental Backup 245 10.4.3 Differential Backup 247 10.5 Storage Systems 247 10.5.1 Disk Systems 247 10.5.2 RAID 248 10.5.3 Tape Systems 251 10.6 Storage Sites and Services 254 10.6.1 Storage Vault Services 254 10.6.2 Storage Services 255 10.7 Networked Storage 256 10.7.1 Storage Area Networks 257 10.7.2 Network Attached Storage 268 10.7.3 Enterprise SANs 270 10.7.4 IP Storage 271 10.8 Storage Operations and Management 273
xii Contents 10.8.1 Hierarchical Storage Management 274 10.8.2 SAN Management 275 10.8.3 Data Restoration/Recovery 279 10.9 Summary and Conclusions 283 References 285 CHAPTER 11 Continuity Facilities 289 11.1 Enterprise Layout 289 11.1.1 Network Layout 290 11.1.2 Facility Location 290 11.1.3 Facility Layout 291 11.2 Cable Plant 291 11.2.1 Cabling Practices 292 11.2.2 Copper Cable Plant 296 11.2.3 Fiber-Optic Cable Plant 299 11.3 Power Plant 301 11.3.1 Power Irregularities 302 11.3.2 Power Supply 303 11.3.3 Power Quality 304 11.3.4 Backup Power 308 11.3.5 Power Distribution Architecture 312 11.3.6 Power Management 314 11.4 Environmental Strategies 315 11.4.1 Air/Cooling 315 11.4.2 Fire Protection Planning 318 11.5 Summary and Conclusions 319 References 320 CHAPTER 12 Network Management for Continuity 323 12.1 Migrating Network Management to the Enterprise 323 12.2 Topology Discovery 325 12.3 Network Monitoring 326 12.4 Problem Resolution 328 12.5 Restoration Management 331 12.6 Carrier/Supplier Management 333 12.7 Traffic Management 334 12.7.1 Classifying Traffic 334 12.7.2 Traffic Control 335 12.7.3 Congestion Management 336 12.7.4 Capacity Planning and Optimization 338 12.8 Service-Level Management 341 12.9 QoS 342 12.9.1 Stages of QoS 342 12.9.2 QoS Deployment 345 12.9.3 QoS Strategies 346
xiii 12.10 Policy-Based Network Management 348 12.11 Service-Level Agreements 351 12.11.1 Carrier/Service Provider Agreements 352 12.12 Change Management 353 12.13 Summary and Conclusions 354 References 356 CHAPTER 13 Using Recovery Sites 359 13.1 Types of Sites 359 13.1.1 Hot Sites 360 13.1.2 Cold Sites 362 13.1.3 Warm Sites 363 13.1.4 Mobile Sites 363 13.2 Site Services 363 13.2.1 Hosting Services 364 13.2.2 Collocation Services 365 13.2.3 Recovery Services 367 13.3 Implementing and Managing Recovery Sites 368 13.3.1 Networking Recovery Sites 369 13.3.2 Recovery Operations 370 13.4 Summary and Conclusions 371 References 372 CHAPTER 14 Continuity Testing 373 14.1 Requirements and Testing 374 14.2 Test Planning 375 14.3 Test Environment 377 14.4 Test Phases 378 14.4.1 Unit Testing 380 14.4.2 Integration Testing 380 14.4.3 System Testing 381 14.4.4 Acceptance Testing 387 14.4.5 Troubleshooting Testing 388 14.5 Summary and Conclusions 390 References 391 CHAPTER 15 Summary and Conclusions 393 About the Author 397 Index 399