Enhancing IBM Netfinity Server Reliability



Similar documents
IBM ^ xseries ServeRAID Technology

Dell Reliable Memory Technology

DDR4 Memory Technology on HP Z Workstations

RAID Technology Overview

The MAX5 Advantage: Clients Benefit running Microsoft SQL Server Data Warehouse (Workloads) on IBM BladeCenter HX5 with IBM MAX5.

Distribution One Server Requirements

Enabling Technologies for Distributed Computing

Enabling Technologies for Distributed and Cloud Computing

Computer Systems Structure Main Memory Organization

Fault Tolerance & Reliability CDA Chapter 3 RAID & Sample Commercial FT Systems

TECHNOLOGY BRIEF. Compaq RAID on a Chip Technology EXECUTIVE SUMMARY CONTENTS

Enterprise-class versus Desktopclass

Introduction. What is RAID? The Array and RAID Controller Concept. Click here to print this article. Re-Printed From SLCentral

Intel RAID Controllers

RAID Basics Training Guide

An Oracle White Paper July Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide

Price/performance Modern Memory Hierarchy

Parallels Virtuozzo Containers

Dependable Systems. 9. Redundant arrays of. Prof. Dr. Miroslaw Malek. Wintersemester 2004/05

COMP 7970 Storage Systems

Demystifying Deduplication for Backup with the Dell DR4000

Cisco Prime Home 5.0 Minimum System Requirements (Standalone and High Availability)

IBM Europe Announcement ZG , dated March 11, 2008

IncidentMonitor Server Specification Datasheet

Maximizing Your Server Memory and Storage Investments with Windows Server 2012 R2

Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting

Nexenta Performance Scaling for Speed and Cost

FUSION iocontrol HYBRID STORAGE ARCHITECTURE 1

IBM System x and BladeCenter for Oracle s JD Edwards EnterpriseOne

The team that wrote this redbook Comments welcome Introduction p. 1 Three phases p. 1 Netfinity Performance Lab p. 2 IBM Center for Microsoft

Improve Power saving and efficiency in virtualized environment of datacenter by right choice of memory. Whitepaper

EMC Unified Storage for Microsoft SQL Server 2008

Assessing RAID ADG vs. RAID 5 vs. RAID 1+0

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Performance and scalability of a large OLTP workload

An Oracle White Paper November Backup and Recovery with Oracle s Sun ZFS Storage Appliances and Oracle Recovery Manager

Comparing Multi-Core Processors for Server Virtualization

RED HAT ENTERPRISE VIRTUALIZATION FOR SERVERS: COMPETITIVE FEATURES

Parallels Virtuozzo Containers

NEC Corporation of America Intro to High Availability / Fault Tolerant Solutions

HRG Assessment: Stratus everrun Enterprise

Hardware RAID vs. Software RAID: Which Implementation is Best for my Application?

Dell Virtualization Solution for Microsoft SQL Server 2012 using PowerEdge R820

FOR SERVERS 2.2: FEATURE matrix

Configuring Memory on the HP Business Desktop dx5150

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Cost-effective, extremely manageable, high-density rack servers September Why Blade Servers? by Mark T. Chapman IBM Server Group

EMC Backup and Recovery for Microsoft SQL Server 2008 Enabled by EMC Celerra Unified Storage

White Paper. Recording Server Virtualization

Red Hat Enterprise linux 5 Continuous Availability

CS 6290 I/O and Storage. Milos Prvulovic

How To Create A Multi Disk Raid

Why disk arrays? CPUs improving faster than disks

Backup and Recovery. Introduction. Benefits. Best-in-class offering. Easy-to-use Backup and Recovery solution.

Oracle Primavera P6 Enterprise Project Portfolio Management Performance and Sizing Guide. An Oracle White Paper October 2010

All Tech Notes and KBCD documents and software are provided "as is" without warranty of any kind. See the Terms of Use for more information.

Random Access Memory (RAM) Types of RAM. RAM Random Access Memory Jamie Tees SDRAM. Micro-DIMM SO-DIMM

The Benefits of POWER7+ and PowerVM over Intel and an x86 Hypervisor

The Shortcut Guide to Balancing Storage Costs and Performance with Hybrid Storage

New!! - Higher performance for Windows and UNIX environments

Virtualization strategy for mid-sized businesses. IBM and VMware virtualization benefits for mid-sized businesses

QLIKVIEW SERVER MEMORY MANAGEMENT AND CPU UTILIZATION

EMC Backup and Recovery for Microsoft SQL Server

Microsoft Dynamics CRM 2011 Guide to features and requirements

Hardware and Software Requirements. Release 7.5.x PowerSchool Student Information System

Motherboard- based Servers versus ATCA- based Servers

The Microsoft Large Mailbox Vision

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

Accelerating Microsoft Exchange Servers with I/O Caching

How To Improve Performance On A Single Chip Computer

Backup and Recovery. Backup and Recovery. Introduction. DeltaV Product Data Sheet. Best-in-class offering. Easy-to-use Backup and Recovery solution

Enterprise Deployment: Laserfiche 8 in a Virtual Environment. White Paper

Dell Microsoft Business Intelligence and Data Warehousing Reference Configuration Performance Results Phase III

Providing Battery-Free, FPGA-Based RAID Cache Solutions

Maximizing SQL Server Virtualization Performance

AirWave 7.7. Server Sizing Guide

An Oracle White Paper November Oracle Real Application Clusters One Node: The Always On Single-Instance Database

Setting a new standard

Virtuoso and Database Scalability

Proposal for Virtual Private Server Provisioning

Memory technology evolution: an overview of system memory technologies

Enterprise-class versus Desktop-class Hard Drives

SUN STORAGE F5100 FLASH ARRAY

Terminal Server Software and Hardware Requirements. Terminal Server. Software and Hardware Requirements. Datacolor Match Pigment Datacolor Tools

Chapter 2 - Computer Organization

The Benefits of Virtualizing

IOS110. Virtualization 5/27/2014 1

Parallels Cloud Storage

Technology Update White Paper. High Speed RAID 6. Powered by Custom ASIC Parity Chips

Fault Tolerant Servers: The Choice for Continuous Availability on Microsoft Windows Server Platform

Transcription:

IBM Chipkill Memory Advanced ECC Memory for the IBM Netfinity 7000 M10 IBM Chipkill Memory (Chipkill), part of the IBM Netfinity X-architecture initiative, will be available during the first quarter of 1999 as an option for the IBM Netfinity 7000 M10 server. With Chipkill, an IBM Netfinity 7000 M10 will be protected from any single memory chip that fails and any number of multi-bit errors from any portion of a single memory chip. This advanced technology derived from IBM large systems will greatly reduce Netfinity 7000 M10 downtime, resulting in a more robust Intel processor-based client-server computing platform. Background In the early 1990s, most Intel processor-based servers employed parity memory technology. Parity memory can detect errors but cannot correct them. When a memory parity error occurs in a server, the system crashes and all data stored in memory is lost. For example, the server may be processing 32-64MB of data that is in memory but not yet stored on disk. In this instance, a memory parity error could cause a significant amount of information to be lost. (All servers put recent data into memory before the data is stored onto disk.) In the past, memory parity errors occurred with sufficient frequency to warrant a move by the entire industry to adopt ECC (Error Checking and Correcting) memory. ECC memory can detect and correct a single-bit error and can usually detect, but not correct, double-bit errors. Because multiple bit errors were rare, ECC provided a significant increase in reliability. Today ECC memory technology is standard on almost every server. To obtain acceptable performance, today s servers require large amounts of memory. This is partly because Intel processor-based servers have increased CPU performance by a factor of about 20 since the early 1990s. However, disk drives have improved performance by only about 5 in the same period. To obtain adequate performance, large amounts of memory are used to isolate fast CPUs from slow disk drives. Because servers require large amounts of memory, customers have also demanded lower memory prices. Low memory costs are obtained by using memory chips that contain as many bits as possible. By using very dense memory chips, the $/MB ratio has been lowered drastically. To achieve this density, however, a single memory chip usually provides 4 or 8 data bits on each access. An unrecoverable data loss condition occurs when a memory chip on a memory DIMM fails because the number of bits supplied from each memory chip is greater than ECC can protect. In the following example, each memory chip supplies 8 bits of data. When a memory chip fails, ECC cannot recover the 8 bits of data that are lost. As a result, the server crashes. Data loss from memory chip failures can occur for both 4-bit and 8-bit wide memory technologies because standard ECC only protects against single-bit failures.

On current high-performance Intel processor-based file servers, these memory chip failures cause a system crash that can result in the permanent loss of several gigabytes of business data! This is a catastrophic failure with no possible data recovery in most file server applications. Memory failures even create significant downtime for database server applications that protect from data loss by logging each transaction to disk. The reason is that the database recovery process may require many hours to consolidate the database contents with the transaction log updates that have not yet been written to the database. Thus, it can take several hours for a database server to recover from a memory failure before the database can resume business transactions. This database recovery period may result in a loss of business productivity. Solving Memory Failures For those systems requiring the highest availability, such as the IBM S/390 class of enterprise servers, the problem of multi-bit or DRAM failures is handled by architectural means in main memory. The memory subsystem design is such that a single chip, no matter what its data width, will not affect more than one bit in any given ECC word. For example, if 4-bit wide DRAMs were in use, each of the 4 bits would feed a different ECC word; that is, a different address of the memory space. Thus, even in the case of an entire memory chip failure, no single ECC word will experience more than one bit of bad data a situation that can be fixed by the ECC logic, thereby maintaining the fault-tolerance of the memory subsystem. Thus, IBM mainframe computers are protected against memory chip failures because of an intelligently designed memory subsystem, which organizes each memory module so that the number of bits supplied by each memory chip is equal to the number of bits protected by the ECC logic. Current Intel processor-based servers do not have this capability because market forces demand cheap memory, forcing designers to use very dense industry standard memory chips. 2

IBM engineers have solved this problem for Netfinity servers by placing a Redundant Array of Inexpensive DRAM (RAID) processor chip directly on the memory DIMM. The RAID chip calculates an ECC checksum for the contents of the entire set of chips for each memory access and stores the result in extra memory space on the protected DIMM. Thus, when a memory chip on the DIMM fails, the RAID result can be used to "back up" the lost data, allowing the Netfinity server to continue functioning. This RAID technology is similar to the RAID technology used to protect the contents of an array of disk drives. We call this memory technology Chip-kill DRAM. Performance IBM engineers have designed the ECC Application-Specific Integrated Circuit (ASIC) engine for Chipkill so that it performs the error correcting calculation without slowing access to the high-performance 50 nsec EDO memory. Thus, IBM Chipkill Memory DIMMs can be installed without modification in a Netfinity Intel Xeon processor-based server without any measurable performance degradation! Reliability Comparison The chart shown below illustrates the results of IBM analysis comparing server outages due to memory failures of parity, ECC and Chipkill-equipped servers. In summary, the following outage rates were identified: The 32MB parity memory-equipped server received 7 outages per 100 servers over 3 years. The 1GB ECC memory-equipped server received 9 outages per 100 servers over 3 years. The 4GB Chipkill-equipped server received 6 outages per 10,000 servers over 3 years! 3-Year Server Outages Due to Memory Failures 1GB IBM Chipkill RAM vs. 1GB ECC vs. 32MB Parity Reliability Cum fails Per 10,000 Systems 1000 900 800 700 600 500 400 300 200 100 0 ~ 7 fails per 100 Systems ~ 9 fails per 100 Systems ~ 6 fails per 10,000 Systems 32MB Parity 1GB SEC ECC 1GB IBM RAID RAM Server outages due to memory failures for servers with 1GB ECC memory are actually higher than those for servers with 32MB parity memory! The reason is that the number of 64Mbit DRAM modules required to supply 1GB of memory has increased to the point that there is a greater potential for a memory chip failure than there is for a single-bit parity error when using 32MB of parity memory. IBM Chipkill Memory technology reduces IBM Netfinity server outages due to memory failures to an incredibly low 6 outages in 10,000 systems for a 3-year lifetime! Timothy Dell, A White Paper on the Benefits of Chipkill-Correct ECC for PC Server Main Memory. November 19, 1997 IBM. For more information, see our Web site at www.chips.ibm.com/products/memory/chipkill/chipkill.html. 3

IBM Chipkill Memory Goes to Mars The following illustration shows the Chipkill Memory DIMM with the NASA Pathfinder Mars probe. This IBM memory technology flew to Mars as part of the Pathfinder probe. Summary Notice the white RAID chip in the center. This chip performs the ECC calculation for each data access and stores the result in additional DRAM modules on the DIMM. Chipkill technology was used by NASA in the Mars Pathfinder probe. NASA engineers understood that this problem was real and they did not want to invest the billions of dollars to send the Pathfinder probe all the way to Mars only to have the entire mission ruined by a memory chip failure! The most straightforward approach to providing Chipkill-correct ECC for Intel processor-based servers is to adopt IBM mainframe techniques and design the memory subsystem in a way that only one I/O bit per chip is used in each of several ECC words. This approach does present design hurdles, such as minimum memory granularity and support for wider chips (as the very common x8 DRAM, but for a system designed from scratch). These hurdles can be overcome. Because Intel processor-based server products are designed utilizing commodity components, the market has expectations of price/performance and frequency of new product introductions that preclude special proprietary memory controller designs. For systems designed with industrystandard components, another approach is needed to offer Chipkill capability. The only known alternative method is to deploy a standard memory subsystem framework and populate it with DIMMs that incorporate a retrofittable and plug-compatible on-dimm Chipkill-correct ECC. This class of product, known as Chipkill Memory from IBM, provides an instantaneous upgrade from an existing ECC memory subsystem to a Chipkill-correct ECC subsystem. 4

The same market forces that drove yesterday s 32MB parity memory Intel processor-based servers to include single error correction ECC memory as a checklist item will drive tomorrow s multi-gigabyte Intel processor-based servers to include Chipkill Memory as a checklist item. In retrospect, this conclusion is not surprising based on the direction set by IBM large-system technology. If one concedes that the mainframe model was a precursor to the distributed, networkcentric model of client-server computing, then it seems inevitable that the reliability and faulttolerance features of the mainframe will soon be demanded in the Intel processor-based servers server arena. 5

International Business Machines Corporation 1999 IBM Personal Computer Company Department LO6A 3039 Cornwallis Road Research Triangle Park NC 27709 Printed in the United States of America 2-99 All rights reserved For terms and conditions or copies of IBM s limited warranty, call 1 800 772-2227 in the U.S. Limited warranty includes International Warranty Service in those countries where this product is sold by IBM or IBM Business Partners (registration required). References in this publication to IBM products or services do not imply that IBM intends to make them available in all countries in which IBM operates. IBM reserves the right to change specifications or other product information without notice. IBM Netfinity servers and PC servers are assembled in the U.S., Great Britain, Japan, Australia and Brazil and are comprised of U.S. and non-u.s. parts. IBM, Netfinity, X-architecture and Chipkill ECC Memory are trademarks of International Business Machines Corporation in the United States and/or other countries. Intel, LANDesk, Pentium and Xeon are trademarks or registered trademarks of Intel Corporation. 6