A Multiple Bit Upset Tolerant SRAM Memory

A Multiple Bit Upset Tolerant SRAM Memory Gustavo Neuberger, Fernanda de Lima, Luigi Carro, Ricardo Reis Universidade Federal do Rio Grande do Sul PPGC - Instituto de Informática - DELET Av. Bento Goncalves 9500, Bloco IV, Porto Alegre, RS, Brazil <neuberg, fglima, carro, reis>@inf.ufrgs.br ABSTRACT This paper presents a high level technique to protect SRAM memories against multiple upsets based on correcting codes. The proposed technique combines Reed Solomon code and Hamming code to assure reliability in presence of multiple bit flips with reduced area and performance penalties. Multiple upsets were randomly injected in various combinations of memory cells to evaluate the robustness of the method. The experiment was emulated in a Virtex FPGA platform. Results show that 100% of the injected double faults and a large amount of multiple faults were corrected by the method. 1: Introduction The constant technology process improvement, has remarkably reduced the transistor geometry and power supply levels in the integrated circuits. In high-density circuits operating in low voltage, the memory cells are able to store information with less capacitance, which means that less charge or current is required to store the data. Unfortunately, a direct consequence is the increase in the device vulnerability to radiation, as charged particles that were once negligible are now much more likely to produce upsets [1]. When a single charged particle strikes the silicon, it loses its energy, resulting in a dense ionized track in the local region. The ionization causes a transient current pulse. This effect is called Single Event Upset (SEU). Although SEU is the major concern in space application, multiple bit upsets (MBU) start to be also a matter to be addressed nowadays because of the nanometric technologies. When a single high-energy ion pass through the silicon it can energize two or more adjacent memory cells [2]. MBUs can be induced by direct ionization or nuclear recoil. The energy of the particle is more likely to provoke double bit upsets while multiple bit upsets are caused by an increase of the particle incident angle. In [3, 4, 5], experiments in memories under proton and heavy ions fluxes have shown the probability of multiple upsets provoked by a single ion. Several techniques are used nowadays to mitigate upsets in memory components. They are based on specific technology process such as Silicon on Insulator (SOI), or based on design, such as replacing each memory cell by hardened memory cells and using error detection and correction codes (EDAC) [6]. Each technique has some advantages and drawbacks, and there is always a compromise between cost, area, performance, power and fault tolerance efficiency. EDAC is an attractive technique because it can be implemented in a high level design step, without changes in the mask process (no NRE cost). There are examples of SEU mitigation techniques using EDAC performed by software [7] and by hardware [8]. Hamming code is largely used to protect memories against SEU because of its efficient ability to correct single upsets with a reduced area and performance overhead [9]. However, multiple upsets caused by a single charged particle can provoke errors in the system protected by the single-error correcting Hamming code, since it assumes a single upset corrected per coded word. In the other hand, Reed-Solomon [6] is a block-based error correcting code, able to cope with multiple upsets. It has a wide range of applications in digital communications and storage. Reed-Solomon codes are used to correct errors in many systems including: storage devices, wireless or mobile communications, high-speed modems and others. Reed-Solomon encoding and decoding is commonly carried out in software, for this reason the RS algorithms normally found in literature do not take into account area and performance effects for hardware implementation. We have developed an efficient Reed-Solomon encoding and decoding description with single block correction capability designed for hardware implementation to protect embedded system memories against multiple upsets. Although the designed RS code copes with multiple upsets, it does not correct double faults located in two adjacent blocks; which requires the use of RS code with double block correction capability. The cost of double block correction code is too high compared to the single block correction code, which makes this alternative inappropriate for hardware implementation. This paper presents a technique that solves the problem of how to achieve 100% of double faults correction without using the high costly double block correction capability RS code. This new technique to protect memories against single and multiple upsets combines Hamming and Reed-Solomon codes with a reduced area overhead compared to more complex codes able to correct multiple upsets. The robustness of the

technique was evaluated by fault injection emulated in a Virtex board. Results show that 100% of the single and double faults were corrected by the technique, as well a large amount of multiple faults. To the author s knowledge, no references were published about a highlevel technique able to correct all double faults and a large number of multiple faults in embedded memories, as presented in this paper. 2: Previous Work Hamming code is an error-detecting and errorcorrecting binary code that satisfies the equation, d+p+1? 2 p, where d is the number of data bits and p is the number of parity bits. Following this equation the Hamming code can correct all single-bit errors on d-bit words and detect double-bit errors when an overall parity check bit is used (SEC-DED) [6]. The hamming code implementation is composed by a combinational block responsible to code the data (encoder block), inclusion of extra bits in the word that indicate the parity (extra latches or flip-flops) and another combinational block responsible to decode the data (decoder block). The encoder block calculates the parity bit and it can be implemented by a set of 2-input XOR gates. The decoder block is more complex than the encoder block, because it needs not only to detect the fault, but it must also correct it. It is basically composed by the same logic used to compose the parity bits plus a decoder, that will indicate the bit address that contains the upset. The decoder block can also be composed of a set of 2-input XOR gates and some AND and INVERTER gates. In [9], studies show the area efficiency of using Hamming code to protect memories. However, it does not cope with multiple upsets. Consequently, more complex correcting codes must be investigated. Reed Solomon [6] is an error-correcting coding system that was devised to address the issue of correcting multiple errors. A Reed-Solomon code is specified as RS (n,k) with s-bit symbols, where n is the total number of symbols per code word and k is the number of data bytes. The number of parity bytes is equal to n k, where n is 2 raised to the power of s minus one (2 s 1). A Reed- Solomon decoder can correct up to t number of bytes, where 2t = n k. So, a Reed-Solomon code with t = 1 can correct all single-bit errors and up to s errors in the same symbol (block), but can not correct a double bit error if these bits are located in two adjacent symbols. To implement a Reed-Solomon code with t = 1, two parity symbols are needed, commonly named R and S. For example, a RS code with 5-bit symbols has a maximum of 31 symbols (155 bits). In order to has single symbol correction, it must has 2 parity symbols and 29 data symbols. The encoding process has to divide the data bits in s- bit symbols; to multiply each symbol by appropriate constants; to perform XOR operations between the multiplication results in order to find R and S symbols. The RS code algorithm needs a high number of tables for conversion proposes. But for hardware implementation, tables have a large area overhead, and to perform the operation in only one cycle, is needed one table for each conversion. In the developed implementation, the conversion tables were replaced logical and arithmetical operations. Consequently, the encoder block is basically implemented by XOR gates. The decoding process is to divide the received bits in R, S and the data symbols, to multiply the RS constants by the symbols and to perform XOR operations of the last results in order to find S0 and S1. If S0 and S1 are 0, no errors are found, otherwise, S1/S0 is the location of the error and S0 is the pattern. In a decoder that only detects if there are errors, with no correction, the steps are basically the same of the encoder. So, the full decoder block is more complex because of the logic needed to locate and correct the error. Reed-Solomon codes with symbols of 5, 6, 7 and 8 bits, all with correction capability of one symbol, were designed in VHDL. Encoding and decoding are performed in a single cycle [10]. This description was optimized for area and performance by using multiplication that do not use tables for this operation. The Reed-Solomon and Hamming designed cores were synthesized in VirtexE FPGA (V600EHQ240) from Xilinx [11] and simulated using ModelSim from Mentor. Results of area, extra flipflops and delay of encoding and decoding were compared to a Hamming code with same word width. The design area occupied in the FPGA is measured in number of 4- input Look-up Table (4-LUT) used to implement the logic. Table 1 shows the results of the RS(155, 145) compared to a 145-bit Hamming code. Reed Solomon Hamming Encoder Decoder Encoder Decoder # 4-LUTs 226 474 134 363 # flip-flops 10 8 Delay (ns) 15.4 45.5 14.1 24.4 Tab.1 Area and performance comparison between hamming code (145-bit) and Reed-Solomon (145-bit) Based on table 1, we can see that the Hamming code has less area overhead than the Reed-Solomon, but the advantage of the Reed-Solomon is that it can correct 1 to 5 bits of errors, if they are in the same symbol, versus only 1 bit corrected between all 145 bits of the Hamming code. In most cases, the errors are consecutive, and the Reed- Solomon can correct them if they are in the same symbol while Hamming code can not. Another important result is that in both codes, the decoding process has more overhead in area and delay than the encoding process. 3: New MBU Fault-tolerant Memory Architecture A n-bit data memory can be protected against faults by using correcting code techniques based on encoder and

decoder blocks and extra bits to store the data parities, as represented in figure 1. The encoder and the decoder can use any error detection and correction code. A limitation of this approach is that the data is only coded in write operations, and decoded in read operations. So, the accumulation of upsets is likely to occur and it is dependable of the reading and writing application request frequency. In order to avoid accumulation of upsets, it is necessary an extra logic able to constantly detect and correct upsets in all coded data. This logic represents area cost and it may interfere in the normal operation of the memory if dual port memory is not used. For example, in the case of using Hamming code, the memory can only support single bit upsets. The problem is that multiple upsets are likely to occur in current and in future technologies. In addition, upsets can accumulate in the memory if the correction process is not fast enough comparing to the upset flux. Fig. 1 General schematic of a fault tolerant memory In order to be able to correct multiple bit upsets, RS code must be used. The data word is divided in symbols, and each data word is a different RS coded word. For example, in a 256-rows memory, the data word uses the entire row, and each data word is divided in m symbols according to the symbol size and to the memory data size. Multiple upsets may occur in any portion of the matrix, but they are more likely to occur as double bit flips that are in the same symbol (upset type a), in vertical adjacent symbols, (upset type b), or in horizontal adjacent symbols, (upset type c) (figure 2). The RS code can easily correct upsets of type a, because it is the essential property of this code: multiple error correction in a same symbol. The second type of double upsets that can occur (type b) will also be corrected because each row is a different RS code, so this is equivalent to two single errors in distinct rows. But the third type of upsets (type c) will not be corrected, because it is equivalent to errors in two different symbols of the same coded word, and the implemented RS is not capable to correcting this type of error. Fig. 2 Examples of double bit flips in a memory where each row is protected by RS So, a new code is needed to correct all possible double errors. The first option is the use of a Reed-Solomon code with capability to correct two different symbols. But this RS code has more than twice the area and delay overhead of the single symbol correction RS [6], which makes this solution inappropriate for memory architectures. The second alternative is the use of bit protected by Hamming code between the RS symbols. The number of bits protected by Hamming will be the same of the number of symbols protected by Reed-Solomon, so this option does not significantly increases the area overhead. Figure 3 presents the insertion of Hamming code in row already coded by RS code. Fig. 3 Schematic of a memory row protected by Reed-Solomon and Hamming code This new approach was analyzed to all possible single, double and greater multiplicity errors, as shown in figure 4 and table 2. All single upsets are corrected by the code. Double bit upsets occurring horizontally or vertically are corrected by the code because each row is a different coded word and there is Hamming code protection in the RS symbols interface. Some multiple bit upsets such as the numbers 7, 8 and 9 are also corrected by the proposed method. The only type of multiple upsets that are not corrected by the method is number 10. However, special placement of the RS coded word symbols may solve this problem. Fig. 4 Representation of single, double and multiple upset types in a memory 3.1: The Memory Case Study Once defined the codes to be used, the next step is to determine the specifications of the memory and the Reed- Solomon and Hamming codes sizes. The first memory study case is a 128-bit data memory, where a 128-bit code is needed. Based on the previously compared tradeoffs of RS code [10], a 7-bit symbol RS code presents an efficient area and performance overhead compared to the number of protected bits. Consequently, a 7-bit symbol RS code was chosen to protect 112 bits of the data memory and a 16-bit Hamming code was chosen to protect the bits between each RS symbol, as it was presented in figure 3. The extra bits to be stored are 14 bits due to Reed- Solomon and 5 bits due to Hamming code, totalizing 19 parity bits for each row. Note that in this first memory study case, the size of the stored data is the same of the matrix physical row. In next section, we discuss the use of this method for any size of data memory in a 128-bit matrix row. Table 3 shows the area and delay of the 112- bit Reed-Solomon code and the 16-bit Hamming code.

Upset Location Fault type Reed-Solomon combined with Hamming Code Effect Single Reed-Solomon symbol 1 Corrected, single error in only one RS symbol Hamming bit 2 Corrected, single error in the Hamming data bits Double RS symbol 3 Corrected, multiple error in only one RS symbol Vertical RS adjacent symbols 4 Corrected, single error in different RS codes Vertical Hamming adjacent bits 5 Corrected, single error in different Hamming codes Horizontal RS and Hamming adjacent bits 6 Corrected, single error in RS symbol and single error in Hamming data bits Multiple Vertical RS symbols 7 Corrected, single error in different RS codes Vertical Hamming bits 8 Corrected, single error in different Hamming codes Horizontal RS symbol and Hamming bit 9 Corrected, single error in RS symbol and single error in Hamming data bits Horizontal RS symbol, Hamming bit and RS symbol 10 Detected but not corrected, single error in Hamming data bits but error in two different RS symbols in the same row Tab. 2 Upset Analysis in Space Environment 16-bit Hamming 112-bit RS Encoder Decoder Encoder Decoder # 4-LUTs 22 99 215 538 # extra ffs 5 x # of row 14 x # of rows Delay (ns) 9.3 21.7 14.5 47.6 Tab.3 Area and Delay of Reed-Solomon and Hamming codes used to protect a memory Figure 5 shows the final architecture of the double error tolerant memory. There are two encoder and decoder blocks, one for Hamming code and other for RS code. The parity bits are also stored in the memory in a reserved area. The placement of each RS parity symbols and Hamming parity bits must be also taken into account to avoid double upsets in the same Hamming coded parity word or in two parity symbols of the same RS coded word. However, because the method is robust to a large number of upsets in a memory, it is necessary a very high flux of charged particles to provoke an accumulation of upsets able to overcome the technique. In order to refresh the data, correcting the faults, a dual-port memory can be used. During read operations, thief any error is detected during decoding, the correct value is written in the memory using the second port, without interruptions in the normal memory operation. The data memory now is error-free. This approach is shown in figure 6. No protected Fault Tolerant Memory Memory # 4-LUTs 95 770 # BlockRAMs 16 18 Speed (MHz) 71 30 Tab.4 Area and performance comparison between a noprotected 128-bit memory and a full fault tolerant 128-bit memory based on RS and Hamming code. Fig. 5 Final schematic of the fault tolerant memory The memory study case was described in VHDL and synthesized in a Virtex-E FPGA using BlockRAMs and CLBs. The simulation was performed using ModelSim from Mentor [11]. Results are presented in table 4. In the results, it is noticed that the fault tolerant memory has an area overhead that is basically the area used by the encoder and decoder blocks. Only two more BlockRAMs are needed, one to store the RS redundancy symbols and other to store the Hamming extra bits. The performance penalty in the fault tolerant memory synthesized in the FPGA is around 50%. Note that in this new mitigation technique, we still cope with the feature of performing encoding only in the write operations and decoding only in read operations. Fig. 6 Fault tolerant memory implemented using a two-port memory 4: GENERALIZING THE MEMORY SIZE The MBU mitigation technique proposed in this paper was presented for a case study memory of 128-bit data words. But not all processors could work with 128-bit memory. One possible solution is to design a new code for each data width. But it is not good because we need to design a completely new library of encoding functions corresponding to the new data width. And for common words of 8-bits, Reed-Solomon and Hamming need too many extra bits and the area occupied will be significantly

larger compared to the original area. So, it is better to adapt the memory without changing the used code. This section shows the necessary steps to adapt the previously designed memory to an interface with the 8 bit processor, but the process can be easily modified for other data widths. In the output, it is easy to change from 128 bits to 8 bits. It only needs a multiplexer to choose the correct 8 bits address. The signal of selection of the multiplexer is the least significant bits from the address signal. The input data of the memory needs more modifications than the output, because now we have only 8 bits of the data that must be encoded, and the others 120 bits remains the same. Looking more closely to the Reed- Solomon and Hamming codes encoding process, one will see that they are basically exclusive or (XOR) operations. The encoding step of the Reed-Solomon code is composed by: divide the input data bits between the symbols, multiply each symbol by an appropriate constant, and the parity bits are XOR operations of these multiplications. The XOR operation has two important properties: A xor 0 = A, A xor 1 = not A. Knowing these properties, we can see that it is easily to subtract the part of a symbol that is contributing to the parity bits. It is only a XOR of the parity bits with a multiplication of the appropriated constant by the old symbol. To get the new parity bits, we use XOR gates again, but with a multiplication of the new symbol with the same constant. So, we need to design a new Reed-Solomon and Hamming encoders, but the inputs of these new encoders are the old parity bits, the old data, the new data and the address of the data; the output is the new parity data. For example, now the steps of Reed-Solomon encoding process are: use the address to get the appropriate constant, multiply the old symbol and the new symbol by the constant, and do a XOR between the old parity and the two multiplications. More than to adapt the encoder to the new memory, we can see that the encoder will have a significantly reduction in area, because now need only two multiplications, compared to 16 multipliers in the full encoding version shown before. The comparison between the new and the old encoders are presented in table 5. Standard RS Enc Dedicated RS Enc # 4-LUTs 215 96 Delay (ns) 14.5 18.5 Tab.5 Area and performance comparison between the standard RS and the RS dedicated for 8-bit memories Now we have a new encoder that can works with a data input of 8 bits. But the problem is that the new encoder needs the data previously stored in the same address. Usually, in a write operation the new data is stored in the specified address, and it is not necessary to know the old data. But for this memory, we need it. So, the solution is to read the memory in the clock rising edge, and to write only in the falling edge of the clock. The user s data and address input are stored in two auxiliary registers while the old data is read from the memory. The write enable signal is also stored in a latch to be used in the clock falling edge. Note that there is no clock phase latency, just a delay in the write cycle that is transparent for the user. One drawback of this solution is the necessity of using extra SEU mitigation techniques for the two registers (data and address) and the write signal latch. The final schematic of the 8 bits memory is shown in Fig 8. Fig. 8 MBU tolerant 8-bit memory schematic The comparison between the original 128 bits memory and this adapted 8 bits memory is presented in table 6. Analysing these results, the performed changes increases the area overhead in approximately 60%. But the speed has a bigger overhead, caused by the utilization of both rising and falling edges of the clock for the write process. Fault Tolerant 128-bit Memory Fault Tolerant 8-bit Memory # 4-LUTs 770 1040 # BlockRAMs 18 18 Speed (MHz) 30 17 Tab.6 Area and performance comparison between fault tolerant 128 bits and 8 bits fault tolerant memory 5: Fault Injection Emulation Experiment Fault injection is an attractive technique for design evaluation due to its high flexibility in terms of spatial and temporal information. The process involves the insertion of faults into particular targets of a system at a determined time in the process and monitoring of results to define its behavior in response of a fault. It has also a reduced turnaround time and an evaluation cost compared to traditional radiation ground testing. Moreover, it has been shown that injecting fault in a programmable logic platform, after synthesizing the full system in a Field Programmable Gate Array (FPGA) can speed up the process in many orders of magnitude [13]. The case study memory evaluated in the experiment is a MBU tolerant 128-bit memory protected by the proposed method based on RS and Hamming code. The fault injection test platform is composed by one AFX V600HQ240-100 daughter card, a JTAG cable used as an interface to a host PC, and a control panel. The system operates in conjunction with a host PC in order to collect the data in the Chipscope analyzer tool [14]. In order to evaluate the robustness of the protected memory, static and dynamic test were performed by fault injection. During the static test, the protected memory

under test (DUT) and a gold memory (fault free) were initialized with random values in the first 16 row positions (2048 total bits per memory). Then in the next cycles, all types of faults discussed in table 2 were injected in the DUT memory. Just one fault was inserted per analysis. The two memories were read and their values were compared. If they are different the fault was not tolerated. As predicted, only the fault type 10 was not tolerated. After that, the faults were injected dynamically as follows. In the initialization, 16 pairs of 128-bit random numbers were stored in two memories under test, and their sums were stored in a gold memory, used to further comparison. Random types of faults were injected at a random instant of time in a random position. The two memories under test were continuously read and the sum of their outputs was compared to the sum stored in the gold memory. After all positions were read, the memories were reinitialized and other random type of fault was injected. The injected faults are the same fault types discussed in table 2. Each fault type has a different probability to occur according to the application environment. Single upsets were projected to have more probability to occur in the fault injection experiment 2:1. While double and multiple upsets are more unlikely to occur but are also tested by the fault injection. These parameters of fault probability can be customized by the user in the VHDL code consistent with the target application. There were injected 1,002 faults, classified in the 10 different types of faults. As predicted, all single and double faults were tolerated. In the case of multiple faults, the method does not tolerate faults upsetting two adjacent RS symbols and the Hamming bit between these two symbols (fault type number 10). For the used application, 21% of the injected multiple faults, type number 10, were tolerated by the application. Although this type of fault is not expected to be corrected by the method, it can be tolerated according to the application. 6: Conclusion This paper presented a high level technique to protect SRAM memories against single and multiple upsets based on correcting codes. The proposed technique combines Reed Solomon code and Hamming code to assure reliability in presence of multiple bit flips with reduced area overhead, needs only 19 parity bits per 128-bit row and performance penalty around 50%. Multiple upsets were randomly injected in all memory cells to evaluate the robustness of the method. The experiment was emulated in a Virtex FPGA platform. Results show the efficiency of the proposed method in presence of all single and double upsets and many types of multiple upsets. All double faults and a large combination of multiple faults were corrected by the method. The only type of multiple faults that was only detected but not corrected by the method is where multiple upsets (three or more) affect two different RS code symbol (example fault type 10). Future work involves the implementation of the method in an ASIC approach in order to have a more precise area and performance overhead evaluation comparing to a no-protected memory component, and make a design embedding this memory and several others to compare in terms of area and performance penalties. ACKNOWLEDGMENTS The authors would like to thank CNPq Brazilian Agency for their support to this work. REFERENCE [1] JOHNSTON, A. Scaling and Technology Issues for Soft Error Rates, 4th Annual Research Conference on Reliability, Stanford University, Oct. 2000. [2] REED, R.; et al, Heavy Ion and Proton Induced Single Event Multiple Upsets, IEEE Nuclear and Space Radiation Effects Conference (NSREC), July 1997. [3] WROBEL, F., PALAU, J., CALVET, M., BERSILLON, O., DUARTE, H., Simulation of Nucleon-Induced Nuclear Reactions in a Simplified SRAM Structure: Scaling Effects on SEU and MBU Cross Sections IEEE Transactions on Nuclear Science, December 2001 [4] JOHANSSON, K., OHLSSON, M., OLSSON, N., BLOMGREN, J., RENBERG, P., Neutron Induced Single- Word Multiple-bit Upset in SRAM IEEE Transactions on Nuclear Science, December 1999 [5] BUCHNER, S., CAMPBELL, A., MEEHAN, T., CLARK, K., McMORROW, D., DYER, C., SANDERSON, C., COMBER, C., KUBOYAMA, S., Investigation of Single- Ion Multiple-Bit Upsets in Memories on Board a Space Experiment IEEE Trans. on Nuclear Science, June 2000 [6] HOUGHTON, A. D., The Engineer s Error Coding Handbook, Chapman & Hall : London, 1997. [7] SHIRVANI, P., SAXENA, N., McCLUSKEY, E., Software Implemented EDAC Protection Agains SEUs, IEEE Transactions on Reliability, Special Section on Fault-Tolerant VLSI Systems, Vol. 49, No. 3, pp. 273-284, Sep. 2000. [8] REDINBO, G., NAPOLITANO, L., ANDALEON, D., Multibit Correcting Data Interface for Fault-Tolerant Systems IEEE Transactions on Nuclear Science, April 1993 [9] HENTSCHKE, R., MARQUES, F, LIMA, F., CARRO, L., SUSIN, A., REIS, R., Analyzing Area and Performance Penalty of Protecting Different Digital Modules with Hamming Code and Triple Modular Redundancy, In: Symposium on Integrated Circuits and Systems Design (SBCCI), 2002. [10] NEUBERGER, G., LIMA, F., REIS, R., Designing a Reed-Solomon Core Optimized for Area and Performance XVII South Symposium on Microelectronics, June 2002. [11] Xilinx Inc. Virtex 2.5 V Field Programmable Gate Arrays, Xilinx Datasheet DS003, v2.4, Oct. 2000. [12] Mentor, ModelSim SE User's Manual. Sep 2001. [13] LIMA, F., CARRO, L., VELAZCO, R., REIS, R., Injecting Multiple Upsets in a SEU tolerant 8051 Microcontroller, In: Latin-American Test Workshop, 2002. [14] XILINX, INC. Chipscope Software and ILA Cores User Manual, Xilinx User Manual, 0401884 (v2.0) Dec., 2000.