Advance in Electronic and Electric Engineering. ISSN 2231-1297, Volume 3, Number 9 (2013), pp. 1121-1134 Research India Publications http://www.ripublication.com/aeee.htm Improved NAND Flash Memories Storage Reliablity Using Nonlinear Multi Error Correction Codes E. Ramakrishna Naik* and L.S. Devaraj** *M.Tech(VLSI), Dept. of ECE, Intellectual Institute Of Technology, Anantapur. Assistant professor Dept. of ECE, Intellectual Institute Of Technolog, Anantapur. E-mail: * ramakrishnanaik4067@gmail.com, ** ramakrishna4067@gmail.com Abstract Multi-level cell (MLC) NAND flash memories are popular storage media because of ir power efficiency and large storage density. Conventional reliable MLC NAND flash memories based on BCH codes or Reed-Solomon (RS) codes have a large number of undetectable and miscorrected errors. Moreover, standard decoders for BCH and RS codes cannot be easily modified to correct errors beyond ir error correcting capability t = [d-1/2], where d is Hamming distance of code. In this paper, we propose two general constructions of nonlinear multi-error correcting codes based on concatenations or generalized from Vasil ev codes. The proposed constructions can generate nonlinear bit-error correcting or digit-error correcting codes with very few or even no errors undetected or miscorrected for all codewords. Moreover, codes generated by generalized Vasil ev construction can correct some errors with multiplicities larger than t without any extra overhead in area, latency, and power consumption compared to schemes where only errors with multiplicity up to t are corrected. The design of reliable MLC NAND flash architectures can be based on proposed nonlinear multi-error correcting codes. The reliability, area overhead and penalty in latency and power consumption of architectures based on proposed codes are compared to architectures based on BCH codes and RS codes. The results show that using proposed nonlinear error correcting codes for protection of MLC NAND flash memories can reduce number of errors undetected or miscorrected for all codewords to be almost 0 at cost of less than 20% increase in
1122 E. Ramakrishna Naik & L.S. Devaraj power and area compared to architectures based on BCH codes and RS codes. Index Terms: Multi-error correcting codes, nonlinear codes, re-liable memory. 1. Introduction The semiconductor industry has witnessed an explosive growth of NAND flash memory market in past several decades. Due to its high data transfer rate, low power consumption, large storage density and long mechanical durability, NAND flash memories are widely used as storage media for devices such as portable media players, digital cameras, cell phones, and low-end netbooks. The increase of storage density and reduction of cost per bit of flash memories were traditionally achieved by aggressive scaling of memory cell transistor until multi-level cell (MLC) technology was developed and implemented in 1997. MLC technology is based on ability to precisely control amount of charge stored into floating gate of memory cell for he purpose of setting threshold voltage to a number of different levels corresponding to different logic values, which enables storage of multiple bits per cell. However, increased number of programming threshold voltage levels has a negative impact on reliability of device due to reduced operational margin. The raw bit error rate of MLC NAND flash memory is around 10 and is at least two orders of magnitude worse than that of single level cell (SLC) NAND flash memory. Moreover, same reliability concerns as for SLC NAND flash memories, e.g., program/read disturb, data retention, programming/erasing endurance, and soft errors [may become more significant for MLC NAND flash memories. Hence a powerful error correcting code (ECC) that is able to correct at least 4-bit errors is required for MLC NAND flash memories to achieve an acceptable application bit error rate, which is no larger than 10. Several works have investigated use of linear block codes to improve reliability of MLC NAND flash memories. In, authors presented a high-throughput and low-power ECC NAND flash memory chip incorporating a 250 MHz BCH error correcting architecture was shown. The author of demon-strated that use of strong BCH codes (e.g., 12,15,67,102) can effectively increase number of bits/cell thus furr increasing storage capacity of MLC NAND flash memories. In, an adaptive-rate ECC architecture based on BCH codes was proposed. The design had four operation modes with dif-ferent error correcting capabilities. An ECC architecture based on
Improved NAND Flash Memories Storage Reliablity Using Nonlinear Multi 1123 Reed-Solomon (RS) codes of length 828 and 820 information digits constructed over was proposed in, which can correct all bit errors of multiplicity less than or equal to four. The architecture achieves higher throughput, requires less area overhead for encoder and decoder but needs 32 more redundant bits than architectures based on BCH codes with same error correcting capability. In, an architecture based on asymmetric limited-magnitude error correcting code was proposed, which can correct all asymmetric errors of mul-tiplicities up to. 2. MLC NAND Flash Memories Multi-level cell is able to store multiple bits by precisely controlling threshold voltage level of cell. In practice, threshold voltage of whole memory array satisfies a Gaussian distribution due to random manufacturin ng variations. The data of NAND flash memory is organized in blocks. Each block consists of a number of pages. Each page stores data bytes and spare bytes. Cells in spare areaa are phys-ically same as cells in rest of page and are typi-cally used for overhead functions such as ECC and wear-leveling. The proportion of spare bytes in total number of bytes per page is usually 3%, e.g., 64 spare bytes for 2048 data bytes. More spare bytes may be required as page size increases, e.g., 218 spare bytes for 4096 data bytes. Due to existence of spare bytes, number of redundant bits of errorr correcting codes used for NAND flash memories is not as critical as for or types of memories such as SRAM and DRAM where area overhead is mostly determined by number of redundant bits. This allows for a flexible design of more powerful errorr correcting codes for NAND flash memories. Similar to SLC flash memories, primary failure mecha-nisms for MLC NAND flash memories include threshold voltage distribution, program/read disturb, data retention, program-ming/erasing endurance, and single event upset. However, while for SLC flash memories a lot of errors are asymmetric, e.g., errors introducedd by program disturb and data retention, for MLC NAND flash memories errors have no preferred symmetry. Moreover, experimental results show that errors in MLC flash memories are more likely to occur uniformly within a page without any observable burstiness or local data dependency. Thereby, throughoutt this paper we assume a random symmetric error model. Let be error-free output of memory and be error vector. The distorted output 3. Constructions of Nonlinear Multi-Error Correcting Codes The error detecting properties of nonlinear codes are highly related to nonlinear functions. The nonlinearity of a function can be defined
1124 E. Ramakrishna Naik & L.S. Devaraj where denotes probability of occurrence of event value of is, higher corresponding nonlinearity of nonlinear function when.. The smaller is. is a perfect 3.1 Multi-Error Correcting Codes Based on Concatenati ions The first construction of nonlinear multi-error correcting codes is based on idea of concatenating linear and nonlinear redundant digits. Table I: Output of Decoder for Linear Codes that Can Correct Up to T Errors. Theorem 1: Let be a nonlinear function with nonlinearity. Let be a linear code with Hamming distance, where and is encoding function. The code defined by on-zero error will be detected with a probability of at least. Proof: Let be errorr vector, where,, and. The error masking equations can be written as If and, are not both 0, at least one of equations shown above will not be satisfied. The error will always be detected. Algorithm 1: Error Correcting Algorithm for nonlinear multi-error correcting codes in orem1: Input : C = =(x1,x2,x3 ) Output : e= =(e1,e2,e3),err
Improved NAND Flash Memories Storage Reliablity Using Nonlinear Multi 1125 1. begin 2. Decodee V, compute S; 3. If Ev =0,S=0 n 4. No errors are detected,err=0; 5. else if Ev =0,S 0 n 6. Uncorrecteble multi-errors are detected,err=1; 7. else if Ev =-1 n 8. Uncorrecteble multi-errors are detected,err=1; 9. else 10. Ev >0; 11. If e 1=0 n 12. Error in redundant digits are detected, 13. ERR=0; 14. Else 15. Compute x1 = x1 e1, x2 = x2 e2 ; 16. Compute ŝ=f(x1 ) x2 ; 17. If ŝ=0 n 18. e=(e1,e2,e3), ERR= =0; 19. else 20. Uncorrecteble multi-errors are detected, 21. ERR=1 4. Hardware Design of Encoder and Decoder for Nonlinear Multi-Error Correcting Codes In this section, we present encoder and decoder archi-tectures for proposed nonlinear multi-error correcting codes. We estimate area, latency and power consumption of proposed architectures and compare m to architectures based on BCH codes and RS codes (see Section V). 4.1 Encoder Architecturee The encoder for BCH codes and RS codes are conventionally implemented based on a linear feedback shift register (LFSR) architecture. Both serial and parallel structures for LFSRs are well studiedd in community. In general, serial LFSR needs clock cycles while parallel LFSR needs only clock cycles to finish computation of redundant bits at cost of higher hardware complexity, where is number of information bits and is parallelism level of LFSRs. Compared to encoder for BCH codes and RS codes, encoder for proposed nonlinear multi-error correcting codes requires one more finite field multiplier and two registers for computation of nonlinear redundant bits. The detailed architecture of encoder for nonlinear (8281,8201, 11) 5-bit error correcting code generated by Theorem 3 is shown in Fig. 2. The design is based on parallel LFSR proposed in [26]. The parallelism level of design is 10. During each
1126 E. Ramakrishna Naik & L.S. Devaraj clock cycle, 10 information bits are inputted to encoder. The most significant bit of message is input via a separate port. The first information bit for BCH code is derived by XORing with first bit of at first clock cycle (when as shown in figure). The bottom half of architecture is a parallel LFSR used to generate redundant bits for BCH codes. is a 10 70 binary matrix. During each clock cycle, 10 most significant bits in shift register are XORed with new input and n multiplied by. The output of multiplier is XORed with shifted data from shift register to generate input to register. The top half of architecture is for computation of nonlinear redundant bits. During even-numbered clock cycles, 10-bit input is buffered. During odd-numbered clock cycles, buffered data is multiplied by new input in and n added to output registers. A 10-bit mask is XORed with data in output register to generate nonlinear redundant bits. For (8281,8201,11) 5-errorcorrecting code, 820 clock cy-cles are required to complete encoding of message. The encoder for (8280,8200,11) nonlinear 5-bit error correcting code based on Theorem 1 is similar to one shown Fig. 2: Architecture of encoder for (8281,8201,11) nonlinear 5-errorcorrecting code.in Fig. 2. The same structure (top half) is used to compute 10-bit nonlinear redundant bits. The main difference be-tween two encoders is as follows. First, encoder for (8280,8200,11) code does not require a separate port for. All information bits are input via in 820 clock cycles, assuming a parallelism level of 10. Second, encoding of (8280,8200,11) code needs one more clock cycle to complete compared to (8281,8201,11) code. At 821th clock cycle, input to (Fig. 2) is switched to already-generated nonlinear check bits using a 10- bit 2:1 multiplexer.the former, however, requires that all operations are performed in.
Improved NAND Flash Memories Storage Reliablity Using Nonlinear Multi 1127 B. Decoder Architecture The decoding of proposed nonlinear multi-error cor-recting codes requires decoding of a BCH code or a RS code. The standard decoder for BCH codes mainly contains three parts: syndrome computation block, error locator polynomial generation block and Chien search block. Compared to decoder for BCH codes, decoder for RS codes requires one more block to compute error magnitude. We next briefly discuss implementation of above four blocks and n present decoder architecture for proposed nonlinear multi-error correcting codes. 1) Syndrome Computation: Without loss of generality, as-sume that BCH code is a narrow-sense BCH code. Let us denote received codeword by. For a -error-correcting BCH codes, syn-dromess are defined as, where is primitive element of. For binary BCH codes,. Hence only odd-numberedd needs to be computed from. The or syndromess can be computed using a much simpler square circuit in. To improve throughput of decoder, a parallel design can be applied to process multiple bits per clock cycle. Fig. 3 shows syn-drome computation circuit with a parallelism level of for one needed.. For whole syndrome computation block, such struc-turess are Error Locator Polynomial Generation: After syn-dromeusing are computed, error locator polynomial will be Berlekamp-Massey (BM) generated algorithm. The hardware implementations of BM algorithms have been well studied in community. In our design a fully serial structure proposed in is used to minimize area overhead. The design mainly requires three multipliers in and two FIFOs. The error locator polynomial of degree can be generated in clock cycles. For our design, and 20 clock cycles are needed for generation of. Chien Search: Let us denote primitive element in by. The Chien search algorithm exhaustively tests wher is a root of error locator polynomial. If, error location is. Rewrite as The computation complexity is reduced based on fact that. The algorithm can also be paral-leled to test multiple positions per clock cycle. A typical imple-mentation of algorithm with a parallelism level of contains -bit multiplexers and registers, multipliers for multi-plication by a constant and adders in. In, a strength-reduced parallel Chien search architecture is proposed. The authors showed that by a simple transformation of error locator polynomial, most of Galois field multiplications can be replaced by
1128 E. Ramakrishna Naik & L.S. Devaraj shift operations resulting in much lower hard-ware complexity (see Fig. 4). For detail of architecture, please refer to. 4) Error Magnitude Computation for RS Codes: Besides error locator polynomial, Berlekamp-Massey algorithm can also generate error magnitude polynomial defined by where is syndrome polynomial. According to Forney s algorithm, error magnitude at position can be computed as where is derivative of and is an integer. It is easy to verify that is simply sum of terms with odd degrees in and can be directly derived during compu-tation of. 5) Decoder Architecture for Nonlinear Multi-Error Cor-recting Codes: The decoder for nonlinear multi-error cor-recting codes presented in Theorem 1 is similar to decoders for BCH codes and RS codes. In fact, most of decoding can be completed by standard BCH or RS decoder. The main dif-ference is as follows. First, nonlinear multi-error correcting codes need to compute nonlinear syndrome (see Algo-rithm 1) when receiving possibly distorted codewords and recompute after correcting errors located by. Second, after decoding of linear codes is completed and is recom-puted, one more clock cycle is required for decoder of nonlinear code to verify error correcting results so that pos-sible miscorrection of errors can be prevented. Fig. 3: Syndrome computation block with a parallelism level of q for BCH codes. The decoder for nonlinear multi-error correcting codes based on Theorem 3 is slightly more complicated than de-coder for codes based on Theorem 1. As an example, de-tailed architecture of decoder for (8281,8201,11) non-linear 5- bit error correcting code is shown in Fig. 5. The whole decoding procedure requires 1675 clock cycles assuming a par-allelism levell of 10. During first 827 cycles, and syn-drome of BCH code are computed. If no errors are detected by BCH code, decoding proceduree will be completed at 828th clock cycle.
Improved NAND Flash Memories Storage Reliablity Using Nonlinear Multi 1129 Depending on value of, eir first two information bits will be flipped or ERRR will be pulled down by ERRR generatingg circuit which indicates that ree are no errors occurring to information bits of code. The error locator polynomial generation and Chien search will be incurred only when errors are detected by BCH code, which can effectively reduce average decoding latency. Fig. 4: Strength-reduced Chien search architecture with a parallelism level of q. If errors are detected by BCH code, Berlekamp-M Massey algorithm will take anor 20 clock cycles to generate error locator polynomial. After this Chien search block will exhaustively test all possible error locations. If n error location. Since, a is (8270,8200,11) shortened BCH code is used, only need to be computed. The original strength-reducedd Chien search architecture is slightly modified for decoding of shortened BCH codes. The constant inputs instead of to bottom Galois field multipliers in Fig. 4 are set to be. is initialized to be and is serially updated during Chien search stage. Starting from 848th clock cycle, 10-bit FIFO output (possibly distorted codeword) and decoded 10-bit error vector will be buffered in two 10-bit registers. At each odd-numbered clock cycle, is updated as follows:
1130 E. Ramakrishna Naik & L.S. Devaraj Fig. 5: Decoder architecture for proposed (8281,8201,11) nonlinear 5-error-correcting code. At 1675 clock cycle, and are used to recheck wher most significant two bits are successfully corrected. A 2-bit errorr mask will be generated to make adjustment to se two bits according to check results. Theorem 3 presented in Section V and decoder for (8281,8201,11) nonlinear 5-bit error correcting code are dif-ferent as follows. 1. All operations of decoder for 5-digit error cor-recting code are performed in. 2. The 5-digit error correcting code does not require a par-allel architecture. A serial design can achieve a similar de-coding latency in terms of number of clock cycles to decoder for (8281,8201,11) 5-bit error correcting code with a parallelism level of 10. 3. One more block for computation of error magni-tude is integrated into architecture shown in Fig. 5. The block is connected to Chien search block and generates final decoded memory contents. The error magnitude polynomial is generated by Berlekamp-Massey block. To reduce hardware overhead, multipliers in for calculation of nonlinear syndrome are reused to generate error magnitude poly-nomial. One inverter in is required to compute according to Forney s algorithm [see (22)]. In general, inverters in Galois field have much longer critical path than multipliers. Thus a four-stage pipeline is added to reduce latency of inverter. Let, can be represented as Given fact that a four-stage pipeline is implemented, above function can be realized using square operations and five multiplications in. Again we reuse
Improved NAND Flash Memories Storage Reliablity Using Nonlinear Multi 1131 multipliers in or blocks for purpose of reducing hardware overhead. Since square operation is simple in, inverter adds minimal area overhead and has a latency similar to Galois filed multiplier in our design. C. Area, Latency, and Power Consumption The area, latency, and power consumption for architec-tures based on six alternatives presented in Section V are shown in Table III. The designs are modelled in Verilog and synsized in RTL Design Compiler using 45-nm NANGATE library [34]. In practice logic circuits used in NAND flash memory could be different from those used in standard digital designs. The estimation presented here is only for purpose of investigating increase in area, power and latency of archi-tectures based on proposed nonlinear multi-error correcting codes compared to architectures based on widely used BCH codes and RS codes. During synsis we fixed clock rate for encoder and decoder and compared area and power consump-tion for architectures based on different codes. The encoders work at 1 GHz. The decoders work at a lower frequency 400 MHz due to long critical path in Berlekamp-Massey block [12]. The six alternatives require similar latency in terms of number of clock cycles for encoding and decoding. Due to computation of error magnitude and pipeline for inverter in Galois field, digit-error correcting codes (RS, etc.) need eight more clock cycles to complete decoding compared to bit-error correcting codes (BCH, etc.). The encoders for digit-error correcting codes require 40% 50% more area overhead and power than encoders for bit-error correcting codes (see Figs. 6 and 7) due to fact that all operations are in. The decoders for digit-error correcting codes, however, require 20% 30% less overhead in area and power because of a much simpler serial architecture. Compared to BCH codes and RS codes, proposed non-linear multi-error correcting codes need about 10% 20% more area and power in total for encoder and decoder and have similar latency in terms of number of clock cycles required to complete encoding and decoding. The(8281,8201,11) nonlinear 5-bit error correcting codes based on Theorem 3 (columns 6 and 7 in Table III), for example, requires 17.5% more area and consumes 10.0% more power in total for encoder and decoder compared to (8262,8192,11) BCH code. We note that encoder and decoder are only a very small portion in MLC NAND flash memory chip, where major portion is memory cell array. Thereby increase in area overhead for encoder and decoder is not significant for reliable memory design. 5. Implementation and Results The proposed NAND Flash Memories Storage Reliablity Using Nonlinear Multi Error Correction Codes. The code is completely synsized using Xilinx XST and
1132 E. Ramakrishna Naik & L.S. Devaraj implemented on device family Spatran 3E, device XC3S500E, package FG 320 with speed grade -4. 6. Conclusion In this paper, constructions of two nonlinear multi-error correcting codes are proposed. Their error correcting algorithms are presented. The proposed codes have much less undetectable and miscorrected errors than conventional BCH codes and RS codes. The designs of reliable MLC NAND flash memories based on proposed nonlinear multi-error correcting codes are pre-sented. We compare area, latency and power con-sumption of reliable MLC NAND flash architectures using proposed nonlinear multi-error correcting codes to architectures based on BCH codes and RS codes. The encoder and decoder for all alternatives are modeled in Verilog and synsized in RTL Design Compiler. The results show that architectures based on nonlinear multi-error correcting codes can have close to zero undetectable and miscorrected errors while consuming less than 20% more area and power consumption than architec-tures based on BCH codes and RS codes. References [1] G. Atwood, A. Fazio, D. Mills, and B. Reaves, Intel Strata memory technology overview, Intel Technol. J., vol. 1, 1997 [Online]. Avail-able: http://www.intel.com/technology/itj/archive/1997.htm [2] J. Cooke, The inconvenient truths about NAND flash memory, pre-sented at Micron MEMCON Presentation, Santa Clara, CA, 2007. [3] R. Dan and R. Singer, Implementing MLC NAND flash for cost-effective, high capacity memory, M-Syst. White paper, 2003 [Online]. Available: http://support.gateway.com/s/manuals/desk-
Improved NAND Flash Memories Storage Reliablity Using Nonlinear Multi 1133 [4] tops/5502664/implementing_mlc_nand_flashwhite%20paper.pdf [5] R. Bez, E. Camerlenghi, A. Modelli, and A. Visconti, Introduction to flash memory, Proc. IEEE, vol. 91, no. 4, pp. 489 502, Apr. 2003. [6] G. Cellere, S. Gerardin, M. Bagatin, A. Paccagnella, A. Visconti, Bonanomi, S. Beltrami, R. Harboe-Sorensen, A. Virtanen, and [7] Roche, Can atmospheric neutrons induce soft errors in NAND floating gate memories?, IEEE Electron Device Lett., vol. 30, no. 2, pp. 178 180, Feb. 2009.
1134 E. Ramakrishna Naik & L.S. Devaraj