CHAPTER 5 MINIMIZED MEMORY VITERBI DECODER ARCHITECTURE USING ZIG-ZAG ALGORITHM

127 CHAPTER 5 MINIMIZED MEMORY VITERBI DECODER ARCHITECTURE USING ZIG-ZAG ALGORITHM 5.1 INTRODUCTION Viterbi decode is a representative decoding method for convolutional coding. It is widely used in communication system and signal processing to achieve low-error-rate data transmission. When erroneous data are received, the closest codeword is selected using the maximum likelihood decoding (MLD). In the implementation, there are two common methods used to determine and store survivor path: Register Exchange Method (RE) Jens Spars et al (1991) and Trace Back method (TB) Shu Lin and Costello (1983). The register exchange algorithm needs the same number of multiplexers and dual port memory as the number of states multiplied by the survivor path length and they are activated by every cycle to update data in memory. It results in considerable power consumption and large circuit area. Thus, the trace back method is preferred when constraint length is large. In the trace back method, the survivor states from the add-compare-select (ACS) units are simply stored in survivor path memory in sequence and are used for trace back after the survivor path is determined. Although Viterbi Algorithm has been generally applied for decades, improvements in decoding efficiency and memory requirements are still much demanded. Most of the previous researches Gerhaxd Fettweis and Fettweis (1992) targeted on the modification of ACS process to accelerate the

128 decoding process but resulted in the need of extra hardware. Therefore, we try to turn our focus to the other components of Viterbi decoder and find that there are still some rooms left for improvements in SMU such as the high latency of trace back management (TBM) and large memory requirement for trace back method. This motivates us to search for a better SMU design that is expected to improve both decoding efficiency and hardware requirements of TBM. For survivor memory management, conventionally, survivor state metrics have to be stored every stage until the entire data sequences have been processed and then trace back process started. However, in the case of conventional encoder structures more number of stages are required to store the state metric values and decode the output sequence. In the proposed method only one additional stage is required instead of number of stages to trace back and decode out the entire data sequence. Hence, trace back shall be realized faster by hopping over a block of redundant stages. With the same concept, a new memory management method for Viterbi decoder is naturally developed so as to reach better performance in two aspects that decoding efficiency could be improved approaching to the performance of TB method, and approximately 50% of TB memory only used to store trace back information could be saved memory utilization. Several algorithms namely one pointer algorithm, k-pointer odd algorithm, k-pointer even algorithm and hybrid algorithm for trace back memory management in Viterbi Decoders were presented and implemented in Feygin and Gulak (1993). Among these, one-pointer algorithm is the best as the memory requirement is approximately half. Pipelined Viterbi decoder with R = 1/2 and k = 7 using look-ahead trace back is implemented in Baek et al (2001). More than 40% of the memory is reduced by this method. Pretrace back architecture for survivor memory unit of Viterbi decoders,

129 targeting wireless communication applications is focused in Yao Gang et al (2005). This method reduces the survivor memory operations by 50% and the memory size as well as latency is reduced by 25% compared to the conventional trace back. Modified Register Exchange (MRE) method for IS-95 Reverse link is introduced in Chanho Lee (2004). The Viterbi decoder is modeled for R = 1/3, k = 9 and trace forward depth of 45 in 0.35 µm CMOS technology. The memory size is reduced by 32% and latency by 50%. The same method can be used for 3G (W CDMA) systems, when a serial architecture with 4 ACS units is implemented in Chaiwat Keawsai1 et al (2004), where the data rate of this architecture exceeds 2 Mbps. The structural similarity between Viterbi and Fast Fourier Transform (FFT) is discussed in Lihong Jia et al (1998). Based on the memory management and data routing techniques developed for long size FFT, a pipelined architecture is implemented to realize Viterbi algorithm for moderate speed applications. The implementation results give a chip area reduction by about 11mm 2. The ACS area is about 7mm 2, which occupies about 65% of the total area. In David Yeh et al (1996) the architecture and implementation of a constraint length 14, a reconfigurable (RACER) Viterbi Decoder that achieves a decoding rate of 41 Kbps is specified. The system uses 36 Xilinx XL 4010 FPGA s multi ring- general cascade Viterbi decoder architecture. The results show that a decoding rate of 1 Mbps can be achieved by this technology. A trace back technique that utilizes a novel forward tracing algorithm HDTV Viterbi decoder is described in Hu et al (1999). System level simulation verified the new trace back technique. An algebraic formulation of the survivor memory management is introduced and it provides a framework for the derivation of new algorithmic and architectural solutions (Gerhaxd Fettweis and Fettweis 1992). VLSI case studies show that about 50% savings are possible in hardware complexity as well as power consumption. A generalized method using precompiled trace-back is

130 presented and the resolution by a graphical method is presented in Ming Bo Lin (2000), gives an alternate method based on permutation networks for memory management. Here, instead of using registers for storage, permutation networks have been used and the resulting circuit has smaller routing area than register exchange method. It has faster decoding speed than trace back technique, regardless of the constraint length. This chapter presents a novel approach, which minimizes the memory usage when compared to the trace back method. The trace back method uses two blocks of RAM arrays. When one RAM is used for storage, the other one performs the trace back and vice versa. We have designed a Viterbi Decoder for efficient memory management by adapting the Zig-Zag algorithm. Here, the algorithm uses a single block of RAM instead of two. The single RAM may be programmed in such a way that it performs the storage as well as trace back using a single RAM block. This method minimizes the memory usage and hence the area occupied by the Viterbi system is reduced. 5.2 MEMORY MANAGEMENT TECHNIQUES In the decoder, the SMU is the block, which recovers the received data based on all the information from the PMU. It also consumes a large amount of power. For a trace back SMU with RAMs, up to 63% overall power is consumed as it requires a large memory to store the local and global winners information as well as complex logic to generate the decoded data (Munteanu 2000). Two major types of SMU implementation exist: Register Exchange (Jens Spars et al 1991 and Trace Back Shu Lin and Costello 1983). 5.2.1 Register Exchange Approach Figure 5.1 illustrates the principle four state register exchange architecture (Kubota et al 1993). In this architecture, a register is assigned to

131 each state and contains decoded data for the survivor path from the initial time slot to the current time slot. As illustrated in Figure 5.1, the ideal path is indicated with bold arrows. According to the local winner of each state, the register content is shifted into another state register and appended with the corresponding decoded data. For instance, at time slot T1 the survivor branch for state 1 is from state 0 at T0; therefore, the initial content of the state 0 register, which is a 0, is shifted into state 1 register at T1 and the corresponding decoded data for the survivor branch, which is a 1, is appended to it. Registers on the ideal path, as shown in Figure 5.1, spread their contents to other state registers as time progresses due to the nature of ACS process. Thus, at the end of time slot T4, the state registers all contain the bit(s) from the same source registers, which is the state 1 register at time T1. As shown in Figure 5.1, the two most significant bits of each register at time slot T4 is 01. Therefore, this is the decoded output for timeslots T0 and T1. Figure 5.1 A four state register exchange implementation of the SMU design (The bold arrows indicate the ideal path of the encoder states)

132 The register exchange approach is claimed to provide high throughput (Kubota et al 1993), as it eliminates the need to trace back since the state register contains the decoded output sequence. However, it is obviously not power efficient as moving data from one register to another wastes a large amount of power. In addition as D-type flip-flops rather than transparent latches need to be used to implement the shift registers although the amount of data, which needs to be held to determine the output, is identical to that required for trace back approach. This all leads to relatively high power consumption. 5.2.2 Trace Back Approach The trace back approach is generally a lower power alternative to the register exchange method. In trace back, one bit for the local winner is assigned to each state to indicate if the survivor branch is from the upper or the lower position. Using this local winner, it is possible to track down the survivor path starting from a final state and starting from a global winner state as previously discussed enhances this search. Figure 5.2 shows a trace back SMU architecture adopted from the architecture described in Riocreux et al (2001), which used global winner information. Here, local winners are stored in the local winner memory. Trace back is started at the global winner from the PMU, which is used as an address to read out the local winner of the global winner state. Then, in the trace back logic the previous global winner in the trace back is produced by shifting the current global winner one place to the right and inserting the read out local winner into the most significant bit position; this arithmetic relationship between parent and child states derives from the butterfly connection shown in Figure 5.3. This new global winner can then be stored into the global winner memory to update the global winner existing at that time slot. The process repeats with the updated global winner reading out its

133 local winner, which is used to form the global winner for the previous time slot. This process continues until the global winner formed agrees with that stored or it reaches the oldest time slot Riocreux et al (2001). In the output logic, shown in Figure 5.2, the decoded output can be obtained from the least significant bit of the global winners stored in the global winner memory. Figure 5.2 A possible trace back SMU implementation using memory As described in the last section, local and global winners are stored in memory. So for each trace back, local winners are repeatedly read out from the local winner memory and new global winners are written back to the global winner memory. This results in complex read/write control mechanisms. Furthermore, unless flip flop storage is used then multi-port SRAM blocks are required as seen in previous implementations (Joeressen and Meyr 1995). Moreover, it is preferable to run trace backs in parallel as an incorrect trace back may damage a good path and it needs a new trace back to correct this as soon as possible. It has been suggested in Black and Meng (1997) that the read-write-based trace back also has a serious speed overhead due to the need to access multiple memory pointers. Therefore, reducing the

134 complexity of the trace back logic and memory, increasing the trace back throughput, and reducing the SMU power consumption are all current research issues in Viterbi decoder designs (Joeressen and Meyr 1995). Figure 5.3 The butterfly state transition diagram represents state transitions of a convolutional encoder of constraint length k Many approaches have been proposed attempting to address these issues, e.g. increasing the number of pointers for parallel trace backs, decreasing the memory access time of the read operation, or increasing the access rate of the read operation in a time multiplexed method (Chang et al 2000). However, none of them change the fundamental read-write architecture in the trace back implementations, so have only limited success in solving these problems. 5.3 EXISTING TRACE BACK METHOD The memory-trace back method has commonly been used in lowthroughput low-power applications. The SMU consists of an SMU control unit and two RAM blocks as shown in Figure 5.4. The memory-trace back method stores the intermediate decision bits at static locations in memory. Since RAM blocks typically operate by reading or writing multiple bits per cycle, a vector of decisions output by the parallel ACSs can be written into

135 memory simultaneously. At the end of the first clock cycle, the RAM 1 will be filled. The second cycle starts with storage in RAM 2 and simultaneously trace back in RAM 1. Thus in trace back method, the storage as well as trace back takes place, but in different RAM blocks. Figure 5.4 Survivor memory unit in trace back method The trace back operation only needs to recall the decision bits that correspond to nodes along a particular trace back path. When one memory is full, then data is written to next memory and so on. When the two memory blocks are full, the trace- back operation starts. When the first received symbol in memory block is handled, the state is stored so the starting point for the previous memory block s trace-back operation is known. This contrasts with the register-exchange method, which constantly moves an array of decision bits through a pipeline of flip-flops. The use of standard SRAM modules offers little power or area advantage over register exchange because of the overhead of peripheral circuitry and standard word addressing.

136 5.4 PROPOSED ZIG-ZAG ALGORITHM The memory management in Viterbi decoders is done in the Survivor Memory Unit (SMU). Usually trace back of the trellis structure is carried out in order to restore the data sequences. Using block of RAM performed the storage of the state metric. Memory management generally deals with reducing the size of the RAM and thus resulting in the reduction of Silicon area. The survivor sequences from the ACS unit are stored in the form of sequences and have been stored in a RAM unit. The RAM is an array of recursive pointers. The single RAM has been used to store the survivor sequence bits in one direction and the sequence of trace back in an opposite direction simultaneously. Once the first packet process gets completed, the memory is full of state metric values, and then the trace back takes place in the reverse direction. Figure 5.5 shows the representation of the direction of storage as well as trace back using Zig-Zag algorithm. Here for the first time, the direction of storage is from A to B in the forward direction. After completion of storage of state metric for first packet symbols, the direction of the trace back gets started from B to A. In the second time the direction of storage is from B to A in reverse direction, the direction of trace back is from A to B forward direction. Here both storage unit as well as the trace back unit has utilized the same amount of memory. All the operations are performed by this way and the direction is in Zig-Zag manner. This technique thereby reduces the amount of memory utilization.

137 Figure 5.5 Schematic representation of direction of storage as well as trace back using Zig-Zag algorithm The RAM is organized as a rectangular page where each column contains all the state metrics made for a given bit time t, and moving forward one column along a row is equivalent to a bit time t+1. The state forms the row address and the bit time forms the column address. The structure of the survivor RAM unit with state metrics are shown in Table 5.1. Table 5.1 Structure of RAM with state metrics values of received symbols Time/ t0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 t 13 t 14 t 15 t 16 t 17 t 18 t 19= t L-1 t 20=t L state S 7 6 8 10 12 14 6 8 12 12 8 12 12 8 10 12 12 8 12 S 6 0 10 12 10 12 0 10 10 14 10 10 14 10 12 10 14 10 10 S 5 8 6 12 14 12 8 6 14 10 6 14 10 6 12 14 10 6 14 S 4 10 0 14 12 10 10 0 12 12 0 12 12 0 14 12 12 0 12 S 3 0 12 10 8 10 0 12 10 10 6 10 10 6 10 8 10 6 10 10 S 2 6 14 12 10 8 6 14 12 8 0 12 8 0 12 10 8 0 12 8 S 1 0 10 10 12 6 0 10 10 12 0 8 12 0 8 12 6 0 8 12 0 S 0 0 6 8 12 14 0 6 8 12 14 6 10 14 6 10 14 0 6 10 14 6 Information sequence 11000110010010001001( 20 bit information)

138 At time t 0 the first column of RAM is initialized with zeros and then at each successive time a set of ACS operations is completed and the resulting state metrics are stored in the appropriate column of the RAM. At each time, the column address is incremented by one along with same row address. At the end of a state update cycle, the entire survivor path memory contains all the state metric values. The Trellis path converges and now the trace back starts in the opposite direction from t L to t 1. The decoding process begins with building the accumulated error metric for 20 numbers of received channel symbol, and selects with the smallest accumulated error metric at each time instant. Once this information is built up, the Viterbi decoder is ready to regenerate the sequence of bits that are input to the convolutional encoder when the message is encoded for transmission. The first packet trace back proceeds from t L stage to t L-1 stage select the state having the smallest accumulated error metric and decode the binary information. Once the decoding process have been completed for t L stage to t L-1 stage the last t L column metrics values were not required for second bit decoding process. This last t L stage memory has been utilized to store second cycle first symbol calculated metrics. Similarly traces backward stage by stage throughout the state history table and decoder the information. Simultaneously, the second cycle storage operations are done in the reverse direction from t L to t 1. The process is continuing, until all the message sequences are processed. The storage of the sequences as well as the trace back is performed in a zig-zag manner and hence the name Zig-Zag Algorithm. This algorithm provides greater area advantage compared to the two RAM trace back method. 5.4.1 Proposed Architecture of the VD The architecture of the VD consists of Branch Metric Unit, Path Metric Unit and Survivor Memory Unit. The proposed VD needs an external

139 RAM block, of which size depends on the applications. The architecture has been designed for the scalability to meet the requirements of applications with high performance needs. The basic block diagram of a Viterbi decoder is as shown in Figure 3.16. The VD has been implemented with eight states and eleven stages. The number of states corresponds to the size and the performance of the decoder, the bigger 15 number of states leads to bigger size and better error correction. 5.4.2 Survivor Memory Unit (SMU) The SMU consists of two parts: SMU control block and RAM block. Here the RAM-blocks are typical memories with address lines and data lines. Writing and reading is enabled in RAM blocks for storing the decision values. All the functionality of the SMU is inside the SMU control block. The SMU behaves such that the decision values are written to memory. The parameter trace back length defines how many symbols decision values are written to the memory block. Moving backwards in trellis diagram does the trace-back operation. The state that has the smallest path metric value is stored to memory (PMU gives that state). On the basis of that information, the address of the decision value in RAM memory has been calculated. This value is picked and the previous state is calculated with the aid of the decision value and the bit, which has caused that state transition, is stored. In this way the whole memory block is utilized. 5.4.3 Memory Management in Viterbi Decoders Register Exchange (RE) method obtains the decoded data using multiplexes and dual port memory. This method is not used presently because of large power consumption and large area required in VLSI implementation.

140 The trace back (TB) method is the preferred method in the design of large constraint length, high performance VD because of lower power dissipation. Here, we have used Zig-Zag algorithm which allows less memory usage than the conventional trace back (TB) method. Choosing optimized value of the parameters chosen compensates the trade off in speed due to minimized area. The memory-trace back method has commonly been used in lowthroughput low-power applications. The memory-trace back method stores the intermediate decision bits at static locations in memory. Since RAM blocks typically operate by reading or writing multiple bits per cycle, a vector of decisions output by the parallel ACSs has been written into memory simultaneously. The trace back operation only needs to recall the decision bits that correspond to nodes along a particular trace back path. When one memory is full, then data is written to next memory and so on. When the two memory blocks are full, the trace- back operation starts. When the first received symbol in the memory block is handled, the state is stored so the starting point for the previous memory block s trace-back operation is known. This contrasts with the register-exchange method, which constantly moves an array of decision bits through a pipeline of flip-flops. The use of standard SRAM modules offers little power or area advantage over register exchange because of the overhead of peripheral circuitry and standard word addressing. The memory for the trace back method permits the design of very compact RAM that provides significant area advantages. In a 0.18-µm CMOS technology, the area of a typical SRAM cell is about 2.4 m, in contrast with the 50- m area required for a flip-flop used in the register-exchange method (Hu et al 1999). The Zig-Zag method utilizes all the advantages of memory trace back. The basic SMU structure in Zig-Zag algorithm consists of a single SMU control unit and a single RAM as shown in Figure 5.6. The storage as well as trace back using single RAM using Zig-Zag algorithm is shown in

141 Figure 5.7. Encode output bits and 6 for computational purpose, and the decision bits are 20 bits for Zig-Zag algorithm and 40 bits for the trace back algorithm. Figure 5.6 Survivor memory units in zig-zag algorithm Figure 5.7 The trellis diagram for Viterbi decoder with k=4, R=1/6 and N=8

142 5.5 RESULTS AND DISCUSSION The proposed VD has been designed and implemented in FPGA Xilinx Spartran II with constrain length 4 and code rate 1/6. In order to reduce the memory, Zig-Zag algorithm has been adopted in survivor memory unit. We have analyzed various parameters for the effective memory utilization of the path metric unit, trace back memory unit, total memory used for computation latency. Various values achieved through our proposed algorithm have been compared with the existing conventional trace back algorithm and has been presented in Table 5.2. We have considered the trace back depth of 20, path metric memory bits of 16 6, encoded output bits 6 for computational purpose and the decision bits are 20 for Zig-Zag algorithm and 40 bits for the trace back algorithm. Table 5.2 Comparison of one point trace back and Zig-Zag algorithm memory utilization Item Method Conventional Trace back method (Feygin G. and Gulak P. 1993) Proposed Zig-Zag algorithm Input buffer L*n=20*6=120 L*n=20*6=120 Path metric memory S*p=8*16*6=768 S*p=8*16*6=768 Trace Back memory L*S*d=20*8*40=6,400 L*S*d=20*8*20=3,200 Total memory 7,288(100%) 4,088(56.09%) Latency L+L=40K=160(100%) L=20K=80(50%) L : Trace back depth = 20 P : Path metric memory bits = 16*6 n : Encoder o/p bits = 6 d : Decision bits = 20 (for Zig-Zag algorithm), 40 (for Trace back method) S : No. of states =8 K : Constraint length =4

143 Table 5.3 Comparison of Total memory utilization with different constrain length at constant code rate 1/6 Constrain length (K) Conventional Trace back method Total memory Proposed Zig-Zag method Total memory 3 2082 1182 4 7288 4088 5 2322 13222 7 70068 41268 7 206162 127762 8 606448 401648 9 1823502 1305102 10 5706028 4426028 Table 5.3 shows the memory utilization with different constraint length for the one point and zig-zag approaches. Figure 5.8 shows the memory usage by the different methods of the Viterbi decoder. It is seen from the figure that the memory usage by the buffer and PMM for both the TB and Zig-Zag algorithm based architecture are the same. It is also seen from the figure that the memory utilization of the conventional method is more than that of proposed method. Using this method, the latency can also be reduced to about 50%. The trade off between area as well as latency is compromised here, by choosing optimized parameters. Figure 5.9 shows the memory utilization with different constraint length for the one point and zig-zag approaches. Here, memory of the zig-zag algorithm approach is lower than the trace back approach. The reduction in memory of VD increases the performance of the system and thereby reduces the latency of the system.

144 Figure 5.8 Memory usages by one point trace back and Zig-Zag algorithm Figure 5.9 Reduced total memory utilization with different constrain length at constant code rate 1/6

145 The synthesis report of the proposed method and the existing trace back method is described in Table 5.4. The proposed method utilized only 1227 slices but in TB method 2492 slices have been utilized. The power dissipation of the proposed method is only 450mW. The conventional TB method has the power dissipation of 542.8 mw which is more than the proposed method. Table 5.4 Comparison of resources utilization results using one point algorithm and trace back algorithm Item Method Trace back Method (Feygin and Gulak 1993) Zig-Zag Algorithm No. of RAM Blocks 2 1 Operating Frequency 83.243 MHz 452.694 MHz Total Delay 12.013 ns 6.785 ns No. of Slices 2492 1227 No. of 4 i/p LUT s 1875 1406 Gate Count 36008 18514 Power Dissipation 542.08 mw 450.00 mw 5.6 SUMMARY In this chapter, we have reported a novel Zig-Zag algorithm for survivor memory management in Viterbi Decoders, We have utilized one RAM block for the survivor memory unit as well as trace back. We have achieved a latency of 50% compared to 100% for the trace back algorithm. Only 56.09% of total memory has been utilized in the proposed algorithm than the trace back method. The operating frequency for the trace back algorithm is 83.243 MHz, but it is 452.694 MHz for the proposed approach. Total delay is only 6.785ns but it is 12.013ns for the trace back algorithm approach.