CHAPTER 5 FINITE STATE MACHINE FOR LOOKUP ENGINE

CHAPTER 5 71 FINITE STATE MACHINE FOR LOOKUP ENGINE 5.1 INTRODUCTION Finite State Machines (FSMs) are important components of digital systems. Therefore, techniques for area efficiency and fast implementation of FSMs are areas of great interest. The implementation of an FSM is strongly determined by the way codes are assigned to the states of an FSM. The state assignment problem can be stated as that of assigning codes to the states of an FSM while optimizing a given criterion. The state assignment problem has received considerable attention from researchers (Ellen et al 1992), because it is an important step in the process of sequential circuit synthesis. Some of the reported state assignment tools are analysed in the literature. Villa and Sangiovanni Vincentenlli(1990) for area minimization of PLA implementations, Lin and Newton (1989) for multilevel implementations. Classical approaches for reduction in the complexity of the next state and output functions involve reducing input or output dependencies, state splitting etc.(zvi Kohavi 1996, Frederick Hennie 1968, Demers et al 1989). The state assignment problem (Lucca Benini et al 1998) can also be viewed as that of the decomposition of an FSM. Recently, decomposition of FSM has attracted attention of researchers for area minimization and power reduction. Some of these approaches towards area minimization are Decomposition and Factorization (Srinivas Devadas, A. R. 1989, Newton 1989) Modify and Restore (Rama et al 1994). Chakraborthy, (1994) and Decomposition as Constrained

72 Covering (Pranav Ashar 1991). The decomposition and factorization approach tries to extract out repetitive parts of finite state machine, implement them only once and pass on the control to them whenever a particular transition occurs. Since the repetitive part of finite state machine is implemented only once, it can achieve some area reduction. However, there may not be exact repetitive parts in an FSM. This difficulty is resolved by the Modify and Restore approach, which involves modifying the next state and output functions to extract repetitive parts and restoring the original functions using restoring PLA and Ex-OR gates. In this approach, the restoring PLA might itself require a large area. The constrained covering approach has put forward strategies for handling various kinds of topologies such as cascadecascade, cascade-parallel, and parallel-parallel. These approaches are highly dependent on the structure of the machine. There are essentially two approaches to machine decomposition. If the original state graph is partitioned into several pieces, with each piece being implemented by a separate machine with a wait state, then exactly one machine would be active at any instant while the others remain in the reset state (Jose et al 1998). In this case, decomposition is viewed as the joining of two disjoint partitions on the set of states. An alternative approach to decompose an FSM is based on factoring the original state graph and is the focus of this research. In this research, decomposing a FSM into two interacting machines is proposed, for area effective as well as high performance implementation. Contrary to (Jose et al 1998) this decomposition is a factoring of the original machine (with N states) into two much smaller machines (with pn states each) and this factoring can be viewed as the meet of the two orthogonal partitions of the set of states of the original

machines. It is to be noted that the two smaller machines are both active simultaneously. 73 5.2 LOOKUP ENGINE ARCHITECTURE The basic architecture for the lookup engine is shown in Figure 5.1. This performs the address lookup. A reconfigurable hardware is essentially a circuit whose behavior can be modified on the fly. The hardware implementation is in the form of a programmable FSM. The state transition table can be loaded onto it by the processor. Figure 5.1 Lookup engine architecture The processor computes the FSM for a given routing database of address prefixes and then compiles it in a format appropriate for programming the reconfigurable hardware. Due to a routing update, if there is a change in the routing database, then the state machine is recomputed again. In case of changes, either the entire FSM may have to be reprogrammed or changes to some part of the FSM graph may have to be made. All the approaches (except CAM-based solutions) require

74 several memory accesses and, thus, the memory bandwidth is one of the major performance bottlenecks. The FSM-based architecture can be efficiently implemented using Flip-Flops (FF) and all the memory accesses can be reduced to accessing the high-speed registers. The implementation can, thus, scale with VLSI technology. This research presents a way to generate an efficient FSM for the routing database and evaluate the lookup speed of such an approach. 5.3 IP ADDRESS LOOKUP SCHEMES The performance of an IP address lookup algorithm is characterized by two parameters. One is the lookup time, i.e., the time required to determine the output interface corresponding to a destination IP address. Since routing table entries may change due to route updates, the time required by an IP lookup algorithm to respond to the changes in the routing database is another parameter used to characterize the IP address lookup. This is termed as update time. IP address lookup engines can be broadly classified into two categories: one based on content addressable memories (CAM) and the other processor memory combination. This work actually creates a third category, which is based on programmable FSMs. 5.3.1 CAM-based solutions In this model, the address lookup can be performed using ternary CAM (TCAM) and NetLogic Microsystems (2001). In a TCAM, a mask of bits can be specified per word. The routing table entries are stored in the order of decreasing prefix lengths. The longest prefix match, thus, corresponds to the first entry among all the entries that match the destination IP address. A TCAM is an attractive solution for high-speed IP address lookup, however, TCAMs with large sizes are

75 typically very expensive. Historically, the CAM technology has also not kept pace with the dynamic random access memory (DRAM) technology in terms of storage density. TCAMs are also very poor in terms of update time, though, recently some progress (P.Gupta et al 2000) has been made in this direction. 5.3.2 Processor memory based solutions In this model, the routing table entries are present in memory and the lookup algorithm runs on a processor. The objective of an IP lookup algorithm is to organize the routing database in an intelligent manner such that during actual lookup operation as few memory accesses are required as possible. For backbone routers with a large routing database, architectures that use off-chip DRAMs are usually employed. One measure of the lookup speed of an algorithm is the number of DRAM accesses that are required to be made. New memory technologies such as synchronous DRAM (SDRAM), RAMBUS, double data rate DRAM (DDR-DRAM) employ some form of parallel banks of memory and interleaving can be performed to hide memory access latency. As pointed out in Eatherton (1999), each memory technology introduces some tradeoffs and IP lookup algorithms need to be carefully tuned across memory architectures to extract the best performance. One of the simplest ways to store the routing database of address prefixes in memory is in the form of a 1-bit trie. A trie is a tree-like data structure where the prefix bits are used to create tree branches. Several modifications to the basic 1-bit trie have been proposed in the literature. Path compression techniques (Morrison 1968) can be used to remove those nodes from the tree that have only one child. The missing nodes are denoted by a skip value that indicates how many nodes have been skipped on the one-way path. Instead of 1-

76 bit tries, multibit tries (Srinivasan et al 1998) can also be used. Unlike in a 1-bit trie, where each node branches to its children depending upon the value of a binary bit, in multibit tries, the branching occurs depending upon the value of several bits taken together. The search also proceeds by inspecting several bits simultaneously. The number of bits to be examined is called the stride length. The strides can be of fixed length or of variable lengths at different levels of the tree. The address prefixes need to be converted into prefixes with lengths equal to the stride. The length of the strides offers a tradeoff between memory and search speed. The optimal strides can be computed using the prefix length distribution (V. Srinivasan et al 1998). In LC Tries (S. Nilsson et al 1999), each complete subtree of height is converted into a subtree of height 1 with children. Thus, a 1-bit trie gets converted into a multibit trie. In Gupta et al (1998), a multibit trie with fixed stride length is implemented using memory banks. This is, however, achieved at the expense of large memory size. Though the above algorithms have provided very novel techniques to arrange the prefixes in an intelligent manner, it is believed that the scalability of processor memory solutions is limited by the fact that the lookup operation requires DRAM accesses. Despite the considerable progress, the DRAM technology has not kept pace with the processor technology. 5.4 FSM FOR LOOKUP ENGINE To illustrate the basic approach, generating an FSM from the 1-bit trie structure is first analyzed. The procedure for generating the 1- bit trie begins at the root node for each prefix. The bits in the prefix are examined one by one. If the bit is zero, then the left node is formed (if not already present) otherwise if the bit is one, then the right node is formed. To generate an FSM, each node in the resulting 1-bit trie can be

77 associated with a state in a FSM. The 1-bit trie and the corresponding FSM for the prefix database is illustrated in Figure 5.2. This FSM is called a 1-bit FSM. The state transition table for this FSM is given in Table 5.1. The state corresponding to an address prefix stores the corresponding output interface. To perform a lookup, the destination IP address bits are applied in serial order and the state machine makes a transition from one state to another depending upon the bit. If a state representing a valid interface is encountered, the state number is stored. The IP address bits are applied until a node whose next state is FINAL is encountered. The search is terminated and the output interface number corresponding to the last stored state is retrieved. In the given example, if the destination IP address is 110, the states that would be traversed are S1, S3, and S10 and the output would be 6. Table 5.1 State transition table Current state Input bit Next state Output S1 0 S2 - S1 1 S3 - S2 0 S4 - S3 1 S5 3 S4 0 S6 2 S4 1 S7 1 S5 0 S8 2 S2 1 Final - S3 0 Final - S5 1 Final -

78 Current state Input bit Next state Output S6 0 Final - S6 1 Final - S7 0 Final - S7 1 Final - S8 0 Final - S8 1 Final - In the worst case, 32 states might have to be traversed for IP lookup, but note that these are not memory accesses and, hence, could be quite fast. For practical routing databases, the number of states in the state machine would be large. It is possible to calculate the number of states in the FSM generated for Mae east, FUNET, and RIPE routing databases. The results are summarized in Table 5.2. Figure 5.2 1- bit trie corresponding to database

Table 5.2 Number of states in FSM 79 Database Number of entries Number of states FUNET 41625 125669 RIPE 55532 172826 The large number of states may result in inefficient hardware implementation and higher delays. Therefore a structured approach is followed in the work, where the FSM graph is partitioned into smaller machines each containing some maximum number of states, say 1024. The partitioning of FSM graph is done with a view to minimize the area of the chip and make the performance of the chip predictable. Each machine is made reconfigurable by introducing memory cells. When one machine completes the processing of a packet, the packet is handed over to an appropriate machine by the central block. Various methods for decomposition of state machine into smaller state machines by exploiting the structure of FSM graphs have been investigated 5.5 PATH-COMPRESSED TRIES While binary tries allow the representation of arbitrary length prefixes, they have the characteristic that long sequences of one-child nodes may exist since these bits need to be inspected, even though no actual branching decision is made, search time can be longer than necessary for some cases. Also, one-child nodes consume additional memory. In an attempt to improve time and space performance, this research uses a technique called path-compression.

80 Path-compression consists in collapsing one-way branch nodes. When one-way branch nodes are removed from a trie, additional information must be kept in remaining nodes so that search operation can be performed correctly. There are many ways to exploit the pathcompression technique corresponding to the binary trie as shown in Figure 5.3. Note that the two nodes 01 and 10 are removed. However a list of prefixes must be maintained in some of the nodes. Because oneway branch nodes are now removed, and it could be jumping directly to the bit where a significant decision is to be made, bypassing the bit inspection of some bits. As a result, a bit number field must be kept low to indicate which bit is the next bit to be inspected. Figure 5.3 Path-compressed FSM In Figure 5.3 these bit numbers are shown next to the nodes. Moreover, the bit strings of prefixes must be explicitly stored. A search in this kind of path-compressed tries is as follows: The algorithm performs, as usual. For instance the BMP of an address beginning with the bit pattern 010110 is taken in the path compressed trie as shown in Figure 5.3.

81 Searching of FSM starts at the root node and since its bit number is 1 the first bit of the address is inspected. The first bit is 0 in this example so the search goes to the left. Since the node is marked as prefix, the prefix a with the corresponding part of the address is compressed. Since the node s bit number is 3, operation skips the second bit of the address and inspect the third one. This bit is 0 so search goes to the left. Again checking continues to find whether the prefix b matches the corresponding part of the address (01011). Since they do not match, search stops and the last remembered BMP (prefix a ) is the correct BMP. Path-compression was first proposed in a scheme called PATRICIA, but this scheme does not support longest prefix matching. Sklower (1991) proposed a scheme with modifications for longest prefix matching in A (Broder et al 2001). In fact, this variant was originally designed not only to support prefixes but more general non-contiguous masks. Since this feature was really never used, current implementations differ somehow from the Sklower s original scheme. For example, the BSD version of the path-compressed trie (referred to as BSD trie) is essentially the same as one just described. The basic difference is that in the BSD scheme, the trie is first traversed without checking the prefixes at internal nodes. 5.6 TOPOGRAPHICAL BREAKDOWN OF THE FSM In this work, a simple method to implement the FSM for IP Address lookup and packet classification is designed. It is assumed that the database is static, i.e., it is not being updated. Initially the process of generating Finite State Machines for IP Address Lookup is performed. First, a one bit trie structure is generated using the prefix table. In the one bit trie each node stores the prefix, output interface number, pointer to parent and its children if present. For each prefix, start at the root node

82 at the top of the one bit trie. Next while looking at the bits in the prefix from the left, if the bit is 0, then create the left node if absent. If the bit is 1, create the right node if absent. Now change the current node to left node if the bit was 0 or to right node if the bit was 1. This process continues till all the bits in the prefix are exhausted. The output interface number as stated in prefix table is assigned to the current node. Consider a database having the following entries. (prefix ort). 0 such a one bit trie structure is shown in the Figure. 5.4. Figure 5.4 Topographically breakdown of FSM Each node shown in Figure 5.4 can be associated with a state in the corresponding FSM. The output is the output interface number. The FSM corresponding to the given database 5 consists of a large number of states. To handle such a large FSM is not practical. Therefore, it must be partitioned into smaller FSMs. To obtain the optimal performance and minimum area, these large FSMs have to be broken down into smaller FSMs. Some approaches for breaking the one bit trie FSM into machines have been discussed in Kobayashi (2000).

83 One possible solution for generating machines is to break the FSM topographically according to the implementation capacity of the machines. This criterion of topographical breaking is to minimize the total number of edges going in or coming out of the machines. It is based on orthogonal partitioning 7 of the original large FSM into smaller machines, which can be executed in parallel and their results, combined to yield the final output. 5.7 PERFORMANCE EVALUATION As discussed previously, the large FSM is broken down into smaller machines using orthogonal decomposition. The inputs to each of the machines namely Present State of the machine (PS), Next State of the machine (NS) and external inputs. Due to orthogonal decomposition, the original FSM consisting of N states is broken down into these 2 machines of this is O(N). The goal here is to minimize the number of multi-edges. Thus, the final graph is one having no parallel edges between a pair of nodes are formed. The machine operates in 2 modes, viz (1) Route lookup mode and (2) Update mode. Simulation results indicate that it takes 5 nsec for the signal to traverse the critical path, and generate the signals for the next state. Since each partition contains 3 states, it would take 15nsec for a complete lookup. However, this does not include some of the delays in the feedback path and some buffer chain delays. Hence, a delay of T rl F =20nsec per lookup would be a conservative estimate. In this mode, the bit stream generated by the processor is loaded serially in the scan chain and then these values are loaded stepwise into the memory cells in each of the rows of the Programmable Logic array. The length of the scan

84 chain is 200. The clock signal has a time period of 100psec. Thus it takes 20nsec to load the scan chain. The time required to load the memory cells is about 5nsec. There are 80 such rows to be loaded to complete the reconfiguration. Hence total time required to update is T u F = (20+5)nsec 80= 2 ec A major bottleneck is the generation of the actual bit stream after getting the data from the database, generation of the FSM and doing the optimal coding. This time is estimated to be around 2 min. 5.7.1 Timing summary after reduction This path compression is run into Xilinx to find the time taken for a complete search of any binary inputs and results were analyzed. The clock speed and maximum frequency taken are given below. Minimum period : 8.519nsec (Maximum Frequency: 117.385MHz) Minimum input arrival time before clock : 8.630nsec Maximum output time required after clock : 5.880nsec 5.7.2 Timing summary before reduction Speed Grade: -6 Minimum period : 15.887nsec (Maximum Frequency : 62.945MHz) Minimum input arrival time before clock : 16.658nsec Maximum output time required after clock : 6.788nsec

85 5.8 CONCLUSION An FSM generated from a 1-bit trie has so far been considered. Multibit tries and its variants can also be considered within the framework. As indicated earlier, the VLSI architecture can be optimized for area by using dynamic reconfiguration. This can improve the packing properties of the architecture.