High Speed Stateful Packet Inspection in Embedded Data-Driven Firewall Ruhui Zhang*, Makoto Iwata*, Yuta Shirane*, Teruhisa Asahiyama*, Wenjun Su**, Youquan Zheng** *) Dept. of Information System Engineering, Kochi University of Technology, JAPAN. **) Dept. of Electronic Engineering, Tsinghua University, CHINA 8645j@gs.kochi-tech.ac.jp, iwata.makoto@kochi-tech.ac.jp Abstract The firewall is normally employed as a line of defense to keep safe network surroundings for private hosts/networks. Among all the techniques to realize a firewall, stateful packet inspection (SPI) is becoming popular. In this paper, we discuss about a high speed SPI. Our SPI for an embedded personal firewall processor is implemented by the self-timed super-pipelined data-driven multiprocessor chip. Furthermore, some hardware-based schemes for the SPI are proposed to extend the architecture and instruction set of the current chip in order to achieve higher performance. A prototype of SPI have been implemented by the data-driven programs and evaluated using the FPGA. Evaluation results show that processing speed of our proposed high-speed SPI can get more than 3 Gb/s. Keywords: stateful packet inspection, embedded firewall, data-driven, quick search, hardware timer, interlock mechanism 1 Introduction In order to develop a firewall, we can use several techniques such as static packet filtering, stateful packet inspection (SPI), application layer gateway, proxy server and so on. SPI is an advanced firewall architecture which was invented in the early 199s. Essentially, SPI is a dynamic packet filtering working in the network layer. It examines each IP datagram to capture necessary information, and then tracks states of network connections so as to allow or deny following traffic traversing the firewall [1]. It has gradually replaced static packet filtering as the industry standard firewall solution for networks. To keep up with the increasing network throughput, a high speed SPI is required. In this paper, a high speed implementation of SPI in embedded firewall realized by selftimed super-pipelined data-driven network-processor (DDNP) [2] chip is introduced. Embedded firewall [2] is a hardware-based firewall processor which is installed at the inside of a local host. It is a unique gate which all the IP datagrams to/from this host need to traverse. Furthermore, the embedded firewall works independent of the host s operating system. This means that the embedded firewall can prevent intruder from using the host as a launching pad for further intrusion, even if the host is invaded. Moreover, the transparency to users and hosts makes it robust and hard to disable. On the other hand, the DDNP chip, which integrates several datadriven processors, overcomes the clock skew and excessive power consumption problems of the
conventional sequential processors [3]. It also provides us pipeline parallel processing capability without any process scheduling or complex interrupt handling. It is programmable and a strong programming and debugging surroundings is provided. These advantages contribute to the high speed SPI. In addition, some dedicated hardware mechanisms are proposed in this paper to make it more suitable for the embedded SPI functions. The rest of the paper is organized as follows. Parallel implementation of SPI is described in section 2. Hardware-based schemes for high-speed SPI are proposed in section 3. Performance evaluation results are given in section 4. In section 5 we make the conclusion. 2 Parallel Implementation of SPI.. CNb Connection table Cd C1b Cc Connection.. N. Connection 1 Connection Establishing running Ending Deleting Timer Rule table CNa Cb Passing permit/deny C1a Ca Figure 1: Basic structure of SPI for TCP segments. SPI basically is a kind of connection tracking function, that s the reason why it s stateful [4]. In our SPI, all the packets traversing the embedded firewall are classified according to the protocol type such as transfer control protocol (TCP), user datagram protocol (UDP) and internet control messages protocol (ICMP), and then processed by relevant modules. For instance, SPI for the TCP connections shown in Figure 1 will establish a record for each network connection to memorize its connection state, state transition time and some other useful information. When a TCP segment arrives, SPI will extract its header fields such as socket pair, flag bits, sequence number, and then use them to determine the next state of the corresponding connection. After that if the segment passes the predefined rule set, it is permitted to go on. Otherwise, it will be rejected. By dynamically updating the connection table, SPI can track all the connections through the firewall. Since SPI tracks the state of each network connection, only a datagram which is a legitimate reply to a previous request can pass through. For example, SPI can block a TCP ACK segment which isn t preceded by a TCP SYN segment with a correct sequence number. Therefore, SPI guarantees internet security in a higher level. The connection state management of TCP segments is based on full-duplex feature of TCP. However, UDP segments have no connection status information such as flag bit, sequence number. Thereby, we adopt some virtual states
to track the connections. With respect to ICMP, the rule set is significant to allow or deny a segment. Efficient management of connections is crucial to realize a high speed SPI. Thanks to the traffic patterns of embedded firewall and multi-processing mechanism of DDNP chip, the processing speed can be greatly increased. First, normally a network connection can be exactly identified by source/destination IP addresses and source/destination port numbers of TCP/UDP. The equivalent representations are by local/remote IP addresses, local/remote ports if we just care about the relative location except for the transfer direction. The local here means the local end controlled by the embedded firewall, the remote means another end of the connection. The field of local IP address is always the same even if in different connections as for an embedded firewall installed inside a host, hence we select other three fields as keys to identify a network connection. The workload of operation is reduced. Second, the heavy task of connection management on the DDNP chip is efficiently distributed to several processors to perform parallel and pipeline processing, thereby all the connections can be processed concurrently. This greatly speeds up the throughput of SPI. 3 High Speed SPI Based on Data-Driven Processing In order to achieve high throughput of SPI, efficient implementation of multiple statetransition processes must be needed. Basically, the state-transition process is a history-sensitive process in which the next state is determined by only the current state and input data. Using the tag identifier of the dynamic data-driven computation principle, active data belonging to each process can be identified by its tag identifier. Thus, the data-driven processor can operate the multiple state-transition processes in highly-parallel. In this case, current state of each connection can be represented as a tagged token data without using the memory. However, the SPI must associate an input IP datagram with one of the existing connections at the firewall. This association process should be implemented efficiently on the data-driven processors. The association process for the SPI must fulfill the following requirements, search request from every input IP datagram, update request when establishment/completion of the connection occurs. This leads to a quick search scheme and an efficient interlock scheme for the connection table. Additionally, timeout calculation is an important matter when tracking the network connections. It is not only for deleting finished connection records but also in case of some attacks like denial-of-service (DoS) attack which will occupy all the space of connection table by embryonic connection records [4]. Thus, timer is needed to check and determine which record ought to be deleted. All these can be realized by software on DDNP chip. However, owing to the advantage of architecture and instruction extendibility, we propose some hardware-based schemes to achieve a higher performance. 3.1 Hybrid Quick Search with Hash and CAM Hashing scheme is one of well-known quick search methods. Under the ideal situation, it works at the efficiency of O(1). However, collision is inevitable. Under the worst situation, the hash table may work at the efficiency of O(n) like the linear search. On the other hand, the content addressable memory (CAM) is good at quick search. It is a kind of associative memory which can be accessed simultaneously by the data content rather than by a specific address. It can complete a lookup operation in a single cycle, resulting in constant O(1) time complexity. However, the current bottleneck of CAM is the large power consumption due to the comparison circuit activated in parallel. This is the obstacle of size expanding. To employ the advantages
and avoid the shortcomings of these two methods, we propose a scheme combining a hash table and a small CAM together to complete a quick search. The basic scheme is shown in Figure 2. When inserting a new connection record to the hash table, its hash address is first calculated by a hash function. We select remote IP address, remote/local ports as keys in our SPI. If there isn t a collision, the record will be stored by the address. Otherwise we use a linear probing method to find an empty space. Here comes the key point of our scheme. When the address of an empty space is found, we not only store the record there, but also save the keys and address of the space in the small CAM. In the worst case, supposing that the CAM is full, the conventional chaining method is employed to solve the succeeding collisions. Connection Connection 1 Connection N No collision Collision & in CAM Collision & in Chaining Searching path Chaining pointer Hash table CAM Chaining table 1 Figure 2: Quick search using both hash and CAM. Once the collision occurs, we can input the keys to the CAM for associating them with one of the connections. If there is a matched connection in the CAM, the address of the connection record can be read out as an associated content directly. If no match happens, we continue to search the chaining table. Since the CAM has saved a majority of the collisions, the tedious operations of chaining are greatly reduced. 3.2 Interlock Mechanism The principle of our interlock mechanism, in one word, is to lock a critical section as small as possible, and then to keep the probability of parallelism processing as great as possible. For example, when operating the chaining table which is adopted in quick search, not the whole table but only an item of the table is locked (called coarse-locking and fine-locking in this paper), thus multiple datagrams can access to the chaining table as long as no collision happens at the same address. This guarantees the parallelism between different connections. However, the smaller the subregions are, the more flags and additional control operations are needed. Since this might become a bottleneck of the SPI, we introduce a new compound instruction. Figure 3 shows the procedures of the instruction in detail. It is to be noted here that only one bit is enough as a flag for one critical section. In this instruction, a word including a flag bit is first read out from the memory. If the flag bit is 1, there is nothing to do but output a DDNP packet. If it is, execute an or operation to set the flag bit to 1, store back the result to the memory and output a DDNP packet. When locking a critical section, this instruction is executed on the flag bit. If the packet is outputted from port 1, that indicates the critical section has already been occupied and the packet is required to wait. Otherwise, the critical section is enterable and locked by the current request. This simple instruction can be easily realized by hardware in DDNP chip without much hardware cost. The most important is that complex control operations and resource requirements are replaced to a simple interlock instruction, so that active SPI processes can be executed in parallel.
Input packet Memory x f x x x x f=1 f x x x f= or 1 Output Port 1 Occupied Waiting x 1 x x x Output Port Enterable Locked Figure 3: Interlock instruction. 3.3 Hardware Timer for Data-Driven Processor It is easier to realize timer function by program on DDNP. For instance, we can use the add instruction to implement a counter. However, the fluent data flow is controlled by handshaking protocol between two pipeline stages in DDNP. Therefore, the execution of an identical instruction may cost different time under different conditions. Thus, this software timer is inaccurate and uncontrollable. Moreover, a DDNP packet must loop endlessly in DDNP to keep the timer running. It is a waste of processing resource. This leads to the necessity of a hardware-based timer which times accurately and independently. Reloading register Crystal oscillator Timer Unit DDNP packet register Counter FIFO Data-driven processor Output CPS D FP/ Arbitrator MM MEM M M: Flow Merging Module MM: Matching Memory Input FP: Functional Processing Unit CPS: Cache Program Storage MEM: Data Cache and Line Memory D: Flow Diverting Module Figure 4: Basic structure of timer unit embedded in DDNP chip. Timers used in conventional computer normally include three components, a crystal oscillator, a counter and a reloading register. The crystal oscillator generates pulses periodically, and each pulse makes the counter subtract one. When the value of counter gets to zero, an interrupt occurs. After that, the counter value is reloaded by the reloading register, and then another circulation begins. This method is employed in our timer. In addition, some modifications are needed to make it work flawlessly in the data-driven processor. Figure 4 illustrates a simple structure of our hardware timer mechanism. In the timer, another register is adopted. This register holds the destination number field of a DDNP packet which is an output datum whenever the interrupt by the counter happens. Glue-circuits such as FIFO are used to connect the synchronous timer and asynchronous DDNP [5]. In the functional processing unit (FP), an
arbitrator function is added so that the interrupt packet is given the highest priority and enters the DDNP pipeline firstly. After entering the DDNP pipeline, the interrupt packet goes to the cache program storage (CPS), finds the corresponding exception module by its destination number and begins to execute. Timing by crystal oscillator makes our timer more accurate. Furthermore, since the timer works independently from the DDNP chip, it won t occupy resource of DDNP chip when timing. It is also simple to open/close the timer function by turning on /off the crystal oscillator. Moreover, a timer shared by multiple data-driven processors to acquire real time parallel processing among one another could be easily realized by expanding the proposed timer to virtual timer function in DDNP chip. 4 Performance Evaluation The proposed schemes are evaluated using a DDNP evaluation board which consists of two ExtTBLs. ExtTBL, as one processor of DDNP, is also a self-timed super-pipelined datadriven processor implemented by FPGA. The data-driven programs are executed on one of the ExtTBLs which has a processing speed of 1M DDNP packets per second. For the quick search, conventional hash table and our proposed scheme are implemented and evaluated respectively. Supposing 128 IP datagrams are inputted, and the capacity of CAM is 32 entries. The average search time (Y-axis) required by these two methods is measured at different collision ratio (X-axis). The results plotted in Figure 5 indicate that our proposed scheme can reduce the search time by 3-9%, where the reduction ratio is determined by the capacity of CAM and the collision ratio. Average Search Time [ms] 2.5 2 1.5 1.5 1 2 3 4 5 6 7 8 9 1 Collision Ratio [%] Hash Hash&CAM Figure 5: Average response time of hash&cam and hash. Chaining table processing in connection creation is selected to evaluate the interlock mechanism and the instruction. In this evaluation, we totally create 128 connections continuously. Figure 6 shows the throughput of two interlock mechanisms which are coarse-locking and finelocking. The results show that the throughput is improved to 4-12 times through our proposed scheme. When a new connection is being created, all other operations such as searching and deleting are forbidden, so the fine-locking is suitable for the situation under which connections are established frequently. But in the case of infrequent connection creation, the fine-grain control cost may produce small overhead. Therefore, two kinds of solutions should be implemented in our SPI to be adaptive to actual traffic patterns in practical.
Finally, a prototype of SPI using the quick search method and the interlock mechanism is emulated and evaluated. The throughput of our SPI is up to 23.5k IP datagrams per second on a single data-driven processor. For the IP datagram with average length which is 2 bits [6], it indicates about 47 Mb/s. As for actual application, the total number and processing speed of processors will be enlarged. For example, we can use a DDNP chip integrated with 1 processors, each of which can realize the processing speed of 14M DDNP packets per second. Due to the linear scalability of the DDNP chip, it can achieve more than 3 Gb/s processing speed (47 Mb/s * 14 * 1 = 6.6 Gb/s). Throughput [kpps] 18 16 14 12 1 8 6 4 2 1 2 3 4 5 6 7 8 9 1 Collision Ratio [%] Coarse-Locking Fine-Locking Figure 6: Throughput of fine-locking and coarse-locking. 5 Conclusion A high speed SPI in our embedded data-driven firewall is described in the paper. For the traffic patterns of embedded firewall and parallelism processing of DDNP chip, the speed of SPI are greatly increased. Furthermore, in order to get a higher performance, some hardware-based schemes including quick search, interlock mechanism and timer are proposed for the extension of architecture and instruction set of current DDNP chip. Results of performance evaluation completed on DDNP evaluation board show the processing speed of our SPI is up to more than 3 Gb/s. Our preliminary research is based on simplified traffic patterns of embedded firewall. For practical application, general traffic patterns will be considered in our future research. Moreover, hardware circuits of our proposed schemes will be implemented in DDNP chip, and finally applied to high speed SPI in embedded data-driven firewall. References [1] R.K.C. Chang and K.P. Fung, Transport layer proxy for stateful UDP packet filtering, Seventh International Symposium on Computers and Communications (ISCC 2), pp. 595 6, July 22. [2] M. Iwata, D. Morikawa, R.H. Zhang, W.J Su, Y.Q. Zheng, and L.J. Kong, Design Concept of an Embedded Data-Driven Firewall Processor, International Conference on Next Era Information Networking (NEINE 4), Sep. 24 (to be presented).
[3] H. Terada, S. Miyata, and M. Iwata, DDMP s: Self-timed super-pipelined data-driven multimedia processors, Proceedings of the IEEE, 87(2), pp. 282 296, Feb. 1999. [4] I. Kang, H. Kim, Determining embryonic connection timeout in stateful inspection, IEEE international Conference on Communication (ICC 3), pp. 458 462, May 23. [5] M.R. Greenstreet, Implementing a STARI Chip, IEEE International Conference on Computer Design: VLSI in Computers and Processors, pp. 38 43, Oct. 1995. [6] G. Partridge, et al., A 5-Gb/s IP router, IEEE/ACM Trans. On Networking, 6(3), pp. 237 248, 1998.