Fast Robust Hashing. ) [7] will be re-mapped (and therefore discarded), due to the load-balancing property of hashing.

Fast Robust Hashing Manue Urueña, David Larrabeiti and Pabo Serrano Universidad Caros III de Madrid E-89 Leganés (Madrid), Spain Emai: {muruenya,darra,pabo}@it.uc3m.es Abstract As statefu fow-aware services are becoming commonpace, distributed router architectures have to quicky assign packets being forwarded to service-speciaized processors in order to baance fow processing and state among them. Moreover, packets beonging to the same fow must be aways assigned to the same CPU, even if some of the service processors become unavaiabe. This paper presents two nove Fast Robust Hashing agorithms for persistent Fow-to-CPU mapping, that require ess hashing operations per packet than previous Robust Hash agorithms, thus being abe to fufi a the above requirements to impement fow-aware services at wire-speed. Index Terms Robust Hashing, Fow Processing I. INTRODUCTION Providing fow-aware services for hundreds of thousands of fows per second arriving through high-speed interfaces poses significant technica chaenges. A common soution is to empoy distributed systems that aows parae packet processing. These muti-processor router architectures [8] aow fow state to be distributed among speciaized CPUs, or even among externa devices. For exampe depoying a custer of NAT appiances attached to the access router of an organization is quite common nowadays. In any case, those statefu services require some kind of mapping mechanism that does not ony baance fows among CPUs or externa devices, but it must aso ensure that a packets beonging to the same fow are aways processed by the same CPU. Otherwise, for exampe in the case of a statefu firewa, any mismatched packets wi be dropped as ony the initia CPU is aware of such connection. A mapping scheme based on hash functions seems idea for distributed packet processing, as it does not require a centra fow tabe, that coud easiy become the botteneck of the system. Instead it is ony necessary to perform one hash operation to the packet fow identifier in order to find the CPU that must process a the packets beonging to that fow. However, this scheme fais when the number of avaiabe processors varies, for exampe when one of the NAT boxes in the custer goes down. In that case, the return range of the hash function shoud be reduced by one in order to skip the disabed processor. But changing the hash function means that the Fow-to-CPU mapping obtained by the od hash function wi differ from the one being performed by the new hash function, eading to the disruption of a the re-mapped fows. In fact, whenever any of the n CPUs goes down, most of the fows ( n n ) [7] wi be re-mapped (and therefore discarded), due to the oad-baancing property of hashing. The Robust Hashing mechanism was designed to address the above issue. Its objective is that, whenever a processor goes down, the ony re-mapped fows are the ones that woud be assigned to the disabed CPU ( n ). The soution is quite simpe: instead of performing a singe hash operation that returns the chosen CPU, each CPU has an associated hash function that gives its affinity to that fow. Then, the CPU with the greatest affinity vaue is chosen. When a CPU goes down its hashing function is not computed, thus it cannot be chosen. Therefore, ony the fows assigned to it are re-mapped because, by definition, the rest of fows have a greater affinity with one of the remaining CPUs. The term Robust Hashing was firsty introduced by Ross in [6] for the Web-caching domain. Later, the Highest Random Weight agorithm [7] by Thaer and Ravishankar enhanced the Robust Hash mechanism to support heterogeneous caches, by assigning them static weights. Finay, Kenc and Le Boudec [3] have defined a method for dynamicay adjusting the weights associated to each CPU depending on its workoad. A drawback of a these techniques is that each mapping requires as many hash operations as CPUs are avaiabe. Because hash functions are quite compex operations and a these computations shoud be performed for each arriving packet, this oad-baancing process coud become the botteneck of the whoe system, thus imiting the scaabiity of the serviceprocessor custer. This paper presents two nove agorithms, designed to reduce the number of hash computations per packet whie keeping the same usefu properties of the cassica Robust Hash techniques: the Sma Robust Hash and the Big Robust Hash agorithms. II. SMALL ROBUST HASHING The objective of the Sma Robust Hash agorithm is to assign a packet to a CPU with just one hash operation. Therefore it empoys a singe hash function that never changes, even when a CPU is disabed, in order to maintain previous mappings. The agorithm works as foows: when a packet arrives, its fow identifier f is empoyed as the argument of the hash function. However, this hash operation h(f) does not return the CPU itsef but a position in the Mapping Vector, which contains the seected CPU. In order to achieve probabiistic oad baancing [7], the Mapping Vector must contain the same number of entries for each avaiabe CPU. Therefore, if a CPU goes down, a its

TABLE I MEMORY REQUIREMENTS (IN BYTES) Number of CPUs (n) 4 8 6 3 Sma Robust Hash 00% 840 7 0 5 4 0 3 Sma Robust Hash 5% 68 840 7 0 8 Sma Robust Hash 0% 56 680 4 0 5 Big Robust Hash 0 36 36 58 Fig.. Evoution of the Mapping Vector for the Sma Robust Hash agorithm (n = 4, CPUs 3 and are disabed) entries in the Mapping Vector are renamed with the remaining CPUs in a round-robin fashion. For exampe, for the Mapping Vector in Fig. and empoying h(f) = f mod as a simpe hash function, a packets of fow 7 wi be assigned to CPU, even if CPUs and 3 go down. On the other hand, fows with identifier 4 start being processed by CPU 3, but ater ones wi be assigned to CPU and then to CPU 4 when CPUs 3 and go down, respectivey. A. Memory requirements Sma Robust Hash agorithm requires a Mapping Vector of ength m such that, for any number of disabed CPUs, there shoud be an equa share of entries for each of the avaiabe CPUs eft. This requirement is quite strong, as it means that, if n is the tota number of CPUs, m shoud be a mutipe of n, (n ),..., that is, a the possibe number of avaiabe CPUs. Otherwise if m is not a mutipe of the number of avaiabe CPUs, it is not possibe to rewrite a the disabed CPU s entries with a fair round-robin, eading to some degree of oad imbaance among the remaining CPUs. Therefore, m = p i where p i n is the greatest i power of every prime number p [, n]. As it coud be seen on the first row of Tabe I, m coud be a huge vaue but for a reativey sma number of CPUs. However this is ony necessary in order to support an arbitrary number of CPUs going down, which is quite an improbabe situation. Therefore it is possibe to save some memory if ony a subset of vaues between and n is chosen, thus providing partia faut-toerance just for a reasonabe number of disabed CPUs. The second and third rows of Tabe I show the memory requirements of the Sma Robust Hash agorithm in order to support a CPU faiure rate of 5% and 0% without oadimbaance among the surviving ones. B. Performance The Sma Robust Hash agorithm poses an optima performance, as it aways finds each packet-to-cpu mapping with just one hash operation pus a memory access, no matter how many CPUs are down. Thus its compexity is O(). m n k Aso, when a CPU is disabed, a its entries at the Mapping Vector shoud be found and repaced by avaiabe ones, being n k the number of CPUs eft. Thus, the compexity to update the Mapping Vector is O(m). However, the memory requirements of the Sma Robust Hash agorithm make it appicabe just for systems aowing partia faut-toerance or having a sma number of CPUs, hence its name. For distributed architectures with a greater number of processors, an agorithm with ess memory requirements is proposed in the next section. III. BIG ROBUST HASHING The Big Robust Hash agorithm aso appies hash functions to the packet s fow identifier in order to obtain a Mapping Vector s index. However, the Big Robust Hashing agorithm does not empoy a singe hash function but severa ones, each one associated to a different Mapping Vector. A the Mapping Vectors together are caed the Mapping Matrix. When a CPUs are active, this agorithm resembes the Sma Robust Hashing one: there is one Mapping Vector (with a singe entry per CPU) and a hash function that returns a position inside this vector. The differences ony arise if some CPUs are disabed: the Big Robust Hash agorithm does not empoy a singe, ong vector but many. When a CPU goes down, a new Mapping Vector containing the remaining CPU entries, and its associated hash function are added. Then, the disabed CPU entry at the initia vector is repaced by a pointer (caed hop) to the second vector, as shown in Fig.. Then, the Mapping process works as foows: given f, the fow identifier of the packet, the initia hash function h(f) returns a position at the initia Mapping Vector. If that entry is an avaiabe CPU, the mapping is done. Ese, when a hop is found, the foowing hash function shoud be appied to seect a position in the new vector. In genera, this process is repeated unti an avaiabe CPU is found. Athough this mechanism empoys severa hash functions, no disruption occurs as they are aways appied in the same order. Fows previousy assigned to an avaiabe CPU wi be mapped again by the same hash function because its entry at the top Mapping Vector remains unchanged. On the other hand, new fows that woud be assigned to the disabed CPU by the first hash function wi find the hop entry, and then the second hash function h (f) wi assign them to one of the avaiabe CPUs at the new vector. Thus the remaining CPUs wi absorb an equa share of the fows that woud be otherwise assigned to the disabed CPU. If a second CPU goes down, another hash function h (f) and its Mapping Vector are added, and the entries of the disabed CPU in a other vectors are renamed with hop entries pointing directy to the third Mapping Vector. Notice that the new vector does not repace the previous one, but it is added at the bottom of the Mapping Matrix. The

3 Fig.. Evoution of the Mapping Matrix for the Big Robust Hash agorithm (n = 5, CPUs 3 and are disabed) intermediate vectors and their hash functions are needed to avoid the disruption of the fows that were mapped empoying them, right before the ast CPU was disabed. Let us iustrate the behavior of the Big Robust Hash agorithm with the aid of the Mapping Matrices of Fig. and a simpe famiy of hash functions based on the modue operator (h(f) = f mod i): h(f) = f mod 5, h (f) = f mod 4, and so on. For exampe, fow 3 is aways processed by CPU 4 because whenever the first hash function h(3) is computed, it founds CPU 4 avaiabe every time. On the other hand, once CPUs 3 and are disabed, packets with fow identifier are mapped empoying the three hash functions. h() finds a hop to the second Mapping vector and appying h () over the second vector has the same resut. Finay h () assigns these packets to CPU. However packets of fow 0 wi be cassified just with two hash operations, because the hop found at the first Mapping Vector redirects the search to the bottom one, which aways succeeds (h (0) CPU 4). A. Memory requirements As the Mapping Matrix is initiaized with a Mapping Vector of ength n, and a new one is added whenever a CPU goes down, the maximum number of Mapping Vectors is n +. Then, as the fu Mapping Matrix is trianguar, it ony requires memory for storing n(n+) positions. Tabe I shows, for different vaues of n, the maximum size of the Mapping Matrix required by the Big Robust Hash agorithm. Those vaues are sma enough to fit in the data cache of commercia Network Processor [], [] microengines, thus memory access atency does not seem to be a major probem. B. Performance If k [0, n] is the number of disabed CPUs, cassica Robust Hash agorithms require n k hash operations in order to perform each fow-to-cpu mapping, that is, one hash operation per avaiabe CPU. That means that, when a CPUs are up an running, it is necessary to compute n hash operations per packet and choose the biggest vaue. On the other hand, both Fast Robust Hash agorithms presented in this paper require just one hash operation and a fast SRAM memory access per packet in the most common case, when a CPUs are avaiabe. The compexity to update the Mapping Matrix when a CPU goes down is O(k) because ony one entry per Mapping Vector shoud be rewritten. The Big Robust Hash agorithm performance study is ess trivia than previous ones as, unike the cassica and Sma variants, each packet is not aways mapped after a fixed number of hash operations, but each one may require a different number of operations. Athough, obviousy, the maximum number of operations per packet is k +, as this is the number of Mapping Vectors when k CPUs are down. In order to study the mapping process, each vector of the Mapping Matrix is caed eve, and they are numbered in reverse order, from the bottom Mapping Vector to the top one, as shown in Fig.. This way, the eve number [0, k] aso indicates the number of disabed CPUs/hop entries at that eve. Other usefu functions that characterize a eve are: P CP Us (): Probabiity to find any avaiabe CPU at eve P CP Us () = n k P hops (): Probabiity to find any hop entry at eve P hops () = P CP Us () = A hop entries in a eve are equiprobabe (by the oadbaancing property of the hash functions), thus P hop () is the probabiity to find one particuar hop redirection at eve : P hop () = P hops() = Let H() be the random variabe that defines the number of hash operations required to find an avaiabe CPU, starting with the hash function/mapping Vector at eve. Then, we are interested in H(k), that is, the average number of hash operations unti an avaiabe CPU is found, starting at the top of the mapping matrix ( = k). The bottom eve of the mapping matrix ( = 0) does not have any hop entries, thus ony one hash operation is needed: () () (3) H(0) = = H(0) = (4) At upper eves ( > 0), the average number of hash operations is: if a CPU entry is found (P CP Us ()), or pus the average number of operations from a ower eve ( H( j)) when the hop to that eve is found. In every eve there are equiprobabe hop entries, each one pointing to a different ower eve (see eve = at the bottom of Fig. ), therefore: H() = P CP Us () + P hops ()( + H(0)) H() = P CP Us () + P hops() ( + H())+ P hops () ( + H(0)) H() = P CP Us () + P hops() ( + H(i)) = + P hop () H(i) (5)

4 Particuarizing it for the eve: H( ) = + P hop ( )( H(0) +... + H( )) H(0) +... + H( ) = H( ) P hop ( ) Extending the eements of the summatory in equation (5) and repacing most of them with equation (6): H() = + P hop ()( H(0) +... + H( ) + H( )) = + P hop ()( H( ) P hop ( ) + H( )) = + P hop() P hop ( ) ( H( ) ) + P hop () H( ) Taking into account that: P hop () P hop ( ) = n k+ n k+ From equations (7) and (8) we get: (6) (7) = = P hop() (8) H() = + ( P hop ())( H( ) ) + P hop ()H( ) = H( ) + P hop () Extending the recursive eements of the formua: H() = H( ) + P hop () = ( H( ) + P hop ( )) + P hop () = ( H(0) + P hop ()) + P hop () + + P hop () By (4) this arithmetic series coud be summarized as: k k H(k) = + P hop () = + = Therefore, as we wi see in Section IV, the average number of hash operations for the Big Robust Hash agorithm ( H(k)) is ogarithmic with the number of disabed CPUs. In order to characterize better the distribution of the number of hash operations per packet, we are aso interested in the probabiity function P hash (h, ): Probabiity to perform h hash operations to find an avaiabe CPU, starting at eve. The ony way to perform a singe hash operation is to find any of the avaiabe CPUs in the current eve, thus: = P hash (, ) = P CP Us () In order to empoy hashes, the first hash query must return one of the hops at that eve, and the next query at a ower eve i [0, ] must succeed. Thus, the tota probabiity is the sum of a the possibe paths: P hash (, ) = P hop () P hash (, i) The reasoning for 3 hashes is the same, athough in this case the hop to the bottom vector is forbidden, as it has no further hops and it is not possibe to perform the hash operations eft. Thus, i [, ]: P hash (3, ) = P hop () P hash (, i) i= (9) Hash operations per packet 5 4 3 0 9 8 7 6 5 4 3 Robust Hash (theo) Big Robust Hash: maximum (theo) Big Robust Hash: average (theo) Big Robust Hash: average (sim) Sma Robust Hash (theo) 0 0 3 4 5 6 7 8 9 0 3 4 5 6 7 8 9 0 3 4 5 6 7 8 9 30 CPUs down (k) Fig. 3. Comparison of different Robust Hash agorithms (n = 3) Therefore, as a hops are equiprobabe, the genera expression is: P CP Us () if h = P hash (h, ) = P hop () P hash (h, i) if h i=h (0) To vaidate these anaytica resuts, Section IV compares the estimation of the average number of hash operations and the probabiity mass function of Big Robust Hash operations with trace-driven simuation measurements. IV. PERFORMANCE EVALUATION A simuations are based on a frame trace from a bidirectiona Gigabit Ethernet ink at the University of Purdue [5]. The trace spans 90 seconds and contains 8.7 miion IP packets from 87 thousand distinct fows, with 933 Mbps and 0 thousand packets per second on average. The fow identifier empoyed is the 3 bit integer buit as the XOR of the source and destination IP addresses, protoco and TCP/UDP port numbers, with the protoco number and the greater port in the upper 6 bits and the other one in the ower 6 bits. Then a Fibonacci hash function [4], with the appropriate moduo, is empoyed to query the Mapping Vector. Fig. 3 compares the performance of a the Robust Hash agorithms anayzed in this paper: Cassic Robust Hash must execute n k hash operations per packet. Of course, the best performer is the Sma Robust Hash agorithm which requires just one hash operation per packet irrespective of how many CPUs are down. However this resut is purey theoretica, as the memory requirements for 3 CPUs with fu faut-toerance are far beyond from any practica impementation. The Big Robust Hash is in a midde ground between them, as it trades-off performance for memory (58 Bytes for a 3 CPU custer). Fig. 3 shows the theoretica curves of the average number of hash operations (Eq. (9)) and the k + upper imit of the Big Robust Hash agorithm. Regarding to the simuation measurements of the Big Robust Hash agorithm, the minimum, first quartie, mean, third quartie and the maximum number of hash operations empoyed per mapping are shown.

5 0.9 0.8 Big Robust Hash k=4 Big Robust Hash k=6 Big Robust Hash k=4 Big Robust Hash k=30 TABLE II CUMULATIVE PROBABILITY OF THE NUMBER OF HASH OPERATIONS (h) FOR THE BIG ROBUST HASH ALGORITHM Probabiity 0.7 0.6 0.5 0.4 0.3 0. 0. h 3 4 5 n = 6, k = 4 0.75 0.9738 0.9987 0.9999.0 n = 6, k = 8 0.5 0.867 0.977 0.9975 0.9998 n = 6, k = 0.5 0.6 0.8694 0.9683 0.9944 n = 3, k = 8 0.75 0.9697 0.9978 0.9999 0.9999 n = 3, k = 6 0.5 0.8545 0.970 0.996 0.9996 n = 3, k = 4 0.5 0.6086 0.853 0.9586 0.9909 n = 64, k = 6 0.75 0.9677 0.9973 0.9998 0.9999 n = 64, k = 3 0.5 0.8505 0.9694 0.9953 0.9994 n = 64, k = 48 0.5 0.605 0.8449 0.9533 0.9887 0 3 4 5 6 7 8 9 0 3 4 5 6 Hash operations per packet Fig. 4. Probabiity mass functions of Big Robust Hashing agorithm (n = 3) As expected, the average number of operations increases much sower than the maximum vaues (i.e. with 3 Mapping Vectors, one of the CPUs eft is found ony after 3.63 hash operations in average). This behavior coud be better seen in Fig. 4 which shows, for representative vaues of k, the histograms with the number of hash operations empoyed by the Big Robust Hash simuation. Tabe II shows the cumuative probabiity to perform h or ess hash operations per packet for different vaues of n and k. These vaues have been obtained by evauating the theoretica Eq. (0). For exampe, the fifth row (n = 3, k = 6) indicates that when ony haf of the 3 CPUs are avaiabe, the 99% of packets wi be mapped with 4 or ess hash operations, and most of them wi require just hashes (85%). Another scaabiity property iustrated by Tabe II is that the number of hash operations does not depend on the tota number of CPUs but on the avaiabe/tota CPU ratio. For exampe, the cumuative probabiities when 50% CPUs are down (i.e. nd, 5th and 8th rows) are amost identica, in spite of the fact that the custer size doubes in every step. Therefore, athough increasing n impies a sight performance decrease, ceary the main factor is the n k n ratio. These resuts confirm that Big Robust Hash mappings does empoy a very sma number of hash operations when compared to cassica Robust Hash agorithms. Moreover, the probabiity of performing more hash operations than the average vaue quicky tends to zero, as show in Fig. 4. V. CONCLUSION Nowadays, fow processing at edge routers is becoming commonpace (e.g. statefu firewas, NAT, Session Border Controers, etc.). Therefore, muti-processor routers need a fast, distributed, and scaabe mechanism to oad-baance fows among a their avaiabe service-speciaized CPUs or to a custer of externa devices, whie ensuring that a packets of a given fow are assigned to the same CPU. Robust Hash techniques are speciay we suited to address this issue, athough cassica Robust Hash agorithms require severa hash operations per packet in order to find which CPU is processing that fow. This paper presents two Fast Robust Hash agorithms that require ony one hash operation to perform each mapping when a CPUs are avaiabe, whie cassica Robust Hash agorithms do require one hash operation for each avaiabe CPU. These new agorithms achieve a better performance by empoying Mapping Vectors to maintain a persistent mapping, even when severa CPUs are disabed. The Sma Robust Hash agorithm aways finds the mapping with a singe hash operation. However, when fu faut-toerance is required, it has severe memory requirements, thus it is ony appicabe to architectures with a sma number of CPUs. On the other hand, the Big Robust Hash agorithm does not empoy a singe, arge Mapping Vector, but a sma Mapping Matrix. In the worst case it requires one hash operation and a fast SRAM memory access for each disabed CPU, pus the initia ones. However, the average number of hash operations is far beow this imit, as it is demonstrated with anaytica and trace-driven simuation resuts. Therefore, the Fast Robust Hashing agorithms coud be good design choices for custer oad-baancers or next generation muti-processor edge router architectures. ACKNOWLEDGMENTS This work is being funded by the Spanish MEC under project IMPROVISA TSI005-07384-C03. The authors wish aso to thank Raque Panadero, Iván Vida, Ricardo Romera, Caros Jesús Bernardos for their vauabe comments. REFERENCES [] J. Aen et a. IBM PowerNP Network Processor: Hardware Software and Appications. IBM Journa of Research and Deveopment. Vo. 47, pp. 77-94, March/May 003. [] Inte IXPXXX Product Line of Network Processors: <http:// www.inte.com/design/network/products/npfamiy/ixpxxx.htm> [3] L. Kenc and J. Y. Le Boudec. Adaptive Load Sharing for Network Processors. Proceedings of the IEEE INFOCOM 00. Vo., pp. 545-554, June 00. [4] D. E. Knuth. The Art of Computer Programming: Seminumerica Agorithms Second Edition. Addison-Wesey 997. [5] NLARN Trace from a Gigabit Ethernet ink at the University of Purdue: <ftp://pma.nanr.net/traces/daiy/00608/pur-4097090.erf.gz>. 8 February 006. [6] K. W. Ross. Hash routing for coections of shared web caches. IEEE Network, Vo., No. 6, pp. 37-44, November/December 997. [7] D. G. Thaer and C. V. Ravishankar. Using name-based mappings to increase hit rates. IEEE/ACM Transactions on Networking, Vo. 6, No., pp. -4, February 998. [8] L. Yang, R. Dantu, T. Anderson and R. Gopa. Forwarding and Contro Eement Separation (ForCES) Framework. RFC 3746, Apri 004.