Per-flow Re-sequencing in Load-Balanced Switches by Using Dynamic Mailbox Sharing

Per-flow Re-sequencing in Load-Balanced Switches by Using Dynamic Mailbox Sharing Hong Cheng, Yaohui Jin, Yu Gao, YingDi Yu, Weisheng Hu State Key Laboratory on Fiber-Optic Local Area Networks and Advanced Optical Communication System Shanghai Jiao Tong University, Shanghai 200240, P. R. Email: jinyh@sjtu.edu.cn Nirwan Ansari Advanced Networking Laboratory, NJIT Newark, NJ 07012, USA Email: Nirwan.Ansari@njit.edu Abstract Load-balanced switches have received much attention because they are more scalable than other switch architectures. However, a load-balanced switch has the problem of packet missequencing. In this paper, we propose a Dynamic Mailbox Sharing (DMS) scheme to eliminate the mis-sequencing problem of load-balanced switches only at the cost of a very small increase of delay. The key idea is to keep packets of the same flow in order in the load-balanced switch. The DMS scheme is based on two statistical facts in operational networks: the number of simultaneous active flows in the router buffer is far less than that of in-progress flows, and most of the intra-flow packet intervals are longer than the packet delay in the high speed router. In DMS, the packet sequence of the same flow arrived in the input ports is recorded in the mailbox maintained in the output ports. Then, packets of the same flow are delivered according to the order of their arrivals. The mailbox becomes the bottleneck in order to accommodate a large number of flows. We thus propose a dynamic sharing scheme to alleviate the bottleneck and greatly enhance the scalability of the mailbox. By simulations using the real internet traffic traces, we show that with a simple flow splitter mechanism restraining mis-sequencing, the average packet delay using DMS is considerably lower than that of other schemes including Uniform Frame Spreading, Padded Frame and the CR switch, and it is close to the ideal case without resequencing even when the load is very high. The results also demonstrate that the size of mailbox is in the hundreds. Keywords- load-balanced switch; mailbox; dynamic sharing I. INTRODUCTION The load-balanced switch architecture [1], [2] has received much attention because it is considered more scalable than other switch architectures. The basic load-balanced switch consists of a two-stage crossbar. The first stage is used for load balancing. It distributes packets from the input port uniformly to the mediate buffer. The second stage is used for switching, sending packets to their destined output ports. The connection patterns of both stages are deterministic and periodic. So there is no need to use a central scheduler, thus making it more scalable than other switch architectures. The problem of the load-balanced switch architecture is that there are multiple paths between each input/output pair. So packets belonging to the same input/output pair may be delivered out of order [2]. In general, routers shall not deliver packets belonging to the same application flow out of order. Many approaches have been proposed to solve the missequencing problem in load-balanced switches. These approaches can be categorized into two classes. In the first class, a re-sequencing buffer at the output port is used. For example, in the Earliest Deadline First (EDF) scheme, a flow splitter and a load-balancing buffer is added in the input of the load balancing stage to limit the re-sequencing buffer size [2]. However, this scheme requires complicated hardware implementation and non-scalable computation overhead [10]. In the second class, mis-sequencing is prevented throughout both stages and no re-sequencing buffer is needed at the output port. Compared with the first class, the schemes of the second class are more scalable, such as the mailbox switch [11] and Uniform Frame Spreading (UFS) [4] [6]. Although the UFS scheme works well under heavy load, it suffers from the starvation problem under light load because it incurs a large waiting time for accumulating a frame. To improve the performance of UFS under light load, Padded Frame (PF) [3] and CR switch [6] schemes have been proposed. We observed that in the schemes mentioned above, like UFS, PF and CR, not only the order of packets of the same application flow are kept, packets belonging to the same input/output pairs are also delivered in order. (In the following, we will use flow to denote application flow for simplicity.) Yet, routers are not required to deliver packets destined to the same output port in order. That is, this is an unnecessary selfimposed constraint, resulting in redundant re-sequencing operations. The time wasted in redundant re-sequencing unnecessarily increases the delay of packets. Our idea is to just keep packets of the same flow in order in the load-balanced switch. The idea is based on two statistical facts in operational networks. The first one is about the number of flows. Early measurement results show that the number of in-progress flows can be extremely large, rendering per-flow queuing infeasible. However, it has been recently observed that the number of active flows in a switch is measured typically in the hundreds even though there may be tens of thousands of flows in progress [8]. In [5], the authors counted the number of concurrent flows from internet traces. They set the statistical time scale to 10ms to observe the actual number of active flows. They also found that the number of active flows is in the 978-1-4244-2075-9/08/$25.00 2008 IEEE

hundreds. Based on this statistical fact, Hu et al. [5] proposed a Dynamic Queue Sharing scheme to implement scalable perflow queuing in high speed routers. The second one is about the intra-flow packet interval. In [7], from the analysis of internet traffic traces, the authors observed that most of the intra-flow packet intervals are more than tens of microseconds, while a packet is typically buffered in a high speed router only for a few microseconds, implying that the probability for a packet encountering mis-sequencing is small. It is expected that the overhead of keeping packets of a flow in order in a load-balanced switch is low. Based on the above two statistical facts, we propose a Dynamic Mailbox Sharing scheme (DMS) to implement perflow re-sequencing in a load-balanced switch. Dynamic mailboxes are put at the output ports of a load balanced switch. The arrival sequence of packets from the same flow is recorded in the mailbox. Then, packets are delivered in accordance with the recorded sequence in the mailbox. To make the implementation of the mailbox scalable, bins in the mailbox are dynamically shared among active flows. When a flow becomes inactive, the bin that the flow occupies is released and can be reused by another active flow. By this dynamic sharing mechanism, the number of bins in the mailbox can be reduced from millions to hundreds. By investigating the distribution the intra-flow packet interval and the delay of packets in the load-balanced switch, we show that most packets have an intra-flow packet interval that leads to a small probability of encountering missequencing, which implies that we can get a much more efficient scheme by avoiding unnecessary maintaining sequence operations. By simulations using real internet traces, we demonstrate that: with a simple flow splitter mechanism restraining mis-sequencing in the input buffer of the first stage, the average delay of packet using DMS outperforms other schemes including Uniform Frame Spreading [4], Padded Frame [3] and the CR switch [6] and it is close to that of the ideal case without re-sequencing. We also demonstrate that the size of mailbox is in the hundreds. The rest of the paper is organized as follows. In Section II, we propose the architecture and operation of DMS in a loadbalanced switch. In Section III, we analyze the probability of packet encountering mis-sequencing. In Section IV, we investigate the performance of the Dynamic Mailbox Sharing, and conduct comparison evaluation with other schemes. Finally, concluding remarks are given in Section V. II. THE SWITCH ARCHITECTURE A. Switch Architecture Fig. 1 shows the architecture of an N N load-balanced switch with Dynamic Mailbox Sharing. As in the basic loadbalanced switch, the switch consists of a two-stage crossbar. In the input buffer of the first stage crossbar, FIFO is used. In the mediate buffer in the input of the second stage crossbar, VOQs are maintained for output ports. In the switch, time is slotted, and cells are of the same size. Figure 1. The architecture of a load-balanced switch with DMS The key components of DMS are the dynamic mailboxes maintained in the output ports. The mailbox is a re-sequencing buffer that can keep the order of packets of the same flow. Each mailbox consists of a number of bins. A bin is allocated to each active flow in the switch. So, in the mailbox, the number of bins in use is not fixed, and thus it is referred to as the dynamic mailbox. The cells in the bin are used to store packets belonging to that flow, and they can also be registered in advance by packets. Since no mechanism is employed in the input buffer to prevent mis-sequencing, packets belonging to the same flow may arrive in the output port out of order. To keep them in order, the sequences of their arrivals in the input ports of the first stage crossbar are recorded by the mailbox through a registration mechanism. Then, the packets arrived in the output port of the second stage crossbar are delivered according to their arrival sequence. We shall next describe how the mailbox works to keep packets of the same flow in order. Registration: When a packet enters the switch, a registration message of the packet is passed to the dynamic mailbox of its destined output port. The registration message includes the ID of the packet and the flow it belongs to. Then the mailbox searches for the bin corresponding to the flow in the message. If no bin is found, a free bin is allocated to the flow. Then, the first empty cell of the bin is allocated to the packet. To record the registration, the cell is tagged with the ID of the packet Sending Packets: In the output port, packets should be sent out in accordance with the sequence in the mailbox registered when they first arrive in the switch. If a packet arrives in the output port earlier than its registered sequence, it is buffered in the mailbox until all the packets registered earlier have been delivered. After a packet has been delivered, its registered cell is released and other registered cells behind are moved forward by one position. B. Dynamic Mailbox Sharing If each in-progress flow is kept in the mailbox, the size of bins in the mailbox can be exorbitantly large. Based on the interesting fact mentioned in the introduction section the number of simultaneous active flows is in the hundreds in a high speed router, we propose a scalable Dynamic Mailbox Sharing mechanism to implement the mailbox that is similar to the previous work of Dynamic Queue Sharing (DQS) [5].

Only limited bins are kept in the mailbox and they are shared by the active flows. The mapping between bins and active flows is kept in an Active Flow Mapping (AFM) table (see Fig. 2). When a registration message of a packet from flow fi arrives, the flow identification defined by 5-tuples (source IP address, destination IP address, protocol, source port, and destination port) is searched in the AFM table. If no matched entry is found, a free bin is allocated to this new active flow and a mapping entry is added in the AFM table. When a bin in the mailbox becomes empty that none of its cells is registered, it is released and its mapping entry is deleted from the AFM table. By this dynamic sharing mechanism, the number of bins in the mailbox can be kept in the hundreds, rather than the exorbitant number. By using hashing to further expedite the operations of the search, operations in the AFM table can be performed in only several slots [5]. describe the distribution of intra-flow packet interval, as shown in the first column of TABLE I. Cumulative Probability 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 NLANR-144M CERNET-33M NLANR-66M NLANR-17M 0.0 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 Intra-flow Packet Interval / us Figure 3. The Distribution of Intra-flow Packets Interval Figure 2. Using Hashing to Divide the AFM Table to Sub-tables At the end of the section, it is worth to mention the effect of packet segmentation in the load-balanced switch. As we known, various length packets are divided into fixed size cells after entering the switch and these cells are reassembled into complete packet before leaving the switch. Generally it won t bring benefit to the delay of packet directly to keep the cells of a packet in order because a packet can't leave the switch until all of its segments have been assembled, no matter if the segments arrive in order. One inconvenience of out-of-order cells is that we need to use a random accessed queue to reassemble them. In the following sections, we will focus in the problem of packet mis-sequencing. III. THE PROBABILITY OF MIS-SEQUENCING In this section, we investigate the distribution of the intraflow packet interval and how it affects the probability of packet mis-sequencing. As mentioned in the introduction section, the distribution of intra-flow packet intervals can greatly affect mis-sequencing. To see how the distribution is, we analyze some internet packet traces derived from an OC-48 link collected by NLANR [14] as well as collected from CERNET (China Education & Research Network) by DragonLab [15]. We adopt the notation in naming the traces by their source and their average throughput. Fig. 3 shows the distribution of intra-flow packet intervals of various traces. Generally we can see that most of intra-flow packet intervals are larger than tens of microseconds. By summarizing the statistics, we get a group of proportion approximation for different intra-flow packet interval ranges to Another factor that can affect mis-sequencing is the delay of packet in the router. First, let s get a general idea of the distribution of the delay of packet by reviewing the result on measuring the delay of packet in the router in the operational network in previous literatures [12][13]. In [13], the authors have reported the result of measurement of the single-hop packet delay through operational routers in the Spring Internet protocol backbone network. In the measurement, the link utilization is moderate (less than 70%). The result shows that on an OC-12 link, the average queuing delay is only a few microseconds. More than 30% of the packets go through the switch without any queuing delay. The 99th percentile delays are below 50 µs. By comparing with the statistic of intra-flow packet interval, we can see that most packets have an intra-flow packet interval larger than the average packet delay. Under the assumption of uniform and admissible input traffic, the delay in the load-balanced switch is same as that in a equivalent single stage switch with fixed service rate for each VOQ [9][1]. Assume the arrival traffic has an exponential inter-arrival time with mean arrival rate λ and each packet is of fixed one cell size with service rate µ, we can model the packet delay of each input/output pair by the waiting time in an M/D/1 queue, and further approximate that by the waiting time in an M/M/1 queue, with arrival rate (λ/n) and service rate (µ/n). Then the packet delay has a exponential distribution with parameter (µ/n)(1- λ/µ). Take a reasonable implementation of load-balanced switch with N=16, output port rate = OC-192 link speed, cell size = 100 bytes for example. The distribution of packet delay under different load is shown in Fig.4. Next, we shall show that the above properties of intra-flow packet interval and packet delay in the load-balanced switch leads to a small probability of packets being transmitted out-of-order. Assuming packet P1 is blocked in the mailbox because of mis-sequencing, there must be an older packet P2 which belongs to the same flow buffered in the switch. The interval between their arrivals is I. To result in mis-sequencing, the delay time of P2 (denoted as d2) should be larger than the sum of the delay time of P1 (denoted as d1) and I. Only considering

the queueing delay portion, we can get an approximation for the probability of P1 encountering mis-sequencing: P d d > I = P d d > I < P d > I (1) 2 1 2 1 2 ( ) ( q q ) ( q ) By synthesizing all the above results, for the above loadbalanced switch implementation example, we can get the following results about the relationship between the the probability of packet mis-sequencing and intra-flow packet interval under different load, as shown in TABLE II. From TABLE I and TABLE II, we can see that most packets have an intra-flow packet interval which falls into the range that leads to a small probability of packets encountering mis-sequencing, especially when the load is not high. This implies that for a large portion of packets, it is unnecessary to keep their sequence in the price of increasing the packet delay. By using the DMS mechanism, only packets that need to be reordered could be queued in the mailbox. We will count the percentage of packets that have been queued in the mailbox to verify the conclusion of this section by simulations in the following section. Cumulative Probability TABLE I. 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.0 2.0x10 1 4.0x10 1 6.0x10 1 8.0x10 1 1.0x10 2 Packet Delay/us load=0.7 load=0.8 load=0.9 load=0.95 Figure 4. Cumulative Probability of Packet Delay THE MIS-SEQUENCING PROBABILITY IN AN LOAD-BLANCED SWITCH IMPLEMENTATION EXAMPLE Intra-packet Interval Proportion Mis-sequencing Probability Approximation Range Load=0.7 Load=0.8 Load=0.9 Load=0.95 [100us, 0.744 0 0 7.6E-4 2.8E-2 [50us, 100us 0.131 2.0E-5 8E-4 2.8E-2 1.7E-1 [20us, 50us 0.076 1.4E-2 5.7E-2 2.4E-1 4.9E-1 [10us, 20us 0.024 1.2E-1 2.4E-1 4.9E-1 7.0E-1 [5us, 0.013 3.4E-1 4.9E-1 7.0E-1 8.4E-1 IV. SIMULATION In this section, we study the performance of the loadbalanced switch with DMS by simulations using real internet traffic traces. We use the traces from NLANR and CERNET. Since the trace with higher average throughput has smaller average intraflow packet interval, we choose the ones with almost the highest throughput among traces from the same collecting source, thus representing the worst case scenario. We inject the traffic of various traces as input into the proposed loadbalanced switch. Each input port starts reading the trace file from different location. Flows in traces are mapped to output ports by ensuring that the average packet arrival rate to each output is the same and does not exceed the output port capacity. In the simulations, the load-balanced switch is time slotted, and processes fixed-size cells. Since packets from the traces are of variable length, they are segmented to 64 bytes cells before entering the switch. To adjust the arrival rate, the length of a slot is changed according to the throughput of the trace. For example, if the throughput of a trace is 500 Mb/s, to achieve the arrival rate of 0.8, the slot (a time unit to transmit a 64-byte cell) length is ((64 8)/500M) 0.8 =0.8192 µs. Each experiment lasts at least 106 slots A. Average Delay We first study the average delay of the proposed loadbalanced switch with DMS, and then compare its performance in terms of the average delay with that of other schemes including UFS, PF and the CR switch. (Because the mailbox switch [11] is similar to the CR switch [6] working in the contention mode, we do not include it in the comparison.) To show the extra delay caused by re-sequencing clearly, we also plot the average delay of the basic load-balanced switch (BLBS) without any re-sequencing process. The results of applying CERNET-33M traces to various 16 16 switches are shown in Fig. 5. (For page limitation, we do not show the figure of the result of simulations with NLANR-133M traces.) The results for CR, UFS and PF using the CERNET-33M trace (in Fig. 5) conform to those reported in [3] and [6] when the input traffic is set to be bursty. We can observe from both figures that when the load is not high ([0, 0.8] in Fig.5), the performance of DMS outperforms others (PF, CR, and UFS) and is very close to that of BLBS (the ideal case). By avoiding unnecessary maintaining sequence operations, DMS keeps packets of the same flow in order and only makes a little sacrifice on the packet delay. However, when the load is extremely high, the average delay of DMS is not as well as other schemes. To improve the performance of DMS in the high load case, we can use some mechanism that is able to restrain mis-sequencing in the input buffer of first stage. Chang et al. [2] proposed to maintain a flow splitter and a load-balancing buffer in the input buffer. The load-balancing buffer in the input of first stage consists of N VOQs destined for the N output ports of first stage. Packets belonging to the same input/output pair are split in the roundrobin fashion to the mediate buffer so that the lengths of VOQs in mediate buffers are approximately the same and the delay of re-sequencing can be reduced. We compare the average delay of DMS with flow splitter with other schemes (UFS, PF, and CR) in the high load case by simulations in a 16 16 switch with the CERNET-33M trace. We observe a similar result as that in the mediate load case: the average delay of DMS is very close to BLBS (the ideal case) and it outperforms other schemes including UFS, PF and CR in the high load case. We also observe that the difference

between the average delay of DMS with flow splitter and the ideal case doesn t increase with the load increasing. Average Delay 1000 900 800 700 600 500 400 300 200 100 BLBS DMS UFS PF CR 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Arrival Rate Figure 5. The Average Delay Using the CERNET-33M Trace B. Mailbox In this section, we investigate the details of the working of mailbox. We examine the percentage of packets that have been queued in the mailbox in the simulations of the 16 16 switch using the CERNET-33M trace, as shown in TABLE II. We also show the number of bins in use and the number of registered cells in the table. From TABLE II, we observe that the percentage of packets queued in mailbox is small. Even when the load is as high as 0.9, the percentage is less than 0.1. So it is fairly unnecessary to perform keeping sequence operations for every packet. We also observe that on average there are only several registered cells in the bin of mailbox, which represents the average number of packets of an active flow in the switch. The maximum number of bins in use is in the low hundreds. Hence, the size of the mailbox (the number of bins) can be set to only several hundreds which are sufficient to keep all the active flows. V. CONCLUSION In this paper, we have proposed a Dynamic Mailbox Sharing mechanism for load-balanced switches to solve the mis-sequencing problem. In the DMS mechanism, the sequence of arrivals of packets of the same flow is recorded in the mailbox maintained in the output port. Packets of the same flow are delivered according to their arrival sequence recorded in the mailbox. To achieve scalability, bins in the mailbox are dynamically shared among active flows. By the dynamic sharing mechanism, the number of bins in the mailbox can be kept in the low hundreds. By simulations, we have also demonstrated that with a simple flow splitter mechanism in the input buffer, the average delay of DMS is considerable smaller than that of other schemes including Uniform Frame Spreading, Padded Frame, and CR switches. The effect of packet segmentation requires further research and will be part of our future endeavor. Arrival Rate Maximum Number of Bins in use TABLE II. Average Number of Bins in use STATISTICS IN MAILBOX Maximum Number Average Number of Registered Cells of Registered Cells in the Bin in the Bin 0.7 68 7.34 12 1.22 0.037 0.75 79 9.89 11 1.31 0.044 0.8 110 14.0 12 1.44 0.052 0.83 130 18.2 21 1.60 0.056 0.86 147 25.3 15 1.74 0.060 0.9 185 44.0 15 2.10 0.065 0.92 208 66.3 15 2.56 0.068 Percentage of Packets Queued in Mailbox The authors would like to thank Professor Wende Zhong (NTU, Singapore) for his discussion. REFERENCES [1] C.S. Chang, D.S Lee, and Y.S Jou, Load balanced Birkhoff-von Neumann switches, part I: one-stage buffering, Computer Communications, Vol.25, pp.611-622, 2002. [2] C.S. Chang, D.S Lee, and C.M Lien, Load balanced Birkhoff-von Neumann switches, part II: Multi-stage buffering, Computer Communications, Vol.25, pp.623-634, 2002. [3] J.J Jaramillo, F. Milan, and R. Srikant, Padded frames: a novel algorithm for stable scheduling in load-balanced switches, Proceedings of CISS, Princeton, NJ, March 2006. [4] I. Keslassy, The load-balanced router, Ph.D. dissertation, Stanford University, Stanford, CA, USA, 2004. [5] C. Hu, Y. Tang, X. Chen, and B. Liu, Per-flow Queueing by Dynamic Queue Sharing, Proceedings of IEEE INFOCOM, Anchorage, Alaska, 2007. [6] C.L. Yu, C.S. Chang, and D.S. Lee, CR Switch: A Load-Balanced Switch with Contention and Reservation, Proceedings of IEEE INFOCOM, Anchorage, Alaska, 2007. [7] L. Shi, Y. Zhang, J. Yu, B. Xu, B. Liu, and J. Li, On the Extreme Parallelism Inside Next-Generation Network Processors, Proceedings of IEEE INFOCOM, Anchorage, Alaska, 2007 [8] A. Kortebi, L. Muscariello, S. Oueslati, and J. Roberts, Evaluating the number of active flows in a scheduler realizing fair statistical bandwidth sharing, in ACM SIGMETRICS 2005, 2005, pp. 217-228 [9] I. Keslassy, S.T Chuang, K. Yu, D Miller, M. Horowitz, O. Solgaard, and N. McKeown, Scaling internet routers using optics, Proceedings of ACM SIGCOMM, Karlsruhe, Germany, August 2003. [10] C.S. Chang, D.S. Lee, and C.Y. Yue, Providing guaranteed rate services in the load balanced Birkhoff-von Neumann switches, Proceedings of IEEE INFOCOM, 2003 [11] C.-S. Chang, D.-S. Lee and Y.-J. Shih, Mailbox switch: a scalable twostage switch architecture for conflict resolution of ordered packets, Proceedings of IEEE INFOCOM, Vol. 3, pp. 1995-2006, Hong Kong, 2004. [12] B.Y. Choi, S. Moon, Z.L. Zhang, K. Papagiannaki, C. Diot. "Analysis of Point-To-Point Packet Delay In an Operational Network," Proceedings of INFOCOM, March, 2004, Hong Kong. [13] K. Papagiannaki, S. Moon, C. Fraleigh, P. Thiran, and C. Diot. "Measurement and analysis of single-hop delay on an IP backbone network," Proceedings of INFOCOM, San Francisco, April 2002. [14] NLANR. Passive measurement and analysis (pma). [Online]. Available: http://pma.nlanr.net [15] DRAGON-Lab. CERNET trace download. [Online]. Available: http://dragonlab.org/traffic/