Threshold-based Exhaustive Round-robin for the CICQ Switch with Virtual Crosspoint Queues

Threshold-based Exhaustive Round-robin for the CICQ Switch with Virtual Crosspoint Queues Kenji Yoshigoe Department of Computer Science University of Arkansas at Little Rock Little Rock, AR 7224 kxyoshigoe@ualr.edu Abstract-A multi-cabinet implementation of a combined input and crosspoint queued (CICQ) switch introduces a large RTT latency between the line cards and switch fabric, making the crosspoint (CP) buffer requirement impractical. A virtual crosspoint queues (VCQs), proposed in literature are shared among a set of VOQs and CP buffers for the same input port, reducing minimal memory required inside the switch fabric. In this paper, a threshold-based exhaustive round-robin (T- ERR) is employed to improve the throughput of the CICQ switch with VCQs. T-ERR at VCQ and CP arbiters serve packets residing in a longer queue more aggressively than packet residing in a shorter queue. T- ERR is simple yet drastically increases throughput for the CICQ with small VCQ size. Simulation experiment with unbalanced traffic show that its throughput improves from 8% to 94% for CP size of 4 cells and 73% to 83% for CP size of 2 cells for RTT = 64 cell time. Furthermore, its throughput is independent of switch size and RTT. Thus, the proposed scheme makes the scalable implementation of a distributed CICQ switch practical. Index Terms CICQ switches, flow control, scalability, and virtual crosspoint queues I. INTRODUCTION A combined input and crosspoint qeueued (CICQ) switch [][2][3] is receiving much attention for its scalability []. Fig. shows the NxN CICQ switch. It has limited buffering at each crosspoint (CP) in addition to buffering at each input port. Since distributed schedulers at each input port and inside the switch fabric run in parallel and independently, the synchronization among input ports required for matrix scheduling used in a virtual output queued (VOQ) input queued (IQ) switch is no longer needed, significantly reducing scheduling cycle [2]. Input VOQ VOQ N Input N VOQ N VOQ NN VOQ-S VOQ-S N Credit feedback VOQ-S = VOQ scheduler CP-S = CP scheduler CP CP N Output Fig.. CICQ switch CP-S CP N CP NN Output N Until recently, a packet switch was capable being built in a single cabinet. To accommodate the growth of the Internet traffic, however, current packet switches consist of a large number of line cards resulting larger physical space and power requirement. Consequently, a multi-cabinet implementation of the packet switch is a current trend [5][6][7]. It means the distance between the line cards and the switch fabric can be tens of meters [8]. Thus, the round-trip time (RTT) delay, between line card and switch fabric has significant impact on designing the multi-cabinet packet switches. In [9], it was shown that RTT delay significantly increases contention on output ports for VOQ IQ switches based on parallel and iterative scheduling algorithms. The problem with multi-cabinet implementation of VOQ IQ switch is that as many as RTT-fixed-size packets (or cells) units of switch matrix scheduler are required to maintain the work conservation of the switch. It can be noted here that CP-S N

implementation of output queued (OQ) switch is practically impossible for a large port size because N- time speedup for internal links between line card and switch fabric in addition to an N-time speedup in switch fabric and output put buffer is required. Ever since the RTT delay internal to the CICQ switch was first addressed in [4], several studies have addressed the issue of CP buffer size [-6]. A reduction of the CP buffer size for the CICQ switches with multiple-level priority traffic is investigated in [], reducing CP buffer size from N 2 PRTT to 2 N RTT + P where P is the number of priority. A two-lane buffered crossbar design was proposed to handle more than two levels of priority traffic using only two queues per CP []. It was observed that the CICQ switch with a CP buffer size that can hold 6% of back-to-back cells in transit between the line card and the CP buffer has an acceptable performance [2]. In a load-balanced CICQ switch [3], an extra switch stage was inserted between the input ports and the buffered crossbar to relax the CP buffer size such that flows with high data rates can be handled when the CP buffer size is smaller than the RTT; however, there is an additional cost for implementing a load-balancer. Both the shared CP memory [4] and rate-based flow control for the CICQ switch [5] also reduce the minimum CP buffer size. The CICQ switches that are cost-effective and scale independently of the growth of the RTT value are of our interest The CICQ switch with virtual crosspoint queues (VCQs) studied in [6] has a VCQs unit associated with an individual input port inside switch fabric (see fig. 2). VOQ schedulers (VOQ-S), VCQ schedulers (VCQ-S), and CP schedulers (CP-S) run in parallel and independently. A VCQs unit functions as an intermediate buffer between a set of VOQs and crosspoint (CP) buffers associated with the same input port, and its memory only needs to run at a line rate while achieving high throughput with minimal buffering requirement inside the switch fabric. Only a simple round-robin scheduler has been investigated for this new packet switch architecture, and its throughput is degraded due to buffer hogging at VCQs and CP buffers. In this paper, threshold-based exhaustive round robin (T-ERR) is employed as VCQ and CP arbiters to improve the switch throughput by reducing the buffer hogging at VCQs. The remainder of the paper is organized as follows. A threshold-based exhaustive round-robin for the CICQ switch with VCQs is proposed to reduce the buffer hogging and achieve high throuput. The proposed solution is evaluated in Section 3. Section 4 is a conclusion. II. THRESHOLD-BASED EXHAUSTIVE ROUND-ROBIN If a shared memory unit in a packet switch is completely filled by packets destined to a single output port, the rest of packets to other output port destinations will have to wait at the input port. This is known as buffer hogging. Fig. 3 shows how a buffer hogging at VCQs of the CICQ switch can be possible. When both the VCQs and CP is completely filled with a packet destined to output port, the VCQs will not be available to the rest of the packet including the one in VOQ 2 resulting a buffer hogging. Fig. 4 shows Input VOQ VOQ 2 VOQ-S VOQ N VCQ 2 VCQ-S VCQ N CP CP 2 VCQs unit Fig. 3. Buffer hogging at VCQ: The VCQ unit is entirely filled with packets to output port, and packet to output port 2 (at VOQ2) is blocked. Input VOQ VOQ-S VOQ N Input N VOQ N VOQ-S N VOQ NN VCQ CP credit feedback VCQ-S VCQ N VCQ N CP CP N VCQ-S N VCQ NN CP N CP NN CP-S CP-S N VCQ credit feedback Output Output N Fig. 2. CICQ switch with VCQ.5.4.3.2. Without VCQ(CP=) VCQ(CP=) Without VCQ(CP=2) VCQ(CP=2) Without VCQ(CP=4) VCQ(CP=4) Without VCQ(CP=6) VCQ(CP=6)..2.3.4.5 Unbalanced probability, ω Fig. 4. of the CICQ switch

simulation result for the throughput of the CICQ switch with and without VCQs for Bernoulli arrivals of cells with unbalanced traffic. Unbalanced probability, ω, of means the traffic load from each input port to each output port is completely balanced while that of means the traffic is completely unbalanced (entire traffic from an input port is forwarded to a single output port) (see Section 3 for the complete description of ω ). RTT between the input port and switch fabric was set to 64 cell-time, and VCQ size was set to 28 cells. As easily seen, the deployment of the VCQs drastically increases the throughput of the CICQ switch for all range of ω measured; however, reducing buffer hogging at VCQs unit could further achieve higher throughput. Goal is to improve the throughput of the CICQ switch with VCQs by reducing the probability of having a buffer hogging at VCQs without increasing VCQs memory size. Threshold-based exhaustive round-robin (T-ERR) is proposed in this paper to achieve high throughput for the CICQ switch with VCQs. Fig. 5 shows the pseudo-code for the T-ERR at a VCQs unit i. An operation of T-ERR is similar to RR arbitration except that RR pointer remains at a currently selected queue as long as it has more cells queued than a predefined value, threshold. A timer is used to prevent it from serving a single queue indefinitely. Similar approach was used in [7] to avoid starvation of cells. By aggressively serving busier queue over less busier queue, not only T-ERR can adapt to dynamic changes in traffic intensity, but also it balance the amount of cells among competing queues in both VCQs and CP buffers. Consequently, a buffer hogging is less frequently introduced. At each VCQs unit i,. Do forever 2. Reset and start a timer 3. Select with RR the next non-empty VCQ i,j with available CP i,j credit 4. While (VCQ ij length is greater than threshold and timer is not expired ) 5. Keep RR pointer position Fig. 5. Pseudocode of T-ERR for a VCQs unit III. EVALUATION OF THE T-ERR FOR THE CICQ SWITCH WITH VCQ A TRAFFIC MODELS AND EXPERIMENTS Four simulation experiments are designed to evaluate the performance of the CICQ switch with VCQs with T-ERR. CSIM simulator [8] is used to implement the CICQ switch with and without VCQs. For all experiments, a 32x32 switch with RTT value of 64 cell times, no internal packet header overhead and no internal speedup is assumed, unless otherwise stated. For T-ERR unless other wise stated, VCQ threshold, CP threshold, and time-out value are set to cells, 2 cells, and cell time, respectively. Instability of the switch was assumed when any of VOQ length exceeded cells, and it resulted in termination of the simulation. Otherwise, simulations were completed for,, cell times. Experiment # (balanced): Bernoulli arrival of cells with uniformly selected outputs was modeled. Offered load was ranged from 5% to 98%. The CP buffer size was set to 6 for the CICQ switch with VCQs. Both the CP size of 6 and 64 was experimented to measure the performance of the CICQ switch without VCQs. The VCQ memory unit size was set to the twice the RTT. This experiment evaluates the switching delay of a switch with a large RTT for smooth traffic. Experiment #2 (unbalanced): Bernoulli arrivals of cells with unbalanced traffic as used in [3] was modeled. For input i, output j, unbalanced probability, ω, and the offered input load, ρ, the traffic load from input i to output j, ρi, j ω ρ = ρ, ω + i j N if i = j and otherwise by, ω ρ i, j = ρ. N was given by, The offered traffic is uniform when ω = and is completely directional from input i to j when ω =. A CP buffer size is varied to hold up to RTT/4 cells. This experiment evaluates the throughput of a switch with a large RTT under unbalanced traffic. This traffic model has been used to evaluate the impact of CP size on the performance of the CICQ switch in [3][6]. Experiment #3 (unbalanced): Same as experiment #2 except that RTT is ranged from 64 to 256 for the CICQ with VCQ switch. VCQs and CP size are set to 2xRTT and 6 cells, respectively. This experiment evaluates the throughput of the switch with various RTTs. Experiment #4 (unbalanced): Same as experiment #2 except that the switch size is ranged from 6 to 28 for

the CICQ with VCQ switch. VCQs and CP size are set to 2xRTT and 6 cells, respectively. This experiment evaluates the throughput of the switch of various sizes. B. EXPERIMENT RESULTS Fig. 6 shows the result of experiment # in which values of mean delay minus RTT are plotted. The mean delay of the CICQ switch without VCQs is almost identical for CP = 6 and CP = 64.. This is expected because the traffic from each of 32 input ports is uniformly distributed over 32 output ports, and only a CP size of 2 (64(RTT)/32(port counts)) cells is sufficient to achieve % throughput. The CICQ switches with VCQs with RR and T-ERR were both stable for all offered loads and had comparable mean delay to that of the CICQ switch without VCQs. About one cell time difference (due to the extra buffering stage by VCQ) was observed at low offered load for both the RR and T-ERR scheduling. The T- ERR scheduling resulted in lower mean delay than the RR scheduling mean delay at offered load of 98%. Fig. 7 shows the result of experiment #2 for the CICQ with VCQs. All models achieved close to % throughput when traffic is completely balanced. They also achieved nearly a % throughput when ω = regardless of CP size. This was expected because, N VCQ _ size RTT. j ij = T-ERR scheduling always resulted in achieving higher throughput than RR for a given CP size. For instance, its throughput improves from 8% to 94% for CP size of 4 cells and from 73% to 83% for CP size of 2 cells for RTT = 64 cell time. Only T- ERR scheduling with CP = 6 achieved nearly % throughput for all ω. Fig. 8 shows the result of experiment #2 for the CICQ with VCQ and that for the CICQ with VCQ with T-ERR. As easily seen from the figure, increases in RTT have a slight reduction of the throughput for the CICQ switch with VCQ with T-ERR. On the other hand, the CICQ switch with VCQs with T-ERR always achieve close to % throughput for all w regardless of varied RTT. Fig. 9 shows the result of experiment #4. is not affected by increases in switch size Mean response time - RTT (cells) 3 25 2 5 5 Without VCQ (CP=6) Without VCQ (CP=64) With VCQ (RR) With VCQ (T-ERR) 5 55 6 65 7 75 8 85 9 95 Load (%) Fig. 6. Results for experiment #.5.4.3.2. RR (RTT=64) RR (RTT=28) RR (RTT=256) T-ERR (RTT=64) T-ERR (RTT=28) T-ERR (RTT=256)..2.3.4.5 Fig. 8. of CICQ with various RTT RR(CP=) T -ERR(CP=) RR(CP=2) T -ERR(CP=2) RR(CP=4) T -ERR(CP=4) RR(CP=6) T -ERR(CP=6)..2.3.4.5 RR(N=32) RR(N=64) RR (N=28) T-ERR(N=32) T-ERR(N=64) T-ERR(N=28)..2.3.4.5 Fig. 7. Results for experiment #2 Fig. 9. Results for experment #4

for both the switch with RR and T-ERR scheduling. of the CICQ switch with VCQs with RR, however, is noticeably lower than that with T-ERR. T-ERR, again, achieved close to % throughput for all w regardless of switch size. This demonstrates that the CICQ switch with VCQ with T-ERR is suited for scaling up independent of the switch size. IV. CONCLUSION In this paper, a CICQ-VCQ switch with VCQ and threshold-based exhaustive round-robin (T-ERR) is investigated to reduce the buffer size requirement inside switch fabric. In T-ERR, a pointer remains at a currently selected queue as long as it has more fixed-size packets (or cells) queued than a predefined threshold value. A timer is used to prevent it from serving a large queue indefinitely. Simulation experiments showed that the throughput of the CICQ switch with VCQ using T- ERR arbitration improves for small CP buffer size. In particular, its throughput improves from 8% to 95% for CP size of 4 cells and 7% to 78% for CP size of 2 cells for RTT = 64 cell time. Furthermore, its high throughput is independent of RTT and switch port size. Thus the employment of the proposed scheduler to the CICQ switch with VCQs is promising to further reduce a significant amount of memory size requirement in the switch fabric. Thus, the T-ERR can be a scheduler of choice for implementing a highly scalable CICQ switch with VCQs architecture. REFERENCES [] M. Nabeshima, Performance Evaluation of a Combined Input- and Crosspoint-Queued Switch, IEICE Transactions on Communications E83-B, No. 3, pp. 737-74, March 2. [2] K. Yoshigoe and K. Christensen, A Parallel-Polled Virtual Output Queued Switch with a Buffered Crossbar, Proceedings of IEEE HPSR, pp. 27-275, May 2. [3] R. Rojas-Cessa, E. Oki, Z. Jing, and H. Chao, CIXB- : Combined Input-One-Cell Crosspoint Buffered Switch, Proceedings of IEEE HPSR, pp. 324-329, May 2. [4] F. Abel, C. Minkenberg, R. Luijten, M. Gusat, and I. Hiadis, A Four-Terabit Single-Stage packet Switch with Large Round-Trip Time Support, IBM Research Report RZ 343 (#9369), July 22. [5] Alcatel, Alcatel 767 Routing Switch Platform. URL:http://www.alcatel.com/products/productsummary.jhtml?repositoryID=/x/opgproduct/Alcatel_767_RSP. jhtml. [6] Avici Systems, The Avici TSR: The World First Scalable Router. URL: http://www.avici.com/ documentation /datasheets/avici_tsr.pdf. [7] Juniper Networks, The Essential Core: Juniper Networks T64 Internet Routing Node with Matrix Technology. URL: http://www.juniper.net/ solutions/literature/solutionbriefs/356.pdf. [8] C. Minkenberg, R. Luijte, F. Abel, W. Denzel, and M. Gusat, Current Issues in Packet Switch Design, Proceedings of ACM SIGCOMM, p.9-24, January 23. [9] F. Tobajas, R. Esper-Chain, V. Armas, J. Lopez, and R. Sarmiento, Round-Trip Delay Effect on Iterative Request-Grant-Accept Scheduling Algorithms for Virtual Output Queued Switches, Proceedings of IEEE GLOBECOM, Vol. 2, pp. 889-893, November 22. [] R. Luijten, C. Minkenberg, and M. Gusat, Reducing Memory Size in Buffered Crossbars with Large Internal Flow Control Latency, Proceedings of IEEE GLOBECOM 7, pp. 3683-3687, January 23. [] N. Chrysos and M. Katevenis, Multiple Priorities in a Two-Lane Buffered Crossbar, Proceedings of IEEE GLOBECOM, Vol. 2, pp. 8-86, December 24. [2] F. Gramsamer, M. Gusat, and R. Luijten, Flow Control Scheduling, Microprocessors and Microsystems Vol. 27, pp. 233-24, 23. [3] R. Rojas-Cessa, Z. Dong, and Z. Guo, Load-Balanced Combined Input-Crosspoint Buffered Packet Switch and Long Round-Trip Times, IEEE Communications Letters, vol. 4, issue 7, pp. 66-663, July 25. [4] R. Rojas-Cessa and Z. Dong, Combined Input- Crosspoint Buffered Packet Switch with Shared Crosspoint Buffers, Proceedings of the 39th Conference on Information Sciences and Systems, John Hopkins University, Baltimore, MD, March 6-8, 25. [5] K. Yoshigoe, "Rate-based Flow-control for the CICQ Switch," Proceedings of the IEEE International Conference on Local Computer Networks, pp. 44-5, November 25. [6] K. Yoshigoe, The CICQ Switch with Virtual Crosspoint Queues for Large RTT, Proceedings of IEEE ICC, June 26. [7] K. Christensen, K. Yoshigoe, A. Roginsky, and N. Gunther, Performance of Packet-to-Cell Segmentation Schemes in Input Buffered Packet Switches, Proceedings of the IEEE ICC, pp.97-2, June 24. [8] H. Schwetman, CSIM8 - The Simulation Engine, Proceedings of the 996 Winter Simulation Conference, pp. 57-52, December 996, URL: http://www.mesquite.com.