Improvement of a TCP Incast Avoidance Method for Data Center Networks

Improvemen of a Incas Avoidance Mehod for Daa Cener Neworks Kazuoshi Kajia, Shigeyuki Osada, Yukinobu Fukushima and Tokumi Yokohira The Graduae School of Naural Science and Technology, Okayama Universiy 3 1 1, Tsushima Naka, Kia Ku, Okayama, 7 853, Japan The Japan Research Insiue, Limied 2 18 1, Higashi Goanda, Shinagawa Ku, Tokyo, 141 22, Japan E mail: {kajia@ne.cne., osada@ne.cne., fukusima@, yokohira@}okayama u.ac.jp s I. INTRODUCTION In daa cener neworks, disribued file sysems where is used as a communicaion proocol beween a clien and a server are very popular. In such disribued sysems, a block of daa is pariioned ino several unis called SRUs (Server Reques Unis) and hey are sored ino several servers. Thus, when an applicaion on a clien ries o read a block, he clien sends requess o he servers which have he corresponding SRUs and hen he servers almos simulaneously ry o send he SRUs o he clien. Then, because many packes bursly arrive a he clien link, many packes are los a he por buffer of he clien link and consequenly some servers have o wai he reransmissions of heir los packes unil imeous occur. In a sandard configuraion, because he minimum imeou value is oo large (he defaul value is 2 msec) compared o he round rip ime (RTT) of daa cener neworks (ypically less han a few hundred micro seconds), i leads o serious hroughpu degradaion. Such well known congesion collapse is called [1] (we call i Incas briefly). In order o avoid Incas, he papers [2] and [3] ry o avoid imeous using some sraegies such as reducing he hreshold value o rigger he fas reransmission mechanism and disabling he slow sar. However, hese sraegies are no so effecive because alhough Incas is ofen caused by losses of all packes of he send window size, he papers mainly focus on losses of some packes covered by he send window. The paper [4] proposes a mehod called, which ries o avoid Incas by adjusing he send window size of servers. However, is no effecive when he number of servers is large as shown laer. The paper [5] suggess reducing he minimum imeou value o a microsecond order value o miigae he impac of imeous. However, his mehod needs o modify he server OS s kernel imer ino he higher resoluion which is very hard ask o do and can cause hroughpu degradaion when he number of servers is large [3]. In addiion, he paper [6] proposes a proocol for daa cener neworks called DC, which can avoid Incas o some exen. However, because is main purpose is no o avoid Incas bu o keep laency ime low, Incas can occur when he number of servers is large [6]. Thus, in our previous work [7], in order o avoid Incas, we have proposed a mehod which limis he maximum number of simulaneously exising connecions o a pre deermined consan value. In he mehod, if he number of currenly exising connecions is equal o he maximum number, a new connecion is allowed o be esablished only when one of currenly exising connecions finishes. However, we canno use he mehod when he SRU size is small. In addiion, we have no invesigaed how o opimize he maximum value. Therefore, if he maximum value is no appropriae, Incas may occur, and he hroughpu is small even if Incas does no occur. In his paper, we improve our previous mehod so ha i is applicable regardless of he SRU size and he maximum value is opimized. In our previous mehod, each connecion is serially esablished some inerval ime afer is preceding connecion is esablished. For small SRU size, because he inerval ime is no defined, we canno use our previous mehod. In our improved mehod, we virually consider some connecions as one connecion so ha he inerval ime is defined for any SRU size and esablish he connecions simulaneously. The number of simulaneously esablished connecion is opimized so ha he hroughpu is maximized wihou Incas. Hereafer, we call our previous mehod he serializaion mehod, and call our improved mehod he opimized serializaion mehod. The res of he paper is organizes as follows. Secion II describes a arge daa cener nework model, he cause of Incas and he serializaion mehod. Secion III describes he opimized mehod in deail. Secion IV shows some numerical resuls o show he effeciveness of he opimized serializaion mehod. Secion V gives conclusion. 978-1-4799-698-7/13/$31. 213 IEEE 459 ICTC 213

II. Clien Swich Por Buffer Fig. 1. Nework Model Server 1 Server 2 Server 3 Server N PREVIOUS INCAST AVOIDANCE METHOD When we focus on he Incas problem, we can simplify daa cener neworks o a nework model shown in Fig. 1. There are one clien and several servers. The clien and servers are conneced o one Eherne swich. A por buffer is equipped in he clien link, and some packes can be emporarily buffered when he link is busy. We can consider a nework model where here are some swiches and some servers and he clien are conneced o differen swiches. However, as long as he clien link is a boleneck, such model can be simplified o he model shown in Fig. 1. In disribued file sysems such as HDFS (Hadoop Disribued File Sysem) and pnfs (parallel Nework File Sysem), a block of daa is pariioned ino several unis called SRU and hey are sored in several servers. The defaul SRU size is 65536 KByes (K=2 1 ) in HDFS and is 32 KByes in pnfs. When an applicaion on a clien ries o read a block, he clien sends requess o all he servers which have he corresponding SRUs. Then every server sends he corresponding SRU o he clien. The sendings from he servers are easy o occur almos simulaneously. Thus, because many packes from he servers almos simulaneously arrive a he por buffer of he clien link, he por buffer is easy o overflow and some packes may be los. When he communicaion beween he clien and each server is being done using, almos all packes of he send window size of a server may be los due o he huge burs arrival of packes o he por buffer. Thus, he fas reransmission and fas recovery mechanisms of are no riggered because hree duplicaed ACKs are no reurned o he server, and consequenly such packe losses are recovered by he imeou mechanism only. Alhough he minimum imeou value in a sandard configuraion may be reasonable in normal nework environmens (is defaul value is 2 msec), i is oo large in daa cener nework environmens, where he bandwidh is he order of Gbps (G=1 9 ) and he round rip ime is less han a few hundred micro seconds. Thus, once a imeou occurs, reransmissions of he los packes occur afer a long waiing ime, and consequenly he delay unil all he SRUs belonging o he block are compleely received by he clien becomes large. On he oher hand, a new block read operaion from he applicaion does no occur unil he curren block read operaion compleely finishes, ha is, unil SRUs from all connecions are received by he clien (such applicaion is called [4]). Therefore once a imeou occurs, a long idle period appears in he clien link, and consequenly he average hroughpu over he period beween he ime of he sendings of he reques from he clien o he servers and he ime of he finish of he receips of all he SRUs is small. As described in he previous secion, he cause of Incas is ha ransmissions from many servers o a clien simulaneously occur. Thus, in our previous work, we have proposed he serializaion mehod, which limis he maximum number of simulaneously exising connecions o a pre deermined consan value. In he mehod, if he number of currenly exising connecions is equal o he maximum number, a new connecion is allowed o be esablished only when one of currenly exising connecions finishes. By he serializaion, because packes from only a few connecions arrive a he por buffer, Incas can be avoided. On he oher hand, we allow any connecion which does no belong o such applicaion o be esablished and here may be some raffic from UDP, ICMP, rouing proocol and so on. However he oal amoun of such raffic (we call i background raffic) can be considered o be small compared o he raffic from barrier synchronized applicaions because background raffic is used for sysem conrol and managemen (in oher words, we should avoid such sysem design ha generaes large amoun of conrol and managemen raffic). In he serializaion mehod, we call he maximum number of simulaneously exising connecions he overlapping parameer. When he parameer is one, we call he serializaion mehod he complee serializaion mehod, and when he parameer is wo or larger value, we call he serializaion mehod nearly complee serializaion mehod. Nex, we explain he wo serializaion mehods. In he complee serializaion mehod, we compleely serialize esablishmens of all he connecions belonging o each applicaion. The clien firs esablishes one connecion and receives he SRU. Then, he clien repeas he same behavior for he second and he succeeding connecions serially. In each connecion, he clien adverises [bis] as is receive window size o he corresponding server, where is he produc of he clien link bandwidh [Gbps] and BaseRTT which is defined as he round rip ime under he condiion ha here is no oher raffic and receive window size is one MSS. Noe ha we can easily obain BaseRTT by a prior experimen for he arge sysem. Because he clien adverises as is receive window size, he send window size (sendwin) of each server is as follows. sendwin = min (cwnd,. ) (1) Where cwnd is he congesion window size of he server. Therefore in each connecion, sendwin increases exponenially in he iniial sage because cwnd increases exponenially due o he slow sar mechanism of, and hen i reaches as shown in Fig. 2(a). The lengh ( 1 ) of he iniial sage, ha is, he period beween he ime origin and he ime sendwin reaches, depends on, BaseRTT and MSS, and i is derived as follows. 46

The sum of sendwins of all exising connecions con-1 con-2 con-3 The sum of sendwins of all exising connecions T 1 T 2 T 1 T 2 T 1 T 2 (a). Change of he sum of sendwins of all esablished connecions Throughpu V Gbps T 1 1 2 1 2 con-1 con-2 con-3 T 1 T 2 T 1 T 2 T 1 T 2 (b). Change of hroughpu Fig. 2. Complee serializaion (a). Change of he sum of sendwins of all esablished connecions Throughpu V Gbps edge Clien We assume ha he clien sars connecion esablishmen (he clien sends a SYN segmen) as shown in Fig. 3 and he iniial cwnd value is wo, and we assume ha he ime origin is he ime when he server reurns he SYN ACK segmen. Then, 1 is given as follows. = BaseRTT (2) _Pk = (MSS + 4) 8 (3) = log _Pk + 1 (4) = ( 1) BaseRTT SYN SYN-ACK ACK Server +( _Pk 2 ) Time origin BaseRTT BaseRTT BaseRTT Fig. 3. Timing char in clien server communicaion (MSS + 4) 8 The lengh ( 2 ) of he period beween he ime when sendwin reaches and he ime when he connecion finishes depends on, BaseRTT, MSS and SRU size and is calculaed as follows. 2 MSS (MSS + 4) 8 MSS = edge If he SRU size is oo small, a connecion finishes before is sendwin reaches and 1 and 2 are no defined. (5) (6) T 1 Fig. 4. Nearly complee serializaion Specifically if and only if he following inequaliy is no saisfied, 1 and 2 are no defined. MSS 2 + 2 _Pk (7) (b). Change of hroughpu For example, when = 1 Gbps, BaseRTT = 1 µsec, MSS = 146 Byes, of 8 KByes does no saisfy he inequaliy and of 9 KByes saisfies i. For larger bandwidh case, when =1 Gbps, of 115 KByes does no saisfy he inequaliy and of 116 KByes saisfies i. In he complee serializaion mehod, because each connecion does no overlap wih he oher connecions, he sum of sendwins of all esablished connecions is obained as shown in Fig. 2(a). Alhough we canno fully use he bandwidh of he clien link for each period 1, we can fully use i for each period 2. Thus, as shown in Fig. 2(b), alhough we canno obain hroughpu of Gbps for each period 1, we can obain hroughpu of Gbps for each period 2. Because 1 does no depend on he SRU size and 2 becomes larger for larger SRU size from Eqs. (5) and (6), he average hroughpu reaches close o Gbps for larger SRU size. However, when he SRU size is small, because 2 becomes small, he complee serializaion mehod canno fully use he bandwidh. Thus, in he nearly complee serializaion mehod, in order o avoid underuilized period as much as possible, we overlap wo connecions parially as shown in Fig. 4. In he mehod, we ry o sar he nex connecion a a ime ( 1 in he figure) in advance so ha sendwin of he connecion reaches a he ime (ime 2 in he figure) when he curren esablished connecion finishes. By overlapping wo connecions parially, he sum of sendwins of all esablished connecions is obained as shown in Fig. 4(a) and we canno obain hroughpu of Gbps only firs period 1. Alhough he SRU size is small, he average hroughpu reaches close o Gbps. However, i is 461

con-a con-b The sum of sendwins of all exising connecions b c con-c hard o derive ime 1 correcly because he wo connecions share he bandwidh of he clien link. We deermine ime 1 approximaely as follows. We call he curren connecion and he nex esablished connecion con A and con B, respecively. If we assume ha only con B exiss in he nework, sendwin of con B can reach afer he period of lengh 1 in Eq. (5). However, con A and con B coexis in he nework. Thus, we assume he wo connecions equally share he bandwidh of he clien link and we consider ha 2 1 is needed for sendwin of con B o reach. Also from he equal sharing assumpion, we consider ha con A can use half he bandwidh ( /2) of he clien link averagely. Thus, con A can send he daa of 1 ( 2 2) [bis] during he period beween 1 and 2. Therefore we can obain ime 1 as he ime when he number of bis which have been already sen in con A reaches (Recall ha is SRU size). From he above discussion, we coun up he number of bis which have been already sen in he curren connecion, and when he number reaches, we sar he nex connecion (we call he condiion ha he number of bis in he curren connecion reaches ). However, because is an approximae value, if we sar he nex connecion based on he condiion I only, he siuaion ha here exis more han wo connecions may occur. For example, in Fig. 5, assume ha because he condiion I is saisfied for connecion con A, connecion con B sars a ime b. Then if he condiion I is saisfied for con B a ime c, a new connecion con C sars. Bu a he ime c, con A may have no finished ye and remain alive. Such siuaion is easier o occur when is smaller. If we allow such muliple overlapping wihou any resricion, Incas may have occurred. Thus, we inroduce he overlapping parameer and we do no allow esablishmen of a new connecion when he number of already exising connecions is, and when one of hem finishes, we esablish a new connecion. III. Fig. 5. Overlap of more han wo connecions IMPROVEMENT OF THE PREVIOUS METHOD In he nearly complee serializaion mehod, we sar he nex connecion when he number of bis which have been already sen in he curren connecion reaches. Thus, when he inequaliy (7) is no saisfied because he SRU size is oo small, 1 is no defined, and consequenly we canno use he mehod. The smalles value of he SRU size for he applicabiliy of he mehod becomes larger for larger bandwidh of he clien link. For example, when BaseRTT = 1 µsec and MSS = 146 Byes, he smalles values for = 1 Gbps and = 1 Gbps are abou 9 KByes and abou con-1, con-2, con-3 con-4, con-5, con-6 con-7, con-8, con-9 Fig. 6. Opimized serializaion 116 KByes, respecively as shown in he preceding secion II. Even if he SRU size saisfies he inequaliy, when i is small, he mehod canno fully uilize he clien bandwidh for a small value of and he mehod may cause Incas for a large value of. Thus, alhough should be opimized, he opimizaion of has no been invesigaed in our previous work. In order o resolve hese problems described above, when he SRU size is smaller han a hreshold value, we adop a sraegy which esablishes a se of several connecions simulaneously while serializing such ses like he nearly complee serializaion mehod as shown in Fig. 6. In he figure, he clien esablishes he se of hree connecions (con 1, con 2 and con 3) simulaneously. Then, he clien sars he second se of he nex hree connecions (con 4, con 5 and con 6) some inerval ime afer he firs se is sared, and hen he clien repeas he same behavior. In he sraegy, he number ( ) of connecions in each se is seleced so ha each se can fully uilize he clien link s bandwidh and a mos wo ses are overlapped as described laer. The clien adverises (i is rounded up in a uni of he packe size). Thus, he number of bis which queue in he clien por buffer is a mos abou. Thus, using such buffer size, we can fully uilize he clien link s bandwidh wihou causing Incas. We call such connecion serializaion mehod. In he nex secion, we describe he hreshold value o use he opimized mehod and how o derive he value of. In he nearly complee serializaion mehod, he period which is needed for sendwin of he nex connecion o reach is assumed o be 2 1. Therefore if 2, we can fully uilize he clien link s bandwidh by overlapping a mos wo connecions. From Eqs. (5) and (6), he inequaliy 2 is expressed as follows. MSS 2 _Pk log _Pk (8) +2 _Pk 2 Therefore, when he SRU size ( ) saisfies he inequaliy, we simply use he nearly complee serializaion mehod. On he oher hand, when he SRU size does no saisfy inequaliy, we 462

consider each se of connecions as one connecion in he nearly complee serializaion mehod and define 1 and 2 of each se in he same way. Then, if 2, we can fully uilize he clien link s bandwidh by overlapping a mos wo ses. The inequaliy 2 is expressed as follows. MSS 2 _Pk log _Pk +2 _Pk 2 Thus, we selec he minimum value ( min ) of which saisfies he inequaliy (9) o decrease he number of bis which queue in he clien por buffer as much as possible. Afer min is derived, alhough he clien adverises (i is rounded up in a uni of he packe size), when is smaller han four packes size and if one packe loss occurs, he fas reransmission and fas recovery mechanisms are no riggered because hree duplicaed ACKs are no reurned o each server for small send window size. As a resul, Incas may occur. To avoid such siuaion, he clien adverises a leas four packes size, and when is smaller han four packes size, we adjus min as follows again. (9) = _Pk (1) 4 As described above, alhough a leas min connecions are esablished in he almos enire period of an applicaion, he number of remaining connecions may smaller han min in he final period of an applicaion. In he final period, if we coninue o use he adverised window of, we canno fully use he clien s link bandwidh. To avoid such siuaion, when he number of remaining connecions is smaller han min, he clien increases he adverised window sizes of remaining connecions so ha he sum of sendwins becomes o fully use he bandwidh. IV. PERFORMANCE EVALUATION We incorporaed he opimized serializaion mehod ino he NS2 simulaor [8] and performed exensive simulaion runs for every combinaion of he parameers shown in Table I. In he simulaion, we assume ha here are no bi errors in packes and here is UDP raffic wih a consan bi rae from each of servers (for example, when he rae is Mbps, each server generaes UDP raffic wih he consan bi rae of Mbps) as a background raffic. Fig. 7 shows example resuls when he clien s link bandwidh is 1 Gbps, he SRU size is 32 KByes, he por buffer size is 4 packes and background raffic of 1 Mbps (.1%). In Fig. 7(a), goodpu is he value of hroughpu (applicaion level hroughpu) when he sum (4 Byes) of he and IP headers are no included in hroughpu calculaion. We also se such simulaion parameers ha MSS is 146 KByes, MTU is 15 KByes and he overhead of layer 2 or lower is zero. An acive server means a server which has an SRU of a requesed block. Tha is, when he number of acive servers is 2 and he SRU size is 32 KByes, we randomly TABLE I. Simulaion parameers Parameer =1Gbps =1Gbps BaseRTT (µsec) 1 SRU (KByes) 32, 64, 128, 256, 512, 124 Por buffer size (packes) 4, 8 2, 4 Cluser size ( ) 64 256 The number of acive servers 1~64 1~256 RTOmin 2msec, 2µsec (fine granulariy imer) UDP background raffic rae (%),.1, 1, 1 selec 2 acive servers from servers and we assume ha he clien requess he SRU of 32 KByes o each acive server (he size of he reques block is 32 KByes 2). In Fig. 7(a), we can see he highes goodpu in he opimized serializaion mehod, and in Fig. 7(b), he mehod aains a small maximum queue lengh (abou _Pk) because he mehod overlaps only wo ses of connecions. In he complee serializaion mehod, as we suggesed before, we canno fully use he bandwidh when he SRU size is small. In he nearly complee serializaion mehod, alhough goodpu becomes larger when we se he large parameer ( =4, =6 and =8), maximum queue lengh becomes large. Thus, Incas may occur when he por buffer size is small. In and, we can see he drasic performance degradaion when he numbers of acive servers exceed abou 16 by Incas. Alhough decreases he send window size o avoid Incas, is minimum value is 2 MSS. Thus, when he number of acive servers is larger han abou 2 (he por buffer size of 4 packes / he minimum send window size of 2 MSS), Incas can occur wih high probabiliy. By conras, wih a fine granulariy imer (2µsec) largely improves goodpu. However, when he number of acive servers is large, we can see goodpu degradaion because imeous repea in a shor span due o a heavy congesion. Fig. 8 shows larger bandwidh (1 Gbps), when he SRU size is 512 KByes, he por buffer size is 2 packes and background raffic of 1 Mbps (.1%). In he figures, he opimized serializaion mehod obains he highes goodpu and aains a small maximum queue lengh. In he nearly complee serializaion mehod, as we suggesed before, alhough goodpu becomes large when we se he appropriae parameer ( =4), Incas occurs when we se larger parameer ( = 6 and = 8). In he oher mehods, we can see almos he similar performance as Fig. 7. V. CONCLUSION We have proposed an improved mehod of he serializaion mehod o avoid Incas. In evaluaion resuls, he proposed mehod showed high hroughpus and aained small queue lenghs of he por buffer while avoiding Incas. In paricular, when he bandwidh becomes larger, we show he effeciveness of he mehod compared o he previous mehods. One of our fuure work is o invesigae he effeciveness of he mehod in real sysems. 463

Goodpu (Mbps) 1 Opimized Complee (RTOmin=2us) 5 Maximum queue lengh (packes) 12 8 6 4 2 4 Opimized Complee (RTOmin=2us) 3 2 1 1 2 3 4 5 6 7 1 2 3 4 5 6 7 The number of acive servers The number of acive servers (a). Goodpus (b). Maximum queue lenghs Fig. 7. Goodpus and maximum queue lenghs when =1Gbps Goodpu (Mbps) 1 Opimized Complee (RTOmin=2us) 25 Maximum queue lengh (packes) 12 8 6 4 2 4 8 12 16 2 24 2 The number of acive servers Complee (RTOmin=2us) 15 1 5 28 Opimized 4 8 12 16 2 24 28 The number of acive servers (a). Goodpus (b). Maximum queue lenghs Fig. 8. Goodpus and maximum queue lenghs when =1Gbps REFERENCES [5] V. Vasudevan, A. Phanishayee, H. Shah, E. Kreva, D. G. Andersen, G. R. Ganger, G. A. Gibson and B. Mueller, Safe and Effecive Fine grained Reransmissions for Daacener Communicaion, SIGCOMM 29, pp. 33 314. [6] M. Alizadeh, A. Greenberg, D. A. Malz, J. Padhye, P. Pael, B. Parabhakar, S.Sengupa and M. Sridharan, Daa Cener (DC), SIGCOMM 21, pp. 63 74. [7] S. Osada, K, Kajia, Y. Fukushima and T. Yokohira, Incas Avoidance Based on Connecion Serializaion in Daa Cener Neworks, APCC 213, pp. 12 125. [8] The Nework Simulaor ns 2. hp://www.isi.edu/nsnam/ns/. [1] D. Nagle, D. Serenyi and A. Mahews, The Panasas AciveScale Sorage Cluser: Delivering Scalable High Bandwidh Sorage, IEEE/ACM Supercompuing 24, pp. 53 62. [2] A. Phanishayee, E. Kreva, V. Vasudevan, D. G. Andersen, G. R. Ganger, G. A. Gibson and S. Seshan, Measuremen and Analysis of Throughpu Collapse in Cluser based Sorage Sysems, The 6h USENIX Conference on File and Sorage Technologies (FAST 28), pp. 1 13. [3] Y. Chen, R. Griffih, J. Liu, R. H. Kaz and A. D. Joseph, Undersanding Incas Throughpu Collapse in Daacener Neworks, ACM WREN 29, pp. 73 82. [4] H. Wu, Z. Feng, C. Guo and Y. Zhang, : Incas Congesion Conrol for in Daa Cener Neworks, ACM CoNEXT 21. 464