TOFEC: Achieving Optimal Throughput-Delay Trade-off of Cloud Storage Using Erasure Codes

: Achevng Optmal Throughput-Delay Trade-off of Cloud Storage Usng Erasure Codes Guanfeng Lang and Ulaş C. Kozat DOCOMO Innovatons, Inc., Palo Alto, CA 94304 Emal: {glang,kozat}@docomonnovatons.com Abstract Our paper presents solutons usng erasure codng, parallel connectons to storage cloud and lmted chunkng (.e., dvdng the object nto a few smaller segments) together to sgnfcantly mprove the delay performance of uploadng and downloadng data n and out of cloud storage. s a strategy that helps front-end proxy adapt to level of workload by treatng scalable cloud storage (e.g. Amazon S3) as a shared resource requrng admsson control. Under lght workloads, creates more smaller chunks and uses more parallel connectons per fle, mnmzng servce delay. Under heavy workloads, automatcally reduces the level of chunkng (fewer chunks wth ncreased sze) and uses fewer parallel connectons to reduce overhead, resultng n hgher throughput and preventng queueng delay. Our trace-drven smulaton results show that s adaptaton mechansm converges to an approprate code that provdes the optmal delay-throughput trade-off wthout reducng system capacty. Compared to a non-adaptve strategy optmzed for throughput, delvers 2.5 lower latency under lght workloads; compared to a non-adaptve strategy optmzed for latency, can scale to support over 3 as many requests. Index Terms FEC, Cloud storage, Queueng, Delay I. INTRODUCTION Cloud storage has been ganng popularty rapdly as an economc, flexble and relable data storage servce that many cloud-based applcatons nowadays are mplemented on. Typcal cloud storage systems are mplemented as key-value stores n whch data objects are stored and retreved va ther unque keys. To provde hgh degree of avalablty, scalablty, and data durablty, each object s replcated several tmes wthn the nternal dstrbuted fle system and sometmes also further protected by erasure codes to more effcently use the storage capacty whle attanng very hgh durablty guarantees []. Cloud storage provders usually mplement a varety of optmzaton mechansms such as load balancng and cachng/prefetchng nternally to mprove performance. Despte all such efforts, stll evaluatons of large scale systems ndcate that there s a hgh degree of randomness n delay performance [2]. Thus, servces that requre more robust and predctable Qualty of Servce (QoS) must deploy ther own external solutons such as sendng multple/redundant requests (n parallel or sequentally), chunkng large objects nto smaller ones and read/wrte each chunk through parallel connectons, replcate the same object usng multple dstnct keys, etc. Fg.. Mean Total Delay (msec) (,) (2,) (2,2) (4,2) (3,3) (6,3) Arrval Rate (req/sec) Delay for downloadng 3MB fles usng fxed MDS codes In ths paper, we present an external strategy that can provde much better throughput-delay performance for fle accessng on cloud storage utlzng erasure codng that does not requre any modfcaton nor knowledge of the nternal mplementaton of storage cloud. Although we base our analyss and evaluaton on Amazon S3 servce and present as an external soluton, can be appled to many other cloud storage systems both externally and nternally wth small modfcatons. The latter can be accomplshed, for example, by makng a thn layer on top of the orgnal API. A. State of the Art 978--4799-3360-0/4/$3.00 c 204 IEEE Among the vast amount of research on mprovng cloud storage system s delay performance emerged n the past few years, two groups n partcular are closely related to our work presented n ths paper: Erasure Codng wth Redundant Requests: As proposed by authors of [3], [4], [5], fles are dvded nto a predetermned number of k chunks, each of whch s /k the sze of the orgnal fle, and encoded nto n > k of coded chunks usng an (n, k) Maxmum Dstance Separable (MDS) code, or more generally a Forward Error Correcton (FEC) code. Downloadng/uploadng of the orgnal fle s accomplshed by downloadng/uploadng n coded chunks usng parallel connectons smultaneously and s deemed served when download/upload of any k coded chunks complete. Such mechansms sgnfcantly mproves the delay performance under lght workload. However, as shown n our prevous work [3] and later reconfrmed by [5], system capacty s reduced due to the overhead for usng smaller chunks and redundant requests. Ths phenomenon s llustrated n Fg. where we plot the delay-throughput trade-off for usng dfferent MDS codes from 978-4799-3360-0/4/$3.00 204 IEEE 826

827 2 our smulatons usng delays traces collected on Amazon S3. Codes wth dfferent k are grouped n dfferent colors. Usng a code wth hgh level of chunkng and redundancy, n ths case a (6,3) code, although delvers 2 gan n delay at lght workload, reduces system capacty to only 30% of the orgnal basc strategy wthout chunkng and redundancy,.e., (, ) code! Ths problem s partally addressed n [3] where we present strateges that adjust n accordng to workload level so that t acheves the near-optmal throughput-delay trade-off for the predetermned k. For example, f k = 3 s used, the strateges n [3] wll acheve the lower-envelop of the red curves n Fg.. Yet, t stll suffers from an almost 60% loss n system capacty. Dynamc Job Szng: It has been observed n [2], [6] that n key-value storage systems such as Amazon S3 and Mcrosoft s Azure Storage, throughput s dramatcally hgher when they receve a small number of storage access requests for large jobs (or objects) than f they receve a large number of requests for small jobs (or objects), because each storage request ncurs overheads such as networkng delay, protocol-processng, lock acqustons, transacton log commts, etc. Authors of [6] developed Stout n whch requests are dynamcally batched to mprove delay-throughput trade-off of key-value storage systems. Based on the observed congeston Stout ncrease or reduce the batchng sze. Thus, at hgh congeston, a larger batch sze s used to mprove the throughput whle at low congeston a smaller batch sze s adopted to reduce the delay. B. Man Contrbuton We ntroduce an adaptve strategy for accessng cloud storage systems va erasure codng, called (Throughput Optmal FEC Cloud), that mplements dynamc adjustment of chunkng and redundancy levels to provde the optmal delaythroughput trade-off. In other words, acheves the lower envelop of curves n all colors n Fg.. The prmary novelty of s ts backlog-based adaptve algorthm for dynamcally adjustng the chunk sze as well as the number of redundant requests ssued to fulfll storage access requests. Ths algorthm of varable chunk szng can be vewed as a novel ntegraton of pror observatons from the two bodes of works dscussed above. Based on the observed backlog level as an ndcator of the workload, ncreases or reduces the chunk sze, as well as the number of redundant requests. In our trace-drven smulaton evaluaton, we demonstrate that: () successfully adapts to full range of workloads, delverng 3 lower average delay than the basc statc strategy wthout chunkng under lght workloads, and under heavy workloads over 3 the throughput of a statc strategy wth hgh chunkng and redundancy levels optmzed for servce delay; and (2) provdes good QoS guarantees as t delvers low delay varatons. works wthout any explct nformaton from the back-end cloud storage mplementaton: ts adaptaton strategy s mplemented solely at the front-end applcaton server (the storage clent) and s based exclusvely on the measured latency from unmodfed cloud storage systems. Ths allows to be more easly deployed, as ndvdual cloud applcatons can adopt wthout beng ted-up wth any partcular cloud storage system, as long as a small number of APIs are provded by the storage system. II. SYSTEM MODELS A. Basc Archtecture and Functonalty The basc system archtecture of captures how web servces today utlze publc or prvate storage clouds. The archtecture conssts of proxy servers n the front-end and a key-value store, referred to as storage cloud, n the backend. Users nteract wth the proxy through a hgh-level API and/or user nterfaces. The proxy translates every hgh-level user request (to read or wrte a fle) nto a set of n tasks. Each task s essentally a basc storage access operaton such as put, get, delete, etc. that wll be accomplshed usng low-level APIs provded by the storage cloud. The proxy mantans a certan number of parallel connectons to the storage cloud and each task s executed over one of these connectons. After a certan number of tasks are completed successfully, the user request s consdered accomplshed and the proxy responds to the user wth an acknowledgment. The solutons we present are deployed on the proxy server sde transparent to the storage cloud. For read request, we assume the fle s pre-coded nto n max n coded chunks wth an (n max,k) MDS code and stored on the cloud. Completon of downloadng any k coded chunks provdes suffcent data to reconstruct the requested fle. For wrte request, the fle to be uploaded s dvded and encoded nto n coded chunks usng an (n,k) MDS code and hence completon of uploadng any k coded chunks means suffcent data have been stored onto the cloud. Thus, upon completon of a request, the n k un-started and/or unfnshed tasks are then preemptvely canceled and removed from the system. Fg. 2. System Model Accordngly, we model the proxy by the queueng system shown n Fg.2. There are two FIFO (frst-n-frst-out) queues: () the request queue that buffers all ncomng user requests, and () the task queue that s a mult-server queue and holds all tasks watng to beng executed. L threads 2, representng the set of parallel connectons to the storage cloud, are attached to the task queue. The adaptaton module of montors the state of the queues and the threads, and decdes what For wrte request, the remanng tasks can also be scheduled as background jobs dependng on the subsequent read profle of the fle. 2 We avod the term server that s commonly used n queueng theory lterature to prevent confuson.

828 3 codng parameter (n,k) to be used for each request. Wthout loss of generalty, we assume that the head-of-lne (HoL) request leave the request queue only when there s at least one dle thread and the task queue s empty. A batch of n tasks are then created for that request and njected nto the task queue. As soon as any k tasks complete successfully, the request s consdered completed. Such a queue system s work conservng snce no thread s left dle as long as there s any request or task pendng. B. Bascs of Erasure Codes An (n, k) MDS code (e.g., Reed-Soloman codes) encodes k data chunks each of B bts nto a codeword consstng of n B-bt long coded chunks. The coded chunks can sustan up to n k erasures such that the k orgnal data chunks can be effcently reconstructed from any subset of k coded chunks. n and k are called the length and dmenson of the MDS code. We also defne r = n/k as the redundancy rato of an (n,k) MDS code. Ths erasure resstant property of MDS codes has been utlzed n pror works [3], [4], [5], as well as n ths paper, to mprove delay of cloud storage systems. Essentally a coded chunk experencng long delay s treated as an erasure. Fg. 3. Example of supportng multple chunk szes wth Shared Key approach: the 3MB fle s dvded and encoded nto a coded fle of 6MB consstng 2 strps, each of 0.5MB. Download the fle usng a (2,) MDS code s accomplshed by creatng two read tasks: one for strps -6, and the other for strps 7-2. In ths paper, we make use of another nterestng property of MDS codes to mplement varable chunk szng of n a storage effcent manner: MDS code of hgh length and dmenson for small chunk sze can be used as MDS code of smaller code length and dmenson for larger chunk sze. To be more specfc, consder any (N,K) MDS code for chunks of b bts. To avod confuson, we wll refer to these b-bt chunks as strps. A dfferent MDS code of length n = N/m, dmenson k = K/m and chunk sze B = bm for some m > can be constructed by smply batchng every m data/coded strps nto one data/coded chunk. The resultng code s an (n,k) MDS code for B-bt chunks because any k coded chunks covers mk = K coded strps, whch s suffcent to reconstruct the orgnal fle of Bk = bm K/m = bk bts. Ths property s llustrated as an example n Fg. 3. In ths example, a 3MB fle s dvded nto 6 strps of 0.5MB and encoded nto 2 coded strps of total sze 6MB, usng a (2,6) MDS code. Ths code can then be used as a (2,) code for 3MB chunks, a (4,2) code for.5mb chunks and a (6,3) code for MB chunks smultaneously by batchng 6, 3 and 2 strps nto a chunk. C. Defntons of Dfferent Delays The delay experenced by a user request conssts of two components: queueng delay (D q ) and servce delay (D s ). Both are defned wth respect to the request queue: () the queueng delay s the amount of tme a request spends watng n the request queue and () the servce delay s the perod of tme between when the request leaves the request queue (.e., admtted nto the task queue and started beng served by at least one thread) and when t fnally leaves the system (.e., the frst tme when any k of the correspondng tasks complete). In addton, we also consder the task delays (D t ), whch s the tme t takes for a thread to serve a task assumng t s not termnated or canceled preemptvely. To clarfy these defntons of delays, consder a request served wth an (n, k) MDS code, wth T A ts arrval tme, T T 2 T n the startng tmes of the correspondng n tasks 3. Then the queueng delay s D q = T T A. Suppose D t,,,d t,n are the correspondng task delays, then the completon tmes of these task wll be X = {T + D t,,,t n + D t,n } f none s canceled. So the request wll leave the system at tme X (k), whch denotes the k-th smallest value n X,.e., the tme when k tasks complete. Then the servce delay of ths request s D s = X (k) T. III. VARIABLE CHUNK SIZING In ths secton, we dscuss mplementaton ssues as well as pros and cons of two potental approaches, namely Unque Key and Shared Key, for supportng erasure-code-based access to fles on the storage cloud wth a varety of chunk szes. Suppose the maxmum desred redundancy rato s r, then these approaches mplement varable chunk szng as follows: Unque Key: For every choce of chunk sze (or equvalently k), a separate batch of rk coded chunks are created and each coded chunk s stored as an ndvdual object wth ts unque key on the storage cloud. The access to dfferent chunks s mplemented through basc get, put storage cloud APIs. Shared Key: A coded fle s frst obtaned by stackng together the coded strps obtaned by applyng a hghdmenson (N = rk, K) MDS code to the orgnal fle, as descrbed n Secton II-B and llustrated n Fg.3. For read, the coded fle s stored on the cloud as one object. Access to chunks wth varable sze s realzed by downloadng segments n the coded fle correspondng to batches of a correspondng number of strps, usng a same key wth more advanced partal read storage cloud APIs. Smlarly, for wrte, the fle s uploaded n parts usng partal wrte APIs and then later merged nto one object n the cloud. A. Implementaton and Comparson of the two Approaches ) Storage cost: When the user request s to wrte a fle, storage cost of Unque Key and Shared Key s not so dfferent. However, to support varable chunk szng for read requests, Shared Key s sgnfcantly more cost-effcent than Unque Key. Wth Shared Key, a sngle coded fle stored on the cloud can be reused to support essentally an arbtrary number of 3 We assume T = f the -th task s never started.

829 4 dfferent chunk szes, as long as the strp sze s small enough. On the other hand, t seems mpossble to acheve smlar reusng wth the Unque Key approach where dfferent chunks of the same fle s treated as ndvdual objects. So wth Unque Key, every addtonal chunk sze to be supported requres an extra storage cost r fle sze. Such lnear growth of storage cost easly makes t prohbtvely expensve even to support a small number of chunk szes. 2) Dversty n delays: The success of and other proposals to use redundant requests (ether wth erasure codng or replcaton) for delay mprovement reles on dversty n cloud storage access delays. In partcular,, as well as [3], [4], [5], requres access delays for dfferent chunks of the same fle to be weakly correlated. Wth Unque Key, snce dfferent chunks are treated as ndvdual objects, there s no nherent connecton among them from the storage cloud system s perspectve. So dependng on the nternal mplementaton of object placement polcy of the storage cloud system, chunks of a fle can be stored on the cloud n dfferent storage unts (dsks or servers) on the same rack, or n dfferent racks n the same data center, or even to dfferent data centers at dstant geographcal locatons. Hence t s qute lkely that delays for accessng dfferent chunks of the same fle show very weak correlaton. On the other hand, wth Shared Key, snce coded chunks are combned nto one coded fle and stored as one object n the cloud, t s very lkely that the whole coded fle, hence all coded chunks/strps, s stored n the same storage unt, unless the storage cloud system nternally dvdes the coded fle nto peces and dstrbutes them to dfferent unts. Although many dstrbuted storage systems do dvde fles nto parts and store them separately, t s normally only for larger fles. For example, the popular Hadoop dstrbuted fle system by default does not dvde fles smaller than 64MB. When dfferent chunks are stored on the same storage unt, we can expect hgher correlaton n ther access delays. It then s to be verfed that the correlaton between dfferent chunks wth the Shared Key approach s stll weak enough for our codng soluton to be benefcal. 3) Unversal support: Unque Key s the approach adopted n our prevous work [3] to support erasure-code based fle accessng wth one predetermned chunk sze. A beneft of Unque Key s that t only requres basc get and put APIs that all storage cloud systems must provde. So t s readly supported by all storage cloud systems and can be mplemented on top of any one. On the other hand, Shared Key requres more advanced APIs that allow the proxy to download or upload only the targeted segment of an object. Such advanced APIs are not currently supported by all storage cloud systems. For example, to the best of our knowledge currently Mcrosoft s Azure Storage provdes only methods for partal read 4 but none for partal wrte. On the contrary, Amazon S3 provdes partal 4 E.g. DownloadRangeToStream(target, offset, length) downloads a segment of length bytes startng from the offset-th byte of the target object (or blob n Azure s jargon). CCDF 0 0 0 0 2 0 3 0 4 Unque Key, thread Shared Key, thread Shared Key, thread 2 Shared Key, thread 3 Shared Key, thread 4 Shared Key, thread 5 Shared Key, thread 6 0 5 0 500 000 500 2000 2500 3000 Task Delay (msec) Fg. 4. CCDF of ndvdual threads wth MB chunks and n = 6 access for both read and wrte: the proxy can download a specfc nclusve byte range wthn an object stored on S3 by callng getobject(request,destnaton) 5 ; and for uploadng an uploadpart method to upload segments of an object and an completemultpartupload method to merge the uploaded segments are provded. We expect more servce provders to ntroduce both partal read and wrte APIs n the near future. B. Measurements on Amazon S3 To understand the trade-off between Unque Key and Shared Key, we run measurements over Amazon EC2 and S3. EC2 nstance served as the proxy n our system model. We nstantated an extra large EC2 nstance wth hgh I/O capablty n the same avalablty regon as the S3 bucket that stores our objects. We conducted experments on dfferent week days n May to July 203 wth varous chunk szes between 0.5MB to 3MB and up to n = 2 coded chunks per fle. For each value of n, we allow L = n smultaneously actve threads whle the -th thread beng responsble for downloadng the - th coded chunk of each fle. Each experment lasted longer than 24 hours. We alternated between dfferent settngs to capture smlar tme of day characterstcs across all settngs. The experments are conducted wthn all 8 avalablty regons n Amazon S3. Except for the US Standard avalablty regon, all other 7 regons demonstrate smlar performance statstcs that are consstent over dfferent tmes and days. We conjecture the dfferent and nconsstent behavor of US Standard mght be due to the fact that t targets a slghtly dfferent usage pattern and t may employ a dfferent mplementaton for that reason 6. We wll exclude US Standard from subsequent dscussons. Due to lack of space, we only show a lmted subset of fndngs for avalablty regon North Calforna that are representatve for regons other than US Standard : () In both Unque Key and Shared Key, the task delay dstrbuton observed by dfferent threads are almost dentcal. The two approaches are ndstngushable even beyond 99.9th percentle. Fg.4 show the complementary cumulatve dstrbuton functon (CCDF) of task delays observed by ndvdual threads for MB chunks and n = 6. Both approaches demonstrate large delay spread n all regons. 5 The byte range s set by callng request.setrange(start,end). 6 See http://docs.aws.amazon.com/general/latest/gr/rande.html#s3 regon

830 5 CCDF 0 0 0 0 2 0 3 n=3 n=4 n=5 n=6 0 4 0 500 000 500 2000 Servce Delay (msec) Fg. 5. (a) Unque Key CCDF 0 0 0 0 2 0 3 n=3 n=4 n=5 n=6 0 4 0 500 000 500 2000 Servce Delay (msec) (b) Shared Key CCDF of servce delay for readng 3MB fles wth MB chunks (2) Task delays for dfferent threads n Unque Key show close to zero correlaton, whle they demonstrate slghtly hgher correlaton n Shared Key, as t s expected. Wth all dfferent settngs, the cross correlaton coeffcent between dfferent threads stays below 0.05 n Unque Key and ranges from 0. to 0.7 n Shared Key. Both approaches acheve sgnfcant servce delay mprovements. Fg.5 plots the CCDF of servce delays for downloadng 3MB fles wth MB chunks (k = 3) wth n = 3 6, assumng all n tasks n a batch start at the same tme. In ths settng, both approaches reduce 99th percentle delays by roughly 50%, 65% and 80% by downloadng, 2 and 3 extra coded chunks. Although Shared Key demonstrates up to 3 tmes hgher cross correlaton coeffcent, there s no meanngful statstcal dstncton n servce delay between the two approaches untl beyond 99th percentle. All avalablty regons experence dfferent degrees of degradaton at hgh percentles wth Shared Key due to the hgher correlaton. Sgnfcant degradaton emerges from around 99.9th percentle and beyond n all regons except for Sao Paulo, n whch degradaton appears around 99th percentle. (3) Task delays are always lower bounded by some constant 0 that grows roughly lnearly as chunk sze ncreases. Ths constant part of delay cannot be reduced by usng more threads: see the flat segment at the begnnng of the CCDF curves n Fg.4 and Fg.5. Snce ths constant porton of task delays s unavodable, t leads to the negatve effect of usng larger n snce there s a mnmum cost of system resource of n (tme thread) that grows lnearly n n. Ths cost leads to a reduced capacty regon for usng more redundant tasks, as llustrated n the example of Fg.. We observe that the two approaches delver almost dentcal total delays (queueng + servce) for all arrval rates, n spte of the degraded servce delay wth Shared Key at very hgh percentle. So we only plot the results wth Shared Key n Fg.. (4) Both the mean and standard devaton of task delays grow roughly lnearly as chunk sze ncreases. Fg.6 plots the measured mean and standard devaton of task delays n both approaches at dfferent chunk szes. Also plotted n the fgures are least squares ftted lnes for the measurement results. Notce that the extrapolatons at chunk sze = 0 are all greater than zero. We beleve ths observaton reflects the costs of non-i/o-related operatons n the storage cloud that do not scale proportonally to object sze: for example, the cost to locate the requested object. We also beleve such costs contrbute partally to the mnmum task delay constant. Mean (msec) 200 50 00 50 Unque Key Unque Key fttng Shared Key Shared Key fttng 0 0 2 3 Chunk Sze (MB) (a) Mean Fg. 6. C. Model of Task Delays Standard devaton (msec) 35 30 25 20 5 0 Unque Key Unque Key fttng Shared Key Shared Key fttng 5 0 2 3 Chunk Sze (MB) (b) Standard Devaton Delay statstcs vs. chunk sze Based on the aforementoned observatons, we decde to use the Shared Key approach n snce ts outstandng storage effcency overweghts the mnmum degradaton n delay. For the analyss present n the next secton, we model the task delays as ndependently dstrbuted random varables whose mean and standard devaton grow lnearly as chunk sze B ncreases. More specfcally, we assume the task delay D t for chunk sze B followng dstrbuton n the form of D t (B) (B)+exp(µ(B)), () where (B) = + B captures the lower bound of task delays as n observaton (3), and exp(µ(b)) represents a exponental random varable that models the tal of the CCDF. The mean and standard devaton of the exponental tal both equal to µ(b) = Ψ+ ΨB. Wth ths model, constants and Ψ together capture the non-zero extrapolatons of the mean and standard devaton of task delays at chunk sze 0, and smlarly, constants and Ψ together capture the rate at whch the mean and standard devaton grow as chunk sze ncreases, as n observaton (4). IV. DESIGN OF For the analyss n ths secton, we group requests nto classes accordng to the tuple (type, sze). Here type can be read or wrte, and can potentally be other type of operatons supported by the cloud storage. Each type of operaton has ts own set of delay parameters {,,Ψ, Ψ}. Subscrpts wll be used to ndcate varables assocated wth each class. We frst ntroduce approxmatons for the expected queueng and servce delays, assumng the FEC code used to serve requests of each class s predetermned and fxed (Secton IV-A). Then we formulate an optmzaton problem whose objectve s to mnmze the expected total delay over all such statc strateges wth fxed FEC codes. We show that solutons to the non-convex optmzaton problem exhbt a nce property (Secton IV-B): The optmal values of n, k and r can all be expressed as functons solely determned by Q the expected length of the request queue: n = N (Q), k = K (Q) and r = R (Q). N, K and R are all strctly decreasng functons of Q. Ths fndng s then used as the gudelne n the desgn of our backlog-drven adaptve strategy (Secton IV-C).

83 6 A. Approxmated Analyss of Statc Strateges Denote J as the fle sze of class. Consder a request of class served wth an (n,k ) MDS code,.e., B = J /k. Frst suppose all n tasks start at the same tme,.e., T = T n. In ths case, gven our model for task delays, t s trval to show that the expected servce delay equals to k D s, = (J /k )+ µ (J /k ) n j=0 j (J /k )+ µ (J /k ) ln n n k ( ) = + J k + Ψ + Ψ J k ln r r. (2) Also defne the system usage (or smply cost) of a request as the sum of the amount of tme each of ts tasks beng served by a thread 7. When all tasks start at the same tme, ts expected system usage s (see Secton IV of [3] for detaled dervaton) U = n (J /k )+ k µ (J /k ) = k r + J r +Ψ k + Ψ J. (3) Suppose class contrbutes to p fracton of the total arrvals, then the average cost per request s U = p U. Wth L smultaneously actve threads, requests depart the system at rate L/U (request/unt tme). In lght of ths observaton, we approxmate the request queue wth an M/M/ queue wth servce rate L/U. In other words, gven the composton of requests {p } and the choce of code(s) {n,k }, the system capacty (the maxmum supportable throughput) s approxmated wth L/U. So the queueng delay n the orgnal system at total arrval rate λ s approxmated by D q = L/U λ 2 L/U = λu L(L λu), (4) and the expected length of the request queue s approxmately Q = λd q = (λu)2 2 L(L λu) = λ L(L λ). (5) Here λ = λu = λ p U = λ p ( k r + J r + Ψ k + Ψ J ). Notcng that gven {p }, L/U s maxmzed when n =, k =,, we call ths maxmum value the (approxmated) full capacty for that {p }. We acknowledge that the above approxmaton s qute coarse, especally because tasks of the same batch do not start at the same tme n general. However, remember the man objectve of ths paper s to develop a practcal soluton that can acheve the optmal delay-throughput trade-off. Accordng to the smulaton results, ths approxmaton s suffcently good for the purpose of ths paper. 7 The tme a task j beng served s D t,j f t completes successfully, X (k) T j f t starts but s termnated preemptvely, and 0 f t s canceled whle watng n the task queue. B. Optmal Statc Strategy Gven total arrval rate λ and composton of requests {p }, we want to fnd the best choce of FEC code for each class such that the total delay s mnmzed. Relaxng the requrement for n and k beng ntegers, ths s formulated as the followng mnmzaton problem 8 : mn {k,r } D q + p D s, s.t. k > 0, r, λ < L. Notce that ths s a non-convex optmzaton problem because the feasble regon s not a convex set, due to the k r terms n λ. In general, non-convex optmzaton problems are dffcult to solve. Fortunately, we are able to prove the followng theorem accordng to whch ths non-convex optmzaton problem can be solved numercally effcently. Theorem : For any gven λ and {p }, the non-convex optmzaton problem ( ) has a unque optmal soluton, whch satsfes the followng for all : k (Ψ k + ΨJ ) k + = J r (r ) J r +Ψ ( ) 2 L = L λ ( + Ψ ln ( ) ) r, (6) r 2L(Ψ k + Ψ J ) k r (r )( k + J ). (7) Proof: See Appendx. Observng that Eq.6 contans only delay parameters and fle sze of class, so t should be always satsfed no matter what arrval rate λ and request composton {p } are. Solvng Eq.6 alone gves a set of pars (k,r ) that are the optmal choce of code for class for some λ and {p }. Then solvng Eq.7 wthn ths set we obtan the optmal k and r as a functon of λ for all combnatons of λ and {p } such that λ = λ ) p U. Observng from Eq.5 that λ = L( Q2 +4Q Q /2, and wth some smple calculus, we conclude that Corollary : The optmal values of n, k and r can all be expressed as strctly decreasng functons of Q: n = N (Q), k = K (Q) and r = R (Q). (8) C. Adaptve Strategy The fndng of Corollary conforms to our ntuton: At lght workload (small λ), there should be lttle backlog n the request queue (small Q) and the servce delay domnates the total delay. In ths case, the system s not operatng n the capacty-lmted regme. So t s benefcal to ncrease the level of chunkng and redundancy to reduce delay. At heavy workload (larger λ), there wll be a large backlog n the request queue (large Q) and the queueng delay domnates the total delay. In ths case, the system operates n the capacty-lmted regme. So t s better to 8 Notce that all classes share the same queueng delay. Also, we requre k > 0 nstead of k for a techncalty to smplfy the proof of the unqueness of the optmal soluton. We requre r snce n k. λ < L s mposed for queue stablty.

832 7 reduce the level of chunkng and redundancy to support hgher throughput. More mportantly, t suggests that t s suffcent to choose the FEC code solely based on the length of the request queue. The basc dea of s to choose n = N (q) and k = K (q) for a request of class, where q s the queue length upon the arrval of the request. When ths s done to all requests arrve nto the system, t can be expected the average code lengths (dmensons) and expected queue length Q satsfy Eq.8, hence optmal delay s acheved. In, ths s mplemented wth a threshold based algorthm, whch can be performed very effcently. For each class, we frst compute the expected queue length f n =,...,n max s the optmal code length by Q N,n = N (n ). (9) Here n max s the maxmum number of tasks allowed for a class request. Snce N s a strctly decreasng functon, ts s a well-defned strctly decreasng functon. As a result, we have Q N, > QN,2 > > QN,n > 0. Remember max our goal s to use code length n f queue length q s around Q N,n, so we want a set of thresholds {HN,n } such that nversen H N, > Q N, > H N,2 > Q N,2 > > H N,n max > Q N,n max > H N,n max + = 0, and wll use n such that q [H,n+ N,HN,n ). In our current mplementaton of, we use H,n N = ( Q N,n,n ) +QN /2 and H, N =. A set of thresholds {H,k K } for adaptaton of k max s found n a smlar fashon. The adaptaton algorthm of s summarzed n pseudocodes as below: (Throughput Optmal FEC Cloud) Intalzaton: q = 0 request arrves : q queue length upon arrval of request 2: class that request belongs to 3: q αq +( α)q 4: Fnd k k max for n = 2,,n max such that q [H,k+ N,HN,k ) 5: Fnd n n max such that q [H,n+ N,HN,n ) 6: n mn(r max k,n) 7: Serve request wth an (n, k) code when t becomes HoL. Note that n Step 6 we reduce n to r max k f the redundancy rato of the code chosen n the prevous steps s hgher than r max the maxmum allowed redundancy rato for class. Also, nstead of comparng q drectly wth the thresholds, we compare an exponental movng average q = αq +( α)q, wth a memory factor 0 α <, aganst the thresholds to determne n and k. The movng average s used to mtgate the transent varaton n queue length so that n and k wll not change too frequently. It s obvous that we only need to set α = 0 n order to use nstantaneous queue length q for the adaptaton snce n ths case q = q. V. EVALUATION We now demonstrate the benefts of s adaptaton mechansm. We evaluate s adaptaton strategy and show that s outperforms statc strateges wth both constant and changng workloads, as well as a smple greedy heurstc that wll be ntroduced later. A. Smulaton Setup We conducted trace-drven smulatons for performance evaluaton for both sngle-class and mult-class scenaros wth both read and wrte requests of dfferent fle szes. Due to lack of space, we only show results for the scenaro wth one class (read,3mb). But we must emphasze that t s representatve enough so that the fndngs to be dscussed n ths secton are vald for other settngs (dfferent fle szes, wrte requests, and multple classes). We assume the system supports up to L = 6 smultaneously actve threads. We set the maxmum code dmenson and redundancy rato to be k max = 6 and r max = 2, because we observe neglgble gan n servce delay beyond ths chunkng and redundancy level from our measurements. We use traces collected n May and June n avalablty regon North Calforna. In order to compute the threshold for, we need estmatons of the delay parameters {,,Ψ, Ψ}. For ths, we frst flter out the worst 0% task delays n the traces, then we compute the delay parameters from the least squares lnear approxmaton for the mean and standard devaton of the remanng task delays. We use memory factor α = 0.99 n. In addton to the statc strateges, we develop a smple heurstc strategy for the purpose of comparson. Unlke the adaptve strategy n, does not requre pror-knowledge of the dstrbuton of task delays, yet t acheves compettve mean delay performance. In, the code to be used to serve a request n class s determned by the number of dle threads { upon ts arrval: suppose there, f l = 0 are l dle threads, then k = mn(k max,l), otherwse, and {, f l = 0 smlarly n = mn(r max. The dea of k,l), otherwse s to frst maxmze the level of chunkng wth the dle threads avalable, then ncrease the redundancy rato as long as there are dle threads reman. B. Throughput-Delay Trade-Off Fg.7 shows the mean, medan, 90th percentle and 99th percentle delays of and wth Posson arrvals at dfferent arrval rate λ. We also run smulatons wth statc strateges for all possble combnatons of (n, k) at every arrval rate. We brute-force fnd the best mean, medan, 90th and 99th percentle delays acheved wth statc strateges and use them as the baselne. Also plot n Fg.7(a) and Fg.7(b) are the mean and medan delay performance of the basc statc strategy wth no chunkng and no replcaton,.e., (, ) code; the smple replcaton statc strategy wth a (2, ) code; and

833 8 Best Statc 0 4 Best Statc delay (msec) Best Statc Statc (,) Statc (2,) Fxed k=6 arrval rate (req/sec) delay (msec) Best Statc Statc (,) Statc (2,) Fxed k=6 arrval rate (req/sec) delay (msec) arrval rate (req/sec) delay (msec) arrval rate (req/sec) (a) Mean Delay (b) Medan Delay Fg. 7. (c) 90th Percentle Delay Delay performance n read only scenaro (d) 99th Percentle Delay the backlog-based adaptve strategy from [3] wth fxed code dmenson k = 6 and n 2. As we can see, both and successfully support the full capacty regon the one supported by basc statc whle achevng almost optmal mean and medan delays throughout the full capacty regon. At lght workload, delvers about 2.5 mprovement n mean delay when compared wth the basc statc strategy, and about 2 when compared wth smple replcaton (from 205ms and 5ms to 84ms). It also reduces the medan delay by about 2 from that of basc and smple replcaton (from 56ms and 38ms to 74ms). Meanwhle acheve about 2 mprovement n both mean (89ms) and medan delays (79ms) over basc. Wth heaver workload, both and successfully adapt ther codes to keep track wth the best statc strateges, n terms of mean and medan delays. It s clear from the fgures that both and acheve our prmary goal of retanng full system capacty, as supported by basc statc strategy. On the contrary, although smple replcaton has slghtly better mean and medan delays than basc under lght workload, t fals to support arrval rates beyond 70% of the capacty of basc. Meanwhle, the adaptve strategy from [3] wth fxed code dmenson k = 6 can only support less than 30% of the orgnal capacty regon, although t acheves the best delay at very lght workload. Whle the two adaptve strateges have smlar performance n mean and medan, outperforms sgnfcantly at hgh percentles. As Fg.7(c) and Fg.7(d) demonstrate, matches wth the best statc strateges at 90th and 99th percentle delays throughout the whole capacty regon. On the other hand, fals to keep track of the best statc performance at lower arrval rates. At lght workload, s s over 2 and 2.5 better than at 90th and 99th percentles. Less nterestng s the case wth heavy workload when the system s capacty-lmted. Hence both strateges converge to the basc statc strategy usng mostly (,) code, whch s optmal at ths regme. C. Behavor of the Adaptaton Mechansms When we look nto the fracton of requests served by each choce of code, and turn out to behave qute dfferently. In Fg.8(a) we plot the compostons of requests served by dfferent code dmenson k s. At each arrval rate, the two bars represent and. For each bar, the colors represent the fracton of requests served wth code Fracton served wth each code dmenson 0.8 0.6 0.4 0.2 0 9 9 29 38 48 58 68 arrval rate (req/msec) 2 3 4 5 6 (a) Composton of k. Left:, Rght: Fg. 8. standard devaton of delay (msec) 250 200 50 00 50 Best Statc 0 arrval rate (req/sec) (b) Standard Devaton Comparson of and dmenson through 6, from bottom to top. s choce of k demonstrates a hgh concentraton around the optmal value: at all arrval rate, over 80% requests are served by 2 neghborng values of k. Moreover, as arrval rate vares from low to hgh, s choce ofk transtons qute smoothly as (5,6) (3,4) (2,3) (,2) and eventually converges to a sngle value as workload approaches system capacty. On the contrary, tends to round-robn across all possble choces of k and majorty of requests are served by ether k = or 6. So s effectvely alternatng between the two extremes of no chunkng and very hgh chunkng, nstead of stayng around the optmal. Such all or nothng behavor results n 2 to 3 worse standard devaton as shown n Fg.8(b). So provdes much better QoS guarantee. We further examne how well the two adaptve strateges adjust to changes n workload. In Fg.9 we plot the total delay experenced by requests arrvng at dfferent tmes wthn a 600-second perod. The arrval rate s 0 request/second for the frst and last 200 seconds, and 70 request/second for the mddle 200 seconds. Both adaptve strateges turn out to be qute agle to changes n arrval rate and quckly converge to a good composton of codes that delvers optmal delays. On the contrary, the statc strategy usng (3,2) code bulds up a huge backlog durng mddle 200-second perod and takes over 00 seconds to clean t up. VI. RELATED WORK FEC n connecton wth multple paths and/or multple servers s a well nvestgated topc n the lterature [7], [8], [9], [0]. However, there s very lttle attenton devoted to the queueng delays. FEC n the context of network codng or coded schedulng has also been a popular topc from the perspectves of throughput (or network utlty) maxmzaton and throughput vs. servce delay trade-offs [], [2], [3],

834 9 Movng averaged delay (msec) 0 5 0 4 Statc (3,2) λ = 0 λ = 70 λ = 0 0 200 400 600 Tme (sec) Fg. 9. Adaptaton to changng workload [4]. Although some ncorporate queung delay analyss, the treatment s largely for broadcast wreless channels wth qute dfferent system characterstcs and constrants. FEC has also been extensvely studed n the context of dstrbuted storage from the ponts of hgh durablty and avalablty whle attanng hgh storage effcency [5], [6], [7]. Authors of [4] conducted theoretcal study of cloud storage systems usng FEC n a smlar fashon as we dd n our work [3]. Gven that exact mathematcal analyss of the general case s very dffcult, authors of [4] consdered a very smple case wth a fxed code of k = 2 tasks. Shah et al. [5] generalze the results from [4] to k > 2. Both works rely on the assumpton of exponental task delays, whch hardly captures the realty. Therefore, some of ther theoretcal results cannot be appled n practce. For example, under the assumpton of exponental task delays, Shah et al. have proved that usng larger n wll not reduce system capacty and wll always mprove delay, contradctng wth smulaton results usng realworld measurements n [3] and ths paper. VII. CONCLUSION s adaptaton mechansm s the frst technque for automatcally adjustng the level of both chunkng and redundancy for scalable key-value storage access usng erasure codes and parallel connectons. montors the local backlog and dynamcally adjust both the length and dmenson of the erasure code to be used. To evaluate s adaptaton mechansm, we run smulatons usng real-world traces obtaned on Amazon S3. We found that delvers the optmal delay-throughput tradeoff and dramatcally outperforms non-adaptve strateges and smple adaptve heurstcs. REFERENCES [] C. Huang, H. Smtc, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. L, and S. Yekhann, Erasure Codng n Wndows Azure Storage, n USENIX ATC, 202. [2] S. L. Garfnkel, An Evaluaton of Amazon s Grd Computng Servces: EC2, S3 and SQS, Harvard Unversty, Tech. Rep., 2007. [3] G. Lang and U. C. Kozat, FAST CLOUD: Pushng the Envelope on Delay Performance of Cloud Storage wth Codng, IEEE/ACM Trans. Networkng, preprnt, 3 Nov. 203, do: 0.09/TNET.203.2289382. [4] L. Huang, S. Pawar, H. Zhang, and K. Ramchandran, Codes Can Reduce Queueng Delay n Data Centers, n IEEE ISIT, 202. [5] N. B. Shah, K. Lee, and K. Ramchandran, The MDS Queue: Analysng Latency Performance of Codes and Redundant Requests, arxv:2.5405, Apr. 203. [6] J. C. McCullough, J. Dunagan, A. Wolman, and A. C. Snoeren, Stout: an Adaptve Interface to Scalable Cloud Storage, n USENIX ATC, 200. [7] V. Sharma, S. Kalyanaraman, K. Kar, K. K. Ramakrshnan, and V. Subramanan, MPLOT: A Transport Protocol Explotng Multpath Dversty Usng Erasure Codes, n IEEE INFOCOM, 2008. [8] E. Gabrelyan, Fault-Tolerant Real-Tme Streamng wth FEC thanks to Capllary MultPath Routng, Computng Research Repostory, 2006. [9] J. W. Byers, M. Luby, and M. Mtzenmacher, Accessng Multple Mrror Stes n Parallel: Usng Tornado Codes to Speed Up Downloads, n IEEE INFOCOM, 999. [0] R. Saad, A. Serhrouchn, Y. Beglche, and K. Chen, Evaluatng Forward Error Correcton performance n BtTorrent protocol, n IEEE LCN, 200. [] A. Erylmaz, A. Ozdaglar, M. Medard, and E. Ahmed, On the Delay and Throughput Gans of Codng n Unrelable Networks, IEEE Trans. Inf. Theor., 2008. [2] W.-L. Yeow, A. T. Hoang, and C.-K. Tham, Mnmzng Delay for Multcast-Streamng n Wreless Networks wth Network Codng, n IEEE INFOCOM, 2009. [3] T. K. Dkalots, A. G. Dmaks, T. Ho, and M. Effros, On the Delay of Network Codng over Lne Networks, Computng Research Repostory, 2009. [4] U. C. Kozat, On the Throughput Capacty of Opportunstc Multcastng wth Erasure Codes, n IEEE INFOCOM, 2008. [5] A. G. Dmaks, P. B. Godfrey, Y. Wu, M. J. Wanwrght, and K. Ramchandran, Network Codng for Dstrbuted Storage Systems, IEEE Trans. Inf. Theor., 200. [6] R. Rodrgues and B. Lskov, Hgh Avalablty n DHTs: Erasure Codng vs. Replcaton, n 4th Internatonal Workshop, IPTPS, 2005. [7] J. L, S. Yang, X. Wang, and B. L, Tree-Structured Data Regeneraton n Dstrbuted Storage Systems wth Regeneratng Codes, n IEEE INFOCOM, 200. APPENDIX Proof: The objectve of ( ) s a lower-bounded contnuously dfferentable functon wthn the feasble regon. Its value goes to as (k,r) approaches the boundary of the feasble regon. As a result, there exst at least one global optmal soluton. At the global optmal, dervatves of the objectve over k and r both equal to 0. Equatng the partal dervatves to 0 can be rewrtten nto Eq.6 and Eq.7. It s trval to show that the left hand sde of Eq.6 s a strctly ncreasng functon of k and the rght hand sde s a strctly ncreasng functon of r as long as r. Ths mples that, r s a strctly ncreasng functon of k. The rght hand sde of Eq.7 becomes some functon π (k ) of k by substtutng r wth the soluton from Eq.6. It can be shown that π s a strctly decreasng functon. Notce that Eq.7 must be satsfed for all and the left hand sde remans unchanged. Then π (k ) = π j (k j ),,j =,,m,,j. (0) Recall that π and π j are strctly decreasng functons of k and k j, respectvely. Ths means that there s a one-to-one mappng between any k and k j at the optmal solutons, and k j s a strctly ncreasng functon of k. Notce that for any gven λ and {p } the left hand sde of Eq.7 s a strctly ncreasng functon of k f we replace all k j s and r j s wth the solutons of Eq.6 and Eq.0. The rght hand sde of Eq.7 s π (k ), whch s a strctly decreasng functon of k. As a result, these two functons can be equal for at most one value of k,.e., Eq.6 and Eq.7 have at most one soluton. Snce we have already proved the exstence of a soluton to these equatons va the exstence of global optmal, they have an unque soluton.