Using Elasticity to Improve Inline Data Deduplication Storage Systems

Usng Elastcty to Improve Inlne Data Deduplcaton Storage Systems Yufeng Wang Temple Unversty Phladelpha, PA, USA Y.F.Wang@temple.edu Chu C Tan Temple Unversty Phladelpha, PA, USA cctan@temple.edu Nngfang M Northeastern Unversty Boston, Massachusetts, USA nngfang@ece.neu.edu Abstract Elastcty s the ablty to scale computng resources suchas memory on-demand,ands one of theman advantages of utlzng cloud computng servces. Wth the ncreasng popularty of cloud based storage, t s natural that more deduplcaton based storage systems wll be mgrated to the cloud. Exstng deduplcaton systems however, do not adequately take advantage of elastcty. In ths paper, we llustrate how to use elastcty to mprove deduplcaton based systems, and propose EAD (elastcty aware deduplcaton), an ndexng algorthm that uses the ablty to dynamcally ncrease memory resources to mprove overall deduplcaton performance. Our expermental results ndcatethateadsable todetect more than98% of allduplcate data, however only consumes less than 5% of expected memory space. Meanwhle, t clams four tmes of deduplcaton effcency than the state-of-art samplng technque whle costs less than half of the amount of memory. I. INTRODUCTION Data deduplcaton s a technque used to reduce storage and transmsson overhead by dentfyng and elmnatng redundant data segments. Data deduplcaton plays an mportant role n exstng storage systems [1], and ts mportance wll contnue to grow as the amount of data ncreases (the growth of data s estmated to reach 35 zettabytes n the year 2020). The flexblty and cost advantages of cloud computng provders such as Azure [2], Amazon [3], etc. make t deployng storage servces n the cloud as an attractve opton. A key property of cloud computng s elastcty, the ablty to dynamcally adjust the amount of computng resources quckly. Elastcty can mprove deduplcaton systems by allowng deduplcaton storage systems to dynamcally adjust the amount of memory resources as needed to detect suffcent amount of duplcate data. Ths s especally useful for nlne deduplcaton systems [4], [5] where the ndex used for deduplcaton s often kept wthn the memory to avod the performance bottleneck from dsk I/O operatons. Intutvely, there s a basc tradeoff between amount of duplcate data been detected and the amount of memory space requred. Smaller memory resources lead to small ndexes, whch n turn leads to worse deduplcaton performance due to mssed deduplcaton opportuntes. Selectng too much memory, on other hand, leads to wasted memory, snce RAM allocated to the ndex cannot be used for other purposes. Elastcty provdes the ablty to scale memory resources as needed to mprove deduplcaton performance wthout ncurrng wasted resources. In ths paper, we proposed an elastcty-aware deduplcaton (EAD) algorthm that takes advantage of the elastcty of cloud computng. The key features of our soluton s that our deduplcaton algorthm s compatble wth current deduplcaton technques such as samplng to take advantage of localty [6], [7], and content-based chunkng [8] [10]. Ths means our soluton can take advantage of state-of-the-art algorthms to mprove performance. Furthermore, we also present a detaled analyss of our algorthm, as well evaluaton usng extensve experments on real world dataset. The rest of the paper s organzed as follows: Secton 2 contans the related work. Secton 3 explores lmtatons of exstng approaches. The EAD s presented n Secton 4. Secton 5 evaluates our soluton, and Secton 6 concludes. II. RELATED WORK Cloud based backup systems typcally use nlne data deduplcaton, where redundant data chunks are dentfed at run tme to avod transmttng the redundant data from the source to the cloud. Ths s opposed to offlne deduplcaton where the source transmts all the data to the cloud, whch then runs deduplcaton process to conserve storage space. For the remanng paper, deduplcaton wll refer to nlne deduplcaton. Numerous research has been done to mprove the performance of fndng duplcate data. Work by [11] focused technques to speed up the deduplcaton process. Researchers have also proposed dfferent chunkng algorthms to mprove the accuracy of detectng duplcates[12] [16]. Other research consders the problem of deduplcaton of multple datatypes[17], [18]. Ths lne of research s complmentary to our work, can be easly ncoporated nto our soluton. The ever ncreasng amounts of data coupled wth the performance gap between n-memory searchng and dsk look ups, mean that ncreasngly, dsk I/O has become the performance bottleneck. Recent deduplcaton research have focused on addressng the problem of lmted memory. Work by [19] proposed ntegrated solutons whch can avod dsk I/Os on close to 99% of the ndex lookups. However, [19] stll puts ndex data on the dsk, nstead of memory. Estmaton algorthms

Duplcate data detected(mb) Duplcate detected by usng dfferent ndex szes 1000 800 600 400 200 VM1 VM2 1 2 3 4 # of entres n ndex x 10 5 Fg. 1: Intutve test on amount of duplcate detected on two equalszed(4.7gb) VMs by usng equal-sze ndexes lke [20] can be used to mprove the performance by reducng total number of chunks, but the fundamental problem remans as the amount of data ncreases. Other exstng research n ths area have proposed dfferent samplng algorthms to ndex more data usng less memory: [6] ntroduces a soluton by only keepng part of chunks nformaton n the ndex; [7] proposes a more advanced method based on the work n [6] by deletng chunks fngerprnts (FPs) from the ndex when t s approachng fullness. The fundamental lmtaton wth samplng based approach s that t s mpossble to mantan deduplcaton performance by reducng the samplng rate whle the nput data sze ncreases but RAM resources remans fxed. Our soluton bulds upon earler work on samplng by takng advantage of the elastcty property of cloud computng to carefully combne samplng wth ncreasng memory resources. III. THE CASE FOR USING ELASTICITY Ths secton wll explore some alternatves to mprove deduplcaton performance, and ther lmtatons. A. Why not pck the best memory sze? One alternatve s to try and estmate the approprate amount of memory that s needed pror to deployng the deduplcaton system. A straghtforward approach s to perform smple proflng on a sample of data and compute the expected memory requrements based on the results. To llustrate why ths s dffcult to choose the rght amount of RAM n practce, we conducted a smple experment that represents a storage system used to archve vrtual machne (VM) mages (ths s a common workload used n deduplcaton evaluatons[7],[21]). We want to maxmze deduplcaton rato to conserve bandwdth and storage costs. For smplcty, we assume that all VMs are runnng the same OS, and are the same sze. A smple way to estmate memory requrements s to frst estmatng the ndex sze for a sngle VM, and then use that to estmate the total RAM necessary for all users. Thus, gven n users, and snce each user stores the same sze VM, we estmate m amountsofram tondexoneuser,ourbackup system wll need n m amounts of RAM. We can derve m va experments. Fg. 1 shows the results for two VMs. As far as we know, VM2 contansmore text fles whle VM1 has more vdeo fles. Number of ndex entry slots ndcates how much nformaton of already stored data the system can provde for duplcate detecton. We set fxed number of ndex entres for duplcate detecton and gradually ncrease t. We see that when ndex entry slots number ncreases to 270 thousand, both VMs exhbt the same amount of duplcate data. As we ncrease the ndex sze, VM1 shows lmted mprovement, whle VM2 shows much better performance. If we had used VM1 to estmate m would have led to much less bandwdth savngs, especally f a sgnfcant number of VMs resemble VM2. Buyng too much memory s wasteful f most of the data resemble VM1. B. Usng Localty and Downsamplng Storage systems that make use of data deduplcaton generally operate on chunk-level, and n order to quckly determne potental duplcate chunks, an ndex for exstng chunks needs to be mantaned n memory. For example, a 100TB data wll need about 800GB amounts of RAM for the ndex under standard deduplcaton parameters [22]. Ths makes keepng the entre ndex n memory challengng. The prncple of localty s used to desgn samplng algorthms that utlze smaller ndex sze whle provdng good performance [6]. The localty prncple suggests that f chunk X s observed to be surroundedby chunksy, Z, W n the past; the next tme chunk X appears, there s a hgh probablty that chunks Y, Z, W wll also appear. In samplng-based deduplcaton, the data wll be frst dvded nto larger segments, each of whch contans thousands of chunks. Deduplcaton s executed based on these segments by dentfyng exstence of ther sampled chunks fngerprnts n the ndex. If a chunk s fngerprnt s found n the ndex, the correspondent segment whch contans that chunk wll be located and fngerprnts nformaton of all the other chunks n ths segment wll be pre-fetched from dsk to the chunk cache n memory. Downsamplng algorthm [7] works as an optmzed samplng approach, by takng advantage of the localty prncple. The dfference s that the samplng rate s ntalzed as 1, whch ndcates t pcks all the chunks n a segment as ts sampled chunks. As the amount of ncomng data ncreases, ths value gradually decreases by droppng half of ndex entres. Thus the ndexng capacty doubles by only acceptng a part of chunks fngerprnts as samples to represent each segment. In other words, nstead of ndexng chunks X, Y, Z, and W n RAM, the downsamplng algorthm wll only ndex chunk X (or another one among four of them) n RAM after two tmes of adjustments, and the rest on dsk. The above samplng-based approaches have two man drawbacks. The frst (obvous) drawback s that not all data wll exhbt localty[17], and thus samplng algorthms do not work well wth these datasets. The second drawback s that even for data that exhbts localty, t s dffcult to select the correct

samplng rate or how to adjust t, due to the large varance n possble deduplcaton rato [23] [24]. IV. EAD: ELASTICITY-AWARE DEDUPLICATION Storage deduplcaton servces n the cloud often run n vrtual machnes (VM). Unlke a conventonal OS whch runs drectly on physcal hardware, the OS n a VM s runnng on top of a hypervsor or vrtual machne montor, whch n turn, communcates wth the underlyng physcal hardware. The hypervsor s responsble for ncreasng RAM resources to the vrtual machne (VM) dynamcally. Ths can be done n two generc ways. The frst s to use a balloonngalgorthmto reclammemory from other VMs runnng on the same physcal machne (PM) [25]. Ths s a relatvely lghtweght process that reles on the OS s memory management algorthm, but can only ncrease relatvely small amounts of memory. Deduplcaton systems that requre ncreasngly larger amounts of memory need to run a VM mgraton algorthm [26], [27]. In VM mgraton, the hypervsor mgrates the RAM contents from one PM to another wth suffcent memory resources [26]. Regardless of the mgraton algorthm used, some downtme can nevtably occur when swtchng over to a new VM [27]. A nave approach towards ncorporatng elastcty s to ncrease the memory sze once the ndex s close to beng full. Ths nave approach does not perform well snce frequent mgratons nduce a hgh overhead. Furthermore, the nave approach always retans the entre old ndex durng each mgraton, even those ndex entres do not fnger prnt many chunks. Such poor performng ndex entres take up valuable ndex space wthout provdng much benefts. Our approach combnes the benefts of downsamplng [7] and VM mgraton to allow users to mantan a satsfactory level of performance by adjustng samplng rate and memory sze accordngly. Our system desgn conssts of two components, an EAD clent that s responsble for fle chunkng, fngerprnt computaton and samplng, and an EAD server whch controls the ndex management and other memory management operatons. The EAD clent s run on the clent sde, for nstance, at the gateway server for a large company. The EAD server can be executed by the cloud provder. The entre system desgn s shown n Fg. 2. Only unque data s supposed to be store n Physcal Storage. The Fle Manager s responsble for data retreval and mantenance, how t works s out of ths paper s scope. A. EAD Algorthm Dfferent types of users have dfferent deduplcaton requrements. Some users wll be wllng to tolerate worse deduplcaton performance n exchange for lower costs, whle others are not. To accomodate dfferent requrements, EAD s desgned to allow a user to specfy a mgraton trgger, Γ ( (0, 1)), whch specfes the level of deduplcaton performance the user s wllng to accept. Deduplcaton performance s usually measured by reducton rato [18], [28], whch s the sze of the orgnal dataset Fg. 2: EAD nfrastructure. dvded by the sze of the dataset after deduplcaton. To help the user select the mgraton trgger, we defne Deduplcaton Rato (DR), Deduplcaton Rato = 1 Sze after deduplcaton Sze of orgnal data. Intutvely, we would lke to frst apply downsamplng algorthms untl the deduplcaton performance becomes unsatsfactory, and then mgrate the ndex to larger memory n order to obtan better performance. EAD wll mgrate to larger RAM only when mgraton wll result n deduplcaton performance better than Γ. Ths has an mportant but subtle mplcaton. EAD wll not always mgrate when deduplcaton performance falls under Γ, but only when mgraton wll mprove performance. Ths s mportant because gven a dataset that nherently exhbts poor deduplcaton characterstcs [29], addng more RAM wll ncur the mgraton overehad wthout mprovng deduplcaton performance. Ths means that EAD cannot smply compare the measured DR aganst Γ because the measured DR may not necessarly reflect the amount of duplcaton that exsts. To llustrate, let us assume that the deduplcaton system measures ts DR and t s less than Γ. There are two possbltes. The frst s that the system has performed overly aggressve downsamplng, and can beneft from ncreasng RAM. The second possblty s that the dataset tself has poor deduplcaton performance, e.g. data n multmeda or encrypted fles. In ths case, ncreasng RAM does not result n better performance. How our EAD algorthm determnes when to mgrate to more RAM resources can be found n Alg. 1. It executes n two phases as generc n-lne deduplcaton systems do. We use S n and x to denote the ncomng segment and chunks nsde t. FP x represents the fngerprnt of chunk x. In Phase I the EAD Clent sends all chunks fngerprnts nformaton (FP x ) of sampled chunks used for estmaton and duplcaton detecton, n each segment S n to EAD Server. The latter wll search ndex table T and chunk cache for duplcaton dentfcaton, as well as updatng estmaton base B. Based on results generated, n whch each chunk x s marked as dup or unq, ndcatng t s a duplcate or unque chunk. EAD Clent only transmts unque data chunks along wth metadata of duplcate onesto EAD Server n Phase II, savngbandwdthandstorage ( x S n ), ncludng labelng FP est x and FP dedup x

space. At the meantme, current samplng rate R 0 s subject to change to R based on deduplcaton performance. Detals on features of the EAD algorthm wll be presented next. B. Estmatng Possble Deduplcaton Performance One of the key features of EAD s that the algorthm s able to determne whether mgraton wll be benefcal. In order to dstngush whether poor deduplcaton performance s due to overly aggressve downsamplng or nherent wthn the dataset, we frst need to be able to estmate the potental DR of the dataset. Obtanng the actual DR s mpractcal snce t requres performng the entre deduplcaton process. Pror work from [30] provded an estmaton algorthm to estmate the deduplcaton performance for statc, fxedsze data sets. Ther algorthm requres the actual data to be avalable n order to perform random samplng and comparsons. However, n our problem, the dataset can be vewed as a stream of data. There s no pror knowledge of the sze or characterstcs of the data to be stored n advance. We also cannot perform back and forth scannng of the complete dataset for estmaton. In our EAD algorthm, we let the EAD Server mantan an estmaton base B. The EAD Clent randomly selects κ fngerprnts from each segment and sends them to EAD Server to be stored n B. Suppose there are n s segments come n, there wll be κ n s samples, whch wll ncrease along wth the ncreasng amount of ncomng data. Each entry slot n B ncludes a fngerprnt as well as two counters, x c1 and x c2, where counter x c1 records the number of occurrences of fngerprnt FP x appears n the B, and x c2 records the number of occurrences of fngerprnt FP x appears among that of all the chunks uploaded. We ntegrate our estmaton process nto the regular deduplcaton operatons so as to avod the separate samplng and scannng phases by [30]. Whle the clent sends the samples for duplcaton searchng to the storage server, these samples for estmaton are transmtted at the same tme for updatng B. Durng the fngerprnt comparson of ncomng chunks aganst that n chunk cache, we update B agan, ncrementng the counter x c2 by one every tme ts correspondent fngerprnt appears. Thus, there s no extra overhead for our estmaton purpose. Usng B, we can compute the estmated deduplcaton rato, EDR, as EDR = 1 1 κ n s x B x c1 x c2. The computaton of EDR happens whle the ndex sze s approachng the memory lmt. Only n the case that DR s smaller than Γ EDR, there wll be a potental performance mprovement by mgraton, and EAD wll mgrate the ndex to larger RAM. Otherwse, EAD wll apply downsamplng on the ndex as the exchange for larger ndexng capacty. C. EAD Refnements The performance of the EAD algorthm can be further mproved by observng addtonal nformaton obtaned durng Algorthm 1 Elastc deduplcaton strategy 1: The ncomng segment S n : Deduplcaton Phase I: Identfy duplcate chunks 2: x S n : EAD Clent sends FP x to EAD Server 3: for all FPx dedup do 4: f FPx dedup T then 5: Locate ts correspondent segments S dup x j S dup : Fetch nformaton of x j (FP xj ) Set x j chunkcache 6: else 7: Add FPx dedup to T 8: for all FPx est do 9: f FPx est B then 10: x c1 = x c1 +1 11: else 12: Add FPx est to B Set x c1 = x c2 = 0 13: for all x S n do 14: x k chunkcache : Compare FP x wth FP xk 15: f FP x = FP xk then 16: Set x dup (x dup ) 17: else 18: Set x unq (x unq ) 19: x l B : Compare FP x wth FP xl 20: f FP x = FP xl then 21: x c2 = x c2 +1 Deduplcaton Phase II: Data transmsson 22: for all x S n do 23: Transmts x unq along wth only metadata of x dup 24: EAD fnshes processng S n 25: f Index s approachng the RAM lmt then 26: f DR < Γ EDR then 27: f R 0 = 1 then 28: EAD sets Γ = DR EDR 29: else 30: EAD trggers mgraton, settng rate R = R 0 31: else 32: EAD sets R = R0 the run tme and then adjustng the parameters of the algorthm. Adjustng Γ. The parameter Γ s specfed by the user, and ndcates the user s desred level of deduplcaton performance. However, the user may sometmes be unaware of the underlyng potental deduplcaton performance of the data, and set an excessvely hgh Γ value, resultng n unnecessary mgratonovertme. We adjusttheuser sγvalueto DR EDR after each mgraton, and also n the case that DR has not reached accepted performance even the samplng rate s one. So that t represents the current system s maxmum deduplcaton ablty. In ths way, EAD s able to elastcally adapt varatons on ncomng data. Amount of RAM and Samplng Rate post mgraton. A smple way to compute the amount of RAM s allocatng after

mgraton by usng a fxed szed, e.g. doublng the RAM each tme ( = 2). We then reset the samplng rate back to 1, and start all over agan. We can mprove over ths process by observng the next to last samplng rate used pror to mgraton. Ths rate s the last known samplng rate that produced acceptable deduplcaton performance. Ths s vald because f t dd not produce an acceptable performance, EAD would have already trggered mgraton. Once we have ths new samplng rate, we can compute the amount of RAM by ntroducng a new counter d (ntalzed as zero) to record occurrences of downsamplng. We can then compute the new RAM, RAM new as { RAM org d = 1 RAM new = [1 d 1 =1 1 1 ] RAM org d 2 As the tmes of downsample operaton ncrease, EAD requres less amount of RAM for ndex table after mgraton. Compared wth always requrng tmes of orgnal RAM, such optmzed approach s able to clam hgher memory utlzaton effcency. Managng Sze of B. One concern wth our estmaton scheme s that the sze of B may become too large. If we need a large amount of RAM to store B, we wll be wastng RAM resources that could be used n the ndex. In practce, the sze of B s relatvely modest. Each entry n B conssts of a fngerprnt and two counters. Usng SHA-1 to compute the fngerprnt results n a 20 byte fngerprnt. An addtonal four bytes are used for each counter. Thus, each B entry s 28 bytes, ndcatng that the total sze of B would be at most approxmately 33.38 MB to support 1 TB of data. In our experment, t only requres 4.32 MB for estmatng 163.2 GB dataset. A. Expermental setup V. IMPLEMENTATION For our experments, we collected a dataset consstng of VMs that all run the Ubuntu OS, but each VM has dfferent types of software and utltes nstalled and contans dfferent types of applcaton data, whch majorty comes from Wkmeda Archves[31] and OpenfMRI [32].The total sze of our dataset s approxmately 163.2 GB. Whle the dataset sze s relatvely modest compared to some pror work [5], [10], [18], we beleve that t stll adequately reflects real-world usages of backup systems, such as backng up employee laptops. To ensure a far comparson, we have scaled down our ndex sze to correspond to our dataset sze, n order to better represent a large scale envronment. We have mplemented our EAD algorthm n Java. For all experments, we use varable block deduplcaton parameters respectvely wth a mnmum and maxmum sze of 4 KB and 16 KB, and a correspondng average chunk sze s 8 KB. We set the segment sze to be 16 MB. These are common parameters used n prevous research [33] [19]. The experments are carred out on a 4-core Intel 3-2120T at 2.60GHz wth total 8GB RAM, runnng on lnux. Strategy # of ndex entres sze of ndex(mb) Full ndex 10 10 6 640 Wth down-sample 5 10 5 32 EAD 1 10 5 6.4 TABLE I: RAM deployment for ndex under dfferent deduplcaton strateges. We set the down-samplng trgger as 0.85, whch means whle the storage s approachng 85% of ts current lmt, the ndex wll be down-sampled(half of ts entres wll be removed. e.g. delete ndex FPs wth FP mod 2 = 0 ). We evaluate our soluton, denoted as Elastc n the fgures, aganst two alternatves approaches. The frst alternatve, denoted as FullIndex, represents an deal stuaton where there s unlmted RAM avalable. Ths wll serve as an upper bound on the total amount of space savngs. The other alternatve s denoted as DownSample, whch s based on[7], a recent approach that dynamcally adjusts the samplng rate to deal wth nsuffcent RAM. B. Deduplcaton rato We hereby compare our algorthm wth a generc deduplcaton mechansm wthout samplng and a state-of-art hgh performance deduplcaton strategy wth down-samplng mechansm [7]. Before deployng the deduplcaton process, we allocate a specfc amount of RAM for ndex n dfferent strateges. Table I shows the amount of RAM allocated for dfferent deduplcaton strateges. We set the sze of each entry slot n the ndex as 64 bytes, whch conssts of three parts: FP, chunk metadata (storage address, chunk length,etc) and counter, whch s 20 bytes(sha-1 hash sgnature [34]), 40 bytes and 4 bytes, respectvely. These szes may vary under dfferent hash functons or addressng polces, however t wll not dffer too much. We assume that the capacty s 75 GB. Nearly 10 mllon ndex entres are needed to ndex all the unque data f we do not use any samplng strateges. Whle under the down-sample strategy wth the mnmum samplng rate of 0.05, we need 500K ndex entres for 75 GB of unque data. EAD always pcks a much more conservatve sze of ndex, specfcally only 100K entry slots n ths case. We here use Normalzed Deduplcaton Rato as the metrc for deduplcaton rato comparson. It s defned as the rato of measured Deduplcaton Rato to Deduplcaton Rato of FullIndex deduplcaton. Note that FullIndex detects all the duplcate data chunks and can clam hghest deduplcaton rato. Thus, such a metrc s meanngful because t ndcates how close the measured deduplcaton rato s to the deal deduplcaton rato achevable n the system. Fg. 3(a) shows the Normalzed Deduplcaton Rato of the above deduplcaton strateges. Downsamplng and computaton of EDR happen when the usage of ndex approaches 85% of ts capacty. For the down-sample strategy, t has the rato hgher than 99.5 %, showng the benefts of takng advantage of localty. The EAD does not clam equally hgh rato, however the gap s less than 2 %. Also consder that the performance requrement for EAD s defned by Γ, whch s 0.95 n ths case, the performance of Elastc s always hgher than 98%, performng better than what s requred.

However, purely comparng the deduplcaton rato s not far for evaluatng ther performance. Snce that these three strateges spend dfferent amount of RAM for ndex from the start. Fg. 3(b) shows how samplng rate and number of ndex slots used vary above cases. Obvously, t brngs too much memory cost wthout samplng. We notce that both DownSample and Elastc have comparatvely very low memory cost (small number of ndex entry slots). Also we can observe that when about 5% of data has been processed, the samplng rate n Elastc ncreases, reflectng ts feature of elastcty. The above results show that EAD s able to use less RAM space to acheve a satsfyng deduplcaton rato, whch s only slghtly lower than the other two. Next we derve a more meanngful metrc Deduplcaton Effcency, as a sngle utlty measure that encompasses both deduplcaton rato and RAM cost, to make a more far comparson among these three strateges. Normalzed Deduplcaton Rato (%) Samplng Rate 100 # of Index Slots 99 98 97 96 FullIndex DownSample Elastc 95 0 20 40 60 80 100 4 x 106 2 (a) Deduplcaton rato comparson. 0 0 20 40 60 80 100 1 0.5 FullIndex DownSample Elastc FullIndex DownSample Elastc 0 0 20 40 60 80 100 (b) Index usage comparson. Fg. 3: The samplng rate s 1 for all of them at the start of back up. The ndex mgraton n EAD s trggered when the normalzed deduplcaton rato drops below 95 % (Γ=0.95), after that, samplng rate doubles( =2). C. Deduplcaton effcency As dscussed n Secton V-B, nether deduplcaton rato nor memory cost alone can fully represent the system performance. Therefore we defne: Duplcate Data Detected Deduplcaton Effcency = Index Entry Slots Deduplcaton Effcency (MB/Slot) 100 80 60 40 20 FullIndex DownSample Elastc 0 0 20 40 60 80 100 Fg. 4: Deduplcaton effcency performance. as a more advanced performance evaluaton crtera. By usng ths crtera, we make more farly comparsons among EAD and the other two solutons, as shown n Fg. 4. It shows that Elastc outperforms both Downsample and FullIndex on effcency. Notce that Elastc always yelds a hgher effcency, almost 4 tmes of that from Downsample and 30 tmes of that from FullIndex. Ths s because that ts elastc feature enable t utlze as lttle memory space as possble to detect enough duplcate data as requred, avodng memory waste as the other two do. D. Elastcty Optmzaton In ths secton, we explore the varatons of EAD from dfferent aspects. Montorng accuracy. EAD can work properly only when t s able to accurately montor the real tme deduplcaton effcency. As the crtera of judgng deduplcaton performance, estmated duplcaton rate s supposed to be as accurate as possble. Otherwse, elastcty mght brng unexpected effect on the performance f t makes an napproprate decson for ndex mgraton. Fg. 5 shows the accuracy of montored deduplcaton ratos durng the backup process. 500 tmes of ndependent test were conducted on the dataset. We here consder the rato of estmated deduplcaton rato n EAD to that n FullIndex as error devaton, whch ndcates the real tme accuracy of montorng. From the fgure we can see that ntally the error devaton s at most 10%, but as more data comes n, the devaton reduces to 2%, whch offers a relable crtera for evaluaton on system performance. Accordng to [30], the reducton rato of a dataset of up to 7 TB can be estmated wth accuracy less than 1%. Notce that we here dynamcally estmate the rato whch represents a very dfferent stuaton (as elaborated n Secton IV-A). It can been seen from Fg. 5 that the devaton s hgher than expected when only part of the dataset has been estmated. So that we reserve more RAM for estmaton, even though t only costs approxmately 4.32 MB of RAM for 162078 samples. Impact on ntal ndex sze. There s no standardzed crtera of selecton on amount of memory sze for ndex table n deduplcaton systems. We only gve examples of memory

Error Devaton 1.1 1.08 1.06 1.04 1.02 1 20 40 60 80 100 Percentage of data processed(%) Fg. 5: Estmaton Accuracy verfcaton. 20 samples per segment are randomly pcked out from ncomng data. Deduplcaton Rato (%) 75 70 65 60 55 50 45 Elastc(80) Elastc(85) Elastc(90) Elastc(95) 20 40 60 80 100 (a) Deduplcaton Rato Deduplcaton Effcency (log scale) 20 19 18 17 16 15 14 13 Elastc100k Elastc150k Elastc200k 20 40 60 80 100 Fg. 6: EAD Performance under dfferent ntal ndex szes sze of one ffth to exstng solutons n Secton V-B. Because the elastcty feature of EAD s supposed to be low memory tolerant that allows us to gve a very low RAM space for ndex at the begnnng, snce t s able to mgrate ndex table f there s no enough space. We explore and verfy ts performance under dfferent ntal memory szes for ndex. Table II shows the normalzed deduplcaton rato as data comes n when ntal ndex entry slots are 100K, 150K and 200K, respectvely. It s not a surprse that a smaller ndex table helps us detect less duplcate data, but the gap s only at most approxmately 4%. Another more nterestng observaton s that by applyng our algorthm, ntally smaller ndex table case sometmes s able to clam even hgher deduplcaton rato. Ths s because t has a hgher probablty to be mgrated so that there wll be more ndex adjustments whch brngs performance mprovement. Fg. 6 shows the Deduplcaton Effcency of EAD wth dfferent ntal memory szes. It shows that the most conservatve RAM ntalzaton case clams the hghest deduplcaton effcency, whch also proves that EAD provdes well balance between memory and storage savngs. Impact of Γ. We then step further to verfy the effectveness of EAD under dfferent polces. Fg. 7 shows the performance when we apply dfferent values of Γ, whle ntal ndex entry slots are 100K. As we analyzed, Γ represents the system s tolerance to duplcate data detecton mssng. The hgher the Deduplcaton Effcency (log scale) 20 19 18 17 16 Elastc(80) Elastc(85) Elastc(90) Elastc(95) 20 40 60 80 100 (b) Deduplcaton Rato Fg. 7: The performance under dfferent mgraton trggerng values. Shown are results of EAD performance when the measured deduplcaton rato fall below 80%,85%,90% and 95% of estmated one, whle = 2. trgger s, t ll be more senstve and easly to trgger mgraton, and vce versa. A hgher Γ guarantees a hgher deduplcaton rato as shown n Fg. 7(a), althoughnot too much n ths case. However, we notce that Γ = 0.95 case also yelds the hghest overall effcency, whch mples that EAD s able to acheve double-wn on both deduplcaton rato and effcency. Memory usage comparson. Snce our goal s to ntroduce a comparable elastcty-aware deduplcaton soluton to exstng approaches. Based on above results, we are able to estmate the ntegrated memory overhead n EAD, and we here compare t to state-of-art. Asde of memory space needed for ndex, EAD requres extra space for estmaton. As shown n Table III, the extra memory overhead of EAD manly comes from the estmaton part, compared wth the other two. Even though, the total RAM cost by EAD s less than 50% and 5% of that by DownSample and FullIndex, respectvely. Also note that there s 0.1 MB of ndex ncrementaton at the end of deduplcaton because of ts conservatve mgraton mechansm. VI. CONCLUSION AND FUTURE WORK As a sgnfcant technque for elmnatng duplcate data, deduplcaton largely reduces storage usage and bandwdth

2000 3000 4000 5000 6000 7000 8000 9000 9600 1 10 5 (6.4 MB) 99.73% 99.08% 96.23% 94.76% 99.66% 99.11% 97.13% 94.41% 93.93% 1.5 10 5 (9.6 MB) 99.79% 98.97% 99.14% 98.74% 98.25% 97.63% 97.31% 96.70% 96.61% 2 10 5 (12.8 MB) 99.79% 99.62% 99.72% 99.17% 98.60% 98.88% 98.69% 97.89% 97.72% TABLE II: The normalzed deduplcaton ratos whle we deploy dfferent amount of memory for ndex. The rato s measured as every 200 ncomng data segments have been processed, Γ = 0.9. Intal Index(MB) Fnal Index(MB) Est.(MB) Total(MB) EAD 6.40(1 10 5 slots) 6.50(106581 slots) 4.32 10.82 Down-sample 32 (5 10 5 slots) 25.91(404818 slots) 0 25.91 Full Dedup 640(10 10 5 slots) 220.23(3441107 slots) 0 220.23 TABLE III: The RAM cost s broken nto ndex and estmaton(est.) parts for analyzng under dfferent deduplcaton strateges. Γ = 0.95 and = 2, respectvely n ths case. occupaton n the enterprse backup systems; Implementaton of Samplng further solves both chunk-lookup dsk bottleneck problem and lmted memory. However settng samplng rate only wth the consderaton of memory sze cannot guarantee the performance of the whole system. We hereby proposed the elastcty-aware deduplcaton soluton, n whch deduplcaton performance and memory sze are both consdered. We detaledly showed EAD s effcent adjustment on samplng rate by case analyss whch shows that EAD clams much better performance than exstng algorthms, offerng a complete gudelne for ts large scale deployment. Drectons for future research manly focus on the large scale mplementaton of our proposed soluton. We are amng to buld such a deduplcaton nfrastructure, verfyng ts property of elastcty and explctly demonstratng the space savng on both storage and memory. Another long-term goal s to explore ts applcaton on dstrbuted envronment n a even larger scalablty. REFERENCES [1] D. Geer, Reducng the storage burden va data deduplcaton, Computer, 2008. [2] B. Calder, J. d. Wang et al., Wndows Azure Storage: a hghly avalable cloud storage servce wth strong consstency, n Proceedngs of the Twenty-Thrd ACM Symposum on Operatng Systems Prncples, 2011. [3] Amazon S3, Cloud Computng Storage for Fles, Images, Vdeos, Accessed n 03/2013, http://aws.amazon.com/s3/. [4] T. T. Thwel and N. L. Then, An effcent ndexng mechansm for data deduplcaton, n Current Trends n Informaton Technology (CTIT), 2009. [5] K. Srnvasan, T. d. Bsson et al., Dedup: Latency-aware, nlne data deduplcaton for prmary storage, n Proceedngs of the 10th USENIX conference on Fle and Storage Technologes, 2012. [6] M. Lllbrdge, K. Eshgh, D. Bhagwat, V. Deolalkar, G. Trezse, and P. Camble, Sparse ndexng: large scale, nlne deduplcaton usng samplng and localty, n Proccedngs of the 7th conference on Fle and storage technologes, 2009. [7] F. Guo and P. Efstathopoulos, Buldng a hgh performance deduplcaton system, n Proceedngs of the 2011 USENIX conference on USENIX annual techncal conference, 2011. [8] A. Adya, B. d et al., FARSITE: Federated, avalable, and relable storage for an ncompletely trusted envronment, ACM SIGOPS Operatng Systems Revew, 2002. [9] G. Forman, K. Eshgh, and S. Chocchett, Fndng smlar fles n large document repostores, n Proceedngs of the eleventh ACM SIGKDD nternatonal conference on Knowledge dscovery n data mnng, 2005. [10] U. Manber et al., Fndng smlar fles n a large fle system, n Proceedngs of the USENIX wnter 1994 techncal conference, 1994. [11] A. Sabaa, P. d. Kumar et al., Inlne Wre Speed Deduplcaton System, 2010, US Patent App. 12/797,032. [12] L. L. You and C. Karamanols, Evaluaton of effcent archval storage technques, n Proceedngs of the 21st IEEE/12th NASA Goddard Conference on Mass Storage Systems and Technologes, 2004. [13] E. Kruus, C. Ungureanu, and C. Dubnck, Bmodal content defned chunkng for backup streams, n Proceedngs of the 8th USENIX conference on Fle and storage technologes, 2010. [14] J. Mn, D. Yoon, and Y. Won, Effcent deduplcaton technques for modern backup operaton, IEEE Transactons on Computers, 2011. [15] A. Muthtacharoen, B. Chen, and D. Mazeres, A low-bandwdth network fle system, n ACM SIGOPS Operatng Systems Revew, 2001. [16] K. Eshgh and H. K. Tang, A framework for analyzng and mprovng content-based chunkng algorthms, Hewlett-Packard Labs Techncal Report TR, 2005. [17] W. Xa, H. d. Jang et al., Slo: a smlarty-localty based near-exact deduplcaton scheme wth low ram overhead and hgh throughput, n Proceedngs of USENIX annual techncal conference, 2011. [18] D. Bhagwat, K. Eshgh, D. D. Long, and M. Lllbrdge, Extreme bnnng: Scalable, parallel deduplcaton for chunk-based fle backup, n Modelng, Analyss & Smulaton of Computer and Telecommuncaton Systems, 2009. [19] B.Zhu,K.L,and H.Patterson, Avodng thedskbottleneck nthedata doman deduplcaton fle system, n Proceedngs of the 6th USENIX Conference on Fle and Storage Technologes, 2008. [20] G. Lu, Y. Jn, and D. H. Du, Frequency based chunkng for data de-duplcaton, n Modelng, Analyss & Smulaton of Computer and Telecommuncaton Systems (MASCOTS), 2010. [21] C. Km, Park et al., Rethnkng deduplcaton n cloud: From data proflng to blueprnt, n Networked Computng and Advanced Informaton Management (NCM), 2011. [22] G. Wallace, F. Dougls, H. Qan, P. Shlane, S. Smaldone, M. Chamness, and W. Hsu, Characterstcs of backup workloads n producton systems, n Proceedngs of the Tenth USENIX Conference on Fle and Storage Technologes (FAST12), 2012. [23] P. Kulkarn, F. Dougls, J. LaVoe, and J. M. Tracey, Redundancy elmnaton wthn large collectons of fles, n Proceedngs of the USENIX Annual Techncal Conference, 2004. [24] D. T. Meyer and W. J. Bolosky, A study of practcal deduplcaton, ACM Transactons on Storage (TOS), 2012. [25] C. A. Waldspurger, Memory resource management n VMware ESX server, ACM SIGOPS Operatng Systems Revew, 2002. [26] F. Travostno, P. d. Daspt et al., Seamless lve mgraton of vrtual machnes over the MAN/WAN, Future Generaton Computer Systems, 2006. [27] C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul, C. Lmpach, I. Pratt, and A. Warfeld, Lve mgraton of vrtual machnes, n Proceedngs of the 2nd conference on Symposum on Networked Systems Desgn & Implementaton-Volume 2, 2005. [28] M. Dutch, Understandng data deduplcaton ratos, n SNIA Data Management Forum, 2008. [29] M. Hbler, L. d. Stoller et al., Fast, Scalable Dsk Imagng wth Frsbee, n USENIX Annual Techncal Conference, General Track, 2003. [30] D. Harnk, O. Margalt, D. Naor, D. Sotnkov, and G. Vernk, Estmaton of deduplcaton ratos n large data sets, n Mass Storage Systems and Technologes (MSST), 2012 IEEE 28th Symposum on, 2012. [31] Wkmeda Downloads Hstorcal Archves, Accessed n 04/2013, http: //dumps.wkmeda.org/archve/. [32] OpenfMRI Datasets, Accessed n 05/2013, https://openfmr.org/ data-sets. [33] B. Debnath, S. Sengupta, and J. L, ChunkStash: speedng up nlne storage deduplcaton usng flash memory, n Proceedngs of the 2010 USENIX conference on USENIX annual techncal conference, 2010. [34] J. H. Burrows, Secure hash standard, DTIC Document, Tech. Rep., 1995.