Using SSD-Assisted Scalable Elasticity to Improve Inline Data Deduplication Storage Systems

Transcription

1 1 Usng SSD-Asssted Scalable Elastcty to Improve Inlne Data Deduplcaton Storage Systems Yufeng Wang, Zhengyu Yang, Nngfang M, Chu C Tan Abstract Elastcty s the ablty to scale computng resources such as memory on-demand, and s one of the man advantages of utlzng cloud computng servces. Wth the ncreasng popularty of cloud based storage, t s natural that more deduplcaton based storage systems wll be mgrated to the cloud. Exstng deduplcaton systems, however, do not adequately take advantage of elastcty. In ths paper, we frst present a SSD (Sold State Drve)-based mult-ter storage archtecture to mprove cachng capacty of current deduplcaton systems. Wth such an enhanced system, we attempt to optmze deduplcaton approaches to acheve hgher memory utlzaton effcency. Then, we llustrate how to use elastcty to mprove deduplcaton based systems, and propose EAD (elastcty aware deduplcaton), an ndexng algorthm that uses the ablty to dynamcally ncrease memory and SSD resources to mprove overall deduplcaton performance. Our expermental results ndcate that EAD s able to detect more than 98% of all duplcate data, however only consumes less than 5% of expected memory space. Meanwhle, t clams four tmes of deduplcaton effcency than the state-of-art samplng technque whle costs less than half of the amount of memory. We further proposed an onlne scalng up algorthm that takes advantage of the elastcty of cloud computng to dynamcally trgger scalng up operaton. Our algorthm also offers a complete gudelne for ts large scale deployment. The expermental results show that our desgn save at least 74% of overall I/O access cost compared to the tradtonal desgn. Index Terms Deduplcaton, Flash-based SSD, Scalng Up, Mgraton, Cloud Computng, Cloud Storage Systems, Fuson Dsk 1 INTRODUCTION Data deduplcaton s a technque used to reduce storage and transmsson overhead by dentfyng and elmnatng redundant data segments. It splts fles nto multple data chunks that are each unquely dentfed by a fngerprnt (FP) that usually s a hash sgnature of the data chunk. The redundant data chunks n a fle are replaced by the ponters. Data deduplcaton has been an essental and crtcal component n cloud backup, synchronzaton and archvng storage systems. It not only reduces the storage space requrements, but also mproves the throughput of the backup and archvng systems by elmnatng the network transmsson of redundant data, as well as reduces the energy consumpton by deployng fewer dsks. Therefore, data deduplcaton plays an mportant role n exstng storage systems [1], and ts mportance wll contnue to grow as the amount of data ncreases (the growth of data s estmated to reach 35 zettabytes n the year 2020) n cloud backup, synchronzaton and archvng storage systems. There are two types of deduplcaton systems: nlne and offlne. Clents n the former system frst transmts Yufeng Wang and Zhengyu Yang share equal credt for ths paper. Yufeng Wang and Chu C Tan are wth Department of Computer and Informaton Scences at Temple Unversty. Zhengyu Yang and Nngfang M are wth Department of Electrcal and Computer Engneerng at Northeastern Unversty. metadata to the server to detect duplcatons and only the new data s gong to be sent to the server. Whle, clents n the latter system transfer all data to the server and then the server conducts the deduplcaton process. In ths paper we use the nlne deduplcaton system [2], [3]. However, n the nlne deduplcaton system, wth the ever ncreasng amount of new data, searchng chunks exstence n the huge dataset stored n the slow-access-speed dsk (usually magnetc dsk MD) s a tme consumng process, whch s the man bottleneck of deduplcaton mplementatons. To mprove overall speed, the memory (RAM) s practcally used to cache hot chunks metadata (e.g. ndex table). Unfortunately, deduplcaton systems currently need to scale to tens of terabytes to petabytes of data volume, the ever ncreasng amounts of new data wll fnally flush the exstng useful data cached n the memory, whose penalty beats the beneft brought from the performance gap between n-memory searchng and dsk lookups. Motvated by ths dsk I/O speed bottleneck ssue, we start to solve t n three dfferent drectons: Our frst drecton s to mprove the performance of the cache system. Snce we cannot easly ncrease RAM s sze and MD s access speed wth a low cost, a feasble way s to ntroduce an extra storage ter between RAM and MD. Ths storage ter should be faster than MD and cheaper than RAM. We fnd that NAND-flash based sold-state dsks (SSDs) meet these two condtons and can be a good canddate for the mddle storage. We thus adopt a three-ter cachng system, consstng of RAM, SSD and

2 2 MD, to ncrease the speed for accessng huge dataset n the MD and enlarge the cache sze for storng hot data. Our second drecton s to best use the cache through downsamplng algorthms. Snce cache systems cannot be nfnte large, we need to flter non-crtcal data out of the cache, and decde what to load from the MD to SSD and RAM. Exstng research [4] [5] [6] n ths area proposed dfferent samplng algorthms to ndex more data usng less memory. The key features of our soluton s that our deduplcaton algorthm s compatble wth current deduplcaton technques such as samplng to take advantage of localty [5], [6], and content-based chunkng [7] [9]. Our last drecton s to dynamcally scale up the storage system durng runtme. In real cases, due to customer s hgh desred deduplcaton degree or specal ncomng data types that naturally have low deduplcaton rato (lke vdeos and encrypted fles), the nsuffcent RAM (ncludng SSD) cachng system wll be eventually flushed by the ncomng data stream. Ths ssue cannot be thoroughly solved by the mult-ter cachng system and downsamplng algorthms. Fortunately, the key property of cloud computng elastcty can be used to mprove deduplcaton systems by allowng deduplcaton storage systems to dynamcally adjust the amount of memory resources as needed to detect suffcent amount of duplcate data. In ndustral envronments, the flexblty and cost advantages of cloud computng provders such as Azure [10], Amazon [11], etc. make deployng new resources n the cloud durng runtme as an possble opton. Based on these facts, we then proposed an elastcty-aware deduplcaton (EAD) algorthm to dynamcally assgn new resources. EAD solves the two man problems n ths topc when to trgger the scalng up operaton and how much new resources are enough for the future. In summary, n ths paper we desgn a mult-ter storage archtecture to enhance the cachng capacty of the deduplcaton system. We optmze the approach to clam hgher RAM utlzaton effcency and best use the cachng system, by usng downsamplng-based algorthms. We fnally propose an elastcty-aware deduplcaton (EAD) algorthm that takes advantage of the elastcty of cloud computng to dynamcally trgger scalng up operaton. Furthermore, we also present a detaled analyss of our algorthm, as well as the evaluaton usng extensve experments on real dataset. The rest of the paper s organzed as follows: Secton 2 explores the background of the deduplcaton system bottleneck and approaches. Secton 3 descrbes the detal of EAD desgn also ncludng scalng up strategy and the storage herarchy. Secton 4 evaluates our soluton, and Secton 5 contans the related work. Secton 6 draws concluson of the paper. 2 BACKGROUND Ths secton wll dscuss the background knowledges of deduplcaton, ncludng ts bottleneck, and further explore some alternatves to mprove deduplcaton performance. 2.1 Data Deduplcaton Bascs As shown n Fgure 1, n a typcal cloud-based deduplcaton system, there s a DedupServer takng charge of processng deduplcaton requests. The DedupServer tself does not store any chunk contents, but only keeps the entre metadata lbrary n ts dsk whch tracks the fngerprnt of each chunk and ts address n the storage pool. In addton, DedupServer s dsk memory caches the hot ndex table entres to accelerate the overall access speed. The basc deduplcaton process frst dves the ncomng data stream nto (fxed/varable sze) chunks, and segments (groups of chunks based on ther localty). Next, duplcate chunks are then dentfed by ther hash fngerprnts (FP) calculated on ther contents. The server then needs to lookup each chunk hash n an ndex t mantans for all chunks seen so far for that storage locaton (dataset) nstance. If there s a match, the ncomng chunk contans redundant data and can be deduplcated; f not, the (new) chunk needs to be added to the system and ts hash and metadata need to be nserted nto the ndex. Detaled algorthm s shown n Algorthm 1, where a memory-dsk deduplcaton archtecture s appled. Dedup Server Clents Cache Memory Dsk Entre Metadata Lbrary Storage Pool Entre Chunk Content Lbrary Clent 1 Clent 2 Clent 3 Fg. 1: A typcal cloud-based deduplcaton system 2.2 Understandng SSDs As dscussed before, the slow-access-speed of dsk s the bottleneck of a deduplcaton system. To address ths ssue, we consder to use SSDs as the mddle storage ter between RAM and MD. Table 1 shows a comparson between RAM, SSD and MD. Specfcally, flash-based SSDs have the followng characterstcs whch help to mprove effcency of the deduplcaton desgn: 1) Hgh access speed: Dfferent from MDs, flashbased SSDs are made of slcon memory chps and have no movng parts. Thus both read and wrte response tmes of SSDs are sgnfcantly better than those of MDs. As shown n Table 1, SSDs are almost 4.29 faster than MDs. In consumer products,

3 3 TABLE 1: Comparson between RAM, SSD and MD (updated n Sept 2014 US) Storage Speed Storage capacty Prce per byte Power outage mpact RAM (Memory) 6 GB/S 1-8 GB USD/GB Data lost SSD (Dsk) up to 600 MB/S GB, up to 2 TB 0.45 USD/GB Data stored MD (Dsk) up to 140 MB/S up to 10TB 0.05 USD/GB Data stored Algorthm 1 Basc deduplcaton strategy wth memory 1: The ncomng segment S n : Deduplcaton Phase I: Identfy duplcate chunks 2: for all x S n do 3: Clent: Send F P x to Server 4: Sever: Search F P x n IndexTable cached n Memory 5: f Found then 6: Set x dup (x dup ) 7: else 8: Sever: Search F P x n IndexTable n Dsk 9: f Found then 10: Set x dup (x dup ) 11: Load F P x from Dsk to cached IndexTable n Memory 12: else 13: Set x unq (x unq ) Deduplcaton Phase II: Data transmsson 14: for all x S n do 15: Transmts x unq along wth only metadata of x dup 16: System fnshes processng S n the maxmum transfer rate typcally ranges from about 100 MB/s to 600 MB/s, dependng on dsk types. Whle n the enterprse market, venders offer devces wth mult-gb/s throughput. 2) Store data after power outage: Not lke RAM, SSD s able to preserve data when power outage happens, whch s a bonus for relablty, because deduplcaton system cannot recover data losslessly wthout entre recpe. 3) Large sze: Snce 2014, SSDs wth szes up to 2 TB become avalable, whle 128 to 512 GB drves are more common. Although SSD cannot easly get same sze as MD, t has much larger capacty compared to RAM. 4) Affordable expense: The prce (e.g. cost per ggabyte) of SSDs changes rapdly, and keeps droppng down n recent years. Today cost per ggabyte of SSDs are about 1/68 of RAMs. These benefts enable SSDs to be wdely used n almost every sde of modern computng systems, from low-end PCs to hgh-end servers n supercomputng, thus makng SSD-based storage systems ncreasngly attractve to both academa and ndustry. 2.3 Understand Downsamplng In ths secton, we wll frst nvestgate why we cannot fnd a best memory sze and fx to that, and then gve an overvew of downsamplng algorthms, whch can be used to reduce data set sze cached n memory Why Not Choose The Best Memory Sze? One ntutve alternatve s to try and estmate the approprate amount of memory that s needed pror to deployng the deduplcaton system. A straghtforward approach s to perform smple proflng on a sample of data and compute the expected memory requrements based on the results. To llustrate why ths s dffcult to choose the rght amount of RAM n practce, we conducted a smple experment that represents a storage system used to archve vrtual machne (VM) mages (ths s a common workload used n deduplcaton evaluatons [6], [12]). We want to maxmze deduplcaton rato to conserve bandwdth and storage costs. For smplcty, we assume that all VMs are runnng the same OS, and are the same sze. A smple way to estmate memory requrements s to frst estmate the ndex sze for a sngle VM, and then use that to estmate the total RAM necessary for all users. Thus, gven n users and each user stores the same sze VM, we estmate m amounts of RAM to ndex one user such that our backup system wll need n m amounts of RAM. We can derve m va experments. Fgure 2 shows the results for two VMs. As far as we know, VM2 contans more text fles whle VM1 has more vdeo fles. Number of ndex entry slots ndcates how much nformaton of already stored data the system can provde for duplcate detecton. We set fxed number of ndex entres for duplcate detecton and gradually ncrease t. We see that when ndex entry slots number ncreases to 270 thousand, both VMs exhbt the same amount of duplcate data. As we ncrease the ndex sze, VM1 shows lmted mprovement, whle VM2 shows much better performance. If we had used VM1 to estmate m would have led to much less bandwdth savngs, especally f a sgnfcant number of VMs resemble VM2. Buyng too much memory s wasteful f most of the data resemble VM Usng Localty And Downsamplng We hereby provde an overvew of basc samplng approached by [5], [6], followed by a dscusson on the advantages of samplng and ts lmtatons. Storage systems that make use of data deduplcaton generally operate on chunk-level, and n order to quckly determne potental duplcate chunks, an ndex for exstng chunks needs to be mantaned n memory. For example, a 100TB data wll need about 800GB RAM for the ndex under standard deduplcaton parameters [13], whch makes keepng the entre ndex n memory challengng. Typcal

4 4 Duplcate data detected(mb) Duplcate detected by usng dfferent ndex szes VM1 VM # of entres n ndex x 10 5 Fg. 2: Intutve test on amount of duplcate detected on two equal-szed(4.7gb) VMs by usng equal-sze ndexes deduplcaton parameters whch have been expermentally shown to gve good performance [13] s gven n Table 2. We see that n order to support 12.5 bllon (100TB/8KB) chunks, we need 800GB amounts of RAM for the ndex. As data sze ncreases to 300TB, we need to support all 37.5 bllon chunks, and 2400 GB of RAM s needed only for ndexng (C E/c k = ). These estmated fgures are naturally conservatve, snce the actual amount of replcated chunks are unknown at run tme. The prncple of localty s used to desgn samplng algorthms that utlze smaller ndex sze whle provdng good performance [5]. The localty prncple suggests that f chunk X s observed to be surrounded by chunks Y, Z, W n the past; the next tme chunk X appears, there s a hgh probablty that chunks Y, Z, W wll also appear. In samplng-based deduplcaton, the data wll be frst dvded nto larger segments, each of whch contans thousands of chunks. Deduplcaton s executed based on these segments by dentfyng exstence of ther sampled chunks fngerprnts n the ndex. If a chunk s fngerprnt s found n the ndex, the correspondent segment whch contans that chunk wll be located and fngerprnts nformaton of all the other chunks n ths segment wll be pre-fetched from dsk to the chunk cache n memory. Downsamplng algorthm [6] works as an optmzed samplng approach, by takng advantage of the localty prncple. The dfference s that the samplng rate s ntalzed as 1,.e., t pcks all the chunks n a segment as ts sampled chunks. As the amount of ncomng data ncreases, ths value gradually decreases by droppng half of ndex entres. Thus the ndexng capacty doubles by only acceptng a part of chunks fngerprnts as samples to represent each segment. In other words, nstead of ndexng chunks X, Y, Z, and W n RAM, the downsamplng algorthm wll only ndex chunk X (or another one among four of them) n RAM after two tmes of adjustments, and the rest on dsk. The above samplng-based approaches have two man drawbacks. The frst (obvous) drawback s that not all data exhbt localty [14], and thus samplng algorthms do not work well wth these datasets. The second drawback s that even for data that exhbts Termnology Chunk sze c k Segment sze S Physcal storage capacty C Number of chunks N Index entry sze E Value c k = 8KB S = 16MB C = 300T B N = C/S E = 64B TABLE 2: An example of a cloud-base backup system confguraton localty, t s dffcult to select the correct samplng rate or how to adjust t, due to the large varance n possble deduplcaton rato [15] [16]. 2.4 Understandng Elastcty Awareness And Scalng Up Our last drecton s to dynamcally scale up the storage system durng runtme. After several downsamplng operatons, the assgned RAM may stll not large enough to process the ncomng data stream. Then we need to trgger the onlne scalng up operaton, whch s takng advantage of cloud computng s elastcty feature. Scalng up operaton can mprove deduplcaton systems by allowng deduplcaton storage systems to dynamcally adjust the amount of memory resources as needed to detect suffcent amount of duplcate data. Ths s especally useful when the ndex used for deduplcaton s often kept wthn the memory to avod the performance bottleneck from dsk I/O operatons. There are two problems n elastcty awareness and scalng up operaton: 1) When to trgger the scalng up operaton? The man gude s to trgger the scalng up operaton only when the result can beneft to the performance. We need to dstngush whether poor deduplcaton performance s due to overly aggressve downsamplng (caused by user s hgh expected deduplcaton degree) or nherent wthn the dataset type (dedup-unfrendly datasets that have low duplcaton rato). To solve that, we need to answer these three questons: () f the the RAM (for the cached ndex) s close to the preset lmtaton, how does system detects and evcts some noncrtal entres? () when does the system need to trgger downsamplng process? () after how many tmes of downsamplng, the system fnally requests to scale up? 2) How to assgn new resources? Both the prcng costs and resource utlze ratos should be consdered when assgnng new resources. A straghtforward strategy s to smply double (or preset n tmes) the current sze when scalng up. However even regardless of the hgh expense, ths exponental expanson brngs a bg waste snce later ncomng stream may not need so large space. An optmzed approach s requred to obtan hgher RAM utlzaton effcency, whch should expend RAM sze based on the occurrences of downsamplng whch may predct the dataset s future.

5 5 3 EAD: ELASTICITY-AWARE DEDUPLICATION Storage deduplcaton servces n the cloud often run n vrtual machnes (VM). Unlke a conventonal OS whch runs drectly on physcal hardware, the OS n a VM s runnng on top of a hypervsor or vrtual machne montor, whch n turn, communcates wth the underlyng physcal hardware. The hypervsor s responsble for ncreasng RAM resources to the vrtual machne (VM) dynamcally. Ths can be done n two generc ways. The frst s to use a balloonng algorthm to reclam memory from other VMs runnng on the same physcal machne (PM) [17]. Ths s a relatvely lghtweght process that reles on the OS s memory management algorthm, but can only ncrease relatvely small amounts of memory. Deduplcaton systems that requre ncreasngly larger amounts of memory need to run a VM mgraton algorthm [18], [19]. In VM mgraton, the hypervsor mgrates the RAM contents from one PM to another wth suffcent memory resources [18]. Regardless of the mgraton algorthm used, some downtme can nevtably occur when swtchng over to a new VM [19]. The second s a nave approach towards ncorporatng elastcty s to ncrease the memory sze once the ndex s close to beng full. Ths nave approach does not perform well snce frequent scalng up or mgraton nduce a hgh overhead. Furthermore, the nave approach always retans the entre old ndex durng each scalng up or mgraton, even those ndex entres do not fnger prnt many chunks. Such poor performng ndex entres take up valuable ndex space wthout provdng much benefts. Our approach combnes the benefts of downsamplng [6] and VM scalng up to allow users to mantan a satsfactory level of performance by adjustng samplng rate and memory sze accordngly. Our system desgn conssts of two components, an EAD clent that s responsble for fle chunkng, fngerprnt computaton and samplng, and an EAD server whch controls the ndex management and other memory management operatons. The EAD clent s run on the clent sde, for nstance, at the gateway server for a large company. The EAD server can be executed by the cloud provder. The entre system desgn s shown n Fgure 3. Only unque data s supposed to be store n Physcal Storage. The Fle Manager s responsble for data retreval and mantenance, how t works s out of ths paper s scope. 3.1 EAD Algorthm Dfferent types of users have dfferent deduplcaton requrements. Some users wll be wllng to tolerate worse deduplcaton performance n exchange for lower costs, whle others wll not. To accommodate dfferent requrements, EAD s desgned to allow a user to specfy a scalng up (or mgraton) trgger, Γ ( (0, 1)), whch specfes the level of deduplcaton performance the user s wllng to accept. Fg. 3: EAD nfrastructure. Deduplcaton performance s usually measured by reducton rato [20], [21], whch s the sze of the orgnal dataset dvded by the sze of the dataset after deduplcaton. To help the user select the mgraton trgger, we defne Deduplcaton Rato (DR), Deduplcaton Rato = 1 Sze after deduplcaton Sze of orgnal data Intutvely, we would lke to frst apply downsamplng algorthms untl the deduplcaton performance becomes unsatsfactory, and then mgrate the ndex to larger memory n order to obtan better performance. EAD wll mgrate to larger RAM only when mgraton wll result n deduplcaton performance better than Γ. Ths has an mportant but subtle mplcaton. EAD wll not always mgrate when deduplcaton performance falls under Γ, but only when mgraton wll mprove performance. Ths s mportant because gven a dataset that nherently exhbts poor deduplcaton characterstcs [22], addng more RAM wll ncur the mgraton overhead wthout mprovng deduplcaton performance. Ths means that EAD cannot smply compare the measured DR aganst Γ because the measured DR may not necessarly reflect the amount of duplcaton that exsts. To llustrate, let us assume that the deduplcaton system measures ts DR and t s less than Γ. There are two possbltes. The frst s that the system has performed overly aggressve downsamplng, and can beneft from ncreasng RAM. The second possblty s that the dataset tself has poor deduplcaton performance, e.g. data n multmeda or encrypted fles. In ths case, ncreasng RAM does not result n better performance. How our EAD algorthm determnes when to mgrate to more RAM resources can be found n Alg. 2. It executes n two phases as generc n-lne deduplcaton systems do. We use S n and x to denote the ncomng segment and chunks nsde t. F P x represents the fngerprnt of chunk x. In Phase I the EAD Clent sends all chunks fngerprnts nformaton (F P x ) of sampled chunks used for estmaton and duplcaton detecton, n each segment S n to EAD Server. The latter wll search ndex table T and chunkcache for duplcaton dentfcaton, as well as updatng estmaton base B. Each chunk x s then marked as dup or unq, ndcatng t s a duplcate or unque chunk. EAD Clent only transmts ( x S n ), ncludng labelng F P est x and F P dedup x (1)

6 6 unque data chunks along wth metadata of duplcate ones to EAD Server n Phase II, savng bandwdth and storage space. At the meantme, current samplng rate R 0 s subject to change to R based on deduplcaton performance. Detals on features of the EAD algorthm wll be presented next. 3.2 Estmatng Possble Deduplcaton Performance One of the key features of EAD s that the algorthm s able to determne whether mgraton wll be benefcal. In order to dstngush whether poor deduplcaton performance s due to overly aggressve downsamplng or nherent wthn the dataset, we frst need to be able to estmate the potental DR of the dataset. Obtanng the actual DR s mpractcal snce t requres performng the entre deduplcaton process. Pror work from [23] provded an estmaton algorthm to estmate the deduplcaton performance for statc, fxed-sze data sets. Ther algorthm requres the actual data to be avalable n order to perform random samplng and comparsons. However, n our problem, the dataset can be vewed as a stream of data. There s no pror knowledge of the sze or characterstcs of the data to be stored n advance. We also cannot perform back and forth scannng of the complete dataset for estmaton. In our EAD algorthm, we let the EAD Server mantan an estmaton base B. The EAD Clent randomly selects κ fngerprnts from each segment and sends them to EAD Server to be stored n B. Suppose there are n s segments come n, there wll be κ n s samples, whch wll ncrease along wth the ncreasng amount of ncomng data. Each entry slot n B ncludes a fngerprnt as well as two counters, x c1 and x c2, where counter x c1 records the number of occurrences of fngerprnt F P x appears n the B, and x c2 records the number of occurrences of fngerprnt F P x appears among that of all the chunks uploaded. We ntegrate our estmaton process nto the regular deduplcaton operatons so as to avod the separate samplng and scannng phases by [23]. Whle the clent sends the samples for duplcaton searchng to the storage server, these samples for estmaton are transmtted at the same tme for updatng B. Durng the fngerprnt comparson of ncomng chunks aganst that n chunk cache, we update B agan, ncrementng the counter x c2 by one every tme ts correspondent fngerprnt appears. Thus, there s no extra overhead for our estmaton purpose. Usng B, we can compute the estmated deduplcaton rato, EDR, as EDR = 1 1 κ n s x B x c1 x c2. (2) The computaton of EDR happens whle the ndex sze s approachng the memory lmt. Only n the case that DR s smaller than Γ EDR, there wll be a potental performance mprovement by mgraton, and EAD wll mgrate the ndex to larger RAM. Otherwse, EAD wll apply downsamplng on the ndex as the exchange for larger ndexng capacty. Algorthm 2 Elastc deduplcaton strategy 1: The ncomng segment S n : Deduplcaton Phase I: Identfy duplcate chunks 2: x S n : EAD Clent sends F P x to EAD Server 3: for all F Px dedup do 4: f F Px dedup T then 5: Locate ts correspondent segments S dup x j S dup : Fetch nformaton of x j (F P xj ) Set x j chunk cache 6: else 7: Add F Px dedup to T 8: for all F Px est do 9: f F Px est B then 10: x c1 = x c : else 12: Add F Px est to B Set x c1 = x c2 = 0 13: for all x S n do 14: x k chunk cache : Compare F P x wth F P xk 15: f F P x = F P xk then 16: Set x dup (x dup ) 17: else 18: Set x unq (x unq ) 19: x l B : Compare F P x wth F P xl 20: f F P x = F P xl then 21: x c2 = x c2 + 1 Deduplcaton Phase II: Data transmsson 22: for all x S n do 23: Transmts x unq along wth only metadata of x dup 24: EAD fnshes processng S n 25: f Index s approachng the RAM lmt then 26: f DR < Γ EDR then 27: f R 0 = 1 then 28: EAD sets Γ = DR EDR 29: else 30: EAD trggers mgraton, settng rate R = R 0 31: else 32: EAD sets R = R0 3.3 EAD Scalng Up Algorthm The performance of the EAD algorthm can be further mproved by observng addtonal nformaton obtaned durng the run tme and then adjustng the parameters of the algorthm. (1) Adjustng Γ. The parameter Γ s specfed by the user, and ndcates the user s desred level of deduplcaton performance. However, the user may sometmes be unaware of the underlyng potental deduplcaton performance of the data, and set an excessvely hgh Γ value, resultng n unnecessary mgraton over tme. We DR EDR adjust the user s Γ value to after each mgraton, and also n the case that DR has not reached accepted performance even the samplng rate s one. So that t represents the current system s maxmum deduplcaton ablty. In ths way, EAD s able to elastcally adapt varatons on ncomng data.

7 7 (2) Amount of RAM Post Mgraton. A smple way to compute the amount of RAM s allocatng after mgraton by usng a fxed szed, e.g. doublng the RAM each tme ( = 2). We then reset the samplng rate back to 1, and start all over agan. Another way s based on the observaton that the samplng rate before the latest downsamplng operaton s able to support a satsfyng performance, so that EAD wll adjust current samplng rate to the one before latest downsamplng operaton. Thus, the specfc amount of ncomng data wll requre dfferent amount of ndex spaces based on the adjusted samplng rate, mplyng that there exsts subtle relatons among R 0, and sze of RAM. When deduplcaton performance may be not satsfactory, EAD wll mgrate to more RAM as well as applyng a hgher samplng rate for future deduplcaton. A smple approach could be that RAM ncreases at the same changng rate ( ) of samplng rate. We can mprove over ths process by observng the next to last samplng rate used pror to mgraton. Ths rate s the last known samplng rate that produced acceptable deduplcaton performance. Ths s vald because f t dd not produce an acceptable performance, EAD would have already trggered mgraton. We hereby propose to utlze a more conservatve ndex RAM ncrementaton polcy based on above analyss. We ntroduce a parameter d (ntalzed as zero) to record occurrences of downsamplng, every tme the downsamplng happens, d ncreases by one. We set the New Index Sze (RAM new ) after mgraton followng the rule: { RAM org d = 1 RAM new = [1 d 1 =1 1 (3) ] RAM org d 2 Where RAM org represents the orgnal ndex sze before mgraton. As the tmes of downsamplng operaton ncrease, EAD requres less amount of RAM for ndex table after mgraton. Compared wth always requrng tmes of orgnal RAM, such optmzed approach s able to clam hgher memory utlzaton effcency. (3) Managng Sze of B. One concern wth our estmaton scheme s that the sze of B may become too large. If we need a large amount of RAM to store B, we wll be wastng RAM resources that could be used n the ndex. In practce, the sze of B s relatvely modest. Each entry n B conssts of a fngerprnt and two counters. Usng SHA-1 to compute the fngerprnt results n a 20 byte fngerprnt. An addtonal four bytes are used for each counter. Thus, each B entry s 28 bytes, ndcatng that the total sze of B would be at most approxmately MB to support 1 TB of data. In our experment, t only requres 4.32 MB for estmatng GB dataset. 3.4 Scalng Up Strategy Whle the scalng up (mgraton) has fnshed, we are left wth the orgnal ndex (coped over), and space for the new ndex, also the system wll apply a new samplng rate, whch s hgher than orgnal one, n order to keep a satsfyng deduplcaton performance. At ths stage, EAD wll compensate the poor deduplcaton performance due to prevously too sparse samplng rate n two steps: 1) Search through the orgnal ndex table, re-detect duplcaton chunks from already stored segments. Notce that the read/wrte operatons may brng unexpected cost, so that EAD only process lmted number of segments whch are able to clam duplcate chunks. Detaled mechansm wll be explaned later. 2) We know that t s possble not all ndex entres n the old ndex are useful, meanng that some entres contan fngerprnts for chunks that are unlkely to be encountered agan. Keepng these entres n the new merged ndex wll waste ndex slots. Therefore, after mgraton and duplcaton redetecton, these entres are removed from the ndex table. To dentfy segments that contan undetected duplcate chunks s a nontrval task. As a samplng approach, data segments are only represented by ther sampled chunks, whose fngerprnts are stored n the ndex. Thus EAD have to dentfy those segments only by searchng through ndex table. We propose to nject addtonal nformaton nto ndex to assst fnshng ths task: a counter (count F P, ntalzed as zero for new added entres) s used for each ndex entry to record ts httng tmes, whch we call httng rate. Every tme when an entry has been found a match, ths counter ncrements by 1. Therefore the larger the counter s, the more duplcate chunks ths entry can detect. Among those segments hooked by FPs wth low httng rate, there exsts evcted duplcate chunks. Ths concluson s derved based on the followng analyss of FPs n the ndex: Entres wth hgh httng rate. These FPs n the ndex ndcates that segments have found matches and lots of chunks near the sampled chunks are dentcal, whch s the natural results of chunk localty. Theoretcally, the hgher httng rate they have and the more entres whch have such hgh httng rate, the more space wll be saved. Entres wth low httng rate. The explanaton for them: some segment themselves share few chunks wth stored ones, resultng n lower ndex matchng rate; other of them share lots of chunks wth stored segments, however they are not hooked by rght FPs because of sparse samplng rate, thus no or not enough matches are found from ther sampled chunks fngerprnts n ndex. Majorty of duplcate chunks evcted could be elmnated from segments who are ndexed by FPs wth low httng rate. Compared wth reprocessng all the segments on the storage, EAD s able to select only part of them for detectng majorty duplcate evctons based on ther httng rate: It uses entres wth low httng rate to track ther correspondent segments and detect evcted duplcate chunks. The threshold for labelng httng rate as hgh or low s not arbtrary. Suppose

8 8 that we have n chunks come n for a backup process, the measured Deduplcaton Rato s f mr (f mr < Γ EDR). At the meantme, we have counters values as {0, 1,, c,, m}, ther correspondent amount of entres are {n 0, n 1,, n c, n m } (.e. There are n 0 entres whose counter values are zero, etc.). Therefore the total evcton amount of chunks (n evt ) whch are expected to be found duplcate on the cloud s calculated as: n evt = n (Γ EDR f mr ) (4) Assume that the orgnal samplng rate s R 0, thus the mnmum number of ndex entres to be selected s R 0 n evt. Based on above calculaton, EAD starts pckng ndex entres wth counter value as zero (n 0 ), f n 0 < R 0 n evt, EAD pcks entres wth httng rate as one and vce versa. Untl t satsfes: c n R 0 n evt (0 c m) (5) =0 By dong ths, those segments that are mostly potental for mprovng deduplcaton performance are dentfed. Then we locate segments ndexed by these FPs, choosng new set of samples as well as detectng duplcate chunks from them. We llustrate above process n detals as Algorthm. 3. Choosng new set of samples wll result extra FPs, whch wll be put nto the addtonal RAM as a part of new ndex table. Besdes, based on former analyss, those entres lead to poor performance are removed from orgnal ndex table, whch could clam addtonal savngs on RAM, makng our soluton more memory effcent. Furthermore, to avod addng them back to the ndex table, a Bloom Flter (BL) [24] s used on the cloud server to record hash nformaton of removed FPs. By dong so, old ndex wll not be entrely kept and valuable ndex space wll be released for future use. Whle scalng up fnshes, EAD wll merge old and new ndexes, calculatng updated Deduplcaton Rato. If Deduplcaton Rato s stll lower than Γ EDR after duplcaton re-detecton, EAD wll reset the value of Γ, makng Γ EDR equals to the value of current Deduplcaton Rato. So that the requrement on deduplcaton performance wll not surpass the system ablty. 3.5 Storage Herarchy The basc desgn s a classcal smple two-level herarchy: RAM for IndexTable (along wth other metadata tables lke EstmateBase and ContanerRecord) and ChunkCache, and MD for SegmentChunkHash. In detal, for the RAM level, we dvde the RAM nto several parttons (note that the RAM mentoned here only means the user accessble RAM part, whch gnores the part occuped by the operatng system as well as other background apps): the frst partton s for cachng IndexTable and some counters, whch wll keep occupyng more and more space n RAM durng runtme. Therefore the larger of avalable RAM assgned for t s, the hgher ht rato the system can acheve. On the other Algorthm 3 Elastc scalng up strategy 1: The ndex entry x (wth fngerprnt F P x ) has been chosen 2: Locate ts correspondent segment Seg x 3: f Seg x has not been processed then 4: Select new sample chunks from Seg x based on current samplng rate 5: for new sampled chunk (wth fngerprnt F P y ) selected do 6: f F P y fnds matchng record n the new ndex (F P y = F P j, F P j ɛf P newndex ) then 7: Locate Seg j and pull out ts FPs to chunk cache for duplcaton re-detecton 8: else 9: F P y has been added nto the new ndex as a new entry slot 10: for α = 1 : total number of chunks n Seg x do 11: Compare F P α wth those n Seg j 12: f F P α fnds match then 13: Chunk α s duplcate 14: Entry x wll be removed from old ndex, Bloom Flter records nformaton of F P x hand, the second RAM partton s an solated temporary loadng area, whch stores ChunkCache (meta data of all chunks from a certan segment), and data there wll be not useful after the comparson process s fnshed. In another word, ChunkCache wll not keep occupyng more and more space n RAM. For the MD level, the entre database of SegmentChunkHash s stored here. When there s a ht mss n the RAM, exstng chunks metadata are loaded from MD to RAM, and new segments and chunks data are updated nto MD. However, there are two man lmtatons of ths desgn: () small RAM sze results n frequently evctng cached IndexTable and lower the ht rato; and () MD s low access speed wll extremely decrease the overall speed. Thus, to mprove the deduplcaton speed, we need to ncrease the cache ht n RAM s IndexTable partton and speedup MD access. Before we desgn new herarchy of RAM, SSD and MD, we need to consder these two queston: () when does t make economc sense to make a pece of data resdent n RAM? and () when does t make sense to have t resdent n dsk? The answer s that RAM keeps most frequently used IndexTable so that the hot data can be accessed from a hgher speed storage RAM. In another word, RAM (a hgher speed storage level) s a typcal cache of MD (a lower speed storage level). A feasble soluton (as shown n Fgure 4) s to use the SSD as the cache of the MD, whch unfes a hgh-speed SSD and a large-capacty hard drve. We name t as the Fuson Dsk (FD) desgn. Bascally, ths desgn focuses on mprovng the dsk set s overall access speed, so we do not need to change any behavor between RAM to the dsk set n our deduplcaton system. RAM stores

9 RAM SSD Dsk Entre Metadata Lbrary HotData, whle SSD stores both HotData (as a copy of MD RAM) and W armdata. Memory Dsk Hot Hot Fuson Dsk Desgn RAM Warm Entre Metadata Lbrary SSD MD Fg. 4: Structure of fuson dsk desgn The last queston s what cachng algorthm should be appled on the SSD?. These followng thngs are key concerns for cachng algorthms: () when and what at to put n the cache or slow storage when facng wth new ncomng data; () when and what to admn from the slow storage to cache; and () when and what to evct from cache to slow storage. In the baselne EAD desgn, we use LRU algorthm whch dscards the least recently used tems frst. In general, there s no one sngle algorthm fx all traces. Thus, we need to conduct a tracedrven smulaton test to analyze dfferent algorthms. 4 EVALUATION In our evaluatons, we collected a dataset consstng of VMs that all run the Ubuntu OS, but each VM has dfferent types of software and utltes nstalled and contans dfferent types of applcaton data, whch majorty comes from Wkmeda Archves [25] and OpenfMRI [26]. The total sze of our dataset s around GB. We evaluate our soluton, denoted as Elastc n the fgures, aganst two alternatves approaches. The frst alternatve, denoted as FullIndex, represents an deal stuaton where there s unlmted RAM avalable. Ths wll serve as an upper bound on the total amount of space savngs. The other alternatve s denoted as DownSample, whch s a recent approach [6] that dynamcally adjusts the samplng rate to deal wth nsuffcent RAM. 4.1 Deduplcaton Rato We hereby compare our algorthm wth a generc deduplcaton mechansm wthout samplng and a state-ofart hgh performance deduplcaton strategy wth downsamplng mechansm [6]. Before deployng the deduplcaton process, we allocate a specfc amount of RAM for ndex n dfferent strateges. Table 3 shows the amount of RAM allocated for dfferent deduplcaton strateges. We set the sze of each entry slot n the ndex as 64 bytes, whch conssts of three parts: FP, chunk metadata (storage address, chunk length,etc) and counter, whch s 20 bytes (SHA-1 hash sgnature [27]), 40 bytes and 4 bytes, respectvely. These szes may vary under dfferent Strategy # of ndex entres sze of ndex(mb) Full ndex Wth down-sample EAD TABLE 3: RAM deployment for ndex under dfferent deduplcaton strateges. We set the down-samplng trgger as 0.85, whch means whle the storage s approachng 85% of ts current lmt, the ndex wll be down-sampled (Half of ts entres wll be removed. e.g., delete ndex FPs wth F P mod 2 = 0 ). hash functons or addressng polces, however t wll not dffer too much. We assume that the capacty s 75 GB. Nearly 10 mllon ndex entres are needed to ndex all the unque data f we do not use any samplng strateges. Whle under the down-sample strategy wth the mnmum samplng rate of 0.05, we need 500K ndex entres for 75 GB of unque data. EAD always pcks a much more conservatve sze of ndex, specfcally only 100K entry slots n ths case. We here use Normalzed Deduplcaton Rato as the metrc for deduplcaton rato comparson. It s defned as the rato of measured Deduplcaton Rato to Deduplcaton Rato of FullIndex deduplcaton. Note that FullIndex detects all the duplcate data chunks and can clam hghest deduplcaton rato. Thus, such a metrc s meanngful because t ndcates how close the measured deduplcaton rato s to the deal deduplcaton rato achevable n the system. Fgure 5a shows the Normalzed Deduplcaton Rato of the above deduplcaton strateges. Downsamplng and computaton of EDR happen when the usage of ndex approaches 85% of ts capacty. For the down-sample strategy, t has the rato hgher than 99.5 %, showng the benefts of takng advantage of localty. The EAD does not clam equally hgh rato, however the gap s less than 2 %. Also consder that the performance requrement for EAD s defned by Γ, whch s 0.95 n ths case, the performance of Elastc s always hgher than 98%, performng better than what s requred. Notce that the rato shows us fluctuatons, whch ndcates the nconsstency on data content among dfferent VMs. When about 5% of data has been processed, there s a performance thrvng, whch cannot appear n DownSample. Ths can be explaned that because of trval sze of ntal ndex sze, Elastc cannot detect enough duplcate chunks, leadng to a poor performance, whch trggers the mgraton and thrves ts performance. Such elastc behavor s the unque feature whch cannot be observed n the other two approaches. However, purely comparng the deduplcaton rato s not far for evaluatng ther performance. Snce that these three strateges spend dfferent amount of RAM for ndex from the start. Fgure 5b shows how samplng rate and number of ndex slots used vary above cases. Obvously, t brngs too much memory cost wthout samplng. We notce that both DownSample and Elastc have comparatvely very low memory cost (small number of ndex entry slots). Also we can observe that when about 5% of data has been processed, the samplng rate n Elastc ncreases, reflectng ts feature 9

10 10 of elastcty. The above results show that EAD s able to use less RAM space to acheve a satsfyng deduplcaton rato, whch s only slghtly lower than the other two. Next we derve a more meanngful metrc Deduplcaton Effcency, as a sngle utlty measure that encompasses both deduplcaton rato and RAM cost, to make a more far comparson among these three strateges. (a) Deduplcaton rato comparson. Normalzed Deduplcaton Rato (%) Samplng Rate 100 # of Index Slots FullIndex DownSample Elastc Amount of data processed (%) (b) Index usage comparson. 4 x Amount of data processed (%) FullIndex DownSample Elastc FullIndex DownSample Elastc Amount of data processed (%) Fg. 5: The samplng rate s 1 for all of them at the start of back up. The ndex mgraton n EAD s trggered when the normalzed deduplcaton rato drops below 95 % (Γ=0.95), after that, samplng rate doubles( =2). 4.2 Deduplcaton Effcency As dscussed n Secton 4.1, nether deduplcaton rato nor memory cost alone can fully represent the system performance. Therefore we defne: Dedup Effcency = Duplcate Data Detected Index Entry Slots as a more advanced performance evaluaton crteron. By usng ths crteron, we make more farly comparsons among EAD and the other two solutons, as shown n Fgure 6. It shows that Elastc outperforms both Downsample and FullIndex on effcency. Notce that Elastc always yelds a hgher effcency, almost 4 tmes of that from Downsample and 30 tmes of that from FullIndex. Ths s because that ts elastc feature enables t to utlze as lttle memory space as possble to detect enough duplcate data as requred, avodng memory waste as the other two do. (6) Deduplcaton Effcency (MB/Slot) FullIndex DownSample Elastc Amount of data processed (%) Fg. 6: Deduplcaton effcency performance. 4.3 SSD-based Fuson Dsk Evaluaton In ths secton, we frst llustrate what s stored n RAM, SSD and MD, and then ntroduce a set of performance metrcs used for evaluatng a cachng system. Followng that, we nvestgate several cachng algorthms, and analyze ther mpacts on our new EAD nfrastructure through extensve smulatons. Senstvty analyss on dfferent RAM and SSD szes s also conducted Storage Parttons (1) What s stored n RAM? The memory of each server stores two parts of cached data: IndexTable (along wth other meta tables) and ChunkCache. Note that n practce, we only adjust the sze of the IndexTable partton (refer as the RAM sze), and gnore the SegmentChunkHash part snce t s a tny temporary loadng area smply for comparson of cached and new-comng fngerprnters. Moreover, we do not count wrte operatons of real chunk content (to the storage pool), but only focus on ndex table access (on the server s dsk). To smplfy the problem, we fx the sze of one chunk s metadata to be 4 KB whch s enough for encodng and assemblng other necessary metadata n a real deduplcaton system. In other word, the basc unt of the I/O accesses s 4KB n ths paper. (2) What s stored n SSD and MD? The server s dsk contans an entre metadata lbrary whch maps chunks fngerprnts (ncludng other metadata) wth ther physcal addresses n the storage pool. In our FD desgn, the SSD lays on top of MD (havng an entre metadata lbrary) and plays the role of cache under the wrte back cachng polcy. For example, pages from the RAM are evcted to the SSD, and when the SSD s full, a vctm page s then evcted to the MD Performance Metrcs Recall that n our new EAD nfrastructure, we adopt SSDs as a mddle storage ter between RAM and dsks. Dfferent cachng algorthms such as LRU(Least Recently Used Updatng), CLOCK [28], ARC [29], and CAR [30] can be used to manage the data set cached n the new mddle ter. To explore the performance of EAD under

11 11 these dfferent cachng algorthms, we ntroduce two mportant performance metrcs: I/O ht rato and I/O operaton cost. We consder a combnaton of these two metrcs as a crteron to nvestgate the mpacts of these cachng algorthms. (1) I/O Ht Rato: I/O ht rato s defned as the fracton of I/O requests that are served by Flash. Although SSD s access unt s also 4KB, an I/O request mght stll cross more than one page. Therefore, we regard an IO request as a ht only when all of ts assocated pages are cached n SSD. Hgher I/O ht rato means that more I/Os can be accessed from Flash drectly whch accelerates the overall I/O performance. Thus, one of our prmary goals s to ncrease I/O ht rato for mprovng SSD utlzaton. (2) I/O Operaton Cost: I/O operaton cost can be represented as I/O response tme or I/O throughput (e.g., IOPS). In ths paper, we use I/O response tme to evaluate the cost, for usng MD and FD, as shown n Equaton 7, where C IOResp and C F lashupdate represent the IO access cost and the Flash contents updatng cost, respectvely. All N terms ndcate the access numbers of SSD Read (N SSDr ), SSD Wrte (N SSDw ), MD Read (N MDr ), and MD Wrte (N MDw ), whle all T terms (e.g., T SSDr and T MDr ) show the correspondng average I/O latency for each operaton. C MD = C IOAccess = N MDr T MDr + N MDw T MDw C F D = C IOAccess + C F lashupdate = N SSDr T SSDr + N SSDw T SSDw +N MDr T MDr + N MDw T MDw (7) The man dfference between the I/O cost calculaton of MD and FD desgn s that the I/O cost of FD conssts of two parts: I/O access cost and Flash contents updatng cost. Table 4 (a) further presents the related I/O operatons costs for MD and FD n four dfferent scenaros,.e., read ht, read mss, wrte ht, and wrte mss. Besdes read mss, our FD desgn always redrects I/Os from slow MD to fast SSD, whch sgnfcantly reduce the total I/O response tmes. Table 4 (b) shows Flash updatng cost, whch s only for the FD desgn. For example, when newly accessed pages are admnstrated but the SSD s full, extra tme s needed to flush (or evct) the drty page(s) to MD. We hereby consder such data movements between SSD and MD as Flash contents updatng cost and nclude t n the overall I/O cost. Table 4(c) further shows the actual average I/O response tmes (n mcroseconds) of varous types of I/O operatons at both SSD and MD devces. These results were measured from an Intel DC S3500 Seres SSD wth the capacty of 80GB and a Western Dgtal WD20EURS-63S48Y0 MD wth 2T B and 5400 RPM. Note that the tested basc I/O sze s specfed as 4KB accordng to the spacal granularty. TABLE 4: Costs Calculaton Insde Fuson Dsk (SSD and MD) (a) Operatons for MD and FD I/O access costs Case read ht Read mss Wrte ht Wrte mss MD MD read MD read MD wrte MD wrte MD read + FD SSD read SSD wrte SSD wrte SSD wrte (b) Operatons for nner FD Flash update cost Case Evct drty page Cost SSD read + MD wrte (c) Measured average I/O response tmes (µs) of SSD and MD Latency T SSDr T SSDw T MDr T MDw 4K Random TABLE 5: I/O ht ratos (%) of dfferent cachng algorthms under 10k-ndex-entres RAM sze case. Ht rato 500MB 1GB 2GB 3GB 4GB LRU CLOCK ARC CAR CART Evaluaton On Dfferent SSD Szes In ths secton, we evaluate the effectveness of the FD desgn by conductng trace-drven smulatons. The actual RAM sze s fxed to 10k ndex entres, wth deduplcaton chuck sze as 8KB and segment sze as 16MB, but the SSD sze s varyng from 500MB to 4GB. Dfferent cachng algorthms (e.g., LRU, CLOCK) are used to manage pages n SSDs. Table 5 shows I/O ht ratos under dfferent SSD szes when RAM s set to store at most 10k ndex entres. We frst observe that all these algorthms acheve smlar I/O ht ratos. Sophstcated algorthms lke ARC, CAR and CART have slghtly hgher ht rato than nave algorthms lke LRU and CLOCK, because of ther enhanced methods to avod beng flushed by I/O spkes. In general, the larger the SSD s, the hgher ht rato t wll obtan. However as long as we has suffcent capacty to hold actve workng sets of all traces, the mprovement n I/O ht rato becomes nvsble. Smlarly, Table 6 shows the normalzed I/O operaton costs under dfferent SSD szes (RAM s upper bound s 10k ndex entres), where the cost under the MD desgn s used as the base lne. We can see that our FD desgn s able to save almost 75% of I/O operaton costs compared to the MD desgn. We nterpret ths beneft by observng that the FD desgn drects a large amount of I/Os to the SSDs whch store hot data and thus reduces the I/Os to the MD, and consequently TABLE 6: Total I/O response tme costs normalzed to no-ssd structure desgn (%) under 10k-ndex-entres RAM sze case. I/O Cost 500 MB 1GB 2GB 3GB 4GB LRU CLOCK ARC CAR CART