Avoiding the Disk Bottleneck in the Data Domain Deduplication File System

Avoiding Dk Bottleneck in Dt Domin Dedupliction File System Benjmin Zhu Dt Domin, Inc. Ki Li Dt Domin, Inc. Princen University Hugo Ptterson Dt Domin, Inc. Abstrct Dk-bsed dedupliction srge hs emerged s new-genertion srge system for enterpre protection replce tpe librries. Dedupliction removes redundnt segments compress in highly compct form mkes it economicl sre bckups on dk insted tpe. A crucil requirement for enterpre protection high, typiclly over 100, which enbles bckups complete quickly. A significnt chllenge identify eliminte duplicte segments t th rte on low-cost system tht cnnot fford enough RAM sre n index sred segments my be forced ccess n on-dk index for every input segment. Th pper describes three techniques employed in production Dt Domin dedupliction file system relieve dk bottleneck. These techniques include: (1) Summry Vecr, compct in-memory structure for identifying new segments; (2) Strem-Informed Segment Lyout, lyout method improve on-dk loclity for sequentilly ccessed segments; (3) Loclity Preserved Cching, which mintins loclity fingerprints duplicte segments chieve high cche hit rtios. Toger, y cn remove 99% dk ccesses for dedupliction rel world worklods. These techniques enble modern two-socket dul-core system run t 90% CPU utiliztion with only one shelf 15 dks chieve 100 for single-strem 210 for multi-strem. 1 Introduction The mssive srge requirements for protection hve presented serious problem for centers. Typiclly, centers perform weekly full bckup ll on ir primry srge systems secondry srge devices where y keep se bckups for weeks months. In ddition, y my perform dily incrementl bckups tht copy only which hs chnged since lst bckup. The frequency, type retention bckups vry for different kinds, but it common for secondry srge hold 10 20 times more thn primry srge. For dster recovery, dditionl fsite copies my double secondry srge cpcity needed. If trnsferred fsite over wide re network, network bwidth requirement cn be enormous. Given protection use cse, re re two min requirements for secondry srge system sring bckup. The first low cost so tht sring bckups moving copies fsite does not end up costing significntly more thn sring primry. The second high performnce so tht bckups cn complete in timely fshion. In mny cses, bckups must complete overnight so lod performing bckups does not interfere with norml dytime usge. The trditionl solution hs been use tpe librries s secondry srge devices trnsfer physicl tpes for dster recovery. Tpe crtridges cost smll frction dk srge systems y hve good sequentil trnsfer rtes in neighborhood 100. But, mnging crtridges mnul process tht expensive error prone. It quite common for resres fil becuse tpe crtridge cnnot be locted or hs been dmged during hling. Furr, rom ccess performnce, needed for resres, extremely poor. Dk-bsed srge systems network repliction would be much preferred if y were ffordble. During pst few yers, dk-bsed, dedupliction srge systems hve been introduced for protection [QD02, MCM01, KDLT04, Dt05, JDT05]. Such systems compress by removing duplicte cross files ten cross ll in srge system. Some implementtions chieve 20:1 rtio (tl size divided by physicl spce used) for 3 months bckup using dily-incrementl weekly-full bckup policy. By substntilly reducing footprint versioned, dedupliction cn mke costs srge on dk tpe comprble mke replicting over WAN remote site for dster recovery prcticl. USENIX Assocition FAST 08: 6th USENIX Conference on File Srge Technologies 269

The specific dedupliction pproch vries mong system vendors. Certinly different pproches vry in how effectively y reduce. But, gol th pper not investigte how get gretest reduction, but rr how do dedupliction t high speed in order meet performnce requirement for secondry srge used for protection. The most widely used dedupliction method for secondry srge, which we cll Identicl Segment Dedupliction, breks file or strem in contiguous segments elimintes duplicte copies identicl segments. Severl emerging commercil systems hve used th pproch. The focus th pper show how implement high- Identicl Segment Dedupliction srge system t low system cost. The key performnce chllenge finding duplicte segments. Given segment size 8 KB performnce trget 100, dedupliction system must process pproximtely 12,000 segments per second. An in-memory index ll segment fingerprints could esily chieve th performnce, but size index would limit system size increse system cost. Consider segment size 8 KB segment fingerprint size 20 bytes. Supporting 8 TB worth unique segments, would require 20 GB just sre fingerprints. An lterntive pproch mintin n on-dk index segment fingerprints use cche ccelerte segment index ccesses. Unfortuntely, trditionl cche would not be effective for th worklod. Since fingerprint vlues re rom, re no sptil loclity in segment index ccesses. Moreover, becuse bckup worklod strems lrge sets through system, re very little temporl loclity. Most segments re referenced just once every week during full bckup one prticulr system. Reference-bsed cching lgorithms such s LRU do not work well for such worklods. The Venti system, for exmple, implemented such cche [QD02]. Its combintion index block cches only improves its write by bout 16% (from 5.6 6.5) even with 8 prllel dk index lookups. The primry reson due its low cche hit rtios. With low cche hit rtios, most index lookups require dk opertions. If ech index lookup requires dk ccess which my tke 10 msec 8 dks re used for index lookups in prllel, write will be bout 6.4, roughly corresponding Venti s less thn 6.5 with 8 drives. While Venti s performnce my be dequte for rchivl usge smll workgroup, it s fr cry from gol deduplicting t 100 compete with high-end tpe librries. Achieving 100, would require 125 dks doing index lookups in prllel! Th would increse system cost dedupliction srge n unttinble level. Our key ide use combintion three methods reduce need for on-dk index lookups during dedupliction process. We present in detil ech three techniques used in production Dt Domin dedupliction file system. The first use Bloom filter, which we cll Summry Vecr, s summry structure test if segment new system. It voids wsted lookups for segments tht do not ext in index. The second sre segments ir fingerprints in sme order tht y occur in file or strem. Such Strem-Informed Segment Lyout (SISL) cretes sptil loclity for segment fingerprint ccesses. The third, clled Loclity Preserved Cching, tkes dvntge segment lyout fetch cche groups segment fingerprints tht re likely be ccessed ger. A single dk ccess cn result in mny cche hits thus void mny on-dk index lookups. Our evlution shows tht se techniques re effective in removing dk bottleneck in n Identicl Segment Dedupliction srge system. For system running on server with two dul-core CPUs with one shelf 15 drives, se techniques cn eliminte bout 99% index lookups for vrible-length segments with n verge size bout 8 KB. We show tht system indeed delivers high : chieving over 100 for single-strem write red performnce, over 210 for multi-strem write performnce. Th n order--mgnitude improvement over prllel indexing techniques presented in Venti system. The rest pper orgnized s follows. Section 2 presents chllenges observtions in designing dedupliction srge system for protection. Section 3 describes stwre rchitecture production Dt Domin dedupliction file system. Section 4 presents our methods for voiding dk bottleneck. Section 5 shows our experimentl results. Section 6 gives n overview relted work, Section 7 drws conclusions. 2 Chllenges Observtions 2.1 Vrible vs. Fixed Length Segments An Identicl Segment Dedupliction system could choose use eir fixed length segments or vrible length segments creted in content dependent mnner. Fixed length segments re sme s fixed-size blocks mny non-dedupliction file systems. For purposes th dcussion, extents tht re multiples 270 FAST 08: 6th USENIX Conference on File Srge Technologies USENIX Assocition

some underlying fixed size unit such s dk secr re sme s fixed-size blocks. Vrible-length segments cn be ny number bytes in length within some rnge. They re result prtitioning file or strem in content dependent mnner [Mn93, BDH94]. The min dvntge fixed segment size simplicity. A conventionl file system cn crete fixed-size blocks in usul wy dedupliction process cn n be pplied deduplicte those fixed-size blocks or segments. The pproch effective t deduplicting whole files tht re identicl becuse every block identicl files will course be identicl. In bckup pplictions, single files re bckup imges tht re mde up lrge numbers component files. These files re rrely entirely identicl even when y re successive bckups sme file system. A single ddition, deletion, or chnge ny component file cn esily shift remining imge content. Even if no or file hs chnged, shift would cuse ech fixed sized segment be different thn it ws lst time, contining some bytes from one neighbor giving up some bytes its or neighbor. The pproch prtitioning in vrible length segments bsed on content llows segment grow or shrink s needed so remining segments cn be identicl previously sred segments. Even for sring individul files, vrible length segments hve n dvntge. Mny files re very similr, but not identicl or versions sme file. Vrible length segments cn ccommodte se differences mximize number identicl segments. Becuse vrible length segments re essentil for dedupliction shifted content bckup imges, we hve chosen m over fixed-length segments. 2.2 Segment Size Wher fixed or vrible sized, choice verge segment size difficult becuse its impct on performnce. The smller segments, more duplicte segments re will be. Put nor wy, if re smll modifiction file, smller segment, smller new tht must be sred more file s bytes will be in duplicte segments. Within limits, smller segments will result in better rtio. On or h, with smller segments, re re more segments process which reduces performnce. At minimum, more segments men more times through dedupliction loop, but it lso likely men more ondk index lookups. With smller segments, re re more segments mnge. Since ech segment requires sme met size, smller segments will require more srge footprint for ir met, segment fingerprints for fewer tl user bytes cn be cched in given mount memory. The segment index lrger. There re more updtes index. To extent tht ny structures scle with number segments, y will limit overll cpcity system. Since commodity servers typiclly hve hrd limit on mount physicl memory in system, decion on segment size cn gretly ffect cost system. A well-designed dupliction srge system should hve smllest segment size possible given cpcity requirements for product. After severl itertive design processes, we hve chosen use 8 KB s verge segment size for vrible sized segments in our dedupliction srge system. 2.3 Performnce-Cpcity Blnce A secondry srge system used for protection must support resonble blnce between cpcity performnce. Since bckups must complete within fixed bckup window time, system with given performnce cn only bckup so much within bckup window. Furr, given fixed retention period for being bcked up, srge system needs only so much cpcity retin bckups tht cn complete within bckup window. Conversely, given prticulr srge cpcity, bckup policy, dedupliction efficiency, it possible compute tht system must sustin justify cpcity. Th blnce between performnce cpcity motivtes need chieve good system performnce with only smll number dk drives. Assuming bckup policy weekly fulls dily incrementls with retention period 15 weeks system tht chieves 20x rtio sring bckups for such policy, s rough rule thumb, it requires pproximtely s much cpcity s primry sre ll bckup imges. Tht, for 1 TB primry, dedupliction secondry srge would consume pproximtely 1 TB physicl cpcity sre 15 weeks bckups. Weekly full bckups re commonly done over weekend with bckup window 16 hours. The blnce weekend reserved for restrting filed bckups or mking dditionl copies. Using rule thumb bove, 1 TB cpcity cn protect pproximtely 1 TB primry. All tht must be bcked up within 16-hour bckup window which implies bout 18 per terbyte cpcity. USENIX Assocition FAST 08: 6th USENIX Conference on File Srge Technologies 271

Following th logic, system with shelf 15 SATA drives ech with cpcity 500 GB tl usble cpcity fter RAID, spres, or overhed 6 TB could protect 6 TB primry srge must refore be ble sustin over 100 dedupliction. 2.4 Fingerprint vs. Byte Comprons An Identicl Segment Dedupliction srge system needs method determine tht two segments re identicl. Th could be done with byte by byte compron newly written segment with previously sred segment. However, such compron only possible by first reding previously sred segment from dk. Th would be much more onerous thn looking up segment in n index would mke it extremely difficult if not impossible mintin needed. To void th overhed, we rely on comprons segment fingerprints determine identity segment. The fingerprint collion-restnt hsh vlue computed over content ech segment. SHA- 1 such collion-restnt function [NIST95]. At 160-bit output vlue, probbility fingerprint collion by pir different segments extremely smll, mny orders mgnitude smller thn hrdwre error rtes [QD02]. When corruption occurs, it will lmost certinly be result undetected errors in RAM, IO busses, network trnsfers, dk srge devices, or hrdwre components or stwre errors not from collion. 3 Dedupliction Srge System Architecture To provide context for presenting our methods for voiding dk bottleneck, th section describes rchitecture production Dt Domin File System, DDFS, for which Identicl Segment Dedupliction n integrl feture. Note tht methods presented in next section re generl cn pply or Identicl Segment Dedupliction srge systems. At highest level, DDFS breks file in vriblelength segments in content dependent mnner [Mn93, BDH94] computes fingerprint for ech segment. DDFS uses fingerprints both identify duplicte segments s prt segment descripr used reference segment. It represents files s sequences segment fingerprints. During writes, DDFS identifies duplicte segments does its best sre only one copy ny prticulr segment. Before sring new segment, DDFS uses vrition Ziv-Lempel lgorithm compress segment [ZL77]. Figure 1: Dt Domin File System rchitecture. Figure 1 block digrm DDFS, which mde up stck stwre components. At p stck, DDFS supports multiple ccess procols which re lyered on common File Services interfce. Supported procols include NFS, CIFS, virtul tpe librry interfce (VTL). When strem enters system, it goes through one strd interfces generic File Services lyer, which mnges nme spce file met. The File Services lyer forwrds write requests Content Sre which mnges content within file. Content Sre breks strem in segments, uses Segment Sre perform dedupliction, keeps trck references for file. Segment Sre does ctul work dedupliction. It pcks deduplicted (unique) segments in reltively lrge units, compresses such units using vrition Ziv-Lempel lgorithm furr compress, n writes compressed results in continers supported by Continer Mnger. To red strem from system, client drives red opertion through one strd interfces File Services Lyer. Content Sre uses references deduplicted segments deliver desired strem client. Segment Sre prefetches, decompresses, reds cches segments from Continer Mnger. The following describes Content Sre, Segment Sre Continer Mnger in detil dcusses our design decions. 3.1 Content Sre Content Sre implements byte-rnge writes reds for deduplicted objects, where n object liner 272 FAST 08: 6th USENIX Conference on File Srge Technologies USENIX Assocition

sequence client bytes hs intrinsic clientsettble ttributes or met. An object my be conventionl file, bckup imge n entire volume or tpe crtridge. To write rnge bytes in n object, Content Sre performs severl opertions. Anchoring prtitions byte rnge in vriblelength segments in content dependent mnner [Mn93, BDH94]. Segment fingerprinting computes SHA-1 hsh genertes segment descripr bsed on it. Ech segment descripr contins per segment informtion t lest fingerprint size Segment mpping builds tree segments tht records mpping between object byte rnges segment descriprs. The gol represent object using references deduplicted segments. To red rnge bytes in n object, Content Sre trverses tree segments creted by segment mpping opertion bove obtin segment descriprs for relevnt segments. It fetches segments from Segment Sre returns requested byte rnge client. 3.2 Segment Sre Segment Sre essentilly bse segments keyed by ir segment descriprs. To support writes, it ccepts segments with ir segment descriprs sres m. To support reds, it fetches segments designted by ir segment descriprs. To write segment, Segment Sre performs severl opertions. Segment filtering determines if segment duplicte. Th key opertion deduplicte segments my trigger dk I/Os, thus its overhed cn significntly impct performnce. Continer pcking dds segments be sred continer which unit srge in system. The pcking opertion lso compresses segment using vrition Ziv-Lempel lgorithm. A continer, when fully pcked, ppended Continer Mnger. Segment Indexing updtes segment index tht mps segment descriprs continer holding segment, fter continer hs been ppended Continer Mnger. To red segment, Segment Sre performs following opertions. Figure 2: Continers re self-describing, immutble, units srge severl megbytes in size. All segments re sred in continers. Segment lookup finds continer sring requested segment. Th opertion my trigger dk I/Os look in on-dk index, thus it sensitive. Continer retrievl reds relevnt portion indicted continer by invoking Continer Mnger. Continer unpcking decompresses retrieved portion continer returns requested segment. 3.3 Continer Mnger The Continer Mnger provides srge continer log bstrction, not block bstrction, Segment Sre. Continers, shown in Figure 2, re self-describing in tht met section includes segment descriprs for sred segments. They re immutble in tht new continers cn be ppended old continers deleted, but continers cnnot be modified once written. When Segment Sre ppends continer, Continer Mnger returns continer ID which unique over life system. The Continer Mnger responsible for llocting, dellocting, reding, writing relibly sring continers. It supports reds met section or portion section, but it only supports ppends whole continers. If continer not full but needs be written dk, it pdded out its full size. Continer Mnger built on p strd block srge. Advnced techniques such s Stwre RAID-6, continuous scrubbing, continer verifiction, end end checks re pplied ensure high level integrity relibility. The continer bstrction fers severl benefits. USENIX Assocition FAST 08: 6th USENIX Conference on File Srge Technologies 273

The fixed continer size mkes continer lloction delloction esy. The lrge grnulrity continer write chieves high dk utiliztion. A properly sized continer size llows efficient fullstripe RAID writes, which enbles n efficient stwre RAID implementtion t srge lyer. 4 Accelertion Methods Th section presents three methods ccelerte dedupliction process in our dedupliction srge system: summry vecr, strem-informed lyout, loclity preserved cching. The combintion se methods llows our system void bout 99% dk I/Os required by system relying on index lookups lone. The following describes ech three techniques in detil. 4.1 Summry Vecr The purpose Summry Vecr reduce number times tht system goes dk look for duplicte segment only find tht none exts. One cn think Summry Vecr s n in-memory, conservtive summry segment index. If Summry Vecr indictes segment not in index, n re no point in looking furr for segment; segment new should be sred. On or h, being only n pproximtion index, if Summry Vecr indictes segment in index, re high probbility tht segment ctully in segment index, but re no gurntee. The Summry Vecr implements following opertions: Init() Insert(fingerprint) Lookup(fingerprint) We use Bloom filter implement Summry Vecr in our current design [Blo70]. A Bloom filter uses vecr m bits summrize extence informtion bout n fingerprints in segment index. In Init(), ll bits re set 0. Insert() uses k independent hshing functions, h 1,, h k, ech mpping fingerprint [0, m -1] sets bits t position h 1 (),, h k () 1. For ny fingerprint x, Lookup(x) will check ll bits t position h 1(x),, h k(x) see if y re ll set 1. If ny bits 0, n we know x definitely not in segment index. Orwe, with high probbility, x will be in segment index, ssuming resonble choices m, n, k. Figure 3 illustrtes opertions Summry Vecr. Figure 3: Summry Vecr opertions. The Summry Vecr cn identify most new segments without looking up segment index. Initilly ll bits in rry re 0. On insertion, shown in (), bits specified by severl hshes, h1, h2, h3 fingerprint segment re set 1. On lookup, shown in (b), bits specified by sme hshes re checked. If ny re 0, s shown in th cse, segment cnnot be in system. As indicted in [FCAB98], probbility flse positive for n element not in set, or flse positive rte, cn be clculted in strightforwrd fshion, given our ssumption tht hsh functions re perfectly rom. After ll n elements hshed inserted in Bloom filter, probbility tht specific bit still 0 1 1 m kn = e The probbility flse positive n: k kn / m kn kn k m e 1 1 1 1. m Using th formul, one cn derive prticulr prmeter chieve given flse positive rte. For exmple, chieve 2% flse positive rte, smllest size Summry Vecr 8 n bits (m/n = 8) number hsh functions cn be 4 (k = 4). To hve firly smll probbility flse positive such s frction percent, we choose m such tht m/n bout 8 for trget gol n k round 4 or 5. For exmple, supporting one billion bse segments requires bout 1 GB memory for Summry Vecr. At system shutdown system writes Summry Vecr dk. At strtup, it reds in sved copy. To hle power filures or kinds unclen shutdowns, system periodiclly checkpoints. 274 FAST 08: 6th USENIX Conference on File Srge Technologies USENIX Assocition

Summry Vecr dk. To recover, system lods most recent checkpoint Summry Vecr n processes continers ppended continer log since checkpoint, dding contined segments Summry Vecr. Although severl vritions Bloom filters hve been proposed during pst few yers [BM05], we hve chosen bsic Bloom Filter for simplicity efficient implementtion. 4.2 Strem-Informed Segment Lyout We use Strem-Informed Segment Lyout (SISL) crete sptil loclity for both segment segment descriprs enble Loclity Preserved Cching s described in next section. A strem here just sequence bytes tht mke up bckup imge sred in Content Sre object. Our min observtion tht in bckup pplictions, segments tend repper in sme very similr sequences with or segments. Consider 1 MB file with hundred or more segments. Every time tht file bcked up, sme sequence hundred segments will pper. If file modified slightly, re will be some new segments, but rest will pper in sme order. When new contins duplicte segment x, re high probbility tht or segments in its locle re duplictes neighbors x. We cll th property segment duplicte loclity. SISL designed preserve th loclity. Content Sre Segment Sre support strem bstrction tht segregtes segments creted for different objects, preserves logicl ordering segments within Content Sre object, dedictes continers hold segments for single strem in ir logicl order. The met sections se continers sre segment descriprs in ir logicl order. Multiple strems cn be written Segment Sre in prllel, but strem bstrction prevents segments for different strems from being jumbled ger in continer. The design decion mke dedupliction srge system strem wre significnt dtinction from or systems such s Venti. When n object opened for writing, Content Sre opens corresponding strem with Segment Sre which in turn ssigns continer strem. Content Sre writes ordered btches segments for object strem. Segment Sre pcks new segments in section dedicted continer, performs vrition Ziv-Lempel on section, writes segment descriprs in met section continer. When continer fills up, it ppends it with Continer Mnger strts new continer for strem. Becuse multiple strems cn write Segment Sre in prllel, re my be multiple open continers, one for ech ctive strem. The end result Strem-Informed Segment Lyout or SISL, becuse for strem, new segments re sred ger in sections, ir segment descriprs re sred ger in met section. SISL fers mny benefits. When multiple segments sme strem re written continer ger, mny fewer dk I/Os re needed reconstruct strem which helps system chieve high red. Descriprs compressed djcent new segments in sme strem re pcked linerly in met sections respectively in sme continer. Th pcking cptures duplicte loclity for future strems resembling th strem, enbles Loclity Preserved Cching work effectively. The met section sred seprtely from section, generlly much smller thn section. For exmple, continer size 4 MB, n verge segment size 8 KB, Ziv-Lempel rtio 2, yield bout 1K segments in continer, require met section size just bout 64 KB, t segment descripr size 64 bytes. The smll grnulrity on continer met section reds llows Loclity Preserved Cching in highly efficient mnner: 1K segments cn be cched using single smll dk I/O. Th contrsts old wy one on-dk index lookup per segment. These dvntges mke SISL n effective mechnm for deduplicting multiple-strem fine-grined segments. Pcking continers in strem wre fshion dtinguhes our system from Venti mny or systems. 4.3 Loclity Preserved Cching We use Loclity Preserved Cching (LPC) ccelerte process identifying duplicte segments. A trditionl cche does not work well for cching fingerprints, hshes, or descriprs for duplicte detection becuse fingerprints re essentilly rom. Since it difficult predict index loction for next segment without going through ctul index ccess gin, ms rtio trditionl cche will be extremely high. We pply LPC tke dvntge segment duplicte loclity so tht if segment duplicte, bse segment highly likely cched lredy. LPC chieved USENIX Assocition FAST 08: 6th USENIX Conference on File Srge Technologies 275

by combining continer bstrction with segment cche s dcussed next. For segments tht cnnot be resolved by Summry Vecr LPC, we resort looking up segment in segment index. We hve two gols on th retrievl: Mking th retrievl reltively rre occurrence. Whenever retrievl mde, it benefits segment filtering future segments in locle. LPC implements segment cche cche likely bse segment descriprs for future duplicte segments. The segment cche mps segment fingerprint its corresponding continer ID. Our min ide mintin segment cche by groups fingerprints. On ms, LPC will fetch entire met section in continer, insert ll fingerprints in met section in cche, remove ll fingerprints n old met section from cche ger. Th method will preserve loclity fingerprints continer in cche. The opertions for segment cche re: Init(): Initilize segment cche. Insert(continer): Iterte through ll segment descriprs in continer met section, insert ech descripr continer ID in segment cche. Remove(continer): Iterte through ll segment descriprs in continer met section, remove ech descripr continer ID from segment cche. Lookup(fingerprint): Find corresponding continer ID for fingerprint specified. Descriprs ll segments in continer re dded or removed from segment cche t once. Segment cching typiclly triggered by duplicte segment tht mses in segment cche, requires lookup in segment index. As side effect finding corresponding continer ID in segment index, we prefetch ll segment descriprs in th continer segment cche. We cll th Loclity Preserved Cching. The intuition tht bse segments in th continer re likely be checked ginst for future duplicte segments, bsed on segment duplicte loclity. Our results on rel world hve vlidted th intuition overwhelmingly. We hve implemented segment cche using hsh tble. When segment cche full, continers tht re ineffective in ccelerting segment filtering re leding cidtes for replcement from segment cche. A resonble cche replcement policy Lest-Recently- Used (LRU) on cched continers. 4.4 Accelerted Segment Filtering We hve combined ll three techniques bove in segment filtering phse our implementtion. For n incoming segment for write, lgorithm does following: Checks see if it in segment cche. If it in cche, incoming segment duplicte. If it not in segment cche, check Summry Vecr. If it not in Summry Vecr, segment new. Write new segment in current continer. If it in Summry Vecr, lookup segment index for its continer Id. If it in index, incoming segment duplicte; insert met section continer in segment cche. If segment cche full, remove met section lest recently used continer first. If it not in segment index, segment new. Write new segment in current continer. We im keep segment index lookup minimum in segment filtering. 5 Experimentl Results We would like nswer following questions: How well does dedupliction srge system work with rel world sets? How effective re three techniques in terms reducing dk I/O opertions? Wht cn dedupliction srge system using se techniques chieve? For first question, we will report our results with rel world from two cusmer centers. For next two questions, we conducted experiments with severl internl sets. Our experiments use Dt Domin DD580 dedupliction srge system s n NFS v3 server [PJSS*94]. Th dedupliction system fetures two-socket duel-core CPU s running t 3 Ghz, tl 8 GB system memory, 2 gigbit NIC crds, 15- drive dk subsystem running stwre RAID6 with one spre drive. We use 1 4 bckup client computers running NFS v3 client for sending. 5.1 Results with Rel World Dt The system described in th pper hs been used t over 1,000 centers. The following prgrphs report dedupliction results from two centers, generted from u-support mechnm system. 276 FAST 08: 6th USENIX Conference on File Srge Technologies USENIX Assocition

Figure 4: Logicl/Physicl Cpcities t Dt Center A Figure 5: Compression Rtios t Dt Center A Strd Min Mx Averge e devition Dily globl l 10.05 74.31 40.63 13.73 Dily locl 1.58 1.97 1.78 0.09 Tble 1: Stttics on Dily Globl Dily Locl Compression Rtios t Dt Center A Dt center A bcks up structured bse over course 31 dys during initil deployment dedupliction system. The bckup policy do dily full bckups, where ech full bckup produces over 600 GB t stedy stte. There re two exceptions: During initil seeding phse (until 6 th dy in th exmple), different fe or different fe types re rolled in bckup set, s bckup dmintrrs s figure out how y wnt use dedupliction system. A low rte duplicte segment identifiction iction elimintion typiclly y ssocited with seeding phse. There re certin dys (18 th dy in th exmple) when no bckup generted. Figure 4 shows logicl cpcity ( mount from user or bckup ppliction perspective) physicl cpcity city ( mount sred in dk medi) system over time t center A. st At end 31 st dy, center hs bcked up bout 16.9 TB, corresponding physicl cpcity less thn 440 GB, reching tl rtio 38.54 1. Figure 5 shows s dily globl rtio ( dily rte reduction due duplicte segment elimintion), dily locl rtio ( dily rte reduction due Ziv-Lempel style th th on new segments), cumultive globl rtio ( cumultive rtio reduction due duplicte segment elimintion), cumultive tl rtio ( cumultive rtio reduction due duplicte segment elimintion Ziv-Lempel style on new segments) over time. st At end 31 st dy, cumultive globl l rtio reches es 22.53 1, cumultive tl rtio reches 38.54 1. The dily globl rtios chnge quite bit over time, wheres dily locl sion rtios re quite stble. Tble 1 summrizes minimum, mximum, verge, strd d devition i both dily globl l dily locl rtios, excluding first th seeding ( 6) dys no bckup (18 ) dy. Dt center B bcks up mixture structured ured bse unstructured ured file system over course 48 dys during initil deployment dedupliction system using both full incrementl bckups. Similr tht in until th center A, seeding lsts 6 dy, re re th few dys without bckups (8, 12-14 1 th, 35 th dys). Outside se dys, mximum dily logicl bckup size bout 2.1 TB, smllest size bout 50 GB. Figure 6 shows logicl cpcity physicl cpcity system over time t center B. th At end 48 th dy, logicl cpcity reches bout 41.4 TB, corresponding physicl cpcity bout 3.0 TB. The tl rtio 13.71 1. Figure 7 shows dily globl rtio, dily locl rtio, cumultive globl l rtio, cumultive tl rtio over time. th At end 48 th dy, cumultive globl l reches 6.85, while cumultive tl reches 13.71. USENIX Assocition FAST 08: 6th USENIX Conference on File Srge Technologies 277

Figure 6: Logicl/Physicl Cpcities t Dt Center B. Figure 7: Compression Rtios t Dt Center er B. Dily globl on Dily locl l on Min Mx Averge e Strd devition 5.09 45.16 13.92 9.08 1.40 4.13 2.33 3 0.57 effective. fective. Independent seeding, Ziv-Lempel style reltively stble, giving reduction bout 2 over time. The rel world observtions on pplicbility duplicte segment elimintion during seeding fter seeding re prticulrly relevnt in evluting our techniques reduce dk ccesses cesses below. Tble 2: Stttics on Dily Globl Dily Locl Compression Rtios t Dt Center B Exchnge Engineering Logicl cpcity (TB) 2.76 2.54 Physicl cpcity fter deduplicting segments 0.49 0.50 (TB) Globl 5.69 5.04 Physicl cpcity ci fter locl sion (TB) 0.22 0.261 Locl sion 2.17 1.93 Totl s 12.36 9.75 Tble 3: Cpcities Compression Rtios on Exchnge Engineering Dtsets Tble 2 summrizes mrizes minimum, mximum, verge, strd devition both dily globl dily locl rtios, excluding seeding dys without bckup. The two sets results show tht dedupliction srge system works well with rel world sets. As expected, both cumultive globl cumultive tl sion rtios increse s system holds more bckup. During seeding, duplicte segment elimintion i ion tends be ineffective, fective, becuse most segments re new. After seeding, despite lrge vrition in ctul number, duplicte segment elimintion becomes extremely 5.2 I/O Svings with Summry Vecr Loclity Preserved Cching To determinee effectiveness fectiveness Summry Vecr Loclity Preserved Cching, we exmine svings for dk reds find duplicte segments ents using Summry Vecr Loclity Preserved Cching. We use two internl sets for our experiment. One dily full bckup compny-wide Exchnge informtion sre over 135-dy period. The or weekly full dily incrementl bckup n Engineering deprtment over 100-dy period. Tble 3 summrizes key ttributes se two sets. These internl sets re generted from production usge (lbeit internl). We lso observe tht vrious rtios produced by internl sets re reltively similr ilr those rel world exmples exmined in section 5.1. We believe se internl sets re resonble proxies rel world deployments. Ech bckup sets sent deduplicting srge system with single bckup strem. With respect dedupliction du srge system, we mesure number dk reds for segment index lookups loclity prefetches etches needed find duplictes during write for four cses: s: (1) with neir Summry Vecr nor Loclity Preserved Cching; (2) with Summry Vecr only; (3) with Loclity Preserved Cching only; 278 FAST 08: 6th USENIX Conference on File Srge Technologies USENIX Assocition

(4) with both Summry Vecr Loclity Preserved Cching. The results re shown in Tble 4. Clerly, Summry Vecr Loclity Preserved Cching combined hve produced n sunding reduction in dk reds. Summry Vecr lone reduces bout 16.5% 18.6% index lookup dk I/Os for exchnge engineering respectively. The Loclity Preserved Cching lone reduces bout 82.4% 81% index lookup dk I/Os for exchnge engineering respectively. Toger y re ble reduce index lookup dk I/Os by 98.94% 99.6% respectively. In generl, Summry Vecr very effective for new, Loclity Preserved Cching highly effective for little or modertely chnged. For bckup, first full bckup (seeding equivlent) does not hve s mny duplicte segments s subsequent full bckups. As result, Summry Vecr effective void dk I/Os for index lookups during first full bckup, wheres Loclity Preserved Cching highly beneficil for subsequent full bckups. Th result lso suggests tht se two sets exhibit good duplicte loclity. 5.3 Throughput To determine dedupliction srge system, we used syntic set driven by client computers. The syntic set ws developed model bckup from multiple bckup cycles from multiple bckup strems, where ech bckup strem cn be generted on sme or different client computer. The set mde up syntic generted on fly from one or more bckup strems. Ech bckup strem mde up n ordered series syntic versions where ech successive version ( genertion ) somewht modified copy preceding genertion in series. The genertion--genertion modifictions include: reordering, deletion exting, ddition new. Single-client bckup over time simulted when syntic genertions from bckup strem re written dedupliction srge system in genertion order, where significnt mounts re unchnged dy--dy or week--week, but where smll chnges continully ccumulte. Multi-client bckup over time simulted when syntic genertions from multiple strems re written dedupliction system in prllel, ech strem in genertion order. There re two min dvntges using syntic set. The first tht vrious rtios cn be built in syntic model, usges pproximting vrious rel world deployments cn be tested esily in house. The second tht one cn use reltively inexpensive client computers generte n rbitrrily lrge mount syntic in memory without dk I/Os write in one strem dedupliction system t more thn 100 MB/s. Multiple chep client computers cn combine in multiple strems sturte intke dedupliction system in switched network environment. We find it both much more costly techniclly chllenging using trditionl bckup stwre, high-end client computers ttched primry srge rrys s bckup clients, high end servers s medi/bckup servers ccomplh sme fet. In our experiments, we choose n verge genertion (dily equivlent) globl rtio 30, n verge genertion (dily equivlent) locl rtio 2 1 for ech bckup strem. These numbers seem possible given rel world exmples in section 5.1. We mesure for one Exchnge Engineering # dk I/Os % tl # dk I/Os % tl no Summry Vecr no Loclity Preserved Cching 328,613,503 100.00% 318,236,712 100.00% Summry Vecr only 274,364,788 83.49% 259,135,171 81.43% Loclity Preserved Cching only 57,725,844 17.57% 60,358,875 18.97% Summry Vecr Loclity Preserved Cching 3,477,129 1.06% 1,257,316 0.40% Tble 4: Index loclity reds. Th tble shows number dk reds perform index lookups or fetches from continer met for four combintions: with without Summry Vecr with without Loclity Preserved Cching. Without eir Summry Vecr or Loclity Preserved Cching, re n index red for every segment. The Summry Vecr voids se reds for most new segments. Loclity Preserved Cching voids index lookups for duplicte segments t cost n extr red fetch group segment fingerprints from continer met for every cche ms for which segment found in index. USENIX Assocition FAST 08: 6th USENIX Conference on File Srge Technologies 279

Figure 8: Write Throughput Single Bckup Client 4 Bckup Clients. Figure 9: Red Throughput Single Bckup Client 4 Bckup Clients bckup strem using one client computer 4 bckup strems using two client computers for write red for 5.4 Dcussion 10 genertions bckup sets. The results re The techniques presented in th pper re generl shown in Figures 8 9. methods improve performnce dedupliction srge systems. Although h our system The dedupliction system delivers high write divides strem in content-bsed segments, se results for both cses. In single strem cse, methods cn lso pply system using fixed ligned system chieves write 110 for segments such s Venti. genertion 0 over 113 for genertion 1 through 9. In 4 strem cse, system chieves write 139 for genertion 0 sustined 217 for genertion 1 through 9. As side note, we hve compred rtios system segmenting strems by contents (bout 8Kbytes on verge) with nor system using fixed ligned 8Kbytes segments on engineering Write for genertion 0 lower becuse ll exchnge bckup sets. We found tht fixed segments re new require Ziv-Lempel style lignment pproch proch gets bsiclly no globl l by CPU dedupliction system. (globl : 1.01) for engineering n eeri, The system delivers high red results for single strem cse. Throughout ll genertions, ertions, system chieves es over 100 red. wheres system with content-bsed segmenttion gets lot globl (6.39:1). The min reson difference fe tht bckup stwre cretes bckup set without religning ng t file For 4 strem cse, red 211 boundries. For exchnge bckup set set where for genertionn 0, 192 for genertion 1, 165 bckup stwre ligns t individul milboxes, for genertion 2, sty t round 140 globl difference ference less (6.61:1 vs. for future genertions. e The min reson for decrese 10.28:1), though re significnt icnt gp. red in lter genertions tht future genertions hve more duplicte segments ents thn first few. However, red stys t bout 140 for lter genertions becuse Strem- Informed Segment Lyout Loclity Preserved Cching. Note tht write hs hriclly been vlued Frgmenttion will become more severe for long term retention, cn reduce effectiveness fectivene Loclity Preserved Cching. We hve investigted mechnms reduce frgmenttion sustin high write red. But, se mechnms re beyond scope th pper. more thn red for bckup use cse since bckup hs complete within specified ied bckup 6 Relted ed Work window time period it much more frequent event Much work on dedupliction focused on bsic methods thn resre. Red h t still very importnt, t sion rtios, not on high. hput. especilly in cse whole system resres. es. Erly dedupliction srge systems use file-level level hshing detect duplicte files reclim ir srge spce [ABCC*02, C*02, TKSK*03, KDLT04]. Since such 280 FAST 08: 6th USENIX Conference on File Srge Technologies USENIX Assocition

systems lso use file hshes ddress files. Some cll such systems content ddressed srge or CAS. Since ir dedupliction t file level, such systems cn chieve only limited globl. Venti removes duplicte fixed-size blocks by compring ir secure hshes [QD02]. It uses lrge on-dk index with strightforwrd index cche lookup fingerprints. Since fingerprints hve no loclity, ir index cche not effective. When using 8 dks lookup fingerprints in prllel, its still limited less thn 7. Venti used continer bstrction lyout on dks, but ws strem gnostic, did not pply Strem-Informed Segment Lyout. To lerte shifted contents, modern dedupliction systems remove redundncies t vrible-size blocks divided bsed on ir contents. Mnber described method determine nchor points lrge file when certin bits rolling fingerprints re zeros [Mn93] showed tht Rbin fingerprints [Rb81, Bro93] cn be computed efficiently. Brin et l. [BDH94] described severl wys divide file in content-bsed segments use such segments detect duplictes in digitl documents. Removing duplictions t contentbsed segment level hs been pplied network procols pplictions [SW00, SCPC*02, RLB03, MCK04] hs reduced network trffic for dtributed file systems [MCM01, JDT05]. Kulkrni et l. evluted efficiency between n identity-bsed (fingerprint compron vrible-length segments) pproch delt- pproch [KDLT04]. These studies hve not ddressed dedupliction sues. The ide using Bloom filter [Blo70] implement Summry Vecr inspired by summry structure for proxy cche in [FCAB98]. Their work lso provided nlys for flse positive rte. In ddition, Broder Mitzenmcher wrote n excellent survey on network pplictions Bloom filters [AM02]. TAPER system used Bloom filter detect duplictes insted detecting if segment new [JDT05]. It did not investigte sues. 7 Conclusions Th pper presents set techniques substntilly reduce dk I/Os in high- dedupliction srge systems. Our experiments show tht combintion se techniques cn chieve over 210 for 4 multiple write strems over 140 for 4 red strems on srge server with two dul-core processors one shelf 15 drives. We hve shown tht Summry Vecr cn reduce dk index lookups by bout 17% Loclity Preserved Cching cn reduce dk index lookups by over 80%, but combined cching techniques cn reduce dk index lookups by bout 99%. Strem-Informed Segment Lyout n effective bstrction preserve sptil loclity enble Loclity Preserved Cching. These techniques re generl methods improve performnce dedupliction srge systems. Our techniques for minimizing dk I/Os chieve good dedupliction performnce mtch well ginst industry trend building mny-core processors. With qud-core CPU s lredy vilble, eight-core CPU s just round corner, it will be reltively short time before lrge-scle dedupliction srge system shows up with 400 ~ 800 with modest mount physicl memory. 8 References [ABCC*02] A. Ady, W. J. Bolosky, M. Cstro, G. Cermk, R. Chiken, J. R. Douceur, J. Howell, J. R. Lorch, M. Theimer, R. P. Wttenher. FARSITE: Federted, vilble, relible srge for n incompletely trusted environment. In Proceedings USENIX Operting Systems Design Implementtion (OSDI), December 2002. [BM05] Andrie Z. Broder Michel Mitzenmcher. Network Applictions Bloom Filters: A Survey. Internet Mmtics, 2005. [BDH94] S. Brin, J. Dv, H. Crci-Molin. Copy Detection Mechnms for Digitl Documents (weblink). 1994, lso lso in Proceedings ACM SIGMOD, 1995. [Blo70] Burn H. Bloom. Spce/time Trde-fs in Hsh Coding with Allowble Errors. Communictions ACM, 13 (7). 422-426. [JDT05] N. Jin, M. Dhlin, R. Tewri. TAPER: Tiered Approch for Eliminting Redundncy in Replic Synchroniztion. In Proceedings USENIX File And Srge Systems (FAST), 2005. [Dt05] Dt Domin, Dt Domin Applince Series: High-Speed Inline Dedupliction Srge, 2005, http://www.domin.com/products/pplinces.htm l [FCAB98] Li Fn, Pei Co, Jussr Almeid, Andrie Z. Broder. Summry Cche: A Sclble Wide-Are Web Cche Shring Procol. in Proceedings ACM SIGCOMM'98, (Vncouver, Cnd, 1998). [KDLT04] P. Kulkrni, F. Dougl, J. D. LVoie, J. M. Trcey: Redundncy Elimintion Within Lrge Collections Files. In Proceedings USENIX Annul Technicl Conference, pges 59-72, 2004. USENIX Assocition FAST 08: 6th USENIX Conference on File Srge Technologies 281

[Mn93] Udi Mnber. Finding Similr Files in A Lrge File System. Technicl Report TR 93-33, Deprtment Computer Science, University Arizon, Ocber 1993, lso in Proceedings USENIX Winter 1994 Technicl Conference, pges 17 21. 1994. [MCK04] J. C. Mogul, Y.-M. Chn, T. Kelly. Design, implementtion, evlution duplicte trnsfer detection in HTTP. In Proceedings Network Systems Design Implementtion, 2004. [MCM01] Athich Muthitchroen, Benjie Chen, Dvid Mzières. A Low-bwidth Network File System. In Proceedings ACM 18th Symposium on Operting Systems Principles. Bnff, Cnd. Ocber, 2001. [NIST95] Ntionl Institute Strds Technology, FIPS 180-1. Secure Hsh Strd. US Deprtment Commerce, April 1995. [PJSS*94] B. Pwlowski, C. Juszczk, P. Stubch, C. Smith, D. Lebel, D. Hitz, NFS Version 3 Design Implementtion, In Proceedings USENIX Summer 1994 Technicl Conference. 1994. [QD02] S. Quinln S. Dorwrd, Venti: A New Approch Archivl Srge. In Proceedings USENIX Conference on File And Srge Technologies (FAST), Jnury 2002. [RLB03] S. C. Rhe, K. Ling, E. Brewer. Vluebsed web cching. In WWW, pges 619 628, 2003. [SCPC*02] C. P. Spuntzk, R. Chr, B. Pfff, J. Chow, M. S. Lm, M. Rosenblum. Optimizing migrtion virtul computers. In Proceedings USENIX Operting Systems Design Implementtion, 2002. [SW00] N. T. Spring D. Werll. A procolindependent technique for eliminting redundnt network trffic. In Proceedings ACM SIGCOMM, pges 87--95, Aug. 2000. [TKSK*03] N. Toli, M. Kozuch, M. Stynrynn, B. Krp, A. Perrig, T. Bressoud. Opportuntic use content ddressble srge for dtributed file systems. In Proceedings 2003 USENIX Annul Technicl Conference, pges 127 140, Sn Annio, TX, June 2003. [YPL05] L. L. You, K. T. Pollck, D. D. E. Long. Deep Sre: An rchivl srge system rchitecture. In Proceedings IEEE Interntionl Conference on Dt Engineering (ICDE 05), April 2005. [ZL77] J. Ziv A. Lempel. A universl lgorithm for sequentil, IEEE Trns. Inform. Theory, vol. IT-23, pp. 337-343, My 1977. 282 FAST 08: 6th USENIX Conference on File Srge Technologies USENIX Assocition