Effective Data Deduplication Implementation

White Paper Effective Data Deduplicatio Implemetatio Eterprises with IT ifrastructure are lookig at reducig their carbo foot prit ad ifrastructure maagemet cost by slimmig dow their data ceters. I cotrast, data volumes are growig expoetially i the eterprise world, leadig to a surgig demad for capacity ad ew hardware. Data Deduplicatio is oe of the techiques that ca help eterprises to have lesser store space to store the same amout of data. Simply put Deduplicatio meas removal of duplicates from data. Deduplicatio ca provide 2 to 200 times space savigs ad ca also be used for low badwidth data trasfer. However it is a resource itesive operatio ad ca offset the advatages if ot implemeted correctly. Deduplicatio is a resource itesive operatio ad differet vedors have differet ways of implemetatio. The challege is to achieve maximum dedupe ratio (also called as data compressio ratio) with as little effect o throughput (IOPS) as possible. Irrespective of vedor implemetatio, Data Deduplicatio ca be divided ito four steps of data segmetatio, creatig figer prits for these data segmets, matchig these figer prits for duplicates ad storig uique segmets o disk. Each of the above metioed steps ca be implemeted usig differet techiques ad each oe of them has their ow advatages ad disadvatages. This white paper will discuss various techiques of implemetig the above metioed steps ad their effect o Data Deduplicatio ratio ad throughput.

About the Author Ravidra Mahabaleshwar Ravi, Solutio Architect i the HiTech Idustry Solutio Uit (ISU) of TCS, is based i Mumbai, Idia. Ravi's mai area of focus is Storage, ad he is resposible for testig solutios of Storage firmware ad applicatios. Ravi has experiece of 8 years i Product egieerig, which icludes Developmet, Maiteace ad Testig. Ravi has bee ivolved i Storage Firmware ad Applicatio test automatio for the past five years. 2

Table of Cotets 1. Data Deduplicatio Defied 4 2. Four Steps to Data Deduplicatio 5 3. Implemetatios ad Implicatios 5 4. Coclusio 10 5. Refereces 10 3

1. Data Deduplicatio Defied Data Deduplicatio is the ew data compactio techology which removes duplicates i data. SNIA[1] defies it as Data Deduplicatio is the process of examiig a data-set or I/O stream at the sub-file level ad storig ad /or sedig oly uique data. It differs from the compressio techiques by workig o the data at sub-file level where as compressio ecodes the data i the file to reduce its storage requiremet. However, compressio ca be used to augmet data deduplicatio to provide higher dedupe ratio (Size of data to be writte / Size of data o disk: 1) Small to large eterprises have bee adoptig this ew techology as it gives sigificat Retur o Ivestmet by: Reducig the storage capacity required to store the data. Reducig etwork badwidth required to trasfer the data. The storage device cost has bee reducig with the advace i disk techology. O the other ed, the IO badwidth is o the upswig with Fibre Chael (FC) based etworks, Gigabit Etheret ad iscsi. IO throughput of these storage devices has icreased ad cost per GB of storig data has reduced. Data Deduplicatio does reduce the cost per GB of data eve further but ca be taxig o the storage device resources due to its mathematically extesive operatios. Efficiecy of ay Data Deduplicatio applicatio is measured by the Dedupe ratio (Size of Actual Data / Size of Data after deduplicatio: 1) ad throughput (Megabytes of Data Deduplicated per sec). Followig are the parameters affectig dedupe ratio ad throughput: Nature of data to be deduplicated. Where is Deduplicatio applied? Data Deduplicatio ca be applied either o source appliace or o target appliace. If Data Deduplicatio is ilie or a post processig applicatio? Implemetatio of Data Deduplicatio. Cliet Cliet Backup Server Source Deduplicatio LAN LAN LAN LAN Backup Appliace Target Deduplicatio Figure 1: Source ad Target Appliace Deduplicatio 4

The followig sectios will detail out geeric steps i Data Deduplicatio ad how their implemetatio has a bearig o the Dedupe ratio ad Deduplicatio throughput. 2. Four Steps to Data Deduplicatio There are more tha a doze major vedors for Deduplicatio applicatios. Irrespective of vedor implemetatio Data Deduplicatio ca be categorized ito four major steps: 1. Idetifyig the uit of compariso 2. Creatig smaller uique idetifier of these uits to be compared. 3. Match for duplicates 4. Savig uique data blocks Implemetatio of each of these steps varies from vedor to vedor. However, the primary objective of ay implemetatio is to: Achieve maximum Deduplicatio ratio (Size of Actual Data / Size of Data after Deduplicatio: 1) Maximize Data Deduplicatio throughput (Megabytes of Data Deduplicated per sec) Miimize system resource utilizatio 3. Implemetatios ad Implicatios 3.1 Idetifyig the uit of compariso Deduplicatio eeds to first idetify the uit of data to work upo to fid if the uit already exists o the disk. Deduplicatio ca be either cotet-aware (kows about file boudaries ad works o files) or ca work at block levels, the latter beig more popular ad widely used methodology. Cotet-aware Deduplicatio works at the file level ad hece compares file for duplicity. Sice the files ca be large it ca adversely affect both the dedupe ratio ad the throughput. I Cotet-aware Deduplicatio, the duplicity across files is ot detected ad the two files might vary by just 10% ad still exist o the disk as separate etities. Block level compariso of data works o either fixed size of data or variable size of data ad is agostic of the file boudaries. This process of breakig up byte stream of data is popularly kow as chukig ad the blocks of data is called as chuk. I fixed chukig, the bytes stream is broke ito 4 Kb, 8 Kb, 16 Kb or 32 Kb of data. It is easier to implemet ad provides higher Deduplicatio throughput. However, the Deduplicatio ratio suffers if cosecutive bytes stream of data is margially differet. 5

st 1 Stream of data d 2 Stream of data New Block 4 kb Fixed Chukig 4 kb Fixed Chukig C1 C2 C3 C4 C5 C6 C7 C8 DISK DISK C1 C2 C3 C1 C2 C3 C6 C7 C8 Figure 2: Fixed Chukig Drawback Figure 2 demostrates how fixed chukig is uable to detect duplicates just because a ew block is added i the cosecutive data stream. First stream of data is broke i four chuks of size 4 KB each. Sice C1 ad C4 match each other, oly C1, C2 ad C3 are writte to disk. Now if the ext stream is very similar to the first except for a ew block of 1 KB added. Whe fixed chukig is applied o this secod stream of data, it creates four chuks, but oly the first chuk (C5) matches ad rest of the chuk do ot match as the bytes have bee offset by the ew block. Variable chukig relies o a patter of data to chuk the data streams. However, the chuked data size is esured to be withi a rage. Figure 3 demostrates how a variable chukig breaks the data stream based o a data patter. If the chuk is too small or too large, the differet implemetatios hadle it differetly. The chukig algorithm ca look for a secodary patter to break the data stream of chuk at maximum size of chuk. The advatage of this kid of chukig is that eve if the same data stream to be deduplicated has a ew block of data, the rest of the data match except that chuk havig the ew block. Hece this method gives a better dedupe ratio but at the cost of the variable chukig algorithm pullig dow the dedupe throughput. Data Patter to match st 1 Stream of data d 2 Stream of data New Block 4 kb Fixed Chukig 4 kb Fixed Chukig C1 C2 C3 C4 C5 C6 DISK DISK C1 C2 C3 C1 C2 C3 C4 Figure 3: Variable Chukig 6

Uit of compariso Dedupe Ratio Dedupe throughput File Fixed Chuk Variable Chuk 3.2 Creatig smaller uique idetifier of these uits to be compared Whe the uit of compariso is a file, matchig is doe usig biary level compariso ad/or hash addressig. However, data chuks eed to be idexed for the purpose of matchig if the data chuk already exists o disk. Sice the chuks sizes are i Kilobytes, comparig each chuk with the chuk o disk is a ear impossible task. Hece, the Data Deduplicatio applicatio creates a uique idetifier for each of the chuks which are expoetial small i size as agaist the chuk size. Chuk Data Disk Uique Idetifier b b a p Idex o t y r s Figure 4: Creatig Uique Idetifier ad Idexig This ca be achieved by hashig. Hashig creates a sigificatly smaller represetatio of a large data. Some of the popular hashig algorithms used are Secure Hash Algorithm (SHA1, SHA512, SHA256), Message Digest Algorithm (MD4, MD5) ad so o. 7

* Probability p 1-E ( - K2/ ( 2 * 2L ) ) Hashig algorithm should be efficiet eough to create uique idetifiers for disparate data chuks at a higher throughput. Each of these hash algorithms ca oly reduce the possibility of hash collisio to ear irrelevace (Refer to the table above). Each of these algorithms has differet performace, based o the message size to be hashed, the processor type used ad iheret implemetatio. Figure 5: Hashig throughput agaist digest bit size Figure 5 demostrates the performace of few of the hashig algorithms agaist the digest size. The data was gathered o a Itel Core Duo 2.8 GHz processor with a 1GB RAM ad ruig Fedora 12 X86_64 Operatig system. Followig are the factors that would evetually impact the dedupe ratio ad throughput: Digest bit size (uique idetifier) geerated should be as smallest as possible. As show i Figure 5, differet algorithms provide differet digest size with varyig probability of hash collisio (refer the table above). MD5 geerates a digest of 128 bits for a message size of 8192 bytes ad SHA512 geerates a digest size of 512 bytes but with 1/4th the probability of hash collisio as that of MD5. Hash Algorithm should take least amout of time to create the uique idetifier. Possibility of hash collisio should be egligible. NOTE: Few vedors also implemet byte by byte match of every chuk that has bee foud to have a duplicate, to work aroud the hash collisios. 3.3 Match for duplicates Matchig for duplicates becomes the crux of the Data Deduplicatio applicatio ad has the maximum impact o both dedupe ratio ad throughput. As the data processed by Data Deduplicatio applicatio icreases, the size of the idex table also icreases ad searchig the idex table becomes slower by time. Let s cosider a example where i the data is broke ito fixed size chuks of 4 Kb ad the uique idetifier of size 20 bytes is calculated usig oe of the hash algorithms. For 40 Gb data with 10:1 dedupe ratio, the umber of chuks will be 10 millio ad the umber of chuks writte to disk would be 1 millio ad 8

hece the size of idex table will be 1 millio (at least 20 MB i size). If we extrapolate this to size of data i large eterprises which will be i terabytes, searchig for duplicate i the idex table will itself be i secods. Differet vedors have smart searches where i the scope of searchig for duplicates are limited to maageable size. Although this might impact the dedupe ratio, it ca sigificatly icrease the throughput. Esure the followig poits while implemetig the matchig algorithm for better dedupe ratio ad throughput: Etire idex search for matches works oly for small amout of data ad is ot scalable. Search algorithm for matches should miimize the scope of search ad maximize the probability of fidig duplicates. Implemetatio of idex table should miimize the size so that the system memory usage is reduced. 3.4 Savig uique data blocks Savig of umatched data o to the disk would also have a bearig o the dedupe ratio ad throughput. Most vedors use compressio to further reduce the data size before storig o the disk. Few of the compressio algorithms that ca be used are: gzip, LZO, LZW, Zlib, PXZ ad so o. There are a few ope source ad commercially available compressio libraries which ca be used. Note the followig before usig the compressio library: Some compressio libraries give higher compressio whe workig o large data while others give good compressio for small chuk of data. Some of them compromise o compressio ratio but provide high speed compressio. Figures 6 & 7 provide a compariso of some of these ope source compressio algorithm with respect to the compressio ratio, compressio speed ad decompressio time for differet kids of data sets. The data was gathered o Itel Core Duo 2.8 GHz processor with a 1GB RAM ad ruig Fedora 12 X86_64 Operatig system. Figure 6 illustrates that PXZ gives the best compressio ratio as agaist other compressio libraries. However, the rate of compressio ad decompressio (MB/s) is very low for PXZ to be viable. Other tha the four ope source compressio libraries metioed i figures 6 & 7, there are other commercial ad ope source libraries too. Figure 6: Compariso of compressio ratio provided by differet compressio algorithm 9

Oce the umatched data is compressed usig a ope source or commercial compressio algorithm, it eeds to be writte to the disk. The data write ad read speed depeds o the uderlyig file system, the raid ad kid of hard disk used (SATA, ATA, FATA ad so o). 4. Coclusio Figure 7: Compariso of compressio speed ad decompressio time provided by differet compressio algorithm Data Deduplicatio assures chage i the way the data is stored o disk at all tiers of storage (Primary, backup, Archival). However, it ca oly be beeficial if the Dedupe ratio (or the disk space savig) justifies the extra cost of a Data Deduplicatio applicatio ad the load it puts o the system resources. Alogside Data Deduplicatio there are advaces i etwork ad disk techology that provide the followig: Very high data trasfer rate. Lower disk cost. Higher Disk IO rate. Data Deduplicatio that provides a higher Dedupe ratio at cost of lower throughput would detrimet the beefit of advaced etwork ad disk techologies metioed above. Hece the Data Deduplicatio implemetatio should effectively provide a high dedupe ratio with a high throughput. Refereces 1. SNIA Data Deduplicatio ad Space Reductio Special Iterest Group (SIG) 2. Deduplicatio Methods for Achievig Data Efficiecy Devi Hamilto 3. Deduplicatio Methods for Achievig Data Efficiecy Matthew Brisse 4. Lookig Beyod the Hype Evaluatio Data Deduplicatio Solutios, Larry Freema 10

About HiTech Idustry Solutio Uit TCS' HiTech idustry Solutios Uit provides optimal, customized, ad comprehesive solutios across varied High Tech idustry segmets: Computer Platform ad Services Compaies, Software Firms, Electroics ad Semicoductor Compaies, ad Professioal Services Firms (Legal, HR, Tax & Accoutig ad Cosultig & Advisory/Aalyst firms). Buildig o its vast experiece i egieerig, busiess process trasformatio, iovatio ad IT solutios, TCS offers a comprehesive portfolio of services that maximize growth, maage risk, ad reduce costs. The TCS HiTech Idustry Solutio Uit parters with High Tech eterprises to provide ed-to-ed solutios which help realize operatioal excellece, iovatio ad greater profitability. For more iformatio, visit us at http://www.tcs.com/idustries/high_tech Cotact For more iformatio about TCS cosultig services, cotact HiTech.Marketig@tcs.com Subscribe to TCS White Papers TCS.com RSS: http://www.tcs.com/rss_feeds/pages/feed.aspx?f=w Feedburer: http://feeds2.feedburer.com/tcswhitepapers About Tata Cosultacy Services (TCS) Tata Cosultacy Services is a IT services, cosultig ad busiess solutios orgaizatio that delivers real results to global busiess, esurig a level of certaity o other firm ca match. TCS offers a cosultig-led, itegrated portfolio of IT ad IT-eabled ifrastructure, egieerig TM ad assurace services. This is delivered through its uique Global Network Delivery Model, recogized as the bechmark of excellece i software developmet. A part of the Tata Group, Idia s largest idustrial coglomerate, TCS has a global footprit ad is listed o the Natioal Stock Exchage ad Bombay Stock Exchage i Idia. For more iformatio, visit us at www.tcs.com IT Services Busiess Solutios Outsourcig All cotet / iformatio preset here is the exclusive property of Tata Cosultacy Services Limited (TCS). The cotet / iformatio cotaied here is correct at the time of publishig. No material from here may be copied, modified, reproduced, republished, uploaded, trasmitted, posted or distributed i ay form without prior writte permissio from TCS. Uauthorized use of the cotet / iformatio appearig here may violate copyright, trademark ad other applicable laws, ad could result i crimial or civil pealties. Copyright 2011 Tata Cosultacy Services Limited TCS Desig Services I M I 05 I 11