SPECIAL REPORT. Data Deduplication. Deep Dive. Put your backups on a diet. Copyright InfoWorld Media Group. All rights reserved.

2 Data deduplication explained How to reduce backup overhead and lower storage costs by eliminating redundant data i By Keith Schultz InfoWorld EVEN THE MOST EFFICIENT BUSINESSES waste oodles of storage. It s a simple fact of business life. Day in and day out, we create documents and we share them. By the time everyone is done emailing and copying, there will be dozens if not hundreds of duplicates of the same presentation, PDF, or spreadsheet strewn throughout the company network. Combine all of those copies with nightly backups, and duplicate bytes soon add up to duplicate gigabytes and terabytes. And it isn t just a storage issue. More data means more spinning disks, more heat, and larger backup sizes. Larger backup sizes mean longer restore times (and more unhappy users). If you can reduce the amount of data stored, those savings translate directly into less power consumption and more manageable backup systems, and ultimately a more efficient operation. Data deduplication takes advantage of the enormous amount of redundancy in business data to dramatically reduce storage requirements. Backups of client systems, file servers, databases, and virtual infrastructure are all ripe for data deduplication. Eliminating duplicate data can reduce the amount of storage space necessary from a 10-to-1 ratio to a 50-to-1 ratio and beyond, depending on the deduplication technology used and the level of data redundancy. With a little help from data deduplication, companies can reduce costs, lighten backup requirements, and accelerate data restoration in an emergency. Deduplication takes several different forms, each with its own approach and optimal role in backup and disaster recovery scenarios. Data deduplication technology is also being applied to live storage systems -- particularly in areas such as virtual server farms, virtual desktops, and email stores, where there are high levels of redundant data -- and will go mainstream for primary storage as a standard feature in Windows Server 2012. In the meantime, let s take a look at why data deduplication has become so attractive to so many organizations. TOO MUCH DATA, TOO LITTLE TIME Like program flyers at a county fair, duplicate data is strewn all over enterprise storage systems. Files are saved to file shares in the data center, with other copies located on an FTP or Web server facing the Internet, and yet another copy (or four) located in users personal folders. Sometimes backup copies are made prior to exporting to another system or updating to new software. Are users good about deleting these extra copies? Not really. The email blast is a classic example of duplicate data. It goes like this: Someone in human resources wants to send out the new Internet acceptable use policy PDF to 100 users on the network. So he or she creates an email, addresses it to a mailing list, attaches the PDF, and presses send. The mail server now has 100 copies of the same attachment in its storage system. Only one copy of the attachment is really necessary, yet with no deduplication system in place, all those copies sit in the mail store taking up space. Server virtualization is another area rife with duplicate data. The whole idea of virtualization is to do more with less and maximize hardware utilization by spinning up multiple virtual machines in one physical server. This equates to less hardware expense (fewer physical machines), lower utility costs, and (hopefully) easier management. But although the server hardware footprint becomes smaller, the storage requirements typically become greater. Each virtual server, whether a VMware, Microsoft, Citrix, or another type of virtual machine, is contained in a file or disk image. For instance, VMware uses a single VMDK (virtual machine disk) file as the virtual hard disk for the virtual machine. As you would expect, VMDK

3 files tend to be rather large at least 2GB, and almost always much larger. One of the great features of virtual machines is that admins can back up the virtual machine (either when offline or online), as well as create snapshots for pointin-time recovery. Now what happens with all of these backup copies? That s right: a lot of duplicate files stored on a file server. Admins also keep golden images of working virtual servers to spawn new virtual machines. Each time a golden image is versioned, you can bet that the previous image is still there. Virtualization is a fantastic way to get the most out of CPU and memory, but without deduplication, virtual hard disks can actually increase network storage requirements. VDI (virtual desktop infrastructure), a close cousin of server virtualization, promises to contribute further to the virtual machine sprawl. Just as server virtualization consolidates multiple server instances, VDI runs multiple virtual desktops on a single host. Although VDI solutions have their own deduplication tricks, such as spinning up multiple virtual desktops from one golden image, most virtual desktops deployed today are persistent: i.e., virtual disks are unique to each and must be stored from session to session. So VDI is a perfect candidate for deduplication because of the large number of very similar virtual disks shared between users. Virtualization may be light on client-side system hardware, but it s heavy on storage. DEFINING DATA DEDUPLICATION AND ITS BENEFITS Simply put, deduplication is the process of detecting and removing duplicate data from a storage medium or file system. Duplicate data may be detected at the file, bit, or block level, depending on the type and aggressiveness of the deduplication process. File deduplication is the fastest method, but also the least effective, because deduplication can occur only when files are identical at the bit level. If two files differ in name only, for example, both files will be stored. Bitlevel and block-level deduplication take a more granular approach. Both methods peer inside the file to analyze its contents, looking for duplicate sequences of bits or blocks of data. Deduplication at the bit or block level is much more effective than file-level dedupe, but also more processing-intensive. Greater data reduction comes at the cost of performance. The first time a deduplication system sees a file or a chunk of a file, that data element is identified. Thereafter, each subsequent identical item is removed from the system but marked with a small placeholder. The placeholder points back to the first instance of the data chunk so that the deduped data can be reassembled when needed. The more duplicate data you have, the fewer original chunks need to be stored. For example, a file system that has 100 copies of the same document from HR in each employee s personal folder can be reduced to a single copy of the original file plus 99 tiny placeholders that point back to the original file. It s easy to see how that can vastly reduce storage requirements, not to mention why it makes much more sense to back up the deduped file system instead of the original. Another benefit of data deduplication is the ability to keep more backup sets on nearline storage. With the amount of backup disk space reduced, more point-in-time backups can be kept ready on disk, making file restoration faster and easier. This also allows for a longer backup history to be maintained. Instead of only having three versions of the file to restore, users can have many more, possibly going back weeks or even months, enabling a granular approach to file backups and accommodating loads of backup history. Disaster recovery is another process that greatly benefits from data deduplication. For years, data compression (which operates within individual files) was the only way to reduce the overall size of the off-site data set. Add deduplication and the backup set can be reduced even more. Why transfer the same data set each night when only a small portion of it changed that day? Deduplication in disaster recovery makes perfect sense: Not only is the transfer time reduced, but the WAN is used more efficiently with less overall traffic. HOW DATA DEDUPLICATION WORKS Data deduplication works by analyzing a chunk of data be it a block of data, a series of bits, or the entire file. This chunk is run through an algorithm to create a unique key, called a hash, and the hash in turn is stored in an index. As each new chunk of data gets hashed, it is compared to the existing hashes in the index. If the hash is already in

4 the index, this means that the chunk of data is a duplicate and doesn t need to be stored again; instead of storing the chunk, the deduplication engine inserts a small placeholder that points back to the hash index. If the hash is not in the index, the chunk is stored in the deduplication dictionary, the hash is added to the index, and so on. It is easy to see that the number of hashes stored in an index can number in the millions or tens of millions. The greater the amount of data analyzed, the larger the hash index will be and the greater the potential storage savings. Deduplication doesn t start working immediately. The system needs to spend some time learning what data it will see in order to build the initial indexes. This means there is a cold penalty the first time a piece of data passes through the deduplication analysis engine. On subsequent passes, as increasing volumes of duplicate data pass through the engine, more and more duplicates are found and removed. To see how this works, let s return to our example of the PDF email attachments, which went out to 100 employees. Each PDF is 25MB, which means the mail server s storage has just increased by roughly 2.5GB (plus the emails themselves). Now suppose the mail server does a nightly backup to an NAS appliance. The backup, as we saw above, has just increased a minimum of 2.5GB from a single email blast. If the network admin wants to keep a week s worth of backups for disaster recovery, the total weekly backup is now 17.5GB larger due to the bulk email. Here s how data deduplication can help. By identifying and removing the duplicate attachments and storing only a single copy, deduplication keeps the nightly backup at a paltry 25MB the size of one copy of the original attachment. For the weekly backups, instead of storing 17.5GB, the backup system stores only 175MB, a 90 percent reduction in storage requirements. Think of how many more backup sets admins could keep online or nearline with that kind of storage reduction. It would be easy to keep months of history instead of only a few days worth. FILE-, BIT-, AND BLOCK-LEVEL DEDUPLICATION There are three levels of deduplication: file-, bit-, and block-level. In file-level deduplication, each file is analyzed as a whole, and a hash value is created. The deduplication software looks at the whole file regardless of name, size or type. The problem is that if a file is different by just a single bit, its hash will be different from other versions. This makes file-level deduplication very inefficient and impractical. For example, one user creates a document in Word 2003. Another user opens it and saves a copy (without any changes) to another location, but in Word 2007 instead. Same file, different format different hash, and no deduplication. Bit-level deduplication and block-level deduplication are vastly more efficient. In both cases, the analysis engine breaks files into chunks or segments and creates hashes based on those smaller pieces. The smaller pieces allow for a higher probability of a match; the more matches, the greater the reduction of data on the system. Unlike with file-level deduplication, if a portion of the file changes, bit- and block-level deduplication only transfer the changed portion, not the entire file. The goal of a bit- or block-level system is to create a very large index of hashes in order to identify many more duplicate items. But, as with many things, there is a point of diminishing returns. If the placeholder is nearly the same size as the removed data, the deduplication gain isn t worthwhile. Or if the number of hashes in the indexing system becomes too large, overall performance of the system may suffer due to database look up latency. SOURCE, TARGET, AND INLINE DEDUPLICATION Just as there are three levels of deduplication, there are also three ways of deploying deduplication. The deduplication engine can run on the production storage system (source), on the backup system (target), or in a dedicated appliance (or appliances) between the source and target (inline). Naturally, each option has its advantages and drawbacks. SOURCE DEDUPLICATION Source deduplication analyzes files on the original server and creates hashes for each file, bit, or block. The deduplication engine runs on the source server, usually as part of a backup application. During a backup operation, the deduplication engine creates hashes for the data and compares them to hashes of the data already stored at the

5 backup destination. Whenever the hashes are the same, only the placeholder reference is copied over. Otherwise, the data is copied to the backup server and the unique hash is added to the deduplication index. Source deduplication is good for keeping network bandwidth under control. Only the deduped data is transferred over the LAN/WAN, instead of each and every bit. The backup destination s backup set is smaller because it receives optimized data only. Plus, a software package is the only change to the network. No new appliances or any other changes are required to deduplicate the data. Source-level deduplication is a good choice only if performance isn t high on the list of requirements. The overall backup process tends to be slower if backup software does the deduplication, due to the extra overhead needed to create, analyze, and transfer hashes and their associated chunks of data. Small data sets don t pay as much of a performance penalty, but large sets can really slow a system to a crawl. The other downside to source deduplication is that the benefits are isolated. Multiple copies of bits, bytes, and files are removed from each source, but the data cannot be deduplicated across multiple sources. If copies of files are strewn across many hosts on your network, you ll still be backing up a lot of duplicate data. TARGET DEDUPLICATION Target deduplication does not impact source server performance in any way. The deduplication is performed on the destination server once the backup is finished. Any off-the-shelf backup software package can be used with target deduplication since no hashes are created on the source server. Target deduplication is backup-system agnostic: It really doesn t care how the data gets to the backup server. Detection of duplicate data is typically very fast, making target-based deduplication ideal where performance is a primary concern. Two major drawbacks to target deduplication are storage space and network bandwidth. Because the deduplication doesn t take place until after the backup set arrives, the backup server must have a large landing zone a large amount of disk space available to store the backup set. When deduplication is complete, and the backup set has been moved to its offline location, the temporary storage can be purged and reset for the next backup. One of deduplication s selling points is the reduction of spindles, but in target deduplication, a few more are needed to hold the pre-deduped backup set. Target deduplication also requires the entire backup data set to be sent over the wire to the backup server. Again, because deduplication doesn t happen until the backup job is complete, every bit has to be sent over the network, in some cases via a slower WAN link. Although you ve reduced the storage requirements for your backup server, transferring the data for disaster recovery will consume as much bandwidth and precious time as before. INLINE DEDUPLICATION Inline deduplication splits the two previous methods by using a specialized appliance to intercept the data in flight between the source and target. There are a couple of ways to do this. One method is to deploy a single dedupe appliance in front of the target storage systems. The second method uses a pair of appliances one on either side of the system, like a WAN circuit, for example to reduce the amount of data traveling over the wire. Fewer bits on the wire equal faster transfers and faster file restoration when disaster strikes. An inline appliance creates hashes on the fly as data passes through and only forwards the unique chunks. This approach to deduplication doesn t require any special software on the source server (like target deduplication, it is backup-application agnostic) nor does it need additional storage space on the destination server. Stored data is already deduped, reducing storage costs and power consumption. The only real downside to inline data deduplication is the high cost of the specialized appliance and the possibility that if the appliance is underpowered or the wrong unit is chosen, the device itself can become a bottleneck. For the most part, that won t happen, but keep scalability in mind. Inline solutions from many vendors are available in physical as well as virtual appliances for VMware ESX, Citrix XenServer, and Microsoft Hyper-V. STRAINING BACKUP SYSTEMS So, how to back up all this data? Many businesses still rely on old tape backup systems, but it s getting harder and harder to back up multiple terabytes of data to

6 tape in a reasonable amount of time. Although modern tape libraries and parallel backup systems increase both capacity and performance, they are not winning the race with data growth. Time, in terms of shrinking backup windows, becomes the limiting factor. VTLs (virtual tape libraries) offer a speedier alternative to tape, using hard disks in configurations that mimic standard tape drives. Many VTL solutions serve as a front end to traditional tape libraries, allowing for high-speed backup to disk, followed by slower backup to tape. Of course, VTLs add some complexity to the backup system, and their additional spindles entail additional cost and power consumption. To avoid the limiting factors of tape systems, many admins are using disk-based storage systems as their primary storage location. This does solve tape drive performance and capacity issues, but it doesn t reduce the number of spindles in play. All things considered, wouldn t it be better if there were simply less data to back up? Data glut compounds the difficulty of disaster recovery, making each stage of nearline and offline storage more expensive. Keeping a copy of the backup in nearline storage makes restoring missing or corrupt files easy. But depending on the backup set size and the number of backup sets admins want to keep handy, your nearline storage can be quite substantial. The next tier, offline storage, is comprised of tapes or other media copies thrown in a vault or sent to some other secure location. Again, if the dataset is large and growing, this offline media set must expand to fit. Many disaster recovery plans include sending the backup set to another geographical location via a WAN or to a cloud-based storage provider. Unless your company has deep pockets and can afford a very fast, low latency WAN link, it would be beneficial to keep the size of the backup set to a minimum. That goes double for restoring data. If the set is really large, trying to restore from an offsite backup will add downtime and end-user frustration. WAN acceleration appliances help streamline backup data transfers by compressing the data in flight, playing tricks with TCP (which helps reduce latency and improves response time), and performing their own data deduplication. Many of these solutions scale well into multiple terabytes and even petabytes of capacity. The nice thing about WAN acceleration appliances is that the performance gains apply in both directions. What s good for the backup is also good for the restore. There are also new cloud gateway appliances that directly connect to cloud storage providers and dedupe data between the data center and the online storage. Simply use your existing backup package, set the cloud gateway appliance as the target, and walk away. The appliance handles the deduplication and transmission of the backup set to the cloud while maintaining a local copy of the most recently saved data for faster restoration. Cloud deduplication will undoubtedly catch up with today s global deduplication solutions, which share deduplication metadata across multiple devices. In the meantime, watch out for the one-to-one solution. For instance, data center 1 (DC1) backs up to a cloud dedupe appliance that is in turn connected to a cloud storage provider. Everything is fine until that fateful day when DC1 is down and IT tries to restore the deduped cloud data to another site (DC2). Because the deduplication hashes and stored byte segments are on DC1 s appliance, and that information isn t shared between sites, DC2 cannot put Humpty Dumpty back together again. In most cases, you ll want a cloud storage gateway that allows your deduped backups to be reconstituted in the cloud itself or at another location. DEDUPING BEYOND THE BACKUP TIER If deduplication works so well with backups, why not apply it to the primary storage tier? In fact, real-time deduplication on live storage systems is becoming a reality. Real-time deduplication software watches the local file system for duplicate files or chunks of files. Then the usual process ensues: When redundant data is located, hashes and placeholders point to indexed chunks of data, reducing online storage requirements. No additional hardware is needed, and to the end-user, all the deduped files look and behave just like the original file. This also works inside virtual systems, potentially reducing their overall file size. Live storage deduplication has been applied to virtual server and virtual desktop environments, and it holds a great deal of promise for SSD storage arrays. Although dedupe is taking longer to penetrate traditional storage systems, it will be a feature built into Windows Server

7 2012. Some specific criteria must be met in order to deduplicate the live storage system, such as deduplication cannot take place on a boot volume and it only works with a specific file system. While it is technically the live storage area, this deduplication feature will run post process, meaning deduplication takes place after the files are written. This type of deduplication has some possible drawbacks. For one, data reduction will be locked to the specific storage volume. Data deduped on one volume will be reconstituted to its full size when copied or moved to another storage volume or server. The deduplication hashes are not shared, as in our one-to-one cloud gateway example. Another drawback is performance. Although there should be no discernible write penalty (since the deduplication takes place after the file is stored) there may be one during a read. When a user copies, moves, or reads a file from a deduped volume, the deduped segments have to be read and reassembled so that the original file is available. A performance hit is all but assured due to the reconstitution of the file chunks. It may take awhile for data deduplication to hit its stride in the primary tier. But deduplication is already extremely effective at the backup tier, where, after all, the vast majority of duplicate data resides. Ballooning data stores will continue to strain existing storage and backup systems. But with data deduplication removing redundant data from the network, we can at least delay the day when we all have more petabytes than we can possibly handle. i Keith Schultz is president of NetData Consulting Services and a contributing editor to InfoWorld.