An In-Depth Look at Deduplication Technologies

Transcription

1 An In-Depth Look at Deduplication Technologies White Paper Juan Orlandini, Datalink Mike Spindler, Datalink August 2008 Abstract: Deduplication is all the rage today, with a myriad of vendors offering technologies that provide deduplication capabilities. However, it is extremely confusing and time consuming for organizations to determine if the benefits of deduplication are compelling enough to consider implementing the technology. There are a number of factors to weigh. This white paper identifies what those factors are, provides an independent assessment of the types of deduplication, and identifies additional characteristics to look at when considering the deduplication solutions in the market.

2 Table of Contents Dedupe overview Deduplication is on everyone s mind What is deduplication? How the process works Savings are substantial Types of deduplication File-based Block-based Options for deduplication Application Source Target Network Inline versus post process Inline Post process Hybrid solutions Additional factors to weigh CPU-bound versus disk-bound File aware (format aware) Replication Realistic compression ratios Future considerations Integration with backup software ISVs Enterprise fit Deciding what is right for you

3 Dedupe overview Deduplication is on everyone s mind Among today s data storage technologies, deduplication is generating a significant buzz and is definitely at the top of the what s hot list. The major storage vendors have all introduced deduplication products within the last one to two years. And while still a relatively new technology, it has garnered interest from enterprise organizations spanning all industries. This hype comes with good reason. Deduplication provides compelling benefits and offers potential rewards rarely seen in today s IT environments in particular as it relates to disk-based backup technologies. Disk-based backup continues to improve the speed, reliability, and availability of backups. Deduplication technology lowers the cost of overall storage and makes diskbased backup more economically feasible. Disk-based backup technologies, coupled with deduplication, are transforming how organizations back up and recover their data and the role that tape plays in modern backup and archive architectures. The major storage vendors have all introduced deduplication products within the last one to two years. And while still a relatively new technology, it has garnered interest from enterprise organizations.. As with most emerging technologies, there are a plethora of alternatives and no clear answers about if, when, and where a solution should be implemented. Before an organization integrates any type of deduplication technology, it must first assess the different types of solutions and how each fits into its backup, recovery, and archive goals. Furthermore, the organization needs to ascertain the most effective path to integrate these technologies into its current operations. Unfortunately, the information necessary to make an educated decision is difficult to find. Each vendor markets its technology as the best approach. Even worse, some vendors have competing products in their own lineups, with each being touted as the best. This white paper demystifies the options and provides a clear, unbiased view of the deduplication market. What is deduplication? Simply put, deduplication identifies redundant information and stores it in a highly efficient format while maintaining the integrity of the original content. The data is stored only once, no matter how many copies are made. Vendors have chosen to call this technology many different things: dedupe, data reduction, single instance storage, global data single instance storage, capacity optimized storage, and even molecular sequence reduction. Although there are differences between each of these terms and associated technologies, at the core the concept is the same: find duplicated sets of data and store it only once. The primary benefit of deduplication is that it greatly reduces storage capacity requirements. This drives several other advantages as well, including lower power consumption, lower cooling requirements, longer disk-based retention of data (faster recoveries), and with some vendors simplified disaster recovery (DR) via optimized replication Datalink. All Rights Reserved. 1

4 How the process works Deduplication technology looks at data at either a block- (subfile) or filelevel. When a chunk of data comes across a dedupe engine, the data is broken into smaller blocks or segments. Each of these smaller blocks is given a unique identifier. This identifier is created by several hashing algorithms or even a bit by bit comparison of the block. Common algorithms used for this are MD5 or SHA-1. Some vendors also have content aware logic, which considers the source of the data (i.e., a NetBackup backup data stream) to determine block sizes and the boundaries of the resulting blocks. As the dedupe engine processes data, it compares the data to the blocks already identified and stored in its database. If a block already exists in the database, the new redundant data is discarded and a reference to the existing data is inserted into the repository. If the block contains new, unique data, then the block is inserted into the data store (filesystem), and a reference is added to that block in the dedupe database. During the backup operation, the backup application sees the file or the backup stream as it normally would. The size of the data is greatly reduced though since the blocks of data are replaced by reference pointers. Depending on where the data is being deduplicated, the backup application may or may not be aware that deduplication is occurring. With source-based deduplication, the application is intimately aware of the process. A targetbased deduplication process, meanwhile, is generally transparent to the backup application. With this approach, the application still reads and transfers the same amount of data from clients to the deduplication device; however, the device appears as a vanilla VTL or disk share and the reduction in capacity is essentially hidden from the backup application. During recovery operations, the same process exists. The backup application reads the data from the device without any concern or knowledge of the deduplication device s operations. As the dedupe engine processes data, it compares the data to the blocks already identified and stored in its database. If a block already exists in the database, the new redundant data is discarded and replaced with a reference to the existing data in the repository Datalink. All Rights Reserved. 2

5 Savings are substantial Deduplication technology can provide significant savings. If an enterprise performs a full backup weekly (as well as incremental daily backups), the potential for redundant data in the backup is enormous. For instance, if only five percent of the data being backed up by an organization changes on a daily basis, then most of the data in the full backup is redundant from backup to backup. Compounding this, it s being saved over and over, week after week. In particular, database or mail backups, which generally are performed as full backups every day, typically only change by a small percentage. Deduplication can also achieve significant savings for incremental backups. A good way to think about this is through a common daily workflow. Most people do not generate net new data each day. They tend to open a document or spreadsheet that they were working on the day before, change the document, and then save it. The net changes are relatively small compared to the whole file. However, when the backup application identifies these files for an incremental backup, the entire file is moved. When the dedupe engine sees this file, it will recognize that most of the data has been seen before and only store the net new data. It s not uncommon to see very high (more than 10x) deduplication rates on incremental backups. Regardless of the specific vendor implementation, file-based deduplication is essentially designed to identify files that are exactly the same and only store them once. Types of deduplication Additionally, deduplication eliminates the data redundancy between servers since the dedupe process examines blocks as opposed to a file or backup stream. For instance, the same system files likely exist on many Windows servers, desktops and laptops. Without any type of deduplication process, those files would be backed up and stored multiple times for each backup client that contains a copy. With deduplication, only a single copy would be stored for all locations. Deduplication technologies largely fall into either a file-based or block-based category. File-based deduplication compares the content of entire files. Block-level approaches take a more granular, sub-file level approach. Some vendors implement their sub-file level algorithms in what they term stream approaches. File-based Regardless of the specific vendor implementation, file-based deduplication is essentially designed to identify files that are exactly the same and only store them once. This approach can be very beneficial in environments where users or applications re-create or save the same file in multiple locations. For example, a common scenario is for a user to send out a spreadsheet or a presentation to multiple co-workers via , with each co-worker saving the file to his or her local network share or a workstation or laptop. Sometimes, the users save the files with a different name without changing the content of the file. Backup applications typically have no way to know that the content of those files is identical and, therefore, each file is backed up individually Datalink. All Rights Reserved. 3

6 The costs for the backup infrastructure increase from the duplication of effort associated with these files. To alleviate this problem, many organizations have implemented archive solutions to deduplication devices, which can identify files that are the same and store only the initial copy. The system only stores the relevant metadata (filename, age, size, etc.) of the redundant files and creates a reference to the first copy. From the perspective of the user or application accessing the file, each file is its own independent copy. The storage subsystem handles the details transparently. These technologies often achieve deduplication when the applications are accessing different storage platforms. In the above example, the initial copy of the data resided in the first user s hard disk. It was then attached to an message and sent to a mail server, which then forwarded a copy to each user, who in turn stored copies on their systems. This process created multiple copies on different file systems as well as a copy on the system. A file-based archive solution coupled with the right backup and archive software identifies that all of the copies are identical on both the file systems and system and stores only a single copy. Block-based solutions take the concept of storing unique data to the next level. The leading vendors in the file-based deduplication space are EMC with its Centera products, Hitachi Data Systems with its Content Archiving Platform (HCAP) solutions, and CommVault with its dedupe solution for backups and archive. Interestingly, until the advent of block or sub-file level deduplication, the file-based vendors did not hype this capability much. However, given the recent popularity of the technology and the buzz around deduplication, they have jumped on the bandwagon and are now much more actively promoting it. Block-based Block-based solutions take the concept of storing unique data to the next level. Rather than focusing on individual files, this method identifies common patterns of data regardless of where the data exists. An example of this would be if a user attached a PowerPoint presentation to an sent to multiple users, with the users each making a few minor changes to the file and saving the presentation to their laptops. At this point, file-based solutions would consider each of these files unique, even though most of the data is identical in each file and only a few bytes of information have changed. Block-based approaches, on the other hand, identify the common data regardless of the slight changes. At a very high level, this process selects a chunk of data (block, segment, variable size segment, etc.) and computes a uniquely identifying value (aka a hash). The system then compares this value against a database to identify if it has seen the data before. If it has not, the data is written to the storage. The computed value for the data along with a pointer (reference) to the data is inserted in to the dedupe database. The next time that unique value is seen, the system knows that it has a copy of the data and only needs to store another reference not the data itself Datalink. All Rights Reserved. 4

7 Options for deduplication Generally, this approach provides the ability to identify the same data, regardless of its location in the file. However, when users change a file, they essentially shift all of the bits that represent the file. This means that when the system identifies the chunks of data, the chunks will fall under different boundaries because they have shifted by a few bits. As a result, the system will see each chunk as being different because it did not create the chunks in the same locations for each file. Fortunately, most vendors that implement sub-file approaches have technologies that address this point specifically. By and large, their approach to solving this problem is what differentiates them from a strict block-based approach as well as from each other. Additionally, some vendors in this space differentiate themselves further because of how they approach the computation and verification of the uniqueness of their chunks. Within the various block- and filebased deduplication technologies, there are several methods of implementation. A significant difference between solutions has to do with where the process occurs. Within the various block- and file-based deduplication technologies, there are several methods of implementation. A significant difference between solutions has to do with where the process occurs. Application One of the first places where deduplication occurs is within applications. A dedupe application is typically complementary to another application that has tendencies to store large amounts of redundant data. The most common type of this application is archival. server applications like Microsoft Exchange or Lotus Notes manage distribution to users, with the majority of the data being stored in servers or at an user s client application. The size of the mail store is largely driven by the attachments to messages. archival applications help manage the storage of both the server and the associated client applications by finding identical attachments on the server and moving single instances of the attachments to a common repository. The files are replaced at the server by a link to the file in the repository. Several software applications provide this archival service. Two of the largest and most common applications are EMC Xtender and Symantec Enterprise Vault. Additional examples of this deduplication approach are certain file-level archive solutions and file virtualization products. Source Source-based also referred to as client-based deduplication processes the data at its origination. This method still utilizes storage or storage with an appliance; however, the CPU cycles for the deduplication process are performed at the client. The greatest benefit of this approach is that only net new data is sent from the client to the backup devices. But because the computational load is carried by the client, it imposes a very high CPU load during backups and in many cases the backup performance is slower than with traditional approaches Datalink. All Rights Reserved. 5

8 With source-based deduplication, the traditional backup software is replaced by the dedupe client code. The backup process at the client looks at the data to be backed up at a block level. Through various hashing and bit comparison techniques, the process determines changed or new blocks of data that need to be sent to the repository. In some cases, the dedupe storage appliance provides recent block lists in its database to the client to help offload some of the cycles required for the process. This speeds up the deduplication at the client because it does not need to talk to the repository to compare each block it processes. Another approach to source deduplication takes a snapshot of the data to be protected. The original snapshot of the data is transferred across the network to common storage device. Subsequent point-in-time snapshots are then taken and compared to previous snapshots, with only the changes sent to a common storage location. This type of deduplication ensures that only small and incremental amounts of data are sent to the storage device. In direct contrast to source deduplication, the process for target deduplication occurs within an appliance at the storage level. These types of source deduplication are well-suited for backing up user desktops, remote office locations, or mobile users. Leaders in source deduplication include EMC Avamar, NetApp Open Systems SnapVault (OSSV), and Symantec PureDisk. Lastly, an emerging variation of the source deduplication process is integrated into some of the latest releases of major enterprise backup software from EMC and Symantec. This design allows the media server to take data from its normal backup clients and perform the deduplication before sending it to the dedupe storage appliance. The deduplication function is performed by the media server (the NetBackup approach) or storage node (the EMC approach) before the data is sent to the backup software. This type of implementation allows an organization to integrate source-based deduplication with enterprise backup using the deduplicated storage for the enterprise backup infrastructure. With this method, performance is based on the abilities of the media server. Companies looking at this technology should consider upgrading or adding additional media servers to their infrastructure to accommodate the additional processor cycles needed. Target In direct contrast to source deduplication, the process for target deduplication occurs within an appliance at the storage level. The appliance either has integrated storage or functions as a gateway to an existing disk array. This method applies CPU and I/O resources at the destination of the deduplicated data and is currently primarily designed to address deduplication for backup and recovery processes and long-term archive of reference data. Applications of the technology include virtual tape (VTL) and disk to disk to tape (D2D2T), primary storage, and content addressable storage (CAS) Datalink. All Rights Reserved. 6

9 D2D2T In a D2D2T environment, the dedupe process runs on a network-attached storage (NAS) or storage area network (SAN) appliance that either has integrated storage or external storage attached. The appliance breaks the backup stream down and performs the deduplication process. The backup software writes to an NFS or CIFS share on the network or LUN on the SAN. With this method, the deduplication occurs as the data is received (inline). Alternatively, some vendors give the user a choice of doing the deduplication after the backup has completed, which is commonly referred to as a post process. There has been an explosion of products in this area from a host of storage vendors, including Data Domain, EMC, NetApp, and Quantum. VTL The VTL approach to deduplication is similar to the D2D method. The difference is that the target is a virtual tape as opposed to a CIFS/NFS file. The deduplication occurs within an existing VTL appliance or an additional appliance that deduplicates to the VTL. Applications of target deduplication include VTL disk-to-disk-totape, primary storage and content addressable storage (CAS). When a VTL appliance creates a virtual tape, the data is written to a LUN on the storage array integrated with the VTL. The general concern of many VTL technology users is that the deduplication process creates too much overhead on the VTL, thus slowing down the backups. Many of these solutions offer both post and inline deduplication to help counter this concern. Still, others offer the best of both worlds with the ability to switch from inline to post process if the overhead of the process is affecting the VTL backup performance. Vendors with solutions in this area include Data Domain, EMC, FalconStor, IBM, NetApp, Quantum, SEPATON, and Sun. Additionally, some vendors offer smart deduplication, where the technology is backup software-aware. This means that the dedupe engine can gain intelligence on how and where to create the blocks from the backup stream and thereby further optimize the deduplication process. The technology provides the ability to ignore components of the backup stream that don t need to be processed (backup file marks, stream information, etc.). Consequently, the ratio of deduplication increases if metadata from the backup applications is removed from the blocks being deduplicated. However, with this intelligence also come additional management factors. If software vendors change their backup software or modify the format they use when writing to tape, then the deduplication vendors have to update their code to remain compatible. Primary Storage Where the aforementioned methods of deduplication focus on the backup and recovery space, primary storage deduplication optimizes the data on the storage being accessed by users. This creates the ability to provide larger storage capacities with less physical disk and reduce the storage cost per gigabyte. In current implementations, the dedupe process runs as a scheduled background task. Note that primary storage should not be confused with tier 1 storage. Although deduplica Datalink. All Rights Reserved. 7

10 tion can be performed on any type of disk (SATA, Fibre Channel, etc.), generally applications needing this space management ability require relatively slower, but high capacity solutions. Vendors who offer these solutions are NetApp and Compellent. Content Addressable Storage Content addressable storage (CAS) is another form of storage that offers deduplication abilities. With these solution types, an appliance is integrated with the storage or front-ends the disk array, performing file-level deduplication on the data it manages. Generally, a product-specific application programming interface (API) interacts with this type of storage. Advantages include redundancy of files in the repository across different nodes within the subsystem or between subsystems for remote replication solutions. Hitachi Data Systems and EMC offer CAS solutions. Network Deduplication technology has also moved to the network itself. Enterprises continue to struggle with the bandwidth or performance of the network between headquarters and remote locations. Traditional methods address this issue through file caching or quality of service (QOS) applications. Another key difference with deduplication technologies has to do with whether the process occurs inline (real-time in the data stream) or post process. Inline versus post process A new approach is appliance-based solutions that are usually at the headquarters and remote offices or between two data centers. Through block deduplication, the technology manages duplication repositories at each of the appliances. The repository consists of blocks from the data stream that represent file sharing, application communication, web based traffic, or . Instead of transferring all the data across the WAN, the appliance replaces the blocks with the referenced blocks in the repository. This can dramatically lower the bandwidth requirements and improve performance. Leaders in this sector include Cisco and Riverbed. Another key difference with deduplication technologies has to do with whether the process occurs inline (real-time in the data stream) or post process (as a background process after the backup has finished and data has been written to the storage system). These approaches were eluded to previously in the VTL and D2D2T sections of this white paper. Inline An inline deduplication process allows the data to be deduplicated in real time as the backup data is received at the front end of the VTL or D2D device. The stream is then passed to the dedupe engine, which breaks the data into blocks, calculates a hash value for the block, determines if the block is new or existing, and replaces the block in the stream with a pointer to the block in the repository. If the block is new to the repository, it is compressed and written to the disk. The original data stream now consists of pointers to the blocks in the repository and is written to disk Datalink. All Rights Reserved. 8

11 The deduplication process is highly CPU and, depending on the implementation, I/O intensive. Because of the computing resources needed, the current maximum performance for the engine is about 400 MB per second. Over the last few years, processors used in these engines have gone from single core processors to dual core, and most recently to quad core processors. Target mode deduplication vendors that rely on CPU resources as the primary driver of their deduplication engine currently utilize dual, quad core processors to maximize performance for their fastest models. Other vendors rely on fast disk when I/O resources are needed for deduplication. As disk speeds (rotational speed, access time, and latency) have improved on the high capacity drives typically used, performance has also improved. Post process With the post process method, deduplication is performed after the backup and post-backup processing have completed. High-end VTL or D2D solutions can typically ingest backup data at up to 1200 MB per second. Since the backups occur before deduplication, there is less at risk time for an organization during which a backup has not yet completed. However, these solutions also require additional disk space to hold the backup before it is deduplicated. When implementing these solutions, it is necessary to size the landing space to accommodate not just the space required for one backup set, but potentially the next one as well. This is because if the ingest or backup speed is very high, there is a good change that the deduplication process won t complete before the next backup starts. Some target deduplication solutions are hybrid and can do both inline and post process dedupe. A key difference with deduplication technologies has to do with whether the process occurs inline (real-time in the data stream) or post process (as a background process after the backup has finished and data has been written to the storage system) Datalink. All Rights Reserved. 9

12 The post process deduplication method is similar to that of inline. Data is read from the disk where it is temporarily stored and the dedupe process occurs. After deduplication is complete, the original pre-dedupe space is reclaimed. Hybrid solutions Additional factors to weigh Some target deduplication solutions can do both inline and post process dedupe. Inline deduplication is performed by default. If the backup stream exceeds the performance abilities of the dedupe engine, the inline approach is suspended, and the backup stream is written to disk. After the backup is finished, a post process dedupe is performed on the backup data stream that was written during the backup. In addition to the pros and cons identified with the various methods of deduplication, organizations should examine several other factors when considering a deduplication solution. A major difference between inline deduplication vendors is their approach to verifying the reliability of data. CPU-bound versus disk-bound A major difference between inline deduplication vendors is their approach to verifying the reliability of data. One major camp has focused on the reliability of the cryptographically secure hash functions used to calculate the unique identifiers. The assumption is that the chance of two pieces of data computing to the same identifier (in computer science terms a collision ) is so small that, for all intent and purposes, it is virtually impossible. The other camp maintains that any chance for collision no matter how small is not acceptable and therefore performs a two step process. Technologies in this arena first compute a relatively easy-to-calculate hashbased identifier and then check for collision by comparing each block against what s on disk. The rationale is that by verifying every block against what s on disk, there is 100 percent certainty that no data will be accidentally deleted. Each of these camps offers some very compelling arguments for why their approach is the fastest and most reliable. The net result is that both methods solve the problem for real-world situations. In reality, the chance of either method being the cause of data loss is extremely low much lower in fact than the likelihood of other subsystems failing and causing data loss. For example, the hash-based approaches use algorithms that have a collision rate in the neighborhood of 1 x 2 80 or 1 x Hard drives have an undetected, unrecoverable error rate in the neighborhood of 1 x The chance of a collision is still much lower than the unrecoverable error rate even with RAID protection. For the disk-based approaches, the same math comes into play. There is a much higher likelihood that other elements of the subsystem will fail. As a result, all the efforts around verifying disk blocks will have been for nothing Datalink. All Rights Reserved. 10

13 These methods have varying side effects from a performance perspective. Vendors that rely on the purely hash-based algorithms are by-and-large CPU bound. Their performance is primarily dictated by the pure number crunching ability of the CPUs they utilize. To get better performance, these sub-systems have merely to wait for faster CPUs. Disk-bound systems are primarily bound by the number of disk drives that can perform the read verification requests in parallel. Their primary method of scaling is by adding more and faster disk drives. They do have CPU limitations as well, but the primary gating factor is the raw number of IOPS that the storage can deliver. Interestingly, today both types of systems achieve roughly the same maximum performance throughput MB per second. However, vendors for both methods are rapidly evolving their technologies and promising even higher throughput numbers. Hash-bound vendors are claiming they will soon provide clustering capabilities that will let them scale their solutions in a linear fashion and disk-bound vendors are predicting they will soon double their throughput rate. Deduplication is not only changing the way that backup is done at primary sites, but also the way backups are done for remote sites and even for disaster recovery. File aware (format aware) An interesting twist to the deduplication space is that the technologies need to understand the format that the backup applications use to write to tape or disk. By being tape format aware, they strip away the information that the backup application puts into the data stream and perform deduplication only on the real data. This is appealing because the tape formats used by the backup vendors insert markers into the data streams. If the deduplication technology is not aware of these markers, then the markers appear to be part of the data that the backup client generated. This is a problem because each marker shifts the data bits and causes the data chunking processes to align in different places. As far as the dedupe engines are concerned, each of these blocks is then unique. It s important when evaluating or implementing deduplication solutions to be aware of this file/format level. It s crucial to either test the solutions against your backup product or have the vendor provide you detailed information about their support for specific formats. Replication Deduplication is not only changing the way that backup is done at primary sites, but also the way backups are done for remote sites and even for disaster recovery. For the first time ever, it is now possible to replicate backup data with very high efficiencies. Prior to deduplication, a 1TB backup to disk (or tape) would consume at least 1TB of bandwidth to replicate. If the backup window was relatively small (less than eight hours) and the bandwidth relatively low (less than a few T1s), the backups would never be replicated in time. However, with deduplication, the 1TB backup could now consume five percent (or even less) of the space. The bandwidth requirements to move this data are reduced by the same amount Datalink. All Rights Reserved. 11

14 Surprisingly, not all deduplication vendors have implemented replication. However, because of the game-changing ability this technology offers, the vendors that currently don t have replication solutions have announced that they are imminent. Of course, the reality in some cases is that it may take several years for some vendors to deliver this capability. For those vendors that have implemented replication with deduplication capabilities, there are a few key differentiators. First is the ability to have many-to-one relationships. Most vendors are able to set up a single primary site to which many remote sites can be replicated. The maximum number of remote sites that can be replicated vary among these technologies. For some, the maximum count is as little as 10, but others are as high as 60. The second differentiator is whether the systems can do multi-hop replications. In these scenarios, Site A is replicated to Site B, and Site B is replicated to Site C (and possibly back to Site A). Very few vendors have this capability. There is no hard and fast compression number that exists for each deduplication technology. A third more subtle but important differentiator is whether the replication is done to a global namespace. For some vendors the replication relationships are generated by partitioning the receiving unit into discreet, non-interacting storage pools. The deduplication engine is only able to handle common data between the two sites in a single pairing. For example, if Site A and Site B replicate to Site C, Site C has to divide its storage into three pools. One pool is for Site A, another is for Site B, and a third is for the backups occurring at Site C. The deduplication engine treats each of those storage pools as separate entities and deduplicates only within that storage pool. This can be potentially troublesome from a resources standpoint. In the above example, we could assume that Site C is a central corporate facility, and sites A and B are remote offices. Chances are good that there is significant common data between sites A and B and the corporate office. But, because the technology treats each storage pool as unique data, the commonality would not be identified. This increases the requirements for both bandwidth and for storage at each of the appliances. Vendors that have a global namespace do not have to partition their storage and thus the deduplication is done globally. In the above example, data that has been seen by Site C s backups would not have to be replicated when Site A and Site B do their backups. Only references (aka identifiers) would need to be sent. With this capability, bandwidth requirements can be reduced even more than the raw deduplication rate at each remote office. Realistic compression ratios There is no hard and fast compression number that exists for each deduplication technology. The various deduplication vendors tout redundant storage reduction ranges from conservative (5:1) to lofty (500:1). Just like with tape drive compression, the numbers can vary significantly. Similar to compression ratios, deduplication ratios are sometimes computed differently. Primarily, the level of optimization that an organization will achieve is driven 2008 Datalink. All Rights Reserved. 12

15 by the type of data being backed up as well as the backup processes. Additionally, the size of the chunks of data processed, and the intelligence of how the deduplication engine decides where a block starts and stops, can affect the level of deduplication achieved. For instance, if an organization runs full backups every day, the ratios of redundant data will be high. Conversely, if incremental forever backups are conducted, the ratio of reduction will be significantly less. Also, many enterprise backup environments already use different approaches to shorten the backup window. With digital images, for example, weekly full backups may be skipped because the nature of this data is static. For Exchange environments, an organization may already use archival software to deduplicate attachments. Factors such as this will lessen the overall deduplication rates. The limits of deduplication in a backup environment are also affected by the SLAs around data. If an organization keeps data for only 30 days, its compression ratios will be significantly lower than if it keeps backups for six months. This is because there will be less redundant data being backed up and stored. Because deduplication technology is still somewhat in its infancy, it will continue to evolve and provide additional capabilities as time goes by. Future considerations Lastly, the overall deduplication rate will increase over time. If the repository of deduplicated data spans months opposed to weeks, the longer retention of this data will result in a greater deduplication rate. Because deduplication technology is still somewhat in its infancy, it will continue to evolve and provide additional capabilities as time goes by. Other areas that organizations will need to keep an eye on include how deduplication will integrate with independent software vendors (ISVs) and how it will fit into large environments. Integration with backup software ISVs When most of the deduplication and virtual tape products were introduced, the best practice was to perform duplication and tape offloading via the backup software. The reasoning was that unless the backup software performed this operation, the backup application would be unaware of the second copy. The downside to this is if a backup is 100 GB, making the copy means the backup server has to read and write the 100 GB a second and third time (read 100GB and write another 100GB). Even worse, the CPU and I/O cycles required by this process are delivered by the backup servers, which typically are already saturated. Symantec has recently created an API that enables utilization of advanced storage functions like deduplication. The API allows storage vendors to integrate into the backup software, enabling the storage solution to perform the replication process and, at the same time, update the backup software with the knowledge of the second copy of the backup. The process happens more quickly because the data is deduped and only the delta changes (new unique 2008 Datalink. All Rights Reserved. 13

16 blocks) need to be transmitted between the storage appliances. Also, the CPU cycles once needed at the backup server can be used for other backup processes like restores. Currently, Symantec NetBackup is the only product to offer this ability. However, the other major backup ISVs will follow as well. Enterprise fit By and large, the majority of the vendors that have implemented deduplication have done so in relatively small appliances (less than 100TB raw, prededupe space). Furthermore, they are only able to achieve relatively moderate deduplication throughput rates (less than 400 MB per second). Large enterprises with high data volumes (hundreds of terabytes to multiple petabytes) and small backup windows need to be particularly aware of these limitations. VTL and other disk-to-disk solutions that don t deduplicate can achieve much higher capacities and throughputs (>1PB and > 1.2GB per second). Even though the capacity problem is mitigated by the dedupe reductions, the relatively low raw data capacity of the current solutions could cause a problem for larger enterprises. Most vendors today approach the larger enterprises by utilizing multiple appliances. However, today these appliances are managed independently and the backup applications see them as unique storage pools. Even after conducting in-depth analysis, it can still be confusing to know which technology to implement. Datalink helps organizations sort through this process. Potentially more worrisome for the enterprise is that none of these solutions offer true controller redundancy. If a single, large dedupe appliance controller goes down, the entire capacity of that array is unavailable until the controller is fixed or replaced. All of the current dedupe vendors are working on addressing this issue. Some are approaching it through the use of traditional clustering technologies (active/passive or active/active pairs) and others are pursuing the distributed computing model (multiple active controllers presenting themselves as a unified solution). In Datalink s view, the best short-term bet is the traditional approach, but for long-term growth and scalability, the distributed model is probably going to be the most tenable solution. Deciding what is right for you Deduplication can be a confusing topic. With so many vendors offering different types of solutions some not even referred to as deduplication it can be difficult to discern which solution types add value and whether the overall benefits outweigh the risks associated with implementing a new technology. Several considerations come into play when organizations assess whether deduplication provides a good fit for their environment. These range from looking at the types of applications and data that are best suited for deduplication to weighing the many varying characteristics of the deduplication technologies that are available in today s market (i.e, post process or inline, VTL or D2D2T, etc.) Datalink. All Rights Reserved. 14

17 Even after conducting in-depth analysis, it can still be confusing to know which technology to implement. Datalink helps organizations sort through this process. As a leading information storage architect, Datalink helps organizations store, manage, and protect their information. We work with companies to maximize the value that IT delivers to their business. Datalink has worked with a number of enterprises to help define and implement solutions that utilize deduplication capabilities. Our independence allows us to recommend hardware and software technologies that provide the most optimal fit for organizations environments and enable them to effectively and efficiently meet their business initiatives. Datalink has extensive field experience with a wide range of technologies. This, combined with the knowledge we glean from in-depth testing conducted in our interoperability labs, provides us with invaluable insight that we can pass on to our clients as we design and implement storage solutions. For more information, contact Datalink at (800) or visit For more information, contact Datalink at (800) or visit Datalink. All Rights Reserved. 15