WHITE PAPER Improving Storage Efficiencies with Data Deduplication and Compression Sponsored by: Oracle Steven Scully May 2010 Benjamin Woo IDC OPINION Global Headquarters: 5 Speen Street Framingham, MA 01701 USA P.508.872.8200 F.508.935.4015 www.idc.com IT organizations worldwide are dealing with the tremendous growth of data and the complexity of managing the storage for that data. In this data-intensive environment, IT managers need to optimize the capacity and performance of their disk storage systems while working to reduce complexity and lower costs. Storage efficiency has been a goal of primary storage systems managers for some time. Various disk system techniques such as thin provisioning, snapshots, and storage resource management have all been developed to help IT managers improve overall storage utilization and performance. A growing area of storage innovation is data deduplication and compression technologies for primary storage systems. The system and storage solutions that now make up Oracle's storage offerings have long been at the forefront of various storage efficiency technologies. Some of the industry's first storage virtualization and space-efficient snapshots were developed by these organizations decades ago. Continuing in that tradition, the latest release of the Oracle Sun Storage 7000 systems includes primary storage data deduplication and data compression capabilities. SITUATION OVERVIEW IT organizations worldwide are dealing with the tremendous growth of data. IDC forecasts that after slowing some in 2009, total disk storage capacity shipped will grow at 48 50% through 2013. With the growth of capacity comes the complexity of managing the storage for that data. The data growth is coming from a wealth of dataintensive applications (e.g., business analytics), expanding use of high-performance computing (e.g., financial services and life sciences), collaboration and Web 2.0 applications, and content-rich data (e.g., digital images or video). In this dataintensive environment, IT managers need to optimize the capacity and performance of storage systems while working to reduce complexity and lower costs. In addition to the continued growth in capacity, the accelerated use of virtual servers and desktops is rapidly altering the storage landscape. IT organizations worldwide are turning to virtualized environments to improve datacenter flexibility and scalability. This in turn drives implementation of networked storage solutions, which can create new pressures on storage performance as I/Os that were previously more distributed are aggregated into a smaller number of host interconnects. There are also implications for organizations' data protection processes and architectures to ensure that every virtual server is protected and that the storage has the same flexibility and resiliency as the virtualized server environment.
Storage efficiency has been a goal of primary storage systems managers and an area of storage innovation for some time. Disk storage system techniques such as thin provisioning, space-efficient snapshots, automated tiering, and virtual storage management have all been developed to help IT managers improve storage system utilization and efficiency. Additionally, IT managers use complementary technologies and applications such as archiving (to relocate static data from primary storage to another tier) or storage resource management (to better understand the allocation of storage they already have) to improve their primary storage utilization. More recently, data deduplication and compression technologies for primary storage have been gaining attention in the quest for improved storage efficiency. Data Deduplication Data deduplication has become an important storage technology in the past few years. Storage solutions, either based on deduplication or with deduplication as a feature, are now available across the entire spectrum of storage offerings from many vendors, large and small. Data deduplication gained much of its market attention around backup data. Because a backup process typically copies the same files again and again, it made sense to not copy the file again if it already had been copied. Backup data remains a key opportunity for deduplication technologies, and almost every backup and recovery solution, from backup software to virtual tape libraries to disk-based backup systems, currently includes some form of deduplication. Data deduplication works by looking for repeated patterns in various chunks of data and eliminating the duplicates. An algorithm is used to generate a hash for each chunk of data, and if it matches a hash that has already been stored, the newer chunk of data is replaced by a pointer to the existing chunk already stored on the system. Three types of chunks are typically used for deduplication file level, block level, and byte level each with their own characteristics. File level. In file-level deduplication, also known as single instancing, the chunks are entire files. The hash is generated on the entire file and duplicates are stored only once. File-level deduplication typically has the lowest overhead, but any change to a file requires a recalculation of the hash, which will most likely result in another copy of the entire file being stored. Block level. Block-level deduplication (fixed and variable) requires more processing overhead but allows for better deduplication of files that are similar but slightly different. All blocks are shared except for the different ones. This approach is very useful with virtual machine images, for example, which mostly include a large copy of the guest operating system with some blocks that are unique to each virtual machine. Byte level. Theoretically, byte-level deduplication can be the most powerful but typically consumes the most processing resources because it has to compute the beginning and endpoints of the chunks as well as the resulting hash. It also excels in environments with lots of repeated but block-misaligned data. This approach is often used within an application (such as email) that better understands the data it is managing. 2 #223279 2010 IDC
Another aspect of data deduplication is when the deduplication is accomplished, with inline and postprocessing being the most common options. With inline deduplication, duplicate chunks are identified and removed before they are written to the back-end disk drives of the system. This process requires more computing power and can impact storage performance but doesn't require additional space and doesn't perform unnecessary writes of data that already exists. Alternately, data can be written to disk, and then the deduplication is accomplished by a postprocess typically executed as part of a scheduled operation. Postprocessing solutions require less computing power, reduce the potential impact on storage performance, and can be scheduled at times that are convenient to the operation of the datacenter. However, postprocessing does require additional storage capacity to hold all the data before the duplicates can be removed and will execute additional reads and write of the data. Data Deduplication for Primary Storage IDC sees increasing interest among customers around more recent innovation involving the deduplication of data on primary storage systems. Vendors are responding by developing primary storage offerings where deduplication can be done by application software running on the host, an appliance placed between the host and the storage array, or the storage array itself. A recent IDC survey into the various uses of data deduplication shows that over 50% of the respondents are using data deduplication or implementing it for a portion of the data in their primary storage system. Almost 10% of the users are deduplicating 90% or more of their total data, while the majority of users are deduplicating 20 40% of their data. The top five types of data that respondents are deduplicating are (in order) Exchange, Windows file systems, SQL databases, Web server/site content, and Oracle databases. Deduplication ratios for primary storage tend to range from 2:1 to 5:1, possibly a little more for some types of data. This is less than end users typically experience with backup deduplication and is due to the nature of primary data as well as the capabilities of some of the deduplication technologies (for example, some cannot deduplicate open files that will be found on primary storage). There can be additional benefits from primary storage deduplication beyond the capacity savings. For example, some primary storage deduplication capabilities are tied into replication and data protection capabilities, allowing the system to back up and restore the data in its deduplicated state, improving performance and reducing network bandwidth. Primary storage deduplication is not for every data set or for every environment. If used for data sets with large amounts of static data, it can produce significant storage savings. If used for the wrong type of data, it can create unwanted latency in the storage process. The key for users is to understand how specific data sets will respond to data deduplication and use it only where the benefits exceed the costs. 2010 IDC #223279 3
Data Compression for Primary Storage Data deduplication is not the only way to improve the efficiency of primary storage. Data compression is another technology that leads to improved efficiency on primary storage systems, and it can be very useful in data sets with variable amounts of empty space (such as databases). Compression can also produce more significant storage savings than deduplication when used in large repositories of unstructured data with few duplicates but many file types that can be compressed. Some vendors offer a combination of both technologies by compressing the data as well as deduplicating the data. Select vendors of primary storage compression view compression and deduplication as competing technologies and do not recommend (or support) their use together. As with deduplication, data compression can improve the performance of data protection solutions by replicating, backing up, and restoring the data in its compressed form. However, primary storage compression also is not for every data set or environment. Some applications already compress data, so further compression may not produce increased results. Data compression also adds latency in the storage process because processing power is used to compress the data on write and decompress the data on read. The storage savings needs to be weighed against the added latency. Data Deduplication and Compression for Oracle's Unified Storage Systems The Oracle Sun Storage 7000 Series was launched in the fall of 2008 as a family of unified, open disk storage systems. The 7000 Series currently consists of the 7110, 7210, 7310, and 7410, which scale from 2TB to 576TB of capacity. The 7000 Series supports both file and block data (including CIFS, NFS, FTP/FTPS/SFTP, and HTTP/WebDAV protocols) using Ethernet, Fibre Channel, and InfiniBand network interfaces. The Sun Storage 7000 Series has several unique features that continue to differentiate it in the market. The features include: DTrace Analytics provides a new way of observing and understanding how the unified storage system and enterprise network clients are operating and behaving, using real-time graphical analysis. Hybrid Storage Pools provide a high-performance architecture that integrates flash-based SSDs as a caching tier with capacity-optimized, enterprise-class HDDs for all storage. Data migration between these tiers occurs automatically, depending on access patterns. The ZFS File System is a combined file system and logical volume manager that features high capacities (up to 16EB), performance, and continuous integrity checking and automatic repair. 4 #223279 2010 IDC
The Sun Storage 7000 systems also have many of the storage efficiency technologies traditionally used to improve storage utilization, including thin provisioning, space-efficient snapshots and clones, and simplified storage management. Data deduplication and data compression capabilities have been added more recently to also increase the storage efficiency. The 7000 Series also takes advantage of the highly multithreaded Open Solaris operating system as well as a multiple, multicore processor environment with significant CPU cycles available to provide the additional processing power needed for data deduplication and compression. Data deduplication. For the 7000 Series, shares or projects can optionally deduplicate data before writing to the storage pool. While configured by the share or project, the system looks at the whole storage pool. The data deduplication is implemented on a block-level basis and is performed inline (referred to by Oracle as synchronous deduplication). The deduplication has no effect on the calculated size of the share, but it does affect the amount of space used for the pool. For example, if two shares contain the same 1GB file, each will appear to be 1GB in size, but the total for the pool will be just 1GB and the deduplication ratio (available from the system dashboard) will be reported as 2. Oracle claims there are no capacity limits to the deduplicated data, unlike other solutions that have to keep the deduplication tables in memory and therefore have limits on the number of references they can store. If the 7000 Series tables exceed the memory, they spill over to the SSD cache and, if needed, to disk, which slightly slows the performance at each step but doesn't limit the capacity. Data compression. For the 7000 Series, shares can optionally compress data before writing to the storage pool, allowing for greater storage utilization at the expense of increased CPU utilization. Four levels of compression are offered, allowing users to choose from the fastest compression (only works for simple inputs but with minimal CPU resources) to the best compression (highest compression but consumes a significant amount of CPU resources). If compression doesn't yield a minimum amount of space savings, it is not used in order to avoid having to decompress the data when reading it back. If used with deduplication, the data is first compressed and then deduplicated. Because there are no additional software license fees for the 7000 Series, new and existing customers can easily make use of these features without additional purchases. Existing customers can implement deduplication for new data going forward, but the system will not go back and deduplicate existing data. The scalability of the Sun Storage 7000 family is well suited to the potential demands of data deduplication and compression. Because these tasks can consume additional CPU resources, users can increase the computational power by adding more CPUs and cache to their 7000 Series product. Likewise, users who expect a significant amount of deduplication and/or compression can start with a smaller amount of total capacity and expand easily by adding more drive expansion units as needed. Finally, customers can increase the performance of reads and writes by adding SSDs to their 7000 Series product. 2010 IDC #223279 5
CHALLENGES/OPPORTUNITIES Since completing its acquisition of Sun Microsystems, Oracle can now focus on the task of educating its sales force, customer base, and prospects about the products and directions for the Sun Storage offerings. Oracle has the opportunity to make a strong statement around storage that builds on the traction it was gaining in the market prior to the acquisition. The new features of the Sun Storage 7000 Series are a natural progression of Oracle's overall strategy around open software initiatives and unified storage solutions. However, IDC believes that Oracle has to address some remaining challenges to maximize the success of the 7000 Series, including: Oracle needs to focus on educating its sales force and channel partners on how and where to sell the Sun Storage 7000 systems as well as other storage solutions in its portfolio. The benefits and trade-offs of using data deduplication and compression on the 7000 Series products need to be clearly understood and communicated to customers. Unrealistic deduplication ratios and unexpected performance issues that arise with customers improperly enabling these capabilities should be avoided. Integration with key application vendors remains a key to the success of the 7000 Series. Oracle needs to continue to expand the alliances it has been building to ensure that major application vendors can more easily take advantage of the advanced features in the 7000 family. For data deduplication and compression, this extends to verifying that major applications work properly when these features are used and providing customers with some sense of the expected results. Oracle needs to extend its unified storage approach to meet the needs of enterprise environments or further position its storage portfolio to meet these needs. With the announced end of Oracle's partnership with Hitachi Data Systems for enterprise disk solutions, this may be an area in which customers will be seeking direction. CONCLUSION Data will continue to grow for IT managers, and they will continue to look for every opportunity to improve their storage utilization and efficiency. Adding data deduplication and compression to primary storage is a more recent area of innovation for storage vendors, which helps customers increase the total capacity of their overall storage environments. IDC views these capabilities as technologies with broad application throughout storage hardware and software solutions, not as specific solutions or markets themselves. Oracle's unified storage systems are well positioned to address current market needs by providing simple-to-use and scalable storage solutions. Industry-standard hardware and open source software have the potential to greatly improve the overall 6 #223279 2010 IDC
economics of storage for customers. The Hybrid Storage Pool architecture is a unique way to integrated flash-based SSDs into the storage system, providing for higher performance and lower power consumption compared with traditional approaches. Integrated storage with comprehensive analytical tools and numerous bundled software features will help reduce the complexity of managing storage. The addition of data deduplication and compression to the 7000 Series further enhances the solution for customers. When evaluating the use of data deduplication and/or compression with primary storage systems, IT managers need to understand all the various aspects at play. The trade-off between disk system performance and increase in storage capacity that both these technologies require must be understood before they are implemented. Implemented correctly, these technologies can improve overall storage efficiency for key data types in many environments with minimal management or effort on the part of the IT staff. Copyright Notice External Publication of IDC Information and Data Any IDC information that is to be used in advertising, press releases, or promotional materials requires prior written approval from the appropriate IDC Vice President or Country Manager. A draft of the proposed document should accompany any such request. IDC reserves the right to deny approval of external usage for any reason. Copyright 2010 IDC. Reproduction without written permission is completely forbidden. 2010 IDC #223279 7