WHITE PAPER Improving Storage Efficiencies with Data Deduplication and Compression



Similar documents
Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May Copyright 2014 Permabit Technology Corporation

W H I T E P A P E R R e a l i z i n g t h e B e n e f i t s o f Deduplication in a Backup and Restore System

INCREASING EFFICIENCY WITH EASY AND COMPREHENSIVE STORAGE MANAGEMENT

Why Inline Data Reduction Is Required for Enterprise Flash Arrays

WHITE PAPER Addressing Enterprise Computing Storage Performance Gaps with Enterprise Flash Drives

W H I T E P A P E R T h e C r i t i c a l N e e d t o P r o t e c t M a i n f r a m e B u s i n e s s - C r i t i c a l A p p l i c a t i o n s

Selling Compellent NAS: File & Block Level in the Same System Chad Thibodeau

June Blade.org 2009 ALL RIGHTS RESERVED

The safer, easier way to help you pass any IT exams. Exam : Storage Sales V2. Title : Version : Demo 1 / 5

Deep Dive on SimpliVity s OmniStack A Technical Whitepaper

EMC XTREMIO EXECUTIVE OVERVIEW

ZFS Administration 1

Data Reduction Methodologies: Comparing ExaGrid s Byte-Level-Delta Data Reduction to Data De-duplication. February 2007

Data De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication

Global Headquarters: 5 Speen Street Framingham, MA USA P F

Protect Data... in the Cloud

Violin Memory Arrays With IBM System Storage SAN Volume Control

Demystifying Deduplication for Backup with the Dell DR4000

ClearPath Storage Update Data Domain on ClearPath MCP

Enterprise Workloads on the IBM X6 Portfolio: Driving Business Advantages

W H I T E P A P E R E n t e r p r i s e V T L s : S t r a t e g ic for Large-Scale Datacenters

Data Deduplication: An Essential Component of your Data Protection Strategy

LDA, the new family of Lortu Data Appliances

Optimizing Storage for Better TCO in Oracle Environments. Part 1: Management INFOSTOR. Executive Brief

I D C V E N D O R S P O T L I G H T. F l a s h, C l o u d, a nd Softw ar e - D e f i n e d Storage:

Deduplication has been around for several

<Insert Picture Here> Refreshing Your Data Protection Environment with Next-Generation Architectures

How To Make A Backup System More Efficient

CASS COUNTY GOVERNMENT. Data Storage Project Request for Proposal

Get Success in Passing Your Certification Exam at first attempt!

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

Solid State Storage in Massive Data Environments Erik Eyberg

Introduction to NetApp Infinite Volume

ENABLING VIRTUALIZED GRIDS WITH ORACLE AND NETAPP

Whitepaper. NexentaConnect for VMware Virtual SAN. Full Featured File services for Virtual SAN

Inside Track Research Note. In association with. Enterprise Storage Architectures. Is it only about scale up or scale out?

Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution

SAS and Oracle: Big Data and Cloud Partnering Innovation Targets the Third Platform

STORAGE CENTER. The Industry s Only SAN with Automated Tiered Storage STORAGE CENTER

MaxDeploy Ready. Hyper- Converged Virtualization Solution. With SanDisk Fusion iomemory products

Got Files? Get Cloud!

All-Flash Enterprise Storage: Defining the Foundation for 3rd Platform Application Availability

Barracuda Backup Deduplication. White Paper

STORAGE CENTER WITH NAS STORAGE CENTER DATASHEET

HyperQ Storage Tiering White Paper

OPTIMIZING EXCHANGE SERVER IN A TIERED STORAGE ENVIRONMENT WHITE PAPER NOVEMBER 2006

Migrate workloads back and forth across diverse physical and virtual platforms using the same interfaces, policies, and performance analytics

Business white paper. environments. The top 5 challenges and solutions for backup and recovery

I T T R A N S F O R M A T I O N A N D T H E C H A N G I N G D A T A C E N T E R

Maxta Storage Platform Enterprise Storage Re-defined

Evaluation Guide. Software vs. Appliance Deduplication

A Dell Technical White Paper Dell Compellent

THE SUMMARY. ARKSERIES - pg. 3. ULTRASERIES - pg. 5. EXTREMESERIES - pg. 9

Worldwide All-Flash Array and Hybrid Flash Array Forecast and 1H14 Vendor Shares

Tiered Data Protection Strategy Data Deduplication. Thomas Störr Sales Director Central Europe November 8, 2007

Big data Devices Apps

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

Efficient Backup with Data Deduplication Which Strategy is Right for You?

VERITAS NetBackup 6.0 Enterprise Server INNOVATIVE DATA PROTECTION DATASHEET. Product Highlights

Eight Considerations for Evaluating Disk-Based Backup Solutions

W H I T E P A P E R T h e I m p a c t o f A u t o S u p p o r t: Leveraging Advanced Remote and Automated Support

Maximize Your Virtual Environment Investment with EMC Avamar. Rob Emsley Senior Director, Product Marketing

W H I T E P A P E R C o n t i n u o u s R e p l i c a t i o n f o r B u s i n e s s - C r i t i c a l A p p l i c a t i o n s

Business-centric Storage for small and medium-sized enterprises. How ETERNUS DX powered by Intel Xeon processors improves data management

Reference Guide WindSpring Data Management Technology (DMT) Solving Today s Storage Optimization Challenges

Uncompromised business agility with Oracle, NetApp and VMware

How To Get A Storage And Data Protection Solution For Virtualization

W H I T E P A P E R D e l i v e r i n g C u s t o m e r V a l u e w i t h E n t e r p r i s e F l a s h D e p l o y m e n t s

Best Practices for Architecting Storage in Virtualized Environments

HP StoreOnce D2D. Understanding the challenges associated with NetApp s deduplication. Business white paper

Turnkey Deduplication Solution for the Enterprise

An Oracle White Paper November Backup and Recovery with Oracle s Sun ZFS Storage Appliances and Oracle Recovery Manager

I D C A N A L Y S T C O N N E C T I O N. T h e C r i t i cal Role of I/O in Public Cloud S e r vi c e P r o vi d e r E n vi r o n m e n t s

WHITE PAPER Data Deduplication for Backup: Accelerating Efficiency and Driving Down IT Costs

Accelerating Data Compression with Intel Multi-Core Processors

Identifying the Hidden Risk of Data Deduplication: How the HYDRAstor TM Solution Proactively Solves the Problem

EMC VNX FAMILY. Copyright 2011 EMC Corporation. All rights reserved.

The Best Network Attached Storage Choice for Oracle Database and Software Environments

EMC DATA DOMAIN OPERATING SYSTEM

Infortrend EonNAS 3000 and 5000: Key System Features

Using HP StoreOnce D2D systems for Microsoft SQL Server backups

How To Build A Cisco Ukcsob420 M3 Blade Server

Real-time Compression: Achieving storage efficiency throughout the data lifecycle

WHITE PAPER. Get Ready for Big Data:

Transcription:

WHITE PAPER Improving Storage Efficiencies with Data Deduplication and Compression Sponsored by: Oracle Steven Scully May 2010 Benjamin Woo IDC OPINION Global Headquarters: 5 Speen Street Framingham, MA 01701 USA P.508.872.8200 F.508.935.4015 www.idc.com IT organizations worldwide are dealing with the tremendous growth of data and the complexity of managing the storage for that data. In this data-intensive environment, IT managers need to optimize the capacity and performance of their disk storage systems while working to reduce complexity and lower costs. Storage efficiency has been a goal of primary storage systems managers for some time. Various disk system techniques such as thin provisioning, snapshots, and storage resource management have all been developed to help IT managers improve overall storage utilization and performance. A growing area of storage innovation is data deduplication and compression technologies for primary storage systems. The system and storage solutions that now make up Oracle's storage offerings have long been at the forefront of various storage efficiency technologies. Some of the industry's first storage virtualization and space-efficient snapshots were developed by these organizations decades ago. Continuing in that tradition, the latest release of the Oracle Sun Storage 7000 systems includes primary storage data deduplication and data compression capabilities. SITUATION OVERVIEW IT organizations worldwide are dealing with the tremendous growth of data. IDC forecasts that after slowing some in 2009, total disk storage capacity shipped will grow at 48 50% through 2013. With the growth of capacity comes the complexity of managing the storage for that data. The data growth is coming from a wealth of dataintensive applications (e.g., business analytics), expanding use of high-performance computing (e.g., financial services and life sciences), collaboration and Web 2.0 applications, and content-rich data (e.g., digital images or video). In this dataintensive environment, IT managers need to optimize the capacity and performance of storage systems while working to reduce complexity and lower costs. In addition to the continued growth in capacity, the accelerated use of virtual servers and desktops is rapidly altering the storage landscape. IT organizations worldwide are turning to virtualized environments to improve datacenter flexibility and scalability. This in turn drives implementation of networked storage solutions, which can create new pressures on storage performance as I/Os that were previously more distributed are aggregated into a smaller number of host interconnects. There are also implications for organizations' data protection processes and architectures to ensure that every virtual server is protected and that the storage has the same flexibility and resiliency as the virtualized server environment.

Storage efficiency has been a goal of primary storage systems managers and an area of storage innovation for some time. Disk storage system techniques such as thin provisioning, space-efficient snapshots, automated tiering, and virtual storage management have all been developed to help IT managers improve storage system utilization and efficiency. Additionally, IT managers use complementary technologies and applications such as archiving (to relocate static data from primary storage to another tier) or storage resource management (to better understand the allocation of storage they already have) to improve their primary storage utilization. More recently, data deduplication and compression technologies for primary storage have been gaining attention in the quest for improved storage efficiency. Data Deduplication Data deduplication has become an important storage technology in the past few years. Storage solutions, either based on deduplication or with deduplication as a feature, are now available across the entire spectrum of storage offerings from many vendors, large and small. Data deduplication gained much of its market attention around backup data. Because a backup process typically copies the same files again and again, it made sense to not copy the file again if it already had been copied. Backup data remains a key opportunity for deduplication technologies, and almost every backup and recovery solution, from backup software to virtual tape libraries to disk-based backup systems, currently includes some form of deduplication. Data deduplication works by looking for repeated patterns in various chunks of data and eliminating the duplicates. An algorithm is used to generate a hash for each chunk of data, and if it matches a hash that has already been stored, the newer chunk of data is replaced by a pointer to the existing chunk already stored on the system. Three types of chunks are typically used for deduplication file level, block level, and byte level each with their own characteristics. File level. In file-level deduplication, also known as single instancing, the chunks are entire files. The hash is generated on the entire file and duplicates are stored only once. File-level deduplication typically has the lowest overhead, but any change to a file requires a recalculation of the hash, which will most likely result in another copy of the entire file being stored. Block level. Block-level deduplication (fixed and variable) requires more processing overhead but allows for better deduplication of files that are similar but slightly different. All blocks are shared except for the different ones. This approach is very useful with virtual machine images, for example, which mostly include a large copy of the guest operating system with some blocks that are unique to each virtual machine. Byte level. Theoretically, byte-level deduplication can be the most powerful but typically consumes the most processing resources because it has to compute the beginning and endpoints of the chunks as well as the resulting hash. It also excels in environments with lots of repeated but block-misaligned data. This approach is often used within an application (such as email) that better understands the data it is managing. 2 #223279 2010 IDC

Another aspect of data deduplication is when the deduplication is accomplished, with inline and postprocessing being the most common options. With inline deduplication, duplicate chunks are identified and removed before they are written to the back-end disk drives of the system. This process requires more computing power and can impact storage performance but doesn't require additional space and doesn't perform unnecessary writes of data that already exists. Alternately, data can be written to disk, and then the deduplication is accomplished by a postprocess typically executed as part of a scheduled operation. Postprocessing solutions require less computing power, reduce the potential impact on storage performance, and can be scheduled at times that are convenient to the operation of the datacenter. However, postprocessing does require additional storage capacity to hold all the data before the duplicates can be removed and will execute additional reads and write of the data. Data Deduplication for Primary Storage IDC sees increasing interest among customers around more recent innovation involving the deduplication of data on primary storage systems. Vendors are responding by developing primary storage offerings where deduplication can be done by application software running on the host, an appliance placed between the host and the storage array, or the storage array itself. A recent IDC survey into the various uses of data deduplication shows that over 50% of the respondents are using data deduplication or implementing it for a portion of the data in their primary storage system. Almost 10% of the users are deduplicating 90% or more of their total data, while the majority of users are deduplicating 20 40% of their data. The top five types of data that respondents are deduplicating are (in order) Exchange, Windows file systems, SQL databases, Web server/site content, and Oracle databases. Deduplication ratios for primary storage tend to range from 2:1 to 5:1, possibly a little more for some types of data. This is less than end users typically experience with backup deduplication and is due to the nature of primary data as well as the capabilities of some of the deduplication technologies (for example, some cannot deduplicate open files that will be found on primary storage). There can be additional benefits from primary storage deduplication beyond the capacity savings. For example, some primary storage deduplication capabilities are tied into replication and data protection capabilities, allowing the system to back up and restore the data in its deduplicated state, improving performance and reducing network bandwidth. Primary storage deduplication is not for every data set or for every environment. If used for data sets with large amounts of static data, it can produce significant storage savings. If used for the wrong type of data, it can create unwanted latency in the storage process. The key for users is to understand how specific data sets will respond to data deduplication and use it only where the benefits exceed the costs. 2010 IDC #223279 3

Data Compression for Primary Storage Data deduplication is not the only way to improve the efficiency of primary storage. Data compression is another technology that leads to improved efficiency on primary storage systems, and it can be very useful in data sets with variable amounts of empty space (such as databases). Compression can also produce more significant storage savings than deduplication when used in large repositories of unstructured data with few duplicates but many file types that can be compressed. Some vendors offer a combination of both technologies by compressing the data as well as deduplicating the data. Select vendors of primary storage compression view compression and deduplication as competing technologies and do not recommend (or support) their use together. As with deduplication, data compression can improve the performance of data protection solutions by replicating, backing up, and restoring the data in its compressed form. However, primary storage compression also is not for every data set or environment. Some applications already compress data, so further compression may not produce increased results. Data compression also adds latency in the storage process because processing power is used to compress the data on write and decompress the data on read. The storage savings needs to be weighed against the added latency. Data Deduplication and Compression for Oracle's Unified Storage Systems The Oracle Sun Storage 7000 Series was launched in the fall of 2008 as a family of unified, open disk storage systems. The 7000 Series currently consists of the 7110, 7210, 7310, and 7410, which scale from 2TB to 576TB of capacity. The 7000 Series supports both file and block data (including CIFS, NFS, FTP/FTPS/SFTP, and HTTP/WebDAV protocols) using Ethernet, Fibre Channel, and InfiniBand network interfaces. The Sun Storage 7000 Series has several unique features that continue to differentiate it in the market. The features include: DTrace Analytics provides a new way of observing and understanding how the unified storage system and enterprise network clients are operating and behaving, using real-time graphical analysis. Hybrid Storage Pools provide a high-performance architecture that integrates flash-based SSDs as a caching tier with capacity-optimized, enterprise-class HDDs for all storage. Data migration between these tiers occurs automatically, depending on access patterns. The ZFS File System is a combined file system and logical volume manager that features high capacities (up to 16EB), performance, and continuous integrity checking and automatic repair. 4 #223279 2010 IDC

The Sun Storage 7000 systems also have many of the storage efficiency technologies traditionally used to improve storage utilization, including thin provisioning, space-efficient snapshots and clones, and simplified storage management. Data deduplication and data compression capabilities have been added more recently to also increase the storage efficiency. The 7000 Series also takes advantage of the highly multithreaded Open Solaris operating system as well as a multiple, multicore processor environment with significant CPU cycles available to provide the additional processing power needed for data deduplication and compression. Data deduplication. For the 7000 Series, shares or projects can optionally deduplicate data before writing to the storage pool. While configured by the share or project, the system looks at the whole storage pool. The data deduplication is implemented on a block-level basis and is performed inline (referred to by Oracle as synchronous deduplication). The deduplication has no effect on the calculated size of the share, but it does affect the amount of space used for the pool. For example, if two shares contain the same 1GB file, each will appear to be 1GB in size, but the total for the pool will be just 1GB and the deduplication ratio (available from the system dashboard) will be reported as 2. Oracle claims there are no capacity limits to the deduplicated data, unlike other solutions that have to keep the deduplication tables in memory and therefore have limits on the number of references they can store. If the 7000 Series tables exceed the memory, they spill over to the SSD cache and, if needed, to disk, which slightly slows the performance at each step but doesn't limit the capacity. Data compression. For the 7000 Series, shares can optionally compress data before writing to the storage pool, allowing for greater storage utilization at the expense of increased CPU utilization. Four levels of compression are offered, allowing users to choose from the fastest compression (only works for simple inputs but with minimal CPU resources) to the best compression (highest compression but consumes a significant amount of CPU resources). If compression doesn't yield a minimum amount of space savings, it is not used in order to avoid having to decompress the data when reading it back. If used with deduplication, the data is first compressed and then deduplicated. Because there are no additional software license fees for the 7000 Series, new and existing customers can easily make use of these features without additional purchases. Existing customers can implement deduplication for new data going forward, but the system will not go back and deduplicate existing data. The scalability of the Sun Storage 7000 family is well suited to the potential demands of data deduplication and compression. Because these tasks can consume additional CPU resources, users can increase the computational power by adding more CPUs and cache to their 7000 Series product. Likewise, users who expect a significant amount of deduplication and/or compression can start with a smaller amount of total capacity and expand easily by adding more drive expansion units as needed. Finally, customers can increase the performance of reads and writes by adding SSDs to their 7000 Series product. 2010 IDC #223279 5

CHALLENGES/OPPORTUNITIES Since completing its acquisition of Sun Microsystems, Oracle can now focus on the task of educating its sales force, customer base, and prospects about the products and directions for the Sun Storage offerings. Oracle has the opportunity to make a strong statement around storage that builds on the traction it was gaining in the market prior to the acquisition. The new features of the Sun Storage 7000 Series are a natural progression of Oracle's overall strategy around open software initiatives and unified storage solutions. However, IDC believes that Oracle has to address some remaining challenges to maximize the success of the 7000 Series, including: Oracle needs to focus on educating its sales force and channel partners on how and where to sell the Sun Storage 7000 systems as well as other storage solutions in its portfolio. The benefits and trade-offs of using data deduplication and compression on the 7000 Series products need to be clearly understood and communicated to customers. Unrealistic deduplication ratios and unexpected performance issues that arise with customers improperly enabling these capabilities should be avoided. Integration with key application vendors remains a key to the success of the 7000 Series. Oracle needs to continue to expand the alliances it has been building to ensure that major application vendors can more easily take advantage of the advanced features in the 7000 family. For data deduplication and compression, this extends to verifying that major applications work properly when these features are used and providing customers with some sense of the expected results. Oracle needs to extend its unified storage approach to meet the needs of enterprise environments or further position its storage portfolio to meet these needs. With the announced end of Oracle's partnership with Hitachi Data Systems for enterprise disk solutions, this may be an area in which customers will be seeking direction. CONCLUSION Data will continue to grow for IT managers, and they will continue to look for every opportunity to improve their storage utilization and efficiency. Adding data deduplication and compression to primary storage is a more recent area of innovation for storage vendors, which helps customers increase the total capacity of their overall storage environments. IDC views these capabilities as technologies with broad application throughout storage hardware and software solutions, not as specific solutions or markets themselves. Oracle's unified storage systems are well positioned to address current market needs by providing simple-to-use and scalable storage solutions. Industry-standard hardware and open source software have the potential to greatly improve the overall 6 #223279 2010 IDC

economics of storage for customers. The Hybrid Storage Pool architecture is a unique way to integrated flash-based SSDs into the storage system, providing for higher performance and lower power consumption compared with traditional approaches. Integrated storage with comprehensive analytical tools and numerous bundled software features will help reduce the complexity of managing storage. The addition of data deduplication and compression to the 7000 Series further enhances the solution for customers. When evaluating the use of data deduplication and/or compression with primary storage systems, IT managers need to understand all the various aspects at play. The trade-off between disk system performance and increase in storage capacity that both these technologies require must be understood before they are implemented. Implemented correctly, these technologies can improve overall storage efficiency for key data types in many environments with minimal management or effort on the part of the IT staff. Copyright Notice External Publication of IDC Information and Data Any IDC information that is to be used in advertising, press releases, or promotional materials requires prior written approval from the appropriate IDC Vice President or Country Manager. A draft of the proposed document should accompany any such request. IDC reserves the right to deny approval of external usage for any reason. Copyright 2010 IDC. Reproduction without written permission is completely forbidden. 2010 IDC #223279 7