Reference Guide WindSpring Data Management Technology (DMT) Solving Today s Storage Optimization Challenges



Similar documents
WHITE PAPER Improving Storage Efficiencies with Data Deduplication and Compression

Data De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May Copyright 2014 Permabit Technology Corporation

Hardware Configuration Guide

DeltaStor Data Deduplication: A Technical Review

Data Reduction Methodologies: Comparing ExaGrid s Byte-Level-Delta Data Reduction to Data De-duplication. February 2007

Reducing Backups with Data Deduplication

Turnkey Deduplication Solution for the Enterprise

Presents. Attix5 Technology. An Introduction

Barracuda Backup Deduplication. White Paper

Data Deduplication in Tivoli Storage Manager. Andrzej Bugowski Spała

Understanding EMC Avamar with EMC Data Protection Advisor

Demystifying Deduplication for Backup with the Dell DR4000

Deduplication and Beyond: Optimizing Performance for Backup and Recovery

Protect Data... in the Cloud

Real-time Compression: Achieving storage efficiency throughout the data lifecycle

WHITE PAPER. Storage Savings Analysis: Storage Savings with Deduplication and Acronis Backup & Recovery 10

E-Guide. Sponsored By:

Deduplication has been around for several

Redefining Microsoft SQL Server Data Management. PAS Specification

Data Backup and Restore (DBR) Overview Detailed Description Pricing... 5 SLAs... 5 Service Matrix Service Description

NETAPP WHITE PAPER Looking Beyond the Hype: Evaluating Data Deduplication Solutions

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique

Don t Get Duped By. Dedupe. 7 Technology Circle Suite 100 Columbia, SC Phone: sales@unitrends.com URL:

EMC Data Domain Boost for Oracle Recovery Manager (RMAN)

WHITE PAPER. Permabit Albireo Data Optimization Software. Benefits of Albireo for Virtual Servers. January Permabit Technology Corporation

Data Reduction: Deduplication and Compression. Danny Harnik IBM Haifa Research Labs

VMware vsphere Data Protection 5.8 TECHNICAL OVERVIEW REVISED AUGUST 2014

Deduplication, Incremental Forever, and the. Olsen Twins. 7 Technology Circle Suite 100 Columbia, SC 29203

Redefining Microsoft Exchange Data Management

The Curious Case of Database Deduplication. PRESENTATION TITLE GOES HERE Gurmeet Goindi Oracle

VMware vsphere Data Protection 6.0

Cost Effective Backup with Deduplication. Copyright 2009 EMC Corporation. All rights reserved.

Acronis Backup Deduplication. Technical Whitepaper

CrashPlan PRO Enterprise Backup

ACHIEVING STORAGE EFFICIENCY WITH DATA DEDUPLICATION

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

TECHNICAL BRIEF. Primary Storage Compression with Storage Foundation 6.0

STORAGE. Buying Guide: TARGET DATA DEDUPLICATION BACKUP SYSTEMS. inside

Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution

Multi-level Metadata Management Scheme for Cloud Storage System

Cloud Storage Backup for Storage as a Service with AT&T

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

Understanding EMC Avamar with EMC Data Protection Advisor

NetApp Data Compression and Deduplication Deployment and Implementation Guide

Data Deduplication and Tivoli Storage Manager

Best Practices for Deploying Citrix XenDesktop on NexentaStor Open Storage

Data Compression and Deduplication. LOC Cisco Systems, Inc. All rights reserved.

Reclaiming Primary Storage with Managed Server HSM

Data Deduplication: An Essential Component of your Data Protection Strategy

Speeding Up Cloud/Server Applications Using Flash Memory

Disaster Recovery Strategies: Business Continuity through Remote Backup Replication

ExaGrid Product Description. Cost-Effective Disk-Based Backup with Data Deduplication

Sawmill Log Analyzer Best Practices!! Page 1 of 6. Sawmill Log Analyzer Best Practices

Introduction to VMware vsphere Data Protection TECHNICAL WHITE PAPER

<Insert Picture Here> Refreshing Your Data Protection Environment with Next-Generation Architectures

Every organization has critical data that it can t live without. When a disaster strikes, how long can your business survive without access to its

Symantec Backup Appliances

Vodacom Managed Hosted Backups

Data Deduplication HTBackup

Don t Get Duped By Dedupe or Dedupe Vendors

09'Linux Plumbers Conference

Cisco WAAS Express. Product Overview. Cisco WAAS Express Benefits. The Cisco WAAS Express Advantage

EMC Data Domain Boost for Oracle Recovery Manager (RMAN)

Rethinking Backup in a Virtualized World

EMC VNXe File Deduplication and Compression

(Formerly Double-Take Backup)

Eight Considerations for Evaluating Disk-Based Backup Solutions

HP StoreOnce D2D. Understanding the challenges associated with NetApp s deduplication. Business white paper

SYMANTEC NETBACKUP APPLIANCE FAMILY OVERVIEW BROCHURE. When you can do it simply, you can do it all.

Availability Digest. Data Deduplication February 2011

The Modern Virtualized Data Center

ABOUT DISK BACKUP WITH DEDUPLICATION

Effective Planning and Use of TSM V6 Deduplication

Maximize Your Virtual Environment Investment with EMC Avamar. Rob Emsley Senior Director, Product Marketing

WHITE PAPER. How Deduplication Benefits Companies of All Sizes An Acronis White Paper

Technical Overview Simple, Scalable, Object Storage Software

WOS Cloud. ddn.com. Personal Storage for the Enterprise. DDN Solution Brief

Backup and Recovery: The Benefits of Multiple Deduplication Policies

Cloud-integrated Storage What & Why

EMC Backup and Recovery for Microsoft SQL Server 2008 Enabled by EMC Celerra Unified Storage

CONFIGURATION GUIDELINES: EMC STORAGE FOR PHYSICAL SECURITY

Turbo Charge Your Data Protection Strategy

Creating a Cloud Backup Service. Deon George

Optimizing Backup and Data Protection in Virtualized Environments. January 2009

Updated November 30, Version 4.1

Rose Business Technologies

Deduplication Best Practices With Microsoft Windows Server 2012 and Veeam Backup & Replication 6.5

Whitepaper: Back Up SAP HANA and SUSE Linux Enterprise Server with SEP sesam. Copyright 2014 SEP

ZFS Backup Platform. ZFS Backup Platform. Senior Systems Analyst TalkTalk Group. Robert Milkowski.

Protect Microsoft Exchange databases, achieve long-term data retention

IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE

Tandberg Data AccuVault RDX

Cloud-integrated Enterprise Storage. Cloud-integrated Storage What & Why. Marc Farley

VMware vsphere Data Protection 6.1

Managed File Transfer

Protecting enterprise servers with StoreOnce and CommVault Simpana

Veritas Backup Exec 15: Deduplication Option

Veeam Backup & Replication for VMware

Configuring Backup Settings. Copyright 2009, Oracle. All rights reserved.

Transcription:

Reference Guide WindSpring Data Management Technology (DMT) Solving Today s Storage Optimization Challenges September 2011

Table of Contents The Enterprise and Mobile Storage Landscapes... 3 Increased Storage Capacity with Optimized, Data-Specific Access... 4 Data Traffic Drives Next-Generation Data Management Systems... 4 Compression... 5 Error Detection and Data Deduplication... 5 Forward Error Correction and Erasure Codes... 5 The Intelligent Compression Management Solution... 6 Metadata... 6 Block... 6 CODEC... 6 Error Detection... 7 Error Correction... 7 Security Fingerprinting... 7 Compression Optimization... 7 Dedupe Enhancement... 8 Erasure Codes... 8 Cloud Data... 8 DMT Optimizes Compression... 9 CODECs... 9 Dedupe...10 Erasure Codes...11 Additional Information...11 WindSpring DMT Reference Guide September 2011 Page 2

The Enterprise and Mobile Storage Landscapes The volume of digital information being created is skyrocketing as rich multimedia becomes ubiquitous, regulatory requirements force long-term retention of data and the move to cloud computing brings more content to the edges of the network. IDC predicts that by 2020, nearly 40 million petabytes of data will be created annually, and available storage is predicted to grow to just over 20 million petabytes. IDC predicts that in the same time enterprise storage will grow into a $50 billion industry. That digital storage gap is forcing enterprises and mobile operators to become more effective at managing the complexities of storing and retrieving data. Even with the rapidly decreasing cost per megabyte for storage, online storage is one of the biggest expense elements in IT budgets today. This Reference Guide describes the challenges that are driving the need for storage optimization for enterprise and mobile applications, and how WindSpring Data Management Technology (DMT ) overcomes these challenges. WindSpring DMT Reference Guide September 2011 Page 3

Increased Storage Capacity with Optimized, Data-Specific Access WindSpring DMT is an advanced, intelligent software compression management system that optimizes both data compression and compressed data management for enterprise and mobile applications. DMT is built on a flexible architecture that simplifies compression management and includes multiple, selectable, lossless CODECs for storage, backup and dedupe, delivering increased storage capacity and optimized data-specific access. The DMT integrated data management suite also includes enhanced error detection, recovery and protection. WindSpring DMT is application sensitive, employing real-time, storage optimized and network optimized CODECs based on policies or parametric automation. DMT s metadata provides block-level error detection and error correction using multiple algorithms. Built-in monitoring systems ensure optimal integration into mobile and enterprise deployments. Data Traffic Drives Next-Generation Data Management Systems The explosion of data communications traffic is challenging network providers in both response times and capacity. While managing this data is essential to modern data systems, not all data is the same. Data storage requirements vary widely, depending on their use. The application using the compression should be the key driver when implementing an effective compression management system: Real-time data residing on primary storage devices, including user files in Word, Excel, Mail and IM, require real-time access. This makes the speed in which a file is compressed and decompressed the most critical issue. Online systems require optimized communications that balance compression and speed, while addressing the limited resources on mobile devices. Backup and archival systems stored company records and data required for regulatory compliance, for example require maximum compression to reduce both CAPEX and OPEX, making file size a higher priority than speed. The method used to access compressed data is the driver for embedded systems. These requirements make compression and data deduplication technologies essential to efficiently managing today s advanced storage systems. WindSpring DMT Reference Guide September 2011 Page 4

Compression Compression reduces the overall size of stored or transmitted data, typically using industry standard CODECs such as LZO and GZIP. These compression algorithms ignore the type of data and the nature of the application being compressed. The method used to implement a CODEC, plus the type of CODEC used and the type of data being compressed binary code or text, for example affects compression and decompression rates. Compression system optimization is achieved by employing deduplication and a balance of compression and speed, combined with the optimal CODEC for the method used to access a particular application. Error Detection and Data Deduplication Error or change detection is used for data deduplication and to determine whether changes have occurred in the original stored data. Error detection is based on algorithms that create a unique code or fingerprint that identifies the contents of a particular block of data. These algorithms provide checksum, CRC (cyclic redundancy checksums) or hash(sha) values that are based on the contents of the data. For error checking, fingerprints are used to determine if a change (presumably an error) has occurred in the dataset, whether it is in primary, backup or archive storage. Data deduplication uses the same fingerprints to identify identical blocks of data, replacing new data with a pointer to the location of the original data. The effectiveness of this deduplication scheme is dependent on the uniqueness of each fingerprint. If two different data blocks have the same fingerprint, then a collision will occur. The probability that a data collision will occur is dependent on the algorithm used, with SHA384 providing the lowest probability, requiring the longest time to calculate and using the largest amount of memory. CRC, on the other hand, has a high probability of collision but is fast and uses little memory. Deduplication requires that hash values are stored per data block so that new block hash values can be compared against existing block hash values. Smaller block sizes require more hash values per file, increasing memory usage, but resulting in better deduplication. Variable block deduplication achieves the best deduplication compression, but does so at the expense of memory and speed. Deduplication can occur as data is written to the primary storage system or as a post-process task running on the storage subsystem. This is dependent on the application, with the requirements for a real-time database focused on speed, while an archival storage system focuses on maximum deduplication. Ultimately, optimizing deduplication systems demands a balance between speed and memory usage, driven by the type of applications and data usage involved. Forward Error Correction and Erasure Codes Compression and deduplication critically impacts the reliability of data storage systems, and the introduction of errors in a compressed backup file may result in substantial and unrecoverable loss of backup data. The loss of a primary deduplicated block could cause all dependent files to be lost or corrupted. Error recovery is essential in compressed data systems, whether they are based on CODECs, deduplication or both. Erasure codes increase the reliability of data storage systems that use compression and data deduplication. While erasure codes make it possible for erased data to be recovered by storing additional metadata with the original data, they also require increased storage. The benefits of this combination of compression and deduplication are realized only if the storage required for the erasure codes is less than that required for the original data. With erasure codes, the location of the error is known, unlike error correcting codes. When parameters are adjusted, erasure codes can provide varying levels of reliability and redundancy. Erasure codes are generated using a number of different algorithms that affect the speed and effectiveness of the recovery algorithms. The type of data dictates the priorities. The value placed on recovery requirements for a stored Web page is typically set at a lower threshold than on a Sarbanes-Oxley document set. Reed-Solomon, Cauchy, Tornado, Raptor and Typhoon erasure code algorithms are all based on the way the encoding and decoding matrices are generated. WindSpring DMT Reference Guide September 2011 Page 5

The Intelligent Compression Management Solution DMT was designed specifically for storage management systems and architected to address the challenges that dominate the management of data in compressed data systems. DMT s standard C libraries enable storage management software to compress data from multiple sources using multiple CODECs, driven automatically or by policy to multiple destinations. By providing direct data access and configurable block sizes, DMT gives storage software complete control over compressed data, whether it is located on primary or secondary storage. DMT also makes it possible for compression to be configured at the file or block level and, as part of the direct data access, includes metadata that enables the use of multiple industry standard CODECs. DMT includes WindSpring s own QC0 CODEC, that enables byte-level access to compressed data without rehydration and direct edit and search of compressed data. DMT also includes metadata that allows the selection of multiple block or file-level hashing algorithms such as SHA256 or CRC. Data deduplication can also be easily handled using multiple levels of hash code matching. The reliability of compressed data is maintained with erasure codes that employ industry-standard libraries and a choice of erasure code algorithms. DMT is cross-platform compatible with standard C/POSIX library interfaces for systems based on Windows, Linux and most embedded operating systems. Metadata DMT manages another critical aspect of compression management the application s interaction with the compressed data using metadata that is included in every compressed file, regardless of CODEC. By managing this metadata, DMT enables applications to directly access the data at the block, sub-block or byte level, as determined by the selected CODEC, without decompressing the file. The metadata is completely configurable to address all critical file data. Block The file compression block size can be set from 4 KB to 1 MB. Smaller block sizes may result in faster access speeds, but may not optimize compression. Larger block sizes increase compression, but depending on the access pattern, may result in lower performance. Because they are critical in determining the correct block size, the performance analysis tools within DMT execute real-time analysis of file access patterns, making it possible to optimize the selected parameters. WindSpring DMT Reference Guide September 2011 Page 6

CODEC Because the optimal CODECs for one region of a file with mixed data types may be completely wrong for another region of the same file, DMT also makes it possible for the application to select the CODEC type on a block-by-block basis. For example, a database file may contain textual data for indexes and embedded pictures and videos as objects. On a file basis, the CODEC can be selected by a policy contained in a configuration file and on a block level. CODEC selection can be automated by setting API parameters. Security Fingerprinting The block-level metadata provides a fingerprint of the data in the file. Combinations of the CRC, CRC+metadata and source or compressed hash values allow security systems to calculate a unique identity for each file. Error Detection Compression can affect the reliability of compressed data in backup systems, with the error rate multiplied by the compression ratio, at a minimum. Because error detection needs to be relevant to the data type, DMT enables the error detection code methods to be selected for both the blocks and the overall file data. The codes can be recorded for both uncompressed data and compressed data. For data deduplication systems, hash calculations are determined by the final deduplication architecture and can be included at the file or block level. Deduplication systems can use CRC, CRC+metadata or hash values to implement compression. CRC and CRC+ can be included by default and DMT can include either source (uncompressed) or compressed hash values. Error Correction For high reliability data systems, erasure codes ensure that files with errors can be recovered. Erasure codes are generated based on different algorithms, each having different characteristics. With DMT, the file-level erasure code algorithm is selected using the file metadata. Compression Optimization DMT s metadata allows direct access to compressed data at the block and sub-block level. Working at the block level, DMT accelerates search and retrieval, while its multiple CODECs manage storage by dynamically selecting the best compression system for the data type in use. Data compression can be optimized for access speed, compression rate or a balance of the two. CODEC selection can be based on policy with compression selected by file type, or it can be automatic, with blocklevel API control of data CODEC and decision metrics. DMT provides simple interfaces including file-by-file and directory compression, extensive APIs for applicationlevel control and standard POSIX file I/O of compressed and rehydrated data. WindSpring DMT Reference Guide September 2011 Page 7

Dedupe Enhancement WindSpring DMT enhances deduplication systems by storing the configurable metadata with the compressed data, optimized for speed, reliability or a balance of the two. DMT computes metadata related to either the original data or the encoded data for every block of data that it encodes, providing both block information and error detection. DMT has been integrated into both Opendedup and the Solaris ZFS system for compression and deduplication. Erasure Codes Files that are compressed with DMT include erasure coding at the file level, using the Jerasure library. Erasure codes can be selected from the options available with the Jerasure library, including Reed-Solomon, Cauchy, Liberation and Blaum-Roth. Other algorithms, such as Tornado, Raptor and Typhoon can also be integrated, with the appropriate licensing from the relevant patent holders. DMT maintains the reliability of compressed data with erasure codes that offer industry standard libraries and a choice of erasure code algorithms. Error detection algorithms are used extensively in deduplication systems to search for identical files, blocks or regions of data. These algorithms are based on checksum algorithms such as CRC16 (cyclic redundancy checks). While DMT defaults to 16-bit CRC algorithms to check the encoded data, CRC32 and Adler32 are available options that provide higher security. Message digest algorithms are based on message digest codes such as MDn and SHAn. DMT s implementation of erasure codes is extensible. At the file level, DMT maintains maximum access speed while providing erasure code reliability. Errors that are detected in the base compression data can be corrected using the embedded erasure codes. At the cloud level, erasure codes can be included with compressed chunk data packets, increasing the reliability of the overall system. The operating system provides overall erasure code protection for a distributed file system. In compression only systems, both error detection and the speed of the algorithm are important, with CRC16 and Adler32 being faster than CRC32, while also delivering effective levels of error detection. The probability of a collision is the most important consideration in data deduplication. CRC16 and Adler32 have very high probabilities of collision, while CRC32 has lower probabilities, but is slower. In general, hash codes are required for final verification, but simpler algorithms can be used to eliminate candidates that will not match. As an intermediate step, DMT uses a combination of its CRC codes and other metadata to reduce the probability of a collision for CRC-based deduplication. DMT also allows the selection of multiple block or file-level hashing algorithms from SHA1 to SHA384. When using these multiple levels of hash code matching, data deduplication is handled with ease. WindSpring DMT Reference Guide September 2011 Page 8

Cloud Data DMT is written using standard C/POSIX-style APIs and can be integrated at the file, system or application level. That integration point drives the implementation of DMT applications in the cloud. DMT Optimizes Compression CODECs DMT was tested in a standard test environment, using an i7/8 GB Nexenta appliance with an internal SATA drive and the modern, data-specific Silesia Corpus. This corpus is a mixture of six textural files (texts, XML, HTML and log data) and six binary files (executable, binary databases, images), totalling 250 MB with sizes ranging from 2 MB to 50 MB. The chart at right illustrates the results of testing DMT s CODECs on the Silesia Corpus, highlighting the trade-off between encode speed, decode speed and effective compression. The chart s right axis shows the estimated size of a standard 1 TB drive after compression. So, while QC2 is clearly the fastest CODEC, the effective size is just over 2 TB, while QC1 results in an effective size of more than 3.5 TB, making it poorly suited for real-time access. When compared with standard CODECs in a straight decode operation, DMT excels again, driven by the block architecture. DMT is 20% faster than LZO, 50% faster than GZIP and 80% WindSpring DMT Reference Guide September 2011 Page 9

faster than LZMA. These figures do not take into account random access performance, where DMT s direct access provides further improvements in speed. The actual results vary depending on data type. Dedupe WindSpring DMT s dedupe capabilities were also tested in a standard network test environment, using the Silesia Corpus, on a Nexenta i7/8 GB appliance with an internal SATA drive configured using Solaris OS, ZFS and a napp-it console. As illustrated in the charts below, source compression has strong downstream multipliers, so that the time it takes for transfer and deduplication of DMT data is much less than native or ZFS compression. The results are a combination of the effect of compression at the source, deduplication on a smaller (compressed) dataset and ZFS compression performance. Deduplication is very effective on DMT compressed data: Time to copy/deduplicate DMT data is about 2x the time it takes to copy one dataset. WindSpring DMT Reference Guide September 2011 Page 10

Time to copy/deduplicate native data is nearly 3x the time it takes to copy one dataset. Time to copy/deduplicate ZFS compressed data is nearly 2.5x the time it takes to copy one dataset. Erasure Codes The chart below shows the effect of two different checksum algorithms on the CODEC speed. Two factors influence the overall impact: Real-time systems demand fast CODECs, requiring both small block size and high speed. DMT allows the application to optimize the checksum algorithm on a file or block level, enabling speed and compression to be balanced for the desired system performance. Additional Information WindSpring DMT is a proven solution that can be easily integrated into enterprise and mobile storage deployments, providing optimized compression and a highly evolved compression management system. By making it possible to select the optimal lossless CODEC for each data type and application, DMT delivers increased storage capacity and optimized data-specific access, while DMT s integrated data management suite provides enhanced error detection, recovery and protection. To learn more about WindSpring DMT please visit www.windspring.com. If you would like to discuss how DMT can make a difference in your business, please contact WindSpring at info@windspring.com. As the block size is reduced, the effect of the checksum algorithm increases. As the speed of the CODEC accelerates, the effect of the checksum algorithm increases. WindSpring DMT Reference Guide September 2011 Page 11

www.windspring.com Tel +1 408 452 7400