Reference Guide WindSpring Data Management Technology (DMT) Solving Today s Storage Optimization Challenges September 2011
Table of Contents The Enterprise and Mobile Storage Landscapes... 3 Increased Storage Capacity with Optimized, Data-Specific Access... 4 Data Traffic Drives Next-Generation Data Management Systems... 4 Compression... 5 Error Detection and Data Deduplication... 5 Forward Error Correction and Erasure Codes... 5 The Intelligent Compression Management Solution... 6 Metadata... 6 Block... 6 CODEC... 6 Error Detection... 7 Error Correction... 7 Security Fingerprinting... 7 Compression Optimization... 7 Dedupe Enhancement... 8 Erasure Codes... 8 Cloud Data... 8 DMT Optimizes Compression... 9 CODECs... 9 Dedupe...10 Erasure Codes...11 Additional Information...11 WindSpring DMT Reference Guide September 2011 Page 2
The Enterprise and Mobile Storage Landscapes The volume of digital information being created is skyrocketing as rich multimedia becomes ubiquitous, regulatory requirements force long-term retention of data and the move to cloud computing brings more content to the edges of the network. IDC predicts that by 2020, nearly 40 million petabytes of data will be created annually, and available storage is predicted to grow to just over 20 million petabytes. IDC predicts that in the same time enterprise storage will grow into a $50 billion industry. That digital storage gap is forcing enterprises and mobile operators to become more effective at managing the complexities of storing and retrieving data. Even with the rapidly decreasing cost per megabyte for storage, online storage is one of the biggest expense elements in IT budgets today. This Reference Guide describes the challenges that are driving the need for storage optimization for enterprise and mobile applications, and how WindSpring Data Management Technology (DMT ) overcomes these challenges. WindSpring DMT Reference Guide September 2011 Page 3
Increased Storage Capacity with Optimized, Data-Specific Access WindSpring DMT is an advanced, intelligent software compression management system that optimizes both data compression and compressed data management for enterprise and mobile applications. DMT is built on a flexible architecture that simplifies compression management and includes multiple, selectable, lossless CODECs for storage, backup and dedupe, delivering increased storage capacity and optimized data-specific access. The DMT integrated data management suite also includes enhanced error detection, recovery and protection. WindSpring DMT is application sensitive, employing real-time, storage optimized and network optimized CODECs based on policies or parametric automation. DMT s metadata provides block-level error detection and error correction using multiple algorithms. Built-in monitoring systems ensure optimal integration into mobile and enterprise deployments. Data Traffic Drives Next-Generation Data Management Systems The explosion of data communications traffic is challenging network providers in both response times and capacity. While managing this data is essential to modern data systems, not all data is the same. Data storage requirements vary widely, depending on their use. The application using the compression should be the key driver when implementing an effective compression management system: Real-time data residing on primary storage devices, including user files in Word, Excel, Mail and IM, require real-time access. This makes the speed in which a file is compressed and decompressed the most critical issue. Online systems require optimized communications that balance compression and speed, while addressing the limited resources on mobile devices. Backup and archival systems stored company records and data required for regulatory compliance, for example require maximum compression to reduce both CAPEX and OPEX, making file size a higher priority than speed. The method used to access compressed data is the driver for embedded systems. These requirements make compression and data deduplication technologies essential to efficiently managing today s advanced storage systems. WindSpring DMT Reference Guide September 2011 Page 4
Compression Compression reduces the overall size of stored or transmitted data, typically using industry standard CODECs such as LZO and GZIP. These compression algorithms ignore the type of data and the nature of the application being compressed. The method used to implement a CODEC, plus the type of CODEC used and the type of data being compressed binary code or text, for example affects compression and decompression rates. Compression system optimization is achieved by employing deduplication and a balance of compression and speed, combined with the optimal CODEC for the method used to access a particular application. Error Detection and Data Deduplication Error or change detection is used for data deduplication and to determine whether changes have occurred in the original stored data. Error detection is based on algorithms that create a unique code or fingerprint that identifies the contents of a particular block of data. These algorithms provide checksum, CRC (cyclic redundancy checksums) or hash(sha) values that are based on the contents of the data. For error checking, fingerprints are used to determine if a change (presumably an error) has occurred in the dataset, whether it is in primary, backup or archive storage. Data deduplication uses the same fingerprints to identify identical blocks of data, replacing new data with a pointer to the location of the original data. The effectiveness of this deduplication scheme is dependent on the uniqueness of each fingerprint. If two different data blocks have the same fingerprint, then a collision will occur. The probability that a data collision will occur is dependent on the algorithm used, with SHA384 providing the lowest probability, requiring the longest time to calculate and using the largest amount of memory. CRC, on the other hand, has a high probability of collision but is fast and uses little memory. Deduplication requires that hash values are stored per data block so that new block hash values can be compared against existing block hash values. Smaller block sizes require more hash values per file, increasing memory usage, but resulting in better deduplication. Variable block deduplication achieves the best deduplication compression, but does so at the expense of memory and speed. Deduplication can occur as data is written to the primary storage system or as a post-process task running on the storage subsystem. This is dependent on the application, with the requirements for a real-time database focused on speed, while an archival storage system focuses on maximum deduplication. Ultimately, optimizing deduplication systems demands a balance between speed and memory usage, driven by the type of applications and data usage involved. Forward Error Correction and Erasure Codes Compression and deduplication critically impacts the reliability of data storage systems, and the introduction of errors in a compressed backup file may result in substantial and unrecoverable loss of backup data. The loss of a primary deduplicated block could cause all dependent files to be lost or corrupted. Error recovery is essential in compressed data systems, whether they are based on CODECs, deduplication or both. Erasure codes increase the reliability of data storage systems that use compression and data deduplication. While erasure codes make it possible for erased data to be recovered by storing additional metadata with the original data, they also require increased storage. The benefits of this combination of compression and deduplication are realized only if the storage required for the erasure codes is less than that required for the original data. With erasure codes, the location of the error is known, unlike error correcting codes. When parameters are adjusted, erasure codes can provide varying levels of reliability and redundancy. Erasure codes are generated using a number of different algorithms that affect the speed and effectiveness of the recovery algorithms. The type of data dictates the priorities. The value placed on recovery requirements for a stored Web page is typically set at a lower threshold than on a Sarbanes-Oxley document set. Reed-Solomon, Cauchy, Tornado, Raptor and Typhoon erasure code algorithms are all based on the way the encoding and decoding matrices are generated. WindSpring DMT Reference Guide September 2011 Page 5
The Intelligent Compression Management Solution DMT was designed specifically for storage management systems and architected to address the challenges that dominate the management of data in compressed data systems. DMT s standard C libraries enable storage management software to compress data from multiple sources using multiple CODECs, driven automatically or by policy to multiple destinations. By providing direct data access and configurable block sizes, DMT gives storage software complete control over compressed data, whether it is located on primary or secondary storage. DMT also makes it possible for compression to be configured at the file or block level and, as part of the direct data access, includes metadata that enables the use of multiple industry standard CODECs. DMT includes WindSpring s own QC0 CODEC, that enables byte-level access to compressed data without rehydration and direct edit and search of compressed data. DMT also includes metadata that allows the selection of multiple block or file-level hashing algorithms such as SHA256 or CRC. Data deduplication can also be easily handled using multiple levels of hash code matching. The reliability of compressed data is maintained with erasure codes that employ industry-standard libraries and a choice of erasure code algorithms. DMT is cross-platform compatible with standard C/POSIX library interfaces for systems based on Windows, Linux and most embedded operating systems. Metadata DMT manages another critical aspect of compression management the application s interaction with the compressed data using metadata that is included in every compressed file, regardless of CODEC. By managing this metadata, DMT enables applications to directly access the data at the block, sub-block or byte level, as determined by the selected CODEC, without decompressing the file. The metadata is completely configurable to address all critical file data. Block The file compression block size can be set from 4 KB to 1 MB. Smaller block sizes may result in faster access speeds, but may not optimize compression. Larger block sizes increase compression, but depending on the access pattern, may result in lower performance. Because they are critical in determining the correct block size, the performance analysis tools within DMT execute real-time analysis of file access patterns, making it possible to optimize the selected parameters. WindSpring DMT Reference Guide September 2011 Page 6
CODEC Because the optimal CODECs for one region of a file with mixed data types may be completely wrong for another region of the same file, DMT also makes it possible for the application to select the CODEC type on a block-by-block basis. For example, a database file may contain textual data for indexes and embedded pictures and videos as objects. On a file basis, the CODEC can be selected by a policy contained in a configuration file and on a block level. CODEC selection can be automated by setting API parameters. Security Fingerprinting The block-level metadata provides a fingerprint of the data in the file. Combinations of the CRC, CRC+metadata and source or compressed hash values allow security systems to calculate a unique identity for each file. Error Detection Compression can affect the reliability of compressed data in backup systems, with the error rate multiplied by the compression ratio, at a minimum. Because error detection needs to be relevant to the data type, DMT enables the error detection code methods to be selected for both the blocks and the overall file data. The codes can be recorded for both uncompressed data and compressed data. For data deduplication systems, hash calculations are determined by the final deduplication architecture and can be included at the file or block level. Deduplication systems can use CRC, CRC+metadata or hash values to implement compression. CRC and CRC+ can be included by default and DMT can include either source (uncompressed) or compressed hash values. Error Correction For high reliability data systems, erasure codes ensure that files with errors can be recovered. Erasure codes are generated based on different algorithms, each having different characteristics. With DMT, the file-level erasure code algorithm is selected using the file metadata. Compression Optimization DMT s metadata allows direct access to compressed data at the block and sub-block level. Working at the block level, DMT accelerates search and retrieval, while its multiple CODECs manage storage by dynamically selecting the best compression system for the data type in use. Data compression can be optimized for access speed, compression rate or a balance of the two. CODEC selection can be based on policy with compression selected by file type, or it can be automatic, with blocklevel API control of data CODEC and decision metrics. DMT provides simple interfaces including file-by-file and directory compression, extensive APIs for applicationlevel control and standard POSIX file I/O of compressed and rehydrated data. WindSpring DMT Reference Guide September 2011 Page 7
Dedupe Enhancement WindSpring DMT enhances deduplication systems by storing the configurable metadata with the compressed data, optimized for speed, reliability or a balance of the two. DMT computes metadata related to either the original data or the encoded data for every block of data that it encodes, providing both block information and error detection. DMT has been integrated into both Opendedup and the Solaris ZFS system for compression and deduplication. Erasure Codes Files that are compressed with DMT include erasure coding at the file level, using the Jerasure library. Erasure codes can be selected from the options available with the Jerasure library, including Reed-Solomon, Cauchy, Liberation and Blaum-Roth. Other algorithms, such as Tornado, Raptor and Typhoon can also be integrated, with the appropriate licensing from the relevant patent holders. DMT maintains the reliability of compressed data with erasure codes that offer industry standard libraries and a choice of erasure code algorithms. Error detection algorithms are used extensively in deduplication systems to search for identical files, blocks or regions of data. These algorithms are based on checksum algorithms such as CRC16 (cyclic redundancy checks). While DMT defaults to 16-bit CRC algorithms to check the encoded data, CRC32 and Adler32 are available options that provide higher security. Message digest algorithms are based on message digest codes such as MDn and SHAn. DMT s implementation of erasure codes is extensible. At the file level, DMT maintains maximum access speed while providing erasure code reliability. Errors that are detected in the base compression data can be corrected using the embedded erasure codes. At the cloud level, erasure codes can be included with compressed chunk data packets, increasing the reliability of the overall system. The operating system provides overall erasure code protection for a distributed file system. In compression only systems, both error detection and the speed of the algorithm are important, with CRC16 and Adler32 being faster than CRC32, while also delivering effective levels of error detection. The probability of a collision is the most important consideration in data deduplication. CRC16 and Adler32 have very high probabilities of collision, while CRC32 has lower probabilities, but is slower. In general, hash codes are required for final verification, but simpler algorithms can be used to eliminate candidates that will not match. As an intermediate step, DMT uses a combination of its CRC codes and other metadata to reduce the probability of a collision for CRC-based deduplication. DMT also allows the selection of multiple block or file-level hashing algorithms from SHA1 to SHA384. When using these multiple levels of hash code matching, data deduplication is handled with ease. WindSpring DMT Reference Guide September 2011 Page 8
Cloud Data DMT is written using standard C/POSIX-style APIs and can be integrated at the file, system or application level. That integration point drives the implementation of DMT applications in the cloud. DMT Optimizes Compression CODECs DMT was tested in a standard test environment, using an i7/8 GB Nexenta appliance with an internal SATA drive and the modern, data-specific Silesia Corpus. This corpus is a mixture of six textural files (texts, XML, HTML and log data) and six binary files (executable, binary databases, images), totalling 250 MB with sizes ranging from 2 MB to 50 MB. The chart at right illustrates the results of testing DMT s CODECs on the Silesia Corpus, highlighting the trade-off between encode speed, decode speed and effective compression. The chart s right axis shows the estimated size of a standard 1 TB drive after compression. So, while QC2 is clearly the fastest CODEC, the effective size is just over 2 TB, while QC1 results in an effective size of more than 3.5 TB, making it poorly suited for real-time access. When compared with standard CODECs in a straight decode operation, DMT excels again, driven by the block architecture. DMT is 20% faster than LZO, 50% faster than GZIP and 80% WindSpring DMT Reference Guide September 2011 Page 9
faster than LZMA. These figures do not take into account random access performance, where DMT s direct access provides further improvements in speed. The actual results vary depending on data type. Dedupe WindSpring DMT s dedupe capabilities were also tested in a standard network test environment, using the Silesia Corpus, on a Nexenta i7/8 GB appliance with an internal SATA drive configured using Solaris OS, ZFS and a napp-it console. As illustrated in the charts below, source compression has strong downstream multipliers, so that the time it takes for transfer and deduplication of DMT data is much less than native or ZFS compression. The results are a combination of the effect of compression at the source, deduplication on a smaller (compressed) dataset and ZFS compression performance. Deduplication is very effective on DMT compressed data: Time to copy/deduplicate DMT data is about 2x the time it takes to copy one dataset. WindSpring DMT Reference Guide September 2011 Page 10
Time to copy/deduplicate native data is nearly 3x the time it takes to copy one dataset. Time to copy/deduplicate ZFS compressed data is nearly 2.5x the time it takes to copy one dataset. Erasure Codes The chart below shows the effect of two different checksum algorithms on the CODEC speed. Two factors influence the overall impact: Real-time systems demand fast CODECs, requiring both small block size and high speed. DMT allows the application to optimize the checksum algorithm on a file or block level, enabling speed and compression to be balanced for the desired system performance. Additional Information WindSpring DMT is a proven solution that can be easily integrated into enterprise and mobile storage deployments, providing optimized compression and a highly evolved compression management system. By making it possible to select the optimal lossless CODEC for each data type and application, DMT delivers increased storage capacity and optimized data-specific access, while DMT s integrated data management suite provides enhanced error detection, recovery and protection. To learn more about WindSpring DMT please visit www.windspring.com. If you would like to discuss how DMT can make a difference in your business, please contact WindSpring at info@windspring.com. As the block size is reduced, the effect of the checksum algorithm increases. As the speed of the CODEC accelerates, the effect of the checksum algorithm increases. WindSpring DMT Reference Guide September 2011 Page 11
www.windspring.com Tel +1 408 452 7400