SDGen: Mimicking Datasets for Content Generation in Storage Benchmarks



Similar documents
Estimating Deduplication Ratios in Large Data Sets

Data Reduction: Deduplication and Compression. Danny Harnik IBM Haifa Research Labs

Deduplication, Compression and Pattern-Based Testing for All Flash Storage Arrays Peter Murray - Load DynamiX Leah Schoeb - Evaluator Group

Reference Guide WindSpring Data Management Technology (DMT) Solving Today s Storage Optimization Challenges

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Everything you need to know about flash storage performance

Reducing Replication Bandwidth for Distributed Document Databases

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique

3Gen Data Deduplication Technical

Flash Storage: Trust, But Verify

Speeding Up Cloud/Server Applications Using Flash Memory

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

PARALLELS CLOUD STORAGE

Inside Dropbox: Understanding Personal Cloud Storage Services

Datasheet iscsi Protocol

LBPerf: An Open Toolkit to Empirically Evaluate the Quality of Service of Middleware Load Balancing Services

How To Test A Flash Storage Array For A Health Care Organization

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May Copyright 2014 Permabit Technology Corporation

Calsoft Webinar - Debunking QA myths for Flash- Based Arrays

A STUDY OF WORKLOAD CHARACTERIZATION IN WEB BENCHMARKING TOOLS FOR WEB SERVER CLUSTERS

Reliability and Fault Tolerance in Storage

Side channels in cloud services, the case of deduplication in cloud storage

HP StoreOnce D2D. Understanding the challenges associated with NetApp s deduplication. Business white paper

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

VNX HYBRID FLASH BEST PRACTICES FOR PERFORMANCE

Key Components of WAN Optimization Controller Functionality

Measuring Performance of Solid State Storage Arrays

Cloud Gateway. Agenda. Cloud concepts Gateway concepts My work. Monica Stebbins

Data Compression and Deduplication. LOC Cisco Systems, Inc. All rights reserved.

FAST 11. Yongseok Oh University of Seoul. Mobile Embedded System Laboratory

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

HP StoreOnce & Deduplication Solutions Zdenek Duchoň Pre-sales consultant

SOLUTION BRIEF. Resolving the VDI Storage Challenge

Hue Streams. Seismic Compression Technology. Years of my life were wasted waiting for data loading and copying

Flash at the price of disk Redefining the Economics of Storage. Kris Van Haverbeke Enterprise Marketing Manager Dell Belux

Bigtable is a proven design Underpins 100+ Google services:

Protecting enterprise servers with StoreOnce and CommVault Simpana

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Benchmarking Hadoop & HBase on Violin

Characterizing Task Usage Shapes in Google s Compute Clusters

Using Synology SSD Technology to Enhance System Performance Synology Inc.

A Study of Application Performance with Non-Volatile Main Memory

balesio Native Format Optimization Technology (NFO)

Microsoft SQL Server 2014 Fast Track

Databases Acceleration with Non Volatile Memory File System (NVMFS) PRESENTATION TITLE GOES HERE Saeed Raja SanDisk Inc.

Availability Digest. Data Deduplication February 2011

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Network Performance Optimisation: The Technical Analytics Understood Mike Gold VP Sales, Europe, Russia and Israel Comtech EF Data May 2013

VDI Optimization Real World Learnings. Russ Fellows, Evaluator Group

Design and Implementation of a Storage Repository Using Commonality Factoring. IEEE/NASA MSST2003 April 7-10, 2003 Eric W. Olsen

The Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage Platforms. Abhijith Shenoy Engineer, Hedvig Inc.

Solid State Storage in a Hard Disk Package. Brian McKean, LSI Corporation

GET. tech brief FASTER BACKUPS

Image Compression through DCT and Huffman Coding Technique

Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking

Samsung Solid State Drive RAPID mode

IBM System Storage Portfolio Overview

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Creating Synthetic Temporal Document Collections for Web Archive Benchmarking

The Why and How of SSD Performance Benchmarking. Esther Spanjer, SMART Modular Easen Ho, Calypso Systems

Building a High Performance Deduplication System Fanglu Guo and Petros Efstathopoulos

Benchmarking Cassandra on Violin

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Generational Performance Comparison: Microsoft Azure s A- Series and D-Series. A Buyer's Lens Report by Anne Qingyang Liu

The Curious Case of Database Deduplication. PRESENTATION TITLE GOES HERE Gurmeet Goindi Oracle

How NAND Flash Threatens DRAM

SALSA Flash-Optimized Software-Defined Storage

Data Reduction Methodologies: Comparing ExaGrid s Byte-Level-Delta Data Reduction to Data De-duplication. February 2007

Condusiv s V-locity Server Boosts Performance of SQL Server 2012 by 55%

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Security of Cloud Storage: - Deduplication vs. Privacy

Understanding the Economics of Flash Storage

The What, Why and How of the Pure Storage Enterprise Flash Array

Lab Testing Summary Report

Optimizing the Performance of the Oracle BI Applications using Oracle Datawarehousing Features and Oracle DAC

Protect Data... in the Cloud

Oracle Aware Flash: Maximizing Performance and Availability for your Database

Leveraging EMC Fully Automated Storage Tiering (FAST) and FAST Cache for SQL Server Enterprise Deployments

Data Deduplication HTBackup

A KAMINARIO WHITE PAPER. Changing the Data Center Economics with Kaminario s K2 All-Flash Storage Array

Tegile Zebi Application Selling. Virtual Desktop Initiatives

FIOS: A Fair, Efficient Flash I/O Scheduler. Stan Park and Kai Shen presented by Jason Gevargizian

Comprehending the Tradeoffs between Deploying Oracle Database on RAID 5 and RAID 10 Storage Configurations. Database Solutions Engineering

IBM Storwize V5000. Designed to drive innovation and greater flexibility with a hybrid storage solution. Highlights. IBM Systems Data Sheet

09'Linux Plumbers Conference

Wan Accelerators: Optimizing Network Traffic with Compression. Bartosz Agas, Marvin Germar & Christopher Tran

SafePeak Case Study: Large Microsoft SharePoint with SafePeak

PostgreSQL Performance Characteristics on Joyent and Amazon EC2

Data Deduplication and Corporate PC Backup

Transcription:

SDGen: Mimicking Datasets for Content Generation in Storage Benchmarks Raúl Gracia-Tinedo (Universitat Rovira i Virgili, Spain) Danny Harnik, Dalit Naor, Dmitry Sotnikov (IBM Research-Haifa, Israel) Sivan Toledo, Aviad Zuck (Tel-Aviv University, Israel)

Pre-Introduction Stones in the backpack!!! Just thin air Random Data Zero Data 2

Introduction Benchmarking is essential to evaluate storage systems: File systems, databases, micro-benchmarks FileBench, LinkBench, Bonnie++, YCSB, Many storage benchmarks try to recreate real workloads: Operations per unit of time, R/W behavior, But, what about the data generated during a benchmark? Real dataset: representative, proprietary, potentially large Simple synthetic data (zeros, random data): not-representative, easy to create, reproducible 3

The Problem Does the benchmarking data actually matter? ZFS Example: A file system with built-in compression ZFS is significantly content-sensitive if compression enabled The throughput also varies depending on the compressor Conclusion: Yes, it matters if data reduction is involved! 4

Current Solutions Some benchmarks try to emulate the compressibility of data (LinkBench, Fio, VDBench): Mixing compressible/incompressible data at right proportion. Problems (LinkBench data vs real data): Accurate compression ratios but insensitive to compressor Unrealistic compression times Heterogeneity is not captured Rand Zeros Rand Zeros 50% compressible! zlib - Text Data (Calgary Corpus) 5

Our Mission Complex situation: Most storage benchmarks generate unrealistic contents Representative data is normally not shared due to privacy issues Not good for the performance evaluation of storage systems with data reduction techniques built-in. We need a common approach to generate realistic and reproducible benchmarking data. In this work, we focus on compression benchmarking. 6

Summary of our Work Synthetic Data GENerator (SDGen): open and extensible framework to generate realistic data for storage benchmarks. Goal: mimic real datasets. Compact, reusable and anonymized dataset representation. Mimicking compression: identify the properties of data that are key to the performance of popular lossless compressors (e.g. zlib, lz4). Usability and integration: SDGen is available for download and has been integrated in popular benchmarks (LinkBench, Impressions). 7

SDGen Lifecycle SDGen: Concept & Overview Mimicking method: capture the characteristics of data that affect data reduction techniques to generate similar synthetic data. SDGen works in two main phases: Scan Phase Generation Phase Scan Dataset Build Characteri zation Share it Load Characteri zation Generate Data SDGen can do full scans or use sampling. SDGen requires knowing what to scan for and how to generate data. 8

Mimicking data for compression We empirically found two properties that affect the behavior of compression engines: Repetition length distribution Key for compression time & ratio Typically follows a power-law Byte frequency Impacts on entropy coding Changes importantly depending on data 9

Generating synthetic data Goals: Generate data with similar properties (repetition lengths, byte freq.) Fast generation throughput At high-level, we generate a data chunk as follows: 1) Random decision: repetition or not? 2) Repetition: No repetition: insert insert repeated newly randomized data data Initialize source of repetitions Repeated sequence 3) Pick a random repeated sequence length from the the repetition histogram 4) Insert repeated random data Synthetic chunk Rep. Len. Histogram 10 Data generator Byte Freq. Histogram

Evaluation Objective Metrics Compression ratio Compression time Datasets Calgary/Canterbury corpus Silesia Corpus PDFs (FAST conferences) Media (IBM engineers) Sensor network data Enwiki9 Private Mix (VMs,.xml,.html, ) Additional Mimicked Properties Repetition length Entropy (byte frequencies) Compressors Target: Lossless compression based on byte level repetition finding and/or on entropy encoding (zlib, lz4). We tested other families of compressors (bzip2, lzma). 11

Evaluation: Mimicked Properties Experiment: compare repetition length distributions and byte entropy in real and SDGen data. SDGen generates data that closely mimics these metrics. 12

Evaluation: Compression Ratio & Time Experiment: Capture per-chunk compression ratios and times for both synthetic and real datasets. Per-chunk compression ratio Compression ratios are closely mimicked Heterogeneity is also well captured within a dataset Per-chunk compression time Compression times are harder to mimic (especially for lz4) Still, for most data types compressors behave similarly 13

Evaluation: Performance of ZFS Experiment: write to ZFS 1GB files augmenting previous datasets. ZFS exhibits similar behavior for both real and our synthetic data. ZFS digests faster LinkBench data (+12% to +44%). DNA sequencing files in Calgary corpus are specially hard to compress 14

Evaluation: Integration with LinkBench Experiment: LinkBench write workload using distinct data types (ZFS + SSD storage). SDGen serves as data generation layer for LinkBench. Write latency is similar in both synthetic and text dataset. 15

Conclusions & Future Directions Data is an important aspect of storage benchmarking when data reduction is involved (compression, dedup). We presented SDGen: a framework for generating realistic and sharable benchmarking data. Idea: scan data, build a characterization, share it, generate data We designed a method to mimic data compression ratios and times for popular lossless compressors. We plan to extend SDGen to mimic data deduplication. 16

Q&A Thanks for your attention! SDGen code: https://github.com/iostackproject/sdgen Funding projects: Towards the next generation of open Personal Clouds http://cloudspaces.eu Software-Defined Storage for Big Data http://iostack.eu 17

Backup: Generation Performance Characterizations of chunks can be used in parallel for generation. Generating uncompressible data is more expensive. We plan optimizations to wisely reuse random data for boosting throughput. 18