Data Deduplication in a Hybrid Architecture for Improving Write Performance

Similar documents
MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May Copyright 2014 Permabit Technology Corporation

Cloud Computing Where ISR Data Will Go for Exploitation

POSIX and Object Distributed Storage Systems

Tradeoffs in Scalable Data Routing for Deduplication Clusters

Turnkey Deduplication Solution for the Enterprise

Reference Guide WindSpring Data Management Technology (DMT) Solving Today s Storage Optimization Challenges

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

BlueArc unified network storage systems 7th TF-Storage Meeting. Scale Bigger, Store Smarter, Accelerate Everything

Speeding Up Cloud/Server Applications Using Flash Memory

Reducing Replication Bandwidth for Distributed Document Databases

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Managing Storage Space in a Flash and Disk Hybrid Storage System

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Understanding EMC Avamar with EMC Data Protection Advisor

IT of SPIM Data Storage and Compression. EMBO Course - August 27th! Jeff Oegema, Peter Steinbach, Oscar Gonzalez

Flexible Storage Allocation

09'Linux Plumbers Conference

Performance Analysis of Mixed Distributed Filesystem Workloads

Accelerating Server Storage Performance on Lenovo ThinkServer

Data Deduplication HTBackup

STORAGE. Buying Guide: TARGET DATA DEDUPLICATION BACKUP SYSTEMS. inside

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Violin: A Framework for Extensible Block-level Storage

Building a High Performance Deduplication System Fanglu Guo and Petros Efstathopoulos

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique

Can High-Performance Interconnects Benefit Memcached and Hadoop?

SMB Direct for SQL Server and Private Cloud

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid

Mixing Hadoop and HPC Workloads on Parallel Filesystems

HP StoreOnce D2D. Understanding the challenges associated with NetApp s deduplication. Business white paper

The assignment of chunk size according to the target data characteristics in deduplication backup system

VMware Virtual SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014

Lab Evaluation of NetApp Hybrid Array with Flash Pool Technology

FAST 11. Yongseok Oh University of Seoul. Mobile Embedded System Laboratory

IMPLEMENTATION OF SOURCE DEDUPLICATION FOR CLOUD BACKUP SERVICES BY EXPLOITING APPLICATION AWARENESS

WHITE PAPER. Permabit Albireo Data Optimization Software. Benefits of Albireo for Virtual Servers. January Permabit Technology Corporation

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Scala Storage Scale-Out Clustered Storage White Paper

Redefining Microsoft SQL Server Data Management. PAS Specification

Data Reduction: Deduplication and Compression. Danny Harnik IBM Haifa Research Labs

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Large File System Backup NERSC Global File System Experience

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Overlapping Data Transfer With Application Execution on Clusters

Optimizing IT Data Services

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Data Centric Systems (DCS)

DeltaStor Data Deduplication: A Technical Review

EFFICIENT GEAR-SHIFTING FOR A POWER-PROPORTIONAL DISTRIBUTED DATA-PLACEMENT METHOD

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Data Backup and Archiving with Enterprise Storage Systems

Metadata Feedback and Utilization for Data Deduplication Across WAN

WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression

VNX HYBRID FLASH BEST PRACTICES FOR PERFORMANCE

Veeam Best Practices with Exablox

Converged storage architecture for Oracle RAC based on NVMe SSDs and standard x86 servers

Benchmarking Cassandra on Violin

Effective Planning and Use of TSM V6 Deduplication

Benchmarking Hadoop & HBase on Violin

Decentralized Deduplication in SAN Cluster File Systems

Data Deduplication Background: A Technical White Paper

WHITE PAPER. DATA DEDUPLICATION BACKGROUND: A Technical White Paper

ECLIPSE Performance Benchmarks and Profiling. January 2009

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud

Blueprints for Scalable IBM Spectrum Protect (TSM) Disk-based Backup Solutions

Quanqing XU YuruBackup: A Highly Scalable and Space-Efficient Incremental Backup System in the Cloud

A Deduplication File System & Course Review

Symantec NetBackup PureDisk Optimizing Backups with Deduplication for Remote Offices, Data Center and Virtual Machines

EMC XTREMIO EXECUTIVE OVERVIEW

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

A Data De-duplication Access Framework for Solid State Drives

Flash Storage Roles & Opportunities. L.A. Hoffman/Ed Delgado CIO & Senior Storage Engineer Goodwin Procter L.L.P.

Hardware Configuration Guide

Using Synology SSD Technology to Enhance System Performance. Based on DSM 5.2

Estimating Deduplication Ratios in Large Data Sets

Best Practices for Optimizing Your Linux VPS and Cloud Server Infrastructure

Maginatics Cloud Storage Platform for Elastic NAS Workloads

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

M710 - Max 960 Drive, 8Gb/16Gb FC, Max 48 ports, Max 192GB Cache Memory

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

Redefining Microsoft Exchange Data Management

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Theoretical Aspects of Storage Systems Autumn 2009

Read Performance Enhancement In Data Deduplication For Secondary Storage

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

Protecting Information in a Smarter Data Center with the Performance of Flash

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Transcription:

Data Deduplication in a Hybrid Architecture for Improving Write Performance Data-intensive Salable Computing Laboratory Department of Computer Science Texas Tech University Lubbock, Texas June 10th, 2013

1 Background and Motivation 2 Data Deduplication in a Hybrid Architecture (DDiHA) 3 Evaluation 4 Conclusion and Future Work

Big Data Problem Many scientific simulations become highly data intensive Simulation resolution desires finer granularity both spacial and temporal - e.x. climate model, 250KM 20KM; 6 hours 30 minutes The output data volume reaches tens of terabytes in a single simulation, the entire system deals with petabytes of data The pressure on the I/O system capability substantially increases PI Project On- Line Data Off- Line Data Lamb, FLASH: Buoyancy- Driven 75TB 300TB Don Turbulent Nuclear Burning Fischer, Paul Reactor Core Hydrodynamics 2TB 5TB Dean, David ComputaIonal Nuclear Structure 4TB 40TB Baker, David ComputaIonal Protein Structure 1TB 2TB Worley, Patrick H. Performance EvaluaIon and Analysis 1TB 1TB Wolverton, Christopher KineIcs and Thermodynamics of Metal and 5TB 100TB Complex Hydride NanoparIcles Washington, Warren Climate Science 10TB 345TB Tsigelny, Igor Parkinson's Disease 2.5TB 50TB Tang, William Plasma Microturbulence 2TB 10TB Sugar, Robert LaVce QCD 1TB 44TB Siegel, Andrew Thermal Striping in Sodium Cooled Reactors 4TB 8TB Roux, Benoit GaIng Mechanisms of Membrane Proteins 10TB 10TB Figure 3: Data volume of current simulations Figure 4: Climate Model Evolution: FAR (1990), SAR (1996), TAR (2001), AR4 (2007)

Motivation - Gap between Applications Demand and I/O System Capability Gyrokinetic Toroidal Code (GTC) code - Outputs particle data that consists of two 2D arrays for electrons and ions, respectively - Two arrays distributed among all cores, particles can move across cores in a random manner as the simulation evolves A production run with the scale of 16,384 cores - Each core outputs roughly two million particles, 260GB in total - Desires O(100MB/s) for efficient output The average I/O throughput of Jaguar (now Titan) is around 4.7MB/s per node Large and growing gap between the application s requirement and system capability

Motivation - Reducing Data Movement For large-scale simulations, reducing data movement over network is essential for performance improvement E.g. active storage demonstrated a potential for analysis applications (read-intensive) - Moving analysis kernel near to data - Transfer small size results over the I/O network What about scientific simulations? - Write a large volume of data - Data need to be stored for future processing and analysis - Storage capacity and bandwidth demands

Motivation - Data Deduplication Widely used in backup storage systems to reduce storage requirement Eliminates the duplicated copies of data using a signature representation Identical segment deduplication most widely used - Breaks data or stream into contiguous segments - Utilizes the Secure Hash Algorithm (SHA) to calculate a fingerprint for each segment Impressive performance in terms of both compression ratio and throughput, e.g. up to 20:1 compression ratio

Data Deduplication in a Hybrid Architecture (DDiHA) We explore utilizing data deduplication to reduce the data volume for write intensive applications on the data path: Using dedicated nodes (deduplication nodes) for deduplication operations Building a deduplication framework before data is transferred over the network Evaluating the performance with both theoretical analysis and prototyping

Decoupled Execution Paradigm DDiHA is designed based on a decoupled system architecture Explores how to use the compute-side nodes to improve the write performance!""#$%&'()*+.'3"%$4'*(5-26,78*(!"#$%982( 0:+&'#+(;136-&'3&%1'(!"#$%&'()"*'+(!"#$%&',+-*'(./&/()"*'+( 0&"1/2',+-*'(./&/()"*'+(!"#$%&'""$&#( )'&*+(,,-(.#'%*/$( )'&*+(,,-(.#'%*/$( Decoupled HPC System Architecture

DDiHA Overview Dedicated deduplication nodes serve for compute nodes and connected through high-speed network Deployed with a global address space library to support a shared data deduplication approach Fingerprints are organized in a global shared table (global index) for a systematic view and improving deduplication ratios Writes going through streaming deduplication API (SD API, an enhanced MPI-IO API) Deduplicated data written to storage through a PFS API

DDiHA Architecture Compute Nodes Application Application Application Application MPI-IO MPI-IO MPI-IO MPI-IO FS API SD API FS API SD API FS API SD API FS API SD API Data Chunk Buffer Data Chunk Buffer FS- File System SDD-Solid State Drive SD-streaming deduplication Data Deduplication Data Deduplication Global Index based on Global Address Space SSD FS API SSD FS API Deduplication Nodes Network Storage Storage Storage Figure 5: Architecture of DDiHA

DDiHA API Two APIs: SD Open & SD Write. (No read) MPI-IO open/write was revised Users need to add prefix SD to the file name when open a file Data will be directed to the deduplication nodes until file is closed sample code MPI Comm com; char *filename="sd:pvfs2:/path/data"; MPI Info info; MPI File fh; MPI File open(com, filename,...)...

DDiHA Runtime Uses the identical segment deduplication algorithm for deduplications - Divides data stream into fixed size segments, e.g. 8KB - Computes fingerprints with SHA1 algorithm for each segment - Compared against stored fingerprints or inserted as a new one Deduplication nodes maintain a global fingerprint table Fingerprint table can be implemented as a binary search tree that distributes across different deduplication nodes

Theoretical Modeling Aim to answer following questions: What are the performance gain/loss? What are desired configurations for deduplication nodes? Methodology: Assume deduplication nodes are the same with compute nodes Assume the task of an application can be divided into two parts: computation and I/O Compare the performance with traditional system given same resources Compute Node Simulation/Computation that generates the data I/O operation to file system Figure 6: Execution Model of Scientific Applications

Theoretical Modeling Execution Model with DDiHA: Deduplication cycle longer than computing cycle Deduplication cycle shorter than computing cycle - Expected, overlapping computation, deduplication, and I/O - Requirement of deduplication efficiency Compute Node Compute Node Data Node: Data Node: Simulation/Computation Data transferring Data Processing Case 1 Case 2 I/O Operation to file system Figure 7: Execution Model with DDiHA Notations: W Workload, data generated by simulation, measured with MB W Output data volume of deduplication C Simulation/Computation capability each compute node for given operation, measured with MB/s. C Deduplication efficiency of each deduplication node for given operation, measured with MB/s. b Average I/O bandwidth of each compute node in current architecture. b h Bandwidth of high speed network between compute nodes and deduplication nodes. r Ratio of data nodes compared to compute nodes. n Number of nodes. k The number of compute-i/o phases of an application. λ b h /b ϕ W /W

Theoretical Modeling Traditional Performance: P = 1 T (1) T = ( W n C + W n b ) k (2) Overlapping Requirement: W n r C + W n r b DDiHA Performance: P = 1 T (3) T W = ( n (1 r) C + W ) k n (1 r) b h W n (1 r) C + + W n r C + W n r b W n (1 r) b h (5) (4) Performance gain: = P P = ( W n C + W n b ) k W ( n (1 r) C + W n (1 r) b h ) k = λ(c + b)(1 r) λb + C (6)

Outline Background and Motivation Data Deduplication in a Hybrid Architecture (DDiHA) Evaluation Conclusion and Futur Theoretical Analysis Baseline: Platform: Jaguar XT5 - I/O bandwidth per node: 4.67MB/s Application: GTC - Compute capacity at Jaguar XT5: 1.08MB/s r-λ analysis. Condition analysis. 1.4 1 1.2 50 45 40 b=4.7 MB/s 1 3.5 3 35 0.8 30 1 λ 4 C=1.1 MB/s b=4.7 MB/s r=0.1 λ= 18 0.6 0.4 0.2 2.5 χ C=1.1 MB/s φ 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 25 2 20 1.5 15 1 10 0.5 5 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 r Figure 8: Theoretical analysis about performance gain. Assuming deduplication is efficienct enough for overlapping, DDiHA can achieve better performance by configuring appropriate deduplication nodes (around 10%) 0 10 20 30 40 50 60 C' (MB/s) 70 80 90 100 Figure 9: Theoretical analysis about the requirement of deduplication efficiency. If the deduplication can reduce the data size more than 10%, The deduplication efficiency requirement is no more than 20MB/s, easy to be achieved

Platform and Setup Platform: Name DISCFarm Cluster Num. of nodes 16 nodes CPU dual quad-core 2.6GHz Memory 4GB per node Setup: Traditional DDiHA Computing cores 42 40 Deduplication cores 0 2 Total cores 42 42 Methodology simulating GTC application behavior Data size 260MB 8GB

Deduplication Bandwidth 65MB/s processing capacity/bandwidth without any optimization Meet the basic requirement analyzed from theoretical modeling (20MB/s) Dedup Bandwidth Bandwidth (MB/s ) 70 65 60 55 50 45 40 35 30 25 20 260 520 1040 2080 4160 8320 Data Size (MB) Figure 10: Efficiency of data deduplication.

Compression Ratio Analyzed and compared compression ratio Observed different compression ratios (from 0 (dataset4) to 16 (dataset3)) for different data sets. Compression Ra-o of Dedup (Raw:Deduped) Compression Ra-o 18 16 14 12 10 8 6 4 2 0 dataset1 dataset2 dataset3 dataset4 dataset5 Figure 11: Compression ratio of data deduplication for different data sets.

I/O Improvement For the traditional paradigm, it measures the time of writing data to the storage For DDiHA, it measures the time of writing data to deduplication nodes (I/O from application s view) I/O Improvement!mes 120 100 80 60 40 20 0 Speedup 10400 20800 41600 Data size (MB) Figure 12: I/O improvement from application s view

Overall Performance Comparison Measures the total execution time of an application (a simulation of GTC in this work) Both computation and I/O were included, data was transfered to the storage DDiHA reduced more than 3 times than the traditional paradigm Overall Performance Execu&on &me (s) 4000 3500 3000 2500 2000 1500 1000 500 0 DEDUP CONVENTIONAL 10400 20800 41600 Data Size (MB) Figure 13: Overall performance comparison between DDiHA and traditional paradigm.

Conclusion and Future Work Big data computing brings new opportunities but also poses big challenges Trading part of computing resources for data reduction can be helpful and critical for system performance An initial investigation of data deduplication on the write path for write-intensive applications - Beneficial because of efficiency and compression ratio Theoretical analysis and prototyping were conducted to evaluate the potential of DDiHA Plan to further explore deduplication algorithms and how to leverage features of scientific datasets to achieve high compression ratio

Any Questions? Thank You! Welcome to visit: http://discl.cs.ttu.edu/