The Mojette Erasure Code: Application to fault tolerant Distributed File System (DFS)



Similar documents
RozoFS: a fault tolerant I/O intensive distributed file system based on Mojette erasure code

New Storage System Solutions

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

technology brief RAID Levels March 1997 Introduction Characteristics of RAID Levels

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

How To Encrypt Data With A Power Of N On A K Disk

Architecting a High Performance Storage System

Accelerate SQL Server 2014 AlwaysOn Availability Groups with Seagate. Nytro Flash Accelerator Cards

THE SUMMARY. ARKSERIES - pg. 3. ULTRASERIES - pg. 5. EXTREMESERIES - pg. 9

Isilon IQ Scale-out NAS for High-Performance Applications

Milestone Solution Partner IT Infrastructure MTP Certification Report Scality RING Software-Defined Storage

SMB Direct for SQL Server and Private Cloud

Building Storage Clouds for Online Applications A Case for Optimized Object Storage

Certification Document macle GmbH Grafenthal-S1212M 24/02/2015. macle GmbH Grafenthal-S1212M Storage system

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Converged storage architecture for Oracle RAC based on NVMe SSDs and standard x86 servers

F600Q 8Gb FC Storage Performance Report Date: 2012/10/30

Intel Solid- State Drive Data Center P3700 Series NVMe Hybrid Storage Performance

THESUMMARY. ARKSERIES - pg. 3. ULTRASERIES - pg. 5. EXTREMESERIES - pg. 9

Comparing Dynamic Disk Pools (DDP) with RAID-6 using IOR

HP reference configuration for entry-level SAS Grid Manager solutions

Performance characterization report for Microsoft Hyper-V R2 on HP StorageWorks P4500 SAN storage

RAID 5 rebuild performance in ProLiant

IBM System x GPFS Storage Server

Freezing Exabytes of Data at Facebook s Cold Storage. Kestutis Patiejunas (kestutip@fb.com)

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

Blueprints for Scalable IBM Spectrum Protect (TSM) Disk-based Backup Solutions

POSIX and Object Distributed Storage Systems

Performance Report Modular RAID for PRIMERGY

Big Fast Data Hadoop acceleration with Flash. June 2013

Input / Ouput devices. I/O Chapter 8. Goals & Constraints. Measures of Performance. Anatomy of a Disk Drive. Introduction - 8.1

Mit Soft- & Hardware zum Erfolg. Giuseppe Paletta

VMware Virtual SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014

GeoGrid Project and Experiences with Hadoop

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Introducing NetApp FAS2500 series. Marek Stopka Senior System Engineer ALEF Distribution CZ s.r.o.

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

Certification Document bluechip STORAGEline R54300s NAS-Server 03/06/2014. bluechip STORAGEline R54300s NAS-Server system

Data Corruption In Storage Stack - Review

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

How SSDs Fit in Different Data Center Applications

HP Proliant BL460c G7

Accelerating Server Storage Performance on Lenovo ThinkServer

I/O Performance of Cisco UCS M-Series Modular Servers with Cisco UCS M142 Compute Cartridges

Using Synology SSD Technology to Enhance System Performance Synology Inc.

HYPER-CONVERGED INFRASTRUCTURE STRATEGIES

The Data Placement Challenge

Disk Storage & Dependability

DELL TM PowerEdge TM T Mailbox Resiliency Exchange 2010 Storage Solution

PowerVault MD1200/MD1220 Storage Solution Guide for Applications

Using Synology SSD Technology to Enhance System Performance. Based on DSM 5.2

Optimizing LTO Backup Performance

Windows 8 SMB 2.2 File Sharing Performance

CSE-E5430 Scalable Cloud Computing P Lecture 5

DELL RAID PRIMER DELL PERC RAID CONTROLLERS. Joe H. Trickey III. Dell Storage RAID Product Marketing. John Seward. Dell Storage RAID Engineering

Storage Architectures for Big Data in the Cloud

MESOS CB220. Cluster-in-a-Box. Network Storage Appliance. A Simple and Smart Way to Converged Storage with QCT MESOS CB220

Convergence-A new keyword for IT infrastructure transformation

N /150/151/160 RAID Controller. N MegaRAID CacheCade. Feature Overview

Integrated Grid Solutions. and Greenplum

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Hadoop: Embracing future hardware

Building low cost disk storage with Ceph and OpenStack Swift

The Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage Platforms. Abhijith Shenoy Engineer, Hedvig Inc.

RAID HARDWARE. On board SATA RAID controller. RAID drive caddy (hot swappable) SATA RAID controller card. Anne Watson 1

SSDs and RAID: What s the right strategy. Paul Goodwin VP Product Development Avant Technology

Parallels Cloud Storage

Building All-Flash Software Defined Storages for Datacenters. Ji Hyuck Yun Storage Tech. Lab SK Telecom

Panasas at the RCF. Fall 2005 Robert Petkus RHIC/USATLAS Computing Facility Brookhaven National Laboratory. Robert Petkus Panasas at the RCF

SOLUTION BRIEF. Resolving the VDI Storage Challenge

EMC SYMMETRIX VMAX 20K STORAGE SYSTEM

Certification Document macle GmbH GRAFENTHAL R2208 S2 01/04/2016. macle GmbH GRAFENTHAL R2208 S2 Storage system

Fusionstor NAS Enterprise Server and Microsoft Windows Storage Server 2003 competitive performance comparison

The IntelliMagic White Paper: Storage Performance Analysis for an IBM Storwize V7000

HP ProLiant DL380p Gen mailbox 2GB mailbox resiliency Exchange 2010 storage solution

LEVERAGING FLASH MEMORY in ENTERPRISE STORAGE. Matt Kixmoeller, Pure Storage

Achieving Higher VDI Scalability and Performance on Microsoft Hyper-V with Seagate 1200 SAS SSD Drives & Proximal Data AutoCache Software

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

Deploying Flash- Accelerated Hadoop with InfiniFlash from SanDisk

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

Measuring Interface Latencies for SAS, Fibre Channel and iscsi

Advanced Knowledge and Understanding of Industrial Data Storage

Scientific Computing Data Management Visions

Protect Data... in the Cloud

Q & A From Hitachi Data Systems WebTech Presentation:

Understanding Flash SSD Performance

Design and Evolution of the Apache Hadoop File System(HDFS)

High-Availability and Scalable Cluster-in-a-Box HPC Storage Solution

Accelerating and Simplifying Apache

The Design and Implementation of the Zetta Storage Service. October 27, 2009

Investigation of storage options for scientific computing on Grid and Cloud facilities

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

Lab Evaluation of NetApp Hybrid Array with Flash Pool Technology

Solid State Storage in the Evolution of the Data Center

Hadoop Architecture. Part 1

Transcription:

The Mojette Erasure Code: Application to fault tolerant Distributed File System (DFS) Architecture de codes correcteurs d'erreurs Journée inter GDR ISIS et SoCSiP 4 Novembre 2014, salle B007, Télécom Bretagne Benoît Parrein, Université de Nantes, IRCCyN Lab, UMR 6597 Joint work with FIZIANS SAS

Outline MDS erasure codes FEC4Cloud project Mojette erasure code Performances Application to DFS: 2

Erasure codes (MDS property) k messages Encoding n code-words k received code-words Decoding k messages 3

FEC4Cloud Project ANR 2012 (appel Emergence) Partners: IRCCyN (lead), ISAE, SATT-Ouest Valorisation Budget: 256 K Duration: 24 months (product oriented) Goal: promoting erasure codes within Cloud storage infrastructure 4

The forward Mojette transform based on Radon transform [Guédon, 1995] compute 1D projections from a 2D geometrical buffer 5

Conditions of reconstruction Myron Katz criteria (1978) N N pi P i or qi Q i With a rectangular geometrical buffer of PxQ pixels And a projections set S = {(pi, qi)}. Mathematical Morphology for non rectangular shape [Normand, 1997] 6

Ancillary data for inversion Number of pixels (ixels) contributing to one bin Sum of coordinates (not represented here) 7

The reverse Mojette transform Check Katz Criteria (or mathematical morphology if necessary) While 2D geometrical buffer is not completely reconstructed do Find one-to-one correspondence into the projection set Retroprojection at the right location Update the projections (bins and ancillary data) 8

Properties of Mojette erasure code (1+є) MDS Systematic and non systematic coding Asynchronous reconstruction No algebraic constraints (as Galois fields) No prime size (as in MDS array or FRT) Linear complexity in coding/decoding [O(IN)] Soft coding and decoding 9

The reverse Mojette transform Check Katz Criteria (or mathematical morphology if necessary) While 2D geometrical buffer is not completely reconstructed do Find one-to-one correspondence into the projection set - costly Retroprojection at the right location Update the projections (bins and ancillary data) - costly 10

Optimizations (1/2) Deterministic path of reconstruction [Normand, 2006] (if geometrical buffer appears as a stripe) Example on a 4 lines geometrical buffer with 4 projections +drastic reduction in writes [engineers of Fizians, 2013] 11

Optimizations (2/2) processeur transform_forward: support pixel[0] 2) read(pixel[0])... pixel[128]... pixel[255] 6) read(pixel[128]) 1) write(bin[0],0) 2)data1 = read(pixel[0]) 3)data2 = read(bin[0]) 4) data3 = data2 xor data1 5) write(bin[0],data3) 6) data1= read(pixel[128] 7)data2 = read(bin[0]) 8) data3 = data2 xor data1 9) write(bin[0]),data3 proj 1 1) write_bin(bin0,0) bin[0] 3) read_bin(bin[0]) 5) write_bin(bin0,data3) 8) read_bin(bin[0]) 9) write_bin(bin0,data3) ;... bin[127] Classical forward mojette transform Cadifra Evaluation www.cadifra.com 12

Optimizations (2/2) processeur support pixel[0]... pixel[128] transform_forward: 1) read(pixel[0] 2) read(pixel[128] 1)data1 = read(pixel[0]) proj 1 bin[0] 2) data2= read(pixel[128] 3) data3 = data2 xor data1 4) write_bin(bin0,data3) 4) write(bin[0],data3)... ;... pixel[255] bin[127] Optimized forward mojette transform [Féron et al., 2014] Cadifra Evaluation www.cadifra.com 13

Related works (software) Reed-Solomon (by Cauchy matrices [Byers, 1995]) Reed-Solomon (by Vandermonde matrices [Rizzo, 1998] now a RFC5510) Cauchy Good [Planck, 2008] in Jerasure 1.2 Intel ISA-L (includes SSE instructions)... 14

Performances (coding) Coding (k=4,m=6) E5-2403@1.8GHZ 5000 4723 4500 4000 CPU Cycles 3500 3000 2621 2560 ISA-L d 4+2 coding +memcpy 2500 Mojette RozoFS 4+2 coding 2000 1500 1152 1000 500 0 8192 4096 blocksize (Bytes) that means 5.625 GB/s (resp. 6.40 GB/s with 4KB) for Mojette coding (in purple) and 3.122 GB/s (resp. 2.88 GB/s with 4KB) for RS coding (in blue) x1.8 (resp. x2.22) faster (for a 3x more coding blocks) 15

Performances (decoding) Decoding (m=4,k=6) E5-2403@1.8GHZ 5000 4710 4500 4000 3500 CPU cycles 3000 ISA-L d 4+2 decoding +memcpy 2496 2500 Mojette RozoFS 4+2 decoding memcpy 2000 1536 1500 832 1000 768 542 500 0 8192 4096 Blocksize (Bytes) that means 9.6 GB/s (resp. 9.60 GB/s with 4KB) for Mojette coding (in purple) and 3.130 GB/s (resp. 2.953 GB/s with 4KB) for RS coding (in blue) x3 (resp. x3.25) faster (for a 2x more coding blocks) 16

The Mojette Erasure Code: Application to fault tolerant Distributed File System (DFS) Architecture de codes correcteurs d'erreurs Journée inter GDR ISIS et SoCSiP 4 Novembre 2014, salle B007, Télécom Bretagne Benoît Parrein, Université de Nantes, IRCCyN Lab, UMR 6597 Joint work with FIZIANS SAS

High availability means... 99.999999...% reachable Copies and copies and copies... (up to 7 times) Hard disks and hard disks and hard disks... High consumption of energy Privacy problems Erasure codes reduce drastically the size of the infrastructure for the same availability rate (2x) and facilitate privacy policy 18

Distributed File Systems HDFS (Hadoop) Facebook file system (f4)...not really I/O centric CephFS, GlusterFS,... Scality (based on Chord)... Mix of replicas (hot data) and erasure coding (cold data) : none use erasure codes always 19

20

RozoFS I/O Centric Distributed File System POSIX Scale-out storage Commodity hardware Fault tolerance (up to 4 failures) Based on erasure coding (Mojette coding) Dedicated to cold and hot data Open source project 21

Client Node Metadata Server ol p r t n Co ath Rozofsmount exportd th a pa Dat metadata n Mo g rin ito Storage Node Storaged Storage Node Storage Node Storaged... Pool of storage Nodes Storaged 22

Read/Write function (in non-sys coding) In a layout 0 i.e (2, 3) coding i.e two projections are necessary for reconstruction 23

Testbed 24

Performances Sequential access: layout 0, 4K blocks...6 GB/s in read...3 GB/s in write 25

Performances Random access: layout 0, 4K blocks...100k IOPS in read...80k IOPS in write 26

RozoFS + Exportd External Network Rozofsmount Storage Rozofsmount Storage Rozofsmount Storage Rozofsmount Storage Niveau clients/applications Standard GigE Infrastructure GigE infrastructure (data storage and metadata)

Credits Sylvain David Pierre Evenou https://github.com/rozofs Alex Van Kempen Jeanpierre Guédon Quentin Lebourgeois Jean-Pierre Monchanin Didier Féron Louis Legouriellec UMR 6597 Dimitri Pertin Nicolas Normand Bastien Confais Christophe de la Guérrande Olivier Blin 28

Vielen Danke! 29

Backup slides 30

The storage in the world 40 Exabytes (1018 bytes) stored in 2020 15 EB (37%) in the Cloud(s) 7,5 EB (50%) video, images,... 31

Server type Fujistu RX300-S8 (R3008S0035FR) CPU model name 2 x Intel Xeon CPU E5-2650 v2 @ 2.60GHz (8 cores & 16 threads/core) Memory (GB) 64 GB RAID card RAID Controller SAS 6Gbit/s 1GB (D3116C) Virtual DRIVE 0 - Seagate Constellaton.2, SAS 6Gb/s, 1TB, 2.5", 7200 RPM (ST91000640SS) - 11 drives - RAID 5 Virtual DRIVE 1 - Seagate Pulsar.2, SAS 6Gb/s, 100GB, 2.5", MLC (ST100FM0002) - 1 drive - RAID 0 Virtual DRIVE 2 - WD Xe, SAS 6Gb/s, 900GB, 2.5", 10000 RPM (WD9001BKHG) - 4 drives - RAID 0 Ethernet controllers - Intel 82599EB 10-Gigabit SFI/SFP+ - 2*10Gb - Intel I350 Gigabit Network - 2*1Gb - Intel I350 Gigabit Network - 4*1Gb 32

33

Conclusions RozoFS is an I/O centric distributed file system based on a erasure code (always) Performances: 100K IOPS, throughput of 6 Gbps... RozoFS follows up the infrastructure Apps: on line video editing, virtualisation (QEMU), database... participate to the convergence of cold and hot data Next: privacy (to check), grid5000 experiments (to come), deduplication (to attach) 34