Violin: A Framework for Extensible Block-level Storage



Similar documents
Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Flexible Storage Allocation

Filesystems Performance in GNU/Linux Multi-Disk Data Storage

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

<Insert Picture Here> Btrfs Filesystem

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

AIX NFS Client Performance Improvements for Databases on NAS

Distributed RAID Architectures for Cluster I/O Computing. Kai Hwang

Datacenter Operating Systems

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Backup Software Technology

Client/Server Computing Distributed Processing, Client/Server, and Clusters

VMware Virtual Machine File System: Technical Overview and Best Practices

Solid State Storage in Massive Data Environments Erik Eyberg

Cray DVS: Data Virtualization Service

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

The Panasas Parallel Storage Cluster. Acknowledgement: Some of the material presented is under copyright by Panasas Inc.

WHITE PAPER Optimizing Virtual Platform Disk Performance

BlobSeer: Towards efficient data storage management on large-scale, distributed systems

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

Oracle Cloud Storage and File system

Virtual Machine Monitors. Dr. Marc E. Fiuczynski Research Scholar Princeton University

Storage Systems Autumn Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

COS 318: Operating Systems. Virtual Machine Monitors

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

Network Attached Storage. Jinfeng Yang Oct/19/2015

Ultimate Guide to Oracle Storage

Technology Insight Series

SMB Direct for SQL Server and Private Cloud

Simplifying Server Workload Migrations

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

June Blade.org 2009 ALL RIGHTS RESERVED

High Availability Databases based on Oracle 10g RAC on Linux

High Availability with Windows Server 2012 Release Candidate

DualFS: A New Journaling File System for Linux

How To Virtualize A Storage Area Network (San) With Virtualization

SolidFire and NetApp All-Flash FAS Architectural Comparison

StorPool Distributed Storage Software Technical Overview

Oracle Database 10g: Performance Tuning 12-1

Distributed File Systems

TELE 301 Lecture 7: Linux/Unix file

COS 318: Operating Systems. Virtual Machine Monitors

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

<Insert Picture Here> Oracle Cloud Storage. Morana Kobal Butković Principal Sales Consultant Oracle Hrvatska

Maginatics Cloud Storage Platform for Elastic NAS Workloads

Hyper-V over SMB Remote File Storage support in Windows Server 8 Hyper-V. Jose Barreto Principal Program Manager Microsoft Corporation

MODULE 3 VIRTUALIZED DATA CENTER COMPUTE

PARALLELS CLOUD STORAGE

Table of Contents Introduction and System Requirements 9 Installing VMware Server 35

Storage Management for the Oracle Database on Red Hat Enterprise Linux 6: Using ASM With or Without ASMLib

Storage Architectures for Big Data in the Cloud

Microsoft SMB File Sharing Best Practices Guide

CS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study

Avid ISIS

Introduction to Gluster. Versions 3.0.x

Big data management with IBM General Parallel File System

Chapter 11 I/O Management and Disk Scheduling

Enterprise Backup and Restore technology and solutions

WHITE PAPER. Permabit Albireo Data Optimization Software. Benefits of Albireo for Virtual Servers. January Permabit Technology Corporation

Support for Storage Volumes Greater than 2TB Using Standard Operating System Functionality

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Moving Virtual Storage to the Cloud. Guidelines for Hosters Who Want to Enhance Their Cloud Offerings with Cloud Storage

Software-defined Storage Architecture for Analytics Computing

IOmark- VDI. HP HP ConvergedSystem 242- HC StoreVirtual Test Report: VDI- HC b Test Report Date: 27, April

Moving Virtual Storage to the Cloud

ZooKeeper. Table of contents

Maximizing SQL Server Virtualization Performance

MagFS: The Ideal File System for the Cloud

Overview of I/O Performance and RAID in an RDBMS Environment. By: Edward Whalen Performance Tuning Corporation

High Availability Solutions with MySQL

IOmark-VM. DotHill AssuredSAN Pro Test Report: VM a Test Report Date: 16, August

Performance, Reliability, and Operational Issues for High Performance NAS Storage on Cray Platforms. Cray User Group Meeting June 2007

SALSA Flash-Optimized Software-Defined Storage

VDI Optimization Real World Learnings. Russ Fellows, Evaluator Group

Violin Memory Arrays With IBM System Storage SAN Volume Control

Taking Linux File and Storage Systems into the Future. Ric Wheeler Director Kernel File and Storage Team Red Hat, Incorporated

Enterprise Storage Solution for Hyper-V Private Cloud and VDI Deployments using Sanbolic s Melio Cloud Software Suite April 2011

Application Brief: Using Titan for MS SQL

Understanding Enterprise NAS

Scala Storage Scale-Out Clustered Storage White Paper

Stream Processing on GPUs Using Distributed Multimedia Middleware

Big Fast Data Hadoop acceleration with Flash. June 2013

Managing Storage Space in a Flash and Disk Hybrid Storage System

Sanbolic s SAN Storage Enhancing Software Portfolio

September 2009 Cloud Storage for Cloud Computing

SAP HANA Operation Expert Summit BUILD - High Availability & Disaster Recovery

Solution Guide Parallels Virtualization for Linux

High Availability Solutions for the MariaDB and MySQL Database

: HP HP Version : R6.1

An Oracle White Paper July Oracle ACFS

Two Parts. Filesystem Interface. Filesystem design. Interface the user sees. Implementing the interface

Solaris For The Modern Data Center. Taking Advantage of Solaris 11 Features

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Next Generation Operating Systems

Feature Comparison. Windows Server 2008 R2 Hyper-V and Windows Server 2012 Hyper-V

A Performance Analysis of the iscsi Protocol

Best Practices for Deploying Citrix XenDesktop on NexentaStor Open Storage

Transcription:

Violin: A Framework for Extensible Block-level Storage Michail Flouris Dept. of Computer Science, University of Toronto, Canada flouris@cs.toronto.edu Angelos Bilas ICS-FORTH & University of Crete, Greece bilas@ics.forth.gr

Large-scale Storage Large-scale storage systems in a data center Driven by Need for massive amounts of scalable storage Consolidation potential for lower costs Challenges in scalable storage Scalability and performance with commodity components Reduction of management cost Wide-area storage sharing 13/4/2005 flouris@cs.toronto.edu 2

Reducing Costs With Emerging scalable, low-cost hardware Commodity clusters / grids (x86 with Linux / BSD) Commodity interconnects and standard protocols (PCI-X/Express/AS, SATA, SCSI, iscsi, GigE) Storage virtualization software that Offers diverse storage views for different applications Automates storage management functions Supports monitoring Exhibits scalability and low overhead We want to improve virtualization at the block-level 13/4/2005 flouris@cs.toronto.edu 3

Block-level Virtualization Virtualization has two meanings Notion 1: Indirection Mapping between physical and logical resources Facilitates resource management Notion 2: Sharing Hides system behind abstractions for sharing Our goal Provide block-level virtualization mechanisms to improve indirection and resource management Why block-level? Transparency, Performance, Flexibility 13/4/2005 flouris@cs.toronto.edu 4

Issue with existing virtualization Current software has configuration flexibility Use of a small set of predefined modules (RAID levels, volume management) Module combination in arbitrary manner But missing Functional Flexibility Ability to extend the system with new functionality Extensions implemented by modules loaded on-demand Not compromising configuration flexibility (New extension modules are combined with old ones) Add management, performance, reliability-related features (e.g. encryption, versioning, migration, compression) 13/4/2005 flouris@cs.toronto.edu 5

Why Functional Flexibility? Reduce implementation complexity Combine simple modules to build complex features Customizing system to user s/application s needs Adaptivity to existing or new applications Incremental system evolution Add new functionality before or after deployment Optimize or upgrade components Prototyping and evaluation of new ideas Not compromising configuration flexibility Creating extensions should be easy Developed by vendors, users, or storage administrators 13/4/2005 flouris@cs.toronto.edu 6

Our Goals Designing a system with automated storage management without extensions, does not work We intend to add desirable management, performance, reliability-related features, in an incremental evolution fashion Violin provides the mechanisms to achieve this Management automation Will evolve over time Will include an initial configuration phase and continuous monitoring and dynamic reconfiguration afterwards System will be able adapt to new applications 13/4/2005 flouris@cs.toronto.edu 7

Extensible Filesystems Related Work Ficus, FiST. Not at block-level, complementary approach. Extensible Network Protocols Click Router, X-kernel, Horus Similar layer stacking for network protocol extensions But: basic differences between network and storage Block-level virtualization software OSS Volume Managers: Linux MD, LVM, EVMS, GEOM Numerous Commercial Solutions Provide only configuration flexibility 13/4/2005 flouris@cs.toronto.edu 8

Outline Motivation Violin Design Implementation Evaluation Conclusions 13/4/2005 flouris@cs.toronto.edu 9

Violin An extensible block-level hierarchy over physical devices User Space Kernel Space Kernel components Violin context User extensions Filesystem Apps Filesystem Block I/O Buffer Cache Violin Modules Block I/O Apps Disk Device Drivers Raw I/O Apps Raw I/O Persistent Metadata LXT High level I/O API Violin kernel module Disks 13/4/2005 flouris@cs.toronto.edu 10

Violin Design Goals Easy to develop new extensions Easy to combine them in I/O hierarchy Low overhead Violin achieves this by providing 1. Convenient semantics and mappings 2. Simple control of the I/O request path 3. Persistent metadata support 13/4/2005 flouris@cs.toronto.edu 11

Virtualization Hierarchies Device graph maps I/O sources to I/O sinks I/O requests pass through devices (layers) in graph Nodes are virtual devices Edges are mappings Hierarchies: connected device sub-graphs, or independent I/O stacks Graph is DAG Directed acyclic graph Internal Devices (A M) E I F Sink Source Source Source H Hierarchy A A B C D Sink J G Sink Sink Hierarchy B Sink L K M Input Capacities: Output Capacities: 13/4/2005 flouris@cs.toronto.edu 12

Virtual Devices A virtual device is driven by an extension module Device/Layer is runtime instance of module Sees input, output address spaces and one or more output devices Maps arbitrarily blocks between devices Transforms data between input and output, vice versa Some modules need logical translation table (LXT) A type of logical device metadata 0 1 2 3 4 5 6 7 8 Input Address Space 3 0 6 1 4 8 LXT 7... 8... Output Address Space 0 1 2 3 4 5 6 7 13/4/2005 flouris@cs.toronto.edu 13

Control of I/O Requests Violin API allows layers simple control on requests passing though them Layer can initiate, forward, complete or terminate I/O requests using simple tags or calls Initiating I/O: Useful for multiplying I/O flows Layers initiate asynchronous I/O requests using callback handlers, executed on completion Forward I/O: Send I/O request to a lower layer Complete I/O: Useful for caching layers Terminate I/O: Error-handling 13/4/2005 flouris@cs.toronto.edu 14

Left Example: Tagged Request Control Request is forwarded through the hierarchy from source to sink (Layer order: F, E, D, B) and then back up Right Example: Request is forwarded until layer C, where it is tagged complete and returns upwards without reaching the sink Fwd Path Fwd to E Fwd to D Fwd to B C A Source Source 13/4/2005 flouris@cs.toronto.edu 15 < < < Sink B Return Path F E Sink D > Tag Request Complete Fwd To Sink Return Path Fwd Path Fwd to E Fwd to D Fwd to B < A < < < Sink C B F E Sink D

Persistent Metadata Storage layers need persistent state For superblocks, partition tables, block maps, etc. Violin offers persistent objects for layer metadata Persistent Objects are memory-mapped storage blocks, accessed as generic memory objects Automatically synchronized to stable storage periodically Automatically loaded / unloaded during startup / shutdown Layers need only allocate objects once Violin internal metadata are also persistent objects Device graph and hierarchy info stored at superblocks 13/4/2005 flouris@cs.toronto.edu 16

Metadata Consistency Three levels of metadata consistency (weaker to stronger) 1. Lazy-updates Synchronized overwriting older metadata periodically Similar to non-journaling filesystems 2. Shadow-updates Using two copies of all metadata (normal & shadow) Synchronization overwrites first normal metadata, then shadow Guarantees module metadata consistency 3. Atomic versioned-metadata consistency Module metadata and application data are versioned On failure the system rolls back to a consistent snapshot Violin currently supports levels 1 and 2 13/4/2005 flouris@cs.toronto.edu 17

Block Size and Memory Overhead Small block sizes can increase memory footprint of module metadata When metadata proportional to total number of blocks Many OSes have small block sizes Linux 2.4.x: 4KB block device size Modules need own independent block size to manage metadata in larger chunks Violin supports larger internal block size Size set by modules Independent from OS block size Reduces memory overhead effectively 13/4/2005 flouris@cs.toronto.edu 18

Outline Motivation Violin Design Implementation Evaluation Conclusions 13/4/2005 flouris@cs.toronto.edu 19

Violin Core Implementation Linux 2.4 loadable block device driver Registers extension modules Provides API & services to extension modules Violin Extension Modules Loadable Linux kernel modules that bind to Core Not device drivers themselves (much simpler) 13/4/2005 flouris@cs.toronto.edu 20

Example Modules RAID RAID Levels 0, 1 and 5 with recovery. Aggregation (plain or striped volumes) Volume Remapping (add, remove, move Volumes) Partitioning Managing Partitions (create, delete, resize partitions) Versioning (Online Snapshots) Online Migration Data Fingerprinting (MD5) Encryption Currently DES, 3DES and Blowfish algorithms 13/4/2005 flouris@cs.toronto.edu 21

Evaluation Ease of module development Configuration Flexibility Performance 13/4/2005 flouris@cs.toronto.edu 22

Evaluation: Ease Of Development Loose comparison of number of code lines Code lines reduced 2-6 times for similar functionality Little to reasonable effort for module development 13/4/2005 flouris@cs.toronto.edu 23

Evaluation: Configuration Flexibility Easily creating a hierarchy with complex functionality from implemented modules Violin allows arbitrary combinations of extension modules VERSIONING PARTITION BLOWFISH Disk RAID 0 PARTITION BLOWFISH Disk 13/4/2005 flouris@cs.toronto.edu 24

Evaluation: Performance Platform: Dual Athlon-MP 2200+ PCs, 512MB RAM, GigE NIC, Western Digital 80GB IDE Disks RedHat Linux 9.0 (Kernel 2.4.20-smp) Benchmarks IOmeter (2004.07.30) for raw block I/O Postmark for filesystem measurements Experiment Cases 1. Pass-through: System Disk vs. Violin Pass-through Layer 2. Vol. Manager: LVM vs. Violin Aggregate+Partition Layers 3. RAID-0: Linux MD vs. Violin RAID 4. RAID-1: Linux MD vs. Violin RAID 13/4/2005 flouris@cs.toronto.edu 25

Violin Pass-through vs. System Disk Violin Raw Disk - 100% Read Violin Raw Disk - 100% Write Violin Raw Disk - 70% Read, 30% Write Raw System Disk - 100% Read Raw System Disk - 100% Write Raw System Disk - 70% Read, 30% Write 80 8 70 60 Sequential I/O 7 6 Random I/O MBytes / sec 50 40 30 MBytes / sec 5 4 3 20 2 10 1 0 0 512B 1KB 2KB 4KB 8KB 16KB 32KB 64KB 512B 1KB 2KB 4KB 8KB 16KB 32KB 64KB Block size Block size IOmeter throughput for Violin Raw Disk vs. System Raw Disk for One Disk 13/4/2005 flouris@cs.toronto.edu 26

Violin vs. LVM (2 Striped Disks) Violin Aggr+Part - 100% Read Violin Aggr+Part - 100% Write Violin Aggr+Part - 70% Read, 30% Write LVM - 100% Read LVM - 100% Write LVM - 70% Read, 30% Write 80 70 60 8 7 6 Random I/O MBytes / sec 50 40 30 MBytes / sec 5 4 3 20 2 10 0 Sequential I/O 1 0 512B 1KB 2KB 4KB 8KB 16KB 32KB 64KB 512B 1KB 2KB 4KB 8KB 16KB 32KB 64KB Block size Block size IOmeter throughput for Violin Aggregation+Partition vs. LVM for 2 Striped Disks (32K stripe) 13/4/2005 flouris@cs.toronto.edu 27

Violin vs. MD (RAID-1 Mirroring, 2 Disks) Violin RAID-1-100% Read Violin RAID-1-100% Write Violin RAID-1-70% Read, 30% Write Linux MD RAID-1-100% Read Linux MD RAID-1-100% Write Linux MD RAID-1-70% Read, 30% Write 80 8 70 60 Sequential I/O 7 6 Random I/O MBytes / sec 50 40 30 MBytes / sec 5 4 3 20 2 10 1 0 0 512B 1KB 2KB 4KB 8KB 16KB 32KB 64KB 512B 1KB 2KB 4KB 8KB 16KB 32KB 64KB Block size Block size IOmeter throughput for RAID-1: Violin vs. Linux MD for 2 Disks 13/4/2005 flouris@cs.toronto.edu 28

Postmark Results over Ext2 Filesystem 10 8 6 4 2 1000 500 Transactions per second Total run time (seconds) Transactions / sec Total Run Time (sec) 0 0 LVM Vl. Aggr+Part System Disk Violin Disk MD RAID-1 Violin RAID-1 MD RAID-0 Violin RAID-0 LVM Vl. Aggr+Part System Disk Violin Disk MD RAID-1 Violin RAID-1 MD RAID-0 Violin RAID-0 13/4/2005 flouris@cs.toronto.edu 29

Multiple Layer Performance Partition Partition + RAID-0 Partition + RAID-0 + Version. Partition + RAID-0 + Version. + Blowfish 80 70 Seq. Read 80 70 MBytes / sec 60 50 40 30 MBytes / sec 60 50 40 30 Seq. Write 20 20 10 10 0 0 512B 1KB 2KB 4KB 8KB 16KB 32KB 64KB 512B 1KB 2KB 4KB 8KB 16KB 32KB 64KB Block size Workload: 100% Sequential Read Block size Workload: 100% Sequential Write 13/4/2005 flouris@cs.toronto.edu 30

Storage Node A Violin Violin Limitations and Future Work iscsi Application Server Application FS Violin iscsi...... iscsi iscsi Violin Storage Node N Distributed Hierarchy Violin does not fully support distributed hierarchies No consistency for dynamically updated metadata Future work: Supporting distributed hierarchies in a cluster Automated hierarchy configuration and management, according to specified requirements and policies 13/4/2005 flouris@cs.toronto.edu 31

Conclusions Goal is to improve virtualization in storage cluster Propose Violin, an extensible I/O layer stack Violin s contributions are mechanisms for Convenient virtualization semantics and mappings Simple control of I/O requests from layers Persistent metadata These mechanisms Make it easy to write extensions Make it easy to combine them Exhibit low overhead (< 10% in our implementation) We believe that Violin s mechanisms are a step towards automated storage management 13/4/2005 flouris@cs.toronto.edu 32

Thank You. Questions? Violin: A Framework for Extensible Block-level Storage, Michail Flouris and Angelos Bilas flouris@cs.toronto.edu, bilas@ics.forth.gr http://www.ics.forth.gr/carv/scalable 13/4/2005 flouris@cs.toronto.edu 33