Practices on Lustre File-level RAID



Similar documents
New Storage System Solutions

High Performance Computing OpenStack Options. September 22, 2015

Network Attached Storage. Jinfeng Yang Oct/19/2015

High Availability Databases based on Oracle 10g RAC on Linux

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

RADOS: A Scalable, Reliable Storage Service for Petabyte- scale Storage Clusters

L.A.M.P.* - Shaman-X High Availability and Disaster Tolerance. Design proposition

EOFS Workshop Paris Sept, Lustre at exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

High Performance Computing Specialists. ZFS Storage as a Solution for Big Data and Flexibility

High Availability and Backup Strategies for the Lustre MDS Server

Lessons learned from parallel file system operation

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Ceph. A file system a little bit different. Udo Seidel

Sun Storage Perspective & Lustre Architecture. Dr. Peter Braam VP Sun Microsystems

In Memory Accelerator for MongoDB

RAID HARDWARE. On board SATA RAID controller. RAID drive caddy (hot swappable) SATA RAID controller card. Anne Watson 1

WHITE PAPER PPAPER. Symantec Backup Exec Quick Recovery & Off-Host Backup Solutions. for Microsoft Exchange Server 2003 & Microsoft SQL Server

INCREASING EFFICIENCY WITH EASY AND COMPREHENSIVE STORAGE MANAGEMENT

technology brief RAID Levels March 1997 Introduction Characteristics of RAID Levels

Current Status of FEFS for the K computer

Distributed File System Choices: Red Hat Storage, GFS2 & pnfs

WHITE PAPER: ENTERPRISE SECURITY. Symantec Backup Exec Quick Recovery and Off-Host Backup Solutions

Module 14: Scalability and High Availability

Stretching A Wolfpack Cluster Of Servers For Disaster Tolerance. Dick Wilkins Program Manager Hewlett-Packard Co. Redmond, WA dick_wilkins@hp.

Designing a Cloud Storage System

Technical White Paper for the Oceanspace VTL6000

Active Storage Processing in a Parallel File System

Fault Tolerance & Reliability CDA Chapter 3 RAID & Sample Commercial FT Systems

Evolving Datacenter Architectures

XtreemFS Extreme cloud file system?! Udo Seidel

Microsoft SMB File Sharing Best Practices Guide

A Scalable Health Network For Lustre

High Performance Computing. Course Notes High Performance Storage

DELL RAID PRIMER DELL PERC RAID CONTROLLERS. Joe H. Trickey III. Dell Storage RAID Product Marketing. John Seward. Dell Storage RAID Engineering

Recovery Principles in MySQL Cluster 5.1

IP Storage On-The-Road Seminar Series

Taking Linux File and Storage Systems into the Future. Ric Wheeler Director Kernel File and Storage Team Red Hat, Incorporated

Storing Data: Disks and Files

EMC MID-RANGE STORAGE AND THE MICROSOFT SQL SERVER I/O RELIABILITY PROGRAM

Cloud Based Application Architectures using Smart Computing

High Availability and Disaster Recovery Solutions for Perforce

IT SERVICE MANAGEMENT FAQ

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

Module: Business Continuity

Luxembourg June

Protect SQL Server 2012 AlwaysOn Availability Group with Hitachi Application Protector

A SURVEY OF POPULAR CLUSTERING TECHNOLOGIES

Lustre: A Scalable, High-Performance File System Cluster File Systems, Inc.

5054A: Designing a High Availability Messaging Solution Using Microsoft Exchange Server 2007

The Use of Flash in Large-Scale Storage Systems.

EMC Celerra Unified Storage Platforms

Oracle Database 10g: Backup and Recovery 1-2

Data Replication INSTALATION GUIDE. Open-E Data Storage Server (DSS ) Integrated Data Replication reduces business downtime.

Violin: A Framework for Extensible Block-level Storage

Cisco Active Network Abstraction Gateway High Availability Solution

DSS. High performance storage pools for LHC. Data & Storage Services. Łukasz Janyst. on behalf of the CERN IT-DSS group

The search engine you can see. Connects people to information and services

Price/performance Modern Memory Hierarchy

EMC: The Virtual Data Center

ovirt and Gluster Hyperconvergence

EMC Business Continuity for Microsoft SQL Server 2008

Table of contents. Matching server virtualization with advanced storage virtualization

How VERITAS Storage Foundation TM for Windows Compliments Microsoft Windows Server 2003

CSE-E5430 Scalable Cloud Computing P Lecture 5

CS161: Operating Systems

Lustre tools for ldiskfs investigation and lightweight I/O statistics

Cray DVS: Data Virtualization Service

Using Multipathing Technology to Achieve a High Availability Solution

Eloquence Training What s new in Eloquence B.08.00

TECHNOLOGY BRIEF. Compaq RAID on a Chip Technology EXECUTIVE SUMMARY CONTENTS

VERITAS Volume Management Technology for Windows COMPARISON: MICROSOFT LOGICAL DISK MANAGER (LDM) AND VERITAS STORAGE FOUNDATION FOR WINDOWS

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Enterprise Storage Solution for Hyper-V Private Cloud and VDI Deployments using Sanbolic s Melio Cloud Software Suite April 2011

What is RAID? data reliability with performance

CONFIGURATION GUIDELINES: EMC STORAGE FOR PHYSICAL SECURITY

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Qsan AegisSAN Storage Application Note for Surveillance

High Availability and Disaster Recovery for Exchange Servers Through a Mailbox Replication Approach

TOP FIVE REASONS WHY CUSTOMERS USE EMC AND VMWARE TO VIRTUALIZE ORACLE ENVIRONMENTS

Definition of RAID Levels

EMC VNXe HIGH AVAILABILITY

Chapter 11 I/O Management and Disk Scheduling

ORACLE DATABASE 10G ENTERPRISE EDITION

Open Source Cloud Software Made in Switzerland. Michael Eichenberger CEO stepping stone GmbH Open Cloud Day 19th of June 2012

Big Fast Data Hadoop acceleration with Flash. June 2013

Microsoft SQL Server Always On Technologies

VERITAS Business Solutions. for DB2

Appendix A Core Concepts in SQL Server High Availability and Replication

ES-1 Elettronica dei Sistemi 1 Computer Architecture

Symantec Storage Foundation High Availability for Windows

Cray Lustre File System Monitoring

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

High Availability with Windows Server 2012 Release Candidate

Technical Note. Dell PowerVault Solutions for Microsoft SQL Server 2005 Always On Technologies. Abstract

Data Management in an International Data Grid Project. Timur Chabuk 04/09/2007

The Future of PostgreSQL High Availability Robert Hodges - Continuent, Inc. Simon Riggs - 2ndQuadrant

IOS110. Virtualization 5/27/2014 1

Backup and Recovery by using SANWatch - Snapshot

Red Hat Linux Internals

Traditional v/s CONVRGD

Transcription:

Practices on Lustre File-level RAID Qi Chen chenqi.jn@gmail.com Jiangnan Institute of Computing Technology

Agenda Background motivations practices on client-driven file-level RAID Server-driven file-level RAID fundamentals and key ideas overview of the architecture advantages and limitations Current state of development Future work

Motivations Two challenges of distributed file system as system scales, node failure is becoming more frequent as disk capacity increase, Hardware RAID is becoming more expensive Solutions high availability heartbeat/zookeeper failover high reliability hardware RAID replication: CephFS/GlusterFS/HDFS Software RAID/erase code: GlusterFS

Lustre storage system HA Much effort on high availability req replay mechanism to handle recovery build-in failover mechanism to reconnect to standby servers But not so much on high reliability disk array hardware RAID to avoid data loss after disk fails. external disk array + mirror channel Unable to tackle Disk array failure The controller failure disk array is unable to be rebuilt Internal disk array No RAID controller Cards

Lustre high reliability design HSM backup data to other storage systems DAOS snap and clone do napshot and clone container Client-driven file-level RAID SNS(Server Network Striping) File level replication

Practices on SNS in lustre-2.4(raid1) Key ideas add raid1 ops in struct lov_layout_operations implement RAID1 object operations in every layer of CLIO Problems management across multiple layers VFS /cl_/cache Lock management cl_lock stat transfer LDLM lock handle Layout transfer Some reverse operations from bottom to up cl_io_for_each_reverse(scan, io) CL_PAGE_INVOKE() osc_io_submit()->cl prep()->cl invoke()- >vvp prep_read()

Lessons learned from our early practice The logic is too complex to keep the code stable should handle cl_{io, lock,, req} should handle LDLM lock should consider about VFS semantics Page management is difficult and cache policy is limited Pages are allocated in VFS but operated in CLIO should fit into OSC cache policy The RAID op error is difficult to be detected async IO merge in osc request

Agenda Background motivations practices on client-driven file-level RAID Server-driven file-level RAID fundamentals and key ideas overview of the architecture advantages and limitations Current state of development Future work

Fundamentals and key ideas Fundamentals simple data process logic implemented as a separated and optional module develop a rapid basic prototype of file-level RAID1 more software RAID algorithms to support more features to explore Key ideas Handle RAID object request on OST servers Leverage the current Lustre HA architecture

Overview of our server-driven file-level RAID architecture CLIENT CLIENT CLIENT CLIENT CLIENT failover failover OST server OST server OST server OST server ldiskfs RAID OBD ldiskfs RAID OBD ldiskfs RAID OBD ldiskfs RAID OBD

Server-driven file-level RAID stack OST SERVICE Handle Normal Req From Client ldlm management Replica FRAME read/write/punch/setattr/sync/getattr magagement Lock Value Block operations osd-disk osd-raid osd-raid osd-raid RAID SERVICE Handle RAID Req From OST ldlm management RAID FRAME read/write/punch/setattr/sync/getattr Page management LockValue Block operations RAID algorithm osd-disk osd-raid osd-raid osd-raid

Server-driven file-level raid vs. client-driven file-level RAID Advantages LDLM lock operations management in replica/raid FRAME Distributed transactions other features to explore Limitations

Avoid complicated ldlm lock operations Client-driven file-level RAID Different lock type extent lock glimpse lock lockless Complicated lock operations cl_lock ops Lockless ops Ldlm lock callback Lock merge Server-driven file-level RAID Ldlm lock is handled by OST service lock req OST SERVICE ost_lvb Replica FRAME ldlm_valblock_ops attr_get osd_{ldiskfs, raid} Replica FRAME just need to support LVB operations lock ret

More simple and intelligent management Page management on client should keep VFS semantics status VFS lock on file cache in VFS should handle cl_ at every layer {ccc, lov}_ and ops {lovsub, osc}_ and ops should handle lock with can be cached at OSC Client-driven file-level RAID difficult to handle status on shared difficult to map the status to shadow difficult to support async RAID operations cl_ cl_ OSC radix tree of s cl_ OSC radix tree of s cl_ VFS PAGE cl_ cl_ VFS PAGE map cl_ CLIO PAGE cl_ OSC radix tree of s cl_ OSC radix tree of s cl_ cl_ client-driven file-level RAID1 without shadow alloc cl_ client-driven file-level RAID1 with shadow alloc

More simple and intelligent management Server-driven file-level RAID s are used by OST/RAID servcie, no need to keep VFS semantics can easily build mapping among different s no need to consider LDLM lock be able to implement async RAID operations Be able to add some customized cache policies osd-ldiskfs Raid osd-ldiskfs osd-raid Replica FRAME osd-raid management IN Replica FRAME osd-raid OST SERVICE RAID SERVICE RAID FRAME osd-raid osd-raid osd-raid management in RAID FRAME

Potential to support distributed transaction Problem Statement When operation is related to more than one sub-operations on RAID objects (ex. RAID6), we ensure that either both operations succeed or both fail. Replica/RAID FRAME osd-ldiskfs osd-raid osd-raid Client-driven file-level RAID no persistent storage difficult to support distributed transaction Server-driven file-level RAID have persistent storage leverage transaction mechanism in object storage device API

Other features Less code added/modified on Lustre client keep the stability of client code old client can also connect to RAID servers RAID algorithm is running on OST server do not consume extra resource (memory, network, CPU) on client take advantages of large memory, highthroughout network and powerful CPU on OST to improve performance

Limitations Network congestion When client and OST on the same node, RAID object ops will race with normal client ops To much connections between OSS servers each OSS should connect to the others Difficult to improve read performance using raid object local disk read is faster than remote network read May get poor performance on other RAID algorithm (ex. RAID5)??? Doubble data transfer in network RAID object will be updated many times Performance is as good as it in RAID1

Agenda Background motivations practices on client-driven file-level RAID Server-driven file-level RAID fundamentals and key ideas overview of the architecture advantages and limitations Current state of development Future work

Current state of development Development is based on lustre-2.5.2 Now just focus on file-level RAID1 The work has been completed file layout support RAID1 lfs {set, get}stripe on RAID1 file RAID1 object create and destroy RAID obd build-up management in raid obd RAID1 object read/write RAID1 object punch/setattr RAID1 lvb ops LDLM lock operations {Glimpes, extent} lock on RAID1 object ost_blocking_ast on RAID1 object File layout lock callback Record stripe status(bad, lost) in RAID1 file layout(struct lov_mds_md) RAID obd OSC ost OFD obd osd-ldiskfs Now, the system can run smoothly on only raid obd!!!

Future work Bad RAID1 file repair and recovery Combine with Lustre failover mechanism Support quota in RAID1 file Reconstruct raid-frame based on osd API use DT API to implement osd-raid device extend the functions of OSP device add struct dt_body_operations:do_attr_get add struct dt_object_operations method use OUT to handle RAID object ops Explore other RAID algorithms RAID obd OSC OST osd ldiskfs OST OFD osd raid OFD obd osd-ldiskfs OUT osd raid

Q & A