Demand Attach / Fast-Restart Fileserver



Similar documents
Bigdata High Availability (HA) Architecture

Running a Workflow on a PowerCenter Grid

AFS File Servers Dos and Don'ts

WEBLOGIC ADMINISTRATION

MOSIX: High performance Linux farm

Chapter 6, The Operating System Machine Level

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Overlapping Data Transfer With Application Execution on Clusters

Practical Performance Understanding the Performance of Your Application

Enterprise Manager Performance Tips

AFS Usage and Backups using TiBS at Fermilab. Presented by Kevin Hill

Best Practices for Data Sharing in a Grid Distributed SAS Environment. Updated July 2010

COS 318: Operating Systems

How to Choose your Red Hat Enterprise Linux Filesystem

Considerations when Choosing a Backup System for AFS

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:

Considerations when Choosing a Backup System for AFS

Netezza PureData System Administration Course

The Google File System

Performance and scalability of a large OLTP workload

A Comparative Study on Vega-HTTP & Popular Open-source Web-servers

Maximizing VMware ESX Performance Through Defragmentation of Guest Systems. Presented by

High Availability for Citrix XenServer

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Module 10: Maintaining Active Directory

PARALLELS CLOUD SERVER

SQL Server 2012 Optimization, Performance Tuning and Troubleshooting

Perforce with Network Appliance Storage

Veritas Storage Foundation Tuning Guide

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

Geospatial Server Performance Colin Bertram UK User Group Meeting 23-Sep-2014

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

MySQL High Availability Solutions. Lenz Grimmer OpenSQL Camp St. Augustin Germany

Cloud Server. Parallels. An Introduction to Operating System Virtualization and Parallels Cloud Server. White Paper.

Availability Digest. Raima s High-Availability Embedded Database December 2011

Benchmarking FreeBSD. Ivan Voras

StreamServe Persuasion SP5 Oracle Database

Scaling Microsoft SQL Server

SAN Conceptual and Design Basics

Virtual server management: Top tips on managing storage in virtual server environments

Tivoli IBM Tivoli Web Response Monitor and IBM Tivoli Web Segment Analyzer

CSC 2405: Computer Systems II

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1

Chapter 3 Operating-System Structures

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 5 - DBMS Architecture

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Chapter 11: File System Implementation. Operating System Concepts 8 th Edition

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Compellent Storage Center

MAGENTO HOSTING Progressive Server Performance Improvements

W H I T E P A P E R. VMware Infrastructure Architecture Overview

Solaris For The Modern Data Center. Taking Advantage of Solaris 11 Features

Java Performance Tuning

ProTrack: A Simple Provenance-tracking Filesystem

We mean.network File System

Real-Time Scheduling 1 / 39

Configuring Apache Derby for Performance and Durability Olav Sandstå

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Capacity Estimation for Linux Workloads

NetBackup Performance Tuning on Windows

SQL Server Business Intelligence on HP ProLiant DL785 Server

WHITE PAPER BRENT WELCH NOVEMBER

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

Distributed File Systems

SQL Server 2005 Features Comparison

Microsoft SMB File Sharing Best Practices Guide

GoGrid Implement.com Configuring a SQL Server 2012 AlwaysOn Cluster

IBRIX Fusion 3.1 Release Notes

Parallels Cloud Server 6.0

On Benchmarking Popular File Systems

Data Compression in Blackbaud CRM Databases

Enterprise Performance Tuning: Best Practices with SQL Server 2008 Analysis Services. By Ajay Goyal Consultant Scalability Experts, Inc.

Active/Active DB2 Clusters for HA and Scalability

CHAPTER 3 PROBLEM STATEMENT AND RESEARCH METHODOLOGY

ORACLE INSTANCE ARCHITECTURE

PATROL Console Server and RTserver Getting Started

Planning Domain Controller Capacity

Tuning WebSphere Application Server ND 7.0. Royal Cyber Inc.

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Multiprogramming. IT 3123 Hardware and Software Concepts. Program Dispatching. Multiprogramming. Program Dispatching. Program Dispatching

PLUMgrid Toolbox: Tools to Install, Operate and Monitor Your Virtual Network Infrastructure

Performance rule violations usually result in increased CPU or I/O, time to fix the mistake, and ultimately, a cost to the business unit.

Architecting ColdFusion For Scalability And High Availability. Ryan Stewart Platform Evangelist

LISTSERV in a High-Availability Environment DRAFT Revised

Web Server (Step 1) Processes request and sends query to SQL server via ADO/OLEDB. Web Server (Step 2) Creates HTML page dynamically from record set

DELL TM PowerEdge TM T Mailbox Resiliency Exchange 2010 Storage Solution

Eloquence Training What s new in Eloquence B.08.00

Parallel Computing. Benson Muite. benson.

Big data management with IBM General Parallel File System

Virtuoso and Database Scalability

An Oracle White Paper December Exadata Smart Flash Cache Features and the Oracle Exadata Database Machine

An Oracle White Paper May Exadata Smart Flash Cache and the Oracle Exadata Database Machine

StreamServe Persuasion SP5 Microsoft SQL Server

Transcription:

. p.1/28 Demand Attach / Fast-Restart Fileserver Tom Keiser Sine Nomine Associates

. p.2/28 Introduction Project was commissioned by an SNA client Main requirement was to reduce fileserver restart time by > 60% Development & testing performed over 7 months in 2005 2006 timeframe Resulted in > 24, 000 line patch to OpenAFS HEAD Patch was committed to OpenAFS CVS in 03/2006 Pulled up to development branch in time for 1.5.1 release

. p.3/28 Motivations for Redesign The on-disk format has a number of limitations Metadata is stored in per-volume files Metadata has poor spatial locality No journalling Volume Package locking model doesn t scale One global lock No notion of state for concurrently accessed objects Lock is held across high-latency operations Breaking CallBacks takes a lot of time Keeping the server offline throughout the salvage process is unnecessary (with certain caveats)

. p.4/28 Proposed Changes Introduce notion of a volume finite-state automata Only attach volumes as required (demand attachment) Only hold volume lock when absolutely necessary Perform all I/O outside of the global lock Automatically salvage volumes as required Parallelize fileserver shutdown process Stop breaking CallBacks during shutdown Save/Restore of host/callback state Volume garbage collector to offline infrequently used volumes

. p.5/28 Demand Attach Fileserver Implementation

. p.6/28 Demand Attach Architecture Demand Attach consists of five components demand attachment salvageserver host/callback state save and restore changes to bos/bosserver ancillary debug and introspection tools Components have substantial interdependencies It would be difficult to make them separable

. p.7/28 Fileserver Startup Sequence Old Fileserver vice partition list is built all volumes are fully attached Demand Attach Fileserver host/callback state is restored host/callback state consistency is verified vice partition list is built list of volumes on each partition is scanned all volumes are placed into pre-attached state

. p.8/28 Fileserver Shutdown Sequence Old Fileserver break callbacks shut down all volumes Demand Attach Fileserver quiesce all host and callback state shut down all online volumes verify consistency of host/callback state save host/callback state

. p.9/28 Feature Comparison Feature 1.2.x 1.4.x DAFS Parallel Startup no yes yes Parallel Shutdown no no yes Tunable data structure sizes no no yes lock-less I/O no no yes automatic salvaging no no yes lock-less data structure traversals no no yes volume state automata no no yes

. p.10/28 Salvageserver Receives salvage requests from the fileserver or bos salvage Maintains a salvage priority queue Forks off new salvager worker children as needed Handles log file combining Maintains N concurrent worker children whose task orders are load balanced across all vice partitions

. p.11/28 Volume Garbage Collector Underlying assumption is many volumes are accessed so infrequently that it is inefficient to keep them attached The inefficiency has to do with optimizing fileserver shutdown time The garbage collector is modeled after the generational GC paradigm Frequently accessed volumes must be dormant for a longer period to be eligible for offlining Infrequently accessed volumes are scanned more frequently to determine their offlining eligibility A background worker thread occasionally offlines a batch of eligible volumes

. p.12/28 Demand Attach Fileserver Performance

. p.13/28 Startup Performance Startup time is now heavily limited by underlying filesystems directory lookup performance Even for 60,000 volumes across 3 partitions, startup times were still under 5 seconds running namei on Solaris UFS Host/CallBack state restore can take several seconds if the tables were nearly full This is a cpu-bound problem, so faster machines mostly obviate this Actual restore is fast; verification is slow Verification can optionally be turned off with the -fs-state-verify argument

. p.14/28 Shutdown Performance Benchmark Environment Sun E450 (4x 480MHz 8MB cache, 4GB RAM) 4Gb aggregate multipathed FC fabric attachment Solaris 10 zone 56 vice partitions spread across LUNs on various IBM FAStT and Sun arrays 44 physical spindles behind the vice LUNs mixture of 10k and 15k FC-AL disks 128 volumes per partition (to keep test iteration time low)

Shutdown Performance. p.15/28

Shutdown Performance It turns out the test system was sufficiently slow to turn the shutdown algorithm into a CPU-bound problem much beyond 32 threads Kernel statistics collected point to very high context switch overhead As a result, we don t know the true lower bound on shutdown time for the underlying disk subsystem Aggregate microstate CPU time was 85% kernelspace Taking N=32 as the last good value, we get a parallel speedup of 49.49 6.28 7.88 for 32 threads This yields a parallel efficiency of 7.88 32 0.25. p.16/28

. p.17/28 Demand Attach Fileserver Administration

. p.18/28 New Bos Create Syntax The bos create syntax has been changed to deal with the online salvage server: bos create <hostname> dafs dafs \ /usr/afs/bin/fileserver \ /usr/afs/bin/volserver \ /usr/afs/bin/salvageserver \ /usr/afs/bin/salvager

. p.19/28 Bos Salvage bos salvage must change because implementation is intimately tied to bnode name bnode name had to change to resolve ambiguity between MR-AFS and Demand Attach Fileserver Cannot manually salvage volumes on demand attach fileserver with older bos client New bos client maintains identical salvage arguments, except -forcedafs must be passed to perform a manual salvage on a demand attach fileserver

. p.20/28 Salvageserver Automated salvaging is not without its pitfalls Infinite salvage,attach,salvage,... loops are possible To halt this progression, a hard limit on salvages for each volume is imposed This counter is reset when the filserver restarts, or when the volume is nuked The -Parallel argument controls the number of concurrent salvage workers Occasional full-server salvages (requiring an outage) are still a good idea

. p.21/28 Introspection and Debugging Utilities salvsync-debug provides low-level introspection and control of the salvage scheduling process fssync-debug provides low-level introspection and conrol of the fileserver volume package state_analyzer provides an interactive command line utility to explore and query the fileserver state database All three of these utilites have context-sensitive help commands fssync-debug and salvsync-debug commands should be understood before use; misuse could cause Bad Things (tm) to occur

. p.22/28 Demand Attach Fileserver Concluding Thoughts

. p.23/28 When not to use Demand Attach environments where host tables overflow e.g. lots of mobile clients, NATs, etc. multi_rx needs to be replaced before we can overcome this limitation environments where fileserver crashes happen frequently demand salvages take longer to run than full-partition salvages future work should solve this problem

. p.24/28 Potential Future Development Directions Fine-grained locking of volume package Replace current daemon_com mechanism with unix domain sockets or Rx RPC endpoints Continually store list of online volumes into fileserver state database Seed salvageserver with list of previously attached volumes following a crash Provide fine-grained volume state from vos commands Dynamic vice partition attachment/detachment without fileserver restart

. p.25/28 Demand Attach Fileserver Clustering Automated failover is very fast due to demand attachment Fewer volumes to salvage following a crash Salvage time is not a component of failover time There is potential to build on the Demand Attach infrastructure to provide robust fileserver clustering Treat volume/host/client/callback state as a memory-mapped on-disk database during runtime, as well as during startup/shutdown If/When Partition UUID support happens, dynamic rebalancing of partitions across an N-node fileserver cluster becomes feasible

. p.26/28 Conclusions Fileserver restart time has been drastically reduced despite poor scaling properties of the on-disk format Further performance gains are going to be increasingly difficult New on-disk format could be very fruitful Improve spatial locality Optimize for DNLC performance characteristics on namei Balance disk I/O subsystem performance characteristics with cache hierarchy performance characteristics using something like self-similar B-trees

. p.27/28 Conclusions Diminishing returns associated with context switch overhead are a serious limit to how far synchronous I/O algorithms can scale Using asynchronous I/O (on certain platforms) could yield significant further performance gains Demand Attach architecture opens up a new realm of possibilities in the space of High Availability and Fileserver Clustering

. p.28/28 Demand Attach Fileserver Questions?