Sistemas Operativos: File System

Similar documents
COS 318: Operating Systems

Recovery and the ACID properties CMPUT 391: Implementing Durability Recovery Manager Atomicity Durability

Lecture 18: Reliable Storage

Outline. Failure Types

File-System Implementation

The Linux Virtual Filesystem

Storage and File Systems. Chester Rebeiro IIT Madras

Review: The ACID properties

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

CS3210: Crash consistency. Taesoo Kim

File System Design and Implementation

Review. Lecture 21: Reliable, High Performance Storage. Overview. Basic Disk & File System properties CSC 468 / CSC /23/2006

UVA. Failure and Recovery. Failure and inconsistency. - transaction failures - system failures - media failures. Principle of recovery

Journaling the Linux ext2fs Filesystem

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Microkernels & Database OSs. Recovery Management in QuickSilver. DB folks: Stonebraker81. Very different philosophies

CSE 120 Principles of Operating Systems

Information Systems. Computer Science Department ETH Zurich Spring 2012

Storage in Database Systems. CMPSCI 445 Fall 2010

Recovery Protocols For Flash File Systems

Chapter 6, The Operating System Machine Level

Network File System (NFS) Pradipta De

Database Concurrency Control and Recovery. Simple database model

Transactions and Recovery. Database Systems Lecture 15 Natasha Alechina

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 5 - DBMS Architecture

File Systems Management and Examples

The World According to the OS. Operating System Support for Database Management. Today s talk. What we see. Banking DB Application

Chapter 11 I/O Management and Disk Scheduling

Two Parts. Filesystem Interface. Filesystem design. Interface the user sees. Implementing the interface

Advanced File Systems. CS 140 Nov 4, 2016 Ali Jose Mashtizadeh

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Database Hardware Selection Guidelines

Chapter 13 File and Database Systems

Chapter 13 File and Database Systems

Chapter 3 Operating-System Structures

Ryusuke KONISHI NTT Cyberspace Laboratories NTT Corporation

Storing Data: Disks and Files. Disks and Files. Why Not Store Everything in Main Memory? Chapter 7

BookKeeper overview. Table of contents

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Chapter 15: Recovery System

09'Linux Plumbers Conference

Digital Forensics Lecture 3. Hard Disk Drive (HDD) Media Forensics

SQL Server Transaction Log from A to Z

Network Attached Storage. Jinfeng Yang Oct/19/2015

Survey of Filesystems for Embedded Linux. Presented by Gene Sally CELF

Taking Linux File and Storage Systems into the Future. Ric Wheeler Director Kernel File and Storage Team Red Hat, Incorporated

Maximizing Cylinder Locality. Disk Drives and Geometry. Seek and Latency Scheduling. (maximizing cylinder locality) 5/16/2016

A Deduplication File System & Course Review

Introduction to Database Management Systems

DualFS: A New Journaling File System for Linux

CSE 120 Principles of Operating Systems. Modules, Interfaces, Structure

Crashes and Recovery. Write-ahead logging

Chapter 3: Operating-System Structures. System Components Operating System Services System Calls System Programs System Structure Virtual Machines

Distributed File Systems

2 nd Semester 2008/2009

Operating Systems CSE 410, Spring File Management. Stephen Wagner Michigan State University

Operating Systems. Design and Implementation. Andrew S. Tanenbaum Melanie Rieback Arno Bakker. Vrije Universiteit Amsterdam

Outline. Operating Systems Design and Implementation. Chap 1 - Overview. What is an OS? 28/10/2014. Introduction

Chapter 11 I/O Management and Disk Scheduling

Crash Recovery. Chapter 18. Database Management Systems, 3ed, R. Ramakrishnan and J. Gehrke

Chapter 16: Recovery System

Storage Class Memory Support in the Windows Operating System Neal Christiansen Principal Development Lead Microsoft

LiveBackup. Jagane Sundar

! Volatile storage: ! Nonvolatile storage:

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

TELE 301 Lecture 7: Linux/Unix file

Encrypted File Systems. Don Porter CSE 506

Violin: A Framework for Extensible Block-level Storage

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 412, University of Maryland. Guest lecturer: David Hovemeyer.

Chapter 10 Case Study 1: LINUX

Part III Storage Management. Chapter 11: File System Implementation

File System Management

File System Reliability (part 2)

Overview. File Management. File System Properties. File Management

Lecture 25 Symbian OS

CHAPTER 17: File Management

Review from last time. CS 537 Lecture 3 OS Structure. OS structure. What you should learn from this lecture

Linux Powered Storage:

Recover EDB and Export Exchange Database to PST 2010

Chapter 11: File System Implementation. Operating System Concepts with Java 8 th Edition

Chapter 11: File System Implementation. Operating System Concepts 8 th Edition

Sistemas Operativos: Input/Output Disks

Storage and File Structure

Chapter 12 File Management

Chapter 12 File Management. Roadmap

Physical Data Organization

Linux Filesystem Comparisons

How do Users and Processes interact with the Operating System? Services for Processes. OS Structure with Services. Services for the OS Itself

How To Recover From Failure In A Relational Database System

The Classical Architecture. Storage 1 / 36

CS 153 Design of Operating Systems Spring 2015

<Insert Picture Here> Btrfs Filesystem

CS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Practical Online Filesystem Checking and Repair

Transaction Management Overview

Configuring Apache Derby for Performance and Durability Olav Sandstå

Chapter 7. File system data structures. File system layout. Code: Block allocator

ReconFS: A Reconstructable File System on Flash Storage

File Systems for Flash Memories. Marcela Zuluaga Sebastian Isaza Dante Rodriguez

Transcription:

Sistemas Operativos: File System Reliability and Performance Pedro F. Souto (pfs@fe.up.pt) May 25, 2012

Sumário Reliability Performance Virtual File System (VFS) Further Reading

Topics Reliability Performance Virtual File System (VFS) Further Reading

File System Reliability Users expect data in disk to persist until they explicitly change it Different events contribute to filesystems failing those expectations Disk Failures Disk are fragile electromechanical devices with a relatively short lifetime (about 5 years) Google has reported failure rates of 2% per year Human Errors Many users type faster than they think Windows uses the recycle bin In Unix/Linux one can change rm: alias rm mv -i /tmp/${logname} System Failures caused by power failures or crashes Backups can address the first two problems Disk failures can also be addressed by redundant media such as RAID

System Failures and FS Reliability Facts 1. File systems cache data and metadata in main memory Use write back rather than write through 2. Some metadata updates require changing more than one disk sector Problem System failures (that do not damage the media) may Lead to loss of data that has not made it to disk Lead to inconsistency of file system data structures on disk Some sectors are updated but others don t Example File creation: 1. Allocate an inode, and initialize it 2. Allocate a directory entry and make it point to inode If system goes down after writing directory entry to disk but before the inode is written, the file system becomes inconsistent What if the writes to disk are done in inverse order?

File System Recovery Upon restart, if the FS was not cleanly shutdown, the OS executes an utility (fsck/scandisk) that: Checks the integrity of the FS Tries to fix the inconsistencies found For example, in the case of the Unix FS, fsck checks, at least: The bitmap of free blocks The inodes and their reference counts by scanning the FS metadata (including directory entries) Also possible that a block be in use and in the free list.

Reducing File System Inconsistencies FS Inconsistencies cannot be avoided (in the Unix FS ) Even if the FS uses synchronous writes for metadata update Challenges Asynchrony System failure may happen at any time Recovery Metadata must be updated in the right order to: Allow recovery Avoid full disk scan Performance Synchronous writes hurt performance Goals are to reduce: The metadata update overhead during normal operation The recovery time at startup after system failure Solutions Enforcing order in metadata updates, taking advantage of metadata semantics FS dependent, but usually very hard

Reducing File System Inconsistencies with Logs Idea Use logs like transactions in databases. Indeed we want disk metadata updates to be: Atomic i.e. either all of them are performed, or none are Consistent i.e. they must preserve system invariants Isolated i.e. as if metadata updates were executed by a single thread Durable i.e. they should persist until modified by other metadata updates These are known as the ACID properties of transactions Advantage Systematic approach using a very mature tecnhology Variations Pratically all modern FS use logs What is logged? is data also logged? How is (meta)data logged? values vs. operations Log contains all FS data and metadata? log vs. journaled FS Type of log redo (write-ahead) vs. undo log Guarantees fully transactional or only order Some do not ensure isolation

Metada-only Write-Ahead (Redo) Log Data structures Log An append only file (on disk) Its tail may be in main memory FS Metadata On disk Cached in main memory Operation Metadata updates are grouped in transactions, sequences of updates that must have ACID properties Update the cached metadata Add entries with the updates at the log tail in main memory Must contain enough information to be able to redo them At the end, add an end of transaction entry to the log An alternative is to use a single log entry per transaction Disk log Log entries must be written to disk before the cached metadata Either, at the end of each transaction Or, when convenient

Metadata-only Redo Log: Recovery Idea Reconstruct the cached metadata by scanning the log and applying its entries Problem If log size is large, this may take too long Solution checkpoint the metadata on disk This is a consistent snapshot of the metadata and keep track of the first log entry whose update is not in that checkpoint This also prevents the log from growing too large Log entries for transactions that made it to disk can be freed Recovery becomes a two step process: 1. Read the most recent metadata checkpoint from disk 2. Apply all the entries in the log for transactions that terminated since that checkpoint Why does this work?

Metadata-only Redo Log: Assessment Advantages Recovery There is no need to scan and check the entire FS metadata. Need only: Scan the log since the last checkpoint Replay it Must read the metadata that was changed since then Normal operation Log entries are appended at the end of the log Writing to disk may be deferred Minimizes seeks Disadvantages Log requires extra space Metadata updates written to disk more than once Log cleanup adds overhead Optimizing log performance is not trivial What about the data? Programmers can invoke fsync()/fdatasync()

Topics Reliability Performance Virtual File System (VFS) Further Reading

Performance Problem Disks are too slow Solution Avoid disk access Cache metadata and data in memory Often, data has to be read from disk To ensure data persistency, data has to be written to disk Avoid seeks when disk access is unavoidable Try to put close on disk Data that belongs to the same file Data and metadata for the same file Problem Fine tunning these tecniques is very hard Filesystem and disks are complex File sizes and access patterns vary widely

Cache What to cache? Everything that can be frequently reused Data blocks i.e. disk blocks with data a.k.a. buffer cache Inodes of opened files Directory names But not the on-disk blocks/inodes of directories Indirect blocks i.e. disk blocks with pointers to data blocks How to manage the buffer cache? Can use pure LRU Hash table Front (LRU) Rear (MRU)... almost. Ensuring consistency in system failures, may prevent it.

Cache Management How large should be the cache? Difficult to say... In systems with VM use integrated buffer management I.e., any frame can be used either for VM pages or for the buffer cache, as needed. For example, Linux: $top [...] Mem: 4048160k total, 1672080k used, 2376080k free, 45560k buffers Swap: 4000180k total, 1582636k used, 2417544k free, 348752k cached [...] pedro@ceuta:~/tmp/snapshots$ free total used free buffers cached Mem: 4048160 1672804 2375356 45592 348860 -/+ buffers/cache: 1278352 2769808 Swap: 4000180 1582628 2417552 buffers is the buffer cache cached appears to be the in memory cache of swap These can be freed, if the system needs more pages, hence the 2nd line in free s output This is useful, because in this system the swap area is smaller than the physical memory, and hibernation...

Buffer Cache and Reads/Writes Reads Prefetch, i.e. read block ahead Works with sequential access Usually, disks controllers cache entire tracks in the disks cache Why not free-behind/replace-behind? Discard buffer from cache whent next is requested Writes Synchronous writes write block to disk immediately No data loss Deferred writes write block later May lead to less disk writes If a block is modified several times between writes to disk Temporary files may not even go to disk Allows further performance gains, by disk scheduling Applications may flush the cache by invoking fsync()

Performance: Avoiding Seeks (1/2) Access to even small file requires reading at least two blocks The file inode The file data block Performance may be improved by locating a file s metadata close to its data I-nodes are located near the start of the disk Disk is divided into cylinder groups, each with its own i-nodes Cylinder group (a) (b) What about multi-platter disks? Anyway, nowadays disk controllers hide the disk geometry

Performance: Avoiding Seeks (2/2) Keep data blocks of the same file sequentially on disk. Use Extents is a set of consecutive blocks on disk Extent sizes range from 128 KiB (Kibi =2 10 ) to several MiB Space in each extent is allocated sequentially For each extent, keeps only the first block number and its length When a new file is created, allocate an extent rather than a single block for it As the file grows use the remaining space in the extent If the extent runs out of space allocate another extent src: Getting to know the Solarois filesystem, Part 1

Other Issues (Not covered) Disk Caches Nowadays disks have caches of tens of MiB And some disks write sectors to disk when they deem best, not when the OS tells them to do it SSDs won t be on servers for a while No seeks Access time gap is much shorter than for disk Networked file systems add the network, server and client side caches, consistency issues... The design space is considerably larger

Topics Reliability Performance Virtual File System (VFS) Further Reading

Virtual File System (VFS) Layer (1/2) Problem How to use diferent FS types on the same OS? ext2/ext3/ext4 and NTFS, for disk FS (V)FAT, on USB pen ISO9660 on CDs/DVDs NFS via the network /proc, for access to kernel structures Solution Add another layer on top of the disk stack src: Anatomy of the Linux virtual file system switch The VFS layer is implemented with main memory data structures only The VFS layer was originally designed by Sun for NFS

Virtual File System (VFS) Layer (2/2) Each File System Must provide a uniform interface, i.e. a set of filesystem (e.g. mount()) and file/directory operations Just like character device drivers in the Linux kernel must implement a set of functions defined in struct file_operations The VFS Layer Provides file system independent functionality Validates system call parameters Copies data to and from user-space Manages the directory name caches Maps system calls to the VFS operations that are implemented by the underlying FS In Linux, the buffer cache is in the Block Layer, between the different FS and the device drivers

Topics Reliability Performance Virtual File System (VFS) Further Reading

Leitura Adicional Sistemas Operativos Subsecção 9.2.3: Estruturas de Suporte à Utilização dos Ficheiros Secção 9.3: Linux Starting at Subsecção 9.3.2.3 (inclusive) Modern Operating Systems, 2nd. Ed. Secções 6.1 e 6.2: Files e Directories Secção 6.3: File System Implementation Subsecções 6.3.6, 6.3.7 e 6.3.8