Sistemas Operativos: File System

Sistemas Operativos: File System Reliability and Performance Pedro F. Souto (pfs@fe.up.pt) May 25, 2012

Sumário Reliability Performance Virtual File System (VFS) Further Reading

Topics Reliability Performance Virtual File System (VFS) Further Reading

File System Reliability Users expect data in disk to persist until they explicitly change it Different events contribute to filesystems failing those expectations Disk Failures Disk are fragile electromechanical devices with a relatively short lifetime (about 5 years) Google has reported failure rates of 2% per year Human Errors Many users type faster than they think Windows uses the recycle bin In Unix/Linux one can change rm: alias rm mv -i /tmp/${logname} System Failures caused by power failures or crashes Backups can address the first two problems Disk failures can also be addressed by redundant media such as RAID

System Failures and FS Reliability Facts 1. File systems cache data and metadata in main memory Use write back rather than write through 2. Some metadata updates require changing more than one disk sector Problem System failures (that do not damage the media) may Lead to loss of data that has not made it to disk Lead to inconsistency of file system data structures on disk Some sectors are updated but others don t Example File creation: 1. Allocate an inode, and initialize it 2. Allocate a directory entry and make it point to inode If system goes down after writing directory entry to disk but before the inode is written, the file system becomes inconsistent What if the writes to disk are done in inverse order?

File System Recovery Upon restart, if the FS was not cleanly shutdown, the OS executes an utility (fsck/scandisk) that: Checks the integrity of the FS Tries to fix the inconsistencies found For example, in the case of the Unix FS, fsck checks, at least: The bitmap of free blocks The inodes and their reference counts by scanning the FS metadata (including directory entries) Also possible that a block be in use and in the free list.

Reducing File System Inconsistencies FS Inconsistencies cannot be avoided (in the Unix FS ) Even if the FS uses synchronous writes for metadata update Challenges Asynchrony System failure may happen at any time Recovery Metadata must be updated in the right order to: Allow recovery Avoid full disk scan Performance Synchronous writes hurt performance Goals are to reduce: The metadata update overhead during normal operation The recovery time at startup after system failure Solutions Enforcing order in metadata updates, taking advantage of metadata semantics FS dependent, but usually very hard

Reducing File System Inconsistencies with Logs Idea Use logs like transactions in databases. Indeed we want disk metadata updates to be: Atomic i.e. either all of them are performed, or none are Consistent i.e. they must preserve system invariants Isolated i.e. as if metadata updates were executed by a single thread Durable i.e. they should persist until modified by other metadata updates These are known as the ACID properties of transactions Advantage Systematic approach using a very mature tecnhology Variations Pratically all modern FS use logs What is logged? is data also logged? How is (meta)data logged? values vs. operations Log contains all FS data and metadata? log vs. journaled FS Type of log redo (write-ahead) vs. undo log Guarantees fully transactional or only order Some do not ensure isolation

Metada-only Write-Ahead (Redo) Log Data structures Log An append only file (on disk) Its tail may be in main memory FS Metadata On disk Cached in main memory Operation Metadata updates are grouped in transactions, sequences of updates that must have ACID properties Update the cached metadata Add entries with the updates at the log tail in main memory Must contain enough information to be able to redo them At the end, add an end of transaction entry to the log An alternative is to use a single log entry per transaction Disk log Log entries must be written to disk before the cached metadata Either, at the end of each transaction Or, when convenient

Metadata-only Redo Log: Recovery Idea Reconstruct the cached metadata by scanning the log and applying its entries Problem If log size is large, this may take too long Solution checkpoint the metadata on disk This is a consistent snapshot of the metadata and keep track of the first log entry whose update is not in that checkpoint This also prevents the log from growing too large Log entries for transactions that made it to disk can be freed Recovery becomes a two step process: 1. Read the most recent metadata checkpoint from disk 2. Apply all the entries in the log for transactions that terminated since that checkpoint Why does this work?

Metadata-only Redo Log: Assessment Advantages Recovery There is no need to scan and check the entire FS metadata. Need only: Scan the log since the last checkpoint Replay it Must read the metadata that was changed since then Normal operation Log entries are appended at the end of the log Writing to disk may be deferred Minimizes seeks Disadvantages Log requires extra space Metadata updates written to disk more than once Log cleanup adds overhead Optimizing log performance is not trivial What about the data? Programmers can invoke fsync()/fdatasync()

Performance Problem Disks are too slow Solution Avoid disk access Cache metadata and data in memory Often, data has to be read from disk To ensure data persistency, data has to be written to disk Avoid seeks when disk access is unavoidable Try to put close on disk Data that belongs to the same file Data and metadata for the same file Problem Fine tunning these tecniques is very hard Filesystem and disks are complex File sizes and access patterns vary widely

Cache What to cache? Everything that can be frequently reused Data blocks i.e. disk blocks with data a.k.a. buffer cache Inodes of opened files Directory names But not the on-disk blocks/inodes of directories Indirect blocks i.e. disk blocks with pointers to data blocks How to manage the buffer cache? Can use pure LRU Hash table Front (LRU) Rear (MRU)... almost. Ensuring consistency in system failures, may prevent it.

Cache Management How large should be the cache? Difficult to say... In systems with VM use integrated buffer management I.e., any frame can be used either for VM pages or for the buffer cache, as needed. For example, Linux: $top [...] Mem: 4048160k total, 1672080k used, 2376080k free, 45560k buffers Swap: 4000180k total, 1582636k used, 2417544k free, 348752k cached [...] pedro@ceuta:~/tmp/snapshots$ free total used free buffers cached Mem: 4048160 1672804 2375356 45592 348860 -/+ buffers/cache: 1278352 2769808 Swap: 4000180 1582628 2417552 buffers is the buffer cache cached appears to be the in memory cache of swap These can be freed, if the system needs more pages, hence the 2nd line in free s output This is useful, because in this system the swap area is smaller than the physical memory, and hibernation...

Buffer Cache and Reads/Writes Reads Prefetch, i.e. read block ahead Works with sequential access Usually, disks controllers cache entire tracks in the disks cache Why not free-behind/replace-behind? Discard buffer from cache whent next is requested Writes Synchronous writes write block to disk immediately No data loss Deferred writes write block later May lead to less disk writes If a block is modified several times between writes to disk Temporary files may not even go to disk Allows further performance gains, by disk scheduling Applications may flush the cache by invoking fsync()

Performance: Avoiding Seeks (1/2) Access to even small file requires reading at least two blocks The file inode The file data block Performance may be improved by locating a file s metadata close to its data I-nodes are located near the start of the disk Disk is divided into cylinder groups, each with its own i-nodes Cylinder group (a) (b) What about multi-platter disks? Anyway, nowadays disk controllers hide the disk geometry

Performance: Avoiding Seeks (2/2) Keep data blocks of the same file sequentially on disk. Use Extents is a set of consecutive blocks on disk Extent sizes range from 128 KiB (Kibi =2 10 ) to several MiB Space in each extent is allocated sequentially For each extent, keeps only the first block number and its length When a new file is created, allocate an extent rather than a single block for it As the file grows use the remaining space in the extent If the extent runs out of space allocate another extent src: Getting to know the Solarois filesystem, Part 1

Other Issues (Not covered) Disk Caches Nowadays disks have caches of tens of MiB And some disks write sectors to disk when they deem best, not when the OS tells them to do it SSDs won t be on servers for a while No seeks Access time gap is much shorter than for disk Networked file systems add the network, server and client side caches, consistency issues... The design space is considerably larger

Virtual File System (VFS) Layer (1/2) Problem How to use diferent FS types on the same OS? ext2/ext3/ext4 and NTFS, for disk FS (V)FAT, on USB pen ISO9660 on CDs/DVDs NFS via the network /proc, for access to kernel structures Solution Add another layer on top of the disk stack src: Anatomy of the Linux virtual file system switch The VFS layer is implemented with main memory data structures only The VFS layer was originally designed by Sun for NFS

Virtual File System (VFS) Layer (2/2) Each File System Must provide a uniform interface, i.e. a set of filesystem (e.g. mount()) and file/directory operations Just like character device drivers in the Linux kernel must implement a set of functions defined in struct file_operations The VFS Layer Provides file system independent functionality Validates system call parameters Copies data to and from user-space Manages the directory name caches Maps system calls to the VFS operations that are implemented by the underlying FS In Linux, the buffer cache is in the Block Layer, between the different FS and the device drivers

Leitura Adicional Sistemas Operativos Subsecção 9.2.3: Estruturas de Suporte à Utilização dos Ficheiros Secção 9.3: Linux Starting at Subsecção 9.3.2.3 (inclusive) Modern Operating Systems, 2nd. Ed. Secções 6.1 e 6.2: Files e Directories Secção 6.3: File System Implementation Subsecções 6.3.6, 6.3.7 e 6.3.8