COS 318: Operating Systems

Similar documents

File Systems Management and Examples

Lecture 18: Reliable Storage

File System Design and Implementation

Outline. Failure Types

Chapter 16: Recovery System

Chapter 14: Recovery System

A Deduplication File System & Course Review

! Volatile storage: ! Nonvolatile storage:

Recovery and the ACID properties CMPUT 391: Implementing Durability Recovery Manager Atomicity Durability

UVA. Failure and Recovery. Failure and inconsistency. - transaction failures - system failures - media failures. Principle of recovery

Chapter 15: Recovery System

How To Recover From Failure In A Relational Database System

COS 318: Operating Systems. File Layout and Directories. Topics. File System Components. Steps to Open A File

Review: The ACID properties

Information Systems. Computer Science Department ETH Zurich Spring 2012

Recovery Protocols For Flash File Systems

DualFS: A New Journaling File System for Linux

CS3210: Crash consistency. Taesoo Kim

Chapter 12 File Management

Operating Systems CSE 410, Spring File Management. Stephen Wagner Michigan State University

File-System Implementation

Chapter 13 File and Database Systems

Chapter 13 File and Database Systems

Ryusuke KONISHI NTT Cyberspace Laboratories NTT Corporation

File System Reliability (part 2)

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Unit 12 Database Recovery

CSE 120 Principles of Operating Systems

Database Concurrency Control and Recovery. Simple database model

File System Management

Review. Lecture 21: Reliable, High Performance Storage. Overview. Basic Disk & File System properties CSC 468 / CSC /23/2006

Chapter 10. Backup and Recovery

The Linux Virtual Filesystem

Recovery Principles in MySQL Cluster 5.1

Journaling the Linux ext2fs Filesystem

Blurred Persistence in Transactional Persistent Memory

Recovery: An Intro to ARIES Based on SKS 17. Instructor: Randal Burns Lecture for April 1, 2002 Computer Science Johns Hopkins University

Microkernels & Database OSs. Recovery Management in QuickSilver. DB folks: Stonebraker81. Very different philosophies

Storage and File Structure

Windows NT File System. Outline. Hardware Basics. Ausgewählte Betriebssysteme Institut Betriebssysteme Fakultät Informatik

Recovery algorithms are techniques to ensure transaction atomicity and durability despite failures. Two main approaches in recovery process

Redo Recovery after System Crashes

Windows OS File Systems

Recovery System C H A P T E R16. Practice Exercises

Flash-Friendly File System (F2FS)

The Design and Implementation of a Log-Structured File System

The World According to the OS. Operating System Support for Database Management. Today s talk. What we see. Banking DB Application

Storage in Database Systems. CMPSCI 445 Fall 2010

Chapter 12 File Management

Chapter 12 File Management. Roadmap

Crashes and Recovery. Write-ahead logging

Outline. Windows NT File System. Hardware Basics. Win2K File System Formats. NTFS Cluster Sizes NTFS

CS 153 Design of Operating Systems Spring 2015

& Data Processing 2. Exercise 2: File Systems. Dipl.-Ing. Bogdan Marin. Universität Duisburg-Essen

Recovery: Write-Ahead Logging

6. Storage and File Structures

Transactions and Recovery. Database Systems Lecture 15 Natasha Alechina

A SCALABLE DEDUPLICATION AND GARBAGE COLLECTION ENGINE FOR INCREMENTAL BACKUP

External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13

KVM & Memory Management Updates

Topics in Computer System Performance and Reliability: Storage Systems!

File Management. Chapter 12

COS 318: Operating Systems. Virtual Memory and Address Translation

DataBlitz Main Memory DataBase System

Oracle Cluster File System on Linux Version 2. Kurt Hackel Señor Software Developer Oracle Corporation

Datenbanksysteme II: Implementation of Database Systems Recovery Undo / Redo

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 5 - DBMS Architecture

Bigdata High Availability (HA) Architecture

Crash Recovery. Chapter 18. Database Management Systems, 3ed, R. Ramakrishnan and J. Gehrke

1 File Management. 1.1 Naming. COMP 242 Class Notes Section 6: File Management

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

File Systems for Flash Memories. Marcela Zuluaga Sebastian Isaza Dante Rodriguez

CHAPTER 17: File Management

Practical Online Filesystem Checking and Repair

Storage and File Systems. Chester Rebeiro IIT Madras

The Classical Architecture. Storage 1 / 36

Proceedings of FAST 03: 2nd USENIX Conference on File and Storage Technologies

1. Introduction to the UNIX File System: logical vision

Synchronization and recovery in a client-server storage system

File System Forensics FAT and NTFS. Copyright Priscilla Oppenheimer 1

Transaction Log Internals and Troubleshooting. Andrey Zavadskiy

Boost SQL Server Performance Buffer Pool Extensions & Delayed Durability

File System Design for and NSF File Server Appliance

The Logical Disk: A New Approach to Improving File Systems

ProTrack: A Simple Provenance-tracking Filesystem

Storing Data: Disks and Files. Disks and Files. Why Not Store Everything in Main Memory? Chapter 7

Resolving Journaling of Journal Anomaly in Android I/O: Multi-Version B-tree with Lazy Split

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Flexible Storage Allocation

SQL Server Transaction Log from A to Z

Journal-guided Resynchronization for Software RAID

EMC DATA DOMAIN DATA INVULNERABILITY ARCHITECTURE: ENHANCING DATA INTEGRITY AND RECOVERABILITY

Where is the memory going? Memory usage in the 2.6 kernel

Introduction to Database Management Systems

How To Write A Transaction System

Transcription:

COS 318: Operating Systems File Performance and Reliability Andy Bavier Computer Science Department Princeton University http://www.cs.princeton.edu/courses/archive/fall10/cos318/

Topics File buffer cache Disk failure and file recovery tools Consistent updates Transactions and logging 2

File Buffer Cache for Performance Cache files in main memory Check the buffer cache first Hit will read from or write to the buffer cache Miss will read from the disk to the buffer cache Usual questions What to cache? How to size? What to prefetch? How and what to replace? Which write policies? User Kernel User buffer Buffer cache Disk 3

What to Cache? Things to consider i-nodes and indirect blocks of directories Directory files I-nodes and indirect blocks of files Files What is a good strategy? Cache i-nodes and indirect blocks if they are in use? Cache only the i-nodes and indirect blocks of the current directory? Cache an entire file vs. referenced blocks of files 4

How to Size? An important issue is how to partition memory between the buffer cache and VM cache Early systems use fixed-size buffer cache It does not adapt to workloads Later systems use variable size cache But, large files are common, how do we make adjustment? Solution Basically, we solve the problem using the working set idea, remember? Buffer cache (90MB) VM (110MB) Buffer cache (120MB) VM (80MB) 5

Challenges: Multiple User Processes Kernel All processes share the same buffer cache Global LRU may not be fair Solution Working set idea again Questions Can each process use a different replacement strategy? Can we move the buffer cache to the user level? What about duplications? Do we need to pin user buffers? User process User process... Buffer cache User process 6

What to Prefetch? Optimal The blocks are fetched in just enough time to use them But, life is hard The good news is that files have locality Temporal locality Spatial locality Common strategies Prefetch next k blocks together (typically > 64KB) Some discard unreferenced blocks Cluster blocks (to the same cylinder group and neighborhood) make prefetching efficient, directory and i-nodes if possible 7

How and What to Replace? Page replacement theory Use past to predict future LRU is good Buffer cache with LRU replacement mechanism If b is in buffer cache, move it to front and return b Otherwise, replace the tail block, get b from disk, insert b to the front Use double linked list with a hash table Questions Why a hash table? What if file >> the cache? LRU (front) Hash table 8

Which Write Policies? Write through Whenever modify cached block, write block to disk Cache is always consistent Simple, but cause more I/Os Write back When modifying a block, mark it as dirty & write to disk later Fast writes, absorbs writes, and enables batching So, what s the problem? User Kernel User buffer Buffer cache Disk 9

Write Back Complications Fundamental tension On crash, all modified data in cache is lost. The longer you postpone write backs, the faster you are and the worst the damage is When to write back When a block is evicted When a file is closed On an explicit flush When a time interval elapses (30 seconds in Unix) Issues These write back options have no guarantees A solution is consistent updates (later) 10

File Recovery Tools Physical backup (dump) and recovery Dump disk blocks by blocks to a backup system Backup only changed blocks since the last backup as an incremental Recovery tool is made accordingly Logical backup (dump) and recovery / man u cos318 Traverse the logical structure from the root Selectively dump what you want to backup Verify logical structures as you backup Recovery tool selectively move files back Consistency check (e.g. fsck) Start from the root i-node Traverse the whole tree and mark reachable files Verify the logical structure Figure out what blocks are free 11

Recovery from Disk Block Failures Boot block Create a utility to replace the boot block Use a flash memory to duplicate the boot block and kernel Super block If there is a duplicate, remake the file system Otherwise, what would you do? Free block data structure Search all reachable files from the root Unreachable blocks are free i-node blocks How to recover? Indirect or data blocks How to recover? bitmap i-node Indirect Indirect Data Data Data 12

Persistency and Crashes File system promise: Persistency File system will hold a file until its owner explicitly deletes it Backups can recover your file even beyond the deletion point Why is this hard? Memory A crash will destroy memory content Cache more better performance Cache more lose more on a crash A file operation often requires modifying multiple blocks, but the system can only atomically modify one at a time Systems can crash anytime? 13

What Is A Crash? Crash is like a context switch Think about a file system as a thread before the context switch and another after the context switch Two threads read or write same shared state? Crash is like time travel Current volatile state lost; suddenly go back to old state Example: move a file Place it in a directory Delete it from old Crash happens and both directories have problems Before Crash After Time Crash 14

Approaches Throw everything away and start over Done for most things (e.g., make again) Not what you want to happen to your email Reconstruction Figure out where you are and make the file system consistent and go from there Try to fix things after a crash ( fsck ) Make consistent updates Either new data or old data, but not garbage data Make multiple updates appear atomic Build arbitrary sized atomic units from smaller atomic ones Similar to how we built critical sections from locks, and locks from atomic instructions 15

Write Metadata First Modify /u/cos318/foo Crash Crash Traverse to /u/cos318/ Consistent Allocate data block Consistent i-node / dir file i-node u dir file Crash Write pointer into i-node Inconsistent Write new data to foo i-node cos318 dir file i-node foo Crash Consistent Old data New data Writing metadata first can cause inconsistency 16

Write Data First Modify /u/cos318/foo Crash Crash Traverse to /u/cos318/ Consistent Allocate data block Consistent i-node / dir file i-node u dir file Crash Write new data to foo Consistent Write pointer into i-node i-node cos318 dir file i-node foo Crash Consistent Old data New data 17

Consistent Updates: Bottom-Up Order The general approach is to use a bottom up order File data blocks, file i-node, directory file, directory i-node, What about file buffer cache Write back all data blocks Update file i-node and write it to disk Update directory file and write it to disk Update directory i-node and write it to disk (if necessary) Continue until no directory update exists Does this solve the write back problem? Updates are consistent but leave garbage blocks around May need to run fsck to clean up once a while Ideal approach: consistent update without leaving garbage 18

Transaction Properties Group multiple operations together so that they have ACID property: Atomicity It either happens or doesn t (no partial operations) Consistency A transaction is a correct transformation of the state Isolation (serializability) Transactions appear to happen one after the other Durability (persistency) Once it happens, stays happened Question Do critical sections have ACID property? 19

Transactions Bundle many operations into a transaction One of the first transaction systems is Sabre American Airline reservation system, made by IBM Primitives BeginTransaction Mark the beginning of the transaction Commit (End transaction) When transaction is done Rollback (Abort transaction) Undo all the actions since Begin transaction. Rules Transactions can run concurrently Rollback can execute anytime Sophisticated transaction systems allow nested transactions 20

Implementation BeginTransaction Commit Rollback Start using a write-ahead log on disk Log all updates Write commit at the end of the log Then write-behind to disk by writing updates to disk Clear the log Clear the log Crash recovery If there is no commit in the log, do nothing If there is commit, replay the log and clear the log Assumptions Writing to disk is correct (recall the error detection and correction) Disk is in a good state before we start 21

An Example: Atomic Money Transfer Move $100 from account S to C (1 thread): BeginTransaction S = S - $100; C = C + $100; Commit Steps: 1: Write new value of S to log 2: Write new value of C to log 3: Write commit 4: Write S to disk 5: Write C to disk 6: Clear the log Possible crashes After 1 After 2 After 3 before 4 and 5 Questions Can we swap 3 with 4? Can we swap 4 and 5? C = 110 S = 700 C = 10 110 S = 800 700 S=700 C=110 Commit 22

Revisit The Implementation BeginTransaction Commit Rollback Start using a write-ahead log on disk Log all updates Write commit at the end of the log Then write-behind to disk by writing updates to disk Clear the log Clear the log Crash recovery If there is no commit in the log, do nothing If there is commit, replay the log and clear the log Questions What is commit? What if there is a crash during the recovery? 23

Two Threads Run Transactions Apply to the mid-term AtomicTransfer program 1: BeginTransaction 2: if ( a1->id < a2->id ) { Acquire( a1->lock ); Acquire( a2->lock ); } else { Acquire( a2->lock ); Acquire( a1->lock ); } 3: if ((a1->balance - $100 ) < 0) { Release( a2->lock ); Release( a1->lock ); goto 7; } 4: a1->balance -= $100; 5: a2->balance += $100; 6: Release( a2->lock ); Release( a1->lock ); 7: Commit What happens if Thread A performs 1-6; context switch Thread B performs 1-7; crash! 24

Two-Phase Locking for Transactions First phase Acquire all locks Second phase Commit operation release all locks (no individual release operations) Rollback operation always undo the changes first and then release all locks 25

Use Transactions in File Systems Make a file operation a transaction Create a file Move a file Write a chunk of data Would this eliminate any need to run fsck after a crash? Make arbitrary number of file operations a transaction Just keep logging but make sure that things are idempotent: making a very long transaction Recovery by replaying the log and correct the file system This is called logging file system or journaling file system Almost all new file systems are journaling (Windows NTFS, Veritas file system, file systems on Linux) 26

Issue with Logging: Performance For every disk write, we now have two disk writes (on different parts of the disk)? It is not so bad because once written to the log, it is safe to do real writes later Performance tricks Changes made in memory and then logged to disk Log writes are sequential (synchronous writes can be fast if on a separate disk) Merge multiple writes to the log with one write Use NVRAM (Non-Volatile RAM) to keep the log 27

Log Management How big is the log? Same size as the file system? Observation Log what s needed for crash recovery Management method Checkpoint operation: flush the buffer cache to disk After a checkpoint, we can truncate log and start again Log needs to be big enough to hold changes in memory Some logging file systems log only metadata (file descriptors and directories) and not file data to keep log size down Would this be a problem? 28

What to Log? Physical blocks (directory blocks and inode blocks) Easy to implement but takes more space Which block image? Before operation: Easy to go backward during recovery After operation: Easy to go forward during recovery. Both: Can go either way. Logical operations Example: Add name foo to directory #41 More compact But more work at recovery time 29

Log-structured File System (LFS) Structure the entire file system as a log with segments A segment has i-nodes, indirect blocks, and data blocks All writes are sequential (no seeks) There will be holes when deleting files Questions What about read performance? How would you clean (garbage collection)? Used Unused Log structured 30

Summary File buffer cache True LRU is possible Simple write back is vulnerable to crashes Disk block failures and file system recovery tools Individual recovery tools Top down traversal tools Consistent updates Transactions and ACID properties Logging or Journaling file systems 31