Project Group High- performance Flexible File System 2010 / 2011 Lecture 1 File Systems André Brinkmann
Task Use disk drives to store huge amounts of data Files as logical resources A file can contain (structured) data (i.e. records) or a set of ASCII bytes We assume to work on a byte level Important: DisSncSon between logical blocks of a file and physical blocks on storage media File systems may support Dynamic sized files Mutable files Variable number of files on a medium Oversize files spanning mulsple media
Storage media for files Filed should be stored on non- volasle media with low latencies and cheap costs and allow read and write accesses Today, magnesc hard disk drives are (ssll) the most suitable media For small amounts of data: Floppies, USB- Flash To archive huge amounts of data: Tape To archive for read- only accesses: CD- ROM, DVD In niches (Energy consumpson, robustness, random access read performance): SSD In the following, we will invessgate hard disk drives as the most important media
On- disk format on a HDD Blocks (Sectors) Tracks Plattenettikett Belegungsdarstellung Cylinder Datei Inhaltsverzeichnis Datei Datei
Example FAT FAT: File AllocaSon Table A FAT- file system consists of six parts: Boot Sector Reserved Sectors FAT 1: Table of links of the clusters (see later slide) FAT 2: Copy of the FAT Root Directory: Table of directory entries Data Region The boot sector contains executable x86- machine code for operasng system start and addisonal informason about the FAT- file system. Boot Sector reserved FAT 1 FAT 2 Root Directory Data region Folie basiert auf Wikipedia.de
Disk label Name of the media Date of commissioning Capacity Physical structure Bad blocks Link to allocason map (or the map itself) Link to root directory (or the root directory itself) Stored on well- defined posison (first block) and is created on first file system use
AllocaSon map (free and used blocks) Based on vectors or tables Stored dense or spreaded Example: Vector (Bitmap) for free and used blocks, seperated for each area (to reduce disk head movements) 11000101 10100000 Area (i.e. Cylinder) 11000000 00000111 11001111 00011000
AllocaSon map in separate table Adress (Blocknumber) Length 3 16 1 2 3 4 5 6 7 8 22 9 9 10 11 12 13 14 15 16 32 10 17 18 19 20 21 22 23 24 44 9 25 26 27 28 29 30 31 32 57 8 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
Root directory (file catalogue, file directory) The root directory contains a list of all stored files and their descripson Flat directory structure In the simplest case, it consists of a simple (one- dimensional) table Constant or variable length file description For huge disks and many files, the flat structure becomes unmanageable (for human users as well as for accessing applicasons)
File directory Structured directories (tree abstracson) Entry of file-catalogue A B E more blocks File B R S A D T File A.R File E.A File E.T X Y X Y File A.S.X File A.S.Y File E.D.X File E.D.Y
File descripson The file descripson contains all metadata: File name Type of organizason Date of creason Owner Access rights Time of last access Time of last modificason PosiSon of the file (parts of the file) Size...
Access rights Access rights are set by the owner (who is most commonly also the creator of the file) If the access rights Read(L) and Write (S) are defined, a possible mapping of access rights could be: Datei 1 Datei 2 Datei 3 Datei 4 Benutzer(gruppe) A L,S S Benutzer(gruppe) B L L,S L L,S Benutzer(gruppe) C L L Benutzer(gruppe) D L More possible flags: Execute (for executable files) ModificaSon of access rights (reserved for owner) Writes split into "update" or "append" Delete Visible
File organizason File organizason describes the inner structure of a file Defines how its blocks are accessed MulSple access types SequenSal blocks are accessed sequensally Direct ElecSve access of random blocks Index- sequensal Both sequensal and direct MulSple organizasonal forms can be provided at the same Sme that are mapped to a single internal organizason
SequenSal File OrganizaSon The blocks hold an internal sequence that determines the access order Mandatory organizason form for files on tape Can also be used on disk drives Uses a pointer that is moved explicitly or implicitly An access (i.e. read) refers to the current posison of the pointer Beginning of the file S4 Update (in place) S1 S2 S3 S4 S5 S6 S7 S8 S9 (Append) Most commonly there are explicit commands to move pointer: next Moves pointer to next block previous Moves pointer to previous block (Mostly non- existent) reset Moves pointer to beginning of file old new EOF (end of file)
SequenSal files on disk drives On disk drives allow mulsple ways to store sequensal files ConSguous The file spans consguous blocks on the disk Spreaded The file uses arbitrary blocks on the disk Order and posison of of blocks can be realized by: Chaining direct (integrated) block- chaining external chaining in a table (i.e. FAT in MS- DOS / Windows) Index blocks
SequenSal files on disk drives Chaining S1 S2 S3 S4 S5 S6 S7 S8 S9 Indexblock S1 S2 S3 S4 S5 S6 S7 S8 S9
Example MS- DOS uses external chaining Chaining is stored in File AllocaSon Table (FAT) one entry for each block For reasons of performance the FAT should be hold in memory Directory entry xyz 235 Name 1. Block 0 129 235 298 567 129 EOF File Allocation Table 567 298
siehe http://www.cc5x.de/mmc/fat.html Example FAT- AllocaSon
Direct File OrganizaSon Direct access to blocks of a file via Key k i S i CalculaSon of address (block or track number) of the block by the key è Hash funcson a i = f(k i ), i.e. a i = k i mod n Block Key The calculated address (block number) may not be the physical block number An addisonal step of mapping is possible Blocks or tracks may serve as containers for mulsple records that are projected to the same hashed address Only if a container is full, collisions must be resolved
Direct File OrganizaSon V S S S S S S V V S S a i = f(k i ) V S S S V Collision resoluson i.e. linear with a i+1 = (a i + d) mod n
Direct File OrganizaSon Hash table will fill up and an overflow might occur Complex reorganizason (i.e. by moving data) becomes necessary To avoid this, extendible hashing could be used Allows incremental extension of the hash table without data movement Requires an addisonal step of indirecson the hashed projecson points into another vector of pointers Used hash funcson is a i = k i mod 2 g keys are discriminated aber their last g bits If an overflow happens, the container's contents are redistributed with the "refined" hash funcson over the old container and a newly created container To maintain a correct addressing, g is incremented by1 (length of pointer vector is doubled) and the pointers have to be updated accordingly
Example Before Extension b = 2 gmax = 4 g max = 2 (Key is 43) g Pointer 2 2 2 2 Vector of Pointers Aber Extension b = 2 gmax = 8 24 16 92 13 49 22 18 19 15 31 27 Data blocks 2 2 2 3 2 2 2 3 Vector of Pointers 24 16 92 13 49 22 18 19 27 43 15 31 Data blocks
Index- sequensal file organizason Some file are accessed both sequensal and direct (at different points in Sme). This leads to a mixture of sequensal and direct (indexed) organizason à index- sequensal file organizason. Although the blocks of the file are stored sequensally on the medium, addisonal data structures allow a direct access. In its simplest form a single step of indexing is required where the index stores the largest key of a block. S4 S7 S12 S15 S18 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18
IndexsequenSal file organizason Blocks may become empty or an overflow might occur for dynamic access paherns (inserson and deleson of blocks) Overflow blocks are created and addisonal indexes are stored S4.2 S12.3 S4 S7 S12 S15 S18 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 S4.1 S12.1 S12.2 S4.2 S12.3 Overflow block
B*- Trees The addisonal indexes for overflow blocks may drasscally increase access Smes for some records Beher: Use dynamic data structures The B*- Tree is a variant of the B- Tree It holds the records in the leaves Internal nodes contain keys for accelerason of accesses. Regarding the fill rason and maintenance of its form, the B*- Tree corresponds to the B- Tree 41 19 31 71 Used in Reiser 4 13 14 17 19 23 24 29 31 37 41 43 71 73 79
ProperSes of B*- Trees The nodes correspond to the blocks on the disk Each node (block) is at least filled halfway through Let c i be the number of keys in an internal node i m the minimal fill rason of internal nodes (min. number of keys) c i * the number of records in a leaf node i m* the minimal fill rason of for leaves (min. number of records) then it holds for all internal nodes i (except root): m c i 2m and for all leaves i m* c i 2m* For the previous example: m = 1, m* = 2
InserSon in B*- Tree Standard case: Space leb in node Overflow: Neighbor has enough space: Compensate with neighbor Neighbors are full: Split node (create a new block) B*- Tree aber inserson of record with key 16 (split node on leave level, neighbor compensason on level above) 31 16 19 41 71 13 14 16 17 19 23 24 29 31 37 41 43 71 73 79
DeleSon in B*- Tree Standard case: Node remains at least half- full ReconfiguraSon case (nodes fill level falls below half): Neighbor more than half- full: Compensate with neighbor Neighbors half- full: Merge with neighbor (free block) B*- Tree aber deleson of record with key 71 (node merge on leave- level) 31 16 19 41 13 14 16 17 19 23 24 29 31 37 41 43 73 79
Depth of B*- Trees? i.e. social insurance in China with approx. 10 9 records 40 bytes per record (key and pointer) and a block size of 4096 byte results in a spreading factor of t = 4096/40 100 (number of keys per node) 10 2 10 4 10 6 10 8 A B*-Tree with depth 5 suffices to store all records! 10 10
File operasons Typical file operasons Create Open Read Write Reset Lock Close Get ahributes Set ahributes (access rights) Delete
File control block OperaSons on files require management informason Pointer to current posison Current block address Pointer to buffers (in main memory) Fill- raso of buffers InformaSon about locks This informason is stored in the file control block (FCB) The FCB is a data structure that is created on file opening and is deleted when the file is closed A process control block holds pointers to the FCBs of the files that were opened by the process
Parallel file access A file may be accessed by mulsple processes in parallel As the FCB contains both informason specific to the file and informason specific to the current user, some parts of the FCB are shared Shared file PCB 1 FCB FCB FCB FCB PCB 2 FCB FCB shared part
Buffering Some files are accessed frequently (i.e. index blocks). To speed up access Smes, disk blocks are buffered in main- memory (disk cache) Some operasng systems use all otherwise unused main- memory as disk cache (i.e. Linux) Modern disk controllers also have internal, transparent caches Prior to each access to a disk block, the buffer is checked if the block is already cached If the cache is full, the same evicson (swapping) strategies as known from virtual memory (LRU, FIFO,...) are used If a modified disk block is stored in cache but is not yet persisted to disk, a system crash (or power blackout) results in data loss Blocks that are important for the consistency of the file system (directory blocks, index blocks) should therefore be directly wrihen to disk SequenSal accesses can be exploited for buffering: Read- Ahead and Free- Behind