Chapter 12 File Management File is the basic element of most of the applications, since the input to an application, as well as its output, is usually a file. They also typically outlive the execution of applications. Users want to access files, save them, and maintain the integrity of their data files. Hence, virtually all OS provide some sort of file management facilities. Although a file management system consists of system utility programs that run as special applications, they all need certain special services from OS, and sometimes, is simply part of the OS, such as Window Explorer, and the one embedded in unix. 1
Overview of file structure A field is the basic element of data. It consists of a length, together with a type, for the value it is supposed to contain. It can be of fixed length, or variable length. A record is a collection of fields that can be treated as a unit by some other programs. For example, an employee record may contain such fields as name, SSN, DOB, data of hire, etc.. A file is a collection of records. It is treated as a single entity by users and applications, and can be referenced by a unique name. Needless to say that a file can be created, deleted, and modified. Access control is usually done at the file level to give different users different access to different files. Finally, a database is a collection of related data files. (Do we really need to talk about it?) 2
Typical operations A file management system usually provides the following operations: Retrieve all the records of a file, e.g., when an application has to produce a summary of the information contained in a file. Retrieve one record, as frequently required in database transactions. Retrieve next(previous) record. This may be required in filling a form. Insert a new record into a file, delete an existing one from a file, or update a record in a file, or retrieve a number of records.. We saw all these in the context of a database transaction. 3
Objectives of a file system 1. To meet the data management needs and requirements of the user, e.g., having the ability of performing the above operations. 2. To guarantee that the data in the file are valid. 3. To optimize performance, e.g., in terms of system throughput and response time to the user. 4. To provide I/O support for a variety of devices. 5. To provide a uniform set of I/O interface routines. 6. In a multi-user system, provide corresponding I/O services. 4
More specifically,... for an interactive, and general-purpose file management system, it should allow each user to: 1. able to create, delete, read, and change files; 2. perhaps have access to other uses files with certain restrictions; 3. control access rights to its own files; 4. able to restructure her files to a form proper to the problem; 5. able to move data between files; 6. able to back up and restore her files in case of damage; and 7. able to get access to her files by using symbolic(logical) names. 5
File system architecture Below shows a typical architecture for a file management system: At the lowest level, such a system provides drivers for such storage devices as disk(direct access) and tape(sequential access). The next one, basic file system, provides a primary interface with the outside environment, concerned with placing blocks of data on the secondary storage device and on the buffering of those data inside the main memory. 6
The other stuff The basic I/O supervisor is responsible for all file I/O initiation and termination. It is also concerned with selecting an appropriate device to perform certain file I/O operation, based on the nature of the file to be processed. It is also responsible for various scheduling, and optimization, activities. Buffering and secondary memory allocation are also dealt with at this level. This is part of the operating system. The logical I/O piece enables users and applications to access records, instead of data blocks, which is handled by the basic file system. Finally, the level closest to the user is often called the access method, which leads to different file structures and ways of accessing to, and processing of, the data. 7
File organization and access methods In choosing a file organization, we have to consider the following factors: access speed, ease of update, economy of storage, simple maintenance, and last, but not least, reliability. Importance of these factors vary with applications. For example, when a file is only processed in the batch mode, with all the records accessed every time, then rapid access to a single record is not important. They may conflict with each other, as well. For example, for achieving economy of storage, there has to be minimum redundancy in the data. On the other hand, redundancy is the chief means of increasing data access, and raising reliability. 8
Structure of a file A file can be organized as a pile, a sequential file, an indexed sequential file, an indexed file, and a direct file. With a pile organization, data records are just collected in the way they arrive, sort of a set. Such records may have different number of fields, or even similar fields in different orders. Hence, the (maximum) length of each field for a record in a pile should be specified in a certain way. Since there is no structure, the only access method for the pile structure is via an exhaustive search. 9
Sequential file This is the most common form of file structure, where a fixed format is used for records. All of the records are of the same number of fixed-length in a given order. One of the fields is referred to as the key field, which is used to uniquely identify the records. Records are typically stored in the order of the key values. It is the only structure suitable for the tape media, and is the optimal method for applications that have to access to all the records. This structure does pose some issues when we have to deal with dynamic data. Typically, a log file is used and later merged with the sequential file. A linked structure, which collects all the blocks with the same key, is also a viable alternative. 10
Indexed sequential file Indexed sequential structure is a popular alternative. It maintains the key characteristic of the sequential structure, with two additional features: 1) An index is added to support the random access, by providing a lookup ability to the vicinity of a desired record. 2) An overflow file with a pointer mechanism so that the records in such an overflow part can be accessed via pointers from their previous records located in the main file. When only one level of index is used, an index record consists of a key and a pointer into its location in the main file. To find a specific record, the index is searched to find the highest key value that is equal to, or precedes, the desired key value, then the search continues in the main file, via the associated pointer. 11
An example Consider a sequential file with one million records. To look for a specific key value, with the sequential search algorithm, will take on average half a million comparisons. (You might think of sorting them out, but external sorting takes a lot more time, although in the same order.) Now assume that we add in an index with 1000 indices. If the keys in the index are more or less evenly distributed over the main file, any search for a specific record will now make on average 500 accesses to the index file, followed by another 500 accesses to the related segment of the main file to eventually find the record. This leads to a reduction of comparisons from 500,000 to 1,000 at the cost of an additional index file. 12
Just a bit maintenance Each record in such a file contains an invisible linker to the associated overflow file. When a record is added to an indexed sequential file, it is added to the overflow file, and the record in the main file with its key value immediately preceding that of the newly added record is updated to contain a pointer to the newly added. On the other hand, if such a record is itself contained in the overflow file, then its pointer will be changed accordingly. As with the sequential file, the overflow file is occasionally merged with the main file. Such a file is pretty flexible, and greatly increases the efficiency of file management. Multiple levels of indices can be used, which will lead to further improvement. 13
Another issue and its solution The indexed sequential file keeps one limitation of the sequential file. Effective access depends on one key field. It is not easily done when a search has to be carried out on an attribute other than the key field. To achieve such a flexibility, we usually use a structure containing multiple indexes, one for each type of the searchable field. Records organized in such a structure are only accessed through the corresponding indexes. Thus, it no longer matters where a record is placed, and variable-length record can be used. These types of files are referred to as indexed files, and are used in applications when timeliness is a critical issue, and data are rarely processed exhaustively. Did we go through this stuff in DB? 14
Direct access Finally, the direct, or hashed, file exploits the capability found on disks to access directly any block of a known address. Again, a key field is made use of, but it does not need any sequential ordering within the file. A typical process of searching for a record is to use a Hashing table, which allows constant access time, if we are not too greedy. (Still remember anything about it, such as loading factors?) This type of organization is often used when rapid access is required, where records are always accessed one at a time. Examples include price lists, name lists, schedules, etc.. 15
File directories Associated with any file management is a file directory structure, which contains information about files, their size, location, ownership, access rights, etc.. The directory itself is a file, owned by the OS and accessed by various file management routines. A very simple way to organize a directory is to use a list of entries, one for each file. Although it was actually used in some earlier system, it is certainly inadequate in any realistic sense. For example, a user might want to organize her files by type, by project, which calls for more sophisticated way for organization, and forces users to use different names for the same file, but of different type. 16
A no brainer Hence, the hierarchical, or tree, structure, as a much more flexible and natural directory structure is almost universally adopted. (Is any explanation necessary on what such a structure looks like?) As we discussed many times, the key property of such a structure is that the path between any two nodes is unique. Each directory itself, at a lower level, can be organized as a subtree, or in a sequential file; and in case of a bigger one, we can follow the approach of using a direct file. 17
What else? With a tree as the directory structure, the naming becomes pretty simple and logical. We simply follows the path from the root all the way down to the file itself. On the other hand, it is not very convenient to always have to use the whole path. This leads to the concept of a working directory or relative path: all references to files will then be relative to the current directory, or the working directory. I recently found out that some earlier system did not support changing of working directory (:-(). 18
File sharing In a multiuser system, there is almost always a requirement for allowing files to be shared among many users. To ensure the integrity of such sharing, we have to address the issues of access rights and simultaneous access. The file system should provide a number of options so that we can control the way a file is shared. Typically, users or groups of users are granted certain access rights to a file. Still remember the chmod stuff? 19
File sharing examples Such hierarchical rights can be none, when a user will not even be aware of the existence of such files; knowledge, when a user knows about this file, as well as its owner, thus, she might ask for the owner for additional rights; execution, when a user can load and execute the application; reading, where a user can get access to its content; appending, a user can add stuff in; updating, a user can even change things; and changing protection, when a user can change access rights granted to other users; and deletion. 20
Ownership, etc. One user is initially designated as the owner of a given file, who will be granted all the above rights and can also assign rights to other users, classified as follows: specific user, namely an individual user; user groups, namely, a group of users; and all, namely, everybody gets accesses to, such as public files. When access to append or update a file is granted to more than one users, the OS must make sure that only one user can get access to such a file at one time to maintain data consistency. Thus, such issues as mutual exclusion and lock, dead or alive, have to be addressed. 21
Record blocking Although records are the logical units of accessing a file, blocks are the units an I/O device deals with. To carry out an I/O operation, records must be grouped into blocks first. The question is thus how to block records? On most systems, blocks are of fixed length. This simplifies I/O operation, buffer allocation in main memory, as well as block organization on secondary storage. But, it may lead to internal fragmentation... We went through this in the last chapter, didn t we? (Homework 11.7) 22
Everything is a tradeoff. When considering the block size relative to the average record size, we notice that, the larger the block, the more records can be passed in with just one I/O operation; which is ideal with files of a sequential nature. On the other hand, if records are being accessed randomly, and no particular locality of reference exists, then a larger block leads to unnecessary transfer of unused data. Similar concerns are expressed in terms of internal(external) fragmentation in the memory management part. Also, larger blocks lead to larger buffers, which are not as easy to manage. 23
Actual methods With fixed blocking, fixed-length records are used, and an integral number of records are kept in a block. There may be unused space at the end of each block, leading to internal fragmentation. (512 = 4 120 + 32.) With variable-length spanned blocking, variablelength records are used and packed together to get rid of unused space. Thus, some records will span over two blocks, aided with a linker mechanism. Finally, there can be variable-length un-spanned blocking. Similar to fixed blocking technique, it can lead to wasted space, external fragmentation this time: a block too small to be useful. 24
The UNIX way The UNIX kernel views all files as stream of bytes. It categories files into four types: besides the usual ordinary files, it also has directory file, which collects directories, and is readonly for users; special files to access various I/O devices, e.g., stdin, stdout; and named pipes, the stuff that we ran into in Project 2. Files of all those types are managed by the UNIX in terms of inodes. An inode(information node) is a control structure that contains the key information needed by OS to manage a particular file. Several file names may be associated with an inode. But, an active inode is associated with only one file, and each file is controlled by exactly one inode. 25
File allocation File allocation is done based on blocks. It is dynamic, thus assigning blocks as needed. An indexed allocation method is used to keep track of each file, with index being stored in an inode. Each inode contains 39 bytes of address information, organized into 13 3-byte addresses. The first 10 refer to the first 10 blocks of the file. If there are more blocks allocated, the 11 th address points to a single indirect block, which contains pointers to succeeding blocks. If that is not enough, then the 12 th and the 13 th addresses will further point to a double, and triple, indirect block. With such a mechanism, a file can be bigger than 16GB. 26
I did not make that up. Below shows the structure we just went through. 27