Distributed Operating Systems

Transcription

1 Distributed Operating Systems ANDREW S. TANENBAUM and ROBBERT VAN RENESSE Department of Mathematics and Computer Science, Vrije Universiteit, Amsterdam, The Netherlands Distributed operating systems have many aspects in common with centralized ones, but they also differ in certain ways. This paper is intended as an introduction to distributed operating systems, and especially to current university research about them. After a discussion of what constitutes a distributed operating system and how it is distinguished from a computer network, various key design issues are discussed. Then several examples of current research projects are examined in some detail, namely, the Cambridge Distributed Computing System, Amoeba, V, and Eden. Categories and Subject Descriptors: C.2.4 [Computer-Communications Networks]: Distributed Systems-network operating system; D.4.3 [Operating Systems]: File Systems Management-distributed file systems; D.4.5 [Operating Systems]: Reliability-fault tolerance; D.4.6 [Operating Systems]: Security and Protection-access controls; D.4.7 [Operating Systems]: Organization and Design-distributed systems General Terms: Algorithms, Design, Experimentation, Reliability, Security Additional Key Words and Phrases: File server INTRODUCTION Everyone agrees that distributed systems are going to be very important in the future. Unfortunately, not everyone agrees on what they mean by the term distributed system. In this paper we present a viewpoint widely held within academia about what is and is not a distributed system, we discuss numerous interesting design issues concerning them, and finally we conclude with a fairly close look at some experimental distributed systems that are the subject of ongoing research at universities. To begin with, we use the term distributed system to mean a distributed operating system as opposed to a database system or some distributed applications system, such as a banking system. An operating system is a program that controls the resources of a computer and provides its users with an interface or virtual machine that is more convenient to use than the bare machine. Examples of well-known centralized (i.e., not distributed) operating systems are CP/M, MS-DOS, and UNIX.3 A distributed operating system is one that looks to its users like an ordinary centralized operating system but runs on multiple, independent central processing units (CPUs). The key concept here is transparency. In other words, the use of multiple processors should be invisible (transparent) to the user. Another way of expressing the same idea is to say that the user views the system as a virtual uniprocessor, not as a collection of distinct machines. This is easier said than done. Many multimachine systems that do not fulfill this requirement have been built. For CP/M is a trademark of Digital Research, Inc. MS-DOS is a trademark of Microsoft. 3 UNIX is a trademark of AT&T Bell Laboratories. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission ACM /85/ $00.75

2 420 l A. S. Tanenbaum and R. van Renesse CONTENTS INTRODUCTION Goals and Problems System Models 1. NETWORK OPERATING SYSTEMS 1.1 File System 1.2 Protection 1.3 Execution Location 1.4 An Example: The Sun Network File System 2. DESIGN ISSUES 2.1 Communication Primitives 2.2 Naming and Protection 2.3 Resource Management 2.4 Fault Tolerance 2.5 Services 3. EXAMPLES OF DISTRIBUTED OPERATING SYSTEMS 3.1 The Cambridge Distributed Computing System 3.2 Amoeba 3.3 The V Kernel 3.4 The Eden Project 3.5 Comparison of the Cambridge, Amoeba, V, and Eden Systems 4. SUMMARY ACKNOWLEDGMENTS REFERENCES example, the ARPANET contains a substantial number of computers, but by this definition it is not a distributed system. Neither is a local network consisting of personal computers with minicomputers and explicit commands to log in here or copy a file from there. In both cases we have a computer network but not a distributed operating system. Thus it is the software, not the hardware, that determines whether a system is distributed or not. As a rule of thumb, if you can tell which computer you are using, you are not using a distributed system. The users of a true distributed system should not know (or care) on which machine (or machines) their programs are running, where their files are stored, and so on. It should be clear by now that very few distributed systems are currently used in a production environment. However, several promising research projects are in progress. To make the contrast with distributed operating systems stronger, let us briefly look at another kind of system, which we call a network operating system. A typical configuration for a network operating system would be a collection of personal computers along with a common printer server and file server for archival storage, all tied together by a local network. Generally speaking, such a system will have most of the following characteristics that distinguish it from a distributed system: l l l l Each computer has its own private operating system, instead of running part of a global, systemwide operating system. Each user normally works on his or her own machine; using a different machine invariably requires some kind of remote login, instead of having the operating system dynamically allocate processes to CPUS. Users are typically aware of where each of their files are kept and must move files between machines with explicit file transfer commands, instead of having file placement managed by the operating system. The system has little or no fault tolerance; if 1 percent of the personal computers crash, 1 percent of the users are out of business, instead of everyone simply being able to continue normal work, albeit with 1 percent worse performance. Goals and Problems The driving force behind the current interest in distributed systems is the enormous rate of technological change in microprocessor technology. Microprocessors have become very powerful and cheap, compared with mainframes and minicomputers, so it has become attractive to think about designing large systems composed of many small processors. These distributed systems clearly have a price/performance advantage over more traditional systems. Another advantage often cited is the relative simplicity of the software-each processor has a dedicated function-although this advantage is more often listed by people who have never tried to write a

3 distributed operating system than by those who have. Incremental growth is another plus; if you need 10 percent more computing power, you just add 10 percent more processors. System architecture is crucial to this type of system growth, however, since it is hard to give each user of a personal computer another 10 percent of a personal computer. Reliability and availability can also be a big advantage; a few parts of the system can be down without disturbing people using the other parts. On the minus side, unless one is very careful, it is easy for the communication protocol overhead to become a major source of inefficiency. There has been built more than one system requiring the full computing power of its machines just to run the protocols, leaving nothing over to do the work. The occasional lack of simplicity cited above is a real problem, although in all fairness, this problem comes from inflated goals: With a centralized system no one expects the computer to function almost normally when half the memory is sick. With a distributed system, a high degree of fault tolerance is often, at least, an implicit goal. A more fundamental problem in distributed systems is the lack of global state information. It is generally a bad idea to even try to collect complete information about any aspect of the system in one table. Lack of up-to-date information makes many things much harder. It is hard to schedule the processors optimally if you are not sure how many are up at the moment. Many people, however, think that these obstacles can be overcome in time, so there is great interest in doing research on the subject. System Models Various models have been suggested for building a distributed system. Most of them fall into one of three broad categories, which we call the minicomputer model, the workstation model, and the processor pool model. In the minicomputer model, the system consists of a few (perhaps even a dozen) minicomputers (e.g., Distributed Operating Systems l 421 VAXs), each with multiple users. Each user is logged onto one specific machine, with remote access to the other machines. This model is a simple outgrowth of the central time-sharing machine. In the workstation model, each user has a personal workstation, usually equipped with a powerful processor, memory, a bitmapped display, and sometimes a disk. Nearly all the work is done on the workstations. Such a system begins to look distributed when it supports a single, global file system, so that data can be accessed without regard to their location. The processor pool model is the next evolutionary step after the workstation model. In a time-sharing system, whether with one or more processors, the ratio of CPUs to logged-in users is normally much less than 1; with the workstation model it is approximately 1; with the processor pool model it is much greater than 1. As CPUs get cheaper and cheaper, this model will become more and more widespread. The idea here is that whenever a user needs computing power, one or more CPUs are temporarily allocated to that user; when the job is completed, the CPUs go back into the pool to await the next request. As an example, when ten procedures (each on a separate file) must be recompiled, ten processors could be allocated to run in parallel for a few seconds and then be returned to the pool of available processors. At least one experimental system described below (Amoeba) attempts to combine two of these models, providing each user with a workstation in addition to the processor pool for general use. No doubt other variations will be tried in the future. 1. NETWORK OPERATING SYSTEMS Before starting our discussion of distributed operating systems, it is worth first taking a brief look at some of the ideas involved in network operating systems, since they can be regarded as primitive forerunners. Although attempts to connect computers together have been around for decades, networking really came into the limelight with the ARPANET in the early

4 422. A. S. Tanenbaum and R. van Renesse 1970s. The original design did not provide for much in the way of a network operating system. Instead, the emphasis was on using the network as a glorified telephone line to allow remote login and file transfer. Later, several attempts were made to create network operating systems, but they never were widely used [Millstein In more recent years, several research organizations have connected collections of minicomputers running the UNIX operating system [Ritchie and Thompson into a network operating system, usually via a local network [Birman and Rowe 1982; Brownbridge et al. 1982; Chesson 1975; Hwang et al. 1982; Luderer et al. 1981; Wambecq Wupit [1983] gives a good survey of these systems, which we shall draw upon for the remainder of this section. As we said earlier, the key issue that distinguishes a network operating system from a distributed one is how aware the users are of the fact that multiple machines are being used. This visibility occurs in three primary areas: the file system, protection, and program execution. Of course, it is possible to have systems that are highly transparent in one area and not at all in the other, which leads to a hybrid form. 1.1 File System When connecting two or more distinct systems together, the first issue that must be faced is how to merge the file systems. Three approaches have been tried. The first approach is not to merge them at all. Going this route means that a program on machine A cannot access files on machine B by making system calls. Instead, the user must run a special file transfer program that copies the needed remote files to the local machine, where they can then be accessed normally. Sometimes remote printing and mail is also handled this way. One of the best-known examples of networks that primarily support file transfer and mail via special programs, and not system call access to remote files, is the UNIX uucp program, and its network, USENET. The next step upward in the direction of a distributed file system is to have adjoining file systems. In this approach, programs on one machine can open files on another machine by providing a path name telling where the file is located. For example, one could say open( /machinel/pathname open( machinel!pathname, open( f/../machinel/pathname,, READ); READ); READ); The latter naming scheme is used in the Newcastle Connection [Brownbridge et al and Netix [Wambecq and is derived from the creation of a virtual superdirectory above the root directories of all the connected machines. Thus /.. means start at the local root directory and go upward one level (to the superdirectory), and then down to the root directory of machine. In Figure 1, the root directory of three machines, A, B, and C are shown, with a superdirectory above them. To access file x from machine C, one could say open( /../C/x, READ-ONLY) In the Newcastle system, the naming tree is actually more general, since machine 1 may really be any directory, so one can attach a machine as a leaf anywhere in the hierarchy, not just at the top. The third approach is the way it is done in distributed operating systems, namely, to have a single global file system visible from all machines. When this method is used, there is one bin directory for binary programs, one password file, and so on. When a program wants to read the password file it does something like open( /etc/passwd, READ-ONLY) without reference to where the file is. It is up to the operating system to locate the file and arrange for transport of data as they are needed. LOCUS is an example of a system using this approach [Popek et al. 1981; Walker et al. 1983; Weinstein et al The convenience of having a single global name space is obvious. In addition, this approach means that the operating system is free to move files around among machines to keep all the disks equally full and busy, and that the system can maintain

5 Distributed Operating Systems l 423 r 9 t u v w x Y 2 Figure 1. A (virtual) superdirectory above the root directory provides access to remote files. replicated copies of files if it so chooses. When the user or program must specify the machine name, the system cannot decide on its own to move a file to a new machine because that would change the (user visible) name used to access the file. Thus in a network operating system, control over file placement must be done manually by the users, whereas in a distributed operating system it can be done automatically by the system itself. 1.2 Protection Closely related to the transparency of the file system is the issue of protection. UNIX and many other operating systems assign a unique internal identifier to each user. Each file in the file system has a little table associated with it (called an i-node in UNIX) telling who the owner is, where the disk blocks are located, etc. If two previously independent machines are now connected, it may turn out that some internal User IDentifier (UID), for example, number 12, has been assigned to a different user on each machine. Consequently, when user 12 tries to access a remote file, the remote file system cannot see whether the access is permitted since two different users have the same UID. One solution to this problem is to require all remote users wanting to access files on machine X to first log onto X using a user name that is local to X. When used this way, the network is just being used as a fancy switch to allow users at any terminal to log onto any computer, just as a telephone company switching center allows any subscriber to call any other subscriber. This solution is usually inconvenient for people and impractical for programs, so something better is needed. The next step up is to allow any user to access files on any machine without having to log in, but to have the remote user appear to have the UID corresponding to GUEST or DEMO or some other publicly known login name. Generally such names have little authority and can only access files that have been designated as readable or writable by all users. A better approach is to have the operating system provide a mapping between UIDs, so that when a user with UID 12 on his or her home machine accesses a remote machine on which his or her UID is 15, the remote machine treats all accesses as though they were done by user 15. This approach implies that sufficient tables are provided to map each user from his or her home (machine, UID) pair to the appropriate UID for any other machine (and that messages cannot be tampered with). In a true distributed system there should be a unique UID for every user, and that UID should be valid on all machines without any mapping. In this way no protection problems arise on remote accesses to files; as far as protection goes, a remote access can be treated like a local access with the same UID. The protection issue makes the difference between a network operating system and a distributed one clear: In one case there are various machines, each with its own user-to-uid mapping, and in the other there is a single, systemwide mapping that is valid everywhere. 1.3 Execution Location Program execution is the third area in which machine boundaries are visible in network operating systems. When a user or a running program wants to create a new process, where is the process created? At least four schemes have been used thus far. The first of these is that the user simply

6 424 l A. S. Tanenbaum and R. van Renesse says CREATE PROCESS in one way or another, and specifies nothing about where. Depending on the implementation, this can be the best or the worst way to do it. In the most distributed case, the system chooses a CPU by looking at the load, location of files to be used, etc. In the least distributed case, the system always runs the process on one specific machine (usually the machine on which the user is logged in). The second approach to process location is to allow users to run jobs on any machine by first logging in there. In this model, processes on different machines cannot communicate or exchange data, but a simple manual load balancing is possible. The third approach is a special command that the user types at a terminal to cause a program to be executed on a specific machine. A typical command might be remote vax4 who to run the who program on machine vax4. In this arrangement, the environment of the new process is the remote machine. In other words, if that process tries to read or write files from its current working directory, it will discover that its working directory is on the remote machine, and that files that were in the parent process s directory are no longer present. Similarly, files written in the working directory will appear on the remote machine, not the local one. The fourth approach is to provide the CREATE PROCESS system call with a parameter specifying where to run the new process, possibly with a new system call for specifying the default site. As with the previous method, the environment will generally be the remote machine. In many cases, signals and other forms of interprocess communication between processes do not work properly among processes on different machines. A final point about the difference between network and distributed operating systems is how they are implemented. A common way to realize a network operating system is to put a layer of software on top of the native operating systems of the individual machines (e.g., Mamrak et al. [1982]). For example, one could write a special library package that would intercept all the system calls and decide whether each one was local or remote [Brownbridge et al Although most system calls can be handled this way without modifying the kernel, invariably there are a few things, such as interprocess signals, interrupt characters (e.g., BREAK) from the keyboard, etc., that are hard to get right. In a true distributed operating system one would normally write the kernel from scratch. 1.4 An Example: The Sun Network File System To provide a contrast with the true distributed systems described later in this paper, in this section we look briefly at a network operating system that runs on the Sun Microsystems workstations. These workstations are intended for use as personal computers. Each one has a series CPU, local memory, and a large bitmapped display. Workstations can be configured with or without local disk, as desired. All the workstations run a version of 4.2BSD UNIX specially modified for networking. This arrangement is a classic example of a network operating system: Each computer runs a traditional operating system, UNIX, and each has its own user(s), but with extra features added to make networking more convenient. During its evolution the Sun system has gone through three distinct versions, which we now describe. In the first version each of the workstations was completely independent from all the others, except that a program rep was provided to copy files from one workstation to another. By typing a command such as rep Ml:/usr/jim/file.c M2:/usr/ast/f.c it was possible to transfer whole files from one machine to another. In the second version, Network Disk (ND), a network disk server was provided to support diskless workstations. Disk space on the disk server s machine was divided into disjoint partitions, with each partition acting as the virtual disk for some (diskless) workstation. Whenever a diskless workstation needed to read a file, the request was processed

7 locallv until it not down to the level of the device driver, it which point the block needed was retrieved by sending a message to the remote disk server. In effect, the network was merely being used to simulate a disk controller. With this network disk system, sharing of disk partitions was not possible. The third version, the Network File System (NFS), allows remote directories to be mounted in the local file tree on any workstation. By mounting, say, a remote directory dot on the empty local directory /usr/doc, all subsequent references to /usr/doc are automatically routed to the remote system. Sharing is allowed in NFS, so several users can read files on a remote machine at the same time. To prevent users from reading other people s private files, a directory can only be mounted remotely if it is explicitly exported by the workstation it is located on. A directory is exported by entering a line for it in a file /etc/exports. To improve performance of remote access, both the client machine and server machine do block caching. Remote services can be located using a Yellow Pages server that maps service names onto their network locations. The NFS is implemented by splitting the operating system up into three layers. The top layer handles directories, and maps each path name onto a generalized i-node called a unode consisting of a (machine, i-node) pair, making each vnode globally unique. Vnode numbers are presented to the middle layer, the virtual file system (VFS). This layer checks to see if a requested vnode is local or not. If it is local, it calls the local disk driver or, in the case of an ND partition, sends a message to the remote disk server. If it is remote, the VFS calls the bottom layer with a request to process it remotely. The bottom layer accepts requests for accesses to remote vnodes and sends them over the network to the bottom layer on the serving machine. From there they propagate upward through the VFS layer to the top layer, where they are reinjected into the VFS layer. The VFS layer sees a request for a local vnode and processes it normally, without realizing that the top layer is ac- Distributed Operating Systems l 425 tually working on behalf of a remote kernel. The reply retraces the same path in the other direction. The protocol between workstations has been carefully designed to be robust in the face of network and server crashes. Each request completely identifies the file (by its vnode), the position in the file, and the byte count. Between requests, the server does not maintain any state information about which files are open or where the current file position is. Thus, if a server crashes and is rebooted, there is no state information that will be lost. The ND and NFS facilities are quite different and can both be used on the same workstation without conflict. ND works at a low level and just handles remote block I/O without regard to the structure of the information on the disk. NFS works at a much higher level and effectively takes requests appearing at the top of the operating system on the client machine and gets them over to the top of the operating system on the server machine, where they are processed in the same way as local requests. 2. DESIGN ISSUES Now we turn from traditional computer systems with some networking facilities added on to systems designed with the intention of being distributed. In this section we look at five issues that distributed systems designers are faced with: l communication primitives, l naming and protection, l resource management, 0 fault tolerance, l services to provide. Although no list could possibly be exhaustive at this early stage of development, these topics should provide a reasonable impression of the areas in which current research is proceeding. 2.1 Communication Primitives The computers forming a distributed system normally do not share primary memory, and so communication via shared memory techniques such as semaphores and monitors is generally not applicable.

8 426. A. S. Tanenbaum and R. van Renesse Instead, message passing in one form or another is used. One widely discussed framework for message-passing systems is the IS0 OS1 reference model, which has seven layers, each performing a welldefined function [Zimmermann The seven layers are the physical layer, datalink layer, network layer, transport layer, session layer, presentation layer, and application layer. By using this model it is possible to connect computers with widely different operating systems, character codes, and ways of viewing the world. Unfortunately, the overhead created by all these layers is substantial. In a distributed system consisting primarily of huge mainframes from different manufacturers, connected by slow leased lines (say, 56 kilobytes per second), the overhead might be tolerable. Plenty of computing capacity would be available for running complex protocols, and the narrow bandwidth means that close coupling between the systems would be impossible anyway. On the other hand, in a distributed system consisting of identical microcomputers connected by a lo-megabyte-per second or faster local network, the price of the IS0 model is generally too high. Nearly all the experimental distributed systems discussed in the literature thus far have opted for a different, much simpler model, so we do not mention the IS0 model further in this paper Message Passing The model that is favored by researchers in this area is the client-server model, in which a client process wanting some service (e.g., reading some data from a tile) sends a message to the server and then waits for a reply message, as shown in Figure 2. In the most naked form the system just provides two primitives: SEND and RE- CEIVE. The SEND primitive specifies the destination and provides a message; the RECEIVE primitive tells from whom a message is desired (including anyone ) and provides a buffer where the incoming message is to be stored. No initial setup is required, and no connection is established, hence no tear down is required. I+ Client sends request Remage +-El Server rend8 reply marage Figure 2. Client-server model of communication. Precisely what semantics these primitives ought to have has been a subject of much controversy among researchers. Two of the fundamental decisions that must be made are unreliable versus reliable and nonblocking versus blocking primitives. At one extreme, SEND can put a message out onto the network and wish it good luck. No guarantee of delivery is provided, and no automatic retransmission is attempted by the system if the message is lost. At the other extreme, SEND can handle lost messages, retransmissions, and acknowledgments internally, so that when SEND terminates, the program is sure that the message has been received and acknowledged. Blocking versus Nonblocking Primitives. The other choice is between nonblocking and blocking primitives. With nonblocking primitives, SEND returns control to the user program as soon as the message has been queued for subsequent transmission (or a copy made). If no copy is made, any changes the program makes to the data before or (heaven forbid) while they are being sent are made at the program s peril. When the message has been transmitted (or copied to a safe place for subsequent transmission), the program is interrupted to inform it that the buffer may be reused. The corresponding RECEIVE primitive signals a willingness to receive a message and provides a buffer for it to be put into. When a message has arrived, the program is informed by interrupt, or it can poll for status continuously or go to sleep until the interrupt arrives. The advantage of these nonblocking primitives is that they provide the maximum flexibility: Programs can

9 Distributed Operating Systems l 427 compute and perform message I/O in parallel in any way they want. Nonblocking primitives also have a disadvantage: They make programming tricky and difficult. Irreproducible, timingdependent programs are painful to write and awful to debug. Consequently, many people advocate sacrificing some flexibility and efficiency by using blocking primitives. A.blocking SEND does not return control to the user until the message has been sent (unreliable blocking primitive) or until the message has been sent and an acknowledgment received (reliable blocking primitive). Either way, the program may immediately modify the buffer without danger. A blocking RECEIVE does not return control until a message has been placed in the buffer. Reliable and unreliable RECEIVES differ in that the former automatically acknowledges receipt of a message, whereas the latter does not. It is not reasonable to combine a reliable SEND with an unreliable RECEIVE, or vice versa; so the system designers must make a choice and provide one set or the other. Blocking and nonblocking primitives do not conflict, so there is no harm done if the sender uses one and the receiver the other. Buffered versus Unbuffered Primitives. Another design decision that must be made is whether or not to buffer messages. The simplest strategy is not to buffer. When a sender has a message for a receiver that has not (yet) executed a RECEIVE primitive, the sender is blocked until a RECEIVE has been done, at which time the message is copied from sender to receiver. This strategy is sometimes referred to as a rendezvous. A slight variation on this theme is to copy the message to an internal buffer on the sender s machine, thus providing for a nonblocking version of the same scheme. As long as the sender does not do any more SENDS before the RECEIVE occurs, no problem occurs. A more general solution is to have a buffering mechanism, usually in the operating system kernel, which allows senders to have multiple SENDS outstanding, even without any interest on the part of the receiver. Although buffered message passing can be implemented in many ways, a typical approach is to provide users with a system call CREATEBUF, which creates a kernel buffer, sometimes called a mailbox, of a user-specified size. To communicate, a sender can now send messages to the receiver s mailbox, where they will be buffered until requested by the receiver. Buffering is not only more complex (creating, destroying, and generally managing the mailboxes), but also raises issues of protection, the need for special high-priority interrupt messages, what to do with mailboxes owned by processes that have been killed or died of natural causes, and more. A more structured form of communication is achieved by distinguishing requests from replies. With this approach, one typically has three primitives: SEND-GET, GET-REQUEST, and SEND-REPLY. SEND-GET is used by clients to send requests and get replies. It combines a SEND to a server with a RECEIVE to get the server s reply. GET-REQUEST is done by servers to acquire messages containing work for them to do. When a server has carried the work out, it sends a reply with SEND-REPLY. By thus restricting the message traffic and using reliable, blocking primitives, one can create some order in the chaos Remote Procedure Call (RPC) The next step forward in message-passing systems is the realization that the model of client sends request and blocks until server sends reply looks very similar to a traditional procedure call from the client to the server. This model has become known in the literature as remote procedure call and has been widely discussed [Birrell and Nelson 1984; Nelson 1981; Spector The idea is to make the semantics of intermachine communication as similar as possible to normal procedure calls because the latter is familiar and well understood, and has proved its worth over the years as a tool for dealing with abstraction. It can be viewed as a refinement of the reliable, blocking SEND-GET, GET-REQUEST,

10 428. A. S. Tanenbaum and R. van Renesse SENDREP primitives, with a more userfriendly syntax. The remote procedure call can be organized as follows. The client (calling program) makes a normal procedure call, say, p(x, y) on its machine, with the intention of invoking the remote procedure p on some other machine. A dummy or stub procedure p must be included in the caller s address space, or at least be dynamically linked to it upon call. This procedure, which may be automatically generated by the compiler, collects the parameters and packs them into a message in a standard format. It then sends the message to the remote machine (using SEND-GET) and blocks, waiting for an answer (see Figure 3). At the remote machine, another stub procedure should be waiting for a message using GET-REQUEST. When a message comes in, the parameters are unpacked by an input-handling procedure, which then makes the local call p(x, y). The remote procedure p is thus called locally, and so its normal assumptions about where to find parameters, the state of the stack, etc., are identical to the case of a purely local call. The only procedures that know that the call is remote are the stubs, which build and send the message on the client side and disassemble and make the call on the server side. The result of the procedure call follows an analogous path in the reverse direction. Remote Procedure Call Design Issues. Although at first glance the remote procedure call model seems clean and simple, under the surface there are several problems. One problem concerns parameter (and result) passing. In most programming languages, parameters can be passed by value or by reference. Passing value parameters over the network is easy; the stub just copies them into the message and off they go. Passing reference parameters (pointers) over the network is not so easy. One needs a unique, systemwide pointer for each object so that it can be remotely accessed. For large objects, such as files, some kind of capability mechanism [Dennis and Van Horn 1966; Levy 1984; Pashtan could be set up, using capabilities as pointers. For small objects, such as integers and Boo- Client Machine Server Machine ~I-~~ Figure 3. Remote procedure call. leans, the amount of overhead and mechanism needed to create a capability and send it in a protected way is so large that this solution is highly undesirable. Still another problem that must be dealt with is how to represent parameters and results in messages. This representation is greatly complicated when different types of machines are involved in a communication. A floating-point number produced on one machine is unlikely to have the same value on a different machine, and even a negative integer will create problems between the l s complement and 2 s complement machines. Converting to and from a standard format on every message sent and received is an obvious possibility, but it is expensive and wasteful, especially when the sender and receiver do, in fact, use the same internal format. If the sender uses its internal format (along with an indication of which format it is) and lets the receiver do the conversion, every machine must be prepared to convert from every other format. When a new machine type is introduced, much existing software must be upgraded. Any way it is done, with remote procedure call (RPC) or with plain messages, it is an unpleasant business. Some of the unpleasantness can be hidden from the user if the remote procedure call mechanism is embedded in a programming language with strong typing, so that the receiver at least knows how many parameters to expect and what types they have. In this respect, a weakly typed language such as C, in which procedures with a variable number of parameters are common, is more complicated to deal with. Still another problem with RPC is the issue of client-server binding. Consider, for example, a system with multiple file servers. If a client creates a file on one of the file servers, it is usually desirable that sub-

11 Distributed Operating Systems l 429 sequent writes to that file go to the file server where the file was created. With mailboxes, arranging for this is straightforward. The client simply addresses the WRITE messages to the same mailbox that the CREATE message was sent to. Since each file server has its own mailbox, there is no ambiguity. When RPC is used, the situation is more complicated, since all the client does is put a procedure call such as write(filedescriptor, BufferAddress, ByteCount); in his program. RPC intentionally hides all the details of locating servers from the client, but sometimes, as in this example, the details are important. In some applications, broadcasting and multicasting (sending to a set of destinations, rather than just one) is useful. For example, when trying to locate a certain person, process, or service, sometimes the only approach is to broadcast an inquiry message and wait for the replies to come back. RPC does not lend itself well to sending messages to sets of processes and getting answers back from some or all of them. The semantics are completely different. Despite all these disadvantages, RPC remains an interesting form of communication, and much current research is being addressed toward improving it and solving the various problems discussed above Error Handling Error handling in distributed systems is radically different from that of centralized systems. In a centralized system, a system crash means that the client, server, and communication channel are all completely destroyed, and no attempt is made to revive them. In a distributed system, matters are more complex. If a client has initiated a remote procedure call with a server that has crashed, the client may just be left hanging forever unless a time-out is built in. However, such a time-out introduces race conditions in the form of clients that time out too quickly, thinking that the server is down, when in fact, it is merely very slow. Client crashes can also cause trouble for servers. Consider, for example, the case of processes A and B communicating via the UNIX pipe model A ] B with A the server and B the client. B asks A for data and gets a reply, but unless that reply is acknowledged somehow, A does not know when it can safely discard data that it may not be able to reproduce. If B crashes, how long should A hold onto the data? (Hint: If the answer is less than infinity, problems will be introduced whenever B is slow in sending an acknowledgment.) Closely related to this is the problem of what happens if a client cannot tell whether or not a server has crashed. Simply waiting until the server is rebooted and trying again sometimes works and sometimes does not. This is a case in which it works: Client asks to read block 7 of some file. This is a case in which it does not work: Client says transfer a million dollars from one bank account to another. In the former case, it does not matter whether or not the server carried out the request before crashing; carrying it out a second time does no harm. In the latter case, one would definitely prefer the call to be carried out exactly once, no more and no less. Calls that may be repeated without harm (like the first example) are said to be idempotent. Unfortunately, it is not always possible to arrange for all calls to have this property. Any call that causes action to occur in the outside world, such as transferring money, printing lines, or opening a valve in an automated chocolate factory just long enough to fill exactly one vat, is likely to cause trouble if performed twice. Spector [1982] and Nelson [1981] have looked at the problem of trying to make sure that remote procedure calls are executed exactly once, and they have developed taxonomies for classifying the semantics of different systems. These vary from systems that offer no guarantee at all (zero or more executions), to those that guarantee at most one execution (zero or one), to those that guarantee at least one execution (one or more). Getting it right (exactly one) is probably impossible, because even if the remote execution can be reduced to one instruction

12 430 l A. S. Tanenbaum and R. van Renesse (e.g., setting a bit in a device register that opens the chocolate valve), one can never be sure after a crash if the system went down a microsecond before or a microsecond after the one critical instruction. Sometimes one can make a guess based on observing external events (e.g., looking to see whether the factory floor is covered with a sticky, brown material), but in general there is no way of knowing. Note that the problem of creating stable storage [Lampson is fundamentally different, since remote procedure calls to the stable storage server in that model never cause events external to the computers Implementation issues Constructing a system in principle is always easier than constructing it in practice. Building a 16-node distributed system that has a total computing power about equal to a single-node system is surprisingly easy. This observation leads to tension between the goals of making it work fast in the normal case and making the semantics reasonable when something goes wrong. Some experimental systems have put the emphasis on one goal and some on the other, but more research is needed before we have systems that are both fast and graceful in the face of crashes. Some things have been learned from past work, however. Foremost among these is that making message passing efficient is very important. To this end, systems should be designed to minimize copying of data [Cheriton 1984a]. For example, a remote procedure call system that first copies each message from the user to the stub, from the stub to the kernel, and finally from the kernel to the network interface board requires three copies on the sending side, and probably three more on the receiving side, for a total of six. If the call is to a remote file server to write a 1K block of data to disk, at a copy time of 1 microsecond per byte, 6 milliseconds are needed just for copying, which puts an upper limit of 167 calls per second, or a throughput of 167 kilobytes per second. When other sources of overhead are considered (e.g., the reply message, the time waiting for access to the network, transmission time), achieving even 80 kilobytes per second will be difficult, if not impossible, no matter how high the network bandwidth or disk speed. Thus it is desirable to avoid copying, but this is not always simple to achieve since without copies, (part of) a needed message may be swapped or paged out when it is needed. Another point worth making is that there is always a substantial fixed overhead with preparing, sending, and receiving a message, even a short message, such as a request to read from a remote file server. The kernel must be invoked, the state of the current process must be saved, the destination must be located, various tables must be updated, permission to access the network must be obtained (e.g., wait for the network to become free or wait for the token), and quite a bit of bookkeeping must be done. This fixed overhead argues for making messages as long as possible, to reduce the number of messages. Unfortunately, many current local networks limit physical packets to 1K or 2K; 4K or 8K would be much better. Of course, if the packets become too long, a highly interactive user may occasionally be queued behind ten maximumlength packets, degrading response time; so the optimum size depends on the work load. Virtual Circuits versus Datagrams There is much controversy over whether remote procedure call ought to be built on top of a flow-controlled, error-controlled, virtual circuit mechanism or directly on top of the unreliable, connectionless (datagram) service. Saltzer et al. [1984] have pointed out that since high reliability can only be achieved by end-to-end acknowledgments at the highest level of protocol, the lower levels need not be 100 percent reliable. The overhead incurred in providing a clean virtual circuit upon which to build remote procedure calls (or any other message-passing system), is therefore wasted. This line of thinking argues for building the message system directly on the raw datagram interface. The other side of the coin is that it would be nice for a distributed system to be able

13 Distributed Operating Systems l 431 to encompass heterogeneous computers in different countries with different post, telephone, and telegraph (PTT) networks and possibly different national alphabets, and that this environment requires complex multilayered protocol structures. It is our observation that both arguments are valid, but, depending on whether one is trying to forge a collection of small computers into a virtual uniprocessor or merely access remote data transparently, one or the other will dominate. Even if one opts for building RPC on top of the raw datagram service provided by a local network, there are still a number of protocols open to the implementer. The simplest one is to have every request and reply separately acknowledged. The message sequence for a remote procedure call is then: REQUEST, ACK, REPLY, ACK, as shown in Figure 4a. The ACKs are managed by the kernel without user knowledge. The number of messages can be reduced from four to three by allowing the REPLY to serve as the ACK for the REQUEST, as shown in Figure 4b. However, a problem arises when the REPLY can be delayed for a long time. For example, when a login process makes an RPC to a terminal server requesting characters, it may be hours or days before someone steps up to a terminal and begins typing. In this event, an additional message has to be introduced to allow the sending kernel to inquire whether the message has arrived or not. A further step in the same direction is to eliminate the other ACK as well, and let the arrival of the next REQUEST imply an acknowledgment of the previous REPLY (see Figure 4~). Again, some mechanism is needed to deal with the case that no new REQUEST is forthcoming quickly. One of the great difficulties in implementing efficient communication is that it is more of a black art than a science. Even straightforward implementations can have unexpected consequences, as the following example from Sventek et al. [1983] shows. Consider a ring containing a circulating token. To transmit, a machine captures and removes the token, puts a message on the network, and then replaces the token, thus allowing the next machine downstream Request Reply Request Reply Request Ack I Ack Request 2 f (4 (b) (c) Request Reply RePlY RePlY Ack Figure 4. Remote procedure call (a) with individual acknowledgments per message, (b) with the reply as the request acknowledgment, (c) with no explicit acknowledgments. the opportunity to capture it. In theory, such a network is fair in that each user has equal access to the network and no one user can monopolize it to the detriment of others. In practice, suppose that two users each want to read a long file from a file server. User A sends a request message to the server, and then replaces the token on the network for B to acquire. After A s message arrives at the server, it takes a short time for the server to handle the incoming message interrupt and reenable the receiving hardware. Until the receiver is reenabled, the server is deaf. Within a microsecond or two of the time A puts the token back on the network, B sees and grabs it, and begins transmitting a request to the (unbeknown to B) deaf file server. Even if the server reenables halfway through B s message, the message will be rejected owing to missing header, bad frame format, and checksum error. According to the ring protocol, after sending one message, B must now replace the token, which A captures for a successful transmission. Once again B transmits during the server s deaf period, and so on. Conclusion: B gets

14 432. A. S. Tanenbaum and R. van Renesse no service at all until A is finished. If A happens to be scanning through the Manhattan telephone book, B may be in for a long wait. This specific problem can be solved by inserting random delays in places to break the synchrony, but our point is that totally unexpected problems like this make it necessary to build and observe real systems to gain insight into the problems. Abstract formulations and simulations are not enough. 2.2 Naming and Protection All operating systems support objects such as files, directories, segments, mailboxes, processes, services, servers, nodes, and I/O devices. When a process wants to access one of these objects, it must present some kind of name to the operating system to specify which object it wants to access. In some instances these names are ASCII strings designed for human use; in others they are binary numbers used only internally. In all cases they have to be managed and protected from misuse Naming as Mapping Naming can best be seen as a problem of mapping between two domains. For example, the directory system in UNIX provides a mapping between ASCII path names and i-node numbers. When an OPEN system call is made, the kernel converts the name of the file to be opened into its i-node number. Internal to the kernel, files are nearly always referred to by i-node number, not ASCII string. Just about all operating systems have something similar. In a distributed system a separate name server is sometimes used to map user-chosen names (ASCII strings) onto objects in an analogous way. Another example of naming is the mapping of virtual addresses onto physical addresses in a virtual memory system. The paging hardware takes a virtual address as input and yields a physical address as output for use by the real memory. In some cases naming implies only a single level of mapping, but in other cases it can imply multiple levels. For example, to use some service, a process might first have to map the service name onto the name of a server process that is prepared to offer the service. As a second step, the server would then be mapped onto the number of the CPU on which that process is running. The mapping need not always be unique, for example, if there are multiple processes prepared to offer the same service Name Servers In centralized systems, the problem of naming can be effectively handled in a straightforward way. The system maintains a table or database providing the necessary nameto-object mappings. The most straightforward generalization of this approach to distributed systems is the single name server model. In this model, a server accepts names in one domain and maps them onto names in another domain. For example, to locate services in some distributed systems, one sends the service name in ASCII to the name server, and it replies with the node number where that service can be found, or with the process name of the server process, or perhaps with the name of a mailbox to which requests for service can be sent. The name server s database is built up by registering services, processes, etc., that want to be publicly known. File directories can be regarded as a special case of name service. Although this model is often acceptable in a small distributed system located at a single site, in a large system it is undesirable to have a single centralized component (the name server) whose demise can bring the whole system to a grinding halt. In addition, if it becomes overloaded, performance will degrade. Furthermore, in a geographically distributed system that may have nodes in different cities or even countries, having a single name server will be inefficient owing to the long delays in accessing it. The next approach is to partition the system into domains, each with its own name server. If the system is composed of multiple local networks connected by gateways and bridges, it seems natural to have one name server per local network. One way to organize such a system is to have a

15 Name server 1 looks up a/b/c a Name server 2 looks up b/c Distributed Operating Systems. Name server 3 looks up c a 433 X > X El Y > C 2 > r Figure 5. Distributing the lookup of a/b/c over three name servers. global naming tree, with files and other objects having names of the form: /country/city/network/pathname. When such a name is presented to any name server, it can immediately route the request to some name server in the designated country, which then sends it to a name server in the designated city, and so on until it reaches the name server in the network where the object is located, where the mapping can be done. Telephone numbers use such a hierarchy, composed of country code, area code, exchange code (first three digits of telephone number in North America), and subscriber line number. Having multiple name servers does not necessarily require having a single, global naming hierarchy. Another way to organize the name servers is to have each one effectively maintain a table of, for example, (ASCII string, pointer) pairs, where the pointer is really a kind of capability for any object or domain in the system. When a name, say a/b/c, is looked up by the local name server, it may well yield a pointer to another domain (name server), to which the rest of the name, b/c, is sent for further processing (see Figure 5). This facility can be used to provide links (in the UNIX sense) to files or objects whose precise whereabouts is managed by a remote name server. Thus if a file foobar is located in another local network, n, with name server n.s, one can make an entry in the local name server s table for the pair (x, n.s) and then access xlfoobar as though it were a local object. Any appropriately authorized user or process knowing the name xlfoobar could make its own synonym s and then perform accesses using s/x/foobar. Each name server parsing a name that involves multiple name servers just strips off the first component and passes the rest of the name to the name server found by looking up the first component locally. A more extreme way of distributing the name server is to have each machine manage its own names. To look up a name, one broadcasts it on the network. At each machine, the incoming request is passed to the local name server, which replies only if it finds a match. Although broadcasting is easiest over a local network such as a ring net or CSMA net (e.g., Ethernet), it is also possible over store-and-forward packet switching networks such as the ARPANET [Dalal Although the normal use of a name server is to map an ASCII string onto a binary number used internally to the system, such as a process identifier or machine number, once in a while the inverse mapping is also useful. For example, if a machine crashes, upon rebooting it could present its (hardwired) node number to the name server to ask what it was doing before the crash, that is, ask for the ASCII string corresponding to the service that it is supposed to be offering so that it can figure out what program to reboot. 2.3 Resource Management Resource management in a distributed system differs from that in a centralized system in a fundamental way. Centralized

16 434 l A. S. Tanenbaum and R. van Renesse systems always have tables that give complete and up-to-date status information about all the resources being managed; distributed systems do not. For example, the process manager in a traditional centralized operating system normally uses a process table with one entry per potential process. When a new process has to be started, it is simple enough to scan the whole table to see whether a slot is free. A distributed operating system, on the other hand, has a much harder job of finding out whether a processor is free, especially if the system designers have rejected the idea of having any central tables at all, for reasons of reliability. Furthermore, even if there is a central table, recent events on outlying processors may have made some table entries obsolete without the table manager knowing it. The problem of managing resources without having accurate global state information is very difficult. Relatively little work has been done in this area. In the following sections we look at some work that has been done, including distributed process management and scheduling Processor Allocation One of the key resources to be managed in a distributed system is the set of available processors. One approach that has been proposed for keeping tabs on a collection of processors is to organize them in a logical hierarchy independent of the physical structure of the network, as in MICROS [Wittie and van Tilborg This approach organizes the machines like people in corporate, military, academic, and other real-world hierarchies. Some of the machines are workers and others are managers. For each group of k workers, one manager machine (the department head ) is assigned the task of keeping track of who is busy and who is idle. If the system is large, there will be an unwieldy number of department heads; so some machines will function as deans, riding herd on k department heads. If there are many deans, they too can be organized hierarchically, with a big cheese keeping tabs on k deans. This hierarchy can be extended ad infinitum, with the number of levels needed growing logarithmically with the number of workers. Since each processor need only maintain communication with one superior and k subordinates, the information stream is manageable. An obvious question is, What happens when a department head, or worse yet, a big cheese, stops functioning (crashes)? One answer is to promote one of the direct subordinates of the faulty manager to fill in for the boss. The choice of which one can either be made by the subordinates themselves, by the deceased s peers, or in a more autocratic system, by the sick manager s boss. To avoid having a single (vulnerable) manager at the top of the tree, one can truncate the tree at the top and have a committee as the ultimate authority. When a member of the ruling committee malfunctions, the remaining members promote someone one level down as a replacement. Although this scheme is not completely distributed, it is feasible and works well in practice. In particular, the system is selfrepairing, and can survive occasional crashes of both workers and managers without any long-term effects. In MICROS, the processors are monoprogrammed, so if a job requiring S processes suddenly appears, the system must allocate S processors for it. Jobs can be created at any level of the hierarchy. The strategy used is for each manager to keep track of approximately how many workers below it are available (possibly several levels below it). If it thinks that a sufficient number are available, it reserves some number R of them, where R 2 S, because the estimate of available workers may not be exact and some machines may be down. If the manager receiving the request thinks that it has too few processors available, it passes the request upward in the tree to its boss. If the boss cannot handle it either, the request continues propagating upward until it reaches a level that has enough available workers at its disposal. At that point, the manager splits the request into parts and parcels them out among the managers below it, which then do the same

17 Distributed Operating Systems l 435 thing until the wave of scheduling requests hits bottom. At the bottom level, the processors are marked as busy, and the actual number of processors allocated is reported back up the tree. To make this strategy work well, R must be large enough so that the probability is high that enough workers will be found to handle the whole job. Otherwise, the request will have to move up one level in the tree and start all over, wasting considerable time and computing power. On the other hand, if R is too large, too many processors will be allocated, wasting computing capacity until word gets back to the top and they can be released. The whole situation is greatly complicated by the fact that requests for processors can be generated randomly anywhere in the system, so at any instant, multiple requests are likely to be in various stages of the allocation algorithm, potentially giving rise to out-of-date estimates of available workers, race conditions, deadlocks, and more. In Van Tilborg and Wittie [1981] a mathematical analysis of the problem is given and various other aspects not described here are covered in detail Scheduling The hierarchical model provides a general model for resource control but does not provide any specific guidance on how to do scheduling. If each process uses an entire processor (i.e., no multiprogramming), and each process is independent of all the others, any process can be assigned to any processor at random. However, if it is common that several processes are working together and must communicate frequently with each other, as in UNIX pipelines or in cascaded (nested) remote procedure calls, then it is desirable to make sure that the whole group runs at once. In this section we address that issue. Let us assume that each processor can handle up to N processes. If there are plenty of machines and N is reasonably large, the problem is not finding a free machine (i.e., a free slot in some process table), but something more subtle. The basic difficulty can be illustrated by an TiilE slot Machine Machine A c 1 0 Q 2 A c D A c 5 0 D 0 1 (a) (b) Figure6. (a) Two jobs running out of phase with each other. (b) Scheduling matrix for eight machines, each with six time slots. The X s indicated allocated slots. example in which processes A and B run on one machine and processes C and D run on another. Each machine is time shared in, say, loo-millisecond time slices, with A and C running in the even slices, and B and D running in the odd ones, as shown in Figure 6a. Suppose that A sends many messages or makes many remote procedure calls to D. During time slice 0, A starts up and immediately calls D, which unfortunately is not running because it is now C s turn. After 100 milliseconds, process switching takes place, and D gets A s message, carries out the work, and quickly replies. Because B is now running, it will be another 100 milliseconds before A gets the reply and can proceed. The net result is one message exchange every 200 milliseconds. What is needed is a way to ensure that processes that communicate frequently run simultaneously. Although it is difficult to determine dynamically the interprocess communication patterns, in many cases a group of related processes will be started off together. For example, it is usually a good bet that the filters in a UNIX pipeline will communicate with each other more than they will with other, previously started processes. Let us assume that processes are created in groups, and that intragroup communication is much more prevalent than intergroup communication. Let us further assume that a sufficiently large number of machines are available to handle the largest group, and that each machine is

18 436 l A. S. Tanenbaum and R. van Renesse multiprogrammed with N process slots (Nway multiprogramming). Ousterhout [ has proposed several algorithms based on the concept of coscheduling, which takes interprocess communication patterns into account while scheduling to ensure that all members of a group run at the same time. The first algorithm uses a conceptual matrix in which each column is the process table for one machine, as shown in Figure 6b. Thus, column 4 consists of all the processes that run on machine 4. Row 3 is the collection of all processes that are in slot 3 of some machine, starting with the process in slot 3 of machine 0, then the process in slot 3 of machine 1, and so on. The gist of his idea is to have each processor use a round-robin scheduling algorithm with all processors first running the process in slot 0 for a fixed period, then all processors running the process in slot 1 for a fixed period, etc. A broadcast message could be used to tell each processor when to do process switching, to keep the time slices synchronized. By putting all the members of a process group in the same slot number, but on different machines, one has the advantage of N-fold parallelism, with a guarantee that all the processes will be run at the same time, to maximize communication throughput. Thus in Figure 6b, four processes that must communicate should be put into slot 3, on machines 1, 2, 3, and 4 for optimum performance. This scheduling technique can be combined with the hierarchical model of process management used in MICROS by having each department head maintain the matrix for its workers, assigning processes to slots in the matrix and broadcasting time signals. Ousterhout also described several variations to this basic method to improve performance. One of these breaks the matrix into rows and concatenates the rows to form one long row. With k machines, any k consecutive slots belong to different machines. To allocate a new process group to slots, one lays a window k slots wide over the long row such that the leftmost slot is empty but the slot just outside the left edge of the window is full. If sufficient empty slots are present in the window, the processes are assigned to the empty slots; otherwise the window is slid to the right and the algorithm repeated. Scheduling is done by starting the window at the left edge and moving rightward by about one window s worth per time slice, taking care not to split groups over windows. Ousterhout s paper discusses these and other methods in more detail and gives some performance results Load Balancing The goal of Ousterhout s work is to place processes that work together on different processors, so that they can all run in parallel. Other researchers have tried to do precisely the opposite, namely, to find subsets of all the processes in the system that are working together, so that closely related groups of processes can be placed on the same machine to reduce interprocess communication costs [Chow and Abraham 1982; Chu et al. 1980; Gylys and Edwards 1976; Lo 1984; Stone 1977,1978; Stone and Bokhari Yet other researchers have been concerned primarily with load balancing, to prevent a situation in which some processors are overloaded while others are empty [Barak and Shiloh 1985; Efe 1982; Krueger and Finkel 1983; Stankovic and Sidhu Of course, the goals of maximizing throughput, minimizing response time, and keeping the load uniform are to some extent in conflict, so many of the researchers try to evaluate different compromises and trade-offs. Each of these different approaches to scheduling makes different assumptions about what is known and what is most important. The people trying to cluster processes to minimize communication costs, for example, assume that any process can run on any machine, that the computing needs of each process are known in advance, and that the interprocess communication traffic between each pair of processes is also known in advance. The people doing load balancing typically make the realistic assumption that nothing about the future behavior of a process is known. The minimizers are generally theorists, whereas the load balancers tend to be

19 Distributed Operating Systems l 437 Machine 1 f Machine 2 Machine 1 I Machine 2 Figure 7. Two ways of statically allocating processes (nodes in the graph) to machines. Arcs show which pairs of processes communicate. (4 (b) people making real systems who care less about optimality than about devising algorithms that can actually be used. Let us now briefly look at each of these approaches. Graph-Theoretic Models. If the system consists of a fixed number of processes, each with known CPU and memory requirements, and a known matrix giving the average amount of traffic between each pair of processes, scheduling can be attacked as a graph-theoretic problem. The system can be represented as a graph, with each process a node and each pair of communicating processes connected by an arc labeled with the data rate between them. The problem of allocating all the processes to k processors then reduces to the problem of partitioning the graph into k disjoint subgraphs, such that each subgraph meets certain constraints (e.g., total CPU and memory requirements below some limit). Arcs that are entirely within one subgraph represent internal communication within a single processor (=fast), whereas arcs that cut across subgraph boundaries represent communication between two processors (=slow). The idea is to find a partitioning of the graph that meets the constraints and minimizes the network traffic, or some variation of this idea. Figure 7a depicts a graph of interacting processors with one possible partitioning of the processes between two machines. Figure 7b shows a better partitioning, with less intermachine traffic, assuming that all the arcs are equally weighted. Many papers have been written on this subject, for example, Chow and Abraham [1982], Chow and Kohler [1979], Stone [1977, 19781, Stone and Bokhari [1978], and Lo [1984]. The results are somewhat academic, since in real systems virtually none of the assumptions (fixed number of processes with static requirements, known traffic matrix, error-free processors and communication) are ever met. Heuristic Load Balancing. When the goal of the scheduling algorithm is dynamic, heuristic load balancing, rather than finding related clusters, a different approach is taken. Here the idea is for each processor to estimate its own load continually, for processors to exchange load information, and for process creation and migration to utilize this information. Various methods of load estimation are possible. One way is just to measure the number of runnable processes on each CPU periodically and take the average of the last n measurements as the load. Another way [Bryant and Finkel19811 is to estimate the residual running times of all the processes and define the load on a processor as the number of CPU seconds that all its processes will need to finish. The residual time can be estimated mostly simply by assuming it is equal to the CPU time already consumed. Bryant and Finkel also discuss other estimation techniques in which both the number of processes and length of remaining time are important. When roundrobin scheduling is used, it is better to be competing against one process that needs 100 seconds than against 100 processes that each need 1 second. Once each processor has computed its load, a way is needed for each processor to find out how everyone else is doing. One way is for each processor to just broadcast its load periodically. After receiving a broadcast from a lightly loaded machine, a processor should shed some of its load by giving it to the lightly loaded processor. This algorithm has several problems. First, it requires a broadcast facility, which may not be available. Second, it consumes

20 438 l A. S. Tanenbaum and R. van Renesse considerable bandwidth for all the here is my load messages. Third, there is a great danger that many processors will try to shed load to the same (previously) lightly loaded processor at once. A different strategy [Barak and Shiloh 1985; Smith is for each processor periodically to pick another processor (possibly a neighbor, possibly at random) and exchange load information with it. After the exchange, the more heavily loaded processor can send processes to the other one until they are equally loaded. In this model, if 100 processes are suddenly created in an otherwise empty system, after one exchange we will have two machines with 50 processes, and after two exchanges most probably four machines with 25 processes. Processes diffuse around the network like a cloud of gas. Actually migrating running processes is trivial in theory, but close to impossible in practice. The hard part is not moving the code, data, and registers, but moving the environment, such as the current position within all the open files, the current values of any running timers, pointers or file descriptors for communicating with tape drives or other I/O devices, etc. All of these problems relate to moving variables and data structures related to the process that are scattered about inside the operating system. What is feasible in practice is to use the load information to create new processes on lightly loaded machines, instead of trying to move running processes. If one has adopted the idea of creating new processes only on lightly loaded machines, another approach, called bidding, is possible [Farber and Larson 1972; Stankovic and Sidhu When a process wants some work done, it broadcasts a request for bids, telling what it needs (e.g., a CPU, 512K memory, floating point, and a tape drive). Other processors can then bid for the work, telling what their workload is, how much memory they have available, etc. The process making the request then chooses the most suitable machine and creates the process there. If multiple request-for-bid messages are outstanding at the same time, a processor accepting a bid may discover that the workload on the bidding machine is not what it expected because that processor has bid for and won other work in the meantime Distributed Deadlock Detection Some theoretical work has been done in the area of detection of deadlocks in distributed systems. How applicable this work may be in practice remains to be seen. Two kinds of potential deadlocks are resource deadlocks and communication deadlocks. Resource deadlocks are traditional deadlocks, in which all of some set of processes are blocked waiting for resources held by other blocked processes. For example, if A holds X and B holds Y, and A wants Y and B wants X, a deadlock will result. In principle, this problem is the same in centralized and distributed systems, but it is harder to detect in the latter because there are no centralized tables giving the status of all resources. The problem has mostly been studied in the context of database systems [Gligor and Shattuck 1980; Isloor and Marsland 1978; Menasce and Muntz 1979; Obermarck The other kind of deadlock that can occur in a distributed system is a communication deadlock. Suppose A is waiting for a message from B and B is waiting for C and C is waiting for A. Then we have a deadlock. Chandy et al. [1983] present an algorithm for detecting (but not preventing) communication deadlocks. Very crudely summarized, they assume that each process that is blocked waiting for a message knows which process or processes might send the message. When a process logically blocks, they assume that it does not really block but instead sends a query message to each of the processes that might send it a real (data) message. If one of these processes is blocked, it sends query messages to the processes it is waiting for. If certain messages eventually come back to the original process, it can conclude that a deadlock exists. In effect, the algorithm is looking for a knot in a directed graph. 2.4 Fault Tolerance Proponents of distributed systems often claim that such systems can be more reliable than centralized systems. Actually,