5 Program Correctness

5 Program Correctness 5.1. Introduction For any application, the designer of a distributed system has the responsibility of certifying the correctness of the system, before users start using it. This guarantee might possibly hold as long every hardware and software component works according to specifications. This chapter explains what correctness criteria are considered important for distributed systems. In message passing models, the state of a distributed system consists of the local states of all the processes, and the states of the channels connecting these processes. For the locally shared variable models, channel states are irrelevant. The state of a distributed system is also called its configuration. From any state, the execution of each eligible action takes the system to a new state. A computation consists of a sequence of atomic actions that transform a given initial state to a final statte. With partial ordering of events, and nondeterministic scheduling of actions, such sequences are not always unique -- depending on system characteristics and implementation policies, the sequence of actions can vary from one run to another. Yet, from the perspective of a system designer, it is important to certify that the system operates "correctly" for every possible run. C D E A B F L G H I J K Fig. 5.1. History of a distributed system. Circles represent states and arcs represent actions causing state transitions. 1

Fig. 5.1 represents the history of a computation that begins from the initial state A and ends in the final state L. Each arc corresponds to an atomic action that causes a state transition. Note that in each of the states B and G, there are two possible actions: this corresponds to nondeterministic choices made by the scheduler(s). The history Σ can be represented as the set of the following three state sequences {ABCDEFL, ABGHIFL, ABGJKIFL}. Each state sequence is also called a behavior computation does not terminate, then some of the behaviors can be infinite. of the system. If a Regardless of what properties are considered to judge correctness, it is important to note that one or more successful "test runs" of the system can never guarantee that the system will behave correctly under all possible circumstances. This is because, such test runs may at best certify the correctness for some specific behaviors, but these can never capture all possible behaviors. To paraphrase Dijkstra, "test runs can at best reveal the presence of bugs, but not their absence." It is tempting to prove correctness by enumerating all possible interleavings of atomic actions, and testing or reasoning about each of these behaviors. However, because of the explosive growth in the number of such behaviors, this approach soon turns out to be impractical -- at least for nontrivial distributed systems. For example, with n processes each executing a sequence of m atomic actions, the total number of possible interleavings is (n m)! (m!) n Therefore, to exhaustively test even a small system, one can easily exceed the computing capacity available with today's fastest computers. 5.2. Correctness Criteria What properties of a distributed system do we look for, when we certify the correctness of a distributed system? The desirable attributes can be broadly classified into two different categories: liveness and safety. Most of the useful properties of a system can be classified either as a liveness or as a safety property. 2

5.2.1 Safety Properties A safety property intuitively implies that "bad things never happen." Different systems have different notions of what can be termed as a bad thing. Consider the history shown in Fig. 4.1. Let a safety property be specified by the statement: "the value of a certain integer variable temperature should never exceed 100." If this safety property has to hold for a system, then it must hold for every state of the system. Thus, if we find that in state G temperature = 107, then we immediately conclude that the safety property is violated - we need not wait for what will happen to temperature thereafter. To demonstrate that a safety property is violated, it is sufficient to demonstrate that it does not hold during an initial prefix of a behavior. Safety properties can often be defined using an invariant relationship. What follows are some examples of safety properties in well-known synchronization problems. Mutual Exclusion Consider a number of processes trying to periodically enter a critical section. Once a process successfully enters the critical section, it is expected to do some work, exit the critical section, and then try for a reentry later. The program for a typical process is outlined below: do true entry protocol; critical section; exit protocol od Here, the requirement of the safety property is that, at most one process can be in the critical section. Accordingly, the safety invariant can be written as Ncs 1, where Ncs is the number of processes in the critical section at any time. A bad thing corresponds to a situation in which two or more processes are in the critical section at the same time. 3

Bounded Capacity Channel A transmitter process P and a receiver process Q are communicating through a channel of bounded capacity B. The usual conditions of this communication are: (i) the transmitter should not send messages when the channel is full, and (ii) the receiver should not try to receive messages when the channel is empty. The following invariant represents a safety property that must be satisfied in every state of the system: 0 np - nc B where, np = number of items produced by the transmitter process; nc = number of items consumed by the receiver process; B = channel capacity. Let B = 20. A bad thing happens when np = 45, nc= 25, and the producer produces one more item. Readers and Writers Problem A number of reader and writer processes update a file. To get a consistent view of the file, it is important to enforce the criteria that (i) writers get exclusive access to the file, and (ii) readers access the file only when no writer is writing. The required safety property can be expressed by the invariant [( nw 1) (nr=0)] [(nw=0) (nr 0)] where, nw = number of writer processes updating the file; nr = number of reader processes reading the file. Here, a bad thing will happen, if a writer is granted write access when a reader is reading the file. Absence of Deadlock In every distributed system, deadlock is a bad thing to happen. A system is deadlocked, when after a finite sequence of actions, the system reaches a state in which all the guards are false, but the system has not reached the final state. Consider a computation that starts from a precondition P and is expected to satisfy the postcondition Q on 4

termination. Let GG be the disjunction of the guards in S. Then the desired safety property can be expressed by the invariant (Q GG). Partial Correctness An important type of safety property is partial correctness. Partial correctness of a program asserts that "if the program terminates," then the resulting state is the final state satisfying the desired postcondition. The bad thing here is the possibility of the program terminating with a wrong answer, or entering into a deadlock. Using the example from the previous paragraph, program S is partial correct when GG Q, so the same safety invariant (Q GG) applies to partial correctness also. Partial correctness does not, however, say anything about whether the given program will terminate -- that is a different and often a deeper issue. The absence of safety can be established by proving the existence of a bad state that is reachable from the initial state, and violates the safety criterion. To prove safety, it is thus necessary to assert that in every state that is reachable from the initial state, the safety criteria holds. 5.2.2 Liveness Properties The essence of a liveness property is that "good things eventually happen." Eventuality is a tricky issue -- it simply implies that the event happens after a finite number of actions, but no expected upper bound for the number of actions is implied in the statement 1. Consider the statement: "Every criminal is eventually brought to justice." Suppose that the crime was committed on January 1, 1990, but the criminal is still at large. Can we say that the statement is false? No -- since who knows, the criminal may be arrested tomorrow! It is thus impossible to prove the falsehood of a liveness property by examining a finite prefix of the behavior. Of course, if the accused person is taken to court today and proven guilty, then the liveness property is trivially proved. But this may be a matter of luck -- apparently no one knows how long we have to wait. Here are some examples of well-known liveness properties. 1 It is often sufficient to guarantee that the events happens with probability 1. 5

Progress Consider the classical mutual exclusion problem, where a number of processes try to enter a critical section. A desirable feature here is for every process to make progress towards the goal, and eventually enter the critical section. Thus, progress towards the critical section is a liveness property. The progress is violated, if there exists at least one infinite behavior, in which a progress remains outside its critical section Absence of guaranteed progress is commonly known as livelock or starvation. Fairness Fairness is a liveness property, as it determines whether an action is eventually executed by the scheduler. As is customary with progress properties, fairness does not guarantee when or after how many steps the action is scheduled. Reachability The problem of reachability addresses the following question: Given a net with an initial state S0, does there exist a finite behavior that changes the system state to S? If so, then S is said to be reachable from S0. Reachability is a liveness property. Network protocol designers who believe in testing a protocol rather than proving its correctness, often run simulation programs to explore the possible states that the protocol could lead the system into, and check if there is anything "bad" about those states. The goal is to find out if a "bad state" is reachable from an initial state through some sequence of legal actions. Needless to say, they succeed in reaching a fraction of the set of possible states into which the system can move in real life. Many protocol certifications are based on this type of testing. The testing of reachability through simulation is never foolproof, and takes a heavy toll of system resources, often leading to the so-called state-explosion problem. Termination Program termination is a liveness property. It guarantees that starting from the initial state, every feasible behavior leads the system to a state in which all the guards are false, and the desired postcondition is satisfied. Note that partial correctness simple ensures that if all the guards are false, then the goal state is reached. It does not tell us anything about whether the terminal state is reachable. The total correctness of a program is the combination of partial correctness and termination. 6

An Example Consider a system of four processes P0 through P3 as shown in Fig. 5.2. Each process has a variable color represented by an integer from the set {0,1,2,3}. We will represent the color of a process i by the symbol c[i]. The objective is to devise an algorithm, so that regardless of the initial colors of the different processes, eventually no two processes have the same color. P1 P2 P0 P3 Figure 5.2. A system of four processes. Every process wants a color that is different from the colors of its neighboring processes. Let N.i denote the set of neighbors of process i. We propose the following program for every process Pi {0,1,2,3} in the system. program colorme {for process i} do j : j N.i :: (c[i] = c[j]) c[i] := (c[i] + 2) mod 4 od Is the program partially correct? Note that only by checking the guards, it is easy to conclude that if the program terminates, i.e. if all the guards are false, then the following condition holds: ( i, j j N.i :: c[i] c[j])......... (1) By definition, this is the desired postcondition. So the system is partially correct. 7

However, it is easy to find out that the program may not terminate. Consider the initial state A represented by the values c[0] = 0, c[1] = 0, c[2] = 2, c[3] = 2. Fig 5.3 shows that at least one possible sequence of actions by which the system can reach the starting state A without ever satisfying the desired postcondition (1). This cyclic behavior demonstrates that it is possible for the program to run for ever. Therefore, the program is partially correct, but not totally correct. Note that it is possible for this program to reach termination if an alternate sequence of actions is chosen by the schedulers. For example, if in state A, process P1 makes a move, then the state c[0] = 0, c[1] = 2, c[2] = 2, c[3] = 2 is reached and condition (1) is satisfied! However, termination is not "guaranteed" as long as there exists a single infinite behavior where the conditions of the goal state are not satisfied. State action c[0] c[1] c[2] c[3] A - 0 0 2 2 B P0 moves 2 0 2 2 C P2 moves 2 0 0 2 D P0 moves 0 0 0 2 E P1 moves 0 2 0 2 F P0 moves 2 2 0 2 G P3 moves 2 2 0 0 H P0 moves 0 2 0 0 I P2 moves 0 2 2 0 J P0 moves 2 2 2 0 K P1 moves 2 0 2 0 L P0 moves 0 0 2 0 A P3 moves 0 0 2 2 Fig. 5.3. An infinite behavior for the system in Fig. 5.2. 8

5.3 Concluding Remarks This chapter explains what is meant by correctness. It does not describe any method of proving correctness. Although most useful properties of a distributed system can be classified as either a liveness or a safety property, it is possible to come across properties which do not belong to either of these two classes. Consider the statement, "there is a 90% probability that an earthquake of magnitude greater than 8.8 will hit California before the year 2000." This is neither a liveness nor a safety property. An implicit assumption made in this chapter is that all well-behaved programs will terminate. This may not always be the case -- particularly for open systems. An open system (also called a reactive system) is one that responds to changes in the environment, and are particularly useful in real-time systems. A system that assumes the environment to be fixed is called a closed system. Correctness often depends on assumptions made about the underlying model. Such assumptions include program semantics, the choice of the scheduler, or the grain of atomicity. A given property may hold if we assume strong fairness, but may not hold if we assume weak fairness. Another property may be true only if we choose a "coarse-grain" atomicity, but may cease to hold with "fine-grain" atomicity. However, in general, if a property holds in a weaker model, then it also holds for the stronger models. In the next chapter, we will discuss various methods of proving the correctness of programs. 5.4 Bibliography Lamport [L77] was the first to point out the importance of safety and liveness properties in proving concurrent programs. Alpern and Schneider [AS85] demonstrated how most of the useful properties related to program correctness can be classified either as a liveness or as a safety property. The book by Francez [F86] contains an extensive discussion on the issue of fairness. Partial correctness proofs are extensively dealt with in [OG76]. The book by David Gries [G81] contains an excellent description of the various methods for proving the correctness of sequential programs. 9

Exercises 1. Consider the following system of processes. Each process Pi has an integer P1 P2 P0 P3 variable c[i] whose values can range from 0 to 3. Now consider the following program: do j : j N.i :: (c[i] = c[j]) c[i] := (c[i] + 1) mod 4 od Enumerate all the behaviors of the above program. Is there an infinite behavior? [Warning: This exercise can be very time consuming] 2. Classify each statement as a liveness or a safety property: (a) No object in the universe can travel at a speed larger than the speed of light. (b) This problem is not difficult - I think it can be solved. (c) The message will reach my friend within an hour. (d) The price of every stock will increase. (e) The Sun rises in the east and sets in the west. (f) Every person will eventually die. 3. Consider a system of n processes 0, 1, 2,, n-1. Each process works in phases. The phase of a process is represented by an integer variable p. Initially, for every process, p = 0. In each phase, a process does some work. A process is allowed to begin phase k+1, when every process has completed their work in phase k. 10

(i) Write the program for a typical process i. You must convince yourself that the program works, but you need not demonstrate any formal correctness proof. (ii) List the safety properties that you need to prove in order to establish the correctness of your program. For each safety property, specify an invariant. (iii) State all the liveness properties relevant to this problem. 4. Five processes P0, P1, P2, P3, P4 are trying to acquire unique names from a set S containing five or more names. Each process Pi starts by choosing an arbitrary initial name x[i] from the set S. Assume that each process can read the names of every other process, and let Ri denote the set of residual names not taken by any of the processes. The program for process Pi is as follows: do ( j x[i] = x[j]) ( b b Ri ) x[i] := b od Argue using program behaviors that the above program will terminate with central schedulers, but may not terminate with distributed schedulers. 5. Consider the following Petri net. Use behaviors to decide if the marking M(a) = 1, M(b) = 0, and M(c) = 1, M(d) = 0 is reachable from the given initial marking: c a b d 6. Consider the transition model of a distributed system, and assume that the system can remain in one of 16 possible states. From each state, there are at most three possible state transitions. Then determine the maximum amount of space that may be required to decide whether a state A is reachable from another state B. 11