C 3 : A System for Automating Application-level Checkpointing of MPI Programs

Size: px
Start display at page:

Download "C 3 : A System for Automating Application-level Checkpointing of MPI Programs"

Transcription

1 C 3 : A System for Automating Application-level Checkpointing of MPI Programs Greg Bronevetsky, Daniel Marques, Keshav Pingali, Paul Stodghill Department of Computer Science, Cornell University, Ithaca, NY Abstract. Fault-tolerance is becoming necessary on high-performance platforms. Checkpointing techniques make programs fault-tolerant by saving their state periodically and restoring this state after failure. System-level checkpointing saves the state of the entire machine on stable storage, but this usually has too much overhead. In practice, programmers do manual application-level checkpointing by writing code to (i) save the values of key program variables at critical points in the program, and (ii) restore the entire computational state from these values during recovery. However, this can be difficult to do in general MPI programs. In ([2],[3]) we have presented a distributed checkpoint coordination protocol which handles MPI s point-to-point and collective constructs, while dealing with the unique challenges of application-level checkpointing. We have implemented our protocols as part of a thin software layer that sits between the application program and the MPI library, so it does not require any modifications to the MPI library. This thin layer is used by the C 3 (Cornell Checkpoint (pre-)compiler), a tool that automatically converts an MPI application in an equivalent fault-tolerant version. In this paper, we summarize our work on this system to date. We also present experimental results that show that the overhead introduced by the protocols are small. We also discuss a number of future areas of research. 1 Introduction The problem of implementing software systems that can tolerate hardware failures has been studied extensively by the distributed systems community [6]. In contrast, the parallel computing community has largely ignored this problem because until recently, most parallel computing was done on relatively reliable big-iron machines whose meantime-between-failures (MTBF) was much longer than the execution time of most programs. However, trends in high-performance computing, such as the popularity of customassembled clusters, the increasing complexity of parallel machines, and the dawn of Grid computing, are increasing the probability of hardware failures, making it imperative that parallel programs tolerate such failures. One solution that has been employed successfully for parallel programs is applicationlevel checkpointing. In this approach, the programmer is responsible for saving computational state periodically, and for restoring this state after failure. In many programs, This work was supported by NSF grants ACI , EIA , ACI , ACI , ACI , and ACI

2 it is possible to recover the full computational state from relatively small amounts of data saved at key places in the program. For example, in an ab initio protein-folding application, it is sufficient to periodically save the positions and velocities of the bases of the protein; this is a few megabytes of information, in contrast to the hundreds of gigabytes of information that would be saved by a system-level checkpoint. This kind of manual application-level checkpointing is feasible if the parallel program is written in a bulk-synchronous manner, but it is not clear how it can be applied to a general MIMD program without global barriers. Without global synchronization, it is not obvious when the state of each process should be saved so as to obtain a global snapshot of the parallel computation. Protocols such as the Chandy-Lamport [4] protocol have been designed by the distributed systems community to address this problem, but these protocols were designed for system-level checkpointing, and cannot be applied to application-level checkpointing, as we explain in Section 4. In two previous papers ([2],[3]), we have present non-blocking, coordinated, applicationlevel checkpointing protocols for the point-to-point and collective constructs of MPI. We have implemented these protocols as part of the C 3 (Cornell Checkpoint (pre- )Compiler), a system that uses program transformation technology to automatically insert application-level checkpointing features into an application s source code. Using our system, it is possible to automatically convert an MPI application in an equivalent fault-tolerant version. The rest of this paper is organized as follows. In Section 2, we present background for and define the problem. In Section 3, we define some terminology and describe our basic approach. In Section 4, we discuss some of the difficulties of adding faulttolerance to MPI programs. In Sections 5 and 6 we present non-blocking checkpointing protocols for point-to-point and collective communication, respectively. In Section 7, we discuss how our system saves the sequential state of each process. In Section 8, we present performance results of our system. In Section 9 we discuss related work, and in Section 10 we describe future work. In Section 11, we offer some conclusions. 2 Background To address the problem of fault tolerance, it is necessary to define the fault model. We focus our attention on stopping faults, in which a faulty process hangs and stops responding to the rest of the system, neither sending nor receiving messages. This model captures many failures that occur in practice and is a useful mechanism in addressing more general problems. We make the standard assumption that there is a reliable transport layer for delivering application messages, and we build our solutions on top of that abstraction. One such reliable implementation of the MPI communication library is Los Alamos MPI [7]. We can now state the problem we address in this paper. We are given a long-running MPI program that must run on a machine that has (i) a reliable message delivery system, (ii) unreliable processors which can fail silently at any time, and (iii) a mechanism such as a distributed failure detector [8] for detecting failed processes. How do we ensure that the program makes progress in spite of these faults?

3 There are two basic approaches to providing fault-tolerance for distributed applications. Message-logging techniques require restarting only the computation performed by the failed process. Surviving processes are not rolled back but must help the restarted process by replaying messages that were sent to it before it failed. Our experience is that the overhead of saving or regenerating messages tends to be so overwhelming that the technique is not practical for scientific applications. Therefore, we focus on Checkpointing techniques, which periodically save a description of the state of a computation to stable storage; if any process fails, all processes are rolled back to a previously saved checkpoint (not necessarily the last), and the computation is restarted from there. Checkpointing techniques can be classified along two independent dimensions. (1) The first dimension is the abstraction level at which the state of a process is saved. In system-level checkpointing (e.g., [9], [11]), the raw process state, including the contents of the program counter, registers and memory, are saved on stable storage. Unfortunately, complete system-level checkpointing of parallel machines with thousands of processors can be impractical because each global checkpoint can require saving terabytes of data to stable storage. For this reason, system-level checkpointing is not done on large machines such as the IBM Blue Gene or the ASCI machines. One alternative which is popular is application-level checkpointing, in which the application is written such that it correctly restarts from various positions in the code by storing certain information to a restart file. The benefit of this technique is that that the programmer needs to save only the minimum amount of data necessary to recover the program state. In this paper, we explore the use of compiler technology to automate application-level checkpointing. (2) The second dimension along which checkpointing techniques can be classified is the technique used to coordinate parallel processes when checkpoints need to be taken. In [2], we argue that the best approach for our problem is to use non-blocking coordinated checkpointing. This means that all of the processes participate in taking each checkpoint, but they do not stop the computation while they do so. A survey of the other approaches to checkpointing can be found in [6]. 3 Our Approach 3.1 Terminology We assume that a distinguished process called the initiator triggers the creation of global checkpoints periodically. We assume that it does not initiate the creation of a global checkpoint before any previous global checkpoint has been created and committed to stable storage. The execution of an application process can therefore be divided into a succession of epochs where an epoch is the period between two successive local checkpoints (by convention, the start of the program is assumed to begin the first epoch). Epochs are labeled successively by integers starting at zero, as shown in Figure 1. Application messages can be classified depending upon whether or not they are sent and received in the same epoch.

4 Compile Time Application Source Application Source with Checkpointing Code Compiled code Precompiler Native Compiler P Q 0 0 Intra-epoch 1 1 Late 2 2 Run Time R 0 1 Early 2 Compiled code Compiled code Start of program Global Checkpoint 1 Global Checkpoint 2 Fig. 1. Epochs and message classification Co-ordination Layer MPI Co-ordination Layer MPI Hardware Fig. 2. System Architecture Definition 1. Given an application message from process A to process B, let e A be the epoch number of A at the point in the application program execution when the send command is executed, and let e B be the epoch number of B at the point when the message is delivered to the application. Late message: If e A <e B, the message is said to be a late message. Intra-epoch message: If e A = e B, the message is said to be an intra-epoch message. Early message: If e A >e B, the message is said to be an early message. Figure 1 shows examples of the three kinds of messages, using the execution trace of three processes named P, Q and R. MPI has several kinds of send and receive commands, so it is important to understand what the message arrows mean in the context of MPI programs. Consider the late message in Figure 1. The source of the arrow represents the point in the execution of the sending process at which control returns from the MPI routine that was invoked to send this message. Note that if this routine is a nonblocking send, the message may not make it to the communication network until much later in execution; nevertheless, what is important for us is that if the application tries to recover from global checkpoint 2, it will not reissue the MPI send. Similarly, the destination of the arrow represents the delivery of the message to the application program. In particular, if an MPI Irecv is used by the receiving process to get the message, the destination of the arrow represents the point at which an MPI Wait for the message would have returned, not the point where control returns from the MPI Irecv routine. In the literature, late messages are sometimes called in-flight messages, and early messages are sometime called inconsistent messages. This terminology was developed in the context of system-level checkpointing protocols but in our opinion, it is misleading in the context of application-level checkpointing.

5 3.2 System Architecture Figure 2 is an overview of our approach. The C 3 system reads almost unmodified single-threaded C/MPI source files and instruments them to perform application-level state-saving; the only additional requirement for the programmer is that he insert calls to a function called PotentialCheckpoint at points in the application where the programmer wants checkpointing to occur. The output of this precompiler is compiled with the native compiler on the hardware platform, and is linked with a library that constitutes a co-ordination layer for implementing the non-blocking coordination. This layer sits between the application and the MPI library, and intercepts all calls from the instrumented application program to the MPI library. Note that MPI can bypass the co-ordination layer to read and write message buffers in the application space directly. Such manipulations, however, are not invisible to the protocol layer. MPI may not begin to access a message buffer until after it has been given specific permission to do so by the application (e.g. via a call to MPI Irecv). Similarly, once the application has granted such permission to MPI, it should not access that buffer until MPI has informed it that doing so is safe (e.g. with the return of a call to MPI Wait). The calls to, and returns from, those functions are intercepted by the protocol layer. This design permits us to implement the coordination protocol without modifying the underlying MPI library, which promotes modularity and eliminates the need for access to MPI library code, which is proprietary on some systems. Further, it allows us to easily migrate from one MPI implementation to another. 4 Difficulties in Application-level Checkpointing of MPI programs In this section, we briefly describe the difficulties with implementing application-level, coordinated, non-blocking checkpointing for MPI programs. Delayed state-saving A fundamental difference between system-level checkpointing and application-level checkpointing is that a system-level checkpoint may be taken at any time during a program s execution, while an application-level checkpoint can only be taken when a program executes a PotentialCheckpoint call. System-level checkpointing protocols, such as the Chandy-Lamport distributed snapshot protocol, exploit this flexibility with checkpoint scheduling to avoid the creation of early messages. This strategy does not work for application-level checkpointing, because, after being notified to take a checkpoint, a process might need to communicate with other processes before arriving at a point where it may take a checkpoint. Handling late and early messages Suppose that an application is restored to Global Checkpoint 2 in Figure 1. On restart, some processes will expect to receive late messages that were sent prior to failure. Therefore, we need mechanisms for (i) identifying late messages and saving them along with the global checkpoint, and (ii) replaying these messages to the receiving process during recovery. Late messages must be handled by non-blocking system-level checkpointing protocols as well. Similarly on recovery, some processes will expect to send early messages that were received prior to failure. To handle this, we need mechanisms for (i) identifying early messages, and (ii) ensuring that they are not resent during recovery.

6 Early messages also pose a separate and more subtle problem: if a non-deterministic event occurs between a checkpoint and an early message send, then on restart the event may occur difference and, hence, the message may be different. In general, we must ensure that if a global checkpoint depends on a non-deterministic event, that the event will re-occur exactly the same way after restart. Therefore, mechanisms are needed to (i) record the non-deterministic events that a global checkpoint depends on, so that (ii) these events can be replayed during recovery. Non-FIFO message delivery at application level In an MPI application, a process P can use tag matching to receive messages from Q in a different order than they were sent. Therefore, a protocol that works at the application-level, as would be the case for application-level checkpointing, cannot assume FIFO communication. Collective communication The MPI standard includes collective communications functions such as MPI Bcast and MPI Alltoall, which involve the exchange of data among a number of processors. The difficulty presented by such functions occurs when some processes make a collective communication call before taking their checkpoints, and others after. We need to ensure that on restart, the processes that reexecute the calls do not deadlock or receive incorrect information. Furthermore, MPI Barrier guarantees specific synchronization semantics, which must be preserved on restart. Problems Checkpointing MPI Library State The entire state of the MPI library is not exposed to the application program. Things like the contents of message buffers and request objects are not directly accessible. Our system must be able to reconstruct this hidden state on recovery. 5 Protocol for point-to-point operations We now sketch the coordination protocol for global checkpointing for point-to-point communication. A complete description of the protocol can be found in [2]. 5.1 High-level description of protocol Initiation As with other non-blocking coordinated checkpointing protocols, we assume the existence of an initiator that is responsible for deciding when the checkpointing process should begin. In our system, the processor with rank 0 in MPI COMM WORLD serves as the initiator, and starts the protocol when a certain amount of time has elapsed since the last checkpoint was taken. Phase #1 The initiator sends a control message called pleasecheckpoint to all application processes. After receiving this message, each process can send and receive messages normally. Phase #2 When an application process reaches its next potentialcheckpoint location, it takes a local checkpoint using the techniques described in Section 7. It also saves the identities of any early messages on stable storage. It then starts recording (i) every late message it receives, and (ii) the result of every non-deterministic decision it makes. Once a process has received all of its late messages 1, it sends a control message called readytostoprecording back to the initiator, but continues recording. 1 We assume the application code receives all messages that it sends.

7 Phase #3 When the initiator gets a readytostoprecording message from all processes, it sends a control message called stoprecording to all other processes. Phase #4 An application process stops recording when (i) it receives a stoprecording message from the initiator, or (ii) it receives a message from a process that has stopped its recording. The second condition is required because we make no assumptions about message delivery order. In particular, it is possible for a recording process to receive a message from non-recording process before receiving the stoprecording message. In this case, the saved state might depend upon an unrecorded non-deterministic event. The second condition prevents this situation from occurring. Once the process has saved its record on disk, it sends a stoppedrecording message back to the initiator. When the initiator receives a stoppedrecording message from all processes, it commits the checkpoint that was just created as the one to be used for recovery, saves this decision on stable storage, and terminates the protocol. 5.2 Piggybacked information on messages To implement this protocol, the protocol layer must piggyback a small amount of information on each application message. The receiver of a message uses this piggybacked information to answer the following questions. 1. Is the message a late, intra-epoch, or early message? 2. Has the sending process stopped recording? 3. Which messages should not be resent during recovery? The piggybacked values on a message are derived from the following values maintained on each process by the protocol layer. epoch: This integer keeps track of the process epoch. It is initialized to 0 at start of execution, and incremented whenever that process takes a local checkpoint. amrecording: This boolean is true when the process is recording, and false otherwise. nextmessageid: This integer is initialized to 0 at the beginning of each epoch, and is incremented whenever the process sends a message. Piggybacking this value on each application message in an epoch ensures that each message sent by a given process in a particular epoch has a unique ID. A simple implementation of the protocol can piggyback all three values on each message that is sent by the application. When a message is received, the protocol layer at the receiver examines the piggybacked epoch number and compares it with the epoch number of the receiver to determine if the message is late, intra-epoch, or early. By looking at the piggybacked boolean, it determines whether the sender is still recording. Finally, if the message is an early message, the receiver adds the pair <sender, messageid> to its suppresslist. Each process saves its suppresslist to stable storage when it takes its local checkpoint. During recovery, each process passes relevant portions of its list of messageid s to other processes so that resending of these messages can be suppressed.

8 By exploiting properties of the protocol, the size of the piggybacked information can be reduces to two booleans and an integer. By exploiting the semantics of MPI message tags, it is possible to eliminate the integer altogether, and piggyback only two boolean values, one to represent epoch and the other amrecording. 5.3 Completing the reception of late messages Finally, we need a mechanism for allowing an application process in one epoch to determine when it has received all the late messages sent in the previous epoch. The solution we have implemented is straight-forward. In every epoch, each process P remembers how many messages it sent to every other process Q (call this value sendcount(p Q)). Each process Q also remembers how many messages it received from every other process P (call this value receivecount(q P )). When a process P takes its local checkpoint, it sends a mysendcount message to the other processes, which contains the number of messages it sent to them in the previous epoch. When process Q receives this control message, it can compare the value with receivecount(q P ) to determine how many more messages to wait for. Since the value of sendcount(p Q) is itself sent in a control message, how does Q know how many of these control messages it should wait for? A simple solution is for each process to send its sendcount to every other process in the system. This solution works, but requires quadratic communication. More efficient solutions can be obtained by requiring processes that communicate with one another to explicitly open and close communication channels. 5.4 Guarantees provided by the protocol It can be shown that this protocol provides the following guarantees that are useful for reasoning about correctness. Claim. 1. No process stops recording until all processes have taken their local checkpoints. 2. A process that has stopped recording cannot receive a late message. In Figure 3, this means that a message of the form b1 g3 cannot occur. 3. A message sent by a process after it has stopped recording can only be received by a process that has itself stopped recording. In Figure 3, this means that messages of the form b3 g2 or b3 g1 cannot occur. Figure 3 shows the possible communication patterns, given these guarantees. 6 Protocol for collective operations In this section, we build on the mechanisms of the point-to-point protocol in order to implement a protocol for collective communication. A complete description of our protocols can found in [3].

9 P g1 g2 g3 Q b1 b2 b3 Recovery line Stop-recording line Fig. 3. Possible Patterns of Communication There are two basic approaches to handling MPI s collective communication functions. The most obvious is to implement these functions on top of our point-to-point protocol. However, because this approach does not use the low-level network layer directly, it is likely to be less efficient than the collective functions provided by the native MPI library. Instead, what we have chosen to do is to use the basic concepts and mechanisms of our point-to-point protocol in order to provide fault-tolerant versions of the collective communication functions that are implemented entirely in terms of the native MPI collective communication functions. We will use MPI Allreduce to illustrate how collective communication is handled. In Figure 4, collective communication call A shows an MPI Allreduce call in which processes P and Q execute the call after taking local checkpoints, and process R executes the call before taking the checkpoint. During recovery, processes P and Q will reexecute this collective communication call, but process R will not. Unless something is done, the program will not recover correctly. Our solution is to use the record to save the result of the MPI Allreduce call at processes P and Q. During recovery, when the processes reexecute the collective communication call, the result is read from the record and returned to the application program. Process R does not reexecute the collective communication call. To make this intuitive idea precise, we need to specify when the result of a collective communication call like MPI Allreduce should be recorded. A simple solution is to require a process to record the result of every collective communication call it makes during the time it is recording. Collective communication call B in Figure 4 illustrates a subtle problem with this solution - process R executes the MPI Allreduce after it has stopped recording, so it would be incorrect for processes P and Q to record the results of their call. This problem is similar to the problem encountered in the point-to-point message case, and the solution is similar (and simpler). Each process piggybacks its amrecording bit on the application data, and the function invoked by MPI Allreduce computes the conjunction of these bits. If any process involved in the collective communication call has stopped recording, all the other processes will learn this fact, and they will also stop recording. As a result, no process will record the result of the call.

10 P Q R Collective Communication Call A Recovery line Collective Communication Call B Stop-recording line Fig. 4. Collective Communication Most of the other collective communication calls can be handled in this way. Ironically, the only one that requires special treatment is MPI Barrier, and the reason is that the MPI standard requires that no processor finishes a call to MPI Barrier until every processor has started a call to MPI Barrier. Suppose that the collective communication call A in Figure 4 is an MPI Barrier. Upon recovery, processors P and Q will have already finished their calls to MPI Barrier, while R has not yet started its call. This is a clear violation of the required behavior. The solution is to ensure that all processes involved in a barrier execute it in the same epoch. In other words, barriers cannot be allowed to cross recovery lines. A simple implementation is the following. All processes involved in the barrier execute an all-reduce communication just before the barrier to determine if they are all in the same epoch. If not, processes that have not yet taken their local checkpoints do so, ensuring that the barrier is executed by all processes in the same epoch. This solution requires the precompiler to insert the all-reduce communication and the potential checkpointing locations before each barrier. As shown in [3], the overhead of this addition is very small in practice. 7 State Saving The protocols described in the previous sections assume that there is a mechanism for taking and restoring a local checkpoint on each processor, which we describe in this section. Application State-Saving The state of the application running on each node consists of its position in the static text of the program, its position in the dynamic execution of the program, its local and global variables, and its heap-allocated structures. Our precompiler modifies the application source so that this state is correctly saved, and can be restarted, at the potentialcheckpoint positions in the original code. Our approach is similar to that used in the PORCH system[12]. While it currently only saves somewhat less data than system-level checkpointing, it offers two significant advantages over that approach. First, it is a starting point for optimizing the amount of state that is saved at a checkpoint. In Section 10, we describe ongoing work towards this

11 goal. Second, it is much simpler and more portable than system-level checkpointing, which very often requires modifying the operating system and native MPI library. MPI Library State-Saving As was already mentioned, our protocol layer intercepts all calls that the application makes to the MPI library. Using this mechanism our system is able to record the direct state changes that the application makes (e.g., calls to MPI Attach buffer). In addition, some MPI functions take or return handles to opaque objects. The protocol layer introduces a level of indirection so that the application only sees handles to objects in the protocol layer (hereafter referred to pseudohandles), which contain the actual handles to the MPI opaque objects. On recovery, the protocol layer reinitializes the pseudo-handles in such a way that they are functionally identical to their counterparts in the original process. 8 Performance In this section, we present an overview of the full experimental results that can be found in [2] and [3]. We performed our experimental evaluation on the CMI cluster at the Cornell Velocity supercomputer. This cluster is composed of 64 2-way Pentium III 1GHz nodes, featuring 2GB of RAM and connected by a Giganet switch. The nodes have 40MB/sec bandwidth to local disk. The point-to-point experiments were conducted on 16 nodes, and the collective experiments were conductions on 32 nodes. On each node, we used only one of the processors. 8.1 Point-to-point We evaluated the performance of the point-to-point protocol on three codes: a dense Conjugate Gradient code, a Laplace solver, and Neurosys, a neuron simulator. All the checkpoints in our experiments are written to the local disk, with a checkpoint interval of 30 seconds 2. The performance of our protocol was measured by recording the runtimes of each of four versions of the above codes. 1. The unmodified program 2. Version #1 + code to piggyback data on messages 3. Version #2 + protocol s records and saving the MPI library state 4. Version #3 + saving the application state Experimental results are shown in Figure 5. We observe in the results that the overhead of using our system is small, except in a few instances. In dense CG, the overhead of saving the application state rises dramatically for the largest problem size. This is as a result of the large amount of state that must be written to disk. The other extreme is Neurosys, which has a very high communication to computation ration on the small problem size. In this case, the overhead of using the protocol becomes evident. For the larger problems it is less so. 2 We chose such a small interval in order to amplify the overheads for the purposes of measurement. In practice, users would choose checkpoint intervals on the order of hours or days, depending upon the underlying system.

12 4096x x X16384 MPI_Allgather, 1 byte protocol block, 32 processes, Absolute Times 10 1 Standard Separate Combined 10 0 Running Time (sec) Dense Conjugate Gradient 131MB 33MB 8.2MB Problem Size Neurosys Running Time (sec) Laplace Solver 2.1MB 532KB 138KB 512x x x2048 Problem Size Time, seconds Message size, bytes MPI_Allgather, 1 byte protocol block, 32 processes, Absolute Overhead Running Time (sec) MB 308KB 75KB 18KB 16x16 32x32 64x64 128x128 Problem Size Unmodified Program Using Protocol Layer, No Checkpoints Checkpointing, No Application State Full Checkpoints The number above each set of bars is the size of the application state for that problem size. Time, seconds Standard Separate Combined 0.2 Fig. 5. Point-to-point Overheads Message size, bytes Fig. 6. MPI Allgather In our experiments, we initiated a new checkpointing 30 seconds after the last checkpoint was committed. For real applications on real machines, the developer will want to select a checkpoint frequency that carefully balances the overhead against the need to make progress. Since our protocol only incurs overhead during the interval in which a checkpoint is being taken, the developer can arbitrarily reduce the protocol overhead by reducing the frequency at which checkpoints are taken. 8.2 Collective MPI supports a very large number of collective communication calls. Here, we compared the performance of the native version of MPI Allgather with the performance of a version modified to utilize our protocol. Those modifications include sending the necessary protocol data (color and logging bits) and performing the protocol logic. There are two natural ways to send the protocol data: either via a separate collective operation that precedes the data operation, or by piggy-backing the control data onto the message data and sending both with one operation. We have measured the overhead of both methods. The time for the separate operation case includes the time to send both messages. For the combined case, it includes the time to copy the control and message data to a contiguous region, to send the combined message, and to separate the message and protocol data on receipt. The top graph in Figure 6 shows the absolute time taken by the native and protocol (both the separate and combined message) versions of MPI Allgather for data message ranging in size from 4 bytes to 4 MB. The bottom graph shows the overhead, in seconds, that the two versions of the protocol add to the communication.

13 Examining the graphs, we see that for small messages, the relative overhead (percentage) might be high but the absolute overhead is small. For large messages sizes, the absolute overhead might be large, but relative to the cost of the native version, the cost is very small. This is the expected behavior. The next effect is that the observed overhead for real applications will be negligible. 9 Existing Work While much theoretical work has been done in the field of distributed fault-tolerance, few systems have been implemented for actual distributed application environments. One such system is CoCheck [14], which provides fault-tolerance for MPI applications. CoCheck provides only the functionality for the coordination of distributed checkpoints, relying on the Condor [9] system to take system-level checkpoints of each process. In contrast to our approach, CoCheck is integrated with its own MPI implementation, and assumes that collective communications are implemented as point-to-point messages. We believe that our ability to inter-operate with any MPI implementation is a significant advantage. Another distributed fault-tolerance implementation is the Manetho [5] system, which uses causal message logging to provide for system recovery. Because a Manetho process logs both the data of the messages that it sends and the non-deterministic events that these messages depend on, the size of those logs may grow very large if used with a program that generates a high volume of large messages, as is the case for many scientific programs. While Manetho can bound the size of these logs by occasionally checkpointing process state to disk, programs that perform a large amount of communication would require very frequent checkpointing to avoid running out of log space. Furthermore, since the system requires a process to take a checkpoint whenever these logs get too large, it is not clear how to use this approach in the context of applicationlevel checkpointing. Note that although our protocol, like the Chandy-Lamport protocol, also records message data, recording happens only during checkpointing. Another difference is that Manetho was not designed to work with any standard message passing API, and thus does not need to deal with the complex constructs such as non-blocking and collective communication found in MPI. The Egida [13] system is another fault-tolerant system for MPI. Like CoCheck, it provides system-level checkpointing, and it has been implemented directly in the MPI layer. Like Manetho, it is primarily based upon message logging, and uses checkpointing to flush the logs when they grow too large. 10 Future Work 10.1 State savings A goal of our project is to provide a highly efficient checkpointing mechanism for MPI applications. One way to minimize checkpoint overhead is to reduce the amount of data that must be saved when taking a checkpoint. Previous work in the compiler literature has looked at analysis techniques for avoiding the checkpointing of dead and read-only

14 variables [1]. This work focused on statically allocated data structures in FORTRAN programs. We would like to extend this work to handle the dynamically created memory objects in C/MPI applications. We are also studying incremental checkpointing approaches for reducing the amount of saved state. Another technique we are developing is the detection of distributed redundant data. If multiple nodes each have a copy of the same data structure, only one of the nodes needs to include it in its checkpoint. On restart, the other nodes will obtain their copy from the one that saved it. Another powerful optimization is to trade off state-saving for recomputation. In many applications, the state of the entire computation at a global checkpoint can be recovered from a small subset of the saved state in that checkpoint. The simplest example of this optimization is provided by a computation in which we need to save two variables x and y. Ify is some simple function of x, it is sufficient to save x, and recompute the value of y during recovery, thereby trading off the cost of saving variable y against the cost of recomputing y during recovery. Real codes provide many opportunities for applying this optimization. For example, in protein-folding using ab initio methods, it is sufficient to save the positions and velocities of the bases in the protein at the end of a time-step because the state of the entire computation can be recovered from that data Extending the protocols In our current work, we are investigating the scalability of the protocol on large highperformance platforms with thousands of processors. We are also extending the protocol to other types of parallel systems. One API of particular interest is OpenMP [10], which is an API for shared-memory programming. Many high-performance platforms consist of clusters in which each node is a shared-memory symmetric multiprocessor. Applications programmers are using a combination of MPI and OpenMP to program such clusters, so we need to extend our protocol for this hybrid model. On a different note, we plan to investigate the overheads of piggybacking control data on top of application messages. Such piggybacking techniques are very common in distributed protocols but the overheads associated with the piggybacking of data can be very complex, as our performance numbers demonstrate. Therefore, we believe that a detailed, cross-platform study of such overheads would be of great use for parallel and distributed protocol designers and implementors. 11 Conclusions In this paper, we have shown that application-level non-blocking coordinated checkpointing can be used to add fault-tolerance to C/MPI programs. We have argued that existing checkpointing protocols are not adequate for this purpose and we have developed protocols for both point-to-point [2] and collective [3] operations to meet the need. These protocol can be used to provide fault tolerance for MPI programs without making any demands on or having knowledge of the underlying MPI implementation. Used in conjunction with the method for automatically saving uniprocessor state described in [2], we have built a system that can be used to add fault-tolerance to C/MPI

15 programs. We have shown how the state of the underlying MPI library can be reconstructed by the implementation of our protocol. Experimental measurements show that the overhead introduced by the protocol implementation layer and program transformations is small. Acknowledgments: This work was inspired by a sabbatical visit by Keshav Pingali to the IBM Blue Gene project. We would like to thank the IBM Corporation for its support, and Marc Snir, Pratap Pattnaik, Manish Gupta, K. Ekanadham, and Jose Moreira for many valuable discussions on fault-tolerance. References 1. M. Beck, J. S. Plank, and G. Kingsley. Compiler-assisted checkpointing. Technical Report UT-CS , Dept. of Computer Science, University of Tennessee, G. Bronevetsky, D. Marques, K. Pingali, and P. Stodghill. Automated application-level checkpointing of mpi programs. In Principles and Practices of Parallel Programming, San Diego, CA, June G. Bronevetsky, D. Marques, K. Pingali, and P. Stodghill. Collective operations in an application-level fault tolerant MPI system. In International Conference on Supercomputing (ICS) 2003, San Francisco, CA, June M. Chandy and L. Lamport. Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computing Systems, 3(1):63 75, E. N. Elnozahy and W. Zwaenepoel. Manetho: Transparent rollback-recovery with low overhead, limited rollback and fast output. IEEE Transactions on Computers, 41(5), May M. Elnozahy, L. Alvisi, Y. M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. Technical Report CMU-CS , School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, Oct R. Graham, S.-E. Choi, D. Daniel, N. Desai, R. Minnich, C. Rasmussen, D. Risinger, and M. Sukalski. A network-failure-tolerant message-passing system for tera-scale clusters. In Proceedings of the International Conference on Supercomputing 2002, I. Gupta, T. Chandra, and G. Goldszmidt. On scalable and efficient distributed failure detectors. In Proc. 20th Annual ACM Symp. on Principles of Distributed Computing, pages , J. B. M. Litzkow, T. Tannenbaum and M. Livny. Checkpoint and migration of UNIX processes in the Condor distributed processing system. Technical Report 1346, University of Wisconsin-Madison, OpenMP. Overview of the OpenMP standard. Online at J. S. Plank, M. Beck, G. Kingsley, and K. Li. Libckpt: Transparent checkpointing under UNIX. Technical Report UT-CS , Dept. of Computer Science, University of Tennessee, B. Ramkumar and V. Strumpen. Portable checkpointing for heterogenous architectures. In Symposium on Fault-Tolerant Computing, pages 58 67, S. Rao, L. Alvisi, and H. M. Vin. Egida: An extensible toolkit for low-overhead faulttolerance. In Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, Madison, Wisconsin, June 15-18, G. Stellner. CoCheck: Checkpointing and Process Migration for MPI. In Proceedings of the 10th International Parallel Processing Symposium (IPPS 96), Honolulu, Hawaii, 1996.

COMP5426 Parallel and Distributed Computing. Distributed Systems: Client/Server and Clusters

COMP5426 Parallel and Distributed Computing. Distributed Systems: Client/Server and Clusters COMP5426 Parallel and Distributed Computing Distributed Systems: Client/Server and Clusters Client/Server Computing Client Client machines are generally single-user workstations providing a user-friendly

More information

Synchronization in. Distributed Systems. Cooperation and Coordination in. Distributed Systems. Kinds of Synchronization.

Synchronization in. Distributed Systems. Cooperation and Coordination in. Distributed Systems. Kinds of Synchronization. Cooperation and Coordination in Distributed Systems Communication Mechanisms for the communication between processes Naming for searching communication partners Synchronization in Distributed Systems But...

More information

Overlapping Data Transfer With Application Execution on Clusters

Overlapping Data Transfer With Application Execution on Clusters Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer

More information

Portable checkpointing and communication for BSP applications on dynamic heterogeneous Grid environments

Portable checkpointing and communication for BSP applications on dynamic heterogeneous Grid environments Portable checkpointing and communication for BSP applications on dynamic heterogeneous Grid environments Raphael Y. de Camargo, Fabio Kon, and Alfredo Goldman Department of Computer Science Universidade

More information

Client/Server Computing Distributed Processing, Client/Server, and Clusters

Client/Server Computing Distributed Processing, Client/Server, and Clusters Client/Server Computing Distributed Processing, Client/Server, and Clusters Chapter 13 Client machines are generally single-user PCs or workstations that provide a highly userfriendly interface to the

More information

Chapter 10. Backup and Recovery

Chapter 10. Backup and Recovery Chapter 10. Backup and Recovery Table of Contents Objectives... 1 Relationship to Other Units... 2 Introduction... 2 Context... 2 A Typical Recovery Problem... 3 Transaction Loggoing... 4 System Log...

More information

Cost-Performance of Fault Tolerance in Cloud Computing

Cost-Performance of Fault Tolerance in Cloud Computing Cost-Performance of Fault Tolerance in Cloud Computing Y.M. Teo,2, B.L. Luong, Y. Song 2 and T. Nam 3 Department of Computer Science, National University of Singapore 2 Shanghai Advanced Research Institute,

More information

Event Logging: Portable and Efficient Checkpointing in Heterogeneous Environments with Non-FIFO Communication Platforms

Event Logging: Portable and Efficient Checkpointing in Heterogeneous Environments with Non-FIFO Communication Platforms Event Logging: Portable and Efficient Checkpointing in Heterogeneous Environments with Non-FIFO Communication Platforms Zhao Peng Department of Computer Science University College Dublin, Belfield Dublin

More information

PARALLELS CLOUD STORAGE

PARALLELS CLOUD STORAGE PARALLELS CLOUD STORAGE Performance Benchmark Results 1 Table of Contents Executive Summary... Error! Bookmark not defined. Architecture Overview... 3 Key Features... 5 No Special Hardware Requirements...

More information

Data Management for Portable Media Players

Data Management for Portable Media Players Data Management for Portable Media Players Table of Contents Introduction...2 The New Role of Database...3 Design Considerations...3 Hardware Limitations...3 Value of a Lightweight Relational Database...4

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

CHAPTER 2 MODELLING FOR DISTRIBUTED NETWORK SYSTEMS: THE CLIENT- SERVER MODEL

CHAPTER 2 MODELLING FOR DISTRIBUTED NETWORK SYSTEMS: THE CLIENT- SERVER MODEL CHAPTER 2 MODELLING FOR DISTRIBUTED NETWORK SYSTEMS: THE CLIENT- SERVER MODEL This chapter is to introduce the client-server model and its role in the development of distributed network systems. The chapter

More information

Chapter 6, The Operating System Machine Level

Chapter 6, The Operating System Machine Level Chapter 6, The Operating System Machine Level 6.1 Virtual Memory 6.2 Virtual I/O Instructions 6.3 Virtual Instructions For Parallel Processing 6.4 Example Operating Systems 6.5 Summary Virtual Memory General

More information

A Pattern-Based Approach to. Automated Application Performance Analysis

A Pattern-Based Approach to. Automated Application Performance Analysis A Pattern-Based Approach to Automated Application Performance Analysis Nikhil Bhatia, Shirley Moore, Felix Wolf, and Jack Dongarra Innovative Computing Laboratory University of Tennessee (bhatia, shirley,

More information

Replication on Virtual Machines

Replication on Virtual Machines Replication on Virtual Machines Siggi Cherem CS 717 November 23rd, 2004 Outline 1 Introduction The Java Virtual Machine 2 Napper, Alvisi, Vin - DSN 2003 Introduction JVM as state machine Addressing non-determinism

More information

Task Scheduling in Hadoop

Task Scheduling in Hadoop Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed

More information

Scalability and Classifications

Scalability and Classifications Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

More information

Dr Markus Hagenbuchner markus@uow.edu.au CSCI319. Distributed Systems

Dr Markus Hagenbuchner markus@uow.edu.au CSCI319. Distributed Systems Dr Markus Hagenbuchner markus@uow.edu.au CSCI319 Distributed Systems CSCI319 Chapter 8 Page: 1 of 61 Fault Tolerance Study objectives: Understand the role of fault tolerance in Distributed Systems. Know

More information

The Service Availability Forum Specification for High Availability Middleware

The Service Availability Forum Specification for High Availability Middleware The Availability Forum Specification for High Availability Middleware Timo Jokiaho, Fred Herrmann, Dave Penkler, Manfred Reitenspiess, Louise Moser Availability Forum Timo.Jokiaho@nokia.com, Frederic.Herrmann@sun.com,

More information

Automatic Logging of Operating System Effects to Guide Application-Level Architecture Simulation

Automatic Logging of Operating System Effects to Guide Application-Level Architecture Simulation Automatic Logging of Operating System Effects to Guide Application-Level Architecture Simulation Satish Narayanasamy, Cristiano Pereira, Harish Patil, Robert Cohn, and Brad Calder Computer Science and

More information

Reliable Systolic Computing through Redundancy

Reliable Systolic Computing through Redundancy Reliable Systolic Computing through Redundancy Kunio Okuda 1, Siang Wun Song 1, and Marcos Tatsuo Yamamoto 1 Universidade de São Paulo, Brazil, {kunio,song,mty}@ime.usp.br, http://www.ime.usp.br/ song/

More information

Performance Monitoring of Parallel Scientific Applications

Performance Monitoring of Parallel Scientific Applications Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure

More information

A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture

A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture Yangsuk Kee Department of Computer Engineering Seoul National University Seoul, 151-742, Korea Soonhoi

More information

Checkpointing-based rollback recovery for parallel applications on the InteGrade Grid middleware

Checkpointing-based rollback recovery for parallel applications on the InteGrade Grid middleware CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2000; 00:1 10 [Version: 2002/09/19 v2.02] Checkpointing-based rollback recovery for parallel applications on the

More information

Box Leangsuksun+ * Thammasat University, Patumtani, Thailand # Oak Ridge National Laboratory, Oak Ridge, TN, USA + Louisiana Tech University, Ruston,

Box Leangsuksun+ * Thammasat University, Patumtani, Thailand # Oak Ridge National Laboratory, Oak Ridge, TN, USA + Louisiana Tech University, Ruston, N. Saragol * Hong Ong# Box Leangsuksun+ K. Chanchio* * Thammasat University, Patumtani, Thailand # Oak Ridge National Laboratory, Oak Ridge, TN, USA + Louisiana Tech University, Ruston, LA, USA Introduction

More information

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Distributed Dynamic Load Balancing for Iterative-Stencil Applications Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,

More information

Consistent, global state transfer for message-passing parallel algorithms in Grid environments

Consistent, global state transfer for message-passing parallel algorithms in Grid environments Consistent, global state transfer for message-passing parallel algorithms in Grid environments PhD thesis Written by József Kovács research fellow MTA SZTAKI Advisors: Dr. Péter Kacsuk (MTA SZTAKI) Dr.

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters COSC 6374 Parallel Computation Parallel I/O (I) I/O basics Spring 2008 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

The EMSX Platform. A Modular, Scalable, Efficient, Adaptable Platform to Manage Multi-technology Networks. A White Paper.

The EMSX Platform. A Modular, Scalable, Efficient, Adaptable Platform to Manage Multi-technology Networks. A White Paper. The EMSX Platform A Modular, Scalable, Efficient, Adaptable Platform to Manage Multi-technology Networks A White Paper November 2002 Abstract: The EMSX Platform is a set of components that together provide

More information

Recommended hardware system configurations for ANSYS users

Recommended hardware system configurations for ANSYS users Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range

More information

x64 Servers: Do you want 64 or 32 bit apps with that server?

x64 Servers: Do you want 64 or 32 bit apps with that server? TMurgent Technologies x64 Servers: Do you want 64 or 32 bit apps with that server? White Paper by Tim Mangan TMurgent Technologies February, 2006 Introduction New servers based on what is generally called

More information

How To Understand The Concept Of A Distributed System

How To Understand The Concept Of A Distributed System Distributed Operating Systems Introduction Ewa Niewiadomska-Szynkiewicz and Adam Kozakiewicz ens@ia.pw.edu.pl, akozakie@ia.pw.edu.pl Institute of Control and Computation Engineering Warsaw University of

More information

Workflow Requirements (Dec. 12, 2006)

Workflow Requirements (Dec. 12, 2006) 1 Functional Requirements Workflow Requirements (Dec. 12, 2006) 1.1 Designing Workflow Templates The workflow design system should provide means for designing (modeling) workflow templates in graphical

More information

An On-Line Algorithm for Checkpoint Placement

An On-Line Algorithm for Checkpoint Placement An On-Line Algorithm for Checkpoint Placement Avi Ziv IBM Israel, Science and Technology Center MATAM - Advanced Technology Center Haifa 3905, Israel avi@haifa.vnat.ibm.com Jehoshua Bruck California Institute

More information

OVERVIEW. CEP Cluster Server is Ideal For: First-time users who want to make applications highly available

OVERVIEW. CEP Cluster Server is Ideal For: First-time users who want to make applications highly available Phone: (603)883-7979 sales@cepoint.com Cepoint Cluster Server CEP Cluster Server turnkey system. ENTERPRISE HIGH AVAILABILITY, High performance and very reliable Super Computing Solution for heterogeneous

More information

COS 318: Operating Systems. Virtual Machine Monitors

COS 318: Operating Systems. Virtual Machine Monitors COS 318: Operating Systems Virtual Machine Monitors Andy Bavier Computer Science Department Princeton University http://www.cs.princeton.edu/courses/archive/fall10/cos318/ Introduction Have been around

More information

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures Chapter 18: Database System Architectures Centralized Systems! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types! Run on a single computer system and do

More information

MS-40074: Microsoft SQL Server 2014 for Oracle DBAs

MS-40074: Microsoft SQL Server 2014 for Oracle DBAs MS-40074: Microsoft SQL Server 2014 for Oracle DBAs Description This four-day instructor-led course provides students with the knowledge and skills to capitalize on their skills and experience as an Oracle

More information

The OSI model has seven layers. The principles that were applied to arrive at the seven layers can be briefly summarized as follows:

The OSI model has seven layers. The principles that were applied to arrive at the seven layers can be briefly summarized as follows: 1.4 Reference Models Now that we have discussed layered networks in the abstract, it is time to look at some examples. In the next two sections we will discuss two important network architectures, the

More information

A Deduplication File System & Course Review

A Deduplication File System & Course Review A Deduplication File System & Course Review Kai Li 12/13/12 Topics A Deduplication File System Review 12/13/12 2 Traditional Data Center Storage Hierarchy Clients Network Server SAN Storage Remote mirror

More information

Improving Logging and Recovery Performance in Phoenix/App

Improving Logging and Recovery Performance in Phoenix/App Improving Logging and Recovery Performance in Phoenix/App Roger Barga Shimin Chen David Lomet Microsoft Research Redmond, WA 98052, USA barga@microsoft.com Carnegie Mellon University Pittsburgh, PA 15213,

More information

Full System Emulation:

Full System Emulation: Full System Emulation: Achieving Successful Automated Dynamic Analysis of Evasive Malware Christopher Kruegel Lastline, Inc. chris@lastline.com 1 Introduction Automated malware analysis systems (or sandboxes)

More information

Operating System for the K computer

Operating System for the K computer Operating System for the K computer Jun Moroo Masahiko Yamada Takeharu Kato For the K computer to achieve the world s highest performance, Fujitsu has worked on the following three performance improvements

More information

Google File System. Web and scalability

Google File System. Web and scalability Google File System Web and scalability The web: - How big is the Web right now? No one knows. - Number of pages that are crawled: o 100,000 pages in 1994 o 8 million pages in 2005 - Crawlable pages might

More information

How do Users and Processes interact with the Operating System? Services for Processes. OS Structure with Services. Services for the OS Itself

How do Users and Processes interact with the Operating System? Services for Processes. OS Structure with Services. Services for the OS Itself How do Users and Processes interact with the Operating System? Users interact indirectly through a collection of system programs that make up the operating system interface. The interface could be: A GUI,

More information

Process Replication for HPC Applications on the Cloud

Process Replication for HPC Applications on the Cloud Process Replication for HPC Applications on the Cloud Scott Purdy and Pete Hunt Advised by Prof. David Bindel December 17, 2010 1 Abstract Cloud computing has emerged as a new paradigm in large-scale computing.

More information

- Behind The Cloud -

- Behind The Cloud - - Behind The Cloud - Infrastructure and Technologies used for Cloud Computing Alexander Huemer, 0025380 Johann Taferl, 0320039 Florian Landolt, 0420673 Seminar aus Informatik, University of Salzburg Overview

More information

1 Organization of Operating Systems

1 Organization of Operating Systems COMP 730 (242) Class Notes Section 10: Organization of Operating Systems 1 Organization of Operating Systems We have studied in detail the organization of Xinu. Naturally, this organization is far from

More information

Multicore Parallel Computing with OpenMP

Multicore Parallel Computing with OpenMP Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large

More information

Applications of Passive Message Logging and TCP Stream Reconstruction to Provide Application-Level Fault Tolerance. Sunny Gleason COM S 717

Applications of Passive Message Logging and TCP Stream Reconstruction to Provide Application-Level Fault Tolerance. Sunny Gleason COM S 717 Applications of Passive Message Logging and TCP Stream Reconstruction to Provide Application-Level Fault Tolerance Sunny Gleason COM S 717 December 17, 2001 0.1 Introduction The proliferation of large-scale

More information

1.1 Difficulty in Fault Localization in Large-Scale Computing Systems

1.1 Difficulty in Fault Localization in Large-Scale Computing Systems Chapter 1 Introduction System failures have been one of the biggest obstacles in operating today s largescale computing systems. Fault localization, i.e., identifying direct or indirect causes of failures,

More information

Comparison of Request Admission Based Performance Isolation Approaches in Multi-tenant SaaS Applications

Comparison of Request Admission Based Performance Isolation Approaches in Multi-tenant SaaS Applications Comparison of Request Admission Based Performance Isolation Approaches in Multi-tenant SaaS Applications Rouven Kreb 1 and Manuel Loesch 2 1 SAP AG, Walldorf, Germany 2 FZI Research Center for Information

More information

Chapter 3: Operating-System Structures. System Components Operating System Services System Calls System Programs System Structure Virtual Machines

Chapter 3: Operating-System Structures. System Components Operating System Services System Calls System Programs System Structure Virtual Machines Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines Operating System Concepts 3.1 Common System Components

More information

Network Attached Storage. Jinfeng Yang Oct/19/2015

Network Attached Storage. Jinfeng Yang Oct/19/2015 Network Attached Storage Jinfeng Yang Oct/19/2015 Outline Part A 1. What is the Network Attached Storage (NAS)? 2. What are the applications of NAS? 3. The benefits of NAS. 4. NAS s performance (Reliability

More information

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters COSC 6374 Parallel I/O (I) I/O basics Fall 2012 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network card 1 Network card

More information

STUDY AND SIMULATION OF A DISTRIBUTED REAL-TIME FAULT-TOLERANCE WEB MONITORING SYSTEM

STUDY AND SIMULATION OF A DISTRIBUTED REAL-TIME FAULT-TOLERANCE WEB MONITORING SYSTEM STUDY AND SIMULATION OF A DISTRIBUTED REAL-TIME FAULT-TOLERANCE WEB MONITORING SYSTEM Albert M. K. Cheng, Shaohong Fang Department of Computer Science University of Houston Houston, TX, 77204, USA http://www.cs.uh.edu

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.

More information

Virtual Machine Monitors. Dr. Marc E. Fiuczynski Research Scholar Princeton University

Virtual Machine Monitors. Dr. Marc E. Fiuczynski Research Scholar Princeton University Virtual Machine Monitors Dr. Marc E. Fiuczynski Research Scholar Princeton University Introduction Have been around since 1960 s on mainframes used for multitasking Good example VM/370 Have resurfaced

More information

Module 5. Broadcast Communication Networks. Version 2 CSE IIT, Kharagpur

Module 5. Broadcast Communication Networks. Version 2 CSE IIT, Kharagpur Module 5 Broadcast Communication Networks Lesson 1 Network Topology Specific Instructional Objectives At the end of this lesson, the students will be able to: Specify what is meant by network topology

More information

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Analysis and Modeling of MapReduce s Performance on Hadoop YARN Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and

More information

Highly Available Mobile Services Infrastructure Using Oracle Berkeley DB

Highly Available Mobile Services Infrastructure Using Oracle Berkeley DB Highly Available Mobile Services Infrastructure Using Oracle Berkeley DB Executive Summary Oracle Berkeley DB is used in a wide variety of carrier-grade mobile infrastructure systems. Berkeley DB provides

More information

Supercomputing applied to Parallel Network Simulation

Supercomputing applied to Parallel Network Simulation Supercomputing applied to Parallel Network Simulation David Cortés-Polo Research, Technological Innovation and Supercomputing Centre of Extremadura, CenitS. Trujillo, Spain david.cortes@cenits.es Summary

More information

Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com

Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...

More information

Scheduling and Resource Management in Computational Mini-Grids

Scheduling and Resource Management in Computational Mini-Grids Scheduling and Resource Management in Computational Mini-Grids July 1, 2002 Project Description The concept of grid computing is becoming a more and more important one in the high performance computing

More information

Building an Inexpensive Parallel Computer

Building an Inexpensive Parallel Computer Res. Lett. Inf. Math. Sci., (2000) 1, 113-118 Available online at http://www.massey.ac.nz/~wwiims/rlims/ Building an Inexpensive Parallel Computer Lutz Grosz and Andre Barczak I.I.M.S., Massey University

More information

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

Quiz for Chapter 6 Storage and Other I/O Topics 3.10 Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [6 points] Give a concise answer to each

More information

Cheap Paxos. Leslie Lamport and Mike Massa. Appeared in The International Conference on Dependable Systems and Networks (DSN 2004 )

Cheap Paxos. Leslie Lamport and Mike Massa. Appeared in The International Conference on Dependable Systems and Networks (DSN 2004 ) Cheap Paxos Leslie Lamport and Mike Massa Appeared in The International Conference on Dependable Systems and Networks (DSN 2004 ) Cheap Paxos Leslie Lamport and Mike Massa Microsoft Abstract Asynchronous

More information

A Survey Study on Monitoring Service for Grid

A Survey Study on Monitoring Service for Grid A Survey Study on Monitoring Service for Grid Erkang You erkyou@indiana.edu ABSTRACT Grid is a distributed system that integrates heterogeneous systems into a single transparent computer, aiming to provide

More information

BrightStor ARCserve Backup for Windows

BrightStor ARCserve Backup for Windows BrightStor ARCserve Backup for Windows Tape RAID Option Guide r11.5 D01183-1E This documentation and related computer software program (hereinafter referred to as the "Documentation") is for the end user's

More information

TranScend. Next Level Payment Processing. Product Overview

TranScend. Next Level Payment Processing. Product Overview TranScend Next Level Payment Processing Product Overview Product Functions & Features TranScend is the newest, most powerful, and most flexible electronics payment system from INTRIX Technology, Inc. It

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

Principles and characteristics of distributed systems and environments

Principles and characteristics of distributed systems and environments Principles and characteristics of distributed systems and environments Definition of a distributed system Distributed system is a collection of independent computers that appears to its users as a single

More information

FAULT TOLERANCE FOR MULTIPROCESSOR SYSTEMS VIA TIME REDUNDANT TASK SCHEDULING

FAULT TOLERANCE FOR MULTIPROCESSOR SYSTEMS VIA TIME REDUNDANT TASK SCHEDULING FAULT TOLERANCE FOR MULTIPROCESSOR SYSTEMS VIA TIME REDUNDANT TASK SCHEDULING Hussain Al-Asaad and Alireza Sarvi Department of Electrical & Computer Engineering University of California Davis, CA, U.S.A.

More information

SAN Conceptual and Design Basics

SAN Conceptual and Design Basics TECHNICAL NOTE VMware Infrastructure 3 SAN Conceptual and Design Basics VMware ESX Server can be used in conjunction with a SAN (storage area network), a specialized high speed network that connects computer

More information

A Survey of Cloud Computing Guanfeng Octides

A Survey of Cloud Computing Guanfeng Octides A Survey of Cloud Computing Guanfeng Nov 7, 2010 Abstract The principal service provided by cloud computing is that underlying infrastructure, which often consists of compute resources like storage, processors,

More information

BPM and SOA require robust and scalable information systems

BPM and SOA require robust and scalable information systems BPM and SOA require robust and scalable information systems Smart work in the smart enterprise Authors: Claus Torp Jensen, STSM and Chief Architect for SOA-BPM-EA Technical Strategy Rob High, Jr., IBM

More information

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud. White Paper 021313-3 Page 1 : A Software Framework for Parallel Programming* The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud. ABSTRACT Programming for Multicore,

More information

Integrating TAU With Eclipse: A Performance Analysis System in an Integrated Development Environment

Integrating TAU With Eclipse: A Performance Analysis System in an Integrated Development Environment Integrating TAU With Eclipse: A Performance Analysis System in an Integrated Development Environment Wyatt Spear, Allen Malony, Alan Morris, Sameer Shende {wspear, malony, amorris, sameer}@cs.uoregon.edu

More information

It is the thinnest layer in the OSI model. At the time the model was formulated, it was not clear that a session layer was needed.

It is the thinnest layer in the OSI model. At the time the model was formulated, it was not clear that a session layer was needed. Session Layer The session layer resides above the transport layer, and provides value added services to the underlying transport layer services. The session layer (along with the presentation layer) add

More information

Lecture 2 Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

Computer Network. Interconnected collection of autonomous computers that are able to exchange information

Computer Network. Interconnected collection of autonomous computers that are able to exchange information Introduction Computer Network. Interconnected collection of autonomous computers that are able to exchange information No master/slave relationship between the computers in the network Data Communications.

More information

TCP Adaptation for MPI on Long-and-Fat Networks

TCP Adaptation for MPI on Long-and-Fat Networks TCP Adaptation for MPI on Long-and-Fat Networks Motohiko Matsuda, Tomohiro Kudoh Yuetsu Kodama, Ryousei Takano Grid Technology Research Center Yutaka Ishikawa The University of Tokyo Outline Background

More information

Objectives. Distributed Databases and Client/Server Architecture. Distributed Database. Data Fragmentation

Objectives. Distributed Databases and Client/Server Architecture. Distributed Database. Data Fragmentation Objectives Distributed Databases and Client/Server Architecture IT354 @ Peter Lo 2005 1 Understand the advantages and disadvantages of distributed databases Know the design issues involved in distributed

More information

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available: Tools Page 1 of 13 ON PROGRAM TRANSLATION A priori, we have two translation mechanisms available: Interpretation Compilation On interpretation: Statements are translated one at a time and executed immediately.

More information

MPICH FOR SCI-CONNECTED CLUSTERS

MPICH FOR SCI-CONNECTED CLUSTERS Autumn Meeting 99 of AK Scientific Computing MPICH FOR SCI-CONNECTED CLUSTERS Joachim Worringen AGENDA Introduction, Related Work & Motivation Implementation Performance Work in Progress Summary MESSAGE-PASSING

More information

An approach to grid scheduling by using Condor-G Matchmaking mechanism

An approach to grid scheduling by using Condor-G Matchmaking mechanism An approach to grid scheduling by using Condor-G Matchmaking mechanism E. Imamagic, B. Radic, D. Dobrenic University Computing Centre, University of Zagreb, Croatia {emir.imamagic, branimir.radic, dobrisa.dobrenic}@srce.hr

More information

Name: 1. CS372H: Spring 2009 Final Exam

Name: 1. CS372H: Spring 2009 Final Exam Name: 1 Instructions CS372H: Spring 2009 Final Exam This exam is closed book and notes with one exception: you may bring and refer to a 1-sided 8.5x11- inch piece of paper printed with a 10-point or larger

More information

A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti 2, Nidhi Rajak 3

A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti 2, Nidhi Rajak 3 A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti, Nidhi Rajak 1 Department of Computer Science & Applications, Dr.H.S.Gour Central University, Sagar, India, ranjit.jnu@gmail.com

More information

Oracle9i Release 2 Database Architecture on Windows. An Oracle Technical White Paper April 2003

Oracle9i Release 2 Database Architecture on Windows. An Oracle Technical White Paper April 2003 Oracle9i Release 2 Database Architecture on Windows An Oracle Technical White Paper April 2003 Oracle9i Release 2 Database Architecture on Windows Executive Overview... 3 Introduction... 3 Oracle9i Release

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional

More information

Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research

More information

Cloud Computing. Up until now

Cloud Computing. Up until now Cloud Computing Lecture 11 Virtualization 2011-2012 Up until now Introduction. Definition of Cloud Computing Grid Computing Content Distribution Networks Map Reduce Cycle-Sharing 1 Process Virtual Machines

More information

Efficient Scheduling Of On-line Services in Cloud Computing Based on Task Migration

Efficient Scheduling Of On-line Services in Cloud Computing Based on Task Migration Efficient Scheduling Of On-line Services in Cloud Computing Based on Task Migration 1 Harish H G, 2 Dr. R Girisha 1 PG Student, 2 Professor, Department of CSE, PESCE Mandya (An Autonomous Institution under

More information

Real-Time Systems Prof. Dr. Rajib Mall Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Real-Time Systems Prof. Dr. Rajib Mall Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Real-Time Systems Prof. Dr. Rajib Mall Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No. # 26 Real - Time POSIX. (Contd.) Ok Good morning, so let us get

More information

Muse Server Sizing. 18 June 2012. Document Version 0.0.1.9 Muse 2.7.0.0

Muse Server Sizing. 18 June 2012. Document Version 0.0.1.9 Muse 2.7.0.0 Muse Server Sizing 18 June 2012 Document Version 0.0.1.9 Muse 2.7.0.0 Notice No part of this publication may be reproduced stored in a retrieval system, or transmitted, in any form or by any means, without

More information

Components for Operating System Design

Components for Operating System Design Components for Operating System Design Alan Messer and Tim Wilkinson SARC, City University, London, UK. Abstract Components are becoming used increasingly in the construction of complex application software.

More information