An Optimistic Parallel Simulation Protocol for Cloud Computing Environments

An Optimistic Parallel Simulation Protocol for Cloud Computing Environments 3 Asad Waqar Malik 1, Alfred J. Park 2, Richard M. Fujimoto 3 1 National University of Science and Technology, Pakistan 2 IBM T.J. Watson Research Center, Yorktown Heights, USA School of Computational Science and Engineering, Georgia Institute of Technology, USA Abstract Cloud computing offers the ability to provide parallel and distributed simulation services remotely to users through the Internet. Services hosted within the cloud can potentially incur processing delays due to load sharing among other active services, and can cause optimistic simulation protocols to perform poorly. This article discusses problems such as increased rollbacks and memory usage that can lead to degradations in performance for optimistic parallel simulations. The Time Warp Straggler Message Identification Protocol (TW-SMIP) is described that offers one approach to addressing this problem. Experimental evidence is provided to show this mechanism can significantly reduce the frequency of rollbacks and memory consumption relative to a traditional Time Warp system. 1. Introduction Cloud computing is a paradigm where software is provided as a service across virtualized computing resources available to clients at remote locations. Cloud computing hides resource availability issues making this infrastructure appealing to users with varying computational requirements: from storage applications to compute intensive tasks. Large scale parallel simulations often require compute time on high performance computing machines and clusters. Access to such resources may be problematic as such facilities can potentially have large acquisition costs and on-going management expenses. Cloud computing offers the potential to make parallel simulation much more accessible to a larger portion of the modeling and simulation community by eliminating or reducing such costs and risks. In a companion paper (Fujimoto, Malik et al. 2010) we describe the potential benefits and challenges in executing parallel and distributed simulations in cloud computing environments. Here, we focus on one class of parallel simulations those using optimistic synchronization mechanisms. Parallel discrete event simulation (PDES) refers to the execution of a discrete event simulation program across multiple processors. Typically this is done to scale simulations to larger configurations, to increase the detail and fidelity of the model, and/or to reduce execution time (Fujimoto 2000). PDES has been applied to a variety of applications such as modeling largescale telecommunication networks (Fujimoto, Perumalla et al. 2003), manufacturing (Lendermann, Low et al. 2005), and transportation systems (Perumalla 2006), to mention a few. A PDES program consists of a collection of logical processes (LPs) that communicate by exchanging time stamped messages or events. A fundamental problem in PDES concerns the synchronization of the parallel simulation program. Each LP must process incoming messages (events) in time stamp order. This is necessary to ensure that events in the simulated future do not affect events in the past. However, if an LP has received an event with, say, time stamp 10, how can it be sure no event will later arrive from another LP with a time stamp smaller than 10? This issue is referred to as the synchronization problem. Time Warp (Jefferson 1985) is a well-known approach to addressing the synchronization problem that uses rollbacks. Each LP is allowed to process whatever events it has received. If it SCS M&S Magazine 2010 / n4 (Oct) Malik/Park/Fujimoto Page 1 of 9

receives a straggler message, i.e., a new event with time stamp smaller than other events it has already processed, it must undo or roll back the computations for these events and re-execute them in the proper (time stamp) sequence. If this computation sent one or more messages to other LPs, the rollback must unsend these messages. A mechanism called anti-messages is used to cancel these events. Rollback-based mechanism, more generally referred to as optimistic synchronization, are described in greater detail in (Fujimoto 2000). Execution of traditional optimistic PDES systems in the presence of external interference from other user computations can lead to an excessive number of rollbacks, as illustrated by the work described in (Carothers and Fujimoto 2000). This is because computations from other users will slow the progress in simulation time of some LPs relative to others that are running on more lightly loaded processors, resulting in more straggler messages and longer rollbacks than would otherwise occur. Additionally, cloud computing environments may exhibit longer communication delays than tightly-coupled high performance computing platforms, further increasing the likelihood of straggler messages. To address these issues, the TW-SMIP protocol that dynamically adjusts the execution of each LP based on local parameters and straggler messages. The protocol avoids barrier synchronizations, and instead dynamically limits forward execution of LPs to reduce the amount of erroneous computation and generation of incorrect messages. Time Warp consists of two distinct components: a local control and a global control mechanism. Local control (i.e., state management, rollback recovery and anti-messages) is implemented within each processor, independent of the other processors. The global control mechanism is used to commit operations such as I/O that cannot be rolled back and to reclaim memory resources through computing a Global Virtual Time (GVT) value. GVT is the minimum simulation time among all unprocessed or partially processed messages and anti-messages in the system. 2. Optimistic Execution in the Cloud Challenges that arise in executing Time Warp programs under a cloud computing architecture include: 1. Effective utilization of resources 2. Load distribution 3. Efficient execution despite network traffic and communication delays 4. Fault tolerance 5. Process synchronization The techniques to address the above mentioned challenges must be provided automatically and transparently to application programs within the cloud. They present certain challenges concerning the optimistic execution paradigm used in Time Warp. Traditional approaches to PDES and most work using Time Warp to date assume a fixed set of dedicated computing resources, and typically do not address fault tolerance concerns. These assumptions are too restrictive for cloud environments. We touch upon each of these issues below. In a cloud computing environment resources are shared among multiple users. The number and nature of the workload presented by these users can vary over time. New resources may become available during the execution of a long-running Time Warp program as existing jobs complete or existing resources may become more heavily utilized as new jobs are initiated on behalf of this or other clients. Ideally, the Time Warp program must adapt as these changes occur to make the most effective use of the resources that are made available for its execution. This may entail distributing the execution over additional processors, or reducing the number of processors during the execution of the Time Warp program. This dynamic environment contrasts with a largely static environment typically used for Time Warp where a set of processors is dedicated to the execution, and the Time Warp program is restricted to only using those processors until execution completes. Unlike traditional parallel computing applications where a poorly balanced system results in idle processors while other processors are overburdened with computation, a poorly SCS M&S Magazine 2010 / n4 (Oct) Malik/Park/Fujimoto Page 2 of 9

balanced Time Warp program may not result in idle processors. This is because processors without sufficient workload may be optimistically performing computations that are later rolled back. Processor workload must consider rolled back computation in assessing the amount of workload placed on the processor (Carothers and Fujimoto 2000). Network traffic and communication delays are significant in current implementations of cloud computing infrastructures. Delayed messages may increase the number of straggler messages, i.e., messages that arrive late and result in a rollback. These effects may be alleviated by considering communication delays and the likelihood of increased rollbacks in determining the most appropriate mapping of Time Warp LPs to processors. Further, large communication delays may impact the algorithm used to compute GVT, so should be considered in implementing the global control mechanism. The Time Warp program must be able to tolerate failures in the underlying computing infrastructure. It should be able to run to completion despite processor and storage failures or network outages. Redundant execution of LPs must be managed in a way to ensure correct results are obtained without an excessive amount of wasted computation, especially in the context of Time Warp s optimistic style of execution that may result in replicated versions of LPs executing entirely different computations. Finally, as mentioned earlier, synchronization is a fundamental problem that must be addressed in order to achieve efficient execution of Time Warp programs in cloud computing environments. Execution of optimistic simulations across cloud computing architectures introduces new problems in addition to the straggler message and rollback issues that arise in traditional Time Warp frameworks. Under a cloud computing architecture, machines may be servicing other jobs and requests concurrently with the optimistic simulation. This leads to nonuniform and asymmetric processing conditions that can degrade performance in Time Warp programs. Here, we focus on this synchronization problem and performance issues when deploying Time Warp programs in cloud computing environments. While there is little work to date concerning synchronization of Time Warp programs in cloud computing environments, synchronization in conventional parallel and distributed computing platforms is a mature area of research. Several synchronization techniques were developed to employ optimism control by limiting forward execution of LPs to improve performance. Several techniques are discussed in (Fujimoto 2000). The mechanism described in (Madisetti and Hardaker 1992) is perhaps most closely related to the approach described here. Special synchronization messages are used to minimize the cascading rollback effect. However, these mechanisms do not address concerns particular to cloud environments, especially the need to execute over nondedicated computing resources. 3. A Cloud Architecture for Time Warp A cloud computing infrastructure offers numerous benefits such as reconfigurable dynamic resources while unifying and simplifying access to resources without burdening end-users with the costs and complexities associated with acquiring and managing underlying hardware and software layers. This computing paradigm is particularly appealing for parallel and distributed simulations as virtualized resources can be configured to meet the demands of the simulation that may vary widely from processor-bound executions to those that are memory-bound. Traditional monolithic PDES simulators designed for static, tightly-coupled cluster systems would not be well suited for a cloud computing environment. As resources are virtualized in a cloud environment, direct and full control of the underlying physical resources is not feasible. Unpredictability in processing and additional delays can adversely affect the performance of a Time Warp system that is sensitive to the execution environment that is not fully dedicated to the simulation. SCS M&S Magazine 2010 / n4 (Oct) Malik/Park/Fujimoto Page 3 of 9

User Interface Initialize Event Module Simulation Processing Module text Storage Module (state saving) Event Distribution Module Network Module Figure 1. TW-SMIP architecture stack A Time Warp system that is integrated into the cloud computing platform as a service that is aware of the environment can compensate for certain disadvantages in the infrastructure such as load sharing of physical resources between unrelated processes and other issues. Efficient execution of optimistic PDES application codes on cloud computing will require a new software infrastructure and algorithms that are aware of the underlying cloud infrastructure. The TW-SMIP architecture is shown in Figure 1. The communication module is responsible for handling communications among LPs, and is implemented over the underlying network module that provides interprocessor communications. The current implementation uses MPI as the network module. The event distribution and simulation processing modules are the main components that are responsible for implementing event management. These include the logic for processing events, handling rollbacks, and sending and receiving antimessages. The storage module implements state saving functions. The event initialization module provides mechanisms to begin the execution by providing input events to LPs based on simulation parameters. The user interface defines an application program interface to LPs. In order to reduce communication overheads, especially in cloud environments where such overheads can be significant, it is often advantageous to map multiple LPs to an individual processor (or virtual machine). An algorithm is required to map LPs to processors. This algorithm must balance communications overheads with achieving effective load distribution, while maintaining sufficient concurrency relative to the number of processors that may be allocated to the execution. Though this is an important issue, it is beyond the scope of the work presented here, so is not addressed further. The mapping of LPs to processors used in the experiments described later was derived manually. 4. The TW-SMIP Protocol TW-SMIP is an optimistic synchronization protocol intended to address concerns about interference and communication delays. Periodic status messages, termed heartbeat (HB) messages, are distributed to LPs residing on a processor to provide information concerning LPs residing on other processors that may send messages. These HB messages are superimposed over a standard Time Warp mechanism. HB messages include information concerning sent messages for straggler detection. They are given SCS M&S Magazine 2010 / n4 (Oct) Malik/Park/Fujimoto Page 4 of 9

higher priority than other messages, and are not subject to message bundling in order to minimize their latency. The TW-SMIP protocol is based on straggler message identification to avoid frequent rollbacks due to asymmetric and uneven processing loads that can be expected to arise. TW-SMIP performs boundary-based synchronization of LPs running on distributed nodes in the cloud architecture. Here we assume the use of TCP/IP, ensuring the reliable delivery of messages, and that multiple LPs can be mapped to a single processor. The requirement of reliable message delivery is necessary in Time Warp to ensure repeatability and to guarantee that the parallel execution produces exactly the same results as a sequential execution of the simulation. It is not a severe requirement for clouds implemented in localized computing clusters and geographically distributed implementations where the volume of communications does not necessitate the use of best effort communications. TW-SMIP is designed to reduce communication overhead and limit rolled back computation. Generated HB messages are only sent to processors where communication has occurred since the last computed GVT value. This approach can significantly reduce the number of HB messages generated during the simulation if not all processors directly communicate with each other. The TW-SMIP protocol is useful for large distributed simulations where the system is prone to network congestion, no specialized broadcast capabilities are available, and/or the simulations span multiple LANs as null HB messages are not used. The principle used in this approach is to send HB messages where communication has occurred. After a fixed wall clock time each processor enters the HB phase and generates HB messages for processors to which it communicates. The LPs continue processing future events in addition to receiving HB messages. LPs stop processing events when it discovers straggler messages through the HB message; they must roll back to the timestamp of the straggler message. At the same time it stops other LPs from processing false computation by generating anti-messages. The HB message based scheme is used to perform boundarybased synchronization for those LPs that have straggler messages; it does not limit other LPs from processing future events. TW-SMIP executes in a manner similar to a traditional Time Warp program, with the addition of the HB messages. Specifically, after a fixed interval of time, each processor enters into a straggler message identification phase and sends HB messages to all other processors to which it communicates. HB messages constrain execution of LPs that may be too far in the future. LPs can generate HB messages independent of other LPs in a simulation. HB messages consist of two array fields: timestamp (TS) and the message identification number (MID). Arrays are used to hold information concerning multiple messages. For example, suppose a source LP fills these two fields with the timestamp of messages and message identification numbers when generating messages destined for another LP. During the simulation, each LP saves the TS and MID values of each message it has sent or received to or from other LPs. Upon receiving a HB message, each LP compares the received information with locally saved information. Thus, each LP has two lists: one maintained by the LP that logs messages as they arrive and a second list that is created upon receipt of a HB message. If the lists are not identical then one or more straggler messages exist in the system. This immediately interrupts event processing and the LP rolls back the simulation to that point where a straggler message is expected. If the timestamp of the straggler message is greater than the local time of the destination node then the LP will keep processing events until the time of the straggler message is reached and pauses. There is a possibility that an HB message may be delayed. Under these circumstances, the receiving LP simply ignores the HB message if all the messages identified in the HB message have been received. Additionally, entries in the receive list are removed. Our proposed protocol utilizes this heartbeat mechanism to control optimistic execution across the cloud computing infrastructure. Specifically, straggler messages are used to SCS M&S Magazine 2010 / n4 (Oct) Malik/Park/Fujimoto Page 5 of 9

reduce rollbacks that would otherwise result in wasted computation. This reduces memory consumption and paces LPs more uniformly across the simulation in the presence of asymmetric processing loads. Further details of the protocol are described in (Malik, Park et al. 2009). 5. Performance Study The following empirical study examines the behavior of the TW-SMIP protocol under different asymmetric and symmetric processor loads. Comparison of our proposed protocol with traditional Time Warp mechanisms is provided to quantify improvement of a straggler message identification mechanism and validate its utility. The following experiments were performed on dual core 3.2GHz Intel Xeon processors with 6GB of memory per node. GNU/RedHat Linux running a 64-bit 2.6.9 kernel was installed on each machine. Nodes were interconnected via Fast Ethernet. Twelve of these nodes were used in the following tests. In order to analyze the performance of the TW- SMIP implementation, the benchmark model described in (Madisetti, Hardaker et al. 1993) was used. This benchmark was designed to capture the computational characteristics of simulations such as those used to model load sharing in electrical power grids. In this simulation program, each LP acts as a source and generates two types of messages: self and propagating. Self-messages are those sent by a source LP to itself with a defined timestamp increment. Propagating messages are sent to another LP in the network. Messages are generated with a timestamp of T LocalTime + L Lookahead. Applications such as electrical power grid simulations exhibit such behavior where load sharing requests are distributed among power stations; if a request cannot be processed locally, it is propagated to a neighboring node in the graph. As discussed in (Madisetti, Hardaker et al. 1993) other network applications such as air traffic simulations exhibit similar behaviors. This application provides a challenging test case for the TW-SMIP protocol because the selfmessages can result in some LPs advancing far ahead of others, only to be rolled back by subsequent straggler messages. For the experiments described here, upon processing an event, an LP sends a message to another LP with probability 0.5; otherwise it sends a message to itself. If the message is sent to another LP, it randomly selects a neighboring LP based on a predefined network topology. The synthetic topology used in these experiments is a two-dimensional grid where each node has N, S, E, W, NE, NW, SE, and SW neighbors. Here, 1000 LPs are mapped to a single processor. To analyze TW-SMIP over asymmetric conditions, a series of experiments were performed with varying HB period and workload. The HB period is defined as the time between successive HB messages generated by each LP. Asymmetric test conditions are achieved by varying the workload across the pool of machines used to gather data. Processors may be lightly loaded or heavily loaded. The background jobs are generated using a tool called Stress. Stress is a workload generator for POSIX systems (Stress Library) and allows for a configurable amount of CPU, memory, I/O, and disk stress on the system. Scenarios termed lightly loaded denote a load of two-cpu bound, one I/O bound, and one memory allocator process. Highly loaded scenarios include a load contains four CPU-bound, two I/O bound and one memory allocator process. The background workload is generated on each node. Figure 2 shows representative results. It indicates the number of rolled back events as well as the total number of events that were processed for the lightly loaded scenario under different HB periods. The data point at infinite HB period indicates the number of rolled back events when no HB messages are used; this corresponds to the performance of a Time Warp system without TW-SMIP. The number of committed events for the different runs remained constant at approximately 2.5 million events. The event rate indicates the number of committed events per unit time, and varies between 0.5 and 1.0 million events per second. SCS M&S Magazine 2010 / n4 (Oct) Malik/Park/Fujimoto Page 6 of 9

Figure 2. TW-SMIP execution scenario Figure 3. Efficiency comparison of test cases The TW-SMIP protocol significantly reduces the number of rolled back events compared to a conventional Time Warp system as shown in Figure 3. The number of rolled back events is increased if the HB period is set to either too high or too low a value. When the HB period is too large, the protocol is not effective in limiting the optimistic execution of LPs, resulting in an increased number of rolled back events. As expected, the traditional Time Warp system with no HB messages (or a HB period of infinity) yields a large number of rolled back events. When the HB messages are too frequent, however, the processing of the HB messages themselves becomes a bottleneck that amplify any imbalances in the parallel simulation execution, again resulting in a large number of rolled back events. The fact that the number of rolled back events remains low over a relatively broad range, from 4 to 100 milliseconds in this test case, suggests that it may not be necessary to fine tune the HB period in order to achieve the benefits of the TW-SMIP protocol. The efficiency of the simulation runs for three different scenarios is shown in Figure 3 with different background workloads. Efficiency is defined as the number of committed events divided by the total number of events that are processed, and gives an indication of the fraction of time the system is processing events that are eventually committed. These data verify that TW-SMIP offers the greatest benefit moderate HB periods, ranging from a few to 100 milliseconds. Not surprisingly, the nonuniformly distributed loads yield more rollbacks and reduced efficiency. SCS M&S Magazine 2010 / n4 (Oct) Malik/Park/Fujimoto Page 7 of 9

Figure 4. TW-SMIP Efficiency vs. Traditional Time Warp Synchronization A comparison between TW-SMIP and a traditional Time Warp synchronization mechanism is shown in Figure 4. A mix of uniform and non-uniform lightly and heavily loaded conditions was used for this study. Specifically, for each test case, the HB period that yielded the best performance in prior tests was used: HB periods of 0.003, 0.07 and 0.004 were used for the lightly loaded uniform scenario, lightly loaded non-uniform scenario, and highly loaded non-uniform scenario respectively. In the lightly loaded uniform test case, the TW-SMIP approach provides a significantly improved efficiency over the traditional Time Warp approach. Under nonuniform test cases, the observed data also shows significant performance improvements in both the lightly loaded and heavily loaded scenarios. For example, TW-SMIP exhibits an efficiency of 91% in the lightly loaded uniform test case, compared to the traditional TW implementation that yields 76% efficiency. An analysis of TW-SMIP was also performed for open queuing simulations on 12 dual core nodes. These test scenarios were performed under different background workloads and the results were qualitatively similar to the prior experiments. They demonstrated that the event rate decreases as HB messages become less frequent due to an increased number of rollbacks. Heavily loaded systems with less frequent rollbacks better utilize the resources. Less frequent HB messages failed to reduce the number of rollbacks, as expected; it produced additional overhead for heavy loaded systems. These series of experiments demonstrate that the TW-SMIP protocol produces better utilization of resources than the traditional Time Warp implementation under a variety of external workloads. A traditional Time Warp system generates frequent rollbacks in a resource sharing environment such as that occurring in a cloud infrastructure; TW-SMIP overcomes this problem by using heartbeat messages as a mechanism to detect straggler messages. 6. Concluding Remarks Cloud computing offers the promise of providing an execution platform without exposing complicated details of PDES execution to users. However, it is well known that optimistic PDES programs under traditional Time Warp frameworks can perform poorly where resources are shared amongst many users leading to slower execution times due to asymmetric and uneven processing. Thus under such an environment like that of a cloud, running optimistic simulations without any optimism control can lead to lower efficiency in execution. The TW-SMIP protocol is a first step in addressing asymmetric background loads that are inherent in cloud computing environments. This protocol defines dynamic synchronization points for individual LPs based on straggler messages. Handling of these straggler messages can improve efficiency of the system; which in turn leads to improved utilization of resources by lessening the amount of rolled back computation. SCS M&S Magazine 2010 / n4 (Oct) Malik/Park/Fujimoto Page 8 of 9

Much additional research is required before the potential of cloud computing for optimistic parallel simulations can be fully realized. Perhaps foremost, experience in executing optimistic parallel simulations in contemporary cloud environments is lacking. Development frameworks and tools are needed to enable implementation of parallel simulation application codes on cloud computing architectures. Acknowledgement Funding for this research was provided in part by NSF Grant ATM-0326431. References Carothers, C. D. and R. M. Fujimoto (2000). "Efficient Execution of Time Warp Programs on Heterogeneous, NOW Platforms." IEEE Trans. Parallel Distrib. Syst. 11(3): 299-317. Fujimoto, R. (2000). Parallel and Distributed Simulation Systems, Wiley Interscience. Fujimoto, R. M., A. W. Malik, et al. (2010). "Parallel and Distributed Simulation in the Cloud." Simulation Magazine, Society for Modeling and Simulation, Intl., 1(3). Fujimoto, R. M., K. S. Perumalla, et al. (2003). Large-Scale Network Simulation -- How Big? How Fast? Modeling, Analysis and Simulation of Computer and Telecommunication Systems. Jefferson, D. (1985). "Virtual Time." ACM Transactions on Programming Languages and Systems 7(3): 404-425. Lendermann, P., M. Y. H. Low, et al. (2005). An Integrated and Adaptive Decision-Support Framework for High-Tech Manufacturing and Service Networks. Proceedings of the 2005 Winter Simulation Conference. Madisetti, V. and D. A. Hardaker (1992). "Synchronization Mechanisms for Distributed Event-Driven Computation." ACM Transactions on Modeling and Computer Simulation 2: 12-51. Madisetti, V. K., D. A. Hardaker, et al. (1993). "The MIMDIX Operating System for Parallel Simulation and Supercomputing." Journal of Parallel and Distributed Computing 18(4): 473-483. Malik, A., A. Park, et al. (2009). Optimistic Synchronization of Parallel Simulations in Cloud Computing Environments. IEEE International Conference on Cloud Computing. Perumalla, K. S. (2006). A Systems Approach to Scalable Transportation Network Modeling. Winter Simulation Conference, Monterey, CA, IEEE. Stress Library. from http://weather.ou.edu/~apw/projects/stress. Asad Waqar Malik is a PhD candidate at National University of Science and Technology (NUST), Pakistan. He received his MS in Software Engineering and Bachelor of Computer Science degrees from NUST and Hamdard University respectively. He has been working in the distributed simulation field since 2005. He also worked as an international scholar at Georgia Institute of Technology. He has five international conference publications. His research interest includes real time decision support system, distributed simulation, and C4I systems. Alfred Park is a postdoctoral research scientist at IBM T.J. Watson Research Center at Yorktown Heights, New York. He received his BS, MS and PhD in Computer Science from the Georgia Institute of Technology in 2002, 2004 and 2009 respectively. His interests are in large scale stream processing systems, high performance computing, metacomputing, and parallel and distributed simulation. Richard Fujimoto is a Regents Professor and Chair of the School of Computational Science and Engineering at the Georgia Institute of Technology. He received his M.S. and Ph.D. degrees from the University of California (Berkeley) in 1980 and 1983 respectively. He has published over 200 articles on parallel and distributed simulation. Among his past activities he lead the definition of the time management services for the DoD High Level Architecture (HLA). SCS M&S Magazine 2010 / n4 (Oct) Malik/Park/Fujimoto Page 9 of 9