On Programming Models for Service-Level High Availability
|
|
|
- Abigail Fisher
- 10 years ago
- Views:
Transcription
1 On Programming Models for Service-Level High vailability C. Engelmann 1,2, S. L. Scott 1, C. Leangsuksun 3, X. He 4 1 Computer Science and Mathematics Division Oak Ridge National Laboratory, Oak Ridge, TN 37831, US 2 Department of Computer Science The University of Reading, Reading, RG6 6H, UK 3 Computer Science Department Louisiana Tech University, Ruston, L 71272, US 4 Department of Electrical and Computer Engineering Tennessee Technological University, Cookeville, TN 38505, US [email protected], [email protected], [email protected], [email protected] bstract This paper provides an overview of existing programming models for service-level high availability and investigates their differences, similarities, advantages, and disadvantages. Its goal is to help to improve reuse of code and to allow adaptation to quality of service requirements. It further aims at encouraging a discussion about these programming models and their provided quality of service, such as availability, performance, serviceability, usability and applicability. Within this context, the presented research focuses on providing high availability for services running on head and service nodes of high-performance computing systems. The proposed conceptual service model and the discussed service-level high availability programming models are applicable to many parallel and distributed computing scenarios as a networked system s availability can invariably be improved by increasing the availability of its most critical services. This research was partially sponsored by the Mathematical, Information, and Computational Sciences Division; Office of dvanced Scientific Computing Research; U.S. Department of Energy. The work was performed in part at the Oak Ridge National Laboratory, which is managed by UT-attelle, LLC under Contract No. DE- C05-00OR It was also performed in part at Louisiana Tech University under U.S. Department of Energy Grant No. DE-FG02-05ER The research performed at Tennessee Tech University was partially supported by the U.S. National Science Foundation under Grant No. CNS and the Research Office of the Tennessee Tech University under a Faculty Research Grant. The work at Oak Ridge National Laboratory and Tennessee Tech University was also partially sponsored by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory. 1. Introduction Today s scientific research community relies on high-performance computing (HPC) to understand the complex nature of open research questions and to drive the race for scientific discovery in various disciplines, such as in nuclear astrophysics, climate dynamics, fusion energy, nanotechnology, and human genomics. Scientific HPC applications, like the Terascale Supernova Initiative [29] or the Community Climate System Model (CCSM) [3], allow to model and simulate complex experimental scenarios without the need or capability to perform real-world experiments. HPC systems in use today can be divided into two distinct use cases, capacity and capability computing. In capacity computing, scientific HPC applications run in the order of a few seconds to several hours, typically on small-to-mid scale systems. The main objective for operating capacity HPC systems is to satisfy a large user base by guaranteeing a certain computational job submission-to-start response time and overall computational job throughput. In contrast, capability computing focuses on a small user base running scientific HPC applications in the order of a few hours to several months on the largest scale systems available. Operating requirements for capability HPC systems include guaranteeing full access to the entire machine, one user at a time for a significant amount of time. In both use cases, demand for continuous availability of HPC systems has risen dramatically as capacity system users experience HPC as an on-demand service with certain quality of service guarantees, and capability users require exclusive access to large HPC systems
2 over a significant amount of time with certain reliability guarantees. HPC systems must be able to run in the event of failures in such a manner that their capacity or capability is not severely degraded. However, both variants partially require different approaches for providing quality of service guarantees as use case scenario, system scale and financial cost determine the feasibility of individual solutions. For example, a capacity computing system may be equipped with additional spare compute nodes to tolerate a certain predictable percentage of compute node failures [30]. The spare compute nodes are able to quickly replace failed compute nodes using a checkpoint/restart system, like the erkeley Lab Checkpoint/Restart (LCR) [1] solution, for the recovery of currently running scientific HPC applications. In contrast, the notion of capability computing implies the use of all available compute nodes. Using spare compute nodes in a capability HPC system just degrades the capability upfront, even in the failure-free case. Nevertheless, capacity and capability computing systems have also similarities in their overall system architectures, such that some solutions apply to both use cases. For example, a single head node is typically managing a HPC system, while optional service nodes may offload certain head node services to provide better performance and scalability. Furthermore, since the head node is the gateway for users to submit scientific HPC applications as computational jobs, multiple gateway nodes may exists as virtual head nodes to provide dedicated user group access points and to apply certain user group access policies. generic architectural abstraction for HPC systems can be defined, such that a group of closely coupled compute nodes is connected to a set of service nodes. The head node of a HPC system is as a special service node. The system of users, service nodes and compute nodes is interdependent, i.e., users depend on service nodes and compute nodes, and compute nodes depend on service nodes. Providing high availability for service nodes invariably increases the overall system availability, especially when removing single points of failure and control from the system. Within this context, our research focuses on providing high availability for service nodes using the service model and service-level replication as a base concept. The overarching goal of our long-term research effort is to provide high-level reliability, availability and serviceability (RS) for HPC by combining advanced high availability (H) with HPC technology, thus allowing scientific HPC applications to better take advantage of the provided capacity or capability. The goal of this paper is to provide an overview of programming models for service-level high availability in order to investigate their differences, similarities, advantages, and disadvantages. Furthermore, this paper aims at encouraging a discussion about these programming models and their provided quality of service, such as availability, performance, serviceability, usability and applicability. While our focus is on capability HPC systems, like the 100 Tflop/s Cray XT4 system [35] at Oak Ridge National Laboratory [23], in particular, the presented service model and service-level high availability programming models are applicable to many parallel and distributed computing scenarios as a networked system s availability can invariably be improved by increasing the availability of its most critical services. This paper is structured as follows. First we discuss previous work on providing high availability for services in HPC systems, followed by a brief overview of previously identified research and development challenges in service-level high availability for HPC systems. We continue with a definition of the conceptual service model and a description of various service-level high availability programming models and their properties. This paper concludes with a discussion about the presented work and future research and development directions in this field. 2. Previous Work While there is a substantial amount of previous work on fault tolerance and high availability for single services as well as for services in networked systems, only a few solutions for providing high availability for services in HPC systems existed when we started our research a few years ago. Even today, high availability solutions for HPC system services are rare. The PS Pro job and resource management for the Cray XT3 [24] supports high availability using hot standby redundancy involving Crays proprietary interconnect for replication and transparent fail-over. Service state is replicated to the standby node, which takes over based on the current state without loosing control of the system. This solution has a mean time to recover (MTTR) of practically 0. However, it is only available for the Cray XT3 and its availability is limited by the deployment of two redundant nodes. The Simple Linux Utility for Resource Management (SLURM) [26, 36] job and resource manager as well as the metadata servers of the Parallel Virtual File System (PVFS) 2 [25] and of the Lustre [19] cluster file system employ an active/standby solution using a shared storage device, which is a common technique for providing service-level high availability using a heart-
3 beat mechanism [14]. n extension of this technique uses a crosswise hot standby redundancy strategy in an asymmetric active/active fashion. In this case, both are active services and additional standby services for each other. In both cases, the MTTR depends on the heartbeat interval. s with most shared storage solutions, correctness and quality of service are not guaranteed due to the lack of strong commit protocols. High vailability Open Source Cluster pplication Resources (H-OSCR) [12, 13, 18] is a high availability framework for the OpenPS [22] and TORQUE [31] job and resource management systems developed by our team at Louisiana Tech University. It employs a warm standby redundancy strategy. State is replicated to a standby head node upon modification. The standby monitors the health of the active node using a heartbeat mechanism and initiates the fail-over based on the current state. However, OpenPS/TORQUE does temporarily loose control of the system. ll previously running jobs are automatically restarted. H-OSCR integrates with the compute node checkpoint/restart layer for LM/MPI [15], LCR [1], improving its MTTR to 3-5 seconds for detection and fail-over plus the time to catch up based on the last checkpoint. s part of the H-OSCR research, an asymmetric active/active prototype [17] has been developed that offers high availability in a high-throughput computing scenario. Two different job and resource management services, OpenPS and the Sun Grid Engine (SGE) [28], run independently on different head nodes at the same time, while an additional head node is configured as a standby. Fail-over is performed using a heartbeat mechanism and is guaranteed for only one service at a time using a priority-based fail-over policy. OpenPS and SGE do loose control of the system during fail-over, requiring a restart of currently running jobs controlled by the failed node. Only one failure is completely masked at a time due to the configuration. However, the system is still operable during a fail-over due to the second active head node. second failure results in a degraded mode with one head node serving the entire system. JOSHU [33, 34] developed by our team at Oak Ridge National Laboratory offers symmetric active/active high availability for HPC job and resource management services with a PS compliant service interface. It represents a virtually synchronous environment using external replication [9] based on the PS service interface providing high availability without any interruption of service and without any loss of state. JOSHU relies on the Transis [5, 32] group communication system for reliable, totally ordered message delivery. The introduced communication overhead in a 3-way active/active head node system with % service uptime, a job submission latency of 300 milliseconds, and a throughput of 3 job submissions per second is in an acceptable range. Similar to the JOSHU solution, our team at Tennessee Technological University recently developed an active/active solution for the PVFS metadata server (MDS) using the Transis group communication system with an improved total message ordering protocol for lower latency and throughput overheads, and utilizing load balancing for MDS read requests. Further information about the overall background of high availability for HPC systems, including a more detailed problem description, conceptual models, a review of existing solutions and other related work, and a description of respective replication methods can be found in our earlier publications [10, 8, 6, 7]. ased on an earlier evaluation of our accomplishments [11], we identified several existing limitations, which need to be dealt with for a production-type deployment of our developed technologies. The most pressing issues are stability, performance, and interaction with compute node fault tolerance mechanisms. The most pressing issue for extending the developed technologies to other critical system services is portability and ease-of-use. Other related work includes the Object Group Pattern [20], which offers programming model support for replicated objects using virtual synchrony provided by a group communication system. In this design pattern, objects are constructed as state machines and replicated using totally ordered and reliably multicast state transitions. The Object Group Pattern also provides the necessary hooks for copying object state, which is needed for joining group members. Orbix+Isis and Electra are follow-on research projects [16] that focus on extending high availability support to COR using object request brokers (ORs) on top of virtual synchrony toolkits. Lastly, there is a plethora of past research on group communication algorithms [2, 4] focusing on providing quality of service guarantees for networked communication in distributed systems with failures. 3. Service Model n earlier evaluation of our accomplishments and their limitations [11] revealed one major problem, which is hindering necessary improvements of developed prototypes and hampering extending the developed technologies to other services. ll service-level high availability solutions need a customized active/hot-
4 standby, asymmetric active/active, or symmetric active/active environment, resulting in insufficient reuse of code. This results not only in rather extensive development efforts for each new high availability solution, but also makes their correctness validation an unnecessary repetitive task. Furthermore, replication techniques, such as active/active and active/standby, cannot be seamlessly interchanged for a specific service in order to find the most appropriate solution based on quality of service requirements. In order to alleviate and eventually eliminate this problem, we propose a more generic approach toward service-level high availability. In a first step we define a conceptual service model. Various service-level high availability programming models and their properties are described based on this model, allowing us to define programming interfaces and operating requirements for specific replication techniques. In a second step, a replication framework, as proposed earlier in [7], may be able to provide interchangeable service-level high availability programming interfaces requiring only minor development efforts for adding high availability support for services, and allowing seamless adaptation to quality of service requirements. The proposed more generic approach toward service-level high availability defines any targeted service as a communicating process, which interacts with other local or remote services and/or with users via an input/output interface, such as network connection(s), command line interface(es), and/or other forms of interprocess and user communication. Interaction is performed using input messages, such as network messages, command line executions, etc., which may trigger output messages to the interacting service or user in a request/response fashion, or to other services or users in a trigger/forward fashion. While a stateless service does not maintain internal state and reacts to input with a predefined output independently of any previous input, stateful services do maintain internal state and change it accordingly to input messages. Stateful services perform state transitions and produce output based on a deterministic state machine. However, non-deterministic behavior may be introduced by non-deterministic input, such as by a system timer sending input messages signaling a specific timeout or time. If the service internally relies on such a non-deterministic component invisible to the outside world, it is considered non-deterministic. Non-deterministic service behavior has serious consequences for service-level replication techniques as replay capability cannot be guaranteed. Stateless services on the other hand do not require service-level state replication. However, they do require appropriate mechanisms for fault tolerant input/output message routing to/from replacement services. s most services in HPC systems are stateful and deterministic, like for example the parallel file system metadata server, we will explore their conceptual service model and service-level high availability programming models. Non-determinism, such as displayed by a batch job scheduling system, may be avoided by configuring the service to behave deterministic, for example by using a deterministic batch job scheduling policy. We define the state of service p at step t as S t p, and the initial state of service p as S 0 p. service state transition from S t 1 p to S t p is triggered by request message r t p sent to and processed by service p at step t 1, such that request message r 1 p triggers the state transition from S 0 p to S 1 p. Request messages are processed in order at the respective service state, such as that service p receives and processes the request messages r 1 p, r 2 p, r 3 p,.... There is a linear history of state transitions S 0 p, S 1 p, S 2 p,... in direct context to a linear history of request messages r 1 p, r 2 p, r 3 p,.... Service state remains unmodified when receiving and processing the x-th query message q t,x p by service p at step t. Multiple different query messages may be processed at the same service state out of order, such as that service p processes the query messages r t,3 p, r t,1 p, r t,2 p,... previously received as r t,1 p, r t,2 p, r t,3 p,.... Each processed request message r t p may trigger any number of y output messages..., o t,y 1 p, o t,y p related to the specific state transition S t 1 p to S t p, while each processed query message q t,x p may trigger any number of y output messages..., o t,x,y 1 p, o t,x,y p related to the specific query message. deterministic service always has replay capability, i.e., different instances of a service have the same state if they have the same linear history of request messages. Furthermore, not only the current state, but also past service states may be reproduced using the respective linear history of request messages. service may provide an interface to atomically obtain a snapshot of the current state using a query message q t,x p and its output message o t,x,1 p, or to atomically overwrite its current state using a request message r t p with an optional output message o t,1 p as confirmation. oth interface functions are to be used for service state replication only. Overwriting the current state of a service constitutes a break in the linear history of state transitions of the original service instance by assuming a different service instance, and therefore is not considered a state transition by itself. The failure mode of a service is fail-stop, i.e., the
5 service, its node, or its communication links, fail by simply stopping. Failure detection mechanisms may be deployed to assure fail-stop behavior in certain cases, such as for incomplete or garbled messages. 4. ctive/standby Replication In the active/standby replication model for servicelevel high availability, at least one additional standby service is monitoring the primary service for a failstop event and assumes the role of the failed active service when detected. The standby service should preferably reside on a different node, while fencing the node of service after a detected failure to enforce the fail-stop model (STONITH concept) [27]. Service-level active/standby replication is based on assuming the same initial states for the primary service and the standby service, i.e., S 0 = S0, and on replicating the service state from the primary service to the standby service by guaranteeing a linear history of state transitions. This can be performed in two distinctive ways, as active/warm-standby and active/hotstandby replication. In the warm-standby model, service state is replicated regularly from the primary service to the standby service in a consistent fashion, i.e., the standby service assumes the state of the primary service once it has been transferred and validated in its entirety. Service state replication may be performed by the primary service using an internal trigger mechanism, such as a timer, to atomically overwrite state of the standby service. It may also be performed in the reverse form by the standby service using a similar internal trigger mechanism to atomically obtain a snapshot of the state of the primary service. The latter case already provides a failure detection mechanism using a timeout for the response from the primary service. failure of the primary service triggers the failover procedure to the standby service, which becomes the new primary service based on the last replicated state. Since the warm-standby model does assure a linear history of state transitions only up to the last replicated service state, all dependent services and users need to be notified that a service state rollback may have been performed. However, the new primary service is unable to verify by itself if a rollback has occurred. In the hot-standby model, service state is replicated on change from the primary service to the standby service in a consistent fashion, i.e., using a commit protocol, in order to provide a fail-over capability without state loss. Service state replication is performed by the primary service when processing request messages. previously received request message r t is forwarded by the primary service to the standby service as request message r t. The standby service replies with an output message o t,1 as an acknowledgment and performs the state transition S t 1 to S t without generating any output. The primary service performs the state transition S t 1 to S t and produces output accordingly after receiving the acknowledgment o t,1 from the standby service or after receiving a notification of a failure of the standby service. failure of the primary service triggers the failover procedure to the standby service, which becomes the new primary service based on the current state. In contrast to the warm-standby model, the hot-standby model does guarantee a linear history of state chance transitions up to the current state. However, the primary service may have failed before sending all output messages..., o t,y 1, o t,y for an already processed request message r t. Furthermore, the primary service may have failed before sending all output messages..., o t,x 1,y 1, o t,x 1,y, o t,x,y 1, o t,x,y for previously received query messages..., q t,x 1, q t,x. ll dependent services and users need to be notified that a fail-over has occurred. The new primary service resends all output messages..., o t,y 1, o t,y related to the previously processed request message r t, while all dependent services and users ignore duplicated output messages. Unanswered query messages..., q t,x 1, q t,x are reissued to the new primary service as query messages..., q t,x 1, q t,x by dependent services and users. The active/standby model always requires to notify dependent services and users about the fail-over. However, in a transparent active/hot-standby model, as exemplified by the earlier mentioned PS Pro job and resource management for the Cray XT3 [24], the network interconnect hardware itself performs in-order replication of all input messages and at most once delivery of output messages from the primary node, similar to the symmetric active/active model. ctive/standby replication also always implies a certain interruption of service until a failure has been detected and a fail-over has been performed. The MTTR of the active/warm-standby model depends on the time to detect a failure, the time needed for failover, and the time needed to catch up based on the last replicated service state. The MTTR of the active/hotstandby model only depends on the time to detect a failure and the time needed for fail-over. The only performance impact of the active/warmstandby model is the need for an atomic service state snapshot, which briefly interrupts the primary service. The active/hot-standby model adds communication latency to every request message, as forwarding to the
6 secondary node and waiting for the respective acknowledgment is needed. This communication latency overhead can be expressed as C L = 2l,, where l, is the communication latency between the primary service and the standby service. n additional communication latency and throughput overhead may be introduced if the primary service and the standby service are not connected by a dedicated communication link equivalent to the communication link(s) of the primary service to all dependent services and users. Using multiple standby services, C,... may provide higher availability as more redundancy is provided. However, consistent service state replication to the standby services, C,... requires fault tolerant multicast capability, i.e., service group membership management. Furthermore, a priority-based fail-over policy, e.g., to to C to..., is needed as well. 5. symmetric ctive/ctive Replication In the asymmetric active/active replication model for service-level high availability, two or more active services,,... provide essentially the same capability at tandem without coordination, while optional standby services α, β,... in an n+1 or n+m configuration may replace failing active services. There is no synchronization or service state replication between the active services,,.... Service state replication is only performed from the active services,,... to the optional standby services α, β,... in an active/standby fashion as previously explained. The only additional requirement is a prioritybased fail-over policy, e.g., >> >>..., if there are more active services n than standby services m, and a load balancing policy for using the active services,,... at tandem. Load balancing of request and query messages needs to be performed at the granularity of user/service groups, as there is no coordination between the active services,,.... Individual active services may be assigned to specific user/service groups in a static load balancing scenario. more dynamic solution is based on sessions, where each session is a time segment of interaction between user/service groups and specific active services. Sessions may be assigned to active services at random, using specific networking hardware, or using a separate service. However, introducing a separate service for session scheduling adds an additional dependency to the system, which requires an additional redundancy strategy. Similar to the active/standby model, the asymmetric active/active replication model requires notification of dependent services and users about a fail-over and about an unsuccessful priority-based fail-over. It also always implies a certain interruption of a specific active service until a failure has been detected and a fail-over has been performed. The MTTR for a a specific active service depends on the active/standby replication strategy. However, other active services are available during a fail-over, which interact with their specific user/service groups and sessions and respond to new user/service groups and sessions. The performance of single active services in the asymmetric active/active replication model is equivalent to the active/standby case. However, the overall service provided by the active service group,,... allows for a higher throughput performance and respectively for a lower respond latency due to the availability of more resources and load balancing. 6. Symmetric ctive/ctive Replication In the symmetric active/active replication model for service-level high availability, two or more active services,,... offer the same capabilities and maintain a common global service state. Service-level symmetric active/active replication is based on assuming the same initial states for all active services,,..., i.e., S 0 = S0 =..., and on replicating the service state by guaranteeing a linear history of state transitions using virtual synchrony [21]. Service state replication among the active services,,... is performed by totally ordering all request messages r t, rt,... and reliably delivering them to all active services,,.... process group communication system is used to perform total message order, reliable message delivery, and service group membership management. Furthermore, consistent output messages o t,1,... related to the specific state transitions St 1 to S t, St 1 to S t,... produced by all active services,,... is unified either by simply ignoring duplicated messages or by using the group communication system for a distributed mutual exclusion. The latter is required if duplicated messages cannot be simply ignored by dependent services and/or users. If needed, the distributed mutual exclusion is performed by adding a local mutual exclusion variable and its lock and unlock functions to the active services,,.... However, the lock has to be acquired and released by routing respective multicast request messages o t,1 r,,... t through the group communication system for total ordering. ccess to the critical section protected by this distributed mutual exclusion is given to the active service sending the first locking request, which in turn produces the output accordingly. Dependent services and users are required to acknowledge the receiving of,
7 output messages to all active services,,... for internal bookkeeping using a reliable multicast. In case of a failure, the membership management notifies everyone about the orphaned lock and releases it. The lock is reacquired until all output is produced accordingly to the state transition. This implies a high overhead lock-step mechanism for at most once output delivery. It should be avoided whenever possible. The number of active services is variable at runtime and can be changed by either forcing an active service to leave the service group or by joining a new service with the service group. Forcibly removed services or failed services are simply deleted from the service group membership without any additional reconfiguration. Multiple simultaneous removals or failures are handled as sequential ones. New services join the service group by atomically overwriting their service state with the service state snapshot from one active service group member, e. g., S t, before receiving following request messages, e. g., r t+1, rt+2,.... Query messages q t,x, qt,x,... may be send directly to respective active services through the group communication system to assure total order with conflicting request messages. Related output messages..., o t,x,y 1, o t,x,y, o t,x,y 1, o t,x,y are sent directly to the dependent service or user. In case of a failure, unanswered query messages, e. g.,..., q t,x 1, q t,x, are reissued by dependent services and users to a different active service, e. g.,, in the service group as query messages, e. g.,..., q t,x 1, q t,x. The symmetric active/active model also always requires to notify dependent services and users about failures in order to reconfigure their access to the service group through the group communication system and to reissue outstanding queries. There is no interruption of service and no loss of state as long as one active service is alive, since active services run in virtual synchrony without the need of extensive reconfiguration. Furthermore, the MTTR is practically 0. However, the group communication system introduces a communication latency and throughput overhead, which increases with the number of active head nodes. This overhead directly depends on the used group communication protocol. 7. Discussion ll presented service-level high availability programming models show certain interface and behavior similarities. They all require to notify dependent services and users about failures. Transparent masking of this requirement may be provided by an underlying adaptable framework, which keeps track of active services, their current high availability programming model, fail-over and rollback scenarios, message duplication, and unanswered query messages. The requirements for service interfaces in these service-level high availability programming models are also similar. service must have an interface to atomically obtain a snapshot of its current state and to atomically overwrite its current state. Furthermore, for the active/standby service-level high availability programming model a service must either provide the described special standby mode for acknowledging request messages and muting output, or an underlying adaptable framework needs to emulate this capability. The internal algorithms of the active/hot-standby and the active/active service-level high availability programming models are equivalent for reliably delivering request messages in total order to all active services. In fact, the active/hot-standby service-level high availability programming model uses a centralized group communication commit protocol for a maximum of two members with fail-over capability. ased on the presented conceptual service model and service-level high availability programming models, active/warm-standby provides the lowest runtime overhead in a failure free environment and the highest recovery impact in case of a failure. Conversely, active/active provides the lowest recovery impact and the highest runtime overhead. Future research and development efforts need to focus on a unified service interface and service-level high availability programming interface that allows reuse of code and seamless adaptation to quality of service requirements while supporting different high availability programming models. References [1] erkeley Lab Checkpoint/Restart (LCR) project at Lawrence erkeley National Laboratory, erkeley, C, US. vailable at [2] G. V. Chockler, I. Keidar, and R. Vitenberg. Group communication specifications: comprehensive study. CM Computing Surveys, 33(4):1 43, [3] Community Climate System Model (CCSM) at the National Center for tmospheric Research, oulder, CO, US. vailable at [4] X. Defago,. Schiper, and P. Urban. Total order broadcast and multicast algorithms: Taxonomy and survey. CM Computing Surveys, 36(4): , [5] D. Dolev and D. Malki. The Transis approach to high availability cluster communication. Communications of the CM, 39(4):64 70, [6] C. Engelmann and S. L. Scott. Concepts for high availability in scientific high-end computing. In Proceed-
8 ings of High vailability and Performance Workshop (HPCW) 2005, Santa Fe, NM, US, Oct. 11, [7] C. Engelmann and S. L. Scott. High availability for ultra-scale high-end scientific computing. In Proceedings of 2 nd International Workshop on Operating Systems, Programming Environments and Management Tools for High-Performance Computing on Clusters (COSET-2) 2005, Cambridge, M, US, June 19, [8] C. Engelmann, S. L. Scott, D. E. ernholdt, N. R. Gottumukkala, C. Leangsuksun, J. Varma, C. Wang, F. Mueller,. G. Shet, and P. Sadayappan. MOLR: daptive runtime support for high-end computing operating and runtime systems. CM SIGOPS Operating Systems Review (OSR), 40(2):63 72, [9] C. Engelmann, S. L. Scott, C. Leangsuksun, and X. He. ctive/active replication for highly available HPC system services. In Proceedings of 1 st International Conference on vailability, Reliability and Security (RES) 2006, pages , Vienna, ustria, pr , [10] C. Engelmann, S. L. Scott, C. Leangsuksun, and X. He. Symmetric active/active high availability for high-performance computing system services. Journal of Computers (JCP), 1(7), To appear. [11] C. Engelmann, S. L. Scott, C. Leangsuksun, and X. He. Towards high availability for high-performance computing system services: ccomplishments and limitations. In Proceedings of High vailability and Performance Workshop (HPCW) 2006, Santa Fe, NM, US, Oct. 17, [12] H-OSCR at Louisiana Tech University, Ruston, L, US. vailable at [13] I. Haddad, C. Leangsuksun, and S. L. Scott. H- OSCR: Towards highly available linux clusters. Linux World Magazine, Mar [14] Heartbeat. vailable at HeartbeatProgram. [15] LM-MPI Project at Indiana University, loomington, IN, US. vailable at [16] S. Landis and S. Maffeis. uilding reliable distributed systems with COR. Theory and Practice of Object Systems, 3(1):31 43, [17] C. Leangsuksun, V. K. Munganuru, T. Liu, S. L. Scott, and C. Engelmann. symmetric active-active high availability for high-end computing. In Proceedings of 2 nd International Workshop on Operating Systems, Programming Environments and Management Tools for High-Performance Computing on Clusters (COSET-2) 2005, Cambridge, M, US, June 19, [18] K. Limaye, C. Leangsuksun, Z. Greenwood, S. L. Scott, C. Engelmann, R. Libby, and K. Chanchio. Job-site level fault tolerance for cluster and grid environments. In Proceedings of IEEE International Conference on Cluster Computing (Cluster) 2005, oston, M, US, Sept , [19] Lustre rchitecture Whitepaper at Cluster File Systems, Inc., oulder, CO, US. vailable at [20] S. Maffeis. The object group design pattern. Proceedings of 2 nd USENIX Conference on Object-Oriented Technologies (COOTS) 1996, June 17-21, [21] L. Moser, Y. mir, P. Melliar-Smith, and D. garwal. Extended virtual synchrony. Proceedings of IEEE 14 th International Conference on Distributed Computing Systems (ICDCS) 1994, pages 56 65, June 21-24, [22] OpenPS at ltair Engineering, Troy, MI, US. vailable at [23] ORNL Jaguar: Cray XT4 Computing Platform at National Center for Computational Sciences, Oak Ridge National Laboratory, Oak Ridge, TN, US. [24] PS Pro for the Cray XT3 Computing Platform at ltair Engineering, Troy, MI, US. vailable at Cray.pdf. [25] PVFS2 Development Team. PVFS2 High-vailability Clustering. vailable at as part of the PVFS2 source distribution. [26] SLURM at Lawrence Livermore National Laboratory, Livermore, C, US. vailable at linux/slurm. [27] STONITH (node fencing) concept. Explained at [28] Sun Grid Engine (SGE) at Sun Microsystems, Inc, Santa Clara, C, US. vailable [29] Terascale Supernova Initiative at Oak Ridge National Laboratory, Oak Ridge, TN, US. vailable at [30]. Tikotekar, C. Leangsuksun, and S. L. Scott. On the survivability of standard mpi applications. In Proceedings of the 7 th LCI International Conference on Linux Clusters: The HPC Revolution 2006, Norman, OK, US, May 1-4, [31] TORQUE Resource Manager at Cluster Resources, Inc., Spanish Fork, UT, US. vailable at [32] Transis Project at Hebrew University of Jerusalem, Israel. vailable at [33] K. Uhlemann. High availability for high-end scientific computing. Master s thesis, Department of Computer Science, University of Reading, UK, Mar [34] K. Uhlemann, C. Engelmann, and S. L. Scott. JOSHU: Symmetric active/active replication for highly available HPC job and resource management. In Proceedings of IEEE International Conference on Cluster Computing (Cluster) 2006, arcelona, Spain, Sept , [35] XT4 Computing Platform at Cray, Seattle, W, US. vailable at [36]. Yoo, M. Jette, and M. Grondona. SLURM: Simple linux utility for resource management. In Lecture Notes in Computer Science: Proceedings of Job Scheduling Strategies for Parallel Processing (JSSPP) 2003, volume 2862, pages 44 60, Seattle, W, US, June 24, 2003.
The Best- Practices in HPC Systems
Symmetric Active/Active High Availability for High-Performance Computing System Services: Accomplishments and Limitations C. Engelmann 1,2, S. L. Scott 1, C. Leangsuksun 3, X. He 4 1 Computer Science and
Asymmetric Active-Active High Availability for High-end Computing
Asymmetric Active-Active High Availability for High-end Computing C. Leangsuksun V. K. Munganuru Computer Science Department Louisiana Tech University P.O. Box 10348, Ruston, LA 71272, USA Phone: +1 318
REM-Rocks: A Runtime Environment Migration Scheme for Rocks based Linux HPC Clusters
REM-Rocks: A Runtime Environment Migration Scheme for Rocks based Linux HPC Clusters Tong Liu, Saeed Iqbal, Yung-Chin Fang, Onur Celebioglu, Victor Masheyakhi and Reza Rooholamini Dell Inc. {Tong_Liu,
Availability Digest. MySQL Clusters Go Active/Active. December 2006
the Availability Digest MySQL Clusters Go Active/Active December 2006 Introduction MySQL (www.mysql.com) is without a doubt the most popular open source database in use today. Developed by MySQL AB of
Fault Tolerance in the Internet: Servers and Routers
Fault Tolerance in the Internet: Servers and Routers Sana Naveed Khawaja, Tariq Mahmood Research Associates Department of Computer Science Lahore University of Management Sciences Motivation Client Link
Pervasive PSQL Meets Critical Business Requirements
Pervasive PSQL Meets Critical Business Requirements Pervasive PSQL White Paper May 2012 Table of Contents Introduction... 3 Data Backup... 3 Pervasive Backup Agent... 3 Pervasive PSQL VSS Writer... 5 Pervasive
LSKA 2010 Survey Report Job Scheduler
LSKA 2010 Survey Report Job Scheduler Graduate Institute of Communication Engineering {r98942067, r98942112}@ntu.edu.tw March 31, 2010 1. Motivation Recently, the computing becomes much more complex. However,
High Availability and Disaster Recovery Solutions for Perforce
High Availability and Disaster Recovery Solutions for Perforce This paper provides strategies for achieving high Perforce server availability and minimizing data loss in the event of a disaster. Perforce
A SURVEY OF POPULAR CLUSTERING TECHNOLOGIES
A SURVEY OF POPULAR CLUSTERING TECHNOLOGIES By: Edward Whalen Performance Tuning Corporation INTRODUCTION There are a number of clustering products available on the market today, and clustering has become
High Availability with Postgres Plus Advanced Server. An EnterpriseDB White Paper
High Availability with Postgres Plus Advanced Server An EnterpriseDB White Paper For DBAs, Database Architects & IT Directors December 2013 Table of Contents Introduction 3 Active/Passive Clustering 4
High Performance Cluster Support for NLB on Window
High Performance Cluster Support for NLB on Window [1]Arvind Rathi, [2] Kirti, [3] Neelam [1]M.Tech Student, Department of CSE, GITM, Gurgaon Haryana (India) [email protected] [2]Asst. Professor,
High Availability and Clustering
High Availability and Clustering AdvOSS-HA is a software application that enables High Availability and Clustering; a critical requirement for any carrier grade solution. It implements multiple redundancy
High-Availability Using Open Source Software
High-Availability Using Open Source Software Luka Perkov Iskon Internet, Zagreb, Croatia Nikola Pavković Ruđer Bošković Institute Bijenička cesta Zagreb, Croatia Juraj Petrović Faculty of Electrical Engineering
Highly Reliable Linux HPC Clusters: Self-awareness Approach
Highly Reliable Linux HPC Clusters: Self-awareness Approach Chokchai Leangsuksun 1, Tong Liu 2,Yudan Liu 1, Stephen L. Scott 3, Richard Libby 4 and Ibrahim Haddad 5 1 Computer Science Department, Louisiana
Cisco Active Network Abstraction Gateway High Availability Solution
. Cisco Active Network Abstraction Gateway High Availability Solution White Paper This white paper describes the Cisco Active Network Abstraction (ANA) Gateway High Availability solution developed and
Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification
Introduction Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification Advanced Topics in Software Engineering 1 Concurrent Programs Characterized by
PVFS High Availability Clustering using Heartbeat 2.0
PVFS High Availability Clustering using Heartbeat 2.0 2008 Contents 1 Introduction 2 2 Requirements 2 2.1 Hardware................................................. 2 2.1.1 Nodes...............................................
Windows Server Failover Clustering April 2010
Windows Server Failover Clustering April 00 Windows Server Failover Clustering (WSFC) is the successor to Microsoft Cluster Service (MSCS). WSFC and its predecessor, MSCS, offer high availability for critical
Bigdata High Availability (HA) Architecture
Bigdata High Availability (HA) Architecture Introduction This whitepaper describes an HA architecture based on a shared nothing design. Each node uses commodity hardware and has its own local resources
Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.
Architectures Cluster Computing Job Parallelism Request Parallelism 2 2010 VMware Inc. All rights reserved Replication Stateless vs. Stateful! Fault tolerance High availability despite failures If one
Introduction of Virtualization Technology to Multi-Process Model Checking
Introduction of Virtualization Technology to Multi-Process Model Checking Watcharin Leungwattanakit [email protected] Masami Hagiya [email protected] Mitsuharu Yamamoto Chiba University
be architected pool of servers reliability and
TECHNICAL WHITE PAPER GRIDSCALE DATABASE VIRTUALIZATION SOFTWARE FOR MICROSOFT SQL SERVER Typical enterprise applications are heavily reliant on the availability of data. Standard architectures of enterprise
High Availability Storage
High Availability Storage High Availability Extensions Goldwyn Rodrigues High Availability Storage Engineer SUSE High Availability Extensions Highly available services for mission critical systems Integrated
Overview of Luna High Availability and Load Balancing
SafeNet HSM TECHNICAL NOTE Overview of Luna High Availability and Load Balancing Contents Introduction... 2 Overview... 2 High Availability... 3 Load Balancing... 4 Failover... 5 Recovery... 5 Standby
Astaro Deployment Guide High Availability Options Clustering and Hot Standby
Connect With Confidence Astaro Deployment Guide Clustering and Hot Standby Table of Contents Introduction... 2 Active/Passive HA (Hot Standby)... 2 Active/Active HA (Cluster)... 2 Astaro s HA Act as One...
Scale and Availability Considerations for Cluster File Systems. David Noy, Symantec Corporation
Scale and Availability Considerations for Cluster File Systems David Noy, Symantec Corporation SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted.
HAOSCAR 2.0: an open source HA-enabling framework for mission critical systems
HAOSCAR 2.0: an open source HA-enabling framework for mission critical systems Rajan Sharma, Thanadech Thanakornworakij { tth010,rsh018}@latech.edu High availability is essential in mission critical computing
A Framework for Highly Available Services Based on Group Communication
A Framework for Highly Available Services Based on Group Communication Alan Fekete [email protected] http://www.cs.usyd.edu.au/ fekete Department of Computer Science F09 University of Sydney 2006,
The Service Availability Forum Specification for High Availability Middleware
The Availability Forum Specification for High Availability Middleware Timo Jokiaho, Fred Herrmann, Dave Penkler, Manfred Reitenspiess, Louise Moser Availability Forum [email protected], [email protected],
Real Time Network Server Monitoring using Smartphone with Dynamic Load Balancing
www.ijcsi.org 227 Real Time Network Server Monitoring using Smartphone with Dynamic Load Balancing Dhuha Basheer Abdullah 1, Zeena Abdulgafar Thanoon 2, 1 Computer Science Department, Mosul University,
Red Hat Cluster Suite
Red Hat Cluster Suite HP User Society / DECUS 17. Mai 2006 Joachim Schröder Red Hat GmbH Two Key Industry Trends Clustering (scale-out) is happening 20% of all servers shipped will be clustered by 2006.
High Availability Solutions for the MariaDB and MySQL Database
High Availability Solutions for the MariaDB and MySQL Database 1 Introduction This paper introduces recommendations and some of the solutions used to create an availability or high availability environment
Fault Tolerant Servers: The Choice for Continuous Availability on Microsoft Windows Server Platform
Fault Tolerant Servers: The Choice for Continuous Availability on Microsoft Windows Server Platform Why clustering and redundancy might not be enough This paper discusses today s options for achieving
Avoid a single point of failure by replicating the server Increase scalability by sharing the load among replicas
3. Replication Replication Goal: Avoid a single point of failure by replicating the server Increase scalability by sharing the load among replicas Problems: Partial failures of replicas and messages No
CHAPTER 2 MODELLING FOR DISTRIBUTED NETWORK SYSTEMS: THE CLIENT- SERVER MODEL
CHAPTER 2 MODELLING FOR DISTRIBUTED NETWORK SYSTEMS: THE CLIENT- SERVER MODEL This chapter is to introduce the client-server model and its role in the development of distributed network systems. The chapter
New Storage System Solutions
New Storage System Solutions Craig Prescott Research Computing May 2, 2013 Outline } Existing storage systems } Requirements and Solutions } Lustre } /scratch/lfs } Questions? Existing Storage Systems
Whitepaper Continuous Availability Suite: Neverfail Solution Architecture
Continuous Availability Suite: Neverfail s Continuous Availability Suite is at the core of every Neverfail solution. It provides a comprehensive software solution for High Availability (HA) and Disaster
High Availability Design Patterns
High Availability Design Patterns Kanwardeep Singh Ahluwalia 81-A, Punjabi Bagh, Patiala 147001 India [email protected] +91 98110 16337 Atul Jain 135, Rishabh Vihar Delhi 110092 India [email protected]
Deltek Costpoint 7.1.1. Process Execution Modes
Deltek Costpoint 7.1.1 Process Execution Modes October 24, 2014 While Deltek has attempted to verify that the information in this document is accurate and complete, some typographical or technical errors
TIBCO StreamBase High Availability Deploy Mission-Critical TIBCO StreamBase Applications in a Fault Tolerant Configuration
TIBCO StreamBase High Availability Deploy Mission-Critical TIBCO StreamBase s in a Fault Tolerant Configuration TIBCO STREAMBASE HIGH AVAILABILITY The TIBCO StreamBase event processing platform provides
Executive Summary WHAT IS DRIVING THE PUSH FOR HIGH AVAILABILITY?
MINIMIZE CUSTOMER SERVICE DISRUPTION IN YOUR CONTACT CENTER GENESYS SIP 99.999% AVAILABILITY PROVIDES QUALITY SERVICE DELIVERY AND A SUPERIOR RETURN ON INVESTMENT TABLE OF CONTENTS Executive Summary...1
Big data management with IBM General Parallel File System
Big data management with IBM General Parallel File System Optimize storage management and boost your return on investment Highlights Handles the explosive growth of structured and unstructured data Offers
HPC Software Requirements to Support an HPC Cluster Supercomputer
HPC Software Requirements to Support an HPC Cluster Supercomputer Susan Kraus, Cray Cluster Solutions Software Product Manager Maria McLaughlin, Cray Cluster Solutions Product Marketing Cray Inc. WP-CCS-Software01-0417
SanDisk ION Accelerator High Availability
WHITE PAPER SanDisk ION Accelerator High Availability 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Introduction 3 Basics of SanDisk ION Accelerator High Availability 3 ALUA Multipathing
SAN Conceptual and Design Basics
TECHNICAL NOTE VMware Infrastructure 3 SAN Conceptual and Design Basics VMware ESX Server can be used in conjunction with a SAN (storage area network), a specialized high speed network that connects computer
An Oracle White Paper January 2013. A Technical Overview of New Features for Automatic Storage Management in Oracle Database 12c
An Oracle White Paper January 2013 A Technical Overview of New Features for Automatic Storage Management in Oracle Database 12c TABLE OF CONTENTS Introduction 2 ASM Overview 2 Total Storage Management
Oracle BI Publisher Enterprise Cluster Deployment. An Oracle White Paper August 2007
Oracle BI Publisher Enterprise Cluster Deployment An Oracle White Paper August 2007 Oracle BI Publisher Enterprise INTRODUCTION This paper covers Oracle BI Publisher cluster and high availability deployment.
High Availability for Citrix XenApp
WHITE PAPER Citrix XenApp High Availability for Citrix XenApp Enhancing XenApp Availability with NetScaler Reference Architecture www.citrix.com Contents Contents... 2 Introduction... 3 Desktop Availability...
High-Availablility Infrastructure Architecture Web Hosting Transition
High-Availablility Infrastructure Architecture Web Hosting Transition Jolanta Lapkiewicz MetLife Corporation, New York e--mail. [email protected]]LPLHU])UF]NRZVNL 7HFKQLFDO8QLYHUVLW\RI:URFáDZ'HSDUWDPent
Berkeley Ninja Architecture
Berkeley Ninja Architecture ACID vs BASE 1.Strong Consistency 2. Availability not considered 3. Conservative 1. Weak consistency 2. Availability is a primary design element 3. Aggressive --> Traditional
DATA CENTER. Best Practices for High Availability Deployment for the Brocade ADX Switch
DATA CENTER Best Practices for High Availability Deployment for the Brocade ADX Switch CONTENTS Contents... 2 Executive Summary... 3 Introduction... 3 Brocade ADX HA Overview... 3 Hot-Standby HA... 4 Active-Standby
A new Cluster Resource Manager for heartbeat
A new Cluster Resource Manager for heartbeat Lars Marowsky-Brée [email protected] January 2004 1 Overview This paper outlines the key features and design of a clustered resource manager to be running on top
Parallel Analysis and Visualization on Cray Compute Node Linux
Parallel Analysis and Visualization on Cray Compute Node Linux David Pugmire, Oak Ridge National Laboratory and Hank Childs, Lawrence Livermore National Laboratory and Sean Ahern, Oak Ridge National Laboratory
High Availability Database Solutions. for PostgreSQL & Postgres Plus
High Availability Database Solutions for PostgreSQL & Postgres Plus An EnterpriseDB White Paper for DBAs, Application Developers and Enterprise Architects November, 2008 High Availability Database Solutions
Google File System. Web and scalability
Google File System Web and scalability The web: - How big is the Web right now? No one knows. - Number of pages that are crawled: o 100,000 pages in 1994 o 8 million pages in 2005 - Crawlable pages might
Building Reliable, Scalable AR System Solutions. High-Availability. White Paper
Building Reliable, Scalable Solutions High-Availability White Paper Introduction This paper will discuss the products, tools and strategies available for building reliable and scalable Action Request System
Distributed Operating Systems. Cluster Systems
Distributed Operating Systems Cluster Systems Ewa Niewiadomska-Szynkiewicz [email protected] Institute of Control and Computation Engineering Warsaw University of Technology E&IT Department, WUT 1 1. Cluster
Dr Markus Hagenbuchner [email protected] CSCI319. Distributed Systems
Dr Markus Hagenbuchner [email protected] CSCI319 Distributed Systems CSCI319 Chapter 8 Page: 1 of 61 Fault Tolerance Study objectives: Understand the role of fault tolerance in Distributed Systems. Know
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:
Tools Page 1 of 13 ON PROGRAM TRANSLATION A priori, we have two translation mechanisms available: Interpretation Compilation On interpretation: Statements are translated one at a time and executed immediately.
High Availability with Windows Server 2012 Release Candidate
High Availability with Windows Server 2012 Release Candidate Windows Server 2012 Release Candidate (RC) delivers innovative new capabilities that enable you to build dynamic storage and availability solutions
- An Essential Building Block for Stable and Reliable Compute Clusters
Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative
Informix Dynamic Server May 2007. Availability Solutions with Informix Dynamic Server 11
Informix Dynamic Server May 2007 Availability Solutions with Informix Dynamic Server 11 1 Availability Solutions with IBM Informix Dynamic Server 11.10 Madison Pruet Ajay Gupta The addition of Multi-node
Fault-Tolerant Framework for Load Balancing System
Fault-Tolerant Framework for Load Balancing System Y. K. LIU, L.M. CHENG, L.L.CHENG Department of Electronic Engineering City University of Hong Kong Tat Chee Avenue, Kowloon, Hong Kong SAR HONG KONG Abstract:
StreamStorage: High-throughput and Scalable Storage Technology for Streaming Data
: High-throughput and Scalable Storage Technology for Streaming Data Munenori Maeda Toshihiro Ozawa Real-time analytical processing (RTAP) of vast amounts of time-series data from sensors, server logs,
Business-centric Storage FUJITSU Hyperscale Storage System ETERNUS CD10000
Business-centric Storage FUJITSU Hyperscale Storage System ETERNUS CD10000 Clear the way for new business opportunities. Unlock the power of data. Overcoming storage limitations Unpredictable data growth
A Failure Predictive and Policy-Based High Availability Strategy for Linux High Performance Computing Cluster
A Failure Predictive and Policy-Based High Availability Strategy for Linux High Performance Computing Cluster Chokchai Leangsuksun 1, Tong Liu 1, Tirumala Rao 1, Stephen L. Scott 2, and Richard Libby³
High Availability Solutions & Technology for NetScreen s Security Systems
High Availability Solutions & Technology for NetScreen s Security Systems Features and Benefits A White Paper By NetScreen Technologies Inc. http://www.netscreen.com INTRODUCTION...3 RESILIENCE...3 SCALABLE
Setting Up B2B Data Exchange for High Availability in an Active/Active Configuration
Setting Up B2B Data Exchange for High Availability in an Active/Active Configuration 2010 Informatica Abstract This document explains how to install multiple copies of B2B Data Exchange on a single computer.
In-Band Security Solution // Solutions Overview
Introduction The strategy and architecture to establish and maintain infrastructure and network security is in a rapid state of change new tools, greater intelligence and managed services are being used
ZooKeeper. Table of contents
by Table of contents 1 ZooKeeper: A Distributed Coordination Service for Distributed Applications... 2 1.1 Design Goals...2 1.2 Data model and the hierarchical namespace...3 1.3 Nodes and ephemeral nodes...
Module 14: Scalability and High Availability
Module 14: Scalability and High Availability Overview Key high availability features available in Oracle and SQL Server Key scalability features available in Oracle and SQL Server High Availability High
SCALABILITY AND AVAILABILITY
SCALABILITY AND AVAILABILITY Real Systems must be Scalable fast enough to handle the expected load and grow easily when the load grows Available available enough of the time Scalable Scale-up increase
Parallel Processing over Mobile Ad Hoc Networks of Handheld Machines
Parallel Processing over Mobile Ad Hoc Networks of Handheld Machines Michael J Jipping Department of Computer Science Hope College Holland, MI 49423 [email protected] Gary Lewandowski Department of Mathematics
Deploy App Orchestration 2.6 for High Availability and Disaster Recovery
Deploy App Orchestration 2.6 for High Availability and Disaster Recovery Qiang Xu, Cloud Services Nanjing Team Last Updated: Mar 24, 2015 Contents Introduction... 2 Process Overview... 3 Before you begin...
Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications
Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications White Paper Table of Contents Overview...3 Replication Types Supported...3 Set-up &
Eliminate SQL Server Downtime Even for maintenance
Eliminate SQL Server Downtime Even for maintenance Eliminate Outages Enable Continuous Availability of Data (zero downtime) Enable Geographic Disaster Recovery - NO crash recovery 2009 xkoto, Inc. All
Outline. Failure Types
Outline Database Management and Tuning Johann Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE Unit 11 1 2 Conclusion Acknowledgements: The slides are provided by Nikolaus Augsten
Virtual PortChannels: Building Networks without Spanning Tree Protocol
. White Paper Virtual PortChannels: Building Networks without Spanning Tree Protocol What You Will Learn This document provides an in-depth look at Cisco's virtual PortChannel (vpc) technology, as developed
Database Replication with MySQL and PostgreSQL
Database Replication with MySQL and PostgreSQL Fabian Mauchle Software and Systems University of Applied Sciences Rapperswil, Switzerland www.hsr.ch/mse Abstract Databases are used very often in business
The Promise of Virtualization for Availability, High Availability, and Disaster Recovery - Myth or Reality?
B.Jostmeyer Tivoli System Automation [email protected] The Promise of Virtualization for Availability, High Availability, and Disaster Recovery - Myth or Reality? IBM Systems Management, Virtualisierung
Lab 5 Explicit Proxy Performance, Load Balancing & Redundancy
Lab 5 Explicit Proxy Performance, Load Balancing & Redundancy Objectives The purpose of this lab is to demonstrate both high availability and performance using virtual IPs coupled with DNS round robin
Scientific versus Business Workflows
2 Scientific versus Business Workflows Roger Barga and Dennis Gannon The formal concept of a workflow has existed in the business world for a long time. An entire industry of tools and technology devoted
High Availability Server Clustering Solutions
White Paper High vailability Server Clustering Solutions Extending the benefits of technology into the server arena Intel in Communications Contents Executive Summary 3 Extending Protection to Storage
Replication on Virtual Machines
Replication on Virtual Machines Siggi Cherem CS 717 November 23rd, 2004 Outline 1 Introduction The Java Virtual Machine 2 Napper, Alvisi, Vin - DSN 2003 Introduction JVM as state machine Addressing non-determinism
Fault Tolerant Servers: The Choice for Continuous Availability
Fault Tolerant Servers: The Choice for Continuous Availability This paper discusses today s options for achieving continuous availability and how NEC s Express5800/ft servers can provide every company
