Design of a High-Availability Multimedia Scheduling Service using Primary-Backup Replication

Design of a High-Availability Multimedia Scheduling Service using Primary-Backup Replication Goudong (Shawn) Liu Vivek Sawant {liug,vivek,lindsey}@cs.unc.edu December 12, 2001 Mark R. Lindsey Abstract The design of a system for highly-available client/server applications is presented, together with an example application for scheduling video services. 1 Introduction A conventional, non-replicated client/server application is susceptible to numerous types of failures, including: 1. Server crash failure 2. Client crash failure 3. Client-to-Server communication failure We wished to provide a system which would address the problem of Server crashes. A server, as used here, is a computing system which provides some service to a number of clients (i.e., users). A server crash occurs when a server ceases to operate, such that all application state which was present in volatile storage (e.g., high-speed and virtual memory) at the time of the crash is nolonger available in any form, even after server recovery (i.e., when the server has started again and is again available to provide service to clients). The standard approach to avoid the loss of application state is to store some of the application state in This effort was prosecuted for Comp 243: Distributed Systems, at the University of North Carolina at Chapel Hill, fall, 2001. non-volatile storage (e.g., on disk). Thus, even when the server crashes, data which was present in nonvolatile storage at the time of the crash is available. However, during the failure period (i.e., the period of time between the server crash and the server recovery), no service is available to clients. We wish to continue to provide service during the period after a given server has crashed and before it is recovered. We employ spares to accomplish this task, and assume that each server fails independently. A spare server is not needed to provide the service when no failures have occurred: it is provided strictly to support operation of the service during a failure period. Providing a single spare can, ideally, allow the service to operate uninterrupted as long as a single server is not failing. If the probability of failure during a given time interval t of primary server P is P (P ) and the probability of failure of the spare server S during an equivalent time interval is P (S) then the probability that both will fail simultaneously is P (S) P (P ). For example, if t = 1 hour, P (P ) = 0.0003, and P (S) = 0.0004, then the mean time between failures (MTTF) of the primary only is 1 P (P ) = 3, 333 hours, while the MTTF of the primary and the spare 1 together is P (P ) P (S) = 8, 333, 333 hours a drastic improvement. Further, supposing a mean repair time (MTTR) of 24 hours, then the availability of running the primary server only is 3,333 3,333+24 = 99.285%, while the expected availability of running the primary together with the spare is as high as 8,333,333 8,333,333+24 = 1

99.999%. 1 In general, if a service can operate on a single server, then to survive the failure of f machines the system should include f + 1 servers, f of which may be considered as spares. But, in order to approach the availability gains described above, a fundamental issue must be resolved: how can a service which has been programmed to operate on a single server be made to operate in a cluster of servers as if it were operating on a single server? Each server in the cluster must be capable of providing the service; each, the cluster is said to be composed of a set of replica servers. However, a naive duplication of the service is not sufficient: simply running multiple instances of the service will not produce the same results as a single server providing the service. The primary-backup approach to replication addresses this problem by designating that each server participating in the cluster has a role at any instant in time, either as primary or as backup. At any instant, the primary server provides the service to the clients, while the backups stand-by as spares. When a backup fails, the service is not affected; but when the primary fails, exactly one of the backups must assume the role of primary. Clients communicate only with the current primary. The implementation of such replication is nontrivial, and requires that the following issues be addressed: 1. How does the client determine which replica is the primary? 2. How does a backup know that the primary has failed? 3. When the primary has failed, how does a particular backup know whether it should take over as primary? (Recall that exactly one of the backups should assume the role of primary when the primary fails.) 4. How is application state replicated from the primary servers to the backup? 1 This analysis assumes an idealized case in which the spare could provide the service seamlessly starting at the instant that the primary fails. In this paper, we describe a primary-backup replication system implementation which addresses these questions, and which can be used to develop highlyavailable applications. We also present an example application which demonstrates this functionality. 2 Replication Service Overview At any point during operation of the service, each replica in a must have Current application state such that the replica could assume the primary role. Knowledge of its current role so that the application can function properly, according to its role. For example, a backup should refuse service requests which would modify the application state, while a primary should service such requests. We provide a replication system which can ensure that each application has this information. 2.1 Application/Replication System Interaction The replication system interface (FTServer) provides a set of procedure calls for use by the application: Start Fault-Tolerant Server instructs the replication system to join the cluster; the calling application is a replica when this call returns. FTServer(AppEventListener) (constructor) Am I primary allows a replica to determine whether it is the primary. boolean isprimary() Broadcast application state is provided only to the primary, and instructs the replication system to distribute an updated version of the application state. Note that this procedure does not allow the primary to determine whether the application state was properly received by any of the backup replicas; as we shall see, this is a property of the non-blocking protocol used. 2

This method does nothing on a backup replica, ensuring that the replicated system state remains consistent. void bcaststateupdate(object) Get connected replicas provides a replica with a list of all other active members of the cluster. InetAddress[] getserverlist() To use the replication system, an application must provide an Application Event Listener (AppEventListener) to the Start Fault-Tolerant Server procedure. The replication system uses a type of callback to inform the application of some events in the cluster by calling these procedures in the AppEventListener: Add server informs the primary replica that a new server has joined the cluster. void addserver(inetaddress) Remove Server informs the primary replica that a server which was in the cluster has left the cluster (e.g., by failing). void removeserver(inetaddress) Update State informs a backup replica that new application state is available. The application, while running in the backup role, contracts to store the current application state. void updatestate(object) 2.2 Application-State Distribution Mechanism We considered two mechanisms for distribution of the application state: 1. Distribute only updates as they occur 2. Distributed the entire application state as necessary Distributing only the updates would allow the system to support applications in which the replicated application state is arbitrarily large. The integration of a new replica, or re-integration of a recovered replica would require that the updates be replayed in order to the joining replica (assuming that no version of the application state is copied to non-volatile storage). Alternately, if a copy of the entire application state can be distributed each time, then no such rollforward protocol is required. We selected this option for its simplicity, and a communication infrastructure to support it. 2.3 Replication System Options Several replications methodologies have been described in literature, and were considered: State-machine approach In the state-machine approach the client presents service requests to every member of the cluster, and collects their responses. If a sufficient number of replicas send an equivalent response, then this response is taken as the true response to the service request. This technique requires that each client is programmed to communicate with the cluster; i.e., each client must be aware of the replication system. We wished to provide a system which was decoupled from the client/server application itself, so this technique was rejected. Single primary, single backup This approach uses a cluster of exactly two replicas, in which one must be dedicated as the primary and another as the backup at any time. The backup changes roles to become the primary only when the primary has failed; as such, the protocol used to determine when to change roles is straightforward. This technique does allow for the replication subsystem to operate independently of the client/server application. However, we wished to provide a system which would use all available resources and provide greater availability than a two-replica cluster can provide. Single primary, multiple backups This approach is a generalization of the singleprimary/single-backup approach, as it uses an arbitrary number of backups. It can provide greater availability, because a failure of 3

the service requires that every replica fail simultaneously. This approach does introduce additional complexities, principle of which is the distributed consensus required to decide which of the replicas will takeover as primary when the primary has failed. Variations of this technique exist; among them are blocking systems and non-blocking systems. The Blocking time in such a system is the worstcase delay between the receipt of a client request by the primary and the response to the client in a failure-free execution. Non-blocking systems have zero blocking time, and provide the fastest-possible response to the user. However, this technique does not allow the primary to confirm that any of the backup replicas has properly received the state change; therefore, there is a non-zero probability that an acknowledgement of an operation will be transmitted by the primary to the client, and that the primary will crash before any of the backups have successfully received the updated state. This is a lost update failure. We chose a to implement a non-blocking primarybackup system supporting an arbitrary number of replicas, as it satisfies our goals for application interaction, and provides for quick responses to the client. 3 Replication System Structure 3.1 Replication Manager Thread, PBServer Each replica runs an instance of the PBServer thread, which manages the replica s interaction with the other members of the cluster. This thread interacts as specified in 2.1 with the application. 3.2 Communications Substrate, objecttransfer The replication system was implemented in Java 2, using the Sun J2SDK 1.3. To support the nonblocking primary-backup protocol, the communication system needed to provide two fundamental services: 1. Send a message to a recipient, but do not wait to ensure that its transmission was completed. 2. Deliver received messages to the replication system as they are available, and do not force the replication system to block until messages are available. The system required the use of several types of messages, including Heartbeat messages and application-state transfer. Java provides a straightforward, blocking mechanism for transfer of objects across TCP channels. 3.2.1 Message Encapsulation Each type of message to be transmitted in this system was encapsulated as a Java class, and the contents were chosen carefully so as to ensure that the class could be marked serializable, as required for transfer by Java s default serialization protocol. 3.2.2 Non-Blocking Object Transmission For each server to which a replica wishes to transfer messages, the replica constructs an ObjectSender. When the replica needs to send a message, it invokes ObjectSender.send(Object), which returns immediately. send() starts a short-lived thread for sending the message, and discards any error results. A message can be transmitted to many recipients with a call to ObjectSenderGroup.broadcast(Object), where an ObjectSenderGroup is a collection of ObjectSender s. 3.2.3 Non-Blocking Object Reception The primary-backup technique requires that failures of a replica are detected by the absence of I-am-alive 4

Heartbeat messages; this implies that the replication system continue make progress through its processing even when no messages have been received. Java does not provide a non-blocking I/O interface, such as select(). Thus, we developed a multi-threaded mechanism for receiving objects: Any ObjectReceiver which wishes to receive a certain type of message (i.e., a certain class of objects) register() s as an Observer with the ReceivedObjectMediator. An ObjectListener runs as a thread and receives all messages from a single replica. It then forwards received objects to ReceivedObjectMediator, which forwards the objects on to the registered receiver. 4 Replication Operation The operation of the replication is based on the non-blocking protocol described in (cite: Budhiraja, Marzullo, et al, Optimal Primary... ). 4.1 Replicated State Each replica maintains two objects which must be equivalent on every replica for correct operation: Message type is encoded by the Java serialization protocol Application State with Version Number is explicitely copied from the primary to the backup each time it is updated. It can be any serializable object. After startup of the cluster (i.e., after the first server has started), the application running on the primary can update the application state at any time; when it does so, the replication system assigns it a version number which is 1 greater than the previous version number, and transmits the updated state with its new version number to all of the other replicas using the Application State Update protocol. Until the first version of the state is distributed, every replica considers the version to be 0 (zero), and the application state to be undefined. This is an acceptable configuration. Version Vector records, for each replica, its status (primary, backup, or faulty), and the version of the application state which that replica is known to have. The Version Vector maintained by all of the protocols described below: 1. Every message between replicas includes the sender s ssversion and isprimary status, as described in 4.2.1. Each message, thus, can be used to update the sender s entry in the version vector. Specifically, the Heartbeat and Join- Response messages are used to determine which replica is primary, and to inform other replicas that new state has been received. 2. The absence of a Heartbeat from a replica ρ can indicate that ρ is faulty. 4.2 Protocols Four related protocols are provided to support operation of the cluster. In each protocol, the messages are transferred between replicas are encapsulated as serializable Java Objects, and encoded using Java s default serialization protocol. 4.2.1 Protocol Unit Header Each message (protocol unit) includes at least two fields: int ssversion The current version of the application state held by the sender when the message was constructed. boolean isprimary True iff the sender was primary when the message was constructed. 4.2.2 Heartbeat The Heartbeat or I-am-alive message is sent by a replica in order to 1. Inform other replicas that it is still functioning 2. Inform other replicas of its version of the application state 5

3. Inform other replicas when a role-change to primary has occurred. Each Heartbeat message consists only of the header contents, described above. Liveness Monitoring Upon joining the cluster, a replica ρ has a version vector which includes all of the active replicas at the time of joining. Every HeartbeatSendRate milliseconds, ρ broadcasts a Heartbeat message (i.e., it transmits a Heartbeat to every non-faulty replica, except itself). In our experiments, we set HeartbeatSendRate = 1000. During operation, ρ checks every HeartbeatCheckRate milliseconds to determine whether any new Heartbeats have been received (500ms in experiments). If, after HeartbeatTimeout milliseconds, a Heartbeat has not been received from another replica φ, then ρ updates its version vector to indicate that φ is faulty. State-Version Changes If a replica ρ receives an Application State Update immediately after ρ has broadcast a Heartbeat, then under the mechanism described above, every other replica s version vector will have an incorrect version number recorded for ρ, even though ρ does have the latest version of the state. Because only a replica with the current state can takeover as primary, every other replica may incorrectly conclude that ρ is not a candidate to takeover as primary. Thus, a failure of the primary before the next Heartbeat broadcast from ρ may cause contention to become the primary. Fundamentally, the problem is that of an inconsistent version vector. To remedy this, ρ will broadcast an extra Heartbeat immediately after it receives an Application State Update. This Heartbeat includes ρ s updated version number, so that each replica has a consistent version vector, as required for the Takeover by distributed consensus (see 4.2.5). Incidentally, while we observed this problem in development, it does not appeared to be mentioned in the original protocol specifications (cite: Optimal ). 4.2.3 Join Upon starting, a replica ρ transmits a Join Request message to each member of on its list Replicas. This list contains an entry for each replica in the cluster, and indicates the replica s network address (IP address, in our case), and its rank. Normally, the list Replicas will be distributed before cluster startup. The Join Request message contains only the fields of the header. It is sent in an attempt to discover the current primary. Each active replica φ responds with a Join Response message, which contains the fields of the header, plus a field int result which resolves to one of the values: JOIN OK indicates that φ is the primary server, and that φ has recorded ρ as a functioning backup. JOIN LOCAL ERR is not used JOIN FAILED indicates that φ is not the primary server, but that φ has recorded ρ as a functioning backup. When a primary replica φ receives a Join Request from ρ, it responds with a result = JOIN OK as described above. It also broadcasts the current version of the Application State to all active replicas, using the Application State Update protocol. Even before the joining replica ρ receives the Application State, it is a functioning replica, but it transmits all messages with ssver = 0. If the current Application State version is not zero, then ρ is ineligible to takeover as primary. This is simply a specific case of the takeover by distributed consensus, described in 4.2.5. 4.2.4 Application State Transmission Only the primary replica may transmit Application State messages. Each Application State message consists of the header plus a single field, appstate, which is the entire contents of the application state encoded as a serializable object. 2 2 To be precise, Java transfers the application state as an Object Graph of serialized objects, so that references within the object can be followed to other objects. This allows for 6

When the application running on the primary replica calls FTServer.bcastStateUpdate(), the primary increments by one the recorded Application State Version number and broadcasts the new version of the Application State to all active backups. It does not wait to determine whether any of the backups receive the updated Application State; the conventional Heartbeat protocol described above is used by the primary to maintain its version vector. When a backup replica receives the Application State α, it stores the α locally, and updates its own Application State Version (as used in outgoing messages) to α s ssver. It then calls AppEventListener.updateState(α) to inform the application running on the backup that new application state is available. This mechanism allows a backup replica to perform non-modifying operations on the Application State; for example, in the multimedia scheduling application described in 5, clients may view the Application State on any backup, but may only modify it on the primary. 4.2.5 Takeover When the primary replica fails, exactly one of the backups must takeover as primary. Our implementation of this technique makes use of the Version Vector maintained by the message exchanges described above. Distributed Consensus In a single-backup cluster, the backup replica can always takeover as primary immediately. However, in our cluster, all of the backup replicas could potentially take over as primary. Thus, we adapted a distributed consensus protocol to determine which one of the backups should takeover as primary. When a backup detects that the primary has failed, it consults an algorithm boolean cantakeoverasprimary() to determine whether it must assume the role of primary. This algorithm consists of the following: conventional object-oriented techniques to be used in the software design of the application. 1. If (version[self] > (version[i], for all i in Non-Faulty)) then return cantakeoverasprimary := true; 2. Else If (((version[self] == version[j]) AND (rank[self] > rank[j])), for all j in Non-Faulty) then return cantakeoverasprimary := true; 3. Else return cantakeoverasprimary := false; This algorithm ensures that a replica with the latest application state available among the non-faulty servers takes over as primary, and if there are multiple replicas with the latest application state, then the tie is broken by the rank which is guaranteed through configuration to be unique to each server. Fail-over time The fail-over time is the period during which no server is the primary, such that the service is unavailable. For the implementation described, this time is F (δ+heartbeatsendrate = 1000. During operation, ρ checks every HeartbeatCheckRate milliseconds to determine whether any new Heartbeats have been received (500ms in experiments). If, after HeartbeatTimeout milliseconds, a Heartbeat has not been received from another replica φ, then ρ updates its version vector to indicate that φ is faulty. 4.3 Failure Handling The replication system presented is designed to provide higher availability than could be achieved with a single server. The behavior of the system under various fault conditions is described. 4.3.1 Crashed Server We wished to provide proper, uninterrupted operation from the time that the cluster starts as long as any one replica is functioning, provided that each server is functioning long enough to join the cluster and, receive the Application State from the current 7

primary. This requires that we handle server crash failures and re-integration. As described above in 4.2.5, if the primary server fails during otherwise-normal operation, then exactly one of the non-faulty backup replicas will takeover as primary. Thus, service continues to be available. When a server joins the cluster (either after recovering from failure, or when starting for the first time), it employs the Join protocol to become a replica. When it has received the Application State, it is a fully-functional backup, and can subsequently takeover as primary. Experiments have demonstrated that our implementation performs this operation reliably. 4.3.2 Missed Message The response of the system to a missed message depends on the type of message that was missed. Missed Heartbeat If replica ρ misses a Heartbeat from replica φ, then ρ will mark φ as faulty, and ρ will cease to transmit any messages to φ until φ sends ρ another Join Request. Missed Application State When a replica ρ misses an updated Application State (i.e., a message a version of the Application State which is newer than any previously-distributed version), then it will nolonger be eligible to takeover as primary as long as another replica with a newer version is non-faulty. 4.3.3 Missed Application State + Primary Crash The presented replication system does not attempt to recover from the multiple-failure described following scenerio: 1. A client κ makes a request to the primary φ 2. φ modifies the application state 3. φ broadcasts the updated application state to all backup replicas, but every backup misses the update 4. φ responds to κ with an indication that the update has been made. 5. φ crashes In this case, the client believes that the change has been made, but it was actually lost. One of the backup replicas will takeover as primary, but the new primary will not have the change made by κ. This scenario describes a disadvantage of every non-blocking primary-backup protocol. 4.3.4 Network Partition In a network partition, the replica cluster is divided into multiple groups of servers; each server within a partition can communicate with others in the partition. In this case, a primary will takeover within each partition, and clients within that partition can communicate only with that primary. This will cause inconsistent application state to be maintained within each partition. When the network partition is repaired, the servers within each of the previous partitions will continue to communicate with each other only. A new server ρ will contact all of the servers, and will join the first primary φ whose Join Response message ρ receives first. It will establish contact with all of the non-faulty replicas in each of the partitioned clusters, and ultimately may elect to takeover as primary within one of the clusters. The long-term results of running in such an arrangement are undefined. Clearly, such a degenerate configuration is undesirable. It can be repaired in only some cases by stopping every server in any cluster except for the one server which has a desirable version of the Application State (if any such version exists), then starting all of the other servers. Supplemental mechanisms are required to provide proper operation in the presence of network partitions. 4.3.5 Link Failure A link failure occurs when a pair of servers φ and ρ cannot communicate with each other, but both of 8

which can communicate with some common set of other servers. While not supported in the version of replication system presented here, we have done work to develop a protocol providing safe operation in the presence of link failures. It would consist of a can-you-see-theprimary protocol, used follows: 1. If the primary φ loses contact with a backup, then the primary marks the backup as faulty and proceeds as usual. 2. If the backup ρ loses contact with the primary, then the backup polls each of the other backup replicas to determine whether any of them can see a primary. If any one of them can see a primary, then ρ halts. Otherwise, the takeover protocol (see 4.2.5) is invoked. The implementation of this protocol is left as future work. 5 Distributed Video Scheduling Service, MESS To demonstrate this replication system in a useful application, we developed the Multimedia Entertainment Super Server, MESS. 5.1 Overview of Service MESS provides a highly-available scheduling service for streaming video. A client connects to the MESS primary server to view the request to watch a particular television channel, and the MESS server attempts to satisfy the request using one of the available video servers. If the request can be satisfied, then the assigned video server tunes to the appropriate television channel, and transmits the video back to the requesting client. Each MESS server is both a member of the application cluster, and is a video server. The backup MESS servers cannot be used for scheduling, but they do provide a read-only view of the schedule to users. A user communicates with the MESS servers to view the schedule and request to view a specific channel through a web interface (i.e., HTML and CGI over HTTP). The MESS server assigned to transmit video to a particular client uses an Open Mash program developed by Ketan Mayer-Patel 3 to transmit the video. 5.2 System Structure The Application Layer stands at the top of the MESS system. It has the following goals: 1. Provide scheduling facility for a client to schedule viewing of an entertainment event (watching a TV channel) via any of the available servers. 2. Provide multimedia streaming via video servers. 5.2.1 Scheduler As figure 2 shows, the Scheduler maintains a global schedule which is implemented as an in-memory database storing all the schedule information of the whole MESS system. The Global Schedule consists of a set of Server Schedules. Each Server Schedule has a set of Schedule Entries. Each Schedule Entry has a client identifier (the Internet address of the client), the requested television channel, the time to start playing, and the time to stop playing. 5.2.2 Replication Mediator The Replication Mediator provides for interaction between the video-scheduling application layer and replication layer. It initiates the interaction with the cluster by constructing an FTServer(), and receives and stores each new version of the Global Schedule that is received from either the application (when the replica is a primary) or from the replication system (when the replica is a backup). Each time the Global Schedule changes, the Replication Mediator sends the schedule to the Video Player. 3 University of North Carolina, kmp@cs.unc.edu 9

Figure 1: Layers of the MESS application, together with the replication system. MESS Application Replication System Communication System Figure 2: Global Schedule Structure. A single GlobalSchedule is the replicated Application State for MESS. Global Schedule Schedule Entry... Schedule Entry Server Schedule... 10

Figure 3: Communication paths. The clients communicate only with the primary, via HTTP. Each MESS server can stream video to one client, but each client may receive multiple streams. MESS Server Client Backup HTTP Client MESS Server Client HTTP HTTP MESS Server Backup CATV source Promary MESS Server Backup 5.2.3 Video Player The Video Player module interprets the Global Schedule to drive the streaming-video subsystem. When a MESS server receives an updated Global Schedule indicating that it should stream video to a client, it starts playing video. When a server crashes, its entries are not removed from a schedule. When a server joins the cluster, it gets a version of the Global Schedule. This schedule indicate that the newly-joining server needs to start streaming video; when this occurs, the it starts streaming immediately. This provides a certain sort of recovery for the service after a server crash. has requests. If there no conflict, it will add the job to the schedule, and show the change to the user. The schedule is then immediately propagated to all other members of the cluster. 5.2.4 User Interface The User Interface uses a servlet running in Apache Tomcat 4 (Catalina) servlet engine. It displays the global schedule, as retrieved from the Replication Mediator. The current status of each server for which videos have been scheduled is also shown to the user. On the primary server, the client can make a request to view a channel. When the client makes a new request, the primary will immediately show whether there is a conflict with the schedule that the client 11