Schedule First, Manage Later: Network-Aware Load Balancing

Transcription

1 Schedule First, Manage Later: Networ-Aware Load Balancing Amir Nahir Department of Computer Science Technion, Israel Institue of Technology Haifa 3, Israel Ariel Orda Department of Electrical Engineering Technion, Israel Institue of Technology Haifa 3, Israel Danny Raz Department of Computer Science Technion, Israel Institue of Technology Haifa 3, Israel Abstract Load balancing in large distributed server systems is a complex optimization problem of critical importance in cloud systems and data centers. Existing schedulers often incur a high overhead in communication when collecting the data required to mae the scheduling decision, hence delaying the job request on its way to the executing server. We propose a novel scheme that incurs no communication overhead between the users and the servers upon job arrival, thus removing any scheduling overhead from the job s critical path. Our approach is based on creating several replicas of each job and sending each replica to a different server. Upon the arrival of a replica to the head of the queue at its server, the latter signals the servers holding replicas of that job, so as to remove them from their queues. We show, through analysis and simulations, that this scheme improves the expected queuing overhead over traditional schemes by a factor of 9 (or more) under all load conditions. In addition, we show that our scheme remains efficient even when the inter-server signal propagation delay is significant (relative to the job s execution time). We provide heuristic solutions to the performance degradation that occurs in such cases and show, by simulations, that they efficiently mitigate the detrimental effect of propagation delays. Finally, we demonstrate the efficiency of our proposed scheme in a real-world environment by implementing a load balancing system based on it, deploying the system on the Amazon Elastic Compute Cloud (EC), and measuring its performance. I. INTRODUCTION The problem of effectively managing distributed computational resources has been extensively studied. Nowadays, it is receiving renewed attention due to the growing interest in large-scale data centers, in particular in the context of Cloud Computing []. To cope with high loads, service providers construct data centers made of tens of thousands of servers [9]. Moreover, such data centers, often providing the same service, may be located in different places around the world. This is the case, for example, with the Google search services - at each given point of time, there are several front-end servers and several bac-end search engines active in various locations over the globe. An incoming user search-request is directed to one of the front-end servers (using DNS) and then redirected into one of the bac-end search engines. The actual choice of the specific server to process a given request has a critical impact on the overall performance of the service. This is a very difficult optimization problem, where in many cases there are multiple objectives and many parameters (such as networ latency and server load) that should be considered. Regardless of the exact optimization criterion, any adaptive system that tries to address this problem needs to collect significant amount of data in order to locate a suitable server for execution. To avoid the stale data problem [3], the data must be collected as close as possible to the arrival of the job. Thus, common load balancing approaches, such as the Supermaret Model [], invoe the data collection process upon a job s arrival. This, in turn, introduces a delay in the execution of the job, as the data collection process hinders the selection of the executing server and the arrival of the job to that server. The size of data centers, coupled with their distribution across the globe, call for fully distributed load balancing techniques. Accordingly, it is important to understand how well can an oblivious distributed load sharing system perform. The above gives rise to the challenge we address in this study: constructing a load balancing mechanism that, on the one hand, minimizes the time a job spends till being assigned to a server, and, on the other hand, scales well and is suitable for contemporary data centers. Our approach is based on duplicating each job request into several replicas, which are sent to randomly chosen servers. We assume that servers process jobs according to a first-comes-first-served policy. Servers assigned with replicas of the same job coordinate the removal of all copies but one. Specifically, when a replica arrives to the head of the queue and begins processing, a signal is sent to the other corresponding servers in order to trigger the removal of all other replicas. Essentially, our scheme adopts a schedule first, manage later approach. Indeed, upon the job s arrival, no data collection is performed and the job is sent to several servers without incurring any decision-maing delay. Only after the job begins processing do we spend (very little) time in executing the required signaling. Overall, this approach generates less overhead than the Supermaret Model []. Moreover, the delay induced does not directly affect the completion time of An oblivious system (also termed static system) is a system that is independent of the current state, and does not use any dynamic input.

2 the job. We test our technique under a traditional setting, in which jobs arrive based on a Poisson process and their lengths (i.e., processing times) are exponentially distributed. Such load profiles have been widely considered in the literature, due to their amenability to formal analysis as well as often being suitable approximations of reality. We show, by analysis and simulations, that in case of two replicas of each job, our scheme offers an improvement of a factor of 9 (and more) in queuing overhead across all loads, where the improvement ratio is especially high in case of high loads namely, a factor higher than 3 for loads over 9%. With more replicas, the improvement ratio is even better. For example, in the case of 3 copies, our scheme offers an improvement of a factor of 9.9 (and more) across all loads. We also tested our scheme in a real-world environment. Specifically, we implemented our scheme on the Amazon EC services system [] and tested its performance. We show that, also in the presence of (detrimental) real-world phenomena, such as pacet loss and signal propagation delay, and also when accounting for the effects of a real system, such as thread context switching, our scheme offers a significant improvement of a factor of 3 (and more), when producing just two replicas, Yet, the time required for the removal signals to arrive from the sending server (the one executing the job) to the servers holding the other replicas may hinder the performance gain offered by our scheme. Indeed, if replicas arrive at the heads of the servers queues approximately at the same time, then the servers may spend time processing jobs that are doomed to be removed. Thus, in the presence of very large delays (with respect to the job s processing time), multiple copies of the same job might complete processing at different servers. Nonetheless, our results show that, as long the signal propagation time is comparable to the average job execution time, our scheme provides a significant performance gain over a random assignment. Furthermore, we establish several heuristics that improve our scheme in presence of signal propagation delay. One class of heuristics modifies the job selection policy of the servers. Specifically, instead of processing jobs based on the first-come-first-served policy, servers may either choose a job from the queue randomly, or through a simple prioritization policy. We also explore a second class of heuristics, in which servers send removal signals before the job maes it to the head of the queue. We evaluate these heuristics through simulations and show that they can be used to restore the high performance gain offered by our scheme under moderate signal propagation delay. To summarize, the main contributions of this paper are as follows. ) Introduction and analysis of a duplication-based load balancing scheme for large distributed server systems such as the Cloud. ) Detailed performance study of the proposed scheme in the presence of signal propagation delay. 3) Heuristics that assist in improving performance in face of moderate signal propagation delay. ) Implementation of the scheme within Amazon s EC framewor in conditions equivalent to those of a reallife commercial environment. The paper is organized as follows. After discussing related wor in the next section, in Section III we formalize the model. In Section IV we describe the main motivation for our wor. In Section V we formally analyze the proposed solution for the case of no signal propagation delay and validate the results also through simulations. In Section VI we evaluate, through simulations, the impact of signal propagation delay on our scheme. We then establish and evaluate heuristics for improving the performance in the case of moderate delays in Section VII. In Section VIII we describe our real-world implementation. Finally, conclusions appear in Section IX. Due to space limits, some of the details are omitted and can be found (online) in []. II. RELATED WORK The problem of load balancing has been studied extensively in a variety of contexts over the last twenty years. Several variations of the problem exist, sometimes under different names (e.g., load sharing, scheduling and tas assignment). Solutions to the problem, often termed scheduling algorithms or assignment policies, can be classified according to several fundamental properties. Some solutions assume a centralized system [7], [], i.e., all jobs arrive at a single entity (called scheduler or dispatcher), which is responsible for determining which server will process the job. Other solutions assume a distributed system [], [], i.e., jobs may arrive to any server in the system, which needs to carry out the scheduling policy and determine which server will process the job. In the sequel, the term scheduler refers to the entity administrating the policy, whether it is a central one or not. Some solutions require a priori (i.e., already when the job arrives) nowledge as to the time required to process the job []. Other solutions attempt to guess this information as the job is being processed [7]. Yet most solutions do not assume any such nowledge, but rely on either other components of the system s state, or simply on the number of pending jobs in the different servers. As explained in the introduction, dynamic (or adaptive) policies require a mechanism to collect the state information from the distributed system, and thus are associated with significant overhead and possible delays. To address these issues, a highly interesting approach, called the Supermaret Model, was proposed in []. The main idea is to consider just d random servers (where d is a small constant) out of the available N, and assign the job to the one with the shortest queue. It was shown in [] that even for d = this scheme achieves a much better delay than random assignment, and this improvement gets better as we increase d. The scheme described in [] was centralized and did not consider the time required to collect the data as part of the delay that each job incurs. Furthermore, its evaluation did not consider the direct

3 3 overhead associated with each load query (on the executing servers). A recent study [] proposed a fully distributed implementation of the Supermaret Model and showed that, under appropriate tuning, this version performs very well even when the overhead is considered. However, the scheme in [] is still dynamic, does not account for the data collection delay, and uses partial nowledge regarding the current system state. As such, it either incurs delays (if we acquire information on demand), or else the use of stale state information. As mentioned, oblivious schemes, such as those considered in this study, are exempted from these shortcomings. In [], the authors address the very same problem we do, that is, how to efficiently manage a large set of servers while removing the scheduling wor from the job s critical path. However, their approach requires that the processing servers (bac-end servers) be familiar with the scheduling servers (front-end servers) and is therefore not applicable to architectures where end-users directly access servers (such as in DNS-based scheduling [3]). III. MODEL & BASIC SCHEME We consider a system of N servers. Each server provides an exponentially distributed service at a rate of µ. Jobs arrive at the system according to a Poisson process at an aggregate rate of λn. We consider batch jobs, so by a job we mean a job request, and the job may be performed by any of the servers in the system. Furthermore, we assume jobs are idempotent, that is, completing the job multiple times will not have any side effects. For example, search queries are such jobs. We assume a distributed system, therefore there is no single entity that decides on the exact assignment of jobs to servers. Instead, each job is replicated into d copies and sent to different, randomly chosen, servers. Servers process jobs based on a first-comes-first-served policy. Whenever a job arrives at a server it queues the job. Whenever a server completes processing a job, it dequeues the next job from the queue, and sends removal signals to the other d servers holding that job s replicas in their queues. Upon reception of a removal signal, a server removes the relevant job from its queue. In case the job is already being processed, the server will drop it only if the index of the server that sent the removal signal is higher than its own. This way, we guarantee that at least one replica of the job will be processed. We note that the signaling mechanism described above is activated in parallel to the beginning of processing of the job by the server, and therefore does not delay the scheduling mechanism. In actual systems, such removal signals incur some end-toend delay, which we denote by T d ; that is, a removal signal reaches its destination within (at most) T d time units after its transmission. To simplify the analysis, at this stage we assume that T d =, which corresponds to the quite typical case where the signaling transmission delays are small compared to job processing times. In Section VI we evaluate the impact of the delay T d on the performance of our scheme. IV. MOTIVATION A. Removing the Scheduling Decision from the Job s Critical Path One common element of load balancing schemes is that the job request spends some time in the scheduler (be it a centralized one, a distributed one, or just a component in one of the servers). The extent of this time may be very short, as in the case of the classic Round Robin (centralized) scheduler, but tends to grow with the complexity of the scheduling policy and the constraints imposed on it, such as the need to operate in a large-scale environment, where more schedulers are present. A common approach to overcome this delay is to collect the information through a process that is asynchronous to the process of determining the scheduling of the job. In such a system, the scheduler would have one process that collects state information from servers, and another process that receives jobs, determines for each the processing server and sends it to that server. The information-collecting process would update a shared object with new information on the system state, and the scheduling process would utilize this information to mae the scheduling decision. This approach suffers from a major shortcoming, namely: if the information is collected too often, the executing servers are frequently interrupted to answer queries regarding their state, which in turn reduces their effective processing rate []. On the other hand, if the information is not collected often enough, the scheduling process maes its decisions based on stale data [3]. In [], the authors propose a different approach, namely: defer the computation required for scheduling decisions and perform them separately from the actual job scheduling. In the Join-Idle-Queue scheme, idle servers report bac to schedulers after going idle. We follow this approach of schedule first, manage later, but in a different manner, which does not assume a two-tier scheduler-server system architecture; this way, we mae it possible to apply our approach also in systems where end-users directly access servers, as is the case, for example, with DNS-based scheduling. B. Improving Scheduling Decision Accuracy In systems where the arrival time and execution time of jobs cannot be pre-determined, one common technique to estimate the load on a server is through the number of jobs held in its queue []. The Supermaret model [] taes this further and utilizes this information to drive its scheduling policy. While the Supermaret model provides significant performance improvement over random assignment, relying on the number of jobs in a server s queue in order to estimate its load can often lead to a wrong scheduling decision. In the sequel, we evaluate the rate of erroneous scheduling decisions in queue-length-based scheduling policies.

4 We begin by evaluating the probability that a queue i, holding n jobs, would complete processing them before another queue, j, would complete processing its m jobs. Lemma : Let {x i } n i= denote a set of n jobs with exponentially distributed lengths with mean µ being processed in queue X. We assume the jobs are processed in order, i.e., x completes processing before x, x completes processing before x 3, etc. Let {y i } m i= denote a set of m jobs with exponentially distributed lengths with mean µ being processed in queue Y. We assume the jobs are processed in order. Let Pr[X,Y] denote the probability that queue X will complete processing its n jobs before queue Y completes processing his m jobs. Then Pr[X,Y] = m+n =n ( ).. n Proof: Let s denote the ordered vector of job completions in the above two-queue system. For example, consider a case where n = and m =. In such a case, s = [y,x,x,y ] describes one possible instantiation of events where queue X completes processing before queue Y. Let s[i] denote the i th job that completes processing in the system. For example, in the case above, s[] = x. We evaluate Pr[X,Y] by enumerating all possibilities of s and the probability of each. First, we note that in every case X completes before Y it holds that s[m+n] = y m. Let denote the index in s of the last job in X to complete, i.e., s[] = x n. It holds that n, because X holds n jobs and processes them in order. It also holds that m+n, because the total number of jobs in the system is m+n, and the last job to complete is y m. We note that s[ + ],s[ + ],...,s[n + m] holds the ordered set of jobs from Y that complete processing after X has completed the processing of its n jobs. Assume =. The number of possible s vectors in such a case is ( n ) (recall that s[ ] = x n ). For each of these places in the vector both queue race to complete their next respective job. Since these are two exponential processes with the same rate µ, the probability of either of them winning is.. We note that the memoriless property of exponential processes plays a ey role the probability of either winning is. regardless of which of them won in the previous cases. Therefore, for every possible value of, where n m+n, we have ( n ) possible vectors, and the probability of each such vector is., and the lemma follows. Next, we use Pr[X,Y] to show that the higher the load in the system the higher the probability of the scheduling mechanism to mae a scheduling error. Consider a pair of queues < X,X >. Let X i denote the number of jobs in queue i. Let X = X +, that is, queue X has jobs more than queue X. Figure depicts Pr[X,X ] as a function of X (the number of jobs in X, the shorter queue). As the figure shows Pr[X,X ] rises with X. Furthermore, for higher values of, the probability to err goes down. Fig.. queue Ratio of longer queue completes first = = =3 = = Number of jobs in the short queue Expected error rate, as a function of the number of jobs in the short And so, consider two systems operating under different loads. The system running under the higher load is expected to have a higher average queue length. And so, when the scheduling mechanisms in these systems have to choose a server under similar condition (i.e., a case where the differences between the two pairs of queues is the same), the scheduler operating in the system running under the higher load is more liely to err. Finally, we combine the results of [] with the value P r[x, Y] above to establish the probability of the Supermaret model to mae erroneous scheduling decisions. For simplicity, we evaluate this for the case of d =, i.e., two servers are sampled to mae the scheduling decision. In [], the author establishes that, when running with the Supermaret model, the probability that a queue holds exactly i jobs, denoted by p i, is λ (i ) λ ( i+ ). Therefore, when sampling two servers to mae the scheduling decision, the probability that the Supermaret model scheduler would err is: p i p j Pr[Min(i,j),Max(i,j)]. () i= j= Figure depicts the probability of maing wrong decisions based on Equation together with results gathered in simulation. The simulation setup has servers, where each run includes, jobs. Each point in the graph reflects the average of at least 3 runs. Upon the arrival of a job, two servers are sampled for their queue lengths and expected completion times for all the jobs in the queue (within our simulation environment, we can determine the completion time since we now the processing time of each job). Given these two data points, we evaluate the match between them. In case the Supermaret scheduling policy would choose the server that is expected to go idle first, we denote this decision as correct. We assume that when both queues are not empty and have the same length, the Supermaret scheduling policy would be right in half of the cases. We attribute the difference

5 Ratio of all decisions Correct Decision Simulation Wrong Decision Simulation Wrong Decision Model N λ µ N λ µ d () N λ µ d (3) Fig.. N λ µ d (i) i Marov chain N λ µ d (i+) i+ N λ µ d (i+) mechanism to perform at its best, a queue-lengths based mechanism performs at its worst. Our scheme, which relies on actual completion times, addresses this problem. Fig.. Fig. 3. Ratio of all decisions Load Correctness of scheduling decisions as a function of the load Correct Decision Tie Brea, Empty Queues Tie Brea, Non empty Queues Wrong Decision Load Correctness of scheduling decisions as a function of the load between the model and simulations to the assumption in Equation that the server states are independent of each other, while, in presence of an effective load balancing scheme as the Supermaret model, this is not the case. Figure 3 further breas down simulation results into four types of decisions. A decision is considered as correct if one queue is (strictly shorter) than the other, and the server with that shorter queue will complete processing all jobs (strictly) before the other server. There are two types of tie decisions: one is when both queues are empty (termed Tie Brea, Empty Queues in the figure) in such a case, any decision the scheduler maes would be right; the other type is when both queues are not empty (termed Tie Brea, Non-empty Queues in the figure) in such a case, assuming the scheduler randomly chooses a server, it may err with. probability. Finally, a decision is determined to be wrong if one queue is (strictly shorter) than the other, but the server with that shorter queue is expected to complete processing all jobs (strictly) after the other server. The above results indicate that, when basing the scheduling policy on queue lengths, the ratio of erroneous decisions rises with the load to a considerable level. For example, for loads above 9%, some 3% of the decisions are wrong. That is, when the system particularly needs the load balancing V. ANALYSIS We proceed to analyze the performance improvement gained by the addition of the replication scheme, as described in the previous section, to a classical system of M/M/ queues, i.e., with Poisson arrivals and exponentially distributed service times (with rate µ). We employ a state-transition-rate diagram [] to analyze the problem from a system-wide perspective. We model the system in a similar way to an M/M/N queue, that is, the state s index in the Marov chain represents the number of jobs in the system (and, accordingly, p i denotes the probability that there are i jobs in the system). Note that we refer to effective jobs, i.e., a single copy of each job. Therefore, jobs enter at a rate of N λ, where N is the number of servers. We turn to evaluate the processing rate. This rate is different than the one of the M/M/N system, because several jobs may end up in the same queue, leaving other servers free. For example, consider a case where d = and there are two jobs in the system. These jobs occupy two servers and therefore the service rate is µ (generally, as long as the number of jobs in the system, m, is d or less, each job will end up in a server of its own, and so the service rate would be m µ). When a third job arrives, its two copies may end up in the same, already busy, servers. In such a case, the service rate would remain µ. On the other hand, it is sufficient that one of the replicas finds an idle server to activate that server and mae the service rate 3 µ. In any case, the service rate is upper-bounded both by m µ and N µ, since a job cannot occupy more than one server, and there are N servers in the system. An arrival of a job to the system may add, at most, one server to the number of occupied servers (since in case multiple copies arrive at free servers, all copies but one get dropped right away). Let P(m,n) denote the probability that when there are m jobs in the system, they occupy exactly n servers. P(m,n) is expressed through the following recursive expression: ( ) n!(n d)! P(m,n) = P(m,n) (n d)!n! ( + P(m,n ) (n )!(N d)! (n d)!n! () ) and is explained as follows: the probability that m jobs in the system occupy exactly n servers depends on the state of

6 Expected Service Rate d= (radom assignment) d= d= d= (M/M/N) Jobs in the System Fig.. Expected service rate, as a function of the number of jobs in the system, N = the system when there were m jobs. Either there were n servers occupied by these m jobs, and, upon the arrival of the m th job, all of its d replicas are sent to the already occupied servers, or n servers were occupied and one (or more) of the m th job s replicas arrived at an unoccupied server. We note that more than one job replica may arrive at an idle server; however, only one server will process that job (while the others drop it once the removal signal is received); thus, this case is covered by the latter part of the equation. The stopping conditions for this recursive expression are as follows: P(m,m) =,m d, P(m,n) =,m n,m d. The above manifest the fact that, when there are n jobs in the system, and n d, it holds that exactly n servers are active. In addition, when there is any (non-zero) number of jobs in the system, it is impossible for all servers to be idle. This yields the following stopping condition: P(m,) =,m >. Therefore, the expected service rate at a state, assuming > d is: µ d () = Min(N,) i=d whereas in the case of d: µ d () = µ. i P(,i) µ, Figure presents the expected service rate as a function of the number of jobs in the system for a setup of servers. This figure indicates that even a very small value of d can provide a significant improvement. For example, while in the case of random assignment (d = ) jobs are required to bring the system to % utilization (i.e., active servers in this case), only jobs are required to reach the same utilization level when d =. This, in turn, implies that the average length of the queue at the servers is smaller when d =. Next, we turn to show that the stability condition of the Marov chain presented in Figure is (as in a regular M/M/N system) λ < µ. We note that when d = N this holds trivially, as the system behavior converges to that of the M/M/N system. We thus focus on the case d < N. For that purpose, we first establish the following property of the probabilities P(m, N ). Lemma : There exists somem, so that for everym M, P(m,N ) >. Proof: By the structure of Equation, to prove the lemma, and it is sufficient to show a single M value for which P(M,N ) >. Consider the case of M = N, i.e., the number of jobs currently active in the system is one less than the number of servers. It is easy to see that, for every d < N, we can select the choice of servers such that each of the M = N jobs is processed by a different server. This implies that P(M,N ) > and the lemma follows. Theorem : The Marov chain presented in Figure has a steady state for λ < µ. Proof: To prove the theorem, it is sufficient to show that m µ lim d(m) = N µ. For that purpose, it is sufficient to show that lim P(m,N) =. m First, we turn to simplify the expression for P(m,N) by setting n = N in Equation : P(m,N) = P(m,N)+ d N P(m,N ). Therefore, following Lemma, it holds that there exists some M, such that for every m > M, it holds that P(m +,N) > P(m,N). Therefore, we conclude that, for m > M, P(m, N) is a strictly ascending sequence, upper bounded by, and the theorem follows. When the steady-state conditions are met, we get the following set of equations: p N λ = p µ, i,p i N λ = p i+ µ d (i+). In addition, since the p i s are probabilities, we require that p i =. i=

7 Fig.. (M/M/) d=, model d=, simulation, model, simulation, model, simulation d=, model d=, simulation 3 Load Fig. 7. Average queuing overhead, no delays, low loads (M/M/) d=, model d=, simulation, model, simulation, model, simulation d=, model d=, simulation Load, no delays, high loads By solving the above set of equations, we can compute the average queue length through: Q = i p i. (3) i= We adopt the queuing overhead measurement metric used in []. We measure the quality of the proposed solution by evaluating the average time that a job spends in the queue (disregarding the processing time, as load balancing schemes cannot improve that). To compute the queuing overhead, we compute the average time in the system using Little s theorem and the average queue length (described by Equation 3), and subtract the average job processing time. Unfortunately, Equation 3 has no simplified close form solution, hence we resort to numerical computations. Figures and 7 depict the expected queuing overhead as predicted by our model (and evaluated numerically) as well as by data gathered through simulation. As a reference, we also show the expected queuing overhead of an M/M/ system (which is equivalent to random assignment, or to activating our scheme with d = ). We used an in-house event-driven simulator to simulate the behavior of the system. Simulations were carried for various loads in a system of servers, with µ =. Each simulation sequence contained, jobs, and each value was computed as the average of such runs. Our numerical computations were conducted for servers and states (representing a finite system which can hold up to jobs) and was solved using Matlab s FMINCON optimization function. The figures show that simulation results match our theoretical model to a high level of accuracy. Furthermore, as the value of d grows, so does the accuracy of the model. For example, in the case of d =, the model and simulation results are less than 3.% apart for all loads under 9%. These results indicate the the proposed scheme offers high value across all loads when compared to the basic random assignment policy. As the number of copies in use, d, grows, the improvement grows as well. For the case of d =, the improvement (measured by the division of the queuing overhead for random assignment by the queuing overhead of the proposed scheme) is over a factor of 9. for all loads. For the case of d = 3, the minimal improvement is by a factor of 9.9, and for d = it is 7.. VI. THE EFFECT OF DELAY Thus far we disregarded an important aspect, namely the delay of signal propagation in the system, that is, the time that it taes for the removal message to arrive from one server to the other. Clearly, this delay has a detrimental impact on the performance of the proposed scheme. Consider, for example, a case of two jobs arriving at two free servers, the first job arriving at time t and the second job arriving at time t + ǫ. In the absence of replicas, each job would start processing immediately upon its arrival; however, with the replication scheme employed, consider a case where both replicas of the two jobs end up at the same two servers. Both replicas of the first job would begin processing at time t. Assuming the signal propagation time is greater then ǫ, the two copies of the second job would be queued at each of the servers behind the first job, until one of the copies is removed upon the arrival of the removal signal. Thus, the completion of the second job is delayed by T d ǫ where T d is the signal propagation time. If the delay T d is large, it is even possible for both replicas of the same job to complete execution at different servers before any removal signal arrives. Figures and 9 present results of simulations that incorporated delays. The simulation setup was the same as the one used in the previous section. All figures are currently in Appendix XI and Appendix X. Our results show that, as can be expected, the average queuing overhead rises with the signal propagation delay and the number of copies. As long as the signal propagation delay remains comparable to the (unit value) average job processing time (i.e., T d.) and the number of copies is small (i.e., d 3), the replication scheme still offers an improvement of at least a factor ofin queuing overhead when compared to random assignment. We note that higher values of signal propagation delay imply that the system s performance bottlenec is not in its computation resources, but rather at the networing resources. Having internal management

8 9 7 3 Delay=. Delay=. Delay=. Delay= Fig.. Expected service rate in presence of signal propagation delays, d =, high load values d= d= Fig. 9. Expected service rate in presence of signal propagation delays, T d =., high load values signals delayed by a time similar, or larger, than that required to process an average job implies that the end user would experience excessive latency due to networing delays. VII. ADDRESSING DELAYS We proceed to propose heuristic schemes for addressing the degradation in performance incurred by signal delay propagation. The primary phenomenon we address is that, due to the good balance between the servers and the fact that servers process jobs in order of their arrival, replicas of the same job often reach the heads of the queues in the servers roughly at the same time, causing redundant processing and delaying other jobs. In the following subsections we introduce three heuristics to improve system performance in presence of delay. Our first two heuristics suggest a different job selection policy. The first one is completely independent of our replication scheme, and simply suggests that servers select jobs from the queue in random order (instead of based on their arrival time). For the second heuristic we require that the list of servers of the same job s replicas, reported in each replica, is organized in random (yet identical in all replicas) order. We recall that such a list is required for the employment of our scheme. This random order is then interpreted as a prioritized list for the execution of jobs. Finally, our third heuristic handles signal propagation delays by requiring servers to send the removal signal before the job maes it to the head of the queue. We chose the following six setups for testing our heuristics: when the scheme is set to send two copies (i.e., d = ), the signal propagation time is.7 or, and the load is either % or 9%. When the scheme is set to send three copies, the signal propagation time is. and the load is either 9% or 97%. These setups where chosen since they exhibited significant performance degradation (in terms of average queuing overhead) with respect to the equivalent setup but with no delays. We aim to show that, with proper heuristics, the operational boundaries where the replication scheme provides significant improvement can be expended significantly, hence considerably mitigating the detrimental effect of signaling propagation delays. A. Location in Server Vector In this heuristic, all replicas of each job hold the list of servers to which they are assigned in a randomly ordered list (yet it is the same list in all replicas). When a server completes the processing of some job and is required to select the next job from its queue for processing, the server would select, among the first jobs (where is a parameter of the heuristic) the job that holds the present server s identity in the highest location. In case of a tie (i.e., two jobs holding the identity of the server in the same location), the server would prefer the job that arrived first. For example, in the case of three copies per job (i.e., d = 3), each job replica has a (identical) list of size three holding the identities of the servers to which replicas of this job have been sent to. A server would therefore prefer a job that holds its identity in the first place in the list over a job that holds its identity in any other location. Specified as above, the heuristic is prone to starvation. Indeed, since servers now choose jobs from their queue according to priorities, a job may remain at the head of the queue for an excessively long time. Since such behavior may be prohibitive in some settings, we augment our heuristic with a starvation-prevention mechanism. Specifically, when starvation-prevention is activated, after a job reaches the head of the queue, we allow the server not to choose it in three occasions, afterwards it is chosen for service deterministically. We note that this scheme is similar in nature to a prioritybased scheme, where jobs are assigned to different priority classes, and servers apply the FCFS policy within the priority classes. This scheme also has the desirable property that each job has one replica that is deemed high priority by the server to which it is assigned. Figure shows the average queuing overhead for two copies under a load of 9% with a delay of.7 and the starvation-prevention mechanism activated. The simulation setup was the same as in the previous sections. For reference, we also present the average queuing overhead in this setup

9 Fig.. Average queuing overhead with location in vector selection, d =, T d =.7, load = 9%, with starvation-prevention Fig.. Average queuing overhead with random selection, d =, T d =.7, load = 9%, with starvation-prevention with and without delays. Additional figures can be found in Appendix XII. Our results show that using this heuristic is efficient in very high loads. When the value of is ept small (i.e., =,3), the queuing overhead drops sharply. As the value of grows, the effect of changing it levels off, as even a moderate value, such as or, typically enables the server to consider the entire queue. With a high enough value of, this scheme is able to overcome the detrimental effects of signal propagation delay and restore the high gain of our duplication scheme. When the load is low (% in our experiments), the queuing overhead was not very high to begin with, but the random selection policy reduces the overhead, maing the duplication scheme at least 3. times better than random assignment. Finally, we note that the activation of the starvation-prevention mechanism has very little effect on (average) performance, that is, this scheme can still grant high performance while avoiding job starvation. B. Random Selection In this heuristic, we simply replace the first-come-firstserved (FCFS) job selection policy with a simple random selection. That is, when a server needs to select the next job for processing, it chooses, uniformly at random, one of the first jobs in its queue, where is a parameter of the heuristic. If there are less than jobs in the queue, the server would choose, uniformly at random, one of the present jobs. We note that, for =, this policy converges to the FCFS rule. As in the previous heuristic, starvation may be experienced, hence we augment this heuristic with the same starvationprevention mechanism. Figure shows the average queuing overhead for two copies under a load of 9% with a delay of.7 and with the starvation-prevention mechanism activated. The simulation setup was the same as in the previous sections. For reference, we also present the average queuing overhead in this setup with and without delays. Additional figures can be found in Appendix XII. Our results show that this scheme exhibits similar results to the previous one, that is, it is highly efficient in high loads, and gets better for higher values of. As in the case of the location-in-vector heuristic, here too the activation of the starvation-prevention mechanism has very little effect on performance. C. Pre-Sending In this heuristic, the servers eep operating according to the first-come-first-server policy, but the rule for sending the removal signal is changed. Specifically, instead of sending the removal signal when the job arrives at the head of the queue, it is sent either when the job arrives at the th location in the queue (where is a parameter of the heuristic), or when the job arrives at a server that has less than jobs. This change requires to modify the rules for removing a job upon the arrival of its corresponding removal signal. Consider, for example, a case where d =, = 3 and one replica of the job arrives at an idle server while the other replica arrives at the third location of a different server. Both servers send a removal signal upon the arrival of the job. This might mae both replicas to be removed, a situation that we must prevent. Ideally, we would lie to have the replica that arrived at the idle server be processed and the other replica (that arrived to a third location in a server s queue) be removed. We accomplish the above in the following way. First, we augment the removal signal with the number of entry in the queue from which the signal was sent, e.g., if the signal is sent once the job arrives at the second place in the queue (that is, there s one job being processed, and another job at the head of the queue), the signal will contain the value. Then, the job removal rule is changed as follows: upon the arrival of a removal signal from a server i to a server j, let i denote the location of the replica in the queue of server i when it sent the signal (as reported in the removal signal), and let l j denote the current location of the corresponding replica in the queue of the server receiving the signal. If l j, the server has already sent its own removal signal for this replica; let j

10 denote the location of this replica in the queue of the server at the time that it sent the signal. The removal rule is as follows: if l j > remove job; if j > i remove job; if j == i and j < i remove job; otherwise, ignore the signal. We turn to establish a critical property of the above removal rule, namely, that it guarantees that at least one replica of each job would complete processing. We note that since the removal rule defines the behavior of a single sender and a single receiver, we prove it for the case of d = and note that the proof easily extends to larger values of d. Theorem : The above removal rule guarantees that at least one replica of each job completes processing. Proof: Assume, by way of contradiction, that there is a job for which neither of its replicas is processes. Let i and j be the identifiers of the servers to which the two replicas are assigned, i < j. It follows that both servers drop the replicas, in response to remove signals, which implies that both servers send such signals. Denote by i and j the locations of the replicas at servers i and j, respectively, at the times that they send their remove signals. Clearly, i and j. We distinguish between three cases, namely: (i) i < j, (ii) i > j and (iii) i = j. First, consider case (i). When the removal signal from j arrives at i, either the replica at i has been been processed - contradicting the assumption, or it is located in i s queue at a position l i i. This, together with i < j (per case (i)) mae the removal rule ignore the removal signal, hence the replica is eventually processed at i, contradicting the assumption. Case (ii) is handled similarly, but considering the application of the removal rule at server j. Finally, consider case (iii). Again, when the removal signal from j arrives at i, either the replica at i has been been processed - contradicting the assumption, or it is located in i s queue at a position l i i. This, together with i = j (per case (iii)) and i < j mae the removal rule ignore the removal signal, hence the replica is eventually processed at i, contradicting the assumption. We conclude that, in all cases, at least one replica is processed at a server, thus establishing the theorem. We note that when the parameter is set to, the scheme s behavior converges to the basic scheme (the small differences in the simulation results are due to the randomness of the simulation environment). Furthermore, when, the scheme converges to the behavior of the Supermaret model in the selection of the executing server for each job. However, our scheme is expected to outperform the Supermaret model, since the Supermaret model delays every job by T d to gather the scheduling information. Figure shows the average queuing overhead for two copies under a load of 9% with a delay of.7 and the presending policy. The simulation setup was the same as the one used in the previous sections. For reference, we also present Here we assume that any implementation of the Supermaret model would parallelize the collection of the scheduling information, and would thus delay jobs by just T d, regardless of the value of d Fig.. Average queuing overhead with pre-sending, d =, T d =.7, load = 9% the average queuing overhead in this setup with and without delays. Additional figures can be found in Appendix XII. Our results show that this scheme exhibits poor performance for low values of, to the extent of performing worse than the random assignment. For higher values of, the scheme provides excellent performance for high loads. As grows, performance degrades again as the scheme begins to behave lie the Supermaret model and errs in correlating queue lengths to expected completion time. VIII. REAL-WORLD IMPLEMENTATION AND EVALUATION To demonstrate the effectiveness of our technique, we implemented such a server-based system and tested its performance on the Amazon EC system []. In the implementation, each server comprises of the following three main components. () Execution Component: this is the component that performs user requests. When a new job arrives, this component dequeues and processes it. Prior to its actual processing, a removal signal is sent to all the servers holding copies of that job. We note that the removal signal is sent in a non-blocing manner, thus preventing any delay to the execution of the job. Finally, a response message is sent to the client indicating the completion of the job. The processing of a job is implemented by a simple busy-wait loop, where the duration of the wait is exponentially distributed. The Execution Component may be interrupted by the Job-Removal Component (described below), indicating that a signal has been received, reporting that the current job is also being processed by another server (of lower ID, see below). Upon an interrupt, the Execution Component stops the processing of the current job, drops it, and proceeds with the processing of the next job request. The Execution Component processes jobs based on their order of arrival. () Queuing Component: this component waits for job requests. Upon such a request, this component places it in the queue. In case the Execution Component is idle, the Queuing Component notifies it so that the job would go into processing. (3) Job-Removal Component: this component waits for signals arriving from other servers in the system. The arrival

11 of such a signal indicates that another server started processing a job request. In case the indicated job is currently being processed by the Execution Component, and the removal signal has arrived from a server with a lower IP address (we use lexicographical comparison of the addresses to determine the order), the Job-Removal Component interrupts the Execution Component, so that the latter can discard this job; the latter condition is required to guarantee that at least one replica of the job will be processed. In case the job is in the queue (i.e., still not processed), the Job-Removal Component removes it. The entire server is implemented in Java, where each component is implemented as a Java thread. In addition, we have also implemented a simple client that generates job requests and sends them to the servers, gathers responses, and produces logs that were processed to create the system statistics. The client determines the job s expected processing time. We generated the jobs with an average of ms execution time. To simplify the implementation, all interactions between the client and the servers, as well as among the servers, were implemented over UDP. In order to test the performance of our proposed scheme, we launched copies of the above described server on different Amazon EC instances of type Small. An Amazon AWS Small instance includes a single core with.7gb of RAM on a -bit platform. We chose to use instances running Linux. All instances were launched within the same geographical region. In stand-alone testing, we have measured 3ms on average for a UDP transmission to transit between servers in the same geographical region. Every run described in the results below included job requests. While running our real-world setup, several phenomena, characteristic of actual systems, were encountered, such as UDP pacet loss (which occurred for less than.% of all pacets). Another such phenomenon had to do with signal propagation time, as follows. On very rare occasions, the original message from the client arrives at one server, starts processing there, and the signal that is supposed to trigger the removal of its replicas arrives at one of the other servers before the job got to be queued there. To address this case, we added a shared object between the Job-Removal Component and the Queuing Component. This shared object is used to trac pending job-removal messages, so that if this particular event occurs, the Queuing Component can determine, upon the job s arrival, that there is no need to queue it and it can be dropped. Figures 3 and depict the queuing overhead as measured in the test of our system on Amazon s EC. As can be expected, the queuing overhead in the real-world implementation is somewhat higher than predicted by simulation, due to realworld effects, such as thread context switching and inter-vm interference. Even when taing these overheads into account, our scheme is able to provide a significant improvement over random assignment, where the largest gain is when the scheme starts to be activated, i.e., at d =. Increasing d beyond shows no clear improvement in our implementation, but we attribute this to the size of our setup ( servers). Fig. 3. Fig d= 3 Average queuing overhead, AWS implementation, low loads d= Average queuing overhead, AWS implementation, high loads Overall, more than server hours were required to gather the above results. IX. CONCLUSIONS We proposed a scheme for distributed load balancing in the context of large data centers. Specifically, we showed that duplicating jobs and sending replicas to different servers can significantly reduce the queuing time, even with a small number of replicas and in particular in high loads. Moreover, we showed that, while our scheme is prone to degradation in presence of signal propagation delay, simple heuristics may be used to overcome this problem. Our findings have been obtained through analysis, simulation and a real-world implementation over a commercial cloud. REFERENCES [] Amazon web services. [Online]. Available: [] D. Breitgand, R. Cohen, A. Nahir, and D. Raz, On Cost-Aware Monitoring for Self-Adaptive Load-Sharing, IEEE Journal on Selected Areas in Communication, vol., no., pp. 7 3,. [3] V. Cardellini, M. Colajanni, and P. Yu, DNS dispatching algorithms with state estimators for scalable web-server clusters, World Wide Web, vol., pp. 3, 999.

12 [] S. Fischer, Distributed load balancing algorithm for adaptive channel allocation for cognitive radios, in In Proc. of the nd Conf. on Cognitive Radio Oriented Wireless Networs and Communications (CrownCom, 7. [] V. Gupta, M. Harchol Balter, K. Sigman, and W. Whitt, Analysis of join-the-shortest-queue routing for web server farms, Perform. Eval., vol., no. 9-, pp., Oct. 7. [Online]. Available: [] M. Harchol-Balter, M. E. Crovella, and C. Murta, On choosing a tas assignment policy for a distributed server system, Cambridge, MA, USA, Tech. Rep., 99. [7] M. Harchol-Balter, Tas assignment with unnown duration, J. ACM, vol. 9, no., pp.,. [] B. Hayes, Cloud computing, Commun. ACM, vol., no. 7, pp. 9,. [9] M. Isard, Autopilot: automatic data center management, SIGOPS Oper. Syst. Rev., vol., no., pp. 7, 7. [] L. Kleinroc, Queueing Systems, Vol. I: Theory. Wiley Interscience, 97. [] Y. Lu, Q. Xie, G. Kliot, A. Geller, J. R. Larus, and A. G. Greenberg, Join-idle-queue: A novel load balancing algorithm for dynamically scalable web services, Perform. Eval., vol., no., pp. 7,. [] M. Mitzenmacher, The power of two choices in randomized load balancing, IEEE Transactions on Parallel and Distributed Systems, vol., no., pp. 9, October. [3], How useful is old information? IEEE Trans. Parallel Distrib. Syst., vol., no., pp.,. [] A. Nahir, A. Orda, and D. Raz, Schedule First, Manage Later: Networ-Aware Load Balancing, Department of Electrical Engineering, Technion, Haifa, Israel, Tech. Rep.,. [Online]. Available:

13 3 X. APPENDIX B d= d= d= d=... 3 Fig.. Expected service rate in presence of signal propagation delays, T d =., low load values 3 Fig. 7. Expected service rate in presence of signal propagation delays, T d =., low load values d= d= Fig.. Expected service rate in presence of signal propagation delays, T d =., high load values d= d= Fig.. Expected service rate in presence of signal propagation delays, T d =., high load values d= d=.. 3 Fig. 9. Expected service rate in presence of signal propagation delays, T d =., low load values

14 d= d= d= d= Fig.. Expected service rate in presence of signal propagation delays, T d =., high load values Fig. 3. Expected service rate in presence of signal propagation delays, T d =, low load values d= d= d= d= Fig.. Expected service rate in presence of signal propagation delays, T d =.7, low load values Fig.. Expected service rate in presence of signal propagation delays, T d =, high load values 3 d= d= d= d= Fig.. Expected service rate in presence of signal propagation delays, T d =.7, high load values Fig.. Expected service rate in presence of signal propagation delays, T d =, low load values

15 d= d= Fig.. Expected service rate in presence of signal propagation delays, T d =, high load values d= d= d= d= Fig. 9. Expected service rate in presence of signal propagation delays, T d =, low load values. 3 Fig. 7. Expected service rate in presence of signal propagation delays, T d = 3, low load values 3 d= d= d= d= Fig. 3. Expected service rate in presence of signal propagation delays, T d =, high load values Fig.. Expected service rate in presence of signal propagation delays, T d = 3, high load values

16 XI. APPENDIX A Delay=. Delay=. Delay=. Delay= Delay= Delay= Delay=3 Delay=... 3 Fig. 3. Expected service rate in presence of signal propagation delays, d =, low delay values, low load values 3 Fig. 33. Expected service rate in presence of signal propagation delays, d =, high delay values, low load values Delay=. Delay=. Delay=. Delay= Fig. 3. Expected service rate in presence of signal propagation delays, d =, low delay values, high load values Delay= Delay= Delay=3 Delay= Fig. 3. Expected service rate in presence of signal propagation delays, d =, high delay values, high load values Delay=. Delay=. Delay=. Delay= Fig. 3. Expected service rate in presence of signal propagation delays, d = 3, low delay values, low load values

17 7 3 Delay=. Delay=. Delay=. Delay= Delay=. Delay=. Delay=. Delay= Fig. 3. Expected service rate in presence of signal propagation delays, d = 3, low delay values, high load values Fig. 39. Expected service rate in presence of signal propagation delays, d =, low delay values, low load values Delay= Delay= Delay=3 Delay= Delay=. Delay=. Delay=. Delay= Fig. 37. Expected service rate in presence of signal propagation delays, d = 3, high delay values, low load values Fig.. Expected service rate in presence of signal propagation delays, d =, low delay values, high load values 7 3 Delay= Delay= Delay=3 Delay= 3 3 Delay= Delay= Delay=3 Delay= Fig. 3. Expected service rate in presence of signal propagation delays, d = 3, high delay values, high load values Fig.. Expected service rate in presence of signal propagation delays, d =, high delay values, low load values

18 9 7 3 Delay= Delay= Delay=3 Delay= Fig.. Expected service rate in presence of signal propagation delays, d =, high delay values, high load values Delay= Delay= Delay=3 Delay= Delay=. Delay=. Delay=. Delay=.7 Fig.. Expected service rate in presence of signal propagation delays, d =, high delay values, low load values. 3 Fig. 3. Expected service rate in presence of signal propagation delays, d =, low delay values, low load values Delay=. Delay=. Delay=. Delay=.7 Delay= Delay= Delay=3 Delay= Fig.. Expected service rate in presence of signal propagation delays, d =, high delay values, high load values Fig.. Expected service rate in presence of signal propagation delays, d =, low delay values, high load values

19 9 A. Random Selection XII. APPENDIX C Fig. 9. Average queueing overhead with random selection, d =, T d =.7, load = 9%, no starvation-prevention Fig. 7. Average queueing overhead with random selection, d =, T d =.7, load = %, no starvation-prevention Fig.. Average queueing overhead with random selection, d =, T d =.7, load = %, with starvation-prevention Fig.. Average queueing overhead with random selection, d =, T d =.7, load = 9%, with starvation-prevention Fig.. Average queueing overhead with random selection, d =, T d =, load = %, no starvation-prevention

20 Fig.. Average queueing overhead with random selection, d =, T d =, load = %, with starvation-prevention Fig.. Average queueing overhead with random selection, d = 3, T d =., load = 9%, no starvation-prevention Fig. 3. Average queueing overhead with random selection, d =, T d =, load = 9%, no starvation-prevention Fig.. Average queueing overhead with random selection, d = 3, T d =., load = 9%, with starvation-prevention Fig.. Average queueing overhead with random selection, d =, T d =, load = 9%, with starvation-prevention Fig. 7. Average queueing overhead with random selection, d = 3, T d =., load = 97%, no starvation-prevention

21 Fig.. Average queueing overhead with random selection, d = 3, T d =., load = 97%, with starvation-prevention

22 B. Location in Server Vector Fig. 9. Average queueing overhead with location in vector selection, d =, T d =.7, load = %, no starvation-prevention Fig.. Average queueing overhead with location in vector selection, d =, T d =.7, load = 9%, no starvation-prevention Fig.. Average queueing overhead with location in vector selection, d =, T d =.7, load = %, with starvation-prevention Fig.. Average queueing overhead with location in vector selection, d =, T d =.7, load = 9%, with starvation-prevention Fig. 3. Average queueing overhead with location in vector selection, d =, T d =, load = %, no starvation-prevention

23 Fig.. Average queueing overhead with location in vector selection, d =, T d =, load = %, with starvation-prevention Fig. 7. Average queueing overhead with location in vector selection, d = 3, T d =., load = 9%, no starvation-prevention Fig.. Average queueing overhead with location in vector selection, d =, T d =, load = 9%, no starvation-prevention Fig.. Average queueing overhead with location in vector selection, d = 3, T d =., load = 9%, with starvation-prevention Fig.. Average queueing overhead with location in vector selection, d =, T d =, load = 9%, with starvation-prevention Fig. 9. Average queueing overhead with location in vector selection, d = 3, T d =., load = 97%, no starvation-prevention

24 Fig. 7. Average queueing overhead with location in vector selection, d = 3, T d =., load = 97%, with starvation-prevention

25 C. Pre-Sending Fig. 7. Average queueing overhead with pre-sending, d =, T d =.7, load = % Fig. 73. Average queueing overhead with pre-sending, d =, T d =, load = % Fig. 7. Average queueing overhead with pre-sending, d =, T d =.7, load = 9% Fig. 7. Average queueing overhead with pre-sending, d =, T d =, load = 9% Fig. 7. Average queueing overhead with pre-sending, d = 3, T d =., load = 9%

26 Fig. 7. Average queueing overhead with pre-sending, d = 3, T d =., load = 97%