Distributed Oblivious Load Balancing Using Prioritized Job Replication

Transcription

1 Distributed Oblivious Balancing Using Prioritized Job Replication Amir Nahir Department of Computer Science Technion, Israel Institute of Technology Haifa, Israel Ariel Orda Department of Electrical Engineering Technion, Israel Institute of Technology Haifa, Israel Danny Raz Department of Computer Science Technion, Israel Institute of Technology Haifa, Israel Abstract Effective load balancing in large distributed server systems is a highly complex optimization problem, which often involves multiple objectives and many parameters. Irrespective of the precise optimization criteria, any attempt to address such an optimization problem would incur significant overhead due to the need to collect the required (state-dependent) information from the various servers. In this study we propose to tackle the problem through an oblivious approach, i.e., a distributed load-sharing scheme that does not use any state information. Our scheme is based on creating, in addition to the regular job requests that are assigned to a randomly chosen server, also low priority job request replicas that are sent to a different server. We show that, when servers can coordinate the removal of redundant copies upon completion of a job, the performance of the system exhibits significant improvement of up to a factor of., even under high load conditions, if job lengths are exponentially distributed, and a more dramatic improvement of over a factor of for typical loads, when job lengths adhere to a heavy-tailed distribution. For cases where such coordination is not feasible, we propose simple, coordination-free schemes, which still yield a significant improvement of up to a factor of. for exponentially distributed job lenghts and up to a factor of over. for the heavy-tailed case. We then design a hybrid scheme, which, while requiring just limited coordination, retains the main benefits of the full-coordination scheme. In addition, we demonstrate the benefit of our prioritized-replicas approach also within the realm of centralized schemes. I. INTRODUCTION The problem of effectively managing distributed computational resources has been extensively studied. Nowadays, it is receiving renewed attention due to the growing interest in large-scale data centers, in particular in the context of Cloud Computing []. To cope with high loads, service providers construct data centers made of tens of thousands of servers []. Moreover, such data centers, often providing a single service, may be located in different places around the world. This is the case, for example, with the Google search services - at each given point in time, there are several front-end servers active in various locations over the globe, and an incoming user searchrequest is directed to one of them. The actual choice of the specific server to process a given request has a critical impact on the overall performance of the service. This is a very difficult optimization problem, where in many cases there are multiple objectives and many parameters (such as network latency and server load) that should be considered. Regardless of the exact optimization criterion, any adaptive system that tries to address this problem needs to generate a significant amount of overhead just by collecting the needed parameters from the various network locations. The size of data centers, coupled with their distribution across the globe, call for fully distributed load balancing techniques. Accordingly, it is important to understand how well can an oblivious distributed load sharing system perform. In this study we introduce a novel approach for addressing oblivious load sharing. This approach is based on employing a two-priority service discipline at the servers, and duplicating each job request into two copies, with different priorities, which are sent randomly to different sites. Now, if the high priority copy arrives at a loaded server, the low priority copy may end up in a lightly loaded server and be processed there, decreasing the average time of the job in the system. Since at each server high priority jobs are always served before any low priority job, the performance of such a system is always at least as good as the basic random assignment technique and, depending on the load, it has the potential of offering considerably improved performance. Informally, one can think of the low-priority job scheme as an auxiliary mechanism that uses leftover capacity to increase the overall response time of the system. We test our technique under two load profiles. The first is a traditional setting in which jobs arrive based on a Poisson process, and job lengths (required processing times) are exponentially distributed. Such load profiles have been widely considered in the literature, due to their amenability to formal analysis as well as often being suitable approximations of reality. In such a setting, it seems reasonable that the scheme will not help much at low load levels, since the high priority copy is most likely to be handled fast; the same could be expected to apply in very high loads, where the auxiliary capacity is very small, as all servers are busy satisfying high priority jobs. However, for intermediate load levels, the probability that a low priority copy will end up in a free server An oblivious system (also termed static system) is a system that ignores the current state, and does not use any dynamic input.

2 and be executed before the main copy is not negligible, and as a result we may get a considerable improvement in the overall time a job spends in the system. Indeed, we show, both by analysis and by simulations, that our scheme yields a significant improvement for such loads. Moreover, and quite surprisingly, we get an even better improvement for higher levels of loads. This rather counter-intuitive phenomenon is due to the fact that, in high loads, even a very small amount of auxiliary capacity has a significant impact on the service time. Clearly, this is an important virtue of the proposed scheme, as it can rescue the system when it hits the particularly difficult working zone. It has been indicated that job lenghts in computer systems often exhibit a heavy tail distribution [], [], [], which the above ( traditional ) profile fails to exhibit. Accordingly, we test our scheme also under a load profile in which jobs arrive based on a Poisson process but their lengths abide to a heavy-tailed distribution, with very high variance. In such a setting, some jobs may be unfortunate to be scheduled for execution on a server behind a very long job. It turns out that the benefit of our proposed scheme is much higher under this setting. Specifically, we show, through simulations, that our scheme yields a dramatic improvement in low loads. In higher loads, low priority replicas end up getting stuck in low priority queues behind long jobs. Once we tune our scheme to drop such jobs, we show that the performance gain for high loads remains significant. We note that, while our load balancing scheme is oblivious, it is not overhead-free. Indeed, we do need a signaling mechanism for removing the redundant copies of completed jobs, and, in addition, a (nonstandard) buffering mechanism is required. Accordingly, we also study variants of the above scheme in which these overheads are removed. We show that, while these variants lose the (counter-intuitive) gain in very high loads, they still offer significant improvement under moderate load values, and at the same time admit a simple implementation and incur less overhead. Based on these findings, we develop a hybrid scheme that combines the two methods and retains the main benefits of the original method while incurring very moderate overhead. Finally, we indicate that our two-priority scheme can boost the performance of other load balancing schemes and demonstrate it on the following two instances. In [], a scheme termed the Supermarket Model has been introduced. In that scheme, d servers are sampled once a new job arrives. The new job is sent to the least loaded server. When augmenting the Supermarket Model with our scheme, the high priority copy is sent to the least loaded server and the low priority copy is sent to a randomly chosen server. We applied our priority-based job duplication scheme also to the well-known (centralized) Round-Robin scheme. Here, the high priority jobs are scheduled in a Round-Robin fashion, and the low priority copy is sent to a randomly chosen server. To summarize, the main contributions of this paper are as follows. Introduction and analysis of a priority-based job duplication scheme for distributed oblivious load sharing in networked server systems such as the Cloud. Detailed performance study of the proposed scheme, under two representative load profiles, with the following findings: significant improvement for job lengths that are exponentially distributed; dramatic improvement for job lengths that follow a (typical) heavy-tail distribution. Design and study of simpler, low overhead schemes, as well as a hybrid scheme that combines the benefits of all. Study of the application of the priority-based job duplication scheme to other load balancing schemes, namely: the Supermarket Model ; centralized oblivious load sharing using Round- Robin. The paper is organized as follows. After discussing related work in the next section, in Section III we formalize the model. In Section IV we analyse the proposed solution for exponentially distributed job lengths and validate the results also through simulations. In Section V we show the improvement gained when applying our technique to systems where job lengths adhere to a heavy-tailed distribution. In Section VI we study low overhead mechanisms and formulate a hybrid mechanism. In Section VII we demonstrate how usage of low priority jobs can improve also other classes of load balancing schemes. Finally, conclusions appear in Section VIII. Due to space limits some details are omitted and can be found (online) in []. II. RELATED WORK The problem of load balancing has been studied extensively in a variety of contexts over the last twenty years. Various variations of the problem exist, sometimes under different names (e.g., load sharing, scheduling and task assignment). Solutions to the problem, often termed scheduling algorithms or assignment policies, can be classified according to several fundamental properties. Some solutions assume a centralized system [], [], i.e., all jobs arrive at a single entity (called scheduler or dispatcher), which is responsible for determining which server will process the job. Other solutions assume a distributed system [], [], i.e., jobs may arrive to any server in the system, which needs to carry out the scheduling policy and determine which server will process the job. In the sequel, the term scheduler refers to the entity administrating the policy, whether it is a central one or not. Some solutions require a priori (i.e., already when the job first arrives) knowledge as to the time required to process the job []. Other solutions attempt to guess this information as the job is being processed []. Yet most solutions do not assume any such knowledge, but rely on either other parts of the system s state, or simply on the number of pending jobs in the different servers. In certain systems, jobs may be preempted, that is, halted while being processed and be served again later on, either at

3 the same or at a different server [], []. When preemption is allowed, we distinguish between two different behaviors once the preempted job resumes its service, namely preempt-resume and preempt-restart. In the former, the state of the job is saved upon preemption and processing continues from the point it was stopped. In the latter, the state of the job cannot be saved, therefore, processing starts from scratch (and the processing effort invested prior to preemption is lost). As explained in the introduction, dynamic (or adaptive) policies require a mechanism to collect the state information from the distributed system, and thus are associated with significant overhead and possible delays. To address these issues, a very interesting approach, called the Supermarket Model, was proposed in []. The main idea is to consider only d random servers (where d is a small constant) out of the available N, and to assign the job to the one with the shortest queue. It was shown in [] that even for d = this scheme achieves a much better waiting time that random assignment, and this improvement gets better as we increase d. While the scheme described in [] was centralized and its evaluation did not consider the direct overhead associated with each load query. A recent study [] proposed a fully distributed implementation of the Supermarket Model and showed that, under appropriate tuning, this version performs very well even when the overhead is considered. However, that scheme is still dynamic and uses partial knowledge regarding the current system state. As such, it either incurs delays (if we acquire information on demand), or else the use of stale state information. As mentioned, oblivious schemes, such as those considered in this study, are exempted from these shortcomings. We note that our priority-based job duplication technique can be applied also to the Supermarket Model; the details are in []. The idea of leveraging idle cycles on one server to offload work from another server has also been explored [] (and many references therein). However, the problem defined there focuses on two servers in an asymetric system, i.e., one server takes the role of the donor, which can assist another server (the beneficiary ), but not the other way around; moreover, extensive state-tracking is required, since the donor needs to have access to the current state of the beneficiary. III. MODEL We consider a system composed of N identical servers,,,...n. Each server has two infinite queues, one for high priority jobs, and the other for low priority jobs. Jobs arrive at the system according to a Poisson process at an aggregate rate of λn. The length of each job shall be characterized, through two possible distributions, in later sections. We consider batch jobs, so by a job we mean a job request, and the job is performed at any of the servers in the system. We assume a distributed system, therefore there is no single entity that decides on the exact assignment of jobs to servers. Instead, for each job, two servers are chosen uniformly at random out of the N servers, one is assigned with the high priority copy of the job, and the other is assigned with the low priority copy. The two copies are placed, immediately, at the end of the two respective queues in the chosen servers. The servers process jobs according to the following preemptive priority discipline. As long as there are jobs in the high priority queue, the server serves them on a first-come-first-served (FCFS) basis. Once the high priority queue becomes empty, the server turns to process jobs from the low priority queue (again, on an FCFS basis). In case a high priority job arrives at the server while it is processing a low priority job, the processing is preempted, the low priority job is returned to the head of the low priority queue, and the server turns to process the high priority job. Once the server s high priority queue becomes empty again, the processing of the low priority job starts again, from scratch (preempt-restart). When a server begins the processing of a high priority job, or completes the processing of a low priority job, an immediate notification is provided to the server holding the job s replica, and that copy is removed from the queue. This last requirement shall be alleviated and also dropped in Section VI. IV. EXPONENTIAL JOB LENGTHS In this section we assume that job lengths are exponentially distributed and, accordingly, we analyze the performance improvement gained by the addition of the low priority jobs, as described in the previous section, to a classical system of M/M/ queues, i.e., with Poisson arrivals and exponentially distributed service times (with rate µ). First, we relax the model somewhat by assuming that the high and low priority jobs arrive under independent Poisson processes, each with rate λn, and that the required time to complete the job is sampled independently between each of the replicas. We employ a state-transition-rate diagram [] to analyze the problem from the perspective of a single server. We denote by p i,j the probability that a server has i jobs in its high priority queue, and j jobs in its low priority queue. While the low priority replicas of a given server s high priority jobs may be distributed among various other servers, to simplify the analysis we make an averaging assumption that each server is assisted by a single additional server. We need to quantify the rate at which jobs held by the server under analysis are processed by other servers. This corresponds both to the low-priority replicas of high-priority jobs queued at the considered server, and to high-priority replicas of low-priority jobs queued at the considered server. Accordingly, we denote by µ R the service rate in which local low priority jobs are served remotely ( R stands for Remote ) as high priority jobs. We note that, as long as a remote server has high priority jobs in its queue, it provides service with rate µ; since we assume all servers are symmetric, we have: µ R = µ p i,j. i= j= In a similar fashion, we denote by µ R the service rate in which local high priority jobs are served remotely as low

4 i-,j- i,j- i+,j-. i-,j λ µ + µ R λ µ R λ i,j µ + µ R i+,j Performace Improvement... λ µ R i-,j+ i,j+ i+,j+. Model, Buffer limit= Simulation, Buffer Limit= Simulation, Unlimited buffers Fig.. Two dimensional state-transition-rate diagram Fig.. Improvement gained for µ = priority jobs. Unlike the case of µ R, here we need to consider not only the time spent processing low priority jobs, but their actual completion. This is because low priority jobs may be preempted upon the arrival of high priority jobs, in which case the cycles spent processing the low priority job before it was preempted are lost. Therefore, µ µ R = µ µ µ+λ j= p,j, where µ+λ is a reduction factor, representing the probability that once a low priority job starts processing, it will complete before the arrival of a new high priority job. The factor s value is explained as follows. Once a server starts processing a low priority job, two exponential processes compete - one for the arrival of a new high priority job (at rate λ, which will cause the low priority job to be preempted and returned to the queue) and the other for the completion of the low priority job (at rate µ). Therefore, the probability that the next event out of these two will be the completion of the low priority job is as stated above. Figure depicts a snippet of the two-dimensional statetransition-rate diagram that corresponds to the above model of the problem. The transition rates indicate that the system may transit from the < i,j > state on account of several events, as follows: an arrival of a new high priority job (which occurs at rate λ and moves the system to the < i +,j > state), an arrival of a new low priority job (which occurs at rate λ and moves the system to the < i,j+ > state), a completion of a high priority job (which may complete either locally at rate µ, or else, when there is more than a single high priority job in the queue, remotely at rate µ R moving the system to the < i,j > state), and a completion of a low priority job (which may complete locally when i = at rate µ or remotely at rate µ R moving the system to the < i,j > state). Assuming this system has a steady-state, we get the following set of equations: i >, j, p i,j (λ+µ+µ R +µ R ) = p i,j λ+p i+,j (µ+µ R ) +p i,j λ+p i,j+ µ R There are two additional sets of equations, one for the case i = (i.e., the high priority queue is empty), which considers a local processing rate of µ for low priority jobs, and the other for i = (i.e., there is a single job in the high priority queue) in which the removal of the high priority job cannot occur on account of its completion as a low priority job on a remote server. In addition, since the p i,j s are probabilities, we require that p i,j =. i= j= Unfortunately, the above expressions have no simplified close form solution. Moreover, numerical solutions (with finite buffer sizes) do not scale well since the corresponding number of variables is quadratic in the buffer size. Figure depicts the improvement gained by introducing low priority jobs to a standard system of M/M/ N servers system (i.e., N servers each being an M/M/ system). This graph shows the improvement in queue size, i.e., the ratio between the expected queue size of the matching M/M/ N servers system and the expected queue size of the system that supports low priority jobs. These results include the performance gained in the twodimensional model with finite buffers of size, compared to a system of M/M/ N servers with finite buffers of size. In addition, we used an in-house event-driven simulator to simulate the behavior of the system. Simulations were carried for various loads in a system of servers, with µ =. Each simulation sequence contained, jobs, and each value was computed as the average of such runs, where the values of the standard deviation were all well below %. Figure shows the results both for a simulated system with limited buffers, matching the results from the model, as well as the improvement gained in a system with unlimited buffers. While the expressions may seem to be similar to those of a standard setting of an M/M/ queue with priority classes, they are, in fact, quadtratic, since both µ R and µr depend on several of the p i,j probability variables.

5 ... Low priority job completion ratio M/M/ normalized queue length Cumulative Probability... Fig..... M/M/ performance vs. low priority job completions. α=.. α=. α=. α=.. α=. α=. Job Length Fig.. Example of Bounded-Pareto cumulative distribution function, maximal job length = We note that our model is a reasonable approximation of reality, and, while there are effects it does not capture, in particular in high loads, it does describe the overall behavior. As expected, our results indicate that, in very low loads, the scheme of employing duplicate low priority jobs does not improve system performace by much. Indeed, at low loads, system performace is very good as it is, and an arriving job has a high probability of reaching an idle server. It is with higher loads where our scheme should come into rescue. Yet, at the high-load end, we might not hope for much, as low priority jobs apparently have little chance of getting through. However, quite counter-intuitively, we observe that, with very high loads, our proposed scheme exhibits an improvement of up to a factor of.. These good news can be explained as follows. Consider Figure, where the red dotted line shows the expected queue λ µ λ length of an M/M/ server (namely, ), with µ =, normalized to, and the blue solid line shows the ratio of jobs (of all user requests) that have been completed as low priority jobs, as obtained through simulations. One can see that, when the load rises over %, the expected queue length of the M/M/ server rises exponentially, while the low priority job completion ratio decreases linearly. Thus, while the contribution of the low priority jobs decreases, it is still sufficient to make a significant impact on the system s performace. For example, when the system operates at % load, our simualtion results indicate that just.% of the jobs first complete as low priority jobs; however, this is enough to bring the system to operate as if the effective arrival rate is.%, which yields an improvement of a factor of. (confirmed in simulation to within % accuracy). V. POWER-LAW JOB LENGTHS Several studies have indicated that job lenghts in computer systems often exhibit a heavy-tailed distribution. In such systems, the vast majority of jobs are very small, while a few jobs are extremely long. Examples of such systems include the sizes of files transfered through the Internet [] and the duration of process lifetimes in UNIX systems [], []. A process that adheres to a heavy-tailed distribution can be described based on the following probability cumulative distribution function: P[X > x] x α where < α <. That is, the number of jobs that are longer than x is an inverse power of x. In our work, we study the system behavior for a variety of α values in this range. Following other works in this domain [], [], and due to practical reasons (namely, in practice, a job never requires infinite processing time), we opt to model job legth arrivals using the Bounded-Pareto distribution. This distribution is defined by three parameters: L - the minimal job length; H - the maximal job length; andα-the exponent of the power-law. Given these, the probability density function of a Bounded- Pareto process is: α L α x α ) α. ( L H We base our findings on simulations. In all settings, for all α values, we keep the maximal possible value fixed and the mean value at, adjusting the minimum as needed. We present results for the following maximal job length values:,, and. Figure depicts the cumulative distribution function for the various α values we work with, for a maximal job length of. When applying our scheme to such arrival patterns, the following phenomenon has been encountered: long jobs reach the head of the low priority queue, get preempted over and over again, thus blocking other low priority jobs from completing. We address this by adding a drop threshold to the low priority queue, as follows. For every low priority job, a counter is maintained, keeping track of the number of times it has been preempted; once the preemption counter reaches the designated threshold, the job is dropped from the queue (the high priority copy is untouched, and is guaranteed to complete). When job lengths are exponentially distributed, this phenomenon is not encountered since variance in job lengths is small, therefore, there is no need to add the drop threshold counter.

6 Drop threshold = Drop threshold = Drop threshold = Drop threshold = Drop threshold = Drop threshold = Fig.. Improvement gained for various drop thresholds, α =., maximal job length = Fig.. Improvement gained for various drop thresholds, α =., maximal job length = Drop threshold = Drop threshold = Drop threshold = Drop threshold = Drop threshold = Drop threshold = Fig.. Improvement gained for various drop thresholds, α =., maximal job length = Fig.. Improvement gained for various drop thresholds, α =., maximal job length = Drop threshold = Drop threshold = Drop threshold =..... Drop threshold = Drop threshold = Drop threshold = Fig.. Improvement gained for various drop thresholds, α =., maximal job length = Fig.. Improvement gained for various drop thresholds, α =., maximal job length =

7 Drop threshold = Drop threshold = Drop threshold = Drop threshold = Drop threshold = Drop threshold = Fig.. Improvement gained for various drop thresholds, α =., maximal job length = Fig.. Improvement gained for various drop thresholds, α =., maximal job length = Drop threshold = Drop threshold = Drop threshold = Drop threshold = Drop threshold = Drop threshold = Fig.. Improvement gained for various drop thresholds, α =., maximal job length = Fig.. Improvement gained for various drop thresholds, α =., maximal job length = Drop threshold = Drop threshold = Drop threshold = Drop threshold = Drop threshold = Drop threshold = Fig.. Improvement gained for various drop thresholds, α =., maximal job length = Fig.. Improvement gained for various drop thresholds, α =., maximal job length =

8 Drop threshold = Drop threshold = Drop threshold = Fig.. Improvement gained for various drop thresholds, α =., maximal job length = Drop threshold = Drop threshold = Drop threshold = Fig.. Improvement gained for various drop thresholds, α =., maximal job length = Drop threshold = Drop threshold = Drop threshold = Drop threshold = Drop threshold = Drop threshold = Fig.. Improvement gained for various drop thresholds, α =., maximal job length = Figures to clarify the impact of the drop threshold value on the performance improvement. These figures show that, for low α values (i.e., α < ), the precise value of the drop threshold has little impact on the system s performance. This is due to the extremely high range of job lengths. For example, when α is set to., and the maximal job length is set to, the minimal job length is, so the ratio between the largest and smallest possible jobs is. In such cases, the drop mechanism rarely mistakes a short job for a long one. In addition, with such a large range, dropping the job after times, as opposed to just after two times, has very little impact, as long as the subsequent jobs get a chance to complete as low priority jobs when their replica is queued behind an exteremely long job. For high values of α, where the difference between the shortest and longest possible job decreases, the importance of setting the drop threshold to the right value becomes evident. In case the drop threshold is set to a too low value, short jobs may be dropped from the low priority queue, forcing their high priority copy to await processing, possibly for a very long time. For example, Figure shows that, when the drop threshold is set to, the performance improvement may be fairly low, Fig.. Improvement gained for various drop thresholds, α =., maximal job length = Drop threshold = Drop threshold = Drop threshold = Fig.. Improvement gained for various drop thresholds, α =., maximal job length =

9 Drop threshold = Drop threshold = Drop threshold = α=. α=. α=. α=. α=. α=. Fig.. Improvement gained for various drop thresholds, α =., maximal job length = Fig.. Improvement gained for various α values, maximal job length =, drop threshold = α=. α=. α=. α=. α=. α=. α=. α=. α=. α=. α=. α=. Fig.. Improvement gained for various α values, maximal job length =, drop threshold = Fig.. Improvement gained for various α values, maximal job length =, drop threshold = and, in some cases (namely, when the load is between % and %), it may actually be lower than that gained by the basic scheme. In case the drop threshold is set to a too high value, long jobs are kept at the low priority queue for a long time, delaying short jobs from completing. Figures, and show the performance improvement gained by applying our scheme to systems where inputs vary according to a Bounded-Pareto distribution. Each figure shows the results for six different α values. The figures differ in the maximal job lengths. Performance gain is computed by dividing the original average time in the system (when jobs are scheduled according to a random assignment policy) by the average time in the system when our scheme is applied. As can be seen, our method dramatically improves the system s performance. For typical loads, performance improvement is over a factor of ; even in the worst cases, improvement remains considerable - ranging from a minimum of. to for different values of maximal job length. VI. PRATICAL IMPLEMENTATION The scheme described in the previous sections incurs two non-negligible tolls in terms of management overhead: the first is the networking overhead related to the signaling messages exchanged between servers for removing redundant copies of completed jobs, while the second is the overhead implied by the priority-based buffering system. Accordingly, in this section we propose and investigate variants of our scheme that reduce or even avoid altogether these tolls. For concreteness and due to space limits, we focus on exponentially distributed job lengths; however, we demonstrate the viability of these sechemes for Power Law distributions by evaluating the performance of the most extreme variant (namely, the optimistic method) under such a distribution. A. Buffering, No Signaling The first scheme that we consider avoids sending signaling messages. Instead, a timeout value is associated with each low priority copy, such that, if a copy fails to complete within some period of time after its arrival to the server s queue, it is dropped from the queue. Note that high priority jobs are not timed-out, therefore each job is always served. Figure depicts simulation results obtained with this scheme for exponentially distributed job lengths. Specifically,

10 Timeout=. Timeout=. Timeout= Timeout= Timeout= Timeout=. λ Fig.... * λ λ λ λ λ µ µ µ µ µ State-transition diagram for the optimistic scheme Model Simulation Fig.. Improvement gained by employing timeouts it describes the performance improvement, with respect to a standard system of M/M/ queues, for timeout values in the range. to. Overall, the scheme offers an improvement of up to a factor of.., depending on the timeout value. We observe a monotonous improvement up to the value of, and a decrease in performance from there on. Indeed, within the range of small-to-moderate values, a larger timeout gives a higher chance for a low priority copy to make an impact. However, when the timeout values are too large, the low priority copies stay within the system too long, often beyond the completion time of their high priority siblings, thus blocking the way to other low priority copies. B. No Buffering, No Signaling - The Optimistic Method While the above scheme avoids altogether the signaling overhead, it still incurs a toll in terms of managing a second (low priority) queue at each server. Our next scheme avoids this toll too, by taking the following optimistic approach. When a low priority job arrives at a server, it is either immediately accepted to service, or - if there is another job (of either priority) being served - it is dropped. In addition, if a low priority copy is preempted by a high priority copy, then it is dropped and thus cannot resume service later. Thus, in this scheme too, we avoid signaling messages, and, in addition, we do not maintain queues for low priority copies. We turn to analyze this scheme for exponentially distributed job lengths. Consider the arrival of a job of length x to the system. This job s low priority copy may either find an idle server, and be processed there without interruption - in which case the job s completion time is exactly x, or it may end up being discarded - in which case the high priority copy will be processed by an ( average ) M/M/ queue, with average completion time of µ λ µ +x. Hence, for a given job length x, the mean time in the system is: ( p(x) x+( p(x)) µ λ ) µ +x Performace Improvement.... Fig.. Improvement gained by the optimistic scheme; exponentially distributed job lengths where p(x) denotes the probability that the low priority copy of a job of size x finds an idle server and gets processed there without interruption. To evaluate p(x), we construct a corresponding statetransition diagram, depicted in Figure. Generally, the system may either be idle (denoted by state ), processing a high priority job (denoted by states,,,...), or processing a low priority job (denoted by state ). Therefore, to complete processing, an arriving low priority job must find the system at state, and remain in state for a duration matching the job s length; formally: x λ p(x) = p λ e λx. Assuming the system has a steady state yields that p = (µ+λ) (µ λ). µ (µ+λ) To conclude, the mean time in the system is: µ e µx ( ( p(x) x+( p(x)) µ λ )) µ +x dx. Figure depicts the improvement gained by using the optimistic scheme. The solid blue line is obtained through the above expression, and the red dashed line corresponds to simulation results. While this scheme provides very little improvement in both low and high loads, it gives an additional boost of over a factor of. in the %-% load range, peaking at % with an improvement of a factor of..

11 α=. α=. α=. α=. α=. α=. Performace Improvement.... Full scheme with signaling and unlimited buffers Optimistic scheme Hybrid scheme. Fig.. Improvement gained by the optimistic scheme; Bounded-Pareto distributed job lengths Fig.. Hybrid Scheme We turn to consider Power Law job lengths. Specifically, Figure depicts the improvement gained by using the optimistic scheme with Bounded-Pareto distributed job lengths, when the maximal job length is fixed to. The optimistic scheme offers dramatic imporvement for low α values and low loads. As the load rises, the probability to complete the processing of a low priority job drops, and the improvement rate drops accordingly (but still remain over a factor of.). For high values of α, the size of the minimal job is relatively high (for example, when α =., the minimal job length is.), thus, the probability of completing a low priority job is extremely low. Therefore, the optimistic scheme offers very little improvement for these α values. C. The Hybrid Scheme The above discussion indicates an inherent tradeoff between performance gain and management toll, as follows. The original, full-fledged scheme offers a remarkable improvement of up to a factor of. for exponentially distributed job lengths; moreover, it has the salient property of boosting performance at particularly high loads. On the other hand, it requires to maintain an additional queue at each server and employ interserver signaling upon the completion of each job. At the other extreme, the optimistic scheme avoids both tolls altogether, yet the performance improvement it offers is limited to up to about.. Moreover, at high loads, all schemes that avoid using signaling do not exhibit any performance boost, but rather their improvement drops to. In an attempt to reconcile between the conflicting goals of performance and management toll, we propose the following hybrid scheme. At load levels that are not at the high end, the scheme adopts the rules of the opportunistic scheme, namely no signaling and no low priority queues. However, when a server senses that the system is highly loaded, it sends signaling messages upon job completion, as done by the full-fledged scheme. This way, low priority queues are Servers can estimate the load conditions using a local scheme such as the one proposed and analysed in [] for cloud computing systems similar to those discussed here. avoided altogether, while signaling messages are employed only upon high load conditions. Such a scheme is somehwat reminiscent of network control schemes that employ control messages under difficult scenarios, e.g., in order to avoid store & forward deadlocks. Figure depicts simulation results obtained with this scheme (for exponentially distributed job lengths and a threshold value of % for starting to employ signaling), and compares it with the original optimistic scheme (i.e., no signaling) and the full-fledged scheme (i.e., signaling under all conditions and employment of a second queue). As expected, the behaviors of the hybrid and opportunistic schemes coincide up to the threshold level. However, for load conditions beyond the threshold, we observe a significant performance boost that, at the peak, about doubles the performance improvement with respect to the optimistic scheme. Moreover, while the performance improvement still lags below that of the full-fledged scheme, the two exhibit a similar pattern. Most notably, with the hybrid scheme we observe the revival of the favorable boosting effect at the high-load end. VII. AUGMENTING EXISTING SCHEMES The advantages of oblivious distributed load balancing notwhihstanding, various reasons may require that a system implements a scheme that is either non-oblivious or centralized. Accordingly, in this section, we show how the low priority job replication concept can be used to augment such alternative schemes, by way of focusing on the (nonoblivious) Supermarket scheme and the (centralized) Round- Robin scheme. A. The Supermarket Model As explained in the introduction, the large amount of distributed data in modern datacenter settings makes any nonoblivious system complex and very hard to implement. The Supermarket Model, studied in [], is a highly interesting approach that addresses this problem. The main idea is to consider only d random servers (where d is a small constant) out of the available N, and to assign the job to the one among them with the shortest queue. It is shown in []

12 d= d= with LP jobs d= d= with LP jobs d= d= with LP jobs Performace Improvement Full scheme with signaling and unlimited buffers Basic Round Robin Round Robin augmented with low priority replicas.. Fig.. Improvement gained by using the supermarket model, with and without low priority jobs, when compared to random assignment Fig.. Improvement gained by Round-Robin with low priority replicas d= d= d= Fig.. Improvement gained by using the supermarket model with low priority jobs when compared to the supermarket model without low priority job that, even for d =, this scheme achieves a much better waiting time than random assignment, and this improvement increases with d (the number of considered servers). While the approach in [] is centralized and does not consider the direct overhead associate with each load query, it was recently shown in [] that the Supermarket Model admits a fully distributed implementation which, under appropriate tuning, works very well even when the overhead is considered. In order to test our general prioritized job replication paradigm also in non-oblivious scenarios, we applied it to the distributed Supermarket Model in the following way: when a job arrives at the dispatcher, it randomly selects d servers and queries them for the queue lengths (in this technique, queue lengths are used to evaluate the load at the server). Recall that, in a fully distributed system, the dispatcher may be one of the servers (in which case it will only query d servers, in addition to itself). The dispatcher then assigns a high priority copy of the job to the server with the shortest queue and a low priority copy to a random server. The servers work according to the basic priority scheme described in Section III. Figure describes the improvement over random assignment achieved by the vanilla distributed Supermarket Model and by our scheme for exponentialy distributed job lengths. In order to clearly examine the impact of our scheme, we depict in Figure the results of the prioritized job replication paradigm compared to the vanilla Supermarket Model for exponentially distributed job lengths. As one can see, for a small d (say, ), the improvement can be as high as a factor of.; this is due to the fact that, in some cases, all of the d servers sampled are busy, and the random selection for the low priority job finds a less occupied one. We note that this additional gain comes on top of an already very effective scheme that yields an improvement of up to a factor of with respect to random assignments; the details can be found in []. B. A Centralized Approach In some systems it is possible to perform load balancing via a centralized entity, rather than distributedly. In that case, all jobs arrive at a central dispatcher, which assigns them to the different servers. One of the prominent schemes for centralized load balancing is Round-Robin, i.e., the dispatcher assigns the jobs to the servers cyclically. The Round-Robin scheme is attractive due to its simplicity, as well as its attained gain when compared to random assignments. Accordingly, for systems that can admit centralized control, we propose a scheme based on a Round-Robin dispatcher, but augmented with replicated jobs with low priorities. Specifically, the dispatcher assigns the high priority jobs in a Round- Robin fashion, and, in addition, a low priority replica is sent to another, randomly chosen, server. Servers operate based on a preemptive priority discipline, as described in Section III. Figure depicts the improvement gained by augmenting the centralized Round-Robin scheduling policy with low priority replicas for exponentially distributed job lengths. This graph shows the improvement gained by three different schemes with respect to a standard M/M/ system, namely the distributed scheme described in the previous sections, a standard Round-Robin scheme, and the augmented Round- Robin scheme described above. All results were obtained by simulation in similar conditions to those described in

13 Section IV (namely,, jobs, µ =, servers) and each value was computed as the average of runs. One can see that, in low to medium loads, the standard Round-Robin scheme is similar to our distributed scheme and even outperforms it (as can be expected from a centralized implementation). However, in high loads, our replica based distributed system is better, and when combining the replica technique with the centralized Round-Robin system we get a particularly significant improvement. VIII. CONCLUSIONS We proposed a scheme for oblivious distributed load sharing in the context of large data centers. Specifically, we showed that when job lengths are exponentially distributed, adding low-priority job replicas can significantly reduce the system service time, even when the overall load is high. Moreover, we showed that the improvement in system performance oferred by our scheme is particularly significant when job lengths exhibit a heavy-tailed distribution. In addition, we showed that considerable improvement can be gained even when employing simple schemes that incur little or no management overhead. Merging the full-fledged approach with the simpler schemes, we proposed an hybrid scheme that provides an attractive combination of performance gain with low overhead. Finally, we showed that the general idea of having low priority replicas can also be augmented to existing job assignment policies, such as (the centralized) Round-Robin and the (nonoblivious) Supermarket Model, there too resulting in significant improvements. Many technical issues, such as the exact optimal balance between overhead and efficiency, or better modeling that can result in close formulae, remain to be explored. In particular, it could be intersting to check whether dimentionality reduction techniques [] can be applied to this problem. The idea of using low priority copies is not limited to the cloud computing setting; a similar idea can be used in many networking scenarios in order to improve response time. Consider for example multimedia packets being transmitted in a cellular or WiFi/WiMax setting, where the recipient may get packets from several base stations []. In such a case, one might send more than one copy of a packet to more than one base station, using priorities, and increase the probability that the packet will be received before its timeout. One can also use such a scheme for general wireline routing, where the goal is to reduce overall latency. Indeed, schemes of duplicate packet routing have been investigated (e.g., [], []), yet not within a prioritized service discipline. [] D. Breitgand, R. Cohen, A. Nahir, and D. Raz, Cost aware adaptive load sharing, in IWSOS,, pp.. [], On Cost-Aware Monitoring for Self-Adaptive -Sharing, IEEE Journal on Selected Areas in Communication, vol., no., pp.,. [] M. E. Crovella and A. Bestavros, Self-similarity in world wide web traffic: evidence and possible causes, IEEE/ACM Trans. Netw., vol., no., pp.,. [] D. L. Eager, E. D. Lazowska, and J. Zahorjan, The limited performance benefits of migrating active processes for load sharing, SIGMETRICS Perform. Eval. Rev., vol., no., pp.,. [] S. Fischer, Distributed load balancing algorithm for adaptive channel allocation for cognitive radios, in In Proc. of the nd Conf. on Cognitive Radio Oriented Wireless Networks and Communications (CrownCom,. [] B. Fu and Z. Tari, A dynamic load distribution strategy for systems under high task variation and heavy traffic, in SAC : Proceedings of the ACM symposium on Applied computing. New York, NY, USA: ACM,, pp.. [] M. Harchol-Balter, M. E. Crovella, and C. Murta, On choosing a task assignment policy for a distributed server system, Cambridge, MA, USA, Tech. Rep.,. [] M. Harchol-Balter and A. B. Downey, Exploiting process lifetime distributions for dynamic load balancing, ACM Trans. Comput. Syst., vol., no., pp.,. [] M. Harchol-Balter, T. Osogami, A. Scheller-Wolf, and A. Wierman, Multi-server queueing systems with multiple priority classes, Queueing Syst. Theory Appl., vol., no. -, pp.,. [] B. Hayes, Cloud computing, Commun. ACM, vol., no., pp.,. [] M. Isard, Autopilot: automatic data center management, SIGOPS Oper. Syst. Rev., vol., no., pp.,. [] L. Kleinrock, Queueing Systems, Vol. I: Theory. Wiley Interscience,. [] W. Leland and T. J. Ott, -balancing heuristics and process behavior, SIGMETRICS Perform. Eval. Rev., vol., no., pp.,. [] M. Mitzenmacher, The power of two choices in randomized load balancing, IEEE Transactions on Parallel and Distributed Systems, vol., no., pp., October. [] A. Nahir, A. Orda, and D. Raz, Distributed oblivious load balancing using prioritized job replication, Department of Electrical Engineering, Technion, Haifa, Israel, Tech. Rep.,. [Online]. Available: http: // [] A. Orda and R. Rom, Routing with packet duplication and elimination in computer networks, Communications, IEEE Transactions on, vol., no., pp., Jul. [] T. Osogami, M. Harchol-Balter, and A. Scheller-Wolf, Analysis of cycle stealing with switching cost, SIGMETRICS Perform. Eval. Rev., vol., no., pp.,. [] W. T. Zaumen, S. Vutukury, and J. G.-L. Aceves, -balanced anycast routing in computer networks, IEEE Symposium on Computers and Communications, vol., p.,. REFERENCES [] Task assignment with unknown duration, J. ACM, vol., no., pp.,. [] D. Amzallag, R. Bar-Yehuda, D. Raz, and G. Scalosub, Cell selection in g cellular networks, in INFOCOM,, pp.. [] A. Barak, S. Guday, and R. G. Wheeler, The MOSIX Distributed Operating System: Balancing for UNIX. Secaucus, NJ, USA: Springer-Verlag New York, Inc.,. The details regarding the Supermarket Model appear in [].