The Power of Slightly More than One Sample in Randomized Load Balancing



Similar documents
Recurrence. 1 Definitions and main statements

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP)

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

A Lyapunov Optimization Approach to Repeated Stochastic Games

1 Example 1: Axis-aligned rectangles

On the Approximation Error of Mean-Field Models

BERNSTEIN POLYNOMIALS

On the Interaction between Load Balancing and Speed Scaling

Value Driven Load Balancing

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

Analysis of Energy-Conserving Access Protocols for Wireless Identification Networks

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Support Vector Machines

What is Candidate Sampling

An Analysis of Central Processor Scheduling in Multiprogrammed Computer Systems

Product-Form Stationary Distributions for Deficiency Zero Chemical Reaction Networks

On the Interaction between Load Balancing and Speed Scaling

Calculation of Sampling Weights

An Alternative Way to Measure Private Equity Performance

ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING

NON-CONSTANT SUM RED-AND-BLACK GAMES WITH BET-DEPENDENT WIN PROBABILITY FUNCTION LAURA PONTIGGIA, University of the Sciences in Philadelphia

J. Parallel Distrib. Comput.

The OC Curve of Attribute Acceptance Plans

8 Algorithm for Binary Searching in Trees

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek

DEFINING %COMPLETE IN MICROSOFT PROJECT

Relay Secrecy in Wireless Networks with Eavesdropper

Lecture 3: Force of Interest, Real Interest Rate, Annuity

Period and Deadline Selection for Schedulability in Real-Time Systems

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network *

Distributed Optimal Contention Window Control for Elastic Traffic in Wireless LANs


Pricing Overage and Underage Penalties for Inventory with Continuous Replenishment and Compound Renewal Demand via Martingale Methods

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

Extending Probabilistic Dynamic Epistemic Logic

How To Solve An Onlne Control Polcy On A Vrtualzed Data Center

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

MAC Layer Service Time Distribution of a Fixed Priority Real Time Scheduler over

Cross-Selling in a Call Center with a Heterogeneous Customer Population

"Research Note" APPLICATION OF CHARGE SIMULATION METHOD TO ELECTRIC FIELD CALCULATION IN THE POWER CABLES *

Optimal resource capacity management for stochastic networks

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S

Power-of-Two Policies for Single- Warehouse Multi-Retailer Inventory Systems with Order Frequency Discounts

Dynamic Pricing for Smart Grid with Reinforcement Learning

This paper concerns the evaluation and analysis of order

Joint Scheduling of Processing and Shuffle Phases in MapReduce Systems

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

Traffic State Estimation in the Traffic Management Center of Berlin

Online Advertisement, Optimization and Stochastic Networks

Schedulability Bound of Weighted Round Robin Schedulers for Hard Real-Time Systems

INSTITUT FÜR INFORMATIK

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

Enabling P2P One-view Multi-party Video Conferencing

Sngle Snk Buy at Bulk Problem and the Access Network

OPTIMAL INVESTMENT POLICIES FOR THE HORSE RACE MODEL. Thomas S. Ferguson and C. Zachary Gilstein UCLA and Bell Communications May 1985, revised 2004

Stability, observer design and control of networks using Lyapunov methods

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

CALL ADMISSION CONTROL IN WIRELESS MULTIMEDIA NETWORKS

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6

AN APPOINTMENT ORDER OUTPATIENT SCHEDULING SYSTEM THAT IMPROVES OUTPATIENT EXPERIENCE

An Interest-Oriented Network Evolution Mechanism for Online Communities

To Fill or not to Fill: The Gas Station Problem

PAS: A Packet Accounting System to Limit the Effects of DoS & DDoS. Debish Fesehaye & Klara Naherstedt University of Illinois-Urbana Champaign

denote the location of a node, and suppose node X . This transmission causes a successful reception by node X for any other node

A Secure Password-Authenticated Key Agreement Using Smart Cards

When Network Effect Meets Congestion Effect: Leveraging Social Services for Wireless Services

Stochastic Bandits with Side Observations on Networks

Using Series to Analyze Financial Situations: Present Value

Cross-Selling in a Call Center with a Heterogeneous Customer Population

A DYNAMIC CRASHING METHOD FOR PROJECT MANAGEMENT USING SIMULATION-BASED OPTIMIZATION. Michael E. Kuhl Radhamés A. Tolentino-Peña

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

The Application of Fractional Brownian Motion in Option Pricing

Sketching Sampled Data Streams

Embedding lattices in the Kleene degrees

Research Article Enhanced Two-Step Method via Relaxed Order of α-satisfactory Degrees for Fuzzy Multiobjective Optimization

Efficient Striping Techniques for Variable Bit Rate Continuous Media File Servers æ

An MILP model for planning of batch plants operating in a campaign-mode

The Greedy Method. Introduction. 0/1 Knapsack Problem

Availability-Based Path Selection and Network Vulnerability Assessment

Calculating the high frequency transmission line parameters of power cables

Answer: A). There is a flatter IS curve in the high MPC economy. Original LM LM after increase in M. IS curve for low MPC economy

Dynamic Fleet Management for Cybercars

Project Networks With Mixed-Time Constraints

In some supply chains, materials are ordered periodically according to local information. This paper investigates

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

A Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression

Cautiousness and Measuring An Investor s Tendency to Buy Options

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

A Probabilistic Theory of Coherence

RESEARCH DISCUSSION PAPER

1 OPTIMIZATION ISSUES IN WEB

Feasibility of Using Discriminate Pricing Schemes for Energy Trading in Smart Grid

On File Delay Minimization for Content Uploading to Media Cloud via Collaborative Wireless Network

Performance Analysis of Energy Consumption of Smartphone Running Mobile Hotspot Application

Price Competition in an Oligopoly Market with Multiple IaaS Cloud Providers

The literature on many-server approximations provides significant simplifications toward the optimal capacity

An Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services

How To Calculate An Approxmaton Factor Of 1 1/E

Transcription:

The Power of Slghtly More than One Sample n Randomzed oad Balancng e Yng, R. Srkant and Xaohan Kang Abstract In many computng and networkng applcatons, arrvng tasks have to be routed to one of many servers, wth the goal of mnmzng queueng delays. When the number of processors s very large, a popular routng algorthm works as follows: select two servers at random and route an arrvng task to the least loaded of the two. It s well-known that ths algorthm dramatcally reduces queueng delays compared to an algorthm whch routes to a sngle randomly selected server. In recent cloud computng applcatons, t has been observed that even samplng two queues per arrvng task can be expensve and can even ncrease delays due to messagng overhead. So there s an nterest n reducng the number of sampled queues per arrvng task. In ths paper, we show that the number of sampled queues can be dramatcally reduced by usng the fact that tasks arrve n batches called jobs. In partcular, we sample a subset of the queues such that the sze of the subset s slghtly larger than the batch sze thus, on average, we only sample slghtly more than one queue per task. Once a random subset of the queues s sampled, we propose a new load balancng method called batch-fllng to attempt to equalze the load among the sampled servers. We show that our algorthm mantans the same asymptotc performance as the so-called power-of-two-choces algorthm whle usng only half the number of samples. I. INTRODUCTION In many computng and networkng applcatons, ncludng routng, hashng, and load balancng see [4], a router also called scheduler has to route arrvng tasks to one of many servers wth the goal of mnmzng queueng delays. Such applcatons have been ncreasngly relevant recently, due to the explosve growth of cloud computng where a large number of servers n a data center are used to process a large volume of tasks. Ideally, one would lke the router to consder the queue lengths at all the servers and select the shortest of the queues snce ths s delay optmal, at least n certan traffc regmes see [6] and references cted wthn. However, samplng all the queues can be expensve when the number of servers s very large. Motvated by such consderatons, load balancng n the large-server lmt was studed n [9], [], [9]. The key result n those papers s that queueng delays can be dramatcally reduced by samplng two servers for each task, nstead of just one, and routng the task to the shorter of the two queues. We wll call ths basc algorthm the power-of-two-choces algorthm as n pror work. These results have been extended n varous drectons. In [3], [4], the results have been extended e Yng and Xaohan Kang are wth the School of Electrcal, Computer and Energy Engneerng, Arzona State Unversty, Tempe, AZ 85287 e-mal: le.yng.2@asu.edu; xaohan.kang@asu.edu R. Srkant s wth the Department of Electrcal and Computer Engneerng at Unversty of Illnos at Urbana-Champagn, Urbana, I 68 e-mal: rsrkant@llnos.edu to the case of heavy-taled dstrbutons, n [7], [8], the effect of resource poolng has been consdered, and the case of heterogeneous servers operatng under the processor-sharng dscplne has been treated n [2]. In ths paper, we are motvated by cloud computng applcatons n whch each arrval s a job consstng of many tasks, each of whch can be executed n parallel n possbly dfferent servers. In queueng theory parlance, ths model dffers from the models mentoned earler due to the fact that task arrvals occur n batches,.e., each job correspondng to a batch arrval of tasks. We note the termnology we use here: a job s a collecton of tasks, and each task can be routed ndependently of each other. Such a model arses n the well-known Map/Reduce framework, for example, where each Map job conssts of many Map tasks here, we do not consder the Reduce phase of the job. More generally, any parallel processng computer system wll have job arrvals whch consst of many tasks whch can be executed n parallel. The queston of nterest s whether the fact that there are batch arrvals can be exploted to sgnfcantly reduce the sample complexty. Here, by sample complexty, we mean the number of queues sampled per arrvng task to make routng decson. Our motvaton for ths problem arses from a study of batch arrvals to computng clusters presented n [3], where the authors observe a phenomenon called messagng overhead,.e., the overhead of provdng task backlog feedback can slow down servers and ncrease the delays experenced by tasks/jobs. Further, [3] proposes an algorthm whch acheves better performance than the power-of-two-choces algorthm when both of them use the same number of samples per arrvng task. In ths paper, we observe that ths basc algorthm for batch arrvals suggested n [3] does not work well n all traffc condtons. Moreover, we present a new algorthm whch explots batch arrvals n a manner n whch t provdes much better sample complexty than the power-of-two-choces algorthm for the same delay performance. Further, when both algorthms are allowed the same sample complexty, our algorthm acheves better delay performance. Our man contrbutons are as follows: We present an algorthm whch samples md queues where m s the batch sze.e., number of tasks of a job. Thus, d s the number of sampled queues per task. The tasks are routed to the queues usng a novel algorthm called water fllng. 2 We frst study our algorthm and other prevously proposed algorthms usng a mean-feld analyss. We show that, for any d >, we acheve better performance than the tradtonal power-of-two-choces algorthm n the large-

2 systems regme. Thus, the mean-feld analyss shows that, n the large-systems regme, we can reduce the number of samples per arrvng task dramatcally: from d = 2 to any d >. 3 We then justfy the mean-feld analyss. In partcular, we frst show that the stochastc system dynamcs converge to determnstc dfferental equatons n the large-systems lmt for any fnte t. Our proof here s motvated by the proof of a celebrated result on densty-dependent Markov processes called Kurtz s theorem [7], but our model s somewhat nonstandard and requres addtonal steps whch are not needed n the orgnal Kurtz s theorem. Further, usng a novel yapunov functon, we show that the system of dfferental equatons converges to an equlbrum descrbed by the mean-feld analyss. Then by showng the nterchange of the lmts, we prove the statonary dstrbuton of the queue sze dstrbuton converges to the soluton of the dfferental equatons. 4 Fnally, we perform extensve smulatons to justfy that our analytcal conclusons are ndeed vald n large, but fnte, systems. In partcular, smulatons show that our algorthm wth just one sample per task on average, acheves the same job delay performance as the power-of-two-choces algorthm and dramatcally reduces the delay compared to the algorthm proposed n [3]. II. PROBEM STATEMENT AND MAIN RESUTS We consder a computng cluster wth n dentcal servers and a central scheduler as shown n Fgure. Each server can process one task at a tme. Tasks arrve at the scheduler n batches also called jobs. Each batch conssts of m tasks and the job arrval process s a Posson process wth rate n mλ. We want the batch sze to be not too small, so we assume that m = Θlog n and m s ncreasng functon of n. For smplcty, we consder a determnstc batch sze here, but the results n the paper can be extended to random batch szes as well n a straghtforward manner, as wll be dscussed n the extended verson of the paper. Furthermore, the results of ths paper hold when the system has multple dstrbuted schedulers and the job arrvals on these schedulers are ndependent Posson processes wth aggregated rate n mλ. Ths s because the sum of ndependent Posson processes s Posson. The scheduler dspatches the tasks to the servers when a job arrves. The servce tmes of the tasks are exponentally dstrbuted wth mean, and are ndependent across tasks. When a task arrves at a server, t s processed mmedately f the server s dle or wats n a FIFO frst-n, frst-out queue f the server s busy. We frst descrbe the tradtonal power-of-d-choces algorthm whch s a smple generalzaton of the power-oftwo-choces mentoned n the prevous secton and another prevously-proposed dea called the batch samplng algorthm. Then, we present our dea whch we call batch-fllng, whch combnes batch samplng wth our new load balancng technque called water-fllng. The-Power-of-d-Choces [], [9]: When a batch of m tasks arrve, the scheduler probes d servers unformly at random for each task. The task s routed to the least loaded server. m tasks n each job scheduler server server 2 server n Fg. : A computng cluster wth n servers and a central scheduler Batch-Samplng [3]: When a batch of m tasks arrve, the scheduler probes dm servers unformly at random to acqure ther queue lengths. The m tasks are added to the the least loaded m servers, one for each server. In ths paper, we propose a new load-balancng algorthm, named batch-fllng: we sample queues as n the batch samplng algorthm but the way that tasks are routed to servers uses a dfferent procedure whch we call water-fllng. Batch-Fllng: When a batch of m tasks arrve, the scheduler probes dm servers unformly at random to acqure ther queue lengths. The m tasks are added to the dm servers usng water fllng, specfcally, the tasks are dspatched one by one to the least loaded server, where the queue length of a server s updated after t receves a task. Remark: In batch-fllng, the frst task n a batch s routed to the least loaded server among the sampled servers,.e., the one wth the smallest number of tasks n ts queue. The key dfference compared to batch-samplng s that the server s queue sze s updated after ths whch means that ths server may no longer be the least-loaded n the sampled servers, and then the next task n the batch s agan routed to the least loaded server, and so on. As we wll see later, ths small change to the routng algorthm has dramatc consequences to the sample complexty of the algorthm. In all algorthms, at each step, tes are broken at random f there s more than one least-loaded server. In ths paper, d s called probe rato, whch s assumed to be a constant ndependent of n. As n [], [9], we wll study the dfferent algorthms n the large-systems lmt,.e., as n, snce a data center today may consst of tens of thousands of servers. The man theoretcal results whch wll be establshed n the paper are summarzed n Table I, and we dscuss them below. The expected per-task delay of batch-fllng wth any d > s smaller than both batch-samplng wth d = 2 and thepower-of-two-choces when λ. In other words, batchfllng outperforms the other two algorthms by samplng slghtly more than one server per task, hence the ttle of the paper. The sze of the longest-queue n the system under the-

3 Expected per-task delay Maxmum queue sze n the system λ Batch-Fllng Batch-Samplng Pod log λ log+λd + O λ log λ λ logλd + O λ log λ λ logλd + O λ log λ log d d λ log+λd logλd f λd f λd = TABE I: Ths table summarzes the expected per-task delays and the maxmum queue szes of the three schedulng algorthms. The order notaton O λ s defned when / λ,.e., λ. Pod stands for the-power-of-d-choces. In batch-fllng and batch-samplng, d > ; and n the-power-of-d-choces, d s an nteger and d 2. λ power-of-d-choces s unbounded for any d 2 because the statonary queue length dstrbuton has unbounded support. The szes of the longest-queue under both batch-fllng and batch-samplng are fnte because the statonary dstrbutons have bounded support. The longest queue under batch-fllng wth d > s smaller than that of batch-samplng wth d = 2 when λ. When d s close to, the sze of longest queue under batch-fllng s much smaller than that under batch-samplng 7 versus 26 when d =. and λ =.99. The small and bounded sze of the queues under batch fllng has mportant consequences. A job s sad to be completed when all the tasks n the job are completed. Snce the tal of the queue sze s cut off, ths has the effect of sgnfcantly reducng job completon delays, as we wll see later n the smulatons secton. The above theoretcal results suggest that the sample complexty.e., the number of samples per arrvng task can be sgnfcantly reduced under batch-fllng. On the other hand, the computatonal complexty s slghtly ncreased compared to batch-samplng snce we requre to have to compare the szes of the smallest queues and the next smallest queues each tme a task s routed. However, ths ncrease n computatonal complexty s a cost to be pad at the router whereas ncreased sample complexty slows down the servers snce they have to send queue length feedback whch takes tme away from ther prmary role of processng tasks. Ths s the reason why sample complexty s a more sgnfcant ssue than the computatonal complexty n data centers although we do not want the computatonal complexty to be very hgh ether. The batch-samplng algorthm performs Odm log m computatons per batch whch correspondng to a sortng operaton, whle batch-fllng algorthm performs an addtonal 2m operatons snce t has to keep track of the queue lengths of the smallest queues and the next smallest queue. III. MEAN-FIED ANAYSIS In ths secton, we wll use mean-feld analyss to study the statonary dstrbutons of the queue lengths under batch-fllng and batch-samplng. The results wll be further valdated usng a proof nspred by the proof of Kurtz s theorem n Secton IV. et Q n k t denote the queue length of the kth server at tme t n a system wth n queues. It can be easly verfed that Q n t s an rreducble and nonexplosve Markov chan, and usng the standard Foster-yapunov theorem see, for example, [5] t can be verfed that the Markov chan s postve recurrent and hence, has a unque statonary dstrbuton. Theorem. The Markov chan Q n t s postve recurrent under batch-fllng. Furthermore, there exsts a constant c >, ndependent of n, such that for any n, where n the steady state. E n ˆQ k [ n n k= ˆQ n k ] < c denotes the queue length of server k The proof of ths theorem s presented n the appendx. et π n denote the statonary dstrbuton of queue k,.e., the probablty that the queue sze s at server k. Here, the ndex k s gnored because the statonary dstrbutons are dentcally across servers. Accordng to the theorem above, we have πn < c, whch further mples that π n as and j= πn j as. We remark that one challenge n provng that the stochastc system dynamcs converge to determnstc dfferental equatons les n that the system s an nfnte-dmensonal system. We wll utlze the facts mentoned above to overcome ths challenge n the proofs. The mean-feld analyss proceeds as follows. Assume the n queues are n the steady state, and further assume that the queue lengths are dentcally and ndependently dstrbuted..d. wth dstrbuton π. Ths..d. assumpton n the meanfeld analyss wll be valdated later n Secton IV n the largesystems lmt. Now consder the queue evoluton of one server n the system. Each queue forms an ndependent Markov chan as shown n Fgure 2, denoted by Q n t, and the transton rates wll be determned by the partcular strategy used to route tasks to servers. We wll derve the transton rates for each of the strateges descrbed earler, namely batch fllng, batch samplng, and the power-of-d-choces, n the rest of ths secton. q n,+k q n,+2 q n,+ + +2 K Fg. 2: The Markov chan representng the nth system n the mean feld analyss

4 A. The statonary dstrbuton under batch-fllng We frst consder the batch-fllng algorthm. The downcrossng transton rate from state to s for all,.e., q n, =, because the processng tme of a task s exponentally dstrbuted wth mean. The up-crossng transton rate from state to state j for j > s q n,j = n m λ dm n φ = dλ φ In the expresson above, Pφ Pj φ, Pφ Pj φ,. n mλ s the batch arrval rate; dm/n s the probablty a server s probed when dm servers are sampled; φ s a dm -vector that denotes the queue lengths of the other dm sampled servers, so and Pφ = dm k= π φk ; Pj φ, s the probablty that a server s queue length becomes j when the server s sampled and s n state, and the the states of the other dm sampled servers are φ. Wthout loss of generalty, assume φ k φ l f k l,.e., φ s ordered. Recall that batch-fllng dspatches tasks usng water fllng among the sampled dm queues. Therefore, gven and φ, ether j = f no task s assgned to the server, or j takes two possble values. Consder a smple example n Fgure 3 where three tasks wll be dspatched to four servers wth queue lengths,, 4, and 4. Then the servers whose queue sze s 4 wll not receve any task, and the servers whose queue sze s wll receve one or two tasks. Fg. 3: An example of water fllng Assume tes are broken unformly at random. The values of Pj φ, are summarzed below. If φ k I φk m, 2 dm k= whch means that the tasks wll be assgned to servers whose orgnal queue szes are smaller than, then { f j =, Pj φ, = f j. If condton 2 does not hold, then the server wth queue sze wll receve some tasks, and { α φ, f j = Pj φ, = Q φ,, α φ, f j = Q φ,, where Q φ, = mn { j : j + } j φ k I φk j m, dm k= whch s the maxmum sze a queue can be flled up to durng the water fllng, and α φ, s gven by m Q φ, dm k= Q φ, φ k I φk Q φ, +, dm k= I φk Q φ, whch s the probablty that a server receves one more task after ts queue sze becomes Q φ, durng water-fllng. Whle the transton rate q n,j n s a complex expresson for fnte n, the followng lemma shows that q n,j converges to some smple q,j as n. The proof of ths lemma s presented n the appendx. emma 2. Under batch-fllng, the transton rates gven dstrbuton π, denoted by q n,j π, converges; and specfcally, lm n qn,j π = q,jπ, where for j, f j =, λd α π f j = q,j π = Q π >, λdα π f j = Q π >, otherwse, and α π = Q π = mn { j : } j j lπ l d l= d Q π 2 Q π jπ j, ]. Qπ π j Accordng to the lemma above, the queue length dynamcs of a sngle server, n the lmt as the number of servers becomes nfnty, can be represented by the Markov chan n Fgure 4, where the up-crossng transtons are nto only two states Q π and Q π due to water fllng. Based on emma 2, we can calculate the statonary dstrbuton of the queue length of a sngle server n the large-system lmt by fndng ˆπ that satsfes the global balance equaton [5]. 3

5 We next check the global balance equatons. For =, Fg. 4: The queue-length Markov chan of a sngle-server, n the large-system lmt, under batch-flng ˆπ q, QBF + q, QBF ˆπ q, = λλd λλd =. For 2, Theorem 3. The statonary dstrbuton of the queue length of a sngle server n the large-system lmt under batch-fllng s λ =, λλd + λd ˆπ =, λ + λd = Q 4 BF, otherwse. where Q BF = log λ log+λd. The expected queue length s log λ log + λd + O λ. Proof. We frst show = Qˆπ, where Q π s defned n 3. Note that QBF. If λ and d are such that QBF =, then equvalently log λ log + λd, whch mples that λ + λd, or d λ. Then ˆπ = λ, λ,,... and Qˆπ = =. If λ and d are such that QBF >, accordng to 3, to show Qˆπ = we only need to show 2 l= lˆπ l < QBF d lˆπ l. 5 et HS and RHS denote the left-hand-sde and the rghthand-sde of 5. Then and HS = RHS = 2 = = Then 5 s equvalent to l= ˆπ j = λ + λd, λd ˆπ j = λ + λd. λd log λ < log + λd, whch holds accordng to the defnton of. ˆπ q, + q, QBF + q, QBF ˆπ + q +, = λλd + λd + λd λλd + λd =. For =, ˆπ QBF q QBF, 2 + q QBF, 2 = ˆπ q, QBF ˆπ QBF q QBF, = λλd + λd 2 + λdαˆπ λ + λd 2 λd αˆπ λ + λd = λ + λd λdαˆπ +. From the defnton of αˆπ we can verfy that So we have αˆπ = =. For =, λd λ + λd λd. ˆπ QBF q QBF, 2 + q QBF, 2 = ˆπ q, QBF ˆπ QBF q QBF, ˆπ QBF q QBF, = ˆπ q, QBF = λ + λd λ + λd λdαˆπ = λ + λd + λdαˆπ =. So the global balance equatons holds. Fnally the expected queue length n statonary dstrbuton

6 s ˆπ + 2ˆπ 2 + + ˆπ QBF = ˆπ j = = = λ + λd = λ + λd λd log λ = log + d + O λ. B. The statonary dstrbuton under batch-samplng Recall n batch-samplng, the m tasks are routed to the leastloaded m queues among the sampled dm queues. Consder a server wth queue sze and assume t s probed. Then the server wll receve a task wth probablty dm + m k= I φk =j E mn, m = E mn, dm E mn, + dm k= I φk = dm k= I φk =j dm dm + dm k= I φk = dm d π j π +. + Followng a smlar analyss as batch-fllng, we can establsh the followng lemma. The detals are omtted. emma 4. Under batch-samplng, the transton rates gven dstrbuton π, denoted by q n,j π converges; and specfcally, f j =, lm n qn,j π = q λd f + = j,jπ = Q π, λα π f + = j = Q π, otherwse, where and Q π = mn α π = { : } π j d l= d Q π 2 π j π Qπ, ]. The Markov chan n the large-system lmt s shown n Fgure 5. Gven π, the Markov chan s a brth-death process up to state Q π. The statonary dstrbuton can agan be calculated usng the global balance equatons. The results are presented n Theorem 5, and the detals are omtted. Fg. 5: The Markov chan n the large-system lmt under batch-samplng Theorem 5. The statonary dstrbuton of the queue length of a sngle server n the large-system lmt under batch-samplng s where λ =, λλ d ˆπ = Q BS, λ λ d λd = Q BS, otherwse. Q BS = The expected queue length s log d d λ logλd log λ + O λ. logλd C. The statonary dstrbuton under the-power-of-d-choces For a system wth non-batch sngle arrvals, the statonary queue-length dstrbuton of a sngle server n the large-system lmt under the-power-of-d-choces has been establshed n [], [9]. The power-of-d choces routng under our batcharrval model also satsfes the same lmtng queue-length dstrbuton, whch we provde below for comparson purposes. Theorem 6. The statonary dstrbuton of the queue length of a server n the nfnte system under the-power-of-d-choces s ˆπ = λ d d The expected queue length s λ d + d. log λ + O λ. logλd IV. DIFFERENTIA EQUATIONS AND KURTZ S THEOREM The results n the prevous secton were obtaned usng the mean-feld analyss whch assumes that the queues are..d. across servers. We wll justfy the mean-feld analyss n ths secton. Agan, we wll focus on batch-fllng. The same results can be establshed for batch-samplng and the-power-of-d-choces by followng smlar steps. We frst consder the followng.

7 non-lnear system descrbed by dfferental equatons: dx = + λdx + x + X x 2, λd α x x j + λdα x x + x +, = X x λdα x x j x + x +, x + x + where and X x = mn α x = { j : } j j lx l d l= d X x 2 X x jx j. Xx x j = X x otherwse, 6 These dfferental equatons are derved from the Markov chan n Fgure 4. Vew x as the fracton of queues wth length. Consder x for X x 2. Accordng to Fgure 4, x decreases wth rate x + λd because the queue sze of a server wth sze becomes wth rate and becomes X x or X x wth total rate λd; and x ncreases wth rate x + because a queue wth sze + becomes a queue wth sze wth rate. Note ths s a non-lnear system because α x and X x depend on the state x. We further defne s t = x j t j= for, whch s related to the fracton of the servers wth queue sze, and ŝ = ˆπ j j= for ˆπ defned n 4. Note that s t = for any t. The dfferental equatons of the non-lnear system can be wrtten n terms of st as follows: where ds = λd + λds + s +, λ λd s j s + s +, = s + s + otherwse, = max : s j d. The followng theorem establshes the equlbrum pont and the stablty of ths non-lnear system. The proof s presented n the appendx. 7 Theorem 7. Assume the ntal condton s satsfes = s s 2 and s <. Startng from s, the system converges to the equlbrum pont ŝ as t, where s the -norm. Next defne Π n sze n the nth system, and π n t to be number of servers wth queue t = n Πn t to be the fracton of servers wth queue sze n the nth system. Here we delberately reuse notaton π because n the steady state, the fracton of servers wth queue sze s equal to the probablty that the queue sze of a server s. However, note that here π n t s a random vector nstead of a dstrbuton. Defne the vector Γ n t N such that ts th component Γ n t = j= Πn j t s the number of servers whose queue lengths are at least, γ n t = Γn t n, and ˆγ such that ˆγ = j= ˆπ j for ˆπ defned n 4. The followng theorem states that γ n t, whch s stochastc, concdes wth st for any bounded tme nterval [, t] when n. Here we defne Ū to be the space of all sequences γ such that = γ γ 8 wth the -norm. The proof s presented n the appendx. Theorem 8. Suppose that γ n s n probablty, where s s a determnstc ntal condton such that s and s <. Then the followng holds lm sup n u t γ n u su = n probablty. Ths result s motvated by Kurtz s theorem [7]. However, we remark that Π n t s not a classcal densty dependent Markov chan because q n,j cannot be wrtten n the form of nβ l for some β l ndependent of n, and γ n s an nfntedmensonal vector. Therefore, the proof of Kurtz s theorem does not drectly apply. Our proof s a non-trval extenson of Kurtz s theorem. We also remark that s = x < s related to the average queue sze at a server, so the condton smply requres the average queue length per server s bounded ntally. Theorem 7 and Theorem 8 establsh the followng result: whch further mples that γ n t n st t ˆγ, 9 π n t n xt t ˆπ. A drect consequence of s that f ˆπ n converges to some π or a subsequence of ˆπ n converges to some π, then π = ˆπ. The convergence of statonary dstrbutons wll be dscussed n the next secton. V. CONVERGENCE OF THE STATIONARY DISTRIBUTIONS We frst present a theorem on the nterchange of lmts. The theorem s smlar to Theorem 5. n []. However, [] assumes the state space of each system s fnte but n our system, the state space of each queue s the set of nonnegatve

8 ntegers. Whle the proofs are smlar, we present t here for the completeness of the paper. Theorem 9. Consder a sequence of random processes X n ndexed by a scalng parameter n, where X n s a vector that denotes value of the process at tme t, and a dynamc system Ẋt = FX. Assume X n and ˆX satsfy the followng assumptons: A Suppose that for any n, X n t w ˆX n, where ˆX n s the statonary dstrbuton of the random process and x denotes the weak convergence. A2 Suppose for each fnte t, when X n t w Xt, 2 lm n Xn = X where both X n and X are determnstc ntal condtons, and X X, where X s a set of ntal condtons. A3 Startng from each ntal condton X X, assume that lm Xt = ˆX. 3 t A4 Any subsequence of ˆXn has a subsubsequence that weakly converges. The lmt of any convergent subsequence, denoted by X, satsfes P X X = and ts support s separable. Then ˆX n w ˆX. Ths result establshes an nterchange of lmts because from A and A2, we have lm lm t n Xn t = lm Xt = ˆX. t The theorem says that wth addtonal assumptons, we further have lm lm n t Xn t = ˆX. The proof s presented n the appendx. By utlzng the result above, we show the convergence of the statonary dstrbuton n the followng theorem. Theorem. Proof. Defne ˆγ n w γ. X = {γ : = γ γ 2..., γ < }, whch s separable because t s a subspace of l = {γ : γ < }, whch s a separable metrc space. A holds due to Theorem. Note lm n γ n = s for determnstc ntal condtons γ n and s mples that γ n s n probablty. Therefore, accordng to Theorem 8, gven determnstc ntal condtons γ n and s such that lm n γ n = s, we have lm sup n u t γ n u su = n probablty, whch mples weak convergence. A3 s establshed n Theorem 7. To valdate A4, we consder the space Ũ whch s the set of γ satsfyng 8 and wth the followng norm used n [9] γ γ = sup γ γ. Under ths norm, space Ũ s compact. Defne ˆγn = j= ˆπn j where ˆπ n denotes the statonary dstrbuton of π n t. By Prokhorov s theorem [2], snce Ũ s compact, there exsts a subsubsequence for any subsequence of ˆγ n that weakly converges to a random vector γ under, whch s denoted by ˆγ nk. By the Skorohod representaton theorem, there exsts a sequence of random vectors wth the same dstrbutons that converge almost surely. By slght abuse of the notaton, we assume ˆγ n k converges to γ almost surely. Snce γ, by the domnated convergence theorem, we have Defne lm k E[ ˆγn k γ ] =. 4 f k γ = k γ. = It s easy to verfy that f k s a contnuous and bounded functon under the -norm. Accordng to the defnton of weak convergence, we have E [ γ ] = lm k E [f k γ] = lm k lm nk E [ f k ˆγ n k ] c, 5 where the last nequalty s due to Theorem, whch mples that P γ X =. The unform convergence of the seres [ ] n E ˆγ k γ =k 6 s establshed n Appendx F. By Tonell s theorem, [ ] lm E ˆγ nk γ =, 7 k whch mples ˆγ n k converges weakly to γ n -norm. Therefore A4 holds. Based on the theorem above, we further have the followng results accordng to usng the same analyss for gettng 4 and 7. Corollary. lm n E[ˆγn ] = ˆγ,

9 and lm E n [ ˆγ n ] = ˆγ, 8 [ ] lm E ˆγ n ˆγ =. 9 n In the next corollary, we show that any k queues are ndependently and dentcally dstrbuted wth dstrbuton ˆπ n the large-system lmt, where k s a constant ndependent of n. Then the system s sad to be ˆπ-chaotc [6]. We prove the result by showng that the unque statonary dstrbuton of k queues that satsfes the detaled balance equatons n the large-system lmt has a product form. Corollary 2. Consder a set of k servers, and wthout loss of generalty, assume the severs are, 2,, k. et π n Q, Q 2,, Q k denote the staton dstrbuton of the queue lengths of these k servers. In the large-system lmt, we have lm n πn Q, Q 2,, Q k = k ˆπ Q, =.e., the k queues are ndependently and dentcally dstrbuted wth dstrbuton ˆπ. VI. SIMUATIONS In ths secton, we use smulatons to evaluate the performance of the three load balancng algorthms n large, but fnte-server, systems. A. Determnstc Batch Sze We frst consdered systems wth n =, servers, batch sze m =. We evaluated the per-task and per-job delays of the three algorthms wth dfferent probe ratos d. Fgures 6 and 7 show the per-task delays and per-job delays, respectvely, when λ =.7. Fgures 8 and 9 show the per-task delays and per-job delays, respectvely, when λ =.9. From these fgures, we have the followng observatons. In terms of per-task delays, batch-fllng matches the powerof-two-choces wth d =.3 when λ =.7 and wth d =.2 when λ =.9. Batch-samplng, on the other hand, requres d =.6 when λ =.7 and d =.7 when λ =.9 to acheve the same per-task delay as the power-of-twochoces. Furthermore, even wth d =, the per-task delay of batch-fllng s only slghtly larger than that of the powerof-two-choces; but batch-samplng has much larger per-task delay when d = versus 3 when λ =.9. Note that the per-job delay of batch-samplng wth d = has been omtted n the fgure for readablty of the fgure. Batch-fllng performs even better n terms of per-job delays. As we can see from Fgures 7 and 9, batch-fllng matches the power-of-two-choces even wth d =! We beleve ths s because the maxmum queue sze of batch-fllng s smaller than that of the power-of-two-choces when d = even though the average queue sze s larger. Batch-samplng requres larger probe ratos to match the per-job delays of the power-of-two-choces. Ths s because the maxmum queue sze of batch-samplng s larger than that of batch-fllng as shown n Table I. B. Random Batch Sze In ths set of smulatons, we evaluated the performance of algorthms under random batch szes. We assume the batch sze M s random varable such that wth probablty.5, M s geometrcally dstrbuted wth mean 75; and wth probablty.5, M s geometrcally dstrbuted wth mean 25. The other settngs are the same as those used wth fxed batch szes. The results for λ =.7 are shown n Fgures and Fgure ; and the results for λ =.9 are shown n Fgures 2 and 3. We note that the conclusons of our prevous smulatons do not change wth these modfcatons. VII. CONCUSIONS AND EXTENSION In ths paper, we proposed a new load-balancng algorthm, named batch-fllng, whch uses water-fllng to attempt to equalze the load among the sampled servers. The algorthm provdes a much lower sample complexty than the powerof-two-choces algorthm for the same delay performance. Specfcally, t only needs to sample slghtly more than one queue per task to match the per-job delay of the power-oftwo-choces algorthm. We remark that the theoretcal results of ths paper can be extended to random batch szes. et M n t denote the batch sze at tme t n the nth system. Assume M n t are..d. across tme t. The man results of ths paper hold gven the sequence of random varables M n E[M n ] converge n dstrbuton, are unformly ntegrable, and M n t = Θlog n. In partcular, Theorem can be establshed by usng the same dea that the yapunov drft of water-fllng s domnated by random M n routng. emma 2 also holds because converge n E[M n ] probablty. The dfferental equatons reman the same under random batch sze, so Theorem 7 s stll vald. Fnally, t s easy to verfy that D /dm converges n mean as m, where m = E[M n ]. ACKNOWEDGMENTS Research was funded n part by ARO MURI W9NF-2- -385 and NSF Grants ECCS-255425 and ECCS-2265. REFERENCES [] V. Anantharam and M. Benchekroun. A technque for computng sojourn tmes n large networks of nteractng queues. Probablty n the engneerng and nformatonal scences, 74:44 464, 993. [2] P. Bllngsley. Weak convergence of measures: Applcatons n probablty. Phladelpha: SIAM, 97. [3] M. Bramson, Y. u, and B. Prabhakar. Asymptotc ndependence of queues under randomzed load balancng. Queueng Systems, 73:247 292, 22. [4] M. Bramson, Y. u, and B. Prabhakar. Decay of tals at equlbrum for FIFO jon the shortest queue networks. The Annals of Appled Probablty, 235:84 878, 23. [5] M. Draef and. Massoule. Epdemcs and rumours n complex networks. Cambrdge Unversty Press, 2. [6] A. Erylmaz and R. Srkant. Asymptotcally tght steady-state queue length bounds mpled by drft condtons. Queueng Systems, 723-4:3 359, 22.

Average task delay 4 3.5 3 2.5 2.5.5 Po2 BS BF.2.4.6.8 2 Probe rato Fg. 6: The average task delays for power-of-two-choces Po2, batch-samplng BS and batch-fllng BF wth λ =.7 and determnstc batch szes. Average job delay 8 5 2 9 6 3 Po2 BS BF.2.4.6.8 2 Probe rato Fg. 7: The average job delays for power-of-two-choces Po2, batch-samplng BS and batch-fllng BF wth λ =.7 and determnstc batch szes. Average task delay 6 5 4 3 2 Po2 BS BF Average job delay 2 6 2 8 4 Po2 BS BF.2.4.6.8 2 Probe rato Fg. 8: The average task delays for power-of-two-choces Po2, batch-samplng BS and batch-fllng BF wth λ =.9 and determnstc batch szes..2.4.6.8 2 Probe rato Fg. 9: The average job delays for power-of-two-choces Po2, batch-samplng BS and batch-fllng BF wth λ =.9 and determnstc batch szes. Average task delay 4 3.5 3 2.5 2.5.5 Po2 BS BF.2.4.6.8 2 Probe rato Fg. : The average task delays for power-of-two-choces Po2, batch-samplng BS and batch-fllng BF wth λ =.7 wth random batch szes. Average batch delay 5 2 9 6 3 Po2 BS BF.2.4.6.8 2 Probe rato Fg. : The average job delays for power-of-two-choces Po2, batch-samplng BS and batch-fllng BF wth λ =.7 wth random batch szes.

Average task delay 6 5 4 3 2 Po2 BS BF Average batch delay 2 5 5 Pod BS BF.2.4.6.8 2 Probe rato Fg. 2: The average task delays for power-of-two-choces Po2, batch-samplng BS and batch-fllng BF wth λ =.9 wth random batch szes..2.4.6.8 2 Probe rato Fg. 3: The average job delays for power-of-two-choces Po2, batch-samplng BS and batch-fllng BF wth λ =.9 wth random batch szes. [7] S. N. Ether and T. G. Kurtz. Markov processes: characterzaton and convergence, volume 282. John Wley & Sons, 29. [8] H. K. Khall. Nonlnear systems. Prentce hall Upper Saddle Rver, 2. [9] M. Mtzenmacher. oad balancng and densty dependent jump Markov processes. In Foundatons of Computer Scence, 996. Proceedngs., 37th Annual Symposum on, pages 23 222. IEEE, 996. [] M. Mtzenmacher. The Power of Two Choces n Randomzed oad Balancng. PhD thess, Unversty of Calforna at Berkeley, 996. [] M. Mtzenmacher. The power of two choces n randomzed load balancng. Parallel and Dstrbuted Systems, IEEE Transactons on, 2:94 4, 2. [2] A. Mukhopadhyay and R. R. Mazumdar. Analyss of load balancng n large heterogeneous processor sharng systems. arxv preprnt arxv:3.586, 23. [3] K. Ousterhout, P. Wendell, M. Zahara, and I. Stoca. Sparrow: dstrbuted, low latency schedulng. In Proceedngs of the Twenty-Fourth ACM Symposum on Operatng Systems Prncples, pages 69 84. ACM, 23. [4] A. Rcha, M. Mtzenmacher, and R. Staraman. The power of two random choces: A survey of technques and results. Combnatoral Optmzaton, 9:255 34, 2. [5] R. Srkant and. Yng. Communcaton Networks: An Optmzaton, Control and Stochastc Networks Perspectve. Cambrdge Unversty Press, 24. [6] A.-S. Szntman. Topcs n propagaton of chaos. In Ecole d Eté de Probabltés de Sant-Flour XIX989, pages 65 25. Sprnger, 99. [7] J. N. Tstskls and K. Xu. On the power of even a lttle resource poolng. Stochastc Systems, 2: 66, 22. [8] J. N. Tstskls and K. Xu. Queueng system topologes wth lmted flexblty. In Proceedngs of the ACM SIGMETRICS/nternatonal conference on Measurement and modelng of computer systems, pages 67 78. ACM, 23. [9] N. D. Vvedenskaya, R.. Dobrushn, and F. I. Karpelevch. Queueng system wth selecton of the shortest of two queues: An asymptotc approach. Problemy Peredach Informats, 32:2 34, 996. APPENDIX A PROOF OF THEOREM We gnore the superscrpt n of Q n k t as we wll focus on the nth system. Defne the yapunov functon to be V Qt = n Q 2 kt. k= et x, y N n denote the state of the Markov chans, and q x,y denote the transton rate from state x to state y. Accordng to the Foster-yapunov theorem for contnuous-tme Markov chan see, for example, Theorem 9..8 n [5], we consder y x q x,y V y V x. 2 Defne n vector e k such that e k [k] = and e k [l] = for any l k. Then q x,x ek V x e k + V x 2x k +, whch corresponds to a departure at server k. Next defne Ψ x to be the set of possble states of the Markov chan when a batch arrval occurs when the system s n state x, then λn q x,y V y V x a 2 m x k + m m n y Ψ x = 2λ k k x k + λn, The nequalty a can be establshed by comparng batchfllng wth the load-balancng polcy that places the m tasks to a set of randomly selected m servers, one for each server. Note that water-fllng s the optmal soluton to the followng problem: subject to: dm mn a k= a k + Q k 2 n k= a k = m a k N k. Therefore y Ψ x q x,y V y s mnmzed under water-fllng, condtoned on the same set of dm sampled queues, and nequalty a holds. Therefore, we have x k + n + λn. y x q x,y V y V x 2 2λ k Therefore, the Markov chan s postve recurrent accordng to the Foster-yapunov theorem. Now assume the system s n

2 the steady state, then we have = E q x,y V y V x y x [ ] 2 2λ E x k + n + λn, whch mples that [ ] E x k + λ n 2 2λ λ. k Therefore, the theorem holds by choosng c = / λ. APPENDIX B PROOF OF EMMA 2 Wthout loss of generalty, assume server has queue sze and has been probed. Now gven any j, defne X j = k dm k= I φk =j, whch s the number of probed servers wth queue sze j wthout ncludng server, and s the summaton of dm..d. Bernoull random varables wth mean π j. We further defne µ j = E[X j ] = dm π j. Consder any such that Q π. The probablty that server receves a task n water fllng s upper bounded by m E jx + j + X j Qπ m E Q + π jx j + Q π X j m = E dm Q π Q + π j Xj dm dm + Q 2 π X j dm whch converges to d Q π Q + π jπ j Qπ π j 22 as m because X j /dm converges to π j n dstrbuton and the term nsde the expectaton s bounded and contnuous n terms of X j /dm. Accordng to the defnton of Q π 3, we know that so 2 and, Q π d Q π jπ j, q,j = Q π and j {, }. 23 Now we assume < Q π. In ths case, the queue sze of server becomes Q Q > after water fllng wth probablty Q 2 m Q E mn, Q jx + j + Q X. j Smlar to the analyss above, t can be shown that converges to m Q Q 2 Q jx j + Q X j d Q 2 Q jπ j Q π. j For Q Q π +, accordng to the defnton of Q π, we have For Q = Q π, For Q Q π, Q π d Q π jπ j. d Q π 2 Q π jπ j = α Qπ π. π j d Q 2 Q jπ j Q π j d Q π 3 Q π 2 jπ j Qπ 2 π j Qπ 2 Q π jπ j Q π 3 Q π 2 jπ j Qπ 2 π j =. 24 Therefore, for any < Q π and j, we have λdα π, f j = Q π q,j = λd α π, f j = Q π, otherwse. Hence, the lemma holds. APPENDIX C PROOF OF THEOREM 7 25 Motvated by the proof n [], we consder the followng yapunov functon V t = s t ŝ. = Defne ɛ = s ŝ, so the yapunov functon can be wrtten as V t = ɛ t. =

3 We wll analyze the upper rght-hand dervatve dv t n three dfferent cases. = lm t t + V t V t t t In the frst case, consder s such that Xs =. In ths case, the dfferental equatons can be wrtten n terms of ɛ n the followng form: + λdɛ + ɛ +, dɛ = λd ɛ j ɛ + ɛ +, ɛ + ɛ + = otherwse. Now for, d ɛ = + λdɛ + ɛ + f ɛ >, = + λdɛ ɛ +, f ɛ <, = ɛ +, f ɛ =. whch mples that d ɛ + λd ɛ + ɛ +. Smlarly, we can obtan that d ɛ 26 { ɛ + λd j= ɛ j + ɛ + f =, ɛ + ɛ + f >. Combnng the results above and the fact that s t as for any t, we conclude n ths case, dv t d ɛ = ɛ. = In the second case, consder s such that Xs >. Then, smlar to the analyss of the frst case, we have dɛ + λd ɛ + ɛ +. 27 We next consder two subcases. In the frst subcase, s QBF ŝ QBF. Note that ŝ = for any >, so we have d ɛ = = dɛ = = = λ λd = λd ɛ QBF + λd = ds s j s QBF ɛ j ɛ QBF Combnng 27 and 28, we obtan dv t ɛ ɛ QBF. ɛ j. 28 In the second subcase, s QBF < ŝ QBF. In ths case = + = = + d ɛ dɛ = = + ds = λ λd s j s QBF + = λ λd ŝ j + λd λd ɛ j ɛ QBF + ɛ j ɛ QBF +, 29 where the last nequalty holds due to the defnton of and the fact that ɛ QBF +t = s QBF +t for any t. Next, gven s QBF < ŝ QBF, we have d ɛ QBF = ds = λd + + λds QBF s QBF + = λd + + λdŝ QBF + + λdɛ QBF ɛ QBF + + λd ɛ QBF + ɛ QBF +, 3 where the last nequalty holds because ɛ QBF <, and λd + + λdŝ QBF = λd + + λd λ + λd = λ + λd λ λ =. Combnng nequaltes 27, 29 and 3, we obtan dv t ɛ. In the thrd case, consder s such that Xs <. In ths case, we frst have and = + d ɛ = = + ds = ɛ +, 3 dɛ + λd ɛ + ɛ + <. 32 We next further consder the followng subcases.

4 Assume s Xs < ŝ Xs, so d ɛ Xs = λ + λd s j + s Xs s Xs+ Note that ŝ ŝ + = λd ŝ for any <, so d ɛ Xs = λ + λd ŝ j λd X s ɛ j + ɛ Xs ɛ Xs+ λ + λd ŝ j + λd X s Next for < <, we have whch mples that d ɛ ɛ j ɛ Xs + ɛ Xs+. 33 dɛ = s + s + = ŝ + ŝ + ɛ + ɛ + =λd λdŝ ɛ + ɛ +, λd ŝ ɛ + ɛ + < <. For =, we have dɛ QBF = s QBF + s QBF + = ŝ QBF + ŝ QBF + ɛ QBF + ɛ QBF + =λ λd whch mples that d ɛ QBF λ λd Summng nequaltes 3-35, we dv t ɛ. Assume s Xs ŝ Xs, then d ɛ Xs =λ λd 34 ŝ j ɛ QBF + ɛ QBF +, ŝ j ɛ QBF + ɛ QBF +. s j s Xs + s Xs+ 35 where nequalty a holds due to the defnton of, and the last nequalty holds because λd + λdŝ Xs + ŝ Xs+ = when <. The summaton of 34 and 35 yelds that = + d ɛ λ λd ŝ j ɛ Xs+ + ɛ QBF + λ λd s j λd ɛ j ɛ Xs+ + ɛ QBF + λd ɛ j ɛ Xs+ + ɛ QBF +, 37 where the last nequalty holds due to the defnton of. The summaton of 3, 32, 36 and 37 yelds dv t ɛ. In a summary, we have shown that { dv t, f st ŝ =, otherse. 38 Next defne = mn{ : ɛ < }. If such an exsts, snce ŝ = for any >,. Furthermore, f <, then. It s easy to verfy that when exsts, d ɛ = dɛ = + λd ɛ ɛ. Snce the followng bound has been used throughout n the proof d ɛ + λd ɛ + ɛ, when exts, we can further obtan whch mples dv t dv t ɛ ɛ, 39 =, f st = ŝ. <, f s t < ŝ for some <, f s t > ŝ, f s t ŝ and s t = ŝ. The result above shows that st ŝ s non-ncreasng. For any x such that x <, we defne S x = {y : y n x n for all n}. Then we can see that S x s compact snce we can approxmate =λ λd s j + λd s Xs s Xs + s Xs+ the tal wth ɛ/2 and the frst fntely-many elements are n an equvalent Eucldean space and hence the the fntedmensonal part s totally bounded wth the remanng ɛ/2 a λd + λds Xs + s Xs+ + λd ɛ Xs + ɛ Xs+, 36 as well. 4

5 Snce st ŝ s non-ncreasng, gven a fxed r > and ntal condton s Bŝ, r, where we have Bŝ, r = {s X : s s r}, st r + ŝ t. N n m λt,.e., a Posson random varable wth mean n m λt. Defne event B n,α to be { B n,α = B n t + α n m λt } and γ n + α s. Snce s t s 2 t..., there exsts Nr such that for any Nr and any t, s t d, ṡ + t. Now consder any ntal state s X. et r = s ŝ and { s f Nr, = s f > Nr. Then s X. et Ω = Bŝ, r S s. Snce both Bŝ, r and S s are closed and S s s compact, we have that Ω s compact. Also note that for any ntal state s Ω we have st Ω as well, so Ω s postve nvarant and compact. Furthermore, gven st such that s t = ŝ and s t ŝ 2, t can be easly shown that s t + δt > ŝ for a suffcently small δt unless st = ŝ. The result can be proved by followng the dea of asalle s nvarance prncple [8]. APPENDIX D PROOF OF THEOREM 8 Recall the defnton of Π n t N where the th component Π n t s the number of servers whose queue lengths are equal to. Snce Π n t can be unquely determned by Γ n t and vce versa, and Π n t s a Markov chan, Γ n t s a Markov chan and we have Γ n t = Γ n + N N t R n Γn u du, 4 where N x are ndependent standard Posson processes and R n Γ s the transton rate of the Markov chan from state Γ to state Γ +. For example, gven =,,,,, whch corresponds to the event that there s a departure from a server wth queue sze, R n Γn = Γ n Γ n 2 because there are Γ n Γ n 2 servers wth queue sze. Dvdng by n on both sdes of equaton 4, we get γ n t = γ n + N n N t R n nγn u du Now defne B n t to be the total number of batch arrvals wthn tme nterval [, t] n the nth system. Then B n t =. Applyng the Chernoff bound, we obtan P B n t + α n m λt e n m λthα, where hα = + α log + α α. Also lm P γ n n + α s = because γ n converges to s n probablty accordng to the assumpton of the theorem. Thus, we have lm P B n,α =. n Note that n γn u s the total number of tasks n the system at tme u. When B n,α occurs, max u t γ n u + α λt + s Defne C α = + αλt + s. When the nequalty above holds, we have γ n u = π n j u C α u t,, 42 j= whch further mples that for k = γ n u 2 d C α 2 d., we have u t, k. 43 Next we defne the followng four sets: T n + : the set of such that, whch s the set of related to arrvals, + n : the set of such that and = for k +. : the set of such that, whch s the set of related to departures. n : the set of and = for m. We further defne N a = N a a, whch s a centered Posson process. Then we have T n γ n t = γ n + T + n T n \ + n n + n n + n n n N n t t t n N R n nγn u du R n nγn u du. R n nγn u du + +

6 Defne st to be the soluton of the dfferental equatons 7 wth ntal condton s, and Fs such that the nonlnear dfferental equatons n 7 are gven by ds = Fs. Followng the dea behnd the proof of Kurtz s theorem see [5] for an easy exposton, we have sup γ n u su 44 u t γ n s 45 + sup u t + sup u t + sup u t + sup u t + n n + n n + n n u dτ u n N R n nγn τ dτ u n N R n nγn τ n u R n nγn τ dτ u Fγ n τ dτ u 46 47 Fγ n τ dτ 48 Fsτ dτ. 49 Accordng to emmas 3-5, we obtan that there exsts n such that for any n n, γ P sup n u su u t u Fγ n τ Fsτ dτ 4δ γ P n s > δ + 3 P B n,α + 4m k max + λt d 2 δ e {e n m λth δ 2d+ m + λtc α δm, m+ k λt, e nth δ } 2m k t whch converges to zero as n snce m = Θlog n. et { γ B n = sup n u su u t u } Fγ n τ Fsτdτ 4δ. Then PB n as n. When B n occurs, for any u [, t], γ n u u su 4δ + Fγ n τ Fsτdτ u 4δ + M γ n τ sτ dτ, where the last nequalty holds because Fs s pschtz as shown n emma 6. By Gronwall s nequalty we have γ n u su 4δe Mu for any u [, t]. Thus P sup u t as n. emma 3. P 46 > δ λt γ n u su 4δe Mt PB n d 2 δ e 2d+ m + λtc α δm + 2 P B n,α. Proof. Note that T + n \ + n occurs when a task s dspatched to a queue wth sze at least k. Under condton 43, when a batch arrval occurs, P T + n \ + n {nγ nγ + } P dm Z k < m = P Z k > d m e d 2 2d+ m, where Z k s the number of servers probed wth queue sze at least k and the last nequalty s obtaned from the Hoeffdng s nequalty for samplng wthout replacement. Therefore, we have P sup P a P u t sup u t m n N P m n N T n + \ + n u m n N t t T + n \ + n T + n \ + n dτ u n N R n nγn τ T + n \ + n R n R n R n δ nγn τ dτ δ nγn τ dτ δ nγn τ dτ δ B n,α + PB n,α m n P n N d 2 m λte 2d+ m δ + PB n,α λt 2d+ m + PB n,α, d 2 δ e where nequalty a holds because Nt s nondecreasng wth t and the last nequalty s obtaned from the Markov nequalty.

7 Smlarly, we can also obtan dτ We dvde the analyss nto the followng cases: P sup u For > m, = for any + n n, whch mples u t n N R n nγn τ Tn \ δ F n γ = and n F n γ F P γ = F γ = γ m+ C α t n N m. R n >m >m nγn τ dτ δ For m > k, Tn \ n P Bn, α t F n n N R n γ = γ + γ +, nγn τ dτ δ whch mples that Tn \ n + PB n,α F n γ F γ =. P n N nλt C α For k, δ + PB n,α m λtc α δm + PB F n γ = λn n,α. n m E[D γ] nγ + nγ + [ ] D = λd E dm γ γ + γ +, emma 4. where D s a random varable denotng the change n the P 47 > δ 4m k max {e n m λth δ m+ k λt, e nth δ } 2m k t. number of servers wth queue sze at least after water fllng. Therefore, [ ] Proof. Note that + n n m k + m 2m k. For + F n D n, γ F γ = λd E dm γ. P sup u u t n N R n nγn τ dτ > δ Recall Z to be the number of probed servers wth queue 2m k sze at least, so D s a functon of Z j j. m P sup u t n N n m λu δ Specfcally, > + 2m k 2e n m λth δ 2m k D λt, = mn dm Z, m dm Z j. where the last nequalty follows from Proposton 5.2 n [5]. Smlarly, for n, P sup u u t n N R n nγn τ dτ > δ 2m k P N nu δ > n 2m k sup u t 2e nth δ 2m k t. Combnng the results above and usng the unon bound, we obtan P 47 > δ 4m k max {e n m λth δ m+ k λt, e nth } 2m k t. emma 5. There exsts n such that for any n n, P 48 > δ P B n,α. Proof. To study 48 under condton 42, we defne F n γ = R n n nγ, + n n and consder F n γ Fγ = F n γ F γ. 5 δ 5 Therefore, D dm = mn dm Z dm, d Z j + dm. Applyng the Hoeffdng s nequalty for samplng wthout replacement, we have that P Z γ dm log m 2 m log m 2e d = 2 m, 2/d whch mples that P Z γ dm m log m k 2k m. 2/d Gven Z γ dm m log m for all k, we can obtan [ ] + E D dm γ mn γ, d γ j k log m d m. By summarzng the cases above, we obtan that under condton 42 F n γ Fγ C α m + k log m d m.

8 Therefore, gven δ, there exsts m δ such that for any m m δ, u u sup F n γ n τ dτ Fγ n τ dτ u t Cα t m + k log m d δ. m So for suffcent large n, u P sup F n γ n τ dτ u t P B n,α. emma 6. Fs s pschtz. u Fγ n τ dτ > δ Proof. Consder s, s N. Wthout loss of generalty. Defne h s = F s s + s +. Then Fs Fs = F s F s = s s + s + s + + h s h s = 2 s s + h s h s. = Recall that F s = s + s + for > and F s = λd + λds + s + for <, so Fs Fs 2 s s + λd X s = s s + = h s h s. We next consder two cases. If h Xs s h Xs s, then = h s h s =λd λds Xs λ + λd =λd λd + λd + λ λd = + s s j j= s j s j j= X s j= λd s s. s j s j s j j= If h Xs s > h Xs s, then = h s h s = λd + λds Xs + λd λds Xs + λ λd s j j= + λ λd s j λ + λd s j j= + λ λd s j j= λd s Xs s Xs + λd s j s j 2λd s s, j= where the frst nequalty holds because j= λ λd s j j= accordng to the defnton of. Combnng the results above, we obtan that Fs Fs 2 + 3λd s s. Therefore, the lemma holds. APPENDIX E PROOF OF THEOREM 9 et ˆXn k denote the weak convergence subsequence n assumpton A4. By A and the Skorohod representaton theorem, there exsts { X nk } and X such that X nk = ˆXn k d, X = d X, and X nk converges to X almost surely. Now let X nk = X nk,.e., the n k th system starts at a random ntal condton specfed by ts statonary dstrbuton, whch mples that X n k t = d Xn k Denote by Xt the random state of the dynamcal system startng from the random ntal condton X. Accordng to A2, for any determnstc ntal condton n X, X n k t w Xt. By the defnton of weak convergence, for a bounded contnuous functon f, [ lm E fx nk t X nk = X ] n k n [ = E fxt X nk = X ]. Snce f s bounded, further by the bounded convergence theorem and the fact that P X X = P X X =, we have [ ] lm E fx nk t = E [fxt], n t.

9 whch mples that X n k t converges weakly to Xt for any t. Snce X n k t = d ˆXn k t, we further have Xt = d X t. Now accordng to A3, the dynamcal system converges to ˆX startng from any ntal condton n X, whch mples Xt converges to ˆX almost surely and also mples that Xt converges weakly to ˆX. Therefore, a pont mass at ˆX, whch mples that ˆXn k converges weakly to ˆX. Snce ths holds for any convergent subsequence, the theorem holds. APPENDIX F UNIFORM CONVERGENCE OF THE SERIES 6 Frst snce = E [ γ ] c accordng to 5 and E [ γ ] for all, the sequence s b = b = E [ γ ] s bounded above and ncreasng, so s = lm b s b exts. Therefore, gven any ɛ, there exsts b ɛ such that for any b b ɛ, E [ γ ] = s s b ɛ. 52 =b+ Next we establsh an upper bound on [ ] E =b+ ˆγ n k usng the yapunov-drft analyss at the steady state. Snce [ ] E c = ˆγ n k [ ] for any n k and E ˆγ n k s decreasng, [ ] E ˆγ n k c for any and any n k. Accordng to Markov s nequalty, we have P ˆγ n k d c 2d 2d d. 53 Now consder the n k th system and defne a yapunov functon to be n k V Qt = Q j t b + + 2, j= where b > and the superscrpt n k of Q s gnored to smplfy the notaton. et x, y N n k denote the state of the Markov chans, and q x,y denote the transton rate from state x to state y. Accordng to the Foster-yapunov theorem for contnuous-tme Markov chan see, for example, Theorem 9..8 n [5], we consder q x,y V y V x. 54 y x Recall that e j s a n k vector such that e j [j] = and e j [l] = for any l j. Then q x,x ej V x e j + V x, f x j b =, f x j = b 2x j b, f x j > b 2x j b +, whch corresponds to a departure at server j. Next defne Ψ x to be the set of possble states of the Markov chan when a batch arrval occurs and the system s n state x. Consder y Ψ x q x,y V y V x. We note that batch-fllng s one of the optmal solutons to the followng problem: dm mn a k= a k + x k b + + 2 n subject to: k= a k = m a k N k, where {x k } k=,,dm are the szes of the probed dm queues. In other words, gven x and the set of dm probed servers, the batch-fllng mnmze V y. Ths can be proved by showng that any task assgnment can be modfed to the batch-fllng soluton, by teratvely movng new tasks from large queues to small queues, wthout ncreasng the value of the objectve functon. Gven any b > 2, we consder the followng two cases. Frst consder x such that x Ω b := x : j d + I xj b 2 n k 2d. In other words, at least d + /2d fracton of servers wth queue sze at most b 2. Defne Ψ x to be the set of possble states of the Markov chan under batch-samplng when a batch arrval occurs and the system s n state x, and q x,y to be the correspondng transton rate. Snce batch-fllng mnmzes V y for any gven set of probed server, we have q x,y V y V x q x,y V y V x. y Ψ x y Ψ x Now under batch-samplng, a server may receve one and at most one task f t s probed. Consder server j such that x j b. Server j s probed wth probablty dm/n k, and wll receve one task f t s among the m least loaded queues n the md probed queues. Condtoned on server j s probed, defne G b 2 to be the number of probed servers wth queue sze at most b 2 among the other dm servers. Accordng to Hoeffdng s nequalty for samplng wthout replacement, we get PG b 2 < m e d 2 2d m. Therefore, we conclude that y Ψ x q x,y V y V x j j λn k m dm e d 2 2d m 2x j b + + + n k λde d 2 2d m 2x j b + + 3.

2 Note that y j b + + = for any queue j such that x j b 2 snce each server s gven at most one task under batch-samplng. Consder x such that x Ω b,.e., d + I xj b 2 < n k 2d. j In ths case, we compare batch-fllng wth the randomzed load-balancng algorthm that places m tasks n a set of randomly selected m servers, one for each server. Accordng to the analyss n the proof of Theorem, we have q x,y V y V x y Ψ x λn k m = λn k + j 3λn k + j m + 2 m x j b + + n k j 2λ x j b + + 2λ x j b + Combnng the results above, we have that q x,y V y V x y x j 2 λ max{, de d 2 2d m } x j b + + 3n k λde d 2 2d m Px Ω b + 2λn k Px Ω b. Recall that the Markov chan s postve recurrent accordng to Theorem. Assume the system s n the steady state, then we have = E q x,y V y V x, y x whch mples that E x j b + n k j d 2 3λde 2 2d m Px Ω b + 2λ Px Ω b. λ max{, de d 2 2d m } Now gven any < ɛ <, defne n ɛ such that for any n n ɛ 3λde d 2 2d m ɛ 2 λ 2 and de d 2 2d m. Such n ɛ exsts because m = Θlog n and m s ncreasng functon of n. Furthermore, defne b ɛ such that for any b b ɛ and n the steady state of any n k th system, 2λ Px Ω b 2 λ ɛ 2. Such b ɛ exsts, ndependent of n k, due to nequalty 53. Furthermore, we note that E x j b + = n k j =b+ [ E ˆγ n k Therefore, gven any < ɛ <, there exst b ɛ and n ɛ such that for any b b ɛ and n k n ɛ, the followng nequalty holds: [ ] E ɛ. =b+ ˆγ n k Combnng the result above and result 52, we conclude that gven any < ɛ <, for any n k n ɛ/2 and b max{b ɛ/2, b ɛ/2 }, [ ] =b+ ˆγ E n k γ =b+ E [ ˆγ n k whch concludes the proof. ]. ] + =b+ E [ γ ] ɛ, APPENDIX G PROOF OF COROARY 2 To smplfy the notaton, we assume k = 2, the analyss for k > 2 s almost dentcal and hence omtted here. Now for the nth system, we defne S n = { : > 2},.e., the set of all servers except servers and 2. We consder the followng Markov chan Q n t, Qn 2 t, n t, where n t = = S I n Q n t= n 2 Π n t I n Q t= I Q n n 2 2 t=.e., the fracton of servers wth queue sze n S n. Recall that Q n t s the queue length of the frst server n the nth system, Q n 2 t s the queue length of the second server n n n the nth system, and ˆQ and ˆQ 2 are the queue lengths n the steady state. Denote by π n n n x, y, = P ˆQ, ˆQ 2, ˆn = x, y,,.e., the statonary dstrbuton of the Markov chan. For the nth system, the global balance equaton for a gven state x, y, s = π n x, y, x,ỹ, x,y, x,ỹ, x,y, r n x,y, x, ỹ, π n x, ỹ, r n x,ỹ, x, y,, where r n x,y, x, ỹ, s the transton rate from state x, y, to x, ỹ, n the nth system, whch further mples that = x,ỹ, x,y, x,ỹ, x,y, π n x, y, r n x,y, x, ỹ, π n x, ỹ, r n x,ỹ, x, y,. 55,