4 III. OPTIMAL SOLUTION In this section we assue that there is a central processing unit that has the coplete knowledge about the whole syste. Given the counication latencies c i j and the organizations own loads n i, our goal is to find an algorith setting relay fractions ρ i j so that the total processing tie of all the requests C i is iniized. Below we show that the proble can be stated as a quadratic prograing proble with the positive-definite Q atrix. This eans that the optiization proble is convex and, in particular, it is polynoially solvable. We consider this result powerful, as it indicates that any local optiization techniques can be applied to the proble. In particular, in the next section we present a distributed algorith that, in practice, is very efficient. Theore 1. The proble of iniizing C i can be expressed as the quadratic prograing proble: C i = ρ T Qρ + b T ρ, with the positive-definite Q atrix. In particular, this eans that the iniization proble is convex. Proof: We express the total processing tie C i in a atrix for as C i = ρ T Qρ + b T ρ, where: ρ is a vector of relay fractions with eleents. ρ (i, j), the eleent at (i + j)-th position, denotes the fraction of local requests of i-th server that are relayed to j-th server ρ i j, thus: ρ = [ρ (1,1),ρ (1,2),...,ρ (1,),ρ (2,1),...,ρ (,) ] T ; Q is 2 -by- 2 atrix in which q (i, j),(k,l) denotes the eleent in (i + j)-th row and in (k + l)-th colun: n i n k /s j if j = l and i < k; q (i, j),(k,l) = n i n k /2s j if j = l and i = k; (2) 0 otherwise; Figure 1 presents the structure of atrix Q. b is a vector with 2 eleents with b i j denoting an eleent at (i + j)-th position: b (i, j) = c i j n i. The following derivation shows how the atrix Q is constructed: ρ T Qρ = i, j ρ (i, j) q (i, j),(k, j) ρ (k, j) (3) k i n i n k ρ (k, j) = ρ (i, j) ( + n2 i ρ (i, j)) (4) i, j k>i s j 2s j = i j k n i n k ρ (i, j) ρ (k, j) 2s j = i r i j l j. (5) j 2s j (3) follows fro the construction of the atrix Q (only eleents k i are non-zero). (4) substitutes q (i, j),(k,l) with the values defined in (2). (5) uses coutativity of ultiplication and substitutes l j = k n k ρ (k, j) and r i j = n i ρ (i, j). The constraints that ρ i j are the fractions ( i, j ρ i j 0 and i j= j=1 ρ i j = 1) can also be expressed in the atrix for. First, ρ 0 2, where 0 2 is a vector of length 2 consisting n i n k /s j 0. n i n k /2s j Figure 1: Matrix Q: denotes non-zero values Algorith 1: CALCBESTTRANSFER(i, j) input: (i, j) the identifiers of the two servers Data: k r ki initialized to the nuber of requests owned by k and relayed to i ( k r k j is defined analogously) Result: The new values of r ki and r k j foreach k do r ki r ki + r k j ; r k j 0; end l i k r ki ; l j 0 ; servers sort [k] so that c k j c ki < c k j c k i = k is before k ; foreach k servers ( do (s ) r ik j in j l i s i l j ) s i s j (c k j c ki ) (s i +s j ),r ki ; if r ik j > 0 then r ki r ki r ik j ; r k j r k j + r ik j ; l i l i r ik j ; l j l j + r ik j ; end end return for each k: r ki and r k j of zeros. Second, Aρ = 1, where 1 is a vector of length and consisting of ones, and A is a -by- 2 atrix defined by the following equation: { 1 if i j < (i + 1) a i j = (6) 0 otherwise. Miniization of C i (ρ) = ρ T Qρ + b T ρ with constraints ρ 0 2 and Aρ = 1 is an instance of quadratic prograing proble. As an upper triangular atrix, atrix Q has 2 eigenvalues equal to the values at the diagonal: n 2 i /2s j (1 i, j ). All eigenvalues are positive so Q is positivedefinite. Corollary 1. The proble of iniizing C i is polynoially solvable. According to [22], the best running tie reported for solving quadratic prograing proble with linear constraints is O(n 3 L) [21], where L represents the total length of the input coefficients and n the nuber of variables (here n = 2 ), so the coplexity of the best solution is O(L 6 ). However, Theore 1 encourages to apply the local optiization techniques, like the algorith presented in the next section. IV. DISTRIBUTED ALGORITHM The centralized algorith requires the inforation about the whole network the size of the input data is O( 2 ) 4

6 The following lea proves the correctness of Algorith 1. Theore 2. After execution of Algorith 1 for the pair of servers i and j, it is not possible to iprove C i only by exchanging any requests between i and j. Sketch of Proof: First we show that after the second loop no requests should be transferred fro i to j. For each organization k the requests owned by k were transferred fro i to j in soe iteration of the second loop; also, each of the next iterations of the second loop could only cause the increase of the load of j (and decrease of i); thus transferring ore requests of k fro i to j would be inefficient. Second, we will show that after the second loop no requests should be transferred back fro j to i either. Let us take the last iteration of the second loop in which the requests of soe organization k were transferred fro i to j. After this transfer we know that r ik j = (s jl i s i l j ) s i s j (c k j c ki ) (s i +s j ) 0 (otherwise the transfer would not be optial). However, this iplies that r ik j = (s jl i s i l j ) s i s j (c k j c k i ) (s i +s j ) 0 for each server k considered before k. As r ik j 0 we get r jk i 0. B. Error estiation The following analysis bounds the distance of the current solution of the distributed algorith to the optiu as a function of the disparity of servers load. When running the algorith, this result can be used to assess whether it is still profitable to continue running the algorith: if the load disparity is low, the current solution is close to the optiu. We introduce the following notation for the analysis. ρ is the snapshot (the current solution) derived by distributed algorith. ρ is the optial solution that iniizes C i (if there are ultiple optial solutions with the sae C i, ρ is the closest solution to ρ in the Manhattan etric). (P, ρ) is a weighted, directed error graph: ρ[i][ j] indicates the nuber of requests that should be transferred fro server i to j in order to reach ρ fro ρ ( ρ[i][ j] requests either belong to i, or to j, and not to another server k). We define dir as the direction of transport: dir(i, j) = 1 if i transfers to j its own requests; dir(i, j) = 1 if i returns to j the requests that initially belonged to j. Let succ(i) denotes the set of successors in the error graph: succ(i) = { j : ρ[i][ j] > 0}; prec(i) denotes the set of predecessors: prec(i) = { j : ρ[ j][i] > 0}. In the error graph, a negative cycle is a sequence of servers i 1,i 2,...,i n such that (i) i 1 = i n ; (ii) j {1,...n 1} ρ[i j ][i j+1 ] > 0; and (iii) n 1 j=1 dir(i j,i j+1 )c i j i j+1 < 0. A negative cycle is sequence of servers that essentially redirect their requests to one another. A solution without negative cycles has a saller processing tie: after disantling a negative cycle, loads on servers reain the sae, but the counication tie is reduced. In Appendix, we show how to detect and reove negative cycles; in order to siplify the presentation of the subsequent analysis, we consider that there are no negative cycles. Proposition 1. If (i) the error graph ρ has no negative cycle; and (ii) j ax k (( s 1 j + s 1 ) r k jk ) = R ( r i j is the nuber of requests which in the current state ρ would be relied to j-th server by the i-th server (as the result of Algorith 1), then ρ ρ 1 (4+1) R i s i, where 1 denotes the Manhattan etric. Proof: The proof is presented in the full version of the article [33]. Proposition 1 gives the estiation of the error for such partial solutions that do not have a negative cycles. Therefore the algorith that cancels negative cycles (see Appendix) should be run whenever the estiation for distance to the optial solution is needed. Our experients show, however, that the negative cycles are rare in practice and that pure Algorith 2 can reove the efficiently (Section VI). V. SELFISH ORGANIZATIONS In this section we consider the case when the organizations are acting selfishly the i-th of the tries to iniize the total processing tie of its own requests C i. We are interested in a steady state in which all the peers have no interest in redirecting any of its requests to different servers the Nash equilibriu. A. Hoogeneous network In this section we present the characteristic of the Nash equilibriu in case when all the servers have equal processing power ( i s i = s), and when all the connections between servers have the sae counication delay ( i j c i j = c). We consider hoogeneous odel, as the odeling of a heterogeneous interconnection graph is coplex. The siulation experients (Section VI-C) show that in the case of selfish servers the average relative degradation of the syste goal on heterogeneous networks is siilar to, or lower than on the hoogeneous networks. Lea 2. For every two servers i and j the difference between their average loads is bounded: l i l j c s Proof: (by contradiction) Assue l i l j > c s. Without loosing the generality, l i > l j. Recall that r i j is the nuber of redirected requests r i j = n i ρ i j. For each sever k (k i), it is not profitable to put ore of its requests to the ore loaded server, so r k j r ki. Now we want to find the relation between l i,l j,r i j and r ii. In a Nash equilibriu, it is not profitable for i to redirect any additional x of its own requests fro itself to j, which can be forally expressed by the equation: 0 (l i x)(r ii x) 2s l ir ii 2s l jr i j 2s cr i j, + (l j + x)(r i j + x) 2s + c(r i j + x) 6

8 total processing tie ( Ci) iteration #servers = 500 #servers = 1000 #servers = 2000 #servers = 3000 #servers = 5000 Figure 2: The convergence of the distributed algorith for peak distribution of initial loads. processing a single request on a single server takes 1s). We also analyzed the case of peak distribution with requests owned by a single server. We evaluated the result based on the distance to the optial solution, which because of the O( 6 ) coplexity of standard solvers (see Section III) was approxiated by our distributed algorith. B. Convergence tie of the distributed algorith In the first series of experients, we evaluated the efficiency of the distributed algorith easured as the nuber of iterations the algorith ust perfor in order to decrease the difference between the total processing ties in the current and the optial requests distributions to less than 2% of the average load. In a single iteration of the distributed algorith, each server executes Algorith 2; if there were any pairs of the servers to be optiized we run optiization in the rando order. Table I suarizes the results. The results indicate that the nuber of iterations ostly depends on the size of the network and on the distribution of the initial load. The type of the network (planet-lab vs. hoogeneous) does not influence the convergence tie. Larger networks and peak distribution result in higher convergence ties. In all considered networks, the algorith converged in at ost 9 iterations. Next, we decreased the required precision error fro 2% to 0.1%, and ran the sae experients. The results are given in Table II. In this case, siilarly, the required nuber of iterations was the highest for peak distribution of the initial load. In each case the algorith converged in at ost 11 iterations. Even for 300 servers the average nuber of iterations is below 8. Also, the standard deviations are low, which indicates that the algorith is stable with respect to its fast convergence. Also, we assessed whether a variation of the distributed algorith that does not eliinate negative cycles (Appendix A) has a slower convergence tie. Although required to prove the convergence (Section IV-B), eliinating the 50 = 100 = 200 = 300 # iterations average ax st. dev. unifor exp peak unifor exp peak unifor exp peak unifor exp peak Table I: The nuber of iterations of the distributed algorith required to obtain at ost 2% relative error in the total processing tie ΣC i. 50 = 100 = 200 = 300 # iterations average ax st. dev. unifor exp peak unifor exp peak unifor exp peak unifor exp peak Table II: The nuber of iterations of the distributed algorith required to obtain at ost 0.1% relative error in the total processing tie ΣC i. negative cycles is coplex in ipleentation and doinates the execution tie. We copared two versions of the distributed algorith: without negative cycle reoval; and with the reoval every two iterations of the algorith. The nuber of iterations for two versions of the algorith were exactly the sae in all 6000 experients. These result show that the cycles which happen in practice can be efficiently reoved by pure Algorith 1. Also, the negative cycles are rare in practice. Finally, we analyzed the convergence of the distributed algorith without negative cycles eliination on larger networks (Figure 2). The previous experients shown that the algorith convergence is the slowest for peak distribution of the initial load, therefore we chose this case for the analysis. The experients used heterogeneous network. The results indicate that even for larger networks the total processing tie decreases exponentially. C. Cost of selfishness In the second series of experients we experientally easured the cost of selfishness as the ratio between total processing ties in cases of selfish and cooperative servers (Table III). In each experient, the Nash equilibriu was approxiated by the following heuristics. Each server was playing its best response to the current distribution of 8

Geographically Distributed Load Balancing with (Almost) Arbitrary Load Functions Piotr Skowron University of Warsaw Poland Email: p.skowron@mimuw.edu.pl Krzysztof Rzadca University of Warsaw Poland Email:

