Optimize the Dynamic Provisioning and Request Dispatching in Distributed Memory Cache Services

Transcription

1 Optimize the Dynamic Provisioning and Request Dispatching in Distributed Memory Cache Services Boyang Yu and Jianping Pan University of Victoria, British Columbia, Canada Abstract The dynamic provisioning of distributed cache services helps to improve the system efficiency. We model the system as groups of servers caching different and none-overlapping key segments of content objects, and investigate the benefit of cache hit and request batching. A stochastic network optimization problem is formulated, which aims at achieving the system stability, low energy cost and certain cache hit rate simultaneously through the dynamic control of server activeness and request dispatching. The problem is transformed into a minimization problem at each time slot and the online algorithm to solve it is proposed. Also we show that dynamic programming helps to lower the computational complexity. Finally, the proposed algorithm is evaluated through extensive simulations. I. INTRODUCTION Today the design of large-scale services attracts more attention than before, motivated by the unprecedented increase of contents stored on the network and the continuous growth of user requests to retrieve them. Scalability and efficiency are two of the most important issues in the system design. In this paper, we aim at improving the efficiency of the distributed memory cache services through the optimized dynamic control of server provisioning and request dispatching. The distributed cache services are broadly used in different large-scale networked systems to relieve the heavy workload. It provides temporary key-value storage in memory and makes the memory of different servers unified as whole, functioning as the cache of the whole key-value space. Memcached [1] is the state-of-the-art and most popular implementation of such a service. It achieves the horizontal scalability mainly through the consistent hashing [2], where the key space is partitioned into segments and each server is responsible for one segment or sub-space of the whole space. Besides, it also supports the mirroring of one sub-space, i.e., assigning multiple servers to redundantly cover the same sub-space and distributing the incoming traffic to them through a load balancer. In contrast to the well-achieved scalability as stated above, in this paper, we are more interested in improving the efficiency of such a system, which is often overlooked in the existing practices of system design. Due to the dynamics of the workload, it is not necessary to keep all the servers of the system active all the time. However, inappropriately turning servers into inactive status will impact the system stability or service quality, shown as a large number of requests queued to wait for service or a large average response time of the system. Therefore, the reasonable control of the server activeness, which leads to a tradeoff between the system stability and /14/$31. c 214 IEEE other goals such as the energy cost and cache hit rate, is the main objective of this work, termed as dynamic provisioning. The stochastic network optimization framework is adopted to achieve such an objective, which ensures the system stability by avoiding the queue backlog becoming infinite and takes the other goals into account. Based on the framework, we are to solve a minimization problem and adjust the server activeness accordingly at each discretized time slot. That is feasible in a data center because of its high-performance internal network. Request dispatching is another issue that will be solved. We will discuss where to dispatch a request if there are multiple choices and the favourable amount of content objects to be batched in a request, based on the queue backlogs at that time slot as input. These decisions will also contribute to the system stability because they affect the arrival of requests at servers. Since the applied framework only requires the current status of queue backlogs in making decisions, the scheme proposed in this paper does not rely on the prediction of the future workload, which makes it easy to apply and less vulnerable to the unpredictability of the workload. Compared with existing works optimizing the distributed systems similarly using stochastic network optimization (for example [3], [4], [5]), ours is characterized by its specific modeling and scheme design towards the distributed cache service. The differences between the cache service analyzed and other common distributed services include: in the cache service, a request can only be dispatched to certain specific servers because of the consistent hashing design and the preference for a high cache hit rate; the change of server provisioning can further influence the responsible key space of servers still being active as well as the resultant cache hit rate, which is crucial to the effective throughput of the system; the diminishing overhead in batching requests was observed in the distributed cache system but not paid enough attention to in the existing design of dynamic server provisioning. The rest of the paper is organized as follows. Section II presents the related work. Section III states the modeling framework. Section IV presents the proposed algorithm. Section V makes the evaluation through simulations. Finally we present the conclusions in Section VI. II. RELATED WORK The distributed memory cache system [6] is broadly used in different networked services today and consistent hashing technique [2] is applied to support the scalability. How Facebook leverages and improves Memcached, a typical imple-

2 mentation of the cache system, to support its social network is stated in [7]. Raindel et al. [8] considered the distributed memory storage system, and discussed how to serve the multiget requests to achieve the maximum throughput. Byers et al. [9] discussed the method in achieving load balance among servers under consistent hashing through mapping a key to multiple servers. These works mostly try to improve the scalability or throughput performance, but the possibility of adjusting the server activeness to achieve the well balance between the performance and efficiency is overlooked. The stochastic network optimization framework [1] is developed based on the system capacity analysis [11] and Lyapunov optimization. It helps to make optimized decisions only with current observations, not relying on the knowledge of arrival rate distribution. Urgaonkar et al. [3] proposed that the throughput and energy of a data center can be optimized through the admission control and routing control. Zhou et al. [5] solved the tradeoff between the performance and cost in the VM resource pool through the framework with specific considerations on the nonlinear energy consumption and the power budget. Maguluri et al. [4] focused on task scheduling under the cloud scenario, and also used the Lyapunov optimization as the main method to solve the problem. Comparatively, we considered more detailed perspectives in the distributed cache system, e.g., a request can only be dispatched to some of the active servers in the system. III. MODELING FRAMEWORK A. Distributed Memory Cache Systems In a common distributed memory cache system, such as Memcached [1], the cache servers work together to fulfill the content object requests initiated by the clients. Consistent hashing [2] is applied in dispatching requests to different servers which results in that each single server covers a subspace of the whole key space and is only responsible to serve the requests falling into the range it covers. Because the workload in requesting objects with specific keys is dynamic, the provisioning of the servers, keeping or turning them to be active or inactive, should be adjusted dynamically to tradeoff the resultant service quality and the energy cost. The whole key space is denoted by S. Each key is a nonnegative integer between and the space size. We simplify the consistent hashing scheme as equally dividing S into N segments, denoted by S 1,..., S N. Formally N k=1 S k = S and S i S j = for any i j. Then the segments form a directional circle based on their indexes, such as S k is the predecessor of S k+1 and S N is the predecessor of S 1, which is similar to Chord [12], a typical DHT system. There is a pool of available servers and we virtually partition the servers into N groups. Each active server in group i owns the key segment S i, where 1 i N. Meanwhile, it might also serve the segments preceding S i in the circle when necessary. The number of available servers in each group is assumed to be M. Totally, there are N M servers considered at most in our scheme. We consider the time is discretized into slots. In each time slot t, the activeness status of a server can Group 1 Group 2 Group 3 Server 2 Group 4 Server 3 Server 3 Server 3 Server 3 Server 1 Server 2 Server 2 Server 2 Server 1 Server 1 Server 1 Relay 1 Relay 2 Relay 3 Relay 4 Client A Fig. 1. System architecture be changed. Accordingly, the activeness of a server j in group i is denoted by a ij (t) and the value 1 or is used to represent the status of active or inactive, where 1 j M. We further model the activeness of a server group i by a i (t) = 1( j a ij (t)), (1) where 1(x) is an indicator function, with the return value 1 if x >, or otherwise. And then we obtain the binary provisioning vector Y (t), such as Y (t) = a 1 (t),..., a N (t)}, (2) which represents the activeness of server groups in slot t. Fig. 1 shows an example with N = 4 and M = 3. As illustrated, the whole group 2 and some servers in group 1 and group 4 are set to be inactive. Here Y (t) = 1,, 1, 1}. Determined by the consistent hashing applied in dispatching requests to servers, the responsible key space of each active server group i includes its own key segment S i and the key segments of one or multiple consecutively inactive groups preceding it in the circle, denoted by Ŝ i = i k=p+1s k, (3) where p is the first active server group preceding i in the circle. In the example illustrated by Fig. 1, the servers in group 3 will also be responsible for the segment owned by group 2 due to the inactive status of the latter. Meanwhile, the servers in the same group cover the same sub-space of keys, so there would be multiple candidates in serving requests to that sub-space. Besides, we set another layer between clients and servers in the system, which is shown as relays in Fig. 1. Each of such logical relays is responsible for one of the N key segments. And the requests falling into the same key segment, even if from different clients, will be aggregated at the relay, leading to a higher performance by the diminishing overhead effect. The relays can make the dynamic adjustment of server provisioning transparent to the clients. The request arrival at each relay k is also the arrival at each key segment S k. B. Covered Key Space v.s. Cache Hit Rate For a lower content retrieval latency, either the cache hit rate or the cache access latency should be improved. We are to achieve the latter through bounding the average queue length at servers. Meanwhile, increasing the number of active server groups can help the former improved, because some

3 Cache hit rate Cache = 8, Cache = 16, Cache = 32, Cache = 64, Ratio of covered space to whole space Fig. 2. Influence of key space Normalized delay server 2 servers 4 servers Objects to retrieve per request Fig. 3. Diminishing overhead servers might serve a smaller key space. When the unequal distribution of workloads in different key segments is ignored, the expected cache hit rate C i of server group i is a function of the ratio of its responsible space size over the whole space size, such as C i = C( Ŝi / S ). An experiment is conducted to determine that relationship, based on the trace of HTTP requests to Wikipedia [13]. We simulate 1 LRU-based cache servers, each with the same cache capacity. Depending on the provisioning status, an active server might be responsible for n/1 of the whole space, where 1 n 1. n is adjusted from 1 to 1 and the results are shown in Fig. 2. We notice that either a smaller key space or a larger cache size can increase the cache hit rate. With the measurement, we use the function fitting tool in Matlab to model the relationship between the covered space ratio x and the cache hit rate C(x). When each server can store 64, objects at most, we obtain C(x) as C(x) =.257x 3.586x 2.495x (4) C. Diminishing Overhead by Request Batching The diminishing overhead by batching objects in requests was initially noticed as the multi-get hole [14] in the distributed cache service. Specifically, when the average number of objects to fetch through each request increases, the request processing efficiency will improve, which would be positive to enlarge the system throughput. To take it into account in the dynamic server provisioning, some experiments are conducted with real servers. We make clients send requests repeatedly, each of which is to obtain x objects by using the multi-get command. With x varied between 1 and 1, we measured the average number of requests served per second, denoted by T (x). Then the average delay of any request containing x objects, normalized to that of any request containing 1 object, 1/T (x) 1/T (1) = T (1) T (x) is calculated as M(x) =. Our experiment results on the normalized mean delay with varying x and the number of servers are shown in Fig. 3. Based on the results, we can make a linear approximation to M(x), such as M(x) = σ + τx, (5) which can be intuitively understood as for each request, no matter how many objects are in it, there is an initial overhead σ in delay, and another part of the delay is determined by the number of objects in it, with a rate of τ. In the experiment with only one server, σ.6 and τ.4. D. Problem Formulation By the modeling, the incoming requests are queued respectively at each relays based on the key of the content object to fetch. There are N queues at relays, corresponding to the N segments of the whole key space. H k (t) is used to represent the queue backlog at relay k, also corresponding to the segment S k, where k = 1, 2,..., N. When the time goes from t to t + 1, the queue backlog of relay k is updated by where H k (t + 1) = max(h k (t) d k (t), ) + A k (t), (6) d k (t) d max. (7) A k (t) denotes the number of arrived objects to retrieve at time t, whose keys are in the segment S k. d k (t) denotes the number of requested objects planned to be dispatched from the queue k to cache servers at time t, constrained by the maximum rate d max. Because the actual number of departing objects is limited by the number of existing objects in the queue, we use ˆd k (t) to denote the actual number of departure, such as ˆd k (t) = min(d k (t), H k (t)). (8) For the cache servers, the queue backlog at the server j in group i is denoted by Q ij (t). It decreases by l max at most in each time slot if the server is in the active status such as a ij (t) = 1. So the queue backlog is updated by Q ij (t + 1) = max(q ij (t) a ij (t)l max, ) + A ij (t), (9) where A ij (t) is the number of objects newly arrived at the server at the time slot t. The dispatching of requests from the relays to the cache servers is decided by the active status of server groups. An active server group i will be responsible for its own segment i as well as the segments of its inactive predecessors. So the exact arrival A ij (t) at a server is determined by both the provisioning vector Y (t) and corresponding departure amounts ˆd k (t) in the related relays. If we consider that the departure ˆd k (t) at relay k can be distributed to multiple servers in the same group i to be determined, we have N ˆd k (t) = d kj (t), (1) j=1 where d kj (t) is the specific amount dispatched to server j among ˆd k (t). Then the actual arrival amount to server j in group i could be represented by i A ij (t) = M(d kj (t)), (11) k=p+1 where p is the first index before i satisfying a p (t) = 1. The function M(x) is from the modeling of diminishing overhead in (5). Serving a request to retrieve a larger number of content objects, can lower the latency shared by each object in the request. So M(x) can be intuitively considered as there is a discount on the counted number of arrived objects at servers when x is larger than 1. H and Q are used to represent the set of relay queues and server queues. Our first objective is to stabilize the queues H and Q, which ensures the time-averaged queue length is

4 bounded, so that a certain cache access latency is ensured. Also we try to maximize a utility function at each time slot to achieve certain other goals. The utility function is defined as U(t) = N i=1 a ij (t) + γ C i(t)a i (t) N Q i=1 a, (12) i(t) where C i (t) is the cache hit rate of server group i at time t, obtained from (4). In (12), the first item represents the energy consumption in keeping some servers active at time t, and the second item represents the benefit of cache hit, modeled as the average hit rate of all the active server groups. Besides, γ is the weight of the cache hit rate performance which is to tradeoff the energy cost and cache hit rate. A. Algorithm Design IV. ONLINE ALGORITHM In order to achieve the goals specified above in an optimal way, we are to design an online algorithm that controls two sets of variables, i.e., the cache server activeness a ij, and the dispatched amount d kj from the relay k to the cache server j in the corresponding server group. Following the stochastic network optimization [1], Lyapunov drift is used in quantifying the queue stability. First we define the Lyapunov function on the queue backlog of H and Q, such as L(Q(t), H(t)) = Q Q ij (t) 2 + H H k (t) 2, (13) which is abbreviated as L(t) in the following. Then the conditional Lyapunov drift function at time t is obtained as } (Q(t), H(t)) = E L(t + 1) L(t) Q(t), H(t), (14) which represents the change of Lyapunov function conditioning on the known queue backlogs at a previous time slot. Minimizing the drift function can potentially ensure the queue stability or system stability. With the other goals on the energy cost and cache hit rate considered, we are to minimize the drift-plus-penalty function, defined as } (Q(t), H(t)) V E U(t) Q(t), H(t), (15) where V is a parameter to tradeoff the importance of the queue stability and utility. Here we use the minus of the utility as the penalty in the drift-plus-penalty method [1]. With the queue backlog functions defined in (6) and (9) and based on the stochastic network optimization, (15) is relaxed to a minimization problem at each time slot t, such as min 2 Q Q ij (t)(a ij (t) a ij (t)l max ) + 2 H k (t)(a k (t) d k (t)) V U(t) H s.t. (7)(8)(1)(11). (16) In our scheme, the problem is solved in every time slot to give instructions on how to control the server provisioning and request dispatching of the system. In the problem, Q ij (t), H k (t) and A k (t) are known parameters in each time slot, so below we are to determine the values of other variables. 1) Server Group Activeness: The binary provisioning vector Y (t) determines the destination server group when requests are dispatched from the relays and it affects the request arrival at each server group. So we need to iterate all the -1 combinations of the vector Y (t) to search the best solution of (16); and in each iteration, the vector of Y (t) is given, so we obtain the best conditional solution based on that specific Y (t); after the iterations, we will choose the minimum in all the conditional solutions. Next we give the scheme to obtain the conditional best solution under a given Y (t). When Y (t) is given, the cache utility part in the objective function of (16) can be ignored, because its value is already determined by Y (t). Besides, 2 H H k(t)a k (t) can also be ignored, because it is not relevant in deciding the value of the variables. 2) Dispatching Destination: We are to determine the value of d kj (t), which is related to the constraint (1) and (11) in the problem (16). Assuming the dispatched amount d k (t) of all the relays and the activeness a ij (t) of all the servers are given, along with the ˆd k (t) determined by d k (t), then the objective of (16) would be simplified to minimize 2 Q ij (t)a ij (t) = 2 i Q ij (t) M(d kj (t)). (17) Q Q k=p+1 The problem can be decoupled on i, so each active server group is considered respectively. For any related relay k whose destination server group is i based on the given Y (t), although the departure amount ˆdk (t) is assumed to be given, the exact portion d kj (t) distributed to each active server j in the group i is still open to be determined. On the left hand side of (17), Q ij (t)s are in fact the constant weight of a sum, and A ij (t)s are possible to be adjusted to make the sum smaller. First, it is more beneficial to dispatch the request to a shorter queue, as it ensures a smaller weight in the sum. Second, there is no benefit to partition ˆd k (t) into parts and dispatch each part to different servers in the group, proved by the fact that M(x 1 )+M(x 2 ) = T (1)/T (x 1 )+T (1)/T (x 2 ) T (1)/T (x 1 + x 2 ) = M(x 1 + x 2 ) holds for any positive x 1 and x 2. So the value of d kj (t), given the requests from relay k would be dispatched to the group i, should be set to ˆdk (t), if j = arg min j Q i j(t) with a i j(t) = 1 d kj (t) =, otherwise (18) It can be intuitively understood as Completely Joining the Shortest Queue, which means that for any relay k, its departing requests should all be dispatched to the active server with the shortest queue in its destination group determined by Y (t). 3) Dispatching Amount: Then the solution of d k (t) is to be determined. Here we still assume the activeness status a ij (t) of servers in a group is known. And the problem (16) is simplified to minimize 2 H H k (t)d k (t) + 2 Q Q ij (t)a ij (t), (19) which can be decoupled on H, so in fact we are to minimize H k (t)d k (t) + Q i j (t)[σ1( ˆd k (t)) + τ ˆd k (t))], (2)

5 where i is the destined group of relay k determined by Y (t) and j is the server being active and with the shortest queue in the group. The indicator function 1( ˆd k (t)) is used to reflect the dispatching amount could be. It is obvious that if d k (t) =, (2) is, therefore we set d k (t) to positive only if it makes (2) negative. First we assume 1( ˆd k (t)) is 1. Because of the min function in (8), there are two cases when comparing between d max and H k (t): 1) in the case of d max H k (t), we need to minimize H k (t)d k (t) + Q i j (t)τh k(t), so we set d k (t) to the maximum value d max if d max τq i j (t), otherwise to ; 2) in the case of d max < H k (t), we need to minimize H k (t)d k (t) + Q i j (t)τd max, so we set d max to the maximum value d max if H k (t) τq i j (t), otherwise to. Then we have the generalized solution d k (t) = d max, if max(d max, H k (t)) τq i j (t), otherwise (21) After the d k (t) is determined, since we have assumed 1( ˆdk (t)) = 1, we need to check whether (2) with obtained d k (t) is less than. If it is, then the solution of d k (t) obtained is adopted; otherwise, d k (t) is set to to make (2) to. 4) Cache Server Activeness: The value of a ij (t) is still not determined yet. Since the activeness status a i (t) is known from given Y (t), each active serve group can be considered respectively, along with the relays relying on it. Note that if a server in group i is kept active, there are two possible reasons: to dispatch requests to it from the corresponding relays or to make it serve requests that are already in the queue. After decoupling (16) on i, then in any group considered, when the server with the shortest queue among the active servers is given, denoted by s, the variable part in the objective function is simplified to 2 j Q ij (t)a ij (t)l max + V j a ij (t). (22) Because each server has a queue, below we represent a server by its queue. Now we make decisions about whether to keep a queue longer than s active in the group. Intuitively, to keep a longer queue active has a larger potential to make (22) negative, because it implies the larger weight Q ij (t) in the sum. Then in the solution, by iterations, each of the queues is assumed be the shortest in turn; in each iteration, we search among all the queues longer than s following the decreasing order of the queue length and choose those that can keep (22) negative to be active. For a given server group, the computational complexity of this method is O(M 2 ), because of the two-level iterations. 5) Improvement by Dynamic Programming: We have devised the method to obtain the solution conditioning on a specific Y (t). Y (t) has 2 N cases, so dynamic programming is introduced to reuse the intermediate results of shared subproblems in the iterations and to lower the complexity from O(2 N ) to O(N 3 ). We devise the recursive formula of DP as, D (l1,l2] = min min D (l1,m] D (m,l2] }, I (l1,l2]}, (23) l 1<m<l 2 Average queue length Average number of active servers Servers Relays Active servers Parameter V x 1 4 Fig. 6. Fig. 4. V=5k V=1k V=2k V=4k Effect of V Number of activer servers Average queue length of servers Average cache hit rate V=5k V=1k V=2k V=4k Fig. 5. Queue length gamma=1 gamma=5 gamma=1 gamma=2 gamma= Fig. 7. Cache hit rate where I (l1,l2] means the best solution for server groups in the range (l 1, l 2 ], assuming that group l 1 and l 2 are active and all the groups in between are inactive and D (l1,l2] means the best solution for groups in the range (l 1, l 2 ] assuming at least the group l 1 and l 2 are active. Besides, the operation D (l1,m] D (m,l2] is to merge two none-overlapping subsolutions together. Based on the formula, we begins from the solutions for the shortest ranges of (l 1, l 2 ], and then gradually obtain the solution for the whole range of server groups. More details of this method are given in our technical report [15]. V. EVALUATIONS We simulate a memory cache system, with 2 server groups and 1 servers in each group at most. The same maximum service rate of servers l max and that of relays d max are set to 7 and 1, 8 objects/slot, respectively. The mean arrival rates at each of the 2 key segments are set to satisfy the ratio λ 1 : λ 2 :... : λ 2, which is obtained by hashing the URL in each record of the Wikipedia request trace [13] into 2 segments and counting the number of requests in each segment. The average of arrival rates, λ = 2 i=1 λ i/2, is set to 7 by default which makes the largest arrival rate λ 16 near the system capacity. In each time slot, the actual amount of arrivals for segment i is randomly generated through sampling the uniformly distributed random variable between [, 2λ i ]. Each experiment runs 5, time slots by default. First, how the change of parameter V influences the queue length at servers and relays is investigated. We expect with the increase of V, the average queue length would be increased, because it was shown in stochastic network optimization that the scheme ensures an O(V )-approximation of the queue length. Fig. 4 shows the results of time-averaged queue length of servers, relays and active servers with the increasing value of V. The average queue length at relays is shown to keep stable, because under the shown range of V, the throughput capacity at servers is still higher than the actual arrival. Besides, the increase of V results in fewer servers being active to save energy and makes the average queue length increased.

6 Queue length / Active servers Queue length Number of active servers Time Fig. 8. Handling bursty traffic Average queue length Fig. 9. Proposed JSQ+MaxWeight MaxWeight JSQ Random Performance comparison Second, we varied the overall average arrival rate λ and compared the performance. Fig. 5 shows the evaluation result on the time-averaged queue length and Fig. 6 shows the resultant number of active servers. The smaller value of V leads to the shorter queue length, however it results in a larger number of servers being active. Besides, when the workload becomes higher, the active number of servers and the queue length will increase. We also noticed that when λ is comparatively larger, the effect of V in lowering the number of active servers is weakened, since the arrival rate is close to the system capacity. Fig. 7 gives the cache hit rate performance with varying λ and γ. When the workload is comparatively low, we see the effect of parameter γ in helping the cache hit rate to increase is higher. In the extreme case, the cache hit rate is improved from about 61% to 76%. But when the arrival rate is near the same capacity, the same extent of increasing γ leads to a much less noticeable effect on the cache hit rate. The reason is that when the arrival rate is larger, the queue length at servers in certain groups would be longer, and then those groups have to be kept active, which leads to the less freedom in improving the cache hit rate. Third, an extra experiment was conducted to verify the performance under the dynamic or bursty changing workload. The experiment lasts 1, time slots. We set λ = 2, but in the time period of [3, 7], the arrival rates were increased to twice of the normal rate, so that the workload was largely increased in that period. We monitored the average server queue length and the number of active servers in each time slot. Fig. 8 shows that although the bursty workload increases the average queue length during the period of [3, 7], after the workload becomes normal, the queue length decreases in quite a short time. Besides, the increased number of active servers during the bursty period shows that the scheme can adapt to the dynamic workload. Finally, the performance of our scheme was compared with others. For all the schemes compared, we adjust parameters to make the resultant number of active servers similar, so the average queue length is the metric of comparison. In the other schemes compared, the ratio of the number of active servers in different groups is set to be proportional to the workload predicted by Exponentially Weighted Moving Averaging (EWMA) [16]. Then the exact servers to be active in each group can be selected randomly or by giving a higher priority to the servers with a larger queue backlog (MaxWeight). Besides, the dispatching from a relay to an active server can be random or by Join-the-Shortest-Queue (JSQ). The results after the first 5, slots are shown in Fig. 9. It shows that the proposed scheme outperforms the others under all the settings of λ. Besides, the performance of the JSQ+MaxWeight scheme is quite close to the proposed when the mean arrival rate is low, but the difference becomes larger when λ increases. VI. CONCLUSIONS We investigated the dynamic provisioning and request dispatching in the distributed memory cache services and proposed the online scheme in achieving the system stability, low energy cost and high cache hit rate. The stochastic network optimization framework was applied in devising the scheme, and dynamic programming was used to lower the algorithm complexity. The scheme proposed only requires the information that can be obtained in the current time slot and its performance is evaluated through extensive simulations. ACKNOWLEDGMENT This work is supported in part by NSERC, CFI and BCKDF. REFERENCES [1] Online, [2] D. Karger, A. Sherman, A. Berkheimer, B. Bogstad, R. Dhanidina, K. Iwamoto, B. Kim, L. Matkins, and Y. Yerushalmi, Web caching with consistent hashing, Computer Networks, vol. 31, no. 11, pp , [3] R. Urgaonkar, U. C. Kozat, K. Igarashi, and M. J. Neely, Dynamic resource allocation and power management in virtualized data centers, in in Proc. of NOMS, 21, pp [4] S. T. Maguluri, R. Srikant, and L. Ying, Stochastic models of load balancing and scheduling in cloud computing clusters, in Proc. of IEEE INFOCOM, 212, pp [5] Z. Zhou, F. Liu, H. Jin, B. Li, B. Li, and H. Jiang, On arbitrating the power-performance tradeoff in SaaS clouds, in Proc. of IEEE INFOCOM, 213. [6] M. Tomaševic, J. Protic, M. Tomasevic, and V. Milutinović, Distributed Shared Memory: Concepts and Systems. John Wiley & Sons, [7] R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski, H. Lee, H. C. Li, R. McElroy, M. Paleczny, D. Peek, P. Saab et al., Scaling memcache at facebook, in Proc. of USENIX NSDI, 213, pp [8] S. Raindel and Y. Birk, Replicate and bundle (rnb) a mechanism for relieving bottlenecks in data centers, in Proc. of IEEE IPDPS, 213, pp [9] J. Byers, J. Considine, and M. Mitzenmacher, Simple load balancing for distributed hash tables, in Peer-to-Peer Systems II. Springer, 23, pp [1] M. J. Neely, Stochastic network optimization with application to communication and queueing systems, Synthesis Lectures on Communication Networks, vol. 3, no. 1, pp , 21. [11] L. Tassiulas and A. Ephremides, Stability properties of constrained queueing systems and scheduling policies for maximum throughput in multihop radio networks, Automatic Control, IEEE Transactions on, vol. 37, no. 12, pp , [12] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan, Chord: A scalable peer-to-peer lookup service for internet applications, in ACM SIGCOMM Computer Communication Review, vol. 31, no. 4. ACM, 21, pp [13] G. Urdaneta, G. Pierre, and M. van Steen, Wikipedia workload analysis for decentralized hosting, Elsevier Computer Networks, vol. 53, no. 11, pp , 29. [14] Online, [15] B. Yu and J. Pan, Optimize the dynamic provisioning and request dispatching in distributed memory cache services, ca/ boyangyu/tr-dcache.pdf, Tech. Rep., 214. [16] J. S. Hunter, The exponentially weighted moving average. Journal of Quality Technology, vol. 18, no. 4, pp , 1986.