Profiling services for resource optimization and capacity planning in distributed systems

Cluster Comput (2008) 11: 313 329 DOI 10.1007/s10586-008-0063-x Profiling services for resource optimization and capacity planning in distributed systems Guofei Jiang Haifeng Chen Kenji Yoshihira Received: 22 May 2008 / Accepted: 8 September 2008 / Published online: 24 September 2008 Springer Science+Business Media, LLC 2008 Abstract The capacity needs of online services are mainly determined by the volume of user loads. For large-scale distributed systems running such services, it is quite difficult to match the capacities of various system components. In this paper, a novel and systematic approach is proposed to profile services for resource optimization and capacity planning. We collect resource consumption related measurements from various components across distributed systems and further search for constant relationships between these measurements. If such relationships always hold under various workloads along time, we consider them as invariants of the underlying system. After extracting many invariants from the system, given any volume of user loads, we can follow these invariant relationships sequentially to estimate the capacity needs of individual components. By comparing the current resource configurations against the estimated capacity needs, we can discover the weakest points that may deteriorate system performance. Operators can consult such analytical results to optimize resource assignments and remove potential performance bottlenecks. In this paper, we propose several algorithms to support capacity analysis and guide operator s capacity planning tasks. Our algorithms are evaluated with real systems and experimental results are also included to demonstrate the effectiveness of our approach. G. Jiang ( ) H. Chen K. Yoshihira NEC Laboratories America, 4 Independence Way, Princeton, NJ 08540, USA e-mail: gfj@nec-labs.com H. Chen e-mail: haifeng@nec-labs.com K. Yoshihira e-mail: kenji@nec-labs.com Keywords System management Capacity planning Resource optimization Distributed systems System invariants Algorithms 1 Introduction In the last decade, with the great success of Internet technology, many large-scale distributed systems have been built to run various online services. These systems have unprecedented capacity to process large volume of transaction requests simultaneously. For example, Google has thousands of servers to handle millions of user queries every day and Amazon sold 3.6 million items or 41 items per second in the single day of December 12, 2005 [2]. While clients may only see a single website, the distributed systems running such services are very complex and could include thousands of components such as operating systems, application software, servers, networking and storage devices. Meantime, clients always expect high Quality of Services (QoS) such as short latency and high availability from online transaction services. Studies have shown that clients are easy dissatisfied because of unreliable services or even seconds delay in response time. However, under many dynamics and uncertainties of user loads and behaviors, some components inside the system may suddenly become the performance bottleneck and then deteriorate system QoS. Therefore, it is very desirable to make right capacity planning for each component so as to maintain the whole system at a good state. Operators usually need to consider two important factors for capacity planning and resource optimization. From one side, sufficient hardware resources have to be deployed so as to meet customer s QoS expectation. From the other side, an oversized system with scale could significantly waste hardware resources, increase IT costs and reduce profits. For

314 Cluster Comput (2008) 11: 313 329 distributed systems, it is especially important to balance resources across distributed components so as to achieve maximal system level capacity. Otherwise, mis-matched component capacities could lead to performance bottlenecks at some segments of the system while wasting resources at other segments. Therefore, a critical challenge is how to match the capacities of large number of components in a distributed system. In practice, operators may take many trial and error procedures for their capacity planning work and assign resources based on their intuition, practical experiences or rules of thumb [1]. Therefore, it is difficult to systematically and precisely analyze the capacity needs for individual components in a distributed system with scale and complexity. For standalone software, we usually use fixed numbers to specify hardware requirements such as CPU frequency and memory size. For example, Microsoft recommends the following minimum system requirements to run Microsoft Office 2003: a Pentium III processor with a clock speed of at least 233 MHz, a minimum of 128 MB RAM and at least 400 MB free hard-disk space [24]. However, it is difficult to give such specifications for online services because their system requirements are mainly determined by an external factor the volume of user loads. For various user loads, we need a model rather than a fixed number to analyze the capacity needs of each component. Such models could enable us to answer those what...if... questions in capacity planning. For example, what should be the size of the database server if the volume of web requests increases three times tomorrow? Though queuing and other models are widely applied in performance modeling [21], these models are often used to analyze a limited number of components under various assumptions. For example, a closed queuing network was used to model multi-tier Internet applications [31] but only CPU resources were considered in the performance model. In practice, we have to consider many other resources as well such as memory, disk I/O and network. Therefore queuing models seem not to scale well in profiling large distributed systems. In this paper, we propose a novel approach to profile services for capacity planning and resource optimization. During operation, distributed systems generate large amounts of monitoring data to track their operational status. We collect resource consumption related monitoring data from various components across distributed systems. CPU usage, network traffic volume, and number of SQL queries are typical examples of such monitoring data. While large volumes of user requests flow through various components, many resource consumption related measurements respond to the intensity of user loads accordingly. Here we introduce a new concept named flow intensity to measure the intensity with which internal measurements react to the volume of user loads. Further we search for constant relationships between these flow intensity measurements collected at various points across the system. If such relationships always hold under various workloads along time, we consider them as invariants of the underlying distributed system. In this paper, we propose an algorithm to automatically extract such invariants from monitoring data. After extracting many invariants from a distributed system, given any volume of user loads, we can follow these invariant relationships sequentially to estimate the capacity needs of individual components. By comparing the current resource assignments against the estimated capacity needs, we can also discover and rank the weakest points that may deteriorate system performance. Later operators can consult such analytical results to optimize resource assignments and remove potential performance bottlenecks. Several graph-based algorithms are proposed in this paper to support such capacity analysis. Our algorithms are tested with real distributed systems including a large production system and experimental results are also included to demonstrate the effectiveness of our approach. 2 System invariants and capacity planning As discussed earlier, we need to collect monitoring data from operational systems to profile online services. In this paper, capacity planning is discussed in the context of system testing or operational management stage but not system design stage. For example, we do not analyze how to optimize architecture design for improving system capacity. Instead, we assume that services are already deployed and functional on distributed systems so that we are able to collect monitoring data from operational systems. In fact, our approach is proposed to guide operators s capacity planning tasks during system evolution. For example, how to upgrade system s capacity during sales events and how to locate performance bottlenecks? For online services, many of internal measurements respond to the intensity of user loads accordingly. For example, network traffic volume and CPU usage usually go up and down in accordance with the volume of user requests. This is especially true for many resource consumption related measurements because they are mainly driven by the intensity of user loads. In this paper, a metric named flow intensity is used to measure the intensity with which such internal measurements react to the volume of user requests. For example, number of SQL queries and average CPU usage (per sampling unit) are such flow intensity measurements. We observe that there exist strong correlations between these flow intensity measurements. Along time, many resource assumption related measurements have similar evolving curves because they mainly respond to the same external factor the volume of user requests. As an example, Fig. 1

Cluster Comput (2008) 11: 313 329 315 Fig. 1 Examples of flow intensities Fig. 2 An example of invariant networks shows the intensities of HTTP requests and SQL queries collected from a typical three-tier web system and their curves are very similar. As an engineered system, a distributed system imposes many constraints on the relationships among these internal measurements. Such constraints could result from many factors such as hardware capacity, application software logic, system architecture and functionality. For example, in a web system, if a specific HTTP request x always leads to two related SQL queries y, we should always have I(y)= 2I(x) because this logic is written in its application software. Note that here we use I(x) and I(y) to represent the flow intensities measured at the point x and y respectively. No matter how flow intensities I(x)and I(y)change in accordance with varying user loads, such relationships (the equation I(y)= 2I(x)) are always constant. In this paper, we model and search for the relationships among measurements collected at various points across distributed systems. If the modeled relationships hold all the time, they are regarded as invariants of the underlying system. Note that the relationship I(y)= 2I(x) but not the measurements is considered as an invariant. Our previous work [13] verified that such invariant relationships widely exist in real distributed systems, which are governed by the physical properties or logic constraints of system components. For a typical three-tier web system including a web server, an application server and a database server, we collected 111 measurements and further extracted 975 of such invariants among them. In this paper, we include an algorithm to automatically extract such invariants from the measurements collected at various points across distributed systems. These invariants characterize the constant relationships between various flow intensity measurements and further formulate a network of invariants. A simple example of such networks is illustrated in Fig. 2. In this network, each node represents a measurement while each edge represents an invariant relationship between the two associated measurements. After extracting invariants from a distributed system, we can use such an invariant network to profile services for capacity planning and resource optimization. For online services, we can use trend analysis [5] to predict the future volume of user requests. The challenge here is how to estimate and upgrade the capacity of various components inside the distributed system so as to serve the predicted volume of user requests. For example, based on the analysis of web server s access log, the volume of HTTP requests is predicted to grow 150% in a month. Now we need to analyze whether the current capacities of internal components (such as the memory of application server and the disk I/O utilization of database server) are sufficient to support such a growth. Since the validity of invariants is not affected by the change of user loads, we choose the volume of user requests as the starting node and sequentially follow the edges (i.e., the invariant relationships shown in Fig. 2) to derive the capacity needs of various internal components. In the above example, if the predicted number of HTTP requests is I(x 1 ), we can use the invariant relationship I(y)= 2I(x) to conclude that the number of SQL queries must be 2I(x 1 ). Further we can consult this information for upgrading the related database server. Note that here the capacity needs of components are quantitively represented by these resource consumption related measurements. For example, given a workload, a server may be required to have two 1 GHz CPUs, 4 GB memory and 100 MB/s network bandwidth etc. By comparing the current resource configurations against the estimated capacity needs, we can also discover the weakest points that may become performance bottlenecks. Therefore, given any volume of new user loads, operators can use such a network of invariants to estimate the capacity needs of various components, balance resource assignments and remove potential performance bottlenecks. In the following two sections, we introduce the models of invariants and our invariant search algorithm first before we

316 Cluster Comput (2008) 11: 313 329 propose and discuss capacity planning algorithms. A similar invariant search algorithm was proposed in our previous work [14] for fault detection and isolation. However in this paper, we have made the following new contributions: We extended our work to extract invariants among multiple workload classes and also modified invariant search algorithm accordingly for capacity planning and resource optimization; We proposed a new algorithm to use invariant networks to predict the capacity needs of components under any new workloads; We also proposed a method to optimize and balance resource assignments in distributed systems based on the estimated capacity needs; Finally we evaluated our approach with real systems including a commercial distributed system. 3 Correlation of flow intensities For convenience, in the following sections, variables such as x and y are used to represent flow intensity measurements and we use equations such as y = f(x) to represent invariants. With flow intensities measured at various points across systems, we need to consider how to model the relationships between these measurements, i.e., with measurements x and y, how to learn a function f so that we can have y = f(x)? As mentioned earlier, many of such resource consumption related measurements change in accordance with the volume of user requests. As time series, these measurements have similar evolving curves along the time t and have linear relationships. In this paper, we use AutoRegressive models with exogenous inputs (ARX) [20] to learn linear relationships between measurements. At time t, we denote the flow intensities measured at the input and output of a component by x(t) and y(t) respectively. The ARX model describes the following relationship between two flow intensities: y(t) + a 1 y(t 1) + +a n y(t n) = b 0 x(t k) + +b m 1 x(t k m + 1) + b, (1) where [n, m, k] is the order of the model and it determines how many previous steps are affecting the current output. a i and b j are the coefficient parameters that reflect how strongly a previous step is affecting the current output. Since there exist time delays in correlating measurements across distributed systems and various system components may also have unsynchronized time clocks, we consider the temporal dependency in the above ARX model. In fact, even with synchronized time clocks, different components may log their data timestamps with various time delays. Lets denote: θ =[a 1,...,a n,b 0,...,b m 1,b] T, (2) ϕ(t) =[ y(t 1),..., y(t n), x(t k),..., x(t k m + 1), 1] T. (3) Then (1) can be rewritten as: y(t) = ϕ(t) T θ. (4) Assuming that we have observed two measurements over a time interval 1 t N, lets denote this observation by: O N ={x(1), y(1),..., x(n), y(n)}. (5) For a given θ, we can use the observed inputs x(t) to calculate the simulated outputs ŷ(t θ) according to (1). Thus we can compare the simulated outputs with the real observed outputs and further define the estimation error by: E N (θ, O N ) = 1 N = 1 N N (y(t) ŷ(t θ)) 2 t=1 N (y(t) ϕ(t) T θ) 2. (6) t=1 The Least Squares Method (LSM) can find the following ˆθ that minimizes the estimation error E N (θ, O N ): [ N ] 1 ˆθ N = ϕ(t)ϕ(t) T t=1 N t=1 ϕ(t)y(t). (7) There are several criteria to evaluate how well the learned model fits the real observation. In this paper, we use the following equation to calculate a normalized fitness score for model validation: [ F(θ)= 1 Nt=1 y(t) ŷ(t θ) 2 Nt=1 y(t) ȳ 2 ], (8) where ȳ is the mean of the real output y(t). Basically Equation (8) introduces a metric to evaluate how well the learned model approximates the real data. A higher fitness score indicates that the model fits the observed data better and its upper bound is 1. Given the observation of two flow intensities, we can always use (7) to learn a model even if this model does not reflect their real relationship at all. Based on statistical theory, only a model with high fitness score is really meaningful in characterizing linear data relationship. We can set a range of the order [n, m, k] rather than a fixed number to learn a list of model candidates and then select the model with the highest fitness score. Other criteria such

Cluster Comput (2008) 11: 313 329 317 as Minimum Description Length (MDL) [28] can also be used to select models. Note that we use the ARX model to learn the long-run relationship between two measurements, i.e., a model y = f(x) only captures the main characteristics of their relationship. The precise relationship between two measurements should be represented with y = f(x)+ ɛ where ɛ is the modeling error. Note that ɛ is usually small for a model with high fitness score. The ARX model shown in (1) can be easily extended to learn a relationship with multiple inputs and multiple outputs. For example, the volume of HTTP requests x can usually be split into multiple classes such as browsing, shopping and ordering. The different types of user requests may result in different amount of resource consumptions. Lets use x i (1 i N) to represent the volume of N request types respectively and then we have x = N i=1 x i.nowif the relationship y = f(x) is sensitive to the distribution change of request types, we can derive a new relationship y = f(x 1,...,x N ). In this case, (1) can be replaced with the following equation with multiple inputs. Here we use {b0 i,...,bi m i 1 } to denote the coefficient parameters for the i th request type. y(t) + a 1 y(t 1) + +a n y(t n) = b 1 0 x 1(t k 1 ) + +b 1 m 1 1 x 1(t k 1 m 1 + 1) + + b N 0 x N(t k N ) + + bm N N 1 x N (t k N m N + 1) + b ( N mi ) 1 = (bj i x i(t k i j)) + b. (9) i=1 j=0 Now we can define new θ and ϕ(t) with the following equations: θ =[a 1,...,a n,b 1 0,...,b1 m 1 1,...,b N 0,...,bN m N 1,b]T, (10) ϕ(t) =[ y(t 1),..., y(t n), x 1 (t k 1 ),...,x 1 (t k 1 m 1 + 1),..., x N (t k N ),...,x N (t k N m N + 1), 1] T. (11) It is straightforward to see that the same (7) and (8) can be used to estimate the parameter θ and calculate its fitness score respectively. In practice, we can select k i = k and m i = m(1 i N)(k and m are fixed values) to reduce parameter search spaces. Especially if we do not consider any time delay in (9), we have y = N i=1 b0 i x i. In this case, b0 i represents the resource consumption unit for each request of type i. The same equations also work here to estimate the best-fit parameters b0 i (1 i N). With this extension to model invariants among multiple classes of workloads, our capacity planning approach works even if the distribution of workload classes is not stationary [29]. Without loss of generality, in the following sections, we will use (1) to introduce the concept of our algorithms. Later in this paper we will extend our algorithms to support capacity planning under multiple classes of workloads. 4 Extracting invariants Given two measurements, we analyzed how to automatically learn a model in the above section. In practice, we may collect many resource consumption related measurements from a complex system but obviously not any pairs of them would have such linear relationships. Due to system uncertainties and user behavior changes, some learned models may not be robust along time. The challenging question is how to extract invariants from large number of measurements. In practice, we may manually build some relationships based on prior system knowledge. However, this knowledge is usually very limited and system dependent. In this section, we introduce an automated algorithm to extract and validate invariants from monitoring data. Note that for capacity planning purpose, we only need to search for invariants among resource consumption related measurements. Assume that we have m of such measurements denoted by I i,1 i m. Since we have little knowledge about their relationships in a specific system, we try any combination of two measurements to construct a model first and then continue to validate whether this model fits with new observations, i.e., we use bruteforce search to construct all hypotheses of invariants first and then sequentially test the validity of these hypotheses in operation. Note that we always have sufficient monitoring data from an 24 365 operational system to validate these hypotheses along time. The fitness score F k (θ) given by (8) is used to evaluate how well a learned model matches the data observed during the k th time window. We denote the length of this window by l, i.e., each window includes l sampling points of measurements. As discussed earlier, given two measurements, we can always use (7) tolearna model. However, models with low fitness scores do not characterize the real data relationships well so that we choose a threshold F to filter out those models in sequential testings. Denote the set of valid models at time t = k l by M k (i.e., after k time windows). During the sequential testings, once if F k (θ) F, we stop testing this model and remove it from M k. After receiving monitoring data for k of such windows, i.e., total k l sampling points, we can calculate a confidence score with the following equation: ki=1 F i (θ) p k (θ) = k = p k 1(θ) (k 1) + F k (θ). (12) k

318 Cluster Comput (2008) 11: 313 329 Algorithm 4.1 Input: I i (t), 1 i m Output: M k and p k (θ) for each time window k Part I: Model Construction at time t = l (i.e. k = 1), set M 1 to an empty set for each I i and I j,1 i, j m, i j learn a model θ ij using (7); compute F 1 (θ ij ) with (8); if F 1 (θ ij )> F, then set M 1 = M 1 {θ ij }, p 1 (θ ij ) = F 1 (θ ij ). Part II: Sequential Validation for each time t = k l(k > 1), set M k = M k 1 ; for each θ ij M k, compute F k (θ ij ) with (8) usingi i (t) and I j (t), (k 1) l + 1 t k l; if F k (θ ij ) F, then remove θ ij from M k ; otherwise update p k (θ ij ) with (12). output M k and p k (θ). k = k + 1. Fig. 3 Invariants extraction algorithm In fact, p k (θ) is the average fitness score for k time windows. Since the set M k only includes valid models and F i (θ) > F(1 i k), we always have F <p k (θ) 1. The invariant extraction algorithm is shown in Fig. 3, which includes two parts: Part I for model construction and Part II for sequential validation. The invariants extracted with Algorithm 4.1 should essentially be considered as likely invariants. As mentioned earlier, a model can be regarded as an invariant of the underlying system only if this model holds all the time. However, even if the validity of a model has been sequentially tested for a long time, we still can not guarantee that this model will always hold. Therefore, it is more accurate to consider these valid models as likely invariants. Based on historical monitoring data, in fact each confidence score p k (θ) measures the robustness of an invariant. Note that given two measurements, logically we do not know which one should be chosen as the input or output (i.e., x or y in (1)) in complex systems. Therefore, we always construct two models with reverse input and output. If two learned models have very different fitness scores, we must have constructed an AutoRegressive (AR) model rather than an ARX model. Since we are only interested in strong correlation between two measurements, we filter out those AR models by requesting the fitness scores of both models to overpass the threshold. Therefore, an invariant relationship between two measurements is bi-directional in this paper. Our previous work verified that invariants widely exist in distributed systems and Fig. 4 shows an invariant net- Fig. 4 An invariant network extracted from a three-tier web system work extracted from a typical three-tier web system [13]. In this figure, each node represents a measurement while each edge represents an invariant relationship between the two associated measurements. Since we do not need any iterations to calculate θ in (7), the computational complexity of Algorithms 4.1 is usually acceptable even under O(m 2 ) brute-force searches. For example, it takes a common laptop around 20 minutes to extract the invariants from 100 flow intensity measurements (i.e. m = 100) and each measurement (as a time series) includes 1000 data points. Note that Part II runs much faster than Part I in Algorithm 4.1. It only takes 2 seconds to validate nearly 1000 invariants as showninfig.4. For large systems, we have proposed efficient and scalable algorithms to extract invariants by compromising search accuracy [15]. Due to limited space, we will not discuss the scalability issue here. Note that the invariant extraction time is negligible compared to the time taken to manually build models in capacity analysis. As discussed in Sect. 3, if some flow intensity metrics include multiple classes of measurements, we can use multiple variables instead of one in model construction and Algorithm 4.1 still works without any change. For example, we can set I i ={Ii 1,...,IN i } in Algorithm 4.1. In this case, I i is a vector rather than a scalar. 5 Estimation of capacity needs In the above section, Algorithm 4.1 automatically extracts all possible invariants among the measurements I i, 1 i m. Further these measurements and invariants formulate a relation network that can be used as a model to systematically profile services. Under low volume of user

Cluster Comput (2008) 11: 313 329 319 Fig. 5 Capacity planning using invariant networks requests, we extract a network of invariants from a system when its quality of services meets client s expectation, i.e., we only profile a system when it is at a good state. Assume that we have collected ten resource consumption related measurements (i.e., m = 10) from a system and extract an invariant network as shown in Fig. 5. For simplicity, here we use this network as an example to illustrate our capacity planning algorithms. In this graph, each node with number i represents the flow intensity measurement I i. As discussed earlier, I i could also be a vector including multiple sub-class measurements. Since we use a threshold F to filter out those models with low fitness scores, not any pair of measurements would have invariant relationships. Therefore, in Fig. 5, we observe two disconnected subnetworks and even some isolated nodes such as node 1. An isolated node implies that this measurement does not have any linear relationship with other measurements. All edges are bi-directional because we always construct two models (with reverse input and output) between two measurements. Now lets consider a triangle relationship among three measurements such as {I 10,I 3,I 4 }. Assume that we have I 3 = f(i 10 ) and I 4 = g(i 3 ), where f and g are both linear functions as shown in (1). Based on the triangle relationship, theoretically we can conclude that I 4 = g(i 3 ) = g(f (I 10 )). According to linear property of functions f and g, the function g(f (.)) should be linear too, which implies that there should exist an invariant relationship between the measurements I 10 and I 4. However, since we use a threshold to filter out those models with low fitness scores, such a linear relationship may not be robust enough to be considered as an invariant by Algorithm 4.1. This explains why there is no edge between I 10 and I 4. As discussed in Sect. 2, invariants characterize constant long-run relationships between measurements and their validity is not affected by the dynamics of user loads. While each invariant models some local relationship between its associated measurements, the network of invariants could well capture many invariant constraints underlying the whole distributed system. Rather than using one or several analytical models to profile services, here we combine a large number of invariant models together into a network to analyze capacity needs and optimize resource assignments. At time t (e.g., in a month or during a sales event), assume that the maximal volume of user requests is predicted to be x. Without loss of generality, in Fig. 5, we assume that the measurement I 10 represents the volume of user requests, i.e., I 10 = x. Now the challenging question is how to upgrade the capacities of other nodes so as to serve this volume of user requests. Starting from the node I 10 = x, we sequentially follow edges to estimate the capacity needs of other nodes in the invariant network. According to Fig. 5, we can reach the nodes {I 3,I 5,I 7 } with one hop. Given I 10 = x, now the question is how to follow invariants to estimate these measurements. As discussed in Sect. 3, we use the model shown in (1) to search for invariant relationships between measurements so that all invariants can be considered as instances of this model template. According to linear property of models, the capacity needs of system components increase monotonically as the volume of user loads increases. Therefore, we use the maximal value of user loads to estimate the capacity needs of internal components. Here we use x to denote the maximal value of I 10.In(1), if we set the inputs x(t) = x at all time steps, we expect the output y(t) to converge to a constant value y(t) = y, where y can be derived from the following equations: y + a 1 y + +a n y = b 0 x + +b m 1 x + b, m 1 i=0 y = b ix + b 1 + n. (13) j=1 a j For convenience, we use f(θ ij ) to represent the propagation function from I i to I j, i.e., f(θ ij ) = b ki i +b m 1 k=0 1+ n where k=1 a k all coefficient parameters are from the vector θ ij,asshown in (2). Therefore, given a value of the input measurement, we can use (13) to estimate the value of the output measurement. For example, given I 10 = x, we can use invariants to derive the values of I 3,I 5 and I 7 respectively. Since these measurements are the inputs of other invariants, in the same way, we can further propagate their values to other nodes in the network such as the nodes I 4 and I 6. AsshowninFig.5, some nodes such as I 4 and I 7 can be reached from the starting node I 10 via multiple paths. Between the same two nodes, multiple paths may include different number of edges and each invariant (edge) also may have different quality in modeling two nodes relationship. Therefore, the capacity needs of a node will be estimated via different paths with different accuracy. For each node, the question is how to locate the best path for propagating the volume of user loads from the starting node. At first, we will choose the shortest path (i.e., with minimal number of hops) to propagate this value. As discussed in Sect. 3, each invariant includes some modeling error ɛ when it characterizes the relationship between two measurements. These modeling errors could accumulate along a path and a longer path usually results in a larger estimation error. In Sect. 4, we

320 Cluster Comput (2008) 11: 313 329 introduce a confidence score p k (θ) to measure the robustness of invariants. According to the definition of confidence score, an invariant with higher fitness score will lead to better accuracy in capacity estimation. For simplicity, here we use p ij to represent the latest p k (θ) between the measurements I i and I j. If there is no invariant between I i and I j, we set p ij = 0. Given a specific path s, we can derive an accumulated score q s = p ij to evaluate the accuracy of this whole path. Therefore, for multiple paths with same number of edges, we choose the path with the highest score q s to estimate capacity needs. In Fig. 5, we also observe that some nodes are not reachable from the starting node. However, these measurements may still have linear relationships with a set of other nodes because they may have a similar but nonlinear or stochastic way to respond to user loads. Now the question is how to estimate the capacity needs of these unreachable nodes. Note that it is extremely difficult to model and estimate the capacity needs of all components at fine granularity. As discussed earlier, many analytical models like queuing models hardly can scale to model thousands of resource metrics in a distributed system, especially if these metrics have complicated dependencies. In this paper, we extract invariant networks to automatically profile as many relationships as possible and then have to manually build complicated models for the remaining part. In performance modeling, many models have been developed to characterize individual components. For example, Menasce et al. utilized several known laws to quantify performance models of various components, which include utilization law, service demand law and the forced flow law etc. [21]. Following these laws and classic theory, we can manually build complicated nonlinear or stochastic models to link those unreachable nodes. In some cases, we can also use bound analysis to derive rough relationship between measurements. Here we give some examples on how to build models to connect those unreachable nodes. Following the utilization law [21], a disk I/O utilization U can be expressed as U = λ S, where λ is the intensity of disk I/Os (e.g. the number of disk I/Os per second) and S is the average service time. Further S includes the disk controller time as well as the time taken to seek a data block from the disk. Under various workloads (random vs. sequential workloads, low vs. heavy workloads), S could be quite different. Therefore, while λ (as a flow intensity node) can be reachable from the starting node, there is no linear relationship between U and λ. However, many literatures [23] on disk performance conclude that a system s performance becomes sluggish if S 0.02 s. Therefore, we can use bound analysis to estimate the capacity needs of disk I/Os with U>0.02λ. We can further propagate the value of λ to the unreachable node U by manually modeling their relationship with domain knowledge. Recently researchers at Google.com [9] investigated the power provisioning problem for a warehouse-sized system with 15 thousand servers. At the server level, they discovered the following nonlinear relationship between power consumption P and CPU utilization U: P = P idle + (P peak P idle )(2U U 1.4 ), (14) where P idle and P peak are the power consumptions at server idle and peak hours respectively. Since various servers behind an online service run different service logics and functions (e.g. database function or web server function), their CPU utilization could be very different even under the same volume of incoming workloads. Given a new volume of workload x, we can follow the invariant network to estimate the CPU utilization U i for each server. According to (14), we can then estimate the power consumption P i of each server and further sum up P i to estimate the whole power supply needs under the new volume of workloads. According to their work, other power consumptions from networking and cooling systems etc. are proportional to the power consumptions of all servers. Therefore, we can also use invariant networks to support the capacity planning of power supplies in a data center. Since invariant networks have automatically modeled the relationships among many resource consumption related metrics, it becomes easier to manually model the remain part of distributed systems. Therefore, our approach and other performance modeling methods are essentially complimentary to each other. By introducing other models as shown in the above examples, we can continue to propagate the volume of user loads to those isolated nodes. For example, in Fig. 5, if we can manually bridge any two nodes from the two disconnected subnetworks, we will be able to propagate the volume of user loads several hops further. Even in this case, our invariant network is still very useful because it can guide us on where to manually bridge two disconnected subnetworks. In fact, it is usually much easier to build models among the measurements from the same component because their dependencies are much more straightforward in this local context. As shown in the above example, it is much easier to build a model between U and λ from the same disk than a model between U (of the backend database server) and the volume of HTTP requests x (of the front web server) directly. This is because we have known domain knowledge on disks to support modeling. Essentially it is the invariant network that enables us to propagate the value of x into various internal components in large distributed systems. Therefore, rather than building models for measurements across many segments of distributed systems, we can manually build local models from the same segment to link disconnected subnetworks. Conversely if we develop a model of each resource metric as a direct function of exogenous workload, we will not be able to observe the

Cluster Comput (2008) 11: 313 329 321 Algorithm 5.1 Input: M, P and x. Output: I i (1 i N) and R. at step k = 0, set V 0 = S 0 ={I 1 }, q 1 = 1, I 1 = x and other I i = 0. do k = k + 1; set S k = φ; for each I i S k 1 and I j U V k 1, if p ij 0, then S k = S k {I j }; for each I j S k, I l = arg max Ii S k 1 (q i p ij ); compute I j = f(θ lj ); q j = p lj q l ; V k = S k V k 1 ; while S k φ R = V k ; output R and all I i. Fig. 6 Capacity needs estimation algorithm relationship between disconnected subnetworks and have to manually model each disconnected metric once. In addition, it is very difficult to model such a relationship across multiple system segments because their dependencies could be very complicated and difficult for us to understand. In this paper, we just consider such complicated models as another class of invariants constructed from domain knowledge and do not distinguish them in our analysis and algorithms. Now we summarize the above discussion and propose the following algorithm shown in Fig. 6 for estimating capacity needs. For convenience, we define the following variables in our algorithms: I i : the individual measurements, 1 i N. U: the set of all measurements, i.e., U ={I i }. M: the set of all invariants, i.e., M ={θ ij } where θ ij is the invariant model between the measurements I i and I j. p ij : the confidence score of the model θ ij. Note that we set p ij = 0 if there is no invariant (edge) between the measurements I i and I j. P : the set of all confidence scores, i.e., P ={p ij }. x: the predicted maximal volume of user loads. I 1 : the starting node in the invariant network, i.e., I 1 = x. S k : the set of nodes that are only reachable at the k th hop from I 1 but not at earlier hops. V k : the set of all nodes that have been visited up to the k th hop. R: the set of all nodes that are reachable from I i. φ: the empty set. f(θ ij ): the propagation function from I i to I j. For linear m 1 k=0 invariants, f(θ ij ) = b ki i +b 1+ n. For those nonlinear or k=1 a k stochastic models, it may include a variety of equations. q s : the maximal accumulated confidence score of the paths from the starting node I 1 to I s. As discussed in Sect. 4, Algorithm 4.1 automatically extracts robust invariants after long sequential testing phases. In this section, Algorithm 5.1 follows the extracted invariant network specified by M and P to estimate capacity needs. Since we always choose the shortest path to propagate from the starting node to other nodes, at each step Algorithm 5.1 only searches those unvisited nodes for further propagation and all those nodes visited before this step must already have their shortest paths to the starting node. Meantime, Algorithm 5.1 only uses those newly visited nodes at each step to search their next hop because only these newly visited nodes may link to some unvisited nodes. For those nodes with multiple same-length paths to the starting node, we choose the best path with the highest accumulated confidence score for estimating the capacity needs. Essentially Algorithm 5.1 is an efficient graph algorithm based on dynamic programming [7]. We incrementally estimate the capacity needs of those newly visited nodes and compute their accumulated confidence scores at each step until no more nodes are reachable from the starting node. Our algorithm can be easily extended to support those models with multiple inputs. Before estimating the capacity of a new node, we just need to check whether all of its input nodes have already been visited. If not, this node is considered as an unvisited node until all of its input nodes are ready. 6 Resource optimization In the above section, Algorithm 5.1 sequentially estimates the resource usages of components under a given volume of user loads. Assume that we have collected the information about current resource configurations when the system was deployed or upgraded. In practice, such configuration information is usually maintained and updated by a Configuration Management DataBase (CMDB). For each measurement I i, we denote the capacity of its related resource configuration by C i. For example, if I(i)is the Megabyte of real memory usage, Ci will be the total memory size of a server. Similarly if I(i) is the number of concurrent SQL connections, the maximum C(i) for MySQL database is 4096. In performance analysis, there are also some benchmarks that could provide specifications for Ci. Note that this configuration information includes hardware specifications like memory size as well as software configurations like the maximal number of database connections. In practice, software configurations such as the maximum number of file descriptors

322 Cluster Comput (2008) 11: 313 329 Fig. 8 System response with overshoot Fig. 7 Capacity analysis and resource optimization and the maximum number of concurrent threads could also affect system performance. In this paper, we also consider them as system resources. Given a volume of user loads x, we use Algorithm 5.1 to estimate the values of I i. By comparing I i against C i, we can further get information to locate potential performance bottlenecks and balance resource assignments. Lets denote O i = C i I i C i, where O i represents the percentage of resource shortage or available margin. Given an estimated volume of user loads, all those components with negative O i are short of capacities under the new workload and we should assign more resources to these components to remove performance bottlenecks. Conversely, for those components with large positive O i, they must have oversized capacities to serve such a new volume of user loads and we may remove some resources from these components to reduce IT cost. Note that we always need to keep right capacity margin for each component. Therefore, as shown in Fig. 7, we can use such analytical results as a guideline to adjust each component s capacity and build a resource balanced distributed system. The values of O i can also be sorted and ranked to list the priority of resource assignments and optimization. Note that we propagate the maximal volume of user loads x through the invariant network to estimate capacity needs. All I i resulting from Algorithm 5.1 represent the capacity needs of internal components that can serve this maximal volume of user loads. Given a step input x(t) = x, we derive its stable output y(t) = y using (13). However, we did not consider the transient response of y(t) before it converges to the stable value y. As shown in Fig. 8, theoretically y(t) may respond with overshoot and its transient value may be larger than the stable value y. The overshoot is generated because a system component does not respond quickly enough to the sudden change of user loads. For example, with a sudden increase of user loads in a three-tier web system, the application server may take some time to initialize more EJB instances and create more database connections so as to handle the increasing workloads. During this overshoot period, we may observe longer latency of user requests. However, computing systems usually respond to the dynamics of user loads quickly. Therefore, even if the overshoot exists, it must only last very short time. In fact, in our experiments, we do not observe any overshoot responses at all though theoretically it may exist. Now if we want a system to have enough capacity to handle such overshoots, we can calculate the volume of overshoot and propagate this as the maximal value (rather than the stable y) in capacity planning. In practice, the capacities of many systems are only expected to support high QoS in 95% of its operational time. Since it may take huge amount of extra resources to guarantee QoS in some rare events, some service providers are willing to compromise QoS for short period of time so as to reduce IT costs. However, for applications with stringent QoS requirements such as real-time video/voice communication, we may have to consider the overshoot situations. For low order ARX models with n, m 2, much literature in classic control theory [18] introduces how to analytically calculate the overshoot. Basically we can apply Z- transform [4] of(1) to analytically derive the transient response of y(t) and then calculate the overshoot value by locating the maximal point of y(t). For high order ARX models, given an input x(t) = x, wehavetouse(1) iteratively to simulate the transient response of y(t) and then estimate the overshoot value. At each step of Algorithm 5.1, rather than using the function f(θ lj ) to estimate a stable I j, we can use the simulation results from (1) to estimate transient I j and further propagate the maximal value to estimate the capacity needs of other nodes. All other parts of Algorithm 5.1 remain the same. Therefore, with ARX models, we are also able to analyze the transient behavior of various components. In our earlier discussion, the volume of user loads is chosen as the starting point to propagate the capacity estimation. Therefore following the invariant networks, we can only estimate the capacity needs of those resource metrics that are reachable from this starting point. However, if we just want to check whether various components have matched resource assignments, any nodes in the invariant networks can be chosen as a starting point to estimate the capacity needs of others and to determine whether they have consistent sizes. For example, we may just follow a single invariant to check whether component A and B have matched resource assignments. In this case, we do not need a fully connected invariant network and even a disconnected subnetwork can enable us to evaluate whether all of its nodes

Cluster Comput (2008) 11: 313 329 323 Fig. 9 The experimental system have balanced resource assignments. If we do not extract invariant network but directly correlate resource metrics with exogenous workloads, we can not support such resource optimization procedures. 7 Experiments Our capacity planning approach is evaluated with a threetier web system as well as a commercial mobile internet system. In both cases, we extract invariants from resource consumption related measurements and then follow the invariant network to estimate the capacity needs under heavy workloads. Later we measure the real resource usages of components under such heavy workloads and compare them with the estimated values to verify our prediction accuracy. Fig. 10 Low and high volume of user loads in experiments 7.1 Three-tier Web systems In this section, our capacity planning experiments were performed in a typical three-tier web system which includes an Apache web server, a JBoss application server [12] and a MySQL database server. Figure 9 illustrates the architecture of our experimental system and its components. The application software running on this system is Pet store [27], which was written by Sun Microsystems. Just like other web services, here users can visit the Pet store website to buy a variety of pets. We developed a client emulator to generate a large class of user scenarios and workload patterns. Various user actions such as browsing items, searching items, account login, adding an item to a shopping-cart, payment and checkout are all included in our workloads. Certain randomness of user behaviors is also considered in the emulated workloads. For example, a user action is randomly selected from all possible user scenarios that could follow the previous user action. The time interval between two user actions is also randomly selected from a reasonable time range. Note that workloads are dynamically generated with much randomness and variance so that we never get a similar workload twice in our experiments. In our experiments, at first we run a low volume of user loads to collect measurements from various components and further use these measurements to extract invariants. As discussed in Sect. 5, given a predicted high volume of user loads, we can use Algorithm 5.1 to estimate the capacity Fig. 11 Categories of measurements needs of various components. Meantime, we also run this predicted volume of user loads to collect measurements and these real measurements are then compared with the estimated values to verify the capacity estimation accuracy of our algorithm. Figure 10 shows examples of both the low volume of user loads and the high volume of user loads used in our experiments, which have very different intensity and dynamics. We have repeated such experiments many times with various workloads to verify our results. Note that we do not have to repeat the same workload in our evaluations. Measurements were collected from the three servers used in our testbed system. Figure 11 shows the categories of our resource consumption related measurements. Totally we have eight categories and each category includes different number of measurements. In our previous work [13], we collected as many as 111 measurements and extracted 975 invariants from our testbed system. Figure 4 illustrates its invariant network and it is too meshed for us to observe its detailed connectivity. For capacity planning, we only chose those resource consumption related metrics and extracted a small invariant network so as to analyze its connectivity in details. Meantime, in a typical three-tier web system, the application server runs bulk of business logic and usually de-

324 Cluster Comput (2008) 11: 313 329 Fig. 12 Measurements and invariants mands much more capacities than the web server and the database server. Therefore, we collected many of our measurements from the application server for our capacity planning experiments. Note that the web server has two network interfaces eth0 and eth1, which communicate with clients and the application server respectively. This monitoring data was used to calculate various flow intensities with sampling unit equal to 6 seconds. We have total 20 resource consumption related measurements from three servers and all of these measurements were used in the following experiments, i.e., n = 20. In our experiments, we collected one hour data to construct models and then continued to test these models for every half an hour, i.e., the window size is half an hour and includes 300 data points. Studies have shown that users are often dissatisfied if the latency of their web requests is longer than 10 seconds. Therefore the order of the ARX model (i.e., [n, m] shown in (1)) should have very narrow range. In our experiments, since the sampling time was selected to be 6 seconds, we set 0 n, m 2. By combining every two measurements from the total 20 measurements, Algorithm 4.1 totally built 190 models (i.e., n(n 1)/2 = 190) as invariant candidates. For each model, a fitness score was calculated according to (8). In our experiments, we selected the threshold of fitness score F = 0.3. A model is considered as an invariant only if its fitness score is always higher than 0.3 in each testing phases. After several phases of sequential testings with various workloads, eventually we extract 68 invariants that are distributed as shown in Fig. 12. In the following equations, we list several examples of such extracted invariants. If we use I ejb, I web and I sql to represent the flow intensities of number of EJB created, number of HTTP requests and number of SQL queries measured from the test bed system, we extract the following invariants with I web as the input: I ejb (t) = 0.44I ejb (t 1) + 0.18I ejb (t 2) + 1.01I web (t) + 1.40I web (t 1), (15) I sql (t) = 0.16I sql (t 1) + 0.37I sql (t 2) + 3.01I web (t) 0.84I web (t 1). (16) In Fig. 12, each node represents a measurement while each edge represents an invariant relationship between the two associated measurements. Therefore, this invariant network totally includes 20 nodes and 68 edges. All measurements with initial tag A_, W_ or D_ are collected from the application server, the web server or the database server respectively. From this figure, we notice that the 20 measurements can be split into many clusters. The biggest cluster in this figure includes 13 measurements. These measurements respond to the volume of user loads directly so that they have many linear relationships among each other. Meantime, we also observe several smaller clusters, which characterize some local relationships between measurements. As mentioned earlier, some measurements may respond to the volume of user loads in same but nonlinear or stochastic ways. For example, Disk Write Merge has an invariant linear relationship with Disk Write Sectors though they both do not have linear relationships with the volume of user loads. The measurement W_CPU Utilization is very noisy and it does not have robust relationships with any other measurements. Therefore it is an isolated node in the invariant network. Later we discovered that the CPU usage of the web server in our testbed was extremely low (only close to 1%) so that the value of W_CPU Utilization was too small and easily affected by other system processes, i.e., we barely observe any intensities from W_CPU Utilization because the

Cluster Comput (2008) 11: 313 329 325 Fig. 13 Estimated and real values of measurements Fig. 14 The number of created EJBs web server only presents static html files but runs on a powerful machine. Now we choose the volume of HTTP requests W_Apache as the starting node in our capacity analysis. In our experiments, the predicted volume of HTTP requests (high user loads) is shown in Fig. 10 and its maximal value is 840 requests/second, i.e., x = 840. Following the invariant network shown in Fig. 12, we use Algorithm 5.1 to estimate other measurements. For example, with one hop from W_Apache, we can estimate the values of the following 10 measurements: W_eth1 Packets, W_eth0 Packets, W_CPU Soft IRQ Time, A_JBoss EJB Created, A_eth0 Packets, A_JVM Processing Time, D_MySQL, D_ CPU Soft IRQ Time, D_CPU Utilization and D_eth0 Packets. With another hop, we can also estimate the values of A_CPU Utilization and A_CPU Soft IRQ Time. These values resulting from Algorithm 5.1 are then compared with the real maximal values collected from our experiments and their differences are shown in Fig. 13.Hereweusee and r to denote the estimated values and the real monitoring values respectively. From this figure, we observe that our approach achieves very good accuracy and averagely it only results in 5% error in estimating capacity needs. We have repeated our experiments many times with different user loads and observed similar estimation accuracy in every of our experiments. In Fig. 13, the value of each measurement has its own specific metric. For example, the value of A_JVM Processing Time refers to the CPU time (nanoseconds) used by JVM (among 6 seconds) and the value of A_CPU Utilization refers to the percentage of user CPU utilization. As illustrated in (13), given I web = 840, we can use the coefficient parameters from (15) to derive that I ejb = 1+0.44 0.18 1.01+1.40 I web = 1607. In the same way, we can use the coefficient parameters from (16) to derive that I sql = 1 0.16 0.37 3.01 0.84 I web = 3878. As examples, Figs. 14 and Fig. 15 The number of SQL queries 15 illustrate the real numbers of EJBs created in the application server and the real number of SQL queries collected from the database server respectively. Both measurements result from the user loads shown in Fig. 10 so that they both have similar curves with that of the user loads. In these two figures, the noisy curves represent the real values while the dashed lines represent the estimated maximal values. The estimated values are very close to the real maximal values monitored from the system though some rare peaks may have larger values. As discussed earlier, invariants are used to characterize long-run relationships between measurements and they do not capture some rare system uncertainties and noises. However, since those peaks are rare and we always add right margin on the estimated capacity needs, our estimated values should be accurate enough for capacity planning [21]. In Fig. 12, we also observe some nodes that are unreachable from the starting node W_Apache. As discussed in Sect. 5, with domain knowledge, we have to manually

326 Cluster Comput (2008) 11: 313 329 build some models between these isolated measurements and those listed in Fig. 13 so as to propagate the volume of user loads. In our future work, we will build a library to include nonlinear models of common components such as disks to complement our invariant network in capacity planning. Based on various scenarios, operators can pull nonlinear models from the library to link disconnected invariant networks. 7.2 Mobile Internet systems In this subsection, our approach was also evaluated with a commercial mobile internet system with multi-class workloads. The mobile internet system provides mobile users with services such as web access, e-mail, news delivery and location information. It provides direct access to the Internet for subscribers using mobile terminals. This system consists of dozens of high end servers including portal web servers, mail servers, picture mail servers and account authentication servers. Web portal access and mail access are the two major classes of user traffics. Due to business confidentiality, we are not able to illustrate its specific architecture here and will only report field testing results. Most of these servers run on Unix operating systems and many measurements are collected by using Unix sar command. Network measurements are collected by SNMP monitoring agents. CPU, memory and network usages are the three types of resource metrics used in our evaluations. We collected these measurements from every servers once per 10 seconds for a week. During this period, we observed that the ratio between mail access traffic and web access traffic could vary roughly from 0.11 to 0.29. Following the same evaluation approach used in the above section, we extracted invariants from a two-hour period of low workloads and then used the invariant network to estimate the capacity needs of components under 5 periods of heavy workloads during the week. The real resource usages observed from log files were then compared with these estimated values to verify estimation accuracy. Our approach averagely results in 6.8% error in estimating capacity needs and the biggest estimation error is 9.5%. Based on operators feedbacks, they are quite satisfied with such accuracies in capacity planning tasks. 8 Discussions Large-scale distributed systems consist of thousands of components. Each component could potentially become the performance bottleneck and deteriorate system QoS. As discussed earlier, it is extremely difficult to model and analyze the capacity needs of each individual components at fine granularity in a large system. Many classic approaches can not scale to profile the behaviors of large number of components precisely. For example, it is not clear how to model the relationships among hundreds of metrics with queuing models and it is also very time-consuming to manually build such models. Meantime, if we only model the system behavior at coarse granularity, some system metrics will not be considered in the model so that we may not be able to predict real performance bottlenecks and optimize resource assignments. In this paper, we extract invariant networks to profile the relationships among resource consumption related metrics across distributed systems. Our motivation is that we should develop algorithms to automatically profile as many relationships as possible among system metrics. Though such relationships may be simple, the invariant networks essentially enable us to cover large number of system metrics and scale up our capacity analysis. For the remaining complicated models, we do not have a choice so that we have to manually build them with system knowledge. Therefore, our approach is complimentary to many existing modeling techniques and could greatly reduce system modeling complexity. To the best of our knowledge, this paper proposes the first solution to analyze system capacity and optimize resource assignments across large systems at such a fine granularity. In fact, most of existing works just model hardware resources such as CPU or memory in capacity analysis and do not consider system metrics like the number of concurrent database connections, which could also affect system performance. Our approach has several limitations as well. As discussed earlier, there exist some disconnected subnetworks that are not reachable from the starting point the volume of workloads. In order to estimate the capacity needs of these isolated nodes, we have to manually build models with domain knowledge to bridge disconnected subnetworks. However, since specific system knowledge (e.g. disks) is needed here, we do not expect that any other methods will be able to automatically profile those complicated relationships. Compared to other modeling techniques, our invariant networks automatically extract the maximal coverage of system metrics whose capacity needs can be estimated from the volume of workloads. Therefore, we are enabled to map the volume of exogenous workloads into the intensities of internal system metrics and decide where to build models in their local context. This greatly reduces modeling complexity when we applied our approach on several commercial systems. Meantime, even with a disconnected subnetwork, we can choose one node as a starting point to verify whether all system metrics inside this subnetwork have balanced resource configurations. This is also proved to be useful in operational environments. Another limitation is that our approach can not automatically model the relationship between response time and system configurations. For example, operators often raise the following questions in capacity planning: how much response time can be improved if a specific part of resource

Cluster Comput (2008) 11: 313 329 327 is upgraded? how many users can the system support if its response time is allowed to increase 10%. For a distributed system with scale and complexity, so far we are not aware of any solutions that can address such problems. We profile systems and extract invariant networks at a good state when clients are satisfied with system QoS. It seems to be difficult to reason about other scenarios beyond this good state. We do not know which parts of system models are still valid and which part are not under those new scenarios. Recently many researchers employed a closed queuing network to model multi-tier Internet applications and then used mean value analysis to calculate system response time [31]. However, performance bottlenecks may shift under the new scenarios so that the original model of response time might become invalid. In fact, the original performance model may even not include the metric of new bottleneck. For example, if a performance model profiles the relationship between CPU resources and system response time, it may not be useful if memory suddenly becomes the bottleneck. 9 Related work Queuing models have been widely used to analyze system performance. Menasce et al. [21] wrote a book on how to use queuing networks for performance modeling of various components such as a database server or a disk. However, as discussed earlier, queuing models are often used to characterize individual components with many assumptions such as stationary workloads. It is not clear how to profile large-scale complex systems with queuing models. Recently Urgaonkar et al. [31, 32] employed a closed queuing network to model multi-tier Internet applications and used Mean Value Analysis (MVA) to calculate the response time of multi-tier distributed systems. They only considered CPU resources in their queuing models but not other resources like memory, disk I/O and network. Besides multi-tier Internet applications, distributed systems have many other architectures and the closed queuing network may not model them well. Stewart and Shen [30] profiled applications to predict throughput and response time of multi-component online services. Stewart et al. [29] used a transaction mix vector to characterize workloads and exploited nonstationarity for performance prediction in distributed systems. Rather than using queuing models, in this paper we automatically extract a network of invariants from operational systems and use this network as a model for performance analysis. Therefore, our approach provides a systematic way to profile complex services for capacity planning, which is complimentary to those classic modeling techniques. Many companies have developed their practical approaches to capacity planning. For example, Microsoft [22] published scalability recommendations for their portal server deployment. IBM Global Services [11] developed their practical capacity planning processes for web application deployment. Oracle [25] provides capacity planning tools for database sizing. Most of these approaches are developed from their practical experiences on some specific components and they may not scale and generalize well for capacity planning tasks in large scale distributed systems. There exists much work on characterizing web traffic for web server capacity planning. Kant and Won [16] analyzed the patterns of web traffic and proposed a methodology to determine bandwidth requirements for hardware components of a web server. Barford and Crovella [3] developed a tool to generate representative web workloads for web server management and capacity planning. Our approach does not characterize workloads but extract invariants to characterize the relationships between the volume of workloads and the capacity needs of individual components. Machine learning methods have also been applied to identify performance bottlenecks in distributed systems. Cohen et al. [6] used a tree-augmented naive (TAN) Bayesian network to learn the probabilistic relationship between SLA violations and resource usages. They further used this learned Bayesian network to identify performance bottlenecks during SLA violations. In their Elba project, Parekh et al. [26] compared the TAN Bayesian network with other classifiers like decision tree in performance bottle detection. Both of their work employed supervised learning mechanisms and required large number of SLA violation samples in their classifier training. For capacity planning tasks, this is not practical because we want to avoid such SLA violations in real business. Our approach is able to predict and remove performance bottlenecks ahead of real SLA violations. In autonomic computing community [17], there is much work on how to dynamically allocate resources for service provisioning. Hellerstein et al. [10] wrote a book on how to apply feedback control theory to resource management in computing systems. Kephart et al. [8, 33] defined utility functions based on service-level attributes and proposed an utility-based approach for efficient allocations of server resources. Kusic and Kandasamy [19] used multiple queuing models to characterize the performance models of server clusters and then applied control theory for optimal resource allocation. Assume that we have a large cluster of servers for multiple applications, autonomic service provisioning is about how to dynamically allocate and share resources among these applications so as to achieve some optimal goals like maximizing the profits of services. Though this topic is related, the scope of our work focuses on offline capacity planning and resource optimization rather than online service provisioning. In addition, dynamic service provisioning also needs right capacity planning to support system evolution.

328 Cluster Comput (2008) 11: 313 329 10 Conclusions For large scale distributed systems, it is critical to make right capacity planning during system evolution. Under many dynamics and uncertainties of user loads, a system without enough capacity could significantly deteriorate performance and lead to users dissatisfaction. Conversely, an oversized system with scale could significantly waste resources and increase IT cost. One challenge is how to match the capacities of various components inside complex systems to remove potential performance bottlenecks and achieve maximal system-level capacity. Mis-matched capacities of system components could result in performance bottlenecks at one segment of a system while wasting resources at other segments. In this paper, we proposed a novel and systematic approach to profiling services for resource optimization and capacity planning. We collect resource consumption related measurements from distributed systems and developed an approach to automatically search for invariants among measurements. After extracting a network of invariants, given any volume of user loads, we can sequentially estimate the capacity needs of individual components. By comparing the current resource assignments against the estimated capacity needs, we can discover the weakest points that may deteriorate system performance. Operators can consult such analytical results to optimize resource assignments and remove potential performance bottlenecks. References 1. Almeida, V., Menasce, D.: Capacity planning: An essential tool for managing web services. IEEE IT Prof. 4(4), 33 38 (2002) 2. Amazon: http://phx.corporate-ir.net/phoenix.zhtml?c=97664&p= irol-newsarticle&id=798960&highlight= 3. Barford, P., Crovella, M.: Generating representative web workloads for network and server performance evaluation. In: SIG- METRICS 98/PERFORMANCE 98: Proceedings of the 1998 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, pp. 151 160, 1998 4. Bracewell, R.: The Fourier Transform and Its Applications, 3nd edn. McGraw-Hill Science/Engineering/Math, New York (1999) 5. Brockwell, P., Davis, R.: Introduction to Time Series and Forecasting, 2nd edn. Springer, Berlin (2003) 6. Cohen, I., Goldszmidt, M., Kelly, T., Symons, J., Chase, J.: Correlating instrumentation data to system states: a building block for automated diagnosis and control. In: OSDI 04: Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation, p. 16, 2004 7. Cormen, T., Leiserson, C., Rivest, R.: Introduction to Algorithms, 1st ed. MIT Press/McGraw-Hill, Cumberland, New York (1990) 8. Das, R., Kephart, J., Whalley, I., Vytas, P.: Towards commercialization of utility-based resource allocation. In: The 3rd International Conference on Autonomic Computing (ICAC2006), pp. 287 290, Dublin, Ireland, June 2006 9. Fan, X., Weber, W.-D., Barroso, L.A.: Power provisioning for a warehouse-sized computer. In: ISCA 07: Proceedings of the 34th Annual International Symposium on Computer Architecture, pp. 13 23, San Diego, California, USA, 2007 10. Hellerstein, J., Diao, Y., Parekh, S., Tilbury, D.M.: Feedback Control of Computing Systems. Wiley-IEEE Press, New York (2004) 11. IBM. http://www-935.ibm.com/services/us/its/pdf/g563-0339-00. pdf 12. JBoss. http://www.jboss.org 13. Jiang, G., Chen, H., Yoshihira, K.: Discovering likely invariants of distributed transaction systems for autonomic system management. In: The 3rd International Conference on Autonomic Computing (ICAC2006), pp. 199 208, Dublin, Ireland, June 2006 14. Jiang, G., Chen, H., Yoshihira, K.: Modeling and tracking of transaction flow dynamics for fault detection in complex systems. IEEE Trans. Dependable Secure Comput. 3(4), 312 326 (2006) 15. Jiang, G., Chen, H., Yoshihira, K.: Efficient and scalable algorithms for inferring likely invariants in distributed systems. IEEE Trans. Knowl. Data Eng. 19(11) (2007) 16. Kant, K., Won, Y.: Server capacity planning for web traffic workload. IEEE Trans. Knowl. Data Eng. 11(5), 731 747 (1999) 17. Kephart, J., Chess, D.: The vision of autonomic computing. Computer 36(1), 41 52 (2003) 18. Kuo, B.: Automatic Control Systems, 6th edn. Prentice-Hall, Englewood (1991) 19. Kusic, D., Kandasamy, N.: Risk-aware limited lookahead control for dynamic resource provisioning in enterprise computing systems. In: The 3rd International Conference on Autonomic Computing (ICAC2006), pp. 74 83, Dublin, Ireland, June 2006 20. Ljung, L.: System Identification Theory for The User, 2nd edn. Prentice Hall PTR, New York (1998) 21. Menasce, D., Dowdy, L., Almeida, V.: Performance by Design: Computer Capacity Planning By Example, 1st ed. Prentice Hall PTR, New York (2004) 22. Microsoft. http://office.microsoft.com/en-us/assistance/ HA011647631033.aspx 23. Microsoft. http://technet.microsoft.com/en-us/library/aa997558. aspx 24. Microsoft Office 2003 system requirments. http://support. microsoft.com/kb/822129 25. Oracle. http://www.dba-oracle.com/monitoring_tablepack.htm 26. Parekh, J., Jung, G., Swint, G., Pu, C., Sahai, A.: Comparison of performance analysis approaches for bottleneck detection in multi-tier enterprise applications. In: IEEE International Workshop on Quality of Services, pp. 302 306, New Haven, CT, USA, 2006 27. Petstore: http://java.sun.com/developer/releases/petstore/ 28. Rissanen, J.: Stochastic Complexity in Statistical Inquiry Theory. World Scientific, Singapore (1989) 29. Stewart, C., Kelly, T., Zhang, A.: Exploiting nonstationarity for performance prediction. SIGOPS Oper. Syst. Rev. 41(3), 31 44 (2007) 30. Stewart, C., Shen, K.: Performance modeling and system management for multi-component online services. In: NSDI 05: Proceedings of the 2nd conference on Symposium on Networked Systems Design and Implementation, pp. 71 84, Boston, Massachusetts, USA, 2005 31. Urgaonkar, B., Pacifici, G., Shenoy, P., Spreitzer, M., Tantawi, A.: An analytical model for multi-tier internet services and its applications. SIGMETRICS Perform. Eval. Rev. 33(1), 291 302 (2005) 32. Urgaonkar, B., Pacifici, G., Shenoy, P., Spreitzer, M., Tantawi, A.: Analytic modeling of multitier internet applications. ACM Trans. Web 1(1), 2 (2007) 33. Walsh, W., Tesauro, G., Kephart, J., Das, R.: Utility functions in autonomic systems. In: The First International Conference on Autonomic Computing (ICAC2004), pp. 70 77, New York, May 2004

Cluster Comput (2008) 11: 313 329 329 Guofei Jiang is currently a Department Head with the Robust and Secure Systems Group in NEC Laboratories America at Princeton, New Jersey. He leads a dozen of researchers working on many topics in the field of distributed systems and networks. He has published over 80 technical papers and also holds several patents. Dr. Jiang is an associate editor of IEEE Security and Privacy magazine and has also served in the program committees of many prestigious conferences. He holds the B.S. and Ph.D. degrees in electrical and computer engineering. Haifeng Chen received the BEng and MEng degrees, both in automation, from Southeast University, China, in 1994 and 1997 respectively, and the Ph.D. degree in computer engineering from Rutgers University, New Jersey, in 2004. He has worked as a researcher in the Chinese national research institute of power automation. He is currently a research staff member at NEC laboratory America, Princeton, NJ. His research interests include data mining, autonomic computing, pattern recognition and robust statistics. Kenji Yoshihira received the B.E. in E.E. at University of Tokyo in 1996 and designed processor chips for enterprise computer at Hitachi Ltd. for five years. He employed himself in CTO at Investoria Inc. in Japan to develop an Internet service system for financial information distribution through 2002 and received the M.S. in CS at New York University in 2004. He is currently a research staff member with the Robust and Secure Systems Group in NEC Laboratories America, inc. in NJ. His current research focus is on distributed system and autonomic computing.