Profiling services for resource optimization and capacity planning in distributed systems
|
|
- Lambert Walters
- 8 years ago
- Views:
Transcription
1 Cluster Comput (2008) 11: DOI /s x Profiling services for resource optimization and capacity planning in distributed systems Guofei Jiang Haifeng Chen Kenji Yoshihira Received: 22 May 2008 / Accepted: 8 September 2008 / Published online: 24 September 2008 Springer Science+Business Media, LLC 2008 Abstract The capacity needs of online services are mainly determined by the volume of user loads. For large-scale distributed systems running such services, it is quite difficult to match the capacities of various system components. In this paper, a novel and systematic approach is proposed to profile services for resource optimization and capacity planning. We collect resource consumption related measurements from various components across distributed systems and further search for constant relationships between these measurements. If such relationships always hold under various workloads along time, we consider them as invariants of the underlying system. After extracting many invariants from the system, given any volume of user loads, we can follow these invariant relationships sequentially to estimate the capacity needs of individual components. By comparing the current resource configurations against the estimated capacity needs, we can discover the weakest points that may deteriorate system performance. Operators can consult such analytical results to optimize resource assignments and remove potential performance bottlenecks. In this paper, we propose several algorithms to support capacity analysis and guide operator s capacity planning tasks. Our algorithms are evaluated with real systems and experimental results are also included to demonstrate the effectiveness of our approach. G. Jiang ( ) H. Chen K. Yoshihira NEC Laboratories America, 4 Independence Way, Princeton, NJ 08540, USA gfj@nec-labs.com H. Chen haifeng@nec-labs.com K. Yoshihira kenji@nec-labs.com Keywords System management Capacity planning Resource optimization Distributed systems System invariants Algorithms 1 Introduction In the last decade, with the great success of Internet technology, many large-scale distributed systems have been built to run various online services. These systems have unprecedented capacity to process large volume of transaction requests simultaneously. For example, Google has thousands of servers to handle millions of user queries every day and Amazon sold 3.6 million items or 41 items per second in the single day of December 12, 2005 [2]. While clients may only see a single website, the distributed systems running such services are very complex and could include thousands of components such as operating systems, application software, servers, networking and storage devices. Meantime, clients always expect high Quality of Services (QoS) such as short latency and high availability from online transaction services. Studies have shown that clients are easy dissatisfied because of unreliable services or even seconds delay in response time. However, under many dynamics and uncertainties of user loads and behaviors, some components inside the system may suddenly become the performance bottleneck and then deteriorate system QoS. Therefore, it is very desirable to make right capacity planning for each component so as to maintain the whole system at a good state. Operators usually need to consider two important factors for capacity planning and resource optimization. From one side, sufficient hardware resources have to be deployed so as to meet customer s QoS expectation. From the other side, an oversized system with scale could significantly waste hardware resources, increase IT costs and reduce profits. For
2 314 Cluster Comput (2008) 11: distributed systems, it is especially important to balance resources across distributed components so as to achieve maximal system level capacity. Otherwise, mis-matched component capacities could lead to performance bottlenecks at some segments of the system while wasting resources at other segments. Therefore, a critical challenge is how to match the capacities of large number of components in a distributed system. In practice, operators may take many trial and error procedures for their capacity planning work and assign resources based on their intuition, practical experiences or rules of thumb [1]. Therefore, it is difficult to systematically and precisely analyze the capacity needs for individual components in a distributed system with scale and complexity. For standalone software, we usually use fixed numbers to specify hardware requirements such as CPU frequency and memory size. For example, Microsoft recommends the following minimum system requirements to run Microsoft Office 2003: a Pentium III processor with a clock speed of at least 233 MHz, a minimum of 128 MB RAM and at least 400 MB free hard-disk space [24]. However, it is difficult to give such specifications for online services because their system requirements are mainly determined by an external factor the volume of user loads. For various user loads, we need a model rather than a fixed number to analyze the capacity needs of each component. Such models could enable us to answer those what...if... questions in capacity planning. For example, what should be the size of the database server if the volume of web requests increases three times tomorrow? Though queuing and other models are widely applied in performance modeling [21], these models are often used to analyze a limited number of components under various assumptions. For example, a closed queuing network was used to model multi-tier Internet applications [31] but only CPU resources were considered in the performance model. In practice, we have to consider many other resources as well such as memory, disk I/O and network. Therefore queuing models seem not to scale well in profiling large distributed systems. In this paper, we propose a novel approach to profile services for capacity planning and resource optimization. During operation, distributed systems generate large amounts of monitoring data to track their operational status. We collect resource consumption related monitoring data from various components across distributed systems. CPU usage, network traffic volume, and number of SQL queries are typical examples of such monitoring data. While large volumes of user requests flow through various components, many resource consumption related measurements respond to the intensity of user loads accordingly. Here we introduce a new concept named flow intensity to measure the intensity with which internal measurements react to the volume of user loads. Further we search for constant relationships between these flow intensity measurements collected at various points across the system. If such relationships always hold under various workloads along time, we consider them as invariants of the underlying distributed system. In this paper, we propose an algorithm to automatically extract such invariants from monitoring data. After extracting many invariants from a distributed system, given any volume of user loads, we can follow these invariant relationships sequentially to estimate the capacity needs of individual components. By comparing the current resource assignments against the estimated capacity needs, we can also discover and rank the weakest points that may deteriorate system performance. Later operators can consult such analytical results to optimize resource assignments and remove potential performance bottlenecks. Several graph-based algorithms are proposed in this paper to support such capacity analysis. Our algorithms are tested with real distributed systems including a large production system and experimental results are also included to demonstrate the effectiveness of our approach. 2 System invariants and capacity planning As discussed earlier, we need to collect monitoring data from operational systems to profile online services. In this paper, capacity planning is discussed in the context of system testing or operational management stage but not system design stage. For example, we do not analyze how to optimize architecture design for improving system capacity. Instead, we assume that services are already deployed and functional on distributed systems so that we are able to collect monitoring data from operational systems. In fact, our approach is proposed to guide operators s capacity planning tasks during system evolution. For example, how to upgrade system s capacity during sales events and how to locate performance bottlenecks? For online services, many of internal measurements respond to the intensity of user loads accordingly. For example, network traffic volume and CPU usage usually go up and down in accordance with the volume of user requests. This is especially true for many resource consumption related measurements because they are mainly driven by the intensity of user loads. In this paper, a metric named flow intensity is used to measure the intensity with which such internal measurements react to the volume of user requests. For example, number of SQL queries and average CPU usage (per sampling unit) are such flow intensity measurements. We observe that there exist strong correlations between these flow intensity measurements. Along time, many resource assumption related measurements have similar evolving curves because they mainly respond to the same external factor the volume of user requests. As an example, Fig. 1
3 Cluster Comput (2008) 11: Fig. 1 Examples of flow intensities Fig. 2 An example of invariant networks shows the intensities of HTTP requests and SQL queries collected from a typical three-tier web system and their curves are very similar. As an engineered system, a distributed system imposes many constraints on the relationships among these internal measurements. Such constraints could result from many factors such as hardware capacity, application software logic, system architecture and functionality. For example, in a web system, if a specific HTTP request x always leads to two related SQL queries y, we should always have I(y)= 2I(x) because this logic is written in its application software. Note that here we use I(x) and I(y) to represent the flow intensities measured at the point x and y respectively. No matter how flow intensities I(x)and I(y)change in accordance with varying user loads, such relationships (the equation I(y)= 2I(x)) are always constant. In this paper, we model and search for the relationships among measurements collected at various points across distributed systems. If the modeled relationships hold all the time, they are regarded as invariants of the underlying system. Note that the relationship I(y)= 2I(x) but not the measurements is considered as an invariant. Our previous work [13] verified that such invariant relationships widely exist in real distributed systems, which are governed by the physical properties or logic constraints of system components. For a typical three-tier web system including a web server, an application server and a database server, we collected 111 measurements and further extracted 975 of such invariants among them. In this paper, we include an algorithm to automatically extract such invariants from the measurements collected at various points across distributed systems. These invariants characterize the constant relationships between various flow intensity measurements and further formulate a network of invariants. A simple example of such networks is illustrated in Fig. 2. In this network, each node represents a measurement while each edge represents an invariant relationship between the two associated measurements. After extracting invariants from a distributed system, we can use such an invariant network to profile services for capacity planning and resource optimization. For online services, we can use trend analysis [5] to predict the future volume of user requests. The challenge here is how to estimate and upgrade the capacity of various components inside the distributed system so as to serve the predicted volume of user requests. For example, based on the analysis of web server s access log, the volume of HTTP requests is predicted to grow 150% in a month. Now we need to analyze whether the current capacities of internal components (such as the memory of application server and the disk I/O utilization of database server) are sufficient to support such a growth. Since the validity of invariants is not affected by the change of user loads, we choose the volume of user requests as the starting node and sequentially follow the edges (i.e., the invariant relationships shown in Fig. 2) to derive the capacity needs of various internal components. In the above example, if the predicted number of HTTP requests is I(x 1 ), we can use the invariant relationship I(y)= 2I(x) to conclude that the number of SQL queries must be 2I(x 1 ). Further we can consult this information for upgrading the related database server. Note that here the capacity needs of components are quantitively represented by these resource consumption related measurements. For example, given a workload, a server may be required to have two 1 GHz CPUs, 4 GB memory and 100 MB/s network bandwidth etc. By comparing the current resource configurations against the estimated capacity needs, we can also discover the weakest points that may become performance bottlenecks. Therefore, given any volume of new user loads, operators can use such a network of invariants to estimate the capacity needs of various components, balance resource assignments and remove potential performance bottlenecks. In the following two sections, we introduce the models of invariants and our invariant search algorithm first before we
4 316 Cluster Comput (2008) 11: propose and discuss capacity planning algorithms. A similar invariant search algorithm was proposed in our previous work [14] for fault detection and isolation. However in this paper, we have made the following new contributions: We extended our work to extract invariants among multiple workload classes and also modified invariant search algorithm accordingly for capacity planning and resource optimization; We proposed a new algorithm to use invariant networks to predict the capacity needs of components under any new workloads; We also proposed a method to optimize and balance resource assignments in distributed systems based on the estimated capacity needs; Finally we evaluated our approach with real systems including a commercial distributed system. 3 Correlation of flow intensities For convenience, in the following sections, variables such as x and y are used to represent flow intensity measurements and we use equations such as y = f(x) to represent invariants. With flow intensities measured at various points across systems, we need to consider how to model the relationships between these measurements, i.e., with measurements x and y, how to learn a function f so that we can have y = f(x)? As mentioned earlier, many of such resource consumption related measurements change in accordance with the volume of user requests. As time series, these measurements have similar evolving curves along the time t and have linear relationships. In this paper, we use AutoRegressive models with exogenous inputs (ARX) [20] to learn linear relationships between measurements. At time t, we denote the flow intensities measured at the input and output of a component by x(t) and y(t) respectively. The ARX model describes the following relationship between two flow intensities: y(t) + a 1 y(t 1) + +a n y(t n) = b 0 x(t k) + +b m 1 x(t k m + 1) + b, (1) where [n, m, k] is the order of the model and it determines how many previous steps are affecting the current output. a i and b j are the coefficient parameters that reflect how strongly a previous step is affecting the current output. Since there exist time delays in correlating measurements across distributed systems and various system components may also have unsynchronized time clocks, we consider the temporal dependency in the above ARX model. In fact, even with synchronized time clocks, different components may log their data timestamps with various time delays. Lets denote: θ =[a 1,...,a n,b 0,...,b m 1,b] T, (2) ϕ(t) =[ y(t 1),..., y(t n), x(t k),..., x(t k m + 1), 1] T. (3) Then (1) can be rewritten as: y(t) = ϕ(t) T θ. (4) Assuming that we have observed two measurements over a time interval 1 t N, lets denote this observation by: O N ={x(1), y(1),..., x(n), y(n)}. (5) For a given θ, we can use the observed inputs x(t) to calculate the simulated outputs ŷ(t θ) according to (1). Thus we can compare the simulated outputs with the real observed outputs and further define the estimation error by: E N (θ, O N ) = 1 N = 1 N N (y(t) ŷ(t θ)) 2 t=1 N (y(t) ϕ(t) T θ) 2. (6) t=1 The Least Squares Method (LSM) can find the following ˆθ that minimizes the estimation error E N (θ, O N ): [ N ] 1 ˆθ N = ϕ(t)ϕ(t) T t=1 N t=1 ϕ(t)y(t). (7) There are several criteria to evaluate how well the learned model fits the real observation. In this paper, we use the following equation to calculate a normalized fitness score for model validation: [ F(θ)= 1 Nt=1 y(t) ŷ(t θ) 2 Nt=1 y(t) ȳ 2 ], (8) where ȳ is the mean of the real output y(t). Basically Equation (8) introduces a metric to evaluate how well the learned model approximates the real data. A higher fitness score indicates that the model fits the observed data better and its upper bound is 1. Given the observation of two flow intensities, we can always use (7) to learn a model even if this model does not reflect their real relationship at all. Based on statistical theory, only a model with high fitness score is really meaningful in characterizing linear data relationship. We can set a range of the order [n, m, k] rather than a fixed number to learn a list of model candidates and then select the model with the highest fitness score. Other criteria such
5 Cluster Comput (2008) 11: as Minimum Description Length (MDL) [28] can also be used to select models. Note that we use the ARX model to learn the long-run relationship between two measurements, i.e., a model y = f(x) only captures the main characteristics of their relationship. The precise relationship between two measurements should be represented with y = f(x)+ ɛ where ɛ is the modeling error. Note that ɛ is usually small for a model with high fitness score. The ARX model shown in (1) can be easily extended to learn a relationship with multiple inputs and multiple outputs. For example, the volume of HTTP requests x can usually be split into multiple classes such as browsing, shopping and ordering. The different types of user requests may result in different amount of resource consumptions. Lets use x i (1 i N) to represent the volume of N request types respectively and then we have x = N i=1 x i.nowif the relationship y = f(x) is sensitive to the distribution change of request types, we can derive a new relationship y = f(x 1,...,x N ). In this case, (1) can be replaced with the following equation with multiple inputs. Here we use {b0 i,...,bi m i 1 } to denote the coefficient parameters for the i th request type. y(t) + a 1 y(t 1) + +a n y(t n) = b 1 0 x 1(t k 1 ) + +b 1 m 1 1 x 1(t k 1 m 1 + 1) + + b N 0 x N(t k N ) + + bm N N 1 x N (t k N m N + 1) + b ( N mi ) 1 = (bj i x i(t k i j)) + b. (9) i=1 j=0 Now we can define new θ and ϕ(t) with the following equations: θ =[a 1,...,a n,b 1 0,...,b1 m 1 1,...,b N 0,...,bN m N 1,b]T, (10) ϕ(t) =[ y(t 1),..., y(t n), x 1 (t k 1 ),...,x 1 (t k 1 m 1 + 1),..., x N (t k N ),...,x N (t k N m N + 1), 1] T. (11) It is straightforward to see that the same (7) and (8) can be used to estimate the parameter θ and calculate its fitness score respectively. In practice, we can select k i = k and m i = m(1 i N)(k and m are fixed values) to reduce parameter search spaces. Especially if we do not consider any time delay in (9), we have y = N i=1 b0 i x i. In this case, b0 i represents the resource consumption unit for each request of type i. The same equations also work here to estimate the best-fit parameters b0 i (1 i N). With this extension to model invariants among multiple classes of workloads, our capacity planning approach works even if the distribution of workload classes is not stationary [29]. Without loss of generality, in the following sections, we will use (1) to introduce the concept of our algorithms. Later in this paper we will extend our algorithms to support capacity planning under multiple classes of workloads. 4 Extracting invariants Given two measurements, we analyzed how to automatically learn a model in the above section. In practice, we may collect many resource consumption related measurements from a complex system but obviously not any pairs of them would have such linear relationships. Due to system uncertainties and user behavior changes, some learned models may not be robust along time. The challenging question is how to extract invariants from large number of measurements. In practice, we may manually build some relationships based on prior system knowledge. However, this knowledge is usually very limited and system dependent. In this section, we introduce an automated algorithm to extract and validate invariants from monitoring data. Note that for capacity planning purpose, we only need to search for invariants among resource consumption related measurements. Assume that we have m of such measurements denoted by I i,1 i m. Since we have little knowledge about their relationships in a specific system, we try any combination of two measurements to construct a model first and then continue to validate whether this model fits with new observations, i.e., we use bruteforce search to construct all hypotheses of invariants first and then sequentially test the validity of these hypotheses in operation. Note that we always have sufficient monitoring data from an operational system to validate these hypotheses along time. The fitness score F k (θ) given by (8) is used to evaluate how well a learned model matches the data observed during the k th time window. We denote the length of this window by l, i.e., each window includes l sampling points of measurements. As discussed earlier, given two measurements, we can always use (7) tolearna model. However, models with low fitness scores do not characterize the real data relationships well so that we choose a threshold F to filter out those models in sequential testings. Denote the set of valid models at time t = k l by M k (i.e., after k time windows). During the sequential testings, once if F k (θ) F, we stop testing this model and remove it from M k. After receiving monitoring data for k of such windows, i.e., total k l sampling points, we can calculate a confidence score with the following equation: ki=1 F i (θ) p k (θ) = k = p k 1(θ) (k 1) + F k (θ). (12) k
6 318 Cluster Comput (2008) 11: Algorithm 4.1 Input: I i (t), 1 i m Output: M k and p k (θ) for each time window k Part I: Model Construction at time t = l (i.e. k = 1), set M 1 to an empty set for each I i and I j,1 i, j m, i j learn a model θ ij using (7); compute F 1 (θ ij ) with (8); if F 1 (θ ij )> F, then set M 1 = M 1 {θ ij }, p 1 (θ ij ) = F 1 (θ ij ). Part II: Sequential Validation for each time t = k l(k > 1), set M k = M k 1 ; for each θ ij M k, compute F k (θ ij ) with (8) usingi i (t) and I j (t), (k 1) l + 1 t k l; if F k (θ ij ) F, then remove θ ij from M k ; otherwise update p k (θ ij ) with (12). output M k and p k (θ). k = k + 1. Fig. 3 Invariants extraction algorithm In fact, p k (θ) is the average fitness score for k time windows. Since the set M k only includes valid models and F i (θ) > F(1 i k), we always have F <p k (θ) 1. The invariant extraction algorithm is shown in Fig. 3, which includes two parts: Part I for model construction and Part II for sequential validation. The invariants extracted with Algorithm 4.1 should essentially be considered as likely invariants. As mentioned earlier, a model can be regarded as an invariant of the underlying system only if this model holds all the time. However, even if the validity of a model has been sequentially tested for a long time, we still can not guarantee that this model will always hold. Therefore, it is more accurate to consider these valid models as likely invariants. Based on historical monitoring data, in fact each confidence score p k (θ) measures the robustness of an invariant. Note that given two measurements, logically we do not know which one should be chosen as the input or output (i.e., x or y in (1)) in complex systems. Therefore, we always construct two models with reverse input and output. If two learned models have very different fitness scores, we must have constructed an AutoRegressive (AR) model rather than an ARX model. Since we are only interested in strong correlation between two measurements, we filter out those AR models by requesting the fitness scores of both models to overpass the threshold. Therefore, an invariant relationship between two measurements is bi-directional in this paper. Our previous work verified that invariants widely exist in distributed systems and Fig. 4 shows an invariant net- Fig. 4 An invariant network extracted from a three-tier web system work extracted from a typical three-tier web system [13]. In this figure, each node represents a measurement while each edge represents an invariant relationship between the two associated measurements. Since we do not need any iterations to calculate θ in (7), the computational complexity of Algorithms 4.1 is usually acceptable even under O(m 2 ) brute-force searches. For example, it takes a common laptop around 20 minutes to extract the invariants from 100 flow intensity measurements (i.e. m = 100) and each measurement (as a time series) includes 1000 data points. Note that Part II runs much faster than Part I in Algorithm 4.1. It only takes 2 seconds to validate nearly 1000 invariants as showninfig.4. For large systems, we have proposed efficient and scalable algorithms to extract invariants by compromising search accuracy [15]. Due to limited space, we will not discuss the scalability issue here. Note that the invariant extraction time is negligible compared to the time taken to manually build models in capacity analysis. As discussed in Sect. 3, if some flow intensity metrics include multiple classes of measurements, we can use multiple variables instead of one in model construction and Algorithm 4.1 still works without any change. For example, we can set I i ={Ii 1,...,IN i } in Algorithm 4.1. In this case, I i is a vector rather than a scalar. 5 Estimation of capacity needs In the above section, Algorithm 4.1 automatically extracts all possible invariants among the measurements I i, 1 i m. Further these measurements and invariants formulate a relation network that can be used as a model to systematically profile services. Under low volume of user
7 Cluster Comput (2008) 11: Fig. 5 Capacity planning using invariant networks requests, we extract a network of invariants from a system when its quality of services meets client s expectation, i.e., we only profile a system when it is at a good state. Assume that we have collected ten resource consumption related measurements (i.e., m = 10) from a system and extract an invariant network as shown in Fig. 5. For simplicity, here we use this network as an example to illustrate our capacity planning algorithms. In this graph, each node with number i represents the flow intensity measurement I i. As discussed earlier, I i could also be a vector including multiple sub-class measurements. Since we use a threshold F to filter out those models with low fitness scores, not any pair of measurements would have invariant relationships. Therefore, in Fig. 5, we observe two disconnected subnetworks and even some isolated nodes such as node 1. An isolated node implies that this measurement does not have any linear relationship with other measurements. All edges are bi-directional because we always construct two models (with reverse input and output) between two measurements. Now lets consider a triangle relationship among three measurements such as {I 10,I 3,I 4 }. Assume that we have I 3 = f(i 10 ) and I 4 = g(i 3 ), where f and g are both linear functions as shown in (1). Based on the triangle relationship, theoretically we can conclude that I 4 = g(i 3 ) = g(f (I 10 )). According to linear property of functions f and g, the function g(f (.)) should be linear too, which implies that there should exist an invariant relationship between the measurements I 10 and I 4. However, since we use a threshold to filter out those models with low fitness scores, such a linear relationship may not be robust enough to be considered as an invariant by Algorithm 4.1. This explains why there is no edge between I 10 and I 4. As discussed in Sect. 2, invariants characterize constant long-run relationships between measurements and their validity is not affected by the dynamics of user loads. While each invariant models some local relationship between its associated measurements, the network of invariants could well capture many invariant constraints underlying the whole distributed system. Rather than using one or several analytical models to profile services, here we combine a large number of invariant models together into a network to analyze capacity needs and optimize resource assignments. At time t (e.g., in a month or during a sales event), assume that the maximal volume of user requests is predicted to be x. Without loss of generality, in Fig. 5, we assume that the measurement I 10 represents the volume of user requests, i.e., I 10 = x. Now the challenging question is how to upgrade the capacities of other nodes so as to serve this volume of user requests. Starting from the node I 10 = x, we sequentially follow edges to estimate the capacity needs of other nodes in the invariant network. According to Fig. 5, we can reach the nodes {I 3,I 5,I 7 } with one hop. Given I 10 = x, now the question is how to follow invariants to estimate these measurements. As discussed in Sect. 3, we use the model shown in (1) to search for invariant relationships between measurements so that all invariants can be considered as instances of this model template. According to linear property of models, the capacity needs of system components increase monotonically as the volume of user loads increases. Therefore, we use the maximal value of user loads to estimate the capacity needs of internal components. Here we use x to denote the maximal value of I 10.In(1), if we set the inputs x(t) = x at all time steps, we expect the output y(t) to converge to a constant value y(t) = y, where y can be derived from the following equations: y + a 1 y + +a n y = b 0 x + +b m 1 x + b, m 1 i=0 y = b ix + b 1 + n. (13) j=1 a j For convenience, we use f(θ ij ) to represent the propagation function from I i to I j, i.e., f(θ ij ) = b ki i +b m 1 k=0 1+ n where k=1 a k all coefficient parameters are from the vector θ ij,asshown in (2). Therefore, given a value of the input measurement, we can use (13) to estimate the value of the output measurement. For example, given I 10 = x, we can use invariants to derive the values of I 3,I 5 and I 7 respectively. Since these measurements are the inputs of other invariants, in the same way, we can further propagate their values to other nodes in the network such as the nodes I 4 and I 6. AsshowninFig.5, some nodes such as I 4 and I 7 can be reached from the starting node I 10 via multiple paths. Between the same two nodes, multiple paths may include different number of edges and each invariant (edge) also may have different quality in modeling two nodes relationship. Therefore, the capacity needs of a node will be estimated via different paths with different accuracy. For each node, the question is how to locate the best path for propagating the volume of user loads from the starting node. At first, we will choose the shortest path (i.e., with minimal number of hops) to propagate this value. As discussed in Sect. 3, each invariant includes some modeling error ɛ when it characterizes the relationship between two measurements. These modeling errors could accumulate along a path and a longer path usually results in a larger estimation error. In Sect. 4, we
8 320 Cluster Comput (2008) 11: introduce a confidence score p k (θ) to measure the robustness of invariants. According to the definition of confidence score, an invariant with higher fitness score will lead to better accuracy in capacity estimation. For simplicity, here we use p ij to represent the latest p k (θ) between the measurements I i and I j. If there is no invariant between I i and I j, we set p ij = 0. Given a specific path s, we can derive an accumulated score q s = p ij to evaluate the accuracy of this whole path. Therefore, for multiple paths with same number of edges, we choose the path with the highest score q s to estimate capacity needs. In Fig. 5, we also observe that some nodes are not reachable from the starting node. However, these measurements may still have linear relationships with a set of other nodes because they may have a similar but nonlinear or stochastic way to respond to user loads. Now the question is how to estimate the capacity needs of these unreachable nodes. Note that it is extremely difficult to model and estimate the capacity needs of all components at fine granularity. As discussed earlier, many analytical models like queuing models hardly can scale to model thousands of resource metrics in a distributed system, especially if these metrics have complicated dependencies. In this paper, we extract invariant networks to automatically profile as many relationships as possible and then have to manually build complicated models for the remaining part. In performance modeling, many models have been developed to characterize individual components. For example, Menasce et al. utilized several known laws to quantify performance models of various components, which include utilization law, service demand law and the forced flow law etc. [21]. Following these laws and classic theory, we can manually build complicated nonlinear or stochastic models to link those unreachable nodes. In some cases, we can also use bound analysis to derive rough relationship between measurements. Here we give some examples on how to build models to connect those unreachable nodes. Following the utilization law [21], a disk I/O utilization U can be expressed as U = λ S, where λ is the intensity of disk I/Os (e.g. the number of disk I/Os per second) and S is the average service time. Further S includes the disk controller time as well as the time taken to seek a data block from the disk. Under various workloads (random vs. sequential workloads, low vs. heavy workloads), S could be quite different. Therefore, while λ (as a flow intensity node) can be reachable from the starting node, there is no linear relationship between U and λ. However, many literatures [23] on disk performance conclude that a system s performance becomes sluggish if S 0.02 s. Therefore, we can use bound analysis to estimate the capacity needs of disk I/Os with U>0.02λ. We can further propagate the value of λ to the unreachable node U by manually modeling their relationship with domain knowledge. Recently researchers at Google.com [9] investigated the power provisioning problem for a warehouse-sized system with 15 thousand servers. At the server level, they discovered the following nonlinear relationship between power consumption P and CPU utilization U: P = P idle + (P peak P idle )(2U U 1.4 ), (14) where P idle and P peak are the power consumptions at server idle and peak hours respectively. Since various servers behind an online service run different service logics and functions (e.g. database function or web server function), their CPU utilization could be very different even under the same volume of incoming workloads. Given a new volume of workload x, we can follow the invariant network to estimate the CPU utilization U i for each server. According to (14), we can then estimate the power consumption P i of each server and further sum up P i to estimate the whole power supply needs under the new volume of workloads. According to their work, other power consumptions from networking and cooling systems etc. are proportional to the power consumptions of all servers. Therefore, we can also use invariant networks to support the capacity planning of power supplies in a data center. Since invariant networks have automatically modeled the relationships among many resource consumption related metrics, it becomes easier to manually model the remain part of distributed systems. Therefore, our approach and other performance modeling methods are essentially complimentary to each other. By introducing other models as shown in the above examples, we can continue to propagate the volume of user loads to those isolated nodes. For example, in Fig. 5, if we can manually bridge any two nodes from the two disconnected subnetworks, we will be able to propagate the volume of user loads several hops further. Even in this case, our invariant network is still very useful because it can guide us on where to manually bridge two disconnected subnetworks. In fact, it is usually much easier to build models among the measurements from the same component because their dependencies are much more straightforward in this local context. As shown in the above example, it is much easier to build a model between U and λ from the same disk than a model between U (of the backend database server) and the volume of HTTP requests x (of the front web server) directly. This is because we have known domain knowledge on disks to support modeling. Essentially it is the invariant network that enables us to propagate the value of x into various internal components in large distributed systems. Therefore, rather than building models for measurements across many segments of distributed systems, we can manually build local models from the same segment to link disconnected subnetworks. Conversely if we develop a model of each resource metric as a direct function of exogenous workload, we will not be able to observe the
9 Cluster Comput (2008) 11: Algorithm 5.1 Input: M, P and x. Output: I i (1 i N) and R. at step k = 0, set V 0 = S 0 ={I 1 }, q 1 = 1, I 1 = x and other I i = 0. do k = k + 1; set S k = φ; for each I i S k 1 and I j U V k 1, if p ij 0, then S k = S k {I j }; for each I j S k, I l = arg max Ii S k 1 (q i p ij ); compute I j = f(θ lj ); q j = p lj q l ; V k = S k V k 1 ; while S k φ R = V k ; output R and all I i. Fig. 6 Capacity needs estimation algorithm relationship between disconnected subnetworks and have to manually model each disconnected metric once. In addition, it is very difficult to model such a relationship across multiple system segments because their dependencies could be very complicated and difficult for us to understand. In this paper, we just consider such complicated models as another class of invariants constructed from domain knowledge and do not distinguish them in our analysis and algorithms. Now we summarize the above discussion and propose the following algorithm shown in Fig. 6 for estimating capacity needs. For convenience, we define the following variables in our algorithms: I i : the individual measurements, 1 i N. U: the set of all measurements, i.e., U ={I i }. M: the set of all invariants, i.e., M ={θ ij } where θ ij is the invariant model between the measurements I i and I j. p ij : the confidence score of the model θ ij. Note that we set p ij = 0 if there is no invariant (edge) between the measurements I i and I j. P : the set of all confidence scores, i.e., P ={p ij }. x: the predicted maximal volume of user loads. I 1 : the starting node in the invariant network, i.e., I 1 = x. S k : the set of nodes that are only reachable at the k th hop from I 1 but not at earlier hops. V k : the set of all nodes that have been visited up to the k th hop. R: the set of all nodes that are reachable from I i. φ: the empty set. f(θ ij ): the propagation function from I i to I j. For linear m 1 k=0 invariants, f(θ ij ) = b ki i +b 1+ n. For those nonlinear or k=1 a k stochastic models, it may include a variety of equations. q s : the maximal accumulated confidence score of the paths from the starting node I 1 to I s. As discussed in Sect. 4, Algorithm 4.1 automatically extracts robust invariants after long sequential testing phases. In this section, Algorithm 5.1 follows the extracted invariant network specified by M and P to estimate capacity needs. Since we always choose the shortest path to propagate from the starting node to other nodes, at each step Algorithm 5.1 only searches those unvisited nodes for further propagation and all those nodes visited before this step must already have their shortest paths to the starting node. Meantime, Algorithm 5.1 only uses those newly visited nodes at each step to search their next hop because only these newly visited nodes may link to some unvisited nodes. For those nodes with multiple same-length paths to the starting node, we choose the best path with the highest accumulated confidence score for estimating the capacity needs. Essentially Algorithm 5.1 is an efficient graph algorithm based on dynamic programming [7]. We incrementally estimate the capacity needs of those newly visited nodes and compute their accumulated confidence scores at each step until no more nodes are reachable from the starting node. Our algorithm can be easily extended to support those models with multiple inputs. Before estimating the capacity of a new node, we just need to check whether all of its input nodes have already been visited. If not, this node is considered as an unvisited node until all of its input nodes are ready. 6 Resource optimization In the above section, Algorithm 5.1 sequentially estimates the resource usages of components under a given volume of user loads. Assume that we have collected the information about current resource configurations when the system was deployed or upgraded. In practice, such configuration information is usually maintained and updated by a Configuration Management DataBase (CMDB). For each measurement I i, we denote the capacity of its related resource configuration by C i. For example, if I(i)is the Megabyte of real memory usage, Ci will be the total memory size of a server. Similarly if I(i) is the number of concurrent SQL connections, the maximum C(i) for MySQL database is In performance analysis, there are also some benchmarks that could provide specifications for Ci. Note that this configuration information includes hardware specifications like memory size as well as software configurations like the maximal number of database connections. In practice, software configurations such as the maximum number of file descriptors
10 322 Cluster Comput (2008) 11: Fig. 8 System response with overshoot Fig. 7 Capacity analysis and resource optimization and the maximum number of concurrent threads could also affect system performance. In this paper, we also consider them as system resources. Given a volume of user loads x, we use Algorithm 5.1 to estimate the values of I i. By comparing I i against C i, we can further get information to locate potential performance bottlenecks and balance resource assignments. Lets denote O i = C i I i C i, where O i represents the percentage of resource shortage or available margin. Given an estimated volume of user loads, all those components with negative O i are short of capacities under the new workload and we should assign more resources to these components to remove performance bottlenecks. Conversely, for those components with large positive O i, they must have oversized capacities to serve such a new volume of user loads and we may remove some resources from these components to reduce IT cost. Note that we always need to keep right capacity margin for each component. Therefore, as shown in Fig. 7, we can use such analytical results as a guideline to adjust each component s capacity and build a resource balanced distributed system. The values of O i can also be sorted and ranked to list the priority of resource assignments and optimization. Note that we propagate the maximal volume of user loads x through the invariant network to estimate capacity needs. All I i resulting from Algorithm 5.1 represent the capacity needs of internal components that can serve this maximal volume of user loads. Given a step input x(t) = x, we derive its stable output y(t) = y using (13). However, we did not consider the transient response of y(t) before it converges to the stable value y. As shown in Fig. 8, theoretically y(t) may respond with overshoot and its transient value may be larger than the stable value y. The overshoot is generated because a system component does not respond quickly enough to the sudden change of user loads. For example, with a sudden increase of user loads in a three-tier web system, the application server may take some time to initialize more EJB instances and create more database connections so as to handle the increasing workloads. During this overshoot period, we may observe longer latency of user requests. However, computing systems usually respond to the dynamics of user loads quickly. Therefore, even if the overshoot exists, it must only last very short time. In fact, in our experiments, we do not observe any overshoot responses at all though theoretically it may exist. Now if we want a system to have enough capacity to handle such overshoots, we can calculate the volume of overshoot and propagate this as the maximal value (rather than the stable y) in capacity planning. In practice, the capacities of many systems are only expected to support high QoS in 95% of its operational time. Since it may take huge amount of extra resources to guarantee QoS in some rare events, some service providers are willing to compromise QoS for short period of time so as to reduce IT costs. However, for applications with stringent QoS requirements such as real-time video/voice communication, we may have to consider the overshoot situations. For low order ARX models with n, m 2, much literature in classic control theory [18] introduces how to analytically calculate the overshoot. Basically we can apply Z- transform [4] of(1) to analytically derive the transient response of y(t) and then calculate the overshoot value by locating the maximal point of y(t). For high order ARX models, given an input x(t) = x, wehavetouse(1) iteratively to simulate the transient response of y(t) and then estimate the overshoot value. At each step of Algorithm 5.1, rather than using the function f(θ lj ) to estimate a stable I j, we can use the simulation results from (1) to estimate transient I j and further propagate the maximal value to estimate the capacity needs of other nodes. All other parts of Algorithm 5.1 remain the same. Therefore, with ARX models, we are also able to analyze the transient behavior of various components. In our earlier discussion, the volume of user loads is chosen as the starting point to propagate the capacity estimation. Therefore following the invariant networks, we can only estimate the capacity needs of those resource metrics that are reachable from this starting point. However, if we just want to check whether various components have matched resource assignments, any nodes in the invariant networks can be chosen as a starting point to estimate the capacity needs of others and to determine whether they have consistent sizes. For example, we may just follow a single invariant to check whether component A and B have matched resource assignments. In this case, we do not need a fully connected invariant network and even a disconnected subnetwork can enable us to evaluate whether all of its nodes
11 Cluster Comput (2008) 11: Fig. 9 The experimental system have balanced resource assignments. If we do not extract invariant network but directly correlate resource metrics with exogenous workloads, we can not support such resource optimization procedures. 7 Experiments Our capacity planning approach is evaluated with a threetier web system as well as a commercial mobile internet system. In both cases, we extract invariants from resource consumption related measurements and then follow the invariant network to estimate the capacity needs under heavy workloads. Later we measure the real resource usages of components under such heavy workloads and compare them with the estimated values to verify our prediction accuracy. Fig. 10 Low and high volume of user loads in experiments 7.1 Three-tier Web systems In this section, our capacity planning experiments were performed in a typical three-tier web system which includes an Apache web server, a JBoss application server [12] and a MySQL database server. Figure 9 illustrates the architecture of our experimental system and its components. The application software running on this system is Pet store [27], which was written by Sun Microsystems. Just like other web services, here users can visit the Pet store website to buy a variety of pets. We developed a client emulator to generate a large class of user scenarios and workload patterns. Various user actions such as browsing items, searching items, account login, adding an item to a shopping-cart, payment and checkout are all included in our workloads. Certain randomness of user behaviors is also considered in the emulated workloads. For example, a user action is randomly selected from all possible user scenarios that could follow the previous user action. The time interval between two user actions is also randomly selected from a reasonable time range. Note that workloads are dynamically generated with much randomness and variance so that we never get a similar workload twice in our experiments. In our experiments, at first we run a low volume of user loads to collect measurements from various components and further use these measurements to extract invariants. As discussed in Sect. 5, given a predicted high volume of user loads, we can use Algorithm 5.1 to estimate the capacity Fig. 11 Categories of measurements needs of various components. Meantime, we also run this predicted volume of user loads to collect measurements and these real measurements are then compared with the estimated values to verify the capacity estimation accuracy of our algorithm. Figure 10 shows examples of both the low volume of user loads and the high volume of user loads used in our experiments, which have very different intensity and dynamics. We have repeated such experiments many times with various workloads to verify our results. Note that we do not have to repeat the same workload in our evaluations. Measurements were collected from the three servers used in our testbed system. Figure 11 shows the categories of our resource consumption related measurements. Totally we have eight categories and each category includes different number of measurements. In our previous work [13], we collected as many as 111 measurements and extracted 975 invariants from our testbed system. Figure 4 illustrates its invariant network and it is too meshed for us to observe its detailed connectivity. For capacity planning, we only chose those resource consumption related metrics and extracted a small invariant network so as to analyze its connectivity in details. Meantime, in a typical three-tier web system, the application server runs bulk of business logic and usually de-
12 324 Cluster Comput (2008) 11: Fig. 12 Measurements and invariants mands much more capacities than the web server and the database server. Therefore, we collected many of our measurements from the application server for our capacity planning experiments. Note that the web server has two network interfaces eth0 and eth1, which communicate with clients and the application server respectively. This monitoring data was used to calculate various flow intensities with sampling unit equal to 6 seconds. We have total 20 resource consumption related measurements from three servers and all of these measurements were used in the following experiments, i.e., n = 20. In our experiments, we collected one hour data to construct models and then continued to test these models for every half an hour, i.e., the window size is half an hour and includes 300 data points. Studies have shown that users are often dissatisfied if the latency of their web requests is longer than 10 seconds. Therefore the order of the ARX model (i.e., [n, m] shown in (1)) should have very narrow range. In our experiments, since the sampling time was selected to be 6 seconds, we set 0 n, m 2. By combining every two measurements from the total 20 measurements, Algorithm 4.1 totally built 190 models (i.e., n(n 1)/2 = 190) as invariant candidates. For each model, a fitness score was calculated according to (8). In our experiments, we selected the threshold of fitness score F = 0.3. A model is considered as an invariant only if its fitness score is always higher than 0.3 in each testing phases. After several phases of sequential testings with various workloads, eventually we extract 68 invariants that are distributed as shown in Fig. 12. In the following equations, we list several examples of such extracted invariants. If we use I ejb, I web and I sql to represent the flow intensities of number of EJB created, number of HTTP requests and number of SQL queries measured from the test bed system, we extract the following invariants with I web as the input: I ejb (t) = 0.44I ejb (t 1) I ejb (t 2) I web (t) I web (t 1), (15) I sql (t) = 0.16I sql (t 1) I sql (t 2) I web (t) 0.84I web (t 1). (16) In Fig. 12, each node represents a measurement while each edge represents an invariant relationship between the two associated measurements. Therefore, this invariant network totally includes 20 nodes and 68 edges. All measurements with initial tag A_, W_ or D_ are collected from the application server, the web server or the database server respectively. From this figure, we notice that the 20 measurements can be split into many clusters. The biggest cluster in this figure includes 13 measurements. These measurements respond to the volume of user loads directly so that they have many linear relationships among each other. Meantime, we also observe several smaller clusters, which characterize some local relationships between measurements. As mentioned earlier, some measurements may respond to the volume of user loads in same but nonlinear or stochastic ways. For example, Disk Write Merge has an invariant linear relationship with Disk Write Sectors though they both do not have linear relationships with the volume of user loads. The measurement W_CPU Utilization is very noisy and it does not have robust relationships with any other measurements. Therefore it is an isolated node in the invariant network. Later we discovered that the CPU usage of the web server in our testbed was extremely low (only close to 1%) so that the value of W_CPU Utilization was too small and easily affected by other system processes, i.e., we barely observe any intensities from W_CPU Utilization because the
13 Cluster Comput (2008) 11: Fig. 13 Estimated and real values of measurements Fig. 14 The number of created EJBs web server only presents static html files but runs on a powerful machine. Now we choose the volume of HTTP requests W_Apache as the starting node in our capacity analysis. In our experiments, the predicted volume of HTTP requests (high user loads) is shown in Fig. 10 and its maximal value is 840 requests/second, i.e., x = 840. Following the invariant network shown in Fig. 12, we use Algorithm 5.1 to estimate other measurements. For example, with one hop from W_Apache, we can estimate the values of the following 10 measurements: W_eth1 Packets, W_eth0 Packets, W_CPU Soft IRQ Time, A_JBoss EJB Created, A_eth0 Packets, A_JVM Processing Time, D_MySQL, D_ CPU Soft IRQ Time, D_CPU Utilization and D_eth0 Packets. With another hop, we can also estimate the values of A_CPU Utilization and A_CPU Soft IRQ Time. These values resulting from Algorithm 5.1 are then compared with the real maximal values collected from our experiments and their differences are shown in Fig. 13.Hereweusee and r to denote the estimated values and the real monitoring values respectively. From this figure, we observe that our approach achieves very good accuracy and averagely it only results in 5% error in estimating capacity needs. We have repeated our experiments many times with different user loads and observed similar estimation accuracy in every of our experiments. In Fig. 13, the value of each measurement has its own specific metric. For example, the value of A_JVM Processing Time refers to the CPU time (nanoseconds) used by JVM (among 6 seconds) and the value of A_CPU Utilization refers to the percentage of user CPU utilization. As illustrated in (13), given I web = 840, we can use the coefficient parameters from (15) to derive that I ejb = I web = In the same way, we can use the coefficient parameters from (16) to derive that I sql = I web = As examples, Figs. 14 and Fig. 15 The number of SQL queries 15 illustrate the real numbers of EJBs created in the application server and the real number of SQL queries collected from the database server respectively. Both measurements result from the user loads shown in Fig. 10 so that they both have similar curves with that of the user loads. In these two figures, the noisy curves represent the real values while the dashed lines represent the estimated maximal values. The estimated values are very close to the real maximal values monitored from the system though some rare peaks may have larger values. As discussed earlier, invariants are used to characterize long-run relationships between measurements and they do not capture some rare system uncertainties and noises. However, since those peaks are rare and we always add right margin on the estimated capacity needs, our estimated values should be accurate enough for capacity planning [21]. In Fig. 12, we also observe some nodes that are unreachable from the starting node W_Apache. As discussed in Sect. 5, with domain knowledge, we have to manually
14 326 Cluster Comput (2008) 11: build some models between these isolated measurements and those listed in Fig. 13 so as to propagate the volume of user loads. In our future work, we will build a library to include nonlinear models of common components such as disks to complement our invariant network in capacity planning. Based on various scenarios, operators can pull nonlinear models from the library to link disconnected invariant networks. 7.2 Mobile Internet systems In this subsection, our approach was also evaluated with a commercial mobile internet system with multi-class workloads. The mobile internet system provides mobile users with services such as web access, , news delivery and location information. It provides direct access to the Internet for subscribers using mobile terminals. This system consists of dozens of high end servers including portal web servers, mail servers, picture mail servers and account authentication servers. Web portal access and mail access are the two major classes of user traffics. Due to business confidentiality, we are not able to illustrate its specific architecture here and will only report field testing results. Most of these servers run on Unix operating systems and many measurements are collected by using Unix sar command. Network measurements are collected by SNMP monitoring agents. CPU, memory and network usages are the three types of resource metrics used in our evaluations. We collected these measurements from every servers once per 10 seconds for a week. During this period, we observed that the ratio between mail access traffic and web access traffic could vary roughly from 0.11 to Following the same evaluation approach used in the above section, we extracted invariants from a two-hour period of low workloads and then used the invariant network to estimate the capacity needs of components under 5 periods of heavy workloads during the week. The real resource usages observed from log files were then compared with these estimated values to verify estimation accuracy. Our approach averagely results in 6.8% error in estimating capacity needs and the biggest estimation error is 9.5%. Based on operators feedbacks, they are quite satisfied with such accuracies in capacity planning tasks. 8 Discussions Large-scale distributed systems consist of thousands of components. Each component could potentially become the performance bottleneck and deteriorate system QoS. As discussed earlier, it is extremely difficult to model and analyze the capacity needs of each individual components at fine granularity in a large system. Many classic approaches can not scale to profile the behaviors of large number of components precisely. For example, it is not clear how to model the relationships among hundreds of metrics with queuing models and it is also very time-consuming to manually build such models. Meantime, if we only model the system behavior at coarse granularity, some system metrics will not be considered in the model so that we may not be able to predict real performance bottlenecks and optimize resource assignments. In this paper, we extract invariant networks to profile the relationships among resource consumption related metrics across distributed systems. Our motivation is that we should develop algorithms to automatically profile as many relationships as possible among system metrics. Though such relationships may be simple, the invariant networks essentially enable us to cover large number of system metrics and scale up our capacity analysis. For the remaining complicated models, we do not have a choice so that we have to manually build them with system knowledge. Therefore, our approach is complimentary to many existing modeling techniques and could greatly reduce system modeling complexity. To the best of our knowledge, this paper proposes the first solution to analyze system capacity and optimize resource assignments across large systems at such a fine granularity. In fact, most of existing works just model hardware resources such as CPU or memory in capacity analysis and do not consider system metrics like the number of concurrent database connections, which could also affect system performance. Our approach has several limitations as well. As discussed earlier, there exist some disconnected subnetworks that are not reachable from the starting point the volume of workloads. In order to estimate the capacity needs of these isolated nodes, we have to manually build models with domain knowledge to bridge disconnected subnetworks. However, since specific system knowledge (e.g. disks) is needed here, we do not expect that any other methods will be able to automatically profile those complicated relationships. Compared to other modeling techniques, our invariant networks automatically extract the maximal coverage of system metrics whose capacity needs can be estimated from the volume of workloads. Therefore, we are enabled to map the volume of exogenous workloads into the intensities of internal system metrics and decide where to build models in their local context. This greatly reduces modeling complexity when we applied our approach on several commercial systems. Meantime, even with a disconnected subnetwork, we can choose one node as a starting point to verify whether all system metrics inside this subnetwork have balanced resource configurations. This is also proved to be useful in operational environments. Another limitation is that our approach can not automatically model the relationship between response time and system configurations. For example, operators often raise the following questions in capacity planning: how much response time can be improved if a specific part of resource
15 Cluster Comput (2008) 11: is upgraded? how many users can the system support if its response time is allowed to increase 10%. For a distributed system with scale and complexity, so far we are not aware of any solutions that can address such problems. We profile systems and extract invariant networks at a good state when clients are satisfied with system QoS. It seems to be difficult to reason about other scenarios beyond this good state. We do not know which parts of system models are still valid and which part are not under those new scenarios. Recently many researchers employed a closed queuing network to model multi-tier Internet applications and then used mean value analysis to calculate system response time [31]. However, performance bottlenecks may shift under the new scenarios so that the original model of response time might become invalid. In fact, the original performance model may even not include the metric of new bottleneck. For example, if a performance model profiles the relationship between CPU resources and system response time, it may not be useful if memory suddenly becomes the bottleneck. 9 Related work Queuing models have been widely used to analyze system performance. Menasce et al. [21] wrote a book on how to use queuing networks for performance modeling of various components such as a database server or a disk. However, as discussed earlier, queuing models are often used to characterize individual components with many assumptions such as stationary workloads. It is not clear how to profile large-scale complex systems with queuing models. Recently Urgaonkar et al. [31, 32] employed a closed queuing network to model multi-tier Internet applications and used Mean Value Analysis (MVA) to calculate the response time of multi-tier distributed systems. They only considered CPU resources in their queuing models but not other resources like memory, disk I/O and network. Besides multi-tier Internet applications, distributed systems have many other architectures and the closed queuing network may not model them well. Stewart and Shen [30] profiled applications to predict throughput and response time of multi-component online services. Stewart et al. [29] used a transaction mix vector to characterize workloads and exploited nonstationarity for performance prediction in distributed systems. Rather than using queuing models, in this paper we automatically extract a network of invariants from operational systems and use this network as a model for performance analysis. Therefore, our approach provides a systematic way to profile complex services for capacity planning, which is complimentary to those classic modeling techniques. Many companies have developed their practical approaches to capacity planning. For example, Microsoft [22] published scalability recommendations for their portal server deployment. IBM Global Services [11] developed their practical capacity planning processes for web application deployment. Oracle [25] provides capacity planning tools for database sizing. Most of these approaches are developed from their practical experiences on some specific components and they may not scale and generalize well for capacity planning tasks in large scale distributed systems. There exists much work on characterizing web traffic for web server capacity planning. Kant and Won [16] analyzed the patterns of web traffic and proposed a methodology to determine bandwidth requirements for hardware components of a web server. Barford and Crovella [3] developed a tool to generate representative web workloads for web server management and capacity planning. Our approach does not characterize workloads but extract invariants to characterize the relationships between the volume of workloads and the capacity needs of individual components. Machine learning methods have also been applied to identify performance bottlenecks in distributed systems. Cohen et al. [6] used a tree-augmented naive (TAN) Bayesian network to learn the probabilistic relationship between SLA violations and resource usages. They further used this learned Bayesian network to identify performance bottlenecks during SLA violations. In their Elba project, Parekh et al. [26] compared the TAN Bayesian network with other classifiers like decision tree in performance bottle detection. Both of their work employed supervised learning mechanisms and required large number of SLA violation samples in their classifier training. For capacity planning tasks, this is not practical because we want to avoid such SLA violations in real business. Our approach is able to predict and remove performance bottlenecks ahead of real SLA violations. In autonomic computing community [17], there is much work on how to dynamically allocate resources for service provisioning. Hellerstein et al. [10] wrote a book on how to apply feedback control theory to resource management in computing systems. Kephart et al. [8, 33] defined utility functions based on service-level attributes and proposed an utility-based approach for efficient allocations of server resources. Kusic and Kandasamy [19] used multiple queuing models to characterize the performance models of server clusters and then applied control theory for optimal resource allocation. Assume that we have a large cluster of servers for multiple applications, autonomic service provisioning is about how to dynamically allocate and share resources among these applications so as to achieve some optimal goals like maximizing the profits of services. Though this topic is related, the scope of our work focuses on offline capacity planning and resource optimization rather than online service provisioning. In addition, dynamic service provisioning also needs right capacity planning to support system evolution.
16 328 Cluster Comput (2008) 11: Conclusions For large scale distributed systems, it is critical to make right capacity planning during system evolution. Under many dynamics and uncertainties of user loads, a system without enough capacity could significantly deteriorate performance and lead to users dissatisfaction. Conversely, an oversized system with scale could significantly waste resources and increase IT cost. One challenge is how to match the capacities of various components inside complex systems to remove potential performance bottlenecks and achieve maximal system-level capacity. Mis-matched capacities of system components could result in performance bottlenecks at one segment of a system while wasting resources at other segments. In this paper, we proposed a novel and systematic approach to profiling services for resource optimization and capacity planning. We collect resource consumption related measurements from distributed systems and developed an approach to automatically search for invariants among measurements. After extracting a network of invariants, given any volume of user loads, we can sequentially estimate the capacity needs of individual components. By comparing the current resource assignments against the estimated capacity needs, we can discover the weakest points that may deteriorate system performance. Operators can consult such analytical results to optimize resource assignments and remove potential performance bottlenecks. References 1. Almeida, V., Menasce, D.: Capacity planning: An essential tool for managing web services. IEEE IT Prof. 4(4), (2002) 2. Amazon: irol-newsarticle&id=798960&highlight= 3. Barford, P., Crovella, M.: Generating representative web workloads for network and server performance evaluation. In: SIG- METRICS 98/PERFORMANCE 98: Proceedings of the 1998 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, pp , Bracewell, R.: The Fourier Transform and Its Applications, 3nd edn. McGraw-Hill Science/Engineering/Math, New York (1999) 5. Brockwell, P., Davis, R.: Introduction to Time Series and Forecasting, 2nd edn. Springer, Berlin (2003) 6. Cohen, I., Goldszmidt, M., Kelly, T., Symons, J., Chase, J.: Correlating instrumentation data to system states: a building block for automated diagnosis and control. In: OSDI 04: Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation, p. 16, Cormen, T., Leiserson, C., Rivest, R.: Introduction to Algorithms, 1st ed. MIT Press/McGraw-Hill, Cumberland, New York (1990) 8. Das, R., Kephart, J., Whalley, I., Vytas, P.: Towards commercialization of utility-based resource allocation. In: The 3rd International Conference on Autonomic Computing (ICAC2006), pp , Dublin, Ireland, June Fan, X., Weber, W.-D., Barroso, L.A.: Power provisioning for a warehouse-sized computer. In: ISCA 07: Proceedings of the 34th Annual International Symposium on Computer Architecture, pp , San Diego, California, USA, Hellerstein, J., Diao, Y., Parekh, S., Tilbury, D.M.: Feedback Control of Computing Systems. Wiley-IEEE Press, New York (2004) 11. IBM. pdf 12. JBoss Jiang, G., Chen, H., Yoshihira, K.: Discovering likely invariants of distributed transaction systems for autonomic system management. In: The 3rd International Conference on Autonomic Computing (ICAC2006), pp , Dublin, Ireland, June Jiang, G., Chen, H., Yoshihira, K.: Modeling and tracking of transaction flow dynamics for fault detection in complex systems. IEEE Trans. Dependable Secure Comput. 3(4), (2006) 15. Jiang, G., Chen, H., Yoshihira, K.: Efficient and scalable algorithms for inferring likely invariants in distributed systems. IEEE Trans. Knowl. Data Eng. 19(11) (2007) 16. Kant, K., Won, Y.: Server capacity planning for web traffic workload. IEEE Trans. Knowl. Data Eng. 11(5), (1999) 17. Kephart, J., Chess, D.: The vision of autonomic computing. Computer 36(1), (2003) 18. Kuo, B.: Automatic Control Systems, 6th edn. Prentice-Hall, Englewood (1991) 19. Kusic, D., Kandasamy, N.: Risk-aware limited lookahead control for dynamic resource provisioning in enterprise computing systems. In: The 3rd International Conference on Autonomic Computing (ICAC2006), pp , Dublin, Ireland, June Ljung, L.: System Identification Theory for The User, 2nd edn. Prentice Hall PTR, New York (1998) 21. Menasce, D., Dowdy, L., Almeida, V.: Performance by Design: Computer Capacity Planning By Example, 1st ed. Prentice Hall PTR, New York (2004) 22. Microsoft. HA aspx 23. Microsoft. aspx 24. Microsoft Office 2003 system requirments. microsoft.com/kb/ Oracle Parekh, J., Jung, G., Swint, G., Pu, C., Sahai, A.: Comparison of performance analysis approaches for bottleneck detection in multi-tier enterprise applications. In: IEEE International Workshop on Quality of Services, pp , New Haven, CT, USA, Petstore: Rissanen, J.: Stochastic Complexity in Statistical Inquiry Theory. World Scientific, Singapore (1989) 29. Stewart, C., Kelly, T., Zhang, A.: Exploiting nonstationarity for performance prediction. SIGOPS Oper. Syst. Rev. 41(3), (2007) 30. Stewart, C., Shen, K.: Performance modeling and system management for multi-component online services. In: NSDI 05: Proceedings of the 2nd conference on Symposium on Networked Systems Design and Implementation, pp , Boston, Massachusetts, USA, Urgaonkar, B., Pacifici, G., Shenoy, P., Spreitzer, M., Tantawi, A.: An analytical model for multi-tier internet services and its applications. SIGMETRICS Perform. Eval. Rev. 33(1), (2005) 32. Urgaonkar, B., Pacifici, G., Shenoy, P., Spreitzer, M., Tantawi, A.: Analytic modeling of multitier internet applications. ACM Trans. Web 1(1), 2 (2007) 33. Walsh, W., Tesauro, G., Kephart, J., Das, R.: Utility functions in autonomic systems. In: The First International Conference on Autonomic Computing (ICAC2004), pp , New York, May 2004
17 Cluster Comput (2008) 11: Guofei Jiang is currently a Department Head with the Robust and Secure Systems Group in NEC Laboratories America at Princeton, New Jersey. He leads a dozen of researchers working on many topics in the field of distributed systems and networks. He has published over 80 technical papers and also holds several patents. Dr. Jiang is an associate editor of IEEE Security and Privacy magazine and has also served in the program committees of many prestigious conferences. He holds the B.S. and Ph.D. degrees in electrical and computer engineering. Haifeng Chen received the BEng and MEng degrees, both in automation, from Southeast University, China, in 1994 and 1997 respectively, and the Ph.D. degree in computer engineering from Rutgers University, New Jersey, in He has worked as a researcher in the Chinese national research institute of power automation. He is currently a research staff member at NEC laboratory America, Princeton, NJ. His research interests include data mining, autonomic computing, pattern recognition and robust statistics. Kenji Yoshihira received the B.E. in E.E. at University of Tokyo in 1996 and designed processor chips for enterprise computer at Hitachi Ltd. for five years. He employed himself in CTO at Investoria Inc. in Japan to develop an Internet service system for financial information distribution through 2002 and received the M.S. in CS at New York University in He is currently a research staff member with the Robust and Secure Systems Group in NEC Laboratories America, inc. in NJ. His current research focus is on distributed system and autonomic computing.
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
More informationVirtuoso and Database Scalability
Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of
More information2. Research and Development on the Autonomic Operation. Control Infrastructure Technologies in the Cloud Computing Environment
R&D supporting future cloud computing infrastructure technologies Research and Development on Autonomic Operation Control Infrastructure Technologies in the Cloud Computing Environment DEMPO Hiroshi, KAMI
More informationDelivering Quality in Software Performance and Scalability Testing
Delivering Quality in Software Performance and Scalability Testing Abstract Khun Ban, Robert Scott, Kingsum Chow, and Huijun Yan Software and Services Group, Intel Corporation {khun.ban, robert.l.scott,
More informationPrescriptive Analytics. A business guide
Prescriptive Analytics A business guide May 2014 Contents 3 The Business Value of Prescriptive Analytics 4 What is Prescriptive Analytics? 6 Prescriptive Analytics Methods 7 Integration 8 Business Applications
More informationPART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General
More informationPerformance Testing. Slow data transfer rate may be inherent in hardware but can also result from software-related problems, such as:
Performance Testing Definition: Performance Testing Performance testing is the process of determining the speed or effectiveness of a computer, network, software program or device. This process can involve
More informationTest Run Analysis Interpretation (AI) Made Easy with OpenLoad
Test Run Analysis Interpretation (AI) Made Easy with OpenLoad OpenDemand Systems, Inc. Abstract / Executive Summary As Web applications and services become more complex, it becomes increasingly difficult
More informationTableau Server Scalability Explained
Tableau Server Scalability Explained Author: Neelesh Kamkolkar Tableau Software July 2013 p2 Executive Summary In March 2013, we ran scalability tests to understand the scalability of Tableau 8.0. We wanted
More informationOPTIMAL MULTI SERVER CONFIGURATION FOR PROFIT MAXIMIZATION IN CLOUD COMPUTING
OPTIMAL MULTI SERVER CONFIGURATION FOR PROFIT MAXIMIZATION IN CLOUD COMPUTING Abstract: As cloud computing becomes more and more popular, understanding the economics of cloud computing becomes critically
More informationRevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
More informationTableau Server 7.0 scalability
Tableau Server 7.0 scalability February 2012 p2 Executive summary In January 2012, we performed scalability tests on Tableau Server to help our customers plan for large deployments. We tested three different
More informationPerformance Modeling and Analysis of a Database Server with Write-Heavy Workload
Performance Modeling and Analysis of a Database Server with Write-Heavy Workload Manfred Dellkrantz, Maria Kihl 2, and Anders Robertsson Department of Automatic Control, Lund University 2 Department of
More informationOn Correlating Performance Metrics
On Correlating Performance Metrics Yiping Ding and Chris Thornley BMC Software, Inc. Kenneth Newman BMC Software, Inc. University of Massachusetts, Boston Performance metrics and their measurements are
More informationApplication Performance Testing Basics
Application Performance Testing Basics ABSTRACT Todays the web is playing a critical role in all the business domains such as entertainment, finance, healthcare etc. It is much important to ensure hassle-free
More informationReport Paper: MatLab/Database Connectivity
Report Paper: MatLab/Database Connectivity Samuel Moyle March 2003 Experiment Introduction This experiment was run following a visit to the University of Queensland, where a simulation engine has been
More informationWindows Server Performance Monitoring
Spot server problems before they are noticed The system s really slow today! How often have you heard that? Finding the solution isn t so easy. The obvious questions to ask are why is it running slowly
More informationPerformance Workload Design
Performance Workload Design The goal of this paper is to show the basic principles involved in designing a workload for performance and scalability testing. We will understand how to achieve these principles
More informationHow To Model A System
Web Applications Engineering: Performance Analysis: Operational Laws Service Oriented Computing Group, CSE, UNSW Week 11 Material in these Lecture Notes is derived from: Performance by Design: Computer
More informationRackspace Cloud Databases and Container-based Virtualization
Rackspace Cloud Databases and Container-based Virtualization August 2012 J.R. Arredondo @jrarredondo Page 1 of 6 INTRODUCTION When Rackspace set out to build the Cloud Databases product, we asked many
More informationKeywords: Dynamic Load Balancing, Process Migration, Load Indices, Threshold Level, Response Time, Process Age.
Volume 3, Issue 10, October 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Load Measurement
More informationPerformance Prediction, Sizing and Capacity Planning for Distributed E-Commerce Applications
Performance Prediction, Sizing and Capacity Planning for Distributed E-Commerce Applications by Samuel D. Kounev (skounev@ito.tu-darmstadt.de) Information Technology Transfer Office Abstract Modern e-commerce
More informationOn the Interaction and Competition among Internet Service Providers
On the Interaction and Competition among Internet Service Providers Sam C.M. Lee John C.S. Lui + Abstract The current Internet architecture comprises of different privately owned Internet service providers
More informationWITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE
WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE 1 W W W. F U S I ON I O.COM Table of Contents Table of Contents... 2 Executive Summary... 3 Introduction: In-Memory Meets iomemory... 4 What
More informationUsing Simulation Modeling to Predict Scalability of an E-commerce Website
Using Simulation Modeling to Predict Scalability of an E-commerce Website Rebeca Sandino Ronald Giachetti Department of Industrial and Systems Engineering Florida International University Miami, FL 33174
More informationMAGENTO HOSTING Progressive Server Performance Improvements
MAGENTO HOSTING Progressive Server Performance Improvements Simple Helix, LLC 4092 Memorial Parkway Ste 202 Huntsville, AL 35802 sales@simplehelix.com 1.866.963.0424 www.simplehelix.com 2 Table of Contents
More informationMuse Server Sizing. 18 June 2012. Document Version 0.0.1.9 Muse 2.7.0.0
Muse Server Sizing 18 June 2012 Document Version 0.0.1.9 Muse 2.7.0.0 Notice No part of this publication may be reproduced stored in a retrieval system, or transmitted, in any form or by any means, without
More informationCapacity Planning Use Case: Mobile SMS How one mobile operator uses BMC Capacity Management to avoid problems with a major revenue stream
SOLUTION WHITE PAPER Capacity Planning Use Case: Mobile SMS How one mobile operator uses BMC Capacity Management to avoid problems with a major revenue stream Table of Contents Introduction...................................................
More informationKEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS
ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,
More informationRecommendations for Performance Benchmarking
Recommendations for Performance Benchmarking Shikhar Puri Abstract Performance benchmarking of applications is increasingly becoming essential before deployment. This paper covers recommendations and best
More informationUnderstanding the Benefits of IBM SPSS Statistics Server
IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster
More informationImprove Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database
WHITE PAPER Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Executive
More informationGraph Database Proof of Concept Report
Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment
More informationWhite Paper. Recording Server Virtualization
White Paper Recording Server Virtualization Prepared by: Mike Sherwood, Senior Solutions Engineer Milestone Systems 23 March 2011 Table of Contents Introduction... 3 Target audience and white paper purpose...
More informationMINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT
MINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT 1 SARIKA K B, 2 S SUBASREE 1 Department of Computer Science, Nehru College of Engineering and Research Centre, Thrissur, Kerala 2 Professor and Head,
More informationEstimate Performance and Capacity Requirements for Workflow in SharePoint Server 2010
Estimate Performance and Capacity Requirements for Workflow in SharePoint Server 2010 This document is provided as-is. Information and views expressed in this document, including URL and other Internet
More informationSystem Requirements Table of contents
Table of contents 1 Introduction... 2 2 Knoa Agent... 2 2.1 System Requirements...2 2.2 Environment Requirements...4 3 Knoa Server Architecture...4 3.1 Knoa Server Components... 4 3.2 Server Hardware Setup...5
More informationCharacterizing Task Usage Shapes in Google s Compute Clusters
Characterizing Task Usage Shapes in Google s Compute Clusters Qi Zhang 1, Joseph L. Hellerstein 2, Raouf Boutaba 1 1 University of Waterloo, 2 Google Inc. Introduction Cloud computing is becoming a key
More informationBenchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
More informationA REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information
More informationEvaluation Report: Supporting Microsoft Exchange on the Lenovo S3200 Hybrid Array
Evaluation Report: Supporting Microsoft Exchange on the Lenovo S3200 Hybrid Array Evaluation report prepared under contract with Lenovo Executive Summary Love it or hate it, businesses rely on email. It
More informationPerformance Evaluation for Software Migration
Performance Evaluation for Software Migration Issam Al-Azzoni INRIA, France Issam.Al-Azzoni@imag.fr ABSTRACT Advances in technology and economical pressure have forced many organizations to consider the
More informationHow To Test For Performance And Scalability On A Server With A Multi-Core Computer (For A Large Server)
Scalability Results Select the right hardware configuration for your organization to optimize performance Table of Contents Introduction... 1 Scalability... 2 Definition... 2 CPU and Memory Usage... 2
More informationDELL. Virtual Desktop Infrastructure Study END-TO-END COMPUTING. Dell Enterprise Solutions Engineering
DELL Virtual Desktop Infrastructure Study END-TO-END COMPUTING Dell Enterprise Solutions Engineering 1 THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL
More informationMaximum performance, minimal risk for data warehousing
SYSTEM X SERVERS SOLUTION BRIEF Maximum performance, minimal risk for data warehousing Microsoft Data Warehouse Fast Track for SQL Server 2014 on System x3850 X6 (95TB) The rapid growth of technology has
More informationOracle Primavera P6 Enterprise Project Portfolio Management Performance and Sizing Guide. An Oracle White Paper October 2010
Oracle Primavera P6 Enterprise Project Portfolio Management Performance and Sizing Guide An Oracle White Paper October 2010 Disclaimer The following is intended to outline our general product direction.
More informationHow To Test On The Dsms Application
Performance Test Summary Report Skills Development Management System December 2014 Performance Test report submitted to National Skill Development Corporation Version Date Name Summary of Changes 1.0 22/12/2014
More informationChapter 1 - Web Server Management and Cluster Topology
Objectives At the end of this chapter, participants will be able to understand: Web server management options provided by Network Deployment Clustered Application Servers Cluster creation and management
More informationLiferay Portal Performance. Benchmark Study of Liferay Portal Enterprise Edition
Liferay Portal Performance Benchmark Study of Liferay Portal Enterprise Edition Table of Contents Executive Summary... 3 Test Scenarios... 4 Benchmark Configuration and Methodology... 5 Environment Configuration...
More informationEnterprise Applications in the Cloud: Non-virtualized Deployment
Enterprise Applications in the Cloud: Non-virtualized Deployment Leonid Grinshpan, Oracle Corporation (www.oracle.com) Subject The cloud is a platform devised to support a number of concurrently working
More informationQLIKVIEW SERVER MEMORY MANAGEMENT AND CPU UTILIZATION
QLIKVIEW SERVER MEMORY MANAGEMENT AND CPU UTILIZATION QlikView Scalability Center Technical Brief Series September 2012 qlikview.com Introduction This technical brief provides a discussion at a fundamental
More informationThe Association of System Performance Professionals
The Association of System Performance Professionals The Computer Measurement Group, commonly called CMG, is a not for profit, worldwide organization of data processing professionals committed to the measurement
More informationOptimal Service Pricing for a Cloud Cache
Optimal Service Pricing for a Cloud Cache K.SRAVANTHI Department of Computer Science & Engineering (M.Tech.) Sindura College of Engineering and Technology Ramagundam,Telangana G.LAKSHMI Asst. Professor,
More informationLoad Testing on Web Application using Automated Testing Tool: Load Complete
Load Testing on Web Application using Automated Testing Tool: Load Complete Neha Thakur, Dr. K.L. Bansal Research Scholar, Department of Computer Science, Himachal Pradesh University, Shimla, India Professor,
More informationDELL s Oracle Database Advisor
DELL s Oracle Database Advisor Underlying Methodology A Dell Technical White Paper Database Solutions Engineering By Roger Lopez Phani MV Dell Product Group January 2010 THIS WHITE PAPER IS FOR INFORMATIONAL
More informationLoad Testing and Monitoring Web Applications in a Windows Environment
OpenDemand Systems, Inc. Load Testing and Monitoring Web Applications in a Windows Environment Introduction An often overlooked step in the development and deployment of Web applications on the Windows
More informationPerformance Analysis of Web based Applications on Single and Multi Core Servers
Performance Analysis of Web based Applications on Single and Multi Core Servers Gitika Khare, Diptikant Pathy, Alpana Rajan, Alok Jain, Anil Rawat Raja Ramanna Centre for Advanced Technology Department
More informationJean Arnaud, Sara Bouchenak. Performance, Availability and Cost of Self-Adaptive Internet Services
Jean Arnaud, Sara Bouchenak Performance, Availability and Cost of Self-Adaptive Internet Services Chapter of Performance and Dependability in Service Computing: Concepts, Techniques and Research Directions
More information1 How to Monitor Performance
1 How to Monitor Performance Contents 1.1. Introduction... 1 1.1.1. Purpose of this How To... 1 1.1.2. Target Audience... 1 1.2. Performance - some theory... 1 1.3. Performance - basic rules... 3 1.4.
More informationBenchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
More informationAn Oracle White Paper July 2011. Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide
Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide An Oracle White Paper July 2011 1 Disclaimer The following is intended to outline our general product direction.
More informationA Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
More information.:!II PACKARD. Performance Evaluation ofa Distributed Application Performance Monitor
r~3 HEWLETT.:!II PACKARD Performance Evaluation ofa Distributed Application Performance Monitor Richard J. Friedrich, Jerome A. Rolia* Broadband Information Systems Laboratory HPL-95-137 December, 1995
More informationEnergy Constrained Resource Scheduling for Cloud Environment
Energy Constrained Resource Scheduling for Cloud Environment 1 R.Selvi, 2 S.Russia, 3 V.K.Anitha 1 2 nd Year M.E.(Software Engineering), 2 Assistant Professor Department of IT KSR Institute for Engineering
More informationMinimize cost and risk for data warehousing
SYSTEM X SERVERS SOLUTION BRIEF Minimize cost and risk for data warehousing Microsoft Data Warehouse Fast Track for SQL Server 2014 on System x3850 X6 (55TB) Highlights Improve time to value for your data
More informationLCMON Network Traffic Analysis
LCMON Network Traffic Analysis Adam Black Centre for Advanced Internet Architectures, Technical Report 79A Swinburne University of Technology Melbourne, Australia adamblack@swin.edu.au Abstract The Swinburne
More informationPLA 7 WAYS TO USE LOG DATA FOR PROACTIVE PERFORMANCE MONITORING. [ WhitePaper ]
[ WhitePaper ] PLA 7 WAYS TO USE LOG DATA FOR PROACTIVE PERFORMANCE MONITORING. Over the past decade, the value of log data for monitoring and diagnosing complex networks has become increasingly obvious.
More informationEnterprise Application Performance Management: An End-to-End Perspective
SETLabs Briefings VOL 4 NO 2 Oct - Dec 2006 Enterprise Application Performance Management: An End-to-End Perspective By Vishy Narayan With rapidly evolving technology, continued improvements in performance
More informationHow To Test For Elulla
EQUELLA Whitepaper Performance Testing Carl Hoffmann Senior Technical Consultant Contents 1 EQUELLA Performance Testing 3 1.1 Introduction 3 1.2 Overview of performance testing 3 2 Why do performance testing?
More informationProduct Brief SysTrack VMP
for VMware View Product Brief SysTrack VMP Benefits Optimize VMware View desktop and server virtualization and terminal server projects Anticipate and handle problems in the planning stage instead of postimplementation
More informationEverything you need to know about flash storage performance
Everything you need to know about flash storage performance The unique characteristics of flash make performance validation testing immensely challenging and critically important; follow these best practices
More informationBest Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays
Best Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays Database Solutions Engineering By Murali Krishnan.K Dell Product Group October 2009
More informationWhite paper: Unlocking the potential of load testing to maximise ROI and reduce risk.
White paper: Unlocking the potential of load testing to maximise ROI and reduce risk. Executive Summary Load testing can be used in a range of business scenarios to deliver numerous benefits. At its core,
More informationWeb Application s Performance Testing
Web Application s Performance Testing B. Election Reddy (07305054) Guided by N. L. Sarda April 13, 2008 1 Contents 1 Introduction 4 2 Objectives 4 3 Performance Indicators 5 4 Types of Performance Testing
More informationRunning a Workflow on a PowerCenter Grid
Running a Workflow on a PowerCenter Grid 2010-2014 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise)
More informationLoad Distribution in Large Scale Network Monitoring Infrastructures
Load Distribution in Large Scale Network Monitoring Infrastructures Josep Sanjuàs-Cuxart, Pere Barlet-Ros, Gianluca Iannaccone, and Josep Solé-Pareta Universitat Politècnica de Catalunya (UPC) {jsanjuas,pbarlet,pareta}@ac.upc.edu
More informationAchieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks
WHITE PAPER July 2014 Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks Contents Executive Summary...2 Background...3 InfiniteGraph...3 High Performance
More informationEvaluation Report: Accelerating SQL Server Database Performance with the Lenovo Storage S3200 SAN Array
Evaluation Report: Accelerating SQL Server Database Performance with the Lenovo Storage S3200 SAN Array Evaluation report prepared under contract with Lenovo Executive Summary Even with the price of flash
More informationMS SQL Performance (Tuning) Best Practices:
MS SQL Performance (Tuning) Best Practices: 1. Don t share the SQL server hardware with other services If other workloads are running on the same server where SQL Server is running, memory and other hardware
More informationSUBHASRI DUTTAGUPTA et al: PERFORMANCE EXTRAPOLATION USING LOAD TESTING RESULTS
Performance Extrapolation using Load Testing Results Subhasri Duttagupta PERC, TCS Innovation Labs Tata Consultancy Services Mumbai, India. subhasri.duttagupta@tcs.com Manoj Nambiar PERC, TCS Innovation
More informationCisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database
Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database Built up on Cisco s big data common platform architecture (CPA), a
More informationIntroducing Performance Engineering by means of Tools and Practical Exercises
Introducing Performance Engineering by means of Tools and Practical Exercises Alexander Ufimtsev, Trevor Parsons, Lucian M. Patcas, John Murphy and Liam Murphy Performance Engineering Laboratory, School
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationPerformance Characteristics of VMFS and RDM VMware ESX Server 3.0.1
Performance Study Performance Characteristics of and RDM VMware ESX Server 3.0.1 VMware ESX Server offers three choices for managing disk access in a virtual machine VMware Virtual Machine File System
More informationAn Oracle White Paper March 2013. Load Testing Best Practices for Oracle E- Business Suite using Oracle Application Testing Suite
An Oracle White Paper March 2013 Load Testing Best Practices for Oracle E- Business Suite using Oracle Application Testing Suite Executive Overview... 1 Introduction... 1 Oracle Load Testing Setup... 2
More informationScalability. Microsoft Dynamics GP 10.0. Benchmark Performance: Advantages of Microsoft SQL Server 2008 with Compression.
Scalability Microsoft Dynamics GP 10.0 Benchmark Performance: Advantages of Microsoft SQL Server 2008 with Compression White Paper May 2009 Contents Introduction... 3 Summary Results... 3 Benchmark Test
More informationImproved metrics collection and correlation for the CERN cloud storage test framework
Improved metrics collection and correlation for the CERN cloud storage test framework September 2013 Author: Carolina Lindqvist Supervisors: Maitane Zotes Seppo Heikkila CERN openlab Summer Student Report
More informationPSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 11 th, 2014 NEC Corporation
PSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 11 th, 2014 NEC Corporation 1. Overview of NEC PCIe SSD Appliance for Microsoft SQL Server Page 2 NEC Corporation
More informationCase Study - I. Industry: Social Networking Website Technology : J2EE AJAX, Spring, MySQL, Weblogic, Windows Server 2008.
Case Study - I Industry: Social Networking Website Technology : J2EE AJAX, Spring, MySQL, Weblogic, Windows Server 2008 Challenges The scalability of the database servers to execute batch processes under
More informationHP reference configuration for entry-level SAS Grid Manager solutions
HP reference configuration for entry-level SAS Grid Manager solutions Up to 864 simultaneous SAS jobs and more than 3 GB/s I/O throughput Technical white paper Table of contents Executive summary... 2
More informationThe Melvyl Recommender Project: Final Report
Performance Testing Testing Goals Current instances of XTF used in production serve thousands to hundreds of thousands of documents. One of our investigation goals was to experiment with a much larger
More informationDIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION
DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION A DIABLO WHITE PAPER AUGUST 2014 Ricky Trigalo Director of Business Development Virtualization, Diablo Technologies
More informationNewsletter 4/2013 Oktober 2013. www.soug.ch
SWISS ORACLE US ER GRO UP www.soug.ch Newsletter 4/2013 Oktober 2013 Oracle 12c Consolidation Planer Data Redaction & Transparent Sensitive Data Protection Oracle Forms Migration Oracle 12c IDENTITY table
More informationDell Virtualization Solution for Microsoft SQL Server 2012 using PowerEdge R820
Dell Virtualization Solution for Microsoft SQL Server 2012 using PowerEdge R820 This white paper discusses the SQL server workload consolidation capabilities of Dell PowerEdge R820 using Virtualization.
More informationPerformance And Scalability In Oracle9i And SQL Server 2000
Performance And Scalability In Oracle9i And SQL Server 2000 Presented By : Phathisile Sibanda Supervisor : John Ebden 1 Presentation Overview Project Objectives Motivation -Why performance & Scalability
More informationApplication of Predictive Analytics for Better Alignment of Business and IT
Application of Predictive Analytics for Better Alignment of Business and IT Boris Zibitsker, PhD bzibitsker@beznext.com July 25, 2014 Big Data Summit - Riga, Latvia About the Presenter Boris Zibitsker
More informationISE 820 All Flash Array. Performance Review. Performance Review. March 2015
ISE 820 All Flash Array Performance Review March 2015 Performance Review March 2015 Table of Contents Executive Summary... 3 SPC-1 Benchmark Re sults... 3 Virtual Desktop Benchmark Testing... 3 Synthetic
More informationChoosing the Right Cloud Provider for Your Business
Choosing the Right Cloud Provider for Your Business Abstract As cloud computing becomes an increasingly important part of any IT organization s delivery model, assessing and selecting the right cloud provider
More information