Deliverable D6.7. Performance testing of cloud applications, Final Release

Transcription

1 REuse and Migration of legacy applications to Interoperable Cloud Services REMICS Small or Medium-scale Focused Research Project (STREP) Project No Deliverable D6.7 Performance testing of cloud applications, Final Release Work Package 6 Leading partner: UT Author(s): Satish Srirama, Huber Flores, Martti Vasar Dissemination level: Public Delivery Date: Version: V1.0 Copyright REMICS Consortium

2 Executive Summary This document D6.7 Performance testing of cloud applications Final release is a public deliverable of the Project REuse and Migration of legacy applications to Interoperable Cloud Services (REMICS) in Small or medium-scale focused research project (STREP) within the European 7th framework program for the ICT Call 5 (FP7-ICT ) Challenge 1: Pervasive and Trusted Network and Service Infrastructures. The study provides the foundations that enable us to adapt migrated applications for dynamic cloud patterns. By using the high performance and quality services provided by the cloud, it is possible to scale the migrated applications on demand based on their usage load. Functional capabilities can be added at runtime (e.g. bandwidth, memory, storage) using automated mechanisms that periodically estimate the provisioning limits of each component utilized by the application. These mechanisms are controlled by performance routines (e.g. events, alarms, policies) defined before the system is put in place, and which are triggered when the application is dealing with conditions (e.g. long queues for resource utilization) that affects aspects such as service throughput and cost of operation, among others. Dynamic cloud reconfiguration mainly focuses on optimizing the distribution of application components. Consequently, cloud applications are dependent on technological choices and network topology. In this context, distributed technologies such as Erlang, Mnesia, CouchDB etc., are preferable due to their high level of reliability for managing concurrency, flexibility for modifying behavior on the fly and fault-tolerance. However, by following the REMICS methodology, mature legacy applications, which were not developed to scale dynamically (based on replication or hot replacement) are migrated to fit a modernized cloud pattern. Thus, the adaptation of a migrated system should happen by studying its performance properties-based capacity for handling parallelism/concurrency. Such properties can be obtained by performing stress tests on applications under multiple circumstances (e.g. single or multiple nodes), cloud parameters (e.g. instance type, region) and configurations (e.g. memory cache); and by analyzing the runtime metrics of the infrastructure in which the application is deployed (e.g. CPU utilization, memory usage). Furthermore, the performance analysis of cloud-based applications, enables to discover potential operational issues such as bottlenecks caused by load balancers, low-level IO hardware and deprecated software components, which may need to be replaced or updated. Moreover, it provides a way to answer some of the most common questions in the deployment of cloud-based applications: how well are the servers handling heavy load? How congested are the communication channels between the servers? What can be the causes of system failure? We explored in this study the performance tools and techniques in order to grant a runtime model [1] with a dynamic scaling logic that follows a control and supervision schema. These techniques can be integrated within the core functionality of CloudML [2], which is used for automating the deployment process in the REMICS project. In this context, the deliverable presents an overview of some of the most popular tools for benchmarking (e.g. Tsung, JMeter), load balancing (e.g HaProxy, Ngnix, Amazon AutoScaling etc), resource monitoring (e.g. Collected, Cacti, etc.) and highlighted their benefits and drawbacks by showing their results in multiple experiments. The document also explains the technical and theoretical aspects that may be utilized in the characterization-based performance of the components (aka artefacts in CloudML) of a legacy application deployed with CloudML. This allows a component to be dynamically replaced or provisioned, when the service demand of the system increases or the utilization of a specific resource is unable to handle a specific workload. Finally, the study makes emphasis in the most common components of a legacy applications (e.g. OLTP/OLAP databases, Web server). The analysis and the experiments are initially conducted on a MediaWiki based case study. However, the results are later applied on the DOME [3] case study of the project by using CloudML to introduce performance monitoring and automatic scaling in the automatic deployment phase. How the Dome Copyright REMICS Consortium Page 2 / 62

3 use case adaption was performed is described in more detail in the deliverable D4.5 REMICS Migrate Principles and Methods. Copyright REMICS Consortium Page 3 / 62

4 Versioning and contribution history Version Description Contributors 0.1 Structured the deliverable Satish Srirama 0.2 Initial content for sections 4, 5, 6, 7 and 8 Martti Vasar 0.3 Updated sections 3, 4, 5, 6, 7, 8, 9 and 10 Huber Flores 0.4 Edited and finalized the sections Satish Srirama 0.5 Prepared the deliverable for internal review Satish Srirama 0.6 Changes applied according to review 1, provided by Christian Hein 0.7 Changes applied according to review 2, provided by Brice Morin Huber Flores Huber Flores 1.0 Finalized the deliverable for submission Satish Srirama Copyright REMICS Consortium Page 4 / 62

5 Table of contents EXECUTIVE SUMMARY... 2 TABLE OF CONTENTS LIST OF FIGURES LIST OF TABLES INTRODUCTION BACKGROUND OBJECTIVES OF THE DELIVERABLE MEASURING PERFORMANCE OF CLOUD INSTANCES CACTI COLLECTD GANGLIA UBUNTU PACKAGE SYSSTAT COMPARISON OF THE COLLECTION TOOLS VERIFYING THE SCALABILITY OF APPLICATIONS VERTICAL SCALING HORIZONTAL SCALING DEFINING THE SCALABILITY PROPERTIES OF APPLICATIONS STUDY OF LOAD-BALANCERS Pen Nginx HaProxy Comparison of load-balancers AUTO-SCALING Using arrival rate for auto scaling Adding/terminating servers Interactions of the framework Amazon Auto Scale Optimal heuristics BENCHMARKING TOOLS JMeter Tsung MEDIAWIKI THE CONSIDERED CASE STUDY APPLICATION MEDIAWIKI CONFIGURATION LAYOUT ANALYSIS OF THE CASE CONFIGURATION OF THE FRAMEWORK FOR VERIFYING THE QOS OF THE WEB APPLICATION Copyright REMICS Consortium Page 5 / 62

6 7.2 COMPARISON OF AMAZON INSTANCES EXPERIMENTS MEASURING SERVICE TIME MEASURING MAXIMUM THROUGHPUT RESULTS OF THE PRELIMINARY EXPERIMENTS CONFIGURATION AND RESULTS OF THE EXPERIMENT CHARACTERISTICS OF DIFFERENT WEB SERVICES IDENTIFYING SIMILAR PERFORMANCE VARIABLES FOR ADAPTING THE APPROACH TO ANY SOA APPLICATION PREPARING THE INSTANCE INSTALLING SOFTWARE ON THE INSTANCE INSTALLING MEDIAWIKI UPLOADING WIKIPEDIA ARTICLE DUMPS INTO MEDIAWIKI DATABASE (OPTIONAL) FRAMEWORK INSTALLATION Monitoring tool Framework BUNDLING IMAGE TOGETHER CREATING AN AUTO-SCALE GROUP FOR AMAZON TO USE AUTO SCALE FOR DYNAMICALLY ALLOCATING THE SERVERS MODERNIZATION OF OLTP/OLAP SYSTEMS OLTP/OLAP TO CLOUD LOAD DISTRIBUTION IN A MULTI-NODE SYSTEM WITH CENTRALIZED DATABASE ACCESS Transactional analysis for MediaWiki in the cloud Storing state of the user SUPPORTING STATEFUL TRANSACTIONS IN MEDIAWIKI Introducing cookies in the application design Share a common space in the cluster for storing session information Using a centralized memcached or database Load balancing with sticky sessions ADDING AND REMOVING SERVERS IN STATEFUL APPLICATION APPLICATION TO REMICS REQUIREMENTS SUMMARY REFERENCES Copyright REMICS Consortium Page 6 / 62

7 1 List of Figures Figure 1 Spikes in a response time while using Cacti Figure 2 Performance spikes seen from an experiment while using Cacti Figure 3 System load collected with CollectD and graph generated with RRDtool Figure 4 Ganglia front-end showing the status of the cluster Figure 5 The monitoring tool using sysstat to draw the CPU usage graph with JavaScript and HTML Figure 6 Scaling scenario for vertical configuration Figure 7 Scaling scenario for horizontal configuration Figure 8 Pen and nginx load-balancer comparison in Amazon EC2 cloud Figure 9 HaProxy statistics web tool Figure 10 Algorithm for allocating servers in the cloud by the framework Figure 11 Showing how the framework intercore works for requesting instances, configuring the instances, requesting performance metrics and terminating the instances Figure 12 Response times with different configurations calculated with Equation Figure 13 JMeter console test plan Figure 14 Tsung statistics web tool Figure 15 Cumulative distribution function of response time with different caching policies using Amazon EC2 instance c1.medium Figure 16 MediaWiki web application running in the cloud, red links indicate missing content behind the link Figure 17 MediaWiki configuration layout in the cloud and how the load generator requests are going through the system Figure 18 How the requests are going through the MediaWiki service, where Tsung is used to generate the load Figure 19 Ramp up experiment for measuring the maximum throughput for three different instance types Figure 20 Cumulative distribution of response times for different instances Figure 21 Clarknet trace with 2 week traffic, red indicates the traffic used to generate the load Figure hour experiment of server allocation for different policies Figure 23 CPU utilization for different servers compared with arrival rate Figure 24 Throughput per Apache server compared to arrival rate Figure 25 MediaWiki successfully installed in the Amazon EC2 cloud Figure 26 Showing MediaWiki application with part of the Wikipedia dump uploaded into the database, red links indicate missing content Figure 27 The distribution components of MediaWiki application in the cloud Copyright REMICS Consortium Page 7 / 62

8 2 List of Tables Table 1 Software used for deploying infrastructure and their roles Table 2 Results of the experiments with different instance types Table 3 Results of the experiments with different instance types Table 4 Results of the comparing different MediaWiki versions using different instance types Table 5 General comparison of the tools Table 6 Abbreviations Copyright REMICS Consortium Page 8 / 62

9 3 Introduction The document is the deliverable D.6.7 of the REMICS project, which extends the work presented in deliverable D.6.6 [4], with the considerations and performance-based techniques post migration that may be applied to characterize OLTP/OLAP systems in the cloud. The document introduces a complete overview of possible tools that can be utilized for monitoring and analyzing the performance of legacy applications migrated to the cloud. By using the tools for exploring multiple performance metrics under different scenarios (e.g. topologies), we aim to identify the optimal configuration properties that allow any application to benefit from the distributed nature and provisioning features of the cloud. The tools considered in the analysis were selected based on their open source nature and widespread usage, which translates into many advantages such as an extensive technical support knowledge base (such as online forums for discussion), constantly improving functionalities, and a high level of portability that simplifies the process of replicating specific scenarios, among others. Moreover, we have considered these tools based on our previous experiments and experiences, where we have identified specific features in tools that enrich the overall performance analysis. For example, CollectD has been proven to be scalable enough to handle any number of hosts in a cluster, without any problems related to IO latencies. Since monitoring the performance of cloud-based applications involves varying and measuring different configuration aspects such as user load, number of servers, load distribution policy etc., we have relied on different kinds of tools for benchmarking, load balancing and system performance measurement. For benchmarking, we have analyzed JMeter and Tsung; for load balancing, HaProxy, Pen, Ngnix and Amazon AutoScale; and for system performance measurement, Cacti, CollectD, Ganglia and Ubuntu package sysstat. Moreover, we have conducted multiple sets of experiments (described in detail in section 9) utilizing these tools for analyzing a MediaWiki application in a configuration similar to Wikipedia. Some of the prominent results from this study include, 1) how to measure the performance of an application periodically without introducing high computational loads to the system, 2) how to scale an application based on performance metrics or operational costs, 3) what sort of data is necessary for defining the execution properties at runtime of any cloud-based application and 4) which features should be considered when choosing a tool for measuring performance. Furthermore, from the experiments, we could also see that, the more sophisticated the tool is for measuring, collecting and reporting data, the greater the impact it has on the overall performance of the system. For example, in the case of tools for measuring the system performance, the Ubuntu package sysstat is preferable for measuring CPU information rather than Cacti. However, Cacti provides better functionality for generating graphs. Similarly, in the case of tools for benchmarking, JMeter eases the process of creating and executing a test plan by providing a graphical interface. However, its performance for generating load is poor when compared with Tsung. The details are provided in the document at the respective locations. The analysis and experiments were initially applied to MediaWiki, as the study in its first stages aimed the exploration of mechanisms that may be utilized to scale a migrated system to a cloud pattern. However, latest results of this study were used in conjunction with CloudML over the DOME case study presented by the REMICS project. Moreover, load balancing and replication-based performance principles are integrated within the deployment phase of REMICS methodology for scaling legacy systems. Finally, the deliverable also presents the principles of adapting OLTP/OLAP systems to match SOA/cloud patterns. Database adaption happens via characterization-based performance of the transactions. However, since the performance of a database may also be affected by the multi-node configuration of the application built on top of it, we studied how the load distribution can be effectively managed without losing the scaling properties of the cloud. The rest of the document is organized as follows: section 4 examines and compares the tools that can be used for measuring the performance of cloud-based applications, section 5 introduces the mechanisms that are utilized for scaling and performing stress tests on the applications, section 6 Copyright REMICS Consortium Page 9 / 62

10 examines the described tools by exploring a case study based on MediaWiki, section 7 presents the results of the experiments, section 8 discusses how the knowledge acquired in this deliverable can be extrapolated in order to create an approach that can be used for measuring the performance of any cloud-based application, section 9 presents the modernization of OLTP/OLAP systems to cloud, section 10 shows how the study is valuable and can be useful to REMICS. Finally, section 11 summarizes the results of the study and introduces future research directions. 3.1 Background Cloud computing [5] provides appealing environment for deploying pilot projects, stress testing Web applications and services, automatically provisioning servers to decrease the cost of operation. It allows to acquire resources on-demand and in variable amounts. Compared to on-premise servers, cost of upkeep is lower as the deployment phase and stress testing acquires only small portion of time to use the servers, there is no need to invest into the infrastructure, no need to pay maintenance cost of the servers and no additional workforce needed to deal with the servers. Cloud customer pays only for the resources actually used. Cloud computing can be an appealing environment for running web applications. However, setting up cloud environments for stress-testing and installing needed infrastructure still requires a significant amount of manual effort. To aid performance engineers in this task, we have developed a framework that integrates several common benchmarking and monitoring tools. The framework helps performance engineers to stress-test applications under various configurations and loads. Furthermore, the framework supports dynamic server allocation based on incoming load using response-time-aware heuristics. We validated the framework by deploying and stress-testing the MediaWiki application. An experimental evaluation was conducted aimed at comparing the responsetime-aware heuristics against Amazon Auto-Scale. Keeping an eye on the running instance in the cloud has great importance: it helps to discover possible bottlenecks, how well the servers are performing under high load, and how resources are distributed between the servers. Cloud providers provide many different instance types with widely varying characteristics. Monitoring the servers will identify which servers in the complex service can be replaced with faster or slower virtual machines to optimize the service configuration. This is important as we can then reduce the cost of running the service, but in the same time improve the performance and serve more clients. This deliverable aims to give an overview of different monitoring tools and describes ways of measuring the performance of the service in the cloud. Several experiments have been conducted using the MediaWiki application in a configuration similar to the Wikipedia. 3.2 Objectives of the deliverable The objectives of the deliverable can be summarized as follow: - To analyze the advantages and disadvantages of different tools that can be used for measuring the performance of cloud-based applications. - To examine the mechanisms for scaling and performing stress tests on the applications. - To investigate how to measure the performance of legacy applications running on the cloud without introducing high computational expenses to the system performance. - To define the relevant system information that has to be measured, in order to establish the execution properties of an application on the fly. - To define how to scale an application based on performance metrics or operational costs. - To help to identify an approach that can be used for simplifying the deployment process of cloud-based applications. - To investigate the post migration issues that need to be overcome for scaling OLTP/OLAP systems in the cloud. The issues were identified in deliverable D.6.6 (interim release) and are explored in detail in this deliverable (final release). Copyright REMICS Consortium Page 10 / 62

11 The study is relevant for REMICS and the knowledge generated from this deliverable is applicable in several requirements of the project and are addressed in different deliverables. The details are mentioned in section Measuring performance of cloud instances There are several tools provided by Linux based operating systems (such as the virtual folder /proc) and several others provided by third parties. Some of the well-known data collection and monitoring tools are Ganglia, Cacti and CollectD. The first two are capable of gathering the data and also generating graphs from the collected information, but CollectD does provide graphing functionality right out of the box. There are some third party and community provided solutions for drawing figures. They all are using RRD (Round-Robin Database) to store the data as it will provide a fixed file size and capability of storing collected values for larger time spans. The data is stored in a flat file containing a circular array whose size is fixed upon creation. It is important to note that there are compatibility issues between different architectures (32 bit vs 64 bit) and it is not possible to draw graphs from RRD files if the architecture does not match. An example of this is a case, where performance metrics are collected from a 32 bit machine and then are drawn on the 64 bit machine. RRDtool 1 was created by Tobias Oetikerand is used to draw the graphs of RRD files. It is flexible and allows specifying different time spans and resolutions to show the data. The tool takes care of the graph layout, assignment of the numerical values on the X and Y-axis and the Y-axis is scaled automatically by the largest and smallest value to fit the graph. For resolving the conflict between 32 bit and 64 bit architectures, it is possible to use RRDtool as it includes the rrddump utility to convert RRD files to XML and using the rrdrestore utility on the correct architecture to convert XML files to RRD format. The conversion consumes more disk space as the XML file is at least 10 times bigger and therefore it is advised to compress the XML file before transferring them to other machines for further processing. The monitoring tools mentioned above are all capable of collecting information with good resolution and covering a large area of different parameters. Many of the benchmarking tools are also capable of measuring performance metrics of the system under heavy load. The variation of different parameters is smaller and in many cases provides information that is not essential, and therefore the functionality is poorly designed. 4.1 Cacti 300 response time [ms] load 1 load Response time [ms] Load Time [hours] Time [hours] Figure 2 Spikes in a response time while using Cacti 1 Figure 1 Performance spikes seen from an experiment while using Cacti Copyright REMICS Consortium Page 11 / 62

12 Cacti 2 is a monitoring tool written mostly in the PHP language, implementing functionality and logic for adding, changing and removing servers and graphs in the monitoring tool. It has an administration interface, where the user can set up new graphs or download additional graphs from the Cacti homepage and import them to the system. The public interface provides information about metrics gathered from the servers. It uses command line scripts to fetch the monitored values. When taking advantage of the cron job-scheduler, the minimum possible interval for data collection is 1 minute, but by default it is set to 5. The information gathering process is rather CPU intensive and collecting data from a large amount of servers with a wide variety of parameters can affect performance of the servers and therefore it is advised to use separate servers for monitoring purposes. Figure 2 and Figure 1 show, how a c1.medium instance running in the Amazon EC2 cloud server experiences a decrease in performance while using Cacti as a monitoring tool, as for every gathering cycle, the system is overloaded for a short time, and it slows down, resulting in slower response times. In general, the system load of the server should not be greater than 1, in order to provide a suitable response time. However, from Figure 2, we can observe that every time Cacti is triggered, the system load exceeds that limit. System load is a value generated by the CPU taking in considering different variables such as CPU process queue among others. Several experiments were conducted for testing the tools under different conditions. The experiments consists of measuring the service time and system maximum throughput. Measurements were gathered in two ways using static and dynamic loads. Static load is generated by a single job that generates requests based on a constant arrival rate, in contrast a dynamic load is generated by performing a ramp up, meaning that the arrival rate of the load is increased periodically. Our experiments have shown large spikes while collecting data from the server affecting overall performance. The MediaWiki case study, which is discussed in section 7, has shown at least a 150ms increase in response time while the Cacti gathering process starts working. The time to gather information is too much and with only 20 servers can take at least half a minute. There is an additional tool called Spine 3 that allows to use multithreaded polling of the data from different servers to speed up the process and therefore decrease the performance loss. Since Cacti uses mostly SSH (Secure Shell) to access other servers for data collection purposes, authentication takes time, and for large amount of servers the collection process is slow. 4.2 CollectD CollectD 4 uses different approach compared to Cacti: it uses daemons to push out information to master server allowing collecting data with much smaller interval. It is written in C++ and uses a small amount of resources for gathering purposes, generating almost no overhead. As it has a small footprint in the system, the default interval for collecting the data is set to 10 seconds. It has built-in plug-ins to ease the data collection process and it is possible to write your own plug-ins or download new plug-ins from the internet to improve detail of the collected data Copyright REMICS Consortium Page 12 / 62

13 Figure 3 System load collected with CollectD and graph generated with RRDtool While setting up CollectD for various servers, it is possible to define multicast address. This way, it does not matter what the IP address of the receiving server is, as it will listen on the multicast address for possible packets. Using multicast is especially useful when deploying monitoring tool into the cloud, because every time a new instance is started, a new IP address is assigned. This way, we can easily configure the master-client configuration without needing to know the IP addresses of the other servers. This means that we can bundle the instance with already correct multicast address without having to change the configuration to set the correct IP addresses. Our experiments have shown that there is no difference in the servers overall performance if CollectD is enabled or not. The only downside using CollectD is that it does not have good interface for displaying the collected data. It has some sample scripts for this purpose, but when compared to Ganglia or Cacti, it requires more involvement from the user. Fortunately, there are some front-end interfaces developed by the community to support displaying the collected information. The second problem with CollectD and the other two monitoring tools is that they are using RRD files. Interpreting and processing the data is not straightforward as the files are in binary format. In order to convert them to a more human readable format, or make the information easy to parse by other scripts and applications, it is necessary to use utilities from the RRDTools set. Collecting values on fly (i.e. CPU use, memory, network bandwidth for any one server or several servers) is not straightforward and processing the data takes time and consumes CPU resources. One option to solve this would be to write a plug-in that pushes the most recent values through a TCP/UDP port in order to simplify the process of fetching the values by other programs. With the default configuration, one CollectD RRD file stored on the hard drive takes up to 0.1 megabytes and the total size of RRD files for one server can take up to 30 megabytes. 4.3 Ganglia Ganglia 5, similar to CollectD, uses client-server architecture, where for each server, a gmond client daemon is running. The primary tasks of gmond are to monitor changes in the host state and announce relevant changes. Gmetad is running on representative cluster nodes and periodically polls a collection of client data sources, parses the collected data, saves all volatile metrics to a RRD database. The third part of the system consists of a PHP Web front-end that is responsible for representing the collected values. The front-end of Ganglia is very dynamic: there are ways to different 5 Copyright REMICS Consortium Page 13 / 62

14 data, filter it and change time spans for the graphs. If caching is not enabled, each page request requires drawing the graphs again and parsing the Ganglia XML tree to get information about the structure of the cluster. In order to achieve good response times, a powerful machine is required to redraw the graphics and retrieve page requests quickly without overloading the machine. Figure 4 Ganglia front-end showing the status of the cluster 4.4 Ubuntu package sysstat The sysstat 6 package has big choice of command line tools that can access Linux system information to fetch different monitoring values. It mainly uses data from /proc to get the monitoring values by converting the data into human readable tab-separated values. It is also possible to store monitored values using the sar tool, but the default time interval is large (5 minutes) and some additional data processing is needed to get binary sar files from the previous day into proper shape. There is a tool called ksar 7 for viewing files generated by sar. This tool is needed to enable the logging of the values by the sar in the configuration file to get the information, it is also possible to define for how long the logs are kept on the hard drive. We have built a monitoring tool that uses sysstat command line tools to collect important performance metrics from the server and push collected information out through a predefined TCP port. This way, gathering the data is relatively fast and no processing has to be done to use it. The second option to gather data was to connect to the server using SSH like it is done by Cacti, but the time it takes to connect varies from 1 to 10 seconds and while the server is under high load, it sometimes can take up to 4 minutes. This behaviour is not acceptable, as important monitored values are missed and the data is not complete. The monitoring tool runs a set of commands from sar and iostat to gather all the necessary information. The overhead of the process is minimal and does not affect overall performance of the service. The maximum throughput of the Apache has stayed the same and at the same time, the tool was capable of collecting the data during periods of high load Copyright REMICS Consortium Page 14 / 62

15 Figure 5 The monitoring tool using sysstat to draw the CPU usage graph with JavaScript and HTML5 Upon the successful collection of data, values are parsed and utilized according to the performance policies of the cloud-based application. $ sar -u 1 1 Linux generic (test-laptop) 07/13/2012 _i686_ (2 CPU) 04:03:11 PM CPU %user %nice %system %iowait %steal %idle 04:03:12 PM all Average: all Comparison of the collection tools While tools such as Cacti or Ganglia can be used for measuring the performance metrics of an application, the inclusion of those within its environment may increase some variables such as the CPU system load, etc., and thus decrease the service performance of the entire system. Different tools introduce different overheads, which can be directly related with the functionality provided by the tool. Therefore, selecting the right tool is fundamental when measuring the performance of cloudbased applications. In this analysis, tools are compared taking in consideration different characteristics such as simplicity to collect, manipulate and present statistical data, effort required for installing and maintaining the tool and system requirements that must be fulfilled to run the collecting process. Moreover, it is discussed which tools and architecture are preferable for measuring the performance metrics. Both Ganglia and CollectD are using client-server approach, where the client daemon on the instance is sending collected performance metrics to the server instance. The Ganglia monitoring tool includes PHP front-end for visualizing the collected data making it easier to configure and draw the graphs. CollectD on the other hand does not include an out-of-the-box solution for displaying the collected data, it has some community-created templates and examples how to use RRDtool to generate such graphs, but it is up to the user to combine this into a front-end application. Because of the architecture they are using, the data gathering process is relatively fast and does not affect the server s overall performance. Cacti is mostly built in PHP language, some additional tools are written in C or using shell scripts. Cacti is similar to Ganglia and includes front-end framework to display the collected information. The main disadvantage is that extensive use of PHP code makes data processing and collecting slow. Cacti mostly uses SSH for data gathering purpose, but the authentication process is slow, and accessing large amount of different servers in this way is time consuming and can affect overall performance of the service. Ganglia and CollectD are most suitable for collecting performance metrics for large server pack and Cacti should be considered only when the data needs to be collected from small amount of servers. Setting up graphs and parameters to collect is easiest with Cacti as it includes custom templates, users can write their own gathering tools and community has generated variety of collection methods that can be easily re-used. All the three applications are using the RRD flat file type to store the Copyright REMICS Consortium Page 15 / 62

16 collected values. The advantage of this file type is that the files are fixed in size, no matter how long the tools have been running, and graphs can still be drawn for long time spans with relatively short time and effort. Disadvantage of the RRD files are that retrieving last values collected by the application requires the use of RRDtool and it can be a CPU-intensive task to find the proper location. Ganglia and CollectD are buffering the values to reduce writing operations, allowing to collect data from hundreds of servers, but this means that the data needs to be flushed to the file to get the latest information. Our study was searching for a solution where the last collected values from all the running servers in the cloud are available for the framework. We can use the collected values to make decisions to add servers to compensate for load or remove idling servers. This was achieved using the sysstat package for collecting the necessary information. The architecture is similar to Ganglia and CollectD, where for every server, an additional service is taking care of collecting the data and forwarding the information to the server. The overhead of the collection process is low, but a little bit higher (though not significantly) when compared to Ganglia running on top of a Java Virtual Machine. 5 Verifying the scalability of applications Applications running on the cloud can be configured to scale out or up, according to different policies or conditions which are meet when the system is in a running status. Depending on the approach used for scaling the application, the system can experiment different re-configuration issues such as long cool down times and inactivity periods among others. Depending on the purpose, an application can be configure to scale in a vertical or horizontal fashion. 5.1 Vertical scaling Vertical scaling is complex, because existing application servers need to be exchanged with more powerful machines (scaling up) or slower machines (scaling down). The transition between two servers must be transparent, otherwise we might lose jobs in the system and are not going to meet the SLA. It can frustrate customers as the service is temporarily down, and in the longer run, cause the loss of clients. It is important to keep in mind, that databases often cache common queries to make the service faster. Replacing machines means the cache is deleted and a new cache has to be built. This can make a new powerful machine actually slower for a limited time, when compared to the previous machine. Figure 6 Scaling scenario for vertical configuration Copyright REMICS Consortium Page 16 / 62

17 5.2 Horizontal scaling Horizontal scaling is simpler, as new servers are requested to run together with existing servers (scaling up) or existing servers are terminated (scaling down). Horizontal scaling works best with frontend load-balancer, which selects most suitable back-end servers from the server pool to distribute incoming jobs. New requested servers can be with any specifications and characteristics, but it would be better if the performance of the new ones is similar to the current servers. This will it make easier to balance the load and back-end servers are equally loaded. Without knowing the details of the system and using different server types for back-end servers, it is easy to overload slower machines, while the faster machines might sit idle. Figure 7 Scaling scenario for horizontal configuration 5.3 Defining the scalability properties of applications Scalability of applications in the cloud allows us to reduce the cost of running the service as it is possible to automatically adapt to incoming traffic. With a higher load we need more servers to successfully accommodate all the necessary requests and with lower load, we can remove extra servers. In this way it is possible to economize running the server, as the resources are not wasted and all of the servers are moderately utilized all the time. When verifying scalability of application one has to make sure whether it is possible to distribute the service into smaller parts. Having a centralized database and cache system, it is quite easy to scale applications in the cloud. Mostly, the workload for the database and the cache system is much lower compared to the servers connecting with these systems (e.g. Apache server). The database is meant for storing and retrieving the data, whereas the application layer takes care of displaying and processing the data. Processing the data can be a CPU intensive task: there have to be many servers replicating the application layer to meet the increasing load to the system. Many web applications are using regular expressions to format the text into the correct layout and using complex user logic to drive the application layer. These tasks need processing and services with large customer bases need additional servers to maintain a reasonable response time without overloading the system (making it unresponsive). 5.4 Study of load-balancers Different load balancers have been tested, each with different characteristics and properties. There is no single good load balancer that can meet everyone's expectations. They all differ in ease of configuration, setting up, maximum throughput and algorithms used to distribute incoming request to Copyright REMICS Consortium Page 17 / 62

18 back-end servers. Another important thing is the difficulty of fetching the statistics of the load balancer, and the sort of information it can provide. Some of the load balancers will give only basic information, such as the number of connected clients and generated requests, but others will also provide additional information like how many clients have been connected with each back-end server and current network bandwidth of the system Pen arrival rate pen nginx Service throughput [rps] Time [hours] Figure 8 Pen and nginx load-balancer comparison in Amazon EC2 cloud Pen is a simple load balancer written in C, that can be configured and executed directly from the command line. It is the simplest to get running, but lacks several configuration options. It has two different algorithms for distributing the load. The first is the round-robin algorithm and the second is extended version of the first algorithm, least round-robin algorithm, where the requests are routed to the server under the least load. Pen has poor performance and cannot fully utilize the computer resources. Experiments in Amazon EC2 have shown that using c1.medium instance, it can only use up to 30 40% CPU and serve only as many as 500 requests per second. Increasing the load further will make the service unstable and as a result, some of the requests are rejected. The statistics given by the Pen interface are lacking information: it is not possible to retrieve the number of requests the system has received and statistics have to be retrieved from the command line. This means that there must be a way to access the machine to run command line scripts or have cronjob configured to automatically generate the statistics. Figure 8 shows the difference between pen and nginx load-balancer. It clearly shows that Pen is unstable with large load and rejects large amount of jobs entering into the system. Both configurations used a simple round-robin algorithm to distribute the load. Pen, by default, tries to forward requests to the same back-end servers for the clients with the same IP to ensure that session state is stored for each visitor. This option was turned off for the during the comparison. There were 25 back-end servers running MediaWiki application and all the servers were using c1.medium as an instance type to have good CPU power and overall performance. The following command demonstrates how to execute the pen daemon and how to terminate the daemon. $ sudo pen -r -t 10 -S 2 -p /var/run/pen.pid 80 server-1:80 server-2:80 $ sudo kill -9 `cat /var/run/pen.pid` Copyright REMICS Consortium Page 18 / 62

19 5.4.2 Nginx Nginx 8 is a load balancer written by Igor Sysoev in C and can be configured in such a way, that restarting the service on the fly is possible without losing any jobs or incoming requests in the system. It is an ideal candidate for scalable applications. One strong argument for nginx is the ease of retrieving the number of arrivals entering the system as nginx provides a module that shows various statistics about the current status of the service. Unfortunately it does not provide any information on how many jobs are in the queue and how many jobs are in each back-end server. The tool s performance is relatively good as we were able to serve at least 700 requests per second using c1.medium instance in Amazon EC2 cloud and only 25% of CPU was used. Further increase was not possible because of network limitations, but studies have shown that nginx can work much better in bigger systems which have more network power. The Nginx daemon can be started from command line using the simple command nginx. There are additional options for the command for specifying the configuration file and whether the daemon should be terminated or restarted. The following commands demonstrate how the nginx service is started, reloaded and terminated: $ sudo nginx [emerg]: bind() to :8080 failed (98: Address already in use) [emerg]: bind() to :8080 failed (98: Address already in use) [emerg]: bind() to :8080 failed (98: Address already in use) [emerg]: bind() to :8080 failed (98: Address already in use) [emerg]: bind() to :8080 failed (98: Address already in use) [emerg]: still could not bind() $ sudo nginx -s reload $ sudo nginx -s stop $ sudo nginx -s stop [error]: open() "/usr/local/nginx/logs/nginx.pid" failed (2: No such file or directory) The first command is unsuccessful, because nginx or some other service is already running on the port 8080 and after 5 unsuccessful tries of port binding, the action is terminated. The second command shows how to restart the service (if the server pool or configuration is changed) and the third demonstrates how to stop the service. The command is successful if there is no output. Trying to execute the termination command a second time will result in an error message, stating that such process from the system was not found. Parameter -s indicates that a signal is sent to the nginx daemon. user www-data; worker_processes 1; http { upstream { server :80; server :80; server :80; } } server { listen 80 default; server_name localhost; location / { proxy_pass } } By default the ngix configuration file is located in the /usr/local folder and it consists of different blocks. The upstream block defines a list of back-end servers, where requests are forwarded, when a client connects with nginx on port 80. If the configuration file is changed, the nginx service needs to be restarted in order to for the changes to take effect. The framework automatically reconfigures the loadbalancer configuration file to add and/or remove Apache servers from the back-end server list. 8 Copyright REMICS Consortium Page 19 / 62

20 Active connections: 291 server accepts handled requests Reading: 6 Writing: 179 Waiting: 106 The above shows output from nginx HTTPStubStatusModule page content, where the first line indicates active connections in the system, the third line shows how many requests from the clients are accepted, handled and requested during the running time of the nginx daemon and the last line indicating how the active connections are divided in the system. HTTPStubStatusModule is a module added to the ngnix core that is used for generating statistics. The framework connects with the nginx HTTPStubStatusModule to fetch the total number of requests (the page is located at when using default configuration, but the last part of the URL can be changed from the configuration file. The number presented by nginx is a total number, therefore it is necessary to store previously collected values and timestamps. This is needed for calculating the arrival rate for that time span. The framework measures arrivals as requests per second. This is achieved by using the Equation 1, where h indicates total number of arrivals and t indicates the timestamp when the value was collected. Index 2 is the latest collected value and t is time in seconds from the UNIX epoch, the result from this is requests per second. λ= h 2 h 1 t 2 t 1 Equation 1 Calculating the arrival rate with two measuring points HaProxy HaProxy 9 is a lightweight mechanism for load balancing that allows to distribute multiple TCP and HTTP-based requests among a set of servers. It allows to collect a larger amount of statistics that can be visualized via a web browser as shown in Figure 9, and provides a better overview of the service during runtime. The statistics can show how many connections have been made to the back-end servers, how many requests are queued and how many of them are in the process, among others. Moreover, it also provides information about network bandwidth utilization for each commodity server (back-end). Figure 9 HaProxy statistics web tool The basic configuration is shown next, after binding the port with the IP address or domain name of the load balancer, the server is configured for forwarding the connection based on the designated 9 Copyright REMICS Consortium Page 20 / 62

21 distribution algorithm. Load distribution can be based on different algorithms such as round robin, static-rr, leastconn, source, uri, url-palm, hdr and rdp-cookie. Load is allocated among the servers which are listed at the end of the configuration file. defaults mode http retries 3 option redispatch maxconn 2000 contimeout 5000 clitimeout srvtimeout listen LB_Identifier xxx.xxx:port mode http cookie LB_log insert balance roundrobin option httpclose option forwardfor stats enable stats auth myuser:mypass server Server xxx.xxx:8080 cookie ServerLog_01 check server Server xxx.xxx:8080 cookie ServerLog_02 check server ServerN xxx.xxx:8080 cookie ServerLog_N check HaProxy is started by executing the following command: $ haproxy -f /etc/haproxy/haproxy.conf This starts the haproxy daemon in the process list. Hot-reconfiguration is possible in haproxy by using the command below. $ haproxy -f /etc/haproxy/haproxy.conf -p /var/run/haproxy.pid -sf $(cat /var/run/haproxy.pid) However, when a hot reconfiguration happens, by default all the associated statistics and log files are cleaned automatically Comparison of load-balancers Currently there are many different load-balancers available with different characteristics, properties and performance capabilities. Our study focused only on a small subset of available load-balancers as their evaluation allows to identify the common features required and that have to be considered when scaling the applications. Moreover, the load balancers which were considered are the most widely used and rich in features. For example, Ngnix hosts nearly 12.18% of all the sites across the multiple domains found in the Internet 10. From the study, it can clearly be said that all the evaluated load balancers worked as expected. Only Pen had some performance issues with high load. For Nginx and HaProxy, performance was similar, however the latter was working better and provided more statistical information. Our experiments included in the deliverable are performed using nginx as a load balancer. The decision to select nginx was made because HaProxy is already well studied and preliminary tests have shown that nginx has great performance. Moreover, writing new services to nginx is comparatively easy. 5.5 Auto-scaling Auto-scaling is the mechanism that takes care of dynamically allocating servers in the cloud environment to meet incoming requests depending on various performance metrics. Depending on the server, it is possible to use arrival rate, CPU, memory, IO or network usage to determine the scaling 10 Copyright REMICS Consortium Page 21 / 62

22 decision. Most often, the CPU based approach is used, because most of the back-end servers that are scaled, are running PHP based web applications. PHP is an interpreted language and uses large amount of CPU to process requests. On the other hand, MySQL can have memory or IO based approach and memcached can have a memory based approach to determine how many additional servers are needed to fit into the model. There are different mechanisms for executing an auto-scaling algorithm. Amazon provides Amazon Auto-Scale to define a predefined policy group, where instances are automatically added or removed depending on the thresholds and alarms set. It is possible to create different policy groups depending on what role the servers have, and how the jobs entering into the system are distributed between different roles. We also look at optimal heuristics to determine the amount of servers depending on the incoming arrival rate and service time of the system Using arrival rate for auto scaling One of the simplest ways for scaling is to use arrival rate as an indicator for the policy - whether there is a need for additional servers, or the some servers should be terminated. To use arrival rate as an indicator for the policy decision, it is necessary to know one back-end server s maximum throughput and service time. Website traffic is largely fluctuating during a small time window, but does have a trend for larger amounts of time. The trend can be either increasing or decreasing, depending if the arrival rate for the current hour has increased when compared to the last hour or not. Public clouds are usually charging customers by full hour of the instances used. This should be taken into consideration while provisioning servers. The arrival rate can be calculated using different means. If we provision servers for each hour, it would be wise to use the previous hour s arrival rate as the indicator. Usually, the arrival rate is an average, but variations exist where maximum or weighted arrival rate is used. With weighted average, the arrivals that are more recent have more importance giving more weight to the average arrival calculation. This helps with spotting an increasing or decreasing trend. If the arrival rate was increasing at the end of the hour, most certainly it will increase at the beginning of the next hour. Our experiments have shown that using weighted average arrival rate compared to default average does not really improve service quality nor performance on a small scale (using 20 servers and having a maximum of 600 requests per second). There were some differences, but they were not significant enough to consider. We also studied double exponential smoothing to predict the arrival rate for the next hour. This means using previously collected or calculated arrival rates and differences between calculated prediction and actual traffic to smooth the next hour curve depending on the error of the prediction and in the trend. These algorithms need data from the previous day at the least in order to improve prediction. Using our current configuration and predicting arrival rates with double exponential smoothing, it did not really improve the service performance and cost of running the servers compared to two previously mentioned methods. It can be concluded that fluctuation within an hour is too large to correctly calculate "optimal" amount of servers that will be capable of serving all the incoming jobs and in the same time hold running cost low as possible. All the algorithms and methods still need some extra spare servers in order to cope with fluctuating traffic Adding/terminating servers While the allocation of resources in the cloud happens automatically by using the Auto scaling mechanism when it is configured with proper alarms or events, we employ an approach similar to the one described in [6, 7] in order to reduce allocation and operation costs. The algorithm is as follows 11 : 1. Five minutes before the full hour, fetch arrival rate from nginx and calculate the amount of servers needed to cope with the traffic. This time is chosen as it represents the safest point that allows analyzing the entire system and deciding a configuration for the next hour. 11 This does not apply for the Auto Scale policy Copyright REMICS Consortium Page 22 / 62

23 Moreover, in the case of de-allocating resources, this time also ensures to provide enough time to cool down the system. 2. Using the previous value, check if the amount of servers calculated is lower than currently the number or servers running. If it is, remove a sufficient amount of servers from the cloud (terminate instances) and reconfigure load balancer. 3. At the full hour, check if the amount of servers calculated from step 1 is more than currently running. If it is, request new servers from the cloud and store the retrieved instance ID codes for tracking purposes. 4. Track requested servers until all the pending servers have changed state to running and return to step 1. If server is changing state from pending to running, it gets private and public IPs from the pool. Use the retrieved IP to connect to the instance, change database and cache IP addresses and start necessary services. If this is done, add a new server into the load balancer s server list and restart the service. Figure 10 Algorithm for allocating servers in the cloud by the framework The steps outlined above are built into the framework and are done automatically. It uses arrival rate and user defined service time to calculate the number of servers required. If the servers are overloaded, a larger service time can be used to allow running more servers. The decision for removing servers before the full hour (see Figure 10) was done to simplify the termination of the servers, because in this case we do not need to check which server has uptime closest to the full hour. This ensures, that while removing servers, we do not need to pay for an extra hour due to the termination process being called out later or being delayed, meaning user has to pay for resources they are not going to be able to use. An important step for adding/removing servers is to ensure that the process is transparent and no incoming connection is lost. While an instance is going to be terminated, a job might be forwarded to the instance, disappearing along with the termination. It is also possible, that existing jobs in the balancer might be dropped, when reloading the load balancer. Finally, if new servers are added too quickly to the load balancing pool and are not yet properly configured (Apache is not started or IP addresses are not changed in the MediaWiki configuration database and memcached), the request will fail. Using nginx as load balancer ensures that all the jobs entering the system are processed even if nginx is restarted or terminated. The worker processes will first get a termination signal and will wait until all the requests are sent back, in the same time the nginx master will create new worker processes with new configuration and all the new requests are forwarded there. This will ensure that while reloading the load balancer, nothing is lost and the new configuration is immediately adapted with new incoming requests. Copyright REMICS Consortium Page 23 / 62

24 Figure 11 Showing how the framework intercore works for requesting instances, configuring the instances, requesting performance metrics and terminating the instances Our framework also takes care of adding and removing servers, making sure that the instance going to be removed is first removed from the nginx configuration server pool and the service itself is restarted and instances are added only if the configuration step was fully processed, meaning that the back-end Apache server is ready to process requests. Figure 11 shows how the framework is dynamically provisioning new servers to the system. The figure shows a simple lifecycle of one instance. If the framework provisioning policy decides that a new server is needed, it will connect to the cloud interface requesting a new instance. The cloud interface checks whether the user can request new instances and if there are any free resources to start the instance. If all the conditions are suitable, a new virtual machine is started in the cloud and the instance ID is sent back to the framework. The framework can use the ID to keep track of the health of the instance and if it has changed state from pending to running. Changing state from pending to running can take around 3 minutes. If the state changes to running, framework will send a couple of commands and configuration files to the instances, depending which roles the instances have. For example, the Apache instance needs to have the Apache2 web service started and the MediaWiki configuration file has to be changed with proper MySQL and memcached IP addresses. If the configuration phase is done, the virtual machine becomes fully functional and is a part of service. The framework takes care of monitoring and measuring performance metrics of the instance. This information is constantly logged and after the experiments, these results can be combined and additional analyzes done. In the end, if the instance is not needed anymore (e.g. the framework provisioning policy detects that there is need to remove servers, run fewer servers in the cloud), it starts the termination process of the unnecessary instances. If it is an Apache server, it is first removed from the load-balancer server list to ensure no more requests are forward to that server. Then the framework will send a terminate command with the correct instance ID to the cloud interface, which takes care of terminating the instance and removing it from the cloud. The cloud interface will return a true or false value to the framework, signalling whether it was possible to remove the instance (e.g. the instance ID was correct and the server was running in the cloud). Copyright REMICS Consortium Page 24 / 62

25 5.5.3 Interactions of the framework As already mentioned, the proposed framework is capable of configuring servers on the fly, depending on their roles. Here we describe how interactions between different servers and the framework are configured. MySQL. The framework has a MySQL configuration file that the user can change. When the framework starts the configuration phase, it first copies the modified or new configuration file to the MySQL server and starts/restarts the service. If correct MySQL credentials are entered, the framework can grab MySQL statistics from the database and log them. It is also possible to add or remove commands for each role, depending if reconfiguration or starting additional services is required. The following SQL statement is executed in MySQL to retrieve statistics about the database performance: mysql> SHOW STATUS; Variable_name Value Aborted_clients 2 Aborted_connects 1 Bytes_received 115 Bytes_sent 159 Connections 225 Qcache_hits Queries Slow_queries 0 Table_locks_waited 3 Threads_cached 5 Threads_connected 1 Threads_created 6 Threads_running 1 Uptime Memcached. The Memcached service can be configured and started from the command line. The framework will automatically start the service and log cache hit/miss statistics. The following command is executed by the framework to gather statistics from memcached, get_hits and get_misses are important values for calculating the hit/miss statistics: $ echo "stats" /bin/netcat -q STAT pid 1099 STAT uptime STAT time STAT version STAT pointer_size 32 STAT rusage_user STAT rusage_system STAT curr_connections 5 STAT total_connections 25 STAT connection_structures 7 STAT cmd_get 47 STAT cmd_set 95 STAT cmd_flush 0 STAT get_hits 47 STAT get_misses 332 STAT delete_misses 0 STAT delete_hits 2 STAT incr_misses 3 STAT incr_hits 46 STAT decr_misses 0 STAT decr_hits 0 STAT cas_misses 0 STAT cas_hits 0 STAT cas_badval 0 STAT bytes_read STAT bytes_written STAT limit_maxbytes STAT accepting_conns 1 STAT listen_disabled_num 0 STAT threads 4 STAT conn_yields 0 STAT bytes Copyright REMICS Consortium Page 25 / 62

26 STAT curr_items 93 STAT total_items 96 STAT evictions 0 END Nginx. The nginx configuration file contains back-end Apache IP addresses. The framework is fully aware of the running servers and regularly updates the back-end pool list to match with the provisioning decisions. The framework also connects with the nginx HttpStubStatusModule to fetch the arrival rates and restarts/starts the service whenever it is needed (e.g. new Apache servers becoming available). Apache. The MediaWiki application is configured with correct MySQL and memcached IP addresses. MediaWiki logic and configuration is cover in detail in next section. The framework can also change PHP and Apache configuration files (e.g. maximum users, memory limit). The framework constantly acquires statistics from Apache mod_status and provides information such as how many connections have been done, how many active connections there are, and the current bandwidth. Load Generator. The framework copies the configuration file of the load generator to the correct servers. Starting the benchmark tool is not yet fully automatic and user has to provide some input and check if the experiment is started correctly. The framework will constantly monitor performance metrics for each server and log CPU, network, I/O and memory usage Amazon Auto Scale Amazon Auto Scale 12 allows customers to dynamically provision servers in the Amazon EC2 cloud. It is possible to define different thresholds when to turn instances on and off. Usually, a CPU based threshold is used, where CPU usage is measured, and the average is calculated for each back-end server in the policy group. If, for a certain time period, the average CPU usage exceeds the upper or lower threshold, servers will be added or removed respectively. Different experiments have shown that Amazon does not terminate the instances in the most logical way. Amazon charges for instances by the running time, where half hour usage is charged as a full hour s, meaning that the instances closest to the full hour should be terminated (only, if needed by the provision decision). Using Amazon Auto Scale, some of the servers were terminated where at least 30 minutes to the next full hour was still available for use. Because of termination of these instances, the user has to pay for resources they were actually not able to use. The experiment conducted with Amazon Auto Scale used for scaling up 70% and scaling down 60% of the average CPU to trigger the alarm. The breach time was set to 15 minutes, which means that this is how much time one of the thresholds can be violated before the scaling operation is finally conducted. We also set a 2 minute cool-down time, to gain some buffer zone until an additional server is added or an existing server is terminated, in case the threshold is still violated under heavy load or low load Optimal heuristics Our optimal policy uses service time and arrival rate to determine the sufficient amount of servers, but at the same time it maintains a reasonable response time meeting the SLA. We use queuing theory [8] to calculate the average response time of the current configuration. We describe the system as a M/M/c/c queuing model, where M represents the inter-arrival time between users, c describes the available resources (servers) that provide the service and the maximum number of users that can enter the system (when c+1 requests arrives to the system, the service is denied for the last one). Moreover, we used this model as it considers that there is no waiting queue and c can be defined arbitrarily (in our case, c is homogenous for all the servers) Copyright REMICS Consortium Page 26 / 62

27 r s = λ r 1+ c n Equation 2 response time of the current configuration servers 10 servers Response time [ms] Response time [ms] rps 240 rps 360 rps Amount of servers Arrival rate [rps] (a) (b) Figure 12 Response times with different configurations calculated with Equation 2 Using Equation 2 [9], we can calculate the theoretical response time of the current configuration, where r is the service time in ms (70 ms), n is the amount of cores (2), λ is the arrival rate and c is the amount of servers. Our interest is to keep the average response time below 250 ms (3.5x slower response time than in the c1.medium experiment). To find a solution, c in Equation 2 is increased until s 250ms. Using a too small epoch for allocating servers may oscillate the system as the arrival rate is not deterministic and is rather noisy. Larger time spans might have experienced a too stable system meaning that new servers are allocated too late and the system can not react to increasing traffic fast enough, losing the ability to handle requests and have more idle servers running with a quickly decreasing trend. One hour is a suitable point for allocating the servers, as Amazon EC2 charges customers at that rate and a closer look for different traces has shown that within one hour, the fluctuation in traffic is relatively small. 5.6 Benchmarking tools To test different hypothesis, mathematical formulas and how the system works under load, we need to use benchmarking tools. A web-based benchmarking tool s general idea is to generate HTTP GET and/or POST requests to stress test the system and collect arrival rates for each request to summarize the service throughput, stability, and answer questions such as how many requests were going through and how arrival rate affects response time. To run such benchmark, a powerful computer has to be used. It is necessary to also monitor the benchmark computer, to have a better picture and exclude strange anomalies. The increase in response time might sometimes be caused by the fact that the system running the benchmark tool is overloaded and cannot generate requests nor retrieve responses from the system under stress in meaningful time. A slightly loaded system might affect the results giving slower response times and the validity of the results can be doubtful. Some of the benchmarking tools support monitoring the system under stress using SSH connection to connect with the server and collect values in interest. Copyright REMICS Consortium Page 27 / 62

28 This additional functionality is not needed as our framework already takes care of collecting the necessary performance metrics from all the servers in the cloud. There is a wide variety of benchmarking tools: open-source, free or commercial ones. The free tools JMeter and Tsung are the most popular of them JMeter JMeter 13 is an Apache project that can be used to load test, analyse and measure the performance of variety of services, mainly focusing on web applications. The tool is written in Java, has a graphical and a command line interface for conducting the experiments. JMeter can be used to conduct unit tests. Different test cases can be generated with the help of the graphical interface to simulate user behaviour on the web page. It allows to store cookies, make login and logout operations. Data collection, visualization and stress testing with JMeter needs a powerful computer. JMeter uses threads to simulate multiple users concurrently, but with a large number of users, these threads are generating large amount of overhead and the number of threads is limited by restrictions by the operating system. To generate a large number of requests, additional JMeter clients are needed to cope with the restrictions and with the overhead Copyright REMICS Consortium Page 28 / 62

29 Figure 13 JMeter console test plan Tsung Tsung 14 (Tsunami) is a benchmarking tool written in Erlang that supports a variety of protocols (e.g. HTTP, XMPP, LDAP, etc.) and XML-based test plans, to perform stress tests to the systems, before they are deployed into production. Tsung is a distributed load cluster-based generator that creates distributed load among multiple servers (secondaries) which are orchestrated by an specific node (master). The configuration plan is defined in an XML schema as shown in the next configuration snippet. and it is loaded by the master node for starting the stressing process. A test plan consists of three parts: cluster configuration, load configuration and request definitions. The cluster configuration part contains all the information of the nodes (e.g. IP address, name, maximum amount of users, etc.); the load configuration part is divided by phases and contains the information regarding the duration of the phase and the arrival or inter-arrival rate that is used for generating the load of the users. Finally, the request definitions part describes the request itself. This can be recorded using a web browser and the tsung-recorder utility. <tsung> <! Cluster configuration setup --> <clients> <client host="tsung-node1" weight="1" maxusers="500"> <ip value=" "></ip> </client> </clients>  <servers> <server host=" " port="80" type="tcp"></server> </servers> <! Load configuration --> <load> <arrivalphase phase="1" duration="2" unit="second"> <users interarrival=" " unit="second"></users> </arrivalphase> </load> <! Request definition --> <sessions> <session name='request' probability='100' type='ts_http'> <request> </request> </session> </sessions> </tsung> Tsung can be executed by using the command tsung, in the master node $ tsung -f configuration_file.xml start 14 Copyright REMICS Consortium Page 29 / 62

30 Prior to execution, secure communication via SSH has to be ensured among all the nodes in the cluster. Communication has to be established without asking for a password. Consequently, each node has to execute the following set of commands. $ ssh-keygen -t dsa $ chmod 600 /root/.ssh/id_dsa $ cat id_dsa.pub >> /root/.ssh/authorized_keys #connecting ssh without a password $ ssh-agent sh -c 'ssh-add < /dev/null && bash' A request can be recorded by setting a proxy listener with tsung-recorder $ tsung-recorder -u -I P 3128 start After simulation, a log file is created, which contains all the information concerning the transactions of the simulated users. Tsung by default implements a tool for generating reports (figure 14) based on logs and it can be utilized by executing the command tsung_stats $/usr/local/lib/tsung/bin/tsung_stats.pl Alternatively, tsung-plotter utility can be installed for creating more sophisticated graphs. Figure 14 Tsung statistics web tool 6 MediaWiki The considered case study application MediaWiki 15 was selected as a base application to verify scalability of the system using various load balancers and algorithms for scaling purpose. MediaWiki was configured similarly as the WikiPedia web application without using reverse-proxy cache servers (e.g. Squid or Varnish) to reduce the complexity and time of the configuration Copyright REMICS Consortium Page 30 / 62

31 CDF memcached + XCache XCache no caching Response time [ms] Figure 15 Cumulative distribution function of response time with different caching policies using Amazon EC2 instance c1.medium MediaWiki uses MySQL database to store articles and uses, for caching, memcached to fetch already parsed and rendered pages to improve service throughput and speed. MySQL and memcached are centralized, having separate servers each. It is possible to configure MySQL in master-slave configuration, where write requests are going to master, these are propagated to slaves and all the read requests are going to slaves. Memcached is easier to replicate as it does not need so much configuration and can be quickly started from command line, rest is taken care by the MediaWiki application. Caching is important as MediaWiki is a complex application written in the PHP scripting language and needs large amount of CPU power to render the pages. There are several built-in caching policies to use: 1) file caching, where pages are stored on the hard drive, 2) database caching, where content is stored in the database and 3) memcached, where information is stored in the physical memory. Wikipedia uses memcached for caching already rendered pages in the memory to improve service speed. File caching needs centralized NFS to store the files as otherwise all the files are duplicated on each Apache server. There are additional caching methods to further improve the speed of service. As the PHP code has to be interpreted every time a request is made, it would be wise to use op code caching to save the interpreted code in the memory, making serving the page request faster. We are using PHP XCache to improve the service speed, it also helps to reduce reads to the hard drive as the code is in the memory. Using XCache is very beneficial for code bases that are larger and complex, especially for MediaWiki. Figure 15 shows the cumulative distribution function of response times, how much caching can reduce the time to serve the requested content. Average response time without caching was 469 ms, with XCache it was reduced to 335 ms and using memcached with XCache, it came down to 71 ms. Copyright REMICS Consortium Page 31 / 62

32 Figure 16 MediaWiki web application running in the cloud, red links indicate missing content behind the link 6.1 MediaWiki configuration layout Figure 17 MediaWiki configuration layout in the cloud and how the load generator requests are going through the system Figure 17 shows the layout of the Mediawiki service in the cloud. We used Tsung as a load generator to make HTTP GET requests to nginx. Nginx divides the requests with least round-robin to the backend Apache servers. Using the Fair module with nginx, we can define maximum amount of concurrent connections each back-end server can have. MediaWiki running on the Apache server receiving the requests will first connect to the database to check whether the page exists. If the page exists, the unformatted content is downloaded from the database. The next step is to check if the content is already formatted and exists in the cache. Requesting pages that are already stored in the cache are rendered more quickly. If the cache entry does not exist, the web application will start to parse the unformatted content to convert the MediaWiki tags into proper HTML format. It will use regular expressions to complete this task. Using regular expressions means that large amount of CPU power Copyright REMICS Consortium Page 32 / 62

33 is needed in order to format the content and therefore the MediaWiki application is CPU bound on Apache server, requiring fast machines. The precise overview of how the request is processed by the system is given on the Figure 18. Figure 18 How the requests are going through the MediaWiki service, where Tsung is used to generate the load 7 Analysis of the case 7.1 Configuration of the framework for verifying the QoS of the web application Our developed framework has a centralized Java Application running along with the load balancer. The application s primary functionality is to track running instances in the cloud, monitor them, change the deployment and turn them on/off. We have a pre-built image containing all the necessary software and tools to work with the Java API. Most importantly, an additional Java Application was built working as a web service that pushes collected performance metrics out of predefined a TCP port. These metrics are collected by the centralized Java Application to analyse a server s performance in the cloud. We deployed an algorithm and logic to automatically provision servers based on different parameters. We can use the service arrival rate to determine the necessary amount of servers or see the average CPU usage. The software bundle supporting the framework allows us to rapidly make new experiments in fast pace and find the most suitable and optimal configuration for each given web application. Stress testing the system will give idea about possible bottlenecks and which services should be optimized to improve the overall performance. Our framework has been tested with different configurations and deployment plans. A more dynamic approach will install all the necessary software to the instance on the fly, but the drawback of this approach is that the provisioning step is done in a much slower pace as the operating system has to download necessary software from the Internet (this depends on the speed of the network), install the software, and finally changes in the configuration files are necessary to run the servers. With the second approach, everything necessary is already installed on the instance. The services installed are by default stopped on boot-up and are started automatically by the framework, depending on the role of the server. For example, we do not need an Apache service running on the database server, as the only service needed there is MySQL. Configuration of the files is done similarly to the Copyright REMICS Consortium Page 33 / 62

34 previous approach by the framework. Both approaches have been tested and the latter one is used as it will allow for much quicker server configuration and there is no need to rely on the operating system repository to fetch the necessary software. Sometimes, these mirrors might fail (site is down, network connection is routed wrong), causing automatic server configuration to fail, and requiring manual intervention. Table 1 Software used for deploying infrastructure and their roles Role Software HTTP Server Apache PHP XCache Load-balancer Nginx Fair module SUN Java Database MySQL Cache Memcached Benchmark tool Tsung Comparison of Amazon instances The public cloud provider Amazon EC2 provides a great variety of instance types to work with: they all have various characteristics, having different CPU, memory, I/O and network bandwidth. The base measurement of instance by Amazon is termed as EC2 computing unit and is approximately equivalent to GHz 2007 Opteron or 2007 Xeon processor. Our study compared three different instance types: m1.small, c1.medium and m1.large. This allows us to understand how the service behaves using different machines and what configuration is the most suitable for the service? We conducted several experiments to pick the most suitable instance type or combination of them. One might think that database and memcached instances will definitely need more memory, making a m1.large instance much suitable, but running a web application using PHP, more CPU power is required, thus making a c1.medium instance better than other instances. 7.3 Experiments Experiments for stress testing the system have been a part of a QoS (Quality of Service) verification for a long time. It has an important role to determine flaws in the system before it is going to the public and the system can be considered as a final product. It provides performance measures and throughput limits for each service, giving a better vision in making capacity planning. Poor QoS can lead to frustrated customers, which tends to lead to lost business opportunities [10]. QoS includes measuring the time it takes to process single response and measure the overall throughput of the server. An increase in response times means that the jobs are staying in the system for a longer time, eventually leading into backlogging and making the system unresponsive. Longer response times will keep off potential customers, as they will not want to waste their time for waiting for the new pages to be rendered. Without stress testing, the following two outcomes can be possible [11]: 1. Service will fail at the worst possible time, generating lots of frustrated customers or losing important transactions. 2. While bottlenecks appear in the system under heavy load, the system administrator might not be aware of where or why they happen, which makes finding the cause of the problem harder. Copyright REMICS Consortium Page 34 / 62

35 7.4 Measuring service time We were interested to see the difference between various instance types provided by the Amazon. Our first set of experiments consisted of measuring service time of the MediaWiki application. To measure the service time for each second, only one request was generated by the load generator to have a minimally loaded service and only one job in the system. We assumed that serving one request may not take more time than a second. The service time is fetched from the average response time gathered during the experiment. 7.5 Measuring maximum throughput Service time provides a vague idea about the potential of the system, but does not show real information. With service time, we can theoretically calculate the maximum throughput of the system, but it often happens, that the actual throughput is much smaller, as under heavy load, the system is not capable of serving the requests as in a way it was possible with much lower load. Our experiment time was set to 1 hour and for each 2 minutes the arrivals per second was incremented by 1 unit. This allowed us to gather enough data points and statistics for each arrival rate. CPU utilization [%] c1.medium CPU [%] m1.large CPU [%] m1.small CPU [%] c1.medium [rps] m1.large [rps] m1.small [rps] Arrival rate [rps] Time [hours] Figure 19 Ramp up experiment for measuring the maximum throughput for three different instance types Figure 19 shows how three different instance types were working under increasing load. Hops in the CPU usage are related to CPU steal, where virtualization blocks available CPU cycles from the operating system, resulting in larger CPU usage and degrading the service. This graph shows that c1.medium is best for running the MediaWiki application and is capable of serving around 28 requests per second at maximum. Copyright REMICS Consortium Page 35 / 62

36 7.6 Results of the preliminary experiments Table 223 summarizes the tests conducted in the Amazon EC2 cloud. To have a good comparison between instances, different characteristics and results are shown in the table. Price per request was calculated using a theoretical value based on how many requests in one hour the selected instance should be capable of serving while taking into account the maximum throughput and cost of one server for one hour. Results show that using c1.medium instances for running the MediaWiki application is the cheapest approach. When looking at how much a single request costs, it is possible to see that it is not reasonable to run more m1.small servers instead of using a single c1.medium. Figure 20 shows the cumulative distribution of response time for different instance types in Amazon EC2 cloud. Instances c1.medium and m1.large are acting similarly, instance type m1.small stands out from others, as two thirds of the requests are twice slower than the rest. It seems that smaller pages with less complexity are using less CPU and can serve content without virtualization (CPU steal) taking away available CPU cycles. While processing complex pages, more CPU usage is affected by the CPU steal and therefore taking much longer to process the request. Table 2 Results of the experiments with different instance types Measurement m1.small c1.medium m1.large Minimum response time 62 ms 57 ms 57 ms Average response time 132 ms 71 ms 71 ms Maximum response time 459 ms 278 ms 367 ms CPU model E5430 E5410 E5506 CPU clock 2.66GHz 2.33GHz 2.13GHz Compute units Maximum CPU steal 56.04% 6.23% 25.14% Cost of server 0.08$ 0.17$ 0.32$ Maximum throughput 6 rps 28 rps 18 rps Price per request (x10-6 ) 3.70$ 1.64$ 4.94$ CDF c1.medium m1.small m1.large Response time [ms] Figure 20 Cumulative distribution of response times for different instances Copyright REMICS Consortium Page 36 / 62

37 7.7 Configuration and results of the experiment Arrival rate [rps] Time [day] Figure 21 Clarknet trace with 2 week traffic, red indicates the traffic used to generate the load Clarknet traces 16 (look Figure 21) were used for generating loads for the experiments. They contain information about arrival rates for 2 weeks. Data from the 10th day was used for generating the traffic. It can be seen that at the beginning and end of the day, the arrivals are low, indicating that at nighttime, there are few visitors, but during midday, peak traffic rate is reached. Figure 22 shows the rate at which the requests were injected into the system. Inter-arrivals were changed for each minute resulting in largely fluctuating traffic. Tsung was used as a load generator to make HTTP GET requests against the nginx load-balancer. A complex Tsung XML configuration file with a 1 minute resolution for arrival phase was set up to follow the Clarknet trace curve. A randomized set of URL addresses, excluding the redirects, were fetched from the MediaWiki database and used to generate the requests. Tsung was deployed in a distributed cluster composed of three nodes (one master and two secondary) running on the cloud. This ensured that the load generator's overall load on the server was small and did not affect the measurement of average response time of requests. A one day experiment was conducted to see the benefits and load curve changes in the system, while dynamically allocating servers in the cloud. For the first experiment we used Amazon Auto Scale [12] and for the second experiment we used our simple optimal heuristic method. It is also possible to use a different allocation policy that is not using average CPU usage or arrival rate, instead utilising memory or network usage. Amazon Auto Scale was configured to scale with a breach time 15 of minutes. This means that if the CPU usage threshold is exceeded at least for 15 minutes, a new server allocation or server termination is done. We set the Amazon Auto Scale threshold to 60% and 70% of the average CPU for down-scaling and up-scaling respectively. The second experiment was using average service time as 70 ms to allocate the correct number of servers. Comparison between both runs was performed. Table 3 Results of the experiments with different instance types Measurement Auto Scale Optimal Always on Average response time ms ms ms Average CPU usage 44.62% 37.22% 23.84% 16 Copyright REMICS Consortium Page 37 / 62

38 Instance hours Cost of servers 49.92$ 55.52$ 76.80$ Requests lost Successful requests 98.44% 99.85% 99.99% Table 3 summarizes the metrics gathered from 24 hour experiments. These experiments show that it is possible to reduce the cost of running the service from 28% to 35%, but with the cost reduction, more jobs are rejected as the servers are working with larger CPU utilization. Average CPU utilization looks normal even with Amazon Auto Scale, but the traffic strongly fluctuates, indicating that in certain time periods the servers are overloaded and are dropping excessive connections. Over 22 million requests were generated, thus losing 343,763 jobs for Auto Scale represents only 1.5% of the jobs lost, which can be acceptable depending on the SLA. For the optimal policy and running all 20 servers for 24 hours, job loss is relatively small (under 0.15%) and should be acceptable under common SLA agreements. Num. of servers Arrival rate [req/sec] Optimal policy Auto Scale Always On Time [hours] Figure hour experiment of server allocation for different policies Average arrival rate in this experiment was 255 requests per second. We were also interested in observing the yield of each policy. Because the cost of one request is relatively small, we compare the values for 1 million requests. Using Auto Scale with 70% up-scaling and 60% down-scaling, serving million requests costs 2.305$ for the service provider. The cost for optimal is 2.527$ and for running all of the servers, it is much higher: around 3.491$. These results are experimental and might not be similar to real-world numbers, because it is hard to predict user behaviour in such a case. It is hard to measure the number of clients dropping off the web page, while the request by the browser is rejected, as it might change the arrival curve. Users might refresh page immediately, generating much larger load for that time period or may leave to other pages, resulting in much lower load. 17 Only Apache servers are counted; instance charge is 0.16 per hour (us-east-1 region) Copyright REMICS Consortium Page 38 / 62

39 7.8 Characteristics of different web services CPU utilization [%] Apache nginx MySQL memcached Arrival rate, λ [rps] Figure 23 CPU utilization for different servers compared with arrival rate The framework is capable of measuring and monitoring performance metrics for all the servers running in the cloud. Knowing exactly how many arrivals to the system were generated, we are capable of drawing a utilization curve for each service, giving information on how much resources were used for each arrival and where is the potential upper limit for each service, meaning that no further jobs are severed unless additional servers are added (horizontal scaling) or existing machine is replaced with faster machine (vertical scaling). Figure 23 shows CPU utilization for arrival rates from 1 to 550 requests per second, using 20 back-end Apache servers, all the instances were running in the same availability zone (us-east-1c) and using same instance type (c1.medium). Variation between Apache servers CPU utilization is caused by using least round-robin algorithm, where requests are passed to the server under the least load. Because at a small load, a couple of first servers in the server pool are capable of serving all the requests faster than they are arriving into system, making the servers idle again and eligible to receive the next request. The MySQL server is moderately loaded and additional or faster servers are needed to cope with larger arrival rates. Load-balancer nginx and memcached are minimally loaded, showing that they are not CPU bound, but rather network bound Apache throughput [rps] Arrival rate, λ [rps] Figure 24 Throughput per Apache server compared to arrival rate Figure 24 shows how the requests between Apache servers are distributed. Even with 100 requests per second to the nginx load-balancer, some of the 20 back-end Apache servers are sitting completely idle and do not receive any requests. Service saturation with such configuration comes with 600 requests per second, but some servers are more capable, thus able to serve more than 30 requests per second, but the average is around 28 requests per second, which is the same as measured for the ramp up experiment for c1.medium instance. Copyright REMICS Consortium Page 39 / 62

40 8 Identifying similar performance variables for adapting the approach to any SOA application Any SOA application that can be configured with centralized database and/or cache can be used by this framework to conduct experiments and verifying its scalability. Even if the database is missing, it is easy to adopt the framework to support stress testing only Apache web applications. These can be applications that do not need a database, but instead are using the file system to store and retrieve pages. The only thing to do are changing the web application s template configuration file. Most web applications have some sort of configuration file, where user needs to define some parameters to connect with the database or where the files are located in the file system. We have successfully stress tested the WordPress blog application as it also supports MySQL database and memcached. Replicating Apache servers is a straightforward process as the user needs only to configure authentication with the database and the framework takes care of starting the Apache process in the server. This framework is not yet adapted to stress test other types of web services than HTTP web applications. Future work should include research to stress test SOAP, XML, database and cache web services. These types of services needs different approach and requests generated to conduct various experiments to measure the performance of the service. The details will be addressed by D6.7 of REMICS, scheduled to be delivered at the end of M36 of the project. Below is a description of steps that are necessary to prepare the instance in the cloud with necessary infrastructure. All the commands have been tested and run through with Ubuntu operating system. Other Linux distributions might work as well. 8.1 Preparing the instance These are step by step instructions showing how to adapt the framework for any enterprise web application, allowing it to automatically provision depending on the amount of incoming requests and stress-test. It allows to make performance analysis of the web application and provides an overview of possible bottlenecks. Itis useful when developing pilot projects to improve the speed of service and to understand how the service performs under high load and in a production environment. The description will mainly focus on installing all the necessary software and service to an instance that is bundled and uploaded to the cloud. This helps when requesting replicated instances with the same configuration. We are using Ubuntu bit as a base image. Amazon EC2 cloud and third party providers are maintaining a list of available instances with operating systems already installed. Latest releases of Ubuntu can be found from the Ubuntu homepage. The user needs to have an Amazon account that is capable of starting new instances in the cloud, otherwise these instructions cannot be followed. We selected one of the images from the list. It is important to make sure that the region of the image is the region where you plan to start the instance.. Because we are using c1.medium instance (various testing and experiments conducted have shown it to be most useful instance type), we need to use 32 bit architecture. Find out where the root of the image is stored, because we are interested in an instance that has the root stored on the instance (otherwise we need to use S3 storage to mount the root). A suitable instance to start with is ami-ffc01996 (instance ID). There are several ways to start the instance. One way is to use command line, but it is necessary to have the package ec2-api-tools installed, to execute the command ec2-run-instances. Another option is to use the Amazon AWS console to start and terminate the instance. The Ubuntu community provides a direct link to do so: Run the instance from the command line using the following command. This also installs the package ec2-api-tools, so make sure that the Linux repository is properly configured: $ sudo apt-get install ec2-api-tools Copyright REMICS Consortium Page 40 / 62

41 $ ec2-run-instances ami-ffc instance-type m1.small --region us-east-1 --key ${EC2_KEYPAIR} K ${EC2_PRIVATE_KEY} C ${EC2_CERT} RESERVATION r-e5f default INSTANCE i-1bc7447d ami-ffc01996 pending key_pair 0 m1.small T13:46: us-east-1b aki-407d9529 monitoring-disabled instance-store When the request is successful, the command will output a notification of reservation with the instance id and its current state. For installing the software and services, we recommended to use the m1.small instance type to reduce the cost of setting up the system, but because of limited CPU, the installation might take more time when compared to a c1.medium instance. It is also necessary to validate the installation and to check if the framework is properly deployed and works as expected. Sometimes several iterations have to be done to bundle the final instance with all the necessary items on it. If everything works as expected, it is possible to move to larger instance. ${EC2_KEYPAIR} is the name of the key-pair generated in the Amazon, not the key file. This will tell the instance which public-private key to use to allow access to the instance through Secure Shell. If you are not able to start the instance correctly, there are many video tutorials provided by the community members to see the process. If the command is successful, it will return the ID of the instance, this can be used to track whether instance is running or not. Using the following command, you can have an overview of your instances in the cloud: $ ec2-describe-instances K ${EC2_PRIVATE_KEY} C ${EC2_CERT} RESERVATION r-e5f default INSTANCE i-1bc7447d ami-ffc01996 ec compute-1.amazonaws.com domu c compute-1.internal running martti.v 0 m1.small T13:46: us-east-1b aki-407d9529 monitoring-disabled instance-store This will show the state of the requested instance: pending, running, terminated etc. First the instance is pending, as the image is copied to a physical host to start the virtual image by XEN. It takes some time, but should be done in 2 to 3 minutes. Sometimes problems can occur: if the virtual image is not in the running state within 10 minutes, you should terminate it and request a new one. If the state changes to running, the instance receives private and public IP addresses. Using your private key, it is possible to connect to the running instance with the following command (notice that public IP is given by the DNS name and also with IP4: ec compute-1.amazonaws.com and ): $ ssh -i /home/user/your_private_key.pem ubuntu@ The authenticity of host ' ( )' can't be established. RSA key fingerprint is 11:3a:f0:e7:d2:1e:2e:5c:e2:##:##:##:##:##:##:##. Are you sure you want to continue connecting (yes/no)? On the first time it will ask if you want to add it to the known host list, answer yes. This can be bypassed from the command line using the parameter -o StrictHostKeyChecking=no. Usually Ubuntu images have user ubuntu to access the instance. If you are finally in the instance, you can type $ sudo -i to gain root privileges and install necessary software. If you do not want to define for ec2-api-tools keys and certifications each time a command is executed, you could make a file /home/user/amazon/amazon, that contains the necessary information. The file consists of thefollowing information: EC2_KEY_DIR=$(dirname $(readlink -f ${BASH_SOURCE})) export EC2_PRIVATE_KEY=${EC2_KEY_DIR}/your_private_key.pem export EC2_CERT=${EC2_KEY_DIR}/your_cert_key.pem export EC2_KEYPAIR=your_keypair_name If you start shell, use source /home/user/amazon/amazon to store the variables in the environment defined above. Now you can call ec2-api-tools commands without the need to define the keys. 8.2 Installing software on the instance If you have successfully started the instance and have access to it, now it is time to install necessary software for the framework. We want to install the necessary services to run the MediaWiki Copyright REMICS Consortium Page 41 / 62

42 application, this includes MySQL to store data in database, memcached to store cached pages in the memory, Apache with PHP to serve HTTP GET requests to MediaWiki and nginx for load balancing the requests. To install the necessary services, use the aptitude tool under Ubuntu. Make sure you have root privileges (look above section, sudo -i will grant you the necessary permissions). But first, we need to update the list of entries in the repository. $ add-apt-repository "deb lucid partner" $ apt-get update $ apt-get install mysql-server apache2 php5 php5-mysql php5-xcache memcached mysql-server sysstat openjdk-6-jdk collectd nginx chkconfig ec2-api-tools It should download ~300 MB of packages and should take some time. It will ask also for an username and password for MySQL, for just testing purposes, you could use root for both. We will currently install nginx from repository, but in order to have the Fair module, we need to build it from source. chkconfig package is used to disable service at the start up, as we do not want to activate all the services for the instance. Use the following commands to stop services from the boot up: $ chkconfig mysql off $ chkconfig apache2 off $ chkconfig memcached off $ chkconfig nginx off To check, if this has worked or not, you could restart the instance with command reboot. After reboot, try the following commands and see if the services are running: $ service mysql status $ service apache2 status $ service memcached status $ service nginx status For the next step, you might want to start the services to allow installation of the MediaWiki (or other web application) into the database and see if everything is working. You should not start nginx, as otherwise a port conflict happens (both apache and nginx are by default running on port 80). $ service mysql start $ service apache2 start $ service memcached start The framework will use the command line to start the memcached with the correct amount of memory as the default memory is set to be 64mb and it is not enough for conducting large scale experiments. Make sure that from my.cnf row bind-address= is commented out as otherwise other instances without proper port forwarding cannot access MySQL server. 8.3 Installing MediaWiki This section can be edited to suit any other enterprise web application to be fit into the framework. For clarity, we show the steps necessary for MediaWiki to get the framework running. If all the services are installed and started again, it is time to install MediaWiki. You can download MediaWiki from For Apache, the default folder for serving content for the outside world is located at /var/www. Go there, download MediaWiki and unpack it with following commands: $ cd /var/www $ wget $ tar -xvf mediawiki tar.gz $ mv mediawiki / mediawiki $ chmod -R 777 /var/www/mediawiki/ Make sure the link for downloading the MediaWiki is correct as the MediaWiki application is constantly updated and the version numbers are changing. We use simplified permissions in the file system, but Copyright REMICS Consortium Page 42 / 62

43 it would be advised to use chown www-data and use chmod 700 permissions for the MediaWiki configuration file to make it only accessible by Apache PHP user. Installation instructions can be found from Access your instance from the address It will warn you that LocalSettings.php is missing and gives the link to redirect you to installation part. Use following parameters to install the MediaWiki: Database host: localhost Database name: wikidb Database table prefix: Database user: root Database password: root Use the same account as for installation: false New database user: wikiuser New database password: pass Create the account if it does not already exist: true Storage engine: innodb Database character set: binary Name of wiki: MediaWiki experiments Enable outbound false Settings for object caching: Use Memcached (requires additional setup and configuration) Memcached servers: localhost:11211 If not defined above, user your own values to fill the installation form fields. If the installation is successful, it will download LocalSettings.php into your local computer through the web browser. This needs to be copied to the instance folder /var/www/mediawiki/. $ scp -i /home/user/your_private_key.pem /home/user/localpath/localsettings.php ubuntu@ :/var/www/mediawiki/ Make sure you can copy the file into the folder, otherwise it is not possible to start MediaWiki. If everything works, you can go to and it should redirect you to the main page. It will consist only one page, so selecting random page should still direct you to the same page. Figure 25 26MediaWiki successfully installed in the Amazon EC2 cloud Copyright REMICS Consortium Page 43 / 62

44 To connect the configuration file with the framework, some additional changes are needed to be done. For MediaWiki installation, user has to add require IPSettings.php at the end of the LocalSettings.php configuration to interconnect with the framework. The framework will copy the following PHP file for every Apache instance to allow dynamically changing IP addresses of the MySQL and memcached services: <?php?> $wgdbserver = ' '; $wgmemcachedservers = array ( => ' :11211' ); The framework will change $wgdbserver with the MySQL IP address and puts all the memcached server list into $wgmemcachedservers array list. This works fine with the MediaWiki installation, but additional changes for other types of web applications are needed, as the configuration file should include assigning the $wgdbserver and $wgmemcachedservers variables into correct places, because none of the web applications are using single structure to define the configuration parameters. Example on changing the WordPress configuration file to support the framework changes with the IP addresses: <?php require_once IPSettings.php ; define('db_host', $wgdbserver); // MySQL server address defined in the // WordPress configuration?> 8.4 Uploading WikiPedia article dumps into MediaWiki database (optional) While using the MediaWiki installation, it only comes with one single page and testing such system does not really provide a good overview. In order to simulate a real world application, we also need some real world application data. Wikipedia makes article dumps on a regular basis for mirror wiki sites to upload. This is freely available from their homepage. Wikipedia makes backup copies of its databases in XML format. These can be downloaded from For updating purposes, the XML needs to be converted into MySQL queries. There is a tool developed by a third party to allow this. The dumps are stored for one year and the link given in this example might therefore become broken: $ cd /mnt $ chmod -R 777 /mnt $ wget $ wget $ java -jar mwdumper.jar --format=sql:1.5 enwiki pages-articles1.xmlp p bz2 mysql -u root -p wikidb password=root 1,000 pages (71.003/sec), 1,000 revs (71.003/sec) 2,000 pages (73.158/sec), 2,000 revs (73.158/sec) 3,000 pages (69.756/sec), 3,000 revs (69.756/sec) 4,000 pages (69.396/sec), 4,000 revs (69.396/sec) 5,000 pages (68.945/sec), 5,000 revs (68.945/sec) 6,000 pages (67.56/sec), 6,000 revs (67.56/sec) 6,343 pages (66.047/sec), 6,343 revs (66.047/sec) The command is a pipe, where mwdumper.jar converts XML into SQL statements, which are forwarded to MySQL to update the content. To test if the data is uploaded successfully, open up the web browser again and visit MediaWiki page you have just installed and select Random page to see, if new article opens. If the web page opens up slowly (takes 10 seconds to open), it means memcached is not correctly configured and probably the service is not started. If you encounter any problems, make sure that all the necessary services are running. Copyright REMICS Consortium Page 44 / 62

45 If you want, you can upload all the dumps into the database, but make sure that there is enough room to store the data or use S3 storage to store database information. To see the amount of free space, use df command: $ df -h Filesystem Size Used Avail Use% Mounted on /dev/xvda1 9.9G 1.3G 8.1G 14% / none 828M 116K 828M 1% /dev none 833M 0 833M 0% /dev/shm none 833M 56K 833M 1% /var/run none 833M 0 833M 0% /var/lock /dev/xvda2 147G 228M 140G 1% /mnt Figure 276 Showing MediaWiki application with part of the Wikipedia dump uploaded into the database, red links indicate missing content 8.5 Framework installation The main part of the installation is now accomplished: the site is accessible from the web, uses the MySQL database and memcached to fetch data for visitors and has data uploaded into database. Now we need to upload the framework to the instance to allow dynamical allocation and monitoring servers while running the service in the cloud. As the framework is still going through development and additional functionalities are have to be covered, it is not yet made available. This section will give brief information of the necessary steps to do in order to get the framework running Monitoring tool One important service to be started is the monitoring tool. It pushes information collected with sysstat out from the predefined port and is accessible by the framework core. It is necessary to have the service started on the instance boot. The information gathered by the service is available from the Copyright REMICS Consortium Page 45 / 62