PERFORMANCE EVALUATION OF E-COMMERCE SERVERS USING THE TPC-W BENCHMARK

PERFORMANCE EVALUATION OF E-COMMERCE SERVERS USING THE TPC-W BENCHMARK D.F. García, J. García, C. López, I. Canga, D. González University of Oviedo Department of Informatics 33204 Gijón, Spain ABSTRACT The performance of servers is currently a major factor in the success of electronic commerce services. These services and applications integrate two well-known technologies into the server: web servers operate as an interface with the clients, and database servers provide the information required in the transactions. There are well-established benchmarking technologies for both kinds of servers, but there is as yet little research into the combination of the two to provide particular services, such as electronic commerce services. The main contribution of this paper is a discussion of the issues involved in the characterization, implementation and utilization of a benchmark for performance evaluation of e-commerce servers. KEYWORDS Performance benchmark, E-Commerce server performance, Web server performance. 1. INTRODUCTION The most important aspects of the performance evaluation of e-commerce servers using benchmarking techniques are the workload characterization, the selection of appropriate metrics and the sensitivity analysis of the metrics to changes in workload or system parameters. The first aspect considered is the characterization of the workload with which to carry out the evaluation. Currently there are several models for e-commerce services. Some models are similar, there are significant differences between others. As the workload is one of the most important elements of the models, a single synthetic workload cannot represent all the current models of e-commerce services properly. To reach an appropriate level of representativeness of the synthetic workload, only one model of e-commerce is characterized: one that fits the majority of businesses. The second aspect is to select the best metric to express the performance level a server is capable of providing. The most common performance metric is the maximum sustained throughput provided by the server under several restrictions in the response time for the transactions. Finally, the way in which the selected metric reflects the influence of possible changes on the workload parameters and on the computer system parameters must be analyzed. Obtaining information about metric behavior when the user profiles change serves to establish the applicability and validity of a benchmark for wider or narrower e-commerce scenarios. Gaining insight into metric behavior when the system parameters change is essential to assess the applicability and utility of a benchmark in order to support decisions on dimensioning and scaling [Menasce2000] of e-commerce servers. This research shows how a benchmark achieves generalized usefulness and applicability for the evaluation of e-commerce server performance. There is no intention to develop a new e-commerce benchmark, but to use existing ones. Some benchmarks have been developed for research purposes, such as WebTP [Jutla1999a] a web-based order management system, or WebEC [Jutla1999b][Bodorik2000] a generic e-broker site. Other benchmarks have been developed by the IT industry following two different approaches. In the first approach, specific e-commerce applications were customized to be used for benchmarking purposes, such as on-line financial management services [NSTL1999], e-commerce solutions 284

PERFORMANCE EVALUATION OF E-COMMERCE SERVERS USING THE TPC-W BENCHMARK for Internet bankers [FISERV2000], or an on-line bookstore application [Pendleton2000]. In the second approach, general-purpose e-commerce benchmarks were developed, such as the e-commerce suite included as a part of WebBench [WebBench2001] or TPC-W [TPC-W2000] [García2003], which also represents an on-line bookstore application. Because the TPC-W commercial benchmark is an excellent representation of the most common type of e-commerce applications currently being developed by the IT industry, the work on performance evaluation of e-commerce servers presented in this paper is based on this benchmark following this structure. In section two, a brief analysis of the characteristics, components and the e-commerce model supported by TPC-W is presented. In section three, the approach taken for the design of a specific implementation of the TPC-W benchmark is outlined. The experimental results obtained are presented in section four, and finally, the conclusions are briefly commented. 2. ANALYSIS OF THE TPC-W BENCHMARK In this section the main features of an e-commerce benchmark are briefly described, including the architecture of the tested system, the workload or e-commerce model supported, and the reporting metrics. 2.1 Architecture of the tested system All e-commerce benchmarks, including TPC-W, have a client-server architecture. The server computer includes all the components that constitute the e-commerce server. The client computers work as emulators to generate the same workload that real customers would generate and they are called RBEs (Remote Browser Emulators). The PGE (Payment Gateway Emulator) emulates the entity which authenticates the users and authorizes the payments. The server and clients communicate through a dedicated network. Figure 1 represents the architecture of the tested system. Emulated Browsers RBE-1 Remote Browser Emulator... Remote Browser Emulator RBE-N Payment Gateway Emulator PGE Network Web-Object Storage HTTP Server Application Server CGI ISAPI E-Commerce Server Application Data-Base Application transactions Figure 1. Architecture for benchmarking an e-commerce server with TPC-W 2.2 Workload model A general benchmark of widespread applicability should not represent the activity of a particular e-commerce segment, but that of any company which markets and sells products or services though Internet. TPC-W follows this approach. Next, the main e-commerce models currently used are analyzed to evaluate how they are represented by the TPC-W benchmark. 285

International Conference WWW/Internet 2003 E-commerce models are broadly categorized into three classes: cybermediary, manufacturer and auction models [Jutla1999c]. The cybermediary model represents a company that operates as an intermediary between suppliers of products or services and final customers. The TPC-W benchmark represents this model well, but in a simplified manner, as TPC-W considers all the products offered by the cybermediary in its internal databases, without consulting the supplier databases on-line. Furthermore, TPC-W does not consider the information interchanged with delivery companies or with other cybermediaries. The manufacturer model represents a company that markets and distributes its own products directly to the final customer through Internet. In this model, the company only requires access to external databases for payment management. TPC-W also fits the manufacturer model very well, covering all the main aspects addressed in the model. The auction model represents a company that manages a stock auction market, where both sellers (providing a list of goods to the company), and buyers (submitting bids for the goods), are final customers of the company. The TPC-W benchmark does not represent this model well due to the specific set of interactions and internal processes involved which are not considered in the TPC-W benchmark. Of all the different e-commerce models, the most typical workload supported by an e-commerce server consists of shopping sessions. Each session is composed of a sequence of interactions of different types, such as search and browse products, add products to the shopping cart, and buy products. The sequences of interactions can be represented by a state transition graph, called the Customer Behavior Model Graph (CBMG) in [Menasce1999], or simply a Web Interaction Diagram in TPC-W. Figure 2 represents the Interaction Diagram for the benchmark TPC-W. Start User Session Key: <name> Button name Web interaction transition via button Web interaction transition via HREF link Admin Confirm <Submit> Admin Request Best Seller Search Result <Submit query> Home New Product <Shopping Cart> <Confirm Updates> Shopping Cart Customer Registration (CURL) Product Detail <Checkout> <Admin> <Add to cart> <Shopping Cart> <Continue Buy> <Continue Buy> Returning customer Non-Returning customer Buy Request <Shopping Cart> Search Request Buy Confirm <Confirm Buy> <Order Status> Order Inquiry <Display Last Order> Order Display Figure 2. Modeling the customer behavior in TPC-W benchmark using an interaction diagram 286

PERFORMANCE EVALUATION OF E-COMMERCE SERVERS USING THE TPC-W BENCHMARK Most benchmarks dealing with user modeling in e-commerce environments differentiate between two typical user profiles. The first models customers that principally use the e-commerce service to find information about available products and usually leave the service without ordering. The second models customers with a higher probability of ordering a product before leaving the service. These customer profiles are called Browse and Order, respectively in the TPC-W benchmark. In [Menasce2000] they are called Occasional and Heavy Buyers. Both profiles share the same Interaction Diagram. The TPC-W benchmark defines a new profile, called Shopping, which merges of the two basic profiles. The difference between the profiles arises from the use of different transition probabilities in the interaction diagram. Therefore, the load generated with each profile will contain a different mix of the basic interactions, as illustrated in Table 1. Table 1. Expected percentage of each interaction for each customer profile Web interaction Browsing profile Shopping profile Ordering profile Browse group 95% 80% 50% Home 29.00% 16.00% 9.12% New products 11.00% 5.00% 0.46% Best sellers 11.00% 5.00% 0.46% Product detail 21.00% 17.00% 12.35% Search request 12.00% 20.00% 14.54% Search results 11.00% 17.00% 13.08% Order group 5% 20% 50% Shopping cart 2.00% 11.60% 13.53% Customer registration 0.82% 3.00% 12.86% Buy request 0.75% 2.60% 12.73% Buy confirm 0.69% 1.20% 10.18% Order inquiry 0.30% 0.75% 0.25% Order display 0.25% 0.66% 0.22% Admin request 0.10% 0.10% 0.12% Admin confirm 0.09% 0.09% 0.11% Each browser waits for a period, called think time, between two successive web interactions. In the TPC-W the think time must follow an exponential distribution with a mean between 7 and 8 seconds. The population of the database scales with the expected throughput of the server. This is a common characteristic of web and database benchmarks. In TPC-W the initial number of rows in each table of the database depends on two parameters: The number of emulated browsers (increasing one by one) The number of items to sell (5 discrete values: 10 3, 10 4, 10 5, 10 6, 10 7 ) In summary, the workload has only two independent parameters: the type of client and the number of items to be sold. The number of Emulated Browsers (EBs) is determined by the Web Interactions Per Second (WIPS) supported by the server, and cannot be selected freely. 2.3 Performance metrics for an e-commerce server The most common metric for measuring the performance of a server is the throughput under response time constraints. The performance metric in TPC-W is the number of Web Interactions Per Second (WIPS) measured in the average shopping scenario. The WIPS is computed as the total number of web interactions requested and completed successfully within a measurement interval divided by the length of that measurement interval in seconds. To provide additional insight into the performance of an e-commerce server working under scenarios of browsing or ordering customer profiles, two additional throughput metrics are defined, WIPSb (for browsing profile) and WIPSo (for ordering profile). 3. ISSUES OF AN IMPLEMENTATION OF THE TPC-W BENCHMARK The implementation of the TPC-W benchmark involves managing a wide spectrum of software and communication technologies to develop its main components, which are presented in next subsections. 287

International Conference WWW/Internet 2003 3.1 The electronic bookshop application The e-commerce application, an e-bookshop, was developed using PHP technology, implementing each web interaction as a page of PHP code. This technology allows portability between multiple platforms with an expected performance better than other highly portable technologies, such as Java. The e-bookshop has been organized in a small directory tree. Its main directory, /ebookshop/, must be installed in the root publication directory of the http server. Six directories of /ebookshop/ contain all the files of the application. Non-secure pages, served with http, are placed in the directory /pag. Secure pages, served with https over ssl, are placed in the directory /pags. The directory /inc contains styles (.css) and inclusion scripts (.inc) for the pages of the application. The GIF images of navigation items, such as buttons and logos, are placed in the directory /img_nav, while the JPEG images of the books are contained in the directory /images. The directory /pge contains the program developed to connect the bookshop application with the payment gateway emulator, PGE. This additional program is necessary because the PHP version used does not support connections based on SSL. 3.2 Data generation utilities The benchmark software includes two utilities developed to facilitate the generation of data: the data base population utility and an image generator. The data base population utility receives the number of items to be sold (10 3, 10 4, etc.), as well as the number of emulated browsers, and produces a program in PL/SQL which is used directly to eliminate, create and fill in all the tables required by the benchmark. The image generator for each item stored in the database creates two JPEG images, one thumbnail and one detailed. All thumbnail images have a fixed size of 5K, while the detailed images can have any of five pre-defined sizes. 3.3 The remote browser emulator Two general approaches are currently used to generate a sequence, or traffic, of web interactions to load an e-commerce server: user emulation [Bardford1998] and aggregate traffic generation [Kant2001]. This implementation of the benchmark follows the user emulation approach because it allows fine control over the behavioral aspects of the user, as required by the TPC-W benchmark. In this implementation, the RBE is designed as a multithreaded program, in which a single thread emulates each browser. All the threads are created just before the emulation begins and they are not destroyed until the emulation has finished. The emulation of browsing sessions consumes very little computational resources in a client-machine in relation to the load injected in the server-machine under test. Therefore, a single client-machine can efficiently emulate a large enough number of browsers to saturate relatively powerful servers. 3.4 The payment gateway emulator This module is implemented as a multithreaded server in which a dynamic pool of threads generates authorization for payments. Because a minimum delay of 2 seconds is required to generate the authorizations, and because of the low percentage of payments in relation to the total number of interactions, the design and implementation of the PGE does not present special performance challenges. Therefore, it was implemented in Java using the API for Secure Socket Extension. 3.5 Data analysis programs Two programs have been developed to generate the data to build two graphs required by the reporting procedures of the TPC-W benchmark. One program calculates the 90-percentile and the histogram of the response time for each web interaction following the TPC-W rules. The other calculates WIPS against elapsed time, counting the interactions within a sliding time window. The user can select two characteristics of the window: the size (always less than 30 seconds) and the displacement step. 288

PERFORMANCE EVALUATION OF E-COMMERCE SERVERS USING THE TPC-W BENCHMARK 4. EXPERIMENTAL RESULTS USING THE TPC-W IMPLEMENTATION This section presents a set of experimental results that help to explain the most relevant aspects of the TPC-W benchmark and its output metric. The experimental work was mainly developed using an Intel Pentium-III tetra-processor running the Linux operating system. Some experiments were carried out with an Alpha tetra-processor running the Dec-Unix operating system. In both servers, the database used was ORACLE and the HTTP server was Apache, both connected by PHP scripts. 4.1 Interpretation of the WIPS metric The primary interest here is to determine if WIPS is a throughput metric that always corresponds to a point of the same part of the throughput curve (linear, knee or saturation) of the server, or if on the contrary, the WIPS metric could fall on any part of the throughput curve in function of the characteristics of the server. Figure 3 shows two typical load experiments in which WIPS appears as a metric of the sustainable throughput of an e-commerce server within the linear part of the complete throughput curve. In general, WIPS could be interpreted or used as a throughput metric for e-commerce servers operating in the linear part of their throughput curve, always before the saturation knee. 7 6 CPU: Pentium EB / 7 7 6 CPU: Alpha EB / 7 WIPS 5 4 3 4.23 WIPS EB / 14 WIPS 5 4 3 3.35 WIPS EB / 14 2 2 1 0 31 EB 0 20 40 60 Emulated Browsers (EBs) 1 0 25 EB 0 20 40 60 Emulated Browsers (EBs) Figure 3. WIPS metric on the throughput curve of two servers 4.2 Granularity of the WIPS metric In the TPC-W benchmark, the minimal variation of the throughput is obtained by adding or eliminating a single EB. Theoretically, with infinitely fast interactions (WIRT=0), the minimum increment of WIPS is 1/7 (0.1428) for each additional EB (curve EB/7 of figure 3). Also, to prevent over-scaling the server, the rules of TPC-W do not allow the throughput to fall under 50% of the maximum possible increments, that is, WIPS increments by 1/14 (0.0714) for each additional EB (curve EB/14 of figure 3). The slope of the throughput curve, obtained from the benchmarking experiments shown in figure 3 reveals an experimentally measured granularity of nearly 1/7, matching the theoretical expected behavior. 4.3 Influence of load factors on the WIPS metric The influence of load factors on WIPS is analyzed through experimental measurements. Although the reported WIPS depends on three load factors, only the client profile and the number of items in the database can be freely established in each load experiment. The number of emulated browsers cannot be selected in the experiments. It must be incremented just until the first response time restriction is violated. Figure 4 shows the relationship between the WIPS metric and the three load factors obtained from the experiments. Each point is the average of the three replications carried out for each experiment. Figure 4 289

International Conference WWW/Internet 2003 shows how WIPS clearly decreases with the increment of the number of items in the database. However, the influence of the type of client in WIPS does not show a clearly defined tendency. 4,5 4 3,5 3 10.000 ITEMS 1.000 ITEMS 29 EB(sh) 27 EB(or) 26 EB(sh) 26 EB(br) 25 EB(or) 25 EB(br) WIPS 2,5 2 EB / 7 1,5 1 0,5 100.000 ITEMS 8 EB(br) 7 EB(sh) 10 EB(or) 0 5 10 15 20 25 30 35 Emulated Browsers (EBs) Figure 4. Influence of load factors on the WIPS metric 4.4 Influence of system factors on the WIPS metric The system factors can be classified in two broad groups: software factors and hardware factors. The software factors are usually unordered qualitative factors, mainly associated to the version or release of each software component of the system or regarding their configuration parameters. The hardware factors are mainly ordered quantitative factors, whose levels are expressed numerically in increasing order: the number of CPUs and the amount of RAM. To evaluate the influence of system factors on WIPS, the load factors are fixed at their mean levels, that is, a client of shopping type and 10 5 items in the database. The evaluation of the influence of software system factors involves many configuration parameters of the system software. However, the default installation values for these parameters generally provide nearly optimum performance, except for the data base queries. To optimize the queries the database is indexed following this rule: when a field of a table is referenced in one or more SQL clauses, a single index is created using the field. We have checked that the addition of compound indexes does not improve the results obtained with single indexing. Figure 5 allows the comparison of the WIPS obtained without indexing (circles on dotted line) with the WIPS obtained with indexing (circles on continuous line). Single indexing of tables allows an increment of WIPS of up to 300%. To evaluate the influence of hardware system factors on WIPS, the load factors and software system factors remain fixed, while the most important hardware system factors are varied. There are many factors in a multiprocessor server that could affect the WIPS metric. However, the key factor to evaluate is the influence of the number of processors used on WIPS, which is the main factor to scale an e-commerce server. In addition, the influence of the amount of memory installed in the server must be evaluated for each configuration, because the processors will only provide their full computational power if they do not suffer memory starvation problems. Figure 5 shows the results of the experiments carried out varying these factors. When the minimum amount of memory installed in order to allow the database to operate, 128Mb (squares on continuous line), the number of processors used has no influence on WIPS. In other words, under 290

PERFORMANCE EVALUATION OF E-COMMERCE SERVERS USING THE TPC-W BENCHMARK the maximum memory starvation conditions, the e-commerce server is not scalable. With more memory installed, 256Mb (triangles on continuous line), the addition of a second processor increases the WIPS very slightly, but the addition of further processors is useless. By further increasing the memory installed, 512Mb (rhombs on continuous line), the addition of a second processor increases the WIPS noticeably, but the addition of the third processor only allows a small increment of the WIPS and a fourth processor is useless. Finally, under the absence of memory starvation problems, 1536Mb (circles on continuous line), the addition of processors always generates appreciable increments in WIPS. WIPS 30 25 20 15 Experimental factors 1536Mb NO-PGE Indexed 1536Mb PGE Indexed 512Mb PGE Indexed 10 5 0 1 2 3 4 Number of CPUs 256Mb PGE Indexed 1536Mb PGE NO-Indexed 128Mb PGE Indexed Figure 5. Influence of system factors on the WIPS metric 4.5 Influence of authentication on the WIPS metric A set of experiments was carried out to show the influence of the authentication service on the WIPS metric. This service is provided by the Payment Gateway Emulator (PGE). The TPC-W benchmark requires the response time of the PGE between the reception of a message and its response to be no less than 2 seconds. To evaluate if the PGE represents a brake in performance, Figure 5 shows the WIPS measured with PGE (circles on continuous line) and without PGE (crosses on thick continuous line). When the authentication services, provided by the PGE, are not considered, the WIPS increases, as the two upper curves of figure 5 show. Considering that the e-commerce server will be fully optimized before its operation, the authentication services will produce a reduction of 10% in the WIPS. 5. CONCLUSION The evaluation work presented in this paper shows that in general, the TPC-W e-commerce synthetic workload and its associated benchmarking rules are a very useful tool to generate a standard metric of the transactional capacity of servers working in e-commerce environments. The specific results of the evaluation work are summarized in the following paragraphs. The WIPS metric typically represents the sustainable throughput of an e-commerce server working between the middle and the end of the linear part of the whole throughput curve. The granularity of the WIPS metric is the inverse of the think time used by the emulated browsers between their successive interactions, showing a typical value of 1/7, and high repeatability. 291

International Conference WWW/Internet 2003 The TPC-W synthetic workload represents the manufacturer e-commerce model very well and the cybermediary e-commerce model only in a simplified manner. For other classic e-commerce models, such as the auction model, the TPC-W workload lacks representativeness. Users can select different values for the two factors of the TPC-W workload: the number of items in the database (10 3, 10 4, 10 5, 10 6, 10 7 ), and the navigation profile of the users (browsing, shopping, ordering). The number of items has a very strong influence on the WIPS metric, while the influence of the profile on WIPS is practically irrelevant. On symmetric multiprocessing platforms (SMPs), the TPC-W synthetic workload shows moderate scalability, measured as the WIPS speedup. With the maximum memory restrictions (128 Mbytes) scalability is null, that is, the workload cannot exploit additional processors added to the SMP platform. When the number of processors increases from 1 to 4 and without memory restrictions (>512 Mbytes), the workload shows a scalability of 2 without database indexing and 2.6 with database indexing. Finally, the authentication services provided by PGE reduce the maximum performance of the e-commerce server by 10% when the server software is fully optimized. ACKNOWLEDGEMENT The Spanish Research, Development and Innovation Program supported this work under the project TIC2001-1374-C03-03. REFERENCES Bardford, P. and Crovella, M., 1998. Generating Representative Web Workloads for Network and Server Performance Evaluation. In Performance Evaluation Review, Vol.26, No.1, pp.151-160. Bodorik, P. et al, 2000, A Step Towards a Benchmark Repository for E-Commerce. Proceedings of 1 st International Conference on Electronic Commerce and Web Technologies. Greenwhich, UK. FISERV, 2000. The PremierEcom Scaling Tests Scalable E-Commerce Solutions for Internet Bankers. WhitePaper of Fiserv Inc. Brookfield, WI, USA. http://www.fiserv.com García, D.F. and García, J., 2003. TPC-W E-Commerce Benchamark Evaluation. In IEEE Computer, Vol.36, No.2, pp.42-48. Jutla, D. et al, 1999a. WebTP: A Benchmark for Web-based Order Management Systems. Proceedings of 32 nd Hawaii International Conference on System Sciences. Maui, Hawaii, USA. Jutla, d., Bodorik, P. and Wang, Y., 1999b. Developing Internet E-Commerce Benchmarks. In Information Systems, Vol.4, No.6, pp.475-493. Jutla, D. et al, 1999c. Making Bussines Sense of Electronic Commerce. In IEEE Computer, Vol.32, No.3, pp.67-75, March. Kant, K., Tewary, V. and Iyer, R., 2001. GEIST: A Generator for E-Commerce & Internet Server Traffic. Proceedings of the IEEE Int. Symp. on Performance Analysis of Systems and Software. Tucson, Arizona, USA. Menascé, D.A. et al, 1999a. A Methodology for Workload Characterization for E-commerce Sites. Proceedings of 1 st ACM Conference in Electronic Commerce. Denver, Colorado, USA. Menascé, D.A. and Almeida, V.F., 2000. Scaling for E-Bussiness: Technologies, Models, Performance, and Capacity Planning. Prentice-Hall, New Jersey, USA. NSTL, 1999. Scalability and Performance Testing of a DNA Application. Technical Report of National Software Testing Labs Inc. Philadelphia, Pennsylvania, USA. http://www.nstl.com Pendleton, M., and Desai, G., 2000. @Bench Test Report: Performance and Scalability of Windows 2000. Technical Report of Doculabs. Chicago, USA. http://www.doculabs.com TPC, 2002. TPC Benchmark W (Web Commerce) Specification. Technical Specification of Transaction Processing Performance Council. San Francisco, California, USA. http://www.tpc.org WebBench, 2002. WebBench Benchmark. Technical Specification of VeriTest. Los Angeles, California, USA. http://www.etestinglabs.com 292