Scaling for E-Business. Chapter 11 Characterizing E-Business Workloads

Scaling for E-Business Chapter 11 Characterizing E-Business Workloads

Overview Introduction Workload Characterization of Web Traffic Characterizing Customer Behavior From HTTP Logs to CBMGs GetSessions Algorithm GetCBMGs Algorithm How Many Clusters to Choose? From HTTP Logs to CVMs Resource Level Workload Characterization E-Business Benchmarks: TPC-W Concluding Remarks 2

Introduction Two models for customer behavior characterization are discussed in Chapter 2 Customer Behavior Model Graph (CBMG) Captures the navigational pattern of a customer during a visit to the site Customer Visit Model (CVM) Captures only the number of times a customer executes each of the e-business functions per session (less detailed) 3

Introduction (Cont) In this chapter, we show how CBMGs and CVMs can be obtained from HTTP logs methods, based on clustering analysis, to derive small groups of CBMGs or CVMs that accurately represent the workload how the parameters for the resource models (e.g., queuing network models) can be derived from customer behavior models 4

Workload Characterization of Web Traffic 5 Workload characterization studies suggested to detect invariants, i.e., regular and predictable patterns, of Web traffic from measurements taken at clients, proxy servers, servers, and the Web as a whole The focus of this chapter is on workload characterization for e-commerce, but it is important to review some of analysis results obtained from information retrieval Web servers

Workload Characterization of Web Traffic (Cont) The rank of the most popular document is one, the second most popular is two, and so on and so forth The file popularity was shown to follow a Zipf distribution, which means that the number of accesses, P, to a document is inversely proportional to the document rank r.so, P = k / r where k is a constant 6

Example 1 Workload Characterization of Web Traffic Question: The HTTP log of a Web site shows 1,800 requests for files during a five minute period. These requests are directed to 12 unique files. Assuming Zipf s Law, what is the estimated number of accesses to each of the 12 files? Solution: Let us number the files from 1 to 12 according to their rank (file 1: most popular; file 12: least popular) Let the number of accesses to file r is k / r 7

Example 1 (Cont) Workload Characterization of Web Traffic The total number of accesses can then be written as 1 1 1 k (... ) 1,800 1 2 12 k 3.1032 1,800 8 Therefore, k = 580.05 So, the estimated number of accesses to the most popular file is k/1 = 580 The estimated number of accesses to the least popular file is k/12 = 580.05/12 = 48

Example 1 (Cont) Workload Characterization of Web Traffic This figure shows how the number of references varies from the most to the least popular file 9

Workload Characterization of Web Traffic (Cont) 10 A heavy-tailed distribution for a random variable X is one in which the tail of the distribution, i.e., the probability that X >x, decreases with x -α for large value of x and for 0 < α < 2 Several empirical studies have found that many of the distributions related to Web traffic (e.g. distribution of file sizes retrieved from a Web server, reading time per page) are heavily-tailed

Workload Characterization of Web Traffic (Cont) 11 For these distributions, the probability that a large value occurs is small but non-negligible A good example of a heavy-tailed distribution is the Pareto distribution The cumulative distribution function (CDF) for the Pareto distribution is given by F( x) P[ X x] 1 ( ), k 0 And the tail of the distribution is given by P[ X x] k ( x ) k x

Workload Characterization of Web Traffic (Cont) On the Web, while most files retrieved from a Web server are small, there is a nonnegligible probability of large files (e.g., images and video clips) being retrieved The next example illustrates the properties of heavy-tailed distributions 12

Example 2 Workload Characterization of Web Traffic Suppose the HTTP log for a website was analyzed to estimate the distribution of the sizes of the files retrieved from the site Suppose that the file size X is distributed according to a Pareto distribution If we plot the logarithm of the tail of the distribution, i.e., log P [X > x], versus the logarithm of the file size, we obtain the straight line log P[ X x] log x log k 13

Example 2 (Cont) Workload Characterization of Web Traffic This figure shows such a log-log plot for α = 0.5 and k = 1 for a Pareto distribution 14

Example 2 (Cont) Workload Characterization of Web Traffic So, the logarithm of the tail of the distribution decreases linearly with the logarithm of the file size with a slope of -α This is a simple test for verifying that a distribution has heavy tail. If we get a straight line for large values of x, then we are dealing with a heavy-tailed distribution 15

Workload Characterization of Web Traffic (Cont) Some Web traffic features that were found to heavytailed include: the size of files requested from Web servers and from the entire Web the number of pages requested per site the reading time per page In many empirical studies, it has been demonstrated that small images account for the majority of the traffic and that document size is inversely related to request frequency 16

Workload Characterization of Web Traffic (Cont) HTTP traffic was shown to be self-similar, i.e., it exhibits similar patterns of burstiness across several time scales ranging from microseconds to minutes A summary of WWW characterization showed that 99% of the queries did not use any Boolean or other advanced operators 17

Workload Characterization of Web Traffic (Cont) A study looks at proxy server workloads in a cable modem environment 40% of the total size of the unique HTTP files retrieved is due to the presence of a few very large file types (e.g. audio, video, compressed, and executable) Due to the higher bandwidth of cable modems, users become more willing to download larger files. This poses additional stress on server resources 18

Workload Characterization of Web Traffic (Cont) The basic component for workloads of information retrieval websites is an individual HTTP request The remainder of this chapter deals with workload characterization methods for e- commerce sites. In this case, the basic component is a session 19

Characterizing Customer Behavior This navigational pattern includes 2 aspects: transitional: It determines how a customer moves from one state (i.e., an e-business function) to the next. This is represented by the matrix of transition probabilities temporal: It has to do with the time it takes for a customer to move from one state to the next. This time is measured from the server s perspective and is called server-perceived think time or just think time 20

Characterizing Customer Behavior (Cont) Server-perceived think time or just think time is defined as the average time elapsed since the server completes a request for a customer until it receives the next request from the same customer during the same session The server-side think time = t 3 - t 2 = 2 x nt + Z b, where nt represents the network time and Z b is the browser-side think time. A think time can be associated with each transition in the CBMB 21

Characterizing Customer Behavior (Cont) So, a CBMG can be defined by a pair (P,Z) where: P = [p i, j ] is an n x n matrix of transition probabilities between the n states of the CBMG Z = [z i, j ] is an n x n matrix that represents the average think times between the states of the CBMG Recall that state 1 is the Entry state and n is the Exit state 22

From HTTP Logs to CBMGs Each customer session can be represented by a CBMG We show here how: we can obtain the CBMGs that characterize customer sessions from HTTP logs we can group CBMGs that originate from similar sessions and represent each group by a CBMG The goal is to characterize the workload by a relatively small and representative number of CBMGs as opposed to having to deals with thousands of CBMGs 23

From HTTP Logs to CBMGs (Cont) Step 1 merge and filter HTTP logs from the various HTTP servers of the e-commerce site Step 3 takes as input the session log S and performs a clustering analysis that results in a set of CBMGs that can be used as a compact representation of the sessions in the log S Step 2 takes as input the request log and generates a session log S 24

From HTTP Logs to CBMGs (Cont) Step 1: Merge and filter HTTP logs from the various HTTP servers of the e-commerce site to discard irrelevant entries such as image requests, errors and others These logs can be merged into a single log using the timestamp. Clock synchronization services such as the ones available in Linux and NT can be used to facilitate merging of distributed logs. This step generates a request log L. 25

From HTTP Logs to CBMGs (Cont) The request log L is represented by a fourtuple (u,r,t,x) : UserID (u): identification of the customer submitting the request. Cookies, dynamic URLs, or even authentication mechanisms can be used to uniquely identifying requests as coming from the same browser during a session RequestTime (t): time at which the request arrived at the site 26

From HTTP Logs to CBMGs (Cont) ExecTime (x): execution time of the request. Even through this value is not normally recorded in the HTTP log, servers can be configured and/or modified to record this information RequestType (r): indicates the type of request. Examples include a GET on the home page, a browse request, a request to execute a search, a selection of one of results of a search and etc. It is assumed that requests to execute CGI scripts or other types of server applications can be easily mapped into request types, i.e., states of the CBMG 27

28 From HTTP Logs to CBMGs - GetSessions Algorithm Before describing the step GetSessions, we need to describe the session log S The k-th entry in this log is composed of two-tuple (C k,w k ) where C k = [c i,j ] is an n x n matrix of transition counts between state i and j of the CBMG for one session and W k = [w i,j ] is an n x n matrix fo accumulated think times between state i and j of the CBMG for one session Consider that for a given session, there were 3 transitions between state s and t and the think times for each of the transitions were 20 sec, 45 sec, and 38 sec, respectively. Then, c s,t = 3 and w i,j = 20 + 45 + 38 = 103 sec

Building a CBMG Matrix P for the CBMG of the Bookstore Example A CBMG can be more formally characterized by a set of states, a set of transitions between states, and by an n x n matrix, P = [ p i,j ], of transition probabilities between the n states of the CBMG Note that elements of the first column and the last row for any CBMG are all zero since, by definition, there are no transitions back to the Entry state from any state nor any transition out of the Exit state 29

From HTTP Logs to CBMGs - GetSessions Algorithm (Cont) 1 Sort the request log L by UserID and then by RequestTime to generate a sorted log L s composed of subsequences, one per UserID, of the form: 2 Each subsequence may represent one or more sessions. For example, a customer may generate a sequence of requests and return to the site one hour later for another session. Thus, subsequences need to be broken into session using a time threshold T (e.g. 30 mins). 30

From HTTP Logs to CBMGs - GetSessions Algorithm (Cont) 2 If the time between 2 consecutive requests R 1 and R 2 in a subsequence exceed T, R 1 is considered to be the last request of a session and R 2 is the first of the following session 3 Subsequences are now broken down into sessions and requests within sessions are in chronological order. Let Q be the number of requests in a given session for UserID u and let (u,r 1,t 1,x 1 ),, (u,r Q,t Q,x Q ) be the requests of this session as they appear in the sorted log L s. 31

From HTTP Logs to CBMGs - GetSessions Algorithm (Cont) 3 Repeat the following procedure for each session 32

From HTTP Logs to CBMGs - GetSessions Algorithm (Cont) Some precautions in using HTTP logs: Recording request times with millisecond accuracy may not be sufficient as processors and networks become faster. For this reason, a higher precision timestamp was recorded in Apache s HTTP log in a capacity planning study You may want to clean the log from crawler activity. Having the browser identification recorded in the log is useful in this case Most proxy and origin servers record, by default, only a small portion of each HTTP request and/or response. However, most support an extend log format and can be configured to provide more information 33

From HTTP Logs to CBMGs - GetCBMGs Algorithm Once the session log S is generated, we need to perform a clustering analysis on it to generate a synthetic workload composed of a relatively small number of CBMGs The centroid of a cluster determines the characteristics of the CBMG Any type of clustering algorithms can be used An example of such an algorithm is the k- means clustering algorithm 34

From HTTP Logs to CBMGs - GetCBMGs Algorithm (Cont) The k-means clustering algorithm It begins by selecting k points in the space of points, which act as initial estimate of the centroids of the k clusters The remaining points are then allocated to the cluster with the nearest centroid The allocation procedure iterates several times over the input points until no point switches cluster assignment or a maximum number of iteration is performed 35

From HTTP Logs to CBMGs - GetCBMGs Algorithm (Cont) Clustering algorithms require a definition of a distance metric to be used in the computation of the distance between a point and a centroid Assume that the session log is composed of M points X m = (C m, W m ), m = 1, 2,, M where C m and W m are the transition count and accumulated think time matrices defined previously Our definition of distance is based on the transition count matrix only since this is a factor that more clearly defines the interaction between a customer and an e-commerce site 36

From HTTP Logs to CBMGs - GetCBMGs Algorithm (Cont) We define the distance d Xa, Xb between two points X a and X b in the session log as the Euclidean distance 37 d X a n n, X ( Ca[ i, j] Cb[ i, j]) b i 1 j 1 At any point during the execution of the k- means clustering algorithm we have k centroids The clustering algorithm needs to keep track of the number of points, s(k), represented by centroid k 2

From HTTP Logs to CBMGs - GetCBMGs Algorithm (Cont) 38 We show how the coordinates of a new centroid, i.e. the new values of the matrices C and W, are obtained when a new point is added to a cluster Suppose that point X m = (C m, W m ) is to be added to centroid k represented by point (C, W) The new centroid will be represented by the point (C,W ), where the elements of the matrices C and W are computed as 1 ) ( ], [ ], [ ) ( ], [ 1 ) ( ], [ ], [ ) ( ], [ k s j i W j W i k s j i W k s j i C j C i k s j i C m m

From HTTP Logs to CBMGs - GetCBMGs Algorithm (Cont) Normalizing: Once all the clusters have been obtained, we can derive the matrices P and Z, which characterize the CBMG associated with each cluster, as p z i, j i, j C[ i, W[ i, j]/ j]/ C[ i, C[ i, k] The arrival rate, ks, of sessions represented by the CBMG of cluster k is given by ks = s(k)/t, where T is the time interval during which the request log L was obtained Once we have the matrices P and Z for each cluster, we can obtain the metrics we discussed previously for each type of session n k 1 j] 39

Example 4 - From HTTP Logs to CBMGs Consider that an HTTP log was analyzed for an e- commerce site that has the following static CBMG The GetSessions algorithm generated 20,000 sessions out of the 340,000 lines in the request log 40

Example 4 (Cont) - From HTTP Logs to CBMGs After running the k-means clustering algorithm on the session log using k=6, we obtained the following six clusters It shows the buy to visit ratio (BV),which represents the percentage of customers who buys from the Web store It indicates the average number of shopper operations requested by a customer for each visit to the electronic store This row shows the percentage of sessions that fall into each cluster 41 The Add to Shopping Cart Visit Ratio (V a ), represents the average number of times per session that a customer adds an item to the shopping cart. The last row indicates the number of browse and search operations associated with customers of each cluster

Example 4 (Cont) - From HTTP Logs to CBMGs Two very different behavior patterns are noted from the above characterization of the e- commerce workload Cluster 1, which represents the majority of the sessions (44.28%), has a very short average session length (5.6) and the highest percentage of customers that buy from the store Cluster 6, which represents a small portion of the customers, exhibits the longest session length and the smallest buying ratio 42

Example 4 (Cont) - From HTTP Logs to CBMGs We plot the percentage of customers who buy as a function of the average session length For this example, we observe that the longer the session, the less likely it is for a customer to buy an item from the Web store. Moreover, the buy to visit decreases, in a quadratic fashion,with the session length 43

Example 4 (Cont) - From HTTP Logs to CBMGs An alternative approach to the one discussed previously is first partition the workload and then apply clustering techniques E.g., we may partition the HTTP log into sessions that resulted in sales and those that did not Then, we can apply clustering techniques to analyze separately the behavior of buyers and non-buyers This approach would have the advantage of giving a special treatment to buyers who typically constitute a small percentage of the sessions 44

How Many Clusters to Choosse? This question can be answered by examining the variation of two metrics: the average distance between points of a cluster and its centroid (the intracluster distance) the average distance between centroids (the intercluster distance) This variation can be characterized by the coefficient of variation (CV), i.e., the ratio between the average and the standard deviation Purpose: minimize the intracluster CV while maximize the intercluster CV 45

How Many Clusters to Choose? (Cont) 46 If the number of clusters is made equal to the number of points, we will have achieved the goal However, we want a compact representation of the workload. So, we need to select a relatively small number of clusters such that the intracluster variance is small and the intercluster variance is large The ratio between the intracluster and intercluster CV, denoted CV, is a useful guide in determining the quality of a clustering process

How Many Clusters to Choose? (Cont) Here plots the intercluster and intracluster coefficient of variation as well as CV versus the number of cluster k CV intra does not vary much with the number of clusters. On the other hand, CV inter increases with k CV drops significantly from k =3 to k =6 and then exhibits a much slower rate of decrease. 47

From HTTP Logs to CVMs Sessions represented by a CVM instead of a CBMG can be obtained from an HTTP log through the algorithm GetCVMSessions shown in next slide 48

From HTTP Logs to CVMs - GetCVMSessions algorithm 1 Execute steps 1 and 2 of the algorithm GetSessions 2 At this point, subsequences are broken down into sessions and requests within sessions are in chronological order. Let Q be the number of requests in a given session for UserID u and let (u,r 1,t 1,x 1 ),, (u,r Q,t Q,x Q ) be the requests of this session as they appear in the sorted log L s. Repeat the following procedure for each session V i V,V 1 0 For k 1 to Q do V n for all i 1; V r k r k 1; 2,...,n - 1 49

From HTTP Logs to CVMs (Cont) 50 Again, as in the case of sessions characterized by CBMGs, we need to group sessions in smaller and representative groups Clustering techniques can also be applied here The distance metric is the distance between to visit ratio vectors Consider sessions a and b characterized by the visit ratio vectors V a =(V 2a,, V n-1a ) and V b =(V 2b,, V n-1b ). Note that we left out the visit ratios for states 1 and n. The distance between sessions A and B is d V a n 1 a b 2, V ( Vi Vi ) b i 2

From HTTP Logs to CVMs (Cont) Here shows the results of applying the k-means clustering algorithm for k = 2, 3 and 4 51 Cluster 4 in the k = 4 case captures the customers who do not buy anything from the site while cluster 1 represents people who always buy

Characterizing the Workload at the Resource Level To perform capacity planning and sizing studies of an e-commerce site, we need to map each CBMG resulting from the workload characterization process described above to IT resources With each server in the CSID, we associate service demands at the various components (e.g. CPU and disks) of the server An e-business function search is mapped to a CSID 52 To each arc of the CSID, we associate service demands for the networks involved in the exchange of messages represented by the arc

Example 5 - Characterizing the Workload at the Resource Level 53 The characterization of the customer behavior for an e-commerce site generated 2 CBMGs one characterizes heavy buyers, i.e., customers who will buy from the site with higher probability the other characterizes occasional buyers who tend to search more than heavy buyers and buy less Let us focus on the e-business function Search, which represents a state of the CBMG Assume that the database server (DS) has 1 CPU and 2 disks with service demands 0.006 sec, 0.002 sec, and 0.0018 sec, respectively, for one execution of the search transaction

Example 5 (Cont) - Characterizing the Workload at the Resource Level Here displays the session arrival rate for each of the two CBMGs and the average number of visits to the state Search for each one 54

Example 5 (Cont) - Characterizing the Workload at the Resource Level 55 Question: What is the service demand per session for Search functions at each component of the DS for each CBMG? What is the utilization of each resource of the database server due to the Search function? Solution: For occasional buyers, each session executes 6.7 searches on the average Each search uses 0.006 sec of CPU at the database server

Example 5 (Cont) - Characterizing the Workload at the Resource Level So, the CPU service demand due to Search functions executed during sessions from occasional buyers is D CPU,OccasionalBuyers (Search) = 6.76 x 0.006 = 0.0406sec In general, the service demand at a resource i (e.g., CPU or disk) due to sessions of type r (e.g., heavy buyers, occasional buyers) for all executions of the e-business function f (e.g., Search, Browse) is D i,r (f) = V f,r x D i (f) the service demand of a single execution of function f at resource i 56 the average number of executions of function f per session of type r

Example 5 (Cont) - Characterizing the Workload at the Resource Level Let us now compute the utilizations. The utilization of a resource is equal of the product of the service demand at that resource multiplied by the throughput (or arrival rate in equilibrium) U i,r (f) = D i,r (f) x r (f) 57 The utilization of resource i due to the execution of function f for session of type r r (f) = r s x V f,r is the rate of execution of function f due to sessions of type r and in which rs is the arrival rate of sessions of type r

Example 5 (Cont) - Characterizing the Workload at the Resource Level OccasionalBuyers (Search) = 0.8 x 6.76 = 5.408 searches/sec U CPU,OccasionalBuyers (Search) = 0.0406 x 5.408 = 0.2193 = 21.93% The previous table shows the service demands and utilization for all resources and for the two types of CBMGs due to the execution of the Search It also presents the total utilization of each resource due to the Search function 58

E-Business Benchmarks: TPC-W 59 Accurate workload characterizations can be used to build benchmark suites that can be used to evaluate and compare competing systems Several workload generators exist for Web servers: Mindcraft s Webstone, SPEC s SPECWeb96 and SPECWeb99, and SURGE The Transaction Processing Performance Council (TPC) has just released TPC-W, the first benchmark aimed at evaluating sites that support e-business activities

TPC-W s Business Model A B2C e-tailer that sells products and services over the Internet The site provides e-business functions that: allow customers to browse through selected products (e.g.,best sellers or new products) search information on existing products see product detail place an order check the status of a previous order 60

TPC-W s Business Model (Cont) Interactions related to placing an order are encrypted through SSL with RSA, RC4 and MD5 as the cipher suites Customers need to register with the site before they are allowed to buy The site maintains a catalog of items that can be searched by a customer Each item has a description and a 5K-byte thumbnail image associated with it 61

TPC-W s Business Model (Cont) The site maintains a database with information about customers, items in the catalog, orders, and credit card transactions All database updates much have the ACID (Atomicity, Consistency, Isolation and Durability) property The size of the catalog is the major scalability parameter for TPC-W. The number of items in the catalog may be 1,000, 10,000, 100,000, 1,000,000, or 10,000,000 62

TPC-W s Customer Behavior Model TPC-W specifies that the activity with the site being benchmarked is driven by emulated browsers (EBs) These EBs generate Web interactions, which represents a complete cycle that starts when the EB selects a navigation option from the previously displayed page and ends when the request page has been completely received by the EB EBs engage in user sessions, i.e., sequences of Web interactions that start with an interaction to the home page 63

Customer Behavior Model Graph (CBMG) for TPC-W Notes: Interactions to inquire about the status of previous orders and site administrative interactions are left out. Also, the transitions to the Exit state are not shown explicitly Two types of browse interactions are grouped into the Browse state: - requests for best sellers - requests for new product information 64 The Entry state can only lead to the Home state, which can be reached from any other state Customers have to go through the Login state before they can the Buy Request state, in which a customer provides billing information (e.g. credit card and billing address information) and shipping address From the Buy Request state a customer can move to the Buy Confirm state which completes the buying process

TPC-W s Customer Behavior Model (Cont) TPC-W classifies Web interactions into 2 broad categories: Browse interactions involve browsing and searching but no product ordering activity. States of the CBMG that fall in this category are Home, Browse, Select, Product Detail, and Search Order interactions involve product ordering activities only and include the following states of the CBMG: Shopping Cart, Login, Buy Request, and Buy Confirm 65

TPC-W s Customer Behavior Model (Cont) 66 TPC-W specifies 3 different types of sessions according to the percentage of Browse and Order Web interactions found in each session Browsing mix: 95% of Browse Web interaction and 5% of Order Web interaction. The buy to visit ratio in these sessions is 0.69% Shopping mix: 80% of Browse Web interaction and 20% of Order Web interaction. The buy to visit ratio in these sessions is 1.2% Ordering mix: 50% of Browse Web interaction and 50% of Order Web interaction. The buy to visit ratio in these sessions is 10.18%

TPC-W Performance Metrics Two types of performance metrics for TPC-W: Throughput metric The main throughput metric is called WIPS (Web Interactions Per Second) and measures the average number of Web Interactions completed per second during an interval in which all the sessions are of the shopping type Another, called WIPSb, measures the average number of Web Interaction per second completed during an interval in which all sessions are of the browsing type Another, called WIPSo, measures the average number of Web Interaction per second completed during an interval in which all sessions are of the ordering type 67

TPC-W Performance Metrics (Cont) Cost / throughput metric The metric is $/WIPS and indicates the ratio between the total cost of the system under test and the number of WIPS measured during a shopping interval. Total cost includes purchase and maintenance costs for all hardware and software components for the system under test 68

Concluding Remarks 69 Workload characterization is the process of describing a workload by means of quantitative parameters in a way that captures the most important features of the workload A workload characterization can be static: describes the consumption of hardware and software resource or dynamic: consists of parameters related to the behavior of user requests We described dynamic workload models, in the form of CMBGs and CVMs, as well as the processes used to obtain these characterizations from HTTP logs

Concluding Remarks (Cont) These dynamic characterizations need to be mapped to resources at the IT level to generate a static workload description This is achieved by mapping customer behavior models to client/server interaction diagram (CSIDs) and obtaining the service demands at each of the servers and networks of the CSID Accurate workload characterization can be used to build benchmark suites that can be used to evaluate and compare competing systems 70

Concluding Remarks (Cont) The Transaction Processing Council (TPC) has released recently TPC-W, a benchmark for e- commerce sites engaged in B2C activities This benchmark is designed to mimic operations of an e-business site and it measures Web Interactions Per Second (WIPS) and cost/performance The transactions of the benchmark are designed to reproduce 5 types of operations: Browse, Shopping Cart, Buy (using SSL), Register, and Search 71

The End