Metamodeling by using Multiple Regression Integrated K-Means Clustering Algorithm

Transcription

1 Metamodeling by using Multiple Regression Integrated K-Means Clustering Algorithm Emre Irfanoglu, Ilker Akgun, Murat M. Gunal Institute of Naval Science and Engineering Turkish Naval Academy Tuzla, Istanbul, TURKEY Keywords: simulation optimization, K-means clustering, metamodel, multi regression Abstract A metamodel in simulation modeling, as also known as response surfaces, emulators, auxiliary models, etc. relates a simulation model s outputs to its inputs without the need for further experimentation. A metamodel is essentially a regression model and mostly known as the model of a simulation model. A metamodel may be used for Validation and Verification, sensitivity or what-if analysis, and optimization of simulation model. In this study, we proposed a new metamodeling approach by using multiple regression integrated K-means clustering algorithm especially for simulation optimization. Our aim is to evaluate the feasibility of a new metamodeling approach in which we create multiple metamodels by clustering inputoutput variables of a simulation model according to their similarities. In this approach, first, we run the simulation model of a system, second, by using K-Means clustering algorithm, we create metamodels for each cluster, and third, we seek the minima (or maxima) for each metamodel. We also tested our approach by using a fictitious call center. We observed that this approach increases the accuracy of a metamodel and decreases the sum of squared errors. These observations give us some insights about usefulness of clustering in metamodeling for simulation optimization. 1. INTRODUCTION Coupling the speed of optimization techniques and flexibility of simulation emerges a new research area called Simulation Optimization (SimOpt), which also affected the practice [1-3]. In the history of Operational Research, SimOpt methods have started to appear in 1990s, with the basic idea of merging the advantages of simulation modeling with optimization. Simulation methods are known for their flexibility to tackle the complexity in systems. Although simulation models require extensive amount of data, they help decision makers make better decisions. Optimization methods, on the other hand, are not as flexible as for modeling complexity, but once they are built, they generate more accurate and faster results than simulation does. Many methods for SimOpt have been developed, mainly in four categories; gradient-based and random search algorithms, evolutionary algorithms and metaheuristics, mathematical programming based approaches, and statistical search techniques. In this study, we suggest a four-phase approach to improve the metamodeling process for SimOpt. Our approach includes simulation experimentation, clustering, metamodeling, and optimization. In the first phase, conventional simulation experimentation techniques are used. Note that we assume we have a simulation model of a typical call centre system, and we aim to optimize some objective function. In the second phase, we apply a clustering algorithm (k-means) to the simulation inputs. In the third phase for each cluster, a metamodel is developed. Finally, we applied optimization techniques to each metamodel. Different from classical metamodeling in SimOpt, we integrated clustering before the multiple regression metamodel, and generated one metamodel for each cluster, instead of one metamodel for all data. First, we review some of SimOpt methods in the literature in section. In section 3 and 4, we give brief information about metamodel and clustering. In section, we presented our proposed approach by comparing with the classical approach. To show an application of the proposed approach, we experimented with a call center simulation model, and showed that clustered metamodels outperform the classical approach.. REVIEW OF SIMULATION OPTIMIZATION TECHNIQUES In SimOpt, a simulation model is used to estimate the performance of a system, and based on the estimation, then, an optimization algorithm is run to find some new input values that will maximize or minimize the system performance estimation. As in the conventional optimization models, the input values, or the decision variables, are constrained. The iterative nature of this approach generally 618

2 makes the simulation model a bottleneck and therefore the model performance is significant. We review some of the well-known simulation optimization techniques as follows: Gradient-based and random search algorithms (e.g. stochastic approximation): Gradient-based search methods are a type of optimization techniques that use the gradient of the objective function to find an optimal solution [4]. In each iteration of the algorithm, the values of the decision variables are adjusted so that the simulation produces a lower objective function value. Gradient-based methods work well in high-dimensional spaces provided that these spaces do not have local minima. The drawback is that global minima are likely to remain unfound. Evolutionary Algorithms and Metaheuristics (e.g. Genetic Algorithms, Tabu Search and Simulated Annealing): Heuristic-based methods strike a balance between exploration and exploitation. This balance permits the identification of local minima, but encourages the discovery of a globally optimal solution []. Heuristic techniques generate good candidate solutions when the search space is large and nonlinear. Mathematical Programming-Based Approaches (e.g. the Sample Path Method): Sample path optimization (also known as stochastic counterpart, sample average approximation; see [6]) takes many simulations first, and then tries to optimize the resulting estimates by using conventional mathematical programming solution algorithms. Statistical Search Techniques (e.g. Sequential Response Surface Methodology): Response surface methodology (RSM) is a statistical method for fitting a series of regression models to the output of a simulation model []. The goal of RSM is to construct a functional relationship between the decision variables and the output to demonstrate how the changes in the value of decision variables affect the output. Relationships constructed from RSM are often called meta-models [7]. RSM usually consists of a screening phase that eliminates unimportant variables in the simulation [8]. After the screening phase, linear models are used to build a surface and find the region of optimality. Then, second or higher order models are run to find the optimal values for decision variables. The eventual objective of RSM is to determine the optimum operating conditions for the system or to determine a region of the factor space in which operating requirements are satisfied [9]. In the formal application of RSM for optimization and for design of experiments in general, one of the most important steps is factor screening, the initial identification of the important" parameters, those factors that have the greatest influence on the response. However, in our discussion of optimization of discrete-event simulation models, we assume that this has already been determined. In most discrete-event system applications, this is usually the case, since there are underlying analytic models which can give a rough idea as to the influence of various parameters. For example, in manufacturing systems and telecommunications networks, the analyst knows from queuing network models which routing probabilities and service times have an effect on the performance measures of interest. RSM procedures usually presuppose a more black box" approach to the problem as stated above, so it is unclear a priori which factors are of importance at all [10]. Additionally, Fu [10] classifies the application of RSM in two main categories: metamodels, and sequential procedures. Meta models are special cases of RSM representation and therefore the remainder of this paper uses the term metamodel rather than RSM. 3. METAMODEL A metamodel is a polynomial model that relates the inputoutput behavior of a simulation model. A metamodel is often a least squares regression model that has form as given in Eqs.(1): k k k k... 0 i i ii i (1) ij i j E y x x x x i 1 i 1 i 1 j 1 where βi, βii, and βij represent regression coefficients, xi (i = 1,..,n) are design variables, and y is the response. The simple form of a metamodel can reveal the general characteristics of behavior in complex simulation models. The objective of a metamodel is to effectively relate the output data of a simulation model to the model s input to aid in the purpose for which the simulation model was developed [11]. Since our aim in this study is to form a metamodel by using clustering algorithms, we review the related literature in the following section. Note that we aim at classifying the input variables according to the similarities between each other, and after clustering the data, there will be n grouped (clustered) data sets, n metamodels. We discuss the details of this approach after stating the clustering algorithms. 4. CLUSTERING Clustering is a way to examine similarities and dissimilarities of observations or objects. Data often fall naturally into groups, or clusters, of observations, where the 619

3 characteristics of objects in the same cluster are similar and the characteristics of objects in different clusters are dissimilar. Both the similarity and the dissimilarity should be examinable in a clear and meaningful way. Measures of similarity depend on the application. Clustering is widespread, and a wealth of clustering algorithms has been developed to solve different problems in specific fields. However, there is no clustering algorithm that can be universally used to solve all problems [1]. Clustering has been applied in a wide range of areas, ranging from engineering (machine learning, artificial intelligence, pattern recognition, mechanical engineering, electrical engineering), computer sciences (web mining, spatial database analysis, textual document collection, image segmentation), life and medical sciences (genetics, biology, microbiology, paleontology, psychiatry, clinic, pathology), to earth sciences (geography. geology, remote sensing), social sciences (sociology, psychology, archeology, education), and economics (marketing, business) [13-14]. There are two common clustering techniques based on the properties of clusters generated [13-1]; hierarchical clustering and partitioned clustering. Hierarchical clustering groups the data over a variety of scales by creating a cluster tree. The tree is not a single set of cluster, but rather a multilevel hierarchy, where clusters at one level are joined as clusters at the next level. This allows you to decide the level or scale of clustering that is most appropriate for your application. In partitioned clustering, the data objects are divided into some specified number of clusters. K-means clustering algorithm is one of the well-known methods in this category. K-means partitions data into k mutually exclusive clusters, and returns the index of the cluster to which each observation has assigned. We used this technique in our methodology to cluster the simulation inputs. K-means clustering algorithm treats each observation in data as an object having a location in space. It finds a partition in which objects within each cluster are as close to each other as possible, and as far from objects in other clusters as possible. There are several different distance measures, depending on the kind of data you are clustering. Each cluster in the partition is defined by its member objects and by its centroid, or center point. The centroid for each cluster is the point to which the sum of distances from all objects in that cluster is minimized. K-means computes the cluster centroids differently for each distance measure, to minimize the sum with respect to the specified measure. K-means uses an iterative algorithm that minimizes the sum of distances from each object to its cluster centroid, over all clusters. The algorithm moves objects between clusters until the sum cannot be decreased further. The result is a set of clusters that are as compact and well-separated as possible. An example of clustered data points is shown in Figure Figure-1. An example of clustered data points (taken from a Matlab example). PROPOSED APPROACH In this study, we proposed a new metamodeling approach by using multiple regression integrated K-means clustering algorithm especially for simulation optimization. Our approach works in four phases; Experimentation, Clustering, Metamodeling, and Optimization. We have ten steps in total as presented in Figure. We assume that we have a simulation model that is built for the system that we desire to find optimum values of some decision variables. Note that in this case, the decision variables are simulation model inputs. In the experimentation phase, the modeler designs the experiments according to the search space size. For example, if there are n input values and we decided to run low and high values of each variable, we end up with n factorial experiments. For some cases, that many experiments may not be enough and more experimentation might be required. 60

4 In the second phase, we cluster the simulation inputs. This is an iterative process since we look for some performance criterion in each iteration and if the criterion for clustering is below the acceptable level, we increase the number of clusters. For example, in K-Means clustering method, the performance criterion is the silhouette value. In the third phase, we develop metamodels for each cluster. As in the previous phase, we look for some quality measures of metamodels, for example by the R-square values. The purpose of a metamodel is to estimate outputs values without further simulation experimentation. Therefore after this phase, we can estimate simulation outputs without running the model. However since we have multiple metamodels, we need to determine some rules for using each metamodel. These rules might be based on the limits of the simulation inputs. The final phase is the optimization phase. Based on the objective function, we seek the minima or maxima of each metamodel. This requires differentiating the regression model and setting it to zero to find the roots of the equation. Then, we choose the minimum or maximum among the clusters optimum values. 6. APPLICATION 6.1. Problem Definition To test our approach, we used an example of a call center simulation model created with ARENA program [16]. We choose this model to benchmark our methodology. Therefore the model structure and its parameter values are taken from the original problem definition as is written in [16]. The call centre provides technical support, sales information, and order processing to a company. The calls arrive to this call centre with interarrival times exponentially distributed with a mean value of 0.87 minute. The call center has 6 trunk lines, which means that there exist concurrent 6 calls maximum. If all lines are busy, then the next arriving call will be rejected. An incoming call can be diverted to one of these options; transfer to technical support, sales information or order status inquiry. Their percentages are 76, 16, 8 respectively. The estimated time for this activity is UNIF(0.1, 0.6); all times are in minutes. Figure-. Flowchart of the proposed methodology In case of technical support calls, first, a recorded welcome message is presented which takes UNIF (0.1, 0.) minutes. In this message, the caller is expected to choose one of the three product types. The percentage of the product types 1, and 3 are, 34 and 41 respectively. If a qualified technical support person is available for the selected product type, the call is automatically routed to that person. Otherwise, the customer is placed in an electronic queue 61

5 until a support person is available. All technical support call durations are triangularly distributed with 3, 6, 18 minutes. After a caller is being served, he exits the system. The second type of calls is the sales. These calls are routed to the sales staff. A sales staff call duration is triangularly distributed with the parameters 4, 1, 4 minutes. As in the technical support, the caller leaves the system after completion of the call. The third type of call, order status, is handled by computers. However some customers may require talking to a real operator. This happens in 1 of this type of calls. Order status calls also distributed triangularly with, 3, 4 minutes. Note that when these calls are inserted to a queue for a real operator, they have lower priority than sales calls. An operator can handle these calls with triangularly distributed times (3,, 10 minutes). These callers then exit the system. In our base experimentation, there are 11 technical support employees to answer the technical support calls. Two are only qualified to handle calls for product Type 1, three are only qualified to handle calls for product Type, three are only qualified to handle calls for product Type 3, two are only qualified to handle calls for product Types 1 and 3, and one is only qualified to handle calls for all three products types. There are four employees to answer the sales calls and those order-status calls that want to speak to a real person. Our main output variable is the total cost which includes 3 types of costs; (1) staffing and resource costs, () costs due to poor customer service and (3) costs of rejected calls. A sales staff s cost is $0/hour and a tech-support staff s cost is $18-$0/hour, depending on their level of training and flexibility. The second type of cost is the incurred cost associated by making costumer wait on hold. When dealing with a call center, at some point, people will start getting mad and the system will start incurring a cost. Although it is difficult to measure this cost, we assumed that for tech calls, this point is 3 minutes; for sales calls, it s 1 minute; and for order status it s minutes. Beyond this tolerance point for each call type, the system will incur a cost of 36.8 cents/minute for tech calls, 81.8 cents/minute for sales calls and 34.6 cents/minute for order status calls. For rejected calls it is assumed that no more than of incoming calls get a busy signal; any model configuration not meeting this requirement will be regarded as unacceptable. With related rejected calls changing the number of trunk line is incurred $98/week for each trunk line. In the optimization part, we used this call center simulation model to find the minimum total cost while holding percent of rejected calls to and less. The decision variables and their lower/upper bound values are as shown in the Table 1. There are two constraints in the problem definition; first, the number of trunk lines must be between 6 and 0. Second, the call center can accommodate 1 operators at most. 6.. Steps of the Methodology Step-1 Specify the decision variables: We choose the six decision variable as shown in Table-1 that affected our performance criteria (e.g. the total cost). Table 1. Decision variables and their lower and upper bounds Decision Variables Lower Upper Bound Bound New Sales (1) 0 1 New Tech 1 () 0 1 New Tech (3) 0 1 New Tech 3 (4) 0 1 New Tech All () 0 1 Trunk Line (6) 6 0 Step- Simulation Experimentation: For this stage, instead of designing our own experiments, we choose the experiments that are already specified by Arena s OptQuest. To ease the process, we first run OptQuest for 00 experiments to find the optimum. As a result of this, OptQuest found the values in Table with the objective function value of $1,017. The run length for the model is 1000 hours and we made 10 replications in each experiment. Table. Minimum total cost and values of decision variables via OptQuest Obj.Func $ Step-3 Evaluate the Simulation Output: 16 experiments among 00 experimental results are removed since they were in infeasible region. Step-4 Determine the Number of Clusters: In this step, we cluster the inputs of the simulation model by examining the silhouette values. The silhouette plot displays a measure of the closeness of each data point by comparing with the neighboring clusters in the diagram. The measure for the silhouette value ranges from +1 to indicates the points that are very distant from the neighboring clusters. 0 indicates the points that are not distinctly in one cluster or another. -1 indicates the points that are assigned to the wrong cluster. The value is defined as; S(i) = (min(b(i,k),) - a(i)) / max(a(i),min(b(i,k))) where a(i) is the average distance from the ith point to the other points in its cluster, and b(i,k) is the average distance from the ith point to points in another cluster k. 6

6 Cluster Step- Cluster Simulation Inputs: We clustered the simulation inputs using the euclidean distance between the inputs. Here, we clustered the inputs up to 8 to compare the Silhouette plots. Step-6 Cluster Validation: To validate the clusters, we analyzed the Silhouette plots and means. Here, the best plot belongs to the -clusters (mean 0.), as shown in Figure-3. Therefore we end up with metamodels Silhouette Value Figure 3. Silhouette plot for the experiments. Step-7 Create Metamodel of Every Cluster: We created metamodels by using Minitab [17] according to number of clusters in Step 6. The Equations to 6 shows the metamodel of each cluster. f , 06* 38, 69* 774, 67* , * 166,4* 618,* 4, 43* 4 1 6, 7* 96, 67* 7,97* 3 4 f ,84* 18*,* 1 3 7, 46* 9, 69* 617, 7* 10, * ,7* 14, 71* 16,81* 3 4 () (3) f * * * * * * * * * * * * * * * * * * * * * * * * * * f * * * * * * * * * * * * + 19.* * * * * * + 4.8* * * * * * * + 69* - 8.9* Step-8 Evaluate the Results: In this step, we evaluate the metamodels in Step-7 by conducting some statistical tests (ANOVA, R-square, Residuals Sum of Square). The metamodels corresponding R-Square values are 79,83, 63,8, 6,7, 67,70 and 96.6 respectively. Additionally, square roots of mean square errors (MSE) are given in Table-3. We compare the R-Square and MSE values with the single metamodel, that is when we assume to have a classic metamodel (no cluster), we see that the single metamodel s R-Square value is 81.1 and MSE is Table 3. Statistical results of proposed approach and classic metamodel Method R MSE F& p Value square Proposed Approach Classic Metamodel Cluster-1 79,83, Cluster- 63,8 Cluster Cluster Cluster () (6) f * +1649* * * * * + 90* * * * * -81.7* * * * * * * +0. * * * +1* * * (4) Step-9 Find the Optimum of Each Metamodel: To optimize the objective functions of five metamodels, we used Matlab [19] s Optimization Tool. Table-4 shows the minimum total costs and values of decision variables. 63

7 Table 4. Objective functions and decision variables values Method Obj.Func Value Decision Variables [1;;3;4;;6] Tested Obj. Func. OptQuest $1017 [3;0;0;0;3;9] - Cluster-1 $1394 [3.76;0;6.17;4;8;0] $870 Cluster- $484 [3.4;0.9;1.4;1.9;6.83; $6343 0] Cluster-3 $3994 [3.;0;3.9;6;0;0] $646 Cluster-4 $1888 [7.;0;4;.6;0;41] $171 Cluster- $034 [4;0;0;0;;9] $1986 Step-10 Test the Optimum by Using Simulation Model: We tested the optimum of each cluster that obtained in Step-9 by using Arena simulation model. Note that the minimum total cost belongs to the Cluster- s metamodel, as shown in Table-3. After running those decision variable values in our call center simulation model, the result is $1646 ( Tested Objective Function column) which is close to the minimum total cost that OptQuest finds $ CONCLUSION Simulation optimization techniques have developed significantly in the last two decades. In this study, we aim at contributing the literature by proposing a new approach in which K-Means clustering algorithm is integrated into metamodeling. We tested the proposed approach by using a call center simulation model. In this example we used 00 scenarios which are created by Arena OptQuest optimization tool, and then clustered the inputs into five groups. The clusters helped to create plausible metamodels with satisfactory and near-optimal R-Square and MSE values. This gives us an indication of the advantage of the proposed approach. When the solution space is large and searching is costly, the proposed approach can be used as an alternative to heuristic search algorithms. However to generalize the usefulness of this approach, we aim at having more cases in the future. 8. ACKNOWLEDGMENTS The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of any affiliated organization or government. 9. REFERENCES [1] Tekin, E. and Sabuncuoglu, I., 004. Simulation Optimization: A Comprehensive Review on Theory and Applications. IEEE Transactions, 36: [] Law, M. and Kelton W. D., 001. Simulation Modeling and Analysis, McGrawHill, Second Edition, United States. [3] Fu, M., 00. Optimization for Simulation: Theory vs. Practice, INFORMS Journal on Computing 14(3):19-1. [4] Waziruddin, S., Brogan,D. C., Reynolds, P. F.: Coercion through Optimization: A Classification of Optimization Techniques Proceedings of the 004 Fall Simulation Interoperability Workshop, Orlando, FL, September 004. [] Carson, Y. and A. Maria: Simulation Optimization: Methods and Applications Proceedings of the 1997 Winter Simulation Conference, [6] Rubinstein, R. Y. and A. Shapiro Discrete Event Systems: Sensitivity Analysis and Stochastic Optimization by the Score Function Method. New York: John Wiley & Sons. [7] Fu, M.: Simulation Optimization Proceedings of the 001 Winter Simulation Conference, 001. [8] R. H. Myers and D. C. Montgomery: Response Surface Methodology: Process and Product Optimization Using Designed Experiments, Wiley-Interscience, 00. [9] Montgomery, D.C. (1991) Design and Analysis of Experiments, John Wiley & Sons, New York, NY. [10] Fu, M.C. (1994) Optimization via simulation: A review. Annals of Operations Research, 3, [11] Sargent,R.G.: Reesearch Issues in Metamodeling Proceedings of the 1991 Winter Simulation Conference, [1] u, R.: Survey of Clustering Algorithms IEEE Transactıons on Neural Networks, Vol. 16, No. 3, pp , May 00. [13] B. Everitt, S. Landau, and M. Leese, Cluster Analysis. London:Arnold, 001.Biography. [14] J. Hartigan, Clustering Algorithms. New York: Wiley, 197. [1] A. Jain, M. Murty, and P. Flynn, Data clustering: A review, ACM Comput. Surv., vol. 31, no. 3, pp , [16] Kelton, W. D. Sadowski, R. P. and Sturrock, D. T Simulation with Arena, McGrawHill, Fourth Edition, United States. pp [17] Minitab, [accessed Jan.013] [18] Arena Simulation Software, [accessed Jan.013] [19] Matlab, [accessed Jan.013] 64

8 Biography Emre İrfanoglu is pursuing his MSc in Naval Operations Research in the Institute of Naval Science and Engineering. He holds a BSc in Industrial Engineering degree where he received in 00 from the Turkish Naval Academy. Ilker Akgun is an assistant professor in Turkish Naval Academy. He completed his PhD in Istanbul Technical University and MSc studies in Middle East Technical University in 01 and 00 respectively. Murat Gunal is an assistant professor in Turkish Naval Academy. He completed his PhD and MSc studies in Lancaster University, UK, in 008 and 000 respectively. His PhD thesis title is Simulation Modelling for Performance Measurement in Hospitals. He did research and worked in simulation field many years. 6