Analysis of Performance Metrics from a Database Management System Using Kohonen s Self Organizing Maps

WSEAS Transactions on Systems Issue 3, Volume 2, July 2003, ISSN 1109-2777 629 Analysis of Performance Metrics from a Database Management System Using Kohonen s Self Organizing Maps Claudia L. Fernandez, Jose Torrres-Jimenez Computer Science Department ITESM Campus Cuernavaca Av. Paseo de la Reforma # 182-A Temixco, Mor. 62490 MEXICO Claudia_l_fernandez@yahoo.com, jtj@itesm.mx Miguel A.Reyes Martinez Cesar A. Coutino-Gomez Computer Science Department, Tecnologico de Monterrey Campus Ciudad de México Calle del Puente Km. 222, México D.F., Mexico migreyes@itesm.mx, ccoutino@itesm.mx Abstract: Data clustering is one of the most interesting data mining problems. Data clustering is the process of discovering groups of data items based on similarities without specifying any additional information. Each cluster contains data items that are similar to some respect and are unlike to the data items in a different cluster. The solution to the clustering problem is more complex when the data items to be classified belong to a large and highdimensional data set. Kohonen s self-organizing maps (SOM) is a neural network that uses an unsupervised learning algorithm, and through a process called self-organization, configures the output units into a topological representation of the input data. SOM provides a solution to the data clustering problem by finding relationships between inputs and outputs and organizing data based on similarities. SOM allows the visualization of highdimensional data with a topology preserving map that reduces multi-dimensional data to a lower-dimensional map or grid of neurons. In this paper, the SOM algorithm is used in conjunction with the hierarchical clustering algorithm Ward to improve the visualization of data clusters. With this process, SQL statements with similar performance metrics are grouped in one cluster and their performance metrics are more alike than the metrics from the SQL transactions in different clusters. The analysis of SQL performance metrics is a current problem in the RDBMS industry that can be solve by applying the SOM algorithm. Keywords: Data mining, Self-organizing maps, DBMS, SQL, Performance analysis. 1 Introduction Data mining is the process of inspecting a large data set with the goal of discovering knowledge previously unknown. In data mining large amounts of data are analyzed and data clustering techniques allow classifying, synthesizing and visualizing large data sets. Data clustering is the process of discovering groups of data items based on similarities without specifying any additional information [4]. Each cluster contains data items that are similar to some respect and are unlike to the data items in a different cluster. The right number of groups where the data items can be classified is unknown. When the data set to be classified is very large and the data items are highly dimensional, i.e. have many components, the solution to the clustering problem is more complex. The Self-Organizing Maps (SOM) is a neural network that with unsupervised learning organizes data by similarities in different clusters.

WSEAS Transactions on Systems Issue 3, Volume 2, July 2003, ISSN 1109-2777 630 SOM has been successfully used in data mining to classify and visualize large amounts of data sets that are highly dimensional [8]. Today s business environments rely on Database Management Systems (DBMS) for the management of information. The need for fast response time database systems is ever increasing. Database application users expect to get their reports in few seconds and long running database processes are not acceptable. Companies cannot afford to lose customers because slow response database systems. Every second that a process runs in a DBMS can be translated into money since there are several resources such as hardware and software that are utilized. Having well-tuned database systems is essential for today s enterprises [7]. But maintaining high performance DBMS is not an easy task to achieve. Database administrators are primarily responsible of managing and tuning database performance. One of the biggest challenges they face is that in order to maintain well-tuned database systems, first they need to analyze and understand large amounts of complex performance metrics that Relational Database Management Systems (RDBMS) provide. The analysis of those performance metrics is crucial to identify performance inefficiencies. Oracle is one of the most popular commercial RDBMS used. In RDBMS, data is accessed and modified through Structured Query Language (SQL) statements [5]. Oracle, as well as other RDBMS, generates SQL performance metrics that need to be analyzed in order to understand the database s performance state. The data set of performance metrics typically contains thousands of data items, where each data item contains more than a dozen different metrics or variables. This paper shows how SOM can be used to analyze the performance metrics of a DBMS using as a case study the commercial RDBMS Oracle. The application of the SOM algorithm on performance metrics allows the discovery of patterns. This can assist a database administrator to better understand how SQL statements use different database resources and to identify SQL performance inefficiencies. If performance inefficiencies are identified, then the database administrator can plan on performance tune the database. In order to performance tune a database, it is necessary to know first if there are performance inefficiencies. This paper does not present the techniques for tuning a database system for enhanced performance. It focuses on the analysis of performance metrics using SOM in order to identify performance inefficiencies. The paper is organized as follows: in section 2 we discuss the analysis of performance metrics in Oracle; in section 3 we present self-organization for performance metrics; section 4 provides the results obtained from experiments and section 5 exposes the conclusions. 2 Analyzing Performance Metrics in Oracle On-going performance monitoring and analysis of Oracle allows database administrators to maintain well-tuned systems. Oracle provides several performance metrics that can be analyzed to understand how the system is performing [5]. If performance problems are identified, Oracle s performance can be improvement by tuning different aspects in the database system. Oracle is a highly tunable RDBMS that permits to make adjustments in order to change performance. The first step in the performance enhancement process is to understand and analyze the different performance metrics that Oracle provides to determine the need for tuning. Sixty percent or more of the performance problems are attributed to SQL statements [7]. SQL is used to retrieve and modify data in a RDBMS. Since sixty percent or more of the performance problems in a database are caused by poor performing SQL statements, then it is crucial to monitor and analyze SQL performance. When a SQL statement is executed, Oracle stores the SQL statement s code and several performance related metrics in one of the buffers in its shared memory, the Oracle SQL Area (OSA) [5]. Only unique SQL statements are stored and some of the performance metrics contain accumulative data that gets accrued in each execution of the SQL statement. The OSA is a set of tables and views in the database system catalog that store the SQL statements and its performance statistics until the database server is shut down or the shared memory is reset [6]. When the OSA fills-up, some elements are released to free up space to store the new ones. The performance metrics that Oracle stores in the OSA for a SQL statement indicate resources usage such as I/O metrics, number of

WSEAS Transactions on Systems Issue 3, Volume 2, July 2003, ISSN 1109-2777 631 executions, number of rows processed and others. For every SQL statement, Oracle provides more than a dozen of different performance metrics. The number of performance metrics provided by Oracle is being increased with every Oracle version release. In most production databases, the OSA can contain several thousands of SQL statements. This number can be higher especially when running OLTP (On- Line Transaction Processing Systems) on Oracle. OLTP systems typically execute in a day thousands of SQL statements where each SQL statement may be executed hundreds of times. Therefore, the data set of SQL performance metrics from OSA is highdimensional since every data item or SQL statement has many metrics associated. This data set also contains a large amount of data items. It is essential to use techniques that can assist to analyze the large and highly dimensional data set of SQL performance metrics. Currently in the database industry only simple statistical methods and simple graphical representations are used. The statistical methods used are the minimum, maximum, average and median. The graphical representations used are 2-D bar or XY charts. These methods do not allow the visualization of a large data set and do not properly allow the discovery of relationships between the metrics that can reveal performance patterns and assist to the identification of performance inefficiencies. The Self-Organization Maps developed by Teuvo Kohonen in 1982 [8] have the ability of revealing relationships between data through self-organization. SOM allows the visualization of high-dimensional data with a topology preserving map that reduces multi-dimensional data to a lower-dimensional map. SOM can be used to analyze the SQL performance metrics from Oracle. The justification of the use of SOM for the analysis of performance metrics is that SOM has the ability of self-organizing data items in clusters based on similarities and the discovery of structures between data, as well as the capability of reducing the dimensionality of a data set. 3 Self-Organization of Performance Metrics This section presents the steps and parameters used during experiments for the analysis of performance metrics with SOM. The process of utilizing SOM for the analysis of SQL performance metrics is as follows: 1. Selection of the input data 2. Pre-processing 1. Execution of the SOM with the parameters described in section 3.3 3. Visual analysis of the clusters 4. Definition of the clusters using Ward s clustering method The maps presented in this paper were generated with the software Viscovery SOMine [2]. These maps seek to identify the SQL statements that use the most database resources. 3.1 Input Data A data set with SQL performance metrics was generated from simulating the execution of an OLTP application. Oracle 8.1.7 on Windows NT was used for the experiments. The data set contains 10,000 data items where each data items corresponds to a SQL statement and its performance metrics 3.2 Pre-processing From all the metrics that OSA provides for a SQL statement, we selected those relevant for performance analysis. Twenty one metrics or variables were selected. Three of those metrics are ratios we calculated with other metrics that are commonly used by database administrators for performance analysis. These metrics provide information on the resources usage and are the following [6]: BUFFER_GETS, BUFFER_GETS/EXECUTIONS, DISK_READS/EXECUTIONS, DISK_READS/BUFFER_GETS, DISK_READS, EXECUTIONS, SHARABLE_MEM, PERSISTENT_MEM, RUNTIME_MEM, SORTS, VERSION_COUNT, LOADED_VERSIONS, OPEN_VERSIONS, USERS_OPENING, USERS_EXECUTING, LOADS, INVALIDATIONS, PARSE_CALLS, ROWS_PROCESSED, COMMAND_TYPE and OPTIMIZER_MODE. Appendix 1 contains the explanation of each metric. All these metrics are numeric except the OPTIMIZER_MODE, which indicates the Oracle s optimizer mode used to execute the SQL statement.

WSEAS Transactions on Systems Issue 3, Volume 2, July 2003, ISSN 1109-2777 632 This metric can have any of the following values: CHOOSE, ALL_ROWS, RULE and MULTIPLE_CHILDREN _PRESENT. The values in this variable were pre-processed and replaced with the values CHOOSE=1, ALL_ROWS=2, RULE=3 and MULTIPLE_ CHILDREN_PRESENT=0. All the attributes were escalated to the variance in order to maintain consistent magnitudes between data values. 3.3 SOM parameters The Euclidean distance was used to define the distance between the nodes in the neural network. The neighborhood function used is the Gaussian. Using these parameters the SOM was trained with the Viscovery SOMine software that uses the batch- SOM algorithm. 3.4 Visualization of clusters After the SOM is complete, the map with the selforganization is presented in a graphical manner using different color shades. Blue is used for small values, green for middle values and red is used for the large values. Fig. 1 shows the map after the selforganization. variables to consider and the number of clusters to create. From the visual inspection of Fig. 1 it is possible to see that the color shades highlight 11 different clusters. Only with the visual analysis of the color shades it not possible to exactly determine the borders of each cluster. This is the justification of the use of Ward s algorithm in Viscovery SOMine to facilitate the identification of the clusters from Fig. 1. Fig. 2 shows the 11 clusters after applying Ward s algorithm on all the components. 4 Results For each cluster, the relationship between each of the components can be identified by visually inspecting the feature maps of components (Fig. 3-4). From the maps we identify that the components BUFFER_GETS, BUFFER_GETS/EXECUTIONS, EXECUTIONS, PARSE_CALLS and ROWS_PROCESSED are correlated. All the metrics above have their highest values in Cluster 7. Cluster 7 contains those SQL statements that generated high CPU cycles because of a high number of executions and a high number of accesses to the server s memory. The SQL statements is this cluster are candidates for performance tuning with the goal of decreasing memory usage. Fig. 1 SOM 3.5 Definition of clusters using Ward s methods. Ward is a hierarchical clustering algorithm [3] that in order to form clusters, it requires knowing the Fig. 2 Self-organization of SQL performance metrics in 11 clusters

WSEAS Transactions on Systems Issue 3, Volume 2, July 2003, ISSN 1109-2777 633 one of the approaches that Oracle can take for SQL execution [5]. These SQL statements are also candidates for tuning with the goal of decreasing the use of Oracle s shared memory. All the other clusters do not show significant use of database resources and do not indicate performance inefficiencies. Fig. 3 Maps of the components Fig. 4 Maps of the components Another interesting cluster is Cluster 2. This cluster contains the SQL statements with the highest values of DISK_READ/EXECUTIONS. This indicates high I/O usage. It also contains SQL statements with the highest value of PERSISTENT_MEM and middle range values for the metrics: RUNTIME_MEM, SORTS, LOADED_VERSIONS and OPEN_VERSIONS. The SQL statements in this cluster are also candidates for performance tuning with the goal of decreasing I/O usage. Cluster 1 contains the SQL statements with the highest ratio of DISK_READS/BUFFERS_GETS indicating that the I/O usage is higher than the CPU processing. These SQL statements also have the highest values of SHARABLE_MEM (Oracle s shared memory) and RUNTIME_MEM. An interesting aspect is that the SQL statements ran under the OPTIMIZER_MODE=3 (RULE) that is At this point the SOM has been analyzed and different types of SQL statements performance have been identified. Therefore, it is possible to analyze data items not included during the training of the SOM and determine to which cluster they belong by performing a distance analysis. This recall technique makes possible the analysis of new SQL performance metrics based on previous identified groups without the need of re-computing the SOM. 5 Conclusions The analysis of SQL performance metrics is a current problem in the RDBMS industry. This paper demonstrated how Kohonen s Self-Organizing Maps can be used to analyze performance metrics of a RDBMS. With the use of the SOM and Ward s algorithms, clusters of SQL statements are identified. The SQL statements in one cluster have performance metrics more similar than the metrics of the SQL statements in other clusters. We have shown how the visual analysis of the feature maps of components permits the discovery of patterns and relationships between the performance metrics. It also allows the identification of SQL statements with a high usage of database resources to assist database administrators to determine the need for performance tuning different aspects of Oracle. References [1] A. Ultsch. Data Mining and Knowledge Discovery with Emergent Self-Organizing Feature Maps for Multivariate Time Series [2] Eudaptics software gmbh. Viscovery SOMine. Austria (1999) URL http://www.eudaptics.com [3] Brian Everitt. Cluster Analysis. Halsted Press, New York (1981) [4] Guido Deboeck, Teuvo Kohonen. Visual Explorations in Finance with Self-Organizing Maps. Springer, Berlin (1998)

WSEAS Transactions on Systems Issue 3, Volume 2, July 2003, ISSN 1109-2777 634 [5] Oracle9i Database Concepts http://downloadwest.oracle.com/docs/cd/b10501_ 01/server.920/a96524/c01_02intro.htm#10525 [6] Oracle Corporation. Oracle8i Reference. Dynamic Performance Views (2000) URL http://downloadwest.oracle.com/docs/cd/a87860_ 01/doc/server.817/a76961/ch3156.htm#15879 [7] Richard Niemiec. Oracle Performance Tuning. Oracle Press (1999) [8] Teuvo Kohonen. Self-organizing Maps. Springer, Berlin (1997) Appendix 1 SQL Performance Metrics BUFFER_GETS = Total number of memory blocks read. BUFFER_GETS/EXECUTIONS = Number of memory blocks read for each execution of the SQL statement. The higher this ratio, the more CPU memory consumption the SQL statement requires for the execution. DISK_READS/EXECUTIONS = Number of blocks read from the hard disk for each execution of the SQL statement. If this ration is high, it could indicate that the SQL statement could be I/O resource intensive. DISK_READS/BUFFER_GETS = Ratio between the blocks read from the hard disk and the memory blocks accessed. A high number could indicate that I/O usage is higher that CPU processing. DISK_READS = Number of blocks read from hard disk. EXECUTIONS = Total number of executions. SHARABLE_MEM = Sum of amount (bytes) of sharable memory used. PERSISTENT_MEM = Sum of amount (bytes) of persistent memory used. RUNTIME_MEM = Fixed amount of memory required to execute the process. SORTS = Sum of the number of sorts performed. Sorts perform temporary processing in some of the Oracle s memory areas. VERSION_COUNT = Number of children processes present in the cache. LOADED_VERSIONS = Number of children processes that are present in the cache and have their context heap. OPEN_VERSIONS = Number of child processes open under the parent process. USERS_OPENING = Number of child cursors that are currently open under this current parent. USERS_EXECUTING = Number of users that have any of the child cursors opened. LOADS = Number of times the object was loaded or reloaded. INVALIDATIONS = Total number of invalidations over all the child processes. PARSE_CALLS = Sum of all parse calls to all the child processes under the parent. ROWS_PROCESSED = Total number of rows processed. COMMAND_TYPE = Oracle s command type definition. OPTIMIZER_MODE = Mode under which the SQL statement is executed.