Extending ORACLE SQL To Mine Clusters in Databases

Transcription

1 REPUBLIC OF IRAQ MINISTRY OF HIGHER EDUCATION AND SCIENTIFIC RESEARCH UNIVERSITY OF BAGHDAD COLLEGE OF SCIENCE COMPUTER SCIENCE DEPARTMENT Extending ORACLE SQL To Mine Clusters in Databases A Dissertation Submitted to Baghdad University \ College of Science \ Computer Science Department as a partial fulfillment of the requirements for the degree of Master in Computer Science By Ahmed A. Hamdan AL-Abodi (B.Sc.) Supervised by Ass.Prof. Dr. Hussein Keitan AL-Kafaji December Dhulqa'da

2 جمهورية العراق وزارة التعليم العالي والبحث العلمي جامعة بغداد / كلية العلوم قسم علوم الحاسبات توسيع لغة الاستعلام المهيكل لنظام اوراكل لتعدين العناقيد في قواعد البيانات مشروع مقدم الى جامعة بغداد/ كلية العلوم / قسم علوم الحاسبات كجزء من متطلبات نيل شهادة الماجستير في علوم الحاسبات من قبل الطالب ا حمد عبد الحسن حمدان العبودي اشراف د.حسين كيطان الخفاجي ۲۰۰٤ ذو القعدة ۱٤۲٥ ه كانون الاول ۲۰۰٤ م

3 Supervisor Certification We certify that this Dissertation was prepared under our supervision at the Department of Computer Science of the University of Baghdad as a partial fulfillment of the requirement needed of the degree of Master of Science in Computer Science. Signature: Name : Ass.Prof. Dr. Hussein Keitan AL-Kafaji (Supervisor) Date : / / 2005 Certification of the Head of the Department In view of the available recommendation I forward this Dissertation for debate by the examining committee. Signature: Name : Makia K. Hamad (Assistance Professor) Date : / / 2005 Head of the Computer Science Department

4 Examining Committee Certification We certify that we have read this Dissertation and as an examining committee, examined the student in its content and what is related to it, and that in our opinion it meets the standard of a Dissertation for the degree of Master of Science in Computer Science. Signature: Name : Ass.Prof. Dr. Hussein Keitan AL-Kafaji (Supervisor) Date : / / 2005 Signature: Name : Makia K. Hamad (Assistance Professor) (Member) Signature: Name : Dr. Lamia H. Khalid (Assistance Professor) (Chairman) Date : / / 2005 Date : / / 2005

5 To the Immaculate Spirits of my Grandfather To My Mother, Father Brother, and Sisters. i

6 ACKNOWLEDGEMENTS Special Thanks and gratitude should be presented to my advisors Dr. Hussein Keitan for his tireless patience, his useful comments and excellent advice and support. I am grateful to the staff of the Department of Computer Science of the University of Baghdad. Last, my thanks go to all people who may support this project directly or indirectly, and they anticipate the concluding musk. AHMED ii

7 Supervisor Certification We certify that this Dissertation was prepared under our supervision at the Department of Computer Science of the University of Baghdad as a partial fulfillment of the requirement needed of the degree of Master of Science in Computer Science. Signature: Name : Ass.Prof. Dr. Hussein Keitan AL-Kafaji (Supervisor) Date : / / 2005 Certification of the Head of the Department In view of the available recommendation I forward this Dissertation for debate by the examining committee. Signature: Name : Makia K. Hamad (Assistance Professor) iii

8 Date : / / 2005 Head of the Computer Science Department iv

9 Examining Committee Certification We certify that we have read this Dissertation and as an examining committee, examined the student in its content and what is related to it, and that in our opinion it meets the standard of a Dissertation for the degree of Master of Science in Computer Science. Signature: Name : Ass.Prof. Dr. Hussein Keitan AL-Kafaji (Supervisor) Date : / / 2005 Signature: Name : Makia K. Hamad (Assistance Professor) (Member) Signature: Name : Dr. Lamia H. Khalid (Assistance Professor) (Chairman) Date : / / 2005 Date : / / 2005 v

10 Table of Contents Subject Chapter One: Introduction 1.1 Introduction Project Contributions Project Organization 5 Chapter Two: KDD, DM and Clustering Techniques 2.1 Introduction Knowledge Discovery in Database Data Mining Data Mining Tasks Association Rules Classification Clustering Important Issues Classification of Clustering Algorithm 13 Page No Partitional Techniques Hierarchical Techniques Density-Based Partitioning Grid-Based Methods Sequential patterns Similarity sequence discovery Integration Approaches 21 Chapter Three: Some Features of Oracle SQL 3.1 Introduction Oracle PL/SQL PL/SQL Block Structure Types of Oracle SQL Instructions Dynamic SQL Steps to Process SQL Statements Using the DBMS_SQL Package Native Dynamic SQL Dbms_SQL Versus Native Dynamic_SQL 36 vi

11 3.2.6 Advantages of Native Dynamic SQL Advantages of the DBMS_SQL Package 39 Chapter Four: Extending SQL to Mine Clustering Database 4.1 Introduction Implementation of MSQL The Translator Lexical Analyzer of DM Queries Syntax Analyzer of DM Queries Semantic Analysis Parameter Generator Parsing of DDL and DML queries Execution of DDL, DML and MSQL queries The Reporter 66 Chapter Five : Discussion, Conclusion, and Future Works 5.1 Discussion and Conclusion Future Works 71 References 74 vii

12 Section No. Page No. Table 3.1 Table (2.1) Some procedures and functions of 32 DBMS_SQL Table (3.2) Clauses of General forms Table (4.1) Some of reserved Words and 49 Characters of ESQL Table (4.2) Identification of data types in Oracle 59 List of Tables viii

13 Section No. Page No. Figure 2.2 Figure (2.1) KDD Process Figure (2.2) Data sets on which centroid and medoid 15 approaches fail. List of Figures ix

14 Figure (2.3) K_Means Techniques Figure (3.1) The flow of execution in DBMS_SQL Abbreviation Figure (3.2) Simple Example Meaning of using DBMS_SQL DBMS Figure (3.3) Usages Database Execute Management immediate Statement System DBMSs Figure (3.4) Simple Database Example Management of using Native Systems DBMS_SQL Figure (3.5) Example by using DBMS_SQL package Figure (3.6) Example by using Native DBMS_SQL 41 package Figure (4.1) MSQL Architecture Figure (4.2) BNF of Mining statements of MSQL Figure (4.3) PL/SQL code of reserved words table Figure (4.4 A flowchart of ESQL Lexical Analyzer Figure (4.5) PL/SQL Code to recognize reserved words, 51 identifiers, and numbers Figure (4.6) PL/SQL Code of recognize Some of 52 delimiters Figure (4.7) A Part of Syntax Analysis Processing Figure (4.8) A Part of PL/SQL Code of the Syntax 55 Analyzer Figure (4.9) Parameter List of the Generated Figure (4.10) Simple example of desc_tab Figure (4.11) Oracle define_column stored procedure 61 Table of Abbreviations x

15 DDL DM DML DOTI DSS MSQL KDD OLAP OLTP PL/SQL SQL Data Definition Language Data Mining Data Manipulation Language Discoverer Of Terminated Itemsets Decision Support System Mining Structured Query Language Knowledge Discovery in Database On-Line Analytical Processing On-Line Transaction Processing Procedural Language/ Structured Query Language Structured Query Language Abstract xi

16 Data mining is a process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions. The mined datasets are often in relational format, while most mining systems do not use relational DBMS; they have a loose connection with database. Thus, they miss the opportunity to leverage the database technology developed in the last couple of decades. Furthermore, the data mining systems emerged to serve mining goals, that is, until now there is no integral system can answer mining queries and traditional queries, i.e. SQL. In this project, an integral system is built by extending SQL language with a set of new statements and functions, which can be used to discover clusters in database table, and answer traditional SQL queries. This proposed system is implemented by using Oracle DBMS, to avoid the problem of loosing connection of data mining with DBMS. This research is accomplished according to many theoretical and systematic steps such as: 1) Extending the BNF of SQL to cluster databases. 2) Designing a translator to the extended BNF. 3) Implementing k-means clustering algorithm. 4) Designing and implementing many reporting interfaces to display the mined clusters. MSQL can be executed under windows 98, windows Me, windows XP or windows 2000 by using Oracle 8i or later versions such as Oracle 9i. xii

17 Chapter One Introduction CHAPTER ONE Introduction 1.1 Introduction Database technology has been successfully used in traditional business data processing. There is an increasing usage of relational database systems by organizations to store vast amount of data. Organizations have been gathering a large amount of data, by using a DBMS system to manage it. Traditional database support On-Line Transaction processing (OLTP), which includes insertion, updating, and deletion, which also support information query requirements. The amount of data kept in various repositories (databases) is growing at a phenomenal rate, and as a somewhat surprising consequence, the amount of meaningful information decreases rapidly. The tremendous amount of data is behind human capabilities to reasonably process it. It is not possible for us to look at the database and see any useful patterns in the data, and consequently derive some potentially useful knowledge from our observation. Therefore, because of the compelling need to extract useful information from this data and the needs of the decision makers for the information at the correct level of detail to support their decision making, Data warehousing, On-Line- Analytical processing (OLAP), Decision Support System (DSS), and data mining rose. Data warehousing provide access to data for complex analysis, knowledge discovery and decision-making. OLAP is used to describe the analysis of complex data from the data warehouses, and DSS is the application that enables users to make strategic decisions. All of these terms will be explained in more detail in chapter two. It is obvious that there is a continuous challenge in selecting the appropriate information resources to maintain and extend people s personal 1

18 Chapter One Introduction knowledge; all major organizations contain gigabytes of data with much hidden information that cannot easily be traced by using SQL, or other shallow query facilities. Data mining algorithms can find interesting regularities, unexpected patterns, and new rules from large databases. Data mining tools, can also answer business questions that traditionally were time consuming to resolve. Data mining and DBMS (Data Base Management System) suffer from bad integration; that is the miner regard DB as a container from which full the data structure and works independently from the control of DBMS. One of the major technologies in data mining involves the discovery of clustering. Clustering is a division of data into groups of similar objects [SADC93, CHY96]. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others [JD88]. Clustering is the subject of active research in several fields. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique computational requirements on relevant clustering algorithms. A variety of algorithms 2

19 Chapter One Introduction have recently emerged that meet these requirements and were successfully applied to real-life data mining problems. The main problem facing the data mining is the bad integration of data mining applications and database management systems (DBMS). Data mining techniques regard the databases as containers for fetching the data that will be processed by using DBMS-independent mining programs. Gradually, the attention converted from defining a new mining operation and developing algorithms for them (most early mining systems were developed largely on file system and specialized data structure and buffer management strategies were devised for each algorithm) to the issue related to integrating mining process with DBMS. The challenge of extending database management systems for data mining application has been the argument of most recent research. Today, data mining algorithms, (especially clustering algorithms), are not well-integrated with DBMS. DBMSs are designed for processing large data sets and offering generic operations. Data mining systems depend on algorithms that consist of steps which require programming language statements; therefore they are unimplemented with the traditional tool of DBMSs that is SQL However there are three techniques to connect the mining system with DBMS, these are: 1) loosely coupled 2) tightly coupled and a new trend has been called 3) DBMS-Embedded miners [Huss02, Yass03]. Loosely coupled systems are completely separated from the DBMS world. The miner is written in a specialized DBMS, the data are converted from DBMS domain to a flat file to be mined by the miner. The tightly coupled system usually written in a language contains embedded statements that can make access to the data managed in a specified DBMS. This technique reduces the data traffics and keeps some DBMSs control over the data. The third technique, i.e., embedded 3

20 Chapter One Introduction mining, is an attempt to completely integrate the mining process with DBMS by utilizing the capabilities of it. Other researchers [Huss02] suggested a completely embedding mining system in a DBMS. The design of the system depends on the DOTI algorithm which is proposed in the same thesis. There have been language proposals to extend SQL to support mining operations. For example, the query language DMQL extends SQL with a collection of operators for mining characteristic rules, association rules, etc. Another example is the mine rule [Rosa96] operators for a generalized version of the association rule discovery problem. Besides, the M-SQL language [Imie96] extends SQL with a special unified operator to generate and query a whole set of propositional rules. Query flocks for association rule mining by using a generate-and-test model has been proposed. ATLaS [Atla ] may be considered a promising research according to the incomplete list of the previous work listed above, but it is also are implemented by using C++. MSQL extends ORACLE SQL to mine association rules from transactional databases [Yass03]. 1.2 Project Contributions In this research, an extension to DBMS functionality is proposed toward on-demand clustering discovery. This extension is achieved by extending the famous SQL language with a set of new statements and functions. The BNF, i.e., the grammar of the ordinary SQL has been extended to contain clusters mining queries, therefore a robust parser is developed to parse and recognize the users queries. This mining tool is named Mining SQL, MSQL. The sent queries are parsed by the MSQL to determine if it is SQL or mining query to execute it in both cases. The integral system in this thesis is implemented by using Oracle8i and 4

21 Chapter One Introduction developer 6i. This system excludes the problem of bad integration between data mining and DBMS. It is enforcement to the trend of completely embedding the mining tasks within the database management systems. It is an attempt to make the mining tasks as traditional operations of DBMSs. 1.3 Project Organization The remainder of this research is organized as follows: chapter two presents the Knowledge Discovery in Database (KDD) and its relationship with data mining, the major tasks of data mining, data mining process and the alternative approach to integrate data mining with database systems are discussed thoroughly. Chapter three presents and explains some Oracle features used to implement MSQL. Chapter four introduces the implementation of the proposed system which is implemented by using Oracle8i packages and Developer 6i. Chapter five summarizes conclusions and recommendations. 5

22 Chapter Two KDD, DM and Clustering Techniques CHAPTER TWO KDD, DM and Clustering Techniques 2.1 Introduction Large databases are routinely being collected in science, business and medicines. A variety of techniques from statistics, signal processing, pattern recognition, machine learning, and neural networks have been proposed to understand the data by discovering useful categories. However, to date research in data mining has not paid attention to the cognitive factors that make learned categories comprehensible Knowledge Discovery In Database KDD is the process of looking in a database to find hidden knowledge patterns (or regularities) without a predetermined idea or hypothesis about what the pattern may be. The KDD has been defined as follows [Usam96a]: KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in the data. Some times the two terms KDD, and data mining are used interchangeably. However, from a research-oriented perspective in computer science, knowledge discovery in database or KDD aims to set up infrastructure for data mining, much at the organizational level. KDD is used to refer to the overall process of knowledge discovery, while the term data mining refers to the actual algorithms used in the discovery process [Usam96a, Marg00]. The KDD process is interactive and 7

23 Chapter Two KDD, DM and Clustering Techniques iterative, involving numerous steps with many decisions being made by the user [Usam96a, Marg00] describe the KDD process as follows: 1) Data selection The goal of this phase is the extraction of data that is relevant to the discovery process from a large data store. 2) Data cleaning and preprocessing This phase is concerned with data cleaning and preparation tasks that are necessary to ensure correct results such as (the removal of noise or outliers if appropriate, collecting the necessary information to model or account for noise, deciding on strategies for handling missing data fields). Eliminating missing values in the data, ensuring that coded values have a uniform meaning and ensuring that no spurious data values exist are typical actions that occur during this phase [Usam96a, Kenn98]. 3) Data transformation This phase aims to eliminate unwanted or highly correlated fields so the results are valid. It uses dimensionality reduction or transformation methods to reduce the effective number of variables under consideration or to find invariant representations for the data [Usam96a, Kenn98]. 4) Data Mining This phase concerns deciding whether the goal of the KDD process is classification, association discovery or clustering, etc. and choosing the data mining algorithm(s) (selecting method(s) to be used for searching for patterns in the data) to discover meaningful patterns and rules and produce predictive models [Usam96a, Kenn98]. This is the core element of the KDD process. Clustering DBMSs is the focus of this project; therefore the next section will explain data mining in detail. 8

24 Chapter Two KDD, DM and Clustering Techniques 5) Interpretation and Consolidation In interpreting mined patterns, it is possible to return to any steps (1 4) for further iteration and evaluation. Consolidating discovered knowledge means incorporating this knowledge into the performance system, or simply documenting and reporting it to interested parties. This also includes checking for and resolving potential conflicts with previously extracted knowledge [Usam02]. The KDD process can involve significant iteration and may contain loops between any two steps. Figure (2.1) shows the steps of KDD process. Transformation Data mining Interpretation Knowledge Cleaning Patterns Selection Data Target data Preprocessed data Transformed data Many loops are possible Figure (2.1) KDD Process 2.3 Data Mining Data mining is a step in knowledge discovery in database (KDD) that searches for a series of hidden patterns in data. Data mining refers to the process of non-trivial extraction of implicit, previously unknown, and potentially useful information from large amounts of data, stored in databases, data warehouses or other repositories, and transforms them into useful knowledge for the user [Elis01]. 9

25 Chapter Two KDD, DM and Clustering Techniques Data mining is the process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules. Data mining centers around the automated discovery of new facts and relationships in data, data mining tools uncover hidden information. Relationships between variables and customer behavior that are nonintuitive are the gems that data mining hopes to figure out. There are different applications of data mining in the business area, Customer Segmentation, Market Basket Analysis, Risk Management, Fraud Detection, Delinquency Tracking, and Demand Prediction, are some examples of major data mining applications Data Mining Tasks The highest primary goals of data mining in practice can be classified into two categories, description, and prediction [Usam96a, Piet98]. Description tasks characterize the general properties of the data in the database, such as association discovery and clustering. Prediction involves using some variables or fields in the database to predict unknown or future values of other variables of interest, such as, classification. The following subsections will give us a closer look at data mining tasks, typically used in a variety of well-known applications, when a large amount of data is considered Association Rules Associations are affinities between items. Association discovery algorithms find combinations where the presence of one item suggests the presence of another.when these algorithms are applied to the shopping transactions at a supermarket, they will uncover affinities among products that are likely to be purchased together. Association rules represent such affinities. 10

26 Chapter Two KDD, DM and Clustering Techniques Classification Classification is learning a function that maps (classifies) a data item into one of several predefined classes [Usam96a]. Classification determines whether an object belongs to a given class, chosen among a set of predefined classes, based on the values of some object attributes, i.e. based on a given classification function, [Elis01]. It is often referred to as supervised learning as the classes are determined prior to examining the data. Regression is a type of classification used to map a real valued predicate variable into data values. A particularly efficient method for producing a classifier from data is to generate a decision tree [Usam96a]. There are several classification methods such as neural networks, genetic algorithms, and decision tree [Piet98]. Examples of classification application include image and pattern recognition, and medical diagnosis Clustering Clustering in data mining [SADC93, CHY96] is a discovery process that groups a set of data such that the intracluster similarity is maximized and the intercluster similarity is minimized [JD88, CHY96]. These discovered clusters can be used to explain the characteristics of the underlying data distribution, and thus serve as the foundation for other data mining and analysis techniques. The applications of clustering include characterization of different customer groups based upon purchasing patterns, categorization of documents on the World Wide Web, grouping of genes and proteins that have similar functionality, grouping of spatial locations prone to earth quakes from seismological data [BR98, XEKS98], etc. Existing clustering algorithms, such as K-means, PAM, and ROCK are designed to find clusters that fit some static models. For example, K- 11

27 Chapter Two KDD, DM and Clustering Techniques means assume that clusters are hyper-ellipsoidal (or globular) and are of similar sizes. Agglomerative hierarchical clustering algorithms, such as CURE and ROCK use a static model to determine the most similar cluster to merge in the hierarchical clustering. These algorithms can breakdown if the choice of parameters in the static model is incorrect with respect to the data set being clustered, or if the model is not adequate to capture the characteristics of clusters. Furthermore, most of these algorithms breakdown when the data consists of clusters that are of diverse shapes, densities, and sizes Important Issues The properties of clustering algorithms That are concerned with data mining include:- v Scalability to large datasets. v Ability to work with high dimensional data. v Ability to find clusters of irregular shape. v Handling outliers. v Time complexity (when there is no confusion). v Data order dependency. v Labeling or assignment (hard or strict vs. soft of fuzzy). v Reliance on a priori knowledge and user defined parameters. v Interpretability of results. While trying to keep these issues in mind, realistically, mentions only few with every algorithm will be discuss. The above list is in no way exhaustive. For example, the discussion of such properties as ability to 12

28 Chapter Two KDD, DM and Clustering Techniques work in pre-defined memory buffer, ability to restart and ability to provide an intermediate solution Classification of Clustering Algorithms Categorization of clustering algorithms is neither straightforward, nor canonical. In reality, groups below overlap. A classification closely provides for reader's convenience followed by this survey. Corresponding terms are explained below. Clustering Algorithms v Hierarchical Methods Agglomerative Algorithms Divisive Algorithms v Partitioning Methods Relocation Algorithms Probabilistic Clustering K-medoids Methods K-means Methods v Density-Based Algorithms Density-Based Connectivity Clustering Density Functions Clustering v Grid-Based Methods v Methods Based on Co-Occurrence of Categorical Data v Constraint-Based Clustering v Clustering Algorithms Used in Machine Learning Gradient Descent and Artificial Neural Networks Evolutionary Methods v Scalable Clustering Algorithms 13

29 Chapter Two KDD, DM and Clustering Techniques v Algorithms For High Dimensional Data Subspace Clustering Projection Techniques Co-Clustering Techniques These techniques will be explained briefly, except the k-means which will be explained in details because it is the selected one to implement the mining SQL Partitional Techniques Partitional clustering attempts to break a data set into K clusters such that the partition optimizes a given criterion. Centroid-based approaches, as typified by K means and ISODATA, try to assign points to clusters such that the mean square distance of points to the centroid of the assigned cluster is minimized. Centroid-based techniques are suitable only for data in metric spaces (e.g., Euclidean space) in which it is possible to compute a centroid of a given set of points. Medoid-based methods, as typified by PAM (Partitioning Around Medoids) and CLARANS, work with similarity data, i.e., data in an arbitrary similarity space. These techniques try to find representative points (medoids) so as to minimize the sum of the distances of points from their closest medoid. A major drawback of both of these schemes is that they fail for data in which points in a given cluster are closer to the center of another cluster than to the center of their own cluster. This can happen in many natural clusters ; for example, if there is a large variation in cluster sizes (as in Figure(2.2) (a)) or when cluster shapes are convex (as in Figure(2.2) (b)). 14

30 Chapter Two KDD, DM and Clustering Techniques a) Clusters of widely different sizes b) Clusters with convex shapes Figure (2.2): Data sets on which centroid and medoid approaches fail. v The k-means Technique The k-means algorithm [Hartigan 75; Hartigan & Wong 79] is by far the most popular clustering tool used in scientific and industrial applications. The name comes from representing each of k clusters C by the mean (or weighted average) c of its points, the so-called centroid. While this obviously does not work well with ategorical attributes, it has the good geometric and statistical sense for numerical attributes. The sum of discrepancies between a point and its centroid expressed through appropriate distance is used as the objective function. For example, the - norm based objective function, the sum of the squares of errors between the points and the corresponding centroids, is equal to the total intracluster variance E(C) = J=1:K X i - C j 2 Note that only means are estimated. A simple modification would normalize individual errors by cluster radii (cluster standard deviation), which makes a lot of sense when clusters have different dispersions. An 15

31 Chapter Two KDD, DM and Clustering Techniques objective function based on -norm has many unique algebraic properties. For example, it coincides with pair-wise errors E'(C) = 1/2 X i - Y j J=1:K 2 and with the difference between the total data variance and the inter-cluster variance.therefore, the cluster separation X i, Yi is achieved simultaneously with the cluster tightness. Two versions of k-means iterative optimization are known. The first version consists of two-step major iterations that (1) reassign all the points to their nearest centroids, and (2) recompute centroids of newly assembled groups. Iterations continue until a stopping criterion is achieved (for example, no reassignments happen). This version is known as Forgy's algorithm [Forgy65] and has many advantages: v It easily works with any Lp norm. v It allows straightforward parallelization. v It is insensitive with respect to data ordering.. The second (classic in iterative optimization) version of k-means iterative optimization reassigns points based on more detailed analysis of effects on the objective function caused by moving a point from its current cluster to a potentially new one. If a move has a positive effect, the point is relocated and the two centroids are recomputed. It is not clear that this version is computationally feasible, because the outlined analysis requires an inner loop over all member points of involved clusters affected by centroids shifts. However, in case it is known [Duda & Hart 73] that all computations can be algebraically reduced to simply computing a single distance! 16

32 Chapter Two KDD, DM and Clustering Techniques Therefore, in this case both versions have the same computational complexity. The wide popularity of k-means algorithm is well deserved. It is simple, straightforward, and is based on the firm foundation of analysis of variances. The k-means algorithm also suffers from all the usual suspects: v The result strongly depends on the initial guess of centroids (or assignments). v Computed local optimum is known to be a far cry from the global one. v It is not obvious what is a good k to use. v The process is sensitive with respect to outliers. v The algorithm lacks scalability. v Only numerical attributes are covered. v Resulting clusters can be unbalanced (in Forgy's version, even empty). Figure (2.3) K_MEANS Technique 17

33 Chapter Two KDD, DM and Clustering Techniques Hierarchical Techniques Hierarchical clustering algorithms produce a nested sequence of clusters, with a single all-inclusive cluster at the top and single point clusters at the bottom. Agglomerative hierarchical algorithms start with all the data points as a separate cluster. Each step of the algorithm involves merging two clusters that are the most similar. After each merge, the total number of clusters decreases by one. These steps can be repeated until the desired number of clusters is obtained or the distance between two closest clusters is above a certain threshold distance. There are many different variations of agglomerative hierarchical algorithms. These algorithms primarily differ in how they update the similarity between existing clusters and the merged clusters. In some methods, each cluster is represented by a centroid or medoid of the points contained in the cluster, and the similarity between two clusters is measured by the similarity between the centroids/medoids of the clusters. Like partitional techniques, such as K-means and K-medoids, these method also fail on clusters of arbitrary shapes and different sizes. In the single link method, each cluster is represented by all the data points in the cluster. The similarity between two clusters is measured by the similarity of the closest pair of data points belonging to different clusters. Unlike the centroid / medoid based methods, this method can find clusters of arbitrary shape and different sizes. However, this method is highly susceptible to noise, outliers, and artifacts Density-Based Partitioning An open set in the Euclidean space can be divided into a set of its connected components. The implementation of this idea for partitioning of a finite set of points requires concepts of density, connectivity and boundary. They are closely related to a point's nearest neighbors. A 18

34 Chapter Two KDD, DM and Clustering Techniques cluster, defined as a connected dense component, grows in any direction that density leads. Therefore, density-based algorithms are capable of discovering clusters of arbitrary shapes. Also this provides a natural protection against outliers. Figure (2.2), i.e., partitional technique, illustrates some cluster shapes that present a problem for partitioning relocation clustering (e.g., k-means), but are handled properly by densitybased algorithms. They also have good scalability. These outstanding properties are tempered with certain inconveniencies. From a very general data description point of view, a single dense cluster consisting of two adjacent areas with significantly different densities (both higher than a threshold) is not very informative. Another drawback is a lack of interpretability. An excellent introduction to density_based methods is contained in the textbook [Han & Kamber 01]. Since density-based algorithms require a metric space, the natural setting for them is spatial data clustering. To make computations feasible, some index of data is constructed (such as R*-tree). This is a topic of active research. Classic indices were effective only with reasonably lowdimensional data. The algorithm DENCLUE that, in fact, is a blend of a density-based clustering and a grid-based preprocessing is lesser affected by data dimensionality. There are two major approaches for density-based methods. The first approach pins density to a training data point and is reviewed in the sub-section Density-Based Connectivity. Representative algorithms include DBSCAN, GDBSCAN, OPTICS, and DBCLASD. The second approach pins density to a point in the attribute space and is explained in the sub-section Density Functions. It includes the algorithm DENCLUE Grid-Based Methods In the previous section crucial concepts of density, connectivity, and boundary were used which required elaborate definitions. Another 19

35 Chapter Two KDD, DM and Clustering Techniques way of dealing with them is to inherit the topology from the underlying attribute space. To limit the search combinations, multirectangular segments are considered. Recall that a segment (also cube, cell, region). is a direct Cartesian product of individual attribute sub-ranges (contiguous in case of numerical attributes). Since some binning is usually adopted for numerical attributes, methods partitioning space are frequently called grid-based methods. The elementary segment corresponding to single-bin or single-value sub-ranges is called a unit. Overall, we shift our attention from data to space partitioning. Data partitioning is induced by points' membership in segments resulted from space partitioning, while space partitioning is based on gridcharacteristics accumulated from input data. One advantage of this indirect handling (data grid-data space-partitioning data-partitioning) is that accumulation of grid-data makes grid-based clustering techniques independent of data ordering. In contrast, relocation methods and all incremental algorithms are very sensitive with respect to data ordering. While density-based partitioning methods work best with numerical attributes, grid-based methods work with attributes of different types Sequential patterns The problem of discovering sequential patterns is to find intertransaction patterns so that the presence of a set of items is followed by another item in the timestamp ordered transaction set [Zhen01]. By analyzing this information, the mining systems can determine temporal relationships among data items. An example of sequential patterns could be that 37% of customers, who buy the Windows Software, Office Software, also buy Anti Virus Software within 30 days. 20

36 Chapter Two KDD, DM and Clustering Techniques Similarity sequence discovery Similar time sequence discovery finds all occurrences or similar occurrences, or finds sequences similar to a given sequence, in a database of time-series data. A time series is a set of values of one variable over a period of time [Pete98]. As an example, time-series analysis techniques can be used to predict the income of a given company during the next year, starting from the incomes of previous years and the current customer payment situation. 2.4 Integration Approaches Most of the current data mining applications have a loose connection with databases. However, there are three different ways in which data mining systems use relational DBMS. They may not use a database at all, be loosely-coupled or be tightly-coupled [Suni98, Rake96b]. Data mining systems which do not use a relational DBMS, provide their own memory and storage management. The disadvantage of this database-less approach is the lost opportunity to leverage the existing relational database technology developed in the last couple of decades. Most of the current data mining applications that use database have a loose connection with them. Some data mining systems use a DBMS but only to store and retrieve the data. A majority of them treat database simply as a container from which data is extracted to populate main memory data structures before the main execution begins. The more database application use loosely-coupled SQL to fetch data records as needed by the mining algorithm. The front end of this application is implemented in a host programming language, with embedded SQL statement in it. The application uses a SQL select statement to retrieve the set of records of interest from the database. A loop in the application 21

37 Chapter Two KDD, DM and Clustering Techniques program copies records in the result set one-by-one from the database address space to the application address space, where computation is performed on them. Loosely-coupled requires a conversion of the data from DBMS format to a format of the host language. This approach limits the amount of data that can be handled (forcing applications to filter information and use only apart of it to discover patterns. This approach has two performance problems : i) copying of records from the database address space to the application address space, and ii) process context switching for each record retrieved, which costly in a database system, built on top of an operating system. The tightly-coupled approach exhibits more integration and cooperation of DBMSs and data mining applications. The designers selectively push parts of the application program that perform computation on retrieved records into the database system, thus, avoiding the performance degradation of mining systems. The researchers anticipations are to accomplish the mining task under the control of DBMSs. The dominion of DBMSs to include the mining process excludes the mentioned drawbacks. These anticipations can be done by finding new generation of mining algorithms that can be implemented by using the capabilities of DBMSs, in addition to develop new utilities and to make a good selection to the available DBMSs to implement the mining systems. The third approach to integrate DM and DBMSs is absolutely obeying the mining process to the domination of DBMSs by completely embedding the DM tasks within DBMSs. This research extends the traditional tool of DBMSs that is SQL to process queries retrieving large itemsets and clustering. This research agrees with the third approach and empowers it. 22

38 Chapter Three Some Features of Oracle's SQL CHAPTER THREE Some Features of Oracle's SQL 3.1 Introduction Oracle is an extremely powerful and flexible relational database system. Along with this power and flexibility, however, comes complexity. In order to design useful applications that are based on Oracle, it is necessary to understand how Oracle manipulates the data stored within the system. PL/SQL is an important tool that is designed for data manipulation, both internally within Oracle and externally in our own applications. PL/SQL is available in a variety of environments, each of which has different advantages [Orac01]. 3.2 ORACLE PL/SQL PL/SQL is a sophisticated programming language used to access Oracle database from various environments. PL/SQL is integrated with the database server, so that the PL/SQL code can be processed quickly and efficiently. It is also available in some client-side Oracle tools. Oracle is a relational database. The language used to access a relational database is Structured Query Language (SQL). SQL is a flexible, efficient language, with features designed to manipulate and examine relational data. For example, suppose we have a table called students then the following SQL statement will delete all students who are majoring in nutrition from the database: Delete from students Where major = 'nutrition'; SQL is a fourth-generation language. This means that the language describes what should be done, but not how to do it. In the DELETE statement just shown, for example we don't know how the database will actually determine which students are majoring in nutrition. Presumably, 24

39 Chapter Three Some Features of Oracle's SQL the server will loop through all the students in some order to determine the proper entries to delete. But the details of this are hidden from us. Third-generation language, such as C and COBOL, are more procedural in nature. A program in a Third-generation language (3GL) implements a step-by-step algorithm to solve the problem. For example, we could accomplish the DELETE operation with something like this: Loop over each student record If this record has major = 'nutrition' then Delete this record; End if; End loop; Each language has advantage and disadvantages. Fourth-generation languages such as SQL are generally fairly simple (compared to thirdgeneration languages) and have fewer commands. They also include the user from the underlying data structures and algorithms. In some cases, however, the procedural constructs available in 3GLs are useful to express a design program. This is where PL/SQL comes in, it combines the power and flexibility of SQL (4GL) with the procedural constructs of a 3GL[Orac01]. PL/SQL stands for Procedural Language/SQL. As its names implies, PL/SQL extends SQL by adding constructs found in other procedural languages, such as: v Variables and types (both predefined and user-defined). v Control structures such as IF-THEN-ELSE statements and loops. v Procedures and functions. v Object types and methods (PL/SQL version 8 and higher). For example, suppose we want change the major for a student. If the student doesn't exist, then we want to create a new record. We do this with the following PL/SQL code [Scott97]: 25

40 Chapter Three Some Features of Oracle's SQL DECLARE /* Declare variables which will be used in SQL statement */ v_newmajor varchar2(10) := 'History'; v_firstname varchar2(10) := 'Scott'; v_lastname varchar2(10) :='Urman'; BEGIN /*Update the students table */ UPDATE students SET major = v_newname WHERE first_name = v_lastname AND last_name = v_lastname; /* Check to see if the record was found. If not, then need to insert this record */ IF SQL%NOTFOUND THEN INSERT INTO students (first_name, last_name, major) VALUES (v_firstnmae, v_lastnmae, v_newmajor); END IF; END; This example contains two different SQL statements ( UPDATE and INSERT) as well as several variable declarations and the conditional IF statement PL/SQL Block Structure The basic unit in PL/SQL is a block. All PL/SQL programs are made up of blocks, which can be nested within each other. Typically, each block performs a logical unit of work in the program, thus separating different tasks from each other. A block has the following structure[psql01] : 26

41 Chapter Three Some Features of Oracle's SQL DECLARE /* Declarative section PL/SQL variables, types, cursors, and local subprograms */ BEGIN /* Executable section procedural and SQL statement. This is the main section of the block and the only one that is required */ EXCEPTION /* Exception handling section error-handling statements */ END; Only the executable section is required; the declarative and exception handling sections are optional. The executable section must also at least one executable statement. The different sections of the block separate different functions of a PL/SQL program Types of Oracle SQL Instructions Oracle SQL deals with three types of instructions that are: 1. DDL: DATA DEFINITION LANGUAGE Such as (CREATE, DROP, ALTER). 2. DML: DATA MANIPLUATION LANGUAGE Such as (SELECT, INSERT, UPDATE, DELETE). 3. DCL: DATA CONTROL LANGUAGE Includes control statement Such as (GRANT) Dynamic SQL Dynamic SQL enables us to write programs that reference SQL statements whose full text is not known until runtime. Before discussing dynamic SQL in detail, a clear definition of static SQL may provide a good starting point for understanding dynamic SQL [PSQL01]. 27

42 Chapter Three Some Features of Oracle's SQL Static SQL statements do not change from execution to execution. The full text of static SQL statements are known at compilation, which provides the following benefits: v Successful compilation verifies that the SQL statements reference valid database objects. v Successful compilation verifies that the necessary privileges are in place to access the database objects. v Performance of static SQL is generally better than dynamic SQL. Because of these advantages, we should use dynamic SQL only if we cannot use static SQL to accomplish our goals, or if using static SQL is cumbersome compared to dynamic SQL. However, static SQL has limitations that can be overcome with dynamic SQL. We may not always know the full text of the SQL statements that must be executed in a PL/SQL procedure. Our program may accept user input that defines the SQL statements to execute, or our program may need to complete some processing work to determine the correct course of action. In such cases, we should use dynamic SQL. For example, a reporting application in a data warehouse environment might not know the exact table name until runtime. These tables might be named according to the starting month and year of the quarter, for example INV_01_1997, INV_04_1997, INV_07_1997, INV_10_1997, INV_01_1998, and so on. You can use dynamic SQL in our reporting application to specify the table name at runtime. We might also want to run a complex query with a user-selectable sort order. Instead of coding the query twice, with different ORDER BY clauses, we can construct the query dynamically to include a specified ORDER BY clause. 28

43 Chapter Three Some Features of Oracle's SQL Dynamic SQL programs can handle changes in data definitions, without the need to recompile. This makes dynamic SQL much more flexible than static SQL. Dynamic SQL lets us write reusable code because the SQL can be easily adapted for different environments. Dynamic SQL also lets us execute data definition language (DDL) statements, data manipulation language (DML) and other SQL statements that are not supported in purely static SQL programs. Prior to Oracle8i, PL/SQL developers could include dynamic SQL in applications by using the Oracle-supplied DBMS_SQL package. However, performing simple operations using DBMS_SQL involves a fair amount of coding. In addition, because DBMS_SQL is based on a procedural API, it incurs high procedure-call and data-copy overhead [PSQL01] Steps to Process SQL Statements All SQL statements have to go through various stages. Some stages may be skipped [Scott97, Orac01]:- a) Parsing Every SQL statement must be parsed. Parsing the statement includes checking the statement's syntax and validating the statement, ensuring that all references to objects are correct, and ensuring that the relevant privileges to those objects exist. b) Binding After parsing, the Oracle server knows the meaning of the Oracle statement but still may not have enough information to execute the statement. The Oracle server may need values for any bind variable in the statement. The process of obtaining these values is called binding variables. 29