Extending ORACLE SQL To Mine Clusters in Databases

Size: px
Start display at page:

Download "Extending ORACLE SQL To Mine Clusters in Databases"

Transcription

1 REPUBLIC OF IRAQ MINISTRY OF HIGHER EDUCATION AND SCIENTIFIC RESEARCH UNIVERSITY OF BAGHDAD COLLEGE OF SCIENCE COMPUTER SCIENCE DEPARTMENT Extending ORACLE SQL To Mine Clusters in Databases A Dissertation Submitted to Baghdad University \ College of Science \ Computer Science Department as a partial fulfillment of the requirements for the degree of Master in Computer Science By Ahmed A. Hamdan AL-Abodi (B.Sc.) Supervised by Ass.Prof. Dr. Hussein Keitan AL-Kafaji December Dhulqa'da

2 جمهورية العراق وزارة التعليم العالي والبحث العلمي جامعة بغداد / كلية العلوم قسم علوم الحاسبات توسيع لغة الاستعلام المهيكل لنظام اوراكل لتعدين العناقيد في قواعد البيانات مشروع مقدم الى جامعة بغداد/ كلية العلوم / قسم علوم الحاسبات كجزء من متطلبات نيل شهادة الماجستير في علوم الحاسبات من قبل الطالب ا حمد عبد الحسن حمدان العبودي اشراف د.حسين كيطان الخفاجي ۲۰۰٤ ذو القعدة ۱٤۲٥ ه كانون الاول ۲۰۰٤ م

3 Supervisor Certification We certify that this Dissertation was prepared under our supervision at the Department of Computer Science of the University of Baghdad as a partial fulfillment of the requirement needed of the degree of Master of Science in Computer Science. Signature: Name : Ass.Prof. Dr. Hussein Keitan AL-Kafaji (Supervisor) Date : / / 2005 Certification of the Head of the Department In view of the available recommendation I forward this Dissertation for debate by the examining committee. Signature: Name : Makia K. Hamad (Assistance Professor) Date : / / 2005 Head of the Computer Science Department

4 Examining Committee Certification We certify that we have read this Dissertation and as an examining committee, examined the student in its content and what is related to it, and that in our opinion it meets the standard of a Dissertation for the degree of Master of Science in Computer Science. Signature: Name : Ass.Prof. Dr. Hussein Keitan AL-Kafaji (Supervisor) Date : / / 2005 Signature: Name : Makia K. Hamad (Assistance Professor) (Member) Signature: Name : Dr. Lamia H. Khalid (Assistance Professor) (Chairman) Date : / / 2005 Date : / / 2005

5 To the Immaculate Spirits of my Grandfather To My Mother, Father Brother, and Sisters. i

6 ACKNOWLEDGEMENTS Special Thanks and gratitude should be presented to my advisors Dr. Hussein Keitan for his tireless patience, his useful comments and excellent advice and support. I am grateful to the staff of the Department of Computer Science of the University of Baghdad. Last, my thanks go to all people who may support this project directly or indirectly, and they anticipate the concluding musk. AHMED ii

7 Supervisor Certification We certify that this Dissertation was prepared under our supervision at the Department of Computer Science of the University of Baghdad as a partial fulfillment of the requirement needed of the degree of Master of Science in Computer Science. Signature: Name : Ass.Prof. Dr. Hussein Keitan AL-Kafaji (Supervisor) Date : / / 2005 Certification of the Head of the Department In view of the available recommendation I forward this Dissertation for debate by the examining committee. Signature: Name : Makia K. Hamad (Assistance Professor) iii

8 Date : / / 2005 Head of the Computer Science Department iv

9 Examining Committee Certification We certify that we have read this Dissertation and as an examining committee, examined the student in its content and what is related to it, and that in our opinion it meets the standard of a Dissertation for the degree of Master of Science in Computer Science. Signature: Name : Ass.Prof. Dr. Hussein Keitan AL-Kafaji (Supervisor) Date : / / 2005 Signature: Name : Makia K. Hamad (Assistance Professor) (Member) Signature: Name : Dr. Lamia H. Khalid (Assistance Professor) (Chairman) Date : / / 2005 Date : / / 2005 v

10 Table of Contents Subject Chapter One: Introduction 1.1 Introduction Project Contributions Project Organization 5 Chapter Two: KDD, DM and Clustering Techniques 2.1 Introduction Knowledge Discovery in Database Data Mining Data Mining Tasks Association Rules Classification Clustering Important Issues Classification of Clustering Algorithm 13 Page No Partitional Techniques Hierarchical Techniques Density-Based Partitioning Grid-Based Methods Sequential patterns Similarity sequence discovery Integration Approaches 21 Chapter Three: Some Features of Oracle SQL 3.1 Introduction Oracle PL/SQL PL/SQL Block Structure Types of Oracle SQL Instructions Dynamic SQL Steps to Process SQL Statements Using the DBMS_SQL Package Native Dynamic SQL Dbms_SQL Versus Native Dynamic_SQL 36 vi

11 3.2.6 Advantages of Native Dynamic SQL Advantages of the DBMS_SQL Package 39 Chapter Four: Extending SQL to Mine Clustering Database 4.1 Introduction Implementation of MSQL The Translator Lexical Analyzer of DM Queries Syntax Analyzer of DM Queries Semantic Analysis Parameter Generator Parsing of DDL and DML queries Execution of DDL, DML and MSQL queries The Reporter 66 Chapter Five : Discussion, Conclusion, and Future Works 5.1 Discussion and Conclusion Future Works 71 References 74 vii

12 Section No. Page No. Table 3.1 Table (2.1) Some procedures and functions of 32 DBMS_SQL Table (3.2) Clauses of General forms Table (4.1) Some of reserved Words and 49 Characters of ESQL Table (4.2) Identification of data types in Oracle 59 List of Tables viii

13 Section No. Page No. Figure 2.2 Figure (2.1) KDD Process Figure (2.2) Data sets on which centroid and medoid 15 approaches fail. List of Figures ix

14 Figure (2.3) K_Means Techniques Figure (3.1) The flow of execution in DBMS_SQL Abbreviation Figure (3.2) Simple Example Meaning of using DBMS_SQL DBMS Figure (3.3) Usages Database Execute Management immediate Statement System DBMSs Figure (3.4) Simple Database Example Management of using Native Systems DBMS_SQL Figure (3.5) Example by using DBMS_SQL package Figure (3.6) Example by using Native DBMS_SQL 41 package Figure (4.1) MSQL Architecture Figure (4.2) BNF of Mining statements of MSQL Figure (4.3) PL/SQL code of reserved words table Figure (4.4 A flowchart of ESQL Lexical Analyzer Figure (4.5) PL/SQL Code to recognize reserved words, 51 identifiers, and numbers Figure (4.6) PL/SQL Code of recognize Some of 52 delimiters Figure (4.7) A Part of Syntax Analysis Processing Figure (4.8) A Part of PL/SQL Code of the Syntax 55 Analyzer Figure (4.9) Parameter List of the Generated Figure (4.10) Simple example of desc_tab Figure (4.11) Oracle define_column stored procedure 61 Table of Abbreviations x

15 DDL DM DML DOTI DSS MSQL KDD OLAP OLTP PL/SQL SQL Data Definition Language Data Mining Data Manipulation Language Discoverer Of Terminated Itemsets Decision Support System Mining Structured Query Language Knowledge Discovery in Database On-Line Analytical Processing On-Line Transaction Processing Procedural Language/ Structured Query Language Structured Query Language Abstract xi

16 Data mining is a process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions. The mined datasets are often in relational format, while most mining systems do not use relational DBMS; they have a loose connection with database. Thus, they miss the opportunity to leverage the database technology developed in the last couple of decades. Furthermore, the data mining systems emerged to serve mining goals, that is, until now there is no integral system can answer mining queries and traditional queries, i.e. SQL. In this project, an integral system is built by extending SQL language with a set of new statements and functions, which can be used to discover clusters in database table, and answer traditional SQL queries. This proposed system is implemented by using Oracle DBMS, to avoid the problem of loosing connection of data mining with DBMS. This research is accomplished according to many theoretical and systematic steps such as: 1) Extending the BNF of SQL to cluster databases. 2) Designing a translator to the extended BNF. 3) Implementing k-means clustering algorithm. 4) Designing and implementing many reporting interfaces to display the mined clusters. MSQL can be executed under windows 98, windows Me, windows XP or windows 2000 by using Oracle 8i or later versions such as Oracle 9i. xii

17 Chapter One Introduction CHAPTER ONE Introduction 1.1 Introduction Database technology has been successfully used in traditional business data processing. There is an increasing usage of relational database systems by organizations to store vast amount of data. Organizations have been gathering a large amount of data, by using a DBMS system to manage it. Traditional database support On-Line Transaction processing (OLTP), which includes insertion, updating, and deletion, which also support information query requirements. The amount of data kept in various repositories (databases) is growing at a phenomenal rate, and as a somewhat surprising consequence, the amount of meaningful information decreases rapidly. The tremendous amount of data is behind human capabilities to reasonably process it. It is not possible for us to look at the database and see any useful patterns in the data, and consequently derive some potentially useful knowledge from our observation. Therefore, because of the compelling need to extract useful information from this data and the needs of the decision makers for the information at the correct level of detail to support their decision making, Data warehousing, On-Line- Analytical processing (OLAP), Decision Support System (DSS), and data mining rose. Data warehousing provide access to data for complex analysis, knowledge discovery and decision-making. OLAP is used to describe the analysis of complex data from the data warehouses, and DSS is the application that enables users to make strategic decisions. All of these terms will be explained in more detail in chapter two. It is obvious that there is a continuous challenge in selecting the appropriate information resources to maintain and extend people s personal 1

18 Chapter One Introduction knowledge; all major organizations contain gigabytes of data with much hidden information that cannot easily be traced by using SQL, or other shallow query facilities. Data mining algorithms can find interesting regularities, unexpected patterns, and new rules from large databases. Data mining tools, can also answer business questions that traditionally were time consuming to resolve. Data mining and DBMS (Data Base Management System) suffer from bad integration; that is the miner regard DB as a container from which full the data structure and works independently from the control of DBMS. One of the major technologies in data mining involves the discovery of clustering. Clustering is a division of data into groups of similar objects [SADC93, CHY96]. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others [JD88]. Clustering is the subject of active research in several fields. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique computational requirements on relevant clustering algorithms. A variety of algorithms 2

19 Chapter One Introduction have recently emerged that meet these requirements and were successfully applied to real-life data mining problems. The main problem facing the data mining is the bad integration of data mining applications and database management systems (DBMS). Data mining techniques regard the databases as containers for fetching the data that will be processed by using DBMS-independent mining programs. Gradually, the attention converted from defining a new mining operation and developing algorithms for them (most early mining systems were developed largely on file system and specialized data structure and buffer management strategies were devised for each algorithm) to the issue related to integrating mining process with DBMS. The challenge of extending database management systems for data mining application has been the argument of most recent research. Today, data mining algorithms, (especially clustering algorithms), are not well-integrated with DBMS. DBMSs are designed for processing large data sets and offering generic operations. Data mining systems depend on algorithms that consist of steps which require programming language statements; therefore they are unimplemented with the traditional tool of DBMSs that is SQL However there are three techniques to connect the mining system with DBMS, these are: 1) loosely coupled 2) tightly coupled and a new trend has been called 3) DBMS-Embedded miners [Huss02, Yass03]. Loosely coupled systems are completely separated from the DBMS world. The miner is written in a specialized DBMS, the data are converted from DBMS domain to a flat file to be mined by the miner. The tightly coupled system usually written in a language contains embedded statements that can make access to the data managed in a specified DBMS. This technique reduces the data traffics and keeps some DBMSs control over the data. The third technique, i.e., embedded 3

20 Chapter One Introduction mining, is an attempt to completely integrate the mining process with DBMS by utilizing the capabilities of it. Other researchers [Huss02] suggested a completely embedding mining system in a DBMS. The design of the system depends on the DOTI algorithm which is proposed in the same thesis. There have been language proposals to extend SQL to support mining operations. For example, the query language DMQL extends SQL with a collection of operators for mining characteristic rules, association rules, etc. Another example is the mine rule [Rosa96] operators for a generalized version of the association rule discovery problem. Besides, the M-SQL language [Imie96] extends SQL with a special unified operator to generate and query a whole set of propositional rules. Query flocks for association rule mining by using a generate-and-test model has been proposed. ATLaS [Atla ] may be considered a promising research according to the incomplete list of the previous work listed above, but it is also are implemented by using C++. MSQL extends ORACLE SQL to mine association rules from transactional databases [Yass03]. 1.2 Project Contributions In this research, an extension to DBMS functionality is proposed toward on-demand clustering discovery. This extension is achieved by extending the famous SQL language with a set of new statements and functions. The BNF, i.e., the grammar of the ordinary SQL has been extended to contain clusters mining queries, therefore a robust parser is developed to parse and recognize the users queries. This mining tool is named Mining SQL, MSQL. The sent queries are parsed by the MSQL to determine if it is SQL or mining query to execute it in both cases. The integral system in this thesis is implemented by using Oracle8i and 4

21 Chapter One Introduction developer 6i. This system excludes the problem of bad integration between data mining and DBMS. It is enforcement to the trend of completely embedding the mining tasks within the database management systems. It is an attempt to make the mining tasks as traditional operations of DBMSs. 1.3 Project Organization The remainder of this research is organized as follows: chapter two presents the Knowledge Discovery in Database (KDD) and its relationship with data mining, the major tasks of data mining, data mining process and the alternative approach to integrate data mining with database systems are discussed thoroughly. Chapter three presents and explains some Oracle features used to implement MSQL. Chapter four introduces the implementation of the proposed system which is implemented by using Oracle8i packages and Developer 6i. Chapter five summarizes conclusions and recommendations. 5

22 Chapter Two KDD, DM and Clustering Techniques CHAPTER TWO KDD, DM and Clustering Techniques 2.1 Introduction Large databases are routinely being collected in science, business and medicines. A variety of techniques from statistics, signal processing, pattern recognition, machine learning, and neural networks have been proposed to understand the data by discovering useful categories. However, to date research in data mining has not paid attention to the cognitive factors that make learned categories comprehensible Knowledge Discovery In Database KDD is the process of looking in a database to find hidden knowledge patterns (or regularities) without a predetermined idea or hypothesis about what the pattern may be. The KDD has been defined as follows [Usam96a]: KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in the data. Some times the two terms KDD, and data mining are used interchangeably. However, from a research-oriented perspective in computer science, knowledge discovery in database or KDD aims to set up infrastructure for data mining, much at the organizational level. KDD is used to refer to the overall process of knowledge discovery, while the term data mining refers to the actual algorithms used in the discovery process [Usam96a, Marg00]. The KDD process is interactive and 7

23 Chapter Two KDD, DM and Clustering Techniques iterative, involving numerous steps with many decisions being made by the user [Usam96a, Marg00] describe the KDD process as follows: 1) Data selection The goal of this phase is the extraction of data that is relevant to the discovery process from a large data store. 2) Data cleaning and preprocessing This phase is concerned with data cleaning and preparation tasks that are necessary to ensure correct results such as (the removal of noise or outliers if appropriate, collecting the necessary information to model or account for noise, deciding on strategies for handling missing data fields). Eliminating missing values in the data, ensuring that coded values have a uniform meaning and ensuring that no spurious data values exist are typical actions that occur during this phase [Usam96a, Kenn98]. 3) Data transformation This phase aims to eliminate unwanted or highly correlated fields so the results are valid. It uses dimensionality reduction or transformation methods to reduce the effective number of variables under consideration or to find invariant representations for the data [Usam96a, Kenn98]. 4) Data Mining This phase concerns deciding whether the goal of the KDD process is classification, association discovery or clustering, etc. and choosing the data mining algorithm(s) (selecting method(s) to be used for searching for patterns in the data) to discover meaningful patterns and rules and produce predictive models [Usam96a, Kenn98]. This is the core element of the KDD process. Clustering DBMSs is the focus of this project; therefore the next section will explain data mining in detail. 8

24 Chapter Two KDD, DM and Clustering Techniques 5) Interpretation and Consolidation In interpreting mined patterns, it is possible to return to any steps (1 4) for further iteration and evaluation. Consolidating discovered knowledge means incorporating this knowledge into the performance system, or simply documenting and reporting it to interested parties. This also includes checking for and resolving potential conflicts with previously extracted knowledge [Usam02]. The KDD process can involve significant iteration and may contain loops between any two steps. Figure (2.1) shows the steps of KDD process. Transformation Data mining Interpretation Knowledge Cleaning Patterns Selection Data Target data Preprocessed data Transformed data Many loops are possible Figure (2.1) KDD Process 2.3 Data Mining Data mining is a step in knowledge discovery in database (KDD) that searches for a series of hidden patterns in data. Data mining refers to the process of non-trivial extraction of implicit, previously unknown, and potentially useful information from large amounts of data, stored in databases, data warehouses or other repositories, and transforms them into useful knowledge for the user [Elis01]. 9

25 Chapter Two KDD, DM and Clustering Techniques Data mining is the process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules. Data mining centers around the automated discovery of new facts and relationships in data, data mining tools uncover hidden information. Relationships between variables and customer behavior that are nonintuitive are the gems that data mining hopes to figure out. There are different applications of data mining in the business area, Customer Segmentation, Market Basket Analysis, Risk Management, Fraud Detection, Delinquency Tracking, and Demand Prediction, are some examples of major data mining applications Data Mining Tasks The highest primary goals of data mining in practice can be classified into two categories, description, and prediction [Usam96a, Piet98]. Description tasks characterize the general properties of the data in the database, such as association discovery and clustering. Prediction involves using some variables or fields in the database to predict unknown or future values of other variables of interest, such as, classification. The following subsections will give us a closer look at data mining tasks, typically used in a variety of well-known applications, when a large amount of data is considered Association Rules Associations are affinities between items. Association discovery algorithms find combinations where the presence of one item suggests the presence of another.when these algorithms are applied to the shopping transactions at a supermarket, they will uncover affinities among products that are likely to be purchased together. Association rules represent such affinities. 10

26 Chapter Two KDD, DM and Clustering Techniques Classification Classification is learning a function that maps (classifies) a data item into one of several predefined classes [Usam96a]. Classification determines whether an object belongs to a given class, chosen among a set of predefined classes, based on the values of some object attributes, i.e. based on a given classification function, [Elis01]. It is often referred to as supervised learning as the classes are determined prior to examining the data. Regression is a type of classification used to map a real valued predicate variable into data values. A particularly efficient method for producing a classifier from data is to generate a decision tree [Usam96a]. There are several classification methods such as neural networks, genetic algorithms, and decision tree [Piet98]. Examples of classification application include image and pattern recognition, and medical diagnosis Clustering Clustering in data mining [SADC93, CHY96] is a discovery process that groups a set of data such that the intracluster similarity is maximized and the intercluster similarity is minimized [JD88, CHY96]. These discovered clusters can be used to explain the characteristics of the underlying data distribution, and thus serve as the foundation for other data mining and analysis techniques. The applications of clustering include characterization of different customer groups based upon purchasing patterns, categorization of documents on the World Wide Web, grouping of genes and proteins that have similar functionality, grouping of spatial locations prone to earth quakes from seismological data [BR98, XEKS98], etc. Existing clustering algorithms, such as K-means, PAM, and ROCK are designed to find clusters that fit some static models. For example, K- 11

27 Chapter Two KDD, DM and Clustering Techniques means assume that clusters are hyper-ellipsoidal (or globular) and are of similar sizes. Agglomerative hierarchical clustering algorithms, such as CURE and ROCK use a static model to determine the most similar cluster to merge in the hierarchical clustering. These algorithms can breakdown if the choice of parameters in the static model is incorrect with respect to the data set being clustered, or if the model is not adequate to capture the characteristics of clusters. Furthermore, most of these algorithms breakdown when the data consists of clusters that are of diverse shapes, densities, and sizes Important Issues The properties of clustering algorithms That are concerned with data mining include:- v Scalability to large datasets. v Ability to work with high dimensional data. v Ability to find clusters of irregular shape. v Handling outliers. v Time complexity (when there is no confusion). v Data order dependency. v Labeling or assignment (hard or strict vs. soft of fuzzy). v Reliance on a priori knowledge and user defined parameters. v Interpretability of results. While trying to keep these issues in mind, realistically, mentions only few with every algorithm will be discuss. The above list is in no way exhaustive. For example, the discussion of such properties as ability to 12

28 Chapter Two KDD, DM and Clustering Techniques work in pre-defined memory buffer, ability to restart and ability to provide an intermediate solution Classification of Clustering Algorithms Categorization of clustering algorithms is neither straightforward, nor canonical. In reality, groups below overlap. A classification closely provides for reader's convenience followed by this survey. Corresponding terms are explained below. Clustering Algorithms v Hierarchical Methods Agglomerative Algorithms Divisive Algorithms v Partitioning Methods Relocation Algorithms Probabilistic Clustering K-medoids Methods K-means Methods v Density-Based Algorithms Density-Based Connectivity Clustering Density Functions Clustering v Grid-Based Methods v Methods Based on Co-Occurrence of Categorical Data v Constraint-Based Clustering v Clustering Algorithms Used in Machine Learning Gradient Descent and Artificial Neural Networks Evolutionary Methods v Scalable Clustering Algorithms 13

29 Chapter Two KDD, DM and Clustering Techniques v Algorithms For High Dimensional Data Subspace Clustering Projection Techniques Co-Clustering Techniques These techniques will be explained briefly, except the k-means which will be explained in details because it is the selected one to implement the mining SQL Partitional Techniques Partitional clustering attempts to break a data set into K clusters such that the partition optimizes a given criterion. Centroid-based approaches, as typified by K means and ISODATA, try to assign points to clusters such that the mean square distance of points to the centroid of the assigned cluster is minimized. Centroid-based techniques are suitable only for data in metric spaces (e.g., Euclidean space) in which it is possible to compute a centroid of a given set of points. Medoid-based methods, as typified by PAM (Partitioning Around Medoids) and CLARANS, work with similarity data, i.e., data in an arbitrary similarity space. These techniques try to find representative points (medoids) so as to minimize the sum of the distances of points from their closest medoid. A major drawback of both of these schemes is that they fail for data in which points in a given cluster are closer to the center of another cluster than to the center of their own cluster. This can happen in many natural clusters ; for example, if there is a large variation in cluster sizes (as in Figure(2.2) (a)) or when cluster shapes are convex (as in Figure(2.2) (b)). 14

30 Chapter Two KDD, DM and Clustering Techniques a) Clusters of widely different sizes b) Clusters with convex shapes Figure (2.2): Data sets on which centroid and medoid approaches fail. v The k-means Technique The k-means algorithm [Hartigan 75; Hartigan & Wong 79] is by far the most popular clustering tool used in scientific and industrial applications. The name comes from representing each of k clusters C by the mean (or weighted average) c of its points, the so-called centroid. While this obviously does not work well with ategorical attributes, it has the good geometric and statistical sense for numerical attributes. The sum of discrepancies between a point and its centroid expressed through appropriate distance is used as the objective function. For example, the - norm based objective function, the sum of the squares of errors between the points and the corresponding centroids, is equal to the total intracluster variance E(C) = J=1:K X i - C j 2 Note that only means are estimated. A simple modification would normalize individual errors by cluster radii (cluster standard deviation), which makes a lot of sense when clusters have different dispersions. An 15

31 Chapter Two KDD, DM and Clustering Techniques objective function based on -norm has many unique algebraic properties. For example, it coincides with pair-wise errors E'(C) = 1/2 X i - Y j J=1:K 2 and with the difference between the total data variance and the inter-cluster variance.therefore, the cluster separation X i, Yi is achieved simultaneously with the cluster tightness. Two versions of k-means iterative optimization are known. The first version consists of two-step major iterations that (1) reassign all the points to their nearest centroids, and (2) recompute centroids of newly assembled groups. Iterations continue until a stopping criterion is achieved (for example, no reassignments happen). This version is known as Forgy's algorithm [Forgy65] and has many advantages: v It easily works with any Lp norm. v It allows straightforward parallelization. v It is insensitive with respect to data ordering.. The second (classic in iterative optimization) version of k-means iterative optimization reassigns points based on more detailed analysis of effects on the objective function caused by moving a point from its current cluster to a potentially new one. If a move has a positive effect, the point is relocated and the two centroids are recomputed. It is not clear that this version is computationally feasible, because the outlined analysis requires an inner loop over all member points of involved clusters affected by centroids shifts. However, in case it is known [Duda & Hart 73] that all computations can be algebraically reduced to simply computing a single distance! 16

32 Chapter Two KDD, DM and Clustering Techniques Therefore, in this case both versions have the same computational complexity. The wide popularity of k-means algorithm is well deserved. It is simple, straightforward, and is based on the firm foundation of analysis of variances. The k-means algorithm also suffers from all the usual suspects: v The result strongly depends on the initial guess of centroids (or assignments). v Computed local optimum is known to be a far cry from the global one. v It is not obvious what is a good k to use. v The process is sensitive with respect to outliers. v The algorithm lacks scalability. v Only numerical attributes are covered. v Resulting clusters can be unbalanced (in Forgy's version, even empty). Figure (2.3) K_MEANS Technique 17

33 Chapter Two KDD, DM and Clustering Techniques Hierarchical Techniques Hierarchical clustering algorithms produce a nested sequence of clusters, with a single all-inclusive cluster at the top and single point clusters at the bottom. Agglomerative hierarchical algorithms start with all the data points as a separate cluster. Each step of the algorithm involves merging two clusters that are the most similar. After each merge, the total number of clusters decreases by one. These steps can be repeated until the desired number of clusters is obtained or the distance between two closest clusters is above a certain threshold distance. There are many different variations of agglomerative hierarchical algorithms. These algorithms primarily differ in how they update the similarity between existing clusters and the merged clusters. In some methods, each cluster is represented by a centroid or medoid of the points contained in the cluster, and the similarity between two clusters is measured by the similarity between the centroids/medoids of the clusters. Like partitional techniques, such as K-means and K-medoids, these method also fail on clusters of arbitrary shapes and different sizes. In the single link method, each cluster is represented by all the data points in the cluster. The similarity between two clusters is measured by the similarity of the closest pair of data points belonging to different clusters. Unlike the centroid / medoid based methods, this method can find clusters of arbitrary shape and different sizes. However, this method is highly susceptible to noise, outliers, and artifacts Density-Based Partitioning An open set in the Euclidean space can be divided into a set of its connected components. The implementation of this idea for partitioning of a finite set of points requires concepts of density, connectivity and boundary. They are closely related to a point's nearest neighbors. A 18

34 Chapter Two KDD, DM and Clustering Techniques cluster, defined as a connected dense component, grows in any direction that density leads. Therefore, density-based algorithms are capable of discovering clusters of arbitrary shapes. Also this provides a natural protection against outliers. Figure (2.2), i.e., partitional technique, illustrates some cluster shapes that present a problem for partitioning relocation clustering (e.g., k-means), but are handled properly by densitybased algorithms. They also have good scalability. These outstanding properties are tempered with certain inconveniencies. From a very general data description point of view, a single dense cluster consisting of two adjacent areas with significantly different densities (both higher than a threshold) is not very informative. Another drawback is a lack of interpretability. An excellent introduction to density_based methods is contained in the textbook [Han & Kamber 01]. Since density-based algorithms require a metric space, the natural setting for them is spatial data clustering. To make computations feasible, some index of data is constructed (such as R*-tree). This is a topic of active research. Classic indices were effective only with reasonably lowdimensional data. The algorithm DENCLUE that, in fact, is a blend of a density-based clustering and a grid-based preprocessing is lesser affected by data dimensionality. There are two major approaches for density-based methods. The first approach pins density to a training data point and is reviewed in the sub-section Density-Based Connectivity. Representative algorithms include DBSCAN, GDBSCAN, OPTICS, and DBCLASD. The second approach pins density to a point in the attribute space and is explained in the sub-section Density Functions. It includes the algorithm DENCLUE Grid-Based Methods In the previous section crucial concepts of density, connectivity, and boundary were used which required elaborate definitions. Another 19

35 Chapter Two KDD, DM and Clustering Techniques way of dealing with them is to inherit the topology from the underlying attribute space. To limit the search combinations, multirectangular segments are considered. Recall that a segment (also cube, cell, region). is a direct Cartesian product of individual attribute sub-ranges (contiguous in case of numerical attributes). Since some binning is usually adopted for numerical attributes, methods partitioning space are frequently called grid-based methods. The elementary segment corresponding to single-bin or single-value sub-ranges is called a unit. Overall, we shift our attention from data to space partitioning. Data partitioning is induced by points' membership in segments resulted from space partitioning, while space partitioning is based on gridcharacteristics accumulated from input data. One advantage of this indirect handling (data grid-data space-partitioning data-partitioning) is that accumulation of grid-data makes grid-based clustering techniques independent of data ordering. In contrast, relocation methods and all incremental algorithms are very sensitive with respect to data ordering. While density-based partitioning methods work best with numerical attributes, grid-based methods work with attributes of different types Sequential patterns The problem of discovering sequential patterns is to find intertransaction patterns so that the presence of a set of items is followed by another item in the timestamp ordered transaction set [Zhen01]. By analyzing this information, the mining systems can determine temporal relationships among data items. An example of sequential patterns could be that 37% of customers, who buy the Windows Software, Office Software, also buy Anti Virus Software within 30 days. 20

36 Chapter Two KDD, DM and Clustering Techniques Similarity sequence discovery Similar time sequence discovery finds all occurrences or similar occurrences, or finds sequences similar to a given sequence, in a database of time-series data. A time series is a set of values of one variable over a period of time [Pete98]. As an example, time-series analysis techniques can be used to predict the income of a given company during the next year, starting from the incomes of previous years and the current customer payment situation. 2.4 Integration Approaches Most of the current data mining applications have a loose connection with databases. However, there are three different ways in which data mining systems use relational DBMS. They may not use a database at all, be loosely-coupled or be tightly-coupled [Suni98, Rake96b]. Data mining systems which do not use a relational DBMS, provide their own memory and storage management. The disadvantage of this database-less approach is the lost opportunity to leverage the existing relational database technology developed in the last couple of decades. Most of the current data mining applications that use database have a loose connection with them. Some data mining systems use a DBMS but only to store and retrieve the data. A majority of them treat database simply as a container from which data is extracted to populate main memory data structures before the main execution begins. The more database application use loosely-coupled SQL to fetch data records as needed by the mining algorithm. The front end of this application is implemented in a host programming language, with embedded SQL statement in it. The application uses a SQL select statement to retrieve the set of records of interest from the database. A loop in the application 21

37 Chapter Two KDD, DM and Clustering Techniques program copies records in the result set one-by-one from the database address space to the application address space, where computation is performed on them. Loosely-coupled requires a conversion of the data from DBMS format to a format of the host language. This approach limits the amount of data that can be handled (forcing applications to filter information and use only apart of it to discover patterns. This approach has two performance problems : i) copying of records from the database address space to the application address space, and ii) process context switching for each record retrieved, which costly in a database system, built on top of an operating system. The tightly-coupled approach exhibits more integration and cooperation of DBMSs and data mining applications. The designers selectively push parts of the application program that perform computation on retrieved records into the database system, thus, avoiding the performance degradation of mining systems. The researchers anticipations are to accomplish the mining task under the control of DBMSs. The dominion of DBMSs to include the mining process excludes the mentioned drawbacks. These anticipations can be done by finding new generation of mining algorithms that can be implemented by using the capabilities of DBMSs, in addition to develop new utilities and to make a good selection to the available DBMSs to implement the mining systems. The third approach to integrate DM and DBMSs is absolutely obeying the mining process to the domination of DBMSs by completely embedding the DM tasks within DBMSs. This research extends the traditional tool of DBMSs that is SQL to process queries retrieving large itemsets and clustering. This research agrees with the third approach and empowers it. 22

38 Chapter Three Some Features of Oracle's SQL CHAPTER THREE Some Features of Oracle's SQL 3.1 Introduction Oracle is an extremely powerful and flexible relational database system. Along with this power and flexibility, however, comes complexity. In order to design useful applications that are based on Oracle, it is necessary to understand how Oracle manipulates the data stored within the system. PL/SQL is an important tool that is designed for data manipulation, both internally within Oracle and externally in our own applications. PL/SQL is available in a variety of environments, each of which has different advantages [Orac01]. 3.2 ORACLE PL/SQL PL/SQL is a sophisticated programming language used to access Oracle database from various environments. PL/SQL is integrated with the database server, so that the PL/SQL code can be processed quickly and efficiently. It is also available in some client-side Oracle tools. Oracle is a relational database. The language used to access a relational database is Structured Query Language (SQL). SQL is a flexible, efficient language, with features designed to manipulate and examine relational data. For example, suppose we have a table called students then the following SQL statement will delete all students who are majoring in nutrition from the database: Delete from students Where major = 'nutrition'; SQL is a fourth-generation language. This means that the language describes what should be done, but not how to do it. In the DELETE statement just shown, for example we don't know how the database will actually determine which students are majoring in nutrition. Presumably, 24

39 Chapter Three Some Features of Oracle's SQL the server will loop through all the students in some order to determine the proper entries to delete. But the details of this are hidden from us. Third-generation language, such as C and COBOL, are more procedural in nature. A program in a Third-generation language (3GL) implements a step-by-step algorithm to solve the problem. For example, we could accomplish the DELETE operation with something like this: Loop over each student record If this record has major = 'nutrition' then Delete this record; End if; End loop; Each language has advantage and disadvantages. Fourth-generation languages such as SQL are generally fairly simple (compared to thirdgeneration languages) and have fewer commands. They also include the user from the underlying data structures and algorithms. In some cases, however, the procedural constructs available in 3GLs are useful to express a design program. This is where PL/SQL comes in, it combines the power and flexibility of SQL (4GL) with the procedural constructs of a 3GL[Orac01]. PL/SQL stands for Procedural Language/SQL. As its names implies, PL/SQL extends SQL by adding constructs found in other procedural languages, such as: v Variables and types (both predefined and user-defined). v Control structures such as IF-THEN-ELSE statements and loops. v Procedures and functions. v Object types and methods (PL/SQL version 8 and higher). For example, suppose we want change the major for a student. If the student doesn't exist, then we want to create a new record. We do this with the following PL/SQL code [Scott97]: 25

40 Chapter Three Some Features of Oracle's SQL DECLARE /* Declare variables which will be used in SQL statement */ v_newmajor varchar2(10) := 'History'; v_firstname varchar2(10) := 'Scott'; v_lastname varchar2(10) :='Urman'; BEGIN /*Update the students table */ UPDATE students SET major = v_newname WHERE first_name = v_lastname AND last_name = v_lastname; /* Check to see if the record was found. If not, then need to insert this record */ IF SQL%NOTFOUND THEN INSERT INTO students (first_name, last_name, major) VALUES (v_firstnmae, v_lastnmae, v_newmajor); END IF; END; This example contains two different SQL statements ( UPDATE and INSERT) as well as several variable declarations and the conditional IF statement PL/SQL Block Structure The basic unit in PL/SQL is a block. All PL/SQL programs are made up of blocks, which can be nested within each other. Typically, each block performs a logical unit of work in the program, thus separating different tasks from each other. A block has the following structure[psql01] : 26

41 Chapter Three Some Features of Oracle's SQL DECLARE /* Declarative section PL/SQL variables, types, cursors, and local subprograms */ BEGIN /* Executable section procedural and SQL statement. This is the main section of the block and the only one that is required */ EXCEPTION /* Exception handling section error-handling statements */ END; Only the executable section is required; the declarative and exception handling sections are optional. The executable section must also at least one executable statement. The different sections of the block separate different functions of a PL/SQL program Types of Oracle SQL Instructions Oracle SQL deals with three types of instructions that are: 1. DDL: DATA DEFINITION LANGUAGE Such as (CREATE, DROP, ALTER). 2. DML: DATA MANIPLUATION LANGUAGE Such as (SELECT, INSERT, UPDATE, DELETE). 3. DCL: DATA CONTROL LANGUAGE Includes control statement Such as (GRANT) Dynamic SQL Dynamic SQL enables us to write programs that reference SQL statements whose full text is not known until runtime. Before discussing dynamic SQL in detail, a clear definition of static SQL may provide a good starting point for understanding dynamic SQL [PSQL01]. 27

42 Chapter Three Some Features of Oracle's SQL Static SQL statements do not change from execution to execution. The full text of static SQL statements are known at compilation, which provides the following benefits: v Successful compilation verifies that the SQL statements reference valid database objects. v Successful compilation verifies that the necessary privileges are in place to access the database objects. v Performance of static SQL is generally better than dynamic SQL. Because of these advantages, we should use dynamic SQL only if we cannot use static SQL to accomplish our goals, or if using static SQL is cumbersome compared to dynamic SQL. However, static SQL has limitations that can be overcome with dynamic SQL. We may not always know the full text of the SQL statements that must be executed in a PL/SQL procedure. Our program may accept user input that defines the SQL statements to execute, or our program may need to complete some processing work to determine the correct course of action. In such cases, we should use dynamic SQL. For example, a reporting application in a data warehouse environment might not know the exact table name until runtime. These tables might be named according to the starting month and year of the quarter, for example INV_01_1997, INV_04_1997, INV_07_1997, INV_10_1997, INV_01_1998, and so on. You can use dynamic SQL in our reporting application to specify the table name at runtime. We might also want to run a complex query with a user-selectable sort order. Instead of coding the query twice, with different ORDER BY clauses, we can construct the query dynamically to include a specified ORDER BY clause. 28

43 Chapter Three Some Features of Oracle's SQL Dynamic SQL programs can handle changes in data definitions, without the need to recompile. This makes dynamic SQL much more flexible than static SQL. Dynamic SQL lets us write reusable code because the SQL can be easily adapted for different environments. Dynamic SQL also lets us execute data definition language (DDL) statements, data manipulation language (DML) and other SQL statements that are not supported in purely static SQL programs. Prior to Oracle8i, PL/SQL developers could include dynamic SQL in applications by using the Oracle-supplied DBMS_SQL package. However, performing simple operations using DBMS_SQL involves a fair amount of coding. In addition, because DBMS_SQL is based on a procedural API, it incurs high procedure-call and data-copy overhead [PSQL01] Steps to Process SQL Statements All SQL statements have to go through various stages. Some stages may be skipped [Scott97, Orac01]:- a) Parsing Every SQL statement must be parsed. Parsing the statement includes checking the statement's syntax and validating the statement, ensuring that all references to objects are correct, and ensuring that the relevant privileges to those objects exist. b) Binding After parsing, the Oracle server knows the meaning of the Oracle statement but still may not have enough information to execute the statement. The Oracle server may need values for any bind variable in the statement. The process of obtaining these values is called binding variables. 29

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

More information

A comparison of various clustering methods and algorithms in data mining

A comparison of various clustering methods and algorithms in data mining Volume :2, Issue :5, 32-36 May 2015 www.allsubjectjournal.com e-issn: 2349-4182 p-issn: 2349-5979 Impact Factor: 3.762 R.Tamilselvi B.Sivasakthi R.Kavitha Assistant Professor A comparison of various clustering

More information

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier Data Mining: Concepts and Techniques Jiawei Han Micheline Kamber Simon Fräser University К MORGAN KAUFMANN PUBLISHERS AN IMPRINT OF Elsevier Contents Foreword Preface xix vii Chapter I Introduction I I.

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Database Marketing, Business Intelligence and Knowledge Discovery

Database Marketing, Business Intelligence and Knowledge Discovery Database Marketing, Business Intelligence and Knowledge Discovery Note: Using material from Tan / Steinbach / Kumar (2005) Introduction to Data Mining,, Addison Wesley; and Cios / Pedrycz / Swiniarski

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Chapter 7. Cluster Analysis

Chapter 7. Cluster Analysis Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. Density-Based Methods 6. Grid-Based Methods 7. Model-Based

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will

More information

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association

More information

Database Programming with PL/SQL: Learning Objectives

Database Programming with PL/SQL: Learning Objectives Database Programming with PL/SQL: Learning Objectives This course covers PL/SQL, a procedural language extension to SQL. Through an innovative project-based approach, students learn procedural logic constructs

More information

A Study of Web Log Analysis Using Clustering Techniques

A Study of Web Log Analysis Using Clustering Techniques A Study of Web Log Analysis Using Clustering Techniques Hemanshu Rana 1, Mayank Patel 2 Assistant Professor, Dept of CSE, M.G Institute of Technical Education, Gujarat India 1 Assistant Professor, Dept

More information

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,

More information

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP Data Warehousing and End-User Access Tools OLAP and Data Mining Accompanying growth in data warehouses is increasing demands for more powerful access tools providing advanced analytical capabilities. Key

More information

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data

More information

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10 1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

More information

Cluster Analysis: Advanced Concepts

Cluster Analysis: Advanced Concepts Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

More information

Data Mining Analytics for Business Intelligence and Decision Support

Data Mining Analytics for Business Intelligence and Decision Support Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical

More information

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

More information

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH Kalinka Mihaylova Kaloyanova St. Kliment Ohridski University of Sofia, Faculty of Mathematics and Informatics Sofia 1164, Bulgaria

More information

Introduction. A. Bellaachia Page: 1

Introduction. A. Bellaachia Page: 1 Introduction 1. Objectives... 3 2. What is Data Mining?... 4 3. Knowledge Discovery Process... 5 4. KD Process Example... 7 5. Typical Data Mining Architecture... 8 6. Database vs. Data Mining... 9 7.

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

OUTLIER ANALYSIS. Data Mining 1

OUTLIER ANALYSIS. Data Mining 1 OUTLIER ANALYSIS Data Mining 1 What Are Outliers? Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism Ex.: Unusual credit card purchase,

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining /8/ What is Cluster

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

Comparison of K-means and Backpropagation Data Mining Algorithms

Comparison of K-means and Backpropagation Data Mining Algorithms Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and

More information

Data Mining Applications in Higher Education

Data Mining Applications in Higher Education Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2

More information

Cluster Analysis: Basic Concepts and Algorithms

Cluster Analysis: Basic Concepts and Algorithms 8 Cluster Analysis: Basic Concepts and Algorithms Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. If meaningful groups are the goal, then the clusters should

More information

Cluster Analysis: Basic Concepts and Algorithms

Cluster Analysis: Basic Concepts and Algorithms Cluster Analsis: Basic Concepts and Algorithms What does it mean clustering? Applications Tpes of clustering K-means Intuition Algorithm Choosing initial centroids Bisecting K-means Post-processing Strengths

More information

Chapter ML:XI. XI. Cluster Analysis

Chapter ML:XI. XI. Cluster Analysis Chapter ML:XI XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster

More information

Cluster Analysis: Basic Concepts and Methods

Cluster Analysis: Basic Concepts and Methods 10 Cluster Analysis: Basic Concepts and Methods Imagine that you are the Director of Customer Relationships at AllElectronics, and you have five managers working for you. You would like to organize all

More information

Principles of Data Mining by Hand&Mannila&Smyth

Principles of Data Mining by Hand&Mannila&Smyth Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences

More information

Data Mining Solutions for the Business Environment

Data Mining Solutions for the Business Environment Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania ruxandra_stefania.petre@yahoo.com Over

More information

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 K-Means Cluster Analsis Chapter 3 PPDM Class Tan,Steinbach, Kumar Introduction to Data Mining 4/18/4 1 What is Cluster Analsis? Finding groups of objects such that the objects in a group will be similar

More information

SPATIAL DATA CLASSIFICATION AND DATA MINING

SPATIAL DATA CLASSIFICATION AND DATA MINING , pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Data Warehousing and Data Mining in Business Applications

Data Warehousing and Data Mining in Business Applications 133 Data Warehousing and Data Mining in Business Applications Eesha Goel CSE Deptt. GZS-PTU Campus, Bathinda. Abstract Information technology is now required in all aspect of our lives that helps in business

More information

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable

More information

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing Introduction to Data Mining and Machine Learning Techniques Iza Moise, Evangelos Pournaras, Dirk Helbing Iza Moise, Evangelos Pournaras, Dirk Helbing 1 Overview Main principles of data mining Definition

More information

Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool

Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool Comparison and Analysis of Various Clustering Metho in Data mining On Education data set Using the weak tool Abstract:- Data mining is used to find the hidden information pattern and relationship between

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli (alberto.ceselli@unimi.it)

More information

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009 Cluster Analysis Alison Merikangas Data Analysis Seminar 18 November 2009 Overview What is cluster analysis? Types of cluster Distance functions Clustering methods Agglomerative K-means Density-based Interpretation

More information

not possible or was possible at a high cost for collecting the data.

not possible or was possible at a high cost for collecting the data. Data Mining and Knowledge Discovery Generating knowledge from data Knowledge Discovery Data Mining White Paper Organizations collect a vast amount of data in the process of carrying out their day-to-day

More information

Chapter 20: Data Analysis

Chapter 20: Data Analysis Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification

More information

Data Mining for Successful Healthcare Organizations

Data Mining for Successful Healthcare Organizations Data Mining for Successful Healthcare Organizations For successful healthcare organizations, it is important to empower the management and staff with data warehousing-based critical thinking and knowledge

More information

Dynamic Data in terms of Data Mining Streams

Dynamic Data in terms of Data Mining Streams International Journal of Computer Science and Software Engineering Volume 2, Number 1 (2015), pp. 1-6 International Research Publication House http://www.irphouse.com Dynamic Data in terms of Data Mining

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

More information

Data Mining is sometimes referred to as KDD and DM and KDD tend to be used as synonyms

Data Mining is sometimes referred to as KDD and DM and KDD tend to be used as synonyms Data Mining Techniques forcrm Data Mining The non-trivial extraction of novel, implicit, and actionable knowledge from large datasets. Extremely large datasets Discovery of the non-obvious Useful knowledge

More information

DATA MINING AND WAREHOUSING CONCEPTS

DATA MINING AND WAREHOUSING CONCEPTS CHAPTER 1 DATA MINING AND WAREHOUSING CONCEPTS 1.1 INTRODUCTION The past couple of decades have seen a dramatic increase in the amount of information or data being stored in electronic format. This accumulation

More information

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Robust Outlier Detection Technique in Data Mining: A Univariate Approach Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,

More information

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns Outline Part 1: of data clustering Non-Supervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties

More information

Data Mining System, Functionalities and Applications: A Radical Review

Data Mining System, Functionalities and Applications: A Radical Review Data Mining System, Functionalities and Applications: A Radical Review Dr. Poonam Chaudhary System Programmer, Kurukshetra University, Kurukshetra Abstract: Data Mining is the process of locating potentially

More information

Clustering UE 141 Spring 2013

Clustering UE 141 Spring 2013 Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or

More information

A Review of Data Mining Techniques

A Review of Data Mining Techniques Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

Web Data Mining: A Case Study. Abstract. Introduction

Web Data Mining: A Case Study. Abstract. Introduction Web Data Mining: A Case Study Samia Jones Galveston College, Galveston, TX 77550 Omprakash K. Gupta Prairie View A&M, Prairie View, TX 77446 okgupta@pvamu.edu Abstract With an enormous amount of data stored

More information

Course 803401 DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Course 803401 DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Oman College of Management and Technology Course 803401 DSS Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization CS/MIS Department Information Sharing

More information

Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

More information

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Chapter 5. Warehousing, Data Acquisition, Data. Visualization Decision Support Systems and Intelligent Systems, Seventh Edition Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization 5-1 Learning Objectives

More information

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering Overview Prognostic Models and Data Mining in Medicine, part I Cluster Analsis What is Cluster Analsis? K-Means Clustering Hierarchical Clustering Cluster Validit Eample: Microarra data analsis 6 Summar

More information

Overview. Background. Data Mining Analytics for Business Intelligence and Decision Support

Overview. Background. Data Mining Analytics for Business Intelligence and Decision Support Mining Analytics for Business Intelligence and Decision Support Chid Apte, PhD Manager, Abstraction Research Group IBM TJ Watson Research Center apte@us.ibm.com http://www.research.ibm.com/dar Overview

More information

Distances, Clustering, and Classification. Heatmaps

Distances, Clustering, and Classification. Heatmaps Distances, Clustering, and Classification Heatmaps 1 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification

More information

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction Data Mining and Exploration Data Mining and Exploration: Introduction Amos Storkey, School of Informatics January 10, 2006 http://www.inf.ed.ac.uk/teaching/courses/dme/ Course Introduction Welcome Administration

More information

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

More information

An Introduction to Cluster Analysis for Data Mining

An Introduction to Cluster Analysis for Data Mining An Introduction to Cluster Analysis for Data Mining 10/02/2000 11:42 AM 1. INTRODUCTION... 4 1.1. Scope of This Paper... 4 1.2. What Cluster Analysis Is... 4 1.3. What Cluster Analysis Is Not... 5 2. OVERVIEW...

More information

Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico

Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from

More information

Chapter ML:XI (continued)

Chapter ML:XI (continued) Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

More information

Foundations of Business Intelligence: Databases and Information Management

Foundations of Business Intelligence: Databases and Information Management Foundations of Business Intelligence: Databases and Information Management Problem: HP s numerous systems unable to deliver the information needed for a complete picture of business operations, lack of

More information

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised

More information

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications

More information

CHAPTER 2 DATABASE MANAGEMENT SYSTEM AND SECURITY

CHAPTER 2 DATABASE MANAGEMENT SYSTEM AND SECURITY CHAPTER 2 DATABASE MANAGEMENT SYSTEM AND SECURITY 2.1 Introduction In this chapter, I am going to introduce Database Management Systems (DBMS) and the Structured Query Language (SQL), its syntax and usage.

More information

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM. DATA MINING TECHNOLOGY Georgiana Marin 1 Abstract In terms of data processing, classical statistical models are restrictive; it requires hypotheses, the knowledge and experience of specialists, equations,

More information

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,

More information

MBA 8473 - Data Mining & Knowledge Discovery

MBA 8473 - Data Mining & Knowledge Discovery MBA 8473 - Data Mining & Knowledge Discovery MBA 8473 1 Learning Objectives 55. Explain what is data mining? 56. Explain two basic types of applications of data mining. 55.1. Compare and contrast various

More information

1 File Processing Systems

1 File Processing Systems COMP 378 Database Systems Notes for Chapter 1 of Database System Concepts Introduction A database management system (DBMS) is a collection of data and an integrated set of programs that access that data.

More information

Business Intelligence and Decision Support Systems

Business Intelligence and Decision Support Systems Chapter 12 Business Intelligence and Decision Support Systems Information Technology For Management 7 th Edition Turban & Volonino Based on lecture slides by L. Beaubien, Providence College John Wiley

More information

Data Mining Fundamentals

Data Mining Fundamentals Part I Data Mining Fundamentals Data Mining: A First View Chapter 1 1.11 Data Mining: A Definition Data Mining The process of employing one or more computer learning techniques to automatically analyze

More information

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya Chapter 6 Basics of Data Integration Fundamentals of Business Analytics Learning Objectives and Learning Outcomes Learning Objectives 1. Concepts of data integration 2. Needs and advantages of using data

More information

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc.

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc. Oracle9i Data Warehouse Review Robert F. Edwards Dulcian, Inc. Agenda Oracle9i Server OLAP Server Analytical SQL Data Mining ETL Warehouse Builder 3i Oracle 9i Server Overview 9i Server = Data Warehouse

More information

How To Use Data Mining For Loyalty Based Management

How To Use Data Mining For Loyalty Based Management Data Mining for Loyalty Based Management Petra Hunziker, Andreas Maier, Alex Nippe, Markus Tresch, Douglas Weers, Peter Zemp Credit Suisse P.O. Box 100, CH - 8070 Zurich, Switzerland markus.tresch@credit-suisse.ch,

More information

Clustering Data Streams

Clustering Data Streams Clustering Data Streams Mohamed Elasmar Prashant Thiruvengadachari Javier Salinas Martin gtg091e@mail.gatech.edu tprashant@gmail.com javisal1@gatech.edu Introduction: Data mining is the science of extracting

More information

Advanced analytics at your hands

Advanced analytics at your hands 2.3 Advanced analytics at your hands Neural Designer is the most powerful predictive analytics software. It uses innovative neural networks techniques to provide data scientists with results in a way previously

More information

MS SQL Performance (Tuning) Best Practices:

MS SQL Performance (Tuning) Best Practices: MS SQL Performance (Tuning) Best Practices: 1. Don t share the SQL server hardware with other services If other workloads are running on the same server where SQL Server is running, memory and other hardware

More information

1 Choosing the right data mining techniques for the job (8 minutes,

1 Choosing the right data mining techniques for the job (8 minutes, CS490D Spring 2004 Final Solutions, May 3, 2004 Prof. Chris Clifton Time will be tight. If you spend more than the recommended time on any question, go on to the next one. If you can t answer it in the

More information

Data Discovery, Analytics, and the Enterprise Data Hub

Data Discovery, Analytics, and the Enterprise Data Hub Data Discovery, Analytics, and the Enterprise Data Hub Version: 101 Table of Contents Summary 3 Used Data and Limitations of Legacy Analytic Architecture 3 The Meaning of Data Discovery & Analytics 4 Machine

More information

Foundations of Artificial Intelligence. Introduction to Data Mining

Foundations of Artificial Intelligence. Introduction to Data Mining Foundations of Artificial Intelligence Introduction to Data Mining Objectives Data Mining Introduce a range of data mining techniques used in AI systems including : Neural networks Decision trees Present

More information

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in

More information

2) Write in detail the issues in the design of code generator.

2) Write in detail the issues in the design of code generator. COMPUTER SCIENCE AND ENGINEERING VI SEM CSE Principles of Compiler Design Unit-IV Question and answers UNIT IV CODE GENERATION 9 Issues in the design of code generator The target machine Runtime Storage

More information

Learning is a very general term denoting the way in which agents:

Learning is a very general term denoting the way in which agents: What is learning? Learning is a very general term denoting the way in which agents: Acquire and organize knowledge (by building, modifying and organizing internal representations of some external reality);

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

DATA WAREHOUSE AND DATA MINING NECCESSITY OR USELESS INVESTMENT

DATA WAREHOUSE AND DATA MINING NECCESSITY OR USELESS INVESTMENT Scientific Bulletin Economic Sciences, Vol. 9 (15) - Information technology - DATA WAREHOUSE AND DATA MINING NECCESSITY OR USELESS INVESTMENT Associate Professor, Ph.D. Emil BURTESCU University of Pitesti,

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Concepts of digital forensics

Concepts of digital forensics Chapter 3 Concepts of digital forensics Digital forensics is a branch of forensic science concerned with the use of digital information (produced, stored and transmitted by computers) as source of evidence

More information

Oracle Database: Program with PL/SQL

Oracle Database: Program with PL/SQL Oracle Database: Program with PL/SQL Duration: 5 Days What you will learn This Oracle Database: Program with PL/SQL training starts with an introduction to PL/SQL and then explores the benefits of this

More information