Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images
|
|
|
- Alyson Cannon
- 10 years ago
- Views:
Transcription
1 Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images Małgorzata Charytanowicz, Jerzy Niewczas, Piotr Kulczycki, Piotr A. Kowalski, Szymon Łukasik, and Sławomir Żak Abstract Methods based on kernel density estimation have been successfully applied for various data mining tasks. Their natural interpretation together with suitable properties make them an attractive tool among others in clustering problems. In this paper, the Complete Gradient Clustering Algorithm has been used to investigate a real data set of grains. The wheat varieties, Kama, Rosa and Canadian, characterized by measurements of main grain geometric features obtained by X-ray technique, have been analyzed. The proposed algorithm is expected to be an effective tool for recognizing wheat varieties. A comparison between the clustering results obtained from this method and the classical k-means clustering algorithm shows positive practical features of the Complete Gradient Clustering Algorithm. 1 Introduction Clustering is a major technique for data mining, used mostly as an unsupervised learning method. The main aim of cluster analysis is to partition a given population into groups or clusters with common characteristics, since similar objects are grouped together, while dissimilar objects belong to different clusters [4, 11]. As a result, a new set of categories of interest, characterizing the population, is discovered. The clustering methods are generally divided into six groups: hierarchical, M. Charytanowicz and J. Niewczas Institute of Mathematics and Computer Science, The John Paul II Catholic University of Lublin, Konstantynów 1 H, PL Lublin, Poland {mchmat,jniewczas}@kul.lublin.pl P. Kulczycki, P.A. Kowalski, S. Łukasik, and S. Żak System Research Institute, Polish Academy of Sciences, Newelska 6, PL Warsaw, Poland Department of Automatic Control and Information Technology, Cracow University of Technology, Warszawska 24, PL Cracow, Poland {kulczycki,pakowal,slukasik,slzak}@ibspan.waw.pl 1
2 2 M. Charytanowicz et al. partitioning, density-based, grid-based, and soft-computing methods. These numerous concepts of clustering are implied by different techniques of determination of the similarity and dissimilarity between objects. A classical partitioning k-means algorithm is concentrated on measuring and comparing the distances among them. It is computationally attractive and easy to interpret and implement in comparison to other methods. On the other hand, the number of clusters is assumed here by user in advance and therefore the nature of the obtained groups may be unreliable for the nature of the data, usually unknown before processing. The rigidity of arbitrary assumptions concerning the number or shape of clusters among data can be overcome by density-based methods that let the data detect inherent data structures. In the paper [9], the Complete Gradient Clustering Algorithm was introduced. The main idea of this algorithm assumes that each cluster is identified by local maxima of the kernel density estimator of the data distribution. The procedure does not need any assumptions concerning the data and may be applied to a wide range of topics and areas of cluster analysis [3, 9, 10]. The main purpose of this work is to propose an effective technique for forming proper categories of wheat. In the earliest attempts to classify wheat grains a geometry and set of parameters were defined. The size, shape and colour of grain because of their heritable characters, can be used for wheat variety recognition. Accomplished studies showed that digital image processing techniques commonly used in multivariate analysis give reliable results in classification process [13, 15, 17]. In this paper, the algorithm proposed in [9] will be used to identify wheat varieties, using their main geometric features. 2 Complete Gradient Clustering Algorithm (CGCA) In this section, the Complete Gradient Clustering Algorithm, for short the CGCA, is shortly described. The principle of the proposed algorithm is based on the distribution of the data; the implementation of the CGCA needs to estimate its density. Each cluster is characterized by a local maximum of the kernel density estimator. As a result, regions of high densities of objects are recognized as clusters, while areas with sparse distributions of objects divide one group from another. Data points are assigned to clusters by using an ascending gradient method, i.e. points moving to the same local maximum are put into the same cluster. The algorithm works in an iterative manner until a termination criterion has been satisfied. 2.1 Kernel Density Estimation Suppose that x 1, x 2,..., x m is a random sample of m points in n-dimensional space from an unknown distribution with density f. Its kernel estimator can be defined as
3 Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images 3 ˆf (x) = 1 m ( ) x xi mh n K, i=1 h (1) where the positive coefficient h is called the smoothing parameter or bandwidth, while the measurable function K : R n [0, ) of unit integral Rn K(x)dx = 1, unimodal and symmetrical with respect to zero, takes the name of a kernel [5, 14]. It is generally accepted, that the choice of the kernel K is not as important as the choice of the coefficient h and thank to this, it is possible to take into account the primarily properties of the estimator obtained. Most often the standard normal kernel given by K(x) = 1 xtx e 2 (2) n/2 2π is used. It is differentiable up to any order and assumes positive values in the whole domain. The practical implementation of the kernel density estimators requires a proper choice of the bandwidth h. In practice the best value of h is mostly taken as the value that minimizes the mean integrated square error. A frequently used bandwidth selection method is based on the approach of least-squares cross validation [5, 14]. The value of h is chosen to minimize the function M : (0, ) R given by the rule: M(h) = 1 m 2 h n m i=1 m j=1 ( ) x j x i K + 2 K(0), (3) h mhn where K(x) = K 2 (x) 2K(x) and K 2 is the convolution square of the function K; for the standard normal kernel (2): K 2 1 xtx (x) = e 4. (4) n/2 (4π) In this case the influence of the smoothing parameter on particular kernels is the same. The individualization of this effect may be achieved through the modification of the smoothing parameter. This relies on introducing the positive modifying parameters s 1,s 2,...,s m mapped on particular kernels, described by the formula ( ) c ˆf (x i ) s i =, (5) s where c [0, ), ˆf is the kernel estimator in its basic form (1) and s denotes the geometrical mean of the numbers ˆf (x 1 ), ˆf (x 2 ),..., ˆf (x m ). The value of the parameter c implies the intensity of modification of the smoothing parameter. Based on indications for the criterion of the mean integrated square error the value 0.5 as c is proposed. Finally, the kernel estimator with modification of the smoothing parameter is defined as
4 4 M. Charytanowicz et al. ˆf (x) = 1 m ( ) 1 x xi mh n i=1 s n K. (6) i hs i Additional procedures improving the quality of the estimator obtained, such as a linear transformation and support boundary, as well as the general aspects of the theory of statistical kernel estimators are found in [5, 6, 14]. Exemplary practical applications are presented in the publications [1, 3, 7, 8, 10]. 2.2 Procedures of the CGCA Consider the data set containing m elements x 1, x 2,..., x m in n-dimensional space. Using the methodology introduced in Subsect. 2.1, the kernel density estimator ˆf may be constructed. The idea of the CGCA is based on the approach proposed by Fukunaga and Hostetler [2]. Thus given the start points: x 0 j = x j for j = 1,2,...,m, (7) each point is moved in an uphill gradient direction using the following iterative formula: x k+1 j = x k j + b ˆf (x k j ) ˆf (x k for j = 1,2,...,m and k = 0,1,..., (8) j ) where ˆf denotes the gradient of kernel estimator ˆf and the value of the parameter b is proposed as h 2 /(n + 2) while the coefficient h is the bandwidth of ˆf. The algorithm will be stopped when the following condition is fulfilled: D k D k 1 αd 0, (9) where D 0 and D k 1, D k denote sums of Euclidean distances between particular elements of the set x 1, x 2,..., x m before starting the algorithm as well as after the (k 1)-th and k-th step, respectively. The positive parameter α is taken arbitrary and the value is primarily recommended. This k-th step is the last one and will be denoted hereinafter by k. Finally, after the k -th step of the algorithm (7)-(8) the set x k 1,xk 2,...,xk m, (10) considered as the new representation of all points x 1, x 2,..., x m, is obtained. Following this, the set of mutual Euclidean distances of the above elements: { } d(xi k,x k j ) (11) i=1,2,...,m 1 j=i+1,i+2,...,m
5 Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images 5 is defined. Using the methodology presented in Subsect. 2.1, the auxiliary kernel estimator ˆf d of the elements of the set (11), treated as a sample of a one-dimensional random variable, is created under the assumption of nonnegative support. Next, the first (i.e. obtained for the smallest value of an argument) local minimum of the function ˆf d belonging to the interval (0,D], where D means the maximum value of the set (11), is found. This local minimum will be denoted as x d, and it can be interpreted as the half-distance between potential closest clusters. Finally, the clusters are created. First, the element of the set (11) is taken; it initially create a oneelement cluster containing it. An element of the set (11) is added to the cluster if the distance between it and any element belonging to the cluster is less than x d. Every added element is removed from the set (11). If there are no more elements belonging to the cluster, the new cluster is created. The procedure of assigning elements to clusters is repeated as long as the set (11) is not empty. Procedures described above constitute the Complete Gradient Algorithm in its basic form. The values of the parameters used are calculated automatically, using optimization criteria. However, by an appropriate change in values of these parameters it is possible to influence the size of number of clusters, and also the proportion of their appearnce in dense areas in relation to sparse regions of elements in this set. Namely, lowering (raising) the value of smoothing parameter h results in raising (lowering) the number of local maxima. A change in the value of that parameter of between -25% and +50% is recommended. Next, raising the intensity c of the smoothing parameter modification results in decreasing the number of clusters in sparse areas of data and increasing their number in dense regions. Inverse effects can be seen in the case of lowering this parameter value. The value of the parameter c to be between 0 and 1.5 is recommended. Finally, an increase of both parameters c and h can be proposed. Then the additional formula h = ( ) 3 c 0.5 h (12) 2 is used for calculating the smoothing parameter h, where the value of the parameter h is calculated on the criterion of the mean integrated square error. The joint action of both these factors results in a twofold smoothing of the function ˆf in the regions where the elements of the set x 1, x 2,..., x m are sparse. Meanwhile these factors more or less compensate for each other in dense areas, thereby having small influence on the detection of clusters located there. Detailed information on the CGCA procedures and their influences on the clustering results is described in [9]. 3 Materials and methods The proposed algorithm has been applied for wheat variety recognition. Studies were conducted using combine harvested wheat grain originating from experimental fields, explored at the Institute of Agrophysics of the Polish Academy of Sciences
6 6 M. Charytanowicz et al. in Lublin. The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each, randomly selected for the experiment. High quality visualization of the internal kernel structure was detected using a soft X-ray technique. It is non-destructive and considerably cheaper than other more sophisticated imaging techniques like scanning microscopy or laser technology. The images were recorded on cm X-ray KODAK plates. Figure 1 presents the X-ray images of these kernels. Fig. 1 X-ray photogram (13 18 cm) of kernels The X-ray photograms were scanned using the Epson Perfection V700 table photo-scanner with a built-in transparency adapter, 600 dpi resolution and 8 bit gray scale levels. Analysis procedures of obtained bitmap graphics files were based on the computer software package GRAINS, specially developed for X-ray diagnostic of wheat kernels [12, 16]. To construct the data, seven geometric parameters of wheat kernels: area A, perimeter P, compactness C = 4πA/P 2, length of kernel, width of kernel, asymmetry coefficient and length of kernel groove, were measured from a total of 210 samples (see Fig. 2). All of these parameters were real-valued continuous. In our investigations, the data was reduced to be two-dimensional after applying the Principal Component Analysis [4] to validate the results visually. 4 Results and discussion The data s projection on the axes of the two greatest principal components, with wheat varieties being distinguished symbolically, is presented in Fig. 3. Samples were labeled by numbers: 1-70 for the Kama wheat variety, for the Rosa wheat variety, and for the Canadian wheat variety. To discuss the clustering
7 Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images 7 Fig. 2 Document window with geometric parameters of a kernel and statistical parameters of its image (as a unit of measure millimeters were used) results obtained by the CGCA, mistakenly classified samples are displayed with their labels (see Fig. 3). Using procedures described in Subsect. 2.2 allowing elimination of clusters in sparse areas, the CGCA created three clusters corresponding to Rosa, Kama, and Canadian varieties, containing 69, 65, and 76 elements respectively. Thus the samples 9 and 38, which belong to the Kama wheat variety are incorrectly grouped into the cluster associated with the Rosa wheat variety. What is more, the samples 125, 136, 139, which belong to the Rosa wheat variety, and the samples 166, 200, 202, which belong to the Canadian wheat variety are mistakenly classified into the cluster associated with the Kama wheat variety. In addition, the samples 20, 27, 28, 30, 40, 60, 61, 64, 70, which belong to the Kama wheat variety are mistakenly classified into the cluster associated with the Canadian wheat variety. It is worth noticing however, that in the case of samples 9 and 38, misclassification can be justifiable both samples lie close to the area of a high density of the Rosa wheat variety samples. The same problem is discerned with samples 125, 136, 139 and 166, 200, 202, which are placed close to samples of the Kama wheat variety. Similarly, mistakenly classified samples 20, 27, 28, 30, 40, 60, 61, 64, 70 lie very close to samples of the Canadian wheat variety. Thus, taking into consideration characteristics of wheat varieties, the CGCA seems to be an effective technique for wheat variety recognition. Clustering results, containing numbers of samples classified properly and mistakenly into clusters associated with Rosa, Kama, and Canadian varieties, are shown in Table 1.
8 8 M. Charytanowicz et al. Fig. 3 Wheat varieties data set on the axes of the two greatest principal components: ( ) the Rosa wheat variety, ( ) the Kama wheat variety, ( ) the Canadian wheat variety Table 1 Clustering results for the wheat varieties data set Number of elements in clusters Clusters Correctly classified Incorrectly classified Total Rosa Kama Canadian According to the results of the CGCA, out of 70 kernels of the Rosa wheat variety, 67 were classified properly. Only 2 of the Kama variety were classified mistakenly as the Rosa variety. For the other two varieties, the CGCA created clusters containing 65 elements (the Kama variety) and 76 elements (the Canadian variety). In regard to the Kama variety, 59 kernels were classified correctly, while 6 of the other varieties were incorrectly identified as the Kama variety. For the Canadian variety, 67 kernels were correctly identified and 9 kernels of the Kama variety were mistakenly identified as the Canadian variety. The results of Kama and Canadian varieties are not so satisfactory as for Rosa and this implies that these two varieties could not be so clearly distinguished as the Rosa variety, when using main geometric parameters.
9 Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images 9 Table 2 Correctness percentages for the wheat varieties data set Wheat Varieties Correctness % Rosa 96 Kama 84 Canadian 96 The percentages of correctness of the CGCA are presented in Table 2. The proposed algorithm achieved an accuracy of about 96% for the Rosa wheat variety, 84% for the Kama wheat variety, and 96% for the Canadian wheat variety. The comparable percentages of correctness of classification has been obtained when the k-means algorithm with arbitrary taken cluster number of 3 was used. It is worth stressing however, that this algorithm availed of the a priori assumed correct number of clusters, which in many applications may not be known, or even such a correct from a theoretical point of view number might not exist at all. The CGCA instead does not require strict assumptions regarding the desired number of cluster, which allows the number obtained to be better suited to a real data structure. Moreover, in its basic form values of parameters may be calculated automatically, however there exists the possibility of their optional change. A feature specific to it is the possibility to influence the proportion between the number of clusters in areas where data elements are dense as opposed to their sparse regions. In addition, by the detection of one-element clusters the algorithm allows the identification of outliers, which enables their elimination or designation to more numerous clusters, thus increasing the homogeneity of the data set. 5 Conclusions The proposed clustering algorithm, based on kernel estimator methodology, is expected to be an effective technique for wheat variety recognition. It performs comparable with respect to the classical k-means algorithm, however requires no a priori information about the data. The data reduced after applying the Principal Component Analysis, contained apparent clustering structures according to their classes. The amount of 193 kernels, giving almost 92% of the total, was classified properly. The wheat varieties used in the study showed differences in their main geometric parameters. The Rosa variety is better recognized, whilst Kama variety and Canadian variety are less successfully differentiated. Further research is needed on grain geometric parameters and their ability to identify wheat kernels.
10 10 M. Charytanowicz et al. References 1. Charytanowicz M, Kulczycki P (2008) Nonparametric Regression for Analyzing Correlation between Medical Parameters. In: Pietka E, Kawa J (eds) Advances in Soft Computing - Information Technologies in Biomedicine. Springer-Verlag Berlin Heidelberg 2. Fukunaga K, Hostetler LD (1975) The estimation of the gradient of a density function, with applications in Pattern Recognition. IEEE Transactions on Information Theory 21: Kowalski P, Łukasik S, Charytanowicz M, Kulczycki P (2008) Data-Driven Fuzzy Modeling and Control with Kernel Density Based Clustering Technique. Polish Journal of Environmental Studies 17: Krzyśko M, Wołyński W, Górecki T, Skorzybut M (2008) Systemy uczace sie. WNT, Warszawa 5. Kulczycki P (2005) Estymatory jadrowe w analizie systemowej. WNT, Warszawa 6. Kulczycki P (2007) Estymatory jadrowe w badaniach systemowych. In: Kulczycki P, Hryniewicz O, Kacprzyk J (eds) Techniki informacyjne w badaniach systemowych. WNT, Warszawa 7. Kulczycki P (2008) Kernel estimators in industrial applications. In: Prasad B (ed) Soft Computing Applications in Industry. Springer-Verlag, Berlin 8. Kulczycki P, Charytanowicz M ( 2005) Bayes Sharpening of Imprecise Information. International Journal of Applied Mathematics and Computer Science 15: Kulczycki P, Charytanowicz M (2010) A Complete Gradient Clustering Algorithm Formed with Kernel Estimators. International Journal of Applied Mathematics and Computer Science, in press 10. Kulczycki P, Daniel K (2009) Metoda wspomagania strategii marketingowej operatora telefonii komórkowej. Przeglad Statystyczny 56: Mirkin B (2005) Clustering for Data Mining: A Data Recovery Approach. Chapman and Hall/CRC, London 12. Niewczas J, Woźniak W (1999) Application of GRAINS program for characterisation of X-ray images of wheat grains at different moisture content. Xth Seminar Properties of Water in Foods. Warsaw Agricultural University, Department of Food Engineering 13. Niewczas J, Woźniak W, Guc A (1995) Attempt to application of image processing to evaluation of changes in internal structure of wheat grain. International Agrophysics 9: Silverman BW (1986) Density Estimation for Statistics and Data Analysis. Chapman and Hall, London 15. Shouche SP, Rastogi R, Bhagwat SG, Sainis JK (2001) Shape analysis of grain of Indian wheat varieties. Computers and Electronics in Agriculture 33: Strumiłło A, Niewczas J, Szczypiński P, Makowski P, Woźniak W (1999) Computer system for analysis of X-ray imges of wheat grains. International Agrophysics 13: Utku H, Koksel H, Kayhan S (1998) Classification of wheat grains by digital image analysis using statistical filters. Euphytica 100:
A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images
A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images Małgorzata Charytanowicz, Jerzy Niewczas, Piotr A. Kowalski, Piotr Kulczycki, Szymon Łukasik, and Sławomir Żak Abstract Methods
Cluster Analysis: Advanced Concepts
Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means
Parallel Computing of Kernel Density Estimates with MPI
Parallel Computing of Kernel Density Estimates with MPI Szymon Lukasik Department of Automatic Control, Cracow University of Technology, ul. Warszawska 24, 31-155 Cracow, Poland [email protected]
Categorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Data Mining Project Report. Document Clustering. Meryem Uzun-Per
Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...
Prentice Hall Algebra 2 2011 Correlated to: Colorado P-12 Academic Standards for High School Mathematics, Adopted 12/2009
Content Area: Mathematics Grade Level Expectations: High School Standard: Number Sense, Properties, and Operations Understand the structure and properties of our number system. At their most basic level
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will
Least Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING
AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations
AN EXPERT SYSTEM TO ANALYZE HOMOGENEITY IN FUEL ELEMENT PLATES FOR RESEARCH REACTORS
AN EXPERT SYSTEM TO ANALYZE HOMOGENEITY IN FUEL ELEMENT PLATES FOR RESEARCH REACTORS Cativa Tolosa, S. and Marajofsky, A. Comisión Nacional de Energía Atómica Abstract In the manufacturing control of Fuel
CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA
We Can Early Learning Curriculum PreK Grades 8 12 INSIDE ALGEBRA, GRADES 8 12 CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA April 2016 www.voyagersopris.com Mathematical
Data Mining. Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based
Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca
Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?
Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009
Cluster Analysis Alison Merikangas Data Analysis Seminar 18 November 2009 Overview What is cluster analysis? Types of cluster Distance functions Clustering methods Agglomerative K-means Density-based Interpretation
Linear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
The Big Data mining to improve medical diagnostics quality
The Big Data mining to improve medical diagnostics quality Ilyasova N.Yu., Kupriyanov A.V. Samara State Aerospace University, Image Processing Systems Institute, Russian Academy of Sciences Abstract. The
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical
Clustering & Visualization
Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.
Machine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19
PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations
DATA MINING TECHNIQUES AND APPLICATIONS
DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,
Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland
Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data
SUCCESSFUL PREDICTION OF HORSE RACING RESULTS USING A NEURAL NETWORK
SUCCESSFUL PREDICTION OF HORSE RACING RESULTS USING A NEURAL NETWORK N M Allinson and D Merritt 1 Introduction This contribution has two main sections. The first discusses some aspects of multilayer perceptrons,
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets
Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Macario O. Cordel II and Arnulfo P. Azcarraga College of Computer Studies *Corresponding Author: [email protected]
Classification of Fingerprints. Sarat C. Dass Department of Statistics & Probability
Classification of Fingerprints Sarat C. Dass Department of Statistics & Probability Fingerprint Classification Fingerprint classification is a coarse level partitioning of a fingerprint database into smaller
Robust Outlier Detection Technique in Data Mining: A Univariate Approach
Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,
University of Ostrava. Fuzzy Transforms
University of Ostrava Institute for Research and Applications of Fuzzy Modeling Fuzzy Transforms Irina Perfilieva Research report No. 58 2004 Submitted/to appear: Fuzzy Sets and Systems Supported by: Grant
Algebra 1 2008. Academic Content Standards Grade Eight and Grade Nine Ohio. Grade Eight. Number, Number Sense and Operations Standard
Academic Content Standards Grade Eight and Grade Nine Ohio Algebra 1 2008 Grade Eight STANDARDS Number, Number Sense and Operations Standard Number and Number Systems 1. Use scientific notation to express
Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
Local outlier detection in data forensics: data mining approach to flag unusual schools
Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential
Robert Collins CSE598G. More on Mean-shift. R.Collins, CSE, PSU CSE598G Spring 2006
More on Mean-shift R.Collins, CSE, PSU Spring 2006 Recall: Kernel Density Estimation Given a set of data samples x i ; i=1...n Convolve with a kernel function H to generate a smooth function f(x) Equivalent
Medical Information Management & Mining. You Chen Jan,15, 2013 [email protected]
Medical Information Management & Mining You Chen Jan,15, 2013 [email protected] 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information
Supervised and unsupervised learning - 1
Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in
STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table
COMBINING THE METHODS OF FORECASTING AND DECISION-MAKING TO OPTIMISE THE FINANCIAL PERFORMANCE OF SMALL ENTERPRISES
COMBINING THE METHODS OF FORECASTING AND DECISION-MAKING TO OPTIMISE THE FINANCIAL PERFORMANCE OF SMALL ENTERPRISES JULIA IGOREVNA LARIONOVA 1 ANNA NIKOLAEVNA TIKHOMIROVA 2 1, 2 The National Nuclear Research
For example, estimate the population of the United States as 3 times 10⁸ and the
CCSS: Mathematics The Number System CCSS: Grade 8 8.NS.A. Know that there are numbers that are not rational, and approximate them by rational numbers. 8.NS.A.1. Understand informally that every number
Cluster Analysis: Basic Concepts and Algorithms
8 Cluster Analysis: Basic Concepts and Algorithms Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. If meaningful groups are the goal, then the clusters should
Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. Philip Kostov and Seamus McErlean
Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. by Philip Kostov and Seamus McErlean Working Paper, Agricultural and Food Economics, Queen
HT2015: SC4 Statistical Data Mining and Machine Learning
HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric
NEW MEXICO Grade 6 MATHEMATICS STANDARDS
PROCESS STANDARDS To help New Mexico students achieve the Content Standards enumerated below, teachers are encouraged to base instruction on the following Process Standards: Problem Solving Build new mathematical
An Analysis on Density Based Clustering of Multi Dimensional Spatial Data
An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,
Digital Cadastral Maps in Land Information Systems
LIBER QUARTERLY, ISSN 1435-5205 LIBER 1999. All rights reserved K.G. Saur, Munich. Printed in Germany Digital Cadastral Maps in Land Information Systems by PIOTR CICHOCINSKI ABSTRACT This paper presents
Grid Density Clustering Algorithm
Grid Density Clustering Algorithm Amandeep Kaur Mann 1, Navneet Kaur 2, Scholar, M.Tech (CSE), RIMT, Mandi Gobindgarh, Punjab, India 1 Assistant Professor (CSE), RIMT, Mandi Gobindgarh, Punjab, India 2
Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico
Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from
AP Physics 1 and 2 Lab Investigations
AP Physics 1 and 2 Lab Investigations Student Guide to Data Analysis New York, NY. College Board, Advanced Placement, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks
Introduction to General and Generalized Linear Models
Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby
Lecture 10: Regression Trees
Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,
15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
A Review of Anomaly Detection Techniques in Network Intrusion Detection System
A Review of Anomaly Detection Techniques in Network Intrusion Detection System Dr.D.V.S.S.Subrahmanyam Professor, Dept. of CSE, Sreyas Institute of Engineering & Technology, Hyderabad, India ABSTRACT:In
Cluster Analysis: Basic Concepts and Methods
10 Cluster Analysis: Basic Concepts and Methods Imagine that you are the Director of Customer Relationships at AllElectronics, and you have five managers working for you. You would like to organize all
There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:
Statistics: Rosie Cornish. 2007. 3.1 Cluster Analysis 1 Introduction This handout is designed to provide only a brief introduction to cluster analysis and how it is done. Books giving further details are
Big Data: Rethinking Text Visualization
Big Data: Rethinking Text Visualization Dr. Anton Heijs [email protected] Treparel April 8, 2013 Abstract In this white paper we discuss text visualization approaches and how these are important
Cluster analysis Cosmin Lazar. COMO Lab VUB
Cluster analysis Cosmin Lazar COMO Lab VUB Introduction Cluster analysis foundations rely on one of the most fundamental, simple and very often unnoticed ways (or methods) of understanding and learning,
Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool
Comparison and Analysis of Various Clustering Metho in Data mining On Education data set Using the weak tool Abstract:- Data mining is used to find the hidden information pattern and relationship between
A Study of Web Log Analysis Using Clustering Techniques
A Study of Web Log Analysis Using Clustering Techniques Hemanshu Rana 1, Mayank Patel 2 Assistant Professor, Dept of CSE, M.G Institute of Technical Education, Gujarat India 1 Assistant Professor, Dept
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
Segmentation of stock trading customers according to potential value
Expert Systems with Applications 27 (2004) 27 33 www.elsevier.com/locate/eswa Segmentation of stock trading customers according to potential value H.W. Shin a, *, S.Y. Sohn b a Samsung Economy Research
K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1
K-Means Cluster Analsis Chapter 3 PPDM Class Tan,Steinbach, Kumar Introduction to Data Mining 4/18/4 1 What is Cluster Analsis? Finding groups of objects such that the objects in a group will be similar
Unsupervised learning: Clustering
Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52 Outline 1 Introduction What
Chapter ML:XI (continued)
Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained
Support Vector Machines Explained
March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano
An Introduction to Machine Learning
An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia [email protected] Tata Institute, Pune,
Quality Assessment in Spatial Clustering of Data Mining
Quality Assessment in Spatial Clustering of Data Mining Azimi, A. and M.R. Delavar Centre of Excellence in Geomatics Engineering and Disaster Management, Dept. of Surveying and Geomatics Engineering, Engineering
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
Several Views of Support Vector Machines
Several Views of Support Vector Machines Ryan M. Rifkin Honda Research Institute USA, Inc. Human Intention Understanding Group 2007 Tikhonov Regularization We are considering algorithms of the form min
How To Understand The Theory Of Probability
Graduate Programs in Statistics Course Titles STAT 100 CALCULUS AND MATR IX ALGEBRA FOR STATISTICS. Differential and integral calculus; infinite series; matrix algebra STAT 195 INTRODUCTION TO MATHEMATICAL
A comparison of various clustering methods and algorithms in data mining
Volume :2, Issue :5, 32-36 May 2015 www.allsubjectjournal.com e-issn: 2349-4182 p-issn: 2349-5979 Impact Factor: 3.762 R.Tamilselvi B.Sivasakthi R.Kavitha Assistant Professor A comparison of various clustering
Introduction to Pattern Recognition
Introduction to Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University [email protected] CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
An Iterative Image Registration Technique with an Application to Stereo Vision
An Iterative Image Registration Technique with an Application to Stereo Vision Bruce D. Lucas Takeo Kanade Computer Science Department Carnegie-Mellon University Pittsburgh, Pennsylvania 15213 Abstract
Lecture 2: Descriptive Statistics and Exploratory Data Analysis
Lecture 2: Descriptive Statistics and Exploratory Data Analysis Further Thoughts on Experimental Design 16 Individuals (8 each from two populations) with replicates Pop 1 Pop 2 Randomly sample 4 individuals
Classification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
How To Identify Noisy Variables In A Cluster
Identification of noisy variables for nonmetric and symbolic data in cluster analysis Marek Walesiak and Andrzej Dudek Wroclaw University of Economics, Department of Econometrics and Computer Science,
Clustering. Data Mining. Abraham Otero. Data Mining. Agenda
Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in
Lecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
E3: PROBABILITY AND STATISTICS lecture notes
E3: PROBABILITY AND STATISTICS lecture notes 2 Contents 1 PROBABILITY THEORY 7 1.1 Experiments and random events............................ 7 1.2 Certain event. Impossible event............................
Mathematics Course 111: Algebra I Part IV: Vector Spaces
Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 1996-7 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are
E-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee [email protected] Seunghee Ham [email protected] Qiyi Jiang [email protected] I. INTRODUCTION Due to the increasing popularity of e-commerce
Neural Networks Lesson 5 - Cluster Analysis
Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm [email protected] Rome, 29
Protein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
How To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
Rule based Classification of BSE Stock Data with Data Mining
International Journal of Information Sciences and Application. ISSN 0974-2255 Volume 4, Number 1 (2012), pp. 1-9 International Research Publication House http://www.irphouse.com Rule based Classification
Knowledge Discovery in Stock Market Data
Knowledge Discovery in Stock Market Data Alfred Ultsch and Hermann Locarek-Junge Abstract This work presents the results of a Data Mining and Knowledge Discovery approach on data from the stock markets
Two Heads Better Than One: Metric+Active Learning and Its Applications for IT Service Classification
29 Ninth IEEE International Conference on Data Mining Two Heads Better Than One: Metric+Active Learning and Its Applications for IT Service Classification Fei Wang 1,JimengSun 2,TaoLi 1, Nikos Anerousis
Local classification and local likelihoods
Local classification and local likelihoods November 18 k-nearest neighbors The idea of local regression can be extended to classification as well The simplest way of doing so is called nearest neighbor
Clustering Via Decision Tree Construction
Clustering Via Decision Tree Construction Bing Liu 1, Yiyuan Xia 2, and Philip S. Yu 3 1 Department of Computer Science, University of Illinois at Chicago, 851 S. Morgan Street, Chicago, IL 60607-7053.
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Determining optimal window size for texture feature extraction methods
IX Spanish Symposium on Pattern Recognition and Image Analysis, Castellon, Spain, May 2001, vol.2, 237-242, ISBN: 84-8021-351-5. Determining optimal window size for texture feature extraction methods Domènec
This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.
Algebra I Overview View unit yearlong overview here Many of the concepts presented in Algebra I are progressions of concepts that were introduced in grades 6 through 8. The content presented in this course
Introduction to time series analysis
Introduction to time series analysis Margherita Gerolimetto November 3, 2010 1 What is a time series? A time series is a collection of observations ordered following a parameter that for us is time. Examples
