# Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar, Rajasthan, India ABSTRACT Outliers are the points which are different from or inconsistent with the rest of the data. They can be novel, new, abnormal, unusual or noisy information. Outliers are sometimes more interesting than the majority of the data. The main challenges of outlier detection with the increasing complexity, size and variety of datasets, are how to catch similar outliers as a group, and how to evaluate the outliers. This paper describes an approach which uses Univariate outlier detection as a pre-processing step to detect the outlier and then applies K-means algorithm hence to analyse the effects of the outliers on the cluster analysis of dataset. Keywords: Outlier, Univariate outlier detection, K-means algorithm. 1. INTRODUCTION Data mining, in general, deals with the discovery of hidden, non-trivial and interesting knowledge from different types of data. As the development of information technologies is taking place, the number of databases, as well as their dimension and complexity is growing rapidly. Hence there is a need of automated analysis of great amount of information. The analysis results which are then generated are used for making a decision by a human or program. Outlier detection is one of the basic problems of data mining. An outlier is an observation of the data that deviates from other observations so much that it arouses suspicions that it was generated by a different and unusual mechanism [6]. On the other hand, Inlier is defined as an observation that is explained by mechanism underlying probability density function. This function represents probability distribution of main part of data observations [2]. Outliers may be erroneous or real. Real outliers are observations whose actual values are very different from those observed for the rest of the data and violate plausible relationships among variables. Erroneous outliers are observations that are distorted due to misreporting errors in the data-collection process. Both types of outliers may exert undue influence on the results of statistical analysis, so they should be identified using reliable detection methods prior to performing data analysis [7].Outliers are sometimes found as a side-product of clustering algorithms. These techniques define outliers as points, which do not lie in any of the clusters formed. Thus, the techniques implicitly define outliers as the background noise in which the clusters are embedded. Another class of techniques are also there which defines outliers as points, which are neither a part of a cluster nor a part of the background noise; rather they are specifically points which behave very differently from the normal data [1].The problem of detecting outliers has been studied in the statistics community as well. In the approaches thus designed, user has to model the data points using a statistical distribution, and points are determined to be outliers depending on how they appear in relation to the postulated model. The main problem with these approaches is that in a number of situations, the user might not have enough knowledge about the underlying data distribution [5]. Sometimes outliers can often be considered as individuals or groups of clients exhibiting behaviour outside the range of what is considered normal. Outliers can be removed or considered separately in regression modelling to improve accuracy which can be considered as benefit of outliers. Identifying them prior to modelling and analysis is important [6] so that the analysis and the results might not get affected due to them. The regression modelling consists in finding a dependence of one random variable or a group of variables on another variable or a group of variables. In the context of outlier-based association method, outliers are observations markedly different from other points. When a group of points have some common characteristics, and these common characteristics are outliers, these points are associated [4] with each other in some or the other way. Statistical tests depend on the distribution of the data whether or not the distribution parameters are known, the number of excepted outliers, the types of excepted outliers [3]. 2. RELATED WORK A survey of outlier detection methods was given by Hodge & Austin [8], focusing especially on those developed within the Computer Science community. Supervised outlier detection methods, are suitable for data whose characteristics do not change through time, they have training data with normal and abnormal data objects. There may be multiple normal and/or abnormal classes. Often, the classification problem is

2 highly imbalanced. In semi-supervised recognition methods, the normal class is taught, and data points that do not resemble normal data are considered outliers. Unsupervised methods process data with no prior knowledge. Four categories of unsupervised outlier detection algorithms; (1) In a clustering-based method, like DBSCAN (a density-based algorithm for discovering clusters in large spatial databases) [9], outliers are by-products of the clustering process and will not be in any resulting cluster. (2) The densitybased method of [10] uses a Local Outlier Factor (LOF) to find outliers. If the object is isolated with respect to the surrounding neighborhood, the outlier degree would be high, and vice versa. (3) The distribution-based method [3] defines, for instance, outliers to be those points p such that at most 0.02% of points are within 0.13 σ of p. (4) Distance-based outliers are those objects that do not have enough neighbours [11][12]. The problem of finding outliers can be solved by answering a nearest neighbour or range query centered at each object O. Several mathematical methods can also be applied to outlier detection. Principal component analysis (PCA) can be used to detect outliers. PCA computes orthonormal vectors that provide a basis (scores) for the input data. Then principal components are sorted in order of decreasing significance or strength. The size of the data can be reduced by eliminating the weaker components which are with low variance [13]. The convex hull method finds outliers by peeling off the outer layers of convex hulls [14]. Data points on shallow layers are likely to be outliers. 3. UNIVARIATE OUTLIER ANALYSIS Univariate outliers are the cases that have an unusual value for a single variable. One way to identify univariate outliers is to convert all of the scores for a variable to standard scores. If the sample size is small (80 or fewer cases), a case is an outlier if its standard score is ±2.5 or beyond. If the sample size is larger than 80 cases, a case is an outlier if its standard score is ±3.0 or beyond. This method applies to interval level variables, and to ordinal level variables that are treated as metric. 4. K-MEANS CLUSTERING In data mining cluster analysis can be done by various methods. One such method is K-means algorithm which partition n-observations into k-clusters in which each observation belongs to the cluster with the nearest mean. K-means clustering operates on actual observations and a single level of clusters is created. K-means clustering is often more suitable than hierarchical clustering for large amounts of data. Each observation in your data is treated as an object having a location in space. Partitions are formed based upon the fact that objects within each cluster are as close to each other as possible, and as far from objects in other clusters as possible. We can choose from five different distance measures, depending on the kind of data we are clustering. Each cluster in the partition is defined by two things, its member objects and centroid (the point to which the sum of distances from all objects in that cluster is minimized), or centre. Cluster centroids are computed differently for each distance measure, to minimize the sum with respect to the measure that you specify. An iterative algorithm is used, that minimizes the sum of distances from each object to its cluster centroid, over all clusters. This algorithm keeps on moving the objects between clusters until the sum cannot be decreased further giving a set of clusters that are as compact and well-separated as possible. 5. PROPOSED APPROACH As we have reviewed several different ways of detecting outliers we here propose a method which is a combination of two approaches, statistical and clustering. Firstly we apply an Univariate outlier detection in SPSS and then k-means algorithm to group the data into clusters. Following are the steps of the algorithm in detail: Step 1: Choose a dataset on which outlier detection is to be performed. Step 2: Apply Univariate Outlier Detection in SPSS to do the pre-processing of data before applying cluster analysis. Step 3: If Outlier is detected based upon the criterion If the sample size is small (80 or fewer cases), a case is an outlier if its standard score is ±2.5 or beyond. If the sample size is larger than 80 cases, a case is an outlier if its standard score is ±3.0 or beyond Then run the k-means (clustering) algorithm, for the dataset with and without the tuple having outlier, using replicates in order to select proper centroids so as to overcome the problem of local minima. Step 4: Compare the results for the sum of point-tocentroid distances. The main goal of k-means algorithm is to find the clusters in such a way so as to minimize the sum of point to centroid distance. Step 5: If Sum (without outlier) < Sum (with outlier), then remove the tuple having outlier permanently from the dataset. 6. EXPERIMENTAL RESULTS We used a hypothetical data which has been made by introducing some outlier values in the well known fisher iris data, having four features namely sepal

3 length, sepal width, petal length, petal width and 150 tuples. After making the dataset we applied Univariate outlier detection on it using SPSS, a widely used Statistical Tool. Results of the univariate outlier analysis are shown in Figure.1 case as there are more than 80 cases. We arrange the z-scores in ascending and descending order and hence search for any value lesser than -3 or greater than +3. Here we get a value greater than +3 when we arranged the z-scores in descending order, value shown in Figure.4. Figure.4. Values of z-score, showing value > +3 Figure.1.Plot between sepal length and petal length in the presence of outlier The plot shows the distribution of points between the two features of our data Sepal length and Petal length, and the highlighted point in red circle shows the outlier point. The descriptive univariate analysis applied on sepal length feature of the data (with outlier) gives following results, shown in Figure.2. whereas that of analysis of data (without outlier) is shown in Figure.3. Now we must delete the outlier entry and save both the dataset i.e. with outlier entry and without outlier entry and run further the k-means algorithm to do the cluster analysis of the data and calculate the sum of points to centroid value in each case. After calculating both the values we will compare them to check whether the presence of outlier increases the sum or not, if it increases then it must be removed. Figure.5. and shows the clustering analysis of the dataset with outlier. Figure.2. Descriptive univariate analysis of the data (with outlier) Figure.3. Descriptive univariate analysis of data (without outlier) We can easily observe here that the mean calculated in Figure.2. is and the maximum value is 10 i.e. much greater than the mean whereas in Figure.3.its nearer to the mean but statistically we observe the z- scores of the sepal length calculated by the analysis, these scores must follow a range of -3 to +3 in our Figure.5. Cluster analysis with outlier Value of the sum of all points in the cluster to the centroid comes out to be , shown in Figure.6. Now we will calculate the sum value for the clustering of data without outlier entries. Figure.7. shows the

4 Value of the sum of all points in the cluster to the centroid comes out to be , shown in Figure.8. which is much lesser than the sum value of the dataset with outlier entries. Hence we will delete the outlier entry permanently from the dataset, because this entry is not at all useful and distorting our original dataset. 7. CONCLUSION Figure.6. Value of the sum clustering analysis of the dataset without outlier. The conclusion of the whole paper lies in the fact that outliers are usually the unwanted entries which always affects the data in one or the other form and distorts the distribution of the data. Sometimes it becomes necessary to keep even the outlier entries because they play an important role in the data but in our case as we are achieving our main objective i.e. to minimize the value of the sum by choosing optimum centroids, we will delete the outlier entries. REFERENCES Figure.7. Cluster analysis without outlier [1] C. Aggarwal and P. Yu, Outlier Detection for High Dimensional Data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Volume 30, Issue 2, pages 37 46, May [2] Z. He, X. Xu and S. Deng, Discovering Clusterbased Local Outliers. Pattern Recognition Letters, Volume 24, Issue 9-10, pages , June [3] E. Knorr and R. Ng, A Unified Notion of Outliers: Properties and Computation. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pages , August [4] S. Lin and D. Brown, An Outlier-based Data Association Method. In Proceedings of the SIAM International Conference on Data Mining, San Francisco, CA, May [5] S. Ramaswamy, R. Rastogi and K. Shim, Efficient Algorithms for Mining Outliers from Large Data Sets. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Volume 29, Issue 2, pages , May [6] G. Williams, R. Baxter, H. He, S. Hawkins and L. Gu, A Comparative Study for RNN for Outlier Detection in Data Mining. In Proceedings of the 2nd IEEE International Conference on Data Mining, page 709, Maebashi City, Japan, December [7] B. Ghosh-Dastidar and J.L. Schafer, Outlier Detection and Editing Procedures for Continuous Multivariate Data. ORP Working Papers, September Figure.8. Value of the sum

5 2003. ( visited [8] Hodge, V. J. and Austin, J. (2004) A Survey of Outlier Detection Methodologies, Kluwer Academic Publishers. [9] Ester, M., Kriegel, H. P., Sander, J., and Xu, X. (1996) A density-based algorithm for discovering clusters in large spatial databases, in Proc. Int l. Conf. Knowledge Discovery and Data Mining (KDD 96), pp [10] Breunig, M. M., Kriegel, H., Ng, R. T., and Sander, J. (2000) LOF: identifying density based local outliers, in Proc. ACM SIGMOD Int l. Conf. On Management of Data, pp [11] Knorr, E. M. and Ng, R. T. (1998) Algorithms for mining distance-based outliers in large datasets, in Proc. of the 24th VLDB conf., pp [12] Knorr, E. M. and Ng, R. T. (1999) Finding intensional knowledge of distance-based outliers, in Proc. of the 25th VLDB conf., pp [13] Chapra, S. C. and Canale, R. P. (2006) Numerical Methods for Engineers, 5th edition, McGraw Hill. [14] Johnson, T., Kwok, I. and Ng, R. (1998) Fast computation of 2-dimensional depth contours, in Proc. 4th Int l. Conf. on Knowledge Discovery and Data Mining, New York, NY, pp

### An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,

### A Novel Density based improved k-means Clustering Algorithm Dbkmeans

A Novel Density based improved k-means Clustering Algorithm Dbkmeans K. Mumtaz 1 and Dr. K. Duraiswamy 2, 1 Vivekanandha Institute of Information and Management Studies, Tiruchengode, India 2 KS Rangasamy

### A Survey on Outlier Detection Techniques for Credit Card Fraud Detection

IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 2, Ver. VI (Mar-Apr. 2014), PP 44-48 A Survey on Outlier Detection Techniques for Credit Card Fraud

### Resource-bounded Fraud Detection

Resource-bounded Fraud Detection Luis Torgo LIAAD-INESC Porto LA / FEP, University of Porto R. de Ceuta, 118, 6., 4050-190 Porto, Portugal ltorgo@liaad.up.pt http://www.liaad.up.pt/~ltorgo Abstract. This

### Fig. 1 A typical Knowledge Discovery process [2]

Volume 4, Issue 7, July 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Review on Clustering

### Outlier Detection in Clustering

Outlier Detection in Clustering Svetlana Cherednichenko 24.01.2005 University of Joensuu Department of Computer Science Master s Thesis TABLE OF CONTENTS 1. INTRODUCTION...1 1.1. BASIC DEFINITIONS... 1

### Standardization and Its Effects on K-Means Clustering Algorithm

Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03

### Quality Assessment in Spatial Clustering of Data Mining

Quality Assessment in Spatial Clustering of Data Mining Azimi, A. and M.R. Delavar Centre of Excellence in Geomatics Engineering and Disaster Management, Dept. of Surveying and Geomatics Engineering, Engineering

### Histogram-based Outlier Score (HBOS): A fast Unsupervised Anomaly Detection Algorithm

Histogram-based Outlier Score (HBOS): A fast Unsupervised Anomaly Detection Algorithm Markus Goldstein and Andreas Dengel German Research Center for Artificial Intelligence (DFKI), Trippstadter Str. 122,

### The Role of Visualization in Effective Data Cleaning

The Role of Visualization in Effective Data Cleaning Yu Qian Dept. of Computer Science The University of Texas at Dallas Richardson, TX 75083-0688, USA qianyu@student.utdallas.edu Kang Zhang Dept. of Computer

### GraphZip: A Fast and Automatic Compression Method for Spatial Data Clustering

GraphZip: A Fast and Automatic Compression Method for Spatial Data Clustering Yu Qian Kang Zhang Department of Computer Science, The University of Texas at Dallas, Richardson, TX 75083-0688, USA {yxq012100,

### Identifying erroneous data using outlier detection techniques

Identifying erroneous data using outlier detection techniques Wei Zhuang 1, Yunqing Zhang 2 and J. Fred Grassle 2 1 Department of Computer Science, Rutgers, the State University of New Jersey, Piscataway,

### Unsupervised Outlier Detection in Time Series Data

Unsupervised Outlier Detection in Time Series Data Zakia Ferdousi and Akira Maeda Graduate School of Science and Engineering, Ritsumeikan University Department of Media Technology, College of Information

### Local Anomaly Detection for Network System Log Monitoring

Local Anomaly Detection for Network System Log Monitoring Pekka Kumpulainen Kimmo Hätönen Tampere University of Technology Nokia Siemens Networks pekka.kumpulainen@tut.fi kimmo.hatonen@nsn.com Abstract

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

### Clustering & Association

Clustering - Overview What is cluster analysis? Grouping data objects based only on information found in the data describing these objects and their relationships Maximize the similarity within objects

### A Taxonomy Framework for Unsupervised Outlier Detection Techniques for Multi-Type Data Sets

A Taxonomy Framework for Unsupervised Outlier Detection Techniques for Multi-Type Data Sets Yang Zhang, Nirvana Meratnia, Paul Havinga Department of Computer Science, University of Twente, P.O.Box 217

### A Two-Step Method for Clustering Mixed Categroical and Numeric Data

Tamkang Journal of Science and Engineering, Vol. 13, No. 1, pp. 11 19 (2010) 11 A Two-Step Method for Clustering Mixed Categroical and Numeric Data Ming-Yi Shih*, Jar-Wen Jheng and Lien-Fu Lai Department

### Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 9, Issue 8 (January 2014), PP. 19-24 Comparative Analysis of EM Clustering Algorithm

### OUTLIER ANALYSIS. Data Mining 1

OUTLIER ANALYSIS Data Mining 1 What Are Outliers? Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism Ex.: Unusual credit card purchase,

### Visual Data Mining with Pixel-oriented Visualization Techniques

Visual Data Mining with Pixel-oriented Visualization Techniques Mihael Ankerst The Boeing Company P.O. Box 3707 MC 7L-70, Seattle, WA 98124 mihael.ankerst@boeing.com Abstract Pixel-oriented visualization

### STOCK MARKET TRENDS USING CLUSTER ANALYSIS AND ARIMA MODEL

Stock Asian-African Market Trends Journal using of Economics Cluster Analysis and Econometrics, and ARIMA Model Vol. 13, No. 2, 2013: 303-308 303 STOCK MARKET TRENDS USING CLUSTER ANALYSIS AND ARIMA MODEL

### Utility-Based Fraud Detection

Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Utility-Based Fraud Detection Luis Torgo and Elsa Lopes Fac. of Sciences / LIAAD-INESC Porto LA University of

### Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data

### Adaptive Anomaly Detection for Network Security

International Journal of Computer and Internet Security. ISSN 0974-2247 Volume 5, Number 1 (2013), pp. 1-9 International Research Publication House http://www.irphouse.com Adaptive Anomaly Detection for

### A Two-stage Algorithm for Outlier Detection Under Cloud Environment Based on Clustering and Perceptron Model

Journal of Computational Information Systems 11: 20 (2015) 7529 7541 Available at http://www.jofcis.com A Two-stage Algorithm for Outlier Detection Under Cloud Environment Based on Clustering and Perceptron

### Clustering Techniques: A Brief Survey of Different Clustering Algorithms

Clustering Techniques: A Brief Survey of Different Clustering Algorithms Deepti Sisodia Technocrates Institute of Technology, Bhopal, India Lokesh Singh Technocrates Institute of Technology, Bhopal, India

### A Comparative Study of clustering algorithms Using weka tools

A Comparative Study of clustering algorithms Using weka tools Bharat Chaudhari 1, Manan Parikh 2 1,2 MECSE, KITRC KALOL ABSTRACT Data clustering is a process of putting similar data into groups. A clustering

### PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA Prakash Singh 1, Aarohi Surya 2 1 Department of Finance, IIM Lucknow, Lucknow, India 2 Department of Computer Science, LNMIIT, Jaipur,

Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

### Data Exploration and Preprocessing. Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)

Data Exploration and Preprocessing Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann

### Anomaly Detection Using Unsupervised Profiling Method in Time Series Data

Anomaly Detection Using Unsupervised Profiling Method in Time Series Data Zakia Ferdousi 1 and Akira Maeda 2 1 Graduate School of Science and Engineering, Ritsumeikan University, 1-1-1, Noji-Higashi, Kusatsu,

### Lecture 20: Clustering

Lecture 20: Clustering Wrap-up of neural nets (from last lecture Introduction to unsupervised learning K-means clustering COMP-424, Lecture 20 - April 3, 2013 1 Unsupervised learning In supervised learning,

### Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Index Contents Page No. 1. Introduction 1 1.1 Related Research 2 1.2 Objective of Research Work 3 1.3 Why Data Mining is Important 3 1.4 Research Methodology 4 1.5 Research Hypothesis 4 1.6 Scope 5 2.

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will

### Context-aware taxi demand hotspots prediction

Int. J. Business Intelligence and Data Mining, Vol. 5, No. 1, 2010 3 Context-aware taxi demand hotspots prediction Han-wen Chang, Yu-chin Tai and Jane Yung-jen Hsu* Department of Computer Science and Information

### Chapter ML:XI (continued)

Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

### Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

### A Fast Fraud Detection Approach using Clustering Based Method

pp. 33-37 Krishi Sanskriti Publications http://www.krishisanskriti.org/jbaer.html A Fast Detection Approach using Clustering Based Method Surbhi Agarwal 1, Santosh Upadhyay 2 1 M.tech Student, Mewar University,

### Clustering: Techniques & Applications. Nguyen Sinh Hoa, Nguyen Hung Son. 15 lutego 2006 Clustering 1

Clustering: Techniques & Applications Nguyen Sinh Hoa, Nguyen Hung Son 15 lutego 2006 Clustering 1 Agenda Introduction Clustering Methods Applications: Outlier Analysis Gene clustering Summary and Conclusions

### DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

### Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool

Comparison and Analysis of Various Clustering Metho in Data mining On Education data set Using the weak tool Abstract:- Data mining is used to find the hidden information pattern and relationship between

### Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598. Keynote, Outlier Detection and Description Workshop, 2013

Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier

### Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico

Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from

### Linköpings Universitet - ITN TNM033 2011-11-30 DBSCAN. A Density-Based Spatial Clustering of Application with Noise

DBSCAN A Density-Based Spatial Clustering of Application with Noise Henrik Bäcklund (henba892), Anders Hedblom (andh893), Niklas Neijman (nikne866) 1 1. Introduction Today data is received automatically

### Map-Reduce Algorithm for Mining Outliers in the Large Data Sets using Twister Programming Model

Map-Reduce Algorithm for Mining Outliers in the Large Data Sets using Twister Programming Model Subramanyam. RBV, Sonam. Gupta Abstract An important problem that appears often when analyzing data involves

### International Journal of Advance Research in Computer Science and Management Studies

Volume 2, Issue 12, December 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

### Local outlier detection in data forensics: data mining approach to flag unusual schools

Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential

### Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

### Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

### Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

### Grid Density Clustering Algorithm

Grid Density Clustering Algorithm Amandeep Kaur Mann 1, Navneet Kaur 2, Scholar, M.Tech (CSE), RIMT, Mandi Gobindgarh, Punjab, India 1 Assistant Professor (CSE), RIMT, Mandi Gobindgarh, Punjab, India 2

### Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in

### HYBRID INTRUSION DETECTION FOR CLUSTER BASED WIRELESS SENSOR NETWORK

HYBRID INTRUSION DETECTION FOR CLUSTER BASED WIRELESS SENSOR NETWORK 1 K.RANJITH SINGH 1 Dept. of Computer Science, Periyar University, TamilNadu, India 2 T.HEMA 2 Dept. of Computer Science, Periyar University,

### OUTLIER ANALYSIS. Authored by CHARU C. AGGARWAL IBM T. J. Watson Research Center, Yorktown Heights, NY, USA

OUTLIER ANALYSIS OUTLIER ANALYSIS Authored by CHARU C. AGGARWAL IBM T. J. Watson Research Center, Yorktown Heights, NY, USA Kluwer Academic Publishers Boston/Dordrecht/London Contents Preface Acknowledgments

### Unsupervised Learning: Clustering with DBSCAN Mat Kallada

Unsupervised Learning: Clustering with DBSCAN Mat Kallada STAT 2450 - Introduction to Data Mining Supervised Data Mining: Predicting a column called the label The domain of data mining focused on prediction:

### TIETS34 Seminar: Data Mining on Biometric identification

TIETS34 Seminar: Data Mining on Biometric identification Youming Zhang Computer Science, School of Information Sciences, 33014 University of Tampere, Finland Youming.Zhang@uta.fi Course Description Content

### On Density Based Transforms for Uncertain Data Mining

On Density Based Transforms for Uncertain Data Mining Charu C. Aggarwal IBM T. J. Watson Research Center 19 Skyline Drive, Hawthorne, NY 10532 charu@us.ibm.com Abstract In spite of the great progress in

### A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases

Published in the Proceedings of 14th International Conference on Data Engineering (ICDE 98) A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases Xiaowei Xu, Martin Ester, Hans-Peter

### Data Mining K-Clustering Problem

Data Mining K-Clustering Problem Elham Karoussi Supervisor Associate Professor Noureddine Bouhmala This Master s Thesis is carried out as a part of the education at the University of Agder and is therefore

### Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

IOSR Journal of Computer Engineering (IOSRJCE) ISSN: 2278-0661, ISBN: 2278-8727 Volume 6, Issue 5 (Nov. - Dec. 2012), PP 36-41 Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

### . Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Outline Part 1: of data clustering Non-Supervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties

### Document Image Retrieval using Signatures as Queries

Document Image Retrieval using Signatures as Queries Sargur N. Srihari, Shravya Shetty, Siyuan Chen, Harish Srinivasan, Chen Huang CEDAR, University at Buffalo(SUNY) Amherst, New York 14228 Gady Agam and

### IMPROVISATION OF STUDYING COMPUTER BY CLUSTER STRATEGIES

INTERNATIONAL JOURNAL OF ADVANCED RESEARCH IN ENGINEERING AND SCIENCE IMPROVISATION OF STUDYING COMPUTER BY CLUSTER STRATEGIES C.Priyanka 1, T.Giri Babu 2 1 M.Tech Student, Dept of CSE, Malla Reddy Engineering

### An Enhanced Clustering Algorithm to Analyze Spatial Data

International Journal of Engineering and Technical Research (IJETR) ISSN: 2321-0869, Volume-2, Issue-7, July 2014 An Enhanced Clustering Algorithm to Analyze Spatial Data Dr. Mahesh Kumar, Mr. Sachin Yadav

### EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

### A Comparative Study of RNN for Outlier Detection in Data Mining

A Comparative Study of RNN for Outlier Detection in Data Mining Graham Williams, Rohan Baxter, Hongxing He, Simon Hawkins and Lifang Gu FirstnameLastname@csiroau http://dataminingcsiroau Enterprise Data

### A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication

2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management

### Chapter 7. Cluster Analysis

Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. Density-Based Methods 6. Grid-Based Methods 7. Model-Based

### An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

### Anomaly Detection using multidimensional reduction Principal Component Analysis

IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 1, Ver. II (Jan. 2014), PP 86-90 Anomaly Detection using multidimensional reduction Principal Component

### Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Lecture 1: Data Mining Overview and Process What is data mining? Example applications Definitions Multi disciplinary Techniques Major challenges The data mining process History of data mining Data mining

### An Evaluation of Machine Learning Method for Intrusion Detection System Using LOF on Jubatus

An Evaluation of Machine Learning Method for Intrusion Detection System Using LOF on Jubatus Tadashi Ogino* Okinawa National College of Technology, Okinawa, Japan. * Corresponding author. Email: ogino@okinawa-ct.ac.jp

### A Survey on Outliers Detection in Distributed Data Mining for Big Data

J. Basic. Appl. Sci. Res., 5(2)31-38, 2015 2015, TextRoad Publication ISSN 2090-4304 Journal of Basic and Applied Scientific Research www.textroad.com A Survey on Outliers Detection in Distributed Data

### Clustering on Large Numeric Data Sets Using Hierarchical Approach Birch

Global Journal of Computer Science and Technology Software & Data Engineering Volume 12 Issue 12 Version 1.0 Year 2012 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global

### Open Access Research and Realization of the Extensible Data Cleaning Framework EDCF

Send Orders for Reprints to reprints@benthamscience.ae The Open Automation and Control Systems Journal, 2015, 7, 2039-2043 2039 Open Access Research and Realization of the Extensible Data Cleaning Framework

### The Data Mining Process

Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

### [EN-024]Clustering radar tracks to evaluate efficiency indicators

ENRI Int. Workshop on ATM/CNS. Tokyo, Japan (EIWAC2010) [EN-024]Clustering radar tracks to evaluate efficiency indicators + R. Winkler A. Temme C. Bösel R. Kruse German Aerospace Center (DLR) Braunschweig

### An Ameliorated Partitioning Clustering Algorithm for Large Data Sets

An Ameliorated Partitioning Clustering Algorithm for Large Data Sets Raghavi Chouhan 1, Abhishek Chauhan 2 MTech Scholar, CSE department, NRI Institute of Information Science and Technology, Bhopal, India

### Comparison of K-means and Backpropagation Data Mining Algorithms

Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and

### Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

### Analysis of Various Techniques to Handling Missing Value in Dataset Rajnik L. Vaishnav a, Dr. K. M. Patel b a

Available online at www.ijiere.com International Journal of Innovative and Emerging Research in Engineering e-issn: 2394-3343 e-issn: 2394-5494 Analysis of Various Techniques to Handling Missing Value

### A Study on the Hierarchical Data Clustering Algorithm Based on Gravity Theory

A Study on the Hierarchical Data Clustering Algorithm Based on Gravity Theory Yen-Jen Oyang, Chien-Yu Chen, and Tsui-Wei Yang Department of Computer Science and Information Engineering National Taiwan

### Outlier detection in astronomical data

Outlier detection in astronomical data Yanxia Zhang *a, Ali Luo *a, and Yongheng Zhao *a *a National Astronomical Observatories, Chinese Academy of Sciences, China. ABSTRACT Astronomical data sets have

### ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications

### A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images

A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images Małgorzata Charytanowicz, Jerzy Niewczas, Piotr A. Kowalski, Piotr Kulczycki, Szymon Łukasik, and Sławomir Żak Abstract Methods

### Public Transportation BigData Clustering

Public Transportation BigData Clustering Preliminary Communication Tomislav Galba J.J. Strossmayer University of Osijek Faculty of Electrical Engineering Cara Hadriana 10b, 31000 Osijek, Croatia tomislav.galba@etfos.hr

### Identification of noisy variables for nonmetric and symbolic data in cluster analysis

Identification of noisy variables for nonmetric and symbolic data in cluster analysis Marek Walesiak and Andrzej Dudek Wroclaw University of Economics, Department of Econometrics and Computer Science,

### Multivariate Analysis of Ecological Data

Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology

### Data Exploration Data Visualization

Data Exploration Data Visualization What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping to select

### Scalable Cluster Analysis of Spatial Events

International Workshop on Visual Analytics (2012) K. Matkovic and G. Santucci (Editors) Scalable Cluster Analysis of Spatial Events I. Peca 1, G. Fuchs 1, K. Vrotsou 1,2, N. Andrienko 1 & G. Andrienko

### Data Mining Process Using Clustering: A Survey

Data Mining Process Using Clustering: A Survey Mohamad Saraee Department of Electrical and Computer Engineering Isfahan University of Techno1ogy, Isfahan, 84156-83111 saraee@cc.iut.ac.ir Najmeh Ahmadian

### A comparison of various clustering methods and algorithms in data mining

Volume :2, Issue :5, 32-36 May 2015 www.allsubjectjournal.com e-issn: 2349-4182 p-issn: 2349-5979 Impact Factor: 3.762 R.Tamilselvi B.Sivasakthi R.Kavitha Assistant Professor A comparison of various clustering

### Bisecting K-Means for Clustering Web Log data

Bisecting K-Means for Clustering Web Log data Ruchika R. Patil Department of Computer Technology YCCE Nagpur, India Amreen Khan Department of Computer Technology YCCE Nagpur, India ABSTRACT Web usage mining

### Enhancing Quality of Data using Data Mining Method

JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2, ISSN 25-967 WWW.JOURNALOFCOMPUTING.ORG 9 Enhancing Quality of Data using Data Mining Method Fatemeh Ghorbanpour A., Mir M. Pedram, Kambiz Badie, Mohammad

### Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

### Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du fdu@cs.ubc.ca University of British Columbia