Two Heads Better Than One: Metric+Active Learning and Its Applications for IT Service Classification

Size: px
Start display at page:

Download "Two Heads Better Than One: Metric+Active Learning and Its Applications for IT Service Classification"

Transcription

1 29 Ninth IEEE International Conference on Data Mining Two Heads Better Than One: Metric+Active Learning and Its Applications for IT Service Classification Fei Wang 1,JimengSun 2,TaoLi 1, Nikos Anerousis 2 1 School of Computing and Information Sciences, Florida International University, Miami, FL IBM T. J. Watson Research Center, Service Department, Yorktown Heights, NY Abstract Large IT service providers track service requests and their execution through problem/change tickets. It is important to classify the tickets based on the problem/change description in order to understand service quality and to optimize service processes. However, two challenges exist in solving this classification problem: 1) ticket descriptions from different classes are of highly diverse characteristics, which invalidates most standard distance metrics; 2) it is very expensive to obtain high-quality labeled data. To address these challenges, we develop two seemingly independent methods 1) Discriminative Neighborhood Metric Learning (DNML) and 2) Active Learning with Median Selection (ALMS), both of which are, however, based on the same core technique: iterated representative selection. A case study on real IT service classification application is presented to demonstrate the effectiveness and efficiency of our proposed methods. I. INTRODUCTION In IT outsourcing environments, every day thousands of problem and change requests are generated on a diverse set of issues related to all kinds of software and hardware. Those requests need to be resolved correctly and quickly in order to meet service level agreements (SLAs). Under this environment, when devices fail or applications need to be upgraded, the person recovering such failures or applying the patch is likely to be sitting thousand miles from the affected component. The reliance on outsourcing of technology support has fundamentally shift the dependencies between participating organizations. In such a complex environment, all service requests are handled and tracked through tickets by various problem & change management systems. A ticket is opened with certain symptom description, and routed to appropriate Service Matter Experts (SMEs) for resolution. The solution is documented when the ticket is closed. The job of SMEs is to resolve tickets quickly and correctly in order to meet SLAs. Another important job role is Quality Analysts (QAs), whose responsibility is to analyze recent tickets in order to identify the opportunity for the service-quality improvement. For example, a frequent password resets on one system may be due to the inconsistent password period in that system. Instead of resetting the password all the times, the password period should be adjusted properly. Or sometimes one patch or fix for a particular server should also be applied to all servers with the same configuration, instead of creating multiple tickets with exactly the same work order on each of those systems. It requires great understanding of the current ticket distribution for QAs to identify any opportunity for optimization. In another word, it is important to classify those tickets based on their description and resolution accurately in a timely manner. In this paper, we address the ticket classification problem that lies in the core of IT service quality improvement with significant practical impact and great challenges: Standard distance metrics do not apply due to diverse characteristics of raw features. More specifically, the ticket description and solution are highly diverse and noisy. Since different SMEs can describe the same problem quite differently and the descriptions are typically short and noisy. Also depending on the types of problems, the description can vary significantly. There are almost no high-quality labeled data. Tickets are handled by SMEs, who often do not have the incentive or ability to classify the tickets accurately, due to heavy workload and incomplete information. On the other hand, QAs, who have the right ability and incentive, do not have cycle to manually label all the tickets. To address these two challenges, we propose a novel hybrid approach that leverage both active learning and metric learning. The contributions of this paper are the following: We propose Discriminative Neighborhood Metric Learning (DNML) that learns a domain-specific distance metric using the overall data distribution and a small set of labeled data. We propose Active Learning with Median Selection (ALMS) that progressively selects the representative data points that need to be labeled, which is naturally a multi-class algorithm. We combine metric and active learning steps into a unified process over the data. Moreover, our algorithm can automatically detect the number of classes contained in the data set. We demonstrate the effectiveness and efficiency of DNML and ALMS over several datasets compared to several existing methods /9 $ IEEE DOI 1.119/ICDM

2 because of its homogeneous assumption of all the feature dimensions. Therefore many researchers propose to learn a Mahalanobis distance which measures the distance between data x i and x j by d m (x i, x) = (x i x j ) C(x i x j ) (2) Figure 1. The basic algorithm flowchart, where the red blobs represent the labeled points, and the gray blobs correspond to unlabeled points. The rest of the paper is organized as follows: Section II presents the methods for metric and active learning. Section III demonstrates the practical case-study of the proposed methods on IT ticket classification application. Finally, Section V concludes. II. METRIC+ACTIVE LEARNING In this section we will introduce our algorithm in detail. First we give an overview of our algorithm. A. Algorithm Overview The basic procedure of our algorithm is to iterate the following procedure: Learn a distance metric from the labeled data set, and then classify the unlabeled data points using the nearest neighbor classifier with the learned metric. We call the data points in the same class a cluster. Select the median from each cluster. For each cluster X i, we partition it into a labeled set Xi L (whose labels are given initially) and an unlabeled set Xi U (whose labels are predicted by the nearest neighbor classifier). Then the median for X i is defined as m i = arg min x c i 2 (1) x Xi U where c i is the mean of X i. Add the selected points into the labeled set. Fig.1 shows the graphical view of the basic algorithm flowchart. B. Discriminative Neighborhood Metric Learning As we know that a good distance metric plays a central role in many data mining and machine learning algorithms (e.g., Nearest Neighbor classifier, k-means algorithm). Usually the Euclidean distance cannot satisfy our requirements where C R d d is a positive semi-definite covariance matrix used to incorporate the correlations of different feature dimensions. In this paper, we consider to learn a low-rank covariance matrix C, such that with the learned distance metric, the within-class compactness and betweenclass scatterness are maximized. Different from linear discriminant analysis [1], which seeks for a discriminant subspace in a global sense, our algorithm aims to learn a distance metric with enhanced local discriminability. To define such local discriminability, we first introduce the definition of two types of neighborhoods [8]: Definition 1: Homogeneous Neighborhood. The homogeneous neighborhood of x i, denoted as Ni o,isthe N i o - nearest data points of x i with the same label, Ni o is the size of Ni o. Definition 2: Heterogeneous Neighborhood. The heterogeneous neighborhood of x i, denoted as Ni e,isthe N i e - nearest data points of x i with different labels, Ni e is the size of Ni e. Basedontheabovetwodefinitions, we can define the local compactness of point x i as C i = d 2 j:x j Ni o m (x i, x j ) (3) and the local scatter ness of point x i as S i = d 2 k:x k Ni e m (x i, x k ) (4) Then the local discriminability of the data set X with respect to the distance metric d m can be defined as J = i C i j:x j N (x i = o i x j ) C(x i x j ) (x i x k ) () C(x i x k ) i S i k:x k N e i The goal of our algorithm is to minimize J, which is equivalent to minimize the local compactness and maximize the local scatterness simultaneously. Fig.2 provides an intuitive graphical illustration of the theme behind our algorithm. However, the minimization of J in Eq.() to get an optimal C is not an easy task as there are d(d +1)/2 variables to solve provided that C is symmetric. Recall that we require C to be a low-rank matrix and C is positive semi-definite, then by incomplete Choleskey factorization, we can decompose C to C = WW (6) where W R d r and r is the rank of C. Inthisway, we only need to solve W instead of solving the entire C. 123

3 (a) The neighborhoods of x i with Euclidean distance (b) The neighborhoods of x i with the learned distance Figure 2. The homogeneous and heterogeneous neighborhoods of x i with the regular Euclidean and learned Mahalanobis distance metrics. The blob with dark green in the middle of the circle is x i, and the blobs with green and blue correspond to the points in Ni o and N i e. The goal of DNML is to learn a distance metric that pulls the points in Ni o towards x i while push the points in Ni e away from x i. Table I DECOMPOSED NEWTOWN S PROCEDURE FOR SOLVING THE TRACE RATIO PROBLEM Input: Matrices M C and M S, precision ε, dimensiond Output: Trace Ratio value λ and matrix W Procedure: 1. Initialize λ =, t = 2. Do eigenvalue decomposition to M C λ tm S 3. Let (β k (λ), w k (λ)) be the k-th eigenvalue, eigenvector pair obtained from step 2, define the first order Taylor expansion ˆβ k (λ) =β k (λ)+β k (λt)(λ λt), where β k (λt) =w k(λ t) M S w k (λ t) 4. Define ˆf d (λ) to be the sum of the smallest d ˆβ k (λ), solve ˆd d (λ) =and set the root to be λ t+1. If λ t+1 λ t <ɛ,gotostep6;otherwiset = t +1, go to step 2 6. Output λ = λ, andw to be the eigenvectors w.r.t. the smallest d eigenvalues of M C M S Combining Eq.(6) and Eq.(), we can derive the following optimization problem tr(w M C W) min W tr(w (7) M S W) where M C = (x i x j )(x i x j ) (8) i j:x j Ni o M S = (x i x k )(x i x k ) (9) i k:x k Ni e are the compactness and scatterness matrices respectively. Therefore, our problem (7) becomes a trace quotient minimization problem, and we can make use of the decomposed Newtown s method [3]. C. Active Learning with Median Selection Recently a novel active learning method called Transductive Experimental Design (TED) [11] is proposed, which aims to select the k most representative points contained in the data set. Despite the theoretical soundness and empirical success of TED, there are still some limitations: Although the name suggests TED is transductive, it does make use of any label information contained in the data set. In fact, TED just uses the whole data set (include labeled and unlabeled) to select the k most representative data, such that the linear reconstruction loss of the whole data set using the selected points are minimized. In this sense, TED is an unsupervised method. As the authors analyzed in [11], TED tends to select the data points with large norms (where the authors analyzed that these points are hard to predict). However, these selected points lie on the border of the data distribution area, and those data points could be outliers that would mislead the classification process. Based on the above analysis, we propose to (1) make use of the label information; (2) select the representative points locally. Specifically, we first learn a distance metric using our DNML method introduced in last section and then apply the nearest neighbor classifier to classify the unlabeled points. In this way, the whole data set is partitioned into several classes, and for each class, we just select the median point as defined in Eq.(1) class1 class2 class3 class4 selected (a) TED class1 class2 class3 class4 selected (b) Local Median Figure 3. Active learning results. (a) shows the results of transductive experimental design; (b) shows the results of our local median selection method. Fig.3 illustrates a toy example on the difference between TED and our local median selection method. The data set here is generated from 4 Gaussians, and each Gaussian contains 1 data points. We treat each Gaussian as a class. Initially we randomly label 1% of the data points and use TED to select 4 most representative data points shown as black triangles in Fig.3(a), from which we can see that these points all lie on the border of these Gaussians. Fig.3(b) shows the results of our local median selection method, where we first apply DNML to learn a proper distance metric from the labeled points and then use such metric to classify the whole data set, and finally we select one median from each class. From the figure we observe that the selected points are representative for each Gaussian. An issue that is worth mentioning here is that our algorithm in fact can be viewed as an approximated version of 124

4 Table II THE METRIC+ACTIVE LEARNING ALGORITHM 6 6 Inputs: Training data, Ni o, N i e, precision ɛ, dimensiond, number of iteration steps T Outputs: The selected points and learned W Procedure: for t=1:t 1. Construct M S, M C from the training data 2. Learn a proper distance metric 3. Count the number of classes k in the training data, apply the learned metric to classify the unlabeled data using the nearest neighbor classifier 4. Select the median in each class and add them into the training data pool end local TED, wherewefirst partition the data set into several local regions using the learned distance metric, and then select exactly ONE representative point in each region. As data mean is the most representative point for a set of data in the sense of Euclidean loss, we select the median which is closest to the data mean from the candidate set. The whole algorithm procedure is summarized in Table II. III. TICKET CLASSIFICATION: ACASE STUDY In this section we present the detailed experimental results on applying our proposed active learning scheme for ticket classification. First we will describe the basic characteristics of the data set. A. The Data Set There are totally 4182 tickets from 27 classes. We use the bag-of-words features, which results in a 3882 dimensional space. After eliminating the duplicate and null tickets, there are 2222 tickets remained. The class distribution is shown in Fig.4(a), from which we can observe that the classes are highly imbalanced and there are many rare classes with only a few data points. We identify a class as rare class if and only if the number of data points contained in it is less than 2. In our experiments, we eliminate those rare classes, which results in a data set of size 2161 from classes, and the class distribution is shown in Fig.4(b). Besides rare classes, we also observe that the data set is highly sparse and there are also a set of rare features. The original feature distribution is shown in Fig.(a), where we accumulated the times that each feature appears in each class. We identify a feature as a rare feature if and only if its total appearance times in the data set is less than 1. After eliminating those rare features, we obtain a data set with 669 features and its distribution is shown in Fig.(b). Finally, we also eliminated the data with only these rare features, which makes the final data set containing 213 tickets with 669 features. number of data points Figure 4. classes 1 1 (a) Original data number of data points (b) Data distribution after eliminating the rare classes Class distribution of original data and the data with no rare feature index (a) Original data feature distribution feature index (b) Data feature distribution with no rare features Figure. Feature distribution of the original data and the data with rare feature eliminated. B. Distance Metric Learning In this part of experiments, we first test the effectiveness of our DNML algorithm on the ticket data set, where we use our algorithm to obtain a distance metric, and then use such distance metric to perform nearest neighbor classification and get the final classification accuracy. Such procedure is repeated times and we report the average classification accuracies and standard deviations as in Fig.6. The size of homogeneous and heterogeneous neighborhoods are set to 3 manually, and the rank of the covariance matrix C is set to 4. From the figure we observe the superiority of our metric learning method. Specifically, with the learned metric, our DNML method clearly outperforms the original NN method, which validates that DNML can learn a better distance metric. As there are two sets of parameters in our DNML method, one is the rank r of the covariance matrix C, the other is the sizes of the homogeneous neighborhood N o and heterogeneous neighborhood N e (denoted as n o and n e ). Therefore we also conducted a set of experiments to test the sensitivity of DNML with respect to those parameters. Fig.7 shows how the algorithm performance varies with respect to the rank of the covariance matrix C, wherewe randomly label half of the tickets as the training set, and the remaining tickets are used for testing. We set the sizes of N o and N e to be 3. The results in Fig.7 are summarized over independent runs. From the figure we can see that

5 classification accuracy DNML NN NB RLS SVM not that sensitive with respect to the variation of Ni o and Ni e. From Fig.8 we can see that when the neighborhood sizes are small, the algorithm performance is better than that when the neighborhood sizes are large. This is possibly because the distribution of the data set is complicated and data in different classes are highly overlapped, then when we enlarge the neighborhoods to include more data points, the learned distance metric might be corrupted by some noisy points, which will make the final classification results inaccurate percentage of labeled tickets (%) Figure 6. Classification accuracy comparison with different supervised learning methods. The x-axis represents the percentage of randomly labeled tickets, and the y-axis denotes the averaged classification accuracy. the final ticket classification results are stable with respect to the choice of the rank of the covariance matrix, except the cases when the rank is too small (i.e., 1 in our case), since in those cases too much information will be lost. When the rank becomes too large, some noise contained in the data set could be retained, therefore the performance of our algorithm would go down a little, and the choices of r [2, 4] are all reasonable. classification accuracy rank of the covariance matrix Figure 7. The sensitivity of the performance of our algorithm with respect to the rank r of the covariance matrix C. Weset Ni o = N i e =3,and half of the data set are labeled as the training data. We also test the sensitivity of our algorithm with respect to the choices of the sizes of Ni o and Ni e, the results of which are shown in Fig.8, where the x-axis and y-axis correspond to the sizes of Ni o and N i e, and the z-axis denotes the classification accuracy averaged over independent runs. Here we assume that the sizes homogeneous and heterogeneous neighborhoods are the same for all the data points. For each run, we randomly label % of the tickets as training data, and the rest as testing data. From Fig.8 we can clearly see that the whole surface of z = f(x, y) is flat, which means that the performance of our algorithm is Accuracy N i e 1 Figure 8. The sensitivity of the performance of our algorithm with respect to the choices of Ni o and N i e, and half of the data set are labeled as the training data. C. Integrated Active Learning and Distance Metric Learning In our implementation, we initially label 2% of the data set, and then apply the various active learning methods. Since there are totally classes in the ticket data set, for each method, we select points from the unlabeled set in each round. For all the approaches that use DNML, we set N o = N e =3, and the rank of the covariance matrix is set to 4. Fig.9 illustrates the results of these algorithms summarized over independent runs, where the x-axis represents the percentage of selected points, and the y-axis denotes the averaged classification accuracy as well as the standard deviation. From the figure we can clearly see that with our DNML+LMED method, the classification accuracy will ascend faster compared to other methods. IV. RELATED WORKS In this section we will briefly review some previous works that are closely related to our metric+active learning method. A. Distance Metric Learning Distance metric learning plays a central role in real world applications. According to [1], these approaches can mainly be categorized into two classes: unsupervised and supervised. Here we mainly review the supervised methods, which learn distance metric from the data set with some supervised information. Usually the information takes the 1 N i o 126

6 classification accuracy DNML+LMED. LMED DNML+Rand DNML+TED percentage of actively labeled tickets (%) Figure 9. The classification accuracy vs. number of selected tickets plot. The x-axis represents the percentage of labeled tickets, and the y-axis represents the classification accuracy averaged over independent runs. form of pairwise constraints, which indicating whether a pair of data points belong to the same class (usually referred to as must-link constraints) or different classes (cannotlink) constraints. Then these algorithms aim to learn a proper distance metric under which the data with must-link constraints are as compact as possible, while the data with cannot-link constraints are far apart from each other. Some typical approaches include the side-information method [9], Relevant Component Analysis (RCA) [],anddiscriminant Component Analysis (DCA)[2].OurDiscriminant Neighborhood Metric Learning (DNML) method can also be viewed as a supervised method, however, we make use of the labeled data together with their labels, which is different from those pairwise constraints. B. Active Learning In many real-world problems, we face the problems when unlabeled data are abundant but labeling data is expensive to obtain (e.g., in text classification, it is expensive and time consuming to ask the users to label the documents manually, however, it is quite easy to obtain a large amount of unlabeled documents by crawling the web). In such a scenario the learning algorithm can actively query the user/teacher for labels. This type of iterative supervised learning is called active learning. Since the learner chooses the examples, the number of examples to learn a concept can often be much lower than the number required in normal supervised learning. Two classical active learning algorithms are Tong and Koller s Simple SVM algorithm [6] and Seung et al. s Query By Committee (QBC) algorithm [4]. However, the simple SVM algorithm is coupled with the Support Vector Machine (SVM) classifier [7] and is only applicable to two-class problems. For the QBC algorithm, we need to construct a committee of models that represent different regions of the version space and have some measure of disagreement among committee members, which is usually difficult for real world applications. Recently, Yu et al. [11] proposed another active learning algorithm called Transductive Experimental Design (TED), which aims to find some most representative points that can optimally reconstruct the whole data set in the sense of Euclidean sense. Our median selection strategy introduced in this paper is similar to TED, and we have analyzed in section II-C the superiority of our algorithm. V. CONCLUSIONS We present a novel metric+active learning method for IT service ticket classification in this paper. Our method combines both the strengths of metric learning and active learning methods. Finally the experimental results on both benchmark and real ticket data sets are presented to demonstrate the effectiveness of the proposed method. ACKNOWLEDGEMENT The work is partially supported by NSF CAREER Award IIS-4628 and a 28 IBM Faculty Award. REFERENCES [1] K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, San Diego, California, 199. [2] S. Hoi, W. Liu, M. Lyu, and W. Ma. Learning distance metrics with contextual constraints for image retrieval. In Proceedings of CVPR26, 26. [3] Y. Jia, F. Nie, and C. Zhang. Trace ratio problem revisited. IEEE Transactions on Neural Networks, 29. [4] H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of COLT, pages , [] N. Shental, T. Hertz, D. Weinshall, and M. Pavel. Adjustment learning and relevant component analysis. In Proceedings of ECCV, pages , 22. [6] S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2:4 66, 21. [7] V. N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., New York, NY, USA, 199. [8] F. Wang and C. Zhang. Feature extraction by maximizing the neighborhood margin. In Proceedings of CVPR, 27. [9] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning, with application to clustering with sideinformation. In Advances in Neural Information Processing Systems, volume, pages 12, 23. [1] L. Yang. Distance metric learning: A comprehensive survey. Technical report, Michgan State University, 26. [11] K. Yu, J. Bi, and V. Tresp. Active learning via transductive experimental design. In Proceedings of ICML, pages ,

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center 1 Outline Part I - Applications Motivation and Introduction Patient similarity application Part II

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic

More information

Bayesian Active Distance Metric Learning

Bayesian Active Distance Metric Learning 44 YANG ET AL. Bayesian Active Distance Metric Learning Liu Yang and Rong Jin Dept. of Computer Science and Engineering Michigan State University East Lansing, MI 4884 Rahul Sukthankar Robotics Institute

More information

15.062 Data Mining: Algorithms and Applications Matrix Math Review

15.062 Data Mining: Algorithms and Applications Matrix Math Review .6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

More information

Personalized Hierarchical Clustering

Personalized Hierarchical Clustering Personalized Hierarchical Clustering Korinna Bade, Andreas Nürnberger Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, D-39106 Magdeburg, Germany {kbade,nuernb}@iws.cs.uni-magdeburg.de

More information

Neural Networks Lesson 5 - Cluster Analysis

Neural Networks Lesson 5 - Cluster Analysis Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29

More information

Virtual Landmarks for the Internet

Virtual Landmarks for the Internet Virtual Landmarks for the Internet Liying Tang Mark Crovella Boston University Computer Science Internet Distance Matters! Useful for configuring Content delivery networks Peer to peer applications Multiuser

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

More information

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Clustering Big Data Anil K. Jain (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Outline Big Data How to extract information? Data clustering

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images

A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images Małgorzata Charytanowicz, Jerzy Niewczas, Piotr A. Kowalski, Piotr Kulczycki, Szymon Łukasik, and Sławomir Żak Abstract Methods

More information

Multi-class Classification: A Coding Based Space Partitioning

Multi-class Classification: A Coding Based Space Partitioning Multi-class Classification: A Coding Based Space Partitioning Sohrab Ferdowsi, Svyatoslav Voloshynovskiy, Marcin Gabryel, and Marcin Korytkowski University of Geneva, Centre Universitaire d Informatique,

More information

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Supervised Feature Selection & Unsupervised Dimensionality Reduction Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Robust Outlier Detection Technique in Data Mining: A Univariate Approach Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,

More information

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

1816 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 15, NO. 7, JULY 2006. Principal Components Null Space Analysis for Image and Video Classification

1816 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 15, NO. 7, JULY 2006. Principal Components Null Space Analysis for Image and Video Classification 1816 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 15, NO. 7, JULY 2006 Principal Components Null Space Analysis for Image and Video Classification Namrata Vaswani, Member, IEEE, and Rama Chellappa, Fellow,

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

More information

A Survey on Outlier Detection Techniques for Credit Card Fraud Detection

A Survey on Outlier Detection Techniques for Credit Card Fraud Detection IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 2, Ver. VI (Mar-Apr. 2014), PP 44-48 A Survey on Outlier Detection Techniques for Credit Card Fraud

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Unsupervised learning: Clustering

Unsupervised learning: Clustering Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52 Outline 1 Introduction What

More information

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Part I: Factorizations and Statistical Modeling/Inference Amnon Shashua School of Computer Science & Eng. The Hebrew University

More information

Recognizing Cats and Dogs with Shape and Appearance based Models. Group Member: Chu Wang, Landu Jiang

Recognizing Cats and Dogs with Shape and Appearance based Models. Group Member: Chu Wang, Landu Jiang Recognizing Cats and Dogs with Shape and Appearance based Models Group Member: Chu Wang, Landu Jiang Abstract Recognizing cats and dogs from images is a challenging competition raised by Kaggle platform

More information

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large

More information

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall Automatic Photo Quality Assessment Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall Estimating i the photorealism of images: Distinguishing i i paintings from photographs h Florin

More information

A Survey on Pre-processing and Post-processing Techniques in Data Mining

A Survey on Pre-processing and Post-processing Techniques in Data Mining , pp. 99-128 http://dx.doi.org/10.14257/ijdta.2014.7.4.09 A Survey on Pre-processing and Post-processing Techniques in Data Mining Divya Tomar and Sonali Agarwal Indian Institute of Information Technology,

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

203.4770: Introduction to Machine Learning Dr. Rita Osadchy

203.4770: Introduction to Machine Learning Dr. Rita Osadchy 203.4770: Introduction to Machine Learning Dr. Rita Osadchy 1 Outline 1. About the Course 2. What is Machine Learning? 3. Types of problems and Situations 4. ML Example 2 About the course Course Homepage:

More information

Final Project Report

Final Project Report CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes

More information

ADVANCED MACHINE LEARNING. Introduction

ADVANCED MACHINE LEARNING. Introduction 1 1 Introduction Lecturer: Prof. Aude Billard (aude.billard@epfl.ch) Teaching Assistants: Guillaume de Chambrier, Nadia Figueroa, Denys Lamotte, Nicola Sommer 2 2 Course Format Alternate between: Lectures

More information

Advanced Ensemble Strategies for Polynomial Models

Advanced Ensemble Strategies for Polynomial Models Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer

More information

A fast multi-class SVM learning method for huge databases

A fast multi-class SVM learning method for huge databases www.ijcsi.org 544 A fast multi-class SVM learning method for huge databases Djeffal Abdelhamid 1, Babahenini Mohamed Chaouki 2 and Taleb-Ahmed Abdelmalik 3 1,2 Computer science department, LESIA Laboratory,

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will

More information

Subspace Analysis and Optimization for AAM Based Face Alignment

Subspace Analysis and Optimization for AAM Based Face Alignment Subspace Analysis and Optimization for AAM Based Face Alignment Ming Zhao Chun Chen College of Computer Science Zhejiang University Hangzhou, 310027, P.R.China zhaoming1999@zju.edu.cn Stan Z. Li Microsoft

More information

E-commerce Transaction Anomaly Classification

E-commerce Transaction Anomaly Classification E-commerce Transaction Anomaly Classification Minyong Lee minyong@stanford.edu Seunghee Ham sham12@stanford.edu Qiyi Jiang qjiang@stanford.edu I. INTRODUCTION Due to the increasing popularity of e-commerce

More information

Calculation of Minimum Distances. Minimum Distance to Means. Σi i = 1

Calculation of Minimum Distances. Minimum Distance to Means. Σi i = 1 Minimum Distance to Means Similar to Parallelepiped classifier, but instead of bounding areas, the user supplies spectral class means in n-dimensional space and the algorithm calculates the distance between

More information

Distance Metric Learning for Large Margin Nearest Neighbor Classification

Distance Metric Learning for Large Margin Nearest Neighbor Classification Journal of Machine Learning Research 10 (2009) 207-244 Submitted 12/07; Revised 9/08; Published 2/09 Distance Metric Learning for Large Margin Nearest Neighbor Classification Kilian Q. Weinberger Yahoo!

More information

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

More information

Learning a Metric during Hierarchical Clustering based on Constraints

Learning a Metric during Hierarchical Clustering based on Constraints Learning a Metric during Hierarchical Clustering based on Constraints Korinna Bade and Andreas Nürnberger Otto-von-Guericke-University Magdeburg, Faculty of Computer Science, D-39106, Magdeburg, Germany

More information

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA Professor Yang Xiang Network Security and Computing Laboratory (NSCLab) School of Information Technology Deakin University, Melbourne, Australia http://anss.org.au/nsclab

More information

Component Ordering in Independent Component Analysis Based on Data Power

Component Ordering in Independent Component Analysis Based on Data Power Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals

More information

Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,

More information

How To Solve The Kd Cup 2010 Challenge

How To Solve The Kd Cup 2010 Challenge A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China catch0327@yahoo.com yanxing@gdut.edu.cn

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

Going Big in Data Dimensionality:

Going Big in Data Dimensionality: LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DEPARTMENT INSTITUTE FOR INFORMATICS DATABASE Going Big in Data Dimensionality: Challenges and Solutions for Mining High Dimensional Data Peer Kröger Lehrstuhl für

More information

How To Identify A Churner

How To Identify A Churner 2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management

More information

Performance Metrics for Graph Mining Tasks

Performance Metrics for Graph Mining Tasks Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

Maximum Margin Clustering

Maximum Margin Clustering Maximum Margin Clustering Linli Xu James Neufeld Bryce Larson Dale Schuurmans University of Waterloo University of Alberta Abstract We propose a new method for clustering based on finding maximum margin

More information

Standardization and Its Effects on K-Means Clustering Algorithm

Standardization and Its Effects on K-Means Clustering Algorithm Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03

More information

Distances, Clustering, and Classification. Heatmaps

Distances, Clustering, and Classification. Heatmaps Distances, Clustering, and Classification Heatmaps 1 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be

More information

Domain Classification of Technical Terms Using the Web

Domain Classification of Technical Terms Using the Web Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using

More information

How To Perform An Ensemble Analysis

How To Perform An Ensemble Analysis Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier

More information

Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico

Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

Learning with Local and Global Consistency

Learning with Local and Global Consistency Learning with Local and Global Consistency Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 7276 Tuebingen, Germany

More information

Learning with Local and Global Consistency

Learning with Local and Global Consistency Learning with Local and Global Consistency Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 7276 Tuebingen, Germany

More information

Machine Learning for Data Science (CS4786) Lecture 1

Machine Learning for Data Science (CS4786) Lecture 1 Machine Learning for Data Science (CS4786) Lecture 1 Tu-Th 10:10 to 11:25 AM Hollister B14 Instructors : Lillian Lee and Karthik Sridharan ROUGH DETAILS ABOUT THE COURSE Diagnostic assignment 0 is out:

More information

Multiclass Classification. 9.520 Class 06, 25 Feb 2008 Ryan Rifkin

Multiclass Classification. 9.520 Class 06, 25 Feb 2008 Ryan Rifkin Multiclass Classification 9.520 Class 06, 25 Feb 2008 Ryan Rifkin It is a tale Told by an idiot, full of sound and fury, Signifying nothing. Macbeth, Act V, Scene V What Is Multiclass Classification? Each

More information

Java Modules for Time Series Analysis

Java Modules for Time Series Analysis Java Modules for Time Series Analysis Agenda Clustering Non-normal distributions Multifactor modeling Implied ratings Time series prediction 1. Clustering + Cluster 1 Synthetic Clustering + Time series

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015 RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering

More information

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

More information

Interactive Machine Learning. Maria-Florina Balcan

Interactive Machine Learning. Maria-Florina Balcan Interactive Machine Learning Maria-Florina Balcan Machine Learning Image Classification Document Categorization Speech Recognition Protein Classification Branch Prediction Fraud Detection Spam Detection

More information

IN this paper we focus on the problem of large-scale, multiclass

IN this paper we focus on the problem of large-scale, multiclass IEEE TRANSACTIONS ON PATTERN RECOGNITION AND MACHINE INTELLIGENCE 1 Distance-Based Image Classification: Generalizing to new classes at near-zero cost Thomas Mensink, Member IEEE, Jakob Verbeek, Member,

More information

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH Kalinka Mihaylova Kaloyanova St. Kliment Ohridski University of Sofia, Faculty of Mathematics and Informatics Sofia 1164, Bulgaria

More information

Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

More information

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) Data Exploration and Preprocessing Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann

More information

Cross-Validation. Synonyms Rotation estimation

Cross-Validation. Synonyms Rotation estimation Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical

More information

Manifold Learning Examples PCA, LLE and ISOMAP

Manifold Learning Examples PCA, LLE and ISOMAP Manifold Learning Examples PCA, LLE and ISOMAP Dan Ventura October 14, 28 Abstract We try to give a helpful concrete example that demonstrates how to use PCA, LLE and Isomap, attempts to provide some intuition

More information

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors Classification k-nearest neighbors Data Mining Dr. Engin YILDIZTEPE Reference Books Han, J., Kamber, M., Pei, J., (2011). Data Mining: Concepts and Techniques. Third edition. San Francisco: Morgan Kaufmann

More information

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns Outline Part 1: of data clustering Non-Supervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties

More information

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S. AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Network Intrusion Detection using Semi Supervised Support Vector Machine

Network Intrusion Detection using Semi Supervised Support Vector Machine Network Intrusion Detection using Semi Supervised Support Vector Machine Jyoti Haweliya Department of Computer Engineering Institute of Engineering & Technology, Devi Ahilya University Indore, India ABSTRACT

More information

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is Clustering 15-381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv Bar-Joseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is

More information

Support Vector Machine (SVM)

Support Vector Machine (SVM) Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Dimensionality Reduction: Principal Components Analysis

Dimensionality Reduction: Principal Components Analysis Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely

More information

Introducing diversity among the models of multi-label classification ensemble

Introducing diversity among the models of multi-label classification ensemble Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

IJCSES Vol.7 No.4 October 2013 pp.165-168 Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS

IJCSES Vol.7 No.4 October 2013 pp.165-168 Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS IJCSES Vol.7 No.4 October 2013 pp.165-168 Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS V.Sudhakar 1 and G. Draksha 2 Abstract:- Collective behavior refers to the behaviors of individuals

More information

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications

More information

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu Learning Gaussian process models from big data Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu Machine learning seminar at University of Cambridge, July 4 2012 Data A lot of

More information