Guido Sciavicco. 11 Novembre 2015



Similar documents
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Data Mining Part 5. Prediction

Knowledge Discovery and Data Mining

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Social Media Mining. Data Mining Essentials

Knowledge Discovery and Data Mining

Azure Machine Learning, SQL Data Mining and R

How To Cluster

MHI3000 Big Data Analytics for Health Care Final Project Report

Performance Metrics for Graph Mining Tasks

Data Mining. Nonlinear Classification

1 Maximum likelihood estimation

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

How To Solve The Kd Cup 2010 Challenge

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Gerry Hobbs, Department of Statistics, West Virginia University

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining - Evaluation of Classifiers

Random Forest Based Imbalanced Data Cleaning and Classification

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Introduction To Genetic Algorithms

A semi-supervised Spam mail detector

Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems

Descriptive Statistics and Measurement Scales

Clustering Technique in Data Mining for Text Documents

Question 2 Naïve Bayes (16 points)

Evaluation & Validation: Credibility: Evaluating what has been learned

Classification and Prediction

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Maschinelles Lernen mit MATLAB

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Using Data Mining for Mobile Communication Clustering and Characterization

Comparison of K-means and Backpropagation Data Mining Algorithms

Selecting Best Investment Opportunities from Stock Portfolios Optimized by a Multiobjective Evolutionary Algorithm

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Data Mining for Knowledge Management. Classification

Using multiple models: Bagging, Boosting, Ensembles, Forests

Machine Learning: Overview

Predict Influencers in the Social Network

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Distances, Clustering, and Classification. Heatmaps

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Package NHEMOtree. February 19, 2015

Environmental Remote Sensing GEOG 2021

Introduction to Data Mining

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Statistical Feature Selection Techniques for Arabic Text Categorization

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Data Mining Applications in Higher Education

Professor Anita Wasilewska. Classification Lecture Notes

Direct Marketing When There Are Voluntary Buyers

Active Learning SVM for Blogs recommendation

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

II. DISTRIBUTIONS distribution normal distribution. standard scores

Knowledge Discovery and Data Mining

STATISTICA Formula Guide: Logistic Regression. Table of Contents

OUTLIER ANALYSIS. Data Mining 1

Independent samples t-test. Dr. Tom Pierce Radford University

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Supporting Online Material for

Model Combination. 24 Novembre 2009

PLAANN as a Classification Tool for Customer Intelligence in Banking

Monitoring of Internet traffic and applications

A Review of Missing Data Treatment Methods

Lecture 10: Regression Trees

Supervised Learning (Big Data Analytics)

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Towards better accuracy for Spam predictions

Data Preprocessing. Week 2

Monday Morning Data Mining

Management Science Letters

What mathematical optimization can, and cannot, do for biologists. Steven Kelk Department of Knowledge Engineering (DKE) Maastricht University, NL

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Protein Protein Interaction Networks

Better credit models benefit us all

Introduction to Learning & Decision Trees

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Experiments in Web Page Classification for Semantic Web

Title. Introduction to Data Mining. Dr Arulsivanathan Naidoo Statistics South Africa. OECD Conference Cape Town 8-10 December 2010.

A Survey on Pre-processing and Post-processing Techniques in Data Mining

Big Data Analytics CSCI 4030

CS570 Data Mining Classification: Ensemble Methods

On the Effectiveness of Obfuscation Techniques in Online Social Networks

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

Automatic Text Processing: Cross-Lingual. Text Categorization

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Database Marketing, Business Intelligence and Knowledge Discovery

Utility-Based Fraud Detection

Class 10. Data Mining and Artificial Intelligence. Data Mining. We are in the 21 st century So where are the robots?

Transcription:

classical and new techniques Università degli Studi di Ferrara 11 Novembre 2015 in collaboration with dr. Enrico Marzano, CIO Gap srl Active Contact System Project 1/27

Contents What is? Embedded Wrapper again: Optimization and Search Strategies Wrapper 2/27

Features - 1 Both supervised and unsupervised knowledge discovery processes share the same basis. A 1 A 2... L 1? a 11 a 12... L 1? a 21 a 22... L 2? a 31 a 32... L 3? 3/27

Features - 2 The information we are searching for is hidden in a table (view) in such a way that we can ask ourselves: 1. Given all instances with a certain label, what are the rules that govern the label value in terms of the attributes (or features) values? - supervised classification 2. Given my unlabeled instances, is it possible to identify clusters to which all instances belong? - unsupervised classification 3. Given my unlabeled instances, and given the result of a clustering process, is it possible to link the clusters to the the attributes via a set of rules? - unsupervised to supervised classification All such questions, as they are posed, implicitly involve all attributes, whether they are meaningful or not. 4/27

Features - 3 A typical classification problems includes hundreds to millions instances, each described by tens to hundreds of attributes. Attributes are generically collected brute force: as I do not know not only the rules I am searching for, but even if such a rule exists, I choose to consider as many characteristics of each instance I can possibly think of. Having non meaningful attributes contributing to a classification process produces a slow, difficult to interpret, and sometimes even wrong result. And, given N attributes, the search space in which I have to select my subset of meaningful attributes is 2 N, which makes it very difficult, if not impossible, to perform a manual selection. 5/27

Features - 4 It is important to stress that it does not matter what algorithm we want to use to perform a classification or a clustering. Every algorithm will perform better is if it run with: 1. less features; 2. more correlated features; 3. features that more clearly influence the label. There are at least three categories of Feature Selection algorithms: 1. ; 2. ; 3. Embedded. 6/27

Filters A filter is an algorithm that can be applied to a data set in order to select only a subset of the features, with the following characteristics: 1. A filter is independent from the successive clustering/classification step; 2. A filter can be supervised or unsupervised - it follows the same nomenclature that we used before: supervised filters can be applied in presence of the label, and unsupervised filters must be used in the other case. Filters not only improve the performance and the result of the successive classification/clustering, but they can and must be used to perform a more fundamental pre-processing of the original data set. Pre-processing our data set is necessary, and, at some stage, this includes a feature selection step. 7/27

Unsupervised Filters: An Example One of the most fundamental unsupervised filter consists of eliminating features with zero or near to zero variance. For each feature that is considered, it is eliminated if: 1. It presents a single possible value (zero variance), or 2. It has very few unique values relative to the number of samples, and the ratio of the frequency of the most common value to the frequency of the second most common value is large (near to zero variance). So, for example, if we have 1000 instances with an attribute name A, which presents the value a 999 times, and b just once, A is eliminated by the filter; the result is that my original data set is then projected over the remaining attributes. 8/27

Supervised Filters: An Example - 1 Another class of filter technique is called feature weighting. A feature weighting algorithm assigns weights to features (individually) and rank them based on their relevance. One way to do this is to estimate the relevance of a feature according to how well its values distinguish among the instances of the same and different classes (labels): suppose that A is a feature, and that its values among the instances changes pretty much as the label does, then its relevance is high. 9/27

Supervised Filters: An Example - 2 Another approach used in filters is based on the selection of a subset of features and the evaluation of goodness of the entire subset. For example, we can use a selection filter to eliminate redundant features: we can establish the consistency of a subset w.r.t. the original set. To this end, we can establish that A is redundant if eliminating it does not change the behaviour of the label over the instances. 10/27

Wrapper-Based Methods - 1 A wrapper-based method considers the selection of a set of features as a search problem, where different combinations are prepared, evaluated and compared to other combinations. A predictive model is used to evaluate a combination of features and assign a score based on model accuracy, thus: 1. A wrapper-based method depends on the model; 2. A wrapper-based method can be supervised or unsupervised - but this time, only supervised wrapper methods are well-known, and unsupervised ones are under investigation; we ll be more precise later! 11/27

Wrapper-Based Methods - 2 A wrapper model is based on the predictive accuracy of a predetermined learning algorithm to determine the quality of selected features, so that for each selection the learning algorithm is asked to check the accuracy of the classifier built over it. Four basic steps: 1. generation: depends on the search strategy; 2. evaluation: depends on the learning algorithm; 3. stopping criterion: depends on the wrapper strategy; 4. validation: usually final-user dependant. 12/27

Supervised Wrapper Methods - 1 In supervised wrapper-based feature selection we can use any supervised learning algorithm as black box. For a given selection we just need to ask the learning algorithm to return a learning model and its performance. If you recall, there are many performance indicators: 1. accuracy; 2. True Positive, True Negative, False Positive, False Negative... ; 3. ROC curve; 4.... One parameter that can be used in stepping from a generic wrapper-based method to a specific one is precisely the performance indicator(s) used in the evaluation phase. 13/27

Supervised Wrapper Methods - 2 So, for example, we can have a wrapper-based feature selection by using: 1. A search strategy to produce a subset of the features (we ll be more precise in a moment!); 2. The decision tree C4.5 as evaluation method; 3. The accuracy of the prediction (correctly classified instances VS total instances) as evaluation meter. Also, we can decide to run the decision tree as full training (to speed up the process) or in cross-validation mode (to avoid false accuracy measures due to over-fitting). 14/27

Supervised Wrapper Methods - 3 Now we know that the evaluation phase is not a problem: we have literally hundreds of possible choices as supervised learning algorithms. But what about the search strategy? Classical choices include: 1. deterministic: sequential and exponential search methods, or 2. random: 2.1 neural networks; 2.2 particle swarm optimization; 2.3 evolutionary algorithms, 2.4... 15/27

EA-based Supervised Wrapper Methods - 1 How can an Evolutionary Algorithm serve as search method in a wrapper-based feature selection mechanism? To understand this, we need to understand how an EA works in the first place. The basic elements of a EA are: 1. Individuals representation and initialization; 2. Fitness/Ojective function(s); 3. Evolution strategy. 16/27

EA-based Supervised Wrapper Methods - 2 In the particular case of feature selection, individual may be represented as: where: b A1, b A2,..., b An, c 1, c 2,... 1. each b Ai is a bit (0=do not select this feature, 1=select this feature); 2. each c j is a particular characteristic of the individual that depends on the search method - might be present or not. 17/27

EA-based Supervised Wrapper Methods - 3 Typically, we randomly initialize P individuals (P possible selections of features); Following the generic EA methodology, we evaluate each one of them, by running a classifier on the projection of the original data set over the selection specified by the individual, and by taking a predetermined quality measure of the run as evaluation (e.g.: the accuracy). If the EA is multi-objective, we can add another quality measure: typically it should hold that the two measures are inversely proportional. For example, we can add the number of selected features. In this way, judging the relative quality of two individuals is more complex; the set of best individuals is called Pareto-set. 18/27

EA-based Supervised Wrapper Methods - 4 Given a set of individuals, we can only exclude those that do not belong to the Pareto set (or front); I does not belong to the front if and only if there exists J such that 1. J improves I under at least one objective (e.g.: the accuracy), and 2. J is not worse than I under every other objective (e.g.: the number of features). Under such conditions, I is dominated by J and thus does not belong to the front. 19/27

EA-based Supervised Wrapper Methods - 5 From the population P we build de sub-set Q of Pareto-optimal individuals. Those in Q, plus some individuals in P \ Q (which are randomly chosen in order to avoid local optima) are combined to form a new population P of individuals generically better of those in P under the objective(s). The process is then iterated a predetermined number of times. The last population, and, in particular, the non-dominated individuals of the last generation, are those we consider for the selection. But which one? 20/27

EA-based Supervised Wrapper Methods - 6 We design an a posteriori decision making process that takes into account domain-specific elements, and we choose one individual. The experiments tell us that this choice behaves much better than the original data set. A huge number of experiments are possible: 1. by changing the classification model; 2. by changing the objective function(s); 3. by changing the selection mechanism hidden in the EA. 21/27

Unsupervised - 1 We already know what unsupervised classification is. Our data set is not labeled: the purpose is to separate the instances into disjoint subsets in the cleanest possible way. Our purpose now is to eliminate those features that make it difficult for a clustering algorithm to give us a proper result. 22/27

Unsupervised - 2 There are several way to measure the goodness of a clustering. A very simple way to do this is to measure the ratio between the biggest and the smallest obtained clusters: the heuristics here is to stay below 3. Another way, possible only with certain clustering algorithms, is to measure the likelihood (for example, in the EM algorithm). Now, suppose that we choose any clustering method (C) and any meaningful measure of its performance (M). 23/27

for Unsupervised - 1 To eliminate noisy features we can proceed precisely the same way as in supervised classification. We can choose evolutionary-based wrapper feature selection, and set the following objectives: 1. Accuracy of the classification model obtained by a selection after the clustering; 2. Measure of the goodness of the clustering. 24/27

for Unsupervised - 2 Therefore, for each selection: 1. We run the clustering algorithm C and we obtain its performance M; 2. We run a classification learning algorithm, where the cluster becomes the class; 3. We set M and the accuracy (or any other measure) to be maximized. Since there is no direct relation between the number of the features and M, we do not set the cardinality of the set of features to be minimized. 25/27

Experiments - 1 As for the GAP data set: 1. We have a data set composed by 49 features (search space: 2 49 : operational, service, and central data; 2. Each line was classified by the outcome; 3. We ran a wrapper-based feature selection mechanism based on evolutionary computation, to minimize the number of features and maximize the accuracy. The accuracy increased from 95.09 to 95.45, and the number of features dropped to 8. Therefore, 41 features previously collected do to influence the outcome of a session. 26/27

Experiments - 2 Psychology data set: 1. We started by 152 features for 159 unlabeled instances: each instance corresponds to a child who undergone a psychological test (search space 2 152 ). 2. We ran a wrapper-based feature selection mechanism based on evolutionary computation, to maximize both the likelihood of the clusterization (via EM) and the accuracy (via J48). 3. We obtained a set of 10 features, and each child has been classified into one of two categories: the interpretation shows that this choice is perfectly coherent with the semantics of the questions in the test. The accuracy increased from 83.98 to 96.04, and the number of features dropped to 10. In this case we can conclude that children examined by the test can be classified into two categories using their answer in 10 of the 152 questions. 27/27