Introduction to Classification, aka Machine Learning

Similar documents
Data Mining Classification: Decision Trees

Data Mining Part 5. Prediction

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Classification Techniques (1)

Data Mining. Nonlinear Classification

Data Mining - Evaluation of Classifiers

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Supervised Learning (Big Data Analytics)

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining

Evaluation & Validation: Credibility: Evaluating what has been learned

Chapter 6. The stacking ensemble approach

Introduction of Information Visualization and Visual Analytics. Chapter 4. Data Mining

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Data Mining for Knowledge Management. Classification

Social Media Mining. Data Mining Essentials

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Knowledge Discovery and Data Mining

Data Mining Algorithms Part 1. Dejan Sarka

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Classification and Prediction

Machine Learning: Overview

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Introduction to Data Mining

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

Machine Learning using MapReduce

Azure Machine Learning, SQL Data Mining and R

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Experiments in Web Page Classification for Semantic Web

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Foundations of Artificial Intelligence. Introduction to Data Mining

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.

Projektgruppe. Categorization of text documents via classification

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

Decision Trees from large Databases: SLIQ

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

Feature Subset Selection in Spam Detection

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

How To Cluster

1. Classification problems

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Spam Detection System Combining Cellular Automata and Naive Bayes Classifier

Using Random Forest to Learn Imbalanced Data

Maschinelles Lernen mit MATLAB

Introduction to Data Mining

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Chapter 12 Discovering New Knowledge Data Mining

Microsoft Azure Machine learning Algorithms

Data Mining and Machine Learning in Bioinformatics

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

MS1b Statistical Data Mining

Data Mining: Introduction. Lecture Notes for Chapter 1. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Performance Measures in Data Mining

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Chapter ML:XI (continued)

(b) How data mining is different from knowledge discovery in databases (KDD)? Explain.

Professor Anita Wasilewska. Classification Lecture Notes

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Data Mining as a tool to Predict the Churn Behaviour among Indian bank customers

Predicting Flight Delays

Real Time Data Analytics Loom to Make Proactive Tread for Pyrexia

Impact of Feature Selection Technique on Classification

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

SVM Ensemble Model for Investment Prediction

Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach

Detecting Spam Using Spam Word Associations

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

from Larson Text By Susan Miertschin

Learning is a very general term denoting the way in which agents:

How To Solve The Kd Cup 2010 Challenge

Supervised Learning Evaluation (via Sentiment Analysis)!

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

Classification algorithm in Data mining: An Overview

Performance Metrics. number of mistakes total number of observations. err = p.1/1

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

Active Learning SVM for Blogs recommendation

A Survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Predictive Data modeling for health care: Comparative performance study of different prediction models

Final Project Report

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Application of Data Mining Techniques to Model Breast Cancer Data

Mining the Software Change Repository of a Legacy Telephony System

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Sentiment Analysis on Twitter with Stock Price and Significant Keyword Correlation. Abstract

Data Mining Techniques for Prognosis in Pancreatic Cancer

An Approach to Detect Spam s by Using Majority Voting

Data Mining Practical Machine Learning Tools and Techniques

Transcription:

Introduction to Classification, aka Machine Learning

Classification: Definition Given a collection of examples (training set ) Each example is represented by a set of features, sometimes called attributes Each example is to be given a label or class Find a model for the label as a function of the values of features. Goal: previously unseen examples should be assigned a label as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. 2

Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (includes clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data 3

Examples of Classification Tasks Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Categorizing news stories as finance, weather, entertainment, sports, etc 4

10 10 Illustrating Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Training Set Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K? 12 Yes Medium 80K? 13 Yes Large 110K? 14 No Small 95K? 15 No Large 67K? Test Set Induction Deduction Learning algorithm Learn Model Apply Model Model 5

Classification Techniques There are a number of different classification algorithms to build a model for classification Decision Tree based Methods Rule-based Methods Memory based reasoning, instance-based learning Neural Networks Genetic Algorithms Naïve Bayes and Bayesian Belief Networks Support Vector Machines In this introduction, we illustrate classification tasks using Decision Tree methods Features can have numeric values (continuous) or a finite set of values (categorical/nominal), including boolean true/false 6

10 Example of a Decision Tree Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Refund Yes No MarSt Single, Divorced TaxInc < 80K > 80K YES Married Training Data Model: Decision Tree Example task: Given the marital status, refund status, and taxable income of a person, label them as to whether they will cheat on their income tax. 7

10 Another Example of Decision Tree Tid Refund Marital Status Taxable Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No Cheat 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Married MarSt Yes Single, Divorced Refund No TaxInc < 80K > 80K YES There could be more than one tree that fits the same data! 8

10 10 Decision Tree Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No Tree Induction algorithm 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes Induction 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No Learn Model 10 No Small 90K Yes Training Set Tid Attrib1 Attrib2 Attrib3 Class Apply Model Model Decision Tree 11 No Small 55K? 12 Yes Medium 80K? 13 Yes Large 110K? Deduction 14 No Small 95K? 15 No Large 67K? Test Set 9

10 Apply Model to Test Data Start from the root of tree. Test Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? Single, Divorced MarSt Married TaxInc < 80K > 80K YES 10

10 Apply Model to Test Data Test Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? Single, Divorced MarSt Married TaxInc < 80K > 80K YES 11

10 Apply Model to Test Data Test Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? Single, Divorced MarSt Married TaxInc < 80K > 80K YES 12

10 Apply Model to Test Data Test Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? Single, Divorced MarSt Married TaxInc < 80K > 80K YES 13

10 Apply Model to Test Data Test Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? Single, Divorced MarSt Married TaxInc < 80K > 80K YES 14

10 Apply Model to Test Data Test Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? Single, Divorced MarSt Married Assign Cheat to No TaxInc < 80K > 80K YES 15

10 10 Decision Tree Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No Tree Induction algorithm 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes Induction 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No Learn Model 10 No Small 90K Yes Training Set Tid Attrib1 Attrib2 Attrib3 Class Apply Model Model Decision Tree 11 No Small 55K? 12 Yes Medium 80K? 13 Yes Large 110K? Deduction 14 No Small 95K? 15 No Large 67K? Test Set 16

Evaluating Classification Methods Accuracy classifier accuracy: predicting class label predictor accuracy: guessing value of predicted attribute Speed time to construct model (training time) time to use the model (classification/prediction time) Robustness: handling noise and missing values Scalability: efficiency in disk-resident databases Interpretability understanding and insight provided by the model Other measures, e.g. goodness of rules, such as decision tree size or compactness of classification model 17

Metrics for Performance Evaluation Focus on the predictive capability of a model Rather than how fast it takes to classify or build models, scalability, etc. Confusion Matrix for a binary classifier (two labels) on test set: PREDICTED CLASS Class=Yes Class=No a: TP (true positive) ACTUAL CLASS Class=Yes a b Class=No c d b: FN (false negative) c: FP (false positive) d: TN (true negative) 18

Classifier Accuracy Measures Another widely-used metric: Accuracy of a classifier M is the percentage of test set that are correctly classified by the model M Accuracy = a a + d + b + c + d = TP + TN TP + TN + FP + FN Yes - C 1 No - C 2 Yes - C 1 a: True positive b: False negative No - C 2 c: False positive d: True negative classes buy_computer = yes buy_computer = no total buy_computer = yes 6954 46 7000 buy_computer = no 412 2588 3000 total 7366 2634 10000 19

Other Classifier Measures Alternative accuracy measures (e.g., for cancer diagnosis or information retrieval) sensitivity = t-pos/pos /* true positive recognition rate */ specificity = t-neg/neg /* true negative recognition rate */ precision = t-pos/(t-pos + f-pos) recall = t-pos/(t-pos + f-neg ) accuracy = sensitivity * pos/(pos + neg) + specificity * neg/(pos + neg) 20

Multi-Class Classification Most classification algorithms solve binary classification tasks, while many tasks are naturally multi-class, i.e. there are more than 2 labels Multi-Class problems are solved by training a number of binary classifiers and combining them to get a multi-class result Confusion matrix is extended to the multi-class case Accuracy definition is naturally extended to the multi-class case Precision and recall are defined for the binary classifiers trained for each label 21

Issues with imbalanced classes Consider a 2-class problem with labels Yes and No Number of No examples = 990 Number of Yes examples = 10 If model predicts everything to be No, accuracy is 990/1000 = 99 % Accuracy is misleading because model does not detect any Yes example Precision and recall will be better measures if you are training a classifier to find rare examples. 22

Evaluating the Accuracy of a Classifier Holdout method Given data is randomly partitioned into two independent sets Training set (e.g., 2/3) for model construction Test set (e.g., 1/3) for accuracy estimation Cross-validation (k-fold, where k = 10 is most popular) Randomly partition the data into k mutually exclusive subsets, each approximately equal size At i-th iteration, use D i as test set and others as training set 1: test train train train train 2: train test train train train... 23

Evaluating the Model - Learning Curve Learning curve shows how accuracy changes with varying sample size Requires a sampling schedule for creating learning curve 24

Classifier Performance: Feature Selection Too long a training or testing is a performance issue for classification of problems with large numbers of attributes or nominal attributes with large numbers of values Feature selection techniques aim to reduce the number of features by finding a smaller or minimal set that can accurately classify the problem reduce the training and prediction time by eliminating noisy or redundant features Two main types of techniques Filtering methods apply a statistical or other information measure to the attribute values without running any training and testing Wrapper methods try different combinations of attributes, run crossvalidation evaluations and compare the results 25

Classifier Performance: Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made by multiple classifiers Examples of ensemble methods Bagging Boosting Heterogeneous classifiers trained on different feature subsets sometimes called mixture of experts 26

General Idea D Original Training data Step 1: Create Multiple Data Sets... D 1 D 2 D t-1 D t Step 2: Build Multiple Classifiers C 1 C 2 C t -1 C t Step 3: Combine Classifiers C * 27

Examples of Classification Problems Some NLP problems are widely investigated as supervised classification problems, and use a variety of problem instances Text categorization: assigning topic labels to documents Word Sense Disambiguation: assigning a sense to a word, as it occurs in a document Semantic Role Labeling: assigning semantic roles to phrases in a sentence From the NLTK book, chapter 6: Classify first names according to gender Document classification (text categorization) Part-Of-Speech tagging Sentence Segmentation Identifying Dialog Act types Recognizing Textual Entailment 28

Text Categorization Represent each document by the words/tokens/terms it contains Sometimes called unigrams, sometimes bag-of-words Identify terms from the document text Remove symbols with little meaning Remove words with little meaning the stop words Stem the meaningful words Remove endings to get root of the word From enchanted, enchants, enchantment, enchanting, get the root word enchant Group together words into phrases (optional) Proper names or other words that are likely to have a different meaning as a phrase than the individual words After grouping, may also want to lowercase the terms 29

Document Features Use a feature vector to represent all the words in a document one position for each word in the collection, representing the weights (often frequency) of words Water, water everywhere, and not a drop to drink! ( 2, 1, 1, 1, 1, 0, ) Another document with the word drink: drink ( 0, 0, 0, 0, 1, 0, ) (shown with frequency weights) Feature vectors may have thousands of words and are often restricted by a threshold frequency of 5 or more

Weka demonstration to observe feature vectors Compare Traditional data mining problem Text mining problem 31