Jeff Howbert Introduction to Machine Learning Winter

Similar documents
Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Classification Techniques (1)

Foundations of Artificial Intelligence. Introduction to Data Mining

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Machine Learning Logistic Regression

Knowledge Discovery and Data Mining

Classification algorithm in Data mining: An Overview

Data Mining for Knowledge Management. Classification

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

Data Mining Classification: Decision Trees

Support Vector Machine (SVM)

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.

Big Data Analytics CSCI 4030

Content-Based Recommendation

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Supervised Learning (Big Data Analytics)

Introduction to Artificial Intelligence G51IAI. An Introduction to Data Mining

Social Media Mining. Data Mining Essentials

Classification and Prediction techniques using Machine Learning for Anomaly Detection.

Classification and Prediction

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Introduction of Information Visualization and Visual Analytics. Chapter 4. Data Mining

Data Mining: Introduction. Lecture Notes for Chapter 1. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Data Mining: Introduction

Data Mining Part 5. Prediction

L25: Ensemble learning

Supervised and unsupervised learning - 1

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

LCs for Binary Classification

Data Mining + Business Intelligence. Integration, Design and Implementation

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Chapter 6. The stacking ensemble approach

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin

Fingerprinting the datacenter: automated classification of performance crises

Improving performance of Memory Based Reasoning model using Weight of Evidence coded categorical variables

A Comparative Analysis of Classification Techniques on Categorical Data in Data Mining

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Decision Trees from large Databases: SLIQ

CSC 411: Lecture 07: Multiclass Classification

Using multiple models: Bagging, Boosting, Ensembles, Forests

MACHINE LEARNING IN HIGH ENERGY PHYSICS

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

An Overview of Knowledge Discovery Database and Data mining Techniques

Data Mining Part 5. Prediction

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

A Simple Introduction to Support Vector Machines

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Data Mining. Nonlinear Classification

Linear Threshold Units

Spam Detection A Machine Learning Approach

Neural Networks and Support Vector Machines

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Neural Networks Lesson 5 - Cluster Analysis

Data Mining Practical Machine Learning Tools and Techniques

Machine Learning using MapReduce

Classification of Bad Accounts in Credit Card Industry

Data Mining Fundamentals

Distances, Clustering, and Classification. Heatmaps

Manual for BEAR Big Data Ensemble of Adaptations for Regression Version 1.0

Data Mining Techniques

Predicting Student Persistence Using Data Mining and Statistical Analysis Methods

Collaborative Filtering. Radek Pelánek

Machine Learning for Data Science (CS4786) Lecture 1

Pixels Description of scene contents. Rob Fergus (NYU) Antonio Torralba (MIT) Yair Weiss (Hebrew U.) William T. Freeman (MIT) Banksy, 2006

Educational Objectives To investigate equilibrium using a lever in two activities.

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Data Mining: Data Preprocessing. I211: Information infrastructure II

A REVIEW ON EFFICIENT DATA ANALYSIS FRAMEWORK FOR INCREASING THROUGHPUT IN BIG DATA. Technology, Coimbatore. Engineering and Technology, Coimbatore.

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Journal of Asian Scientific Research COMPARISON OF THREE CLASSIFICATION ALGORITHMS FOR PREDICTING PM2.5 IN HONG KONG RURAL AREA.

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Introduction to Machine Learning Using Python. Vikram Kamath

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Machine Learning Capacity and Performance Analysis and R

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

How To Cluster

Local features and matching. Image classification & object localization

Factor Models for Gender Prediction Based on E-commerce Data

Data Preprocessing. Week 2

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

CSCI567 Machine Learning (Fall 2014)

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum

Multi-class Classification: A Coding Based Space Partitioning

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Microsoft Azure Machine learning Algorithms

Data Mining: A Preprocessing Engine

Support Vector Machines Explained

Practical Introduction to Machine Learning and Optimization. Alessio Signorini

Mining Big Data. Pang-Ning Tan. Associate Professor Dept of Computer Science & Engineering Michigan State University

Decompose Error Rate into components, some of which can be measured on unlabeled data

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

More Local Structure Information for Make-Model Recognition

Knowledge Discovery and Data Mining

Transcription:

Classification Nearest es Neighbor Jeff Howbert Introduction to Machine Learning Winter 2012 1

Instance based classifiers Set of Stored Cases Atr1... AtrN Class A B B C A C B Store the training samples Use training i samples to predict the class label of unseen samples Unseen Case Atr1... AtrN Jeff Howbert Introduction to Machine Learning Winter 2012 2

Instance based classifiers Examples: Rote learner memorize entire training data perform classification only if attributes of test sample match one of the training samples exactly Nearest neighbor use k closest samples (nearest neighbors) to perform classification Jeff Howbert Introduction to Machine Learning Winter 2012 3

Nearest neighbor classifiers Basic idea: If it walks like a duck, quacks like a duck, then it s probably a duck compute distance test sample training samples choose k of the nearest samples Jeff Howbert Introduction to Machine Learning Winter 2012 4

Nearest neighbor classifiers Unknown record Requires three inputs: 1. The set of stored samples 2. Distance metric to compute distance between samples 3. The value of k, the number of nearest neighbors to retrieve Jeff Howbert Introduction to Machine Learning Winter 2012 5

Nearest neighbor classifiers Unknown record To classify unknown record: 1. Compute distance to other training records 2. Identify k nearest neighbors 3. Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote) Jeff Howbert Introduction to Machine Learning Winter 2012 6

Definition of nearest neighbor X X X (a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor k-nearest neighbors of a sample x are datapoints that have the k smallest distances to x Jeff Howbert Introduction to Machine Learning Winter 2012 7

1-nearest neighbor Voronoi diagram Jeff Howbert Introduction to Machine Learning Winter 2012 8

Nearest neighbor classification Compute distance between two points: Euclidean distance d( x, y) = i ( x i y i Options for determining the class from nearest neighbor list Take majority vote of class labels among the k-nearest neighbors Weight the votes according to distance 2 ) example: weight factor w=1/d 2 Jeff Howbert Introduction to Machine Learning Winter 2012 9

Nearest neighbor classification Choosing the value of k: If k is too small, sensitive to noise points If k is too large, neighborhood may include points from other classes Jeff Howbert Introduction to Machine Learning Winter 2012 10

Nearest neighbor classification Scaling issues Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes Example: height of a person may vary from 1.5 m to 1.8 m weight of a person may vary from 90 lb to 300 lb income of a person may vary from $10K to $1M Jeff Howbert Introduction to Machine Learning Winter 2012 11

Nearest neighbor classification Problem with Euclidean measure: High dimensional data curse of dimensionality Can produce counter-intuitive results 111111111110 1 1 1 1 1 1 1 1 1 1 100000000000 0 0 0 0 0 0 0 0 0 0 vs 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 d = 1.4142 d = 1.4142 one solution: normalize the vectors to unit length Jeff Howbert Introduction to Machine Learning Winter 2012 12

Nearest neighbor classification k-nearest neighbor classifier is a lazy learner Does not build model explicitly. Unlike eager learners such as decision tree induction and rule-based systems. Classifying unknown samples is relatively expensive. k-nearest neighbor classifier is a local model, vs. global model of linear classifiers. s Jeff Howbert Introduction to Machine Learning Winter 2012 13

Example: PEBLS PEBLS: Parallel Examplar-Based Learning System (Cost & Salzberg) Works with both continuous and nominal features For nominal features, distance between two nominal values is computed using modified value difference metric (MVDM) Each sample is assigned a weight factor Number of nearest neighbor, k = 1 Jeff Howbert Introduction to Machine Learning Winter 2012 14

10 Example: PEBLS Tid Refund Marital Status Taxable Income Cheat Distance between nominal attribute values: 1 Yes Single 125K No d(single,married) 2 No Married 100K No = 2/4 0/4 + 2/4 4/4 = 1 3 No Single 70K No d(single,divorced) 4 Yes Married 120K No 5 No Divorced 95K Yes = 2/4 1/2 + 2/4 1/2 = 0 6 No Married 60K No d(married,divorced) 7 Yes Divorced 220K No = 0/4 1/2 + 4/4 1/2 = 1 8 No Single 85K Yes d(refund=yes,refund=no) 9 No Married 75K No = 0/3 3/7 + 3/3 4/7 = 6/7 10 No Single 90K Yes Marital Status Class Single Married Divorced Yes 2 0 1 Refund Class Yes No Yes 0 3 d ( V1, V2 ) = i n n 1i 1 n n 2i 2 No 2 4 1 No 3 4 Jeff Howbert Introduction to Machine Learning Winter 2012 15

10 Example: PEBLS Tid Refund Marital Status Taxable Income Cheat X Yes Single 125K No Y No Married 100K No Distance between record X and record Y: where: Δ( X w X =, Y ) = w X w Y d i= 1 d ( X i, Y Number of times X is used for prediction Number of times X predicts correctly i 2 ) w X w X 1 if X makes accurate prediction most of the time > 1 if X is not reliable for making predictions Jeff Howbert Introduction to Machine Learning Winter 2012 16

Decision boundaries in global vs. local models linear regression 15-nearest neighbor 1-nearest neighbor global stable can be inaccurate local accurate unstable What ultimately matters: GENERALIZATION Jeff Howbert Introduction to Machine Learning Winter 2012 17