Classification k-nearest neighbors Data Mining Dr. Engin YILDIZTEPE Reference Books Han, J., Kamber, M., Pei, J., (2011). Data Mining: Concepts and Techniques. Third edition. San Francisco: Morgan Kaufmann Publishers. Larose, Daniel T. (2005). Discovering Knowledge In Data An Introduction to Data Mining. New Jersey: John Wiley and Sons Ltd. Pang-NingTan,Michael Steinbach, Vipin Kumar (2006), Introduction to Data Mining, AddisonWesley. Alpaydın, E. (2010). Introduction to Machine Learning. Second Ed. London:MIT Press. 2 Supervised vs. Unsupervised Learning Supervised learning Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data Classification: Definition In classification, there is a target categorical variable, (e.g., income bracket), which is partitioned into predetermined classes or categories, such as highh income, middle income, and low income. 3 1
10 10 Classification: Definition Training set : The set of tuples used for model construction is training set. Given a collection of records. Each record contains a set of attributes, one of the attributes is the class. Each tuple/record is assumed to belong to a predefined class, as determined by the class attribute. Test set: A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and testt set used to validate it. Test set is independent of training set (otherwise overfitting) Illustrating Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K? 12 Yes Medium 80K? 13 Yes Large 110K? 14 No Small 95K? Learn Model Apply Model 15 No Large 67K? Classification: Definition Model construction: Find a model for class attribute as a function of the values of other attributes. The model is represented as classification rules, decision trees, or mathematical formulae Goal: previously unseen or new records should be assigned a class as accurately as possible. Process (1): Model Construction Training Data Classification Algorithms Accuracy rate is the percentage of test set samples that are correctly classified by the model. If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known NAME RANK YEARS TENURED Mike Assistant Prof 3 no Mary Assistant Prof 7 yes Bill Professor 2 yes Jim Associate Prof 7 yes Dave Assistant Prof 6 no Anne Associate Prof 3 no Classifier (Model) IF rank = professor OR years > 6 THEN tenured = yes 8 2
Process (2): Using the Model in Prediction Prediction Problems: Classification vs. Numeric Prediction Testing Data Classifier NAME RANK YEARS TENURED Tom Assistant Prof 2 no Merlisa Associate Prof 7 no George Professor 5 yes Joseph Assistant Prof 7 yes Unseen Data (Jeff, Professor, 4) Tenured? 9 Classification predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Numeric Prediction models continuous-valued functions, i.e., predicts unknown or missing values 10 Classification Techniques Instance-Based Classifiers Decision Tree based Methods Neural Networks Other Classification Methods Rule-based Methods Memory based reasoning Bayes Classification Methods Support Vector Machines Lazy Learning Lazy learning Lazy learning (e.g., instance-based learning): Simply stores training data (or only minor processing) and waits until it is given a test tuple Lazy: less time in training but more time in predicting 12 3
Instance-Based Methods (Lazy Learner) Instance-based learning: Store training examples and delay the processing ( lazy (lazy evaluation ) until a new instance must be classified Typical approaches k-nearest neighbor approach Instances represented as points in a Euclidean space. K-Nearest Neighbor Classifiers Basic idea: If it walks like a duck, quacks like a duck, then it s probably a duck Compute Distance Test Record Locally weighted regression Case-based reasoning Training Records Choose k of the nearest records 13 K-Nearest-Neighbor Classifiers Definition of k-nearest Neighbor Unknown record Requires three things The set of stored records Distance Metric to compute distance between records The value of k, the number of nearest neighbors to retrieve To classify an unknown record: Compute distance to other training records Identify k nearest neighbors Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote) X X X (a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x 4
Nearest Neighbor Classification Compute distance between two points: Euclidean distance d( p, q) = i ( p i q i ) 2 Nearest Neighbor Classification Choosing the value of k: If k is too small, sensitive to noise points If k is too large, neighborhood may include points from other classes Determine the class from nearest neighbor list take the majority vote of class labels among the k- nearest neighbors Nearest Neighbor Classification Scaling issues Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes (normalization) Example: height of a person may vary from 1.5m to 1.8m weight of a person may vary from 40kg to 100kg income of a person may vary from $10K to $1M Nearest neighbor Classification k-nn classifiers are lazy learners It does not build models explicitly They re not eager learners such as decision tree induction and rule-based systems Classifying unknown records are relatively expensive. 5
21 X1 X2 class Example 5 4 + 4 7 + 1 6 + 2 7 + 2 4 + 2 2 + 1 6 + 4 1 + 6 1 + 4 1 + 10 10 5 8 10 5 8 4 8 6 5 8 4 5 7 4 6 9 7 7 5 8 10 6 10 6 10 4 8 3? 6