LCs for Binary Classification



Similar documents
Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Statistical Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Linear Threshold Units

Linear Classification. Volker Tresp Summer 2015

Content-Based Recommendation

Machine Learning and Pattern Recognition Logistic Regression

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Wes, Delaram, and Emily MA751. Exercise p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

Designing a learning system

Big Data Analytics CSCI 4030

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

CSC 411: Lecture 07: Multiclass Classification

Sentiment analysis using emoticons

Machine Learning Final Project Spam Filtering

Chapter 6. The stacking ensemble approach

Machine Learning Logistic Regression

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Data Mining - Evaluation of Classifiers

LOGIT AND PROBIT ANALYSIS

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Supervised Learning (Big Data Analytics)

Lecture 3: Linear methods for classification

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Data Mining: An Overview. David Madigan

Microsoft Azure Machine learning Algorithms

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Neural Networks and Support Vector Machines

Machine Learning in Spam Filtering

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Data Mining. Nonlinear Classification

Support Vector Machines with Clustering for Training with Very Large Datasets

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Classification Problems

The Artificial Prediction Market

STA 4273H: Statistical Machine Learning

L25: Ensemble learning

Employer Health Insurance Premium Prediction Elliott Lui

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

: Introduction to Machine Learning Dr. Rita Osadchy

Lecture 2: The SVM classifier

Question 2 Naïve Bayes (16 points)

Support Vector Machine (SVM)

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

Car Insurance. Havránek, Pokorný, Tomášek

A semi-supervised Spam mail detector

Introduction to Learning & Decision Trees

Support Vector Machines

Decompose Error Rate into components, some of which can be measured on unlabeled data

Vectors Math 122 Calculus III D Joyce, Fall 2012

HT2015: SC4 Statistical Data Mining and Machine Learning

Active Learning SVM for Blogs recommendation

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Classification of Bad Accounts in Credit Card Industry

a 11 x 1 + a 12 x a 1n x n = b 1 a 21 x 1 + a 22 x a 2n x n = b 2.

Introduction to Online Learning Theory

Basics of Statistical Machine Learning

3 An Illustrative Example

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Detecting Corporate Fraud: An Application of Machine Learning

Predictive Dynamix Inc

Trading regret rate for computational efficiency in online learning with limited feedback

An Introduction to Data Mining

Classification algorithm in Data mining: An Overview

Introduction to Support Vector Machines. Colin Campbell, Bristol University

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Programming Exercise 3: Multi-class Classification and Neural Networks

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Feature Engineering in Machine Learning

E3: PROBABILITY AND STATISTICS lecture notes

Computing with Finite and Infinite Networks

Data Mining Techniques Chapter 6: Decision Trees

Lecture 2: August 29. Linear Programming (part I)

12. Inner Product Spaces

Point Biserial Correlation Tests

Topological Properties

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Machine learning for algo trading

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

D-optimal plans in observational studies

Environmental Remote Sensing GEOG 2021

Course 395: Machine Learning

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

BALTIC OLYMPIAD IN INFORMATICS Stockholm, April 18-22, 2009 Page 1 of?? ENG rectangle. Rectangle

Nominal and ordinal logistic regression

OPTIMIZING WEB SERVER'S DATA TRANSFER WITH HOTLINKS

Cross-validation for detecting and preventing overfitting

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Part 2: Analysis of Relationship Between Two Variables

Making Sense of the Mayhem: Machine Learning and March Madness

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

Data Mining Techniques for Prognosis in Pancreatic Cancer

An Introduction to Neural Networks

AdaBoost. Jiri Matas and Jan Šochman. Centre for Machine Perception Czech Technical University, Prague

Comparison of machine learning methods for intelligent tutoring systems

Closest Pair Problem

Analysis Tools and Libraries for BigData

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Transcription:

Linear Classifiers A linear classifier is a classifier such that classification is performed by a dot product beteen the to vectors representing the document and the category, respectively. Therefore it consists in a document-like representation of the category c i Linear classifiers are thus very efficient at classification time Methods for learning linear classifiers can be partitioned in to broad classes Incremental methods (IMs) (or on-line) build a classifier soon after examining the first document, as incrementally refine it as they examine ne ones Batch methods (BMs) build a classifier by analyzing Tr all at once. 77 LCs for Binary Classification Consider 2 class problems in VSM Deciding beteen to classes, perhaps, sport and not sport Ho do e define (and find) the separating surface (hyperplane)? Ho do e test hich region a test doc is in? 78

Separation by Hyperplanes A strong high-bias assumption is linear separability: in 2 dimensions, can separate classes by a line in higher dimensions, need hyperplanes Can find separating hyperplane by linear programming (or can iteratively fit solution via perceptron): separator can be expressed as ax + by = c 79 Linear programming / Perceptron Find a,b,c, such that ax + by c for red points ax + by c for green points. 80

Linear programming / Perceptron t x + b > 0 t x + b < 0 If x belongs to Sport (y = +1) Otherise (y = -1) You can rite y ([ b] [x 1] t ) > 0 Thus, e ant to compute a vector W s.t. Y i W t X i > 0 here W=[ b] t and X i =[x i 1] t 81 The Perceptron Algorithm A simple (but poerful) method is the perceptron 1. Initialize W=0 2. For all training examples (X i,y i ), i=1,..,n If (y i W t X i <= 0 ) W = W + y i X i 3. If no errors have been done in step 2, stop. Otherise repeat step 2. 82

Linear classifier: Example Class: interest (as in interest rate) Example features of a linear classifier i t i i t i 0.70 prime 0.67 rate 0.63 interest 0.60 rates 0.46 discount 0.43 bundesbank -0.71 dlrs -0.35 orld -0.33 sees -0.25 year -0.24 group -0.24 dlr To classify, find dot product of feature vector and eights Score( rate discount dlrs orld ) =.67+.46-.71-.35=0.05 > 0 Score( prime dlrs ) =.70-.71=-.01 < 0 83 Linear Classifiers Many common text classifiers are linear classifiers Naïve Bayes Perceptron Rocchio Logistic regression Support vector machines (ith linear kernel) Linear regression (Simple) perceptron neural netorks Despite this similarity, large performance differences For separable problems, there is an infinite number of separating hyperplanes. Which one do you choose? What to do for non-separable problems? Different training methods pick different hyperplanes Classifiers more poerful than linear often don t perform better. Why? 84

High Dimensional Data Documents are zero along almost all axes Most document pairs are very far apart (i.e., not strictly orthogonal, but only share very common ords and a fe scattered others) In classification terms: virtually all document sets are separable, for most any classification This is part of hy linear classifiers are quite successful in this domain 85 Naive Bayes is a linear classifier To-class Naive Bayes. We compute: P( C d) P( C) P( C) log = log + log P( C d) P( C) P( C) Decide class C if the odds ratio is greater than 1, i.e., if the log odds is greater than 0. So decision boundary is hyperplane: α β + V = log β n = P ( C ) ; P ( C ) 0 n here = # of α = d log occurrence P ( C ) ; P ( C ) s of in d 86

knn vs. Linear Classifiers Bias/Variance tradeoff Variance Capacity knn has high variance and lo bias. Infinite memory LCs has lo variance and high bias. Decision surface has to be linear (hyperplane) Consider: Is an object a tree? Too much capacity/variance, lo bias Botanist ho memorizes Will alays say no to ne object (e.g., # leaves) Not enough capacity/variance, high bias Lazy botanist Says yes if the object is green Want the middle ground 87 Bias vs. variance: Choosing the correct model capacity 88