Lecture Lectur 12 Oct 24th 2008

Similar documents
ECE544NA Final Project: Robust Machine Learning Hardware via Classifier Ensemble

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

Support Vector Machines

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

Lecture 2: Single Layer Perceptrons Kevin Swingler

Forecasting the Direction and Strength of Stock Market Movement

What is Candidate Sampling

Design of Output Codes for Fast Covering Learning using Basic Decomposition Techniques

SVM Tutorial: Classification, Regression, and Ranking

A novel Method for Data Mining and Classification based on

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

BANKRUPTCY PREDICTION BY USING SUPPORT VECTOR MACHINES AND GENETIC ALGORITHMS

L10: Linear discriminants analysis

Bag-of-Words models. Lecture 9. Slides from: S. Lazebnik, A. Torralba, L. Fei-Fei, D. Lowe, C. Szurka

Fault tolerance in cloud technologies presented as a service

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM

The OC Curve of Attribute Acceptance Plans

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

Credit Limit Optimization (CLO) for Credit Cards

Logistic Regression. Steve Kroon

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

SIMPLE LINEAR CORRELATION

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

Heuristic Static Load-Balancing Algorithm Applied to CESM

Learning to Classify Ordinal Data: The Data Replication Method

Single and multiple stage classifiers implementing logistic discrimination

Mining Feature Importance: Applying Evolutionary Algorithms within a Web-based Educational System

Can Auto Liability Insurance Purchases Signal Risk Attitude?

Gender Classification for Real-Time Audience Analysis System

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

Performance Analysis and Coding Strategy of ECOC SVMs

Linear Circuits Analysis. Superposition, Thevenin /Norton Equivalent circuits

On-Line Fault Detection in Wind Turbine Transmission System using Adaptive Filter and Robust Statistical Features

Statistical Methods to Develop Rating Models

Robust Design of Public Storage Warehouses. Yeming (Yale) Gong EMLYON Business School

8 Algorithm for Binary Searching in Trees

The Greedy Method. Introduction. 0/1 Knapsack Problem

1 Example 1: Axis-aligned rectangles

Analysis of Premium Liabilities for Australian Lines of Business

An Enhanced Super-Resolution System with Improved Image Registration, Automatic Image Selection, and Image Enhancement

ONE of the most crucial problems that every image

Implementation of Deutsch's Algorithm Using Mathcad

Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions

Improved SVM in Cloud Computing Information Mining

THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION

Lecture 2 Sequence Alignment. Burr Settles IBS Summer Research Program 2008 bsettles@cs.wisc.edu

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

Sketching Sampled Data Streams

A study on the ability of Support Vector Regression and Neural Networks to Forecast Basic Time Series Patterns

Investigation of Normalization Techniques and Their Impact on a Recognition Rate in Handwritten Numeral Recognition

IMPACT ANALYSIS OF A CELLULAR PHONE

An Alternative Way to Measure Private Equity Performance

Churn prediction in subscription services: An application of support vector machines while comparing two parameter-selection techniques

Application of Quasi Monte Carlo methods and Global Sensitivity Analysis in finance

Financial market forecasting using a two-step kernel learning method for the support vector regression

An Efficient and Simplified Model for Forecasting using SRM

Multiclass sparse logistic regression for classification of multiple cancer types using gene expression data

Research Article Integrated Model of Multiple Kernel Learning and Differential Evolution for EUR/USD Trading

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

Managing Cycle Inventories. Matching Supply and Demand

A COLLABORATIVE TRADING MODEL BY SUPPORT VECTOR REGRESSION AND TS FUZZY RULE FOR DAILY STOCK TURNING POINTS DETECTION

Learning from Large Distributed Data: A Scaling Down Sampling Scheme for Efficient Data Processing

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S

An Inductive Fuzzy Classification Approach applied to Individual Marketing

BERNSTEIN POLYNOMIALS

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

Learning from Multiple Outlooks

Predicting Software Development Project Outcomes *

Support Vector Machine Model for Currency Crisis Discrimination. Arindam Chaudhuri 1. Abstract

Ensembling Neural Networks: Many Could Be Better Than All

7.5. Present Value of an Annuity. Investigate

Improved Mining of Software Complexity Data on Evolutionary Filtered Training Sets

New Approaches to Support Vector Ordinal Regression

DEFINING %COMPLETE IN MICROSOFT PROJECT

Online Multiple Kernel Learning: Algorithms and Mistake Bounds

Disagreement-Based Multi-System Tracking

Semantic Content Enrichment of Sensor Network Data for Environmental Monitoring

Production. 2. Y is closed A set is closed if it contains its boundary. We need this for the solution existence in the profit maximization problem.

Project Networks With Mixed-Time Constraints

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

An Integrated Semantically Correct 2.5D Object Oriented TIN. Andreas Koch

Risk-based Fatigue Estimate of Deep Water Risers -- Course Project for EM388F: Fracture Mechanics, Spring 2008

An Interest-Oriented Network Evolution Mechanism for Online Communities

Dynamic Resource Allocation for MapReduce with Partitioning Skew

Calculating the high frequency transmission line parameters of power cables

Chapter 6. Classification and Prediction

Loop Parallelization

HowHow to Find the Best Online Stock Broker

PAS: A Packet Accounting System to Limit the Effects of DoS & DDoS. Debish Fesehaye & Klara Naherstedt University of Illinois-Urbana Champaign

Power-of-Two Policies for Single- Warehouse Multi-Retailer Inventory Systems with Order Frequency Discounts

Stochastic Inventory Management for Tactical Process Planning under Uncertainties: MINLP Models and Algorithms

Characterization of Assembly. Variation Analysis Methods. A Thesis. Presented to the. Department of Mechanical Engineering. Brigham Young University

Enterprise Master Patient Index

An Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services

Discussion Papers. Support Vector Machines (SVM) as a Technique for Solvency Analysis. Laura Auria Rouslan A. Moro. Berlin, August 2008

Answer: A). There is a flatter IS curve in the high MPC economy. Original LM LM after increase in M. IS curve for low MPC economy

Transcription:

Lecture 12 Oct 24 th 2008

Revew Lnear SVM seeks to fnd a lnear decson boundary that maxmzes the geometrc t margn Usng the concept of soft margn, we can acheve tradeoff between maxmzng the margn and fathfully fttng all tranng examples, thus provdes better handlng of outlers/nosy tranng examples and (smple) cases that are not lnearly separable By mappng the data from the orgnal nput space nto a hgher dmensonal feature space, we obtan non lnear SVM Let s see a bt more about ths

Non lnear SVM The basc dea s to map the data onto a new feature space such that the data s lnearly separable n ths space However, we don t need to explctly compute the mappng. Instead, we use what we call the kernel functon (ths s referred to as the kernel trck) What s a kernel functon? A kernel functon k(a,b) (, ) = <φ(a) φ(b)> ) φ s a mappng Why usng kernel functon? Thennerproduct n the mapped feature space <φ(a) φ(b)> scomputed usng the kernel functon appled to the orgnal nput vectors k(a,b) Ths avods the computatonal cost ncurred by the mappng We have shown prevously that usng kernel n predcton stage s straght forward, but what about learnng of w (or more drectly the α s)? Does ths make any dfference n the SVM learnng step? Let s look at the optmzaton problem of the orgnal SVM

ξ c N b mn 1 2, + = w w Soft margn SVM optmzaton N N ξ b y, 1, 0,, 1,, 1 ) ( subject to : L L = = + ξ x w N optmzaton problem Knowng that, we can rewrte the optmzaton problem n terms of nner products = = N y 1 x w α N ξ b ξ c y y N j j N j j j j b 1 1 ) ( bj mn 1,, > + < = α α x x w I f l f SVM ll d l h N N ξ b y y j j j j, 1, 0,, 1,, 1 ) ( subject to : 1 L L = = > + < = ξ α x x In fact, learnng for SVM s typcally carred out usng only the nner products of x, wthout usng orgnal vector of x Now we just need to replace the nner products wth an Now we just need to replace the nner products wth an approprate kernel functons

What we have seen so far Lnear SVM Geometrc margn vs. functonal margn Problem formulaton Soft margn SVM Include the slack varable n optmzaton c controls the trade off Nonlnear SVM usng kernel functons Kernel functons

Notes on Applyng SVM Many SVM mplementatons areavalable, avalable, and can be found at www.kernel machne.org/software.html Handlng multple class problem wth SVM requres transformng a multclass problem nto multple bnary class problems One aganst rest Parwse etc

Model selecton for SVM There are a number of model selecton questons when applyng SVM Whch kernel functons to use? What c parameter to use (soft margn)? You can choose to use default df optons provded by the software, but a more relable approach s to use cross valdatons

Strengths Strength vs weakness The soluton s globally optmal It scales well wth hgh dmensonal data It can handle non tradtonal data lke strngs, trees, nstead of the tradtonal fxed length feature vectors Weakness Why? Because as long as we can defne a kernel functon for such nput, we can apply svm Need to specfy a good kernel Tranng tme can be long f you use the wrong software package 2007 2006 1998

Ensemble Learnng

Ensemble Learnng So far we have desgned learnng algorthms that take a tranng set and output a classfer What f we want more accuracy than current algorthms afford? Develop new learnng algorthm Improve exstng algorthms Another approach s to leverage the algorthms we have va ensemble methods Instead of callng an algorthm just once and usng ts classfer Call algorthm multple tmes and combne the multple classfers

What s Ensemble Learnng Tradtonal: Ensemble method: S S L 1 L 1 L 2 L L S dfferent tranng sets and/or learnng algorthms h 1 h 2 L h S (x,?) h 1 (x, y * =h 1 (x)) h * = F(h 1, h 2, L, h S ) (x,?) (x, y * =h * (x))

Ensemble Learnng INTUITION: Combnng Predctons of multple classfers (an ensemble) s more accurate than a sngle classfer. Justfcaton: easy to fnd qute good rules of thumb however hard to fnd sngle hghly accurate predcton rule. If the tranng set s small and the hypothess space s large then there may be many equally accurate classfers. Hypothess space does not contan the true functon, but t has several good approxmatons. Exhaustve global search n the hypothess space s expensve so we can combne the predctons of several locally accurate classfers.

How to generate ensemble? There are a varety of methods developed We wll look at two of them: Baggng Boostng (Adaboost: adaptve boostng) Both of these methods takes a sngle learnng algorthm (we wll call ths the base learner) and use t multple tmes to generate multple classfers

Baggng: Bootstrap Aggregaton (Breman, 1996) Generate a random sample from tranng set S by a random re samplng technque called bootstrappng Repeat ths samplng procedure, gettng a sequence of T tranng sets: S 1,SS 2,,SS T Learn a sequence of classfers h 1,h 2,,h T for each of these tranng sets, usng the same base learner To classfy an unknown pont X, let each classfer predct h 1 (X) = 1 h 2 (X) = 1, h 3 (X) = 0,, h T (X) = 1, Take smple majorty vote to makethe fnal predcton Predct the class that gets the most vote from all the learned classfers

Bootstrappng S = {} For =1,, N (N s the total number of ponts n S) draw a random pont from S and add t to S End Return S Ths s a samplng procedure that samples wth replacement Each tme a pont s drawn, t wll not be removed Ths means that t we can have multple l copes of the same data dt pont tn my sample New tranng set contans the same number of ponts ( may contan repeats) t) as the orgnal ltranng set On average, 66.7% of the orgnal ponts wll appear n a random sample

The true decson boundary

Decson Boundary by the CART Decson Tree Algorthm Note that the decson tree has trouble representng ths decson boundary

By averagng 100 trees, we acheve better approxmaton of the boundary, together wth nformaton regardng how confdence we are about our predcton.

Emprcal Results for Baggng Decson Trees (Freund & Schapre) Each pont represents the results of one data set Why can baggng mprove the classfcaton accuracy?

The Concept of Bas and Varance Target Bas Varance

Bas/Varance for classfers Bas arses when the classfer cannot represent the true functon that s, the classfer underfts the data Varance arses when the classfer overfts the data mnor varatons n tranng set cause the classfer to overft dfferently Clearly you would lke to have a low basandand lowvarance classfer! Typcally, low bas classfers (overfttng) have hgh varance hghbas classfers (underfttng) have low varance We have a trade off

Effect of Algorthm Parameters on Bas and Varance k nearest neghbor: ncreasng k typcally ncreases bas and reduces varance decson trees of depth D: ncreasng D typcally ncreases varance and reduces bas CS434 Fall 2007

Why does baggng work? Baggng takes the average of multple models reduces the varance Ths suggests that baggng works the best wth low bas and hgh varance classfers CS434 Fall 2007