Feature engineering. Léon Bottou COS 424 4/22/2010

Similar documents
Data Mining Practical Machine Learning Tools and Techniques

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Neural Networks and Support Vector Machines

CS570 Data Mining Classification: Ensemble Methods

Predict Influencers in the Social Network

Supervised Learning (Big Data Analytics)

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Data Mining. Nonlinear Classification

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Simple and efficient online algorithms for real world applications

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

A Simple Introduction to Support Vector Machines

Programming Exercise 3: Multi-class Classification and Neural Networks

Knowledge Discovery and Data Mining

MHI3000 Big Data Analytics for Health Care Final Project Report

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Car Insurance. Havránek, Pokorný, Tomášek

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

6.2.8 Neural networks for data mining

Knowledge Discovery and Data Mining

Model Combination. 24 Novembre 2009

Introduction to Learning & Decision Trees

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Making Sense of the Mayhem: Machine Learning and March Madness

Supporting Online Material for

Image Classification for Dogs and Cats

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Introduction to Machine Learning CMU-10701

Machine Learning Introduction

Machine learning for algo trading

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Chapter 7. Feature Selection. 7.1 Introduction

Classification of Bad Accounts in Credit Card Industry

The Artificial Prediction Market

Discrete Optimization

Beating the MLB Moneyline

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals

APP INVENTOR. Test Review

STA 4273H: Statistical Machine Learning

Boosting.

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

Lecture 6: Classification & Localization. boris. ginzburg@intel.com

Ensemble Approach for the Classification of Imbalanced Data

Local features and matching. Image classification & object localization

Predicting Flight Delays

Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems

Linear Classification. Volker Tresp Summer 2015

Feature Engineering in Machine Learning

: Introduction to Machine Learning Dr. Rita Osadchy

Predictive Data modeling for health care: Comparative performance study of different prediction models

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Machine Learning Capacity and Performance Analysis and R

Introduction to Pattern Recognition

Evaluation of Feature Selection Methods for Predictive Modeling Using Neural Networks in Credits Scoring

Azure Machine Learning, SQL Data Mining and R

HT2015: SC4 Statistical Data Mining and Machine Learning

Lecture 6. Artificial Neural Networks

Mathematical Models of Supervised Learning and their Application to Medical Diagnosis

Data Mining - Evaluation of Classifiers

Case Study Report: Building and analyzing SVM ensembles with Bagging and AdaBoost on big data sets

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

NEURAL NETWORKS A Comprehensive Foundation

Support Vector Pruning with SortedVotes for Large-Scale Datasets

Decision Trees from large Databases: SLIQ

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Hong Kong Stock Index Forecasting

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Data Mining for Knowledge Management. Classification

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

How To Perform An Ensemble Analysis

Environmental Remote Sensing GEOG 2021

AdaBoost. Jiri Matas and Jan Šochman. Centre for Machine Perception Czech Technical University, Prague

A Survey on Pre-processing and Post-processing Techniques in Data Mining

Feature Selection vs. Extraction

Unsupervised Data Mining (Clustering)

Data Mining Part 5. Prediction

Steven C.H. Hoi School of Information Systems Singapore Management University

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

CSCI567 Machine Learning (Fall 2014)

Support Vector Machine (SVM)

Novelty Detection in image recognition using IRF Neural Networks properties

Lecture 13: Validation

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

An Introduction to Data Mining

Machine Learning Final Project Spam Filtering

Transfer Learning for Latin and Chinese Characters with Deep Neural Networks

Convolution. 1D Formula: 2D Formula: Example on the web:

A Novel Feature Selection Method Based on an Integrated Data Envelopment Analysis and Entropy Mode

1 Maximum likelihood estimation

Why Ensembles Win Data Mining Competitions

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

REVIEW OF ENSEMBLE CLASSIFICATION

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Big Data - Lecture 1 Optimization reminders

Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/

Transcription:

Feature engineering Léon Bottou COS 424 4/22/2010

Summary Summary I. The importance of features II. Feature relevance III. Selecting features IV. Learning features Léon Bottou 2/29 COS 424 4/22/2010

I. The importance of features Léon Bottou 3/29 COS 424 4/22/2010

Simple linear models People like simple linear models with convex loss functions Training has a unique solution. Easy to analyze and easy to debug. Which basis functions Φ? Also called the features. Many basis functions Poor testing performance. Few basis functions Poor training performance, in general. Good training performance if we pick the right ones. The testing performance is then good as well. Léon Bottou 4/29 COS 424 4/22/2010

Explainable models Modelling for prediction Sometimes one builds a model for its predictions. The model is the operational system. Better prediction = $$$. Modelling for explanations Sometimes one builds a model for interpreting its structure. The human acquires knowledge from the model. The human then design the operational system. (we need humans because our modelling technology is insufficient.) Selecting the important features More compact models are usually easier to interpret. A model optimized for explanability is not optimized for accuracy. Identification problem vs. emulation problem. Léon Bottou 5/29 COS 424 4/22/2010

Feature explosion Initial features The initial pick of feature is always an expression of prior knowledge. images pixels, contours, textures, etc. signal samples, spectrograms, etc. time series ticks, trends, reversals, etc. biological data dna, marker sequences, genes, etc. text data words, grammatical classes and relations, etc. Combining features Combinations that linear system cannot represent: polynomial combinations, logical conjunctions, decision trees. Total number of features then grows very quickly. Solutions Kernels (with caveats, see later) Feature selection (but why should it work at all?) Léon Bottou 6/29 COS 424 4/22/2010

II. Relevant features Assume we know distribution p (X, Y ). Y : output X : input, all features X i : one feature R i = X \ X i : all features but X i, Léon Bottou 7/29 COS 424 4/22/2010

Probabilistic feature relevance Strongly relevant feature Definition: X i Y R i Feature X i brings information that no other feature contains. Weakly relevant feature Definition: X i Y S for some strict subset S of R i. Feature X i brings information that also exists in other features. Feature X i brings information in conjunction with other features. Irrelevant feature Definition: neither strongly relevant nor weakly relevant. Stronger than X i Y. See the XOR example. Relevant feature Definition: not irrelevant. Léon Bottou 8/29 COS 424 4/22/2010

Interesting example Two variables can be useless by themselves but informative together. Léon Bottou 9/29 COS 424 4/22/2010

Interesting example Correlated variables may be useless by themselves. Léon Bottou 10/29 COS 424 4/22/2010

Interesting example Strongly relevant variables may be useless for classification. Léon Bottou 11/29 COS 424 4/22/2010

Bad news Forward selection Start with empty set of features S 0 =. Incrementally add features X t such that X t Y S t 1. Will find all strongly relevant features. May not find some weakly relevant features (e.g. xor). Backward selection Start with full set of features S 0 = X. Incrementally remove features X i such that X t Y S t 1 \ X t. Will keep all strongly relevant features. May eliminate some weakly relevant features (e.g. redundant). Finding all relevant features is NP-hard. Possible to construct a distribution that demands an exhaustive search through all the subsets of features. Léon Bottou 12/29 COS 424 4/22/2010

III. Selecting features How to select relevant features when p(x, y) is unknown but data is available? Léon Bottou 13/29 COS 424 4/22/2010

Selecting features from data Training data is limited Restricting the number of features is a capactity control mechanism. We may want to use only a subset of the relevant features. Notable approaches Feature selection using regularization. Feature selection using wrappers. Feature selection using greedy algorithms. Léon Bottou 14/29 COS 424 4/22/2010

L 0 structural risk minimization Algorithm 1. For r = 1... d, find system f r S r that minimize training error. 2. Evaluate f r on a validation set. 3. Pick f = arg min r E valid (f r ) Note The NP-hardness remains hidden in step (1). Léon Bottou 15/29 COS 424 4/22/2010

L 0 structural risk minimization Let E r = min E test (f). The following result holds (Ng 1998): f S r ( ) E test (f ) min r=1...d E r + Õ h r + n Õ r log d train n train + O log d n valid Assume E r is quite good for a low number of features r. Meaning that few features are relevant. Then we can still find a good classifier if h r and log d are reasonable. We can filter an exponential number of irrelevant features. Léon Bottou 16/29 COS 424 4/22/2010

L 0 regularisation min w 1 n n l(y, f w (x)) + λ count{w j 0} i=1 This would be the same as L0-SRM. But how can we optimize that? Léon Bottou 17/29 COS 424 4/22/2010

L 1 regularisation The L 1 norm is the first convex L p norm. min w 1 n n l(y, f w (x)) + λ w 1 i=1 Same logarithmic property (Tsybakov 2006). L 1 regulatization can weed an exponential number of irrelevant features. See also compressed sensing. Léon Bottou 18/29 COS 424 4/22/2010

L 2 regularisation The L 2 norm is the same as the maximum margin idea. min w 1 n n l(y, f w (x)) + λ w 2 i=1 Logarithmic property is lost. Rotationally invariant regularizer! SVMs do not have magic properties for filtering out irrelevant features. They perform best when dealing with lots of relevant features. Léon Bottou 19/29 COS 424 4/22/2010

L 1/2 regularization? min w 1 n n l(y, f w (x)) + λ w 12 i=1 This is non convex. Therefore hard to optimize. Initialize with L 1 norm solution then perform gradient steps. This is surely not optimal, but gives sparser solutions than L 1 regularization! Works better than L 1 in practice. But this is a secret! Léon Bottou 20/29 COS 424 4/22/2010

Wrapper approaches Wrappers Assume we have chosen a learning system and algorithm. Navigate feature subsets by adding/removing features. Evaluate on the validation set. Backward selection wrapper Start with all features. Try removing each feature and measure validation set impact. Remove the feature that causes the least harm. Repeat. Notes There are many variants (forward, backtracking, etc.) Risk of overfitting the validation set. Computationally expensive. Quite effective in practice. Léon Bottou 21/29 COS 424 4/22/2010

Greedy methods Algorithms that incorporate features one by one. Decision trees Each decision can be seen as a feature. Pruning the decision tree prunes the features Ensembles Ensembles of classifiers involving few features. Random forests. Boosting. Léon Bottou 22/29 COS 424 4/22/2010

Greedy method example The Viola-Jones face recognizer Lots of very simple features. x[i, j] α r R Rects (i,j) R Quickly evaluated by first precomputing X i0 j 0 = x[i, j] i i 0 j j 0 Run AdaBoost with weak classifiers bases on these features. Léon Bottou 23/29 COS 424 4/22/2010

IV. Feature learning Léon Bottou 24/29 COS 424 4/22/2010

Feature learning in one slide Suppose we have weight on a feature X. Suppose we prefer a closely related feature X + ɛ. Léon Bottou 25/29 COS 424 4/22/2010

Feature learning and multilayer models Léon Bottou 26/29 COS 424 4/22/2010

Feature learning for image analysis 2D Convolutional Neural Networks 1989: isolated handwritten digit recognition 1991: face recognition, sonar image analysis 1993: vehicle recognition 1994: zip code recognition 1996: check reading INPUT 32x32 C1: feature maps 6@28x28 C3: f. maps 16@10x10 S4: f. maps 16@5x5 S2: f. maps 6@14x14 C5: layer 120 F6: layer 84 OUTPUT 10 Convolutions Subsampling Convolutions Full connection Gaussian connections Subsampling Full connection Léon Bottou 27/29 COS 424 4/22/2010

Feature learning for face recognition Note: more powerful but slower than Viola-Jones Léon Bottou 28/29 COS 424 4/22/2010

Feature learning revisited Handcrafted features Result from knowledge acquired by the feature designer. This knowledge was acquired on multiple datasets associated with related tasks. Multilayer features Trained on a single dataset (e.g. CNNs). Requires lots of training data. Interesting training data is expensive Multitask/multilayer features In the vicinity of an interesting task with costly labels there are related tasks with abundant labels. Example: face recognition face comparison. More during the next lecture! Léon Bottou 29/29 COS 424 4/22/2010