Vowpal Wabbit 7 Tutorial. Stephane Ross



Similar documents
A Logistic Regression Approach to Ad Click Prediction

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Introduction to Online Learning Theory

The Artificial Prediction Market

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

CSCI567 Machine Learning (Fall 2014)

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Simple and efficient online algorithms for real world applications

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Sibyl: a system for large scale machine learning

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Linear Threshold Units

MLlib: Scalable Machine Learning on Spark

Statistical Machine Learning

Logistic Regression for Spam Filtering

(Quasi-)Newton methods

Lecture 2: The SVM classifier

Predict Influencers in the Social Network

An Introduction to Data Mining

Mammoth Scale Machine Learning!

Making Sense of the Mayhem: Machine Learning and March Madness

Machine Learning Logistic Regression

TOWARD BIG DATA ANALYSIS WORKSHOP

The Need for Training in Big Data: Experiences and Case Studies

Decision Trees from large Databases: SLIQ

Big Data Analytics CSCI 4030

L3: Statistical Modeling with Hadoop

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Lecture 3: Linear methods for classification

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Spark and the Big Data Library

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Journée Thématique Big Data 13/03/2015

Azure Machine Learning, SQL Data Mining and R

GLM, insurance pricing & big data: paying attention to convergence issues.

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

dedupe Documentation Release Forest Gregg, Derek Eder, and contributors

Machine Learning Big Data using Map Reduce

SAS Software to Fit the Generalized Linear Model

Extreme Machine Learning with GPUs John Canny

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Big Data Analytics. Lucas Rego Drumond

REVIEW OF ENSEMBLE CLASSIFICATION

More Data Mining with Weka

High Productivity Data Processing Analytics Methods with Applications

Data Mining Practical Machine Learning Tools and Techniques

CSC 411: Lecture 07: Multiclass Classification

Differential privacy in health care analytics and medical research An interactive tutorial

Supervised Learning (Big Data Analytics)

Statistical machine learning, high dimension and big data

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

IBM SPSS Direct Marketing 23

Distributed Machine Learning and Big Data

Multivariate Statistical Inference and Applications

Big Data - Lecture 1 Optimization reminders

IBM SPSS Neural Networks 22

DATA STRUCTURES USING C

Reject Inference in Credit Scoring. Jie-Men Mok

Classification using Logistic Regression

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING

Introduction to Parallel Programming and MapReduce

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

The Data Mining Process

Lecture 8 February 4

Data Mining. Nonlinear Classification

Cross-validation for detecting and preventing overfitting

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Principles of Data Mining by Hand&Mannila&Smyth

Support Vector Machines with Clustering for Training with Very Large Datasets

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Programming Exercise 3: Multi-class Classification and Neural Networks

Data Mining. SPSS Clementine Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Collaborative Filtering. Radek Pelánek

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

A Simple Introduction to Support Vector Machines

Lecture 6: Logistic Regression

How To Make A Credit Risk Model For A Bank Account

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Big Data Analytics. Lucas Rego Drumond

Local features and matching. Image classification & object localization

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node

Stochastic Optimization for Big Data Analytics: Algorithms and Libraries

Data Mining III: Numeric Estimation

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

IBM SPSS Direct Marketing 22

SCUt ils SmartAssign Guide Solution for Microsoft System Center 2012 Service Manager

CSCI 5417 Information Retrieval Systems Jim Martin!

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

1 Introduction to Matrices

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

Transcription:

Vowpal Wabbit 7 Tutorial Stephane Ross

Binary Classification and Regression Input Format Data in text file (can be gziped), one example/line Label [weight] Namespace Feature... Feature Namespace... Namespace: string (can be empty) Feature: string[:value] or int[:value], string is hashed to get index, value 1 by default, features not present have value 0 Weight 1 by default Label: use {-1,1} for classification, or any real value for regression 1 1:0.43 5:2.1 10:0.1-1 I went to school 10 race=white sex=male age:25 0.5 optflow 1:0.3 2:0.4 harris 1:0.5 2:0.9 1 0.154 53 1100 8567

Binary Classification and Regression Train on dataset train.txt:./vw -d train.txt Training -d filepath : loads data in filepath --passes n : iterate over dataset n times -c : creates a binary cache of dataset for faster loading next time (required with --passes n for n > 1)./vw -d train.txt -c --passes 10

General Options Saving, Loading and Testing Predictors -f filepath : where to save final predictor./vw -d train.txt -c --passes 10 -f predictor.vw -i filepath : where to load initial predictor, 0 vector when unspecified -t : test predictor on data (no training)./vw -d test.txt -t -i predictor.vw -p filepath: where to save predictions./vw -d test.txt -t -i predictor.vw -p predictions.txt --readable_model filepath: saves a predictor in text format./vw -d train.txt -c --passes 10 --readable_model p.txt

--loss_function loss General Options Loss Functions loss in {squared,logistic,hinge,quantile} squared is default Train a SVM (labels must be {-1,1}):./vw -d train.txt -f svm.vw --loss_function hinge Train a Logistic Regressor (labels must be {-1,1}):./vw -d train.txt -f lg.vw --loss_function logistic

--l1 value, default is 0 --l2 value, default is 0 General Options L1 and L2 Regularization Train a Regularized SVM:./vw -d train.txt -f svm.vw --loss_function hinge --l2 0.1

General Options Update Rule Options -l s, scales learning rate, default 0.5 or 10 for non-default rule --initial_t i, default 1 or 0 depending on adaptive GD --power_t p, default 0.5 For SGD, this means α t = s(i/(i + t)) p Similar effect with the more complex adaptive update rules./vw -d train.txt --sgd -l 0.1 --initial_t 10 --power_t 1

General Options Update Rule Options Default is normalized adaptive invariant update rule Can specify any combination of --adaptive, --invariant, -- normalized --adaptive: uses sum of gradients to decrease learning rate (Duchi) --invariant: importance aware updates (Nikos) --normalized: updates adjusted for scale of each feature Or --sgd to use standard update rule./vw -d train.txt --adaptive -l 1./vw -d train.txt --adaptive --invariant -l 1./vw -d train.txt --normalized -l 1

General Options Other Useful Options -b n, default is n=18: log number of weight parameters, increase to reduce collisions from hashing -q ab, quadratic features between all features in namespace a* and b* --ignore a, removes features from namespace a* 0.5 optflow 1:0.3 2:0.4 harris 1:0.5 2:0.9./vw --d train.txt --q oo --q oh

One example/line Multiclass Classification Input Format Label [weight] Namespace feature... feature Namespace... Label in {1,2,...,k} 1 1 5 6 3 2 7 2 2 4 5 6

./vw -d train.txt --oaa k k = nb classes Multiclass Classification Training and Testing Implemented as Reduction to Binary Classification, 2 Options: --oaa k: One Against All, k = nb classes --ect k: Error Correcting Tournament /Filter Tree, k = nb classes (with --ect, optional --error n, n is nb errors tolerated by code, default n=0)./vw -d train.txt --ect 10 --error 2 -f predictor.vw./vw -d test.txt -t -i predictor.vw

Cost-Sensitive Multiclass Classification One example/line Input Format Label... Label Namespace feature... Namespace feature... Label format: id[:cost] id in {1,2,...,k} cost is 1 by default Can specify only subset of labels (if must choose within a subset) 1:0.5 2:1.3 3:0.1 1 5 6 2:0.1 3:0.5 2 6

Cost-Sensitive Multiclass Classification Training and Testing./vw -d train.txt --csoaa k k = nb classes Implemented as a Reduction, 2 Options: --csoaa k: Cost-Sensitive One Against All (Reduction to Regression) --wap k: Weighted All Pairs (Reduction to weighted binary classification)./vw -d train.txt --wap 10 -f predictor.vw./vw -d test.txt -t -i predictor.vw

Offline Contextual Bandit Input Format Cost-Sensitive Multiclass when only observe cost for 1 action/example Data collected a priori by exploration policy One example/line Label Namespace feature... Namespace feature... Label format: action:cost:prob action is in {1,2,...,k} cost for this action, prob that action was chosen by exploration policy in this context 1:0.5:0.25 1 5 6 3:2.4:0.5 2 6

Offline Contextual Bandit Input Format Can specify subset of allowed actions if not all available: 1 2:1.5:0.3 4 these are the features Can specify costs for all unobserved actions for testing learned policy (only action with a prob used/observed for training) : 1:0.2 2:1.5:0.3 3:2.5 4:1.3 these are the features 1:2.2:0.5 4:1.4 more features

./vw -d train.txt --cb k k = nb actions (arms) Offline Contextual Bandit Training and Testing Implemented as a Reduction to Cost-Sensitive Multiclass Optional: --cb_type {ips,dm,dr} specifies how we generate cost vectors ips: inverse propensity score (unbiased estimate of costs using prob) dm: direct method (regression to predict unknown costs) dr: doubly robust (combination of the above 2), default Default cost-sensitive learner is --csoaa, but can use --wap./vw -d train.txt --cb 10 --cb_type ips./vw -d train.txt --cb 10 --cb_type dm --wap 10 -f predictor.vw./vw -d test.txt -t -i predictor.vw

Sequence Predictions Input Format Same format as multiclass for each prediction in a sequence Sequences separated by empty line in file 1 This 2 is 1 a 3 sequence 1 Another 3 sequence

Sequence Predictions Training and Testing For SEARN:./vw --d train.txt -c --passes 10 --searn k --searn_task sequence --searn_beta 0.5 --searn_passes_per_policy 2 -f policy.vw For DAgger:./vw --d train.txt -c --passes 10 --searn k --searn_task sequence --searn_as_dagger 0.0001 -f policy.vw Testing:./vw -d test.txt -t -i policy.vw

Sequence Predictions Additional Features From Previous Predictions --searn_sequencetask_history h h = nb previous predictions to append, default 1 --searn_sequencetask_bigrams Use bigram features from history, not used by default --searn_sequencetask_features f f = nb previous predictions to pair with observed features, default 0 --searn_sequencetask_bigram_features Use bigram on history paired with observed features, not used by default

Sequence Predictions Reduction to Cost-Sensitive Multiclass Searn/DAgger generates cost-sensitive multiclass examples Costs from rollouts of current policy for each possible prediction at each step --searn_rollout_oracle: Rollout expert/oracle instead --csoaa is default cost-sensitive learner, but can use --wap instead Can also use with contextual bandit --cb Rolls out only one sampled prediction rather than all at each step

Sequence Predictions Example./vw -d train.txt -c --passes 10 --searn 20 --searn_task sequence -- searn_as_dagger 0.0001 --searn_sequencetask_history 2 -- searn_sequencetask_features 1 --cb 20 --cb_type ips --wap 20 -f policy.vw./vw -d test.txt -t -i policy.vw Note the impressive stack of reductions: Structured Prediction -> Contextual Bandit -> Cost-sensitive Multiclass -> Weighted Binary

Caution Default b 18 often not enough with so many stacked reductions Weight vectors for all reductions stored in the same weight vector Each one accessed through different offsets added to hash/feature index So collisions can occur between weight vectors and features Especially with --searn, or extreme multiclass problems Having some collisions is not always bad Forces to generalize by looking at all features, can t just rely on one Can tweak b through validation on held out set

Cool Trick to deal with Collisions Even if # parameters > # distinct features, collisions can occur Can copy features many times, hashed to different indices, hoping that 1 copy is collision free Easily done with q:./vw -d train.txt -q fc 1 f These are features c two three four -1 f More features c two three four Can tweak both -b, and nb copies through validation on held out data.

Much More --lda : latent dirichlet allocation --bfgs: use batch lbfgs instead of stochastic gradient descent --rank: matrix factorization --csoaa_ldf --wap_ldf : cost-sensitive multiclass with label dependent features --active_learning : active learning Use VW as a library inside another program