Interactive Machine Learning. Maria-Florina Balcan



Similar documents
Simple and efficient online algorithms for real world applications

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Data Mining - Evaluation of Classifiers

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

A Brief Introduction to Property Testing

Active Learning with Boosting for Spam Detection

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

A tutorial on active learning

Adaptive Online Gradient Descent

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Statistical Machine Learning

4.1 Introduction - Online Learning Model

Follow the Perturbed Leader

FilterBoost: Regression and Classification on Large Datasets

Machine Learning Final Project Spam Filtering

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Supervised Learning (Big Data Analytics)

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

An Introduction to Machine Learning

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Online Algorithms: Learning & Optimization with No Regret.

Learning is a very general term denoting the way in which agents:

Two faces of active learning

AdaBoost. Jiri Matas and Jan Šochman. Centre for Machine Perception Czech Technical University, Prague

Consistent Binary Classification with Generalized Performance Metrics

Data Mining Practical Machine Learning Tools and Techniques

Online Semi-Supervised Learning

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Question 2 Naïve Bayes (16 points)

Introduction to Online Learning Theory

DUOL: A Double Updating Approach for Online Learning

Notes from Week 1: Algorithms for sequential prediction

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Bootstrapping Big Data

Employer Health Insurance Premium Prediction Elliott Lui

REVIEW OF ENSEMBLE CLASSIFICATION

Intelligent Heuristic Construction with Active Learning

Influences in low-degree polynomials

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

CS570 Data Mining Classification: Ensemble Methods

Multi-Class Active Learning for Image Classification

Linear Threshold Units

Machine Learning in Spam Filtering

Chapter 6. The stacking ensemble approach

Chapter 4: Artificial Neural Networks

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Boosting.

The Artificial Prediction Market

The Advantages and Disadvantages of Online Linear Optimization

Lecture 1: Course overview, circuits, and formulas

Broadcasting in Wireless Networks

Machine Learning using MapReduce

Steven C.H. Hoi School of Information Systems Singapore Management University

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Model Combination. 24 Novembre 2009

: Introduction to Machine Learning Dr. Rita Osadchy

Efficiency of algorithms. Algorithms. Efficiency of algorithms. Binary search and linear search. Best, worst and average case.

Steven C.H. Hoi. School of Computer Engineering Nanyang Technological University Singapore

Large Scale Learning to Rank

An Overview of Knowledge Discovery Database and Data mining Techniques

Machine Learning.

CSE 473: Artificial Intelligence Autumn 2010

On Adaboost and Optimal Betting Strategies

Active Learning in the Drug Discovery Process

EMPIRICAL RISK MINIMIZATION FOR CAR INSURANCE DATA

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Social Media Mining. Data Mining Essentials

Tail inequalities for order statistics of log-concave vectors and applications

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Data Mining. Nonlinear Classification

Support Vector Machines with Clustering for Training with Very Large Datasets

Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.

The Multiplicative Weights Update method

SAS Certificate Applied Statistics and SAS Programming

Machine Learning CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

2.1 Complexity Classes

Introduction to Learning & Decision Trees

Active Learning SVM for Blogs recommendation

Challenges of Cloud Scale Natural Language Processing

Classification algorithm in Data mining: An Overview

Two Heads Better Than One: Metric+Active Learning and Its Applications for IT Service Classification

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Making Sense of the Mayhem: Machine Learning and March Madness

Transcription:

Interactive Machine Learning Maria-Florina Balcan

Machine Learning Image Classification Document Categorization Speech Recognition Protein Classification Branch Prediction Fraud Detection Spam Detection 2

Supervised Classification Example: Decide which emails are spam and which are important. Not spam Supervised classification spam Goal: use past emails to produce good rule for future data. 3

Example: Supervised Classification Represent each message by features. (e.g., keywords, spelling, etc.) example label Reasonable RULES: Predict SPAM if unknown AND (money OR pills) Predict SPAM if 2money + 3pills 5 known > 0 + + - - - - - Linearly separable 4

Two Main Aspects in Machine Learning Algorithm Design. How to optimize? Automatically generate rules that do well on observed data. Confidence Bounds, Generalization Guarantees Confidence for rule effectiveness on future data. 5

Supervised Learning Formalization Classic models: PAC (Valiant), SLT (Vapnik) X feature space S={(x, l)} - set of labeled examples drawn i.i.d. from distr. D over X and labeled by target concept c * Do optimization over S, find hypothesis h 2 C. Goal: h has small error over D. err(h)=pr x 2 D (h(x) c*(x)) h c * c* in C, realizable case; else agnostic 6

Sample Complexity Results Finite C, realizable Infinite C, realizable Agnostic replace with 2. 7

Supervised Learning Infinite C, realizable Lots of work on tighter bounds [e.g., Haussler,Littlestone,Wartmuth 94 ; Bartlett,Bousquet,Mendelson 05; Gine and Koltchinski 06]. Lower bounds that match upper bounds on worst case distr. Sample complex. pretty well understood up to pesky gaps Powerful algorithms as well (AdaBoost & SVM)

Classic Paradigm Insufficient Nowadays Modern applications: massive amounts of raw data. Only a tiny fraction can be annotated by human experts. Protein sequences Billions of webpages Images

Learning Algorithm Unlabeled examples Active Learning Data Source Expert / Oracle Request for the Label of an Example A Label for that Example Request for the Label of an Example A Label for that Example... Algorithm outputs a classifier The learner can choose specific examples to be labeled. He works harder, to use fewer labeled examples.

Active Learning Do unlabeled data, interaction help? When, why, by how much? Foundations lacking a few years ago. Significant progress recently Mostly on sample complexity. Still lots of exciting open questions.

Outline of My Talk Brief history, disagreement based active learning. Aggressive active learning of linear separators. Interactive learning more generally. Interactive learning Passive learning (sample complexity & algorithms) Distributed learning (communication efficient protocols)

Can adaptive querying help? Noise-Free Case Useful in practice, e.g., Active SVM [Tong & Koller, ICML 99] Canonical theoretical example [CAL92, Dasgupta04] Active Algorithm - Sample with 1/ unlabeled examples; do binary search. - - w + + Passive supervised: (1/ ) labels to find an -accurate threshold. Active: only O(log 1/ ) labels. Exponential improvement.

Can adaptive querying help? Noise-Free Case Useful in practice, e.g., Active SVM [Tong & Koller, ICML 99] Canonical theoretical example [CAL92, Dasgupta04] Passive supervised: (1/ ) labels to find an -accurate threshold. Active: only O(log 1/ ) labels. Exponential improvement. Homogeneous linear sep, uniform distribution QBC [Freund et al., 97] c * Active Perceptron [Dasgupta, Kalai, Monteleoni 05] In general, number of queries needed depends on C and P.

When Active Learning Helps. Agnostic case A 2 the first algorithm which is robust to noise. [Balcan, Beygelzimer, Langford, ICML 06] [Balcan, Beygelzimer, Langford, JCSS 08] Region of disagreement style: Pick a few points at random from the current region of uncertainty, query their labels, throw out hypothesis if you are statistically confident they are suboptimal. Guarantees for A 2 [BBL 06, 08]: Fall-back & exponential improvements. C thresholds, low noise, exponential improvement. C - homogeneous linear separators in R d, D - uniform, low noise, only d 2 log (1/ ) labels. c * A lot of subsequent work. E.g., Hanneke 07, DasguptaHsuMonteleoni 07, Wang 09, Friedman 09, Koltchinskii10, BalcanHannekeWortman 08, BeigelzimerHsuLangfordZhang 10, Hanneke 10, El-Yaniv-Wiener 12, Minsker 12, AilonBegleiterExra 12. 15

A 2 first active algo robust to noise Disagreement based : Pick a few points at random from the current region of uncertainty, query their labels, throw out hypothesis if you are statistically confident they are suboptimal. Guarantees for A 2 [Hanneke 07]: Disagreement coefficient [BBL 06, 08] Realizable case: Linear Separators, uniform distr.: c * 16

When Active Learning Helps Lots of exciting activity in recent years. Several specific analyses for realizable case. E.g., linear separators, uniform distribution QBC [Freund et al., 97] Active Perceptron [Dasgupta, Kalai, Monteleoni 05] Generic algos that work even in the agnostic case and under various noise conditions A 2 [Balcan, Beygelzimer, Langford 06] [Hanneke 07] DKM algo [Dasgupta,Hsu, Monteleoni 07] [Hanneke 10] [Hanneke 10] [Wang 09] Koltchinksi s algo [Koltchinksi 10]

Aggressive Active Learning This talk: C - homogeneous linear seps in R d, D logconcave Realizable: exponential improvement, only O(d log 1/ ) labels to find a hypothesis with error. [Bounded noise]. Tsybakov noise: polynomial improvement. [Balcan, Broder, Zhang, COLT 07] ; [Balcan,Long, COLT 13] Log-concave distributions: log of density fnc concave wide class: includes uniform distr. over any convex set, Gaussian distr., Logistic, etc played a major role in sampling, optimization, integration, learning [LV 07, KKMS 05,KLT 09]

Margin Based Active-Learning, Realizable Case Algorithm Draw m 1 unlabeled examples, label them, add them to W(1). iterate k=2,, s find a hypothesis w k-1 consistent with W(k-1). W(k)=W(k-1). sample m k unlabeled samples x satisfying w k-1 x k-1 ; label them and add them to W(k). end iterate

Margin Based Active-Learning, Realizable Case Draw m 1 unlabeled examples, label them, add them to W(1). iterate k = 2,, s find a hypothesis w k-1 consistent with W(k-1). W(k)=W(k-1). sample m k unlabeled samples x satisfying w k-1 T x k-1 label them and add them to W(k). 1 w 2 w 3 w 1 2

Margin Based Active-Learning, Realizable Case Theorem If P X log-concave in R d. and iterations w s has error. then after

Linear Separators, Log-Concave Distributions Fact 1 Proof idea: (u,v) u v project the region of disagreement in the space given by u and v use properties of log-concave distributions in 2 dimensions. Fact 2 v

Linear Separators, Log-Concave Distributions Fact 3 If and u v v

Linear Separators, Log-Concave Distributions Fact 3 If E and u v v Proof idea: project the region of disagreement in the space given by u and v Note that each x in E has x / =c 2

Margin Based Active-Learning, Realizable Case iterate k=2,,s find a hypothesis w k-1 consistent with W(k-1). W(k)=W(k-1). sample m k unlabeled samples x satisfying w k-1 T x k-1 label them and add them to W(k). w w k-1 w * Proof Idea Induction: all w consistent with W(k) have error 1/2 k ; so, w k has error 1/2 k. k-1 For 1/2 k+1

Proof Idea Under the uniform distr. for w w k-1 1/2 k+1 w * k-1

Proof Idea Under the uniform distr. for w w k-1 1/2 k+1 w * k-1 Enough to ensure Can do with only labels.

Theorem Passive Learning Any passive learning algo that outputs w consistent with examples, satisfies err(w) ², with proab. 1-±. Solves open question for the uniform distr. [Long 95, 03], [Bshouty 09] First tight bound for poly-time PAC algos for an infinite class of fns under a general class of distributions. [Ehrenfeucht et al., 1989; Blumer et al., 1989]

Theorem Passive Learning Any passive learning algo that outputs w consistent with High Level Idea examples, satisfies err(w) ², with proab. 1-±. Run algo online, use the intermediate w to track the progress Performs well even if it periodically builds w using some of the examples, and only uses borderline cases for preliminary classifiers for further training. Carefully distribute ±, allow higher prob. of failure in later stages [once w is already pretty good, it takes longer to get examples that help to further improve it]

Theorem Passive Learning Any passive learning algo that outputs w consistent with High Level Idea examples, satisfies err(w) ², with proab. 1-±.

Margin Based AL, Bounded Noise Guarantee Assume P X log-concave in R d. Assume that P(Y=1 x)-p(y=-1 x) for all x. Assume w * is the Bayes classifier. Then The previous algorithm and proof extend naturally, and get again an exponential improvement.

Bounded Noise, Exponential improvement Guarantee Assume P X log-concave in R d Assume that P(Y=1 x)-p(y=-1 x) for all x. Assume w * is the Bayes classifier. Then The previous algorithm and proof extend naturally, and get again an exponential improvement.

Margin Based AL, Summary Extensions to nearly log-concave distributions, noisy settings. Matching Lower Bounds. Broadens class of pbs for which AL provides exponential improvement in 1/² (without additional increase on d). Interesting implications to passive learning: First tight bound for a poly-time PAC algo for an infinite class of fns under a general class of data distributions. [Balcan-Long, COLT 13] New algo that tolerates malicious noise at significantly better levels than previously known even for passive learning [KKMS 05, KLS 10]; also active (much better label sample complexity). [Awasthi-Balcan-Long 13]

Important direction: richer interactions with the expert. Better Accuracy Fewer queries Expert Natural interaction

New Types of Interaction [Balcan-Hanneke COLT 12] Class Conditional Query raw data Classifier Expert Labeler Mistake Query raw data dog cat penguin ) Classifier wolf Expert Labeler 35

Class Conditional & Mistake Queries Used in practice, e.g. Faces in IPhoto. Lack of theoretical understanding. Realizable (Folklore): much fewer queries than label requests. 36

Class Conditional Query Class Conditional Query raw data Classifier Expert Labeler Folklore Result (Realizable Case): Run halving algorithm 37

Class Conditional & Mistake Queries Used in practice, e.g. Faces in IPhoto. Lack of theoretical understanding. Realizable (Folklore): much fewer queries than label requests. Balcan-Hanneke, COLT 12 38

Formal Model FindMistake(S,h) based on k CCQs 39

Agnostic Case, Upper Bound I.e., eliminate h when makes at least one mistake in some number out of several sets chosen at random from U. rather than eliminating h for a mistake, as in halving. 40

Agnostic Case, Upper Bound Proof Sketch Best h in V survives (does not make mistakes on too many sets). 41

Upper Bound, Agnostic Case Phase 1: Robust halving algorithm Phase 2: A simple refining algorithm We recover the true labels of this unlabeled sample and run ERM.

Class Conditional & Mistake Queries Used in practice, e.g. Faces in IPhoto. Lack of theoretical understanding. Realizable (Folklore): much fewer queries than label requests. Balcan-Hanneke, COLT 12 Adapted to obtain the first agnostic communication efficient in a distributed learning setting. [Balcan-Blum-Fine-Mansour COLT12] 43

Active Learning Discussion, Open Directions Rates for aggressive AL with optimal query complexity for general settings. [BL 13], [BBZ 07], aggressive localization in instance space leads to aggressive localization in concept space; can it be generalized? Computationally efficient algos for noisy settings. (recent progress [BF 13], [ABL 13]) More general queries Efficient algorithms for CCQs. More general/natural queries. Shallow/Deep learning strategies.