67577 Intro. to Machine Learning Fall semester, 2008/9. Lecture 11: PAC II. Lecturer: Amnon Shashua Scribe: Amnon Shashua

Similar documents
Machine Learning Applications in Grid Computing

Lecture L26-3D Rigid Body Dynamics: The Inertia Tensor

Lesson 44: Acceleration, Velocity, and Period in SHM

Factor Model. Arbitrage Pricing Theory. Systematic Versus Non-Systematic Risk. Intuitive Argument

Data Set Generation for Rectangular Placement Problems

Reliability Constrained Packet-sizing for Linear Multi-hop Wireless Networks

Online Bagging and Boosting

Information Processing Letters

Position Auctions and Non-uniform Conversion Rates

arxiv: v1 [math.pr] 9 May 2008

ON SELF-ROUTING IN CLOS CONNECTION NETWORKS. BARRY G. DOUGLASS Electrical Engineering Department Texas A&M University College Station, TX

On Computing Nearest Neighbors with Applications to Decoding of Binary Linear Codes

1 if 1 x 0 1 if 0 x 1

A quantum secret ballot. Abstract

Physics 211: Lab Oscillations. Simple Harmonic Motion.

Searching strategy for multi-target discovery in wireless networks

A Gas Law And Absolute Zero

Solving Systems of Linear Equations

Lecture L9 - Linear Impulse and Momentum. Collisions

How To Get A Loan From A Bank For Free

Halloween Costume Ideas for the Wii Game

Binary Embedding: Fundamental Limits and Fast Algorithm

This paper studies a rental firm that offers reusable products to price- and quality-of-service sensitive

Online Appendix I: A Model of Household Bargaining with Violence. In this appendix I develop a simple model of household bargaining that

2. FINDING A SOLUTION

PERFORMANCE METRICS FOR THE IT SERVICES PORTFOLIO

Stochastic Online Scheduling on Parallel Machines

Fuzzy Sets in HR Management

A Gas Law And Absolute Zero Lab 11

MINIMUM VERTEX DEGREE THRESHOLD FOR LOOSE HAMILTON CYCLES IN 3-UNIFORM HYPERGRAPHS

( C) CLASS 10. TEMPERATURE AND ATOMS

Calculating the Return on Investment (ROI) for DMSMS Management. The Problem with Cost Avoidance

Research Article Performance Evaluation of Human Resource Outsourcing in Food Processing Enterprises

A CHAOS MODEL OF SUBHARMONIC OSCILLATIONS IN CURRENT MODE PWM BOOST CONVERTERS

5.7 Chebyshev Multi-section Matching Transformer

Use of extrapolation to forecast the working capital in the mechanical engineering companies

Analyzing Spatiotemporal Characteristics of Education Network Traffic with Flexible Multiscale Entropy

An Improved Decision-making Model of Human Resource Outsourcing Based on Internet Collaboration

Media Adaptation Framework in Biofeedback System for Stroke Patient Rehabilitation

Airline Yield Management with Overbooking, Cancellations, and No-Shows JANAKIRAM SUBRAMANIAN

Pricing Asian Options using Monte Carlo Methods

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

RECURSIVE DYNAMIC PROGRAMMING: HEURISTIC RULES, BOUNDING AND STATE SPACE REDUCTION. Henrik Kure

The Velocities of Gas Molecules

Insurance Spirals and the Lloyd s Market

International Journal of Management & Information Systems First Quarter 2012 Volume 16, Number 1

PREDICTION OF POSSIBLE CONGESTIONS IN SLA CREATION PROCESS

Investing in corporate bonds?

Construction Economics & Finance. Module 3 Lecture-1

6. Time (or Space) Series Analysis

Audio Engineering Society. Convention Paper. Presented at the 119th Convention 2005 October 7 10 New York, New York USA

Method of supply chain optimization in E-commerce

Energy Proportionality for Disk Storage Using Replication

ESTIMATING LIQUIDITY PREMIA IN THE SPANISH GOVERNMENT SECURITIES MARKET

Factored Models for Probabilistic Modal Logic

Note on a generalized wage rigidity result. Abstract

Investing in corporate bonds?

Introduction to Unit Conversion: the SI

The Virtual Spring Mass System

Efficient Key Management for Secure Group Communications with Bursty Behavior

Evaluating Inventory Management Performance: a Preliminary Desk-Simulation Study Based on IOC Model

F=ma From Problems and Solutions in Introductory Mechanics (Draft version, August 2014) David Morin,

Work, Energy, Conservation of Energy

The Mathematics of Pumping Water

Data Streaming Algorithms for Estimating Entropy of Network Traffic

CRM FACTORS ASSESSMENT USING ANALYTIC HIERARCHY PROCESS

Support Vector Machine Soft Margin Classifiers: Error Analysis

Managing Complex Network Operation with Predictive Analytics

6.080/6.089 GITCS Feb 12, Lecture 3

Preference-based Search and Multi-criteria Optimization

Dynamic Placement for Clustered Web Applications

Image restoration for a rectangular poor-pixels detector

Adaptive Modulation and Coding for Unmanned Aerial Vehicle (UAV) Radio Channel

A Fast Algorithm for Online Placement and Reorganization of Replicated Data

Implementation of Active Queue Management in a Combined Input and Output Queued Switch

Models and Algorithms for Stochastic Online Scheduling 1

The AGA Evaluating Model of Customer Loyalty Based on E-commerce Environment

An Approach to Combating Free-riding in Peer-to-Peer Networks

High Performance Chinese/English Mixed OCR with Character Level Language Identification

Multi-Class Deep Boosting

Considerations on Distributed Load Balancing for Fully Heterogeneous Machines: Two Particular Cases

Problem Set 2: Solutions ECON 301: Intermediate Microeconomics Prof. Marek Weretka. Problem 1 (Marginal Rate of Substitution)

The Application of Bandwidth Optimization Technique in SLA Negotiation Process

Introduction to the Microsoft Sync Framework. Michael Clark Development Manager Microsoft

Reconnect 04 Solving Integer Programs with Branch and Bound (and Branch and Cut)

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

Quality evaluation of the model-based forecasts of implied volatility index

Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

Follow the Perturbed Leader

SAMPLING METHODS LEARNING OBJECTIVES

Equivalent Tapped Delay Line Channel Responses with Reduced Taps

Modeling Cooperative Gene Regulation Using Fast Orthogonal Search

Presentation Safety Legislation and Standards

Partitioned Elias-Fano Indexes

ADJUSTING FOR QUALITY CHANGE

A magnetic Rotor to convert vacuum-energy into mechanical energy

Endogenous Market Structure and the Cooperative Firm

Vectors & Newton's Laws I

Modeling operational risk data reported above a time-varying threshold

Markov Models and Their Use for Calculations of Important Traffic Parameters of Contact Center

Transcription:

67577 Intro. to Machine Learning Fall seester, 2008/9 Lecture : PAC II Lecturer: Anon Shashua Scribe: Anon Shashua The result of the PAC odel (also known as the foral learning odel) is that if the concept class C is PAC-learnable then the learning strategy ust siply consist of gathering a sufficiently large training saple S of size > o (ɛ, δ), for given accuracy ɛ > 0 and confidence 0 < δ < paraeters, and finds a hypothesis h C which is consistent with S. The learning algorith is then guaranteed to have a bounded error err(h) < ɛ with probability δ. The error easureent includes data not seen by the training phase. This state of affair also holds (with soe slight odifications on the saple coplexity bounds) when there is no consistent hypothesis (the unrealizable case). In this case the learner siply needs to iniize the epirical error err(h) ˆ on the saple training data S, and if is sufficiently large then the learner is guaranteed to have err(h) < Opt(C) + ɛ with probability δ. The easure Opt(C) is defined as the inial err(g) over all g C. Note that in the realizable case Opt(C) = 0. More details in Lecture 0. The property of bounding the true error err(h) by iniizing the saple error err(h) ˆ is very convenient. The fundaental question is under what conditions this type of generalization property applies? We saw in Lecture 0 that a satisfactorily answer can be provided when the cardinality of the concept space is bounded, i.e. C <, which happens for Boolean concept space for exaple. In that lecture we have proven that: o (ɛ, δ) = O( ɛ ln C δ ), is sufficient for guaranteeing a learning odel in the foral sense, i.e., which has the generalization property described above. In this lecture and the one that follows we have two goals in ind. First is to generalize the result of finite concept class cardinality to infinite cardinality note that the bound above is not eaningful when C =. Can we learn in the foral sense any non-trivial infinite concept class? (we already saw an exaple of a PAC-learnable infinite concept class which is the class of axes aligned rectangles). In order to answer this question we will need to a general easure of concept class coplexity which will replace the cardinality ter C in the saple coplexity bound o (ɛ, δ). It is tepting to assue that the nuber of paraeters which fully describe the concepts of C can serve as such a easure, but we will show that in fact one needs a ore powerful easure called the Vapnik-Chervonenkis (VC) diension. Our second goal is to pave the way and provide the theoretical foundation for the large argin principle algorith (SVM) we derived in lectures 6,7.. The VC Diension The basic principle behind the VC diension easure is that although C ay have infinite cardinality, the restriction of the application of concepts in C to a finite saple S has a finite outcoe. -

Lecture : PAC II -2 This outcoe is typically governed by an exponential growth with the size of the saple S but not always. The point at which the growth stops being exponential is when the coplexity of the concept class C has exhausted itself, in a anner of speaking. We will assue C is a concept class over the instance space X both of which can be infinite. We also assue that the concept class aps instances in X to {0, }, i.e., the input instances are apped to positive or negative labels. A training saple S is drawn i.i.d according to soe fixed but unknown distribution D and S consists of instances x,..., x. In our notations we will try to reserve c C to denote the target concept and h C to denote soe concept. We begin with the following definition: Definition which is a set of vectors in {0, }. Π C (S) = {(h(x ),..., h(x ) : h C} Π C (S) is set whose ebers are -diensional Boolean vectors induced by functions of C. These ebers are often called dichotoies or behaviors on S induced or realized by C. If C akes a full realization then Π C (S) will have 2 ebers. An equivalent description is a collection of subsets of S: Π C (S) = {h S : h C} where each h C akes a partition of S into two sets the positive and negative points. The set Π C (S) contains therefore subsets of S (the positive points of S under h). A full realization will provide ( ) i = 2. We will use both descriptions of Π C (S) as a collection of subsets of S and as a set of vectors interchangeably. Definition 2 If Π C (S) = 2 then S is considered shattered by C. In other words, S is shattered by C if C realizes all possible dichotoies of S. Consider as an exaple a finite concept class C = {c,..., c 4 } applied to three instance vectors with the results: Then, x x 2 x 3 c c 2 0 c 3 0 0 c 4 0 0 0 Π C ({x }) = {(0), ()} Π C ({x, x 3 }) = {(0, 0), (0, ), (, 0), (, )} Π C ({x 2, x 3 }) = {(0, 0), (, )} shattered shattered not shattered With these definitions we are ready to describe the easure of concept class coplexity. Definition 3 (VC diension) The VC diension of C, noted as V Cdi(C), is the cardinality d of the largest set S shattered by C. If all sets S (arbitrarily large) can be shattered by C, then V Cdi(C) =. V Cdi(C) = ax{d S = d, and Π C (S) = 2 d }

Lecture : PAC II -3 The VC diension of a class of functions C is the point d at which all saples S with cardinality S > d are no longer shattered by C. As long as C shatters S it anifests its full richness in the sense that one can obtain fro S all possible results (dichotoies). Once that ceases to hold, i.e., when S > d, it eans that C has exhausted its richness (coplexity). An infinite VC diension eans that C aintains full richness for all saple sizes. Therefore, the VC diension is a cobinatorial easure of a function class coplexity. Before we consider a nuber of exaples of geoetric concept classes and their VC diension, it is iportant clarify the lower and upper bounds (existential and universal quantifiers) in the definition of VC diension. The VC diension is at least d if there exists soe saple S = d which is shattered by C this does not ean that all saples of size d are shattered by C. Conversely, in order to show that the VC diension is at ost d, one ust show that no saple of size d + is shattered. Naturally, proving an upper bound is ore difficult than proving the lower bound on the VC diension. The following exaples are shown in a hand waiving style and are not eant to for rigorous proofs of the stated bounds they are shown for illustrative purposes only. Intervals of the real line: The concept class C is governed by two paraeters α, α 2 in the closed interval [0, ]. A concept fro this class will tag an input instance 0 < x < as positive if α x α 2 and negative otherwise. The VC diension is at least 2: select a saple of 2 points x, x 2 positioned in the open interval (0, ). We need to show that there are values of α, α 2 which realize all the possible four dichotoies (+, +), (, ), (+, ), (, +). This is clearly possible as one can place the interval [α, α 2 ] such the intersection with the interval [x, x 2 ] is null, (thus producing (, )), or to fully include [x, x 2 ] (thus producing (+, +)) or to partially intersect [x, x 2 ] such that x or x 2 are excluded (thus producing the reaining two dichotoies). To show that the VC diension is at ost 2, we need to show that any saple of three points x, x 2, x 3 on the line (0, ) cannot be shattered. It is sufficient to show that one of the dichotoies is not realizable: the labeling (+,, +) cannot be realizable by any interval [α, α 2 ] this is because if x, x 3 are labeled positive then by definition the interval [α, α 2 ] ust fully include the interval [x, x 3 ] and since x < x 2 < x 3 then x 2 ust be labeled positive as well. Thus V Cdi(C) = 2. Axes-aligned rectangles in the plane: We have seen this concept class in Lecture 2 a point in the plane is labeled positive if it lies in an axes-aligned rectangle. The concept class C is thus governed by 4 paraeters. The VC diension is at least 4: consider a configuration of 4 input points arranged in a cross pattern (recall that we need only to show soe saple S that can be shattered). We can place the rectangles (concepts of the class C) such that all 6 dichotoies can be realized (for exaple, placing the rectangle to include the vertical pair of points and exclude the horizontal pair of points would induce the labeling (+,, +, )). It is iportant to note that in this case, not all configurations of 4 points can be shattered but to prove a lower bound it is sufficient to show the existence of a single shattered set of 4 points. To show that the VC diension is at ost 4, we need to prove that any set of 5 points cannot be shattered. For any set of 5 points there ust be soe point that is internal, i.e., is neither the extree left, right, top or botto point of the five. If we label this internal point as negative and the reaining 4 points as positive then there is no axes-aligned rectangle (concept) which cold realize this labeling (because if the external 4 points are labeled positive then they ust be fully within the concept rectangle, but then the internal point ust also be included in the rectangle and thus labeled positive as well). Separating hyperplanes: Consider first linear half spaces in the plane. The lower bound on the

Lecture : PAC II -4 VC diension is 3 since any three (non-collinear) points in R 2 can be shattered, i.e., all 8 possible labelings of the three points can be realized by placing a separating line appropriately. By having one of the points on one side of the line and the other two on the other side we can realize 3 dichotoies and by placing the line such that all three points are on the sae side will realize the 4th. The reaining 4 dichotoies are realized by a sign flip of the four previous cases. To show that the upper bound is also 3, we need to show that no set of 4 points can be shattered. We consider two cases: (i) the four points for a convex region, i.e., lie on the convex hull defined by the 4 points, (ii) three of the 4 points define the convex hull and the 4th point is internal. In the first case, the labeling which is positive for one diagonal pair and negative to the other pair cannot be realized by a separating line. In the second case, a labeling which is positive for the three hull points and negative for the interior point cannot be realize. Thus, the VC diension is 3 and in general the VC diension for separating hyperplanes in R n is n +. Union of a finite nuber of intervals on the line: This is an exaple of a concept class with an infinite VC diension. For any saple of points on the line, one can place a sufficient nuber of intervals to realize any labeling. The exaples so far were siple enough that one ight get the wrong ipression that there is a correlation between the nuber of paraeters required to describe concepts of the class and the VC diension. As a counter exaple, consider the two paraeter concept class: C = {sign(sin(ωx + θ) : ω} which has an infinite VC diension as one can show that for every set of points on the line one can realize all possible labelings by choosing a sufficiently large value of ω (which serves as the frequency of the sync function) and appropriate phase. We conclude this section with the following clai: Theore The VC diension of a finite concept class C < is bounded fro above: V Cdi(C) log 2 C. Proof: if V Cdi(C) = d then there exists at least 2 d functions in C because every function induces a labeling and there are at least 2 d labelings. Thus, fro C 2 d follows that d log 2 C..2 The Relation between VC diension and PAC Learning We saw that the VC diension is a cobinatorial easure of concept class coplexity and we would like to have it replace the cardinality ter in the saple coplexity bound. The first result of interest is to show that if the VC diension of the concept class is infinite then the class is not PAC learnable. Theore 2 Concept class C with V Cdi(C) = is not learnable in the foral sense. Proof: Assue the contrary that C is PAC learnable. Let L be the learning algorith and be the nuber of training exaples required to learn the concept class with accuracy ɛ = 0. and δ = 0.9. That is, after seeing at least (ɛ, δ) training exaples, the learner generates a concept h which satisfies p(err(h) 0.) 0.9.

Lecture : PAC II -5 Since the VC diension is infinite there exist a saple set S with 2 instances which is shattered by C. Since the foral odel (PAC) applies to any training saple we will use the set S as follows. We will define a probability distribution on the instance space X which is unifor on S (with probability 2 ) and zero everywhere else. Because S is shattered, then any target concept is possible so we will choose our target concept c in the following anner: prob(c t (x i ) = 0) = x i S, 2 in other words, the labels c t (x i ) are deterined by a coin flip. The learner L selects an i.i.d. saple of instances S which due to the structure of D eans that the S S and outputs a consistent hypothesis h C. The probability of error for each x i S is: prob(c t (x i ) h(x i )) = 2. The reason for that is because S is shattered by C, i.e., we can select any target concept for any labeling of S (the 2 exaples) therefore we could select the labels of the points not seen by the learner arbitrarily (by flipping a coin). Regardless of h, the probability of istake is 0.5. The expectation on the error of h is: E[err(h)] = 0 2 + 2 2 = 4. This is because we have 2 points to saple (according to D as all other points have zero probability) fro which the error on half of the is zero (as h is consistent on the training set S) and the error on the reaining half is 0.5. Thus, the average error is 0.25. Note that E[err(h)] = 0.25 for any choice of ɛ, δ as it is based on the saple size. For any saple size we can follow the construction above and generate the learning proble such that if the learner produces a consistent hypothesis the expectation of the error will be 0.25. The result that E[err(h)] = 0.25 is not possible for the accuracy and confidence values we have set: with probability of at least 0.9 we have that err(h) 0. and with probability 0. then err(h) = β where 0. < β. Taking the worst case of β = we coe up with the average error: E[err(h)] 0.9 0. + 0. = 0.9 < 0.25. We have therefore arrived to a contradiction that C is PAC learnable. We next obtain a bound on the growth of Π S (C) when the saple size S = is uch larger than the VC diension V Cdi(C) = d of the concept class. We will need few ore definitions: Definition 4 (Growth function) Π C () = ax{ Π S (C) : S = } The easure Π C () is the axiu nuber of dichotoies induced by C for saples of size. As long as d then Π C () = 2. The question is what happens to the growth pattern of Π C () when > d. We will see that the growth becoes polynoial a fact which is crucial for the learnability of C. Definition 5 For any natural nubers, d we have the following definition: Φ d () = Φ d ( ) + Φ d ( ) Φ d (0) = Φ 0 () =

Lecture : PAC II -6 By induction on, d it is possible to prove the following: Theore 3 Φ d () = d ( ) i Proof: by induction on, d. For details see Kearns & Vazirani pp. 56. For d we have that Φ d () = 2. For > d we can derive a polynoial upper bound as follows. ( ) d d d ( ) d ( ) ( ) d i ( ) ( ) d i = ( + d i i i ) e d Fro which we obtain: Dividing both sides by ( d ) d yields: ( ) d d Φ d () e d. ( ) d ( e Φ d () e d = d d ) d = O( d ). We need one ore result before we are ready to present the ain result of this lecture: Theore 4 (Sauer s lea) If V Cdi(C) = d, then for any, Π C () Φ d (). Proof: By induction on both d,. For details see Kearns & Vazirani pp. 55 56. Taken together, we have now a fairly interesting characterization on how the cobinatorial easure of coplexity of the concept class C scales up with the saple size. When the VC diension of C is infinite the growth is exponential, i.e., Π C () = 2 for all values of. On the other hand, when the concept class has a bounded VC diension V Cdi(C) = d < then the growth pattern undergoes a discontinuity fro an exponential to a polynoial growth: { 2 } d Π C () = ( ) e d d > d As a direct result of this observation, when >> d is uch larger than d the entropy becoes uch saller than. Recall than fro an inforation theoretic perspective, the entropy of a rando variable Z with discrete values z,..., z n with probabilities p i, i =,..., n is defined as: H(Z) = n p i log 2 p i, where I(p i ) = log 2 p i is a easure of inforation, i.e., is large when p i is sall (eaning that there is uch inforation in the occurrence of an unlikely event) and vanishes when the event is certain p i =. The entropy is therefore the expectation of inforation. Entropy is axial for a unifor distribution H(Z) = log 2 n. The entropy in inforation theory context can be viewed as the nuber of bits required for coding z,..., z n. In coding theory it can be shown that the entropy of a distribution provides the lower bound on the average length of any possible encoding of a uniquely decodable code fro which one sybol goes into one sybol. When the distribution is unifor we will need the axial nuber of bits, i.e., one cannot copress the data. In the case of concept class C with VC diension d, we see that one when d all possible dichotoies

Lecture : PAC II -7 are realized and thus one will need bits (as there are 2 dichotoies) for representing all the outcoes of the saple. However, when >> d only a sall fraction of the 2 dichotoies can be realized, therefore the distribution of outcoes is highly non-unifor and thus one would need uch less bits for coding the outcoes of the saple. The technical results which follow are therefore a foral way of expressing in a rigorous anner this siple truth If it is possible to copress, then it is possible to learn. The crucial point is that learnability is a direct consequence of the phase transition (fro exponential to polynoial) in the growth of the nuber of dichotoies realized by the concept class. In the next lecture we will continue to prove the double sapling theore which derives the saple size coplexity as a function of the VC diension.