Gibbs Sampling and Online Learning Introduction

Similar documents
The Basics of Graphical Models

1 Nonzero sum games and Nash equilibria

Basics of Statistical Machine Learning

CHAPTER 2 Estimating Probabilities

Lecture 8 The Subjective Theory of Betting on Theories

1 Maximum likelihood estimation

Introduction to Mobile Robotics Bayes Filter Particle Filter and Monte Carlo Localization

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

6.042/18.062J Mathematics for Computer Science. Expected Value I

Wald s Identity. by Jeffery Hein. Dartmouth College, Math 100

6.207/14.15: Networks Lecture 15: Repeated Games and Cooperation

The Chinese Restaurant Process

Random access protocols for channel access. Markov chains and their stability. Laurent Massoulié.

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Master s Theory Exam Spring 2006

Comparison of frequentist and Bayesian inference. Class 20, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Game Theory 1. Introduction

Dirichlet Processes A gentle tutorial

Course: Model, Learning, and Inference: Lecture 5

Game Theory and Algorithms Lecture 10: Extensive Games: Critiques and Extensions

Introduction to Markov Chain Monte Carlo

: Introduction to Machine Learning Dr. Rita Osadchy

STATISTICS 8: CHAPTERS 7 TO 10, SAMPLE MULTIPLE CHOICE QUESTIONS

Globally Optimal Crowdsourcing Quality Management

Chapter 5. Discrete Probability Distributions

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

1 Representation of Games. Kerschbamer: Commitment and Information in Games

Decision Making under Uncertainty

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Chapter 4 Lecture Notes

Everything You Know about Managing the Sales Funnel is Wrong

1. (First passage/hitting times/gambler s ruin problem:) Suppose that X has a discrete state space and let i be a fixed state. Let

Problem of the Month Through the Grapevine

Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091)

IEOR 6711: Stochastic Models, I Fall 2012, Professor Whitt, Final Exam SOLUTIONS

The Turing Test! and What Computer Science Offers to Cognitive Science "

Computational Game Theory and Clustering

Introduction to computer science

Measuring the Performance of an Agent

Cosmological Arguments for the Existence of God S. Clarke

Exam Introduction Mathematical Finance and Insurance

Pascal is here expressing a kind of skepticism about the ability of human reason to deliver an answer to this question.

Collatz Sequence. Fibbonacci Sequence. n is even; Recurrence Relation: a n+1 = a n + a n 1.

An Introduction to Using WinBUGS for Cost-Effectiveness Analyses in Health Economics

Decision Trees and Networks

Interpreting Kullback-Leibler Divergence with the Neyman-Pearson Lemma

How to Gamble If You Must

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

How To Solve A Sequential Mca Problem

Learning outcomes. Knowledge and understanding. Competence and skills

Monte Carlo-based statistical methods (MASM11/FMS091)

Exponential time algorithms for graph coloring

MACHINE LEARNING IN HIGH ENERGY PHYSICS

INTRUSION PREVENTION AND EXPERT SYSTEMS

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing


Supervised Learning (Big Data Analytics)

i/io as compared to 4/10 for players 2 and 3. A "correct" way to play Certainly many interesting games are not solvable by this definition.

Data Mining Practical Machine Learning Tools and Techniques

PS 271B: Quantitative Methods II. Lecture Notes

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset

On the effect of data set size on bias and variance in classification learning

The Challenger-Solver game: Variations on the theme of P =?NP

Lectures 5-6: Taylor Series

The Kelly Betting System for Favorable Games.

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

1 Portfolio Selection

Bayesian Statistics: Indian Buffet Process

CHANCE ENCOUNTERS. Making Sense of Hypothesis Tests. Howard Fincher. Learning Development Tutor. Upgrade Study Advice Service

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Load Balancing and Switch Scheduling

CSE 473: Artificial Intelligence Autumn 2010

Learning is a very general term denoting the way in which agents:

Model-based Synthesis. Tony O Hagan

Bayesian Phylogeny and Measures of Branch Support

3.2 Roulette and Markov Chains

Machine Learning and Pattern Recognition Logistic Regression

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 2

Monte Carlo-based statistical methods (MASM11/FMS091)

The power of money management

Logic and Reasoning Practice Final Exam Spring Section Number

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

p-values and significance levels (false positive or false alarm rates)

Lies My Calculator and Computer Told Me

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Rafael Witten Yuze Huang Haithem Turki. Playing Strong Poker. 1. Why Poker?

Seeing the Value in Customer Service

Discrete Structures for Computer Science

A Working Knowledge of Computational Complexity for an Optimizer

WHAT ARE MATHEMATICAL PROOFS AND WHY THEY ARE IMPORTANT?

DRAFT. New York State Testing Program Grade 7 Common Core Mathematics Test. Released Questions with Annotations

- Easy to insert & delete in O(1) time - Don t need to estimate total memory needed. - Hard to search in less than O(n) time

DEVELOPING HYPOTHESIS AND

Cell Phone based Activity Detection using Markov Logic Network

MapReduce Approach to Collective Classification for Networks

Transcription:

Statistical Techniques in Robotics (16-831, F14) Lecture#10(Tuesday, September 30) Gibbs Sampling and Online Learning Introduction Lecturer: Drew Bagnell Scribes: {Shichao Yang} 1 1 Sampling Samples are valuable because they can be used to compute expected values of functions over a probability distribution. They can also be used to compute marginals by counting. Most sampling techniques use the ratios taught in last class to minimize computation time. This section discusses two techniques: importance sampling and Gibbs sampling. The latter (or more commonly some variant) is the more frequently used method. 1.1 Importance Sampling This technique has been discussed earlier in class: We want E p f(x), but we can t sample p, so we instead sample x i s from a different distribution q and weight each f(x i ) by p(x i) q(x i ). As far as possible, q should be similar to p in order to make the sampling more precise. What distribution should we draw from? We can draw from q = Πq i (x i ), but this usually does not work since in most cases, it is too far from the actual distribution p. It is easier to sample from chains and trees, so often if a non-tree graph is almost a tree, we can convert it to a tree by breaking some links, sample from the new tree-based distribution, and use importance sampling to compute the expectations for the non-tree graph. Importance sampling usually only works for graphs that are nearly trees: in most examples that are of interest, this is never the case. We then need the following Gibbs sampling. 1.2 Gibbs Sampling Consider the Gibbs field in Figure??. Gibbs sampling works as follows: 1. Start with some initial configuration (say [0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1] in the order x 1 to x 15 ). 2. Pick a node at random (say x 7 ). 3. Resample the node s probability conditioned on its Markov blanket: generate x + 7 p(x 7 N(x 7 )), i.e. flip a coin in proportion to the likelihood of getting a 0 or a 1 given the neighbors. 4. Repeat steps 3 and 4 for T time steps. For a large enough T, x t converges to a sample from p(x). To get the next sample, it is necessary to wait another T time steps, as updates too close to each other may be correlated. If you are 1 Contect adapted from previous scribes: Jiaji Zhou in F12, Sankalp Arora in F11. 1

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 Figure 1: Gibbs field with maximum clique size 2. computing the expectation of a function under p, it is more advantageous to do that directly over the updates, than to wait for samples: for large enough T, 1 T t f(x t) converges to E p f(x). Note that in generally x does not converge, it is the expection that converges. There might be some other tricks. For example, it is better not to include x t in beginning stages to compute 1 T t f(x t) as it is more randomly initially. We may also have many sequences of x t running parallel and finally compute their mean. 1.3 Doubts and Answers You may have the following doubts as I doubted before: How could the expection converge? Think of the problem in 2D case (x = [x 0, x 1 ]). Pick out all those steps that we perform resampling on x 0 while fixing x 1, look at all these x 0 and you will find out they will converge to as if sampled from the marginal probablity p(x 0 ), and so as x 1. It is not iid sampled, why does it work? For adjacent steps it is not iid sampled, but if you look at every K steps (generally K cannot be too small), it is approximately iid sampled. 1.4 Derivation To deeply understand why gibbs sampling could converge to the desired distribution, here we take a simple example. Think of robot in a chess board in Figure??. Robot starts from one random cell and then could jump to its neighbours in four directions but will bounce back when reaching boundary. After running it for long steps, what s the final distribution of the robot place in the board? Will there be an equilibrium distribution or multiple? If multiple, will it change periodically? For the equilibrium state, it means that belief of each cell bel(x i ) won t change. In more detail, the probability of robot in its surroundings N(x i ) and move to x i is the same as robot in x i and move to its surroundings N(x i ). It is similar to the stable fluid. Fluid coming to this cell is the same as fluid going out. 2

It is easy to design the gibbs field similar as the chess board. If x represents the state of all cells at specific time, x = x 1, x 2,...x i,...x 100, while x represents another state: x = x 1, x 2,... x i,...x 100. We want to know whether: p(x)p( x x) equals p( x)p(x x) if we draw samples from the field using gibbs sampling. Actually, p(x)p( x x) = p(x) p(x 1,..., x i = 1,...x n ) p(x)p( x) = p( x)p(x x) = p( x) p(x 1,..., x i = 0,...x n ) p( x)p(x) = So the two expressions are the same value. This means p(x) is actually the equilibrium state. Figure 2: Chess board problem. Robot starts at one cell, and could move in four directions but will bounce back when reaching the boundary.we want to find the final distribution of each cell after long enough steps. 2 Online Learning Introduction This section begins with a small discussion on logic and normative view of the world. Then we will question the nature of induction and its use as a tool for modeling and making decisions about the world. We will explore the limits of the inductive process, and in the process, prove how well different online induction algorithms(e.g - weighted majority, randomized weighted majority etc.) perform in relation to each other through the new concepts of loss and regret. 2.1 Normative View of the World Normative view is concerned with identifying the best decision to take or inference to make, assuming an ideal decision maker who is fully informed, is able to compute with perfect accuracy and is 3

fully rational. It assumes probability based inference and general decision making can and should only be defined by pure logic. The normative view suggests that any decision maker or inference engine that doesnt follow the aforementioned rules is wrong or invalid. Hence it does not state the amount of error induced in a system if approximation techniques(like finite number of particles in a particle filter) are used for the lack of adequate computation resources, time or knowledge. Succintly put the normative view considers known unknowns, not the unknown unknowns. And this fallacy makes the normative view of little use in real life. 2.2 Induction Our standard justification for the use of probability theory is that it seems like the reasonable thing to do. Logic is certainly reasonable, and probability naturally and uniquely extends our notion of logic to include degree-of-belief type statements. Thus, if one accepts logic, one must accept probability as a way of reasoning about the world. In all the probabilistic systems discussed so far (e.g. localization and mapping), we can momentarily discard our normative view by looking at their common elements. At the very core, something like mapping, for example, can be seen simply as an induction problem. We take in data about world, and using prior knowledge, we make inferences, and then predictions using the inferences. In fact, Bayes rule can be seen as the canonical way to do these operations. Induction is a method to predict the future based on observations in the past, and to make decisions from limited information. Thus, we can look generally at induction itself in order to arrive at answers to certain fundamental questions behind many robotic learning systems. Questions such as: Why does inductive reasoning work?, When does it fail?, and How good can it be? will tell us about the nature of learning and lead us to unique notions of optimality with regard to worst-case adversarial conditions. 2.2.1 Induction from a Chicken s Perspective We begin with an illustrative example of how inductive reasoning can fail. Imagine you are a chicken. Every day of your life, the farmer comes to you at a certain time in the day and feeds you. On the 1001 th day, the farmer walks up to you. What do you predict? What can you predict other than that he will feed you? Unfortunately for you, this day the farmer has decided to eat you. Thus, induction fails spectacularly. You may object to this example and say that the model of the world that the chicken was using was wrong, but this is precisely the point of the discussion: perhaps we can never have a model that perfectly aligns with the real world. Perhaps, you know vaguely of your impending doom, so you assign a low probability of being eaten everyday. You still will (probably) guess incorrectly on that 1001 th day because your guess of being fed will dominate your feeling of doom. Furthermore, for any model you bear, an adversarial or random farmer can always thwart it to force false inductive predictions about your fate each day. 2.2.2 Three views of induction: 1. The David Hume view. No matter what, your model will never capture enough of the world to be correct all the time. Furthermore, induction relies heavily on priors made from assumptions, and potentially small changes in a prior can completely change a future belief. If your model is as good as your assumptions or priors, and most assumptions are not likely to be 4

valid at all, then induction is fundamentally broken. We find this view, although true, is not very satisfying or particularily useful. 2. The Goldilocks/Panglossian/Einstein Position. It just so happens that our world is set up in a way such that induction works. It is the nature of the universe that events can be effectively modeled through induction to a degree that allows us to make consistently useful predictions about it. In other words, we re lucky. Once again, we do not find this view very satisfying or particularly useful. 3. The No Regret view. It is true that you cannot say that induction will work or work well for any given situation, but as far as strategies go, it s provably near-optimal. This view comes from the outgrowth of game theory, computer science, and machine learning. This thinking is uniquely late 20 th century. 5