# Markov chains and Markov Random Fields (MRFs)

Save this PDF as:

Size: px
Start display at page:

Download "Markov chains and Markov Random Fields (MRFs)"

## Transcription

1 Markov chains and Markov Random Fields (MRFs) 1 Why Markov Models We discuss Markov models now. This is the simplest statistical model in which we don t assume that all variables are independent; we assume that there are important dependencies, but also conditional independencies that we can take advantage of. Markov models show up in a lot of ways in computer vision. For example: In modeling texture, we can represent a patch of an image by output of a large number of filters. If we assume that some of these filter outputs are dependent on each other (for example, filters at adjacent scales), but that there is conditional independence between different filter outputs, we can get a tractable model for use in texture synthesis. The shape of contours can be modeled as a Markov chain. In this approach, the future course of a contour that passes through point x might depend on the tangent of the contour at x, but not on the shape of the contour prior to reaching x. This oversimplifies, but is implicitly used in many approaches to perceptual grouping and contour segmentation. In action recognition, we often model the movements that constitute an action as a Markov chain. This assumption means, for example, that if I want to predict what is about to happen when someone is in the middle of leaving work and getting in their car, that this will depend on their current state (where is the person and car right now, and what is their current trajectory) but not on previous states (once you are standing in front of the car, how you will get in doesn t really depend on how you left the building and arrived in front of the car). Again, in modeling texture, we might assume that the appearance of a pixel in a textured image patch will depend on some of the surrounding pixels, but given these pixels, it will be independent of the rest of the image. 2 Markov Chains We will start by discussing the most simple Markov model, a Markov chain. We have already talked about these a little, since diffusion of a single particle can be thought of as a Markov chain. We can also use Markov chains to model contours, and they are used, explicitly or implicitly, in many contour-based segmentation algorithms. One of the key advantages of 1D Markov models is that they lend themselves to dynamic programming solutions. In a Markov chain, we have a sequence of random variables, which we can think of as describing the state of a system, x 1, x 2, x n. What makes them Markov is a conditional independence property that says that each state only depends on the previous state. That is: P (x n x 1, x 2, x n 1 ) = P (x n x n 1 ) Diffusion of a single particle offers us a simple example of this. Let x i be the position of a particle at time i. Suppose at each time step, the particle jumps either one unit to the left or 1

2 right, or stays in the same place. We can see that x i depends on x i 1, but that if we know x i 1 the previous values of x are irrelevant. We can also see that this is a stochastic version of the deterministic diffusion we ve studied. If there are many particles, we can use the law of large numbers to assume that a fixed fraction of them jump to the left and right. Remember that we saw that the position x i will have a Gaussian distribution for large i, because of the central limit theorem. Markov chains have some nice properties. One is that their steady state can be found by solving an eigenvalue problem. If a Markov chain involves transitions between a discrete set of states, it s very useful to describe these transitions from state to state using vectors and matrices. Let p t j be the probability of being in state j and time t. Now let s put these in a vector, so that: p t = (p t 1,, pt n) gives the probability distribution over all states we can be in. These are probabilities because even if the state at time 0 is deterministic, after that it will be stochastic. Notice that j pt j = 1. Now, the Markov model is completely specified by the probability that if I m in state s j at time t 1 that I ll be in state s i at time t, for all i and j. (In principle, these probabilities might also depend on t, that is, vary over time, but I ll assume for now that they don t.) We build a matrix with these probabilities, called A, with entries A ij. Notice that every column of A has to sum to 1 because if I am in state j at time t 1 I have to move to exactly one state at time t. This is convenient, because we have: p t = Ap t 1. This just says that the probability that I wind up in, eg., state 1 at time t is the sum of the probability that I m in state i at time t 1 and move to state 1, for all possible i s. Notice that this is more general than the diffusion processes we solved with convolution. This matrix multiplication is only equivalent to a convolution if the matrix is a band around the diagonal, with every row the same, but shifted. This formulation makes it easy to study the asymptotic behavior of a Markov chain. It is just the limit of: A(A((A(p 0 ))) = A n (p 0 ) Notice that because A is stochastic, the elements of p t always sum to 1. As an example, let s consider the simple Markov chain given by: A = [.75.5;.25.5]; Using Matlab, we can see that if we pick p randomly and apply A to it many times we get [2/3, 1/3] as our answer. We can work out that this is a steady state of the system. Suppose we have p t = [2/31/3]. Then the chances that we will wind up in state 1 at time t+1 is 2/3 3/4 + 1/3 1/2 = + 1/6 = 2/3. We can also see, with Matlab, that this asymptotic state is an eigenvector of A. This is because this repeated multiplication is just the power method of computing the eigenvector associated with the largest eigenvalue of A. That is, if we keep multiplying p by A, the result converges to the leading eigenvector. This is true regardless of the initial condition of p. Proof by example: Suppose p = av 1 + bv 2, where v 1 and v 2 are eigenvectors of A, with associated eigenvalues of λ 1 and λ 2. Then: A(av 1 + bv 2 ) = A(av 1 ) + A(bv 2 ) = aλ 1 v 1 + bλ 2 v 2. And similarly: A n (av 1 + bv 2 ) = a(λ 1 ) n v 1 + b(λ n 2 )v 2. As n goes to infinity, if λ 2 is a smaller than λ 1, it will become insignificant, and the larger eigenvalue dominates. The result converges to a vector in the direction of v 1. This also means that the leading eigenvalue of A must be 1; this happens because A is a stochastic matrix. 2

3 I ve skipped some conditions needed to prove this. First, the stochastic process must be ergodic. This means that we can get to any state from any other state. Otherwise, we might start in one state and never be able to get to the leading eigenvector from it. Second, the Markov chain must have a unique leading eigenvalue. Otherwise, the result could vary among linear combinations of the leading eigenvectors. Or the result might not converge, but could be periodic. For example, if A = ( ) then ( 1 2, 1 2 ) is an eigenvector, but the result can also oscillate. Third, the initial condition must be generic. If it has no component of the leading eigenvector in it, it can t converge to that. So the leading eigenvector of the transition matrix gives us a probability distribution over the asymptotic state. This can also be interpreted as the expected fraction of time the system stays in each state, over time. This makes sense, since the probability that the system is in a state at time n is the expected amount of time it spends there at time n. Markov Random Fields Markov chains provided us with a way to model 1D objects such as contours probabilistically, in a way that led to nice, tractable computations. We now consider 2D Markov models. These are more powerful, but not as easy to compute with. In addition we will consider two additional issues. First, we will consider adding observations to our models. These observations are conditioned on the value of the set of random variables we are modeling. (If we had considered observations with Markov chains, we would have arrived at Hidden Markov Models (HMMS), which are widely used in vision and other fields). In MRFs, we also consider a set of random variables which have some conditional independence properties. We introduce some terminology which is slightly different from the 1D case. We say that the MRF contains a set of sites, which we may index with two values, i and j, when we wish to emphasize the 2D structure of the sites. So we might talk about site S i,j and it s horizontal neighbor, S i+1,j. It may also be convenient to use a single index, referring to sites as S i. In this case, the order of the sites is arbitrary. We also suppose that we have a set of labels, L d. Labels might take on continuous values, but we ll assume that we have a discrete set of labels. Every site will have a label. So the sites may be thought of as random variables that can take a discrete set of values. A labeling assigns a label to every site. We denote this as f = {f 1,..., f m }, so that f i is the label of site S i. Next, we assume that our sites have a neighborhood structure. N i denotes all the sites that are neighbors to S i. That is, S j N i means S j and S i are neighbors. We can define any neighborhood structure that we want, with the constraint that being neighbors is symmetric, that is, that S j N i S i N j. Also, a site is not its own neighbor. We denote the set of all neighborhoods as N. This neighborhood structure now allows us to define the Markov structure of our distribution. We say that F is an MRF on S with respect to N if and only if: P (f) > 0, f and P (f i f S {i} ) = P (f i f Ni ) 3

4 The first condition is needed for some technical reasons. It means that in modeling we can t make any labeling have 0 probability, but since the probability can be very small this isn t much of a restriction. The second condition provides the Markov property. As an example of this, let s consider the problem of segmenting an image into foreground and background. We can assign a site to every pixel in the image. Our label set is binary, indicating foreground or background. Suppose we wish to encode the constraint that foreground regions tend to be compact, by stating that if a pixel is foreground, its neighboring pixels are also likely to be foreground. We can define a simple neighborhood structure based on 4-connected neighbors. That is, N i,j = {S i 1,j, S i+1,j, S i,j 1, S i,j+1 } (notice how we switch from using one to two subscripts). Then with the conditional probabilities available to us, we can encode constraints such as that if all of a pixels neighbors are foreground, it is probably foreground, if all its neighbors are background, it is probably background, and in other cases, it is fairly likely to be either. (You may also notice that this MRF has no connection yet to the intensities in the image. This will be handled below.) MRFs and Gibbs Distributions Given any set of random variables, there are a number of natural problems to answer. First, given an MRF it is straightforward to determine the probability of any particular labeling. However, figuring out what set of conditional probabilities to use in an MRF is not so simple. For one thing, an arbitrary set of conditional probabilities for different sites and neighborhoods may not be mutually consistent, and it is not obvious how to determine this. Finally, a key problem will be to find the most likely labeling of an MRF. These problems are made easier by the use of Gibbs distributions, which turn out to be equivalent to MRFs, but in some ways are much easier to work with. In a Gibbs distribution, the cliques capture dependencies between neighborhoods. A set of sites, {i 1, i 2,..., i n } form a clique if for all k, j, i k N j. Given a probability distribution defined for a set of sites and labels, we say that it is a Gibbs distribution if the distribution is of the following form: in which the energy function, U(f), is of the form: P (f) = 1 U(f) e T (1) Z U(f) = Σ c C V c (f) where C is the set of all cliques, and V c (f) is the clique potential, defined for every clique. That is, P (f) is an exponential function over the sum of potentials that can be defined independently for each clique. This is analogous to the independence structure given by neighborhoods in MRFs. In the above terms, T is a scalar called the temperature. In the above equation, Z is a normalizing value needed to make the probabilities sum to 1: Z = Σ f F e U(f) T T is a scalar that determines how sharply peaked the distribution is; note that as T becomes very small, the distribution is dominated by its most likely element. 4

5 The main reason Gibbs distributions are important to us is that they turn out to be equivalent to MRFs. That means that for any MRF, we can write it s probability distribution in the form of Equation 1. That means that in learning or designing an MRF, we can focus on finding the clique potentials. Note that to fully specify this distribution is still hard, since we must determine the value of Z. A straightforward way of computing this involves computing a sum with an exponential number of terms. There is a lot of work on approximating this value. However, in many cases we don t need it, because to find the MAP distribution for an MRF we just have the problem of finding the labeling that minimizes the energy function U(f), since Z is a constant factor that applies to all labeling. Finding the labeling that minimizes U(f) is still an NP-hard combinatorial optimization problem in most cases, but there are many algorithms that attack this problem. MRFs with Observations So far, we have only considered distributions over labels. This amounts to a means for specifying a prior distribution over labeled images. But we also want to connect this with the information in a specific image. To do this, we ll use examples in which every site is a pixel, so that there is one piece of image information at each site. We ll call the image information X, with the information at each pixel given by X i or X i,j. We are then interested in solving problems like: argmax f P (f X) To make this concrete, let s consider an example of image denoising. We consider an MRF in which each pixel is a site, and two pixels are neighbors if they are 4-connected. Suppose every pixel has an intrinsic intensity from 0 to 255, given by its label. X i is this intensity, with noise added, so that X i = f i + e i where the e i are iid and drawn from a zero mean, Gaussian distribution with variance σ 2. Using Bayes law we have: P (X f)p (f) P (f X) = P (X) To compute this we have: P (X f) = ΠP (X i f i ) Note that each X i is conditionally independent of all labels, given f i, ie., that P (X i f) = P (X i f i ) and also that the X i are independent of each other given the labels, ie., that: P (X f) = ΠP (X i f) Our noise model states that the P (X i f i ) will be a Gaussian distribution with mean f i, so that: P (X i f i ) = 1 e (f i X i )2 2σ 2 2πσ 2 5

6 Therefore, we can define the clique potential: V c (f i ) = (f i X i ) 2 2σ 2 (Note we can ignore constant factors, since we normalize anyway with Z). These are the unary cliques, which capture the data (pixels). If we want to, we can add to this a prior on different labels. We can also define a pairwise clique potential to encode the prior that neighboring pixels should have similar intensities. The simplest way to do this is to define: F or c = {i, j} V c = 0 V c = k f i = f j f i f j This biases us to have piecewise constant regions in the restored image. On the other hand, if we give: V c = f i f j we penalize according to the total variation in the image. Computing MAP estimates of MRFs and CRFs There are several kinds of computations that we might want to perform with MRFs and CRFs. These include learning the clique potentials and other parameters, finding a MAP estimate of the labels, given an MRF/CRF and image information, sampling from the conditional distribution (instead of just getting a MAP estimate). We will focus mainly on the MAP estimate problem, since this is very useful. Initialization: All these algorithms are iterative, and the results can depend on how the solution (labeling) is initialized. The simplest initialization method is to compute the MAP estimate using only the unary clique potentials. So, for denoising, for example, we would initialize the labels to be the corresponding pixel intensity. Of course, there are many other standard heuristics, such as using domain knowledge (ie., a prior on the labels), or trying many random initializations and picking the one that leads to the best solution. Iterated Conditional Modes: This is the simplest, greedy algorithm. Visit the sites in some order, and for a given site, S i, choose the label f i that maximizes P (f) given that all other labels are fixed. Notice that this only requires computing the clique potentials that include S i for all possible labels, so this requires computation that is linear in the number of labels. ICM is efficient, but can quickly converge to a poor local minimum. It is likely to work well when the unary clique potentials are very strong (note that in the limit, as the unary clique potentials dominate, ICM produces the global optimum.) Simulated Annealing: This method uses stochastic optimization to avoid some of the local minima that occur with ICM. The idea is to randomly change a label, and then accept this change with a probability that depends on the extent to which this change increases or decreases the overall probability of the labeling. For example, given a set of nodes with labels, f, we randomly select a site, S i and consider changing the label to f i. We let f denote this new label set containing f i. Let p = min(1, P (f )/P (f)). Then we replace f with f with probability p. This results in our 6

7 always accepting a change that improves the labeling, but also in our accepting changes that reduce the probability of the labeling. This can allow us to escape from labelings that are locally, but not globally optimal. Note that the distribution P (f) becomes very flat for large values of T, and more peaked as T decreases. To take advantage of this, we use an annealing schedule in which T begins with a high value, and gradually decreases. This means that at first we make changes to the labeling almost at random, often moving to less probable labelings, and then gradually move more deterministically only to more probable labelings. It can be proven that if the annealing schedule is chosen properly this will converge to the globally optimal solution, but of course since the problem is NP-hard, this must require an annealing schedule that uses an exponential amount of time. Belief Propagation: Belief propagation allows for exact inference in graphical models that do not contain loops, such as Markov chain models or models with a tree structure. It has been shown that this can also lead to effective inference in models with loops, such as MRFs, but we won t discuss this algorithm. Graph Cuts: We ll talk more about this in the next class. 7

### Lecture 11: Graphical Models for Inference

Lecture 11: Graphical Models for Inference So far we have seen two graphical models that are used for inference - the Bayesian network and the Join tree. These two both represent the same joint probability

### Why the Normal Distribution?

Why the Normal Distribution? Raul Rojas Freie Universität Berlin Februar 2010 Abstract This short note explains in simple terms why the normal distribution is so ubiquitous in pattern recognition applications.

### Linear Programming. March 14, 2014

Linear Programming March 1, 01 Parts of this introduction to linear programming were adapted from Chapter 9 of Introduction to Algorithms, Second Edition, by Cormen, Leiserson, Rivest and Stein [1]. 1

### Introduction to Segmentation

Lecture 2: Introduction to Segmentation Jonathan Krause 1 Goal Goal: Identify groups of pixels that go together image credit: Steve Seitz, Kristen Grauman 2 Types of Segmentation Semantic Segmentation:

### Tutorial on variational approximation methods. Tommi S. Jaakkola MIT AI Lab

Tutorial on variational approximation methods Tommi S. Jaakkola MIT AI Lab tommi@ai.mit.edu Tutorial topics A bit of history Examples of variational methods A brief intro to graphical models Variational

### Presentation 3: Eigenvalues and Eigenvectors of a Matrix

Colleen Kirksey, Beth Van Schoyck, Dennis Bowers MATH 280: Problem Solving November 18, 2011 Presentation 3: Eigenvalues and Eigenvectors of a Matrix Order of Presentation: 1. Definitions of Eigenvalues

### 10.3 POWER METHOD FOR APPROXIMATING EIGENVALUES

58 CHAPTER NUMERICAL METHODS. POWER METHOD FOR APPROXIMATING EIGENVALUES In Chapter 7 you saw that the eigenvalues of an n n matrix A are obtained by solving its characteristic equation n c nn c nn...

### LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING ----Changsheng Liu 10-30-2014 Agenda Semi Supervised Learning Topics in Semi Supervised Learning Label Propagation Local and global consistency Graph

### Linear Dependence Tests

Linear Dependence Tests The book omits a few key tests for checking the linear dependence of vectors. These short notes discuss these tests, as well as the reasoning behind them. Our first test checks

### LS.6 Solution Matrices

LS.6 Solution Matrices In the literature, solutions to linear systems often are expressed using square matrices rather than vectors. You need to get used to the terminology. As before, we state the definitions

### 7 Communication Classes

this version: 26 February 2009 7 Communication Classes Perhaps surprisingly, we can learn much about the long-run behavior of a Markov chain merely from the zero pattern of its transition matrix. In the

### Correlation and Convolution Class Notes for CMSC 426, Fall 2005 David Jacobs

Correlation and Convolution Class otes for CMSC 46, Fall 5 David Jacobs Introduction Correlation and Convolution are basic operations that we will perform to extract information from images. They are in

### Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

### 7 Gaussian Elimination and LU Factorization

7 Gaussian Elimination and LU Factorization In this final section on matrix factorization methods for solving Ax = b we want to take a closer look at Gaussian elimination (probably the best known method

### Linear Programming. April 12, 2005

Linear Programming April 1, 005 Parts of this were adapted from Chapter 9 of i Introduction to Algorithms (Second Edition) /i by Cormen, Leiserson, Rivest and Stein. 1 What is linear programming? The first

### The Basics of Graphical Models

The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures

### 2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system

1. Systems of linear equations We are interested in the solutions to systems of linear equations. A linear equation is of the form 3x 5y + 2z + w = 3. The key thing is that we don t multiply the variables

### Basic Terminology for Systems of Equations in a Nutshell. E. L. Lady. 3x 1 7x 2 +4x 3 =0 5x 1 +8x 2 12x 3 =0.

Basic Terminology for Systems of Equations in a Nutshell E L Lady A system of linear equations is something like the following: x 7x +4x =0 5x +8x x = Note that the number of equations is not required

### Probabilistic Graphical Models

Probabilistic Graphical Models Raquel Urtasun and Tamir Hazan TTI Chicago April 4, 2011 Raquel Urtasun and Tamir Hazan (TTI-C) Graphical Models April 4, 2011 1 / 22 Bayesian Networks and independences

### Discrete Mathematics and Probability Theory Fall 2009 Satish Rao,David Tse Note 11

CS 70 Discrete Mathematics and Probability Theory Fall 2009 Satish Rao,David Tse Note Conditional Probability A pharmaceutical company is marketing a new test for a certain medical condition. According

### DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

### Lecture 4: Thresholding

Lecture 4: Thresholding c Bryan S. Morse, Brigham Young University, 1998 2000 Last modified on Wednesday, January 12, 2000 at 10:00 AM. Reading SH&B, Section 5.1 4.1 Introduction Segmentation involves

### Creating a NL Texas Hold em Bot

Creating a NL Texas Hold em Bot Introduction Poker is an easy game to learn by very tough to master. One of the things that is hard to do is controlling emotions. Due to frustration, many have made the

### Solution to Homework 2

Solution to Homework 2 Olena Bormashenko September 23, 2011 Section 1.4: 1(a)(b)(i)(k), 4, 5, 14; Section 1.5: 1(a)(b)(c)(d)(e)(n), 2(a)(c), 13, 16, 17, 18, 27 Section 1.4 1. Compute the following, if

### Induction. Margaret M. Fleck. 10 October These notes cover mathematical induction and recursive definition

Induction Margaret M. Fleck 10 October 011 These notes cover mathematical induction and recursive definition 1 Introduction to induction At the start of the term, we saw the following formula for computing

### Applications to Data Smoothing and Image Processing I

Applications to Data Smoothing and Image Processing I MA 348 Kurt Bryan Signals and Images Let t denote time and consider a signal a(t) on some time interval, say t. We ll assume that the signal a(t) is

### Manifold Learning Examples PCA, LLE and ISOMAP

Manifold Learning Examples PCA, LLE and ISOMAP Dan Ventura October 14, 28 Abstract We try to give a helpful concrete example that demonstrates how to use PCA, LLE and Isomap, attempts to provide some intuition

### A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form

Section 1.3 Matrix Products A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form (scalar #1)(quantity #1) + (scalar #2)(quantity #2) +...

### DETERMINANTS. b 2. x 2

DETERMINANTS 1 Systems of two equations in two unknowns A system of two equations in two unknowns has the form a 11 x 1 + a 12 x 2 = b 1 a 21 x 1 + a 22 x 2 = b 2 This can be written more concisely in

### Hidden Markov Models

8.47 Introduction to omputational Molecular Biology Lecture 7: November 4, 2004 Scribe: Han-Pang hiu Lecturer: Ross Lippert Editor: Russ ox Hidden Markov Models The G island phenomenon The nucleotide frequencies

### CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a

### Bindel, Fall 2012 Matrix Computations (CS 6210) Week 1: Wednesday, Aug 22

Logistics Week 1: Wednesday, Aug 22 We will go over the syllabus in the first class, but in general you should look at the class web page at http://www.cs.cornell.edu/~bindel/class/cs6210-f12/ The web

### Continuous Functions, Smooth Functions and the Derivative

UCSC AMS/ECON 11A Supplemental Notes # 4 Continuous Functions, Smooth Functions and the Derivative c 2004 Yonatan Katznelson 1. Continuous functions One of the things that economists like to do with mathematical

### Definition: A square matrix A is block diagonal if A has the form A 1 O O O A 2 O A =

The question we want to answer now is the following: If A is not similar to a diagonal matrix, then what is the simplest matrix that A is similar to? Before we can provide the answer, we will have to introduce

### Kalman Filter Applications

Kalman Filter Applications The Kalman filter (see Subject MI37) is a very powerful tool when it comes to controlling noisy systems. The basic idea of a Kalman filter is: Noisy data in hopefully less noisy

### Dynamic Programming and Graph Algorithms in Computer Vision

Dynamic Programming and Graph Algorithms in Computer Vision Pedro F. Felzenszwalb and Ramin Zabih Abstract Optimization is a powerful paradigm for expressing and solving problems in a wide range of areas,

### Structured Learning and Prediction in Computer Vision. Contents

Foundations and Trends R in Computer Graphics and Vision Vol. 6, Nos. 3 4 (2010) 185 365 c 2011 S. Nowozin and C. H. Lampert DOI: 10.1561/0600000033 Structured Learning and Prediction in Computer Vision

### Course: Model, Learning, and Inference: Lecture 5

Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### 4. MATRICES Matrices

4. MATRICES 170 4. Matrices 4.1. Definitions. Definition 4.1.1. A matrix is a rectangular array of numbers. A matrix with m rows and n columns is said to have dimension m n and may be represented as follows:

### Theory of Computation Prof. Kamala Krithivasan Department of Computer Science and Engineering Indian Institute of Technology, Madras

Theory of Computation Prof. Kamala Krithivasan Department of Computer Science and Engineering Indian Institute of Technology, Madras Lecture No. # 31 Recursive Sets, Recursively Innumerable Sets, Encoding

### Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Models vs. Patterns Models A model is a high level, global description of a

### 5.1 Bipartite Matching

CS787: Advanced Algorithms Lecture 5: Applications of Network Flow In the last lecture, we looked at the problem of finding the maximum flow in a graph, and how it can be efficiently solved using the Ford-Fulkerson

### DERIVATIVES AS MATRICES; CHAIN RULE

DERIVATIVES AS MATRICES; CHAIN RULE 1. Derivatives of Real-valued Functions Let s first consider functions f : R 2 R. Recall that if the partial derivatives of f exist at the point (x 0, y 0 ), then we

### Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data

### Lecture 3: Linear Programming Relaxations and Rounding

Lecture 3: Linear Programming Relaxations and Rounding 1 Approximation Algorithms and Linear Relaxations For the time being, suppose we have a minimization problem. Many times, the problem at hand can

### CHAPTER 3 Numbers and Numeral Systems

CHAPTER 3 Numbers and Numeral Systems Numbers play an important role in almost all areas of mathematics, not least in calculus. Virtually all calculus books contain a thorough description of the natural,

### 1 Eigenvalues and Eigenvectors

Math 20 Chapter 5 Eigenvalues and Eigenvectors Eigenvalues and Eigenvectors. Definition: A scalar λ is called an eigenvalue of the n n matrix A is there is a nontrivial solution x of Ax = λx. Such an x

### Approximation Algorithms

Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NP-Completeness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms

### 1 2 3 1 1 2 x = + x 2 + x 4 1 0 1

(d) If the vector b is the sum of the four columns of A, write down the complete solution to Ax = b. 1 2 3 1 1 2 x = + x 2 + x 4 1 0 0 1 0 1 2. (11 points) This problem finds the curve y = C + D 2 t which

### Intro. to the Divide-and-Conquer Strategy via Merge Sort CMPSC 465 CLRS Sections 2.3, Intro. to and various parts of Chapter 4

Intro. to the Divide-and-Conquer Strategy via Merge Sort CMPSC 465 CLRS Sections 2.3, Intro. to and various parts of Chapter 4 I. Algorithm Design and Divide-and-Conquer There are various strategies we

### 3.2 Matrix Multiplication

3.2 Matrix Multiplication Question : How do you multiply two matrices? Question 2: How do you interpret the entries in a product of two matrices? When you add or subtract two matrices, you add or subtract

### 5.5. Solving linear systems by the elimination method

55 Solving linear systems by the elimination method Equivalent systems The major technique of solving systems of equations is changing the original problem into another one which is of an easier to solve

### 4. Introduction to Statistics

Statistics for Engineers 4-1 4. Introduction to Statistics Descriptive Statistics Types of data A variate or random variable is a quantity or attribute whose value may vary from one unit of investigation

### Vector Math Computer Graphics Scott D. Anderson

Vector Math Computer Graphics Scott D. Anderson 1 Dot Product The notation v w means the dot product or scalar product or inner product of two vectors, v and w. In abstract mathematics, we can talk about

### Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 10

CS 70 Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 10 Introduction to Discrete Probability Probability theory has its origins in gambling analyzing card games, dice,

### Lecture 3: Finding integer solutions to systems of linear equations

Lecture 3: Finding integer solutions to systems of linear equations Algorithmic Number Theory (Fall 2014) Rutgers University Swastik Kopparty Scribe: Abhishek Bhrushundi 1 Overview The goal of this lecture

### MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0-534-40596-7. Systems of Linear Equations Definition. An n-dimensional vector is a row or a column

### Induction Problems. Tom Davis November 7, 2005

Induction Problems Tom Davis tomrdavis@earthlin.net http://www.geometer.org/mathcircles November 7, 2005 All of the following problems should be proved by mathematical induction. The problems are not necessarily

### Basics of Statistical Machine Learning

CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

### We call this set an n-dimensional parallelogram (with one vertex 0). We also refer to the vectors x 1,..., x n as the edges of P.

Volumes of parallelograms 1 Chapter 8 Volumes of parallelograms In the present short chapter we are going to discuss the elementary geometrical objects which we call parallelograms. These are going to

### Diagonalisation. Chapter 3. Introduction. Eigenvalues and eigenvectors. Reading. Definitions

Chapter 3 Diagonalisation Eigenvalues and eigenvectors, diagonalisation of a matrix, orthogonal diagonalisation fo symmetric matrices Reading As in the previous chapter, there is no specific essential

### Model-based Synthesis. Tony O Hagan

Model-based Synthesis Tony O Hagan Stochastic models Synthesising evidence through a statistical model 2 Evidence Synthesis (Session 3), Helsinki, 28/10/11 Graphical modelling The kinds of models that

### A Learning Based Method for Super-Resolution of Low Resolution Images

A Learning Based Method for Super-Resolution of Low Resolution Images Emre Ugur June 1, 2004 emre.ugur@ceng.metu.edu.tr Abstract The main objective of this project is the study of a learning based method

### Introduction to Algorithms Review information for Prelim 1 CS 4820, Spring 2010 Distributed Wednesday, February 24

Introduction to Algorithms Review information for Prelim 1 CS 4820, Spring 2010 Distributed Wednesday, February 24 The final exam will cover seven topics. 1. greedy algorithms 2. divide-and-conquer algorithms

### MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS Systems of Equations and Matrices Representation of a linear system The general system of m equations in n unknowns can be written a x + a 2 x 2 + + a n x n b a

### Compression algorithm for Bayesian network modeling of binary systems

Compression algorithm for Bayesian network modeling of binary systems I. Tien & A. Der Kiureghian University of California, Berkeley ABSTRACT: A Bayesian network (BN) is a useful tool for analyzing the

### Practical Guide to the Simplex Method of Linear Programming

Practical Guide to the Simplex Method of Linear Programming Marcel Oliver Revised: April, 0 The basic steps of the simplex algorithm Step : Write the linear programming problem in standard form Linear

### Models for Discrete Variables

Probability Models for Discrete Variables Our study of probability begins much as any data analysis does: What is the distribution of the data? Histograms, boxplots, percentiles, means, standard deviations

### Introduction to Deep Learning Variational Inference, Mean Field Theory

Introduction to Deep Learning Variational Inference, Mean Field Theory 1 Iasonas Kokkinos Iasonas.kokkinos@ecp.fr Center for Visual Computing Ecole Centrale Paris Galen Group INRIA-Saclay Lecture 3: recap

### Linear Programming Notes VII Sensitivity Analysis

Linear Programming Notes VII Sensitivity Analysis 1 Introduction When you use a mathematical model to describe reality you must make approximations. The world is more complicated than the kinds of optimization

### 1 Solving LPs: The Simplex Algorithm of George Dantzig

Solving LPs: The Simplex Algorithm of George Dantzig. Simplex Pivoting: Dictionary Format We illustrate a general solution procedure, called the simplex algorithm, by implementing it on a very simple example.

### The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy

BMI Paper The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy Faculty of Sciences VU University Amsterdam De Boelelaan 1081 1081 HV Amsterdam Netherlands Author: R.D.R.

### Machine Learning I Week 14: Sequence Learning Introduction

Machine Learning I Week 14: Sequence Learning Introduction Alex Graves Technische Universität München 29. January 2009 Literature Pattern Recognition and Machine Learning Chapter 13: Sequential Data Christopher

### Au = = = 3u. Aw = = = 2w. so the action of A on u and w is very easy to picture: it simply amounts to a stretching by 3 and 2, respectively.

Chapter 7 Eigenvalues and Eigenvectors In this last chapter of our exploration of Linear Algebra we will revisit eigenvalues and eigenvectors of matrices, concepts that were already introduced in Geometry

### Chapter 6. Orthogonality

6.3 Orthogonal Matrices 1 Chapter 6. Orthogonality 6.3 Orthogonal Matrices Definition 6.4. An n n matrix A is orthogonal if A T A = I. Note. We will see that the columns of an orthogonal matrix must be

### Similarity and Diagonalization. Similar Matrices

MATH022 Linear Algebra Brief lecture notes 48 Similarity and Diagonalization Similar Matrices Let A and B be n n matrices. We say that A is similar to B if there is an invertible n n matrix P such that

### Fractions and Decimals

Fractions and Decimals Tom Davis tomrdavis@earthlink.net http://www.geometer.org/mathcircles December 1, 2005 1 Introduction If you divide 1 by 81, you will find that 1/81 =.012345679012345679... The first

### Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

### ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015

ECON 459 Game Theory Lecture Notes Auctions Luca Anderlini Spring 2015 These notes have been used before. If you can still spot any errors or have any suggestions for improvement, please let me know. 1

### Project and Production Management Prof. Arun Kanda Department of Mechanical Engineering Indian Institute of Technology, Delhi

Project and Production Management Prof. Arun Kanda Department of Mechanical Engineering Indian Institute of Technology, Delhi Lecture - 15 Limited Resource Allocation Today we are going to be talking about

### 3.2 Roulette and Markov Chains

238 CHAPTER 3. DISCRETE DYNAMICAL SYSTEMS WITH MANY VARIABLES 3.2 Roulette and Markov Chains In this section we will be discussing an application of systems of recursion equations called Markov Chains.

### CS2 Algorithms and Data Structures Note 11. Breadth-First Search and Shortest Paths

CS2 Algorithms and Data Structures Note 11 Breadth-First Search and Shortest Paths In this last lecture of the CS2 Algorithms and Data Structures thread we will consider the problem of computing distances

### 10.2 ITERATIVE METHODS FOR SOLVING LINEAR SYSTEMS. The Jacobi Method

578 CHAPTER 1 NUMERICAL METHODS 1. ITERATIVE METHODS FOR SOLVING LINEAR SYSTEMS As a numerical technique, Gaussian elimination is rather unusual because it is direct. That is, a solution is obtained after

### 22 Matrix exponent. Equal eigenvalues

22 Matrix exponent. Equal eigenvalues 22. Matrix exponent Consider a first order differential equation of the form y = ay, a R, with the initial condition y) = y. Of course, we know that the solution to

### Math 115A HW4 Solutions University of California, Los Angeles. 5 2i 6 + 4i. (5 2i)7i (6 + 4i)( 3 + i) = 35i + 14 ( 22 6i) = 36 + 41i.

Math 5A HW4 Solutions September 5, 202 University of California, Los Angeles Problem 4..3b Calculate the determinant, 5 2i 6 + 4i 3 + i 7i Solution: The textbook s instructions give us, (5 2i)7i (6 + 4i)(

### Combinatorics: The Fine Art of Counting

Combinatorics: The Fine Art of Counting Week 7 Lecture Notes Discrete Probability Continued Note Binomial coefficients are written horizontally. The symbol ~ is used to mean approximately equal. The Bernoulli

### Joint Probability Distributions and Random Samples (Devore Chapter Five)

Joint Probability Distributions and Random Samples (Devore Chapter Five) 1016-345-01 Probability and Statistics for Engineers Winter 2010-2011 Contents 1 Joint Probability Distributions 1 1.1 Two Discrete

### Notes on Factoring. MA 206 Kurt Bryan

The General Approach Notes on Factoring MA 26 Kurt Bryan Suppose I hand you n, a 2 digit integer and tell you that n is composite, with smallest prime factor around 5 digits. Finding a nontrivial factor

### Introduction to Matrix Algebra

Psychology 7291: Multivariate Statistics (Carey) 8/27/98 Matrix Algebra - 1 Introduction to Matrix Algebra Definitions: A matrix is a collection of numbers ordered by rows and columns. It is customary

### 1.2 Solving a System of Linear Equations

1.. SOLVING A SYSTEM OF LINEAR EQUATIONS 1. Solving a System of Linear Equations 1..1 Simple Systems - Basic De nitions As noticed above, the general form of a linear system of m equations in n variables

### Understanding and Applying Kalman Filtering

Understanding and Applying Kalman Filtering Lindsay Kleeman Department of Electrical and Computer Systems Engineering Monash University, Clayton 1 Introduction Objectives: 1. Provide a basic understanding

### 1. LINEAR EQUATIONS. A linear equation in n unknowns x 1, x 2,, x n is an equation of the form

1. LINEAR EQUATIONS A linear equation in n unknowns x 1, x 2,, x n is an equation of the form a 1 x 1 + a 2 x 2 + + a n x n = b, where a 1, a 2,..., a n, b are given real numbers. For example, with x and

### Divide And Conquer Algorithms

CSE341T/CSE549T 09/10/2014 Lecture 5 Divide And Conquer Algorithms Recall in last lecture, we looked at one way of parallelizing matrix multiplication. At the end of the lecture, we saw the reduce SUM

### Lecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method

Lecture 3 3B1B Optimization Michaelmas 2015 A. Zisserman Linear Programming Extreme solutions Simplex method Interior point method Integer programming and relaxation The Optimization Tree Linear Programming

### Notes on Orthogonal and Symmetric Matrices MENU, Winter 2013

Notes on Orthogonal and Symmetric Matrices MENU, Winter 201 These notes summarize the main properties and uses of orthogonal and symmetric matrices. We covered quite a bit of material regarding these topics,

### IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 7, JULY 2009 1181

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 7, JULY 2009 1181 The Global Kernel k-means Algorithm for Clustering in Feature Space Grigorios F. Tzortzis and Aristidis C. Likas, Senior Member, IEEE

### 2014-2015 The Master s Degree with Thesis Course Descriptions in Industrial Engineering

2014-2015 The Master s Degree with Thesis Course Descriptions in Industrial Engineering Compulsory Courses IENG540 Optimization Models and Algorithms In the course important deterministic optimization