# A crash course in probability and Naïve Bayes classification

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish between discrete and continuous variables. The distribution of a discrete random variable: The probabilities of each value it can take. Notation: P(X = x i ). These numbers satisfy: X P (X = x i )=1 i 1 2 Probability theory Probability theory A joint probability distribution for two variables is a table. Marginal Probability If the two variables are binary, how many parameters does it have? Joint Probability Conditional Probability What about joint probability of d variables P(X 1,,X d )? How many parameters does it have if each variable is binary? 4 1

2 Probability theory The Rules of Probability Marginalization: Marginalization Product Rule Product Rule Independence: X and Y are independent if P(Y X) = P(Y) This implies P(X,Y) = P(X) P(Y) Using probability in learning P (Y X) Interested in: 1% X features Y labels 0.5% 0% For example, when classifying spam, we could estimate P(Y Viagara, lottery) We would then classify an example if P(Y X) > 0.5. However, it s usually easier to model P(X Y) Maximum likelihood Fit a probabilistic model P(x θ) to data Estimate θ Given independent identically distributed (i.i.d.) data X = (x 1, x 2,, x n ) Likelihood Log likelihood P (X ) =P (x 1 )P (x 2 ),...,P(x n ) ln P (X ) = nx ln P (x i ) i=1 Maximum likelihood solution: parameters θ that maximize ln P(X θ) 7 2

3 : coin toss Estimate the probability p that a coin lands Heads using the result of n coin tosses, h of which resulted in heads. The likelihood of the data: Log likelihood: Taking a derivative and setting to 0: P (X ) =p h (1 p) n h ln P (X ) =h ln p +(n h)ln(1 ln P (X = h p (n h) (1 p) =0 From the product rule: and: Therefore: P(Y, X) = P(Y X) P(X) P(Y, X) = P(X Y) P(Y) P (Y X) = This is known as Bayes rule Bayes rule P (X Y )P (Y ) P (X) ) p = h n 9 10 P (Y X) = posterior Bayes rule P (X Y )P (Y ) P (X) posterior likelihood prior P(X) can be computed as: P (X) = X Y P (X Y )P (Y ) likelihood prior Maximum a-posteriori and maximum likelihood The maximum a posteriori (MAP) rule: y MAP = arg max Y P (X Y )P (Y ) P (Y X) = arg max Y P (X) If we ignore the prior distribution or assume it is uniform we obtain the maximum likelihood rule: y ML = arg max P (X Y ) Y = arg max P (X Y )P (Y ) Y A classifier that has access to P(Y X) is a Bayes optimal classifier. But is not important for inferring a label 12 3

4 Naïve Bayes classifier We would like to model P(X Y), where X is a feature vector, and Y is its associated label. Task:%Predict%whether%or%not%a%picnic%spot%is%enjoyable% % Training&Data:&& X%=%(X 1 %%%%%%%X 2 %%%%%%%%X 3 %%%%%%%% %%%%%%%% %%%%%%%X d )%%%%%%%%%%Y% Naïve Bayes classifier We would like to model P(X Y), where X is a feature vector, and Y is its associated label. Simplifying assumption: conditional independence: given the class label the features are independent, i.e. P (X Y )=P (x 1 Y )P (x 2 Y ),...,P(x d Y ) n&rows& How many parameters now? How many parameters? Prior: P(Y) k-1 if k classes Likelihood: P(X Y) (2 d 1)k for binary features Naïve Bayes classifier We would like to model P(X Y), where X is a feature vector, and Y is its associated label. Simplifying assumption: conditional independence: given the class label the features are independent, i.e. P (X Y )=P (x 1 Y )P (x 2 Y ),...,P(x d Y ) Naïve Bayes decision rule: y NB = arg max Y Naïve Bayes classifier P (X Y )P (Y ) = arg max Y dy P (x i Y )P (Y ) i=1 If conditional independence holds, NB is an optimal classifier! How many parameters now? dk + k

5 Training a Naïve Bayes classifier Training data: Feature matrix X (n x d) and labels y 1, y n Maximum likelihood estimates: Class prior: Likelihood: ˆP (y) = {i : y i = y} n ˆP (x i y) = ˆP (x i,y) ˆP (y) = {i : X ij = x i,y i = y} /n {i : y i = y} /n classification Suppose our vocabulary contains three words a, b and c, and we use a multivariate Bernoulli model for our s, with parameters = (0.5,0.67,0.33) = (0.67,0.33,0.33) This means, for example, that the presence of b is twice as likely in spam (+), compared with ham. The to be classified contains words a and b but not c, and hence is described by the bit vector x = (1,1,0). We obtain likelihoods P(x ) = (1 0.33) = P(x ) = (1 0.33) = The ML classification of x is thus spam classification: training data a? b? c? Class e e e e e e e e ount vectors. (right) The same data set What are the parameters of the model? classification: training data a? b? c? Class e e e e e e e e ount vectors. (right) The same data set What are the parameters of the model? ˆP (y) = {i : y i = y} n ˆP (x i y) = ˆP (x i,y) ˆP (y) = {i : X ij = x i,y i = y} /n {i : y i = y} /n

6 classification: training data a? b? c? Class e e e e e e e e ount vectors. (right) The same data set What are the parameters of the model? P(+) = 0.5, P(-) = 0.5 P(a +) = 0.5, P(a -) = 0.75 P(b +) = 0.75, P(b -)= 0.25 P(c +) = 0.25, P(c -)= 0.25 ˆP (x i y) = ˆP (x i,y) ˆP (y) ˆP (y) = {i : y i = y} n = {i : Xij = xi,yi = y} /n {i : y i = y} /n Comments on Naïve Bayes Usually features are not conditionally independent, i.e. P (X Y ) 6= P (x 1 Y )P (x 2 Y ),...,P(x d Y ) And yet, one of the most widely used classifiers. Easy to train! It often performs well even when the assumption is violated. Domingos, P., & Pazzani, M. (1997). Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier. Machine Learning. 29, When there are few training examples What if you never see a training example where x 1 =a when y=spam? P(x spam) = P(a spam) P(b spam) P(c spam) = 0 What to do? When there are few training examples What if you never see a training example where x 1 =a when y=spam? P(x spam) = P(a spam) P(b spam) P(c spam) = 0 What to do? Add virtual examples for which x 1 =a when y=spam

7 Naïve Bayes for continuous variables Continuous Probability Distributions Need to talk about continuous distributions! The probability of the random variable assuming a value within some given interval from x 1 to x 2 is defined to be the area under the graph of the probability density function between x 1 and x 2. f (x) Uniform f (x) Normal/Gaussian x x x x 1 x 1 x Expectations The Gaussian (normal) distribution Discrete variables Continuous variables Conditional expectation (discrete) Approximate expectation (discrete and continuous) 7

8 Properties of the Gaussian distribution Standard Normal Distribution 99.72% 95.44% 68.26% A random variable having a normal distribution with a mean of 0 and a standard deviation of 1 is said to have a standard normal probability distribution. µ 3σ µ 1σ µ 2σ µ µ + 1σ µ + 2σ µ + 3σ x Standard Normal Probability Distribution Gaussian Parameter Estimation Converting to the Standard Normal Distribution z x = µ σ Likelihood function We can think of z as a measure of the number of standard deviations x is from µ. 8

9 Maximum (Log) Likelihood 34 Gaussian models Assume we have data that belongs to three classes, and assume a likelihood that follows a Gaussian distribution Likelihood function: P (X i = x Y = y k )= Gaussian Naïve Bayes 1 (x µik ) 2 p exp 2 ik 2 2 ik Need to estimate mean and variance for each feature in each class

10 Summary Naïve Bayes classifier: ² ² ² What s the assumption Why we make it How we learn it Naïve Bayes for discrete data Gaussian naïve Bayes 37 10

### Introduction to Machine Learning

Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 5: Decision Theory & ROC Curves Gaussian ML Estimation Many figures courtesy Kevin Murphy s textbook,

### Probability Theory. Elementary rules of probability Sum rule. Product rule. p. 23

Probability Theory Uncertainty is key concept in machine learning. Probability provides consistent framework for the quantification and manipulation of uncertainty. Probability of an event is the fraction

### 1 Maximum likelihood estimation

COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N

### Bayesian Learning. CSL465/603 - Fall 2016 Narayanan C Krishnan

Bayesian Learning CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Bayes Theorem MAP Learners Bayes optimal classifier Naïve Bayes classifier Example text classification Bayesian networks

### Bayesian Classification

CS 650: Computer Vision Bryan S. Morse BYU Computer Science Statistical Basis Training: Class-Conditional Probabilities Suppose that we measure features for a large training set taken from class ω i. Each

### Bayes and Naïve Bayes. cs534-machine Learning

Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule

### Probabilities for machine learning

Probabilities for machine learning Roland Memisevic February 2, 2006 Why probabilities? One of the hardest problems when building complex intelligent systems is brittleness. How can we keep tiny irregularities

### Bayes Rule. How is this rule derived? Using Bayes rule for probabilistic inference: Evidence Cause) Evidence)

Bayes Rule P ( A B = B A A B ( Rev. Thomas Bayes (1702-1761 How is this rule derived? Using Bayes rule for probabilistic inference: P ( Cause Evidence = Evidence Cause Cause Evidence Cause Evidence: diagnostic

### Bayesian Decision Theory. Prof. Richard Zanibbi

Bayesian Decision Theory Prof. Richard Zanibbi Bayesian Decision Theory The Basic Idea To minimize errors, choose the least risky class, i.e. the class for which the expected loss is smallest Assumptions

### Lecture 2: October 24

Machine Learning: Foundations Fall Semester, 2 Lecture 2: October 24 Lecturer: Yishay Mansour Scribe: Shahar Yifrah, Roi Meron 2. Bayesian Inference - Overview This lecture is going to describe the basic

### CS340 Machine learning Naïve Bayes classifiers

CS340 Machine learning Naïve Bayes classifiers Document classification Let Y {1,,C} be the class label and x {0,1} d eg Y {spam, urgent, normal}, x i = I(word i is present in message) Bag of words model

### Basics of Statistical Machine Learning

CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

### Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2015 CS 551, Fall 2015

### L10: Probability, statistics, and estimation theory

L10: Probability, statistics, and estimation theory Review of probability theory Bayes theorem Statistics and the Normal distribution Least Squares Error estimation Maximum Likelihood estimation Bayesian

### Learning from Data: Naive Bayes

Semester 1 http://www.anc.ed.ac.uk/ amos/lfd/ Naive Bayes Typical example: Bayesian Spam Filter. Naive means naive. Bayesian methods can be much more sophisticated. Basic assumption: conditional independence.

### Linear Classification. Volker Tresp Summer 2015

Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber 2011 1

Data Modeling & Analysis Techniques Probability & Statistics Manfred Huber 2011 1 Probability and Statistics Probability and statistics are often used interchangeably but are different, related fields

### Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 10, 2014 1 Principle of maximum likelihood Consider a family of probability distributions

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

10-601 Machine Learning http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html Course data All up-to-date info is on the course web page: http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### Why the Normal Distribution?

Why the Normal Distribution? Raul Rojas Freie Universität Berlin Februar 2010 Abstract This short note explains in simple terms why the normal distribution is so ubiquitous in pattern recognition applications.

### Naive Bayesian Classifier

POLYTECHNIC UNIVERSITY Department of Computer Science / Finance and Risk Engineering Naive Bayesian Classifier K. Ming Leung Abstract: A statistical classifier called Naive Bayesian classifier is discussed.

### Christfried Webers. Canberra February June 2015

c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic

### P (x) 0. Discrete random variables Expected value. The expected value, mean or average of a random variable x is: xp (x) = v i P (v i )

Discrete random variables Probability mass function Given a discrete random variable X taking values in X = {v 1,..., v m }, its probability mass function P : X [0, 1] is defined as: P (v i ) = Pr[X =

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

### Mixture Models. Jia Li. Department of Statistics The Pennsylvania State University. Mixture Models

Mixture Models Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Clustering by Mixture Models General bacground on clustering Example method: -means Mixture model based

### Summary of Probability

Summary of Probability Mathematical Physics I Rules of Probability The probability of an event is called P(A), which is a positive number less than or equal to 1. The total probability for all possible

### Regime Switching Models: An Example for a Stock Market Index

Regime Switching Models: An Example for a Stock Market Index Erik Kole Econometric Institute, Erasmus School of Economics, Erasmus University Rotterdam April 2010 In this document, I discuss in detail

### Artificial Intelligence Mar 27, Bayesian Networks 1 P (T D)P (D) + P (T D)P ( D) =

Artificial Intelligence 15-381 Mar 27, 2007 Bayesian Networks 1 Recap of last lecture Probability: precise representation of uncertainty Probability theory: optimal updating of knowledge based on new information

### Sufficient Statistics and Exponential Family. 1 Statistics and Sufficient Statistics. Math 541: Statistical Theory II. Lecturer: Songfeng Zheng

Math 541: Statistical Theory II Lecturer: Songfeng Zheng Sufficient Statistics and Exponential Family 1 Statistics and Sufficient Statistics Suppose we have a random sample X 1,, X n taken from a distribution

### Classification Problems

Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems

### Probability & Statistics Primer Gregory J. Hakim University of Washington 2 January 2009 v2.0

Probability & Statistics Primer Gregory J. Hakim University of Washington 2 January 2009 v2.0 This primer provides an overview of basic concepts and definitions in probability and statistics. We shall

### CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a

### Gaussian Classifiers CS498

Gaussian Classifiers CS498 Today s lecture The Gaussian Gaussian classifiers A slightly more sophisticated classifier Nearest Neighbors We can classify with nearest neighbors x m 1 m 2 Decision boundary

### CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not

### Probabilistic Graphical Models Final Exam

Introduction to Probabilistic Graphical Models Final Exam, March 2007 1 Probabilistic Graphical Models Final Exam Please fill your name and I.D.: Name:... I.D.:... Duration: 3 hours. Guidelines: 1. The

### MS&E 226: Small Data

MS&E 226: Small Data Lecture 16: Bayesian inference (v3) Ramesh Johari ramesh.johari@stanford.edu 1 / 35 Priors 2 / 35 Frequentist vs. Bayesian inference Frequentists treat the parameters as fixed (deterministic).

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

### Machine Learning

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 8, 2011 Today: Graphical models Bayes Nets: Representing distributions Conditional independencies

### Lecture 9: Bayesian Learning

Lecture 9: Bayesian Learning Cognitive Systems II - Machine Learning SS 2005 Part II: Special Aspects of Concept Learning Bayes Theorem, MAL / ML hypotheses, Brute-force MAP LEARNING, MDL principle, Bayes

### Review. Lecture 3: Probability Distributions. Poisson Distribution. May 8, 2012 GENOME 560, Spring Su In Lee, CSE & GS

Lecture 3: Probability Distributions May 8, 202 GENOME 560, Spring 202 Su In Lee, CSE & GS suinlee@uw.edu Review Random variables Discrete: Probability mass function (pmf) Continuous: Probability density

### MAS108 Probability I

1 QUEEN MARY UNIVERSITY OF LONDON 2:30 pm, Thursday 3 May, 2007 Duration: 2 hours MAS108 Probability I Do not start reading the question paper until you are instructed to by the invigilators. The paper

### ML, MAP, and Bayesian The Holy Trinity of Parameter Estimation and Data Prediction. Avinash Kak Purdue University. January 4, :19am

ML, MAP, and Bayesian The Holy Trinity of Parameter Estimation and Data Prediction Avinash Kak Purdue University January 4, 2017 11:19am An RVL Tutorial Presentation originally presented in Summer 2008

### Statistical Machine Learning from Data

Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Gaussian Mixture Models Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

### Set theory, and set operations

1 Set theory, and set operations Sayan Mukherjee Motivation It goes without saying that a Bayesian statistician should know probability theory in depth to model. Measure theory is not needed unless we

### Question 2 Naïve Bayes (16 points)

Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the

### Bayesian Networks Tutorial I Basics of Probability Theory and Bayesian Networks

Bayesian Networks 2015 2016 Tutorial I Basics of Probability Theory and Bayesian Networks Peter Lucas LIACS, Leiden University Elementary probability theory We adopt the following notations with respect

### ECEN 5682 Theory and Practice of Error Control Codes

ECEN 5682 Theory and Practice of Error Control Codes Convolutional Codes University of Colorado Spring 2007 Linear (n, k) block codes take k data symbols at a time and encode them into n code symbols.

### Lecture 2: Introduction to belief (Bayesian) networks

Lecture 2: Introduction to belief (Bayesian) networks Conditional independence What is a belief network? Independence maps (I-maps) January 7, 2008 1 COMP-526 Lecture 2 Recall from last time: Conditional

### General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Spam Filtering. Example: OCR. Generalization and Overfitting

CS 88: Artificial Intelligence Fall 7 Lecture : Perceptrons //7 General Naïve Bayes A general naive Bayes model: C x E n parameters C parameters n x E x C parameters E E C E n Dan Klein UC Berkeley We

### The Optimality of Naive Bayes

The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most

### Lecture 9: Introduction to Pattern Analysis

Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns

### Clustering - example. Given some data x i X Find a partitioning of the data into k disjunctive clusters Example: k-means clustering

Clustering - example Graph Mining and Graph Kernels Given some data x i X Find a partitioning of the data into k disjunctive clusters Example: k-means clustering x!!!!8!!! 8 x 1 1 Clustering - example

### Expectation Discrete RV - weighted average Continuous RV - use integral to take the weighted average

PHP 2510 Expectation, variance, covariance, correlation Expectation Discrete RV - weighted average Continuous RV - use integral to take the weighted average Variance Variance is the average of (X µ) 2

### Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

### Data Analysis and Uncertainty Part 3: Hypothesis Testing/Sampling

Data Analysis and Uncertainty Part 3: Hypothesis Testing/Sampling Instructor: Sargur N. University at Buffalo The State University of New York srihari@cedar.buffalo.edu Topics 1. Hypothesis Testing 1.

### Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

CS 188: Artificial Intelligence Lecture 20: Dynamic Bayes Nets, Naïve Bayes Pieter Abbeel UC Berkeley Slides adapted from Dan Klein. Part III: Machine Learning Up until now: how to reason in a model and

### Joint Probability Distributions and Random Samples. Week 5, 2011 Stat 4570/5570 Material from Devore s book (Ed 8), and Cengage

5 Joint Probability Distributions and Random Samples Week 5, 2011 Stat 4570/5570 Material from Devore s book (Ed 8), and Cengage Two Discrete Random Variables The probability mass function (pmf) of a single

### Math/Stat 370: Engineering Statistics, Washington State University

Math/Stat 370: Engineering Statistics, Washington State University Haijun Li lih@math.wsu.edu Department of Mathematics Washington State University Week 2 Haijun Li Math/Stat 370: Engineering Statistics,

### Finite Markov Chains and Algorithmic Applications. Matematisk statistik, Chalmers tekniska högskola och Göteborgs universitet

Finite Markov Chains and Algorithmic Applications Olle Häggström Matematisk statistik, Chalmers tekniska högskola och Göteborgs universitet PUBLISHED BY THE PRESS SYNDICATE OF THE UNIVERSITY OF CAMBRIDGE

### Examination 110 Probability and Statistics Examination

Examination 0 Probability and Statistics Examination Sample Examination Questions The Probability and Statistics Examination consists of 5 multiple-choice test questions. The test is a three-hour examination

### Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

Regression Estimation - Least Squares and Maximum Likelihood Dr. Frank Wood Least Squares Max(min)imization Function to minimize w.r.t. b 0, b 1 Q = n (Y i (b 0 + b 1 X i )) 2 i=1 Minimize this by maximizing

### 4. Joint Distributions of Two Random Variables

4. Joint Distributions of Two Random Variables 4.1 Joint Distributions of Two Discrete Random Variables Suppose the discrete random variables X and Y have supports S X and S Y, respectively. The joint

### Markov Chains. Chapter Introduction and Definitions 110SOR201(2002)

page 5 SOR() Chapter Markov Chains. Introduction and Definitions Consider a sequence of consecutive times ( or trials or stages): n =,,,... Suppose that at each time a probabilistic experiment is performed,

### Less naive Bayes spam detection

Less naive Bayes spam detection Hongming Yang Eindhoven University of Technology Dept. EE, Rm PT 3.27, P.O.Box 53, 5600MB Eindhoven The Netherlands. E-mail:h.m.yang@tue.nl also CoSiNe Connectivity Systems

### ORF 245 Fundamentals of Statistics Chapter 8 Hypothesis Testing

ORF 245 Fundamentals of Statistics Chapter 8 Hypothesis Testing Robert Vanderbei Fall 2015 Slides last edited on December 11, 2015 http://www.princeton.edu/ rvdb Coin Tossing Example Consider two coins.

### The Exponential Family

The Exponential Family David M. Blei Columbia University November 3, 2015 Definition A probability density in the exponential family has this form where p.x j / D h.x/ expf > t.x/ a./g; (1) is the natural

### Chapter 10: Introducing Probability

Chapter 10: Introducing Probability Randomness and Probability So far, in the first half of the course, we have learned how to examine and obtain data. Now we turn to another very important aspect of Statistics

### Hidden Markov Model. Jia Li. Department of Statistics The Pennsylvania State University. Hidden Markov Model

Hidden Markov Model Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Hidden Markov Model Hidden Markov models have close connection with mixture models. A mixture model

### Random Variables and Their Expected Values

Discrete and Continuous Random Variables The Probability Mass Function The (Cumulative) Distribution Function Discrete and Continuous Random Variables The Probability Mass Function The (Cumulative) Distribution

### Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:

### Bayesian networks. Data Mining. Abraham Otero. Data Mining. Agenda. Bayes' theorem Naive Bayes Bayesian networks Bayesian networks (example)

Bayesian networks 1/40 Agenda Introduction Bayes' theorem Bayesian networks Quick reference 2/40 1 Introduction Probabilistic methods: They are backed by a strong mathematical formalism. They can capture

### Course: Model, Learning, and Inference: Lecture 5

Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.

### Classification: Naïve Bayes Classifier Evaluation. Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Classification: Naïve Bayes Classifier Evaluation Toon Calders ( t.calders@tue.nl ) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Last Lecture Classification

### The basics of probability theory. Distribution of variables, some important distributions

The basics of probability theory. Distribution of variables, some important distributions 1 Random experiment The outcome is not determined uniquely by the considered conditions. For example, tossing a

### Lecture.7 Poisson Distributions - properties, Normal Distributions- properties. Theoretical Distributions. Discrete distribution

Lecture.7 Poisson Distributions - properties, Normal Distributions- properties Theoretical distributions are Theoretical Distributions 1. Binomial distribution 2. Poisson distribution Discrete distribution

### Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Prof. Dr. J. Franke All of Statistics 1.52 Binary response variables - logistic regression Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit

### Math/Stat 360-1: Probability and Statistics, Washington State University

Math/Stat 360-1: Probability and Statistics, Washington State University Haijun Li lih@math.wsu.edu Department of Mathematics Washington State University Week 3 Haijun Li Math/Stat 360-1: Probability and

### Cell Phone based Activity Detection using Markov Logic Network

Cell Phone based Activity Detection using Markov Logic Network Somdeb Sarkhel sxs104721@utdallas.edu 1 Introduction Mobile devices are becoming increasingly sophisticated and the latest generation of smart

### 4. Introduction to Statistics

Statistics for Engineers 4-1 4. Introduction to Statistics Descriptive Statistics Types of data A variate or random variable is a quantity or attribute whose value may vary from one unit of investigation

### Machine Learning in Spam Filtering

Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.

### Machine Learning I Week 14: Sequence Learning Introduction

Machine Learning I Week 14: Sequence Learning Introduction Alex Graves Technische Universität München 29. January 2009 Literature Pattern Recognition and Machine Learning Chapter 13: Sequential Data Christopher

### Fundamental Probability and Statistics

Fundamental Probability and Statistics "There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don't know. But there are

### Machine Learning Overview

Machine Learning Overview Sargur N. Srihari University at Buffalo, State University of New York USA 1 Outline 1. What is Machine Learning (ML)? 1. As a scientific Discipline 2. As an area of Computer Science/AI

### Bayesian Methods of Parameter Estimation

Bayesian Methods of Parameter Estimation Aciel Eshky University of Edinburgh School of Informatics Introduction In order to motivate the idea of parameter estimation we need to first understand the notion

### 13.3 Inference Using Full Joint Distribution

191 The probability distribution on a single variable must sum to 1 It is also true that any joint probability distribution on any set of variables must sum to 1 Recall that any proposition a is equivalent

### 2.8 Expected values and variance

y 1 b a 0 y 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0 a b (a) Probability density function x 0 a b (b) Cumulative distribution function x Figure 2.3: The probability density function and cumulative distribution function

### Probability, Conditional Independence

Probability, Conditional Independence June 19, 2012 Probability, Conditional Independence Probability Sample space Ω of events Each event ω Ω has an associated measure Probability of the event P(ω) Axioms

### Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

### Introduction to latent variable models

Introduction to latent variable models lecture 1 Francesco Bartolucci Department of Economics, Finance and Statistics University of Perugia, IT bart@stat.unipg.it Outline [2/24] Latent variables and their

### Introduction to Statistical Machine Learning

CHAPTER Introduction to Statistical Machine Learning We start with a gentle introduction to statistical machine learning. Readers familiar with machine learning may wish to skip directly to Section 2,

### 1 A simple example. A short introduction to Bayesian statistics, part I Math 218, Mathematical Statistics D Joyce, Spring 2016

and P (B X). In order to do find those conditional probabilities, we ll use Bayes formula. We can easily compute the reverse probabilities A short introduction to Bayesian statistics, part I Math 18, Mathematical

### Mathematical Background

Appendix A Mathematical Background A.1 Joint, Marginal and Conditional Probability Let the n (discrete or continuous) random variables y 1,..., y n have a joint joint probability probability p(y 1,...,

### Logic, Probability and Learning

Logic, Probability and Learning Luc De Raedt luc.deraedt@cs.kuleuven.be Overview Logic Learning Probabilistic Learning Probabilistic Logic Learning Closely following : Russell and Norvig, AI: a modern