P (x) 0. Discrete random variables Expected value. The expected value, mean or average of a random variable x is: xp (x) = v i P (v i )

Similar documents
Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

For a partition B 1,..., B n, where B i B j = for i. A = (A B 1 ) (A B 2 ),..., (A B n ) and thus. P (A) = P (A B i ) = P (A B i )P (B i )

Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment 2-3, Probability and Statistics, March Due:-March 25, 2015.

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

1 Prior Probability and Posterior Probability

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

Math 431 An Introduction to Probability. Final Exam Solutions

Joint Exam 1/P Sample Exam 1

Lecture Notes 1. Brief Review of Basic Probability

Chapter 5. Random variables

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

University of California, Los Angeles Department of Statistics. Random variables

Statistics 100A Homework 4 Solutions

Probability Calculator

Basics of Statistical Machine Learning

Maximum Likelihood Estimation

Lecture 9: Introduction to Pattern Analysis

Statistics 100A Homework 8 Solutions

Lecture 8. Confidence intervals and the central limit theorem

E3: PROBABILITY AND STATISTICS lecture notes

Probability Distribution for Discrete Random Variables

MULTIVARIATE PROBABILITY DISTRIBUTIONS

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 18. A Brief Introduction to Continuous Probability

Overview of Monte Carlo Simulation, Probability Review and Introduction to Matlab

An Introduction to Basic Statistics and Probability

Linear Threshold Units

Covariance and Correlation

5. Continuous Random Variables

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

Chapter 4 Lecture Notes

Some probability and statistics

Notes on the Negative Binomial Distribution

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Inference of Probability Distributions for Trust and Security applications

Section 5.1 Continuous Random Variables: Introduction

ST 371 (IV): Discrete Random Variables

Gaussian Conjugate Prior Cheat Sheet

Master s Theory Exam Spring 2006

APPLIED MATHEMATICS ADVANCED LEVEL

1 Sufficient statistics

Errata and updates for ASM Exam C/Exam 4 Manual (Sixteenth Edition) sorted by page

Aggregate Loss Models

Practice problems for Homework 11 - Point Estimation

What you CANNOT ignore about Probs and Stats

Statistics 100A Homework 7 Solutions

Sections 2.11 and 5.8

Math 151. Rumbos Spring Solutions to Assignment #22

Random variables P(X = 3) = P(X = 3) = 1 8, P(X = 1) = P(X = 1) = 3 8.

4. Continuous Random Variables, the Pareto and Normal Distributions

INSURANCE RISK THEORY (Problems)

Statistical Machine Learning from Data

Quadratic forms Cochran s theorem, degrees of freedom, and all that

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 14 10/27/2008 MOMENT GENERATING FUNCTIONS

The Dirichlet-Multinomial and Dirichlet-Categorical models for Bayesian inference

Statistical Machine Learning

Tests of Hypotheses Using Statistics

STAT 830 Convergence in Distribution

Lecture 8: More Continuous Random Variables

Math/Stats 342: Solutions to Homework

Martingale Ideas in Elementary Probability. Lecture course Higher Mathematics College, Independent University of Moscow Spring 1996

Probability Generating Functions

Example: Boats and Manatees

Definition: Suppose that two random variables, either continuous or discrete, X and Y have joint density

The Bivariate Normal Distribution

Example. A casino offers the following bets (the fairest bets in the casino!) 1 You get $0 (i.e., you can walk away)

The sample space for a pair of die rolls is the set. The sample space for a random number between 0 and 1 is the interval [0, 1].

The Binomial Probability Distribution

CHAPTER 6: Continuous Uniform Distribution: 6.1. Definition: The density function of the continuous random variable X on the interval [A, B] is.

Statistics 104: Section 6!

1.1 Introduction, and Review of Probability Theory Random Variable, Range, Types of Random Variables CDF, PDF, Quantiles...

What is Statistics? Lecture 1. Introduction and probability review. Idea of parametric inference

Bayes and Naïve Bayes. cs534-machine Learning

Introduction to Probability

Math 461 Fall 2006 Test 2 Solutions

Lecture 3: Linear methods for classification

RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS

6 PROBABILITY GENERATING FUNCTIONS

Normal Distribution as an Approximation to the Binomial Distribution

Lecture 21. The Multivariate Normal Distribution

Multivariate normal distribution and testing for means (see MKB Ch 3)

Principle of Data Reduction

Generating Random Numbers Variance Reduction Quasi-Monte Carlo. Simulation Methods. Leonid Kogan. MIT, Sloan , Fall 2010

Dongfeng Li. Autumn 2010

Lecture 3: Continuous distributions, expected value & mean, variance, the normal distribution

Random variables, probability distributions, binomial random variable

Question: What is the probability that a five-card poker hand contains a flush, that is, five cards of the same suit?

2. Discrete random variables

ECE302 Spring 2006 HW4 Solutions February 6,

Chapter 4. Probability and Probability Distributions

Sample Questions for Mastery #5

Point Biserial Correlation Tests

Econ 132 C. Health Insurance: U.S., Risk Pooling, Risk Aversion, Moral Hazard, Rand Study 7

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Lecture 6: Discrete & Continuous Probability and Random Variables

Simple Linear Regression Inference

Homework 4 - KEY. Jeff Brenion. June 16, Note: Many problems can be solved in more than one way; we present only a single solution here.

WHERE DOES THE 10% CONDITION COME FROM?

A Coefficient of Variation for Skewed and Heavy-Tailed Insurance Losses. Michael R. Powers[ 1 ] Temple University and Tsinghua University

( ) is proportional to ( 10 + x)!2. Calculate the

Transcription:

Discrete random variables Probability mass function Given a discrete random variable X taking values in X = {v 1,..., v m }, its probability mass function P : X [0, 1] is defined as: P (v i ) = Pr[X = v i ] and satisfies the following conditions: P (x) 0 x X P (x) = 1 Discrete random variables Expected value The expected value, mean or average of a random variable x is: E[x] = µ = x X xp (x) = m v i P (v i ) The expectation operator is linear: E[λx + λ y] = λe[x] + λ E[y] Variance The variance of a random variable is the moment of inertia of its probability mass function: Var[x] = σ 2 = E[(x µ) 2 ] = x X(x µ) 2 P (x) The standard deviation σ indicates the typical amount of deviation from the mean one should expect for a randomly drawn value for x. Properties of mean and variance second moment E[x 2 ] = x X x 2 P (x) variance in terms of expectation variance and scalar multiplication variance of uncorrelated variables Var[x] = E[x 2 ] E[x] 2 Var[λx] = λ 2 Var[x] Var[x + y] = Var[x] + Var[y]

Probability distributions Bernoulli distribution Two possible values (outcomes): 1 (success), 0 (failure). Parameters: p probability of success. Probability mass function: P (x; p) = { p if x = 1 1 p if x = 0 E[x] = p Var[x] = p(1 p) Example: tossing a coin Head (success) and tail (failure) possible outcomes p is probability of head Bernoulli distribution Proof of mean E[x] = xp (x) x X = xp (x) x {0,1} = 0 (1 p) + 1 p = p Bernoulli distribution Proof of variance Var[x] = x X(x µ) 2 P (x) = (x p) 2 P (x) x {0,1} = (0 p) 2 (1 p) + (1 p) 2 p = p 2 (1 p) + (1 p) (1 p) p = (1 p) (p 2 + p p 2 ) = (1 p) p 2

Probability distributions Binomial distribution Probability of a certain number of successes in n independent Bernoulli trials Parameters: p probability of success, n number of trials. Probability mass function: E[x] = np Var[x] = np(1 p) P (x; p, n) = ( ) n p x (1 p) n x x Example: tossing a coin n number of coin tosses probability of obtaining x heads Pairs of discrete random variables Probability mass function Given a pair of discrete random variables X and Y taking values X = {v 1,..., v m } Y = {w 1,..., w n }, the joint probability mass function is defined as: with properties: P (x, y) 0 x X y Y P (x, y) = 1 P (v i, w j ) = Pr[X = v i, Y = w j ] Properties Expected value Variance Covariance Correlation coefficient µ x = E[x] = xp (x, y) x X y Y µ y = E[y] = yp (x, y) x X y Y σx 2 = Var[(x µ x ) 2 ] = (x µ x ) 2 P (x, y) x X y Y σy 2 = Var[(y µ y ) 2 ] = (y µ y ) 2 P (x, y) x X y Y σ xy = E[(x µ x )(y µ y )] = x X (x µ x )(y µ y )P (x, y) y Y ρ = σ xy σ x σ y 3

Probability distributions Multinomial distribution (one sample) Models the probability of a certain outcome for an event with m possible outcomes. Parameters: p 1,..., p m probability of each outcome Probability mass function: P (x 1,..., x m ; p 1,..., p m ) = m p xi i where x 1,..., x m is a vector with x i = 1 for outcome i and x j = 0 for all j i. E[x i ] = p i Var[x i ] = p i (1 p i ) Cov[x i, x j ] = p i p j Probability distributions Multinomial distribution: example Tossing a dice with six faces: m is the number of faces p i is probability of obtaining face i Probability distributions Multinomial distribution (general case) Given n samples of an event with m possible outcomes, models the probability of a certain distribution of outcomes. Parameters: p 1,..., p m probability of each outcome, n number of samples. Probability mass function (assumes m x i = n): P (x 1,..., x m ; p 1,..., p m, n) = n! m x i! m p xi i E[x i ] = np i Var[x i ] = np i (1 p i ) Cov[x i, x j ] = np i p j 4

Probability distributions Multinomial distribution: example Tossing a dice n number of times a dice is tossed x i number of times face i is obtained p i probability of obtaining face i Conditional probabilities conditional probability probability of x once y is observed P (x y) = P (x, y) P (y) statistical independence variables X and Y are statistical independent iff implying: P (x, y) = P (x)p (y) P (x y) = P (x) P (y x) = P (y) Basic rules law of total probability The marginal distribution of a variable is obtained from a joint distribution summing over all possible values of the other variable (sum rule) P (x) = y Y P (x, y) P (y) = x X P (x, y) product rule conditional probability definition implies that P (x, y) = P (x y)p (y) = P (y x)p (x) Bayes rule P (y x) = P (x y)p (y) P (x) Bayes rule Significance P (y x) = P (x y)p (y) P (x) allows to invert statistical connections between effect (x) and cause (y): posterior = likelihood prior evidence evidence can be obtained using the sum rule from likelihood and prior: P (x) = y P (x, y) = y P (x y)p (y) 5

Playing with probabilities Use rules! Basic rules allow to model a certain probability (e.g. cause given effect) given knowledge of some related ones (e.g. likelihood, prior) All our manipulations will be applications of the three basic rules Basic rules apply to any number of varables: P (y) = x = x = x P (x, y, z) z (sum rule) P (y x, z)p (x, z) z z P (x y, z)p (y z)p (x, z) P (x z) (product rule) (Bayes rule) Playing with probabilities Example P (y x, z) = = = = = = P (x, z y)p (y) P (x, z) (Bayes rule) P (x, z y)p (y) P (x z)p (z) (product rule) P (x z, y)p (z y)p (y) P (x z)p (z) (product rule) P (x z, y)p (z, y) P (x z)p (z) (product rule) P (x z, y)p (y z)p (z) P (x z)p (z) (product rule) P (x z, y)p (y z) P (x z) Continuouos random variables Probability density function Instead of the probability of a specific value of X, we model the probability that x falls in an interval (a, b): Properties: p(x) 0 p(x)dx = 1 Pr[x (a, b)] = Note The probability of a specific value x 0 is given by: b a p(x)dx 1 p(x 0 ) = lim ɛ 0 ɛ Pr[x [x 0, x 0 + ɛ)] 6

Properties expected value variance E[x] = µ = Var[x] = σ 2 = xp(x)dx (x µ) 2 p(x)dx Note Definitions and formulas for discrete random variables carry over to continuous random variables with sums replaced by integrals Probability distributions Gaussian (or normal) distribution Bell-shaped curve. Parameters: µ mean, σ 2 variance. Probability density function: p(x; µ, σ) = 1 (x µ)2 exp 2πσ 2σ 2 E[x] = µ Var[x] = σ 2 7

Standard normal distribution: N(0, 1) Standardization of a normal distribution N(µ, σ 2 ) z = x µ σ Probability distributions Beta distribution Defined in the interval [0, 1] Parameters: α, β Probability density function: p(x; α, β) = Γ(α + β) Γ(α)Γ(β) xα 1 (1 x) β 1 E[x] = Var[x] = α α+β Γ(x + 1) = xγ(x), Γ(1) = 1 αβ (α+β) 2 (α+β+1) Note It models the posterior distribution of parameter p of a binomial distribution after observing α 1 independent events with probability p and β 1 with probability 1 p. 8

Probability distributions Multivariate normal distribution normal distribution for d-dimensional vectorial data. Parameters: µ mean vector, Σ covariance matrix. Probability density function: E[x] = µ Var[x] = Σ p(x; µ, Σ) = 1 (2π) d/2 Σ 1/2 exp 1 2 (x µ)t Σ 1 (x µ) squared Mahalanobis distance from x to µ is standard measure of distance to mean: r 2 = (x µ) T Σ 1 (x µ) Probability distributions Dirichlet distribution Defined: x [0, 1] m, m x i = 1 Parameters: α = α 1,..., α m Probability density function: p(x 1,..., x m ; α) = Γ(α 0 ) m Γ(α i) m x αi 1 i 9

E[x i ] = αi α 0 where α 0 = m j=1 α j Var[x i ] = αi(α0 αi) α 2 0 (α0+1) Cov[x i, x j ] = αiαj α 2 0 (α0+1) Note It models the posterior distribution of parameters p of a multinomial distribution after observing α i 1 times each mutually exclusive event Probability laws Expectation of an average Consider a sample of X 1,..., X n i.i.d instances drawn from a distribution with mean µ and variance σ 2. Consider the random variable X n measuring the sample average: X n = X 1 + + X n n 10

Its expectation is computed as (E[a(X + Y )] = a(e[x] + E[Y ])): E[ X n ] = 1 n (E[X 1] + + E[X n ]) = µ i.e. the expectation of an average is the true mean of the distribution Probability laws variance of an average Consider the random variable X n measuring the sample average: X n = X 1 + + X n n Its variance is computed as (Var[a(X + Y )] = a 2 (Var[X] + Var[Y ]) for X and Y independent): Var[ X n ] = 1 n 2 (Var[X 1] + + Var[X n ]) = σ2 n i.e. the variance of the average decreases with the number of observations (the more examples you see, the more likely you are to estimate the correct average) Probability laws Chebyshev s inequality Consider a random variable X with mean µ and variance σ 2. Chebyshev s inequality states that for all a > 0: Replacing a = kσ for k > 0 we obtain: Pr[ X µ a] σ2 a 2 Pr[ X µ kσ] 1 k 2 Note Chebyshev s inequality shows that most of the probability mass of a random variable stays within few standard deviations from its mean Probability laws The law of large numbers Consider a sample of X 1,..., X n i.i.d instances drawn from a distribution with mean µ and variance σ 2. For any ɛ > 0, its sample average X n obeys: lim Pr[ X n µ > ɛ] = 0 n It can be shown using Chebyshev s inequality and the facts that E[ X n ] = µ, Var[ X n ] = σ 2 /n: Interpretation Pr[ X n E[ X n ] ɛ] σ2 nɛ 2 The accuracy of an empirical statistic increases with the number of samples 11

Probability laws Central Limit theorem Consider a sample of X 1,..., X n i.i.d instances drawn from a distribution with mean µ and variance σ 2. 1. Regardless of the distribution of X i, for n, the distribution of the sample average X n approaches a Normal distribution 2. Its mean approaches µ and its variance approaches σ 2 /n 3. Thus the normalized sample average: z = X n µ σ n approaches a standard Normal distribution N(0, 1). Central Limit theorem Interpretation The sum of a sufficiently large sample of i.i.d. random measurements is approximately normally distributed We don t need to know the form of their distribution (it can be arbitrary) Justifies the importance of Normal distribution in real world applications Information theory Entropy Consider a discrete set of symbols V = {v 1,..., v n } with mutually exclusive probabilities P (v i ). We aim a designing a binary code for each symbol, minimizing the average length of messages Shannon and Weaver (1949) proved that the optimal code assigns to each symbol v i a number of bits equal to log P (v i ) The entropy of the set of symbols is the expected length of a message encoding a symbol assuming such optimal coding: n H[V] = E[ log P (v)] = P (v i ) log P (v i ) Information theory Cross entropy Consider two distributions P and Q over variable X The cross entropy between P and Q measures the expected number of bits needed to code a symbol sampled from P using Q instead H(P ; Q) = E P [ log Q(v)] = n P (v i ) log Q(v i ) 12

Information theory Relative entropy Consider two distributions P and Q over variable X The relative entropy or Kullback-Leibler (KL) divergence measures the expected length difference when coding instances sampled from P using Q instead: D KL (p q) = H(P ; Q) H(P ) n n = P (v i ) log Q(v i ) + P (v i ) log P (v i ) = n P (v i ) log P (v i) Q(v i ) Note The KL-divergence is not a distance (metric) as it is not necessarily symmetric Information theory Conditional entropy Consider two variables V, W with (possibly different) distributions P The conditional entropy is the entropy remaining for variable W once V is known: H(W V ) = v P (v)h(w V = v) = v P (v) w P (w v) log P (w v) Information theory Mutual information Consider two variables V, W with (possibly different) distributions P The mutual information (or information gain) is the reduction in entropy for W once V is known: I(W ; V ) = H(W ) H(W V ) = w p(w) log p(w) + v P (v) w P (w v) log P (w v) 13