Chapter ML:IV. IV. Statistical Learning. Probability Basics Bayes Classification Maximum a-posteriori Hypotheses



Similar documents
E3: PROBABILITY AND STATISTICS lecture notes

Statistics in Geophysics: Introduction and Probability Theory

Probability and statistics; Rehearsal for pattern recognition

Is quantum mechanics compatible with a deterministic universe? Two interpretations of quantum probabilities *

Elements of probability theory

MATHEMATICAL METHODS OF STATISTICS

CHAPTER 7 GENERAL PROOF SYSTEMS

1. (First passage/hitting times/gambler s ruin problem:) Suppose that X has a discrete state space and let i be a fixed state. Let

People have thought about, and defined, probability in different ways. important to note the consequences of the definition:

Fuzzy Probability Distributions in Bayesian Analysis

Probability Theory. Florian Herzog. A random variable is neither random nor variable. Gian-Carlo Rota, M.I.T..

The "Dutch Book" argument, tracing back to independent work by. F.Ramsey (1926) and B.deFinetti (1937), offers prudential grounds for

WHAT ARE MATHEMATICAL PROOFS AND WHY THEY ARE IMPORTANT?

What Is Probability?

Working Paper Series

Math/Stats 425 Introduction to Probability. 1. Uncertainty and the axioms of probability

The Basics of Graphical Models

Nonparametric adaptive age replacement with a one-cycle criterion

Mathematics (MAT) MAT 061 Basic Euclidean Geometry 3 Hours. MAT 051 Pre-Algebra 4 Hours

1 Short Introduction to Time Series

Handout #1: Mathematical Reasoning

Chapter 4. Probability and Probability Distributions

How To Understand And Solve A Linear Programming Problem

6 Scalar, Stochastic, Discrete Dynamic Systems

IAM 530 ELEMENTS OF PROBABILITY AND STATISTICS INTRODUCTION

Removing Partial Inconsistency in Valuation- Based Systems*

A Uniform Asymptotic Estimate for Discounted Aggregate Claims with Subexponential Tails

CONTINUED FRACTIONS AND FACTORING. Niels Lauritzen

Practical Guide to the Simplex Method of Linear Programming

Polarization codes and the rate of polarization

FIRST YEAR CALCULUS. Chapter 7 CONTINUITY. It is a parabola, and we can draw this parabola without lifting our pencil from the paper.

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Betting interpretations of probability

GENERALIZED BANKRUPTCY PROBLEM

How To Factorize Of Finite Abelian Groups By A Cyclic Subset Of A Finite Group

Wald s Identity. by Jeffery Hein. Dartmouth College, Math 100

Basic Probability Concepts

If, under a given assumption, the of a particular observed is extremely. , we conclude that the is probably not

Simulation-Based Security with Inexhaustible Interactive Turing Machines

Name. Final Exam, Economics 210A, December 2011 Here are some remarks to help you with answering the questions.

ROULETTE ODDS AND PROFITS. The Mathematics of Complex Bets. Cătălin Bărboianu

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Linear Risk Management and Optimal Selection of a Limited Number

On characterization of a class of convex operators for pricing insurance risks

A Geometry of Oriented Curves *

Two-Sided Matching Theory

Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski.

Lecture 1 Introduction Properties of Probability Methods of Enumeration Asrat Temesgen Stockholm University

Investigación Operativa. The uniform rule in the division problem

3. Mathematical Induction

6.3 Conditional Probability and Independence

God created the integers and the rest is the work of man. (Leopold Kronecker, in an after-dinner speech at a conference, Berlin, 1886)

UPDATES OF LOGIC PROGRAMS

How To Find The Sample Space Of A Random Experiment In R (Programming)

Fairfield Public Schools

Analyzing the Demand for Deductible Insurance

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS

This chapter is all about cardinality of sets. At first this looks like a

1 Introduction to Option Pricing

Some Research Problems in Uncertainty Theory


Summary Last Lecture. Automated Reasoning. Outline of the Lecture. Definition sequent calculus. Theorem (Normalisation and Strong Normalisation)

INTRODUCTORY SET THEORY

SUBGROUPS OF CYCLIC GROUPS. 1. Introduction In a group G, we denote the (cyclic) group of powers of some g G by

Non-Life Insurance Mathematics

LOOKING FOR A GOOD TIME TO BET

An Introduction to Basic Statistics and Probability

What is Linear Programming?

Chapter II. Controlling Cars on a Bridge

CONTINUED FRACTIONS AND PELL S EQUATION. Contents 1. Continued Fractions 1 2. Solution to Pell s Equation 9 References 12

How many numbers there are?

THE FUNDAMENTAL THEOREM OF ARBITRAGE PRICING

University of Miskolc

CHAPTER 3. Methods of Proofs. 1. Logical Arguments and Formal Proofs

1 Introductory Comments. 2 Bayesian Probability

A Little Set Theory (Never Hurt Anybody)

Exploratory Data Analysis

How To Understand The Theory Of Media Theory

Supersymmetry: Sonderforschungsbereich: Collaborative Research

On a comparison result for Markov processes

Introduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing.

Adaptive Online Gradient Descent

Red Herrings: Some Thoughts on the Meaning of Zero-Probability Events and Mathematical Modeling. Edi Karni*

MATHEMATICS. Administered by the Department of Mathematical and Computing Sciences within the College of Arts and Sciences. Degree Requirements

Mathematics for Econometrics, Fourth Edition

Elementary Number Theory and Methods of Proof. CSE 215, Foundations of Computer Science Stony Brook University

Algebraic and Transcendental Numbers

Lecture Note 1 Set and Probability Theory. MIT Spring 2006 Herman Bennett

Stochastic Inventory Control

On the Efficiency of Competitive Stock Markets Where Traders Have Diverse Information

A Few Basics of Probability

Convex analysis and profit/cost/support functions

The Basics of Interest Theory

Alessandro Birolini. ineerin. Theory and Practice. Fifth edition. With 140 Figures, 60 Tables, 120 Examples, and 50 Problems.

U.C. Berkeley CS276: Cryptography Handout 0.1 Luca Trevisan January, Notes on Algebra

Note on Negative Probabilities and Observable Processes

Worksheet for Teaching Module Probability (Lesson 1)

Psychometric Data Analysis: A Size/fit Trade-off Evaluation Procedure for Knowledge Structures

Transcription:

Chapter ML:IV IV. Statistical Learning Probability Basics Bayes Classification Maximum a-posteriori Hypotheses ML:IV-1 Statistical Learning STEIN 2005-2015

Area Overview Mathematics Statistics...... Stochastics Probability theory Mathematical statistics Inferential statistics Descriptive statistics Exploratory data analysis Data mining From the area of probability theory: Kolmogorov axioms From the area of mathematical statistics: Naive Bayes ML:IV-2 Statistical Learning STEIN 2005-2015

Definition 1 (Random Experiment, Random Observation) An random experiment or random trial is a procedure that, at least theoretically, can be repeated infinite times. It is characterized as follows: 1. Configuration. A precisely specified system that can be reconstructed. 2. Procedure. An instruction of how to execute the experiment based on the configuration. 3. Unpredictability of outcome. Random experiments whose configuration and procedure are not designed artificially are called natural random experiments or natural random observations. ML:IV-3 Statistical Learning STEIN 2005-2015

Remarks: A procedure can be repeated several times using the same system, but also with equivalent different systems. Random experiments are causal in the sense of cause and effect. The randomness of an experiment (the unpredictability of its outcome) is a consequence of the missing information about the causal chain. As a consequence, a random experiment may turn to a deterministic process if new insights become known. ML:IV-4 Statistical Learning STEIN 2005-2015

Definition 2 (Sample Space, Event Space) A set Ω = {ω 1, ω 2,..., ω n } is called sample space of a random experiment, if each experiment outcome is associated with at most one element ω Ω. The elements in Ω are called outcomes. Let Ω be a finite sample space. Each subset A Ω is called an event; an event A occurs iff the experiment outcome ω is a member of A. The set of all events, P(Ω), is called the event space. ML:IV-5 Statistical Learning STEIN 2005-2015

Definition 2 (Sample Space, Event Space) A set Ω = {ω 1, ω 2,..., ω n } is called sample space of a random experiment, if each experiment outcome is associated with at most one element ω Ω. The elements in Ω are called outcomes. Let Ω be a finite sample space. Each subset A Ω is called an event; an event A occurs iff the experiment outcome ω is a member of A. The set of all events, P(Ω), is called the event space. Definition 3 (Important Event Types) Let Ω be a finite sample space, and let A Ω and B Ω be two events. Then we agree on the following notation: 1. impossible event 2. Ω certain event 3. A := Ω \ A complementary event (opposite event) of A 4. A = 1 elementary event 5. A B A is a sub-event of B or A entails B, A B 6. A = B A B and B A 7. A B = A and B are incompatible (compatible otherwise) ML:IV-6 Statistical Learning STEIN 2005-2015

Classical Concept Formation Empirical law of large numbers: For particular events the average of the outcomes obtained from a large number of trials is close to the expected value, and it will become closer as more trials are performed. ML:IV-7 Statistical Learning STEIN 2005-2015

Classical Concept Formation Empirical law of large numbers: For particular events the average of the outcomes obtained from a large number of trials is close to the expected value, and it will become closer as more trials are performed. Definition 4 (Classical / Laplace Probability) If each elementary event in Ω gets assigned the same probability, then the probability P (A) of an event A is defined as follows: P (A) = A Ω = number of cases favorable for A number of total outcomes possible ML:IV-8 Statistical Learning STEIN 2005-2015

Remarks: A random experiment whose configuration and procedure imply an equiprobable sample space, be it by definition or by construction, is called Laplace experiment. The probabilities of the outcomes are called Laplace probabilities. Since they are defined by the experiment configuration along with the experiment procedure, they need not to be estimated. The assumption that a given experiment is a Laplace experiment is called Laplace assumption. If the Laplace assumption cannot be presumed, the probabilities can only be obtained from a possibly large number of trials. Strictly speaking, the Laplace probability as introduced above is not a definition but a circular definition: the probability concept is defined by means of the concept of equalprobability, i.e., another kind of probability. Inspired by the empirical law of large numbers, one has tried to develop a frequentist probability concept that is based on the (fictitious) limit of the relative frequencies [von Mises, 1951]. The attempt failed since such a limit formation is only in mathematical settings possible (infinitesimal calculus), where accurate repetitions unto infinity may be made. ML:IV-9 Statistical Learning STEIN 2005-2015

Axiomatic Concept Formation The principle steps of axiomatic concept formation: 1. Postulate a function that assigns a probability to each element of the event space. 2. Specify the basic, required properties of this function in the form of axioms. ML:IV-10 Statistical Learning STEIN 2005-2015

Axiomatic Concept Formation The principle steps of axiomatic concept formation: 1. Postulate a function that assigns a probability to each element of the event space. 2. Specify the basic, required properties of this function in the form of axioms. Definition 5 (Probability Measure [Kolmogorov 1933]) Let Ω be a set, called sample space, and let P(Ω) be the set of all events, called event space. Then a function P : P(Ω) R that maps each event A P(Ω) onto a real number P (A) is called probability measure, if it has the following properties: 1. P (A) 0 (Axiom I) 2. P (Ω) = 1 (Axiom II) 3. A B = implies P (A B) = P (A) + P (B) (Axiom III) ML:IV-11 Statistical Learning STEIN 2005-2015

Axiomatic Concept Formation (continued) Definition 6 (Probability Space) Let Ω be a sample space, let P(Ω) be an event space, and let P : P(Ω) R be a probability measure. Then the tuple (Ω, P ), as well as the triple (Ω, P(Ω), P ), is called probability space. ML:IV-12 Statistical Learning STEIN 2005-2015

Axiomatic Concept Formation (continued) Definition 6 (Probability Space) Let Ω be a sample space, let P(Ω) be an event space, and let P : P(Ω) R be a probability measure. Then the tuple (Ω, P ), as well as the triple (Ω, P(Ω), P ), is called probability space. Theorem 7 (Implications of Kolmogorov Axioms) 1. P (A) + P (A) = 1 (from Axioms II, III) 2. P ( ) = 0 (from 1. with A = Ω) 3. Monotonicity law of the probability measure: A B P (A) P (B) (from Axioms I, II) 4. P (A B) = P (A) + P (B) P (A B) (from Axiom III) 5. Let A 1, A 2..., A k be mutually exclusive (incompatible), then holds: P (A 1 A 2... A k ) = P (A 1 ) + P (A 2 ) +... + P (A k ) ML:IV-13 Statistical Learning STEIN 2005-2015

Remarks: The three axioms are also called the axiom system of Kolmogorov. P (A) is denoted as the probability of the occurrence of A Nothing is said about the distribution of the probabilities P. Generally, a function that is equipped with the three properties of a probability measure is called a non-negative, normalized, and additive measure. ML:IV-14 Statistical Learning STEIN 2005-2015

Conditional Probability Definition 8 (Conditional Probability) Let (Ω, P(Ω), P ) be a probability space and let A, B P(Ω) two events. Then the probability of the occurrence of event A given that event B is known to have occurred is defined as follows: P (A B) P (A B) =, if P (B) > 0 P (B) P (A B) is called probability of A under condition B. ML:IV-15 Statistical Learning STEIN 2005-2015

Conditional Probability (continued) Theorem 9 (Total Probability) Let (Ω, P(Ω), P ) be a probability space, and let A 1,..., A k be mutually exclusive events with Ω = A 1... A k, P (A i ) > 0, i = 1,..., k. Then for an B P(Ω) holds: P (B) = k P (A i ) P (B A i ) i=1 ML:IV-16 Statistical Learning STEIN 2005-2015

Conditional Probability (continued) Theorem 9 (Total Probability) Let (Ω, P(Ω), P ) be a probability space, and let A 1,..., A k be mutually exclusive events with Ω = A 1... A k, P (A i ) > 0, i = 1,..., k. Then for an B P(Ω) holds: P (B) = k P (A i ) P (B A i ) i=1 Proof P (B) = P (Ω B) = P ((A 1... A k ) B) = P ((A 1 B)... (A k B)) k = P (A i B) i=1 k = P (B A i ) = i=1 i=1 k P (A i ) P (B A i ) ML:IV-17 Statistical Learning STEIN 2005-2015

Remarks: The theorem of total probability states that the probability of an event equals the sum of the probabilities of the sub-events into which the event has been partitioned. Considered as a function in parameter A and constant B, the conditional probability P (A B) fulfills the Kolmogorov axioms and defines also a probability measure, denoted as P B here. Important consequences (deductions) from the conditional probability definition: 1. P (A B) = P (B) P (A B) (see multiplication rule in Definition 10) 2. P (A B) = P (B A) = P (A) P (B A) 3. P (B) P (A B) = P (A) P (B A) P (A B) = P (A B) P (B) = P (A) P (B A) P (B) 4. P (A B) = 1 P (A B) Usually, the following inequality must be assumed: P (A B) 1 P (A B). ML:IV-18 Statistical Learning STEIN 2005-2015

Independence of Events Definition 10 (Statistical Independence of two Events) Let (Ω, P(Ω), P ) be a probability space, and let A, B P(Ω) be two events. Then A and B are called statistically independent iff the following equation holds: P (A B) = P (A) P (B) multiplication rule If statistical independence is given and the two inequalities 0 < P (B) < 1 are fulfilled, the following equivalences hold: P (A B) = P (A) P (B) P (A B) = P (A B) P (A B) = P (A) ML:IV-19 Statistical Learning STEIN 2005-2015

Independence of Events (continued) Definition 11 (Statistical Independence of k Events) Let (Ω, P(Ω), P ) be a probability space, and let A 1,..., A k P(Ω) be events. Then the A 1,..., A k are called jointly statistically independent at P iff for all subsets {A i1,..., A il } {A 1,..., A k } the multiplication rule holds: where i 1 < i 2 <... < i l and 2 l k. P (A i1... A il ) = P (A i1 )... P (A il ), ML:IV-20 Statistical Learning STEIN 2005-2015