Inference of Probability Distributions for Trust and Security applications

Similar documents
Basics of Statistical Machine Learning

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

CHAPTER 2 Estimating Probabilities

Comparison of frequentist and Bayesian inference. Class 20, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Statistical Machine Learning

1 Sufficient statistics

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

Bayesian Analysis for the Social Sciences

Adaptive Online Gradient Descent

MAS2317/3317. Introduction to Bayesian Statistics. More revision material

Notes on the Negative Binomial Distribution

Part 2: One-parameter models

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015

Lecture 9: Bayesian hypothesis testing

Statistical Machine Learning from Data

Likelihood Approaches for Trial Designs in Early Phase Oncology

Bayes and Naïve Bayes. cs534-machine Learning

Maximum Likelihood Estimation

Coding and decoding with convolutional codes. The Viterbi Algor

Errata and updates for ASM Exam C/Exam 4 Manual (Sixteenth Edition) sorted by page

Supplement to Call Centers with Delay Information: Models and Insights

Principle of Data Reduction


Analysis of Bayesian Dynamic Linear Models

The Basics of Graphical Models

Introduction to Logistic Regression

1 Prior Probability and Posterior Probability

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

Big Data & Scripting Part II Streaming Algorithms

Section 4.4 Inner Product Spaces


Gaussian Processes to Speed up Hamiltonian Monte Carlo

Bayesian Adaptive Designs for Early-Phase Oncology Trials

Time Series Analysis

0.1 Phase Estimation Technique

Stochastic Inventory Control

INSURANCE RISK THEORY (Problems)

Random graphs with a given degree sequence

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

The Dirichlet-Multinomial and Dirichlet-Categorical models for Bayesian inference

Likelihood: Frequentist vs Bayesian Reasoning

Lecture 8: More Continuous Random Variables

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

Master s Theory Exam Spring 2006

PS 271B: Quantitative Methods II. Lecture Notes

L4: Bayesian Decision Theory

Chapter 7. Sealed-bid Auctions

What Is Probability?

1 Formulating The Low Degree Testing Problem

Measuring Intrusion Detection Capability: An Information-Theoretic Approach

Gaussian Conjugate Prior Cheat Sheet

Seed Distributions for the NCAA Men s Basketball Tournament: Why it May Not Matter Who Plays Whom*

A New Interpretation of Information Rate

An Introduction to Using WinBUGS for Cost-Effectiveness Analyses in Health Economics

Markov Chain Monte Carlo Simulation Made Simple

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

171:290 Model Selection Lecture II: The Akaike Information Criterion

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

The CUSUM algorithm a small review. Pierre Granjon

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

We shall turn our attention to solving linear systems of equations. Ax = b

Mathematical Induction

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

A logistic approximation to the cumulative normal distribution

Important Probability Distributions OPRE 6301

Factoring Algorithms

Procedia Computer Science 00 (2012) Trieu Minh Nhut Le, Jinli Cao, and Zhen He. trieule@sgu.edu.vn, j.cao@latrobe.edu.au, z.he@latrobe.edu.

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Tests of Hypotheses Using Statistics

Introduction to General and Generalized Linear Models

Bayesian logistic betting strategy against probability forecasting. Akimichi Takemura, Univ. Tokyo. November 12, 2012

Point Biserial Correlation Tests

People have thought about, and defined, probability in different ways. important to note the consequences of the definition:

Probabilistic trust models in network security

The Advantages and Disadvantages of Online Linear Optimization

Bayesian Model Averaging Continual Reassessment Method BMA-CRM. Guosheng Yin and Ying Yuan. August 26, 2009

Mixing internal and external data for managing operational risk

Persuasion by Cheap Talk - Online Appendix

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Decompose Error Rate into components, some of which can be measured on unlabeled data

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Interpreting Kullback-Leibler Divergence with the Neyman-Pearson Lemma

Basic Bayesian Methods

How To Check For Differences In The One Way Anova

Estimating the Degree of Activity of jumps in High Frequency Financial Data. joint with Yacine Aït-Sahalia

1.3. DOT PRODUCT If θ is the angle (between 0 and π) between two non-zero vectors u and v,

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

17. SIMPLE LINEAR REGRESSION II

Bayesian Statistics in One Hour. Patrick Lam

Transcription:

Inference of Probability Distributions for Trust and Security applications Vladimiro Sassone Based on joint work with Mogens Nielsen & Catuscia Palamidessi

Outline 2

Outline Motivations 2

Outline Motivations Bayesian vs Frequentist approach 2

Outline Motivations Bayesian vs Frequentist approach A class of functions to estimate the distribution 2

Outline Motivations Bayesian vs Frequentist approach A class of functions to estimate the distribution Measuring the precision of an estimation function 2

Motivations Inferring the probability distribution of a random variable Examples of applications in Trust & Security How much we can trust an individual or a set of individuals Input distribution in a noisy channel to compute the Bayes risk Application of the Bayesian approach to hypothesis testing (anonymity, information flow)... 3

Setting and assumptions For simplicity we consider only binary random variables X = {succ, fail} honest/dishonest, secure/insecure,... Goal: infer (an approximation of) the probability of success Means: Sequence of n trials. Observation (Evidence) : s, f Pr(succ) =θ s =#succ f =#fail = n s 4

Using the evidence to infer θ The Frequentist method: F (n, s) = s n The Bayesian method: Assume an a priori probability distribution for θ (representing your partial knowledge about θ, whatever the source may be) and combine it with the evidence, using Bayes theorem, to obtain the a posteriori distribution 5

Bayesian vs Frequentist Criticisms to the frequentist approach Limited applicability: sometimes it is not possible to measure the frequencies (in this talk we consider the case in which this is possible) Eg: what is the probability that my submitted paper will be accepted? Misleading evidence: For small samples (small n) we can be unlucky, i.e. get unlikely results This is less dramatic for the Bayesian approach because the a priori distribution reduces the effect of a misleading evidence, provided it is close enough to the real distribution 6

Bayesian vs Frequentist Criticisms to the Bayesian approach We need to assume an a priori probability distribution; as we usually do not know the real distribution, the assumption can be somehow arbitrary and differ significantly from reality Observe that the two approaches give the same result as n tends to infinity: the true distribution Frequentist approach: because of the law of large numbers Bayes approach: because the a priori washes out for large values of n. 7

Bayesian vs Frequentist The surprising thing is that the Frequentist approach can be worse than the Bayesian approach even when the trials give a good result, or when we consider the average difference (from the true θ) wrt all possible results Example: true θ = 1/2, n = 1 Pr(s) 1 F (n, s) = s n = 0 s =0 1 s =1 The difference from the true distribution is 1/2 1/2 0 1/2 θ 1 s 8

Bayesian vs Frequentist The surprising thing is that the Frequentist approach can be worse than the Bayesian approach even when the trials give a good result, or when we consider the average difference (from the true θ) wrt all possible results Example: true θ = 1/2, n = 1 Pr(s) 1 F (n, s) = s n = 0 s =0 1 s =1 The difference from the true distribution is 1/2 1/2 0 1/3 1/2 2/3 θ 1 s A better function would be F c (n, s) = s +1 n +2 = 1 3 s =0 2 3 s =1 The difference from the true distribution is 1/6 9

Bayesian vs Frequentist The surprising thing is that the Frequentist approach can be worse than the Bayesian approach even when the trials give a good result, or when we consider the average difference (from the true θ) wrt all possible results Pr(s) 1 Example: true θ = 1/2, n = 2 F (n, s) = s n = 0 s =0 1 2 s =1 1 s =2 The average difference from the true distribution is 1/4 1/2 1/4 0 1/2 θ 1 s 10

Bayesian vs Frequentist The surprising thing is that the Frequentist approach can be worse than the Bayesian approach even when the trials give a good result, or when we consider the average difference (from the true θ) w.r.t. all possible results Pr(s) 1 Example: true θ = 1/2, n = 2 F (n, s) = s n = The average distance from the true distribution is 1/4 0 s =0 1 2 s =1 1 s =2 1/2 1/4 0 1/4 1/2 3/4 θ 1 s Again, a better function would be F c (n, s) = s +1 n +2 = 1 2 s =1 3 4 s =2 The average distance from the true distribution is 1/8 1 4 s =0 11

Bayesian vs Frequentist 12

Bayesian vs Frequentist We will see that Fc(s,n) = (s+1)/(n+2) corresponds to one of the possible Bayesian approaches. 12

Bayesian vs Frequentist We will see that Fc(s,n) = (s+1)/(n+2) corresponds to one of the possible Bayesian approaches. Of course, if the true θ is different from 1/2 then Fc can be worse than F 12

Bayesian vs Frequentist We will see that Fc(s,n) = (s+1)/(n+2) corresponds to one of the possible Bayesian approaches. Of course, if the true θ is different from 1/2 then Fc can be worse than F And, of course, the problem is that we don t know what θ is (the value θ is exactly what we are trying to find out!). 12

Bayesian vs Frequentist We will see that Fc(s,n) = (s+1)/(n+2) corresponds to one of the possible Bayesian approaches. Of course, if the true θ is different from 1/2 then Fc can be worse than F And, of course, the problem is that we don t know what θ is (the value θ is exactly what we are trying to find out!). However, Fc is still better than F if we consider the average distance wrt all possible θ [0,1], assuming that they are all equally likely (i.e. that θ has a uniform distribution) 12

Bayesian vs Frequentist We will see that Fc(s,n) = (s+1)/(n+2) corresponds to one of the possible Bayesian approaches. Of course, if the true θ is different from 1/2 then Fc can be worse than F And, of course, the problem is that we don t know what θ is (the value θ is exactly what we are trying to find out!). However, Fc is still better than F if we consider the average distance wrt all possible θ [0,1], assuming that they are all equally likely (i.e. that θ has a uniform distribution) In fact we can prove that, under a suitable notion of difference, and for θ uniformly distributed, Fc is the best function of the kind G(s,n) = (s+t)/(n+m) 12

A Bayesian approach Assumption: θ is the generic value of a continuous random variable whose probability density is a Beta distribution with (unknown) parameters, B(σ, ϕ)(θ) = Γ(σ+ϕ) Γ(σ)Γ(ϕ) θ σ 1 (1 θ) ϕ 1 where Γ is the extension of the factorial function i.e. Γ(n) =(n 1)! for n natural number Note that the uniform distribution is a particular case of Beta distribution, with, B, can be seen as the a posteriori probability density of given by a uniform a priori (principle of maximum entropy) and a trial sequence resulting in successes and failures. 13

Examples of Beta Distribution 14

Examples of Beta Distribution 14

Examples of Beta Distribution 14

Other examples of Beta Distribution 15

The Bayesian Approach Assume an a priori probability distribution for Θ (representing our partial knowledge about Θ, whatever the source may be) and combine it with the evidence, using Bayes theorem, to obtain the a posteriori probability distribution likelihood a priori Pd(θ s) = a posteriori Pr(s θ) Pd(θ) Pr(s) evidence One possible definition for the estimation function (algorithm) is the mean of the a posteriori distribution 1 A(n, s) =E Pd(θ s) (Θ) = 0 θ Pd(θ s) dθ 16

The Bayesian Approach Since the distribution of Θ is assumed to be a beta distribution B(,, it is natural to take as a priori a function of the same class, i.e. B( In general we don t know the real parameters, hence may be different from, The likelihood Pr( s ) is a binomial, i.e. s + f Pr(s θ) = s θ s (1 θ) f The Beta distribution is a conjugate of the binomial, which means that the application of Bayes theorem gives as a posteriori a function of the same class, and more precisely Pd(θ s) =B(α + s, β + f) 17

The Bayesian Approach Summarizing, we are considering three probability density functions for : B(, ) : the real distribution of B(, ) : the a priori (the distribution of up to our best knowledge) B(s +, f + ) : the a posteriori The result of the mean-based algorithm is : A α,β (n, s) = E B(s+α,f+β) (Θ) = s + α s + f + α + β = s + α n + α + β 18

The Bayesian Approach The frequentist method can be seen as the limit of the Bayesian mean-based algorithms, for 0 Intuitively, the Bayesian meanbased algorithms give the best result for ( and How can we compare two Bayesian algorithms in general, i.e. independently of? 19

Measuring the precision of Bayesian algorithms Define a difference D(A(n,s), ) (possibly a distance, but not necessarily. It does not need to be symmetric) non-negative zero iff A(n,s) = what else? Consider the expected value DE(A,n, ) of D(A(n,s), ) with respect to the likelihood (the conditional probability of s given ) n D E (A, n, θ) = Pr(s θ) D(A(n, s), θ) Risk of A : the expected value R(A,n) of DE(A,n, ) with respect to the true distribution of 1 R(A, n) = Pd(θ) D E (A, n, θ) dθ s=0 0 20

Measuring the precision of Bayesian Algorithms Note that the definition of Risk of A is general, i.e. it is a natural definition for any estimation algorithm (not necessarily Bayesian or mean-based) What other conditions should D satisfy? It seems natural to require that D be such that R(A,n) has a minimum (for all n s) when the a priori distribution coincides with the true distribution It is not obvious that such D exists 21

Measuring the precision of Bayesian Algorithms We have considered the following candidates for D(x,y) (all of which can be extended to the n-ary case): The norms: x - y x - y 2... x - y k... The Kullback-Leibler divergence y D KL ((y, 1 y) (x, 1 x)) = y log 2 x + (1 y) log 2 1 y 1 x 22

Measuring the precision of Bayesian algorithms Theorem. For the mean-based Bayesian algorithms, with a priori B( ), we have that the condition is satisfied (i.e. the Risk is minimum when coincide with the parameters, of the true distribution), by the following functions: The 2nd norm (x - y) 2 The Kullback-Leibler divergence We find it very surprising that the condition is satisfied by these two very different functions, and not by any of the other norms x - y k for k 2 23

D(x, y) =(x y) 2 σ =1, ϕ =1 R(A α,β, 5) n =5, θ =1/2 n =5 D E (A α,β, 5, 1/2) For the Kullback-Leibler divergence the plots are similar, but much more steep, and they diverge for 0 or 0 24

Work in progress Note that for the 2nd norm D(x,y) = (x-y) 2 the average DE is a distance. This contrasts with the case of D(x,y) = DKL(y x) and makes the first more appealing. How robust is the theorem that certifies that the 2nd-norm-based DE is a good distance? In particular: Does it extend to the case of multi-valued random variables? Note that in the multi-valued case the likelihood is a multinomial, the conjugate a priori is a Dirichelet and the D is the Euclidian distance (squared) What are the possible applications? 25

Possible applications (work in progress) We can use DE to compare two different estimation algorithms. Mean-based vs other ways of selecting a Bayesian vs non-bayesian In more complicated scenarios there may be different Bayesian meanbased algorithms. Example: noisy channel. DE induces a metric on distributions. Bayes equations define transformations on this metric space from the a priori to the a posteriori. We intend to study the properties of such transformations in the hope that they will reveal interesting properties of the corresponding Bayesian methods, independent of the a priori. 26