EMPIRICAL FREQUENCY DISTRIBUTION

Transcription

1 INTRODUCTION TO MEDICAL STATISTICS: Mirjana Kujundžić Tiljak EMPIRICAL FREQUENCY DISTRIBUTION observed data DISTRIBUTION - described by mathematical models 2 1

2 when some empirical distribution approximates a particular probability distribution theoretical knowledge of that distribution could be used answer questions about data evaluation of probabilities is required 3 PROBABILITY (P) measures uncertainty measures the chance of a given event occurring 0 P 1 P = 0 event cannot occur P = 1 event must occur Q = 1-P probability of the complementary event (the event not occurring) 4 2

3 PROBABILITY (P) Various approaches in probability calculations: Subjective personal degree of belief that the event will occur (e.g. the world sill come to an end in the year 2050) Frequentist the proportion of times the event would occur if the experiment will be repeated a large number of times (e.g. the number of times we would get a head") A priori requires knowledge of the theoretical model probability distribution which describes the probabilities of all possible outcomes of the experiment (e.g. genetic theory allows us to describe the probability distribution for eye color in a baby born t a blue-eyed women and brown-eyed man by initially specifying all possible genotypes of eye color in the baby and their probabilities) 5 PROBABILITY (P) The addition rule: if two events (A and B) are mutually exclusive the probability that either one or the other occurs (A or B) is equal to the sum of their probabilities Prob (A or B) = Prob (A) + Prob (B) The multiplication rule: if two events (A and B) are independent the probability that both events occur (A and B) is equal to the product of the probability of each Prob (A and B) = Prob (A) Prob (B) 6 3

4 RANDOM VARIABLES random variable a quantity that can take any one of a set of mutally excluseve values with a given probability discrete or discontinuous random variable = numerical values are integer E.g. number of children in family 0, 1, 2, 3, k continuus random variable = numerical values are real numbers E.g. body weight 72,35 kg, blood glucose level 7,2 mmol/l 7 PROBABILITY DISTRIBUTION Probability distribution shows the probabilities of all possible values of the random variable a theoretical distribution that is expressed mathematically has a mean and variance that are analogous to those of and empirical distribution parameters summary measures (e.g. mean, variance) characterizing that distribution are estimated in the sample by relevant statistics depending on whether the random variable is discrete or continuous the probability distribution can be either discrete or continuous 8 4

5 PROBABILITY DISCRETE (Binomial, Poisson) the probability can be derived corresponding to every possible value of the random variable the sum of all such probabilitis is 1 9 PROBABILITY CONTINUOUS (Normal, Chi-squared, t, F) the probability of the random variable, x, taking values in certain ranges, could be derived if the horizontal axis represents the values of x the curve from the equation of the distribution could be drawn (= probability density function) Total area under the curve = 1 represents the probability of all possible events Probability that x lies between two limits is equal to the area under the curve between these values 10 5

6 PROBABILITY Probability that x lies between two limits? 11 PROBABILITY Probability that x lies between two limits? 12 6

7 THE NORMAL (GAUSSIAN) DISTRIBUTION one of the most important distributions in statistics german mathematician C.F. Gauss the most biological measurements follow normal distribution it is used in many analytical models 13 THE NORMAL (GAUSSIAN) DISTRIBUTION Probability density function: f (x) = (1/σ 2π) e a a = -1/2 ((x-µ)/σ)

8 THE NORMAL (GAUSSIAN) DISTRIBUTION Completely described by two parameters: - mean (µ ) -variance(σ 2 ) X~ N (µ,σ 2 ) 15 THE NORMAL (GAUSSIAN) DISTRIBUTION 16 8

9 THE NORMAL (GAUSSIAN) DISTRIBUTION normal distribution curve: area under curve = 1 bell-shaped (unimodal= symmetrical about its mean apsolute maximum for x = µ shifted to the right if the mean is increased and to the left if the mean is decreased (assuming constant variance) flattened as the variance is increased but becomes more peaked as the variance is decreased (for a ficed mean) 17 THE NORMAL (GAUSSIAN) DISTRIBUTION the mean and median and mode of a Normal distribution are equal the probability (P) that a normally distributed random variable, x, with mean, µ, and standard deviation, σ, lies between: (µ - σ) and (µ + σ) = 0,68 (µ σ) and (µ σ) = 0.95 (µ 2.58σ) and (µ σ) = 0.99 these intervals may be used to define reference intervals 18 9

10 THE NORMAL (GAUSSIAN) DISTRIBUTION changing µ, constant σ: 19 THE NORMAL (GAUSSIAN) DISTRIBUTION changing µ, constant σ: 20 10

11 THE NORMAL (GAUSSIAN) DISTRIBUTION changing σ, constant µ: 21 THE NORMAL (GAUSSIAN) DISTRIBUTION changing σ, constant µ: 22 11

12 THE NORMAL (GAUSSIAN) DISTRIBUTION changing σ, constant µ: 23 THE STANDARD NORMAL DISTRIBUTION transformation of original value (x) to Standardized Normal Deviate (SND) (z i ): z i = (x 1 - µ)/σ sample: = random variable that has a Standard Normal distribution z i = (x 1 - x)/s mean (µ) = 0; variance (σ 2 ) = 1; N (0,1) 24 12

13 THE STANDARD NORMAL DISTRIBUTION X 1 Z 1 X 2 Z 2 X 3 Z 3 X n Z n, s =?, s z =? 25 THE STANDARD NORMAL DISTRIBUTION X 1 Z 1 X 2 Z 2 X 3 Z 3 X n Z n, s =0, s z =

14 THE STANDARD NORMAL DISTRIBUTION X 1 Z 1 X 2 Z 2 X 3 Z 3 X n Z n, s =0, s z =1 Z~N(0,1) 27 THE STANDARD NORMAL DISTRIBUTION 28 14

15 THE STANDARD NORMAL DISTRIBUTION 29 THE STANDARD NORMAL DISTRIBUTION 30 15

16 THE STANDARD NORMAL DISTRIBUTION 31 THE STANDARD NORMAL DISTRIBUTION 32 16

17 THE STUDENT S t-distribution W.S. Gossett (pseudonym Student) parameter that characterizes the t-distribution = the degrees of freedom Similar shape as normal distribution (more spread out with longer tails) as the degrees of freedom increase its shape approaches Normality Useful for calculating confidence intervals for testing hypotheses about one or two means 33 THE STUDENT S t-distribution 34 17

18 THE CHI-SQUARE (χ 2 ) DISTRIBUTION a right skewed distribution taking positive values characterized by its degrees of freedom its shape depends on the degrees of freedom it becomes more symmetrical and approaches Normality as they increase useful for analysing categorical data 35 THE CHI-SQUARE (χ 2 ) DISTRIBUTION 36 18

19 THE F-DISTRIBUTION skewed to the right defined by a ratio the distribution of a ratio of two estimated variances calculated from Normal dana approximates the F-distritution characterized by degrees of freedom of the numerator and the denominator of the ratio useful for comparing two variances, and more than two means using the analysis of variance 37 THE LOGNORMAL DISTRIBUTION the probability distribution of a random variable whose log (to base 10 or e) follows the Normal distribution highly skewed to the right logs of row data skewed to the right an empirical distribution that is nearly Normal = data approximate Log-normal distribution geometric mean = a summary measure of location 38 19

20 THE LOGNORMAL DISTRIBUTION 39 THE BINOMIAL DISTRIBUTION theoretical distribution for discrete random variable definition: Jacob Bernuolli, two outcomes: success i failure n events E.g. n = 100 unrelated women undergoing IVF outcome = success (pregnancy) or failure 40 20

21 THE BINOMIAL DISTRIBUTION Two parameters that describe the Binomial distribution: n = number of indivudial in the sample (or repetitions of a trial) π = the true probability of success for each individual (or in each trial) X~B(n,p) 41 THE BINOMIAL DISTRIBUTION Mean = nπ (the value for the random variable that we expect if we look at n individuals, or repeat the trial n times) Variance = nπ (1- π) small n the distribution is skewed to the right if π <0.5 the distribution is skewed to dhe right if π >

22 THE BINOMIAL DISTRIBUTION the distribution becomes more symmetrical as the sample size increases and approximates to the Normal distribution if both nπ and nπ (1 π) are greater than 5 the properties of the Binomial distribution could be use when making inferences about proportions the Normal approximation of the Binomial distribution when analyzing proportions is often used 43 THE BINOMIAL DISTRIBUTION Example: gene recombination Chromosomal locus: 2 allels: A and a p = probability of A Q = 1 p = probability of a P(A) = p, P(a) = q, (p+q = 1) 44 22

23 THE BINOMIAL DISTRIBUTION conception outcame space:{aa, Aa, aa} P(AA) = P(A) * P(A)= p 2 P(aa) = P(a) * P(a) = q 2 P(Aa) = P(A) * P(a) = pq P(aA) = P(a) * P(A)= qp 1,0 2pq p 2 + 2pq + q 2 = (p+q) 2 = 1 2 = 1 45 THE BINOMIAL DISTRIBUTION 46 23

24 THE BINOMIAL DISTRIBUTION Example probability of genotypes: frequency of gene A = 0,33 frequency of gene a = 0,67 (p+q) 2 = (0,33 + 0,67) 2 = 0, * 0,33 * 0,67 + 0,67 2 P (AA)= 0,33 2 = P (Aa) = 0,33 * 0,67 = 0,2211 P (aa) = 0,67 * 0,33 = 0,2211 P (aa) = 0,67 2 = 0, THE BINOMIAL DISTRIBUTION Graphical presentatnion probabilities of different genotypes 0,5 0,45 0,4 0,35 0,3 P 0,25 0,2 0,15 0,1 0,05 0 AA Aa aa 48 24

25 THE BINOMIAL DISTRIBUTION Example death outcome as binomial distribution: Letality od neke bolesti = 0,30..(30/100) Survival probability = 0,70 n = 5 Binom: (0,30 + 0,70) 5 Number of death examinees Binom Probability 5 (everybody) (nobody) P 5 5p 4 q 10p 3 q 2 10p 2 q 3 5pq 4 q 5 0, , , , , ,16807 Total 1, THE POISSON DISTRIBUTION Poisson (begining of XIX century) the Poisson random variable = the count or the number of events that occur independently and randomly in time or space at some average rate, µ (0 and all positive integers) example: the number of hospital admissions per day typically follows the Poisson distribution use of the Poisson cistribution to calculate the probability of a certain number of admissions on any particular day 50 25

26 THE POISSON DISTRIBUTION Mean (average rate, µ) = the parameter that describes the Poisson distribution The mean equals the variance in the Poisson distribution Unimodal curve, right skewed if the mean is small, but becomes more symmetrical as the mean increases, when it approximates n Normal distribution 51 26