STATS8: Introduction to Biostatistics. Random Variables and Probability Distributions. Babak Shahbaba Department of Statistics, UCI

Similar documents
Stats on the TI 83 and TI 84 Calculator

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Chapter 5. Random variables

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

ST 371 (IV): Discrete Random Variables

Lecture 8. Confidence intervals and the central limit theorem

Random variables, probability distributions, binomial random variable

Lecture 3: Continuous distributions, expected value & mean, variance, the normal distribution

Pr(X = x) = f(x) = λe λx

Section 5 Part 2. Probability Distributions for Discrete Random Variables

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

4. Continuous Random Variables, the Pareto and Normal Distributions

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

Chapter 4 Lecture Notes

Probability Distributions

Lecture 7: Continuous Random Variables

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

STAT 315: HOW TO CHOOSE A DISTRIBUTION FOR A RANDOM VARIABLE

Lecture 5 : The Poisson Distribution

Ch5: Discrete Probability Distributions Section 5-1: Probability Distribution

3.4. The Binomial Probability Distribution. Copyright Cengage Learning. All rights reserved.

Experimental Design. Power and Sample Size Determination. Proportions. Proportions. Confidence Interval for p. The Binomial Test

Random Variables. Chapter 2. Random Variables 1

You flip a fair coin four times, what is the probability that you obtain three heads.

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

An Introduction to Basic Statistics and Probability

Chapter 4 - Lecture 1 Probability Density Functions and Cumul. Distribution Functions

Week 4: Standard Error and Confidence Intervals

Homework 4 - KEY. Jeff Brenion. June 16, Note: Many problems can be solved in more than one way; we present only a single solution here.

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

Chapter 3: DISCRETE RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS. Part 3: Discrete Uniform Distribution Binomial Distribution

LECTURE 16. Readings: Section 5.1. Lecture outline. Random processes Definition of the Bernoulli process Basic properties of the Bernoulli process

2. Discrete random variables

The normal approximation to the binomial

Stat 704 Data Analysis I Probability Review

EMPIRICAL FREQUENCY DISTRIBUTION

Normal Distribution as an Approximation to the Binomial Distribution

Review of Random Variables

A POPULATION MEAN, CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

Regression Analysis: A Complete Example

Notes on Continuous Random Variables

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

5. Continuous Random Variables

The sample space for a pair of die rolls is the set. The sample space for a random number between 0 and 1 is the interval [0, 1].

Exploratory Data Analysis

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

The normal approximation to the binomial

Introduction to Probability

Probability Distributions

One-Way Analysis of Variance

MATH 140 Lab 4: Probability and the Standard Normal Distribution

Multivariate Normal Distribution

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means

Confidence Intervals for the Difference Between Two Means

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

99.37, 99.38, 99.38, 99.39, 99.39, 99.39, 99.39, 99.40, 99.41, cm

The Normal Distribution. Alan T. Arnholt Department of Mathematical Sciences Appalachian State University

HYPOTHESIS TESTING (ONE SAMPLE) - CHAPTER 7 1. used confidence intervals to answer questions such as...

Two-sample inference: Continuous data

MEASURES OF VARIATION

Master s Theory Exam Spring 2006

Normal distribution. ) 2 /2σ. 2π σ

Chapter 3 RANDOM VARIATE GENERATION

For a partition B 1,..., B n, where B i B j = for i. A = (A B 1 ) (A B 2 ),..., (A B n ) and thus. P (A) = P (A B i ) = P (A B i )P (B i )

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction

Probability Theory. Florian Herzog. A random variable is neither random nor variable. Gian-Carlo Rota, M.I.T..

Probability Distribution for Discrete Random Variables

HYPOTHESIS TESTING (ONE SAMPLE) - CHAPTER 7 1. used confidence intervals to answer questions such as...

1. How different is the t distribution from the normal?

Lecture Notes 1. Brief Review of Basic Probability

0 x = 0.30 x = 1.10 x = 3.05 x = 4.15 x = x = 12. f(x) =

Binomial Distribution n = 20, p = 0.3

ECE302 Spring 2006 HW4 Solutions February 6,

Means, standard deviations and. and standard errors

Principle of Data Reduction

MBA 611 STATISTICS AND QUANTITATIVE METHODS

FEGYVERNEKI SÁNDOR, PROBABILITY THEORY AND MATHEmATICAL

What is Statistics? Lecture 1. Introduction and probability review. Idea of parametric inference

Lecture 6: Discrete & Continuous Probability and Random Variables

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Measurement and Measurement Scales

1.1 Introduction, and Review of Probability Theory Random Variable, Range, Types of Random Variables CDF, PDF, Quantiles...

Sums of Independent Random Variables

Unit 4 The Bernoulli and Binomial Distributions

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Estimation and Confidence Intervals

2WB05 Simulation Lecture 8: Generating random variables

MATH 10: Elementary Statistics and Probability Chapter 5: Continuous Random Variables

Chapter 9 Monté Carlo Simulation

Dongfeng Li. Autumn 2010

Descriptive Statistics

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

Statistical Confidence Calculations

Permutation Tests for Comparing Two Populations

Hypothesis Testing for Beginners

Calculating P-Values. Parkland College. Isela Guerra Parkland College. Recommended Citation

Chapter 4. Probability Distributions

Statistical Functions in Excel

Transcription:

STATS8: Introduction to Biostatistics Random Variables and Probability Distributions Babak Shahbaba Department of Statistics, UCI

Random variables In this lecture, we will discuss random variables and their probability distributions. Formally, a random variable X assigns a numerical value to each possible outcome (and event) of a random phenomenon. For instance, we can define X based on possible genotypes of a bi-allelic gene A as follows: 0 for genotype AA, X = 1 for genotype Aa, 2 for genotype aa. Alternatively, we can define a random, Y, variable this way: { 0 for genotypes AA and aa, Y = 1 for genotype Aa.

Random variables After we define a random variable, we can find the probabilities for its possible values based on the probabilities for its underlying random phenomenon. This way, instead of talking about the probabilities for different outcomes and events, we can talk about the probability of different values for a random variable. For example, suppose P(AA) = 0.49, P(Aa) = 0.42, and P(aa) = 0.09. Then, we can say that P(X = 0) = 0.49, i.e., X is equal to 0 with probability of 0.49. Note that the total probability for the random variable is still 1.

Random variables The probability distribution of a random variable specifies its possible values (i.e., its range) and their corresponding probabilities. For the random variable X defined based on genotypes, the probability distribution can be simply specified as follows: 0.49 for x = 0, P(X = x) = 0.42 for x = 1, 0.09 for x = 2. Here, x denotes a specific value (i.e., 0, 1, or 2) of the random variable.

Discrete vs. continuous random variables We divide the random variables into two major groups: discrete and continuous. Discrete random variables can take a countable set of values. These variables can be categorical (nominal or ordinal), such as genotype, or counts, such as the number of patients visiting an emergency room per day, Continuous random variables can take an uncountable number of possible values. For any two possible values of this random variable, we can always find another value between them.

Probability distribution The probability distribution of a random variable provides the required information to find the probability of its possible values. The probability distributions discussed here are characterized by one or more parameters. The parameters of probability distributions we assume for random variables are usually unknown. Typically, we use Greek alphabets such as µ and σ to denote these parameters and distinguish them from known values. We usually use µ to denote the mean of a random variable and use σ 2 to denote its variance.

Discrete probability distributions For discrete random variables, the probability distribution is fully defined by the probability mass function (pmf). This is a function that specifies the probability of each possible value within range of random variable. For the genotype example, the pmf of the random variable X is 0.49 for x = 0, P(X = x) = 0.42 for x = 1, 0.09 for x = 2.

Bernoulli distribution Binary random variables (e.g., healthy/diseased) are abundant in scientific studies. The binary random variable X with possible values 0 and 1 has a Bernoulli distribution with parameter θ. Here, P(X = 1) = θ and P(X = 0) = 1 θ. For example, P(X = x) = { 0.2 for x = 0, 0.8 for x = 1. We denote this as X Bernoulli(θ), where 0 θ 1.

Bernoulli distribution The mean of a binary random variable, X, with Bernoulli(θ) distribution is θ. We show this as µ = θ. The variance of a random variable with Bernoulli(θ) distribution is σ 2 = θ(1 θ) = µ(1 µ). The standard deviation is obtained by taking the square root of variance: σ = θ(1 θ) = µ(1 µ).

Binomial distribution A sequence of binary random variables X 1, X 2,..., X n is called Bernoulli trials if they all have the same Bernoulli distribution and are independent. The random variable Y representing the number of times the outcome of interest occurs in n Bernoulli trials (i.e., the sum of Bernoulli trials) has a Binomial(n, θ) distribution. The pmf of a binomial(n, θ) specifies the probability of each possible value (integers from 0 through n) of the random variable. The theoretical (population) mean of a random variable Y with Binomial(n, θ) distribution is µ = nθ. The theoretical (population) variance of Y is σ 2 = nθ(1 θ).

Continuous probability distributions For discrete random variables, the pmf provides the probability of each possible value. For continuous random variables, the number of possible values is uncountable, and the probability of any specific value is zero. For these variables, we are interested in the probability that the value of the random variable is within a specific interval from x 1 to x 2 ; we show this probability as P(x 1 < X x 2 ).

Probability density function For continuous random variables, we use probability density functions (pdf) to specify the distribution. Using the pdf, we can obtain the probability of any interval. Density 0.00 0.02 0.04 0.06 0.08 10 20 30 40 X Figure: Probability density function for BMI

Probability density function The total area under the probability density curve is 1. The curve (and its corresponding function) gives the probability of the random variable falling within an interval. This probability is equal to the area under the probability density curve over the interval. Density 0.00 0.02 0.04 0.06 0.08 10 20 30 40 X

Lower tail probability the probability of observing values less than or equal to a specific value x, is called the lower tail probability and is denoted as P(X x). Density 0.00 0.02 0.04 0.06 0.08 10 20 30 40 X

Upper tail probability The probability of observing values greater than x, P(X > x), is called the upper tail probability and is found by measuring the area under the curve to the right of x. Density 0.00 0.02 0.04 0.06 0.08 10 20 30 40 X

Probability of intervals The probability of any interval from x 1 to x 2, where x 1 < x 2, can be obtained using the corresponding lower tail probabilities for these two points as follows: P(x 1 < X x 2 ) = P(X x 2 ) P(X x 1 ). For example, the probability of a BMI between 25 and 30 is P(25 < X 30) = P(X 30) P(X 25).

Normal distribution Consider the probability distribution function and its corresponding probability density curve we assumed for BMI in the above example. This distribution is known as normal distribution, which is one of the most widely used distributions for continuous random variables. Random variables with this distribution (or very close to it) occur often in nature.

Normal distribution A normal distribution and its corresponding pdf are fully specified by the mean µ and variance σ 2. A random variable X with normal distribution is denoted X N(µ, σ 2 ). N(0, 1) is called the standard normal distribution. Density 0.0 0.1 0.2 0.3 0.4 N(1, 4) N( 1, 1) 6 4 2 0 2 4 6 8 X

The 68-95-99.7% rule The 68 95 99.7% rule for normal distributions specifies that 68% of values fall within 1 standard deviation of the mean: P(µ σ < X µ + σ) = 0.68. 95% of values fall within 2 standard deviations of the mean: P(µ 2σ < X µ + 2σ) = 0.95. 99.7% of values fall within 3 standard deviations of the mean: P(µ 3σ < X µ + 3σ) = 0.997.

Normal distribution 68% central probability 95% central probability Density 0.000 0.005 0.010 0.015 0.020 0.025 σ σ µ Density 0.000 0.005 0.010 0.015 0.020 0.025 2σ µ 2σ 80 100 120 140 160 X 80 100 120 140 160 X

Student s t-distribution Another continuous probability distribution that is used very often in statistics is the Student s t-distribution or simply the t-distribution. Density 0.0 0.1 0.2 0.3 0.4 N(0, 1) t(4) t(1) 4 2 0 2 4 X

Student s t-distribution A t-distribution is specified by only one parameter called the degrees of freedom df. The t-distribution with df degrees of freedom is usually denoted as t(df ) or t df, where df is a positive real number (df > 0). The mean of this distribution is µ = 0, and the variance is determined by the degrees of freedom parameter, σ 2 = df /(df 2), which is of course defined when df > 2.

Cumulative distribution function We saw that by using lower tail probabilities, we can find the probability of any given interval. Indeed, all we need to find the probabilities of any interval is a function that returns the lower tail probability at any given value of the random variable: P(X x). This function is called the cumulative distribution function (cdf) or simply the distribution function.

Quantiles We can use the cdf plot in the reverse direction to find the value of the random variable for a given lower tail probability. Cumulative Probability 0.0 0.2 0.4 0.6 0.8 1.0 Cumulative Probability 0.0 0.2 0.4 0.6 0.8 1.0 4 2 0 2 4 X 4 2 0 2 4 X Figure: Left: Finding lower tail probabilities. Right: Finding quantiles

Scaling and shifting random variables If Y = ax + b, then µ Y = aµ X + b, σ 2 Y = a 2 σ 2 X, σ Y = a σ X. The process of shifting and scaling a random variable to create a new random variable with mean zero and variance one is called standardization. For this, we first subtract the mean µ and then divide the result by the standard deviation σ. Z = X µ. σ If X N(µ, σ 2 ), then Z N(0, 1).

Adding/subtracting random variables If W = X + Y, then µ W = µ X + µ Y. If the random variables X and Y are independent (i.e., they do not affect each other probabilities), then we can find the variance of W as follows: σ 2 W = σ2 X + σ2 Y. If X N(µ X, σx 2 ) and Y N(µ Y, σy 2 ), then assuming that the two random variables are independent, we have W = X + Y N ( µ X + µ Y, σ 2 X + σ2 Y ).

Adding/subtracting random variables If we subtract Y from X, then µ W = µ X µ Y. If the two variables are independent, σw 2 = σ2 X + σ2 Y. Note that we still add the variances. Subtracting Y from X is the same as adding Y to X.