Stat 704 Data Analysis I Probability Review

Similar documents

Multivariate Normal Distribution

Sections 2.11 and 5.8

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Lecture Notes 1. Brief Review of Basic Probability

What is Statistics? Lecture 1. Introduction and probability review. Idea of parametric inference

Joint Exam 1/P Sample Exam 1

STAT 360 Probability and Statistics. Fall 2012

Problem sets for BUEC 333 Part 1: Probability and Statistics

MULTIVARIATE PROBABILITY DISTRIBUTIONS

Lecture 8. Confidence intervals and the central limit theorem

PSTAT 120B Probability and Statistics

Simple Linear Regression Inference

For a partition B 1,..., B n, where B i B j = for i. A = (A B 1 ) (A B 2 ),..., (A B n ) and thus. P (A) = P (A B i ) = P (A B i )P (B i )

Random Variables. Chapter 2. Random Variables 1

Definition: Suppose that two random variables, either continuous or discrete, X and Y have joint density

Measurements of central tendency express whether the numbers tend to be high or low. The most common of these are:

ECE302 Spring 2006 HW5 Solutions February 21,

5. Continuous Random Variables

Lecture 6: Discrete & Continuous Probability and Random Variables

The sample space for a pair of die rolls is the set. The sample space for a random number between 0 and 1 is the interval [0, 1].

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Random variables P(X = 3) = P(X = 3) = 1 8, P(X = 1) = P(X = 1) = 3 8.

An Introduction to Basic Statistics and Probability

Math 370/408, Spring 2008 Prof. A.J. Hildebrand. Actuarial Exam Practice Problem Set 5 Solutions

Lecture 3: Continuous distributions, expected value & mean, variance, the normal distribution

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Section Format Day Begin End Building Rm# Instructor. 001 Lecture Tue 6:45 PM 8:40 PM Silver 401 Ballerini

STAT 830 Convergence in Distribution

Dongfeng Li. Autumn 2010

University of California, Los Angeles Department of Statistics. Random variables

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

Principle of Data Reduction

Financial Risk Management Exam Sample Questions/Answers

Data Mining: Algorithms and Applications Matrix Math Review

WEEK #22: PDFs and CDFs, Measures of Center and Spread

A Tutorial on Probability Theory

Description. Textbook. Grading. Objective

Lecture 7: Continuous Random Variables

Overview of Monte Carlo Simulation, Probability Review and Introduction to Matlab

ST 371 (IV): Discrete Random Variables

Multivariate normal distribution and testing for means (see MKB Ch 3)

Master s Theory Exam Spring 2006

Statistics 104: Section 6!

GLM I An Introduction to Generalized Linear Models

1 Another method of estimation: least squares

Lecture 9: Introduction to Pattern Analysis

Generating Random Numbers Variance Reduction Quasi-Monte Carlo. Simulation Methods. Leonid Kogan. MIT, Sloan , Fall 2010

Statistics for Analysis of Experimental Data

Machine Learning.

Simple Linear Regression

THE CENTRAL LIMIT THEOREM TORONTO

Notes on Continuous Random Variables

MAS108 Probability I

Aggregate Loss Models

Introduction to General and Generalized Linear Models

SAS Software to Fit the Generalized Linear Model

Some probability and statistics

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

STA 256: Statistics and Probability I

Approximation of Aggregate Losses Using Simulation

The Normal Distribution. Alan T. Arnholt Department of Mathematical Sciences Appalachian State University

Sampling Distributions

Solution to HW - 1. Problem 1. [Points = 3] In September, Chapel Hill s daily high temperature has a mean

Introduction to Probability

The Normal distribution

), 35% use extra unleaded gas ( A

A Model of Optimum Tariff in Vehicle Fleet Insurance

Chapter 13 Introduction to Nonlinear Regression( 非線性迴歸 )

Review of Random Variables

E3: PROBABILITY AND STATISTICS lecture notes

MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL. by Michael L. Orlov Chemistry Department, Oregon State University (1996)

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction

Examples: Joint Densities and Joint Mass Functions Example 1: X and Y are jointly continuous with joint pdf

Important Probability Distributions OPRE 6301

How To Find The Correlation Of Random Bits With The Xor Operator

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

One-Way Analysis of Variance

Exploratory Data Analysis

Linear regression methods for large n and streaming data

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

International College of Economics and Finance Syllabus Probability Theory and Introductory Statistics

( ) is proportional to ( 10 + x)!2. Calculate the

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

Geostatistics Exploratory Analysis

PELLISSIPPI STATE COMMUNITY COLLEGE MASTER SYLLABUS INTRODUCTION TO STATISTICS MATH 2050

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

Martingale Ideas in Elementary Probability. Lecture course Higher Mathematics College, Independent University of Moscow Spring 1996

University of Chicago Graduate School of Business. Business 41000: Business Statistics

Chapter 7. One-way ANOVA

A Coefficient of Variation for Skewed and Heavy-Tailed Insurance Losses. Michael R. Powers[ 1 ] Temple University and Tsinghua University

Math 370, Actuarial Problemsolving Spring 2008 A.J. Hildebrand. Practice Test, 1/28/2008 (with solutions)

University of Ljubljana Doctoral Programme in Statistics Methodology of Statistical Research Written examination February 14 th, 2014.

Chapter 3 RANDOM VARIATE GENERATION

0 x = 0.30 x = 1.10 x = 3.05 x = 4.15 x = x = 12. f(x) =

Mathematical Expectation

4. Continuous Random Variables, the Pareto and Normal Distributions

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Hypothesis Testing for Beginners

Transcription:

1 / 30 Stat 704 Data Analysis I Probability Review Timothy Hanson Department of Statistics, University of South Carolina

Course information 2 / 30 Logistics: Tuesday/Thursday 11:40am to 12:55pm in LeConte College 201A. Instructor: Tim Hanson, Leconte 219C. Office hours: Tuesday/Thursday 10-11:00am and by appointment. Required text: Applied Linear Statistical Models (5th Edition), by Kutner, Nachtsheim, Neter, and Li. Online notes at http://www.stat.sc.edu/ hansont/stat704/stat704.html Grading: homework 50%, two exams 25% each. Stat 704 has a co-requisite of Stat 712 (Casella & Berger level mathematical statistics). You need to be taking this, or have taken this already.

A.3 Random Variables 3 / 30 def n: A random variable is defined as a function that maps an outcome from some random phenomenon to a real number. More formally, a random variable is a map or function from the sample space of an experiment, S, to some subset of the real numbers R R. Restated: A random variable measures the result of a random phenomenon. Example 1: The height Y of a randomly selected University of South Carolina statistics graduate student. Example 2: The number of car accidents Y in a month at the intersection of Assembly and Gervais.

cdf, pdf, pmf Every random variable has a cumulative distribution function (cdf) associated with it: F(y) = P(Y y). Discrete random variables have a probability mass function (pmf) f(y) = P(Y = y) = F(y) F(y ) = F(y) lim F(x). (A.11) x y Continuous random variables have a probability density function (pdf) such that for a < b P(a Y b) = b a f(y)dy. For continuous random variables, f(y) = F (y). Question: Are the two examples on the previous slide continuous or discrete? 4 / 30

A.3 Expected value 5 / 30 The expected value, or mean of a random variable is, in general, defined as E(Y) = y df(y). For discrete random variables this is E(Y) = y f(y). y:f(y)>0 For continuous random variables this is E(Y) = y f(y)dy. (A.12) (A.14)

E( ) is linear 6 / 30 Note: If a and c are constants, E(a+cY) = a+ce(y). (A.13) In particular, E(a) = a E(cY) = ce(y) E(Y + a) = E(Y)+a

A.3 Variance The variance of a random variable measures the spread of its probability distribution. It is the expected squared deviation about the mean: var(y) = E{[Y E(Y)] 2 } Equivalently, var(y) = E(Y 2 ) [E(Y)] 2 Note: If a and c are constants, var(a+cy) = c 2 var(y) (A.15) (A.15a) (A.16) In particular, var(a) = 0 var(cy) = c 2 var(y) var(y + a) = var(y) Note: The standard deviation of Y is sd(y) = var(y). 7 / 30

Example 8 / 30 Suppose Y is the high temperature in Celsius of a September day in Seattle. Say E(Y) = 20 and var(y) = 10. Let W be the high temperature in Fahrenheit. Then ( ) 9 E(W) = E 5 Y + 32 = 9 5 E(Y)+32 = 9 20+32 = 68 degrees. 5 ( ) 9 var(w) = var 5 Y + 32 = ( ) 9 2 var(y) = 3.24(10) = 32.4 degrees 2. 5 sd(y) = var(y) = 32.4 = 5.7 degrees.

A.3 Covariance 9 / 30 For two random variables Y and Z, the covariance of Y and Z is cov(y, Z) = E{[Y E(Y)][Z E(Z)]}. Note cov(y, Z) = E(YZ) E(Y)E(Z) (A.21) If Y and Z have positive covariance, lower values of Y tend to correspond to lower values of Z (and large values of Y with large values of Z ). Example: X is work experience in years and Y is salary in Euro. If Y and Z have negative covariance, lower values of Y tend to correspond to higher values of Z and vice-versa. Example: X is the weight of a car in tons and Y is miles per gallon.

Covariance is linear 10 / 30 If a 1, c 1, a 2, c 2 are constants, cov(a 1 + c 1 Y, a 2 + c 2 Z) = c 1 c 2 cov(y, Z) (A.22) Note: by definition cov(y, Y) = var(y). The correlation coefficient between Y and Z is the covariance scaled to be between 1 and 1: corr(y, Z) = cov(y, Z) var(y)var(z) (A.25a) If corr(y, Z) = 0 then Y and Z are uncorrelated.

Independent random variables 11 / 30 Informally, two random variables Y and Z are independent if knowing the value of one random variable does not affect the probability distribution of the other random variable. Note: If Y and Z are independent, then Y and Z are uncorrelated, corr(y, Z) = 0. However, corr(y, Z) = 0 does not imply independence in general. If Y and Z have a bivariate normal distribution then cov(y, Z) = 0 Y, Z independent. Question: what is the formal definition of independence for (Y, Z)?

Linear combinations of random variables 12 / 30 Suppose Y 1, Y 2,...,Y n are random variables and a 1, a 2,...,a n are constants. Then [ n ] n E a i Y i = a i E(Y i ). (A.29a) i=1 i=1 That is, E [a 1 Y 1 + a 2 Y 2 + +a n Y n ] = a 1 E(Y 1 )+a 2 E(Y 2 )+ +a n E(Y n ). Also, [ n ] var a i Y i = i=1 n n a i a j cov(y i, Y j ) i=1 j=1 (A.29b)

13 / 30 For two random variables (A.30a & b) E(a 1 Y 1 + a 2 Y 2 ) = a 1 E(Y 1 )+a 2 E(Y 2 ), var(a 1 Y 1 + a 2 Y 2 ) = a 2 1 var(y 1)+a 2 2 var(y 2)+2a 1 a 2 cov(y 1, Y 2 ). Note: if Y 1,...,Y n are all independent (or even just uncorrelated), then [ n ] n var a i Y i = ai 2 var(y i). i=1 i=1 Also, if Y 1,...,Y n are all independent, then ( n ) n n cov a i Y i, c i Y i = a i c i var(y i ). i=1 i=1 i=1 (A.31) (A.32)

Important example Suppose Y 1,...,Y n are independent random variables, each with mean µ and variance σ 2. Define the sample mean as Ȳ = 1 n n i=1 Y i. Then E(Ȳ) = E ( 1 n Y 1 + + 1 n Yn ) = 1 n E(Y 1)+ + 1 n E(Yn) = 1 n µ+ + 1 n µ ( ) 1 = n n µ = µ. var(ȳ) = var ( 1 n Y 1 + + 1 n Yn ) = 1 n 2 var(y 1)+ + 1 n 2 var(yn) ( ) 1 = (n) n 2σ2 = σ2 n. (Casella & Berger pp. 212 214) 14 / 30

A.3 Central Limit Theorem 15 / 30 The Central Limit Theorem takes this a step further. When Y 1,...,Y n are independent and identically distributed (i.e. a random sample) from any distribution such that E(Y i ) = µ and var(y) = σ 2, and n is reasonably large, Ȳ N ( µ, σ 2 ), n where is read as approximately distributed as. σ2 Note that E(Ȳ) = µ and var(ȳ) = n as on the previous slide. The CLT slaps normality onto Ȳ. Formally, the CLT states n( Ȳ µ) D N(0,σ 2 ). (Casella & Berger pp. 236 240)

Section A.4 Gaussian & related distributions 16 / 30 Normal distribution (Casella & Berger pp. 102 106) A random variable Y has a normal distribution with mean µ and standard deviation σ, denoted Y N(µ,σ 2 ), if it has the pdf { 1 f(y) = exp 1 ( ) } y µ 2, 2πσ 2 2 σ for < y <. Here, µ R and σ > 0. Note: If Y N(µ,σ 2 ) then Z = Y µ σ N(0, 1) is said to have a standard normal distribution.

Sums of independent normals Note: If a and c are constants and Y N(µ,σ 2 ), then a+cy N(a+cµ, c 2 σ 2 ). Note: If Y 1,...,Y n are independent normal such that Y i N(µ i,σ 2 i ) and a 1,...,a n are constants, then ( n n a i Y i = a 1 Y 1 + +a n Y n N a i µ i, i=1 i=1 n i=1 a 2 i σ2 i Example: Suppose Y 1,...,Y n are iid from N(µ,σ 2 ). Then (Casella & Berger p. 215) Ȳ N ) (µ, σ2. n ). 17 / 30

A.4 χ 2 distribution 18 / 30 def n: If Z 1,...,Z ν iid N(0, 1), then X = Z 2 1 + +Z 2 ν χ 2 ν, chi-square with ν degrees of freedom. Note: E(X) = ν & var(x) = 2ν. Plot of χ 2 1, χ2 2, χ2 3, χ2 4 PDFs: 0.5 0.4 0.3 0.2 0.1 2 4 6 8 10

A.4 t distribution 19 / 30 def n: If Z N(0, 1) independent of X χ 2 ν then T = Z X/ν t ν, t with ν degrees of freedom. Note that E(T) = 0 for ν 2 and var(t) = ν t 1, t 2, t 3, t 4 PDFs: ν 2 for ν 3. 0.35 0.3 0.25 0.2 0.15 0.1 0.05-4 -2 2 4

A.4 F distribution 20 / 30 def n: If X 1 χ 2 ν 1 independent of X 2 χ 2 ν 2 then F = X 1/ν 1 X 2 /ν 2 F ν1,ν 2, F with ν 1 degrees of freedom in the numerator and ν 2 degrees of freedom in the denominator. Note: The square of a t ν random variable is an F 1,ν random variable. Proof: [ ] 2 tν 2 Z = = Z 2 χ 2 ν /ν χ 2 ν/ν = χ2 1 /1 χ 2 ν/ν = F 1,ν. Note: E(F) = ν 2 /(ν 2 2) for ν 2 > 2. Variance is function of ν 1 and ν 2 and a bit more complicated. Question: If F F(ν 1,ν 2 ), what is F 1 distributed as?

Relate plots to E(F) = ν 2 /(ν 2 2) 21 / 30 F 2,2, F 5,5, F 5,20, F 5,200 PDFs: 1 0.8 0.6 0.4 0.2 1 2 3 4

A.6 normal population inference 22 / 30 A model for a single sample Suppose we have a random sample Y 1,...,Y n of observations from a normal distribution with unknown mean µ and unknown variance σ 2. We can model these data as Y i = µ+ǫ i, i = 1,...,n, where ǫ i N(0,σ 2 ). Often we wish to obtain inference for the unknown population mean µ, e.g. a confidence interval for µ or hypothesis test H 0 : µ = µ 0.

Standardize Ȳ to get t random variable 23 / 30 Let s 2 = 1 n 1 n i=1 (Y i Ȳ)2 be the sample variance and s = s 2 be the sample standard deviation. Fact: (n 1)s2 σ 2 = 1 n σ 2 i=1 (Y i Ȳ)2 has a χ 2 n 1 distribution (easy to show using results from linear models). Fact: Ȳ µ σ/ has a N(0, 1) distribution. n Fact: Ȳ is independent of s2. So then any function of Ȳ is independent of any function of s 2. Therefore [ ] Ȳ µ σ/ n 1 n σ 2 i=1 (Yi Ȳ)2 n 1 (Casella & Berger Theorem 5.3.1, p. 218) = Ȳ µ s/ n t n 1.

Building a confidence interval 24 / 30 Let 0 < α < 1, typically α = 0.05. Let t n 1 (1 α/2) be such that P(T t n 1 ) = 1 α/2 for T t n 1. t n 1 quantiles t n 1 density 1 α α 2 α 2 t n 1(1 α 2) 0 t n 1(1 α 2)

Confidence interval for µ 25 / 30 Under the model Y i = µ+ǫ i, i = 1,...,n, where ǫ i N(0,σ 2 ), ( 1 α = P t n 1 (1 α/2) Ȳ µ ) s/ n t n 1(1 α/2) ( = P s t n 1 (1 α/2) Ȳ µ s ) t n 1 (1 α/2) n n ( = P Ȳ s t n 1 (1 α/2) µ Ȳ + s ) t n 1 (1 α/2) n n

Confidence interval for µ 26 / 30 So a (1 α)100% random probability interval for µ is Ȳ ± t n 1 (1 α/2) s n where t n 1 (1 α/2) is the (1 α/2)th quantile of a t n 1 random variable: i.e. the value such that P(T < t n 1 (1 α/2)) = 1 α/2 where T t n 1. This, of course, turns into a confidence interval after Ȳ = ȳ and s 2 are observed, and no longer random.

Standardizing with Ȳ instead of µ 27 / 30 Note: If Y 1,...,Y n iid N(µ,σ 2 ), then: and n ( ) Yi µ 2 χ 2 σ n, i=1 n ( Yi Ȳ i=1 σ ) 2 χ 2 n 1. First one is straightforward from properties of normals and definition of χ 2 ν; second one is intuitive but not straightforward to show until linear models...

Confidence interval example 28 / 30 Say we collect n = 30 summer daily high temperatures and obtain ȳ = 77.667 and s = 8.872. To obtain a 90% CI, we need, where α = 0.10 yielding t 29 (1 α/2) = t 29 (0.95) = 1.699 (Table B.2), ( ) 8.872 77.667 ±(1.699) (74.91, 80.42). 30 Interpretation: With 90% confidence, the true mean high temperature is between 74.91 and 80.42 degrees.

SAS example: Page 645 29 / 30 n = 63 faculty voluntarily attended a summer workshop on case teaching methods (out of 110 faculty total). At the end of the following academic year their teaching was evaluated on a 7-point scale (1=really bad to 7=outstanding). proc ttest in SAS gets us a confidence interval for the mean.

SAS code ******************************** * Example 2, p. 645 (Chapter 15) ********************************; data teaching; input rating attend$ @@; if attend= Attended ; * only keep those who attended; datalines; 4.8 Attended 6.4 Attended 6.3 Attended 6.0 Attended 5.4 Attended 5.8 Attended 6.1 Attended 6.3 Attended 5.0 Attended 6.2 Attended 5.6 Attended 5.0 Attended 6.4 Attended 5.8 Attended 5.5 Attended 6.1 Attended 6.0 Attended 6.0 Attended 5.4 Attended 5.8 Attended 6.5 Attended 6.0 Attended 6.1 Attended 4.7 Attended 5.6 Attended 6.1 Attended 5.8 Attended 4.8 Attended 5.9 Attended 5.4 Attended 5.3 Attended 6.0 Attended 5.6 Attended 6.3 Attended 5.2 Attended 6.0 Attended 6.4 Attended 5.8 Attended 4.9 Attended 4.1 Attended 6.0 Attended 6.4 Attended 5.9 Attended 6.6 Attended 6.0 Attended 4.4 Attended 5.9 Attended 6.5 Attended 4.9 Attended 5.4 Attended 5.8 Attended 5.6 Attended 6.2 Attended 6.3 Attended 5.8 Attended 5.9 Attended 6.5 Attended 5.4 Attended 5.9 Attended 6.1 Attended 6.6 Attended 4.7 Attended 5.5 Attended 5.0 NotAttend 5.5 NotAttend 5.7 NotAttend 4.3 NotAttend 4.9 NotAttend 3.4 NotAttend 5.1 NotAttend 4.8 NotAttend 5.0 NotAttend 5.5 NotAttend 5.7 NotAttend 5.0 NotAttend 5.2 NotAttend 4.2 NotAttend 5.7 NotAttend 5.9 NotAttend 5.8 NotAttend 4.2 NotAttend 5.7 NotAttend 4.8 NotAttend 4.6 NotAttend 5.0 NotAttend 4.9 NotAttend 6.3 NotAttend 5.6 NotAttend 5.7 NotAttend 5.1 NotAttend 5.8 NotAttend 3.8 NotAttend 5.0 NotAttend 6.1 NotAttend 4.4 NotAttend 3.9 NotAttend 6.3 NotAttend 6.3 NotAttend 4.8 NotAttend 6.1 NotAttend 5.3 NotAttend 5.1 NotAttend 5.5 NotAttend 5.9 NotAttend 5.5 NotAttend 6.0 NotAttend 5.4 NotAttend 5.9 NotAttend 5.5 NotAttend 6.0 NotAttend ; proc ttest data=teaching; var rating; run; 30 / 30