Fundamentals of Traffic Operations and Control Topic: Statistics for Traffic Engineers

Similar documents
Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

Chapter 5. Random variables

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

3.4. The Binomial Probability Distribution. Copyright Cengage Learning. All rights reserved.

Fairfield Public Schools

Notes on Continuous Random Variables

Introduction to Probability

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

CHAPTER 6: Continuous Uniform Distribution: 6.1. Definition: The density function of the continuous random variable X on the interval [A, B] is.

Normality Testing in Excel

Chapter 3 RANDOM VARIATE GENERATION

ST 371 (IV): Discrete Random Variables

4. Continuous Random Variables, the Pareto and Normal Distributions

Simple linear regression

Sums of Independent Random Variables

5. Continuous Random Variables

LECTURE 16. Readings: Section 5.1. Lecture outline. Random processes Definition of the Bernoulli process Basic properties of the Bernoulli process

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

Important Probability Distributions OPRE 6301

12.5: CHI-SQUARE GOODNESS OF FIT TESTS

Dongfeng Li. Autumn 2010

E3: PROBABILITY AND STATISTICS lecture notes

Introduction to Quantitative Methods

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction

Stats on the TI 83 and TI 84 Calculator

UNIT I: RANDOM VARIABLES PART- A -TWO MARKS

Non Parametric Inference

Lecture 7: Continuous Random Variables

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

Stat 704 Data Analysis I Probability Review

Random variables, probability distributions, binomial random variable

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Exploratory Data Analysis

An Introduction to Basic Statistics and Probability

MAS108 Probability I

What is Statistics? Lecture 1. Introduction and probability review. Idea of parametric inference

Confidence Intervals for One Standard Deviation Using Standard Deviation

Confidence Intervals for the Difference Between Two Means

How To Understand And Solve A Linear Programming Problem

Lecture 8. Confidence intervals and the central limit theorem

2 Binomial, Poisson, Normal Distribution

Lecture 6: Discrete & Continuous Probability and Random Variables

II. DISTRIBUTIONS distribution normal distribution. standard scores

Example: 1. You have observed that the number of hits to your web site follow a Poisson distribution at a rate of 2 per day.

FEGYVERNEKI SÁNDOR, PROBABILITY THEORY AND MATHEmATICAL

Aachen Summer Simulation Seminar 2014

You flip a fair coin four times, what is the probability that you obtain three heads.

1.1 Introduction, and Review of Probability Theory Random Variable, Range, Types of Random Variables CDF, PDF, Quantiles...

Review of Random Variables

How To Write A Data Analysis

Testing Research and Statistical Hypotheses

STAT 350 Practice Final Exam Solution (Spring 2015)

Descriptive Statistics

Definition: Suppose that two random variables, either continuous or discrete, X and Y have joint density

. (3.3) n Note that supremum (3.2) must occur at one of the observed values x i or to the left of x i.

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Master s Theory Exam Spring 2006

PROBABILITY AND SAMPLING DISTRIBUTIONS

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Quantitative Methods for Finance

THE CENTRAL LIMIT THEOREM TORONTO

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Statistical Functions in Excel

Chapter 4. Probability Distributions

IEOR 6711: Stochastic Models I Fall 2012, Professor Whitt, Tuesday, September 11 Normal Approximations and the Central Limit Theorem

Chapter G08 Nonparametric Statistics

VISUALIZATION OF DENSITY FUNCTIONS WITH GEOGEBRA

Using Excel for inferential statistics

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Normal distribution. ) 2 /2σ. 2π σ

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

1.5 Oneway Analysis of Variance

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

Mean = (sum of the values / the number of the value) if probabilities are equal

DECISION MAKING UNDER UNCERTAINTY:

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

Chapter 5 Analysis of variance SPSS Analysis of variance

Means, standard deviations and. and standard errors

NAG C Library Chapter Introduction. g08 Nonparametric Statistics

TEST 2 STUDY GUIDE. 1. Consider the data shown below.

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

Simple Linear Regression Inference

For a partition B 1,..., B n, where B i B j = for i. A = (A B 1 ) (A B 2 ),..., (A B n ) and thus. P (A) = P (A B i ) = P (A B i )P (B i )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

WEEK #23: Statistics for Spread; Binomial Distribution

Hypothesis Testing for Beginners

Unit 26 Estimation with Confidence Intervals

CHI-SQUARE: TESTING FOR GOODNESS OF FIT

Study Guide for the Final Exam

Without data, all you are is just another person with an opinion.

Lecture Notes 1. Brief Review of Basic Probability

Introduction to General and Generalized Linear Models

Probability density function : An arbitrary continuous random variable X is similarly described by its probability density function f x = f X

Transcription:

Fundamentals of Traffic Operations and Control Topic: Statistics for Traffic Engineers Nikolas Geroliminis Ecole Polytechnique Fédérale de Lausanne nikolas.geroliminis@epfl.ch

Role of Statistical Inference in Decision-Making Process Real World Data Collection Estimation of Parameters, Choice of Distribution Calculation of Probabilities, (Using the prescribed distributions, and estimated parameters) Statistical Inference Information obtained from the sampled data is used to make generalizations about the populations from which the samples were obtained Sample vs. Population Information for Decision-Making and Design

Role of Sampling in Statistical Inferences < x < + µ 2 σ x s 2 x s 2 1 n 1 = n 1 = xi ( x x) i 2

Statistical Analysis Used to address the following questions: 1. How many samples are required? 2. What confidence should I have in this estimate? 3. What statistical distribution best describes the observed data mathematically? 4. Has a traffic engineering design resulted in a change in the characteristics of the population?

Distributions What is meant by distributional form? It is the frequency of specific values occurring within the measured data set Considering a traffic stream along a signalized arterial What operational considerations are there for the signal if: traffic volume is constant per unit time (i.e., uniform) vs. randomly varying (some other distribution)? What design considerations are there for turn bays?

Describing a Distribution Two types of statistical parameters that describe a distribution Central tendency Dispersion

Common Statistical Measures Measures of central tendency Sample Mean Sample Median x~ = Middle value if odd # of observations x~ = Average of two middle values if even # of observations Mode Most frequent observation x n i= = 1 n x i

Common Statistical Measures Measures of dispersion (or variability) Sample Variance Sample Standard Deviation Sample Coefficient of Variation ( ) 1 1 2 1 1 2 2 1 2 = = = = = n n x x n x x s n i i n i i n i i 2 s s = x s cov =

Distribution Terms The mechanism for assigning probabilities to events defined by random variables is to use either a mass function (for discrete variables) or a density function (for continuous variables) Probability mass function (p.m.f.) Probability density function (p.d.f.) Cumulative distribution function (c.d.f.)

p.m.f. For discrete data Name refers to point masses Probability mass is distributed in discrete points along measurement axis.

p.d.f. For continuous data Two conditions must be met f(x) 0 for all x - f ( x) dx = 1 (area under entire graph) Thus, probability of value being between a and b is the area under the curve between those two points.

p.d.f. Name implies that probability density is smeared in a continuous fashion along entire interval of possible values. Contrary to p.m.f., specific values along measurement axis of continuous distribution have probability of zero

c.d.f. Cumulative probability for some value X x For p.m.f., c.d.f. is obtained by summing the p.m.f. p(x) over all possible values x satisfying X x For p.d.f., c.d.f. is obtained by integrating f(x) between the limits - and x

Common Traffic Distributions Uniform Normal Poisson Negative Exponential

Uniform Examples (discrete): Tossing a coin Rolling a six-sided die Examples (continuous): D/D/1 queuing (deterministic arrivals and departures with one departure channel) Suppose I take a bus to work, and that every five minute a bus arrives at my stop. Because of variation in the time I leave my house, I don t always arrive at the bus stop at the same time, so my waiting time, X, for the next bus is a continuous random variable.

Uniform Distribution f ( x; A, B) = B 1 0 A A x B otherwise The set of possible values of X is the interval [0, 5]. A possible probability density function for X is: f ( x) = 1 5 0 0 x 5 otherwise

Normal Normal distribution function is continuous p.d.f. is: f ( x; µ, σ ) 1 e σ 2π µ = mean, σ = standard deviation (for population, true) x = mean, s = standard deviation (for sample, estimated) = 1 x 2 µ σ 2

Normal What does it mean, conceptually? Distribution is centered about its mean Spread is function of standard deviation Mean, median, and mode are numerically equal 68.27% of observations will be within 1 std. dev., 95.45% within 2 std. dev., 99.73% within 3 std. dev. Values of - to are theoretically possible, but generally there are practical limits (-4 to 4)

Standard Normal p.d.f. for standard normal dist. is: 1 ( ) ( z 2 / 2) f z;0,1 = e 2π To get a standard normal random variable for a measurement from a nonstandard normal dist., use: z = x µ σ

Standard Normal Distribution

Poisson Discrete distribution Commonly referred to as counting distribution Represents the count distribution of random events

Poisson For a sequence of events to be considered truly random, two conditions must be met Any point in time is as likely as any other for an event to occur (e.g., vehicle arrival) The occurrence of an event does not affect the probability of the occurrence of another event (e.g., the arrival of one vehicle at a point in time does not affect the arrival time of any other vehicle)

Poisson p.m.f. for Poisson dist. is: p( x) = e λt ( λt) x! x p(x) = probability of exactly x vehicles arriving in a time interval t x = # of vehicles arriving in a specific time interval λ = average rate of arrival (veh/unit time) t = selected time interval (duration of each counting period (unit time))

Poisson p.m.f. also commonly expressed as: m x e m p( x) = x! m = average number of occurrences during a specific time period t (i.e., m = λt)

Poisson Example A roadway has an average hourly volume of 360 vph. Assume that the arrival of vehicles is Poisson distributed, estimate the probabilities of having 0, 1, 2, 3, 4, and 5 or more vehicles every 20 seconds. See board

Negative Exponential The assumption of Poisson distributed vehicle arrivals also implies a distribution of the time intervals between the arrivals of successive vehicles (i.e., time headway) To demonstrate this, let the average arrival rate, λ, be in units of vehicles per second, so that λ = q 3600 Substituting into Poisson equation yields e p( x) = qt 3600 ( qt / 3600) x! x

Negative Exponential Note that the probability of having no vehicles arrive in a time interval of length t (i.e., P(0)) is the equivalent of the probability of a vehicle headway, h, being greater than or equal to the time interval t. P( 0) = P( h t) = (1) e 1 qt 3600 = e qt 3600 This distribution of vehicle headways is known as the negative exponential distribution

Negative Exponential Example A roadway has an average hourly volume of 360vph. Assume that the arrival of vehicles is Poisson distributed. What is the probability of gap between successive vehicles will be between 8 to 10 seconds? See board

Expectation and Variance Expectation (Mean) Variance x = E( x) = xf ( x) dx x 2 2 2 2 2 σ x = E[( x x) ] = ( x Ex [ ]) f( xdx ) = Ex [ ] Ex [ ] pdf mean variance Bernoulli P0 = 1 p, P1 = p p p( 1 p) n! k n k Binomial P q np npq ( n k)! k! Poisson k α α e k! α α Uniform 1 ( b a) ( a + b) 2 2 ( b a) 12 Exponential λx λe 1 λ 2 1 λ ( x m) 1 2 σ Normal e m 2πσ 2 2 2 σ

Sum of Random Variables and Central Limit Theorem Let 2 where x, x,..., x are i.i.d. with mean µ and variance σ, then or S = x + x + L+ x n 1 2 lim n 2 ( ) (, ) lim f ( z) = N( 01, ) where Z = n 1 2 n f s = N nµ nσ S Z n n n n S n nµ nσ The sum of n similarly distributed random variables tends to the normal distribution, no matter what the initial, underlying distribution is. See board for an illustration

Approximating a Normal Distribution 0.2 Probability 0.15 0.1 0.05 0 k = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Figure 11. Binomial probability distribution with parameters n = 100 and p = 0.07 (shaded) and normal approximation to it (unshaded).

Sample Size How many observations do we need? It depends on several things (e.g., confidence bounds, standard deviation of the underlying distribution, and tolerance) Although larger samples are likely to lead to better estimates of distribution parameters Data collection is expensive Usually only able to measure fraction of possible values in the population Therefore, we would like to collect only as much data that will give us our required level of statistical confidence

Sample Sizes n = s z α/2 ε 2 n = minimum number of measured speeds s = estimated sample standard deviation, mph z α/2 = constant corresponding to the desired confidence level ε = permitted error in the average speed estimate, mph

Normal Speed data 55 42 53 67 58 65 63 31 51 66 54 49 55 44 49 47 69 76 20 46 62 30 69 56 45 25 64 54 74 44 35 83 64 78 65 45 33 75 48 56 50 66 72 49 63 58 70 37 55 68 29 38 34 47 39 53 64 41 59 89 42 44 51 79 38 54 54 77 58 61

Step 1: Sort Data Rank all data in ascending order: 1-20 2-25 3-29 4-30 5-31 6-33 and so on...

Step 2: Group Data Suggestion: 20-29 interval 1: 3 30-39 interval 2: 9 40-49 interval 3: 15 50-59 interval 4: 18 60-69 interval 5: 15 70-79 interval 6: 8 80-89 interval 7: 2

Step 3: Plot Histogram 20 15 10 5 0 1 3 5 7 Interval

Step 4: Plot CDF 100% 80% 60% 40% 20% 0% 20 30 40 50 60 70 80 Speed

Sample Size Example Want to collect speed data from freeway segment Previous studies determined s = 4 mph (use with caution) Want to estimate population mean (µ) within ± 1 mph at a 99% confidence level n = 4 2.58 1 2 = 106.5 107 observations needed

Sample Size Example Consider already collected speed data sample Mean = 52.3 mph Std. dev. = 6.3 mph n = 200 Want to calculate if we have an adequate sample size for a 99% confidence level and ε = 1 2.58 n = 6.3 1 2 2 = 264 = 152 < not enough observations How about for 95% confidence level? 1.96 n = 6.3 1 200 OK

Hypothesis Testing A theoretical proposition which can be tested statistically A statement about an event, the outcome of which is unknown at the time of the prediction, set forth in a way that it can be rejected

Possible Outcomes in the Testing of a Hypothesis H 0 : H 1 : Null hypothesis Alternative hypothesis Only one of the two hypotheses is true, but don t know which is true Reality Test True False True OK. Type I error False Type II error OK Type I error: Type II error: Reject a correct null hypothesis (false negative) Fail to reject a false null hypothesis (false positive)

Hypothesis Testing Steps Formulate a hypothesis (H 0 ) Design a test procedure by which a decision can be made Use statistics to refine the test procedure, recognizing the tradeoff of Type I error versus Type II error Apply the test Make a decision

Examples Before and after study Speed reduction of 5mph (it happened, it didn t) Accident reduction of 10% (it happened, it didn t) Compare two distributions (i.e., are two sample data come from the same distribution?) Whether observed pattern of data fits a particular distribution (Chi-Square Test) Significance of coefficients in a regression model (t Test) Etc.

Example Spot speeds observed over a year on a freeway were found to be normally distributed with a mean of 47.25 mph, with s.d. = 8.61mph. However, some new equipment has indicated that the mean speed is 48.63 mph Is there any evidence that (a) the new equipment is faulty and (b) the new equipment is indicating a speed that is lower than the actual speed?

Test for Significant Difference Are two samples of data from the same distribution? How much difference is a significant difference? z = x s n + x 1 2 2 1 1 s n 2 2 2 Where all variables are as defined before, with subscripts 1 and 2 referring to samples 1 and 2, respectively.

Distribution Fitting How do we determine distributional form? How confident can I be that the sample distribution represents the population dist.?

Distribution Fitting Plot the data Use a histogram: a graphical representation of a frequency distribution Examine Plot Can overlay with theoretical distributions for comparison

Histogram w/theoretical normal curve overlay

Goodness-of-Fit If distributions look like a match, proceed to statistical test Statistical Testing Different tests have been devised to compare fit of empirical data with theoretical distribution One of the most common tests is: Chi-squared (Χ 2 )

Chi-squared Test How does Chi-squared test work? Define categories (or ranges) and assign data to the categories There should be at least 5 categories and 5 data entries per category Compute the expected number of samples for each category based upon the theorized distribution Compute difference between actual observations/class and theoretical distribution observations/class Compute Chi-squared value (see next page)

Chi-squared Statistic 2 χ = I i= 1 ( f f ) 0 f t t 2 χ 2 = chi-squared value f 0 = observed number or frequency of observations in category i f t = theoretical (or other observed) number or frequency of expected observations in category i i = category index I = number of categories

Chi-squared Test (cont.) Determine reference Chi-squared value Compare calculated Chi-squared value to reference value If computed value < reference value, do no reject hypothesis that the empirical data fit the theoretical distribution

Chi-Square Distribution

Computed Chi-square value=1.0209<9.488 => cannot reject H Example Consider the spot speed data shown before The computed mean was 48 mph and the computed standard deviation is 8.6 mph. Consider the following hypothesis: H 0 : The underlying distribution is normal with µ=48 mph and σ=8.6 mph. N=7 categories, f=n-1-g=7-1-2=4 (# of degrees of freedom), a=0.05, Chi-squared value=9.488