Fundamentals of Traffic Operations and Control Topic: Statistics for Traffic Engineers Nikolas Geroliminis Ecole Polytechnique Fédérale de Lausanne nikolas.geroliminis@epfl.ch
Role of Statistical Inference in Decision-Making Process Real World Data Collection Estimation of Parameters, Choice of Distribution Calculation of Probabilities, (Using the prescribed distributions, and estimated parameters) Statistical Inference Information obtained from the sampled data is used to make generalizations about the populations from which the samples were obtained Sample vs. Population Information for Decision-Making and Design
Role of Sampling in Statistical Inferences < x < + µ 2 σ x s 2 x s 2 1 n 1 = n 1 = xi ( x x) i 2
Statistical Analysis Used to address the following questions: 1. How many samples are required? 2. What confidence should I have in this estimate? 3. What statistical distribution best describes the observed data mathematically? 4. Has a traffic engineering design resulted in a change in the characteristics of the population?
Distributions What is meant by distributional form? It is the frequency of specific values occurring within the measured data set Considering a traffic stream along a signalized arterial What operational considerations are there for the signal if: traffic volume is constant per unit time (i.e., uniform) vs. randomly varying (some other distribution)? What design considerations are there for turn bays?
Describing a Distribution Two types of statistical parameters that describe a distribution Central tendency Dispersion
Common Statistical Measures Measures of central tendency Sample Mean Sample Median x~ = Middle value if odd # of observations x~ = Average of two middle values if even # of observations Mode Most frequent observation x n i= = 1 n x i
Common Statistical Measures Measures of dispersion (or variability) Sample Variance Sample Standard Deviation Sample Coefficient of Variation ( ) 1 1 2 1 1 2 2 1 2 = = = = = n n x x n x x s n i i n i i n i i 2 s s = x s cov =
Distribution Terms The mechanism for assigning probabilities to events defined by random variables is to use either a mass function (for discrete variables) or a density function (for continuous variables) Probability mass function (p.m.f.) Probability density function (p.d.f.) Cumulative distribution function (c.d.f.)
p.m.f. For discrete data Name refers to point masses Probability mass is distributed in discrete points along measurement axis.
p.d.f. For continuous data Two conditions must be met f(x) 0 for all x - f ( x) dx = 1 (area under entire graph) Thus, probability of value being between a and b is the area under the curve between those two points.
p.d.f. Name implies that probability density is smeared in a continuous fashion along entire interval of possible values. Contrary to p.m.f., specific values along measurement axis of continuous distribution have probability of zero
c.d.f. Cumulative probability for some value X x For p.m.f., c.d.f. is obtained by summing the p.m.f. p(x) over all possible values x satisfying X x For p.d.f., c.d.f. is obtained by integrating f(x) between the limits - and x
Common Traffic Distributions Uniform Normal Poisson Negative Exponential
Uniform Examples (discrete): Tossing a coin Rolling a six-sided die Examples (continuous): D/D/1 queuing (deterministic arrivals and departures with one departure channel) Suppose I take a bus to work, and that every five minute a bus arrives at my stop. Because of variation in the time I leave my house, I don t always arrive at the bus stop at the same time, so my waiting time, X, for the next bus is a continuous random variable.
Uniform Distribution f ( x; A, B) = B 1 0 A A x B otherwise The set of possible values of X is the interval [0, 5]. A possible probability density function for X is: f ( x) = 1 5 0 0 x 5 otherwise
Normal Normal distribution function is continuous p.d.f. is: f ( x; µ, σ ) 1 e σ 2π µ = mean, σ = standard deviation (for population, true) x = mean, s = standard deviation (for sample, estimated) = 1 x 2 µ σ 2
Normal What does it mean, conceptually? Distribution is centered about its mean Spread is function of standard deviation Mean, median, and mode are numerically equal 68.27% of observations will be within 1 std. dev., 95.45% within 2 std. dev., 99.73% within 3 std. dev. Values of - to are theoretically possible, but generally there are practical limits (-4 to 4)
Standard Normal p.d.f. for standard normal dist. is: 1 ( ) ( z 2 / 2) f z;0,1 = e 2π To get a standard normal random variable for a measurement from a nonstandard normal dist., use: z = x µ σ
Standard Normal Distribution
Poisson Discrete distribution Commonly referred to as counting distribution Represents the count distribution of random events
Poisson For a sequence of events to be considered truly random, two conditions must be met Any point in time is as likely as any other for an event to occur (e.g., vehicle arrival) The occurrence of an event does not affect the probability of the occurrence of another event (e.g., the arrival of one vehicle at a point in time does not affect the arrival time of any other vehicle)
Poisson p.m.f. for Poisson dist. is: p( x) = e λt ( λt) x! x p(x) = probability of exactly x vehicles arriving in a time interval t x = # of vehicles arriving in a specific time interval λ = average rate of arrival (veh/unit time) t = selected time interval (duration of each counting period (unit time))
Poisson p.m.f. also commonly expressed as: m x e m p( x) = x! m = average number of occurrences during a specific time period t (i.e., m = λt)
Poisson Example A roadway has an average hourly volume of 360 vph. Assume that the arrival of vehicles is Poisson distributed, estimate the probabilities of having 0, 1, 2, 3, 4, and 5 or more vehicles every 20 seconds. See board
Negative Exponential The assumption of Poisson distributed vehicle arrivals also implies a distribution of the time intervals between the arrivals of successive vehicles (i.e., time headway) To demonstrate this, let the average arrival rate, λ, be in units of vehicles per second, so that λ = q 3600 Substituting into Poisson equation yields e p( x) = qt 3600 ( qt / 3600) x! x
Negative Exponential Note that the probability of having no vehicles arrive in a time interval of length t (i.e., P(0)) is the equivalent of the probability of a vehicle headway, h, being greater than or equal to the time interval t. P( 0) = P( h t) = (1) e 1 qt 3600 = e qt 3600 This distribution of vehicle headways is known as the negative exponential distribution
Negative Exponential Example A roadway has an average hourly volume of 360vph. Assume that the arrival of vehicles is Poisson distributed. What is the probability of gap between successive vehicles will be between 8 to 10 seconds? See board
Expectation and Variance Expectation (Mean) Variance x = E( x) = xf ( x) dx x 2 2 2 2 2 σ x = E[( x x) ] = ( x Ex [ ]) f( xdx ) = Ex [ ] Ex [ ] pdf mean variance Bernoulli P0 = 1 p, P1 = p p p( 1 p) n! k n k Binomial P q np npq ( n k)! k! Poisson k α α e k! α α Uniform 1 ( b a) ( a + b) 2 2 ( b a) 12 Exponential λx λe 1 λ 2 1 λ ( x m) 1 2 σ Normal e m 2πσ 2 2 2 σ
Sum of Random Variables and Central Limit Theorem Let 2 where x, x,..., x are i.i.d. with mean µ and variance σ, then or S = x + x + L+ x n 1 2 lim n 2 ( ) (, ) lim f ( z) = N( 01, ) where Z = n 1 2 n f s = N nµ nσ S Z n n n n S n nµ nσ The sum of n similarly distributed random variables tends to the normal distribution, no matter what the initial, underlying distribution is. See board for an illustration
Approximating a Normal Distribution 0.2 Probability 0.15 0.1 0.05 0 k = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Figure 11. Binomial probability distribution with parameters n = 100 and p = 0.07 (shaded) and normal approximation to it (unshaded).
Sample Size How many observations do we need? It depends on several things (e.g., confidence bounds, standard deviation of the underlying distribution, and tolerance) Although larger samples are likely to lead to better estimates of distribution parameters Data collection is expensive Usually only able to measure fraction of possible values in the population Therefore, we would like to collect only as much data that will give us our required level of statistical confidence
Sample Sizes n = s z α/2 ε 2 n = minimum number of measured speeds s = estimated sample standard deviation, mph z α/2 = constant corresponding to the desired confidence level ε = permitted error in the average speed estimate, mph
Normal Speed data 55 42 53 67 58 65 63 31 51 66 54 49 55 44 49 47 69 76 20 46 62 30 69 56 45 25 64 54 74 44 35 83 64 78 65 45 33 75 48 56 50 66 72 49 63 58 70 37 55 68 29 38 34 47 39 53 64 41 59 89 42 44 51 79 38 54 54 77 58 61
Step 1: Sort Data Rank all data in ascending order: 1-20 2-25 3-29 4-30 5-31 6-33 and so on...
Step 2: Group Data Suggestion: 20-29 interval 1: 3 30-39 interval 2: 9 40-49 interval 3: 15 50-59 interval 4: 18 60-69 interval 5: 15 70-79 interval 6: 8 80-89 interval 7: 2
Step 3: Plot Histogram 20 15 10 5 0 1 3 5 7 Interval
Step 4: Plot CDF 100% 80% 60% 40% 20% 0% 20 30 40 50 60 70 80 Speed
Sample Size Example Want to collect speed data from freeway segment Previous studies determined s = 4 mph (use with caution) Want to estimate population mean (µ) within ± 1 mph at a 99% confidence level n = 4 2.58 1 2 = 106.5 107 observations needed
Sample Size Example Consider already collected speed data sample Mean = 52.3 mph Std. dev. = 6.3 mph n = 200 Want to calculate if we have an adequate sample size for a 99% confidence level and ε = 1 2.58 n = 6.3 1 2 2 = 264 = 152 < not enough observations How about for 95% confidence level? 1.96 n = 6.3 1 200 OK
Hypothesis Testing A theoretical proposition which can be tested statistically A statement about an event, the outcome of which is unknown at the time of the prediction, set forth in a way that it can be rejected
Possible Outcomes in the Testing of a Hypothesis H 0 : H 1 : Null hypothesis Alternative hypothesis Only one of the two hypotheses is true, but don t know which is true Reality Test True False True OK. Type I error False Type II error OK Type I error: Type II error: Reject a correct null hypothesis (false negative) Fail to reject a false null hypothesis (false positive)
Hypothesis Testing Steps Formulate a hypothesis (H 0 ) Design a test procedure by which a decision can be made Use statistics to refine the test procedure, recognizing the tradeoff of Type I error versus Type II error Apply the test Make a decision
Examples Before and after study Speed reduction of 5mph (it happened, it didn t) Accident reduction of 10% (it happened, it didn t) Compare two distributions (i.e., are two sample data come from the same distribution?) Whether observed pattern of data fits a particular distribution (Chi-Square Test) Significance of coefficients in a regression model (t Test) Etc.
Example Spot speeds observed over a year on a freeway were found to be normally distributed with a mean of 47.25 mph, with s.d. = 8.61mph. However, some new equipment has indicated that the mean speed is 48.63 mph Is there any evidence that (a) the new equipment is faulty and (b) the new equipment is indicating a speed that is lower than the actual speed?
Test for Significant Difference Are two samples of data from the same distribution? How much difference is a significant difference? z = x s n + x 1 2 2 1 1 s n 2 2 2 Where all variables are as defined before, with subscripts 1 and 2 referring to samples 1 and 2, respectively.
Distribution Fitting How do we determine distributional form? How confident can I be that the sample distribution represents the population dist.?
Distribution Fitting Plot the data Use a histogram: a graphical representation of a frequency distribution Examine Plot Can overlay with theoretical distributions for comparison
Histogram w/theoretical normal curve overlay
Goodness-of-Fit If distributions look like a match, proceed to statistical test Statistical Testing Different tests have been devised to compare fit of empirical data with theoretical distribution One of the most common tests is: Chi-squared (Χ 2 )
Chi-squared Test How does Chi-squared test work? Define categories (or ranges) and assign data to the categories There should be at least 5 categories and 5 data entries per category Compute the expected number of samples for each category based upon the theorized distribution Compute difference between actual observations/class and theoretical distribution observations/class Compute Chi-squared value (see next page)
Chi-squared Statistic 2 χ = I i= 1 ( f f ) 0 f t t 2 χ 2 = chi-squared value f 0 = observed number or frequency of observations in category i f t = theoretical (or other observed) number or frequency of expected observations in category i i = category index I = number of categories
Chi-squared Test (cont.) Determine reference Chi-squared value Compare calculated Chi-squared value to reference value If computed value < reference value, do no reject hypothesis that the empirical data fit the theoretical distribution
Chi-Square Distribution
Computed Chi-square value=1.0209<9.488 => cannot reject H Example Consider the spot speed data shown before The computed mean was 48 mph and the computed standard deviation is 8.6 mph. Consider the following hypothesis: H 0 : The underlying distribution is normal with µ=48 mph and σ=8.6 mph. N=7 categories, f=n-1-g=7-1-2=4 (# of degrees of freedom), a=0.05, Chi-squared value=9.488