Exploring Imputation Techniques for Missing Data in Transportation Management Systems



Similar documents
Journal Of Business & Economics Research September 2005 Volume 3, Number 9

Chapter 8: Regression with Lagged Explanatory Variables

Duration and Convexity ( ) 20 = Bond B has a maturity of 5 years and also has a required rate of return of 10%. Its price is $613.

Performance Center Overview. Performance Center Overview 1

The Application of Multi Shifts and Break Windows in Employees Scheduling

USE OF EDUCATION TECHNOLOGY IN ENGLISH CLASSES

Hedging with Forwards and Futures

Chapter 1.6 Financial Management

Distributing Human Resources among Software Development Projects 1

MACROECONOMIC FORECASTS AT THE MOF A LOOK INTO THE REAR VIEW MIRROR

TEMPORAL PATTERN IDENTIFICATION OF TIME SERIES DATA USING PATTERN WAVELETS AND GENETIC ALGORITHMS

Automatic measurement and detection of GSM interferences

INTRODUCTION TO FORECASTING

Analysis of Pricing and Efficiency Control Strategy between Internet Retailer and Conventional Retailer

Morningstar Investor Return

Vector Autoregressions (VARs): Operational Perspectives

Time Series Analysis Using SAS R Part I The Augmented Dickey-Fuller (ADF) Test

Task is a schedulable entity, i.e., a thread

Risk Modelling of Collateralised Lending

The Grantor Retained Annuity Trust (GRAT)

Multiprocessor Systems-on-Chips

The naive method discussed in Lecture 1 uses the most recent observations to forecast future values. That is, Y ˆ t + 1

Measuring macroeconomic volatility Applications to export revenue data,

Cointegration: The Engle and Granger approach

WATER MIST FIRE PROTECTION RELIABILITY ANALYSIS

A Note on Using the Svensson procedure to estimate the risk free rate in corporate valuation

Individual Health Insurance April 30, 2008 Pages

Term Structure of Prices of Asian Options

Chapter 8 Student Lecture Notes 8-1

SPEC model selection algorithm for ARCH models: an options pricing evaluation framework

Appendix D Flexibility Factor/Margin of Choice Desktop Research

The Greek financial crisis: growing imbalances and sovereign spreads. Heather D. Gibson, Stephan G. Hall and George S. Tavlas

COMPARISON OF AIR TRAVEL DEMAND FORECASTING METHODS

Real-time Particle Filters

DDoS Attacks Detection Model and its Application

Hotel Room Demand Forecasting via Observed Reservation Information

11/6/2013. Chapter 14: Dynamic AD-AS. Introduction. Introduction. Keeping track of time. The model s elements

PROFIT TEST MODELLING IN LIFE ASSURANCE USING SPREADSHEETS PART ONE

Stock Trading with Recurrent Reinforcement Learning (RRL) CS229 Application Project Gabriel Molina, SUID

This is the author s version of a work that was submitted/accepted for publication in the following source:

Improving timeliness of industrial short-term statistics using time series analysis

SELF-EVALUATION FOR VIDEO TRACKING SYSTEMS

LEASING VERSUSBUYING

DEMAND FORECASTING MODELS

ANALYSIS FOR FINDING AN EFFICIENT SALES FORECASTING METHOD IN THE PROCESS OF PRODUCTION PLANNING, OPERATION AND OTHER AREAS OF DECISION MAKING

Market Analysis and Models of Investment. Product Development and Whole Life Cycle Costing

Statistical Analysis with Little s Law. Supplementary Material: More on the Call Center Data. by Song-Hee Kim and Ward Whitt

Diane K. Michelson, SAS Institute Inc, Cary, NC Annie Dudley Zangi, SAS Institute Inc, Cary, NC

Nikkei Stock Average Volatility Index Real-time Version Index Guidebook

Single-machine Scheduling with Periodic Maintenance and both Preemptive and. Non-preemptive jobs in Remanufacturing System 1

Time-Series Forecasting Model for Automobile Sales in Thailand

Predicting Stock Market Index Trading Signals Using Neural Networks

DOES TRADING VOLUME INFLUENCE GARCH EFFECTS? SOME EVIDENCE FROM THE GREEK MARKET WITH SPECIAL REFERENCE TO BANKING SECTOR

Forecasting. Including an Introduction to Forecasting using the SAP R/3 System

Finance and Economics Discussion Series Divisions of Research & Statistics and Monetary Affairs Federal Reserve Board, Washington, D.C.

Market Liquidity and the Impacts of the Computerized Trading System: Evidence from the Stock Exchange of Thailand

Chapter 6: Business Valuation (Income Approach)

Supplementary Appendix for Depression Babies: Do Macroeconomic Experiences Affect Risk-Taking?

Sampling Time-Based Sliding Windows in Bounded Space

TSG-RAN Working Group 1 (Radio Layer 1) meeting #3 Nynashamn, Sweden 22 nd 26 th March 1999

Trends in TCP/IP Retransmissions and Resets

Segmentation, Probability of Default and Basel II Capital Measures. for Credit Card Portfolios

Usefulness of the Forward Curve in Forecasting Oil Prices

A Scalable and Lightweight QoS Monitoring Technique Combining Passive and Active Approaches

Relationships between Stock Prices and Accounting Information: A Review of the Residual Income and Ohlson Models. Scott Pirie* and Malcolm Smith**

µ r of the ferrite amounts to It should be noted that the magnetic length of the + δ

Why Did the Demand for Cash Decrease Recently in Korea?

Small and Large Trades Around Earnings Announcements: Does Trading Behavior Explain Post-Earnings-Announcement Drift?

Modelling and Forecasting Volatility of Gold Price with Other Precious Metals Prices by Univariate GARCH Models

How To Calculate Price Elasiciy Per Capia Per Capi

DYNAMIC MODELS FOR VALUATION OF WRONGFUL DEATH PAYMENTS

Chapter 2 Problems. 3600s = 25m / s d = s t = 25m / s 0.5s = 12.5m. Δx = x(4) x(0) =12m 0m =12m

ANALYSIS AND COMPARISONS OF SOME SOLUTION CONCEPTS FOR STOCHASTIC PROGRAMMING PROBLEMS

Double Entry System of Accounting

Principal components of stock market dynamics. Methodology and applications in brief (to be updated ) Andrei Bouzaev, bouzaev@ya.

Information technology and economic growth in Canada and the U.S.

Forecasting, Ordering and Stock- Holding for Erratic Demand

Random Walk in 1-D. 3 possible paths x vs n. -5 For our random walk, we assume the probabilities p,q do not depend on time (n) - stationary

SEASONAL ADJUSTMENT. 1 Introduction. 2 Methodology. 3 X-11-ARIMA and X-12-ARIMA Methods

Idealistic characteristics of Islamic Azad University masters - Islamshahr Branch from Students Perspective

Improvement of a TCP Incast Avoidance Method for Data Center Networks

Measuring the Effects of Monetary Policy: A Factor-Augmented Vector Autoregressive (FAVAR) Approach * Ben S. Bernanke, Federal Reserve Board

Factors Affecting Initial Enrollment Intensity: Part-Time versus Full-Time Enrollment

II.1. Debt reduction and fiscal multipliers. dbt da dpbal da dg. bal

Market Efficiency or Not? The Behaviour of China s Stock Prices in Response to the Announcement of Bonus Issues

The Kinetics of the Stock Markets

Transcription:

Exploring Impuaion Techniques for Missing Daa in Transporaion Managemen Sysems Brian L. Smih Assisan Professor Universiy of Virginia Deparmen of Civil Engineering P. O. Box 400742 Charloesville, VA 22904-4742 Phone: 434-243-8585 Fax: 434-982-2851 E-mail:briansmih@virginia.edu William T. Scherer Associae Professor Universiy of Virginia Deparmen of Sysems and Informaion Engineering P.O. Box 400747 151 Engineer's Way Charloesville, VA 22904 Phone: 434-982-2069 Fax: 434-982-2972 E-mail:ws@virginia.edu James H. Conklin Graduae Research Assisan Universiy of Virginia Deparmen of Sysems and Informaion Engineering P.O. Box 400747 151 Engineer's Way Charloesville, VA 22904 Phone: 434 924-3641 Fax: 434 982-2972 E-mail: conklin@virginia.edu Corresponding Auhor: Brian L. Smih Word Coun: 4366 + 11 Figures * 250 words/figure = 7,116 words Acceped for Publicaion and Presenaion a he 82 nd Annual Meeing of he Transporaion Research Board Ocober 2002 TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin ABSTRACT Many saes have implemened large-scale ransporaion managemen sysems o improve mobiliy in urban areas. These sysems are highly prone o missing and erroneous daa, which resul in drasically reduced daa ses for analysis and real ime operaions. Impuaion is he pracice of filling in missing daa wih esimaed values. Currenly, he ransporaion indusry does no generally uilize impuaion as a means of handling missing daa. Oher disciplines have recognized he imporance of addressing missing daa and, as a resul, mehods and sofware for impuing missing daa are becoming widely available. The purpose of his paper is o address he feasibiliy and applicabiliy of impuing missing raffic daa, and o perform a preliminary analysis of several heurisic and saisical impuaion echniques. Preliminary resuls produced excellen performance in our case sudy and indicae ha he saisical echniques are more accurae while mainaining he naural characerisics of he daa. - 1 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin INTRODUCTION Many saes have implemened large-scale ransporaion managemen sysems (TMSs) o improve mobiliy in congesed urban areas. These sysems provide real-ime monioring of raffic condiions o suppor he implemenaion of conrol sraegies, and o provide useful informaion o ravelers. A he roo of hese aciviies lies he abiliy o measure he condiion of raffic. Wihou his abiliy, few of hese benefis can be realized from TMSs or oher inelligen ransporaion sysems. TMSs ypically use loop deecors embedded in he roadways o collec raffic daa. Given he harsh environmen in which hey operae, loop deecors are highly prone o reurn erroneous or missing daa. For example, he Mobiliy Monioring Program of he Texas Transporaion Insiue (TTI) repors ha afer screening erroneous daa, TMS daa archives can be anywhere from 16% o 93% complee. The median value in his sudy was 67% (Turner e al. 2001). Clearly, missing daa is a significan problem in TMSs and oher sysems/applicaions ha rely on archived TMS daa. Currenly, mos TMSs do no employ saisical echniques for replacing missing daa wih values ha are likely indicaive of he condiions (generally referred o as impuing missing daa). Insead, records wih missing daa are ypically excluded from analysis. This is in agreemen wih he AASHTO Guidelines for Traffic Daa Programs, which saes: Some curren raffic ediing programs esimae missing or edirejeced daa. This pracice, ermed impuaion, is no recommended (AASHTO 1992). AASHTO makes his recommendaion based on he saed jusificaion ha impuing missing values inroduces errors which canno be quanified (AASHTO 1992). This recommendaion, however, assumes ha one canno effecively use exising - 2 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin near daa (eiher emporally or spaially) or hisorical raffic paerns o accuraely impue missing values. On he conrary, analyss in oher disciplines have recognized he need o direcly address he challenge of impuaion. This can be seen in he quaniy of lieraure on he opic and developmen of saisical sofware designed o impue missing daa. For example, he sofware package S-Plus has recenly added a complee library of funcions for missing daa, and SAS has included more missing daa funcions in is laes version. Allison, a prominen saisician and sociologis, poins ou ha in a daa se wih 1,000 records and 20 variables, a 5% missing daa rae for each variable resuls in a daa se wih only abou 360 complee records. (Allison 2002) This problem is also eviden in he TTI experience discussed above. The purpose of his paper is o explore he feasibiliy of using impuaion echniques in TMSs. In paricular, he research examines heurisic impuaion echniques ha rely on hisorical daa and daa from surrounding ime periods and locaion, as well as more classical heoreical saisical impuaion echniques. BACKGROUND INFORMATION The purpose of his secion is o inroduce and define he basic conceps and issues addressed in his paper. Missing Daa Loop deecors, he primary source of real ime and archived raffic daa, commonly fail o repor daa back o TMSs. These failures have many causes including consrucion, deecor failure, communicaions nework failure, and daa archival sysem - 3 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin failure. Given he naure of TMSs, some level of missing daa is unavoidable. (Turner e al. 1999) The echniques addressed in his paper will impue raffic daa colleced by all ypes of deecors. Erroneous Daa Erroneous daa is anoher source of missing daa. Erroneous daa are daa repored by deecors ha are no physically feasible. For example, if he volume is greaer han zero and he speed and occupancy are measured as 0, his indicaes a physically infeasible sae. The TransGuide sysem repors ha approximaely 1% of he daa i collecs is marked as suspec daa. (Turner e al. 1999) Depending on how he daa is screened, i is likely ha a much higher percenage of he daa can be considered erroneous. Since erroneous daa is no accurae, i usually no used in analyses. As such, i is ypically reaed as missing daa. Properies of Traffic Daa Traffic daa colleced over a shor polling inerval ends o be very noisy. As he polling or measuremen inerval is exended, he daa smoohes ou significanly o reveal an underlying signal. Figure 1 illusraes how he underlying flow paern or signal is much more eviden when he daa has been aggregaed o he 10 minue inerval from he 1 minue inerval. The Highway Capaciy Manual recommends 15-minue inervals for mos of is analyses for his very reason. (TRB 2000) The high level of variabiliy poses a significan problem when impuing missing daa a he 1-minue level. As he 1-minue char in Figure 1 shows, he volume level can vary more han 500-4 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin vehicles/hour/lane (vphpl), or nearly ¼ of a lane s capaciy, from one minue o he nex. As a resul, his sudy will focus on impuing 10-minue daa. METRICS FOR EVALUATING IMPUTATION TECHNIQUES Idenifying preferred impuaion echniques is a muli-objecive problem. I is imporan o impue each value as accuraely as possible, while a he same ime mainaining he naural characerisics of he real daa. In addiion, one mus carefully consider he pracical, implemenaion and mainenance issues associaed wih a echnique. The following merics were used o evaluae he impuaion echniques. Quaniaive Measures This se of measures allows one o direcly compare he accuracy of muliple impuaion echniques. Error Measures Error, as i is reaed in his research, is he difference beween he acual value and he impued value. The error equaion for volume is shown below. The same equaion can be used o calculae he error for speed and occupancy. e ˆ = V V where e = The error for deecor i in saion j a ime V = Acual volume for deecor i in saion j a ime Vˆ = Impued volume for deecor i in saion j a ime - 5 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin Error Disribuion - This measure illusraes how he error values (where he error is considered a random variable) are disribued. This disribuion ideally should be cenered close o zero. A perfec impuaion echnique would have a mean error of 0 unis and a sandard deviaion of 0 unis. This measure can mos easily be viewed and compared using box-plos. The box surrounding he median (marked by a solid do) shows he range of values from he 25 h o he 75 h percenile of he daa. The noches indicae he 95% confidence inerval abou he median and he whiskers exend 1.5 imes he iner quarile range (he difference beween he 75 h percenile value and he 25 h percenile value) pas he end of he box. The open dos ploed beyond he whiskers are he oulier poins. - 6 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin Mean Absolue Error (MAE) This measure shows he average error which indicaes on average how far from he acual value a paricular echnique impues. e MAE = n where n = number of impued values Mean Absolue Percen Error (MAPE) This is an average measure of how far away from he acual value a paricular echnique impues scaled o he acual value. The equaions below show how o calculae he MAPE for volume. The same equaions can be used for speed and occupancy. PE MAPE = n e PE = 100% V where PE = Percen Error for deecor i in saion j a ime Roo Mean Squared Error (RMSE) - This is a weighed average of he error ha applies a much heavier weigh o large errors han small errors. This is a classic performance meric widely used in model developmen and analysis. RMSE = e n 2 Sandard Deviaion of Errors (SDE) This is a measure of how much variance here is in he se of errors. A smaller sandard deviaion means ha he errors are ighly clusered around he mean value. SDE = var( e i, ) Minimum Absolue Error (MinAE) This is he minimum absolue error value. - 7 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin MinAE = min( e i, ) Maximum Absolue Error (MaxAE) This is he maximum absolue error value. MaxAE = max( e i, ) Change in Naural Variance This measure is simply he difference beween he variance in he real daa and he impued daa. This is of concern because many impuaion echniques reduce he naural variance of he daa. This is poenially an undesirable effec and should be minimized as much as possible. For example, simply impuing by using he hisorical mean can significanly reduce he variance of he overall daa se. The variance can be a paricularly imporan variable. For example when calculaing any measure ha uses percenile values, he variance of he daa se is a crucial elemen. Percen Change in Variance (PCV) his measures he difference beween he variance of he acual daa and he impued daa. For he purposes of he preliminary analysis, his measure will rea he variance as a saionary value. This may no always be he case as can be seen in Figure 1. PCV = var( Vˆ) var( V ) 100% var( V ) Qualiaive Measures This se of measures is inended o assess he implemenaion and mainenance issues associaed wih he use of a paricular impuaion echnique. - 8 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin Required Inpu Daa Differen impuaion echniques require differen ses of inpu daa. This meric is a qualiaive assessmen of he availabiliy of required inpu daa. If a echnique requires many differen ypes of daa, he probabiliy of all daa being available on a consisen basis may be quie low. Complexiy Some impuaion echniques are very simple and easy o undersand, while ohers require a grea deal of background knowledge. While some of he more complex echniques may provide good esimaes, hey may be difficul o use and ulimaely be impracical for implemenaion. Compuaional Speed Given he quaniy of missing daa and he real-ime requiremens of a TMS, i is imporan ha he impuaion process execue rapidly. If a echnique is oo compuaionally inensive, i may be impracical. SURVEY OF IMPUTATION TECHNIQUES Techniques for esimaing raffic daa can be broadly classified in wo caegories: heurisic or saisical. This secion defines a number of impuaion echniques ha have eiher been developed or idenified by he research eam. - 9 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin To faciliae he discussion, Figure 3 illusraes several names and convenions ha will be used for he remainder of his paper. A deecor is a sensor measuring volume, speed, and occupancy in a single lane. A saion is comprised of all of he deecors a one locaion. The lanes are numbered saring wih he median lane and ending wih he shoulder lane. Saion level measuremens are aggregaed from he deecor level daa as follows: V O S = = = where V V O O S S p i= 1 p i= 1 p ( V S ) i= 1 V O p p i= 1 V = Saion Volume for saion j a ime = Volume for deecor i of saion j a ime = Saion Occupancy for saion j a ime = Occupancy for deecor i of saion j a ime = Saion Speed for saion j a ime = Speed for deecor i of saion j a ime p = number of deecors in saion Each of he echniques analyzed in his paper generae impuaions for each deecor, however, his preliminary analysis only looked a daa aggregaed o he saion level. - 10 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin Heurisic Techniques Hisorical Average (His) This echnique simply uses he hisorical average value for a given ime of day, day of week (weekend vs. weekday), and deecor for impuing missing values. This is probably he mos common approach o solve he problem of missing daa. This echnique performs reasonably well under normal condiions, however i can misrepresen condiions when he raffic is abnormal. Referring back o Figure 2 provides an example of how hisorical mean impuaion does no adap o deal wih abnormal raffic condiions. A = 07:00, he volume drops significanly, bu he hisorical mean would no reflec his abnormaliy. This echnique is described mahemaically below: Vˆ = Hv Oˆ = Ho Sˆ = Hs where Vˆ = Esimaed Volume for deecor i in saion j a imeof day Oˆ = Esimaed Occupancy for deecor i in saion j a ime of day Sˆ = EsimaedSpeed for deecor i in saion j a imeof day Hv = Hisorical volume for deecor i in saion j a imeof day Ho = Hisorical occupancy for deecor i in saion j a imeof day Hs = Hisoricalspeed for deecor i in saion j a imeof day - 11 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin Weighed Average of Surrounding Deecors wih Lane Disribuion (SurS) This echnique uses a weighed average of upsream and downsream saions o impue he volume for he saion wih missing daa. The volume is hen disribued across he differen lanes using hisorical lane disribuion paerns. The occupancies are similarly disribued. To impue he speed, a volume-weighed average is disribued according o he hisorical disribuion. In he following equaions, he F facors accoun for he hisorical differences in he volumes, speeds, and occupancies beween saions. So, in he scenario where a downsream saion has much lower volumes due o a major exi beween he saions, his facor will help scale up he impuaion calculaed based on he up and downsream deecor s values. The lane disribuion facor (D) is simply he average percenage of flow in a specific lane a a given ime. Lane disribuion paerns a a given ime of day and a given locaion are very consisen. (Smih and Conklin) - 12 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin Vˆ Dv i= 1 p Hvi, i= 1 Fv k, = s Hvi, k, i= 1 where Vˆ = Esimaed Volume for deecor i of saion ja ime of day. Hv Fv p Dv Vˆ k, = = = = = p i= 1 p i= 1 Vˆ Hv Hv Fv = Dv i= 1 Hisorical Lane Disribuion of volume for deecor i of Hisorical Volume of deecor i of Hisorical Volume Facor beween saion k and saion j a ime of j = Saion being esimaed up = Upsream Saion up, down = Downsream Saion + Fv 2 p = number of lanes in saion j up q = number of lanes in upsream saion r = number of lanes in downsream saion. s = number of lanes in he kh saion q V up r i= 1 V down, saion j. saion j a ime of day. day. The occupancy and speed can also be calculaed wih his same echnique. The speed is bes calculaed using a volume weighed speed equaion. Using he volume weighed speed ensures ha a few vehicles raveling a very differen speed from he res of he vehicles will no offse he average speed. One of he advanages of his impuaion echnique is ha i is independen of he daa in surrounding ime periods, so i can be used when daa is missing for more han one period a a ime. - 13 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin Average of Surrounding Time Periods (SurT) This echnique averages he values from he 10-minue inervals before and afer he missing value. The applicaion of his echnique is limied because i can only be used when boh he preceding and following periods do no have missing daa. Vˆ Oˆ Sˆ where Vˆ = Oˆ = Sˆ = Vi, = Oi, = S = V = O = S = 1 1 1 + Vi, + 1 2 + Oi, 2 + S + 2 + 1 1 Esimaed Volume for deecor i of Esimaed Occupancy for deecor i of EsimaedSpeed for deecor i of Acual Volume for deecor i of Acual Occupancy for deecor i of AcualSpeed for deecor i of saion j a ime saion j a ime saion j a ime saion ja ime saion j a ime saion j a ime Saisical Techniques The objecive of using classical saisical echniques, such as daa augmenaion or expecaion maximizaion, o impue missing daa is o impue values such ha he naural characerisics (paricularly he mean and variance) of he daa are preserved. This research explores he use of Expecaion Maximizaion and Daa Augmenaion. Alhough hese echniques can be used independenly of each oher, he EM algorihm is commonly used o generae saring values for Daa Augmenaion. The EM algorihm will be used in his fashion for his research. - 14 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin Expecaion Maximizaion (EM) EM is an ieraive regression echnique in which he missing variables are regressed on he available daa and addiional variables are provided as inpus o he algorihm. Firs, a vecor of means and a covariance marix are calculaed using all available daa. The means are hen impued for missing values in each variable. These impued means serve as a saring value for he impuaion. Nex, variables wih missing values are regressed on all he oher available variables. The impued mean values are hen replaced wih esimaes calculaed from he regression equaions. Wih he new impuaions in place, he means and covariances are recalculaed. Regression equaions and impuaions are ieraively calculaed unil he mean and covariance marix values converge. S-plus, a common saisical package uses he maximum absolue relaive change in he parameers as he convergence crierion. This crierion calculaes he change in each value of he variance and mean marix from one ieraion o he nex. When his difference is less han some user-defined olerance for each elemen of he marix, he algorihm sops ieraing. By defaul, S-plus uses he value of 0.001 as he olerance level. The basic implemenaion of he algorihm is as follows (Allison 2002; Lile and Rubin 1987): Sep 0: Calculae sample means and covariance marix, M and Σ, for each variable wih all available daa. Sep 1: Inser esimae of mean ino all variables wih missing daa. Sep 2: Calculae maximum likelihood esimae of missing values. Replace missing values wih new esimaes Sep 3: Esimae means and covariance marix - 15 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin Sep 4: If Means and Covariance Marix have converged: Sop. Use curren esimaes of missing values. Oherwise: Repea Seps 2-4. The addiional variables menioned previously menioned could include hisorical averages, daa from oher saions, or oher correlaed values. The purpose of hese addiional variables is o provide more informaion abou he variables wih missing daa so ha a beer regression model can be generaed. While EM can be used independenly o impue missing daa, i will be used (as is commonly done) o generae saring values for he missing daa poins in he Daa Augmenaion Algorihm. As such, no resuls will be shown for he EM algorihm. Daa Augmenaion Daa augmenaion (DA) is much like EM, however i adds wo random draws from Baysian poserior disribuions o help preserve he characerisics of he daa furher. The impued values generaed by he EM algorihm are commonly used as saring values for he missing poins in daa augmenaion. The following algorihmic explanaion of Daa Augmenaion is adaped direcly from Allison s book Missing Daa: Sep 0: Inser saring values for all missing daa (oupus from EM are suggesed) Sep 1: Use means and covariance marix o calculae regression coefficiens for missing daa (one se of coefficiens for each paern of missing daa) Sep 2: Using regression coefficiens, calculae esimaes for missing values. Add a random draw from he residual disribuion of he regression equaion. Sep 3: Esimae means and covariance marix using all daa poins - 16 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin Sep 4: Make a random draw from he poserior disribuion of he means and covariances. Sep 5: Reach unil random draws in sep 4 converge. For his applicaion in his research, he following variables were used as inpus o he muliple impuaion analysis: Curren daa wih missing values, hisorical values for curren saion, and curren values for upsream and downsream saions. Depending on he paern of missing daa, he addiional variables may be changed. While he inclusion of addiional variables increases he run ime and complexiy, i can improve he qualiy of he impuaions. In his case, i is imporan o include he curren values from up and downsream saions because, as previously menioned, using only hisorical daa for impuaion can resul in poor impuaions during abnormal periods. CLASSIFICATION OF MISSING DATA Missing daa can be found in many differen paerns, boh emporally and spaially. Naurally, when here are fewer missing records (spaially and emporally), here is more informaion available o suppor impuaion. For example, if one deecor in he middle lane is missing daa for one minue, and all of he deecors around i have daa, we can make a good esimae of wha he corresponding missing saion volume. If however several consecuive saions have no been funcional for 3 monhs, i is much more difficul o deermine wha he sae of he sysem is. By classifying he missing daa paerns, i is possible o beer assess he uiliy of alernaive impuaion echniques. - 17 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin Figure 4 illusraes he axonomy of differen missing daa paerns. Cerain paerns of missing daa prohibi he use of some impuaion echniques. For example, in he case where here is a Missing Saion, he daa from surrounding deecors canno be used o impue he daa in he middle lane. The saisical echniques, denoed by gray boxes, can concepually be adaped for any paern of missing daa by changing he addiional variables included as inpus. PROCEDURE FOR EVALUATING IMPUTATION TECHNIQUES To measure he effeciveness of he differen impuaion echniques, records from complee raffic daases were arificially removed and saved. Then using he impuaion echniques, he removed daa was impued and compared o he real daa. To aid in comparison, he same daa was used for each analysis. In his case, every hird record was removed from he daa. By removing daa in his fashion, all of he echniques, including surrounding ime periods, could be applied o he same daa se. Since raffic daa is primarily used during he dayime hours, his research will limi is analysis o he daa spanning from 06:00 o 22:00. The daa used in his research was colleced by Virginia Deparmen of Transporaion s (VDOT) Norhern Virginia Smar Traffic Cener, he TMS for he region, and archived in he Smar Travel Lab a he Universiy of Virginia. This TMS collecs volume, speed, and occupancy measuremens every minue for 1146 deecors locaed a 543 locaions along norhern Virginia s freeways. The daa is colleced using inducive loop deecors. - 18 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin For his research daa was exraced from saions 131,141, and 151. Impuaion was performed on saion 141 and saions 131 and 151 were used as addiional inpu daa. These saions are locaed along Roue 66 Eas approaching he Washingon DC Belway. Figure 5 illusraes he layou of hese saions. The speed limi a his locaion is 55 miles per hour. The hisorical average ables were calculaed using daa from February 6, 2002 unil May 6, 20002. On average here were 28.9 records used o calculae each hisorical average value. The curren daa was colleced on May 17, 2002. Figure 6 shows a volume plo for he es daa on saion 141. RESULTS Resuls for his case sudy are presened using boh he qualiaive and quaniaive merics idenified earlier in he paper. Qualiaive Merics Table 1 presens a preliminary analysis of he qualiaive merics for each impuaion echnique. These values are subjecive and are based on he lieraure and he basic characerisics for each of he echniques. This able is presened o provide he reader wih he auhors assessmen of he qualiaive merics for each alernaive impuaion echnique. - 19 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin Quaniaive Merics The quaniaive resuls, found in Table 2, are encouraging. They show ha i is indeed possible o impue missing values wih reasonable resuls. Figure 7 shows ha he error erms, are generally cenered on or close o zero. A paricularly noable resul is ha he Daa Augmenaion echnique performed very well. This is encouraging because Daa Augmenaion provides many aracive saisical properies, despie he fac ha i is more complex and more difficul o inerpre. Figure 8 shows he Roo Mean Squared Error erms for each of he impuaion echniques. Daa Augmenaion performs beer han he oher echniques for impuing volume and speed. In a saisical es a he 95% confidence level, DA had a lower MAPE hen he hisorical or surrounding saion echniques for volume and speed measuremens. For he occupancy measuremens, DA had he same MAPE as all of he oher echniques a he 95% confidence level. Saisical ess showed ha he MAPE for DA was no beer han ha of he SurT echnique for volume, speed, or occupancy. The roo mean squared error weighs larger errors more han smaller errors, which indicaes ha he magniude of errors made by his mehod are generally smaller han he oher echniques. Figure 9 shows ha he RMSE for he DA echnique is much lower han hen he SurT echnique. This indicaes ha he SurT echnique mus have several very large errors, because he MAPE for SurT and DA proved o be he same. On average, Daa Augmenaion will esimae he missing volumes wihin 3% of he acual values and he missing speeds wihin 1% of he acual values. If he average speed were 60 mph, a 5% error would be an error of 3 mph. Over a 10-mile srech, his would resul in a 30 second error in simple ravel ime esimaion. The average and - 20 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin median errors when using Daa Augmenaion are almos exacly zero, so if an analys exraced ime series daa wih impued values for a day, he oal volume and he average speeds and occupancies would probably be very close o he real values. RECOMMENDATIONS Based on he research presened in his paper, i is clear ha he echnique used o impue daa mus be chosen judiciously and wih regard o he analysis ha use he impued daa. While he advanced saisical echniques will likely produce resuls ha are well suied o he majoriy of siuaions, heir complexiy may be prohibiive. If he daa is o be used for a real ime applicaion such as ravel ime esimaion, one of he heurisic echniques such as SurS may be appropriae and work well. The calculaions involved are fas and resuls are generally reliable. On he oher hand, if he purpose of impuing he missing values is o creae a complee daase o be sored in an archival daabase for fuure analysis, i may be beneficial o implemen some of he more sophisicaed mehods. The saisical mehods will mos likely generae beer resuls and hey will help capure he naural variance in he daa. Some saisics, such as he buffer index (a measure of raffic variabiliy), are very sensiive o he variance of he daa. As such, i would no be advisable o use hisorical mean impuaion in his case. CONCLUSIONS These resuls indicae ha impuing missing values in Transporaion Managemen Sysems is feasible. The echniques presened in his research are a firs sep owards finding a funcional impuaion sysem. The resuls from his research indicae ha he more sophisicaed saisical echniques may generae beer impuaions han he simpler heurisic approaches. Furher research should be conduced using - 21 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin differen inpu variables, differen missing daa paerns, and possibly oher saisical echniques for impuaion. Based on hese resuls, i is recommended ha he ransporaion profession seriously reconsider he AASHTO policy of no impuing raffic daa. This will provide users wih as much informaion as possible. I is also recommended ha each impued record be flagged in a daabase. This will allow users o choose wheher or no hey wish o uilize impued daa based on he needs of a paricular applicaion or analysis. - 22 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin RESOURCES AASHTO. (1992). "AASHTO guidelines for raffic daa programs." American Associaion of Sae Highway and Transporaion Officials, Washingon, D.C. Allison, P. D. Missing daa, Sage Publicaions, Thousand Oaks, Calif. (2002). Lile, R. J. A., and Rubin, D. B. Saisical analysis wih missing daa, 2 nd ediion Wiley, New York (2002). Smih, B. L., and Conklin, J. H. The Use of Local Lane Disribuion Paerns for he Esimaion of Missing Daa in Transporaion Managemen Sysems. In Transporaion Research Board, (Acceped for Publicaion). TRB. (2000). "Highway Capaciy Manual." The Naional Research Council, Washingon, D.C. Turner, S. M., Eisele, W. L., Gajewsk B. J., Alber, L. P., and Benz, R. J. (1999). "ITS Daa Archiving: Case Sudy Analyses of San Anonio TransGuide(r) Daa." Texas Transporaion Insiue, The Texas A&M Universiy Sysem, College Saion, Texas. Turner, S. M., Lomax, T., and Margioa, R. (2001). "Monioring Urban Roadways in 2000: Using Archived Operaions Daa for Reliabiliy and Mobiliy Measuremen." Texas Transporaion Insiue, College Saion, TX. - 23 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin LIST OF TABLES TABLE 1: PRELIMINARY QUALITATIVE MEASURES...25 TABLE 2: PRELIMINARY RESULTS...26 LIST OF FIGURES FIGURE 1: FLOW VS. TIME PLOTS....27 FIGURE 2: HISTORICAL FLOW VS. ACTUAL FLOW...28 FIGURE 3: ROAD LAYOUT DIAGRAM...29 FIGURE 4: CLASSIFICATION OF APPLICABLE IMPUTATION TECHNIQUES BY MISSING DATA PATTERN...30 FIGURE 5: LANE DIAGRAM...31 FIGURE 6:VOLUME PLOT FOR STATION 141 ON MAY 17, 2002...32 FIGURE 7: ERROR BOX PLOTS...33 FIGURE 8: ROOT MEAN SQUARE ERROR...34 FIGURE 9: ABSOLUTE PERCENT ERROR PLOT...35-24 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin Table 1: Preliminary Qualiaive Measures Technique Required Inpu Daa Complexiy Compuaiona l Speed His Hisorical Average Low Very Fas SurT Daa from 1 min before and afer daa Low Fas SurS Hisorical Lane Disribuion Facors Med Fas Hisorical Facors from up and downsream saions Values from surrounding deecors EM Hisorical Daa from saion High Medium Values from surrounding saions and deecors (if available) DA Hisorical Daa from saion Values from surrounding saions and deecors (if available) Very High Slow - 25 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin Table 2: Preliminary Resuls Daa Measure Mehod Volume Occupancy Speed DA 2.6% 5.4% 0.9% MAPE His 3.7% 4.6% 2.5% SurS 3.8% 3.8% 3.4% SurT 3.2% 3.5% 0.9% DA 4.38 vphpl 0.41 % occupancy 0.48 mph MAE His 5.37 0.45 0.91 SurS 5.76 0.34 2.00 SurT 4.72 0.29 0.45 DA 66.2 vphpl 6.49% occupancy 7.32 mph RMSE His 72.1 9.80 18.89 SurS 72.9 6.97 24.58 SurT 73.1 6.47 8.80 DA 18.7 vphpl 1.8 % occupancy 2.08 mph SDE His 17.7 2.65 5.31 SurS 20.8 1.82 2.46 SurT 17.4 1.96 3.93 DA -0.5% 1.7% -1.0% PCV His 5.3% -11.8% -14.3% SurS 2.4% -9.8% 23.4% SurT -2.1% 1.0% 6.2% DA 0.106 vph 0.013 % occupancy 0.007 mph MinAE His 0.138 0.003 0.024 SurS 0.315 0.002 0.233 SurT 0.000 0.000 0.049 DA 96.4 vph 8.74 % occupancy 8.37 mph MaxAE His 74.5 16.51 27.56 SurS 70.1 13.08 21.17 SurT 125.0 15.55 13.61-26 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin Figure 1: Flow vs. Time Plos. Flow vs. Time 1 Min Inerval 2000 Flow (vphpl) 1500 1000 500 0 12:00 2:00 4:00 6:00 8:00 10:00 12:00 2:00 4:00 6:00 8:00 10:00 12:00 May 7, 2002 Flow vs. Time 10 min Inerval 2000 1500 Volume (vphpl) 1000 500 0 12:00 2:00 4:00 6:00 8:00 10:00 12:00 2:00 4:00 6:00 8:00 10:00 12:00 May 7, 2002-27 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin Figure 2: Hisorical Flow vs. Acual Flow. Hisorical vs. Acual Flow Volume (vphpl) 1600 1000 400 Hisorical Acual 12:00 2:00 4:00 6:00 8:00 10:00 12:00 2:00 4:00 6:00 8:00 10:00 12:00 May 17, 2002-28 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin Figure 3: Road Layou Diagram Median Loop Deecors 1 2 3 4 Shoulder Saion - 29 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin Figure 4: Classificaion of Applicable Impuaion Techniques by Missing Daa Paern. Missing Daa Missing Deecor 1 Minue Missing Deecor > 1 Minue Missing Saion 1 Minue Missing Saion > 1 Minue Missing Srech 1 Minue Missing Srech > 1 Minue Hisorical Average Hisorical Average Hisorical Average Hisorical Average Hisorical Average Hisorical Average Weighed Ave of Surrounding Saions w/ LD Weighed Ave of Surrounding Saions w/ LD Weighed Ave of Surrounding Saions w/ LD Weighed Ave of Surrounding Saions w/ LD Average of Surrounding Time Periods EM Average of Surounding Time Periods EM Average of Surrounding Time Periods EM EM Daa Augmenaion EM Daa Augmenaion EM Daa Augmenaion Daa Augmenaion Daa Augmenaion Daa Augmenaion Legend: Missing Deecor - Missing daa for 1 deecor for more han 1 minue Missing Saion - Missing daa for all deecors in a given saion Missing Srech - Missing daa for all deecors in 2 or more adjacen saions. - 30 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin Figure 5: Lane Diagram. Saion 131 Saion 141 Saion 151 Median 1 2 3 4 0.402 miles 0.473 miles 0.875 miles - 31 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin Figure 6:Volume plo for Saion 141 on May 17, 2002. Flow vs. Time 1800 Flow (vphpl) 1300 800 300 12:00 2:00 4:00 6:00 8:00 10:00 12:00 2:00 4:00 6:00 8:00 10:00 12:00 May 17, 2002-32 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin Figure 7: Error Box Plos Volume Error Terms Error -100-50 0 50 HIs.V DA.V SurS.V SurT.V Occupancy Error Terms Error -15-10 -5 0 5 His.O DA.O SurS.O SurT.O Speed Error Terms Error -10 0 10 20 His.S DA.S SurS.S SurT.S - 33 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin Figure 8: Roo Mean Square Error Volume 73 Roo Mean Squared Error 71 69 67 65 DA His SurS SurT Mehod Occupancy 10 Roo Mean Squared Error 9 8 7 6 DA His SurS SurT Mehod Speed 25 Roo Mean Squared Error 20 15 10 5 DA His SurS SurT Mehod - 34 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.

Smih, Scherer, and Conklin Figure 9: Absolue Percen Error Plo Average Percen Absolue Error 5% His DA SurS SurT Average Absolue Error 4% 3% 2% 1% 0% Volume Occupancy Speed (vphpl) (% occupancy) (mph) - 35 - TRB 2003 Annual Meeing CD-ROM Paper revised from original submial.