Predict. Learning to. Abstract. Introduction. Overview. Lei Sun. Flight Delay. when people. Flight delays. According to. be compared.

Similar documents
Predicting Flight Delays

Airport Performance Measurement

Analytics That Allow You To See Beyond The Cloud. By Alex Huang, Ph.D., Head of Aviation Analytics Services, The Weather Company, an IBM Business

Employer Health Insurance Premium Prediction Elliott Lui

CSCI567 Machine Learning (Fall 2014)

Supervised Learning (Big Data Analytics)

Cumulative Diagrams: An Example

AGENCY: Federal Aviation Administration (FAA), Department of Transportation.

As Flight Delays at United, American & Delta Jump, Airlines Oppose Airport Proposal for Funding that Could Be Used to Reduce Delays

Making Sense of the Mayhem: Machine Learning and March Madness

Statistical Machine Learning

Linear Threshold Units

QUALITY OF SERVICE INDEX

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Chapter 6. The stacking ensemble approach

Predict Influencers in the Social Network

Aviation Safety: Making a safe system even safer. Nancy Graham Director, Air Navigation Bureau International Civil Aviation Organization

Machine Learning in Spam Filtering

Encrypting Network Traffic

FEES AND CHARGES. For the use of Heliports and Airports in Greenland. G. Season Cards and Flights Using Closed Landing Grounds

x 2 + y 2 = 1 y 1 = x 2 + 2x y = x 2 + 2x + 1

Section 2: Ten Tools for Applying Sociology

AUDIT REPORT AIRCRAFT AND AIR SERVICE MANAGEMENT PROGRAMS DOE/IG-0437 JANUARY 1999

The Economic Impact of Commercial Airports in 2010

A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic

Dubai International Airport

Predict the Popularity of YouTube Videos Using Early View Data

Cathay Pacific Flight Schedule

Support Vector Machine (SVM)

Do Low Cost Carriers Provide Low Quality Service? *

LCs for Binary Classification

Classification using Logistic Regression

PREPARING FOR TAKEOFF: AIR TRAVEL OUTLOOK FOR 2016 Mining 2015 data to understand when to buy, expected ticket prices, and fare differentials

While flight plan calculations are necessary for safety and regulatory compliance, they also provide airlines with an opportunity for cost

Aviation and the European Union Emissions Trading System 1. U.S. aviation industry and proposed House bill on the EU Aviation Directive

Charter Services. Frequently Asked Questions

Making Sense of Urban Data

Optimization Requirements Document for the Meteorological Data Collection and Reporting System/ Aircraft Meteorological Data Relay System

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Flight delay performance at Hartsfield-Jackson Atlanta International Airport

ICAO Language Proficiency in Ab-Initio Flight Training. Ms. Angela C. Albritton Aviation English Consultant

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

A Flight Bodo Mo i Rana with Wideroe of Norway

ILS Replacement. ACI World Safety Seminar November 2008 Kempinski Hotel Beijing Lufthansa Centre

Beechcraft 1900D: Fuel, Emissions & Cost Savings Operational Analysis

Walt Disney World Walt Disney World Walt Disney World Walt Disney World Walt Disney World Walt Disney World Walt Disney World

WATSONVILLE MUNICIPAL AIRPORT MASTER PLAN CITY OF WATSONVILLE, SANTA CRUZ COUNTY, CALIFORNIA CHAPTER 3. AVIATION FORECASTS REVISED APRIL 2010

CHAPTER 4. o Hotel Results 15 CHAPTER 5. o Car Results: Matrix & Options 19. o Ground and Limo Service 21. o Trip Purchasing & Booking 23

EST.03. An Introduction to Parametric Estimating

No-Annual-Fee Credit Cards in the Airline Industry Industry Analysis from IdeaWorks

Chartering An Aircraft A Consumers Guide

Comment submission: Date for submission of comments: Comments will be received until 5:00 P.M., Eastern Time, on August 18, 2008.

Machine Learning and Pattern Recognition Logistic Regression

LAN Airline Flight Schedule

Language Modeling. Chapter Introduction

Airline Schedule Development

Comparison of machine learning methods for intelligent tutoring systems

FACT SHEET UNMANNED AIRCRAFT SYSTEMS (UAS)

FRA. Comfort. Services. More Freedom, more Time, more Holidays

4 EXISTING AND FORECAST NOISE EXPOSURE MAPS

MKE Air Service Update mitchellairport.com

Prescriptive Analytics. A business guide

Data Review and Analysis Program (DRAP) Flight Data Visualization Program for Enhancement of FOQA

China Airlines Flight Schedule

Discussion Paper 01: Aviation Demand Forecasting

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Understanding the altimeter

Semi-Supervised Support Vector Machines and Application to Spam Filtering

CONNECTING PROGRAM FOR EUROPE QUESTIONS AND ANSWERS

ST A TEMENT OF ALICE M. RIVLIN DIRECTOR CONGRESSIONAL BUDGET OFFICE

A Short Tour of the Predictive Modeling Process

1 Review of Least Squares Solutions to Overdetermined Systems

How to Win at the Track

Module 5: Multiple Regression Analysis

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

MSCA Introduction to Statistical Concepts

Principal components analysis

years in the community

Matrix Airfare Search

Tagging with Hidden Markov Models

How can we discover stocks that will

Nowcasting of significant convection by application of cloud tracking algorithm to satellite and radar images

Providing Flight Training at:

Math 4310 Handout - Quotient Vector Spaces

alternative collection

Please be sure to save a copy of this activity to your computer!

The Artificial Prediction Market

Lecture 3: Finding integer solutions to systems of linear equations

SEA LAND AIR LEARNING TO FLY AT

Machine Learning in Stock Price Trend Forecasting

MIT ICAT. Airline Revenue Management: Flight Leg and Network Optimization Transportation Systems Analysis: Demand & Economics

Aviation Infrastructure Costs

Low Level Windshear Alert System (LLWAS) An integral part of the U.S. FAA Wind-shear safety program

Creating a NL Texas Hold em Bot

What s Wrong with the Airline Industry? Diagnosis and Possible Cures. Statement of

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

Exemption No A Regulatory Docket No. FAA

Efficality of Aircraft Flight Patterns in the East

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Transcription:

CS229 Term Project Report Learning to Predict Flight Delay Lei Sun sunlei@stanford.edu Abstract Flight Delay a very common rk when people take airplane. It a frustrating experience not only to the people who have scheduled events after it, but also to the people who have connecting flights, and even to the people who meet arrival at the airport terminals. But with reliable prediction, people can have a head up of how likely their flights are going to be delayed such that they can mitigate the rk days before the flight. In th project, three different machine learning models are developed to predict the likelihood of a flight delay. The result from these three models will be compared. Introduction A flight delay a delay in which an airline flight takes off and/or lands later than its scheduled time. The Federal Aviation Admintration (FAA) considers a flight to be delayed when it 15 minutes later than its scheduled time [1]. Unfortunately, air traffic control a large and complicated system, which depends on many factors. Especially, flight safety a factor that always has the highest priority, and should never be compromed. In year 2011, 23% of the 4,608,956 flight operations in USA are delayed [2]. In another word, one flight out of five delayed. Flight delays are a frustrating inconvenience to passengers. A delayed flight might be costly to passengers by making them late to their personal scheduled events. A passenger who delayed on a multi-plane trip could ms a connecting flight. According to rk management theory, a rk quantified by its probability and its consequence. Flight delay has both high probability and high consequence, thus should be definitely consideredd as high rk. In th project, I am going to develop some machine learning models to predict the flight delay by analyzing some htorical data. Since all the flight informationn and weather forecast known before a flight, then it possible that some prediction can be made to see how likely a flight going to be delayed. Then people can make some migration plan in advance. Data Collection and Overview According to the stattics from The Research and Innovative Technology Admintration Page 1

Initially, we select 11 features that are related to the 5 top reasons lted by RITA: Arrival Departure Day of week * Carrier Airport Vibility Wind Speed Condition (Snow, Rain ) Airport Vibility Wind Speed Condition (Snow, Rain ) Dtance of flight * (RITA), the 5 top reasons that cause flight delay are: Air Carrier reason, Extreme Weather, National Aviation System (NAS) reason, Late- arriving aircraft and Security [3] The figure above shows the percentage of the reason which caused the flight delay. At the first glance, surpringly, weather causes only 4 percent of flight delays. But there another category of weather within the NAS category. Th type of weather slows the operations of the system but does not prevent flying. During 2011, 75.5% of NAS delays were due to weather [3]. RITA also provides detailed on-time performance data within some time range for research purpose. It contains most of the flight information. More importantly, the data set large enough for both training and testing. Unfortunately, the weather data not provided in the data set. But due to its direct and strong relationship with flight delay, I decided to collect all the weather data for each flight record. After trying a couple of approaches, I finally chose to use the API on wunderground.com, although for some small airports, only some less accurate htorical data available, but they should be sufficient for us to build the model. In the following four figures, I showed the percentage of delayed flight versus some selected feature data to vualize the patterns in it, and also we can roughly tell that if it a strong indicator of a flight delay. Due to the high dimension of feature data, I plotted them individually in each figure. According to RITA s study, air carrier one of the main reasons for delay. Although we have 15 carriers in our dataa set. For space reason, only the top 5 carriers with most delay and the top 3 with least delay are shown in the figure below. The worst airlines for flight delay are Frontier, Atlantic, JetBlue, Alaska, and United, while the best airlines are Hawaiian, Southwest and AirTran. Th result very similar to the stattics that I can find from other sources [4]. Page 2

these two weather features, the relationship almost perfectly linear. From our intuition, airport should be another factor that affects the flight delay a lot, because larger airport tends to have heavier traffic thus more likely to cause delay. Our data verifies th assumption that busier airport have higher delay rate, but on the other hand, the worst airports in our data set are not one of the busiest airports in the US, such as New York Airport (JFK) or Los Angeles Airport (LAX). In addition, even common sense can tell us that bad and extreme weather contributes greatly to flight delay. From the htorical weather database, I picked three features that have strong tie with delay: vibility (mile), wind speed (mph) and weather condition (such as rain, snow, cloudy, or hail). The following two figures show the relationship between flight delay versus vibility, and relationship between the delay percentages versus wind speed. For On the other hand, however, two features surpringly show no indication of flight delay. Day of week (Monday, Tuesday ) initially selected due to the reason that the air traffic should be heavier during weekend, which not, however, supported by the data. Also the flight dtance another feature that does not show any pattern and finally removed from the feature set. Model 1: Naïve Bayes The first model I built was Naïve Bayes, since it the most straightforward model and easy to be implemented. In order to build the model, I first dcretized some continuous valued input, such as vibility and wind speed. After some experiments, the Page 3

buckets that achieve best result are shown below, which the same as what used as variables in the figures in the last section: Vibility (Mile) 0-2 2-5 5-7 >7 1 2 3 4 Wind (MPH) 0-10 10-20 20-30 >30 1 2 3 4 There are only 2 classes in output, delayed or not delayed. So for training data, delayed minutes will be dcretized using the official definition: 15 minutes or more considered as delayed. After tuning the model, the best result from 5- fold cross validation 90.37% %. For study purpose I also did some experiment on multi- class model, because it also interesting to see how the model performs when we want it to predict the delay time more precely, such as delayed 30 minutes or 60 minutes instead of just two categories: delayed or not. As shown in the figure below, with more number of output buckets, the performance becomes lower. Model 2: Logtic Regression As dcussed in class, Naïve Bayes model makes a very strong assumption that x s are conditionally independent given y, which also known as Naïve Bayes assumption. In our data, however, we are not confident about it, especially on weather data. For example, foggy days always have lower vibility than sunny days, so they are obviously not independent features. Moreover, since we have enough data, we don t have to make any other strong assumptionss like p(x y) follows dtribution, therefore, logtic a Gaussian regression should be more robust. As a second exerce, I developed a logtic regression model with stochastic gradient ascent rule to see if th model can provide us better prediction over Naïve Bayes. It turned out that the accuracy from model 2 91.88%. It indeed better than that from Naïve Bayes model, although the improvement not as large as I expected. We can also safely draw the conclusion that not all the features in our feature set are perfectly, mutually independent. Th also leads me to look for a better model to describe the data. Model 3: Nonlinear with Kernel Trick One of the limitation of logtic regression model that its decion boundary still linear. We can add some extra features and use a high order polynomial, but there are two difficulties. First, it takes too much time to generate new features due to the high volume of the training data set. Second, we are not sure about what high order polynomial the best fit. Kernel trick a very powerful tool in machine learning domain, and more important, it can solve the two difficulties when using high order polynomial. So in the third model, I d like to build a non-linear regression model with kernel trick. Page 4

From Problem Set 2, we derived an approach to kernelize the perceptron algorithm. Similarly, we will consider a stochastic gradient descent-like implementation of the classic linear algorithm where each update to the parameters θ made using only one training example. It also worth mentioning that the first two models are classification models, while th model a regression model which has continuous valued output. So when we get the continuous-valued late minutes, we can tell the flight delayed (greater than 15) or not. At the same time we can easily evaluate how bad the delay. After all, delay 20 minutes or delay 3 hours totally different experience. From the classic linear regression model, the update rule : Since 0 Then The prediction step, Despite the high dimension of θ, we don t have to update θ explicitly; we just need to keep tracking the value of, each training data passed into the training once. The way I evaluate the performance of th model to calculate the average difference between the actual delay and the prediction on the test data. After tuning, the performance very satfying. The average error 5.37 minutes which means the prediction only about 5 minutes off the actual value on average. Furthermore, when I convert the continuous valued prediction result into two categories as in the first two models, the accuracy (93.2%) the highest among the three models, which also an expected result. Conclusion Flight delay directly related to airport location and weather condition thus can be predicted by machine learning algorithm. Naïve Bayes model performs OK considering its simplicity. Logtic Regression model slightly better because it removes the independence assumption. Kernel based non-linear regression model has the best performance but time complexity higher especially when there are many training data. Reference 1. http://en.wikipedia.org/wiki/flight_delay 2. http://www.transtats.bts.gov/homedrillchart.asp 3. http:/bts.gov/help/aviation/html/understanding 4. http://www.travelandleure.com/articles/best-an d-worst-airlines-for-delays In th model, I used a degree-5 polynomial kernel. The above iteration keeps running until Page 5