Predicting Domestic Flight Delays during the Holiday Season

Kelsey Heng
Analytics Vidhya
Published in
4 min readAug 21, 2019

--

Air travel in recent years has become a commonplace mode of transportation rather than a luxury. Increased air passengers generally led to increased competition among airlines — discounted flights, better routes, value-added services. Moreover, rail companies are making huge efforts to improve interstate trains, increasing the competition for consumers.

One service factor that many consumers consider when deciding the airline to fly with is flight delays. While delays due to natural factors and airport security can’t be prevented, airlines can improve delays due to technical issues. In the past year, almost 1 out of 3 flights were delayed!

In this project, we aimed to predict flight delays in December when many consumers are travelling interstate to visit family for the Christmas holidays.

Building a machine-learning model

Data from the previous year were obtained from the US Bureau of Transportation Statistics to train the model. To select the most suitable model, initial testings were done and a few parameters were compared. Models used were logistic regression, Naive-Bayes Bernoulli, Decision Tree and Random Forest. KNN and SVC were omitted as my system could not support it.

First up, we look at the ROC-AUC curve which gives us an idea of the false alarm rate (delay predicted but no delay) vs hit rate (delay predicted and delay occurred). It describes how good the model is at predicting the positive class when the actual outcome is positive. The difference between the models did not vary by more than 10%.

Initial ROC-AUC curve

A second parameter evaluated — recall score. It is otherwise known as sensitivity, summarising the proportion of flight delays predicted accurately out of late flights in reality.

Despite the decision tree and random forest models having significantly higher scores, a closer look revealed that the models were overfitting with a difference of ~35% between the training and test score.

Improving the model

  1. A quick assessment revealed that the number of flight delay count was half of no delay. Undersampling was done to balance the dataset.
0 denotes no delay, 1 denotes flight being delayed

2. Factors contributing to the model were evaluated. Feature engineering was done to eliminate factors with zero weight and the model was fine-tuned using RandomSearchCV from Scikit-learn to improve prediction.

Top 20 features

Improved models

Area under the curve (AUC) score determines the accuracy of the model for predicting flight delays vs false alarm, which remained at ~65% after fine-tuning. Recall score indicates the proportion of flight delays being predicted accurately. A higher score indicates lesser flights are being predicted as being on time but a delay occurred, which improved significantly for the Logistic and Naive Bayes Bernoulli model.

Conclusions and Insights

Logistic Regression might be the best model to predict flight delay. Despite having high scores, both decision tree and random forest models were overfitting, having a large discrepancy of scores between the training and test datasets.

Passengers travelling earlier in the week (Mon-Wed) tend to experience a flight delay more than other days. Besides, passengers departing from busy airports such as JFK and SFO have a higher chance of their flight being delayed.

More details and codes for this project can be found on my Github. I can be contacted via LinkedIn if you would like to connect.

--

--

Kelsey Heng
Analytics Vidhya

Neuroscience researcher turned analytics consultant. Huge love for data storytelling, turning numbers into fun facts!