Predicting Domestic Flight Delays during the Holiday Season
Air travel in recent years has become a commonplace mode of transportation rather than a luxury. Increased air passengers generally led to increased competition among airlines — discounted flights, better routes, value-added services. Moreover, rail companies are making huge efforts to improve interstate trains, increasing the competition for consumers.
One service factor that many consumers consider when deciding the airline to fly with is flight delays. While delays due to natural factors and airport security can’t be prevented, airlines can improve delays due to technical issues. In the past year, almost 1 out of 3 flights were delayed!
In this project, we aimed to predict flight delays in December when many consumers are travelling interstate to visit family for the Christmas holidays.
Building a machine-learning model
Data from the previous year were obtained from the US Bureau of Transportation Statistics to train the model. To select the most suitable model, initial testings were done and a few parameters were compared. Models used were logistic regression, Naive-Bayes Bernoulli, Decision Tree and Random Forest. KNN and SVC were omitted as my system could not support it.
First up, we look at the ROC-AUC curve which gives us an idea of the false alarm rate (delay predicted but no delay) vs hit rate (delay predicted and delay occurred). It describes how good the model is at predicting the positive class when the actual outcome is positive. The difference between the models did not vary by more than 10%.
A second parameter evaluated — recall score. It is otherwise known as sensitivity, summarising the proportion of flight delays predicted accurately out of late flights in reality.
Despite the decision tree and random forest models having significantly higher scores, a closer look revealed that the models were overfitting with a difference of ~35% between the training and test score.
Improving the model
- A quick assessment revealed that the number of flight delay count was half of no delay. Undersampling was done to balance the dataset.
2. Factors contributing to the model were evaluated. Feature engineering was done to eliminate factors with zero weight and the model was fine-tuned using RandomSearchCV from Scikit-learn to improve prediction.
Improved models
Area under the curve (AUC) score determines the accuracy of the model for predicting flight delays vs false alarm, which remained at ~65% after fine-tuning. Recall score indicates the proportion of flight delays being predicted accurately. A higher score indicates lesser flights are being predicted as being on time but a delay occurred, which improved significantly for the Logistic and Naive Bayes Bernoulli model.
Conclusions and Insights
Logistic Regression might be the best model to predict flight delay. Despite having high scores, both decision tree and random forest models were overfitting, having a large discrepancy of scores between the training and test datasets.
Passengers travelling earlier in the week (Mon-Wed) tend to experience a flight delay more than other days. Besides, passengers departing from busy airports such as JFK and SFO have a higher chance of their flight being delayed.