A Comparative Study of a Series of Supervised Learning Models for Motorcycle Crash Injury Severity Prediction

Motorcycle Crash Injury Severity Machine Learning Algorithms Supervised Learning Models Random Forest SHAP Analysis

Authors

Downloads

Motorcycle crashes pose a major public health challenge in Thailand, where motorcyclists account for most traffic fatalities. This study aims to evaluate and compare the predictive performance of four supervised learning models—Decision Tree (DT), K-Nearest Neighbor (KNN), Naïve Bayes (NB), and Random Forest (RF)—for motorcycle crash injury severity using data from the Highway Accident Information Management System (2020–2022). After preprocessing, 36 explanatory variables covering roadway, environmental, accident causes, crash characteristics, and vehicle involvement were analyzed. To address class imbalance, the Synthetic Minority Oversampling Technique (SMOTE) and cost-sensitive learning were applied, and models were validated using train–test splits with cross-validation. The Random Forest model achieved the best performance with an AUC of 0.726, balanced accuracy of 0.649, and Matthews Correlation Coefficient (MCC) of 0.308, outperforming the other algorithms. SHapley Additive exPlanations (SHAP) were used to interpret the RF model, identifying nighttime crashes, large truck involvement, and roadway features (e.g., depressed medians and two-lane roads) as key predictors of severe outcomes. These insights suggest countermeasures such as improving nighttime safety, dedicating truck lanes, and designing safer medians. The novelty of this study lies in integrating model comparison, imbalance-aware metrics, and SHAP interpretability to provide actionable, context-specific policy recommendations for motorcycle safety in Thailand.