Suicide attempts among adolescents represent a growing public health crisis in the United States, with prevalence rising from 6.3% in 2009 to 10% in 2023. Despite decades of research, traditional statistical approaches have demonstrated limited success in predicting suicide attempts, often due to their reliance on predefined assumptions and neglect of non-linear interactions among diverse risk factors.
Objective:
This study aimed to evaluate and compare the performance of four supervised machine learning (ML) models—LASSO Logistic Regression, Classification and Regression Trees (CART), Random Forest, and eXtreme Gradient Boosting (XGBoost)—in predicting adolescent suicide attempts using data from the 2017 Youth Risk Behavior Surveillance System (YRBS). A secondary objective was to identify the most influential predictors across models to enhance understanding of suicide risk.
Methods:
Data were drawn from a nationally representative sample of 12,551 high school students who completed the 2017 YRBS. The primary outcome was self-reported suicide attempts in the past 12 months, dichotomized as present or absent. Predictor variables encompassed a wide range of domains, including substance use, emotional and cognitive factors, experiences of violence, school safety, nutritional habits, physical activity, media use, and demographics. Missing data were imputed using K-nearest neighbors with Gower’s distance. The dataset was stratified and split into training (75%) and testing (25%) sets, with 5-fold cross-validation used for model tuning. Sensitivity was prioritized as the primary performance metric due to the low prevalence of suicide attempts (7.4%), with additional metrics including specificity, AUC, accuracy, and Cohen’s kappa.
Results:
LASSO Logistic Regression exhibited the highest sensitivity (88.9%) and AUC (94.6%) in validation data, outperforming tree-based models. In the testing dataset, LASSO maintained high performance (sensitivity = 89.2%, specificity = 92.3%, AUC = 95.1%). Suicidal ideation and suicide planning were the most influential predictors across all models. Other consistent risk factors included substance use (e.g., injected drugs, synthetic marijuana), feelings of insecurity at school, uncertainty in sexual orientation, and experiences of sexual abuse. Notably, LASSO identified water consumption, race/ethnicity, and sleep patterns as additional significant predictors. The emphasis on sensitivity provided a more meaningful evaluation of model utility in identifying adolescents at risk.
Conclusion:
Machine learning models—particularly LASSO—demonstrate strong potential to improve adolescent suicide attempt prediction by incorporating a broader range of behavioral and environmental risk factors than traditional methods. Emphasizing sensitivity over AUC allows for better identification of true positive cases in imbalanced datasets. The study highlights the importance of integrating diverse domains of adolescent experiences into predictive models and supports the use of ML as a valuable tool in developing data-driven, targeted prevention strategies. However, limitations include the use of secondary data lacking detailed mental health history and the pre-pandemic timeframe of the dataset, underscoring the need for model retraining with current data.
![[ Visit Client Website ]](images/banner.gif)