Lung cancer is the second common cancer and a leading cause of cancer-related death in the United States. Detection of lung cancer at early stages can significantly improve the survival rate. Annual screenings for lung cancer with low-dose computed tomography (LDCT) was an effective strategy to reduce 20% of mortality from lung cancer. Unfavorably, the prevalence of using LDCT for lung cancer prevention in the United States (US) has remained below 4% over time. The purpose of this study is to develop machine learning models to analyze interplay pathway of factors associated with lung cancer screenings with the LDCT and to understand the mechanism of underutilization of lung cancer screening.
Method: The study was based on the data retrieved from the 2018 Behavioral Risk Factor Surveillance System (BRFSS). A total of 2266 participants eligible to the American Cancer Society recommendation (ages 55 to 79 years with a 20 pack-year history of smoking and current smoker or quit within the past 15 years) were initially included in this study. After dealing with missing values, 86 variables and 710 samples were included in the decision tree model and the random forest model. The data were randomly split into training (569/710, 80%) and testing (141/710, 20%) sets. The model used training set to develop the decision tree classification, and the testing set, independent of model development, was used for performance evaluation of the developed model. Gini impurity, or Gini index, is the probability of incorrectly classifying a given variable, and is used to select and determine the optimal split of the nodes in the model.
Results: After 100 independent runs to ensure model robustness, the average performance metrics of random forest model were obtained: Average accuracy is 67.5%, f1 score is 66.55%, sensitivity is 63.82%, and specificity is 71.29% based on the 100 runs. The randomly sampled 20% testing data was used for performance evaluation for our decision tree model. Performance metrics were: 67.78% accuracy, 65.76% F1 score, 62.52% sensitivity, and 73.57% specificity. In the decision model, 9 interactive pathways were identified among factors: Average drinks per month, BMI, diabetes, first smoke age, years of smoking, year(s) quit smoking, gender, last sigmoidoscopy or colonoscopy, last dental visit, general health, insurance, education, and last Pap test.
Conclusions: The findings of this study may use to develop a protocol for identifying patients who are at high risk for lung cancer but are not sufficiently screened. Lung cancer screening utilization is in a result of the interplay of multi-factors. Lung cancer screening programs in clinical settings should not only focus on patients’ smoking behaviors but also consider other socioeconomic factors. In order to promote early detection and survival rates of lung cancer, promoting screening rates in female populations and obesity patients should be the focus of future cancer prevention.