More than 75% of adults in the United States are past-year gamblers. Most gamblers do not experience negative consequences, but problem gamblers may experience a range of harms including emotional distress, financial ruin, family conflict, health issues, occupational problems, and criminal behavior.
Lottery gambling is popular with 80% of past-year gamblers playing the lottery. Widespread availability and low-cost means that lottery may impact vulnerable people (e.g., people with serious mental health conditions) and contribute to the overall gambling harms experienced by problem gamblers.
Even though problem gambling has significant economic and social costs, rates of early detection and treatment remain low. Predictive modeling holds promise in passively identifying problem gamblers. It has been examined in the context of online gambling, but has never been evaluated in the context of lotteries. In the current study, we used machine learning to predict problem gambling among lottery players.
Methods:
Data: Retrospective administrative data was merged with survey data gathered from a lottery loyalty program in a midwestern state (n = 5,903). Loyalty program members register to participate and upload their tickets to a lottery website. Administrative data from the participants was supplemented with a survey that included questions on demographics, other gambling participation, and a problem gambling instrument.
Measures: The Problem Gambling Severity Index (PGSI) was used as the dependent variable with a threshold score of greater than five (13.7%). Features (independent variables) in the model included a wide range of demographic variables (e.g., age, race), data on ticket purchases (e.g., ticket type, amounts), and other gambling frequency (e.g., casino, sports betting).
Analysis: The sample was divided with 80% for model development, and 20% used for testing. The random forest algorithm (RF) was used to predict problem gambling status. RF is a form of classification and regression tree (CART) analysis which estimates many trees using a random subsample of features. This ensemble method avoids overfitting problems and can detect linear and nonlinear relationships. Models were tested for between two and 48 features per tree using tenfold cross validation, and compared based on the F1 score. R package was used for machine learning.
Results:
Using tuning parameters, the number of features per tree was set at 46 and run on the 10,000 bootstrapped samples. The five most important predictors were video gaming frequency, age, frequency and total instant-win tickets, and frequency of casino video poker. This model was then applied to the test data (20%). The performance of the model was fair (balance accuracy = 0.61, F1 score = .35), but specificity was low (0.25).
Conclusions and Implications:
Our findings were mixed in that the model showed the ability to identify non-problem gamblers, but did not do as well identifying problem gamblers. We plan to evaluate these models in new samples to improve performance and consider intervention models that can be linked to a detection algorithm, such as a norms-based approach where data on typical lottery play along with individual spending data is shared with at-risk lottery players.