The purpose of the current study was to develop and validate a set of models that use unstructured text narratives from investigations of child maltreatment to detect cases in which a drug or alcohol problem is observed within the family system. Currently, the most reliable data about substance-related problems (SRPs) exist in the written summaries, but these have not been analyzable at a population level. Thus, state policymakers and system administrators are unable to make data-driven decisions about the socio-demographic, geographic, and temporal trends of SRPs in the current system of care. Accurate text mining models could serve as an efficient and cost-effective solution to this problem.
Methods: All written summaries of substantiated child maltreatment investigations from 2015 to 2017 (N = 75,843) were obtained from a state child welfare agency. The study team randomly selected 3,000 investigation summaries, and then manually reviewed and labeled these documents based on whether an SRP was observed. Three-quarters of the manually labeled documents were used to develop a set of text classification models using common machine learning algorithms. These models were then validated using the remaining labeled documents. Lastly, three metrics -- accuracy, sensitivity, and specificity -- were calculated to allow for cross-model comparisons of performance. We also calculated a set of kappa scores for each model to evaluate the degree to which model classifications were exchangeable with those of our expert human coders.
Results: The most accurate text classification model was the random forest algorithm, correctly classifying 93.2% of labeled documents when compared against the classification conducted manually by expert human reviewers (specificity = 96.7%; sensitivity = 84.9%). Inter-rater reliability estimates (kappa) between the computer model and human reviewers ranged from .81 to .88, suggesting that model classifications are exchangeable with those of human coders.
Conclusions and Implications: These results provide compelling evidence that text mining procedures can be a cost-effective and efficient solution for extracting meaningful insights from unstructured text data. Although the current study focused on caseworker identification of substance-related problems, the same methodology could readily be used to classify cases based on other dimensions of importance to child welfare administrators and policymakers. Findings from the current study lay the groundwork for exploring the full range of ways in which text mining models may be deployed to enhance child welfare practice and outcomes.