Abstract: Predicting Risk of Child Maltreatment at Birth: A Comparison of the Classification Accuracy of Machine Learning Techniques (Society for Social Work and Research 22nd Annual Conference - Achieving Equal Opportunity, Equity, and Justice)

Predicting Risk of Child Maltreatment at Birth: A Comparison of the Classification Accuracy of Machine Learning Techniques

Schedule:
Saturday, January 13, 2018: 9:06 AM
Marquis BR Salon 9 (ML 2) (Marriott Marquis Washington DC)
* noted as presenting author
Lindsey Palmer, MSW, PhD Student, University of Southern California, Los Angeles, CA
Michael Tsang, MS, Doctoral Student, University of Southern California, Los Angeles, CA
John Prindle, PhD, Research Faculty, University of Southern California, Los Angeles, CA
Tanya Gupta, MS, Research Programmer & Database Analyst, University of Southern California, Los Angeles, CA
Emily Putnam-Hornstein, PhD, Associate Professor, University of Southern California, Los Angeles, CA
Background/Purpose: Child abuse and neglect remains a major public health issue, with national data suggesting that 1 in 7 children will be substantiated as victims by age 18 (Wildeman et al., 2017) and more than 1 in 3 children will experience an investigated allegation of maltreatment by child protective services (CPS) (Kim et. al, 2017). Given well-documented adverse effects associated with childhood conditions that contribute to maltreatment and lead to CPS involvement, there is tremendous interest in finding ways to move prevention efforts upstream. A number of jurisdictions (e.g., New Zealand, California, Pennsylvania, Tennessee, Ohio) have begun to explore the possibility of using vital birth records to identify a child’s risk of adversities, before a child is reported for abuse or neglect or experiences a fatality. While initial predictions have been able to achieve adequate levels of performance, the role Machine Learning algorithms can play in developing modeling tools is still unknown. Complex Machine Learning models are well regarded to perform better than logistic and linear regression, but improved performance may come at the cost of interpretability of model predictions. The purpose of this analysis was to test the predictive performance of state of the art methods for classifying risk by using modern Machine Learning techniques. We place high importance on the interpretability of machine learning models due to implications for the communication of these models to a variety of practitioners and stakeholders.

Methods: Two sources of data from California were linked to create the analytic dataset: vital birth records and CPS records. Birth records from 2001, 2006, 2007 and 2012 were probabilistically linked to CPS records with children then classified based on whether or not they were reported for alleged maltreatment before 5 years of age. Four models, Decision Trees (DTs), Generalized Additive Models (GAMs), Random Forests (RFs), and Deep Neural Networks (DNNs), were applied to determine their ability to outperform linear regression. DTs and GAMs were selected for their interpretability due to DTs ability to generate flowcharts, and GAMs to generate main effects plots and statistical interactions. 

Results: Using cross validation, we obtained average accuracies (AUCs) of 71.60% [CI:70.83, 72.65]  (0.7791 [0.77, 0.78]) for LR, 70.48% [67.63, 72.49] (0.7624 [0.76, 0.77]) for DTs, 73.39% [67.78, 76.05] (0.7781 [0.77, 0.79]) for GAMs, 72.73% [72.05, 73.87] (0.7890 [0.78, 0.79]) for RFs, and 76.29% [73.57, 79.91] (0.7867 [0.78, 0.79]) for DNNs.  DTs and GAMs both yielded sensible interpretations, placing emphasis on mother’s education, health insurance type, and nonimmigrant status as important risk predictors. These predictors appeared in DTs from nodes with highest Gini impurity and in GAMs from main effect and interaction plots.

Conclusions/Implications: DNNs appear to possess the potential to significantly outperform all other models. The confidence interval of DNN’s performance is wide in accuracy, nevertheless, the interval in both accuracy and AUC is generally higher than that of LR. For interpretable models, DTs underperformed against LR, but GAMs showed similar performance; thus, GAMs could potentially be used in place of LR for their interpretability.