Methods: We used the results of two human-coded SRs: one on the impacts of substance use exposure on children (n=5,858; Landsman et al., 2024) and another on masculinity threats (n=1,017; Rogalin & Addison, 2024). We compare the accuracy and Area Under the Receiver Operating Characteristics Curve (AUC) (Huang & Ling, 2005) of four algorithms (random forests (Breiman, 2001), logistic regression, multinomial naive bayes (Kibriya et al., 2004), and support vector machines (Cortes & Vapnik, 1995]) using ChatGPT-4. These are used to test the accuracy and reliability of ChatGPT-4 for replicating our samples. We also evaluate the non-matched results for when the human coders said to include sources compared to when ChatGPT-4 said to include a source and instances where ChatGPT-4 said to include irrelevant sources. [In a full submission, we would include the complete set of interactions with ChatGPT-4 for future replication purposes]
Results: Within the substance use exposure review, ChatGPT-4’s logistic regression model obtained the highest accuracy (90.70%) and AUC (92.97%). In contrast, in the masculinity threat review, logistic regression also obtained the highest accuracy (81.37%) and AUC (91.20%). The model in the substance use exposure review was more likely to correctly identify records for the review, and they were unlikely to incorrectly include results that should be excluded. However, the results differed slightly for the masculinity threat review, and we explore reasons for this (e.g., class imbalance). Overall, findings suggest the models can adequately replicate human coding.
Conclusions and Implications: The results of this comparative analysis indicate that AI can be used during the abstract screening phase of SRs. This has implications for efficiency and information aggregation. It opens the door for social work researchers to more quickly sift through large amounts of information and for review teams to identify relevant research more efficiently for their reviews.