Abstract: An Application of AI in the Systematic Review Process: A Comparative Analysis of Chatgpt's Abstract Screening Efficacy (Society for Social Work and Research 29th Annual Conference)

Please note schedule is subject to change. All in-person and virtual presentations are in Pacific Time Zone (PST).

752P An Application of AI in the Systematic Review Process: A Comparative Analysis of Chatgpt's Abstract Screening Efficacy

Schedule:
Sunday, January 19, 2025
Grand Ballroom C, Level 2 (Sheraton Grand Seattle)
* noted as presenting author
Saige M. Addison, MSW, PhD Student, University of Iowa, Iowa City, IA
Christabel Rogalin, PhD, Associate Professor of Sociology, Purdue University Northwest, Westville, IN
Emily Campion, PhD, Assistant Professor, Management and Entrepreneurship, University of Iowa, IA
Miriam Landsman, PhD, Associate Professor, University of Iowa, Iowa City, IA
Christopher Veeh, PhD, Assistant Professor, University of Iowa, Iowa City, IA
Background and Purpose: Systematic reviews (SR) are a dominant approach for collecting and synthesizing information. SRs typically adhere to the Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) guidelines (Page et al., 2021) which outline the process to collect, select, and synthesize information for these reviews. Social work researchers and journals regularly publish these types of reviews which allows other scholars to quickly discern the state of the literature. However, the laborious process of conducting systematic reviews is vulnerable to errors and infidelity among screeners. Recent advancements in artificial intelligence (AI) platforms, like ChatGPT-4, suggest that AI could be a partner in this process to increase efficiency of abstract screening (Chai et al., 2021; van Dijk et al., 2023) and that ChatGPT-4 can contribute to SR development (Najafali et al., 2023; Wang et al., 2023). Although Najafali and colleagues (2023) suggest that ChatGPT-4 is not prepared to conduct full PRISMA SRs, ChatGPT-4 is ripe for assisting with highly structured parts of the review process like abstract screening (Cai et al., 2023). In this study, we compare ChatGPT-4’s ability to accurately screen abstracts in two SRs conducted using the PRISMA method without AI assistance in the abstract screening decision making.

Methods: We used the results of two human-coded SRs: one on the impacts of substance use exposure on children (n=5,858; Landsman et al., 2024) and another on masculinity threats (n=1,017; Rogalin & Addison, 2024). We compare the accuracy and Area Under the Receiver Operating Characteristics Curve (AUC) (Huang & Ling, 2005) of four algorithms (random forests (Breiman, 2001), logistic regression, multinomial naive bayes (Kibriya et al., 2004), and support vector machines (Cortes & Vapnik, 1995]) using ChatGPT-4. These are used to test the accuracy and reliability of ChatGPT-4 for replicating our samples. We also evaluate the non-matched results for when the human coders said to include sources compared to when ChatGPT-4 said to include a source and instances where ChatGPT-4 said to include irrelevant sources. [In a full submission, we would include the complete set of interactions with ChatGPT-4 for future replication purposes]

Results: Within the substance use exposure review, ChatGPT-4’s logistic regression model obtained the highest accuracy (90.70%) and AUC (92.97%). In contrast, in the masculinity threat review, logistic regression also obtained the highest accuracy (81.37%) and AUC (91.20%). The model in the substance use exposure review was more likely to correctly identify records for the review, and they were unlikely to incorrectly include results that should be excluded. However, the results differed slightly for the masculinity threat review, and we explore reasons for this (e.g., class imbalance). Overall, findings suggest the models can adequately replicate human coding.

Conclusions and Implications: The results of this comparative analysis indicate that AI can be used during the abstract screening phase of SRs. This has implications for efficiency and information aggregation. It opens the door for social work researchers to more quickly sift through large amounts of information and for review teams to identify relevant research more efficiently for their reviews.