Abstract: Can Artificial Intelligence Write a Good Clinical Note? Using NLP Methods to Determine Clinic Note Quality (Society for Social Work and Research 29th Annual Conference)

Schedule:

Saturday, January 18, 2025

Redwood B, Level 2 (Sheraton Grand Seattle)

* noted as presenting author

Victoria Stanhope, PhD, MSW, Professor; Associate Dean for Faculty Affairs, New York University, New York, NY

Nari Yoo, MA, PhD Candidate, New York University, New York, NY

Elizabeth Matthews, PhD, Assistant Professor, Fordham University, New York, NY

Yuanyuan Hu, MSW, Doctoral Student, New York University, New York, NY

Daniel Baslock, MSW, PhD Candidate, New York University, New York, NY

Samantha Luxmikanthan, Student, New York University, NY

Background. Increasingly, medical and behavioral health settings are leveraging artificial intelligence (AI) to ease the burden of documentation within the electronic health record. Clinicians often report feeling overwhelmed by the need to produce progress notes documenting in-person visits within short timeframes due to billing needs. To relieve this burden, many agencies are beginning to use AI to generate notes based on audio recordings of clinic visits. This raises questions about the quality of AI generated notes, and the assumptions underlying machine learning algorithms regarding what constitutes clinically relevant data. Note quality is even more salient as federal policies now mandate that all behavioral health settings make their clinical notes available to clients via patient portals. This study aims to 1) use natural language processing (NLP) methods to develop an algorithm to determine note quality and 2) apply the algorithm to compare note quality between behavioral health notes generated by clinicians and notes generated by AI.

Methods. Domain experts with clinical social work backgrounds developed a rubric for evaluating clinical note quality. The rubric was grounded in the literature on clinical documentation and person-centered care. It consisted of three domains: (1) readability, (2) clinical content, and (3) person-centeredness, which were defined for manual and computational assessment. Clinical notes from one community mental health center (CMHC) and AI-generated notes using the OpenAI GPT-4 model were scored manually using the rubric. The manual annotation by domain experts then served to train and validate the quality evaluation algorithm, which was refined for optimal performance. Using Python, we compared the word count, the number of large words (with more than 6 characters), and the frequency of n-grams using independent sample t-tests.

Results. When comparing the CMHC and AI-generated notes, several key differences were found. Differences in note length were marginally significant, with AI-generated notes having 8 more words (p = .08). AI-generated notes had 12 more large words (p < .001) compared to CMHC notes. They tended to use more technical medical language and negative emotion words, such as “hopelessness” and “overwhelmed”, compared to CMHC notes. Further, the AI-generated notes less frequently included descriptions of the client’s family compared to CMHC notes (p < .001). Finally, AI-generated notes were less likely to include individualized information and strength-based approaches than CMHC notes.

Conclusion and Implications. This study developed and piloted a rubric and an NLP algorithm to determine clinical note quality based on readability, clinical content, and person-centeredness. In applying this algorithm to both CMHC and AI-generated notes, preliminary findings suggest that it was able to detect differences in note quality across the three domains. The findings confirm that we can harness AI not only to generate clinical notes, but also to assess them for quality, if undergirded by human expertise. Future research should focus on larger samples and extend these methods to other settings. As AI plays an increasing role in the delivery of healthcare, ensuring quality control that is rooted in human clinical judgement is essential.