Abstract: A Method for Assessing and Ensuring the Quality of Artificial Intelligence (AI) Chatbots for LGBTQ+ Populations (Society for Social Work and Research 29th Annual Conference)

Schedule:

Friday, January 17, 2025

Boren, Level 4 (Sheraton Grand Seattle)

* noted as presenting author

Elwin Wu, PhD, Professor of Social Work, Columbia University School of Social Work, New York, NY

Jimin Sung, MA, PhD Student, Columbia University, New York, NY

Ivie Arasomwan, BA, Research Assistant, Columbia University School of Social Work, NY

Zichen Zhao, MA, Data Science Fellow, Columbia University, NY

Charles Lea, PhD, MSW, Assistant Professor of Social Work, Columbia University

BACKGROUND & PURPOSE: With the explosion in the capability, availability, and use of artificial intelligence (AI) chatbots (e.g., ChatGPT), a fundamental and pressing question continues to be: How good and/or helpful is this AI chatbot? We aimed to develop a systematic approach to qualitatively and quantitatively answer this question, starting with a focus on LGBTQ+ populations as they are both vulnerable and experience many health and psychosocial inequities.

METHODS: In an ongoing study of prominent, publicly accessible large language model (LLM) chatbots–e.g., GPT-3.5 and -4, Gemini, Claude, and Llama 2—we sought to specify the necessary and sufficient dimensions of quality and usability. A priori, we identified the following: validity, reliability; bias, safety (especially for LGBTQ+ considerations), and usability. As the study progressed, we added the need to account for rapid updates to chatbots’ models/algorithms as well as training corpuses. An Expert and Accountability Panel consisting of ~10 scientists, service providers, and community members with relevant expertise and lived experience informed and validated specific assessment procedures as well as of the overall methodology.

RESULTS: Validity assessment addresses how consistent the output is with the current/latest science. Effective assessment involves a combination of human driven benchmarking and machine-driven benchmarking (e.g., BLEU, ROUGE). Validity also needs to characterize the conditions/triggers/domains/etc. that lead to “hallucinations” and their frequency. Reliability assesses the consistency of chatbot output by deliberate selection of benchmark prompts based on human-centered design principles and use of prompt engineering. Bias in chatbot output assessment focuses on whether subgroups are over/under-represented or accounted for (e.g., racial/ethnic minoritized groups) as well as perpetuation of stereotypes and marginalization/exclusion. Safety is assessed via ensuring output is sufficiently conservative when the evidence-base is lacking or has equivocal findings/conclusions; systems and user interface also need to be assessed for sufficient anonymity and privacy as well as unintentional threats to safety (e.g., targeted ads appearing as a result of chatbot queries). Usability assesses how aspects of cost, modality (e.g., app-based, browser-based, plug-in/embedded component ), and accessibility facilitate or inhibit use based on a user’s characteristics (e.g., developmental age, language ability), contexts (e.g., urgency, tone, style), and roles (e.g., affected individual, professional). In addition to the development and deployment of new AI chatbots, existing chatbots undergo rapid refinement/evolution in the underlying AI approaches/models as well as training corpus, necessitating the clear documentation of necessary information such as version number(s) and characteristics of the training corpus (e.g., range/boundaries of training data with respect to source, content, and dates) at the time(s) of assessment.

CONCLUSIONS & IMPLICATIONS: This study provides concrete examples of the steps that collectively constitute a systematic and structured method to help individuals utilize (or not) existing and future AI chatbots in a responsible manner. Originally designed for community members, service providers, and researchers interested in the health and well-being of LGBTQ+ individuals, it could be generalizable to other minoritized/marginalized and vulnerable populations.