METHODS: In an ongoing study of prominent, publicly accessible large language model (LLM) chatbots–e.g., GPT-3.5 and -4, Gemini, Claude, and Llama 2—we sought to specify the necessary and sufficient dimensions of quality and usability. A priori, we identified the following: validity, reliability; bias, safety (especially for LGBTQ+ considerations), and usability. As the study progressed, we added the need to account for rapid updates to chatbots’ models/algorithms as well as training corpuses. An Expert and Accountability Panel consisting of ~10 scientists, service providers, and community members with relevant expertise and lived experience informed and validated specific assessment procedures as well as of the overall methodology.
RESULTS: Validity assessment addresses how consistent the output is with the current/latest science. Effective assessment involves a combination of human driven benchmarking and machine-driven benchmarking (e.g., BLEU, ROUGE). Validity also needs to characterize the conditions/triggers/domains/etc. that lead to “hallucinations” and their frequency. Reliability assesses the consistency of chatbot output by deliberate selection of benchmark prompts based on human-centered design principles and use of prompt engineering. Bias in chatbot output assessment focuses on whether subgroups are over/under-represented or accounted for (e.g., racial/ethnic minoritized groups) as well as perpetuation of stereotypes and marginalization/exclusion. Safety is assessed via ensuring output is sufficiently conservative when the evidence-base is lacking or has equivocal findings/conclusions; systems and user interface also need to be assessed for sufficient anonymity and privacy as well as unintentional threats to safety (e.g., targeted ads appearing as a result of chatbot queries). Usability assesses how aspects of cost, modality (e.g., app-based, browser-based, plug-in/embedded component ), and accessibility facilitate or inhibit use based on a user’s characteristics (e.g., developmental age, language ability), contexts (e.g., urgency, tone, style), and roles (e.g., affected individual, professional). In addition to the development and deployment of new AI chatbots, existing chatbots undergo rapid refinement/evolution in the underlying AI approaches/models as well as training corpus, necessitating the clear documentation of necessary information such as version number(s) and characteristics of the training corpus (e.g., range/boundaries of training data with respect to source, content, and dates) at the time(s) of assessment.
CONCLUSIONS & IMPLICATIONS: This study provides concrete examples of the steps that collectively constitute a systematic and structured method to help individuals utilize (or not) existing and future AI chatbots in a responsible manner. Originally designed for community members, service providers, and researchers interested in the health and well-being of LGBTQ+ individuals, it could be generalizable to other minoritized/marginalized and vulnerable populations.