Abstract: Genuine Data, Genuine Impacts: A Team-Based Systematic Coding Approach for Detecting and Excluding Fraudulent Bot Data from an Online Survey Sample (Society for Social Work and Research 29th Annual Conference)

Schedule:

Sunday, January 19, 2025

Kirkland, Level 3 (Sheraton Grand Seattle)

* noted as presenting author

Ashley Brooks, PhD, Research Associate, University of Toronto, Toronto, ON, Canada

Shelley Craig, PhD, Professor, University of Toronto, Toronto, ON, Canada

Dane Marco Di Cesare, PhD, Assistant Professor, Brock University, St. Catharines, ON, Canada

Background and Purpose

Social work research is increasingly utilizing social media-based recruitment and online surveys to access hard-to-reach and stigmatized study populations (Archer-Kuhn et al., 2021). However, there is growing scholarly concern about survey bots, automated programs designed by bad actors to respond to surveys and farm incentives, which impact data integrity (Storozuk, 2020). A slew of recent studies in social work and allied social sciences has documented survey bot incursions that have wasted limited research funds and, in some cases, resulted in termination of the study (Bybee et al., 2022; Coulter-Thompson et al., 2023; Griffin et al., 2021; Shaw et al., 2022; Sterzing et al., 2018). Sharing effective approaches to dealing with survey bots increases awareness of and capacity to manage this risk and may support the development of new data cleaning strategies to tackle this issue. This study therefore offers a collaborative systematic coding approach to identifying and excluding bot data from an international online survey sample using rigorous processes to enhance data trustworthiness.

Methods

3,681 responses to an online survey of wellbeing and video gaming among sexual and gender diverse participants aged 14-29 (M = 21.95, SD = 4.06) from Canada, the USA, Mexico, the UK, or Australia were subjected to a dichotomous coding process by four coders. Following calibration on 97 random responses, remaining responses were coded twice independently by counterbalanced coder pairs using 9 coding categories: (a) spamming; (b) no qualitative answers; (c) poor qualitative answers; (d) incongruence between IP address and reported country of residence; (e) incongruence between reported age and year of birth (f) duplicate qualitative answers; (g) duplicate IP addresses (h) speeding; and (i) suspicious email address. Following the first round, coder pairs met to discuss and resolve disagreements for each category and remaining disagreements were resolved by the first author. Codes were summed across all categories to produce a fraud probability index ranging from 0-9 to guide exclusion decisions. Retained data were compared against excluded data to examine respective demographic characteristics, variance, and scale scores.

Results

1,670 (45.37%) responses were excluded as suspected bots and had an average fraud index score of 4.5. The coding process produced high levels of inter-rater reliability between each coder pair (95.07% - 97.71%) and across all coding categories (86.82% - 100%). Excluded data over-represented the USA as the country of residence and intersectional demographics that are typically underrepresented in survey research (e.g., Black transgender women), and produced more normally distributed data with less utilization of the most extreme Likert scale response options.

Conclusion and Implications

The study provides crucial and timely diagnostic information for researchers using online surveys and social media-based recruitment and a systematic and collaborative process for identifying (in)genuine responses and enhancing data trustworthiness. Our findings emphasize a growing imperative to adapt online survey research practices to be responsive to technological change and survey fraud. It is essential to anticipate survey bot incursions and plan mitigation and detection strategies to be implemented throughout a study’s design, recruitment, and data cleaning and analysis stages.