Background and Purpose: Mining social media data for studying the human condition has created new and unique challenges. When analyzing social media data from marginalized communities, algorithms lack the ability to accurately interpret offline context, which may lead to dangerous assumptions about and implications for marginalized communities. To combat this challenge, we hired formerly gang-involved youth as domain experts for contextualizing social media data in order to create inclusive, community-informed algorithms. Utilizing data from the Gang Intervention and Computer Science Project—a comprehensive analysis of Twitter data from gang-involved youth in Chicago—we describe the process of involving formerly gang-involved youth in developing a new prototype natural language processing (NLP) system that detects aggression and loss in Twitter data. We offer a contextually-driven interdisciplinary approach between social work and data science that integrates domain insights into the training of social work annotators and the production of algorithms for positive social impact.
Methods: We hired two young men (African American and Latino) 18 years and older who live in Chicago neighborhoods with high rates of violence to work as domain experts. They initially provided interpretations of 185 randomly sampled tweets from our corpus and later gave more focused insights based on our annotators questions and challenges. We developed a multistep process that integrates these insights to inform MSW student annotators interpretations, which provides the training data for our NLP system. Our process includes: 1) identifying, onboarding, and integrating domain experts, 2) initial domain expert interpretations of Twitter data, 3) training and assessing social work student annotator quality, and 4) iterative domain expert involvement and reconciliation of student annotator disagreement.
Findings: We found that incorporating broad domain expert insights in the training of social work annotators and the iterative inclusion of focused domain expert insights throughout the annotation process led to rigorously trained annotators and more robust understandings of social media posts by gang-involved and affiliated youth. Additionally, the involvement of domain experts unearthed contextual insights specific to Chicago neighborhoods with high rates of violence. We provide seven key insights as examples: language, emojis, song lyrics, behavioral/temporal cues, people, neighborhood references, and gang/crew knowledge. We further expand on these seven areas with three case examples of our domain experts and social work student annotators determining context and meaning from tweets.
Conclusions and Implications: Domain experts must be involved in the interpretation of unstructured data, solution creation, and many other aspects of the research process. This goes beyond harvesting and capturing domain expertise. The involvement of domain experts in various areas of social and data science research, including mechanisms for accountability and ethically sound research practices, is a critical piece of truly creating algorithms trained to support and protect marginalized youth and communities. If the gap between people who create algorithms and people who experience the direct impacts of them persists, we will likely continue to reinforce the very social inequities we hope to ameliorate