Abstract: The Perils and Promises of Unsupervised Machine Learning in Social Work Research: Comparing Human Coders to Computer-Led Topic Modeling (Society for Social Work and Research 22nd Annual Conference - Achieving Equal Opportunity, Equity, and Justice)

The Perils and Promises of Unsupervised Machine Learning in Social Work Research: Comparing Human Coders to Computer-Led Topic Modeling

Schedule:
Friday, January 12, 2018: 2:07 PM
Marquis BR Salon 13 (ML 2) (Marriott Marquis Washington DC)
* noted as presenting author
Maria Rodriguez, PhD, MSW, Assistant Professor, City University of New York, Hunter College, New York, NY
Heather Storer, PhD, Assitant Professor, Tulane University, New Orleans, LA
Joseph Mienko, PhD, MSW, Senior Research Scientist, University of Washington, Seattle, WA
Background:#WhyIStayed and #WhyILeft are hashtags born following the much discussed videotape of former Baltimore Ravens football player Ray Rice physically assaulting his then fiancée Janay Rice in an Atlantic City elevator. The hashtags describe Twitter users’ individual-level experiences with domestic violence, including the multitude of factors that influenced their decision-making processes to stay or leave their abusive relationships.

Topic modeling uses an algorithm to partition data into meaningful groups without a preconceived grouping mechanism. The method is most used to classify text data: an algorithm classifies documents based on the topics signaled by the content. Because topic modeling is not explicitly guided by the researcher, the depth of the classification is unclear.

The current study uses topic modeling to examine a sample of #WhyIStayed and #WhyILeft tweets (N = 3,068). It then compares the topical outputs to the codes developed by human coders in a traditional qualitative analysis of the same sample. 

Methods:The study team used the qualitative analysis software Dedoose to manually code 3,068 tweets that contained the #WhyIStayed and #WhyILeft hashtags. This random sample represents 5% of the total sample of tweets that were collected during the first 30 days of the campaign. The tweets were coded using a traditional thematic analysis coding approach. This process involved all tweets being inductively analyzed using multiple rounds of primary and secondary coding. All codes were organized hierarchically in thematic categories comprised of similar constructs. A codebook was used to promote consistency between data analysts.

The same tweet sample was subject to a topic modeling approach, using the latent Dirichlet allocation (LDA) method. LDA treats each tweet as a mix of topics, and each topic as a mix of words. In this way, LDA assumes that tweets can have ‘overlapping’ content, which means one tweet can be classified with two or more topics, much the same way traditional qualitative coding works. LDA analysis was completed using the ‘topicmodeling’ package in R.

Results & Implications: Manual coding produced a wide breadth of data regarding the causes, lived experiences, and impacts of enduring an abusive relationship (over 300 unique codes). This analytic process captured the “fine grain details” associated with experiencing dating violence, but the inductive analytic process was labor-intensive and subject to all of the limitations associated with qualitative data analysis.

The LDA topic modeling approach resulted in 25 topics. While the analytic process was more expedient than manual coding, results demonstrate that nuances in the narratives offered by the tweets are lost. However, topic modeling does offer a data exploration process for  qualitative work. For example, had topic modeling been done first, it may have helped establish the base code book, modestly speeding up the manual coding process. Implications for social work practice, policy and future research will be discussed.