Abstract: Using Machine Learning to Code Immigrant-Serving Organizations (Society for Social Work and Research 26th Annual Conference - Social Work Science for Racial, Social, and Political Justice)

679P Using Machine Learning to Code Immigrant-Serving Organizations

Sunday, January 16, 2022
Marquis BR Salon 6, ML 2 (Marriott Marquis Washington, DC)
* noted as presenting author
Cheng Ren, MSSA, PhD Student, University of California, Berkeley, Berkeley, CA
Irene Bloemraad, Professor, University of California, Berkeley, Berkeley, CA
Background/Purpose: Nonprofits provide important human and social services, engage in advocacy, and provide community. They are especially important for vulnerable groups, such as the poor or racial minorities. In this study, we focus on immigrant communities, an understudied population in nonprofit scholarship. Migrants face multiple challenges due to noncitizenship, linguistic or cultural barriers, and other vulnerabilities. Researchers who want to study immigrant-serving organizations have, however, faced a significant problem: there is no easy way to identify such nonprofits in datasets derived from IRS records. Two common methods filter organizations based on NTEE codes or particular words that highlight a homeland, race or ethnicity such as “Asian”. These methods are time-consuming and not very comprehensive. We ask: Are there better methods to more accurately and efficiently identify immigrant-serving organizations based on their name? This study explores 1) how machine learning and natural language processing can boost the classification process; and 2) what features of organizational names are easy to recognize and which remain difficult for classification.

Methods: There is no large public dataset of immigrant-serving organizations. We create test datasets based on several smaller public datasets, drawing from the Guidestar dataset, NCCS dataset, and a hand-coded dataset from Bloemraad’s research. These three datasets are merged, cleaned and divided into training/test datasets (N=6027, 75% for training and 25% for testing) and a validation dataset (N=400, half are positive). Seven natural language processing (NLP) methods are applied including bag of words, word2vec, Recurrent neural network, and BERT. After using NLP, a classification model is followed. An accuracy metric is applied to check the performance of NLP methods and compare them to existing NTEE and keyword methods. We also apply the t-Distributed Stochastic Neighbor Embedding (T-SNE); this dimensionality reduction approach helps visualize which name features are easy for algorithms to recognize.


The BERT methods achieve the highest accuracy in the test and validation datasets, which hit 0.85 and 0.89 compared to the basic bag of words method (0.79 and 0.71). The machine learning method could recognize 180 out of 200 immigrant-serving organizations, a much higher score of recall compared to the NTEE method (32/200) or keywords method (83/200). In using NLP, we find that some name features, like country names, remain good classification indicators, as does the use of a non-English language (e.g., organizations with “alianza” in their name). However, NLP methods still have difficulty classifying some organizational names correctly, such as the “last name+foundation” pattern.

Conclusions and Implications:

We conclude that applying these advanced NLP techniques is a highly efficient and quite accurate method to replace part of human labor in coding immigrant organizations. This can help advance research in significant ways. By making immigrant-serving organizations visible and easy to tally, researchers in social welfare can better assess inequalities in service provision, spatial inequalities in location, and the impact of the nonprofit infrastructure on the lives of vulnerable migrants.