Methods: There is no large public dataset of immigrant-serving organizations. We create test datasets based on several smaller public datasets, drawing from the Guidestar dataset, NCCS dataset, and a hand-coded dataset from Bloemraad’s research. These three datasets are merged, cleaned and divided into training/test datasets (N=6027, 75% for training and 25% for testing) and a validation dataset (N=400, half are positive). Seven natural language processing (NLP) methods are applied including bag of words, word2vec, Recurrent neural network, and BERT. After using NLP, a classification model is followed. An accuracy metric is applied to check the performance of NLP methods and compare them to existing NTEE and keyword methods. We also apply the t-Distributed Stochastic Neighbor Embedding (T-SNE); this dimensionality reduction approach helps visualize which name features are easy for algorithms to recognize.
Results:
The BERT methods achieve the highest accuracy in the test and validation datasets, which hit 0.85 and 0.89 compared to the basic bag of words method (0.79 and 0.71). The machine learning method could recognize 180 out of 200 immigrant-serving organizations, a much higher score of recall compared to the NTEE method (32/200) or keywords method (83/200). In using NLP, we find that some name features, like country names, remain good classification indicators, as does the use of a non-English language (e.g., organizations with “alianza” in their name). However, NLP methods still have difficulty classifying some organizational names correctly, such as the “last name+foundation” pattern.
Conclusions and Implications:
We conclude that applying these advanced NLP techniques is a highly efficient and quite accurate method to replace part of human labor in coding immigrant organizations. This can help advance research in significant ways. By making immigrant-serving organizations visible and easy to tally, researchers in social welfare can better assess inequalities in service provision, spatial inequalities in location, and the impact of the nonprofit infrastructure on the lives of vulnerable migrants.