Abstract: Data Management Using R (Society for Social Work and Research 22nd Annual Conference - Achieving Equal Opportunity, Equity, and Justice)

Data Management Using R

Schedule:
Thursday, January 11, 2018: 2:00 PM
Marquis BR Salon 8 (ML 2) (Marriott Marquis Washington DC)
* noted as presenting author
Charles Auerbach, PhD, Professor, Yeshiva University, New York, NY
Data management is a crucial part of conducting research. Prior to beginning a research project, matters related to data management need to be considered. Good data management assures that the data produced will be clear and accurate. At the core of this is the reproducibility of research. This makes it possible for other researchers to verify results to advance knowledge.

Before data can be analyzed, it must be in the appropriate format. It is estimated that data management constitutes 80% of the quality of the final data analysis (Dasu & Johnson, 2003).  Incorrect data can lead to incorrect conclusions negating the benefits of data driven decisions (Hellerstein, 2008). R provides an array of built-in functions and packages for the purpose of data management.

While analyzing data using a SEM package, it is often necessary to create new variables, collapse or recode categories, etc. R’s data management functions have significant advantages over proprietary SEM packages such as Mplus. One of the appeals of using R is that with just a few lines of syntax, data can be manipulated. For example the read.table() function is quite flexible.  Different types of text data, such as comma or tab delimited can be imported into R by changing a single option (Auerbach & Zeitlin ,2015).  R syntax is more logical and follows programming principles more strictly. Furthermore, there is a very lively community to rely on for direction. Because lavaan is an R package, all of R’s data management capabilities can be exploited.

Through demonstration, this workshop will show how to implement the most commonly used data management tasks in R, including built-in functions and add-on packages.

The paper will begin with a discussion of how to input data directly into an R data frame and then how to use Excel or another spreadsheet to quickly and effectively record data (Auerbach & Zeitlin, 2014).  Since Excel is the most commonly used spreadsheet program, entering data into Excel will be demonstrated.  Packages for importing data from Excel, STATA, SAS and SPSS such as foreign, Hmisc and xlsx, will be discussed. 

We will demonstrate a number of data management functions that include concatenating data sets, merging data sets, adding variables to an existing R data set, sorting a data set, sub-setting a data set, aggregating a data set, computing new variables, deleting variables, recoding variables, and renaming variables.

Through demonstration, participants will be presented with the benefits and ease of using R for data management.  Although not necessary, attendees are encouraged to bring along their laptops to work through the examples as they are shown.  Participants will be provided with a Dropbox link to access all the resources mentioned during this presentation.