The first part of this workshop will go over the basics of R, the world's leading free and open source statistical software. The second part will review the intellectual history of causal inference as it pertains to matching. The third part will introduce GenMatch, an application of genetic matching algorithms for the social sciences.
The R project is a GNU (GNU is Not UNIX) implementation of the S system developed by Chambers et al. at Bell Laboratories for statistical and graphical analysis of data. GNU/Linux is the operating system maintained by the Free Software Foundation. It is "copy left" under the general public license (GPL) so that it may be used, modified and redistributed at will provided all derivative products also are copyleft under the GPL. Mostly used in command line, there are graphic user interfaces for different audiences.
According to the Neyman-Rubin potential outcomes framework, the fundamental problem of causal inference is that the researcher never observes the treated person under control conditions and never observes the control population under the treatment regime. That is why the random clinical trail is the gold standard for causal inference. The average treatment effect, or experimental benchmark, is the difference in mean outcome values between the treatment group and control group.
However, in many cases interventions cannot be subject to a randomized trial for budgetary, logistical or ethical reasons. That is why it is often necessary to perform an observational study, or quasi-experiment. Indeed, an observational study with a large random sample can generate estimates with better external validity provided the causal path and selection bias are well known.
Matching is necessary in an observational study to estimate the the average treatment effect on the treated. The treatment and control groups must be balanced on all observable key variables that influence selection into treatment or confound the outcome. While propensity scores are estimated using logistic regression from observed covariates on treatment. GenMatch is an improvement because it is non-parametric and affine invariant (preserves ratios of distances among covariates). It also generates a set of randomly generated population weights from which to select the optimal solution. To prevent data mining, GenMatch requires users to write their own function to discard bad weights based on a priori knowledge of the intervention and study population.
The workshop will be primarily a demonstration with software and handouts provided. Participants will practice writing sample code on their own laptop in small groups or on butcher paper. The workshop presenter has had a full year course in causal inference by one of the authors of GenMatch.