Which Evaluation Method Will Take the Gold? A heated debate about the “gold standard” rages on. -by Mary Gleich and Leah Goldstein Moses

In a Randomized Control Trial (RCT), people are selected to receive a service at random, and results for both selected and non-selected people are tracked over time. RCTs are sometimes called the “gold standard” of evaluation. However, ethical, pragmatic or resource issues often make RCT a poor choice for evaluation. What do people like about RCTs? RCTs solve one major problem in evaluation. When evaluating a program, it is often hard to tell if it was the program that caused a specific result, or if something about the person who participated made the result happen. Researchers call this selection bias; others call this cherry-picking. An example, relevant to our September article on schools, is the academic results for children who exercise school choice to attend what’s considered to be a good school. It might be that school really does have better academic programs. Or, better results could be because the parents who chose that school are actively involved in seeking academic opportunities for their children. If students were randomly assigned to schools, this selection bias would not exist. What do people hate about RCTs? Some researchers feel very strongly that RCTs are a poor research approach in almost every situation. They don’t allow programs to target the people who most need their work. They are very expensive; diverting resources to research that could go to direct service. And they generally have a firewall between researchers and any implementers or participants, meaning opportunities for shared interpretation and understanding are limited. What are some alternatives to RCTs? There are some research approaches that have similar benefits to RCTs – compensating for that selection bias – while reducing some of the problems. These include:

Difference-in-differences (Dif-in-Dif). This approach compares the results for participants to their data prior to participating in a program, and to the changes over time that a comparison group experiences.[1] Unfortunately, Dif-in-Dif results can be weakened if there is some broader issue affecting both groups – for example: no matter what programs you participated in, if you lived in the U.S. from 2007 on, you were likely affected in some way by the financial crisis.
Propensity Score Matching (PSM). PSM uses multiple factors to try to find a very close matched comparison to each person participating in the program. Often, multiple matches are found, in order to address issues of incomplete data and strengthen the analysis. For example: in our evaluation for Mercy Corps’ Global Citizen Corps program, we used age, home community, language, educational attainment, and several other factors to find several comparisons for each participating young adult. We needed a very large sample in order to find a good match for each participant. Once we had a good sample, we administered a survey and collected qualitative data from both participants and their comparisons.
Regression Discontinuity Design (RDD) tries to isolate whether a program was the specific cause of a change, even when people can’t be randomly assigned to participate in a program. Instead, participants are assigned based on some sort of cut-off (such as a test score, income level, or other factor). This allows a program to reach those that are most in need of its service. RDD assumes that people on either side of the cut-off will have relatively similar experiences outside of the program, so any differences in results are caused by the program.

Each of these methods can be used when it is important to try to determine causality. We often try to enrich our analysis with stories and insights from participants. For example, if we are seeing an improvement in test scores or an increase in household income, we might ask participants what factors in their lives led to the change. Want a periodic dose of heated debate about RCT and its close methodological relatives? Join the Evaltalk listserv at https://listserv.ua.edu/archives/evaltalk.html. References: Barth, Gibbons and Guo. “Introduction to Propensity Score Matching: A New Device for Program Evaluation” January 2004. <http://ssw.unc.edu/VRC/Lectures/PSM_SSWR_2004.pdf> Duflo, Kremer. “Use of randomization in the Evaluation of Development Effectiveness” Imbens, Guido and Lemieux, Thomas: Regression Discontinuity Designs: A Guide to Practice, 2010, Journal of Economic Literature 48, 281-355 Trochim, William K. Research Methods Knowledge Base, http://www.socialresearchmethods.net/kb/index.php. 2006.

[1] Comparison group=NOT randomly assigned; Control group=IS randomly assigned.