Kevin Munger and I recently had the joy to write a response to “Nonparametric Identification is not enough, but randomized controlled trials are” by the all-star team of Aronow, Robins, Saarinen, Sävje and Sekhon (ARSSS). You can tell Kevin and I like this paper because as of this moment, we make up about one quarter of its total citations on Google Scholar. I believe a number of other responses will be coming out soon; I’m very excited to read them.
I’m going to riff a bit more here about our response, “Enough?”. In particular, I want to dig into what I see as the most important perspective we articulate in the paper and some of its further implications. Namely, we argue that the biggest value of experiments is ontological1.
How to respond to a great paper
This was my first reaction to thinking about how to do this. I love the paper, and I refer to it a lot. But a response isn’t that interesting if you’re just saying “ditto”. We started out with Kevin wanting to talk more about how uniform consistency of the SATE was still insufficient for science (plus various other big picture philosophy of science points) and me wanting to discuss different reasons why RCTs are special (e.g. correct specification through shoe-leather)2. Regardless, we quickly latched onto the most provocative word in the title: “enough”. What does it mean for a method to be “enough”? Musing on this formed the core of our response.
Enough for what?
We started off by trying to think through what it would, conceivably, take for an estimate to be generalizable knowledge. We identified 4 things that we certainly want to be able to generalize over: (1) sample, (2) site, (3) realization of a theory, and (4) time. The first two are fairly well-trodden. Convenience samples may make it hard to generalize to populations of interest, and site-selection bias may make it difficult to generalize to the locations we’d like to know for future policy choices. Tal Yarkoni has already talked through the multiple challenges of generalizations across various dimensions in the language of random effects in Psychology.
The third element is kind of weird. What exactly is the set of treatments we could have defined based on a particular theory we’re trying to test? I have a bit of a pre-occupation with this question because it was my experience running message-tests at Facebook that while psychological mechanisms may be important, so are the specifics of wording that are largely unrelated to the mechanisms. If you write text that sounds like a robot, it’s gonna suck even if you’re pulling on the right mechanism. If we want to generalize the results to the theory being good or bad, then you have to be able to abstract away from this.
The final point, on time, is a long-time hobby horse of Kevin’s. One of the main criticisms of this focus is that it isn’t actually any different than other forms of generalizability. We identify one extremely important way that time is different: there’s no ‘shoe-leather’ based solution for it. You can random sample from a population, a set of sites or from potential treatment definitions. You cannot randomly sample time. This is an extraordinarily important difference, because it means, ultimately, that you must rely on modeling assumptions that may not be necessary in these other cases (for exactly the reasons ARSSS spell out: randomization is an extremely powerful tool to eliminate such assumptions).
I strongly believe in the power of shoe-leather3. The fact that it is insufficient for temporal generalization should be concerning4. Regardless, our point is just that randomization is good, sure, but it doesn’t get us to where we want to go scientifically. What does makes it good?
Control of what?
I think the best part of the paper focuses on the ontological value of experimentation. That is, experiments create novel states of the world, and this is a powerful way to imagine alternatives to the social reality we currently inhabit. It creates incentives for people to create cooler and more ambitious changes to the world while measuring what those changes do. But I think this gets at another reason that RCTs are particularly useful.
Observational methods assume a complicated ontology in the course of creating a DAG: how to chunk up the social world into nodes and how to assign values for those nodes to each unit. Put simply, the social world is extremely complex, and this process is subject to extraordinary error. RCTs circumvent this difficulty entirely: their accuracy does not depend on such ontological assumptions. Instead, they impose their ontology. In the paper, we focused on the ontology of the treatment, but the argument is much broader and deeper.
For example, suppose that we run some field experiment that changes substantial aspects of people’s media diet and we measure effects on various constructs that we think are interesting: ideology, perhaps, beliefs about facts, etc. It isn’t just that the RCT imposes the ontology of treatment and control, the RCT provides information on all of these constructs that we create (i.e. causal effects on these constructs). In survey research a common recommendation is to avoid regressing one survey construct on another5. But the truth is, we do this with all observational research, as we make assumptions about constructs everywhere. With the RCTs, we only do this on one side of the equation. The treatment ontology is imposed, but this allows us to weaken the assumptions of ontology in the outcomes. We will still have valid causal effects on whatever ontology we choose.
Or maybe in our field experiment, we measure something collected administratively about behavior (e.g. turnout in an election). In this case, both the treatment and the outcome have some ontological precision in their meaning6. What about the heterogeneity we may observe? Well, we don’t rely on the ontology the same way we do in observational settings: to take an ontology that is currently the subject of political contention, consider measuring heterogeneity by gender identity. There are different ways to construct this ontology, but when we choose such a construction, we can simply measure the heterogeneity according to this ontology. If we’re instead in an observational world, we would need to care much more about getting this ontology “correct” in some way as it pertains to selection into treatment7. In an RCT, we do not rely on this correctness. We can explore variation with respect to whatever ontology we choose and, perhaps, even make judgments about which ontology better reflects the variation in response.
RCTs give us a solid foundation from which to explore these questions, while observational work cannot do this so easily.
For whom?
Critically, however, the ontological power of experiments is premised on control of the world. It is power not just in the statistical sense of charts and numbers, but in the sense of guns and laws. Wielding this form of power requires responsibility and humility.
Experimentation can be a tool for achieving our country. By trying out new social possibilities, we can provide the foundation on which a better society can be built. Doing this requires venturing past is and into ought. This is fraught territory for scientists, so must be part of a larger democratic process. As the fabric of society is increasingly constructed of bits and code, we can increasingly implement creative new interventions that are meaningful to people’s lives. I think, in fact, we’re ethically obligated to do so.
By making our focus broader in this framing, we do something that ARSSS could not: draw distinctions between types of experiments and demonstrate the value of experimentation within a larger experimenting society. All experiments enjoy the statistical benefits that ARSSS describe. Not all experiments are similarly powerful ontological tools for considering new social worlds. The experiments that measure up best under this perspective are the ones that attempt ambitious changes to social reality. The experiments that do poorly are, mostly, what we have: survey experiments. These are, surely, not enough.
Read the paper and let us know what you think!
You may have read some about this perspective on Kevin’s blog.
Chris Harshaw has a response to ARSSS that goes more in this direction focused on the importance of the design-based frame of thought (and its rhetorical power).
If you’re reading this blog, I hope you’re already familiar with the beautiful David Freedman paper on this subject, Statistical Models and Shoe Leather. If you aren’t, read it immediately.
I’ve done a little searching, but I’d be interested in some citation archeology / intellectual history. Where does this recommendation originate?
Or at least something like phenomenological precision.
If you’ve ever read into the literature on measurement error of covariates for observational causal inference (its a nearly intractable problem), this should make you sweat.