The First Mistake in Crafting Survey Experiments

27 Mar 2018

In the past few years, I’ve been doing a lot of teaching on the topic of survey experiments, motivated by the dearth of accessible guides to conducting survey-experimental research aside from Mutz’s (2011) Population-Based Survey Experiments. The way I teach this course is being first outlining the logic of experimental design in general and then walk participants through how to use survey experimental manipulations to operationalize variations in potentially causal variables. The starting point is alway theory, from which manipulations and outcome questions are derived. The reason for teaching this way is that first-time experimenters frequently have a survey background (or no background at all in empirical research) and believe that the starting point of a survey experiment is the questionnaire. This is a first and often final, fatal mistake. (It is trivial whether this approach takes place in a word document or in survey software, the error is starting with a questionnaire at all.) Aside from the fact that starting from a questionnaire is an error-prone preparation for fieldwork that is likely to introduce trivial, easily missed typographical errors, the decision to design the survey before designing the experiment is very likely to lead a new experimenter astray.

There are three main reasons why starting from a questionnaire is flawed. The first is that the naive questionnaire-first approach to survey-experimental design presupposes that the first manipulations one constructs are the most suitable for any particular experimental test. For example, I may want to do an experiment on citizens’ opinions toward a specific policy and assess whether framing of that issue affects opinions, so I start by finding articles that generally use two frames that I’m interested in and then plunk those down in a Word document followed by some questions measuring opinions. In doing so, I’ve fixated myself on a particular style of manipulation of framing and also made assumptions about what is - and what is not - held constant across experimental conditions. Why did I choose to use articles as opposed to some other stimulus? Why did I choose these articles? In what ways do they vary aside from use of the frames I was interested? Are these representative of framing treatments generally? Why have I chosen articles at all, as opposed to framed question wordings or visual stimuli or a question-ordering manipulation? Do these stimuli have characteristics that are appropriate for testing my theory? Oh wait, I haven’t written out a theory so the last of these questions is unanswerable.

Good experimental design starts with clearly stated set of empirical expectations that are expressed independent of the manipulations being used. If I am interested in the effects of framing on opinion, I need to state that expectation without regard to the particular device I have used to operationalize “framing” and without regard to the particular questions I have used to measure the outcome. We cannot evaluate construct validity of the manipulation and the outcome measure if the theory does not exist independent of the measures. If I start with a questionnaire, I inevitably express my expectations in a way that is constrained by my naive intuition about what feels like a good questionnaire rather than what is a theoretically interesting hypothesis. This initial questionnaire-independent theoretical statement typically also benefits from explication of assumptions being made either in the theory or in its general operationalization. For example, perhaps I envision explicit scope conditions for my theory (such as ambition to speak to specific types of people, specific points in time, or specific issues). Stating these up front is a form of pre-registration of my ideas (with myself) that prevents me from later getting lost in trying to explain whatever results emerge from the study with data independent of design. Starting from an explicit theoretical statement helps to avoid fishing through the data or, worse, fishing through explanations for those data.

The second problem with starting a survey-experimental project with a questionnaire is that it inevitably leads to a questionnaire that is filled with extraneous material. A survey-experiment contains at its most essential level only two things: a manipulation of an independent, putatively causal variable and the measurement of an outcome. Most survey experiments thus contain about two items or two versions of a single question. Yet the questionnaire-first approach to survey-experimental design often yields questionnaires with much more material. I think the reason for this stems primarily from thinking about survey-experiments as if they were observational surveys. In an observational analysis of survey data, I might be particularly interested in descriptive patterns in the resulting data such as demographic variation in outcomes. This may be also be my intention in survey-experimental design but those analyses serve a different purpose from testing my core causal hypothesis - they are frosting or they are fodder for a different empirical project.

When I start from a theoretically motivated empirical expectation, my questionnaire typically generates a minimum number of items - that is, just the essential core. Anything else I add needs to be justified in terms of a specific analysis that each additional item will be used for:

Why am I measuring demographics or media exposure? Is it to assess pre-treatment covariate balance? If so, why am I doing that analysis? What will I learn from it? What will I do in response to evidence of balance or imbalance?
Why am I measuring demographics or media exposure? Is it to assess treatment effect heterogeneity? Why? What kind of heterogeneity do I expect? If I expect heterogeneiety, why didn’t I theorize it or why didn’t I manipulate its source? If I plan to search post-hoc for heterogeneiety, what will I do with those findings? How - if at all - will I communicate them?
Why am I measuring demographics or media exposure? What is it for? Is it for assessing effect heterogeneity or something else? Do I just want to know about my sample or do I want to characterize whether it is typical of my population of interest? What evidence would constitute typicality or representativeness, or the lack thereof? What will I do in response to evidence of representativeness or lack thereof?
Why am I measuring pretreatment opinions? Is it to improve measurement precision by conducting a within-subjects design? Will it generate consistency biases? Will it affect respondents’ understanding of the study?
Why am I measuring attention to or recall of the stimulus material? Is this a manipulation check? Is it something for my own benefit? Do I intend to instrument for this using random assignment in a two-stage least squares analysis? What will I do if the check “fails”?
Why am I measuring other outcomes other than opinion? Is it that I’m worried I won’t find significant results? Is it that I have a more extensive theory than the one I mentally articulated? Is it that I plan to p-hack or perhaps setup some exploratory data analysis for future work? Is it that might control for post-treatment outcomes or attempt some form of mediation analysis? Have I considered the assumptions necessary for analyzing data in this way?

In essence, all of these questions ask the researcher to reflect upon why they are doing things in their survey instrument. This is something they would certainly do if they were designing a survey - time is precious, so it is important to only measure what will actually be used. In survey-experimental work, it is important to be even more stringent about deciding what to measure because ultimately the analysis typically relies only upon the core items of the questionnaire (the manipulation and the outcome question or questions). Other items might serve some purpose but design-based analysis of experiments does not make obvious what those purposes might be. That’s not to say there is no room for such analyses - quite the opposite - but the logic of performing such analyses does not follow from a simple analysis of a two-condition experiment in order to speak to the kind of simple theoretical statement used here.

Analysis aside, survey experimental stimuli are typically designed to generate maximum variation in the putatively causal variable such that immediate effects can be measured using outcome questions soon thereafter. The greater the number of superfluous items, the more noise is introduced into the design; this is not between-condition noise that generates confounding but rather noise that makes those results more constrained or conditional. Such constraint might be because we want respondents to be a given mindset during the experience of treatment but noise introduced for purposes other than that form of control is not necessarily useful because it may make the results even more local than they otherwise would be (and thus increase the apparent magnitude of results or diminish magnitudes, and we can never know which is which).

The final problem with a questionnaire-first approach is that ignores the connection between survey measurement and experimental data analysis. While a particular question might seem - for reasons of face validity - to be the appropriate measure of an outcome concept, decisions about how to measure outcomes need to reflect an intersection between respondent comprehensibility and data quality. Survey experimental analysis necessarily involves the analysis of a question-operationalized variable or set of variables. If we follow recent advice within political science to mostly analyze experimental data using ordinary least squares regression, then we need to consider how the questionnaire we are designing will lead to an outcome measure that can be sensibly analyzed in this way. In particular, outcomes that might have some intuitive appeal - like rankings or qualitative categorical response questions - do not lend themselves naturally to an OLS-centric analysis (or they lend themselves to analyses that require considerable researcher discretion). Thinking ahead to the analysis, the researcher should select questionnaire items that enable the kind of analysis that is likely to be performed.

In the same way, the questionnaire-first approach runs the risk of introducing excessive complexity into designs as further conditions are added in the face of considerations about interesting aspects to vary or control. In the framing example used above, only two or perhaps three conditions are necessary to gain insight into the framing effect. The decision to introduce further conditions should be theoretically motivated rather than introduced by piecemeal variations on the current version of the questionnaire. If we think again about the form of analysis that might take place with the resulting data, a two-condition experiment lends itself to only one obvious OLS-based analysis but a three-condition experiment allows for substantially more variations on a theme:

Outcome as a function of control as baseline, with treatment 1 and treatment 2 introduced as additional factors
Outcome as a function of treatment 1 as baseline, with control and treatment 2 introduced as additional factors
Outcome as a function of treatment 2 as baseline, with control and treatment 1 introduced as additional factors
Outcome as a function of control as baseline and treatment 2 as an additional factor, omitting treatment 1
Outcome as a function of control as baseline and treatment 1 as an additional factor, omitting treatment 2
Outcome as a function of treatment 1 as baseline and treatment 2 as an additional factor, omitting control
Outcome as a function of treatment 2 as baseline and treatment 1 as an additional factor, omitting control
Outcome as a function of control as baseline, merging treatment 1 and treatment 2 as an additional factor
Outcome as a function of merged treatment 1 and treatment 2 as baseline, with control as an additional factor
Outcome as a function of treatment 1 as baseline, merging control and treatment 2 as an additional factor
Outcome as a function of merged control and treatment 2 as baseline, with treatment 1 as an additional factor
Outcome as a function of treatment 2 as baseline, merging control and treatment 1 as an additional factor
Outcome as a function of merged control and treatment 1 as baseline, with treatment 2 as an additional factor

While there is a mathematical equivalence in many of these parameterizations and all are ultimately based upon combinations and comparisons of the three treatment group means, the insights that are immediately and obviously obtained from alternative parameterizations are likely to vary dramatically. The decision about which of these parameterizations is of sole or primary interest is something that should be decided at a theoretical level in order that the design speaks to the anticipated empirical regularities in a straightforward manner. If we start with a questionnaire rather than an experimental design, that decision about what matters may not be made until later or perhaps never at all.

Therefore, rather than begin with a survey questionnaire as the starting point for an experimental project, researchers should always start from a clearly articulated empirical expectation situated within a detailed protocol document or pre-analysis plan (regardless of whether that plan is registered). Starting with a general or abstract approach will help to clarify what is intended to be manipulated and measured and possibly motivated pilot testing or at least reflection upon the range of possible manipulations of the core concepts before defaulting to one that is obvious. This approach will ultimately enable succinct experimental designs that can be clearly communicated to audiences and to one’s future self without the burden of getting bogged down in the messy data that a questionnaire-first survey-experimental design is likely to generate.

← Older post Newer post →

survey experiments experiments research design theory hypotheses writing

Except where noted, this website is licensed under a Creative Commons Attribution 4.0 International License. Views expressed are solely my own, not those of any current, past, or future employer.