Choices, Models and Morals » Lecture 7
A treatment is an intervention (perhaps including just ‘leaving things be’) intended to produce a net beneficial change.
A decision to treat, or implement policy, can be modelled using the decision-theoretic framework we’ve already discussed in lecture 2. Namely, we are advised to choose the treatment that maximises expected net benefit.
the notion of evidence-based policy and practice ‘… fits well with a rational decision-making model of the policy process’.… Thus, it appears to be rational common sense to see policy as a purposive course of action in pursuit of objectives based upon careful assessment of alternative ways of achieving such objectives and effective implementation of the selected course of action (Sanderson 2002: 5)
Even accepting this framework, it is crucial that we can identify what effects our acts could produce – whether with certainty, or with some degree of probability.
In the policy context, economic theory is supposed to give us information about the causal mechanisms governing the rational interactions between economic agents that drive policy outcomes.
In lecture 5 we discussed the question of whether economic models explain.
This was a pressing question, raised by an inconsistent triad of plausible claims (Reiss 2013: 127): (i) only true theories explain; (ii) most economic models are false of/inaccurate about their target system; (iii) economic models are explanatory. These can be generalized:
Reiss (2013: 127–33) gives an extensive defence of (2).
In practice, (3) seems true; the policies governments design are sold to us on the basis of their cost-effectiveness at promoting some desired end, based generally on some underlying rational choice model of the effects of the policy intervention.
the triumphs of modern medicine can easily lead us to overlook many of its ongoing problems. Even today, too much medical decision-making is based on poor evidence. There are still too many medical treatments that harm patients, some that are of little or no proven benefit, and others that are worthwhile but are not used enough. How can this be, when every year, studies into the effects of treatments generate a mountain of results? Sadly, the evidence is often unreliable and, moreover, much of the research that is done does not address the questions that patients need answered.
Part of the problem is that treatment effects are very seldom overwhelmingly obvious or dramatic. Instead, there will usually be uncertainties about how well new treatments work, or indeed whether they do more good than harm. So carefully designed fair tests – tests that set out to reduce biases and take into account the play of chance… – are necessary to identify treatment effects reliably. (Evans, Thornton, et al. 2011: xx)
Central to the practical implementation of a rational choice approach to treatments is the idea of an evidence hierarchy, where different types of evidence are taken to be more or less reliable in justifying causal beliefs – see table 1.
This is supposed to influence clinical practice: ‘Good decisions should be informed by good evidence, which will tell us about the likely consequences of different treatment options’ (Evans, Thornton, et al. 2011: 143).
Quality of evidence | Type of evidence |
---|---|
Step 1 | Systematic review of randomized trials or \(n\)-of-1 trials |
Step 2 | Randomized trial or observational study with dramatic effect |
Step 3 | Non-randomized controlled cohort/follow-up study |
Step 4 | Case-series, case-control studies, or historically controlled studies |
Step 5 | Mechanism-based reasoning |
Why are RCTs at the top of the hierarchy?
A basic account of causal inference in experimental conditions is provided by Mill’s method of difference:
if an instance in which the phenomenon under investigation occurs, and an instance in which it does not occur, have every circumstance in common save one, that one occurring in the former; the circumstance in which alone the two instances differ, is the effect, or the cause, or an indispensable part of the cause, of the phenomenon.…
The Method of Difference has for its foundation, that whatever can not be eliminated, is connected with the phenomenon by a law. Of these methods, that of Difference is more particularly a method of artificial experiment …
It thus appears to be by the Method of Difference alone that we can ever, in the way of direct experience, arrive with certainty at causes. (Mill 1874: bks III, ch. VIII, §§2–3)
In practice many policy evaluations don’t involve an RCT at all.
For example, many policies are justified by pilot studies, but these may not reflect the policy-as-implemented:
The first problem concerns the time needed for the effects of new policies to be manifested … It may take some considerable time for pilot projects to become fully established so as to represent the conditions under which a policy would work when fully implemented. If the policy aims to change attitudes and behaviour or achieve institutional reform, effects may be difficult to achieve during the course of a pilot project.
This problem is exacerbated by political considerations that constrain the length of time for pilot programmes to operate. When policy initiatives arise from political manifesto commitments, policy makers are understandably impatient to receive results that will provide evidential support for decisions to proceed with full implementation (Sanderson 2002: 11)
A key limitation of RCTs is that they detect causation in one fixed population, with certain frequencies of confounders, and certain underlying traits.
What we need for policy design is some evidence that this intervention will continue to be effective in new circumstances.
The methods recommended by typical evidence-ranking schemes [i.e., RCTs] are very good at establishing efficacy: whether a treatment causes a given outcome in the selected population under the selected circumstances. In evidence-based policy we are interested in effectiveness: What would happen were the treatment to be introduced as and when it would be in the population of interest. How can we move from efficacy to effectiveness? (Cartwright 2008: 130–31)
there are a number of other reasons why a pilot may not be typical of the policy as it would ultimately be implemented. … as Hasluck … points out, ‘… the resources devoted to the pilot may exceed that (sic) available at national roll out. There may also be a greater level of commitment and a “pioneering spirit” amongst staff involved in delivery’ (Sanderson 2002: 12)
Consider the California class-size reduction programme. The plan was backed up by evidence that class-size reduction is effective for improving reading scores from a well-conducted RCT in Tennessee. Yet in California when class sizes were reduced across the state reading scores did not go up. … There’s a conventional explanation. … California rolled out the programme state-wide and over a short period creating a sudden need for new teachers and new classrooms. So large numbers of poorly qualified teachers were hired and not surprisingly the more poorly qualified teachers went to the more disadvantaged schools. Also classes were held in spaces not appropriate and other educational programmes commonly taken to be conducive to learning to read were curtailed for lack of space (Cartwright 2008: 131; see Reiss 2013: 205–6)
RCTs, when implemented successfully, give us knowledge “cheaply” in the sense that they require no specific background knowledge in order to identify a causal effect from the data. But this does come at an eventual cost: if the understanding of the causal structure that is being experimented on in the RCT is very limited, there are no good grounds for believing that a result will also hold in this or that population that differs from the test population.
In a sense an RCT is a “black-box” method of causal inference. A treatment is administered, an outcome observed, with no need for any understanding of what is going on in between and why a treatment produces its outcome. But if there is no knowledge of why a treatment produces a given outcome, the basis for drawing inferences beyond the immediate test population is very narrow. (Reiss 2013: 205)
The deliberate choice of a level of statistical significance requires that one consider which kind of errors one is willing to tolerate. … In testing whether dioxins have a particular effect or not, an excess of false positives in such studies will mean that dioxins will appear to cause more harm to the animals than they actually do, leading to overregulation of the chemicals. An excess of false negatives will have the opposite result, causing dioxins to appear less harmful than they actually are, leading to underregulation of the chemicals. Thus, in general, false positives are likely to lead to stronger regulation than is warranted (or overregulation); false negatives are likely to lead to weaker regulation than is warranted (or underregulation). Overregulation presents excess costs to the industries that would bear the costs of regulations. Underregulation presents costs to public health and to other areas affected by damage to public health. Depending on how one values these effects, an evaluation that requires the consultation of non-epistemic values, different balances between false positives and false negatives will be preferable (Douglas 2000: 566–67)
Type I/II errors are less likely to show up in multiple RCTs.
Hence the recent emphasis in policy circles on meta-analyses, where many RCTs are aggregated to see the overall pattern of causes.
These can be powerful tools in policy evaluation. Consider the discovery that when children who commit offences are formally processed by the legal system (rather than diverted to social programs or just released), this actually leads to increased subsequent criminality:
juvenile system processing appears to not have a crime control effect, and across all measures appears to increase delinquency. … Given the additional financial costs associated with system processing (especially when compared to doing nothing) and the lack of evidence for any public safety benefit, jurisdictions should review their policies regarding the handling of juveniles (Petrosino, Turpin-Petrosino, and Guckenburg 2010: 6–7)
For more examples of effective/ineffective policies, see 80000hours.org/articles/can-you-guess/.
Consumer price inflation will be my main case but I will briefly discuss GDP and unemployment as comparisons. All three variables are regarded as observable by economists. But, as we will see, measuring them requires making a large number of substantial, and often contentious, assumptions. Making these assumptions requires real commitment on the part of the investigator as regards certain facts relevant to the measurement procedure, the measurement purpose as well as evaluative judgments. (Reiss 2013: 150)
What happens in perceptual processing, according to this account, is that sensory information is interpreted by reference to the perceivers background theories, the latter serving, in effect, to rule out certain etiologies as implausible causal histories for the present sensory array (Fodor 1984: 31)
Reiss (2013: 150–57) spends a long time explaining how CPI measurement is influenced by a significant array of overt and covert decisions about how and what to measure.
Most relevant to our previous discussion:
economists tend to be highly theory-driven when they estimate the impact of quality changes on people’s well-being. To give just one example, an increased variety of products is always interpreted as a good thing because consumers can satisfy their preferences more precisely. But greater variety is not necessarily a good thing as it increases the costs of decision-making, among other things. What is the true value of having yet another variety of Cheerios? (Reiss 2013: 155)