The evaluation pitfalls of popular programmes
Through the fog of wall-to-wall Olympic coverage, a leaked piece of policy evaluation recently made it on to the front pages of the British media. This was research suggesting that a flagship government programme aimed at Troubled Families - costing hundreds of millions of pounds and heralded regularly by government as a great triumph - is in fact having very little impact at all.
While we await more information to see how true this turns out to be, it certainly raises the issue of what we want from an evaluation.
Over my years working in government, the private sector, and now the charity sector, I have come across just about everything in the evaluation world. Some of it's good, with real attempts to get fair data and understand what would have happened without the programme. Some is pretty bad, with anecdotes masquerading as data or questionnaires with leading questions put together as if someone had made them up over the kitchen table. Others are poor case studies, which are followed by enormous leaps of logic about the programme's effect. Fortunately, there are also a number of high-end randomised controlled trials (RCTs) or econometric studies trying to get to the heart of a programme's additionality and impact.
But even the good evaluations often tell us less than we might think. And this matters most when they are assessing a project or programme that the government, a charity, a funder, or even the public really wants to believe in - to believe that it works well enough that we should continue to invest in it.
The first point to make is that almost any intervention will have some positive impact on some outcomes. So, seeing this positive impact is no reason to run up the flag and rejoice. Charities find this fact very hard to understand. So, too, does government.
That's why you need to do more than check for impact. The key question is: was the cost of achieving that impact worth it and - in a related question - is it the best way of achieving that impact?
At the very least, that means some assessment of the amount of impact that would be needed to justify the cost of delivering the programme. If your data is good enough, a full economic analysis is ideal, because without that you run the risk of producing spurious figures.
Secondly, we need to ask whether the impact achieved is sustained over time, or whether repeated doses of the intervention are needed. For example, does a course of six mentoring sessions put someone on the path to a better life once and for all, or does its impact fade fast unless you provide further mentoring or some other intervention, such as catch-up
Issues like this have arisen in the evaluation of the National Citizen Service (NCS), a government programme that will cost some £500m a year when fully up and running. We know that young people feel better and more positive after being on NCS - but do only the better motivated go on NCS, how long does the “glow effect” last, does it really change their actual behaviours, and could the same result have been achieved in other ways (like funding voluntary groups to work with the young people)?
The evaluators try to answer some of these questions, but their caveats tend to get lost when an enthusiastic government is telling the story. Even thorough evaluations, such as a recent RCT-based Behavioural Insights Team report on Youth Social Action, suffer from these failings, which makes it dangerous to exaggerate their impact.
And of course the gold standard, the pure RCT, has its own issues. The members of the randomly chosen comparator group will have their own back stories - especially if they are vulnerable, as in the Youth Social Action study. Equally, that child you worked with at the age of 6 will undoubtedly receive more interventions as he or she grows older. How do you allow for that when you're working out the net effect of your intervention?
Just as problematic is the fact that, simply because an RCT shows that a policy succeeds in one place at one time, you can't assume that it can be replicated later on elsewhere. The recent Family Nurse Partnership (FNP) evaluation illustrates this: a policy that seemed to work well in the US . This was partly because the basic medical service without FNP is a lot better in the UK than it is in the US.
Evaluating the evaluators
So where do we go from here? Certainly we need good evaluation and must continue to invest in it or we will spending money inefficiently. Despite the revolt against experts that the recent Brexit debate seemed to reveal, we can't give up on that.
It is essential to be very open about what an evaluation did, how it collected the data, how it allowed for things it could not measure well, what it is assuming about sustainability, and so on. One should then produce a range rather than a misleading point estimate, and leave the possible impact and cost-effectiveness for others to judge.
Another thing: if you promise to publish an evaluation, do it promptly and don't wait for it to be leaked!