Skip to main content

Outcome Evaluation Designs

The following sections provide more technical detail for outcome evaluation designs. Many consider the experimental approach using randomised control trials as the ‘gold standard’ for evaluation evidence. In many TDM projects this may not be practical, and quasi-experimental or non-experimental methods will be sufficient.

Experimental approaches

You decide who is in each of the comparison groups in an experiment, usually by randomly allocating people into each group. The group that receives or participates in your project is the intervention or treatment group, and there may be more than one intervention or treatment group that receive different levels or types of project activities.

If a group does not participate in the project, it is a control group that provides the counterfactual case; giving us a sense of what might have happened to the treatment group without the intervention.

An experiment where the potential participants in the project are randomly allocated into each of these groups is a randomized control trial (RCT). If all of potential participants share common characteristics that might affect the success of the project, random allocation into groups controls for these factors so that they will be equally likely to be present in each group. In other words, because the groups are the same in terms of all characteristics that might affect the project, except for the fact that one group does get the project and the other does not, then we can say that differences in outcomes are due to the project.

Experimental approaches should be considered in the following circumstances:

  • Scope of the project: The project has a small number of well-defined final outcomes and it can be delivered in isolation from other activities that might also affect these outcomes.
  • The Intervention context: it must not be possible for the groups to affect each other. For example, members of the project group should not be able to communicate to the control group information that might alter the latter’s behaviour. This will be tricky if members of the treatment and control group live in the same geographic area. If the project is rolled out in different sites, there should not be major differences in the way it is implemented across these sites.
  • Stability of the project: the project should not change in any significant way while the experiment is running, otherwise it will be hard to know exactly what form of the project affected the outcomes. Similarly, it must be possible to limit entry and exit from the project by participants so that the members of the treatment group remain comparable to the control group.
  • It must be ethically appropriate: Even if an experimental design is technical possible, it might not be appropriate to use one. Some ethical questions you need to think about include:
  • Will members of the control or experimental groups be exposed to known risks or known harmful outcomes?
  • Will the experiment mean members of the control group are denied services which are known to be beneficial on the basis of existing evidence?
  • Will members of the control group be denied access to services to which they have an historical entitlement?

For TDM projects, an experimental design can be useful to assess the impact of information to customers, as illustrated in Box 1. 

Box 1: Randomisation in the Walk in to Work Out project

The Walk in to Work Out evaluation tested whether a ‘self-help intervention’ delivered via written materials could increase active commuting behaviour. Participants who had been identified as thinking about, or doing, some irregular, walking or cycling to work were selected from three Glasgow workplaces. These workplaces were in the same area of the city served by a range of public transport links and marked cycle routes. Volunteers in the project trial were randomly split so that:

  • the experimental group received the Walk in to Work Out pack immediately

  • the control group was told the pack would be forwarded six months later (but did, in fact, not receive it within the lifetime of the study)

Follow-up questionnaires were sent to both groups after six and twelve months to measure the effectiveness of the trial. The experimental design offered statistical proof that those individuals who received the pack were twice as likely as those who had not to increase walking to work. Mutrie, N et al, 2002, Walk in to Work Out: a randomised controlled trial of a self-help intervention to promote active commuting, Journal of Epidemiology and Community Health, Vol 56, pp. 407-412s

While many people regard the classic RCT as the ‘gold standard’ for evaluation designs, RCTs are not always desirable or feasible. As an evaluator you may not always have the level of control over project implementation to apply an experimental design, and there may be ethical issues with varying treatment options among the groups. In other instances, RCTs may simply be unnecessary to answer the specific KEQs that have to be addressed.

Quasi-experimental design 

This design is also based on the principle of comparing outcomes from a defined group who receives the intervention with a group not receiving the intervention, however, you cannot fully control the composition of your comparison groups. Groups are constructed once data are gathered and can be divided up according to characteristics of the people (or things) from which data have been taken (hence the term ‘statistical control’).

For example, one local government area (LGA) may receive a TDM project that includes an app providing real-time information about travel-choices, whereas another LGA gets the same project except that the app is not provided to them. You might observe later differences in behaviour, but because you did not randomly allocate people to live in each of the LGAs, these differences may be due to factors such as the age distribution of the populations, proximity workplaces, other transport options, etc. To make a more meaningful comparison you will need to gather data about all these factors and ‘control’ for them using the appropriate statistical modelling technique, examples of which are in Table 1.

Table 1: Statistical techniques for quasi-experimental designs
Statistical techniques 
What it does , and how to use it in an evaluation 

Linear regression 
(also known as Ordinary Least Squares (OLS) regression) 

Quantifies the difference in an outcome (dependent variable) based on differences in other factors (independent variables) 

Regression discontinuity  

Groups are based on whether individuals meet project eligibility criteria or not. That is, groups are constructed by using ‘scores’ on some pre-project assessment e.g. grant applicants who meet all criteria and receive grants, and grant applicants who do not meet all criteria and do not receive grants. Sometimes the latter group is constructed from ‘near misses’: those who nearly met project eligibility criteria but just missed out. 

Multi-level modelling 
(also known as hierarchical modelling) 

Considers the fact that individual outcomes may not be due solely to individual characteristics, but also partly due to the groups and sub-groups to which those individuals belong. In other words, the data may be ‘nested’ in different levels e.g. student learning outcomes may be affected by individual student characteristics such as age and sex, but also be due to characteristics of the schools in which students cluster. 

Logistic regression 

Similar to linear regression, but used when the outcome is a binary variable e.g. passed or failed. It allows us to assess how particular characteristics of individuals affect their probability of having one form of the outcome or the other e.g. what is the difference in the probability of succeeding if a project participant is a male or female 

Heckman modelling 

Used when samples affected by a project are not randomly selected. Breaks the analysis into two steps. The first assesses the factors that determine whether individuals were ‘selected’ into the project or not, and the second step understand that factors that caused selected individuals to have different outcomes from the project. 

Difference-in-difference modelling 
(sometimes called a comparative interrupted time series design or a non-equivalent control group pre-test design)
 

Compares the outcomes for groups exposed to different conditions (e.g. degree of project intervention) at different times. Used in settings in which some units experience a change in treatment status over time while other units do not, and where an RCT could not be conducted. Measures the impact of a project that changes over time for some units, but not for others 

Propensity score matching 

Statistically creating comparable groups based on the factors that influenced people’s propensity to participate in the project  

Given the complexity and analytical skills required to understand the choices in Table 1, an alternative to prescribing a preferred analysis as part of your evaluation planning, you can specify how the decision-making about statistical model will take place, and what kind of checks will be put in place to ensure that the right approach is taken. For example, a technical advisory team may be set up to work through the detailed issues involved in choosing appropriate statistical analysis.

Non-experimental/theory-based approaches 

Experimental and quasi-experimental approaches rely on access to a comparison group that provides a counterfactual; an ‘image’ of what the group getting the project would have looked like had they not received the project.

In many instances, you might not be able to get data on a comparison group. All you have is information from participants in your project. In these situations, you can draw on several non-experimental approaches to make conclusions about effectiveness. One of the strengths of these approaches, over experimental and quasi-experimental approaches, is that they can generate evidence about process issues, so that process and outcome evaluations can be conducted simultaneously. 

These approaches assess effectiveness by either or both of the following methods: 

  • Checking whether the results support causal attribution. An example of such an approach is actor attribution, whereby participants in the project are asked to explain results. Similarly, expert comparison compares actual outcomes to expert predictions. 
  • Ruling out alternative explanations. An example is general elimination methodology which Identifies alternative explanations for your outcomes and then systematically investigates them to see if they can be ruled out. 

Table 2 provides a menu of options, most of which involve a detailed case-study of the project. If you find that some of these are relevant to evaluating your TDM project, you should explore these further (as well as experimental and quasi-experimental designs) by looking at the UK Government’s Magenta Book Annex A.

Table 2: Non-experimental methods
Method 
How it assesses effectiveness 
Methods to check whether the results support causal attribution 

Actor attribution 

Asks actors how they explain the results 

Realist evaluation 

Specific, hypothesised causal mechanisms, in context, are articulated and evidence gathered for each. The key formula is:  Context + Mechanism = Outcome 

Modus operandi 

Search for distinguishing features of causal paths, e.g. distinctive concepts or terminology used by participants 

Process/contribution tracing 

Takes a case-based approach, focusing on the use of clues within a case (causal-process observations, CPOs) to adjudicate between alternative possible explanations 

Contribution analysis 

Assess whether the project is based on a plausible theory of change, whether it was implemented as intended, whether the anticipated chain of early results occurred and the extent to which other factors influenced the project’s achievements 

Bayesian updating 

An extension of other theory-based methods such as contribution analysis and process tracing. The probability of a contribution claim being true given the existence of a piece of evidence is estimated using a variety of data 

Most significant change 

 

A participatory monitoring and evaluation method involving the collection of change stories from the field and the selection of the most significant of these stories in terms of impact by panels of designated stakeholders. Once the changes have been captured, selected groups discuss the changes and the value of each of them, lifting the most significant to the surface.  

Qualitative comparative analysis 

Compares the configurations of different cases to identify the components that produce specific outcomes 

Expert comparison 

Compares actual outcomes to expert predictions 

Methods to rule out alternative explanations 

Force field analysis 

Provide a detailed overview of the variety of forces that may be acting on an organisational change issue 

General elimination methodology 

Identify alternative explanations and then systematically investigate them to see if they can be ruled out 

Key informant interviews 

Ask experts in these types of projects or in the community to identify other possible explanations and/or to assess whether these explanations can be ruled out 

Process tracing 

Rule out alternative explanatory variables at each step of the theory of change 

Ruling out technical explanations 

Identify and investigate possible ways that the results might reflect technical / measurement limitations rather than actual causal relationships 

Investigation of disconfirming evidence 

When data don’t fit the expected pattern, treat these not as outliers but as potential clues to other causal factors, then seek to explain them 

Statistically controlling for extraneous variables 

Where an external factor is likely to affect the final outcome, take it into account when looking for congruence 

A theory-based approach is particularly suitable when the evaluation: 

  • Seeks to test the theory behind an intervention. For instance, the success of an intervention to promote cycling might be based on the number and kinds of organisations that need to be involved, the level of investment required, the number and type of engineering and education measures to be implemented, the number of people and locations that need to be involved in order to achieve the anticipated outcome and impact targets. You could evaluate this by collecting data on each element, and then the observed outcomes are explained with (and attributed to) the extent to which inputs / activities and outputs were achieved as planned.  
  • Investigates a complex project that has multiple interacting outcomes, and/or consist of different components or is implemented in different locations.  
  • Needs to consider how contextual factors (e.g. people, organisations or socio-economic circumstances) influence the success of an project.
  • Seeks to generate new knowledge that can be used to inform future TDM projects.
  • Is interested in identifying both anticipated and unintended outcomes and how they have been achieved.   

When to take measurements 

A key part of outcome evaluation designs is the decision when to take measures for your indicators. At the very least some kind of before-and-after plan is necessary to make some assessment of the impact of your project. At either end of the process you might need to consider taking measures at a number of points: 

  • Baselines measures provide a reference point from which changes can be gauged. The simplest approach is to take measures just before the TDM project is implemented. However, this may not always be sufficient to provide a counterpoint to post-project measures. Many TDM-relevant data series, such as customer journey numbers, traffic volumes, etc. have both random and seasonal ups and downs. Thus, a single baseline measure may capture the relevant series on one of these ups or downs and produce misleading impressions when compared to post-project measures. To mitigate against this possibility, baseline measures should be collected over time before implementation so that any inherent random fluctuations can be considered in the data analysis. Similarly, seasonal variations can be assessed and post-project measures can be taken at corresponding points in the seasonal cycle.
  • During the project, measures can be taken to observe how enabling outcomes are emerging and whether they correspond to the pathway laid out in your logic model. 
  • After the project more than one measure may need to be taken to allow for outcomes to fully mature.