A useful structure for assessing the quality of evaluation is provided by the standards developed by the Joint Committee on Standards for Educational Evaluation in the USA (1994). These standards use four broad assessment criteria: utility, propriety, feasibility and accuracy, which can be used as a guide for designing good quality evaluation.


This is about ensuring that the evaluation will actually be useful: that the results will have an interested audience and application. This application may be at the level of an individual officer who will use the results of an evaluation exercise to inform personal practice, or it can be at service level to inform policy decisions about whether a particular programme should be introduced across a service.
To have high utility, an evaluation must have identified the intended user(s) of the results, which is sometimes the person or group delivering the service, and is sometimes the funder of the evaluation. The important point here is that it is no good carrying out an evaluation if the results would not have an audience that could make use of the findings.


This criterion is about ensuring that the design of the evaluation is practical and, importantly, cost effective. Decisions in this area are frequently about balancing the quality of the data that will be obtained by a particular method with the potential cost of that approach. For instance, long-term follow up of offenders is frequently limited to reconviction data. It would be desirable to assess long-term impact of other factors such as employment, but the difficulties of obtaining this information on a systematic basis often make it impractical.
Evaluation, even on a small scale, needs time, effort and specialist skills. There is never just one way of undertaking a particular evaluation: it is a matter of choosing the most appropriate approach in the circumstances. Those circumstances will include the importance of the evaluation and the potential impact of the results being weighed against the costs of obtaining those results. This is frequently a political as well as a practical decision.
Where a new programme or initiative is being developed it is important that evaluation is built in and appropriate resources allocated to this task. Evaluation should be seen to be providing value for money, and the main means of achieving this is through publication. It is important that evaluations and findings are reported accessibly and made available for critique, so that others can learn from what doesn’t work as well as what does.


Evaluation must be designed and conducted in a manner that is fair and ethical, and respects the security, dignity and self-worth of participants. It should be sensitive to diversity of interests and values, whether related to sex, culture, religion, race, sexuality, age or a host of other potential features of diversity. These issues apply to all stages of the evaluation process, from designing samples that try to include a sufficient number of relevant minority groups, to undertaking analysis that enables inter-group comparisons (Eichler, 1988, demonstrates the range of aspects of design which could be questioned here). Those who work with offenders have a commitment to anti-discriminatory practice and it is important that this be assessed in all evaluations.

Questions of ethics are not always easy to decide and evaluators should address key ones during the planning of an evaluation. For instance, assurances of confidentiality may have limits: if an interview with an offender raises child protection questions what would the evaluator do? What potential harm could be caused to respondents in interviews with survivors of sexual abuse?


The fourth criterion in assessment of quality relates to the technical adequacy of the evaluation: will the exercise produce valid and reliable data, which the user can trust? This question addresses the methodology employed in the evaluation and the success of its implementation. As well as broad design of the evaluation, this aspect also includes the need for individual instruments to be accurately and fully completed at the appropriate point in time by all the required respondents: confirmation that the data set is comprehensive. Where large numbers of respondents do not provide information this will affect the results and can introduce bias that should be assessed as part of the process.
Good evaluators use their technical skill and sensitivity to the issues to maximise the integrity of the evaluation process in undertaking the evaluation. Where this expertise is not immediately available researchers and evaluators will often act as consultant to the evaluation, and this guidance offers some assistance with these points. As the experience of the pilot projects demonstrates, the support of a ‘critical friend’ is invaluable to inexperienced evaluators.

Reliability and validity

A useful structure for assessing the quality of evaluation is provided by the standards developed by the Joint Committee on Standards for Educational Evaluation in the USA (1994). These standards use four broad assessment criteria: utility, propriety, feasibility and accuracy, which can be used as a guide for designing good quality evaluation.
Reliability and validity are interconnected, in that a measure cannot be valid if it is not reliable. Perversely, it is very possible for a measure to be reliable, but not valid. The analogy often used to illustrate this is one of a clock. A clock which is working erratically so that it sometimes accurate, sometimes fast and sometimes slow is unreliable as a measure of time, and thus cannot give a valid indication of time. A clock which is not changed at the transition from standard time to summertime is reliable: it is always an hour slow; but it does not tell the real time and therefore it is not valid. If you knew that the clock was an hour slow, however, you could make the necessary adjustment and it would then be a valid measure. Measures that are designed to assess attitude change are some of the hardest in which to ensure reliability and validity.
Validity is the extent to which the instrument measures what it was designed to measure, e.g. does an intelligence test really measure intelligence, or the ability to complete intelligence tests? There is a substantial literature here, particularly in respect of the discriminatory nature of such instruments, which have frequently been developed using samples of white middle class male respondents. Similar questions arise when using measures in Europe which have been developed in North America. A good example of the question of validity particularly relevant for probation evaluations is the use of reconviction data as a proxy measure of re-offending.
Reliability describes the extent to which a measure can be relied upon, its dependability. The question frequently asked here is: does it always measure the same thing? It is about consistency – will an instrument obtain the same result with exactly the same subject if repeated? There are established statistical means of measuring reliability when designing such instruments, and many established instruments will have been developed using such techniques.

Ethical issues

In the early stages of design and choice of methods it is particularly important to consider anti-discriminatory practice issues and assess whether minority groups will be appropriately and sensitively considered in the evaluation. This will require consideration of the targeting and inclusion of groups within the programme, the relevance of aspects of the work to a range of groups and differential outcomes between these groups. It will also affect decisions about the sampling design within the evaluation.

The instruments to be used should be validated for all the types of respondent in the evaluation. If they are not, questions will arise about the validity of any conclusions drawn from their use. For instance, if an instrument has been validated for use with adults, how appropriate is it to use it for young offenders? There may be no alternative to using an instrument that has not been validated on the appropriate group, but where this happens the design of the evaluation should take account of this and look for differences between groups that may arise from the inappropriateness of the measure rather than different outcomes of the programme.

Many professional organisations such as the British Society of Criminology, the British Sociological Association and the European Evaluation Society have Codes of Practice to guide members undertaking research and evaluation in the field. Codes of practice for a wide range of countries are available on the web page of the European Society of Evaluation.

There are four generally accepted ethical principles (Beauchamp & Childress 1994).
Participants have a right to refuse to take part in the evaluation, and have the right to be properly informed in making that decision. This is essentially about informed consent and is an issue that can create considerable dilemmas in relation to evaluation in working with offenders. To what extent can offenders ‘consent’ to take part when compliance a condition of the order?

The evaluation should not cause harm, either directly or indirectly. This is not just about physical harm. Some of the consequences of taking part in evaluation are not immediately evident and need to be considered carefully, especially where sensitive topics are being evaluated. For instance, an evaluation of a programme for the perpetrators of domestic violence may consider that the most reliable information about the effect of the programme can be obtained from the partners of perpetrators. However, this may create two sorts of harm for the partners: first, physical – if the perpetrator is unhappy about the partner taking part, and secondly psychological – if the method requires the partner to recount and relive violent incidents.

This is about people being treated fairly and equally within the evaluation. This may not be as straightforward as it first appears, for instance inappropriate questioning may mean that not all groups have been given an equal voice within the evaluation. Female offenders and black offenders may have a different perspective about the value of a programme from white male offenders and questions should be designed to enable the full range to be obtained.
This is about trying to do good within the evaluation and frequently refers to the overall objective of the evaluation rather than the treatment of individuals within it. Most evaluation work is about trying to improve practice, which will ultimately be of value of all participants.

In order to fulfil the requirements of these principles, it is expected that the evaluator wil;
- tell the truth and not try to mislead participants
- respect a participant’s right to privacy and not hound them to take part, or to answer a specific question
- ensure the confidentiality of the individual and be honest about the extent to which this can be assured
- keep any promises made to participants

The ways in which ethical issues are addressed will also need to take account of the context within which the evaluation is taking place, both the legal context and requirements of the country/organisation within which the work is located, and the social and organisational context of the agency within which it is taking pace. A good example of the sort of complexities that can arise is given by the San Patriagnano pilot evaluation, more details of which can be found here.

Consideration of ethical issues at the design stage will not identify all the potential problems or ways of resolving them, but at least it will lay down some ground rules to assist with the resolution of the issues which will inevitably emerge during the process.