# Inter-rater reliability

**Inter-rater reliability** or **Inter-rater agreement** is the measurement of agreement between raters. It gives
a score of how much homogeneity, or consensus, there is in the ratings given by judges. It is useful in refining
the metrics given to human judges, for example by determining if a particular scale is appropriate for measuring a
particular variable.

There are a number of statistics which can be used in order to determine the inter-rater reliability. Different statistics are appropriate for different types of measurement. Some of the various statistics are; joint-probability of agreement, Cohen's kappa and the related Fleiss' kappa, inter-rater correlation, and intra-class correlation.

The joint-probability of agreement is probably the most simple and least robust measure. This simply takes the number of times each rating (e.g. 1, 2, ... 5) is assigned by each rater and then divides this number by the total number of ratings. This however assumes that the data is entirely nominal. Another problem with this statistic is that it does not take into account that agreement may happen solely based on chance.

## Contents |

## [edit] Kappa statistics

*Main articles: Cohen's kappa, Fleiss' kappa*

Cohen's kappa, which works on two raters, and Fleiss' kappa, an adaptation that works for any fixed number of raters, are statistics which also take into account the amount of agreement that could be expected to occur through chance. They suffer from the same problems as the joint-probability in that they treat the data as nominal and assume no underlying connection between the scores.

## [edit] Correlation coefficients

*Main articles: Pearson product-moment correlation coefficient, Spearman's rank correlation coefficient*

In respect of inter-rater correlation, either Pearson's *r* correlation coefficient, or Spearman's ρ correlation
coefficient can be used to measure pairwise correlation between raters and then the mean can be taken to give an average
level of agreement for the group. The mean of Spearman's ρ has been used to measure inter-judge
correlation. However neither Spearman's or Pearson's take into account the magnitude of the differences between
scores. For example, in rating on a scale of 1...5, Judge A might assign the following scores to four segments; 1,2,1,3 and Judge B might assign; 2,3,2,4. The correlation coefficient would be 1, indicating perfect correlation,
however the judges do not completely agree.

## [edit] Intra-class correlation coefficient

Another way of performing reliability testing is to use the intra-class correlation coefficient (ICC). This
is defined as, "the proportion of variance of an observation due to between-subject variability
in the true scores". The range of the ICC is, as with the other correlation coefficients, between 1.0 and -1.0. The
ICC will be high when there is little variation between the scores given each to a segment by raters, e.g. if all raters
give the same, or similar scores to each of the segments. The ICC is an improvement over Pearson's *r* and Spearman's ρ,
as it takes into account the difference, or variance in between ratings for individual segments, along with the correlation
between raters.

## [edit] Notes

- Cohen, J. (1960) "A coefficient for agreement for nominal scales" in
*Education and Psychological Measurement*. Vol. 20, pp. 37--46 - Fleiss, J. L. (1971) "Measuring nominal scale agreement among many raters" in
*Psychological Bulletin*. Vol. 76, No. 5, pp. 378--382 - Shrout, P. and Fleiss, J. L. (1979) "Intraclass correlation: uses in assessing rater reliability" in
*Psychological Bulletin*. Vol. 86, pp. 420--428 - Everitt, B. (1996)
*Making Sense of Statistics in Psychology*(Oxford : Oxford University Press) ISBN 0198523661

## [edit] Further reading

- Gwet, K. (2001)
*Handbook of Inter-Rater Reliability*, (Gaithersburg : StatAxis Publishing) ISBN 0970806205

## [edit] External links

Some content on this page may previously have appeared on Citizendium. |