# Inter-rater reliability

Inter-rater reliability or Inter-rater agreement is the measurement of agreement between raters. It gives a score of how much homogeneity, or consensus, there is in the ratings given by judges. It is useful in refining the metrics given to human judges, for example by determining if a particular scale is appropriate for measuring a particular variable.

There are a number of statistics which can be used in order to determine the inter-rater reliability. Different statistics are appropriate for different types of measurement. Some of the various statistics are; joint-probability of agreement, Cohen's kappa and the related Fleiss' kappa, inter-rater correlation, and intra-class correlation.

The joint-probability of agreement is probably the most simple and least robust measure. This simply takes the number of times each rating (e.g. 1, 2, ... 5) is assigned by each rater and then divides this number by the total number of ratings. This however assumes that the data is entirely nominal. Another problem with this statistic is that it does not take into account that agreement may happen solely based on chance.

## Kappa statistics

Main articles: Cohen's kappa, Fleiss' kappa

Cohen's kappa, which works on two raters, and Fleiss' kappa, an adaptation that works for any fixed number of raters, are statistics which also take into account the amount of agreement that could be expected to occur through chance. They suffer from the same problems as the joint-probability in that they treat the data as nominal and assume no underlying connection between the scores.

## Correlation coefficients

Main articles: Pearson product-moment correlation coefficient, Spearman's rank correlation coefficient

In respect of inter-rater correlation, either Pearson's r correlation coefficient, or Spearman's ρ correlation coefficient can be used to measure pairwise correlation between raters and then the mean can be taken to give an average level of agreement for the group. The mean of Spearman's ρ has been used to measure inter-judge correlation. However neither Spearman's or Pearson's take into account the magnitude of the differences between scores. For example, in rating on a scale of 1...5, Judge A might assign the following scores to four segments; 1,2,1,3 and Judge B might assign; 2,3,2,4. The correlation coefficient would be 1, indicating perfect correlation, however the judges do not completely agree.