The problem with kappa

David Powers

    Research output: Contribution to conferencePaperpeer-review

    63 Citations (Scopus)


    It is becoming clear that traditional evaluation measures used in Computational Linguistics (including Error Rates, Accuracy, Recall, Precision and F-measure) are of limited value for unbiased evaluation of systems, and are not meaningful for comparison of algorithms unless both the dataset and algorithm parameters are strictly controlled for skew (Prevalence and Bias). The use of techniques originally designed for other purposes, in particular Receiver Operating Characteristics Area Under Curve, plus variants of Kappa, have been proposed to fill the void. This paper aims to clear up some of the confusion relating to evaluation, by demonstrating that the usefulness of each evaluation method is highly dependent on the assumptions made about the distributions of the dataset and the underlying populations. The behaviour of a number of evaluation measures is compared under common assumptions. Deploying a system in a context which has the opposite skew from its validation set can be expected to approximately negate Fleiss Kappa and halve Cohen Kappa but leave Powers Kappa unchanged. For most performance evaluation purposes, the latter is thus most appropriate, whilst for comparison of behaviour, Matthews Correlation is recommended.

    Original languageEnglish
    Number of pages11
    Publication statusPublished - 2012
    Event13th Conference of the European Chapter of the Association for Computational Linguistics 2012 -
    Duration: 24 Apr 2012 → …


    Conference13th Conference of the European Chapter of the Association for Computational Linguistics 2012
    Period24/04/12 → …


    Dive into the research topics of 'The problem with kappa'. Together they form a unique fingerprint.

    Cite this