Assessing inter-rater agreement for nominal judgement variables. Gareth McCray

Similar documents
Software Solutions Appendix B. B.1 The R Software

Inter-Rater Agreement in Analysis of Open-Ended Responses: Lessons from a Mixed Methods Study of Principals 1

Measurement: Reliability and Validity Measures. Jonathan Weiner, DrPH Johns Hopkins University

Test Reliability Indicates More than Just Consistency

WHAT IS A JOURNAL CLUB?

Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial

The ACGME Resident Survey Aggregate Reports: An Analysis and Assessment of Overall Program Compliance

Nursing Journal Toolkit: Critiquing a Quantitative Research Article

College Readiness LINKING STUDY

Interobserver Agreement in Behavioral Research: Importance and Calculation

What is Effect Size? effect size in Information Technology, Learning, and Performance Journal manuscripts.

National curriculum tests. Key stage 2. English reading test framework. National curriculum tests from For test developers

NON-PROBABILITY SAMPLING TECHNIQUES

AUTISM SPECTRUM RATING SCALES (ASRS )

CHAPTER 15 NOMINAL MEASURES OF CORRELATION: PHI, THE CONTINGENCY COEFFICIENT, AND CRAMER'S V

Every School a Good School

Bivariate Statistics Session 2: Measuring Associations Chi-Square Test

National curriculum tests. Key stage 1. English reading test framework. National curriculum tests from For test developers

Many recent studies of Internet gambling particularly

Calculating inter-coder reliability in media content analysis using Krippendorff s Alpha

Technical Information

RESEARCH METHODS IN I/O PSYCHOLOGY

Analysing Questionnaires using Minitab (for SPSS queries contact -)

Study Guide for the Final Exam

II. DISTRIBUTIONS distribution normal distribution. standard scores

Glossary of Terms Ability Accommodation Adjusted validity/reliability coefficient Alternate forms Analysis of work Assessment Battery Bias

Calculating, Interpreting, and Reporting Estimates of Effect Size (Magnitude of an Effect or the Strength of a Relationship)

PEER REVIEW HISTORY ARTICLE DETAILS VERSION 1 - REVIEW. Avinesh Pillai Department of Statistics University of Auckland New Zealand 16-Jul-2015

Analyzing Research Articles: A Guide for Readers and Writers 1. Sam Mathews, Ph.D. Department of Psychology The University of West Florida

Work in the 21 st Century: An Introduction to Industrial- Organizational Psychology. Chapter 6. Performance Measurement

DAIC - Advanced Experimental Design in Clinical Research

Additional sources Compilation of sources:

James E. Bartlett, II is Assistant Professor, Department of Business Education and Office Administration, Ball State University, Muncie, Indiana.

The Contextualization of Project Management Practice and Best Practice

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Chapter 1: Scope of Thesis

Guidelines for Best Test Development Practices to Ensure Validity and Fairness for International English Language Proficiency Assessments

Analysis of academy school performance in GCSEs 2014

PRELIMINARY ITEM STATISTICS USING POINT-BISERIAL CORRELATION AND P-VALUES

RESEARCH METHODS IN I/O PSYCHOLOGY

DATA COLLECTION AND ANALYSIS

Technical Report. Overview. Revisions in this Edition. Four-Level Assessment Process

T-test & factor analysis

Statistical Innovations Online course: Latent Class Discrete Choice Modeling with Scale Factors

INTERNATIONAL FRAMEWORK FOR ASSURANCE ENGAGEMENTS CONTENTS

Reliability and validity, the topics of this and the next chapter, are twins and

The Relationship between the Fundamental Attribution Bias, Relationship Quality, and Performance Appraisal

Association Between Variables

ACT Research Explains New ACT Test Writing Scores and Their Relationship to Other Test Scores

Theoretical Biophysics Group

Guidance on Professional Judgment for CPAs

Advantages and Disadvantages of Various Assessment Methods

Reception baseline: criteria for potential assessments

SAMPLING FOR EFFECTIVE INTERNAL AUDITING. By Mwenya P. Chitalu CIA

X = T + E. Reliability. Reliability. Classical Test Theory 7/18/2012. Refers to the consistency or stability of scores

Linda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén Table Of Contents

Section 3 Part 1. Relationships between two numerical variables

Strategies for Promoting Gatekeeper Course Success Among Students Needing Remediation: Research Report for the Virginia Community College System

SECTION 1: Example Exercise Outcomes. Instructions and Questions Booklet. AssessmentDay. Practice Assessments

On the identification of sales forecasting model in the presence of promotions

Assessing Farmers' Sustainable Agricultural Practice Needs: Implication for a Sustainable Farming System

Statistical & Technical Team

The Financial Services Trust Index: A Pilot Study. Christine T Ennew. Financial Services Research Forum. University of Nottingham

How to Get More Value from Your Survey Data

Intercoder reliability for qualitative research

Chapter 6 Appraising and Managing Performance. True/False Questions

2. Incidence, prevalence and duration of breastfeeding

GLOSSARY OF EVALUATION TERMS

The Friedman Test with MS Excel. In 3 Simple Steps. Kilem L. Gwet, Ph.D.

Having a coin come up heads or tails is a variable on a nominal scale. Heads is a different category from tails.

Chapter 3 Psychometrics: Reliability & Validity

MINORITY ETHNIC PUPILS IN THE LONGITUDINAL STUDY OF YOUNG PEOPLE IN ENGLAND (LSYPE)

STA 371G: Statistics and Modeling

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

individualdifferences

Types of Error in Surveys

Understanding and Quantifying EFFECT SIZES

User research for information architecture projects

Techniques for data collection

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Reliability Overview

Descriptive Statistics

Teachers Emotional Intelligence and Its Relationship with Job Satisfaction

COEFFICIENT OF CONCORDANCE

Learning Objectives for Selected Programs Offering Degrees at Two Academic Levels

Introduction to Hypothesis Testing. Hypothesis Testing. Step 1: State the Hypotheses

Transcription:

Assessing inter-rater agreement for minal judgement variables Gareth McCray

Please Cite as: McCray, G, (2013) Assessing inter-rater agreement for minal judgement variables. Paper presented at the Language Testing Forum. Nottingham, November 15-17. Copyright Gareth McCray, 2013

Inter-rater agreement Definition of terms Nominal variable Judgement variable In the context of this talk it is defined as the extent to which raters agree with one ather Inter-rater agreement might also be termed inter-rater reliability or rater concordance Variables which cant be measured on a scale and have implied hierarchical relationship to one ather E.g. - What is the first language of the student? - Did the student exhibit a specific behaviour (/)? Variables which reflect the subjective, yet informed opinion of a judge about a specific matter under investigation

Background to the study This study is part of an initial phase of a project attempting to model item difficulty in reading comprehension in the L2 context, which made use of expert judgments. Before statistical modelling could take place, the levels of interrater agreement had to be evaluated in order to verify that only variables with statistically significant levels of agreement were included in the model. Judgements were gathered on 8 dichotomous variables on 82 items.

Communicating inter-rater agreement However, the reliability of the expert judges was rather low Compared with what? The overall level of agreement was very high According to which benchmark?

Statement of the Problem (1) Various methods have been used in language testing to communicate a summary value of inter-rater agreement. This variety of methods makes it difficult to make comparisons between different studies. To allow the levels of agreement of expert judgements to be compared across studies it would be a good idea to have a robust standard statistic for reporting levels of agreement which has published benchmark levels. Method Summary Statistics Raw Agreement Proportion/Percentage Cronbach s Alpha G-Theory variance components Problems No measure of statistical significance. No inclusion of agreement due to chance. A measure of internal consistency t of rater agreement per se. Difficult to communicate the results.

Statement of the Problem (2) The fact that the data were dichotomous and that many had high trait prevalence meant that a suitable technique for the assessment of inter-rater agreement was problematic. I intend to present some methods for the assessment of interrater agreement for this data and discuss their advantages and disadvantages. Evaluative criteria Interpretability & Communicability Robustness

Rater 2 Contingency Tables Example Contingency Table 80 5 5 10 (5 + (80 5) + = 10) = Disagreements 90 Agreements

Coin 2 Raw Agreement Proportion (RAP) Raw Agreement Proportion Contingency table of the variable Number of Features and Conditions Coin 1 Heads Heads Tails Tails Heads Heads 25 22 0 1 2 0 Tails Tails 23 30 0 1 01 RAP = 55%

Problems with the RAP No consideration of agreement due to chance No test of statistical significance If we have two raters rating on a dichotomous scale at random there is a 50% chance of agreement. If we have 2 raters rating on a 3 point scale (low/medium/high) at random there is a 33% change of agreement. The RAP does t allow us to (simply) measure perform a test of whether the agreements found are statistically significantly better than those that can be attributed to chance alone. Evaluative criteria Interpretability & Communicability Robustness X

Cohen s Kappa Cohen s Kappa (1960), one of the most widely used indices of rater agreement, overcomes this problem of chance agreement. It gives a value of between 0 and 1; 0 being the lowest level of agreement 1 being the highest. Clear benchmarks for the strength of agreement have been constructed for its use to aid in communicability. Fleiss (1981) Benchmark Scale for the Kappa < 0.40 Poor 0.40 to 0.75 Intermediate to good More than 0.75 Excellent Landis and Koch (1977) Benchmark Scale for the Kappa < 0.0 Poor 0.00 to 0.20 Slight 0.21 to 0.40 Fair 0.41 to 0.60 Moderate 0.61 to 0.80 Substantial 0.81 to 1.00 Almost perfect Altman s (1991) Benchmark Scale for the Kappa < 0.20 Poor 0.21 to 0.40 Fair 0.41 to 0.60 Moderate 0.61 to 0.80 Good 0.81 to 1.00 Very good

Rater 2 Rater 2 Sensitivity to Marginal Homogeneity Example contingency table 1 Example contingency table 2 30 20 20 30 30 30 10 30 RAP = 60% Cohen s Kappa = 0.200* Krippendorff's Alpha = 0.204* Phi = 0.200* RAP = 60% Cohen s Kappa = 0.231* Krippendorff's Alpha = 0.204* Phi = 0.250*

Rater 2 Rater 2 Sensitivity to Trait Prevalence Example contingency table 3 Example contingency table 4 45 5 5 45 80 5 5 10 RAP = 90% Cohen s Kappa = 0.800*** Krippendorff's Alpha = 0.801*** Phi = 0.800*** RAP = 90% Cohen s Kappa = 0.608*** Krippendorff's Alpha = 0.610*** Phi = 0.608***

Rater 2 Rater 2 Sensitivity to Trait Prevalence Example contingency table 5 Example contingency table 6 45 5 5 45 80 5 5 10 Cohen s Kappa = 0.800*** Fleiss Benchmark = Excellent Landis and Koch Benchmark = Substantial Altman s Benchmark = Good Cohen s Kappa = 0.608*** Fleiss Benchmark = Intermediate to Good Landis and Koch Benchmark = Moderate Altman s Benchmark = Moderate

Problems With Cohen s Kappa Cohen s Kappa (and the Phi Coefficient) are sensitive to marginal homogeneity and as a measure of agreement. Cohen s Kappa (the Phi Coefficient and Krippendorff's Alpha) are sensitive and negatively bias for high trait prevalence. For more robust measures of inter-rater agreement, particularly in cases of high trait prevalence, an alternative statistic is required. Evaluative criteria Interpretability & Communicability Robustness X

Gwet's AC1 Gwet s (2008) AC1 is an alternative statistic Like the Kappa it takes a value between 0 and 1 It is recommended that the same benchmark scales as those for the Kappa can be used for communication of levels of agreement. It is t sensitive to marginal homogeneity and positively biases for trait prevalence. It can be extended to multiple raters It can deal with both minal and ordinal data it can deal with missing data Evaluative criteria Interpretability & Communicability Robustness

Rater 2 Rater 2 AC1 and Marginal Homogeneity Example contingency table 7 Example contingency table 8 30 20 30 30 20 30 10 30 RAP = 60% Cohen s Kappa = 0.200* Gwet s AC1 = 0.200* RAP = 60% Cohen s Kappa = 0.231* Gwet s AC1 = 0.200*

Rater 2 Rater 2 AC1 and Trait Prevalence Example contingency table 9 Example contingency table 10 45 5 80 5 5 45 5 10 RAP = 90% Cohen s Kappa = 0.800*** Gwet s AC1 = 0.800*** RAP = 90% Cohen s Kappa = 0.608*** Gwet s AC1 = 0.866***

Rater 2 Rater 2 AC1 Trait Prevalence Increase Example contingency table 11 Example contingency table 12 30 20 20 30 50 20 20 10 RAP = 60% Cohen s Kappa = 0.2* Gwet s AC1 = 0.2* RAP = 60% Cohen s Kappa = 0.04 Gwet s AC1 = 0.31**

Proximity of pieces of information Competing information in text Prominence of necessary information Semantic match between item and text Concreteness of necessary information Familiarity of topic Register of text Extent to which outside information is required Substantive Results Overall measures on inter-rater agreement AC1 0.58** Kappa 0.34** RAP 0.82 Measures of inter-rater agreement by item type SAMC MAMC GAPS AC1 0.39** 0.52** 0.57** Kappa 0.39** 0.47** 0.29** RAP 0.81 0.84 0.79 Measures of inter-rater agreement by variable AC1 0.46** 0.27** 0.47** 0.08 0.41** 0.48** 0.49** 0.60** Kappa 0.40** 0.27** 0.07 0.15 0.05 0.06 0.00 0.06 RAP 0.84 0.80 0.85 0.75 0.80 0.82 0.82 0.86

Proximity of pieces of information Competing information in text Prominence of necessary information Semantic match between item and text Concreteness of necessary information Familiarity of topic Register of text Extent to which outside information is required Altman s Benchmark Overall measures on inter-rater agreement AC1 Moderate Kappa Fair RAP 0.82 Measures of inter-rater agreement by item type SAMC MAMC GAPS AC1 Fair Moderate Moderate Kappa Fair Moderate Fair RAP 0.81 0.84 0.79 Measures of inter-rater agreement by variable AC1 Moderate Fair Moderate Poor Moderate Moderate Moderate Moderate Kappa Moderate Fair Poor Poor Poor Poor Poor Poor RAP 0.84 0.80 0.85 0.75 0.80 0.82 0.82 0.86

Conclusion In order to allow the levels of agreement of expert judgements to be compared across different studies it would be a very good idea to have a standard statistic for reporting levels of agreement. I would suggest the usage of Gwet s AC1 as it provides an interpretable, communicable and robust method to disseminate levels of inter-rater agreement.

References Gwet, L. (2010). The Handbook of Inter-Rater Reliability. Gaithersburg: Advanced Analytics. Gwet, L. (2008). Computing Inter-Rater Reliability and its Variance in the Presence of High Agreement. British Journal of Mathematical and Statistical Psychology, 61, 29 48. Cohen, J. (1960). A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement, 20, 37 46.