It is now well understood that to make effective

Similar documents
VALID AND RELIABLE MEASUREMENT tools are essential

INTER-RATER AND INTRA-RATER RELIABILITY OF ACTIVE HIP ABDUCTION TEST FOR STANDING INDUCED LOW BACK PAIN

Hogeschool van Amsterdam. Navicular Drop Test. User Guide and Manual. Sabrina Jayne Charlesworth and Stine Magistad Johansen

Week 4: Standard Error and Confidence Intervals

8 General discussion. Chapter 8: general discussion 95

Analysing Questionnaires using Minitab (for SPSS queries contact -)

SPORTSCIENCE sportsci.org

Test Reliability Indicates More than Just Consistency

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Standard Deviation Estimator

Correlational Research. Correlational Research. Stephen E. Brock, Ph.D., NCSP EDS 250. Descriptive Research 1. Correlational Research: Scatter Plots

Session 7 Bivariate Data and Analysis

What Is an Intracluster Correlation Coefficient? Crucial Concepts for Primary Care Researchers

Sample Size and Power in Clinical Trials

Key Words: Kinesiologvlbiomechanics, lower extremity; Muscle; Muscle pdormance,

Ultrasound imaging has

A Manual Therapy and Exercise Approach to Breast Cancer Rehabilitation Course

UNDERSTANDING THE TWO-WAY ANOVA

WHAT IS A JOURNAL CLUB?

Sampling. COUN 695 Experimental Design

Passive range of motion in patients with adhesive shoulder capsulitis, an intertester reliability study over eight weeks

Spring Force Constant Determination as a Learning Tool for Graphing and Modeling

Study Guide for the Final Exam

Descriptive Statistics

CHAPTER THREE COMMON DESCRIPTIVE STATISTICS COMMON DESCRIPTIVE STATISTICS / 13

Statistical Methods For Assessing Measurement Error (Reliability) in Variables Relevant to Sports Medicine

Measurement: Reliability and Validity Measures. Jonathan Weiner, DrPH Johns Hopkins University

College Readiness LINKING STUDY

Content Sheet 7-1: Overview of Quality Control for Quantitative Tests

research/scientific includes the following: statistical hypotheses: you have a null and alternative you accept one and reject the other

Shoulder-related dysfunction is a common health problem, for. Reliability of Function-Related Tests in Patients With Shoulder Pathologies

COMP6053 lecture: Relationship between two variables: correlation, covariance and r-squared.

Comparing Alternate Designs For A Multi-Domain Cluster Sample

Hand-Held Dynamometry: Reliability of Lower Extremity Muscle Testing in Healthy, Physically Active, Young Adults

Pilot Testing and Sampling. An important component in the data collection process is that of the pilot study, which

FEGYVERNEKI SÁNDOR, PROBABILITY THEORY AND MATHEmATICAL

Testing the Reliability of the Proposed CPT Codes for Physical Medicine and Rehabilitation

WAT-T: The Workload Assessment Tool for Therapists

Part 2: Analysis of Relationship Between Two Variables

PEER REVIEW HISTORY ARTICLE DETAILS VERSION 1 - REVIEW. Tatyana A Shamliyan. I do not have COI. 30-May-2012

Impact of adhesive capsulitis on quality of life in elderly subjects with diabetes: A cross sectional study

Introduction to Analysis of Variance (ANOVA) Limitations of the t-test

Technical Information

Clinical Research Defined

Understanding and Quantifying EFFECT SIZES

MATH 140 Lab 4: Probability and the Standard Normal Distribution

Ultrasound Dose Calculations

Confidence Intervals for One Standard Deviation Using Standard Deviation

Nursing Journal Toolkit: Critiquing a Quantitative Research Article

Hedge Effectiveness Testing

National Disability Authority Resource Allocation Feasibility Study Final Report January 2013

Topic #6: Hypothesis. Usage

An Honest Gauge R&R Study

UNDERSTANDING ANALYSIS OF COVARIANCE (ANCOVA)

Basic gait parameters : Reference data for normal subjects, years of age

Guided Reading 9 th Edition. informed consent, protection from harm, deception, confidentiality, and anonymity.

X = T + E. Reliability. Reliability. Classical Test Theory 7/18/2012. Refers to the consistency or stability of scores

Calculating, Interpreting, and Reporting Estimates of Effect Size (Magnitude of an Effect or the Strength of a Relationship)

Study Design Sample Size Calculation & Power Analysis. RCMAR/CHIME April 21, 2014 Honghu Liu, PhD Professor University of California Los Angeles

Interpreting and Using SAT Scores

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Analytical Methods: A Statistical Perspective on the ICH Q2A and Q2B Guidelines for Validation of Analytical Methods

CHAPTER 11 CHI-SQUARE AND F DISTRIBUTIONS

Section 13, Part 1 ANOVA. Analysis Of Variance

One-Way ANOVA using SPSS SPSS ANOVA procedures found in the Compare Means analyses. Specifically, we demonstrate

Feifei Ye, PhD Assistant Professor School of Education University of Pittsburgh

Simple Regression Theory II 2010 Samuel L. Baker

Splinting in Neurology. Jo Tuckey MSc MCSP

Vilnius University. Faculty of Mathematics and Informatics. Gintautas Bareikis

Passive Range of Motion Exercises

"Statistical methods are objective methods by which group trends are abstracted from observations on many separate individuals." 1

Lecture Notes Module 1

CALCULATIONS & STATISTICS

DATA COLLECTION AND ANALYSIS

How to Verify Performance Specifications

WEB APPENDIX. Calculating Beta Coefficients. b Beta Rise Run Y X

Algebra Academic Content Standards Grade Eight and Grade Nine Ohio. Grade Eight. Number, Number Sense and Operations Standard

The Role of Acupuncture with Electrostimulation in the Prozen Shoulder

Handicap after acute whiplash injury A 1-year prospective study of risk factors

Movement Pa+ern Analysis and Training in Athletes 02/13/2016

Validation and Calibration. Definitions and Terminology

Statistics. Measurement. Scales of Measurement 7/18/2012

Tests for Two Proportions

Statistiek II. John Nerbonne. October 1, Dept of Information Science

Robust procedures for Canadian Test Day Model final report for the Holstein breed

Correlation key concepts:

CONSISTENT VISUAL ANALYSES OF INTRASUBJECT DATA SUNGWOO KAHNG KYONG-MEE CHUNG KATHARINE GUTSHALL STEVEN C. PITTS

Standardized Tests, Intelligence & IQ, and Standardized Scores

Variables Control Charts

Partial Estimates of Reliability: Parallel Form Reliability in the Key Stage 2 Science Tests

Re-Visiting How do you know he tried his best... The Coefficient of Variation As a Determinant of Consistent Effort

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

II. DISTRIBUTIONS distribution normal distribution. standard scores

Range of Motion Exercises

Chapter 3. Sampling. Sampling Methods

Chapter 1 Introduction. 1.1 Introduction

Quality and critical appraisal of clinical practice guidelines a relevant topic for health care?

3 Some Integer Functions

interpretation and implication of Keogh, Barnes, Joiner, and Littleton s paper Gender,

Transcription:

Factors Affecting Interpretation of Reliability Coefficients Leslie Russek, PT, PhD, OCS 1 It is now well understood that to make effective decisions based on patient data physical therapists need to know the psychometric properties of the clinical measurements they use. 5 What may not be as well understood is that the calculation of various reliability coefficients often reported in the literature are not all based on the same statistical concepts. This technical note explores how variability among subjects affects interpretation of various reliability coefficients and how different coefficients, used with the same data, can lead to paradoxical interpretation. Variability Among Subjects Researchers studying reliability have sometimes noted a difference between the intraclass correlation coefficient (ICC) calculated from the measurements of small angles, such as knee extension range of motion (ROM), and large angles, such as knee flexion ROM. Brosseau et al 1 state that To our knowledge, there are no studies that explain the difference between smaller and larger angles when considering intratester reliability, therefore, further research is needed. 1 In another study, researchers hypothesize that The lower reliability for knee extension could reflect the difficulty therapists have aligning the goniometer in extension. 6 Other researchers have hypothesized that the ICC is lower for knee extension than flexion because The knee extension arc is limited, and any error might, therefore, be magnified. Determining the anatomical landmarks may be difficult in patients with pathological changes in the knee. Knee extension itself may be highly labile and, therefore, hard to quantify. 11 1 Associate Professor, Physical Therapy Department, Clarkson University, Potsdam, NY. This work was initiated while author was a research physical therapist with Physiotherapy Associates, Glen Burnie, MD and completed at Clarkson University, Potsdam, NY. This research was supported by a demonstration grant from the APTA Section on Research and by Physiotherapy Associates. The protocol for this study was approved by the Institutional Review Board of Physiotherapy Associates. Address correspondence to Leslie Russek, Physical Therapy Department, Clarkson University, Potsdam, NY 13699-5880. E-mail: Lnrussek@ clarkson.edu The problem noted with reliability of small-value measurements is not limited to knee ROM. Researchers studying reliability of shoulder ROM have noted similar discrepancies between the ICC for shoulder flexion or abduction (large angles) and extension (smaller angles). 9 These researchers attributed the poor extension ICC to the fact that Extension may not be measured frequently in the clinic, which could have contributed to the poor intertester reliability. Although the cause of the poor intertester reliability for measurements of extension cannot be identified from our data... 9 Although researchers have often explained differences among ICCs between large and small joint angles as being due to physiological factors or measurement error, in actuality some or all of the difference among the magnitude of the coefficients may be due to differences in variability among subjects rather than variability in measurement error. Portney and Watkins emphasize that the ICC greatly depends on total subject variability and recommend that reliability studies use subjects having a wide range of scores. 8 However, because not all physical therapy measurements have the same amount of variability among subjects, not all reliability research has been able to follow this recommendation. Paradoxical Results Different reliability coefficients may give paradoxical results in which a measurement has good reliability according to one coefficient but poor reliability according to another. In particular, reliability estimates computed using the ICC may seem to contradict reliability estimates computed using the standard error of measurement (SEM). Stratford 13 notes a paradox in which the ICC suggests better reliability of knee flexion than extension, yet the SEM suggests better reliability of extension than flexion. He explains this paradox by saying that a good ICC indicates a measurement better for differentiating one subject from another, while a good SEM shows Journal of Orthopaedic & Sports Physical Therapy 341

that a measurement is better for differentiating a measurement in one subject from another measurement in that subject. 13 This paradox is based on the characteristics of the ICC and SEM. The ICC is based upon the results of an analysis of variance (ANOVA), which separates the error into variability between individuals and variability within an individual (error due to repeated measurements). Because the ICC is a ratio of the error from repeated measures and the total variability, the ICC varies between 0.0 and 1.0, where 0.0 reflects no reliability and 1.0 reflects perfect reliability. The SEM is derived from the SD of the measurement error and is proportional to the total variability. The SEM provides an estimate of reliability in the units measured (eg, degrees for ROM). 3,8,10,14 Because SEMs are reported in the units of the raw data, the SEMs for different types of measurements cannot be compared to one another as can ICCs. Table 1 provides a summary of the characteristics of the ICC and SEM. The SEM is clinically useful because it helps us determine the amount a second measurement would need to differ (from an initial measurement) to be confident that the change is not just due to test-retest error. For example, if investigators looking at the intrarater reliability of passive knee extension ROM measured using an electrogoniometer found an SEM of 0.988, the minimum difference to be confident of real patient change would be 2.7 (minimum detectable change is computed as 1.96 2 SEM). 3,10 In this example, a second measurement would need to differ from the first by more than 2.7 for a therapist to be confident that the difference was due to real patient change and not just error in the 2 measurements. The current work addresses 2 issues related to reliability: (1) how variability among subjects affects interpretation of reliability coefficients and (2) how different coefficients can yield paradoxical results. In spite of warnings that the ICC is sensitive to total subject variability, reliability studies continue to attribute the lower ICC for knee extension to greater physiological TABLE 1. Comparison of the intraclass correlation coefficient (ICC) and standard error of measurement (SEM). ICC SEM Purpose Comparing differences between groups Comparing changes in an individual Units Normalized (unitless) In units of measurement Range 0-1 0 Ideal Closer to 1 Closer to 0 Basic equation (S E)/(S+E) (S+E) (1 r) Abbreviations: S, subject variability; E, error variability; r, reliability coefficient. variability and measurement error. Few studies contrast ICCs and SEMs when these coefficients appear to show contradictory reliability. The current work uses actual and simulated knee ROM data to illustrate how the ICC and SEM reflect different aspects of reliability, even in the case when the test-retest error is the same. The objective is to provide readers with a clear understanding of the differences between the ICC and SEM, and to illustrate the effect of subject variability on these reliability coefficients. METHODS Subjects A total of 32 therapists in 9 clinics participated in 76 episodes of data collection on 38 subjects. Therapists collected data on between 1 and 6 subjects (mean, 2.4). Therapists had an average of 8.7 years of experience with 4.8 years in their current positions. Forty-nine percent had entry-level Bachelor s degrees and 51% had entry-level or advanced Master s degrees. Among the participating therapists, 82% worked full time. Subjects were drawn from patients whose evaluation or re-evaluation included ROM measurements of the knee. Potential subjects were excluded if they had acute or severe pain (arbitrarily defined as a pain rating greater than 6/10) to exclude patients whose condition might be exacerbated by the repeated measurements and to minimize the chance that increased pain would impair patient performance during retest measurements. Informed consent was obtained from each subject. A total of 38 subjects with a variety of orthopedic knee conditions were included. Only measurements from the injured limb were used; in case of bilateral involvement, only the measurement for the left limb was used. The present data were collected as part of a larger study of reliability of routinely collected clinical data. The protocol for this study was approved by the Institutional Review Board of Physiotherapy Associates. Procedure The therapist who initially evaluated the patient was the treating therapist ; the therapist repeating the measurements was the retest therapist. For clinics with more than 2 therapists, each therapist was given a unique, randomly generated list of names of other therapists in the clinic. Once a treating therapist was identified from the list, he/she was randomly paired with a retest therapist by selecting the next name on the random-pairing list, as described by Rothstein et al. 11 If the identified therapist was unavailable, the next name was selected; the unavailable therapist remained the next name on the list to be selected next. This process generated an approximately equal number of all possible pairings, minimizing pairing bias. 342 J Orthop Sports Phys Ther Volume 34 Number 6 June 2004

TABLE 2. Mean and SDs used to generate simulated data and error values. Simulated retest data were computed by adding error value to simulated initial measurement. data (ie, injured and noninjured knee flexion and extension). The simulated data are shown in Appendix 1. Mean for data generation SD for data generation SIE SIF SNE SNF Error 3 100 0 130 0 8 25 5 10 7 Abbreviations: SIE, simulated injured (knee) extension; SIF, simulated injured (knee) flexion; SNE, simulated noninjured (knee) extension; SNF, simulated noninjured (knee) flexion. Therapists collected any combination of active or passive knee flexion or extension ROM measurements, as they deemed appropriate for that patient. Data were thus collected for active extension (AE), active flexion (AF), passive extension (PE), and passive flexion (PF). Therapists were asked to record range measurements to the nearest degree. Measurements were done in a repeated-measures (test-retest) manner; the treating therapist indicated which measurements had been performed and the retest therapist repeated these measurements, typically within 20 minutes. The order in which measurements were performed was not controlled. To avoid bias, therapists collected their data on separate sheets of paper and were not allowed to watch each other conduct the measurements or to consult with each other prior to completion of both sets of data. Generation of Simulated Data Simulated data were generated to more clearly illustrate how the different components of variability contribute to the differences among the reliability coefficients. All joint ranges were based on normally distributed random numbers, rounded to the nearest integer, with means and SDs selected to produce values similar to the actual data (Table 2). Hence, simulated injured flexion (SIF) and simulated injured extension (SIE) values were generated to reflect a distribution similar to the actual injured knee range of data presented here. Simulated noninjured flexion (SNF) and simulated noninjured extension (SNE) values were generated to reflect typical ranges seen in noninjured knees. Simulated error (difference between simulated initial measurement and the simulated retest measurement) was randomly generated based on a normal distribution. A mean of 0 and SD of 7 provided simulated error values that were similar to those seen in the actual data. The randomly generated difference was added to each test measurement to compute the simulated retest measurement. The same simulated error was used to compute the retest measurement for all simulated data. Consequently, the error between test and retest measurements is identical for each set of simulated Data Analysis Data were analyzed using ICC 1,1. ICC 1,1 was the most appropriate ICC model and form for the experimental design used in this study. 8,12 Model 1 (indicated by the first number) is appropriate when testers are selected at random from a larger pool of testers; here, 2 therapists were selected from the pool of participating therapists. The second number refers to the form, which in this case was a single measurement rather than a mean of several measurements. A separate 1-way analysis of variance (ANOVA) was computed for each actual data set (AF, AE, PF, and PE) and for each simulated data set (SIF, SIE, SNF, and SNE). Using the results of the ANOVA, ICC 1,1 for each data set was estimated using equation 1. 7,8 Equation 1. ICC 1,1 = BMS WMS BMS+(k 1)WMS, where BMS is the between mean square and WMS is the within-mean square from the 1-way ANOVA. In the current study k, the number of testers (therapists), is 2. The SEM was computed using equation 2. 8 Equation 2. SEM = 2 (1 ), where is the SD and is the reliability coefficient. The SD was computed for both test and retest data. The reliability coefficient used in computing the SEM is based on the intended use of the SEM 8 ; for this study, the corresponding ICC 1,1 value was used as the reliability coefficient because the goal was to compare the ICC and SEM as measures of interrater reliability. RESULTS Simulated and actual data are shown in Appendices 1 and 2, respectively. The actual and simulated data are similar for the injured knee, with knee extension angles clustered near 0 while knee flexion angles were distributed through a wide range of values. Simulated flexion and extension measurements were distributed over a wider range for injured knees than for noninjured knees. Tables 3 and 4 show the minimum, maximum, mean, and SD of the actual and simulated data, respectively. Equation (3) shows the computation of the ICC for the measurements of passive knee extension (PE), using relevant mean-square values from Table 5. The ICCs for the other data sets, computed using the mean-square results in Tables 5 and 6, are given in Table 7. J Orthop Sports Phys Ther Volume 34 Number 6 June 2004 343

TABLE 3. Characteristics of actual data. Values are in degrees. PE PF AE AF Test Restest Error* Test Restest Error* Test Restest Error* Test Restest Error* Mean 2.9 1.6 1.3 102.3 105.6 3.4 3.5 1.6 1.9 108.1 110.6 2.5 SD 5.9 4.6 5.3 34.2 29.9 9.3 7.7 7.7 6.8 27.9 26.5 9.4 Minimum 15 12 11 35 38 20 20 14 18 45 55 28 Maximum 6 7 12 168 150 28 12 18 16 148 146 14 Abbreviations: PE, passive knee extension; PF, passive knee flexion; AE, active knee extension; AF, active knee flexion. * Error is the difference between test and retest; mean, SD, minimum and maximum are for the error values. TABLE 4. Characteristics of simulated data. Values are in degrees. SIE SIF SNE SNF Error* Test Retest Test Retest Test Retest Test Retest Mean 0.1 3.6 3.7 102.8 102.7 0.8 0.7 129.8 129.7 SD 7.1 7.7 9.5 30.4 26.9 2.2 7.3 5.6 7.7 Minimum 12.0 18.0 28.0 41.0 47.0 5.0 12.0 121.0 109.0 Maximum 19.0 10.0 10.0 160.0 157.0 5.0 17.0 141.0 150.0 Abbreviations: SIE, simulated injured (knee) extension; SIF, simulated injured (knee) flexion; SNE, simulated noninjured (knee) extension; SNF, simulated noninjured (knee) flexion. * Error is the difference between test and retest; mean, SD, minimum and maximum are for the error values. Equation 3. PE (ICC 1,1 )= BMS WMS = 42.63 14.24 =0.5 BMS+(k 1)WMS 42.63 + (2 1)14.24 Using the relevant ICC as the reliability coefficient, equation (4) shows the computation of the SEM for knee PE (where the SD used is computer for combined test and retest data). The SEMs for the other data sets are given in Table 7. Equation 4. PE (SEM) = 2 (1 ) = 5.31 2 (1 0.50) = 3.76 The ICCs for extension measurements ranged between 0.15 (SNE) and 0.67 (SIE); the ICCs for actual data were 0.50 and 0.59 for PE and AE, respectively. The ICCs for flexion measurements ranged between 0.46 (SNF) and 0.97 (SIF); the ICCs for actual data were 0.95 and 0.94 for PF and AF, respectively. The SEMs for the knee angle measurements made by physical therapists (PE and AE) were 3.8 and 4.9, respectively. The SEMs for PF and AF were 6.8 and 6.7, respectively (Table 7). The SEMs for simulated knee flexion and extension data were all between 4.9 and 5.0. DISCUSSION Variability Among Subjects The present work indicates that the test-retest ICC was better (higher) for knee flexion than extension for both actual and simulated data. Previous literature also reports higher ICCs for knee flexion than for extension. Brosseau et al 1 found ICCs of 0.43 to 0.52 for knee ROM data with values close to 0 (ie, extension) and 0.82 to 0.88 for knee ROM values not close to 0 (ie, flexion). Clapper and Wolf 2 similarly found higher ICCs for knee flexion (0.95) than knee extension (0.85) and Watkins et al 16 reported a smaller difference between ICCs for knee flexion and knee extension (0.90 and 0.86, respectively). Hayes et al 6 found ICCs of 0.95 to 0.99 for knee flexion and 0.71 to 0.86 for knee extension in knees with osteoarthritis. Rothstein et al 11 reported ICCs of 0.84 to 0.92 for knee flexion (depending on goniometer and trial) and ICCs of 0.59 to 0.80 for knee extension. Rothstein et al 11 did a post hoc analysis separating those therapists who used similar measurement techniques from those who did not; results showed that the ICC was higher for those who used the same measurement technique (0.74-0.84) relative to the ICC for those using different technique (0.20-0.69). They concluded that the low ICC for extension could be attributed to differences in patient positioning. 11 However, others have analyzed the importance of patient position and found that it had only slight impact on reliability. 10 The current study shows that physiological or methodological differences between flexion and extension measurements are not necessary to have wide variation in ICC. The constant error in the simulated data was a larger proportion of the extension measurements (which were near 0) than the flexion measurements. The simulated data therefore had lower ICCs for extension than for flexion, even 344 J Orthop Sports Phys Ther Volume 34 Number 6 June 2004

though the simulated error did not have sources of variability, such as patient positioning. The low ICCs reported in the literature for knee extension and other small values may therefore reflect the impact of limited subject variability on the computation of the ICC. Fritz et al 4 compared reliability for ROM in people with injured and uninjured knees and found that the ICC for the injured knee was higher than for the uninjured knee in both flexion and extension. The ICC for the current simulated injured versus noninjured data shows that when injured joints have more intersubject variability, they will have higher ICCs because of the greater intersubject variability rather than because of a difference in actual testretest error (which was the same for the simulated injured and noninjured data). Reliability measures are, therefore, specific to the patient population tested and should only be applied to that population. Paradoxical Results Contrary to what one might expect, the current work found that the smaller (ie, better) SEMs did not correspond to measurements with higher (ie, better) ICCs. In spite of the differences in ICCs in the TABLE 5. One-way analysis of variance data for actual knee range of motion. Sum of Squares Mean Square Degrees of Freedom Passive extension (PE) Between subjects 1108.26 42.63 26 27 Within subjects 384.50 14.24 27 Passive flexion (PF) Between subjects 54510.93 2018.92 27 28 Within subjects 1333.00 47.61 28 Active extension (AE) Between subjects 2751.35 94.87 29 30 Within subjects 729.50 24.32 30 Active flexion (AF) Between subjects 38798.63 1436.99 27 28 Within subjects 1282.50 45.80 28 TABLE 6. One-way analysis of variance data for simulated knee range of motion data. Sum of Squares Mean Square Degrees of Freedom Injured extension (SIE) Between subjects 3640.33 125.53 29 30 Within subjects 735.00 24.50 30 Injured flexion (SIF) Between subjects 47087.73 1623.72 29 30 Within subjects 735.00 24.50 30 Noninjured extension (SNE) Between subjects 968.73 33.41 29 30 Within subjects 735.00 24.50 30 Noninjured flexion (SNF) Between subjects 1918.73 66.16 29 30 Within subjects 735.00 24.50 30 TABLE 7. Intraclass correlation coefficient (ICC) and standard error of measurement (SEM) for actual and simulated knee range of motion data. Motion ICC 1,1 SEM n Passive extension (PE) 0.50 3.8 27 Passive flexion (PF) 0.95 6.8 28 Active extension (AE) 0.59 4.9 30 Active flexion (AF) 0.94 6.7 28 Simulated injured extension (SIE) 0.67 5.0 30 Simulated injured flexion (SIF) 0.97 4.9 30 Simulated noninjured extension (SNE) 0.15 4.9 30 Simulated noninjured flexion (SNF) 0.46 5.0 30 n n J Orthop Sports Phys Ther Volume 34 Number 6 June 2004 345

simulated data presented, the simulated data sets had equal SEMs (the small differences can be attributed to computation and rounding errors) because absolute error was the same for each simulated data set. The SEMs for actual data in the present study were smaller for passive and active knee extension (3.3 and 4.5, respectively) than for flexion (6.9 and 6.0 ). The smaller SEMs suggest that extension measurements were more reliable than flexion measurements in the present study. Fritz et al 4 found the same pattern of SEMs for knee extension and flexion: the SEM for injured knees was 1.7 for extension and 3.9 for flexion. Only 2 reports could be found in the literature comparing SEM and ICC data for knee ROM. Fritz et al 4 obtained a similar paradoxical result: higher ICCs in measurements with greater variability compared to data with less variability; lower SEMs in measurements whose value was small (relative to the magnitude of error) compared to measurements with values that were large (relative to the magnitude of error). However, they did not address this discrepancy in their discussion. Stratford and Goldsmith 14 noted a similar phenomenon by computing SEMs from the data presented in Hayes et al 6 to demonstrate that ICCs are better for distinguishing among subjects while SEMs are better for assessing error in repeated measures. The simulated data presented here illustrate how the different nature of the SEM and ICC results in this paradox. The ICC compares the error due to repeated measures to the total variability in the data (ie, a ratio of variances). Mathematically, this comparison is done by dividing the error from repeated measures (S E) by the total variability (S + E). This can be seen in (equation 5), which is a simplified version of the ICC computation provided in equation 1. Equation 5. ICC 1,1 = S E S+E, where S is the variability among subjects and E is the error variability. If S equals BMS and E equals WMS, equation 5 reduces to equation 1 for a single measurement repeated 1 time. When the subject variability is large relative to the error variability (S E), the ICC approaches 1.0; this is the case with flexion data and simulated data from injured knees. When the error variability is the same order of magnitude as the subject variability (S E), the ICC approaches 0.0; this is the case with extension data and simulated data from noninjured knees. In contrast, equation 6 shows that the SEM is proportional to the SD, or total variability in the data (S + E). Equation 6. SEM=(S+E) 1 ICC Because total variability (subject plus error) was much smaller for knee extension than for flexion, the SEM was smaller for knee extension than for flexion. The SEM and ICC therefore have an almost inverse relationship to one another. 15 The different characteristics of these 2 reliability coefficients may create confusion about measurement error and have implications for clinical decision making. 5 CONCLUSION The simulated data here show that some or all of the differences previously noted among ICC for large and small value data could be due to differences in variability among subjects rather than differences due to physiological variability or measurement error. The ICC and SEM reflect different aspects of reliability: a ratio of variances versus consistency of measurement. As a result, the ICC and SEM can contradict one another: measurements may have good reliability as determined by their SEM, but may have poor reliability as determined by their ICC, and vice versa. As Roebroeck et al 10 have stated, Reliability is not an absolute quality of a measurement, but is dependent on the way a measurement will be interpreted. ACKNOWLEDGMENTS I would like to thank Michael Wooden, PT, MS, OCS of Physiotherapy Associates for his assistance in recruiting participating clinics, and all of the physical therapists who contributed data for this study. REFERENCES 1. Brosseau L, Tousignant M, Budd J, et al. Intratester and intertester reliability and criterion validity of the parallelogram and universal goniometers for active knee flexion in healthy subjects. Physiother Res Int. 1997;2:150-166. 2. Clapper MP, Wolf SL. Comparison of the reliability of the Orthoranger and the standard goniometer for assessing active lower extremity range of motion. Phys Ther. 1988;68:214-218. 3. Eliasziw M, Young SL, Woodbury MG, Fryday-Field K. Statistical methodology for the concurrent assessment of interrater and intrarater reliability: using goniometric measurements as an example. Phys Ther. 1994;74:777-788. 4. Fritz JM, Delitto A, Erhard RE, Roman M. An examination of the selective tissue tension scheme, with evidence for the concept of a capsular pattern of the knee. Phys Ther. 1998;78:1046-1056; discussion 1057-1061. 5. Hayes KW. The effect of awareness of measurement error on physical therapists confidence in their decisions. Phys Ther. 1992;72:515-525; discussion 526-531. 6. Hayes KW, Petersen C, Falconer J. An examination of Cyriax s passive motion tests with patients having osteoarthritis of the knee. Phys Ther. 1994;74:697-707; discussion 707-699. 346 J Orthop Sports Phys Ther Volume 34 Number 6 June 2004

7. Hobbs FD, Parle JV, Kenkre JE. Accuracy of routinely collected clinical data on acute medical admissions to one hospital. Br J Gen Pract. 1997;47:439-440. 8. Portney LG, Watkins MP. Foundations of Clinical Research. 2nd ed. Norwalk, CT: Appleton and Lange; 1993. 9. Riddle DL, Rothstein JM, Lamb RL. Goniometric reliability in a clinical setting. Shoulder measurements. Phys Ther. 1987;67:668-673. 10. Roebroeck ME, Harlaar J, Lankhorst GJ. The application of generalizability theory to reliability assessment: an illustration using isometric force measurements. Phys Ther. 1993;73:386-395; discussion 396-401. 11. Rothstein JM, Miller PJ, Roettger RF. Goniometric reliability in a clinical setting. Elbow and knee measurements. Phys Ther. 1983;63:1611-1615. 12. Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psych Bull. 1979;86:420-428. 13. Stratford P. Reliability: consistency or differentiating among subjects? Phys Ther. 1989;69:299-300. 14. Stratford PW, Goldsmith CH. Use of the standard error as a reliability index of interest: an applied example using elbow flexor strength data. Phys Ther. 1997;77:745-750. 15. Streiner D, Norman G. Health Measurement Scales. Oxford, England: Oxford University Press; 2000. 16. Watkins MA, Riddle DL, Lamb RL, Personius WJ. Reliability of goniometric measurements and visual estimates of knee range of motion obtained in a clinical setting. Phys Ther. 1991;71:90-96; discussion 96-97. J Orthop Sports Phys Ther Volume 34 Number 6 June 2004 347

Appendix Appendix 1. Simulated knee range of motion data used in analysis. Values are in degrees. SIE SIF SNE SNF Subject Error* Test Retest Test Retest Test Retest Test Retest 1 1 9 8 90 91 2 3 121 122 2 2 11 13 132 130 3 1 124 122 3 14 12 2 75 89 2 16 122 136 4 5 10 15 141 136 2 3 136 131 5 7 3 10 85 92 2 5 125 132 6 0 4 4 100 100 3 3 132 132 7 0 10 10 120 120 3 3 128 128 8 7 9 2 128 121 1 8 128 121 9 2 13 15 113 111 2 0 130 128 10 2 1 1 118 116 2 0 140 138 11 2 7 9 74 76 5 3 125 127 12 5 17 22 140 135 2 3 135 130 13 0 2 2 47 47 2 2 129 129 14 3 9 12 160 157 0 3 136 133 15 3 2 1 90 87 2 1 130 127 16 1 2 1 93 92 3 2 128 127 17 3 4 1 99 102 0 3 126 129 18 11 8 3 86 97 2 13 128 139 19 7 13 6 41 48 2 5 128 135 20 10 2 8 59 69 2 12 129 139 21 12 3 15 151 139 2 10 121 109 22 7 6 1 88 81 1 6 121 114 23 0 7 7 104 104 5 5 129 129 24 7 2 5 107 100 1 8 136 129 25 1 4 3 144 145 1 0 126 127 26 7 2 5 111 104 3 10 136 129 27 3 9 6 60 63 1 4 136 139 28 19 10 9 89 108 2 17 131 150 29 7 2 9 128 121 1 6 136 129 30 10 18 28 111 101 2 12 141 131 Abbreviations: SIE, simulated injured (knee) extension; SIF, simulated injured (knee) flexion; SNE, simulated noninjured (knee) extension; F, simulated noninjured (knee) flexion. * Difference between test and retest. Appendix 2. Actual knee range of motion data measured on patients. Values in degrees. PE PF AE AF Test Retest Error* Test Retest Error* Test Retest Error* Test Retest Error* 1 0 0 0 80 85 5 0 5 5 142 140 2 2 5 5 0 50 56 6 2 3 1 105 111 6 3 5 4 1 54 70 16 0 0 0 130 134 4 4 0 0 0 113 127 14 10 14 4 68 55 13 5 5 0 5 112 120 8 0 0 0 145 140 5 6 6 5 11 128 122 6 5 12 7 45 55 10 7 0 7 7 35 38 3 7 8 1 50 62 12 8 10 6 4 120 120 0 2 12 10 107 135 28 9 10 5 5 123 135 12 12 5 7 95 112 17 10 0 2 2 115 118 3 10 2 12 125 134 9 11 0 12 12 100 102 2 20 14 6 118 129 11 12 8 3 5 112 118 6 10 12 2 102 101 1 13 10 0 10 78 85 7 15 13 2 108 108 0 14 8 9 1 40 45 5 8 8 16 70 73 3 15 15 5 10 60 80 20 18 4 14 110 120 10 16 5 4 1 125 115 10 15 7 8 111 113 2 17 0 0 0 120 130 10 0 18 18 118 110 8 348 J Orthop Sports Phys Ther Volume 34 Number 6 June 2004

Appendix 2. Actual knee range of motion data measured on patients. Values in degrees. (cont d.) PE PF AE AF Test Retest Error* Test Retest Error* Test Retest Error* Test Retest Error* 18 12 9 3 113 125 12 0 4 4 120 127 7 19 10 3 7 121 115 6 0 0 0 108 110 2 20 4 1 3 105 103 2 0 3 3 139 125 14 21 5 7 2 78 83 5 0 0 0 124 110 14 22 0 0 0 152 145 7 0 2 2 108 105 3 23 5 0 5 168 140 28 8 11 3 95 98 3 24 0 0 0 96 100 4 4 3 1 70 80 10 25 0 5 5 145 143 2 15 4 11 148 142 6 26 0 0 0 147 150 3 4 3 1 135 136 1 27 6 0 6 100 105 5 0 0 0 89 86 3 28 74 83 9 2 0 2 143 146 3 29 0 5 5 30 0 0 0 Abbreviations: PE, passive extension; PF, passive flexion; AE, active extension; AF, active flexion. * Difference between test and retest. J Orthop Sports Phys Ther Volume 34 Number 6 June 2004 349