Does item homogeneity indicate internal consistency or item redundancy in psychometric scales?



Similar documents
Richard E. Zinbarg northwestern university, the family institute at northwestern university. William Revelle northwestern university

Internal Consistency: Do We Really Know What It Is and How to Assess It?

X = T + E. Reliability. Reliability. Classical Test Theory 7/18/2012. Refers to the consistency or stability of scores

Constructing a TpB Questionnaire: Conceptual and Methodological Considerations

Q FACTOR ANALYSIS (Q-METHODOLOGY) AS DATA ANALYSIS TECHNIQUE

Estimate a WAIS Full Scale IQ with a score on the International Contest 2009

Basic Concepts in Classical Test Theory: Relating Variance Partitioning in Substantive Analyses. to the Same Process in Measurement Analyses ABSTRACT

Chapter 3 Psychometrics: Reliability & Validity

Exploring Graduates Perceptions of the Quality of Higher Education

RESEARCH METHODS IN I/O PSYCHOLOGY

Original Article. Charles Spearman: British Behavioral Scientist

Reporting and Interpreting Scores Derived from Likert-type Scales

Exploratory Factor Analysis

Guided Reading 9 th Edition. informed consent, protection from harm, deception, confidentiality, and anonymity.

A PARADIGM FOR DEVELOPING BETTER MEASURES OF MARKETING CONSTRUCTS

Reliability Overview

Validity and Reliability in Social Science Research

This chapter discusses some of the basic concepts in inferential statistics.

Instrument Validation Study. Regarding Leadership Circle Profile. By Industrial Psychology Department. Bowling Green State University

RESEARCH METHODS IN I/O PSYCHOLOGY

CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES. From Exploratory Factor Analysis Ledyard R Tucker and Robert C.

MATHEMATICS AS THE CRITICAL FILTER: CURRICULAR EFFECTS ON GENDERED CAREER CHOICES

Exploring Epistemological Beliefs and Conceptual Change in Undergraduate Psychology Students

Designing a Questionnaire

Extending the debate between Spearman and Wilson 1929: When do single variables optimally reproduce the common part of the observed covariances?

Test Reliability Indicates More than Just Consistency

Psychological measurements: their uses and misuses

Glossary of Terms Ability Accommodation Adjusted validity/reliability coefficient Alternate forms Analysis of work Assessment Battery Bias

PARTIAL LEAST SQUARES IS TO LISREL AS PRINCIPAL COMPONENTS ANALYSIS IS TO COMMON FACTOR ANALYSIS. Wynne W. Chin University of Calgary, CANADA

Test Bias. As we have seen, psychological tests can be well-conceived and well-constructed, but

The relationship between emotional intelligence and school management

Factorial Invariance in Student Ratings of Instruction

[This document contains corrections to a few typos that were found on the version available through the journal s web page]

Syllabus for Psychology 492 Psychological Measurement Winter, 2006

BRIEF REPORT: Short Form of the VIA Inventory of Strengths: Construction and Initial Tests of Reliability and Validity

Levels of Measurement. 1. Purely by the numbers numerical criteria 2. Theoretical considerations conceptual criteria

Validation of the Core Self-Evaluations Scale research instrument in the conditions of Slovak Republic

Overview of Factor Analysis

Reliability Analysis

2011 Validity and Reliability Results Regarding the SIS

Correlational Research. Correlational Research. Stephen E. Brock, Ph.D., NCSP EDS 250. Descriptive Research 1. Correlational Research: Scatter Plots

History and Purpose of the Principles for the Validation and Use of Personnel Selection Procedures

The Revised Dutch Rating System for Test Quality. Arne Evers. Work and Organizational Psychology. University of Amsterdam. and

ANALYZING TWO ASSUMPTIONS UNDERLYING THE SCORING OF CLASSROOM ASSESSMENTS

The Relationship between Social Intelligence and Job Satisfaction among MA and BA Teachers

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

WHAT IS A JOURNAL CLUB?

Factor Rotations in Factor Analyses.

Statistics, Research, & SPSS: The Basics

Application of a Psychometric Rating Model to

Part III. Item-Level Analysis

Pilot Testing and Sampling. An important component in the data collection process is that of the pilot study, which

How to report the percentage of explained common variance in exploratory factor analysis

Applications of Structural Equation Modeling in Social Sciences Research

Emotionally unstable? It spells trouble for work, relationships and life

General Symptom Measures

Canonical Correlation Analysis

Assessment, Case Conceptualization, Diagnosis, and Treatment Planning Overview

Choosing the Right Type of Rotation in PCA and EFA James Dean Brown (University of Hawai i at Manoa)

Scale Construction and Psychometrics for Social and Personality Psychology. R. Michael Furr

ARE OBSESSIVE BELIEFS AND INTERPRETATIVE BIAS OF INTRUSIONS PREDICTORS OF OBSESSIVE COMPULSIVE SYMPTOMATOLOGY? A study WITH A TURKISH SAMPLE

Multivariate Analysis of Variance (MANOVA)

SEM Analysis of the Impact of Knowledge Management, Total Quality Management and Innovation on Organizational Performance

Scores, 7: Immediate Recall, Delayed Recall, Yield 1, Yield 2, Shift, Total Suggestibility, Confabulation.

PsyD Psychology ( )

ASSESSMENT: Coaching Efficacy As Indicators Of Coach Education Program Needs

The Personal Learning Insights Profile Research Report

THE ACT INTEREST INVENTORY AND THE WORLD-OF-WORK MAP

Learner Self-efficacy Beliefs in a Computer-intensive Asynchronous College Algebra Course

Multiple Regression: What Is It?

Abstract. Introduction

Evaluating a Fatigue Management Training Program For Coaches

INVESTIGATING BUSINESS SCHOOLS INTENTIONS TO OFFER E-COMMERCE DEGREE-PROGRAMS

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Chapter 1 Introduction. 1.1 Introduction

Enhancing Customer Relationships in the Foodservice Industry

Undergraduate Psychology Major Learning Goals and Outcomes i

What is Effect Size? effect size in Information Technology, Learning, and Performance Journal manuscripts.

The Self-Regulation Questionnaire (SRQ)

Publishing multiple journal articles from a single data set: Issues and recommendations

Harrison, P.L., & Oakland, T. (2003), Adaptive Behavior Assessment System Second Edition, San Antonio, TX: The Psychological Corporation.

A Reasoned Action Explanation for Survey Nonresponse 1

interpretation and implication of Keogh, Barnes, Joiner, and Littleton s paper Gender,

Multidimensional Constructs in Organizational Behavior Research: An Integrative Analytical Framework

LEARNING OUTCOMES FOR THE PSYCHOLOGY MAJOR

Using a Mental Measurements Yearbook Review to Evaluate a Test

Cognitive Behavior Group Therapy in Mathematics Anxiety

A REVIEW OF SCALE DEVELOPMENT PRACTICES IN NONPROFIT MANAGEMENT AND MARKETING

What Is a Case Study? series of related events) which the analyst believes exhibits (or exhibit) the operation of

Expectancy Value Theory: Motivating Healthcare Workers

Test-Retest Reliability and The Birkman Method Frank R. Larkey & Jennifer L. Knight, 2002

Structural Equation Modelling (SEM)

Quantitative Research: Reliability and Validity

T-test & factor analysis

Statistics. Measurement. Scales of Measurement 7/18/2012

Report on the Ontario Principals Council Leadership Study

MEASURING INFORMATION QUALITY OF WEB SITES: DEVELOPMENT OF AN INSTRUMENT

Validity, Fairness, and Testing

Transcription:

Does item homogeneity indicate internal consistency or item redundancy in psychometric scales? By Gregory J. Boyle Department of Psychology, University of Queensland, St Lucia 4067, Queensland, Australia Abstract The term internal consistency has been used extensively in classical psychometrics to refer to the reliability of a scale based on the degree of withinscale item intercorrelation, as measured by say the split-half method, or more adequately by Cronbach's (1951) (Psychometrika, 16, 297 334) alpha, as well as the KR 20 and KR 21 coefficients. This term is a misnomer, as a high estimate of internal item consistency/item homogeneity may also suggest a high level of item redundancy, wherein essentially the same item is rephrased in several different ways. Internal consistency or item homogeneity is often used for estimating intra-scale reliability, in terms of the item variances and covariances derived from a single occasion of measurement. While it is desirable that items in a psychometric scale measure something in common (i.e. exhibit uni-dimensionality), Hattie (1985) has indicated that there is still no satisfactory index. As Hattie (pp. 157-158) pointed out, a uni-dimensional scale (having an underlying latent trait), is not necessarily reliable, internally consistent or homogeneous. Hattie concluded that the frequent use of Cronbach s alpha coefficient as a measure of uni-dimensionality is not justified. Hattie further stated that, alpha can be high even if there is no general factor, since (1) it is influenced by the number of items and parallel repetitions of items, (2) it increases as the number of factors pertaining to each item increases, and (3) it decreases moderately as the item communalities increase. The subsequent assertion by Ray (1988) that internal consistency of a psychometric scale should be maximised, represents a further restatement of classical itemetric theory, and ignores the previous work of Hattie (1985), and many others, as outlined below. There is an optimal range of internal consistency/item homogeneity, if significant item redundancy is to be avoided (Boyle, 1983, 1985. 1986). According to Kline (1979, p. 3), with item inter-correlations which are lower than about 0.3, each part of the test must be measuring something different A higher correlation than (0.7), on the other hand suggests that the test is too narrow and too specific if one constructs items that are virtually paraphrases of each other, the results would be high internal consistency and very low validity. Furthermore, according to Kline (1986, p. 3), maximum validity is obtained where test items do not all correlate with each other, but where each correlates positively with the criterion. Such a test would have only low internal-consistency reliability.

As Cattell (1978) pointed out, a scale comprised of many items which are essentially repetitions of each other can appear in factor analysis as a bloated specific (as in Guilford s S-O-I model of intellectual structure, cf. Brody & Brody. 1976). Kline (1986, pp. 118-119) further remarked that high internal consistency can be antithetical to high validity the importance of internal-consistency reliability has been exaggerated in psychometry (i.e. I agree with Cattell) According to Hayes. Nelson and Jarrett (1987, p. 972). a measure could readily have treatment utility without internal consistency high internal consistency should not necessarily be expected. Likewise, as Allen and Potkay (1983. p. 1088). Lachar and Wirt (1981. p. 616) and McDonald (1981) have all shown, either high or low item homogeneity can be associated with either high or low reliability, despite classical itemetric opinion. According to McDonald (p. 113). Coefficient alpha cannot be used as a reliability coefficient McDonald (p. 100) has refuted on mathematical grounds, the commonly held belief that the alpha coefficient measures internal consistency or item homogeneity of a scale. McDonald (p. 110) stated that, it has never been made clear what is meant by internal consistency or why KR-20 or coefficient alpha can be deemed to measure it confusion pertaining to coefficient alpha has a long history reviewed by Green, Lissitz & Mulaik (1977). Furthermore, McDonald (p. 111) concluded that, alpha has not been shown to be a quantitative measure of any intelligible and useful psychometric concept, except when computed from items with equal covariances. This conclusion was based on the original use of item homogeneity as an estimate of scale reliability by Gulliksen (1950). which was shown by Lord and Novick (1968) to be valid only when items are tau equivalent. Accordingly, it may often be more appropriate to regard estimates such as the alpha coefficient as indicators of item redundancy and narrowness of a scale (cf. Boyle. 1985). Items should be selected which are loaded maximally by the factor representing that scale, but which exhibit moderate to low item inter-correlations in order to maximise the breadth of measurement of the given factor. Merely adding additional items to a scale as classical itemetrics has advocated in accord with the Spearman Brown formula, ignores the error variance associated with each item, and must be regarded by any contemporary and objective assessment (such as demonstrated with LISREL congeneric factor analysis-joreskog & Sorbom, 1989), as being a rather unsophisticated method of increasing scale reliability. Ray (1988) uncritically cited Nunnally (1967) not (1978) -as well as Cronbach (1951) in restating classical reliability theory. However, Pedhazur (1982, p. 636) has indicated that Nunnally s classical approach to reliability failed to acknowledge that measurement errors are

often systematic and non-random. Ray s comments arc therefore founded on psychometric views published over 20 years ago! Ray (1988) claimed that broad validity of a scale is facilitated by the use of subscales. However, this ignores the fact that in many multidimensional psychometric instruments (such as the EPI. EPQ, JEPI, 16PF. CAQ, 8SQ, POMS, DES-IV, MAACL, etc.) each subscale actually measures a discrete factor analytic dimension. Despite Ray s dogmatic assertions, semantic overlap of items is only one possible influence on observed item inter-correlations, as indicated above in relation to Hattie s (1985) work. As well, Ray made no distinction between state vs trait scales (cf. Boyle, 1983, 1985. 1986, 1987). While a reliable trait scale should exhibit high test-retest correlations for both immediate retest (dependability) and for longer term retest (stability), a reliable state scale should exhibit only a high dependability coefficient, if the scale is truly sensitive to situational variability in mood. Ray and Pedersen (1986) asserted on the basis of a highly biased, unrepresentative and very restricted sample of the U.S.A. population, that Eysenck s Psychoticism scale in the EPQ was a failed experiment, not on the grounds of inadequate validity, but again merely on the basis of dated classical itemetric references. Ray objected to Eysenck s Psychoticism scale because he found that the mean item inter-correlations were only moderate. Yet, Ray s results with the EPQ were clearly biased due to severe restriction of variance in his data. Ray (1988) subsequently criticised Smedslund (1987) for not appreciating the virtues of the EPQ, despite denigrating it in the Ray and Pedersen note (cf. Smedslund. 1988). This amounts to little more than the pot calling the kettle black. Ray (1988) recommended Comrey s FHID approach to scale construction with the aim of increasing scale reliability. However, he was mistaken as to the actual composition of the item parcels in the CPS (four items counterbalanced for direction of scoring, not three as stated). While it is undoubtedly true that such item-parcel variables are more reliable than items as such, nevertheless, for a specified number of items in a scale, less of the pertinent construct is actually measured. Moreover, Cattell (1973. p. 360) has indicated that, The high homogeneity in the FHIDs is carried over with the second-order factor scales, leaving them excessively homogeneous. Hence, Ray s assertions concerning scale construction with itemparcel variables would seem quite inadequate. Cattell (1973. pp. 357-379; 1978. pp. 289-293; 1982) has argued that generally there is an optimally low level of item homogeneity. Cattell provided a conceptual demonstration of high item validity in the context of zero item homogeneity. Since a scale which is valid must also be reliable, it follows that it is theoretically possible for a scale to be reliable even though the internal consistency is zero. On the other hand, it is well known that even a highly reliable scale is not necessarily valid. Any number of invalid scales can be made more reliable simply by adding further invalid items in accord with the Spearman-Brown prophecy formula, and/or by adding further items which are essentially mere repetitions of the items already included in the scale. Ray s (1988) recommendations, if followed, can only result in significant item redundancy and likely contamination of the factor purity of psychometric scales.

The advantage of moderate to low item homogeneity is seen in multiple regression analysis, wherein a higher multiple R is produced from predictor variables (items) with only moderate item inter-correlations. Cattell s behavioural dispersion principal suggests that only when there is considerable item diversity, enabling sampling of behaviours across a wide spectrum of life expressions, can individuals be advantaged equally in responding to the items in a particular psychometric scale. As well, reduced item homogeneity facilitates the maintenance of validity across different cultures. A given item may elicit discrepant responses in different cultural settings. If there is high item homogeneity and most of the items are similar (i.e. there is significant item redundancy cf. Boyle, 1985), measurement error due to cultural distortions probably will be evident. This problem can be minimised by including a wide diversity of items (i.e. maximising breadth of measurement) in psychometric scale construction. Cattell indicated that a scale which has high internal consistency is probably contaminated on the one hand, by a bloated specific factor (such as in Guilford s S-O- I model), wherein over-inclusion of particular items pertaining to a specific dimension, gives the impression of a substantive factor, despite its lack of practical significance and evident triviality. On the other hand, psychometric scale contamination occurs by inclusion of several items predictive of an unwanted common factor. Cattell (1978. p. 289) demonstrated that, a very narrow specific can be blown up to the apparent status of a common factor in any given matrix by entering the experiment with several items that arc close variants on the specific variable. In this instance, item homogeneity (internal consistency) is increased by confounding the true factor with a bloated specific. Selection of items with high homogeneity/internal consistency, undoubtedly often results in a scale with a contaminated factor structure. To minimise these distorting influences, it is desirable to invoke suppressor action by including items that arc loaded positively and negatively on the unwanted dimensions, which also are loaded significantly on the relevant common factor. In contrast to Ray s (1988) restatement of classical itemetric opinion, Cattell (1973. p. 359) asserted that. In practice the random tendency to opposite loadings on these other factors will reduce the item homogeneity virtually to zero. Item diversity therefore, results in reduced item homogeneity and concomitantly, reduced item inter-correlations, but maximises breadth of measurement of a given construct. However, Cattell cautioned that, since low homogeneity means different specific factors and suppressor action by opposite loadings on unwanted common factors a test which (misguidedly) advertises high homogeneity is contaminated either with a bloated specific or by items sharing a common unwanted factor. In summary, high internal consistency/item homogeneity results spuriously from the inadvertent inclusion of essentially similar items in a psychometric scale.

Determination of what should be considered appropriate item homogeneity for a scale is, according to Cattell (1973. pp. 361-362) far more complex than is commonly considered in classical itemetrics The complexity is generated on the one side by the natural history of the domain and on the other by the unusual complexity of the purely statistical psychometric laws involved. According to Cattell, to obtain a broad but valid, behaviourally based rather than semantically based scale, test constructors will need to sift by factor analysis hundreds of items to get those having validity despite high diversity. In this regard, the newer congeneric factor analytic methods using programs such as LISREL (Joreskog & Sorbom, 1989) will undoubtedly minimise the amount of noise which is so prevalent among the items of many existing psychometric scales, designed along classical psychometric lines, wherein internal consistency has been maximised. This traditional itemetric view of intra-class correlation still persists in the contemporary psychometric literature [e.g. Crocker & Algina, 1986, pp. 119-122: Cronbach, 1990, pp. 202-204; Ferguson, 1981, pp. 438-439: also see Boyle, 1987, for a discussion of the limitations of the (1985) AERA/APA/NCME Standards in this regard]. However, especially in the non-ability areas of motivation, personality and mood states, moderate to low item homogeneity is actually preferred if one is to ensure a broad coverage of the particular constructs being measured. References Allen and Potkay, 1983. B.P. Allen and C.R. Potkay, Just as arbitrary as ever: comments on Zuckerman's rejoinder. Journal of Personality and Social Psychology 44 (1983), pp. 1087 1089. Boyle, 1983. G.J. Boyle, Critical review of state-trait curiosity test development. Motivation and Emotion 7 (1983), pp. 377 397. Boyle, 1985. G.J. Boyle, Self-report measures of depression: some psychometric considerations. British Journal of Clinical Psychology 24 (1985), pp. 45 59. Boyle, 1986. G.J. Boyle, Higher-order factors in the Differential Emotions Scale (DES-III). Personality and Individual Differences 7 (1986), pp. 305 310. Boyle, 1987. G.J. Boyle, Review of the (1985) Standards for educational and psychological testing: AERA, APA and NCME.. Australian Journal of Psychology 39 (1987), pp. 235 237. Brody and Brody, 1976. E.B. Brody and N. Brody, Intelligence: Nature, determinants, and consequences., Academic Press, New York (1976). Cattell, 1973. R.B. Cattell, Personality and mood by questionnaire., Jossey-Bass, San Francisco, CA (1973). Cattell, 1978. R.B. Cattell, Scientific use of factor analysis in behavioral and life sciences., Plenum Press, New York (1978).

Cattell, 1982. R.B. Cattell, The psychometry of objective motivation measurement: a response to the critique of Cooper and Kline. British Journal of Educational Psychology 52 (1982), pp. 234 241. Crocker and Algina, 1986. L. Crocker and J. Algina, Introduction to classical and modern test theory., Holt, Rinehart & Winston, New York (1986). Cronbach, 1951. L.J. Cronbach, Coefficient alpha and the internal consistency of tests. Psychometrika 16 (1951), pp. 297 334. Cronbach, 1990. L.J. Cronbach, Essentials of psychological testing. (5th edn. ed.), Harper & Row, New York (1990). Ferguson, 1981. G.A. Ferguson, Statistical analysis in psychology and education. (5th edn. ed.),, McGraw-Hill, Auckland (1981). Green et al., 1977. S.B. Green, R.W. Lissitz and S.A. Mulaik, Limitations of coefficient alpha as an index of test unidimensionality. Educational and Psychological Measurement 37 (1977), pp. 827 838. Gulliksen, 1950. H. Gulliksen, Theory of mental tests., Wiley, New York (1950). Hattie, 1985. J. Hattie, Methodology review: assessing unidimensionality of tests and items. Applied Psychological Measurement 9 (1985), pp. 139 164. Hayes et al., 1987. S.C. Hayes, R.O. Nelson and J.B. Jarrett, The treatment utility of assessment: a functional approach to evaluating assessment quality. American Psychologist 42 (1987), pp. 963 974. Jöreskog and Sörbom, 1989. K.G. Jöreskog and D. Sörbom, LISREL 7: A guide to the program and applications., SPSS Inc., Chicago, IL (1989). Kline, 1979. P. Kline, Psychometrics and psychology. Academic Press, London (1979). Kline, 1986. P. Kline, A handbook of test construction: Introduction to psychometric design., Methuen, New York (1986). Lachar and Wirt, 1981. D. Lachar and R.D. Wirt, A data-based analysis of the psychometric performance of the Personality Inventory for Children (PIC): an alternative to the Achenbach review. Journal of Personality Assessment 45 (1981), pp. 614 616. Lord and Novick, 1968. F.M. Lord and M.R. Novick, Statistical theories of mental test scores., Addison-Wesley, Reading, MA (1968). McDonald, 1981. R.P. McDonald, The dimensionality of tests and items. British Journal of Mathematical and Statistical Psychology 34 (1981), pp. 110 117.

Nunnally, 1967/1978. J.C. Nunnally, Psychometric theory., McGraw-Hill, New York (1967/1978). Pedhazur, 1982. E.J. Pedhazur, Multiple regression in behavioral research., Holt, Rinehart & Winston, New York (1982). Ray, 1988. J.J. Ray, Semantic overlap between scale items may be a good thing: reply to Smedslund. Scandinavian Journal of Psychology 29 (1988), pp. 145 147. Ray and Pedersen, 1986. J.J. Ray and R. Pedersen, Internal consistency in the Eysenck Psychoticism scale. Journal of Psychology 120 (1986), pp. 635 636. Smedslund, 1987. J. Smedslund, The epistemic status of inter-item correlations in Eysenck's Personality Questionnaire: the a priori versus the empirical in psychological data. Scandinavian Journal of Psychology 28 (1987), pp. 42 55. Smedslund, 1988. J. Smedslund, What is measured by a psychological measure?. Scandinavian Journal of Psychology 29 (1988), pp. 148 151. Standards for educational and psychological testing: AERA/APA/NCME, American Psychological Association, Washington, DC (1985).