Reliability and validity, the topics of this and the next chapter, are twins and



Similar documents
Test Reliability Indicates More than Just Consistency

Intelligence. My Brilliant Brain. Conceptual Difficulties. What is Intelligence? Chapter 10. Intelligence: Ability or Abilities?

Measurement: Reliability and Validity

Basic Concepts in Research and Data Analysis

An Application of the UTAUT Model for Understanding Student Perceptions Using Course Management Software

Measurement: Reliability and Validity Measures. Jonathan Weiner, DrPH Johns Hopkins University

Measurement and Measurement Scales

X = T + E. Reliability. Reliability. Classical Test Theory 7/18/2012. Refers to the consistency or stability of scores

Unit Conversions. Ben Logan Feb 10, 2005

This chapter reviews the general issues involving data analysis and introduces

interpretation and implication of Keogh, Barnes, Joiner, and Littleton s paper Gender,

Association Between Variables

Chapter 5 Conceptualization, Operationalization, and Measurement

How To Study The Academic Performance Of An Mba

Reliability Analysis

INVESTIGATING THE EFFECTIVENESS OF POSITIVE PSYCHOLOGY TRAINING ON INCREASED HARDINESS AND PSYCHOLOGICAL WELL-BEING

Chapter 1 Assignment Part 1

CORRELATIONAL ANALYSIS: PEARSON S r Purpose of correlational analysis The purpose of performing a correlational analysis: To discover whether there

3d space container challenge

Unit 3. Answering the Research Question: Quantitative Designs

Reliability Overview

The Inventory of Male Friendliness in Nursing Programs (IMFNP)

White House 806 West Franklin Street P.O. Box Richmond, Virginia

High School Psychology and its Impact on University Psychology Performance: Some Early Data

8 th European Conference on Psychological Assessment

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

IMPACT OF CORE SELF EVALUATION (CSE) ON JOB SATISFACTION IN EDUCATION SECTOR OF PAKISTAN Yasir IQBAL University of the Punjab Pakistan

CHAPTER 2: CLASSIFICATION AND ASSESSMENT IN CLINICAL PSYCHOLOGY KEY TERMS

Father s height (inches)

ORIGINAL ATTACHMENT THREE-CATEGORY MEASURE

Levels of measurement in psychological research:

RESEARCH METHODS IN I/O PSYCHOLOGY

Designing a Questionnaire

DATA COLLECTION AND ANALYSIS

WHAT IS A JOURNAL CLUB?

Measurement. How are variables measured?

Psychology 305A Lecture 3. Research Methods in Personality Psychology

RESEARCH METHODS IN I/O PSYCHOLOGY

15 Most Typically Used Interview Questions and Answers

ATTITUDES OF ILLINOIS AGRISCIENCE STUDENTS AND THEIR PARENTS TOWARD AGRICULTURE AND AGRICULTURAL EDUCATION PROGRAMS

The Human Genome. Genetics and Personality. The Human Genome. The Human Genome 2/19/2009. Chapter 6. Controversy About Genes and Personality

Self Presentation and its Perception in Online Dating Websites

MULTIVARIATE ANALYSIS OF BUYERS AND NON-BUYERS OF THE FEDERAL LONG-TERM CARE INSURANCE PROGRAM

UNDERSTANDING THE TWO-WAY ANOVA

Testing and Interpreting Interactions in Regression In a Nutshell

A PARADIGM FOR DEVELOPING BETTER MEASURES OF MARKETING CONSTRUCTS

Psychology 60 Fall 2013 Practice Exam Actual Exam: Next Monday. Good luck!

EFL LEARNERS PERCEPTIONS OF USING LMS

Coni Francis, PhD, RD Nutrition and Dietetics Program University of Northern Colorado

MARKET ANALYSIS OF STUDENT S ATTITUDES ABOUT CREDIT CARDS

Last time we had arrived at the following provisional interpretation of Aquinas second way:

CHAPTER 5: CONSUMERS ATTITUDE TOWARDS ONLINE MARKETING OF INDIAN RAILWAYS

Application Trends Survey

"Gender diversity in the healthcare sector - how much progress have we made?" Nicola Hartley Director, Leadership Development

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Positive Psychology: The Science of Happiness Michael Mendelsohn

MODEL I: DRINK REGRESSED ON GPA & MALE, WITHOUT CENTERING

An International Comparison of the Career of Social Work by Students in Social Work

Spirituality and Moral Development Among Students at a Christian College Krista M. Hernandez

Role of husbands and wives in Ephesians 5

MEASURES OF VARIATION

Determining Future Success of College Students

Engaging young people in mental health care: The role of youth workers

NEGATIVE EFFECTS OF SOCIAL MEDIA ON YOUNG ADULTS. Social Media. Negative Effects of Social Media on Young Adults Taylar Long Bloomsburg University

The Psychology Lab Checklist By: Danielle Sclafani 08 and Stephanie Anglin 10; Updated by: Rebecca Behrens 11

Running head: SAMPLE FOR STUDENTS 1. Sample APA Paper for Students Interested in Learning APA Style 6th Edition. Jeffrey H. Kahn

The Normal Distribution

CENTER FOR EQUAL OPPORTUNITY. Preferences at the Service Academies

Statistical Analysis on Relation between Workers Information Security Awareness and the Behaviors in Japan

How To Collect Data From A Large Group

English as a Second Language Podcast ESL Podcast 216 Outsourcing Operations

How To Know When A Market Is In A Strong Trending Phase


Graduate Schools and Fellowships in Mathematics ( )

Using Personality to Predict Outbound Call Center Job Performance

Mind on Statistics. Chapter 8

What Does a Personality Look Like?

History CH-8. Applied Psych Or What people did with psychology degrees besides teach and theorize.

c. Construct a boxplot for the data. Write a one sentence interpretation of your graph.

Related KidsHealth Links

Jon A. Krosnick and LinChiat Chang, Ohio State University. April, Introduction

Topic #1: Introduction to measurement and statistics

Current Problems and Resolutions. The Relative Effects of Competence and Likability on Interpersonal Attraction

A therapist s use of verbal response categories in engagement and nonengagement interviews

This chapter discusses some of the basic concepts in inferential statistics.

SigmaRADIUS Leadership Effectiveness Report

Mind on Statistics. Chapter 13

How To Know What Motivates Someone To Persist After Failure

HOW TO WRITE A LABORATORY REPORT

METACOGNITIVE AWARENESS OF PRE-SERVICE TEACHERS

CHAPTER 3: RESEARCH METHODS. A cross-sectional correlation research design was used for this study where the

Quantitative Research: Reliability and Validity

The 4 Ways You Can. When A Realtor Can t Do The Job

Transcription:

Research Skills for Psychology Majors: Everything You Need to Know to Get Started Reliability Reliability and validity, the topics of this and the next chapter, are twins and cannot be completely separated. These two concepts comprise the dual holy grail of research, and outside of the central importance of theory, they are crucial to any sort of meaningful research. Without reliability and validity, research is nonsense. Many forms of reliability and of validity have been identified, perhaps to the point that the words themselves have been stretched thin. Reliability refers to consistency of measurement and takes several forms: whether a construct is measured in a manner that is stable over time; are the items in a test congruent with each other; are two supposedly identical tests really the same; are people performing ratings that agree with each other. Validity is a deeper concept: does the instrument measure the right construct; is the experiment constructed so that the independent variable actually represents the theoretical construct; is it really the independent variable that caused changes in the dependent variable; and more. Four kinds of reliability are usually identified in behavioral research: test-retest, parallel forms, internal consistency, and inter-rater. Test-Retest Reliability A good test (or instrument ) is stable over repeated administrations. An IQ test given to the same person three times in one month should reveal the same IQ score, more or less, each time. Likewise, any adequate measure of personality, attitudes, skills, values, or beliefs should come out the same, time after time, unless the person him or herself actually changes in either a short-term or longterm sense. Some degree of variability is expected due to temporary changes in the person, the situation, or just plain random factors like luck, but in general the scores should be consistent. The test-retest reliability of a measure is estimated using a reliability coefficient. A reliability coefficient is often a correlation coefficient calculated between the administrations of the test. (Correlation coefficients are described in the SPSS-Basic Analyses chapter. See sidebar for a quick explanation of correlations.) A typical research study of test-retest reliability would administer a test to a sample of 100 people, wait a month, then readminister it to the same people. The correlation between the two administrations is an indicator of the instrument s reliability. 2003 W. K. Gabrenya Jr. Version: 1.0

Page 2 Parallel Forms Reliability Forms are alternate versions of the same test. We use the terms Form A and Form B sometimes to identify such versions of a test. Parallel forms are forms that really do measure the same thing, that is, are equivalent. Parallel forms of a test are developed to be used in situations where we must obtain essentially the same information from people at several different but close-together times and we don t want their exposure to the test at one time to affect their responses other times. This carry over effect can occur due to practice (on a skills-oriented test) or remembering the items from a previous administration. For example, the author s research on international student adjustment required obtaining psychological adjustment measures from students every week for 10 weeks. At some point, it was expected that the students would stop thinking about the test questions and just answer automatically. To try to avoid this, two forms of the adjustment measure were developed and administered on alternate weeks. Parallel forms reliability is measured using the correlation coefficient between the forms. In a typical study, a group of 100 students would take all forms of the test. The correlations among the forms represent the extent of parallel forms reliability. Internal Consistency Reliability Often it is important that the individual items in a test measure the same thing, or almost measure the same thing. If items are not consistent with each other, the overall test score will reflect several different underlying constructs and won t make any sense. For example, a measure of attitudes toward pizza might include several differ qualities of pizza. These items should be sufficiently similar to each other so that adding up the item responses produces a total score that focuses on the intended construct. Correlation Coefficient Correlation refers to the relationship or association between two variables within a sample. For example, we would expect to find a correlation between height and weight in a sample of people: generally, taller people weigh more. The relationship can range from very strong to nonexistent. Height and weight show a fairly strong relationship, although of course there are exceptions such as tall thin people and short wide people. On the other hand, weight and IQ show no relationship. Sometimes, as one variable (height) increases, another variable (weight) also increases, evidencing a positive relationship. Other times, we see a negative relationship, such as between smoking cigarettes (packs per week) and life expectancy (years). All of these relationships are examined within groups of people and refer to the general trend present in the group, not to individuals. The correlation coefficient is a statistical estimation of the strength of relationship. (A more formal, theoretical explanation of correlation is presented in the SPSS-Basic Analyses chapter.) Correlation coefficients are symbolized by the letter r and range from 1 to 0 to -1. At r=0, there is no relationship, as in weight and IQ. At r=1 there is a perfect positive relationship: as one variable increases, the other one increases in perfect lockstep. Such relationships are rare in the real world. At r=-1, there is a perfect negative relationship, also rare. Correlation coefficients range anywhere along the scale from 1 to -1. For example, the relationship between people s personality traits and their behaviors tends to be around r=.30 (not so good). The correlation between the SAT-Math and overall university GPA among psychology undergraduates here is r=.53, a moderately good relationship. 1. Pizza tastes good. Strongly Agree - Somewhat Agree - Neutral - Somewhat Disagree - Strongly Disagree 2. I feel hungry when I think of pizza. 3. Pizza is inexpensive. 4. Pizza is easy to get. 5. Pizza is popular in New York. Items 1 and 2 would probably have high internal consistency reliability. Items 1, 2, 3, and 4 would probably have moderately high reliability. Item 5 really does not fit this test and would show little internal consistency reliability with the other items.

Page 3 Internal consistency reliability is measured using special types of correlation coefficients termed Cronbach s Alpha and the Kuder-Richardson Coefficient. When researchers make new tests and examine aspects of the test such as the internal consistency reliability of the items, the procedure is termed item analysis. Inter-Rater Reliability A somewhat different sort of reliability is at issue when the same stimulus (person, event, behavior, etc.) must be rated by more than one rater. For example, in studies of the relationship between physical attractiveness and social development, the researchers need to know how attractive the person is. (Research of this kind asks questions such as, do prettier people develop better social skills? ) How can this rating be done? Calculate the ratio of length of nose to distance between the ears? While some such physical indexes of attractiveness have been developed, the most common way is to assemble a panel of judges to rate the stimuli. (Sounds like figure skating judging, but there is no bribery involved.) The researcher needs to look at the extent to which the raters agree on their ratings. When inter-rater reliability is low, the researcher has to wonder if it is possible to classify persons on a dimension such as attractiveness or whether his or her attempts to do so have failed. Inter-rater reliability is measured in several ways, such as the percentage of agreement among the judges or a correlation-like coefficient called Kappa. A Research Example Here s a semi-hypothetical example that brings these reliability issues together. Of course, a complete example would require considering theory and reliability s twin, validity. The researcher wants to know if personality traits are related to sexual activity. She cannot perform a true experiment because personality traits cannot be controlled or manipulated, so she must be content with simply measuring the traits and assessing sexual behavior. The trait of interest is Self-Monitoring, a blend of extraversion, willingness to self-present, self-presentation abilities, and low social anxiety. High Self-Monitors are outgoing people who know how to act in various situations to get what they want, and do so. A measure of sexual behavior is created just for this study, the Sexual Activity Test (SAT). The Self-Monitoring Scale (SMS) already exists, so she will use it as it is. She must, however, construct the SAT from scratch. The details of how she would actually do this are complicated, but in the end she has a 20-item test that assesses various interpersonal sexual behaviors. The researcher theorizes that, overall, SAT is a unitary construct. A unitary construct has one, central idea or dimension rather than several sub-dimensions. For example, some psychologists believe that IQ may not be a unitary construct, arguing that there are several different kinds of intelligence. If true, then a single IQ score is meaningless. To determine if the SAT assesses a unitary construct, the researcher gives the initial version of the test to a large sample of people who will not be in the real study, then performs an item analysis. She looks at the internal consistency reli-

Page 4 ability of the 20 items to see if they hang together. (She also does some other things that are beyond our interest here.) Coefficient alpha, a common measure of internal consistency, turns out to be α=.45. This is too low, indicating that the items are not measuring the same thing. She has two choices: (1) give up on the idea that the SAT will assess a unitary construct and try to find the two or more sub-dimensions that represent interpersonal sexual activity; or (2) find the bad items and get rid of them. She chooses the latter. A bad item is one that is poorly related to the other items, sort of like a human who refuses to fit in to a social group. To find the bad items, she correlates each item with the total score (the average of all the items). These 20 correlations are termed item-total correlations. She looks for items with a poor relationship to the total score and eliminates them from the SAT. Then she recalculates coefficient alpha on the new, smaller test and, hooray, it is now α=.85 (very good). As the Japanese say: the nail that sticks up gets pounded down. Next, she want to make sure that the SAT is stable over time. She finds still another sample of people and gives them the SAT twice, one month apart. The correlation between time 1 and time 2, the test-retest reliability, turns out to be r=.70. This is very good given the fact that people do change over time, and some instability in the test is expected for this reason rather than due to the test s qualities. The researcher wants to do a longitudinal study of Self-Monitoring and sexual behavior so she can figure out whether these two variables change over time, if the relationship between them changes over time, and possibly which one causes the other. A longitudinal study measures the same thing over time. She has some fear that giving the same tests several times will reduce the quality of her results because her research subjects will remember the items from administration to administration and answer automatically without thinking. She needs at least two versions of each test. To produce these versions, she creates two SMSs by dividing up the items using the item analysis information published some time ago by the brilliant young psychologists Gabrenya and Arkin (1980). She does the same for the SAT. To make sure that she has parallel forms, she gives all four tests (two SMSs and two SATs) to another sample of people and calculates the correlation between the parallel forms. Because she is such a meticulous researcher and because the earlier work of Gabrenya and Arkin was so fine, she obtains parallel forms reliability coefficients that are good, r=.75 for SMS and r=.68 for SAT. Now, finally, she is ready to perform her research. She selects a sample of 50 males and 50 females in the range 22-25 years old, all unmarried, and administers her two tests to them four times, once every six months. She gives Form A of the SMS and Form A of the SAT the first time, Form B the second time, Form A the third time, etc. Unfortunately, she gets very noisy results: the correlations between SAT and SMS are in the right direction, but low. She concludes that something else is affecting sexual activity and, based on other social psychology research, theorizes that it is the physical attractiveness of the subjects. She must now evaluate each subject s physical attractiveness. She brings all 100 into her lab and takes professional quality photos of them from a variety of angles. Then she assembles a panel of two men and two women in the 22-25

Page 5 age range and has them rate each photo on attractiveness. To make sure they are producing a reliable measure, she looks at the agreement rates among the four raters. Coefficient Kappa comes out to be K=.65, which is good enough. Now she can use the attractiveness ratings to lower the noise (error variance) in her data. How would such a study actually come out? This particular study has not been performed, but components of it have. Self Monitoring does predict higher sexual activity, and attractive people do get more dates. Based on other studies, we would predict that SMS would cause sexual activity, not the opposite. Adding attractiveness would undoubtedly strengthen the results of the study. Reference Gabrenya, W. K., Jr., & Arkin, R. M. (1980). Self-Monitoring Scale: Factor structure and correlates. Personality and Social Psychology Bulletin, 6, 13-22.