Psychometrics 101 Part 2: Essentials of Test Score Interpretation. Steve Saladin, Ph.D. University of Idaho

Psychometrics 101 Part 2: Essentials of Test Score Interpretation Steve Saladin, Ph.D. University of Idaho

Standards for Educational and Psychological Testing 15.10 Those responsible for testing programs should provide appropriate interpretations when test score information is released to students, parents, legal representatives, teachers, or the media. The interpretations should describe in simple language what the test covers, what scores mean, common misinterpretations of test scores, and how scores will be used.

Pay no attention to the man behind the curtain! L. Frank Baum, The Wonderful Wizard of Oz

Where are we going today? A score is only as good as the test Ø Reliability Ø Validity Not all scales are created equal Ø Nominal to Ratio A score by any other name Ø Norm or Criterion referenced Yeah, but what does my score mean?

Error, Error Everywhere No test is perfect, no measurement is perfect There is always error in any measurement Score = Truth + Error

Error, Error Everywhere Error can be lots of things including Ø The environment

Error, Error Everywhere Error can be lots of things including Ø The environment Ø The test-taker

Error, Error Everywhere Error can be lots of things including Ø The environment Ø The test-taker Ø Procedural variations*

Error, Error Everywhere Error can be lots of things including Ø The environment Ø The test-taker Ø Procedural variations Ø The test itself John has 13 stickers. Jill gives him 10 more. How many stickers does Jack have? Since error makes scores inconsistent or unreliable, a measure of reliability of scores is important

Reliability Test-Retest Ø Test group on two different occasions and correlate the results Ø Are results stable over time Ø Does not mean they are the same scores, only that they stay the same relative to each other

Reliability Test-Retest Inter-Rater Ø Two people score same test Ø Are results the same Ø Important when there is element of subjectivity

Reliability Test-Retest Inter-Rater Internal Consistency Ø Correlate score on each item to total Ø Are they all measuring the same thing All correlations so subject to same problems

Things that artificially effect reliability estimates Length of test Ø Longer test = higher reliability More questions, differing response on one item has less effect May be artificial, but shorter tests are inherently less reliable Variability in what is being measured Ø Restriction of range If everyone scores the same, you can t see that low scores tend to remain low and high scores tend to remain high Ø Size of sample Small sample makes it hard to find relationship

Some Facts About Correlation Small samples may miss relationship Heterogeneous samples may miss relationship 0.87 0.42 0.78

So what s good? SAT reports reliabilities of.89-.93 Ø Test Characteristics of the SAT on http://professionals.collegeboard.com/data-reports-research/sat/data-tables ACCUPLACER reports reliabilities ranging from.84 for Sentence Skills to.90 for Arithmetic Ø College Board webinar on ACCUPLACER cut scores CLEP College Algebra estimates reliability of 0.90 Ø Test Information Guide: College-Level Examination Program 2013-2014, College Algebra 0.90 is great 0.80 should be used with caution if at all

Reliability & Error in Individual Scores Can t totally get rid of Error, but can estimate how much is there Using reliability you can estimate how much a persons score would vary due to error. Standard Error of the Measurement Ø SEM =SD * 1 r Ø an index of the extent to which an individual s scores vary over multiple administrations Ø gives the range within which the true score is likely to exist

Theoretical distribution of scores ----------------96%--------------- ------68%------ 2% 14% 34% 34% 14% 2% -3-2 -1 0 +1 +2 +3 2% 16% 50% 84% 96% 1 SEM below to 1 SEM above = 68% confidence 2 SEM below to 2 SEM above = 95% confidence

SEM for some tests ACCUPLACER Diagnostic scores 1.18-1.7 (about 1.3 average), so 68% confidence interval for score of 10 is 9-11, 95% confidence is 7-13 ACT Composite SEM.91, so 68% confidence interval for score of 20 is 19-21, 95% confidence is 18-22 Ø ACT Technical Manual WAIS-IV FSIQ SEM is 2.16, so 68 % confidence interval for score of 100 is 98-102, 95% confidence is 96-104

Are scores on two tests really different? Standard Error of the Difference Ø SE diff = Sqr Root of SEM 1 2 + SEM 2 2 Ø must be on same scale Ø Diff >= 1 Se diff 68% confident Ø Diff >= 2 Se diff 95% confident This is the scientific approach Can also use a common sense approach

Are these score really different? ACT Composite scores of 18 and 20? Ø 68% confident 18 is really 17-19 Ø 68% confident 20 is really 19-21 Ø Confidence band overlaps, so difference could easily be due to error ACT Composite scores of 20 and 23 Ø 68% confident 20 is really 19-21 Ø 68% confident 23 is really 22-24 Ø Confidence bands do not overlap so at least 68% sure the difference is not due to error

Does Reliability = Validity? Getting a consistent result means reliability NO! Having that result be meaningful is validity Validity is based on inferences you make from results Ø Test has to be reliable to be valid Ø Test does not have to be valid to be reliable

Validity Any evidence that a test measures what it says it is measuring Any evidence that inferences made from the test are useful and meaningful

Validity 3 types of evidence Ø Content Ø Criterion-Related Ø Construct

Content Validity Think of a test as a sample of possible problems/items Ø 4 th grade spelling test should be a representative sample of 4 th grade spelling words Ø GRE Quantitative should be a representative sample of the math problems a grad school applicant might be expected to solve Should be part of design Ø Identifying # of algebra, trig, calculus, etc. should be on test (table of specifications) Frequently evaluated by item analysis or expert opinions

Criterion-Related Validity How does test score correlate with some external measure (criterion) Ø Placement test score and performance in class Ø Admission test score and GPA for first semester Sometimes called Predictive or Concurrent Validity Correlation that is effected by error in the test and error in the criterion Ø Only top students take GRE Ø Graduate School grade restriction

To use or not to use. Depends on the question. Ø What is impact of decision? Ø What is cost of using? Of not using? Decision Theory can be a guide to determining incremental validity Ø Net gain in using scores

Construct Validity Most important for psychological test where what you are measuring is abstract or theoretical Ø Intelligence Ø Personality characteristics Ø Attitudes and beliefs Usually involves multiple pieces of evidence

Not all scales are created equal Nominal Scale Ø Simply places in a specific category Ø Often dichotomous like sex/gender Ø Can be multiple categories like race, religion Ø Does not provide quantitative information even if you use numbers for categories Catholic=1, Baptist=2, Buddhist=3, Atheist=4 Does not mean Catholic is worth less than Atheist

Not all scales are created equal Ordinal Scale Ø Places in a position that is ordered in some way Typically from low to high Ø Percentile Rank Ø Letter Grades Ø One score clearly better or higher, but only within that group And provides no information about the group No information about how closely 2 scores are related

100 students, 100 questions If scores evenly distributed (1 person for each possible score from 1-100) then Ø 10th percentile=11 Ø 25th percentle=26 Ø 50th percentile=51 Ø 60th percentle=60 Ø 75th percentile=75 Ø 90th percentile=90 1 21 41 61 81 2 22 42 62 82 3 23 43 63 83 4 24 44 64 84 5 25 45 65 85 6 26 46 66 86 7 27 47 67 87 8 28 48 68 88 9 29 49 69 89 10 30 50 70 90 11 31 51 71 91 12 32 52 72 92 13 33 53 73 93 14 34 54 74 94 15 35 55 75 95 16 36 56 76 96 17 37 57 77 97 18 38 58 78 98 19 39 59 79 99 20 40 60 80 100

100 students, 100 questions Now suppose they still range from 1-100, but, 20 scored 1, 20 scored 10, 20 scored 25, 10 scored 50, 20 scored 75 and 10 scored 100 Ø 10th percentile=1 Ø 25th percentle=10 Ø 50th percentile=25 Ø 60th percentle=35 Ø 75th percentile=75 Ø 90th percentile=90 1 10 25 50 75 1 10 25 50 75 1 10 25 50 75 1 10 25 50 75 1 10 25 50 75 1 10 25 50 75 1 10 25 50 75 1 10 25 50 75 1 10 25 50 75 1 10 25 50 100 1 10 25 75 100 1 10 25 75 100 1 10 25 75 100 1 10 25 75 100 1 10 25 75 100 1 10 25 75 100 1 10 25 75 100 1 10 25 75 100 1 10 25 75 100 1 10 25 75 100

100 students, 100 questions Now suppose they still range from 1-100, but, 1 person scored 1, 1 person scored 100 and the rest were a typical class room distribution Ø 10th percentile=71 Ø 25th percentle=78 Ø 50th percentile=84 Ø 60th percentle=86 Ø 75th percentile=89 Ø 90th percentile=93 1 77 82 86 91 50 77 82 86 91 60 78 83 86 91 65 78 83 86 91 67 78 83 87 92 70 78 83 87 92 70 79 84 87 92 70 79 84 87 93 71 79 84 88 93 71 79 84 88 93 71 80 84 88 94 72 80 84 88 94 72 80 84 89 94 72 80 85 89 96 75 81 85 89 96 76 81 85 90 96 76 81 85 90 98 76 81 86 90 98 77 82 86 91 98 77 82 86 91 100

100 students, 100 questions So depending on the distribution of our 100 scores, always ranging from 1-100 we get Ø 10th percentile= 11 1 71 Ø 25th percentle= 26 10 78 Ø 50th percentile= 51 25 84 Ø 60th percentle= 60 35 86 Ø 75th percentile= 75 75 89 Ø 90th percentile= 90 100 93 Ø To be able to say something about differences in magnitude, you need to go to the next level

Not all scales are created equal Interval Scales Ø Scores are arranged from low to high but with equal intervals Ø 5 pt difference in scores is the same whether it is between scores at the bottom, in the middle or at the top Ø Can be a direct measurement Temperature Ø Usually a conversion of raw to standard score of some kind Ø IQ, SAT/ACT Ø Allows statistical manipulation Mean, standard deviation, etc.

Not all scales are created equal SAT Math scores are interval scales Ø Brenda 800 Ø Matt 600 Ø Jim 400 Ø Matt is as much better at math than Jim, as Brenda is than Matt. Ø Brenda is twice as good at math as Jim.

Not all scales are created equal Ratio Scales Ø Not only do you have equal intervals, but you also have an absolute (meaningful) zero point Ø speed, weight Ø Generally meaningless in educational testing Does a 0 on a history test mean you have no knowledge of history? Ø Allows you to actually say one score is twice as good as another

Not all scales are created equal For educational testing, scales will almost always be either ordinal or interval Ø Often they provide both ACT Math=19, 47 th percentile

A score by any other name. Raw scores are generally meaningless Ø Becky got 87 questions right probably good if there are 100 questions Probably not so good if there are 200 questions Two major ways to make score meaningful Ø Norm-referenced scores Ø Criterion-reference scores

Norm-referenced If I want to know how well some one can do something relative to her/his peers, I want to use norm-referenced reporting Many standardized tests are norm-referenced Ø SAT, ACT, GRE Grading on the curve means grade based on comparison with rest of class (norm-referenced) Ø 80% might be a B, an A, a C or something else. Most psychological tests are norm-referenced Ø MMPI-2, WAIS-IV, WJ-III

Norm-referenced Scales are typically interval but also report percentile rank To evaluate norm-referenced scores you need to exam the group it was normed on Ø Size bigger is better Ø Characteristics Does the norm group match the group you want to use it with (Age, gender, ethnicity, SES, etc.) If not, is there research to support the use with your group Ø Are they up-to-date? Older they get, the more you need to question them

Criterion-referenced If you want to know if someone has a minimal level of skill or knowledge, then you need to use criterionreferenced scores Class room grading is generally criterion based Ø 90% right=a, 80%=B, 70%=C, etc. Certification exams are often criterion-referenced Ø Proctor certification, licensing exams Ø Typically reported as percentage correct or P/F For Criterion-referenced the type of scale is often not important (you either reach a cut point or you don t)

Criterion-referenced To evaluate criterion-referenced score you need to examine how the cut-off was determined Ø Who, what, when, where And how it was validated Ø Was it? Ø Who, what, when, where

Establishing cut scores Steps in setting cut-scores for Placement test Ø Have faculty agree upon a list of skills necessary for success in the course Ø Have faculty review test content/questions and decide for each one if a student who will be successful in the course should be able to answer the question Ø Administer test to enrolled students at beginning of semester and compare against their final grades Ø Decide on the cut-score to use Ø Use it for placement and then revaluate the appropriateness of the placement Excerpted from College Board webinar on setting cut scores for ACCUPLACER

Cut Scores Maximize success Pre Calc A B C D F False negative True negative True positive False Positive 60 80 100 120 ACCUPLACER Elem Alg

Cut Scores PreCalc A B C D F Maximize opportunity False negative True negative True positive False Positive 60 80 100 120 ACCUPLACER Elem Alg

Cut Scores Effectiveness = True Positive + True Negative True Pos+False Pos+True Neg+False Neg Have to weigh effectiveness against cost

Yeah, but what does my score mean?

Yeah, but what does my score mean? Start with the test Ø What is it supposed to measure? ACT Test scores reflect what students have learned throughout high school SAT designed to assess your academic readiness for college.what you know and how well you can apply that knowledge ACCUPLACER allows educational institutions to address the challenges of accurate placement and remediation CLEP tests mastery of college-level material in specific areas

Yeah, but what does my score mean? Start with the test Ø Is it reliable? Valid? CLEP Intro Psychology r=0.89, SEM=2.89 Content developed by college faculty to reflect content of Intro Psych classes Ø What are characteristics of norm group/how are criterion points determined CLEP Standard setting panel of 15-20 faculty establish performance needed to reflect a grade of C and B

Yeah, but what does my score mean? Consider characteristics of test taker Ø Do they match norm group/group test designed to be used with local vs. national, cultural considerations Ø Are the results consistent with other information GPA, other tests Ø What is her/his goal? Does it make sense without considering test result

Yeah, but what does my score mean? Consider the score Ø Generally speaking, Average is about the 25 th to the 75 th percentile If norms are general population, lower average may be pretty weak compared to a college student Local ACT/SAT norms where all HS juniors take test National norms for all college bound, lower average might be typical for the local tech school Ø Cut score should be considered barely enough to meet criteria 50 on CLEP Spanish test may give credit for 101 but should probably do some review before tackle 102

Yeah, but what does my score mean? Remember a test result is just a sample of the students performance Ø Poor performance could be due to lots of things Ø Generally, can be sure it tells you the minimum of what they are capable of, but maybe not the maximum

Examples of Uses of Test Scores and Related Data that Should Be Avoided: Using test scores as the sole basis for important decisions affecting the lives of individuals, when other information of equal or greater relevance and the resources for using such information are available. Using minimum test scores without proper validation. Making decisions about otherwise qualified students based only on small differences in test scores. Using scores without appropriate consideration to their validity. Providing inadequate or misleading information about the importance of test scores in making judgments or decisions. Requiring or recommending that certain tests be taken when the scores are not used or are used to a negligible extent. Failing to recognize differences in admission standards and requirements that may exist among different schools or departments within many institutions when providing information to fprospective applicants. --From Guidelines on Uses of College Board Test Scores and Data end

Steve Saladin, Ph.D. University of Idaho Moscow, ID 83844-3140 ssaladin@uidaho.edu