EFSET PLUS-TOEFL CORRELATION STUDY REPORT 02
|
|
|
- Elfrieda Robbins
- 10 years ago
- Views:
Transcription
1 W W W. E F S E T. O R G
2 EFSET PLUS-TOEFL CORRELATION STUDY REPORT 02
3 EXTERNAL VALIDITY OF EFSET PLUS WITH TOEFL ibt Abstract This study was carried out to explore the statistical association between EFSET PLUS and TOEFL ibt scores. Three-hundred eighty four volunteer examinees participated in the study. The results suggest moderately strong, positive correlations between EFSET PLUS and TOEFL for both the reading and listening scales and provide solid evidence of convergent validity. The reliabilities for both the EFSET PLUS reading and listening score scales were also very high because of the adaptive nature of the test. EFSET PLUS-TOEFL CORRELATION STUDY REPORT 03
4 TABLE OF CONTENTS Introduction Analysis and Results Discussion References About the Author EFSET PLUS-TOEFL CORRELATION STUDY REPORT 04
5 INTRODUCTION This report describes a validation study carried out in fall 2014 for the new EF Standard English Test PLUS (EFSET PLUS). This report presents empirical, external validity evidence regarding the relationship between EFSET PLUS proficiency scores and reported Test of English as a Foreign Language (TOEFL ibt ) scores. The TOEFL ibt is an internet-based test of English language proficiency developed and administered by Educational Testing Service. It is generally recognized as one of the premier tests of English language proficiency in the world. The TOEFL ibt version was released for operational use in Separate TOEFL component scores are reported for each of the four modalities (reading, listening, writing and speaking), each using a score scale ranging from 0 to 30. The composite scale is a simple sum of the component scores. In contrast, EFSET PLUS is a free, online test designed to provide separate measures of English language reading and listening proficiency. The test is professionally developed and administered online with a computer interface that is standardized across computer platforms. The reading and listening sections of EFSET PLUS are adaptively tailored to each examinee s proficiency, providing an efficient and accurate way of assessing language skills. As an interpretive aid, performance scores on EFSET PLUS are directly aligned with six levels (A1 to C2) of the Council of Europe s Common European Framework of Reference (CEFR) for languages. In this study, an international sample of non-native English language learners was recruited and screened over a period of months. Three-hundred eighty-four examinees who met the study eligibility requirements were administered both EFSET PLUS reading and listening tests. As part of the eligibility requirements, the examinees were required to upload a digital copy of their TOEFL ibt score report. Their scores on EFSET PLUS and their reported TOEFL listening and reading scores were then analyzed to investigate the degree of statistical correspondence between the tests. The study results confirm that the EFSET PLUS scores are highly reliable across the corresponding reading and listening score scales and maintain reasonable statistical correspondence (convergent validity) with TOEFL reading and listening scores. This study found that EFSET PLUS scores correlated reasonably well with TOEFL ibt scores somewhat better with the total TOEFL scores than with the separate reading and listening section scores. This provides fairly solid convergent validity evidence (see Cambell & Fiske, 1959), suggesting that the EFSET PLUS score scales are tapping into some of the same English language skills as TOEFL. The next section of the paper describes the EFSET PLUS examinations and scoring process. It also describes the participant sample used for the validation study. Analysis and results are covered in the subsequent section. EFSET PLUS-TOEFL CORRELATION STUDY REPORT 05
6 INTRODUCTION Description of the EFSET PLUS Tests and Score Scales Separate reading and listening test forms which were statistically equivalent to the EFSET PLUS were used for this study. This was to ensure that there was no learning effect of the publicly available EFSET PLUS. The EFSET tests employ various types of selected-response item types, including multiple-selection items. A set of items is associated with a specific reading or listening stimulus to comprise a task. In turn, one or more tasks are assembled as a unit to prescribed statistical and content specifications; these are called modules. The modules can vary in length, depending on the number of items associated with each task. Because of the extra time needed to listen to the task-based aural stimuli, the listening modules tend to have slightly fewer items than the reading modules. In general, the reading modules for this study had from 16 to 24 items. The listening modules each had between 12 and 18 items. In aggregate, each examinee was administered a three-stage test consisting of one module per stage. The actual test forms for EFSET and EFSET PLUS are administered using an adaptive framework known as computerized adaptive multistage testing or ca-mst (Luecht & Nungester, 1998; Luecht, 2000; Zenisky, Hambleton & Luecht, 2010; Luecht, 2014a). Ca-MST is a psychometrically powerful and flexible test design that provides each examinee with a test form customized for his or her demonstrated level of language proficiency. For this study, each EFSET examinee was administered a three-stage ca-mst panel with three levels of difficulty at stage 2 and four levels of difficulty at stage 3 as shown in Figure 1. The panels are self-adapting. Once assigned to an examinee, each panel has internal routing instructions that create a statistically optimal pathway for that examinee through the panel. The statistical optimization of the routing maximizes the precision of every examinee s final score. As Figure 1 demonstrates, all examinees assigned a particular panel start with the same module at Stage 1 (M1, a medium difficulty module). Based on their performance on the M1 module, they are then routed to either module E2, M2 or D2 at Stage 2. The panel routes the lowest performing examinees to E2 and the highest performing examinees to D2. All others are routed to M2. Combining performance from both Stages 1 and 2, each examinee is then routed to module VE3, ME3, MD3, or VD3 for the final stage of testing. This type of adaptive routing has been demonstrated to significantly improve the precision of the final score estimates compared to a fixed (non-adaptive) test form of comparable length (Luecht & Nungester, 1998). The cut scores used for routing are established when the panel is constructed and statistically optimize the precision of each pathway through the panel. EFSET PLUS-TOEFL CORRELATION STUDY REPORT 06
7 INTRODUCTION Stage 1 Stage 2 Stage 3 High Language Proficency Low Figure1. An Example of a ca-mst Panel All EFSET items are statistically calibrated to the EF reading and listening score scales. The calibration process employs item response theory (IRT) to determine the difficulty of each item relative to all other items. The IRT-calibrated items and tasks for the reading and listening panels used in this study were previously administered to large samples of EF examinees and calibrated using the Rasch calibration software program WINSTEPS (Linacre, 2014). This software is used world-wide for IRT calibrations. The IRT model used for the calibrations is known as the partial-credit model or PCM (Wright & Masters, 1982; Masters, 2010). The partial-credit model can be written as follows: ( ;, d ) ( ) P x = X θ b P θ = Equation 1 i i ix m exp j= 0 exp k= 0 k= 0 ( b d ) ( b d ) where θ is the examinee s proficiency score, b i denotes an item difficulty or location for item i, and d ik denotes two or more threshold parameters associated with separations of the category points for items that use three or more score points (k=0,,xi). All reading items and tasks for the EF Standard Setting (EFSET, section 10) were calibrated to one IRT scale, θ R. All listening items and tasks were calibrated to another IRT scale, θ L. x j θ + θ + i i ik ik Using the calibrated PCM items and tasks, a language proficiency score on either the θ R or θ L scale can be readily estimated regardless of whether a particular examinee follows an easier or more difficult route through the panel (i.e. the routes or pathways denoted by the arrows in Figure 1). The differences in module difficulty within each panel are automatically managed by a well-established IRT scoring process known as maximum likelihood estimation (MLE). EFSET PLUS-TOEFL CORRELATION STUDY REPORT 07
8 INTRODUCTION MLE scoring takes the various calibrated item difficulties along each panel route directly into account when estimating each examinee s reading or listening score. As noted earlier, the EF score scales for reading and listening are aligned to the Council of Europe s Common European Framework of Reference (CEFR) for languages. The CEFR provides a set of conceptual guidelines that describe the expected proficiency of language learners at six levels, A1 to C2 (see Figure 2). Type of Language User Level Code Description Basic Beginner Elementary A1 A2 Understands familiar everyday words, expressions and very basic phrases aimed at the satisfaction of needs of a concrete type Understands sentences and frequently used expressions (e.g. personal and family information, shopping, local geography, employment) Independent Intermediate Upper Intermediate B1 B2 Understand the main points of clear, standard input on familiar matters regularly encountered in work, school, leisure, etc. Understands main ideas of complex text or speech on both concrete and abstract topics, including technical discussions in field of specialisation Advanced C1 Understands a wide range of demanding, longer texts, and recognises implicit or nuanced meanings Proficient Mastery C2 Understands with ease virtually every form of material read, including abstract or linguistically complex text such as manuals, specialised articles and literary works, and any kind of spoken language, including live broadcasts delivered at native speed Figure 2. Six CEFR Language Proficiency Levels The content validity of the EFSET ca-mst modules and panels is well-established and follows state-of-the-art task and test design principles established by world experts on language and adaptive assessment design. The EFSET Technical Background Report (EFSET, 2014) provides a comprehensive overview of the test development process. It should be noted that the EFSET and EFSET PLUS alignment to the CEFR levels was established through a formal standard-setting process (Luecht, 2014c; EFSET, 2014). EFSET PLUS-TOEFL CORRELATION STUDY REPORT 08
9 INTRODUCTION Validation Study Sample Examinees were recruited to participate in the online EF validation study. The primary eligibility requirements were: (a) having a valid address and (b) being able to provide by digital upload an official TOEFL ibt score report showing recent reading and listening scores. Recent was operationally defined as having taken the TOEFL ibt modules within the past 18 months. All potential examinees completed a brief survey to establish their eligibility and then uploaded a digital copy of their TOEFL ibt score report. Only eligible candidates were allowed to proceed to the next phase and actually take the EFSET PLUS reading and listening forms. This validation study testing was carried out during fall The examinees were administered and completed both an EFSET PLUS reading and listening panel. Every examinee that completed both EFSET PLUS panels within the testing window and whose performance demonstrated reasonable effort 1 was compensated with a voucher for 50. Ultimately, there were 384 participants with complete data 2. Demographically, the sample was comprised of 197 (51.3%) women and 187 (48.7%) men. Ages of the examinees ranged from 16 to 33 years; the average age was with a standard deviation of 1.95 years. Twenty-nine nationalities were represented in this study. The majority of the study participants (227 or 59.1%) listed their nationality as Brazilian. Other relatively high-percentage nationalities listed were China (11.5%), India (7.8%), and Germany (3.6%). The remaining participants were from other Asian countries, as well as various European, African and South American nations. Education and English as a second language (ESL) experience of the sample are jointly summarized in Table 1. In general, the sample was comprised primarily of well-educated, young Brazilian adults with somewhat extensive ESL experience. The gender mix was about equal. 1 Examinees who left entire modules blank or who otherwise exhibited an obvious lack of effort were excluded. The application process carefully explained the study participation rules to each examinee. 2 One examinee had taken the paper-and-pencil TOEFL, rather than the newer internet version. That individual was excluded from the study. EFSET PLUS-TOEFL CORRELATION STUDY REPORT 09
10 INTRODUCTION Table 1. Language Experience and Educational Information for the Sample (N=384) Language Experience Frequency Percent Less than 1 yr. 1-3 years 4-6 years 7-9 years More than 9 yrs % 14.1% 25.3% 22.9% 34.9% Degree Frequency Percent Did not finish high/secondary school High/secondary school Further education: some college Bachelor s degree Master s degree Other degrees % 43.8% 18.2% 31.8% 0.3% 4.7% Major Area of Study Frequency Percent Sciences Business Art and design Mathematics Social Sciences Languages Humanities Politics Electrical engineering Geoinformatic systems Law % 8.6% 7.6% 6.3% 4.2% 3.9% 1.3% 0.8% 0.3% 0.3% 0.3% Missing or other % EFSET PLUS-TOEFL CORRELATION STUDY REPORT 10
11 INTRODUCTION TOEFL ibt scores for the 384 participants are summarized in Table 2. Note that one examinee was missing a valid TOEFL reading score; another was missing a valid TOEFL listening score. Therefore, the counts for the TOEFL components are only N R =N L =383. On average, the participants in this study would be classified as having high TOEFL reading and listening proficiency 3, although the ranges of scores definitely cover reasonable spreads of English language knowledge and skill. Table 2. Summary of TOEFL Performance Statistics Reported TOEFL Score Reading Listening Total Count Mean Std. Deviation Minimum Maximum The EFSET PLUS descriptive statistics on the key proficiency-related variables, estimated reliability coefficients, correlations (observed and disattentuated), and some auxiliary performance comparisons between the validation study participants EFSET PLUS listening and reading scores and TOEFL ibt scores are presented in the next section. 3 Based on interpretive information published by ETS ( EFSET PLUS-TOEFL CORRELATION STUDY REPORT 11
12 ANALYSIS AND RESULTS Descriptive statistics for the EFSET PLUS scores are shown in Table 3 for the 384 examinees that participated in this study. The variables Reading θ R and Listening θ L are the two EFSET PLUS proficiency scores. By IRT convention, proficiency scores estimates are often denoted by the Greek letter θ ( theta ). Note that in practice, these IRT scores are rescaled to a more convenient and somewhat more interpretable set of scale values (0 to 100). For various technical statistical reasons, that rescaling was not applied for purposes of this study. Here, it is sufficient to note that the score estimates of θ R and θ L can be negative or positive 4, where higher positive numbers denote better language proficiency as measured by the EFSET PLUS ca-mst panels. Because of the adaptive multistage test design used for EFSET PLUS, the reliabilities of the EFSET PLUS reading and listening scores are excellent and provide accurate measures across the scales. Table 3. EFSET PPLUS Descriptive Statistics for EFSET PLUS IRT Proficiency Scores (N=384) Statistics Reading θ R Listening θ L Mean Std. Deviation Minimum Maximum As suggested by Table 2 (shown earlier), the sample appeared to be highly proficient in English on average. Table 3 again confirms that finding. Consider that the EFSET PLUS means and standard deviations for extremely large samples of more than 37,000 examinees were, respectively, and 1.09 for reading and and 1.14 for listening. The implication is that, compared to those very large sample statistics, these 384 study participants were, on average, at the 81st percentile for reading and at the 90th percentile for listening 5. 4 The IRT calibration software, WINSTEPS (Linacre, 2014) scales the EFSET PLUS item banks to have a mean item difficulty parameter estimates (scale locations centers) of zero. The examinees scores are not centered or otherwise standardized to zero and should not be interpreted as z-scores or other normal-curve equivalents. 5 These comparative results are based on normal approximation percentiles, using the large-sample means and standard deviations as reasonable estimates of the population distributional parameters. EFSET PLUS-TOEFL CORRELATION STUDY REPORT 12
13 ANALYSIS AND RESULTS An important benefit of the multistage test design used for EFSET PLUS is evident when considering the impact on score accuracy or reliability. The adaptive EFSET PLUS panels (see Figure 1) are specifically designed to provide somewhat more uniform precision ACROSS entire the score scale providing the best possible precision of the estimates of θ R and θ L. It is common to report a score reliability coefficient as an omnibus index of score accuracy one of the most commonly reported types of reliability coefficient is called Cronbach s α ( alpha,). Cronbach s α provides a somewhat conservative estimate of the average consistency of scores across the scale (Haertel, 2006). Values above 0.9 are considered to be very good. Because of the adaptive nature of the EFSET PLUS panels, traditional reliability coefficients can only be approximated using what is termed a marginal reliability coefficient. This type of reliability of coefficient is computed as Equation 2 ( ˆ, ) 2 ρ θθ = 1 E σ σ 2 2 ( ˆθθ) ( ˆθ) where the numerator of the rightmost term is the average error variance of estimate for the IRT proficiency scores and the denominator of the rightmost term is the variance of the estimated IRT θ scores (Lord & Novick, 1968). Provided that the data fit the IRT model used for calibration and scoring the PCM in the case of EFSET PLUS this marginal reliability is usually very comparable to Cronbach s α coefficient. The marginal reliability coefficients for EFSET PLUS are for reading and for listening, based on samples of more than 37,000 examinees. This implies excellent reliability across the score scale a direct and entirely expected outcome of using an adaptive multistage testing design. The reliability coefficients are used to adjust the correlations between EFSET PLUS and TOEFL, as discussed below. Pairwise Pearson product-moment correlations were computed between five score variables: (i) TOEFL ibt reading scores; (ii) TOEFL ibt listening scores; (iii) TOEFL ibt total scores; (iv) EFSET PLUS IRT score estimates for θ R (reading); and (iv) EFSET PLUS score estimates of θ L for listening. Correlations denote the degree of statistical linear association between pairs of variables. Values near 1.0 indicate an almost perfect linear relationship between the variable pair. Values near zero indicate almost no linear association and values near 1.0 indicate a nearly perfect inverse relationship (i.e. increasing values on one variable are strongly associated with decreasing values on the second variable). Validity studies such as this often result in moderate, positive correlations (e.g. 0.4 to 0.7). The product-moment correlations between the observed TOEFL and EFSET PLUS scores are shown in the lower triangle of the correlation matrix in Table 4 (i.e. in the unshaded cells below the diagonal of the matrix). There is one correlation for each pairing of the five variables. EFSET PLUS-TOEFL CORRELATION STUDY REPORT 13
14 ANALYSIS AND RESULTS TOEFL ibt Scores EFSET PLUS Scores Score Variables Reading Listening Total θ R θ L TOEFL Reading TOEFL Listening TOEFL IBT Total EFSET PLUS Reading θ R EFSET PLUS Reading θ L Table 4. Correlations Between TOEFL and EFSET PLUS Scores (Disattenuated Correlations Above the Diagonal, Reliability Coefficients on the Diagonal of the Matrix) The correlations in the upper (shaded) section of the matrix in Table 4 are called disattentuated correlations. That is, they estimate the statistical relationships between the five scores if measurement or score estimation errors were eliminated all-together. The disattenuated correlations are computed by dividing each observed product-moment correlation by the square root of the product of the reliability coefficients for each score included in the pairing (Haertel, 2006, p. 85). Because the reliability coefficients for the TOELF ibt and EFSET PLUS scores are all relatively high, the magnitude of increase in the true-score [disattentuated] correlations is not much larger than the observed correlations in the lower section of the matrix. It should be further apparent that the EFSET PLUS reading and listening scores are at a comparable level of reliability to the total (composite) TOEFL ibt scores. The most relevant correlations from a validity perspective are the two disattentuated correlations between the TOEFL ibt reading and estimated EFSET PLUS θ R scores (0.70) and between the TOEFL listening and estimated EFSET PLUS θ L scores (0.77). Those correlations suggest a fairly strong, positive linear association between the TOEFL ibt and EFSET PLUS scores. Figures 3 and 4 respectively show the scatter plots for the observed reading and listening scores. The TOEFL ibt scores are plotted relative to the horizontal axis in each plot. The EFSET PLUS scores are plotted relative to the vertical axis. The best-fitting regression line is also shown for each pair of score variables. It should be apparent that the EFSET PLUS scores are somewhat more variable than the reported TOEFL ibt scores. EFSET PLUS-TOEFL CORRELATION STUDY REPORT 14
15 ANALYSIS AND RESULTS 4 3 EFSET Reading Scores, θ R TOEFL ibt Reading Scores Figure 3. Scatterplot of EFSET PLUS (Vertical) by TOEFL ibt Reading Scores 3 EFSET Listening Scores, θ L TOEFL ibt Listening Scores Figure 4. Scatterplot of EFSET PLUS (Vertical) by TOEFL ibt Reading Scores It would not be realistic to expect perfect correspondence between EFSET PLUS and TOEFL scores. The tests are different but appear to measure some of the same composites of English reading and listening skills. The fact that there are only moderately high, positive, disattentuated correlations between TOEFL and EFSET PLUS scores may be due to a plethora of factors ranging from some restriction of the variation in the scores due to study eligibility requirements to the scaling and rounding of the TOEFL section scores to integer values ranging from 0 to 30. Or, the EFSET PLUS tasks and scales may simply be getting a slightly different constellation of English language traits. In any case, these results provide fairly solid convergent validity evidence. EFSET PLUS-TOEFL CORRELATION STUDY REPORT 15
16 DISCUSSION It is important to understand that there are no absolute assessment quality standards for measuring English language proficiency. The Standards for Educational and Psychological Testing (AERA, APA, NCME, 2014) and the International Test Commission Guidelines (ITC, 2008) stress the need for ongoing validity evidence gathering as a responsible measurement practice. In that spirit, test developers do the best they can to provide engaging and fair assessments that yield reasonably reliable and valid scores. TOEFL is a well-known and mature test with a world-wide user base. In contrast, EFSET PLUS is still a relatively new testing program. Both tests produce very accurate scores of reading and listening. Given the adaptive nature of EFSET PLUS, it should not be surprising that its reliabilities for the reading and listening score scales may actually be slightly higher than their TOEFL counterparts and comparable in magnitude to the reliability of TOEFL ibt total scores (see the reliability coefficients in Table 4, for example). None of those results imply that one test is better than the other, however. Scores on the two tests are definitely related to one another and provide persuasive convergent validity evidence. That is, the correlations reported provide some compelling evidence that TOEFL and EFSET PLUS are converging toward measuring common reading and listening proficiency traits. However, the results do not imply that one test is more valid than the other. Notably, given the moderate, positive correlations reported here, it may seem reasonable to try to establish some type of statistical concordance relationship between the TOEFL and EFSET PLUS score scales. That is, score users might want to have some method of direct translating between the points on the EFSET PLUS and on the TOEFL score scales. At present, there is no policy intent to provide that type of direct statistical alignment between the TOEFL and EFSET PLUS score scales. Given the small sample size in this study and the only moderate positive correlations reported, any statistical concordance would not be psychometrically appropriate 6 and could be misleading. 6 Score or classification concordance tables are sometimes created to show the approximate equivalence of scores on two scales that measure similar but not necessarily the same constructs. An example would be the well-known concordance between college admissions tests like the ACT Assessment (Act, Inc.) and the Scholastic Aptitude Test (SAT) in the US. Basing concordance on tests with only moderate correlations can lead to misuse of the scores if some users consider the scores to actually be exchangeable. Concorded scores are not exchangeable (Kolen & Brennan, 2014). A policy decision was therefore made NOT to provide concordance information between TOEFL and EFSET PLUS until additional evidence is gathered. EFSET PLUS-TOEFL CORRELATION STUDY REPORT 16
17 REFERENCES American Educational Research Association, American Psychological Association, National Council on Measurement in Education. (2014). Standards for Educational and Psychological Testing. Washington, DC: AERA. Campell, D. T, & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, Educational Testing Service. (2011). Reliability and comparability of TOEFL ibt scores. TOEFL ibt Research Insight, Series I, Volume 3. Princeton, NJ: Author. EF. (2014). EFSET Technical Background Report. London, U.K: efset.org. Haertel, E. H. (2006). Reliability. In R. L. Brennan (Ed.). Educational Measurement, 4th Edition, pp Washington, DC: American Council on Education/Praeger Publishers. International English Language Testing System. (2013). IELTS Researchers - Test performance Author: International English Language Testing System, International Test Commission. (2008). International Test Commission Guidelines. Website: Kolen, M. J. & Brennan, R. L. (2014). Test equating, scaling, and linking: Methods and practices, 2nd edition. New York: Springer. Linacre, M. (2014). WINSTEPS Rasch Measurement (Version 3.81). [Computer program]. Author: Lord F.M. & Novick, M. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Luecht, R. M. (2000, April). Implementing the computer-adaptive sequential testing (CAST) framework to mass produce high quality computer-adaptive and mastery tests. Symposium paper presented at the Annual Meeting of the National Council on Measurement in Education, New Orleans, LA. Luecht, R. M. (2014a). External validity and reliability of EFSET. [EF technical report]. London, U.K.: EF. Luecht, R. M. (2014b). Computerized adaptive multistage design considerations and operational issues (pp ). In D. Yan, A. A. von Davier & C. Lewis (Eds.) Computerized Multistage Testing: Theory and Applications. New York: Taylor-Francis. EFSET PLUS-TOEFL CORRELATION STUDY REPORT 17
18 ABOUT THE AUTHOR Richard M. Luecht, PhD, Professor of Educational Research Methodology at the University of North Carolina at Greensboro (UNCG), is the chief psychometric consultant for the EFSET team. He is also a Senior Research Scientist with the Center for Assessment Research and Technology, a not-for-profit psychometric services division of the Center for Credentialing and Education, Greensboro, NC. Ric has published numerous articles and book chapters on technical measurement issues. He has been a technical consultant and advisor for many state department of education testing agencies and large-scale testing organizations, including New York, Pennsylvania, Delaware, Georgia, North Carolina, South Carolina, New Jersey, Puerto Rico, The College Board, Educational Testing Service, HUMRRO, the Partnership for Assessment of Readiness for College and Career (PARCC), the National Center and State Collaborative (NCSC), the American Institute of Certified Public Accountants, the National Board on Professional Teaching Standards, Cisco Corporation, the Defense Language Institute, the National Commission on the Certification of Physicians Assistants, and Education First (EFSET). He has been an active participant previously at the National Council of Measurement in Education (NCME), American Educational Research Association (AERA) and Association of Test Publishers (ATP)meetings, teaching workshops and giving presentations on topics such as assessment engineering and principled assessment design, computer-based testing, multistage testing design and implementation, standard setting, automated test assembly, IRT calibration, scale maintenance and scoring, designing complex performance assessments, diagnostic testing, multidimensional IRT, and language testing. Before joining UNCG, Ric was the Director for Computerized Adaptive Testing Research and Senior Psychometrician at the National Board of Medical Examiners where he oversaw psychometric processing for the United States Medical Licensing Examination (USMLE) Step and numerous subject examinations, as well being instrumental in the design of systems and technologies for the migration of the United States Medical Licensing Examination programs to computerized delivery. He has also designed software systems and algorithms for large-scale automated test assembly and devised a computerized adaptive multistage testing implementation framework that is used by a number of large-scale testing programs. His most recent work involves the development of a comprehensive framework and associated methodologies for a new approach to large-scale formative assessment design and implementation called assessment engineering (AE). EFSET PLUS-TOEFL CORRELATION STUDY REPORT 18
19 Copyright EF Education First Ltd. All rights reserved
EXTERNAL VALIDITY OF EFSET PLUS AND IELTS SCORES
W W W. E F S E T. O R G E F S E T P L U S - I E LT S C O R R E L AT I O N S T U D Y R E P O RT 0 2 EXTERNAL VALIDITY OF EFSET PLUS AND IELTS SCORES Richard M. Luecht Abstract This study was carried out
Validity, reliability, and concordance of the Duolingo English Test
Validity, reliability, and concordance of the Duolingo English Test May 2014 Feifei Ye, PhD Assistant Professor University of Pittsburgh School of Education [email protected] Validity, reliability, and
Feifei Ye, PhD Assistant Professor School of Education University of Pittsburgh [email protected]
Feifei Ye, PhD Assistant Professor School of Education University of Pittsburgh [email protected] Validity, reliability, and concordance of the Duolingo English Test Ye (2014), p. 1 Duolingo has developed
Chapter 6: The Information Function 129. CHAPTER 7 Test Calibration
Chapter 6: The Information Function 129 CHAPTER 7 Test Calibration 130 Chapter 7: Test Calibration CHAPTER 7 Test Calibration For didactic purposes, all of the preceding chapters have assumed that the
Glossary of Terms Ability Accommodation Adjusted validity/reliability coefficient Alternate forms Analysis of work Assessment Battery Bias
Glossary of Terms Ability A defined domain of cognitive, perceptual, psychomotor, or physical functioning. Accommodation A change in the content, format, and/or administration of a selection procedure
Module 3: Correlation and Covariance
Using Statistical Data to Make Decisions Module 3: Correlation and Covariance Tom Ilvento Dr. Mugdim Pašiƒ University of Delaware Sarajevo Graduate School of Business O ften our interest in data analysis
The new Verbal Reasoning and Quantitative Reasoning score scales:
Better by Design The new Verbal Reasoning and Quantitative Reasoning score scales: A helpful overview of what you need to know Inside: An in-depth look at the new 130 170 score scales Guidance on making
Measuring the Effectiveness of Rosetta Stone
FINAL REPORT Measuring the Effectiveness of Rosetta Stone Roumen Vesselinov, Ph.D. Visiting Assistant Professor Queens College City University of New York [email protected] (718) 997-5444 January
Dr. Wei Wei Royal Melbourne Institute of Technology, Vietnam Campus January 2013
Research Summary: Can integrated skills tasks change students use of learning strategies and materials? A case study using PTE Academic integrated skills items Dr. Wei Wei Royal Melbourne Institute of
DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.
DESCRIPTIVE STATISTICS The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses. DESCRIPTIVE VS. INFERENTIAL STATISTICS Descriptive To organize,
Florida Math for College Readiness
Core Florida Math for College Readiness Florida Math for College Readiness provides a fourth-year math curriculum focused on developing the mastery of skills identified as critical to postsecondary readiness
College Readiness LINKING STUDY
College Readiness LINKING STUDY A Study of the Alignment of the RIT Scales of NWEA s MAP Assessments with the College Readiness Benchmarks of EXPLORE, PLAN, and ACT December 2011 (updated January 17, 2012)
096 Professional Readiness Examination (Mathematics)
096 Professional Readiness Examination (Mathematics) Effective after October 1, 2013 MI-SG-FLD096M-02 TABLE OF CONTENTS PART 1: General Information About the MTTC Program and Test Preparation OVERVIEW
BASI Manual Addendum: Growth Scale Value (GSV) Scales and College Report
BASI Manual Addendum: Growth Scale Value (GSV) Scales and College Report Growth Scale Value (GSV) Scales Overview The BASI GSV scales were created to help users measure the progress of an individual student
Gamma Distribution Fitting
Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics
Benchmark Assessment in Standards-Based Education:
Research Paper Benchmark Assessment in : The Galileo K-12 Online Educational Management System by John Richard Bergan, Ph.D. John Robert Bergan, Ph.D. and Christine Guerrera Burnham, Ph.D. Submitted by:
Local outlier detection in data forensics: data mining approach to flag unusual schools
Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential
Data exploration with Microsoft Excel: analysing more than one variable
Data exploration with Microsoft Excel: analysing more than one variable Contents 1 Introduction... 1 2 Comparing different groups or different variables... 2 3 Exploring the association between categorical
COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES.
277 CHAPTER VI COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES. This chapter contains a full discussion of customer loyalty comparisons between private and public insurance companies
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
Calculating, Interpreting, and Reporting Estimates of Effect Size (Magnitude of an Effect or the Strength of a Relationship)
1 Calculating, Interpreting, and Reporting Estimates of Effect Size (Magnitude of an Effect or the Strength of a Relationship) I. Authors should report effect sizes in the manuscript and tables when reporting
Result Analysis of the Local FCE Examination Sessions (2013-2014) at Tomsk Polytechnic University
Doi:10.5901/mjss.2015.v6n3s1p239 Abstract Result Analysis of the Local FCE Examination Sessions (2013-2014) at Tomsk Polytechnic University Oksana P. Kabrysheva National Research Tomsk Polytechnic University,
Comparison of the Cambridge Exams main suite, IELTS and TOEFL
Comparison of the Cambridge Exams main suite, IELTS and TOEFL This guide is intended to help teachers and consultants advise students on which exam to take by making a side-by-side comparison. Before getting
Report on the Scaling of the 2012 NSW Higher School Certificate
Report on the Scaling of the 2012 NSW Higher School Certificate NSW Vice-Chancellors Committee Technical Committee on Scaling Universities Admissions Centre (NSW & ACT) Pty Ltd 2013 ACN 070 055 935 ABN
APEC Online Consumer Checklist for English Language Programs
APEC Online Consumer Checklist for English Language Programs The APEC Online Consumer Checklist For English Language Programs will serve the training needs of government officials, businesspeople, students,
Touchstone Level 2. Common European Framework of Reference for Languages (CEFR)
Touchstone Level 2 Common European Framework of Reference for Languages (CEFR) Contents Introduction to CEFR 2 CEFR level 3 CEFR goals realized in this level of Touchstone 4 How each unit relates to the
2009 CREDO Center for Research on Education Outcomes (CREDO) Stanford University Stanford, CA http://credo.stanford.edu June 2009
Technical Appendix 2009 CREDO Center for Research on Education Outcomes (CREDO) Stanford University Stanford, CA http://credo.stanford.edu June 2009 CREDO gratefully acknowledges the support of the State
Center for Advanced Studies in Measurement and Assessment. CASMA Research Report
Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 13 and Accuracy Under the Compound Multinomial Model Won-Chan Lee November 2005 Revised April 2007 Revised April 2008
Final Project Report
Centre for the Greek Language Standardizing The Certificate of Attainment in Greek on the Common European Framework of Reference Final Project Report Spiros Papageorgiou University of Michigan Thessaloniki,
Research Report 2013-3. Rating Quality Studies Using Rasch Measurement Theory. By George Engelhard, Jr. and Stefanie A. Wind.
Research Report 2013-3 Rating Quality Studies Using Rasch Measurement Theory By George Engelhard, Jr. and Stefanie A. Wind easurement George Engelhard Jr. is a professor of educational measurement and
FGCU Faculty Salary Compression and Inversion Study
Compression and Inversion Study Submitted to: Steve Belcher Special Assistant for Faculty Affairs Office of Academic Affairs Florida Gulf Coast University Submitted by: The Balmoral Group, LLC Responsible
A C T R esearcli R e p o rt S eries 2 0 0 5. Using ACT Assessment Scores to Set Benchmarks for College Readiness. IJeff Allen.
A C T R esearcli R e p o rt S eries 2 0 0 5 Using ACT Assessment Scores to Set Benchmarks for College Readiness IJeff Allen Jim Sconing ACT August 2005 For additional copies write: ACT Research Report
ACT Research Explains New ACT Test Writing Scores and Their Relationship to Other Test Scores
ACT Research Explains New ACT Test Writing Scores and Their Relationship to Other Test Scores Wayne J. Camara, Dongmei Li, Deborah J. Harris, Benjamin Andrews, Qing Yi, and Yong He ACT Research Explains
Report on the Scaling of the 2013 NSW Higher School Certificate. NSW Vice-Chancellors Committee Technical Committee on Scaling
Report on the Scaling of the 2013 NSW Higher School Certificate NSW Vice-Chancellors Committee Technical Committee on Scaling Universities Admissions Centre (NSW & ACT) Pty Ltd 2014 ACN 070 055 935 ABN
Technical Report. Overview. Revisions in this Edition. Four-Level Assessment Process
Technical Report Overview The Clinical Evaluation of Language Fundamentals Fourth Edition (CELF 4) is an individually administered test for determining if a student (ages 5 through 21 years) has a language
Relating the ACT Indicator Understanding Complex Texts to College Course Grades
ACT Research & Policy Technical Brief 2016 Relating the ACT Indicator Understanding Complex Texts to College Course Grades Jeff Allen, PhD; Brad Bolender; Yu Fang, PhD; Dongmei Li, PhD; and Tony Thompson,
1/27/2013. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2
PSY 512: Advanced Statistics for Psychological and Behavioral Research 2 Introduce moderated multiple regression Continuous predictor continuous predictor Continuous predictor categorical predictor Understand
Guidelines for Best Test Development Practices to Ensure Validity and Fairness for International English Language Proficiency Assessments
John W. Young, Youngsoon So & Gary J. Ockey Educational Testing Service Guidelines for Best Test Development Practices to Ensure Validity and Fairness for International English Language Proficiency Assessments
Policy Capture for Setting End-of-Course and Kentucky Performance Rating for Educational Progress (K-PREP) Cut Scores
2013 No. 007 Policy Capture for Setting End-of-Course and Kentucky Performance Rating for Educational Progress (K-PREP) Cut Scores Prepared for: Authors: Kentucky Department of Education Capital Plaza
Mapping State Proficiency Standards Onto the NAEP Scales:
Mapping State Proficiency Standards Onto the NAEP Scales: Variation and Change in State Standards for Reading and Mathematics, 2005 2009 NCES 2011-458 U.S. DEPARTMENT OF EDUCATION Contents 1 Executive
Assessing the Relative Fit of Alternative Item Response Theory Models to the Data
Research Paper Assessing the Relative Fit of Alternative Item Response Theory Models to the Data by John Richard Bergan, Ph.D. 6700 E. Speedway Boulevard Tucson, Arizona 85710 Phone: 520.323.9033 Fax:
NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
THE USE OF INTERNATIONAL EXAMINATIONS IN THE ENGLISH LANGUAGE PROGRAM AT UQROO
THE USE OF INTERNATIONAL EXAMINATIONS IN THE ENGLISH LANGUAGE PROGRAM AT UQROO M.C. José Luis Borges Ucan M.C. Deon Victoria Heffington Dr. Alfredo Marín Marín Dr. Caridad Macola Rojo Universidad de Quintana
Lin s Concordance Correlation Coefficient
NSS Statistical Software NSS.com hapter 30 Lin s oncordance orrelation oefficient Introduction This procedure calculates Lin s concordance correlation coefficient ( ) from a set of bivariate data. The
Report on the Scaling of the 2014 NSW Higher School Certificate. NSW Vice-Chancellors Committee Technical Committee on Scaling
Report on the Scaling of the 2014 NSW Higher School Certificate NSW Vice-Chancellors Committee Technical Committee on Scaling Contents Preface Acknowledgements Definitions iii iv v 1 The Higher School
Computer-based testing: An alternative for the assessment of Turkish undergraduate students
Available online at www.sciencedirect.com Computers & Education 51 (2008) 1198 1204 www.elsevier.com/locate/compedu Computer-based testing: An alternative for the assessment of Turkish undergraduate students
Information Security and Risk Management
Information Security and Risk Management by Lawrence D. Bodin Professor Emeritus of Decision and Information Technology Robert H. Smith School of Business University of Maryland College Park, MD 20742
Detection Sensitivity and Response Bias
Detection Sensitivity and Response Bias Lewis O. Harvey, Jr. Department of Psychology University of Colorado Boulder, Colorado The Brain (Observable) Stimulus System (Observable) Response System (Observable)
General Information and Guidelines ESOL International All Modes Entry 3 and Levels 1, 2 and 3 (B1, B2, C1 and C2)
English Speaking Board General Information and Guidelines ESOL International All Modes Entry 3 and Levels 1, 2 and 3 (B1, B2, C1 and C2) Qualification numbers B1 500/3646/4 B2 500/3647/6 C1 500/3648/8
Statistics. Measurement. Scales of Measurement 7/18/2012
Statistics Measurement Measurement is defined as a set of rules for assigning numbers to represent objects, traits, attributes, or behaviors A variableis something that varies (eye color), a constant does
Correlation key concepts:
CORRELATION Correlation key concepts: Types of correlation Methods of studying correlation a) Scatter diagram b) Karl pearson s coefficient of correlation c) Spearman s Rank correlation coefficient d)
15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
2. Simple Linear Regression
Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according
Simple linear regression
Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between
How To Get A Degree From The Brussels School Of International Relations
The UK s European university BRUSSELS SCHOOL OF INTERNATIONAL STUDIES Graduate study BRUSSELS: THE CAPITAL OF EUROPE Home to the main institutions of the European Union and numerous European and international
University and College Preparation with Kaplan International in Canada 2015
University and College Preparation with Kaplan International in Canada 2015 www.kaplaninternational.com Why university and college preparation with Kaplan International in Canada? At Kaplan International,
Evaluating a Retest Policy at a North American University
Journal of the National College Testing Association 2015/Volume 1/Issue 1 Evaluating a Retest Policy at a North American University CINDY L. JAMES, PhD Thompson Rivers University Retesting for admission
Automated Essay Scoring with the E-rater System Yigal Attali Educational Testing Service [email protected]
Automated Essay Scoring with the E-rater System Yigal Attali Educational Testing Service [email protected] Abstract This paper provides an overview of e-rater, a state-of-the-art automated essay scoring
Government of Ireland International Education Scholar & Ambassador Programme 2015/2016
Government of Ireland International Education Scholar & Ambassador Programme 2015/2016 Call for Applications from Brazilian Undergraduate & Postgraduate Students Mary Immaculate College is delighted to
The relationship of CPA exam delay after graduation to institutional CPA exam pass rates
The relationship of CPA exam delay after graduation to institutional CPA exam pass rates ABSTRACT John D. Morgan Winona State University This study investigates the relationship between average delay after
Annual Goals for Math & Computer Science
Annual Goals for Math & Computer Science 2010-2011 Gather student learning outcomes assessment data for the computer science major and begin to consider the implications of these data Goal - Gather student
Connecting English Language Learning and Academic Performance: A Prediction Study
Connecting English Language Learning and Academic Performance: A Prediction Study American Educational Research Association Vancouver, British Columbia, Canada Jadie Kong Sonya Powers Laura Starr Natasha
Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13
Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13 Overview Missingness and impact on statistical analysis Missing data assumptions/mechanisms Conventional
Home Schooling Achievement
ing Achievement Why are so many parents choosing to home school? Because it works. A 7 study by Dr. Brian Ray of the National Home Education Research Institute (NHERI) found that home educated students
A Performance Comparison of Native and Non-native Speakers of English on an English Language Proficiency Test ...
technical report A Performance Comparison of Native and Non-native Speakers of English on an English Language Proficiency Test.......... Agnes Stephenson, Ph.D. Hong Jiao Nathan Wall This paper was presented
The Impact of Management Information Systems on the Performance of Governmental Organizations- Study at Jordanian Ministry of Planning
The Impact of Management Information Systems on the Performance of Governmental Organizations- Study at Jordanian Ministry of Planning Dr. Shehadeh M.A.AL-Gharaibeh Assistant prof. Business Administration
Testing Accommodations For English Learners In New Jersey
New Jersey Department of Education Partnership for Assessment of Readiness for College and Careers (PARCC) Testing Accommodations for English Learners (EL) Revised December 2014 Overview Accommodations
Effects of Different Response Types on Iranian EFL Test Takers Performance
Effects of Different Response Types on Iranian EFL Test Takers Performance Mohammad Hassan Chehrazad PhD Candidate, University of Tabriz [email protected] Parviz Ajideh Professor, University
Impact / Performance Matrix A Strategic Planning Tool
Impact / Performance Matrix A Strategic Planning Tool Larry J. Seibert, Ph.D. When Board members and staff convene for strategic planning sessions, there are a number of questions that typically need to
Overview of Factor Analysis
Overview of Factor Analysis Jamie DeCoster Department of Psychology University of Alabama 348 Gordon Palmer Hall Box 870348 Tuscaloosa, AL 35487-0348 Phone: (205) 348-4431 Fax: (205) 348-8648 August 1,
For example, estimate the population of the United States as 3 times 10⁸ and the
CCSS: Mathematics The Number System CCSS: Grade 8 8.NS.A. Know that there are numbers that are not rational, and approximate them by rational numbers. 8.NS.A.1. Understand informally that every number
Analysis of Bayesian Dynamic Linear Models
Analysis of Bayesian Dynamic Linear Models Emily M. Casleton December 17, 2010 1 Introduction The main purpose of this project is to explore the Bayesian analysis of Dynamic Linear Models (DLMs). The main
PERSONAL INJURY LITIGATION AND THE NON- ENGLISH SPEAKING CLIENT
PERSONAL INJURY LITIGATION AND THE NON- ENGLISH SPEAKING CLIENT By Justin Valentine There are few specific provisions in the CPR dealing with non-english speaking parties and witnesses. This is surprising
Univariate Regression
Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is
CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA
We Can Early Learning Curriculum PreK Grades 8 12 INSIDE ALGEBRA, GRADES 8 12 CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA April 2016 www.voyagersopris.com Mathematical
Jon A. Krosnick and LinChiat Chang, Ohio State University. April, 2001. Introduction
A Comparison of the Random Digit Dialing Telephone Survey Methodology with Internet Survey Methodology as Implemented by Knowledge Networks and Harris Interactive Jon A. Krosnick and LinChiat Chang, Ohio
TEAS V National Standard Setting Study 2010 Executive Summary
TEAS V National Standard Setting Study 2010 Executive Summary The purpose of the TEAS V National Standard Setting Study was to develop a set of recommended criterion-referenced cut scores that nursing
GMAC. Predicting Graduate Program Success for Undergraduate and Intended Graduate Accounting and Business-Related Concentrations
GMAC Predicting Graduate Program Success for Undergraduate and Intended Graduate Accounting and Business-Related Concentrations Kara M. Owens and Eileen Talento-Miller GMAC Research Reports RR-06-15 August
Factorial Invariance in Student Ratings of Instruction
Factorial Invariance in Student Ratings of Instruction Isaac I. Bejar Educational Testing Service Kenneth O. Doyle University of Minnesota The factorial invariance of student ratings of instruction across
TExMaT I Texas Examinations for Master Teachers. Preparation Manual. 087 Master Mathematics Teacher EC 4
TExMaT I Texas Examinations for Master Teachers Preparation Manual 087 Master Mathematics Teacher EC 4 Copyright 2006 by the Texas Education Agency (TEA). All rights reserved. The Texas Education Agency
Internal Consistency: Do We Really Know What It Is and How to Assess It?
Journal of Psychology and Behavioral Science June 2014, Vol. 2, No. 2, pp. 205-220 ISSN: 2374-2380 (Print) 2374-2399 (Online) Copyright The Author(s). 2014. All Rights Reserved. Published by American Research
Literacy and numeracy assessments of adult English language learners. An analysis of data from the Literacy and Numeracy for Adults Assessment Tool
Literacy and numeracy assessments of adult English language learners An analysis of data from the Literacy and Numeracy for Adults Assessment Tool This series covers research on teaching and learning in
Stability of School Building Accountability Scores and Gains. CSE Technical Report 561. Robert L. Linn CRESST/University of Colorado at Boulder
Stability of School Building Accountability Scores and Gains CSE Technical Report 561 Robert L. Linn CRESST/University of Colorado at Boulder Carolyn Haug University of Colorado at Boulder April 2002 Center
application materials ll.m. program
application materials ll.m. program Application for Class of 2014-15 boston college law school Newton, Massachusetts 02459 U.S.A. application for admission ll.m. program boston college law school Newton,
application materials ll.m. program
application materials ll.m. program Application for Class of 2016-17 boston college law school 885 Centre Street Newton, Massachusetts 02459 U.S.A. application for admission ll.m. program boston college
Introductory Guide to the Common European Framework of Reference (CEFR) for English Language Teachers
Introductory Guide to the Common European Framework of Reference (CEFR) for English Language Teachers What is the Common European Framework of Reference? The Common European Framework of Reference gives
Understanding Your Praxis Scores
2014 15 Understanding Your Praxis Scores The Praxis Series Assessments are developed and administered by Educational Testing Service (E T S ). Praxis Core Academic Skills for Educators (Core) tests measure
