A multiple objective test assembly approach for exposure control problems in Computerized Adaptive Testing



Similar documents
Calculation of Sampling Weights

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

An Alternative Way to Measure Private Equity Performance

Multiple-Period Attribution: Residuals and Compounding

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

Traffic State Estimation in the Traffic Management Center of Berlin

What is Candidate Sampling

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

DEFINING %COMPLETE IN MICROSOFT PROJECT

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

Recurrence. 1 Definitions and main statements

Robust Design of Public Storage Warehouses. Yeming (Yale) Gong EMLYON Business School

Can Auto Liability Insurance Purchases Signal Risk Attitude?

8 Algorithm for Binary Searching in Trees

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek

The OC Curve of Attribute Acceptance Plans

Realistic Image Synthesis

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

J. Parallel Distrib. Comput.

Latent Class Regression. Statistics for Psychosocial Research II: Structural Models December 4 and 6, 2006

Computer-assisted Auditing for High- Volume Medical Coding

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

RESEARCH ON DUAL-SHAKER SINE VIBRATION CONTROL. Yaoqi FENG 1, Hanping QIU 1. China Academy of Space Technology (CAST)

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

Project Networks With Mixed-Time Constraints

AN APPOINTMENT ORDER OUTPATIENT SCHEDULING SYSTEM THAT IMPROVES OUTPATIENT EXPERIENCE

A DYNAMIC CRASHING METHOD FOR PROJECT MANAGEMENT USING SIMULATION-BASED OPTIMIZATION. Michael E. Kuhl Radhamés A. Tolentino-Peña

Time Value of Money Module

Research Article Enhanced Two-Step Method via Relaxed Order of α-satisfactory Degrees for Fuzzy Multiobjective Optimization

Construction Rules for Morningstar Canada Target Dividend Index SM

Power-of-Two Policies for Single- Warehouse Multi-Retailer Inventory Systems with Order Frequency Discounts

RequIn, a tool for fast web traffic inference

An Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services

Design and Development of a Security Evaluation Platform Based on International Standards

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

Proceedings of the Annual Meeting of the American Statistical Association, August 5-9, 2001

In some supply chains, materials are ordered periodically according to local information. This paper investigates

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

How To Calculate The Accountng Perod Of Nequalty

An Interest-Oriented Network Evolution Mechanism for Online Communities

Number of Levels Cumulative Annual operating Income per year construction costs costs ($) ($) ($) 1 600,000 35, , ,200,000 60, ,000

An Integrated Approach of AHP-GP and Visualization for Software Architecture Optimization: A case-study for selection of architecture style

RELIABILITY, RISK AND AVAILABILITY ANLYSIS OF A CONTAINER GANTRY CRANE ABSTRACT

Credit Limit Optimization (CLO) for Credit Cards

A Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression

The Current Employment Statistics (CES) survey,

SPEE Recommended Evaluation Practice #6 Definition of Decline Curve Parameters Background:

Software project management with GAs

IMPACT ANALYSIS OF A CELLULAR PHONE

14.74 Lecture 5: Health (2)

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

CHAPTER 14 MORE ABOUT REGRESSION

Using Series to Analyze Financial Situations: Present Value

VoIP Playout Buffer Adjustment using Adaptive Estimation of Network Delays

Fuzzy TOPSIS Method in the Selection of Investment Boards by Incorporating Operational Risks

Staff Paper. Farm Savings Accounts: Examining Income Variability, Eligibility, and Benefits. Brent Gloy, Eddy LaDue, and Charles Cuykendall

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

Forecasting the Direction and Strength of Stock Market Movement

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network *

SIMULATION OPTIMIZATION: APPLICATIONS IN RISK MANAGEMENT

1.1 The University may award Higher Doctorate degrees as specified from time-to-time in UPR AS11 1.

) of the Cell class is created containing information about events associated with the cell. Events are added to the Cell instance

Lecture 2: Single Layer Perceptrons Kevin Swingler

行 政 院 國 家 科 學 委 員 會 補 助 專 題 研 究 計 畫 成 果 報 告 期 中 進 度 報 告

SIMPLE LINEAR CORRELATION

An Investigation of the Performance of the Generalized S-X 2 Item-Fit Index for Polytomous IRT Models. Taehoon Kang Troy T. Chen

Chapter 4 ECONOMIC DISPATCH AND UNIT COMMITMENT

7.5. Present Value of an Annuity. Investigate

A Replication-Based and Fault Tolerant Allocation Algorithm for Cloud Computing

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6

L10: Linear discriminants analysis

ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING

The Greedy Method. Introduction. 0/1 Knapsack Problem

Efficient Project Portfolio as a tool for Enterprise Risk Management

Complex Service Provisioning in Collaborative Cloud Markets

To manage leave, meeting institutional requirements and treating individual staff members fairly and consistently.

IDENTIFICATION AND CORRECTION OF A COMMON ERROR IN GENERAL ANNUITY CALCULATIONS

CHAPTER 5 RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES

Enabling P2P One-view Multi-party Video Conferencing

Statistical Methods to Develop Rating Models

FREQUENCY OF OCCURRENCE OF CERTAIN CHEMICAL CLASSES OF GSR FROM VARIOUS AMMUNITION TYPES

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

Minimal Coding Network With Combinatorial Structure For Instantaneous Recovery From Edge Failures

Answer: A). There is a flatter IS curve in the high MPC economy. Original LM LM after increase in M. IS curve for low MPC economy

Demographic and Health Surveys Methodology

Outsourcing inventory management decisions in healthcare: Models and application

2008/8. An integrated model for warehouse and inventory planning. Géraldine Strack and Yves Pochet

Risk-based Fatigue Estimate of Deep Water Risers -- Course Project for EM388F: Fracture Mechanics, Spring 2008

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

An Empirical Study of Search Engine Advertising Effectiveness

APPLICATION OF PROBE DATA COLLECTED VIA INFRARED BEACONS TO TRAFFIC MANEGEMENT

Transcription:

Pscológca (2), 3, 335-355. A multple objectve test assembly approach for exposure control problems n Computerzed Adaptve Testng Bernard P. Veldkamp * (), Angela J. Verschoor (2) & Theo J.H.M. Eggen (2) () Research Center for Examnaton and Certfcaton, Unversty of Twente, The etherlands; (2) CITO, The etherlands Overexposure and underexposure of tems n the bank are serous problems n operatonal computerzed adaptve testng (CAT) systems. These exposure problems mght result n tem compromse, or pont at a waste of nvestments. The exposure control problem can be vewed as a test assembly problem wth multple objectves. Informaton n the test has to be maxmzed, tem compromse has to be mnmzed, and pool usage has to be optmzed. In ths paper, a multple objectves method s developed to deal wth both types of exposure problems. In ths method, exposure control parameters based on observed exposure rates are mplemented as weghts for the nformaton n the tem selecton procedure. The method does not need tme consumng smulaton studes, and t can be mplemented condtonal on ablty level. The method s compared wth Sympson Hetter method for exposure control, wth the Progressve method and wth alphastratfed testng. The results show that the method s successful n dealng wth both knds of exposure problems. In computerzed adaptve testng (CAT), tems are selected on-the-fly. Adaptve procedures are used to select tems wth optmal measurement characterstcs at the estmated ablty level of examnees. CAT possesses the same advantages as other computer-based testng procedures, lke ncreased flexblty and connecton of admnstratve systems. Besdes, for a CAT t also holds that test length can be decreased by almost 4 percent wthout decrease of measurement precson, and examnees are no longer frustrated by tems that are ether too dffcult or too easy (see e.g. van der Lnden, & Glas, 2, Waner, Dorans, Flaugher, Green, Mslevy, Stenberg, & Thssen, 99). CAT systems are theoretcally based on the propertes of tem response theory (IRT). In IRT, person parameters and tem parameters are * E-mal: b.p.veldkamp@gw.utwente.nl

336 B.P. Veldkamp, et al. separated. The tem parameters are supposed to be nvarant for dfferent values of the person parameters. Therefore, tems can be calbrated and the tem parameters can be stored n tem banks. From these tem banks, tems that provde most nformaton at the estmated person parameter are selected. In many large scale testng programs, paper-and-pencl test have been replaced by CATs. For example for the Graduate Record Examnaton (GRE) and the Armed Servces Vocatonal Apttude Battery (ASVAB), CAT-versons are avalable now. CITO (Natonal Insttute of Educatonal Measurement) n the Netherlands admnsters several CATs, lke MATHCAT (CITO, 999), TURCAT (CITO, n press), DSLcat (CITO, 22) and KndergartenCAT. MATHCAT s developed for dagnosng Mathematcs defcences for college students (Verschoor, & Straetmans, 2), TURCAT tests profcency of Turksh as a second language, DSLcat tests Dutch as a Second Language, and KndergartenCAT contans tests for measurng orderng, language, and orentaton n tme and space abltes of young chldren (Eggen, 24). These CATs, lke almost all operatonal CAT systems encounter an unevenly dstrbuted use of tems n the bank. In general, most tem selecton procedures favor some tems above others, due to superor measurement propertes or favorable tem characterstcs. As a result, some tems are overexposed. Ths mght result n tem compromse, whch undermnes the valdty of score-based nferences (Wse & Kngsbury, 2). On the other hand, some tems mght be underexposed, whch s a waste of nvestments. Therefore, choosng a strategy for controllng the exposure of tems to examnees has become an ntegral part of test development (Davs & Dodd, 23). In ths paper, a multple objectves exposure control method s proposed for dealng wth problems of both overexposure and underexposure of the tems. Frst, a theoretcal background s gven. Then, the new method s ntroduced. The performance of the method s evaluated n two studes. Fnally, recommendatons about the use of the new method are gven. THEORETICAL BACKGROU D One of the frst methods developed to deal wth exposure control problems, s the 5-4-3-2- technque (Hetter, & Sympson, 997, McBrde, & Martn, 983) appled n the CAT-ASVAB. Ths randomzed procedure was developed to reduce probablty of tem sequences n the frst fve teratons of CAT. Kngsbury and Zara (989) and Thomasson (998)

Computarzed Adaptatve Testng 337 developed dfferent randomzaton methods amed to reduce overall tem exposure. Rotatng tem pool methods (Arel, Veldkamp, and van der Lnden, 24, Way, 998, Way, Steffen, and Anderson, 998) and CAST (Luecht & Nungester, 998) were developed to spread the tems over dfferent tests by a pror reducng the avalablty of tems for selecton. However, n CAT ndustry tem-exposure control method based on the Sympson and Hetter method (985) are most commonly appled. Sympson-Hetter methods Although some varatons exst, the general dea underlyng these methods can be descrbed as follows. To defne these methods two events have to be dstngushed, the event that tem s selected by the CAT algorthm (S ), and the event that tem s admnstered (A ). The probablty that event A occurs s the probablty that A occurs gven that S has occurred tmes the probablty that S occurs: P(A ) = P(A S ) * P(S ). () To control the tem exposure, one could focus on ether of both probabltes. In the Sympson-Hetter methods, exposure control s conducted after an tem s selected. The condtonal probabltes P(A S ) are used as control parameters. These control parameters gude the probablty experment n whch t s determned whether the selected tem s admnstered or removed temporarly for the person tested from the pool. The dea underlyng the method s that when r max s the target value for the maxmum exposure rate, the condtonal probabltes can be set n such way that P(A ) r max. The procedure to fnd approprate values for the control parameters s qute tme consumng. In a seres of teratve adjustment, the approprate values can be found. These Sympson-Hetter methods suffer from several drawbacks. When the populaton s categorzed based on ablty, the exposure rates wthn sub groups mght stll be hgh. Tme-consumng smulaton studes have to be conducted for calculatng the exposure control parameters. Moreover, the procedure for calculatng the control parameters does not converge properly, and the clam that P(A ) r max holds, can not be valdated (van der Lnden, 23). Fnally, t s also known that the Sympson-Hetter method s hardly effectve n dealng wth underexposure problems. Underexposure refers to the problem that tems n the pool are admnstered so seldom, that the expense for constructng them can not be justfed.

338 B.P. Veldkamp, et al. Several mprovements of the orgnal procedure have been developed. Stockng & Lews (998) proposed to conduct exposure control condtonal on ablty level, to overcome the problem of hgh exposure rates for specfc ablty levels. They defned the events n () condtonal on ablty level. The new relatonshp can be descrbed as P(A θ j ) = P(A S,θ j ) * P(S θ j ), j=,..,j, (2) where J defnes the number of ablty levels to take nto account. The tme needed to calculate the exposure control parameters ncreases J tmes, because control parameters have to be calculated for all J ablty parameters. When ths new procedure s appled, exposure rates wthn subgroups of the ablty scale wll also be below the specfed level. Ths modfcaton solves one of the problems of the method, but convergence problems and loss of total test nformaton stll exsts. Van der Lnden (23) proposed to modfy the Sympson-Hetter method to speed up the teratve adjustment process to fnd the exposure control parameters. In the Sympson-Hetter method, the exposure parameters are adjusted wth the followng rule: t t+ f P ( S ) rmax P ( A S ) : = t t rmax / P ( S ) f P ( S ) > rmax (3) where t s the teraton number, and r max s the desred target for the exposure parameters. The adjustment process can be speeded up by changng ths rule nto t t t+ P ( A S ) f P ( A ) rmax P ( A S ) : = t t rmax / P ( S ) γ f P ( A ) > rmax (4) where γ s a parameter to ncrease the sze of the adjustment. Although less tme s needed for fndng exposure control parameters, the process s stll generally tedous and tme-consumng, partcularly f the control parameters have to be set condtonally on a set of realstc ablty values for the populaton of examnees.

Computarzed Adaptatve Testng 339 Barrada, Veldkamp & Olea (29) modfed the Sympson-Hetter approach by varyng the exposure control parameters throughout the test admnstraton. To avod that all tems wth hgh dscrmnatng power are selected when estmaton of trat levels s stll uncertan, low values for r max are mposed at the begnnng of the test. The values of r max ncrease durng CAT admnstraton. So, hghly dscrmnatng tems are reserved for the later stages of the test. Elgblty methods Recently, van der Lnden and Veldkamp (24, 27) proposed to formulate the exposure control problem as a problem of constraned test assembly. Lke the Sympson-Hetter method a probablstc algorthm s used. However, ths method does not need tme consumng smulaton studes to fnd control parameters for the probablstc experment. Based on the observed exposure rates, the algorthm determnes whether tem elgblty constrants are added to the model for selectng the tems n CAT. The method conssts of several steps. Frst, a probablty experment s conducted to determne f an tem s elgble. Second, nelgblty constrants are added to the test assembly model, and the model s solved. Three, f the addton of elgblty constrants leads to an nfeasble model, the constrants are removed and the relaxed model s solved. The probablty for an tem of beng elgble to examnee (j+) can be expressed n terms of: ε j : number of examnees through j for whom tem has been elgble. α j : number of examnees through j to whom tem has been admnstered. For examnee (j+), tem s elgble wth estmated probablty: P j+ ε jr ( E ) = mn α j max wth α j >. For α j =, the probablty of beng elgble s defned to be P j+ (E ) =.The method proved to perform well n dealng wth (over)exposure of popular tems n the bank. Both the (modfed) Sympson-Hetter methods and the Elgblty methods manly focus on overexposure of popular tems n the pool. Although decrease of exposure rates of the most popular tems results n some ncrease of exposure rates of less popular tems, only exposure rates, (5)

34 B.P. Veldkamp, et al. of tems wth almost as favorable attrbutes as the most popular tems ncrease. Unpopular tems are stll hardly selected. Methods for controllng underexposure For solvng the problem of underexposure, dfferent methods have been developed. Chang & Yng (999) ntroduced α-stratfed testng. In ther approach, tem pools are stratfed wth respect to values of ther dscrmnaton parameters α. The frst tems are chosen from the stratum wth lowest α values. A second group of tems are chosen from the subsequent stratum, and the last tems n the test from the stratum wth hghest α values. Ths approach s based on the observaton that estmates of the ablty parameters are very unstable durng the admnstraton of the frst few tems of a CAT. Because of ths, less dscrmnatng tems should be used n the earler stages, whle the most dscrmnatng tems should be used when estmates have been stablzed. The clam s that ths approach would lead to a more balanced tem exposure dstrbuton and mprove tem pool utlzaton. Unfortunately, ths method does not mpose any bounds on exposure rates. Some observed exposure rates mght be much hgher than expected (Parshall, Kromrey, & Hogarty, 2). Besdes, the method s hghly dependent on tem bank propertes. Usually, dscrmnaton parameters are not unformly dstrbuted or the dscrmnaton and the dffculty parameters mght correlate postvely. A dfferent method for solvng the problem of underexposure s based on the observaton that exposure problems result from the tem selecton crteron that s appled. When tems are selected that maxmze Fsher s Informaton crteron, tems wth hgh dscrmnaton values tend to be selected more often than the others. One way to reduce both over- and underexposure s to add a random component to the tem selecton crteron. Revuelta and Ponsoda (998) elaborated ths dea n ther Progressve method. When ths method s appled, a random value R n the nterval [,H], where H s the maxmum value of the nformaton functon, s assgned to each tem n the bank. Items are selected based on a weghted combnaton of the random component and Fsher s nformaton crteron: s s ( ) R + I ( ˆ), θ (6) n n

Computarzed Adaptatve Testng 34 where the weghtng factor s determned by the seral poston s of the tem n the test, and the total test length n. For selectng the frst tem, the value of the crteron s domnated by the value of the random component, whle for selectng the last tem, the random component does not nfluence the crteron anymore. Ths method proved to be effectve aganst underexposure, however, t s not condtonal on ablty level, and t can not be guaranteed that targets for exposure rates wll be met. Another drawback s that tems that are completely off target mght be presented to a canddate. Dealng wth exposure control problems n CAT s rather complcated. Although several promsng methods have been developed, all of them seem to suffer from varous drawbacks. Because of ths, exposure control problems stll exst. In most large scale testng systems, a rather pragmatc approach s used and a combnaton of over- and underexposure control methods s mplemented. For example, n most CATs developed by CITO, a combnaton of the Sympson-Hetter method and a generalzaton of the Progressve method s mplemented (Eggen, 2). By mplementng a combnaton of methods, an attempt s made both to maxmze measurement accuracy, and to balance tem pool usage. MULTIPLE OBJECTIVITY A D EXPOSURE CO TROL When an exposure control method s mplemented, the test assembly problem can be formulated as an nstance of multple objectve decson makng (Veldkamp, 999). The frst objectve s to assemble tests accordngly to the test specfcatons. In general, the amount of nformaton n the test s maxmzed, whle a number of constrants on test content, tem format, word count or gender orentaton of the tems have to be met. The second objectve n the process s related to exposure of the tems. The objectve s to obtan an evenly dstrbuted use of tems n the bank. The observaton that the exposure control problem s a problem of multple objectves n test assembly s the corner stone of the method presented n ths paper. The man dea s that exposure control methods should represent ths multple-objectvty. Both objectves can be formulated n mathematcal programmng terms. The frst objectve can be formulated as:

342 B.P. Veldkamp, et al. max subject to S j I = S e I = x a x x = j I b x n j b x {,}, I ( θ ) x j (categorcal) (quanttatve) (nter - tem dependences) (test length) where x denotes whether an tem s selected (x = ) or not (x = ). The nformaton n the test s maxmzed. The frst general constrant represents constrants lke content or tem type. The second constrant represents specfcatons related to quanttatve attrbutes lke word count or response tmes. The thrd constrant s formulated to deal wth dependences between tems lke enemes, but also tem sets. In ths way, the frst objectve can be obtaned. To formulate the second objectve s slghtly more complcated. In van der Lnden and Veldkamp (27) t s shown that the followng equalty holds: ϕ = n, where φ s the observed exposure rate, and n represents the test length. Because of ths, t suffces to mnmze the maxmum exposure rate to obtan an evenly dstrbuted use of the tems n the bank. Therefore, the second objectve can be formulated as (8) jϕ + x mn max, (9) j + (7) where j s the number of prevously tested examnees. These two objectves mght conflct. To maxmze the amount of nformaton n the test, hghly dscrmnatng tems are often selected. On the other hand, to obtan an evenly dstrbuted use of the bank, these popular tems can not be

Computarzed Adaptatve Testng 343 admnstered to all canddates. It comes down to the test assemblers preferences, how to deal wth these conflctng objectves. One method for dealng wth multple objectve test assembly problems s to combne both the objectves n one sngle objectve functon, by usng one of the objectves as a weghtng functon for the other (Veldkamp, 999). When ths method s appled to the exposure control problem, the nformaton can be weghted wth some functon of the observed tem exposure rates. The resultng objectve of the test assembly problem can be formulated as: max w( ϕ ) I ( θ ) x, () where w(φ ) s a weghtng functon that represents the test assemblers preferences. Several weghtng functons can be appled. For example, the functon can be based on the observaton that the use of popular tems can be reduced by temporarly removng them from the pool of avalable tems, untl ther observed exposure rate s smaller than r max (see Revuelta & Ponsoda, 998). Ths weghtng functon s shown n Fgure a. A second example s based on the observaton that the use of unpopular tems (φ << r max ) can be ncreased by ncreasng ther weghts. To boost the use of unpopular tems, the weghtng functon mght decrease for ncreasng exposure rates. Ths observaton results n a weghtng functon shown n Fgure b. The thrd example s related to test farness. Because expellng some tems from admnstraton for some students, as n the frst and second weghtng functon, mght not be consdered far, assgnng a small weght for popular tems (φ > r max ) reduces the probablty that they are selected, but does not make them nelgble. Two weghtng functons that combne observatons two and three are shown n Fgures c and d. Moreover, the causes of over exposure can be taken nto account when the weghtng functon s defned. The man cause of exposure problems lays n the amount of nformaton provded by the tem. Snce the amount of nformaton presented by an tem s related to the squared dscrmnaton of an tem, a weghtng functon that takes the amount of nformaton nto account can be formulated as: 2 ( max w φ ) = a ( φ > r ). ()

344 B.P. Veldkamp, et al. Fgure. Weghtng functons (weghtng factor on y-axs and observed exposure rate on x-axs). In all these examples, a dfference s made between tems that are overexposed (φ > r max ) and those who are not (φ r max ). For both ntervals dfferent weghtng functons can be defned, based on a number of observatons. However, the queston remans whch weghtng functon performs best for whch nterval. A systematc approach to answer ths queston would be to dstngush between both ntervals and to see whch functon for whch nterval results n the best exposure control method.

Computarzed Adaptatve Testng 345 UMERICAL EXAMPLES A comparson study was carred out to judge the performance of the multple objectve exposure control method. Several settngs of the method were compared wth the Sympson-Hetter method, the alpha-stratfed method, randomzed tem selecton, and CAT wthout exposure control. In the frst example, dfferent weghtng functons were compared. Dfferent methods for exposure control were compared n Example 2. Example. To fnd the best settngs for the multple objectve exposure control method, several functons were mplemented. The tems n the bank were calbrated wth the OPLM, a specal verson of the 2PLM, where the dscrmnaton parameters are restrcted to be nteger. The OPLM s the general IRT model underlyng all CATs developed by CITO. The tem bank conssted of 3 tems. The test length of all CATs was set equal to 4 tems. Fsher s Informaton crteron was used to select the tems. The ablty was estmated wth the Weghted maxmum lkelhood estmator (Warm, 989), assumng that the tem parameters are known. The ntal estmate of the ablty was set equal to zero. For all examples, 4 examnees were randomly sampled from a normal dstrbuton. The maxmum exposure rate r max was set equal to r max =.3 n the examples. These settngs most closely resembled the CITO context. To compare the results, the followng crtera were appled. The performance of the CAT was evaluated by takng both the bas and the root mean squared error (RMSE) nto account. bas = P p= ( ˆ θ θ ) p p, P (2) RMSE = P p= ( ˆ θ θ ) p p 2, P (3) where p =,,P runs over all persons. To control for underexposure of the tems, three dfferent functons were dstngushed for φ r max. The frst functon does not control for

346 B.P. Veldkamp, et al. underexposure of the tems ( w ( φ ) = ). The second functon tres to control for underexposure by assgnng decreasng weghts when the observed exposure rate ncreases. The functon s defned such that the weght equals one for tems that have not been admnstered yet ( w ( φ = ) = ), and t lnearly decreases, where the weght for tems wth observed exposure equal to r max s set equal to a constant ( w ( φ = rmax ) = c, where c << ). The thrd functon ams at the causes of underexposure, and relates the weghts to the nverse of the squared dscrmnaton. For overexposure (φ > r max ), four dfferent functons where dstngushed n ths study. Frst, overexposure was not allowed ( w ( φ ) = ). In the second functon, a small weght s assgned ( w (φ ) = c ). In the thrd functon, the weght lnearly decreases, where the weght for tems wth observed exposure equal to r max s set equal to a constant ( w φ = r ) c, where c << ), and the weght s set equal to ( max = zero when the observed exposure rate equals one ( w ( φ = ) = ). The fourth functon ams at the causes of overexposure, and relates the weghts to the nverse of the squared dscrmnaton. In the examples, the weghtng constant was set equal to c =.4. When the multple objectve exposure control method s appled, any weghtng functon s a combnaton of functon for controllng underexposure and a functon for controllng overexposure of the tems. The weghtng functons were compared for two dfferent settngs, r max =.3. Snce 4 tems were selected from an tem bank of 3 tems, the lower bound for r max equals.33. Resultng bas and RMSE for r max =.3 are shown n Table and Table 2. The exposure rates of the tems are shown n Fgure 2. Wth respect to functons controllng for overexposure, the results were more or less what we had expected. The condtons where no overlap was allowed resulted n hghest values for the RMSE. Lowest values were obtaned when small weghts were assgned to overexposed tems. Both adaptve functons ended up somewhere between them. An unexpected effect was that controllng for underexposure resulted n smaller RMSEs. Ths mght be caused by an nteracton between the composton of the tem pool and the adaptve tem selecton process.

Computarzed Adaptatve Testng 347 Table. Bas for dfferent combnatons of weghtng functons for under- and overexposure. Underexpos ure Overexposure w ( ) = φ w ( φ ) = lnear w ( φ ) 2 = a w ( φ ) =... w ( φ ) = c... w ( φ ) = lnear w ( φ ) 2 = a...... As can be seen n Table, the values for the resultng bases hardly dffer from zero, and no sgnfcant dfferences between the condtons were found. Table 2. RMSEs for dfferent combnatons of weghtng functons for under- and overexposure. Underexpos ure Overexposure w ( ) = φ w ( φ ) = lnear w ( φ ) 2 = a w ( φ ) =.98.94.96 w ( φ ) = c.94.9.9 w ( φ ) = lnear w ( φ ) 2 = a.95.9.92.96.93.93

348 B.P. Veldkamp, et al.,8,6,2 5 5 2 25 3,8,6,2 5 5 2 25 3,8,6,2 5 5 2 25 3,8,6,2 5 5 2 25 3,8,6,2 5 5 2 25 3,8,6,2 5 5 2 25 3,8,6,2 5 5 2 25 3,8,6,2 5 5 2 25 3,8,6,2 5 5 2 25 3,8,6,2 5 5 2 25 3,8,6,2 5 5 2 25 3,8,6,2 5 5 2 25 3 Fgure 2. Observed exposure for dfferent settngs of the multple objectve exposure control method r max =.3

Computarzed Adaptatve Testng 349 The observed exposure rates are shown n Fgure 2. Ths fgure has to be read n the same way as both tables; the frst row of the frst column descrbes the results for the condton of no underexposure control w ( φ ) =, and no overexposure allowed w ( φ ) =, etc.. For overexposure, the results were clear. The best results wth respect to observed exposure rates were obtaned when no overexposure was allowed (row ). Allowng overexposed tems to be used (rows 2-4) resulted n hgh overexposure of some popular tems. These results can be explaned by checkng the weghtng functons. Because the weghtng functons just weght the nformaton provded by an tem, very nformatve tems mght stll be selected when the dfference n weghts between overexposed and less popular tems s small. The method of decreasng weghts (row 3), resulted n smallest overexposure of the most popular tems. For underexposure, the methods wth decreasng weghts (columns 2-3) performed best. They performed better than the cases were no underexposure control was appled (column ). Wth respect to observed exposure rates no dfferences were found due to the way the weghts decreased. Takng both RMSE and observed exposure rates nto account, the best results were obtaned n when no overexposure was allowed (row ) and underexposure was beng controlled for wth lnearly decreasng weghts (column 2). Example 2. To evaluate the performance of the multple objectve exposure control method, t was compared wth the alpha-stratfed method, the Sympson-Hetter method, and the progressve method n combnaton wth Sympson-Hetter. For the alpha-stratfed method we used four strata. Stratum contaned 4% of the tems n the bank. Stratum 2 also contaned 4% of the tems. Stratum 3 had 5% of the tems. Stratum 4 had only 5% of the tems. Durng the test assembly process, the same percentages of tems were selected from the strata. To add some benchmarks, both randomzed tem selecton and tem selecton based on Fsher Informaton wthout exposure control were added to the example. In ths comparson study, the weghtng functon that performed best wth respect to bas, RMSE and observed exposure rates n the frst study was appled. The resultng functon combned a lnear part to control for underexposure and a weght equal to zero to control for overexposure. For every exposure control method, 4 CATs were smulated. The maxmum exposure rates

35 B.P. Veldkamp, et al. were set equal to r max =.3 n these smulatons. The results are shown n Table 3. Table 3. Performance dfferent exposure control methods r max =.3 Method Bas RMSE no exposure control..86 Multple objectve method..94 Sympson-Hetter method..98 Alpha-stratfed method..9 Progressve method (S-H)..97 Randomzed tem selecton..33 When the results n Table 3 are compared, t can be observed that the dfferent exposure control methods dd not result n any bas. Besdes, the multple objectve exposure control method resulted n smallest RMSE. The observed exposure rates are shown n Fgure 3. It can be seen that our mplementaton of the alpha-stratfed method was not very successful n dealng wth over-exposure. For some tems the observed exposure rate exceeded.4. A dfferent stratfcaton mght have performed better, although we dd not succeed n fndng good settngs. Wth respect to underexposure control, the alpha-stratfed method performed best. For practcal applcatons, a combnaton of the alpha-stratfed method wth the Sympson-Hetter method or the multple objectve method mght be recommended. Almost no dfferences were found between the Sympson- Hetter method and the combnaton of the Progressve method and the Sympson-Hetter method. The progressve method performed slghtly better wth respect to underexposure. Ths mplementaton of the multple objectve exposure control method resulted n most tems wth maxmum exposure rate. Ths also explans why ths method resulted n smallest RMSE.

Computarzed Adaptatve Testng 35,5 observed exposure rates,3,2, 5 5 2 25 3 Items Fgure 3. Observed exposure rates for multple objectves (dotted), Sympson-Hetter (dashed), Alpha-stratfed (thn), and Progressve (thck) exposure control. DISCUSSIO Exposure control s appled to computer adaptve testng programs for several reasons. The most mportant reason s to prevent tem compromse. A second reason s to ncrease the usage of the tem pool. Untl now, several exposure control methods have been developed that deal wth the problem of over-exposure successfully. Under exposure of the tems s stll a problem n many adaptve testng programs. The multple objectve exposure control method was developed to deal wth both knds of exposure control problems. One of the advantages of the new method s that no tme consumng smulaton studes have to be carred out. The new method can be mplemented on the fly. Durng the admnstraton, the addtonal tme for selectng an tem wth the multple objectve exposure control method was less than a mllsecond. In the frst example, t can be observed how the weghtng functons nfluence the resultng tests. For example, the best results for the RMSE are obtaned for an weghtng functon that allowed overexposure of some popular tems. In

352 B.P. Veldkamp, et al. other words, the tradeoff between RMSE and observed exposure rates can be controlled by defnng approprate weghtng functons. The multple objectve exposure control method was descrbed as a determnstc method of exposure control. Ths mples that any admnstraton of the test drectly nfluence the weghts for the next canddates. If such a dependency s undesrable, a probablstc mplementaton mght be consdered. The weghtng functons w( φ ) determne the probablty for an tem to be selected. Before any CAT s admnstered, a probablty experment s carred out for every tem to decde whether t s selected for the pool or not. For examnee j+, tem s elgble, that means avalable for selecton, wth estmated probablty P ( j+ ) ( E ) = w( φ ), (4) where E denotes the event that tem s elgble. In the experment, a random number u s drawn from the nterval [,]. For u < P, the tem s elgble, for u > P, the tem s not elegble. Ths probablty experment s comparable to the one descrbed n van der Lnden & Veldkamp (24). However, n ths approach the test specalst can defne the functon that relates the observed exposure rates to the probablty of beng elgble. The result of ths experment s a subset of the tem pool that can be used for test admnstraton. Fnally, snce the multple objectve exposure control method s an nteractve method where the parameters affectng the exposure control method are updated durng the test admnstraton perod, some remarks have to be made about practcal mplementaton. In a web-based envronment, wth testng over the nternet, updatng the parameters on-thefly seems rather straghtforward. However, when thousands of examnees partcpate n a test at the same tme updatng the parameters every few mnutes nstead of contnuous updatng mght be consdered. Ths wll reduce the probablty of crashng the web-server. When the method s appled n classroom settng, whch s most common for CITO CATs, the exposure rates resultng from dfferent locatons can be combned perodcally. When the method s appled to operatonal CATs, one of the frst questons s to choose whch weghtng functon to mplement. In the frst example, several weghtng functons were compared for a gven tem bank. Ths example just llustrates the effects of controllng for underexposure

Computarzed Adaptatve Testng 353 and the effects of allowng overexposure of some of the tems. The resultng bas (Table ), RMSE (Table 2), and observed exposure rates (Fgure 2), can not be generalzed beyond ths example. However, based on theoretcal arguments, a practtoner could choose between controllng for underexposure ( w ( φ ) = lnear or 2 = a w ( φ ) ) or not controllng ( w ( φ ) = ). The same knd of decson needs to be made about how strct the maxmum exposure rate r max has to be mposed. A small smulaton study (comparable to the one n Example ) can be carred out to get a feelng about how the method mght work for an operatonal CAT wth a gven tem bank. Even although we n general recommend performng smulaton studes before startng any operatonal CAT, ths step s not a necessary requrement for the mplementaton of the multple objectve exposure control method. The ntal observed exposure rates can be set equal to (φ = ) for all tems, and the values of φ can be updated after every test admnstraton. The multple objectves exposure control method has not been mplemented n any commercal software package yet. It s generally applcable to CAT programs based on, for example, the Wegthed Devaton Model (Stockng, & Swanson, 993) or the Shadow Test Approach (van der Lnden, 25). For ths study, the method was mplemented n CAT software developed at CITO n The Netherlands. For operatonal use, practtoners ether have to add a module that calculates the weghts for each tem gve the observed exposure rates to ther CAT software, and to mplement these weghts n ther tem selecton procedures, or they can contact the authors. REFERE CES Arel, A, Veldkamp, B.P., & van der Lnden, W.J. (24). Constructng rotatng tem pools for constraned adaptve testng. Journal of Educatonal Measurement, 4, 345-36. Barrada, J.R., Veldkamp, B.P., & Olea, J. (29). Multple maxmum exposure rates n computerzed adaptve testng. Appled Psychologcal Measurement, 33, 58-73. Chang, H-H, & Yng, Z. (999). α-stratfed computerzed adaptve testng. Appled Psychologcal Measurement, 23, 2-222. CITO (999). WISCAT. Een computergestuurd toetspakket voor rekenen en wskunde. [Mathcat: A computerzed test package for arthmetc and mathematcs]. CITO: Arnhem. CITO (22). T2cat. Een computergestuurd toetspakket voor ederlands als tweede taal. [DSLcat. A computerzed test package for Dutch as a Second Language]. CITO: Arnhem.

354 B.P. Veldkamp, et al. CITO (n press). TURCAT. Een computergestuurd toetspakket voor Turks als tweede taal. [TURCAT. A computerzed test package for Turksh as a Second Language]. CITO: Arnhem. Davs, L.L., & Dodd, B. (23). Item exposure constrants for testlets n the verbal reasonng secton of the MCAT. Appled Psychologcal Measurement, 27, 335-356. Eggen, T.J.H.M. (2). Overexposure and underexposure of tems n computerzed adaptve testng. Measurement and Research Department Reports, 2-. Arnhem: Cto. Eggen, T.J.H.M. (24). CATs for kds: easy and effcent. Paper presented at the 24 meetng of Assocaton of Test Publshers, Palm Sprngs, CA. Hetter, R.D., & Sympson, J.B. (997). Item exposure control n CAT-ASVAB. In W. Sands, B.K. Waters, & J.R. McBrde (Eds.), Computerzed adaptve testng from nqury to operaton (pp. 4-44). Washngton, DC: Amercan Psychologcal Assocaton. Kngsbury, G.G. & Zara, A.R. (989). Procedures for selectng tems for computerzed adaptve tests. Appled Measurement n Educaton, 2, 359-375. Luecht, R.M., & Nungester, R.J. (998). Some practcal examples of computer-adaptve sequental testng. Journal of Educatonal Measurement, 35, 229-249. McBrde, J.R. & Martn, J.T. (983). Relablty and valdty of adaptve ablty tests n a mltary settng. In D.J. Wess (Ed.), ew horzons n testng (pp. 223-226). New York: Academc Press. Parshall, C., Harmes, J.C., & Kromrey, J.D. (2). Item exposure control n computeradaptve testng: The use of freezng to augment stratfcaton. Florda Journal of Educatonal Research, 4, 28-52. Revuelta, J. & Ponsada, V. (998) A comparson of tem exposure control methods n computerzed adaptve testng. Journal of Educatonal Measurement, 38, 3-327. Stockng, M. L., & Lews, C. (998). Controllng tem exposure condtonal on ablty n computerzed adaptve testng. Journal of Educatonal and Behavoral Statstcs, 23, 57-75. Sympson, J. B., & Hetter, R. D. (985, October). Controllng tem-exposure rates n computerzed adaptve testng. Proceedngs of the 27th annual meetng of the Mltary Testng Assocaton (pp. 973-977). San Dego, CA: Navy Personnel Research and Development Center. Thomasson, G.L. (998). CAT tem exposure control: ew evalutaton tools, alternate method and ntegraton nto a total CAT program. Paper presented at the annual meetng of the Natonal Councl of Measurement n Educaton, San Dego. van der Lnden, W.J. (2). Constraned adaptve testng wth shadow tests. In W.J. van der Lnden, and C.A.W. Glas (Eds.) Computerzed adaptve testng: Theory and practce (pp. -25). Boston, MA: Kluwer Academc Publshers. van der Lnden, W. J. (23). Some alternatves to Sympson-Hetter tem-exposure control n computerzed adaptve testng. Journal of Educatonal and Behavoral Statstcs, 28, 249-265. van der Lnden, W.J., & Glas, C.A.W. (2). Computerzed adaptve testng: Theory and practse. Boston, MA: Kluwer Academc Publshers van der Lnden, W. J., & Veldkamp, B. P. (24). Constranng tem exposure n computerzed adaptve testng wth shadow tests. Journal of Educatonal and Behavoral Statstcs, 29, 273-29.

Computarzed Adaptatve Testng 355 van der Lnden, W. J., & Veldkamp, B. P. (27). Condtonal tem-exposure control n adaptve testng usng tem-nelgblty probabltes. Journal of Educatonal and Behavoral Statstcs, 32. In press. Veldkamp, B.P. (999). Multple objectve test assembly problems. Journal of Educatonal Measurement, 36, 253-266 Verschoor, A.J., & Straetmans, G.J.J.N. (2). MathCAT: A flexble testng system n mathematcs educaton for adults. In W.J. van der Lnden, and C.A.W. Glas (Eds.) Computerzed adaptve testng: Theory and practce (pp. -6). Boston, MA: Kluwer Academc Publshers. Warm, T.A. (989). Weghted maxmum lkelhood estmaton of ablty n tem response theory. Psychometrka, 54, 427-45. Waner, H., Dorans, N.J., Flaugher, R., Green, B.F., Mslevy, R.J., Stenberg, L., Thssen, D. (99). Computerzed adaptve testng: A prmer. Hllsdale, NJ: Lawrence Erlbaum Assocates. Way, W.D. (998). Protectng the ntegrty of computerzed testng tem pools. Educatonal Measurement, Issues and Practce, 7, 7-27. Way, W.D., Steffen, M., & Anderson, G.S. (998). Developng, mantanng, and renewng the tem nventory to support computer-based testng. Paper presented at the colloquum on computer-based testng: Buldng the foundaton for future assessments, Phladelpha, PA. Wse, S.L., & Kngsbury, G.G. (2). Practcal ssues n developng and mantanng a computerzed adaptve testng program. Pscologca, 2, 35-56. (Manuscrpt receved: 9 August 27; accepted: 2 July 29)