The Effects of Testlets on Reliability and Differential Item Functioning *
|
|
|
- Florence Kennedy
- 10 years ago
- Views:
Transcription
1 ISSN eissn DOI /estp Copyright 2015 EDAM Educational Sciences: Theory & Practice 2015 August 15(4) Received June 6, 2014 Accepted March 25, 2015 OnlineFirst May 15, 2015 The Effects of Testlets on Reliability and Differential Item Functioning * Gulsen Tasdelen Teker a Sakarya University Nuri Dogan b Hacettepe University Abstract Reliability and differential item functioning (DIF) analyses were conducted on testlets displaying local item dependence in this study. The data set employed in the research was obtained from the answers given by 1500 students to the 20 items included in six testlets given in English Proficiency Exam by the School of Foreign Languages of a state University in Turkey. One of the purposes of this study was to determine the influences of the tests composed of testlets on reliability, so the reliability coefficients obtained for cases where the influences of testlets were considered and those for cases where the testlet influences were not considered were compared. In consequence of the G theory analyses conducted in this context, it was found that the G and Phi coefficients estimated by not considering the testlet effects were higher than those estimated by considering the testlet effects. It was concluded that the reliability was estimated to be relatively higher when the influences of the testlet were not considered. Two methods were used in this study so as to determine the effects of testlets on differential item functioning and the results were compared. In the DIF-determining method considering the testlet effect, both the number of items displaying DIF at the significant and estimated levels of DIF were found to be higher than in the method not considering the testlet effect. Keywords: Testlet Reliability Differential item functioning Generalizability theory Local item independence * This study is based on a brief summary of the doctoral dissertation entitled The Effects of Testlets on Reliability and Differential Item Functioning prepared in the Educational Measurement and Evaluation Program, Hacettepe University, Turkey. a Corresponding author Gulsen Tasdelen Teker (PhD), Department of Educational Sciences, Faculty of Education, Sakarya University, Sakarya, Turkey Research areas: Generalizability theory; Differential item functioning [email protected] b Assoc. Prof. Nuri Dogan (PhD), Department of Educational Sciences, Faculty of Education, Hacettepe University, Ankara, Turkey [email protected]
2 Educational Sciences: Theory & Practice Numerous large-scale international examinations (TOEFL, GRE, PISA, TIMMS, PIRLS, etc.), as well as national examinations (KPSS, YDS, ALES, YGS, LYS, etc.), are administered every year. On the basis of the results obtained in those examinations, assorted information is acquired in relation to individuals and to nations, and important decisions are made about them. Therefore, it is an indispensable necessity to present evidence on the reliability and validity for such examinations. Writing the items, which are considered as the units constituting the tests, may be thought of as one of the most important steps in test preparation. Writing the items, which measure the intended properties and having the intended psychometric properties are quite demanding work (Özçelik, 2010). For this reason, groups of items called testlets (Wainer & Kiely, 1987) have been frequently used in standardized tests; these have been administered nationally and internationally for 20 years because of their advantages in time, cost, or measurement approach for a specific domain. A testlet expresses a series of items related to only one common stimulus. The performance to be displayed in relation to each item constituting a testlet depends on both a general ability and specific ability related to the content or the topic (Rosenbaum, 1988 as cited in Wainer & Lewis, 1990). For instance, it may focus on a reading passage, a lab application, a graph question, or on a complex problem (DeMars, 2006). The beneficial results yielded by the use of testlets in educational applications have led to the use of them in many large-scale tests. Yet, due to certain statistical features of testlets, their implementation in unidimensional measurement models has caused the emergence of some unfavorable circumstances (Fukuhara & Kamata, 2007; Yen, 1993). One of the technical concerns caused by the use of testlets is the violation of the assumption of local item independence (Sireci, Thissen, & Wainer, 1991; Thissen, Steinberg, & Mooney, 1989; Wainer & Kiely, 1987; Wang & Wilson, 2005; Yen, 1993). There is a claim that the common stimuli of the items constituting the testlets have created dependence in individuals responses to those items. For example, an individual previously informed of a theme in a reading passage would both be able to answer the questions about the text and be in a more advantageous position than those who are at the same level of ability, but who were not previously informed of the topic. Local Item Independence The assumption of local item independence is an important hypothesis for a number of measurement theories; therefore, the violation of the assumption, which results in local item dependence, may lead to undesired results. For example, based on the assessment of classical test theory, the standard error of the measurement may be underestimated and thus the reliability may be overestimated owing to the fact that the testlets violate the assumption of local item independence (Sireci et al., 1991; Wainer, 1995; Wainer & Thissen, 1996; Wainer & Wang, 2000; Yen, 1993). When looked at from the perspective of item response theory (IRT), the violation of the assumption results in overestimating the knowledge obtained from the test, and thus underestimating the standard error of the ability (Thissen et al., 1989; Yen, 1993). This also causes the misestimating of the item parameters (DeMars, 2006). Reliability The aim in measurement studies is to obtain observed scores as close to the actual scores as possible. Conversely, reliable measurement results the measurement results that are as close as possible to the real scores are only actualized on the condition that the random errors are minimized (Baykul, 2000). Item based estimation methods are generally used in determining reliability. This does not pose a problem if the tests are composed of items independent of each other. However, if the test that reliability is to be determined for is composed of testlets, which are defined as the items depending on a common stimulus, the fact that traditional reliability determination methods causes biased estimations stands as a reality, and has been common knowledge among researchers for numerous years (Lee & Frisbie, 1997, 1999; Lee & Park, 2012; Wainer, 1995). In consequence, on reviewing the relevant literature, it has been found that a great number of studies are available that conclude that the reliability of the scores for tests consisting of testlets can be higher than the ones determined through item based reliability estimation methods (Hendrickson, 2001; Lee & Frisbie, 1997; Lee & Park, 2012; Sireci et al., 1991; Wainer & Thissen, 1996). Reliability, which is defined in classical test theory (CTT) as the consistency of the scores obtained through measurements, can vary according to the source to which the error is connected. Therefore, the errors in the measurement results are considered 970
3 Tasdelen Teker, Dogan / The Effects of Testlets on Reliability and Differential Item Functioning as the errors coming from only one source of variability, and this emerges as a restriction of CTT. One of the most important assumptions of item response theory (IRT) is local item independence. If researchers employ an IRT model different from this one, the results obtained will change and the researchers will need to meet the strong assumptions (unidimensionality and local item independence) of IRT (Lee & Frisbie, 1997). Generalizability Theory Generalizability (G) theory is a statistical theory related to the reliability of performance measurements (Shavelson & Webb, 1991). The G theory can deal with the problems mentioned above that are encountered when both the CTT and the IRT based methods are used. Since the methods in G theory can consider several sources of error simultaneously, estimations can be made more accurately than the ones in CTT. As it is not necessary to meet very strong assumptions like in IRT, it can be used for the estimation of reliability of tests composed of testlets. An important situation faced in the G theory is the type of designs and whether they are balanced or unbalanced. The number of observations in balanced designs is equal at each level for the source of variability (Brennan, 2001a). To cite an example, consider a test containing 10 individuals (s), 20 items (i), and 5 testlets (h). In this example, considering the individuals as the object of measurement, and the items and the testlets as the facets, the design can be represented as the sx(i:h) design where the items are nested in the testlets and the individuals are crossed with them. An equal number of items are included in each testlet in this example. That is to say, each of the five testlets has four items, thus there is a balanced data structure in which all individuals answer a 20-item test. On examining the tests composed of testlets, it is frequently observed that the number of items in testlets can differ, and that this causes the emergence of unbalanced data sets. The sx(i:h) design mentioned above will be considered differently at this point. The example considered will turn into an unbalanced data set when the equal number of items in the testlets differs because the observations obtained for each variable in unbalanced designs are not equal. The reasons for this are: (1) the availability of the missing data, and (2) the real variability of the number of observations concerning the levels of variables, as in this example (Brennan, 2001a). Differential Item Functioning (DIF) Regardless of the type of items constituting the test, the most important psychometric property that it should have is validity. The significant elements of threat in relation to validity are item bias and test bias (Clauser & Mazor, 1998). Neutrality against those taking a test is a property that should be satisfied in tests. The differing performances displayed in relation to an item by individuals being at equal levels of ability but in different groups can be accounted for by differential item functioning (DIF). This can also be defined as the differentiation of the probability of individuals who are similar in the measured property but in different sub-groups such as gender and socio-economic level to answer an item correctly (Hambleton, Swaminathan, & Rogers, 1991). The increase in the use of testlets in standardized tests in the field of education over the last 20 years has led to an increase in studies concerning how to score and analyze them. Traditionally, the effects of testlets were ignored and each item constituting the testlet was scored as if they were independent items (Bradlow, Wainer, & Wang, 1999; Lee, Kolen, Frisbie, & Ankenmann, 2001; Sireci et al., 1991; Wainer, 1995; Wainer & Wang, 2000; Yen, 1993). The DIF analyses performed in these studies are considered at item level (Sireci et al., 1991; Wainer & Thissen, 1996). Other studies calculating the sums or averages of the items constituting testlets, and then obtaining the scores at the testlet level are also available in the literature (Lee, Dunbar, & Frisbie, 2001; Lee & Frisbie, 1999; Sireci et al., 1991). Another method of scoring is testlet scoring where testlets are considered as one single item and are polytomously scored; polytomous item response models, especially Bock s nominal response model, are used in this process (Sireci et al., 1991; Thissen et al., 1989; Wainer, 1995; Wainer & Thissen, 1996; Yen, 1993). However, the DIF derived from this method is at testlet level rather than item level. In other words, through this method, the differential testlet functioning (DTF) is derived. Therefore, specific items which lead to DIF cannot be determined (Fukuhara & Kamata, 2011). Since constructing a testlet is demanding, time-consuming work, taking the testlet displaying DTF out of the item pool would not be a desirable case. Instead, identifying and regulating the problematic items in the testlet would be advantageous in re-using the testlet. The items constituting the testlet are dependent on each other is a positive situation, but in doing so there is the loss of information in relation to the response design of the individuals responding 971
4 Educational Sciences: Theory & Practice to those items (Wainer, Bradlow, & Du, 2000). To eliminate this negative outcome, attaching the random effect of the testlets effects to the original IRT models is also considered as a strategy (Bradlow et al., 1999; Li, Bolt, & Fu, 2006; Wainer et al., 2000; Wang, Bradlow, & Wainer, 2002). This strategy is called the Testlet Response Theory (TRT). Many testlet models consider the testlet effects as a random result that is available in the TRT (Bradlow et al., 1999; Li et al., 2006; Wainer et al., 2000; Wang et al., 2002). All TRT models recommend the testlet parameter, in addition to the traditional IRT parameters, in order to determine the amount of local item dependence. All TRT models that are developed are adapted from the IRT models or from the previously suggested TRT models. Li et al. (2006), and De Mars (2006) suggest testlet models based on polytomous item response theory. The representation of the bi-factor polytomous testlet response theory model is shown in Equation 1: P(y ij = 1) = a i1 θ j - b i + a i2 γ d(i)j (1) In Equation 1, the ability component, represented by θ j, expresses the first dimension, and the random effect component for the testlets, represented by γ d(i)j, expresses the second dimension. Both have standard normal distribution, and are independent of each other. a i1 and a i2 express the discrimination parameters of item j. a i1 expresses the relation between ability and item j whereas a i2 expresses the relation between testlet effect and item j. That is to say, separate discrimination parameters are available for the two dimensions considered. And, b i is the item difficulty parameter. In this testlet model, the testlet effect is interpreted differently from 2P-TRT and from 3P-TRT. If the a i2 parameter is estimated higher than the a i1 parameter, it may be said that the second dimension representing the testlets is more dominant. Both, Li et al. (2006) and DeMars (2006) concluded that the bi-factor polytomous testlet theory models fit better than the TRT models for testlets in data containing local item dependence caused by testlets. As mentioned previously there are many models for determining DIF, some of them are based on IRT and some are not. However, when applying the IRT-based DIF-determining models to tests containing testlets, the value of the DIF can be biased (Fukuhara & Kamata, 2007). Hence, it is necessary to use the models that consider the local item dependence caused by testlets to estimate more accurately the DIF value in those cases. 972 The model suggested by Fukuhara and Kamata (2011) and used in this study is basically the developed form of the bi-factor multidimensional item response theory model for testlets (bi-factor MIRT) suggested by Li et al. (2006) and DeMars (2006), and can be represented mathematically as in Equation 2: In P(y = 1) ij = a i (β θ G j + ζ j - b i + γ d(i)j - β i G j ) (2) P(y ij = 0) In Equation 2, β θ, represents the group effect (G j ), in ability θ j and considers the average ability difference between the focus (G j = 1) and the reference (G j = 0) groups. The ζ j is considered a residual for individual j, and β i represents the DIF magnitude of item i. This study employs the IRT-based DIF-determining method which does not consider the testlet effect in order to control the DIF-determining model based on the bi-factor MIRT model for testlets. This IRTbased DIF-determining model is the developed form of the two-parameter logistic TRT model. It is believed that since there might be differences between the discriminations of the items in the study, this model will determine the DIF better than the Rasch TRT model does. This model considers the group covariates in determining DIF. The mathematical expression of the model is shown in Equation 3: In P(y = 1) ij = a i (β θ G j + ζ j - b i + γ d(i)j - β i G j ) (3) P(y ij = 0) Equation 3 is the same as Equation 2 except for the testlets random effect (γ d(i)j ). Consequently, an attempt was made at the second stage of this study to compare the effects of testlets in DIFdetermining through these two methods. Purpose of the Study This study in general aims at determining the effects of testlets on reliability and on differential item functioning, which is an issue of validity, by using a test composed of testlets. Although many theses, articles, and reports are available with regard to test reliability and to differential item functioning in Turkey, no studies concerning the testlet effects have been encountered. Because local item dependence is a property capable of affecting reliability and DIF analyses stem from the characteristic properties of the testlets, which are frequently used in tests like the ALES and YDS, it is considered an important topic to examine. Therefore, in discussing reliability and DIFdetermining methods, an attempt was made in this research to exhibit the differences between methods considering and not considering the testlet effects.
5 Tasdelen Teker, Dogan / The Effects of Testlets on Reliability and Differential Item Functioning Although there has been an increase in the number of research studies conducted in Turkey in relation to generalizability (G) theory, it is observed that balanced designs are used in the majority of them. The research studies performed by Nalbantoğlu Yılmaz and Uzun Başusta (2012), and by Nalbantoğlu Yılmaz (2012) can be given as examples for the limited number of studies performed in Turkey in which an unbalanced design is used. Because the G theory analyses are performed in unbalanced designs, which are more appropriate for real case data, it is believed that this study will contribute to the field. Problem Statement and Sub-problems This study seeks answers to the problem question do testlets have any effects on reliability and differential item functioning as well as to the sub-problems listed below in accordance with the purpose mentioned: 1. Are there any differences between the reliability coefficients estimated at the level of item and testlet without including the testlet effect and the reliability coefficients estimated by including the testlet effect? 2. What are the results for generalizability theory of the unbalanced sx(i:h) design in which items (i) nested in testlets (h) and individuals (s) are crossed with them? 3. Do the number of items displaying DIF and the estimated level(s) of DIF differ according to gender when the testlet effects are considered and not considered in the tests composed of testlets? Method Since the aim of the study is to determine the effects of testlets on test reliability and differential item functioning, it is a descriptive study. Study Group The study group was composed of 1500 students responding to 20 items, scored as 1-0, included in the reading comprehension/ text comprehension, part of the English Proficiency Exam given at the School of Foreign Languages of a state University. Of the study group, 661 (44.07%) participants were female and 839 (55.93%) were male. Descriptive Test Statistics The descriptive statistics for the test considered in the study are shown in Table 1. Table 1 Descriptive Statistics of the Test Variable Value Number of items 20 Number of students 1500 Mean Standard deviation Skewness Kurtosis Mean difficulty Mean Discrimination Cronbach a According to Table 1, the score distribution had a skewness value partly to the right and it partly had kurtosis. On the other hand, the results of the test were at a medium level of difficulty, and the discrimination indexes of the items that were obtained from the value of mean discrimination were sufficient in general. The reliability coefficient for the test was estimated at approximately.78. These statistics can be interpreted as that there are no obstacles to the test that need to be worked on. Data Analysis The G-String (Bloch & Norman, 2011) program that functions on the basis of urgenova (Brennan, 2001b) was used in the generalizability theory analyses; it was performed by including (at item and testlet levels) and not including the testlet effect in determining the reliability of the results of test composed of testlets. In tests composed of testlets, the DIF-determining method considering the testlet effects based on the bi-factor multidimensional IRT model for testlets with covariates and the DIF-determining method not considering the testlet effects based on the two parameter logistic IRT model with covariates were used to determine the items displaying DIF. The analyses were done using the WinBUGS 1.4 (Spiegelhalter, Thomas, & Best, 2003) program. Moreover, opinions were obtained from 10 experts in relation to the differing functions in DIF displaying items. One of the experts held a doctorate degree in the field of measurement and evaluation, and focused on DIF. One expert was preparing a doctoral dissertation in the field of measurement and evaluation and had research studies focused 973
6 Educational Sciences: Theory & Practice on DIF. Three of the experts were preparing their doctoral dissertation(s) in the field of measurement and evaluation. The remaining five experts were in the field of English language teaching one was a professor, two were studying for their doctoral degrees, and two were English teachers-. Findings Findings Concerning the First Sub-problem The data from the 1500 students (s) taking the English Proficiency Test resulted in 20 items (i) that nested in six testlets (h). Because the number of items included in each testlet is different in this data set, the researcher worked with an unbalanced data set concerning the sx(i:h) design. The distribution of items into testlets is shown in Table 2. Table 2 The Distribution of Items into Testlets Testlet number Item numbers in testlets Number of item in a testlet 1 1, , 4, , 7, , 10, , 13, , 16, 17, 18, 19, 20 6 Total: 20 items The G theory analyses performed without considering the testlet effects were done in two different ways: 1. The random design (sxi) in which individuals (s) were crossed with items (i) at the level of items: A data set composed of 1500 students and 20 items was used. 2. The random design (sxh) in which individuals (s) were crossed with testlets (h) at the level of testlets: A data set composed of 1500 students and six testlets was used. Each score of testlets was found by calculating the average for the items constituting the testlets. In the G theory analyses that were performed considering the testlet effect, the random design [sx(i:h)] where the answers given by 1500 students to the nested 20 items in the six testlets were crossed and were used. Since the number of items in each testlet was different, the data was unbalanced and the analyses were done by taking this into consideration. The values obtained in consequence of the analyses are shown in Table 3. Table 3 Reliability Coefficients The situations in which the effect of testlets were considered Reliability Coefficient Item level (sxi) Testlet level (sxh) The situation in which the effect of testlets was not considered [sx(i:h)] G 0,776 0,762 0,761 Phi 0,757 0,712 0,711 On examining Table 3, it was found that the G coefficient obtained in consequence of the generalizability theory analyses performed at the item level on the 20-item test data was 0.78 while the Phi coefficient was The G coefficient obtained by calculating the averages for the items in each testlet was 0.76 whereas the Phi coefficient was In this case, the G coefficient estimated at the item level was about 0.02 higher than the one at the testlet level. The Phi coefficient estimated at the item level, however, was found to be 0.07 higher than the one at the testlet level. The lowest reliability values read in Table 3 were estimated in a case where the testlet effects were considered. The G and the Phi coefficients that were estimated at the testlet level and by considering the testlet effects were found to be very close. Yet, the analyses were performed at the testlet level as if there were six items in the test. In other words, the averages were calculated with the six items in the testlets, and the reliability analyses were done with the scores for the six testlets. On the other hand, on including the testlet effects, the analyses were conducted with the 20-item data set. The fact that the reliability values were very close when numerous items are so different, (6 versus 20 items) may be interpreted as that one of the estimations was higher than the actual one. It may be possible that the reliability values that had been closely estimated to the ones for the 20-item data were estimated higher than the actual ones for the six items. Consequently, according to Table 3, it is clear that the G and Phi coefficients estimated without considering the testlet effects are higher than the ones estimated by considering the testlet effects. The estimation of reliability without considering the testlet effects as higher than the actual is compatible with the findings obtained in studies in literature (Hendrickson, 2001; Lee & Frisbie, 1997, 1999; Sireci et al., 1991; Wainer & Thissen, 1996). Results similar to the ones obtained in the cited studies were gathered in this research. 974
7 Tasdelen Teker, Dogan / The Effects of Testlets on Reliability and Differential Item Functioning Findings Concerning the Second Sub-problem The analysis results for the generalizability theory performed for the unbalanced sx(i:h) design where items (i) nested in the testlets (h) and individuals (s) were crossed with them are given below as the findings obtained in generalizability (G) and decision (D) studies. The G Study Findings: The variance components estimated through the G study performed with the unbalanced sx(i:h) design, and the total variance explanation ratios are given in Table 4. On examining the variance values estimated through the G study performed with the unbalanced sx(i:h) design and the total variance explanations given in Table 4, it is found that the variance for the individual (s) as the main effect explains 13.1% of the total variance with a value of This is the second highest value read in the table. This is an indicator of the fact that interpersonal differences can be exhibited, which is a desirable case. The variance component estimated for the testlet (h) main effect accounts for 5.2% of the total variance with a value of The smallness of this value indicates that the difference between the difficulty levels from one testlet to another is not very different. The numbers of items included in each testlet are different in the study, and the variance component for the i:h effect, where items nested in testlets, accounts for 4.3% of the total variance with a value of The variance component for the i:h effect is composed of the variance components for item (i) main effect, and itemxtestlet (ih) common effect (Brennan, 2001a). The smallness of this value may be interpreted as that the individuals perceptions about the difficulties of the items in testlets are not very different. The variance component for the individual x testlet (sh) common effect accounts for 1.8% of the total variance with a value of It is clear from Table 4 that this estimated variance value is the smallest value read. The interpretation for this variance value may be that the differences stemming from individual-testlet interaction are small. The variance component for the residual effect (si:t,e) accounts for 75.6% of the total variance with a value of This is the highest value in Table 4 in terms of accounting for the total variance. It is impossible in the nested sx(i:h) design to obtain the individual-item (si) common interaction separately. Therefore, this interaction constitutes a part of residual variance. The fact that the residual variance is so high indicates that the effects could be caused by individual-item common interaction (si), individual-item-testlet common interaction (sit), or that other unexplained sources of variability could be high. On reviewing field literature, it was found that residual variance was estimated to be high in other studies (Hendrickson, 2001; Nalbantoğlu Yılmaz, 2012). This may stem from the fact that the design displays both an unbalanced and a nested structure. The D Study Findings: Since there were differing numbers of items for each testlet, the D study was conducted by increasing and decreasing the numbers of items in the unbalanced sx(i:h) design and in each testlet. The G and the Phi coefficients obtained are thus shown in Table 5. As is clear from Table 5, reducing the number of testlets (nh = 3) on the condition that the total number of items (ni+ = 20) in the testlets remain constant, varying numbers of items were obtained in each testlet, thus the G coefficient was estimated as 0.749, and the Phi coefficient as (ni+ = 20, nh = 3, ňh = 2.89). On the condition that the total number of items in testlets (ni+ = 20) remains constant, the number of testlets (nh = 8) was increased, thus the differing numbers of items were included in each testlet, and the G coefficient was estimated as 0.766, and the Phi coefficient as (ni+ = 20, nh = 8, ňh = 7.69). The G and the Phi coefficients derived by reducing the number of testlets decreased by in terms of the G coefficient and by in terms of the Phi coefficient. The G and the Phi coefficients derived by increasing the number of testlets increased by in terms of the G coefficient and by in terms of the Phi coefficient. In conclusion, it was found that decreasing or increasing the number of Table 4 G Study Results of Unbalanced sx(i:h) Design Source of Variance Sum of Squares df Mean of Squares Values of Variances Percentages of Variances s 1305, , , ,13 h 403, , , ,24 i:h 228, , , ,25 sh 1540, , , ,81 si:h,e 4003, , , ,57 Total 7481, , ,
8 Educational Sciences: Theory & Practice Table 5 The Results of the D Study of the Unbalanced sx(i:h) Design ni+ ňh ni:h ňh G Phi , 5, 7 2,84 0,710 0, , 2, 2, 3, 3, 4 5,57 0,722 0, , 2, 3, 3, 3, 3 5,82 0,723 0, , 2, 2, 2, 2, 2, 2, 2 8,00 0,726 0, , 7, 8 2,89 0,749 0, , 3, 3, 3, 3, 6 5,26 0,761 0, , 3, 3, 4, 4, 4 5,71 0,762 0, , 2, 2, 2, 3, 3, 3, 3 7,69 0,766 0, , 13, 15 2,97 0,840 0, , 4, 6, 6, 9, 11 5,22 0,854 0, , 5, 7, 7, 9, 9 5,44 0,855 0, , 2, 4, 4, 6, 6, 8, 8 6,67 0,859 0,811 ni+: number of item; nh: number of testlet; ni:h: number of item in testlets; ňh: n 2 / Σ i+ h n2 i:h testlets on the condition that the number of items remained constant did not greatly affect the G and the Phi coefficients. The G coefficient obtained by reducing the number of items in each testlet and also the total number of items (ni = 16) on the condition that the number of testlets (nh = 6) remained constant was 0.722; the Phi coefficient estimated thus was (ni+ = 16, nh = 6, ňh = 5.57). The G coefficient obtained by increasing the number of items in each testlet and also the total number of items (ni = 40) on the condition that the number of testlets (nh = 6) remained constant was 0.855; the Phi coefficient estimated thus was (ni+ = 40, nh = 6, ňh = 5.44). When the number of testlets stayed constant and the number of items decreased, the G coefficient decreased by and the Phi coefficient decreased by When the number of items increased, the G coefficient increased by and the Phi coefficient increased by In conclusion, it was found that decreasing or increasing the number of items in the testlets on the condition that the number of testlets remained constant did not greatly affect the G and the Phi coefficients. On increasing the number of testlets and the number of items used in the testlets (ni+ = 40, nh = 8, ňh = 6.67), the G coefficient was estimated as and the Phi coefficient as Consequently, it was found that in increasing the number of testlets and items in the testlets together there was an increase of in the G coefficient and an increase of in the Phi coefficient. From this, it may be concluded that increasing the number of both the testlets and the items in the testlets caused an increase in the reliability coefficient. On reducing the difference between the number of items between testlets (ni+ = 20, nh = 6, ňh = 5.71) compared to the initial difference by keeping the number of testlets and the total number of items, the G coefficient was estimated as 0.762, and the Phi coefficient as Thus, it may be interpreted that reducing the variability between testlets in terms of the number of items in the testlets by keeping the number of testlets and the total number of items in the testlets constant caused an increase in the G and the Phi coefficients. This is also true for when the number of testlets is constant and the total number of items is 16 and 40. When the number of items is 16, the G and the phi coefficients estimated where the variability between testlets is small (ni+ = 16, nh = 6, ňh = 5.82) is higher than those estimated where the variability is high (ni+ = 16, nh = 6, ňh = 5.57). When the number of items is 40, the G and Phi coefficients estimated where the variability between testlets is small (ni+ = 40, nh = 6, ňh = 5.44) is greater than those estimated in case where the variability is big high (ni+ = 40, nh = 6, ňh = 5.22). In accordance with the findings obtained from the D study, which was performed on the unbalanced sx(i:h) design and for which the results are shown in Table 5, it was observed that increasing the number of testlets and number of items in the testlets influenced reliability in a positive way. Yet, the G and Phi coefficients estimated by keeping the number of items constant and by increasing the number of testlets rose more than the G and the Phi coefficients estimated by keeping the number items constant and increasing the number of items in each testlet. This was an important finding parallel to results obtained found in the literature (Hendrickson, 2001; Lee & Frisbie, 1999). That is to say, the increase in the number of testlets contributed more to reliability than the increase in the number of items in the testlets did. The local item dependence observed in testlets can increase in parallel to the increase in the number of 976
9 Tasdelen Teker, Dogan / The Effects of Testlets on Reliability and Differential Item Functioning items in a testlet. Therefore, it is normally expected that the contributions from an increase in the number of items in a testlet to reliability is lower than those made by an increase in the number of testlets. Findings Concerning the Third Sub-problem The DIF-determining study was performed on the items which were considered within the scope of the study and which contained testlets. Since the aim was to determine the effects of tests composed of testlets on DIF, two methods, one of which considered testlet effects and one of which did not consider testlet effects, were used in this study. The methods used were recommended by Fukuhara and Kamata (2011), and were based on IRT. In determining the DIF levels in this study, the measure of was considered to be significant at B level whereas the measure of was considered to be significant at C level for both methods (Vaughn, 2006). The findings obtained from the DIF analysis performed on the tests through the WinBUGS program based on both considering and without considering the testlet effects are shown in Table 6. As it is clear from Table 6, significant levels of DIF were determined in five items (items 6, 7, 10, 11, and 14) through the IRT DIF model. Besides, the significant level of DIF determined in those five items, analyses were performed through the bi-factor MIRT DIF model. In addition to that, significant DIF was not estimated for two items (items 8 and 16) in the IRT DIF model. However, significant DIF was estimated for the testlets and analysis was performed through the bi-factor MIRT DIF model. In other words, based on the analysis performed through the bi-factor MIRT DIF model, it was found that the number of DIF items estimated significantly was two more than the number of DIF items estimated significantly with the IRT DIF model. On examining the DIF levels estimated in the DIF analyses performed on gender basis, it becomes evident that there are differences between the levels obtained through the models used, as is clear in Table 6. The DIF levels estimated for items 6, 7, and 10 were estimated at the C, B, and C levels respectively in both models. On the other hand, the DIF that was estimated at the A level in the IRT DIF model for items 8 and 16 was estimated at the B level in the bi-factor MIRT DIF model. In a similar vein, the DIF estimated at the B level in the IRT DIF model for items 11 and 14 was estimated at the C level in bi-factor MIRT DIF model. Expert opinion was consulted so as to exhibit the source of DIF in items for which significant DIF was Table 6 The DIF Magnitudes and Levels Obtained from Using Two Models by Means of Gender Item IRT DIF Model (Without considering the testlet effect) DIF Magnitude (β) DIF Level Advantageous group Bi-factor MIRT DIF Model (Considering the testlet effect) DIF DIF Level Advantageous group Magnitude (β) *** C Male *** C Male ** B Male ** B Male * A Male ** B Male *** C Female *** C Female ** B Male *** C Male ** B Male *** C Male * A Male ** B Male * β < (A level or no DIF); **0.426 β < (B level DIF); *** β (C level DIF) 977
10 Educational Sciences: Theory & Practice estimated through analyses performed in the bi-factor MIRT DIF model. The following was concluded: The reading passage to which items 6, 7, and 8, containing DIF in favor of male students, was related with connection to the Internet, social media and emotions. Current experts understand that there are no distinctions in these issues in relation to gender basis, and therefore it was pointed out by almost all of the experts that those items did not have bias towards male students. The reading passage to which item 10, containing DIF in favor of female students, and item 11, containing DIF in favor of male students, was related with machines, industry, chemistry, and engineering. The experts stated that the topics in the text were interesting to the male students. In addition, seven of the experts agreed that because the detail asked in item 10 was related with cloth, it might display bias towards female students. A close examination of the expert opinions concerning item 11, which was in favor of male students, showed that half of the experts contended that the topic of the testlet was within the interest of males and therefore there might be biased towards male students, while the other half disagreed. The reading passage to which item 14, containing DIF in favor of male students, was related with mechanical vehicles (such as ferries) and their technical faults. Therefore, it was pointed out by six of the experts that this item might be biased towards men. The remaining four experts said that they did not expect a gender bias in this item. The reading passage to which item 16, containing DIF in favor of male students, was related with brain drain, counties economies, and education. It was stated by the majority of the experts that the topic would not be gender biased. Discussion and Recommendations First, the generalizability theory was employed in estimating the reliability of the tests composed of testlets. The reliability of the tests containing testlets was estimated at item and testlet levels where testlet effects were not considered, and for the nested design in which testlet effects were considered. Of those values estimated, the reliability coefficient obtained at the item level was higher than the other coefficients. The G coefficient estimated in one-facet designs in the G theory is found to be equal to Cronbach alpha (Brennan, 2001a). In consequence, when the testlet effects are not considered in tests containing testlets, a biased result, which is equal to Cronbach alpha, is obtained. When the testlet effects are not considered, it is also possible to estimate a reliability value so high as to reach a value when the number of items is doubled (Wainer, Sireci, & Thissen, 1991). It is very common to encounter the use of testlets in tests owing to their several advantages in the test development process. On the basis of the results obtained in this study, it was concluded that an increase in the number of testlets was more influential than an increase in the number of items constituting the testlet for the purposes of raising reliability. This is supported by other G studies performed within the framework of the G theory of this study (see Table 4). For example, the effect of testlets on the total variance is greater than the one obtained by the nesting of the items in testlets. The variation in the number of items in testlets in a test influences the reliability obtainable from the test results. If the number of items constituting the testlets differs greatly from one testlet to another within a research study, it affects reliability in a negative way. It was found that the number of items, which was determined to display DIF at significant levels through the DIF-determining method, considering the testlet effects was greater than the number of items which was determined through the method not considering the testlet effect. Since testlets violate the assumption of local item dependence, the DIF that was estimated through methods based on IRT was estimated as smaller than the actual. Moreover, on reviewing the DIF studies available in literature, it was concluded that the DIF was found to be trivial at the item level with analyses without considering the testlet effects and became clear with analyses considering the testlet effects (Fukuhara, 2009; Fukuhara & Kamata, 2011; Sedivy, 2009; Wainer et al., 1991). This does not lead to serious differences when there is large enough data sample ; but as the sample and the DIF becomes smaller, the methods considering the testlet effects displays better performance. In conclusion, the increase in the frequency of the use of testlets in test development activities makes it obligatory to analyze those instruments through right statistical means in terms of accurately interpreting the results to be obtained from those tests. 978
11 Tasdelen Teker, Dogan / The Effects of Testlets on Reliability and Differential Item Functioning Recommendations Based on Conclusions In estimating the reliability of the tests containing testlets, the estimation methods that take local item dependence into consideration should be considered in order to reach unbiased estimations. Since increasing the number of testlets instead of increasing the number of items in testlets is more effective, this may be preferred in order to achieve more reliable measurements of tests composed of testlets. In other words, when increasing the number of testlets prepared, depending on a common stimulus is more influential. So as to increase the reliability of the tests containing testlets, the closeness of the number of items from one testlet to another should be assured. Recommendations to Researchers The method based on the bi-factor MIRT DIF model, which may be used in estimating the DIF in tests composed of testlets, can also be used in other tests. Many cases in education are capable of leading to local item dependence (Yen, 1993). For instance, a standardized science test is composed of such sub-themes as physics, chemistry and biology, and in such a test, the sub-themes can be grouped together and be considered as testlets. The methods and analyses used in this study can be employed in determining the DIF. The fact that this research was performed with a data set, and thus with a small number of items, can be thought of as a possible restriction. Yet, in research studies conducted in relation to the effects of testlets on reliability (Hendrickson, 2001; Lee & Frisbie, 1997, 1999; Sireci et al., 1991; Wainer, 1995) and on differential item functioning (Fukuhara & Kamata, 2011; Wainer, 1995) similar cases were encountered. Performing research with different data sets (composed of just testlets or independent items and testlets together) and with data sets containing more items will be beneficial in obtaining more reliable results. Analyses were done in this research by using real data. Simulation can be done in order to evaluate the testlet effects in differing measurement situations in more details. To conclude, targets should be set for tests containing testlets to achieve more unbiased results through methods which do not ignore cases of local item dependence while estimating reliability and validity. These issues have been considered in this research. 979
12 Educational Sciences: Theory & Practice References Baykul, Y. (2000). Eğitimde ve psikolojide ölçme: klasik test teorisi ve uygulaması. Ankara: ÖSYM. Bloch, R., & Norman, G. (2011). G-String 4 user manual (Version 6.1.1). Hamilton, Ontario, Canada: Ralph Bloch & Geoff Norman. Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64, Brennan, R. L. (2001a). Generalizability theory. New York, NY: Springer-Verlag. Brennan, R. L. (2001b). Manual for urgenova (Version 2.1) (Iowa Testing Programs Occasional Paper Number 49). Iowa City, IA: Iowa Testing Programs, University of Iowa. Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement: Issues and Practice, 17(1), DeMars, C. E. (2006). Application of the bi-factor multidimensional item response theory model to testletbased tests. Journal of Educational Measurement, 43(2), Fukuhara, H. (2009). A differential item functioning model for testlet-based items using a bı-factor multidimensional item response theory model: a Bayesian approach (Doctoral dissertation, Florida State University College of Education). Retrieved from cgi?article=1573&context=etd Fukuhara, H., & Kamata, A. (2007, November). DIF detection in a presence of locally dependent items. Paper presented at the annual meeting of the Florida Educational Research Association, Tampa. Fukuhara, H., & Kamata, A. (2011). A bi-factor multidimensional item response theory model for differential item functioning analysis on testlet-based items. Applied Psychological Measurement, 35(8), Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. California, CA: Sage. Hendrickson, A. B. (2001, April). Reliability of scores from tests composed of testlets: A comparison of methods. Paper presented at the annual meeting of the National Council on Measurement in Education, Seattle. Lee, G., Dunbar, S. B., & Frisbie, D. A. (2001). The relative appropriateness of eight measurement models for analyzing scores from test composed of testlets. Educational and Psychological Measurement, 61(6), Lee, G., & Frisbie, D. A. (1997, March). A generalizability approach to evaluating the reliability of testlet-based test scores. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago. Lee, G., & Frisbie, D. A. (1999). Estimating reliability under a generalizability theory model for test scores composed of testlets. Applied Measurement in Education, 12(3), Lee, G., Kolen, M. J., Frisbie, D. A., & Ankenmann, R. D. (2001). Comparison of dichotomous and polytomous item response models in equating scores from test composed of testlets. Applied Psychological Measurement, 25, Lee, G., & Park, I. (2012). A comparison of the approaches of generalizability theory and item response theory in estimating the reliability of test scores for testlet-composed tests. Asia Pacific Education Review, 13(1), Li, Y., Bolt, D. M., & Fu, J. (2006). A comparison of alternative models for testlets. Applied Psychological Measurement, 30(1), Nalbantoğlu Yılmaz, F. (2012). Genellenebilirlik kuramında dengelenmiş ve dengelenmemiş desenlerin karşılaştırılması (Doctoral dissertation, Ankara University, Turkey). Retrieved from Nalbantoğlu Yılmaz, F., & Uzun Başusta, B. (2012, September). Genellenebilirlik kuramıyla dikiş atma ve alma becerileri istasyonu güvenirliğinin değerlendirilmesi. Paper presented at III. Ulusal Eğitimde ve Psikolojide Ölçme ve Değerlendirme Kongresi, Bolu, Turkey. Özçelik, D. A. (2010). Test hazırlama kılavuzu. Ankara: Pegem Akademi. Sedivy, S. K. (2009). Using traditional methods to detect differential item functioning in testlet data (Doctoral dissertation, University of Wisconsin-Milwaukee). Retrieved from Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Newbury Park, CA: Sage. Sireci, S. G., Thissen, D., & Wainer, H. (1991). On the reliability of testlet-based tests. Journal of Educational Measurement, 28, Spiegelhalter, D., Thomas, A., & Best, N. (2003). WinBUGS 1.4. Cambridge, UK: MRC Biostatistics Unit, Institute of Public Health. Thissen, D., Steinberg, L., & Mooney, J. A. (1989). Trace lines for testlets: A use of multiple-categorical response models. Journal of Educational Measurement, 26(3), Vaughn, B. K. (2006). A hierarchical generalized linear model of random differential item functioning for polytomous items: A Bayesian multilevel approach (Doctoral dissertation, The Florida State University College of Education). Retrieved from fsu.edu/cgi/viewcontent.cgi?article=5595&context=etd Wainer, H. (1995). Precision and differential item functioning on a testlet-based test: The 1991 law school admissions test as an example. Applied Measurement in Education, 8, Wainer, H., Bradlow, E. T., & Du, Z. (2000). Testlet response theory: An analog for the 3PL model useful in testlet-based adaptive testing. In W. J. van der Linden & C. A. Glas (Eds.), Computerized adaptive testing: Theory and practice (pp ). Dordrecht: Kluwer Academic Publishers. Wainer, H., & Kiely, G. L. (1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational Measurement, 24(3), Wainer, H., & Lewis, C. (1990). Toward a psychometrics for testlets. Journal of Educational Measurement, 27(1), Wainer, H., Sireci, S. G., & Thissen, D. (1991). Differential testlet functioning: definitions and detection. Journal of Educational Measurement, 28(3), Wainer, H., & Thissen, D. (1996). How is reliability related to the quality of test scores? What is the effect of local dependence on reliability? Educational Measurement: Issues and Practice, 15(1), Wainer, H., & Wang, C. (2000). Using a new statistical model for testlets to score TOEFL. Journal of Educational Measurement, 37, Wang, W. C., & Wilson, M. (2005). Assessment of differential item functioning in testlet-based items using the Rasch testlet model. Educational and Psychological Measurement, 65(4), Wang, X., Bradlow, E. T. & Wainer, H. (2002). A general Bayesian model for testlets: theory and application. Applied Psychological Measurement, 26(1), Yen, W. M. (1993). Scaling performance assessment: strategies for managing local item dependence. Journal of Educational Measurement, 30,
Computerized Adaptive Testing in Industrial and Organizational Psychology. Guido Makransky
Computerized Adaptive Testing in Industrial and Organizational Psychology Guido Makransky Promotiecommissie Promotor Prof. dr. Cees A. W. Glas Assistent Promotor Prof. dr. Svend Kreiner Overige leden Prof.
Using SAS PROC MCMC to Estimate and Evaluate Item Response Theory Models
Using SAS PROC MCMC to Estimate and Evaluate Item Response Theory Models Clement A Stone Abstract Interest in estimating item response theory (IRT) models using Bayesian methods has grown tremendously
Keith A. Boughton, Don A. Klinger, and Mark J. Gierl. Center for Research in Applied Measurement and Evaluation University of Alberta
Effects of Random Rater Error on Parameter Recovery of the Generalized Partial Credit Model and Graded Response Model Keith A. Boughton, Don A. Klinger, and Mark J. Gierl Center for Research in Applied
e-asttle Glossary The following terms are commonly used in e-asttle:
e-asttle Glossary The following terms are commonly used in e-asttle: Achievement Objective (AO) Advanced (A) Analytic scoring asttle Basic (B) Calibration Cognitive processing Component Construct Constructed
Center for Advanced Studies in Measurement and Assessment. CASMA Research Report
Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 13 and Accuracy Under the Compound Multinomial Model Won-Chan Lee November 2005 Revised April 2007 Revised April 2008
Clarence D. Kreiter, PhD *, George R. Bergus, MD, MA(Ed)
A study of Two Clinical Performance Scores: Assessing the Psychometric Characteristics of a Combined Score Derived from Clinical Evaluation Forms and OSCEs Clarence D. Kreiter, PhD *, George R. Bergus,
PRE-SERVICE SCIENCE AND PRIMARY SCHOOL TEACHERS PERCEPTIONS OF SCIENCE LABORATORY ENVIRONMENT
PRE-SERVICE SCIENCE AND PRIMARY SCHOOL TEACHERS PERCEPTIONS OF SCIENCE LABORATORY ENVIRONMENT Gamze Çetinkaya 1 and Jale Çakıroğlu 1 1 Middle East Technical University Abstract: The associations between
GMAC. Predicting Success in Graduate Management Doctoral Programs
GMAC Predicting Success in Graduate Management Doctoral Programs Kara O. Siegert GMAC Research Reports RR-07-10 July 12, 2007 Abstract An integral part of the test evaluation and improvement process involves
Effect of Job Autonomy Upon Organizational Commitment of Employees at Different Hierarchical Level
psyct.psychopen.eu 2193-7281 Research Articles Effect of Job Autonomy Upon Organizational Commitment of Employees at Different Hierarchical Level Shalini Sisodia* a, Ira Das a [a] Department of Psychology,
Internal Consistency: Do We Really Know What It Is and How to Assess It?
Journal of Psychology and Behavioral Science June 2014, Vol. 2, No. 2, pp. 205-220 ISSN: 2374-2380 (Print) 2374-2399 (Online) Copyright The Author(s). 2014. All Rights Reserved. Published by American Research
Relating the ACT Indicator Understanding Complex Texts to College Course Grades
ACT Research & Policy Technical Brief 2016 Relating the ACT Indicator Understanding Complex Texts to College Course Grades Jeff Allen, PhD; Brad Bolender; Yu Fang, PhD; Dongmei Li, PhD; and Tony Thompson,
Evaluating IRT- and CTT-based Methods of Estimating Classification Consistency and Accuracy Indices from Single Administrations
University of Massachusetts - Amherst ScholarWorks@UMass Amherst Dissertations Dissertations and Theses 9-1-2011 Evaluating IRT- and CTT-based Methods of Estimating Classification Consistency and Accuracy
Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm
Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm
Chapter 5: Analysis of The National Education Longitudinal Study (NELS:88)
Chapter 5: Analysis of The National Education Longitudinal Study (NELS:88) Introduction The National Educational Longitudinal Survey (NELS:88) followed students from 8 th grade in 1988 to 10 th grade in
Principals Use of Computer Technology
Principals Use of Computer Technology 85 Lantry L. Brockmeier James L. Pate Don Leech Abstract: The Principal s Computer Technology Survey, a 40-item instrument, was employed to collect data on Georgia
Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics
INTERNATIONAL BLACK SEA UNIVERSITY COMPUTER TECHNOLOGIES AND ENGINEERING FACULTY ELABORATION OF AN ALGORITHM OF DETECTING TESTS DIMENSIONALITY Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree
Utilization of Response Time in Data Forensics of K-12 Computer-Based Assessment XIN LUCY LIU. Data Recognition Corporation (DRC) VINCENT PRIMOLI
Response Time in K 12 Data Forensics 1 Utilization of Response Time in Data Forensics of K-12 Computer-Based Assessment XIN LUCY LIU Data Recognition Corporation (DRC) VINCENT PRIMOLI Data Recognition
Linda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén www.statmodel.com. Table Of Contents
Mplus Short Courses Topic 2 Regression Analysis, Eploratory Factor Analysis, Confirmatory Factor Analysis, And Structural Equation Modeling For Categorical, Censored, And Count Outcomes Linda K. Muthén
Assessing the Relative Fit of Alternative Item Response Theory Models to the Data
Research Paper Assessing the Relative Fit of Alternative Item Response Theory Models to the Data by John Richard Bergan, Ph.D. 6700 E. Speedway Boulevard Tucson, Arizona 85710 Phone: 520.323.9033 Fax:
individualdifferences
1 Simple ANalysis Of Variance (ANOVA) Oftentimes we have more than two groups that we want to compare. The purpose of ANOVA is to allow us to compare group means from several independent samples. In general,
COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES.
277 CHAPTER VI COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES. This chapter contains a full discussion of customer loyalty comparisons between private and public insurance companies
11. Analysis of Case-control Studies Logistic Regression
Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:
Attitudes Toward Science of Students Enrolled in Introductory Level Science Courses at UW-La Crosse
Attitudes Toward Science of Students Enrolled in Introductory Level Science Courses at UW-La Crosse Dana E. Craker Faculty Sponsor: Abdulaziz Elfessi, Department of Mathematics ABSTRACT Nearly fifty percent
Validity, reliability, and concordance of the Duolingo English Test
Validity, reliability, and concordance of the Duolingo English Test May 2014 Feifei Ye, PhD Assistant Professor University of Pittsburgh School of Education [email protected] Validity, reliability, and
University Students' Perceptions of Web-based vs. Paper-based Homework in a General Physics Course
Eurasia Journal of Mathematics, Science & Technology Education, 2007, 3(1), 29-34 University Students' Perceptions of Web-based vs. Paper-based Homework in a General Physics Course Neşet Demirci Balıkesir
International Statistical Institute, 56th Session, 2007: Phil Everson
Teaching Regression using American Football Scores Everson, Phil Swarthmore College Department of Mathematics and Statistics 5 College Avenue Swarthmore, PA198, USA E-mail: [email protected] 1. Introduction
Is Test of Performance Strategies (TOPS) a Precise Tool for Iranian Adult Athletes?
Middle-East Journal of Scientific Research 22 (8): 1219-1227, 2014 ISSN 1990-9233 IDOSI Publications, 2014 DOI: 10.5829/idosi.mejsr.2014.22.08.22030 Is Test of Performance Strategies (TOPS) a Precise Tool
A General Approach to Variance Estimation under Imputation for Missing Survey Data
A General Approach to Variance Estimation under Imputation for Missing Survey Data J.N.K. Rao Carleton University Ottawa, Canada 1 2 1 Joint work with J.K. Kim at Iowa State University. 2 Workshop on Survey
UNDERSTANDING THE TWO-WAY ANOVA
UNDERSTANDING THE e have seen how the one-way ANOVA can be used to compare two or more sample means in studies involving a single independent variable. This can be extended to two independent variables
The Influence of Stressful Life Events of College Students on Subjective Well-Being: The Mediation Effect of the Operational Effectiveness
Open Journal of Social Sciences, 2016, 4, 70-76 Published Online June 2016 in SciRes. http://www.scirp.org/journal/jss http://dx.doi.org/10.4236/jss.2016.46008 The Influence of Stressful Life Events of
DISCRIMINANT FUNCTION ANALYSIS (DA)
DISCRIMINANT FUNCTION ANALYSIS (DA) John Poulsen and Aaron French Key words: assumptions, further reading, computations, standardized coefficents, structure matrix, tests of signficance Introduction Discriminant
EVALUATION OF THE SKILLS OF K-12 STUDENTS REGARDING THE NATIONAL EDUCATIONAL TECHNOLOGY STANDARDS FOR STUDENTS (NETS*S) IN TURKEY
EVALUATION OF THE SKILLS OF K-12 STUDENTS REGARDING THE NATIONAL EDUCATIONAL TECHNOLOGY STANDARDS FOR STUDENTS (NETS*S) IN TURKEY Adile Aşkım KURT Anadolu University, Education Faculty, Computer &Instructional
Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS
Chapter Seven Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Section : An introduction to multiple regression WHAT IS MULTIPLE REGRESSION? Multiple
Item Response Theory: What It Is and How You Can Use the IRT Procedure to Apply It
Paper SAS364-2014 Item Response Theory: What It Is and How You Can Use the IRT Procedure to Apply It Xinming An and Yiu-Fai Yung, SAS Institute Inc. ABSTRACT Item response theory (IRT) is concerned with
I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
Beckman HLM Reading Group: Questions, Answers and Examples Carolyn J. Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Linear Algebra Slide 1 of
Computer-based testing: An alternative for the assessment of Turkish undergraduate students
Available online at www.sciencedirect.com Computers & Education 51 (2008) 1198 1204 www.elsevier.com/locate/compedu Computer-based testing: An alternative for the assessment of Turkish undergraduate students
Richard E. Zinbarg northwestern university, the family institute at northwestern university. William Revelle northwestern university
psychometrika vol. 70, no., 23 33 march 2005 DOI: 0.007/s336-003-0974-7 CRONBACH S α, REVELLE S β, AND MCDONALD S ω H : THEIR RELATIONS WITH EACH OTHER AND TWO ALTERNATIVE CONCEPTUALIZATIONS OF RELIABILITY
Statistical Models in R
Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova
MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS
MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level of Significance
Lasso on Categorical Data
Lasso on Categorical Data Yunjin Choi, Rina Park, Michael Seo December 14, 2012 1 Introduction In social science studies, the variables of interest are often categorical, such as race, gender, and nationality.
Handling attrition and non-response in longitudinal data
Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein
Test Reliability Indicates More than Just Consistency
Assessment Brief 015.03 Test Indicates More than Just Consistency by Dr. Timothy Vansickle April 015 Introduction is the extent to which an experiment, test, or measuring procedure yields the same results
Calculating, Interpreting, and Reporting Estimates of Effect Size (Magnitude of an Effect or the Strength of a Relationship)
1 Calculating, Interpreting, and Reporting Estimates of Effect Size (Magnitude of an Effect or the Strength of a Relationship) I. Authors should report effect sizes in the manuscript and tables when reporting
Factorial Invariance in Student Ratings of Instruction
Factorial Invariance in Student Ratings of Instruction Isaac I. Bejar Educational Testing Service Kenneth O. Doyle University of Minnesota The factorial invariance of student ratings of instruction across
Impact of Skewness on Statistical Power
Modern Applied Science; Vol. 7, No. 8; 013 ISSN 1913-1844 E-ISSN 1913-185 Published by Canadian Center of Science and Education Impact of Skewness on Statistical Power Ötüken Senger 1 1 Kafkas University,
Time series Forecasting using Holt-Winters Exponential Smoothing
Time series Forecasting using Holt-Winters Exponential Smoothing Prajakta S. Kalekar(04329008) Kanwal Rekhi School of Information Technology Under the guidance of Prof. Bernard December 6, 2004 Abstract
METACOGNITIVE AWARENESS OF PRE-SERVICE TEACHERS
METACOGNITIVE AWARENESS OF PRE-SERVICE TEACHERS Emine ŞENDURUR Kocaeli University, Faculty of Education Kocaeli, TURKEY Polat ŞENDURUR Ondokuz Mayıs University, Faculty of Education Samsun, TURKEY Neşet
Item Response Theory in R using Package ltm
Item Response Theory in R using Package ltm Dimitris Rizopoulos Department of Biostatistics, Erasmus University Medical Center, the Netherlands [email protected] Department of Statistics and Mathematics
A Review of Physical Education Teachers Efficacy in Teaching-Learning Process Gökhan ÇETİNKOL [1], Serap ÖZBAŞ [2]
A Review of Physical Education Teachers Efficacy in Teaching-Learning Process Gökhan ÇETİNKOL [1], Serap ÖZBAŞ [2] [1] Near East University [email protected] [2] Near East University [email protected]
EVALUATION OF PROSPECTIVE SCIENCE TEACHERS COMPUTER SELF-EFFICACY
ISSN 1308 8971 Special Issue: Selected papers presented at WCNTSE EVALUATION OF PROSPECTIVE SCIENCE TEACHERS COMPUTER SELF-EFFICACY a Nurhan ÖZTÜRK, a Esra BOZKURT, b Tezcan KARTAL, c Ramazan DEMİR & d
A STUDY OF THE EFFECTS OF ELECTRONIC TEXTBOOK-AIDED REMEDIAL TEACHING ON STUDENTS LEARNING OUTCOMES AT THE OPTICS UNIT
A STUDY OF THE EFFECTS OF ELECTRONIC TEXTBOOK-AIDED REMEDIAL TEACHING ON STUDENTS LEARNING OUTCOMES AT THE OPTICS UNIT Chen-Feng Wu, Pin-Chang Chen and Shu-Fen Tzeng Department of Information Management,
interpretation and implication of Keogh, Barnes, Joiner, and Littleton s paper Gender,
This essay critiques the theoretical perspectives, research design and analysis, and interpretation and implication of Keogh, Barnes, Joiner, and Littleton s paper Gender, Pair Composition and Computer
PSYCHOLOGY Vol. II - The Construction and Use of Psychological Tests and Measures - Bruno D. Zumbo, Michaela N. Gelin, Anita M.
THE CONSTRUCTION AND USE OF PSYCHOLOGICAL TESTS AND MEASURES Bruno D. Zumbo, Michaela N. Gelin and Measurement, Evaluation, and Research Methodology Program, Department of Educational and Counselling Psychology,
Statistiek II. John Nerbonne. October 1, 2010. Dept of Information Science [email protected]
Dept of Information Science [email protected] October 1, 2010 Course outline 1 One-way ANOVA. 2 Factorial ANOVA. 3 Repeated measures ANOVA. 4 Correlation and regression. 5 Multiple regression. 6 Logistic
NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group
MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could
Implementing the Graduate Management Admission Test Computerized Adaptive Test. Lawrence M. Rudner Graduate Management Admission Council
Implementing the Graduate Management Admission Test Computerized Adaptive Test Lawrence M. Rudner Graduate Management Admission Council Presented at the CAT Models and Monitoring Paper Session, June 7,
Beware of Cheating in Online Assessing Systems
Abstract Massive open online courses (MOOCs) are playing an increasingly important role in higher education around the world, but despite their popularity, the measurement of student learning in these
Section Format Day Begin End Building Rm# Instructor. 001 Lecture Tue 6:45 PM 8:40 PM Silver 401 Ballerini
NEW YORK UNIVERSITY ROBERT F. WAGNER GRADUATE SCHOOL OF PUBLIC SERVICE Course Syllabus Spring 2016 Statistical Methods for Public, Nonprofit, and Health Management Section Format Day Begin End Building
Evaluating Analytical Writing for Admission to Graduate Business Programs
Evaluating Analytical Writing for Admission to Graduate Business Programs Eileen Talento-Miller, Kara O. Siegert, Hillary Taliaferro GMAC Research Reports RR-11-03 March 15, 2011 Abstract The assessment
How to Get More Value from Your Survey Data
Technical report How to Get More Value from Your Survey Data Discover four advanced analysis techniques that make survey research more effective Table of contents Introduction..............................................................2
Relationship of Gender to Faculty Use of Online Educational Tools. Susan Lucas. The University of Alabama
Relationship of Gender to Faculty Use of Online Educational Tools Susan Lucas The University of Alabama Copyright Susan Lucas, 2003 This work is the intellectual property of the author. Permission is granted
WHAT IS A JOURNAL CLUB?
WHAT IS A JOURNAL CLUB? With its September 2002 issue, the American Journal of Critical Care debuts a new feature, the AJCC Journal Club. Each issue of the journal will now feature an AJCC Journal Club
Package EstCRM. July 13, 2015
Version 1.4 Date 2015-7-11 Package EstCRM July 13, 2015 Title Calibrating Parameters for the Samejima's Continuous IRT Model Author Cengiz Zopluoglu Maintainer Cengiz Zopluoglu
Test Bias. As we have seen, psychological tests can be well-conceived and well-constructed, but
Test Bias As we have seen, psychological tests can be well-conceived and well-constructed, but none are perfect. The reliability of test scores can be compromised by random measurement error (unsystematic
Calculator Use on Stanford Series Mathematics Tests
assessment report Calculator Use on Stanford Series Mathematics Tests Thomas E Brooks, PhD Betsy J Case, PhD Tracy Cerrillo, PhD Nancy Severance Nathan Wall Michael J Young, PhD May 2003 (Revision 1, November
Glossary of Terms Ability Accommodation Adjusted validity/reliability coefficient Alternate forms Analysis of work Assessment Battery Bias
Glossary of Terms Ability A defined domain of cognitive, perceptual, psychomotor, or physical functioning. Accommodation A change in the content, format, and/or administration of a selection procedure
HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION
HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate
Comparison of Estimation Methods for Complex Survey Data Analysis
Comparison of Estimation Methods for Complex Survey Data Analysis Tihomir Asparouhov 1 Muthen & Muthen Bengt Muthen 2 UCLA 1 Tihomir Asparouhov, Muthen & Muthen, 3463 Stoner Ave. Los Angeles, CA 90066.
Chapter 6: The Information Function 129. CHAPTER 7 Test Calibration
Chapter 6: The Information Function 129 CHAPTER 7 Test Calibration 130 Chapter 7: Test Calibration CHAPTER 7 Test Calibration For didactic purposes, all of the preceding chapters have assumed that the
Party Secretaries in Chinese Higher Education Institutions, Who Are They?
Journal of International Education and Leadership Volume 2 Issue 2 Summer 2012 http://www.jielusa.og/home/ ISSN: 2161-7252 Party Secretaries in Chinese Higher Education Institutions, Who Are They? Hua
User Resistance Factors in Post ERP Implementation
User Resistance Factors in Post ERP Implementation Sayeed Haider Salih 1 e-mail: [email protected] Ab Razak Che Hussin 2 e-mail: [email protected] Halina Mohamed Dahlan 3 e-mail: [email protected] Author(s)
The Relation between Life Long Learning Tendency and Leadership Level of Education Managers
Special Issue 2016-II, pp., 80-88; 01 February 2016 Available online at http://www.partedres.com ISSN: 2148-6123 http://dx.doi.org/10.17275/per.16.spi.2.9 The Relation between Life Long Learning Tendency
Models for Product Demand Forecasting with the Use of Judgmental Adjustments to Statistical Forecasts
Page 1 of 20 ISF 2008 Models for Product Demand Forecasting with the Use of Judgmental Adjustments to Statistical Forecasts Andrey Davydenko, Professor Robert Fildes [email protected] Lancaster
Technical Report. Teach for America Teachers Contribution to Student Achievement in Louisiana in Grades 4-9: 2004-2005 to 2006-2007
Page 1 of 16 Technical Report Teach for America Teachers Contribution to Student Achievement in Louisiana in Grades 4-9: 2004-2005 to 2006-2007 George H. Noell, Ph.D. Department of Psychology Louisiana
Algebra 1 Course Information
Course Information Course Description: Students will study patterns, relations, and functions, and focus on the use of mathematical models to understand and analyze quantitative relationships. Through
Introduction to industrial/organizational psychology
Introduction to industrial/organizational psychology Chapter 1 Industrial/Organizational Psychology: A branch of psychology that applies the principles of psychology to the workplace I/O psychologists
GMAC. Predicting Graduate Program Success for Undergraduate and Intended Graduate Accounting and Business-Related Concentrations
GMAC Predicting Graduate Program Success for Undergraduate and Intended Graduate Accounting and Business-Related Concentrations Kara M. Owens and Eileen Talento-Miller GMAC Research Reports RR-06-15 August
Feifei Ye, PhD Assistant Professor School of Education University of Pittsburgh [email protected]
Feifei Ye, PhD Assistant Professor School of Education University of Pittsburgh [email protected] Validity, reliability, and concordance of the Duolingo English Test Ye (2014), p. 1 Duolingo has developed
Running head: SCHOOL COMPUTER USE AND ACADEMIC PERFORMANCE. Using the U.S. PISA results to investigate the relationship between
Computer Use and Academic Performance- PISA 1 Running head: SCHOOL COMPUTER USE AND ACADEMIC PERFORMANCE Using the U.S. PISA results to investigate the relationship between school computer use and student
Description. Textbook. Grading. Objective
EC151.02 Statistics for Business and Economics (MWF 8:00-8:50) Instructor: Chiu Yu Ko Office: 462D, 21 Campenalla Way Phone: 2-6093 Email: [email protected] Office Hours: by appointment Description This course
Best Practices in Using Large, Complex Samples: The Importance of Using Appropriate Weights and Design Effect Compensation
A peer-reviewed electronic journal. Copyright is retained by the first or sole author, who grants right of first publication to the Practical Assessment, Research & Evaluation. Permission is granted to
Qualitative vs Quantitative research & Multilevel methods
Qualitative vs Quantitative research & Multilevel methods How to include context in your research April 2005 Marjolein Deunk Content What is qualitative analysis and how does it differ from quantitative
Introduction to Analysis of Variance (ANOVA) Limitations of the t-test
Introduction to Analysis of Variance (ANOVA) The Structural Model, The Summary Table, and the One- Way ANOVA Limitations of the t-test Although the t-test is commonly used, it has limitations Can only
Comparison of Logging Residue from Lump Sum and Log Scale Timber Sales James O. Howard and Donald J. DeMars
United States Department of Agriculture Forest Service Pacific Northwest Forest and Range Experiment Station Research Paper PNW-337 May 1985 Comparison of Logging Residue from Lump Sum and Log Scale Timber
Effects of Different Response Types on Iranian EFL Test Takers Performance
Effects of Different Response Types on Iranian EFL Test Takers Performance Mohammad Hassan Chehrazad PhD Candidate, University of Tabriz [email protected] Parviz Ajideh Professor, University
An Empirical Study on the Effects of Software Characteristics on Corporate Performance
, pp.61-66 http://dx.doi.org/10.14257/astl.2014.48.12 An Empirical Study on the Effects of Software Characteristics on Corporate Moon-Jong Choi 1, Won-Seok Kang 1 and Geun-A Kim 2 1 DGIST, 333 Techno Jungang
Assessing a theoretical model on EFL college students
ABSTRACT Assessing a theoretical model on EFL college students Yu-Ping Chang Yu Da University, Taiwan This study aimed to (1) integrate relevant language learning models and theories, (2) construct a theoretical
A Bayesian Approach to Person Fit Analysis in Item Response Theory Models
A Bayesian Approach to Person Fit Analysis in Item Response Theory Models Cees A. W. Glas and Rob R. Meijer University of Twente, The Netherlands A Bayesian approach to the evaluation of person fit in
The Loss in Efficiency from Using Grouped Data to Estimate Coefficients of Group Level Variables. Kathleen M. Lang* Boston College.
The Loss in Efficiency from Using Grouped Data to Estimate Coefficients of Group Level Variables Kathleen M. Lang* Boston College and Peter Gottschalk Boston College Abstract We derive the efficiency loss
ESTIMATION OF THE EFFECTIVE DEGREES OF FREEDOM IN T-TYPE TESTS FOR COMPLEX DATA
m ESTIMATION OF THE EFFECTIVE DEGREES OF FREEDOM IN T-TYPE TESTS FOR COMPLEX DATA Jiahe Qian, Educational Testing Service Rosedale Road, MS 02-T, Princeton, NJ 08541 Key Words" Complex sampling, NAEP data,
Drawing Inferences about Instructors: The Inter-Class Reliability of Student Ratings of Instruction
OEA Report 00-02 Drawing Inferences about Instructors: The Inter-Class Reliability of Student Ratings of Instruction Gerald M. Gillmore February, 2000 OVERVIEW The question addressed in this report is
the metric of medical education
the metric of medical education Reliability: on the reproducibility of assessment data Steven M Downing CONTEXT All assessment data, like other scientific experimental data, must be reproducible in order
