High/Scope Early Childhood Reading Institute 5/9/2007 Jacob E. Cheadle, Ph.D.

Size: px
Start display at page:

Download "High/Scope Early Childhood Reading Institute 5/9/2007 Jacob E. Cheadle, Ph.D."

Transcription

1 The Early Literacy Skills Assessment (ELSA) Psychometric Report For Both English and Spanish Versions High/Scope Early Childhood Reading Institute 5/9/2007 Jacob E. Cheadle, Ph.D.

2 Page II Table of Contents 1 Introduction Summary of Findings Methodological Considerations Model Specification Model Fit Testing Factorial Invariance Normality and Modeling Observed Measures Reliability Conceptual Issues: The Approach Figures English ELSA Sample Descriptive Statistics Tables Reading Comprehension Choosing a factor model Factor Correlations & Reliability Factor Score Distributions Change in Children's Scores Figures Tables Phonological Awareness Choosing a Factor Model Factor Correlations & Reliabilities Factor Score Distributions Change in Children's Scores Figures Tables Alphabetic Principle... 28

3 Page III Choosing a Factor Model Factor Correlations & Reliability Factor Score Distributions Change in Children's Scores Figures Tables Concepts about Print Choosing a Factor Model Factor Correlations & Reliability Factor Score Distributions Change in Children's Scores Figures Tables The Full ELSA Tables Concurrent Validity Tables Discussion & Conclusion Spanish ELSA Sample Descriptive Statistics Tables Reading Comprehension Choosing a Factor Model Factor Correlations & Reliability Factor Score Distributions Assessing Change Figures Tables Phonological Awareness Approach & Orientation Choosing a Factor Model... 55

4 Page IV Factor Score Correlations & Reliability Factor Score Distributions Assessing Change Tables Alphabetic Principle Approach & Orientation Choosing a Factor Model Factor Score Correlations & Reliability Factor Score Distributions Assessing Change Figures Tables Concepts about Print Approach & Orientation Choosing a Factor Model Factor Score Correlations & Reliability Factor Score Distributions Assessing Change Figures Tables The Full ELSA Tables Comparing the Spanish & English Versions of the ELSA Tables Discussion & Conclusion Appendix Models & Estimation English Comprehension Phonological Awareness Alphabetic Principle... 80

5 Page V Concepts about Print Spanish Comprehension Phonological Awareness Alphabetic Principle Concepts about Print Table of Figures FIGURE 1.4-1: GENERAL FIRST- AND SECOND-ORDER FACTOR MODELS... 9 FIGURE 1.4-2: TWO-WAVE OR PRE-POST FACTOR MODEL FIGURE 2.2-1: GRAPHICAL DEPICTION OF THE FIRST-ORDER, SINGLE FACTOR, COMPREHENSION CFA.. 16 FIGURE 2.2-2: GRAPHICAL DEPICTION OF SECOND-ORDER COMPREHENSION CFA FIGURE 2.2-3: PRETEST FACTOR SCORE HISTOGRAMS BY SCORING METHOD FIGURE 2.2-4: POSTTEST FACTOR SCORE HISTOGRAMS BY SCORING METHOD FIGURE 2.2-5: PRE- AND POSTTEST SCORES FROM THE TWO-WAVE MODEL FIGURE 2.2-6: NONPARAMETRIC RELATIONSHIP BETWEEN AGE AND CHILDREN S COMPREHENSION SCORES BY SCORING METHOD FIGURE 2.2-7: RELATIONSHIP BETWEEN AGE BETWEEN ASSESSMENTS, CHILD AGE AT PRETEST, AND CHILDREN S COMPREHENSION GROWTH FIGURE 2.3-1: PRE- AND POSTTEST PHONOLOGICAL AWARENESS FACTOR SCORES BY SCORING METHOD FROM CROSS-SECTIONAL AND LONGITUDINAL MODELS FIGURE 2.3-2: NONPARAMETRIC RELATIONSHIP BETWEEN AGE AND CHILDREN S PHONOLOGICAL AWARENESS SCORES BY SCORING METHOD FIGURE 2.3-3: RELATIONSHIP BETWEEN AGE BETWEEN ASSESSMENTS, CHILD AGE AT PRETEST, AND PHONOLOGICAL AWARENESS GROWTH FIGURE 2.4-1: PRE- AND POSTTEST ALPHABETIC PRINCIPLE FACTOR SCORES BY SCORING METHOD FROM CROSS-SECTIONAL MODELS FIGURE 2.4-2: NONPARAMETRIC RELATIONSHIP BETWEEN AGE AND CHILDREN S PRE- AND POSTTEST SCORES BY SCORING METHOD FIGURE 2.4-3: RELATIONSHIP BETWEEN AGE BETWEEN ASSESSMENTS, CHILD AGE AT PRETEST, AND ALPHABETIC PRINCIPLE GROWTH FIGURE 2.5-1: PRE- AND POSTTEST CONCEPTS ABOUT PRINT FACTOR SCORES BY SCORING METHOD FROM CROSS-SECTIONAL AND LONGITUDINAL MODELS FIGURE 2.5-2: NONPARAMETRIC RELATIONSHIP BETWEEN AGE AND CHILDREN S CONCEPTS ABOUT PRINT POSTTEST SCORES BY SCORING METHOD FIGURE 2.5-3: RELATIONSHIP BETWEEN AGE BETWEEN ASSESSMENTS, CHILD AGE AT PRETEST, AND CONCEPTS ABOUT PRINT GROWTH FIGURE 3.2-1: BASIC CFA FOR READING COMPREHENSION FIGURE 3.2-2: BASIC CFA FOR READING COMPREHENSION FIGURE 3.2-3: FACTOR SCORE DISTRIBUTIONS ACROSS SCORING METHODS, PRETEST FIGURE 3.2-4: FACTOR SCORE DISTRIBUTIONS ACROSS SCORING METHODS, POSTTEST... 50

6 Page VI FIGURE 3.2-5: PRE- AND POSTTEST SCORES FROM THE TWO-WAVE MODEL FIGURE 3.2-6: NONPARAMETRIC RELATIONSHIP BETWEEN CHILDREN S COMPREHENSION SCORES AND AGE (LOWESS CURVES) FIGURE 3.2-7: RELATIONSHIP BETWEEN AGE BETWEEN ASSESSMENTS, CHILD AGE AT PRETEST, AND CHILDREN S COMPREHENSION GROWTH FIGURE 3.3-1: PRE- AND POSTTEST FACTOR SCORES ACROSS SCORING METHODS FIGURE 3.3-2: NONPARAMETRIC RELATIONSHIP BETWEEN AGE AND PHONOLOGICAL AWARENESS BY SCORING METHOD FIGURE 3.3-3: RELATIONSHIP BETWEEN AGE BETWEEN ASSESSMENTS, CHILD AGE AT PRETEST, AND PHONOLOGICAL AWARENESS GROWTH FIGURE 3.4-1: PRE- AND POSTTEST ALPHABETIC PRINCIPLE FACTOR SCORES ACROSS SCORING METHODS FIGURE 3.4-2: NONPARAMETRIC RELATIONSHIP BETWEEN AGE AND ALPHABETIC PRINCIPLE BY SCORING METHOD...64 FIGURE 3.4-3: RELATIONSHIP BETWEEN AGE BETWEEN ASSESSMENTS, CHILD AGE AT PRETEST, AND ALPHABETIC PRINCIPLE GROWTH FIGURE 3.5-1: PRE- AND POSTTEST CONCEPTS ABOUT PRINT FACTOR SCORES ACROSS SCORING METHODS FIGURE 3.5-2: NONPARAMETRIC RELATIONSHIP BETWEEN AGE AND CONCEPTS ABOUT PRINT, BY SCORING METHOD FIGURE 3.5-3: RELATIONSHIP BETWEEN AGE BETWEEN ASSESSMENTS, CHILD AGE AT PRETEST, AND CONCEPTS ABOUT PRINT GROWTH FIGURE 4.3-1: FACTOR LOADINGS AND FIT FOR AN ALPHABETIC PRINCIPLE CFA IDENTIFIED USING FACTOR SCORES FROM THE FIRST-ORDER IRT MODELS... 84

7 Page 1 1 Introduction In this document we explore the statistical properties of the Early Literacy Skills Assessment (ELSA). We have approached analysis of the ELSA from two different perspectives with the idea that the instrument is useful to two populations of users, differentiated largely by the scale of operations. Because the ELSA does not require professionals to administer, to the extent that it is a valid and reliable instrument from which meaningful inferences regarding children's early literacy skills can be drawn, child-care professionals may use the ELSA for small projects within specific centers or across a small number of centers. Additionally, parents at the individual family level may also wish to use the ELSA to acquire information about specific components of their children's skills. Larger projects spanning many facilities may also wish to make use of the ELSA to assess early childhood reading programs or simply to take stock of the reading skills of cohorts of children. We expect that the resources and data-analytic skills will vary substantially between large and small scale projects and that, accordingly, the use of the ELSA will differ. Large scale projects with heterogeneous samples of children and access to psychometricians or statisticians with strong methodological skills and familiarity with statistical software will probably make use of the item-level data from children's responses to the ELSA differently than smaller projects without these resources. There is little doubt that the use of statistical methodologies like item response theory (IRT; De Boeck and Wilson 2004; van der Linden and Hambleton 1997) modeling with large samples is the preferred method for scoring children's latent early literacy skills, and we hope that projects with these resources will employ the ELSA, but not all projects using the ELSA will have these resources. Because the ELSA can flexibly scale between projects with different needs and resources, we have constructed the assessment of the ELSA in a way that highlights the similarities between scoring methodologies. The ELSA contains four instruments, including comprehension, phonological awareness, alphabetic principle, and concepts about print. For additional details of the ELSA see DeBruin-Parecki (2005). 1.1 Summary of Findings The ELSA comes in a total of four flavors, two English and two Spanish for Dante and Violet versions. The results indicate that the ELSA performs many functions reasonably well. It is able to reliably score children s early literacy skills, discriminate change in children s skills, and the evidence suggests it may be possible to compare the skills of English and Spanish-speaking children. The good qualities of the test are consistent across the English and Spanish-speaking versions. There are some less desirable qualities as well, although as we note below, they are manageable. In terms of implementation, the ELSA is notable because it does not need trained professionals to

8 Page 2 assess children, while the shorter length improves the manageability of assessing children, which can significantly lower project costs. In some cases, however, constructs are measured with relatively few items which can limit the ability of the ELSA to discriminate low-scoring children in some cases (comprehension, alphabetic principle, and phonological awareness), or high scoring children (concepts about print). However, these issues are not insurmountable, nor do they invalidate the ELSA. Overall, the ELSA is a good test. Below we summarize: Statistical performance of the ELSA. The factor analyses suggested a number of important facts about the ELSA: (1) factor structures for both the English and Spanish-speaking versions across the ELSA instruments adequately reproduced the observed variation in the data; (2) factor loadings were typically high, indicating that the items had adequate discrimination; (3) factor structures were similar over time, indicating that the instrument factor models were largely invariant, suggesting that the same constructs are measured on both the pre- and posttests; (4) the Dante and Violet versions of the ELSA performed similarly for the English-speaking ELSA 1, indicating that children scored using either version can be compared; (5) factor models were similar for the English and Spanish-speaking versions, suggesting that it may be possible to compare the early literacy skills of English and Spanish speaking children using the ELSA. These results, which cross versions, English and Spanish-speaking samples, and over time, provide considerable statistical support for the ELSA. Reliability. The ELSA reliability estimates were consistently above.6 and were often.8 or higher across instruments, indicating that both the English and Spanish versions of the ELSA reliably measure children s comprehension, phonological awareness, alphabetic principle, and concepts about print. Concurrent validity. Although we did not have concurrent measures for all of the ELSA instruments, the alphabetic principle instrument correlated highly with similar items from the Woodcock Johnson and Pre-CTOPP. In addition, the phonological awareness and concepts about print also correlated over.6 with items reflecting phonological awareness and concepts about print in the Pre-CTOPP, indicating that the 3 of the 4 ELSA instruments for which concurrent measures were available overlapped substantially with previously validated measures. Estimated posterior factor score distributions. We spent considerable time exploring the distributions of children s estimated scores. The ELSA does not always discriminate children with poor skills well for either the English or Spanish speaking versions of the test, although this poor discrimination could as easily indicate the absence of a skill for a subset of the disadvantaged samples used in this report. In general, floor effects were more sever when using scale scores or cross-sectional estimates, while scoring children using a two-wave pre-post model generally reduced the number of children scored at the absolute floor to trivial numbers. In addition, floor effects were related to children s age, so scoring children 45 months or older should further reduce the effects or poor discrimination for low-scoring children from disadvantaged populations. Assessing change in children s early literacy skills using the ELSA. Notably, despite some non-normality in the factor and scale score distributions, the ELSA is able to reliably 1 We did not pursue version differences for the Spanish ELSA because of sample limitations.

9 Page 3 discriminate change in children s scores, and the ability to discriminate change improves when factor scores are estimated using pre-post models. Overall, there are many positive features recommending the ELSA. As noted, the statistical properties are good and validated across different samples, versions, and over time. In fact, the only real concern arising from the ELSA is the potential for poor discrimination amongst low scoring children. This problem may be exacerbated amongst disadvantaged samples like those used here and are likely to pose fewer issues with more heterogeneous samples for which greater information across items may be available. In addition, when possible, flooring is greatly reduced when children are scoring using both pre- and posttest estimation simultaneously. Pre-post models contain substantially more information on children s skill levels that cross-sectional models, which allows for improved scoring of children. Furthermore, we suggest that that floor effects are not a serious concern for small projects using level scores since children with very few correct answers will be scored in the basal category any way. Floor effects are most problematic for the estimation of mean change in children s scores across a continuous ability/skill distribution. For projects without the ability to scale children s scores using pre- and posttest information, we recommend they be cognizant of the extent to which children are scored at the floor since these effects are not likely to be important for all populations, or for assessments using children older than approximately 45 months of age. If significant proportions of children are scored at the floor, it may be useful to use level scores. Either way, there is a good deal of reliable information on children s early literacy skills in the ELSA. Taken in sum, these results confirm the reliability of the ELSA as a measure of children s early literacy skills. Furthermore, the consistency of the results supports the general validity of the ELSA constructs for assessing both English and Spanish-speaking populations. 1.2 Methodological Considerations We use a variety of confirmatory factory analysis (CFA) approaches to understand the properties of the ELSA. Confirmatory factor analysis, which is a subset of what is commonly referred to as structural equation modeling (SEM) with latent variables (e.g., Bollen 1989), is a type of factor analysis where factor structures are specified a priori. Because factor structures are specified by the researcher or practitioner, generally according to some theory or hypothesis that posits patterns of relationships between the observed indicators, a variety of tools not available using traditional factor analytic techniques become available. As discussed below, the CFA approach allows explicit factor structures to be analyzed, item-level measurement to be treated flexibly, patterns of factorial invariance to be assessed, and reliability to be measured based upon the ability of the model to accurately score children. Additionally, by imposing constraints upon the mean structure, it is also possible to estimate change or growth in latent variables.

10 Page Model Specification Model specification using the CFA approach is, in the simplest case, a general factor model with explicit restrictions placed upon the factor loadings. For the ELSA assessment, we typically assess two factor structures, 2 a single first-order factor solution, and a second-order factor solution. The first-order solution, as shown in panel A of Figure 1.4-1, is a single factor solution, identified by all indicators comprising one of the ELSA subtests (e.g., all comprehension items). This model suggests that all items relating to comprehension load under a single common factor. The second-order solution acknowledges that the ELSA subtests capture general constructs which are related to sub-dimensions (e.g., comprehension skills reflect prediction, retelling, and connection to life subdimensions). This solution is conceptually similar to an exploratory factor analysis with a non-orthogonal rotation permitting correlations amongst the factors. It is these correlations between subfactors or first-order factors that identify the general or second-order factor, which is in many cases the factor about which users which to draw inferences (e.g., children s knowledge of concepts about print). However, there is nothing ad hoc about this approach. Factor models are derived explicitly from the ELSA scoring sheet and, importantly, this explicit model specification approach makes a broader range of statistical machinery available to study the ELSA, as described below, allowing a cohesive picture of the statistical evidence for the reliability and validity of the ELSA to be painted Model Fit By restricting the degrees of freedom by specifying a specific factor structure, it is possible to globally assess model fit. Conceptually, model fit can be assessed by comparing two sources of information. The first source of information is the observed covariance structure of the different items. Without imposing any constraints upon the model, observed indicators are intercorrelated it is this intercorrelation which suggests that the measurements may reflect some latent or unobserved trait. The second source of information is the expected covariance structure which is implied by the a priori specified factor model. This structure is implied by relating specific items to specific factors with some loadings fixed at zero (or no relationship or loading on a given factor) and others freely estimated (items allowed to load on a factor). Rather than omitting "any loading below.4" in a table, these loadings are explicitly omitted or estimated in the CFA approach. By comparing these observed and expected covariance structures it is possible to assess whether the specified model accurately reproduces the observed pattern of associations. Because the expected covariance pattern is nested in the unstructured observed covariance pattern, model fit is assessed with a chi-square difference test. Because the chi-square difference test is sensitive to sample size, making it possible to detect trivial 2 More accurately, we pursue two typical factor patterns, often exploring a number of submodels which are not reported to better understand the factor structure.

11 Page 5 differences, a number of model-fit assessments have been developed. In addition to the chi-square difference test, model fit is assessed using the root mean square error of approximation (RMSEA), which computes average lack of model fit per degree of freedom. Following normal conventions, values.06 are taken to indicate adequate model fit (Hu and Bentler 1999). Another measure, the comparative fit index (CFI), also known as the Bentler Comparative Fit Index, ranges between 0 and 1 with values above.9 typically considered acceptable. This goodness-of-fit index is interpreted such that a CFI of.95 indicates that the model reproduces 95% of the covariation observed in the data. The final measure used is the Tucker-Lewis index (TLI), which also ranges from 0-1 with values closer to 1 indicating better fit. Values less than.9 indicate that the model should be respecified while values approximately.95 or better are typically taken to indicate adequate model performance (Hu and Bentler 1999). Of course, good model fit is not proof that the model generated the data. However, adequate fit means that the hypothesized model is consistent with the observed pattern of relationships amongst items or measurements. This congruence is one sign of validity. Tests or assessments for which conceptually consistent models do not accurately reproduce the observed patterns of relationships in the data are not valid. Measurement must map onto meaning Testing Factorial Invariance The ability to estimate exact-fit statistics and goodness-of-fit measures means that proper model specification allows explicit hypotheses to be tested. For psychometric assessments, two characteristics, in additional to general cross-sectional test performance, are important. Foremost, is the extent to which the factor model performs similarly over time. Is the factor structure similar for the pre- and posttest? Are the factor loadings similar, indicating that the same factors are being measured? Additionally, because the ELSA comes in two flavors (for both the English and Spanish versions), performance for both the Dante and Violet versions should be similar so that researchers and practitioners can compare across versions if need be. These patterns of assessment are possible because, once again, the CFA approach to latent variable modeling posits explicit models with parameter constraints used to identify the model and salvage degrees of freedom. Model fit across waves or test versions are possible by imposing constrains like equality of factor loadings in across wave factor models or across sample factor models (e.g., Dante and Violet, or across English and Spanish-speaking samples) in the case of version assessment. A two-wave pre-post second-order factor model would look like the one depicted Figure 1.4-2, while factorial invariance can be explored by assessing model fit after adding constraints such as and. Instruments that are both reliable and valid must measure the same construct over time, which means that the relationship between children's observed latent scores (e.g., phonological awareness) should be similar. Furthermore, conceptually similar models can be estimated with

12 Page 6 parameter constraints imposed across test versions (Dante vs. Violet) to assess whether or not the tests function similarly. Since Dante and Violet were designed to measure the same constructs, evidence of similar fit provides further statistical support for the reliability of the ELSA and also means that children's scores can be compared across versions. Researchers may also wish to know if it is possible to compare the skills measured by the ELSA across English and Spanish speaking samples (see section 0) Normality and Modeling Observed Measures The CFA approach we employ, in addition to allowing specific a priori hypothesized models to be assessed cross-sectionally, over time, and across samples, can handle a variety of measurement strategies at the item-level of analysis. For example, the comprehension assessment measures are counts, often leading to highly non-normal distributions for these variables which violates the assumptions upon which standard factor analysis is based. For alphabetic principle, the items for uppercase letter recognition are a series of dichotomous indicator variables that take on values indicating incorrect/correct. These variations in measurement can lead to problems and incorrect inferences when using standard exploratory factor analysis software in generalist programs like SPSS and Stata. Thus, the approach we take allows variables to be disturbed normally, or as counts, or categorically. Conceptually, the analysis of categorical indicators, be they ordinal-dichotomous or ordered-polytomous is equivalent to item-response theory modeling (IRT), where constraints on the factor-loadings result in the 1-parameter or Rasch model, while freely estimated factor-loadings are equivalent to two-parameter models (e.g., van der Linden and Hambleton 1997). When variables are treated normally in the following analysis robust standard errors and model fit statistics (Yuan and Bentler 2000; Muthén & Muthén 2006) are used to adjust for the item non-normality. This leads to better fit statistics but lower reliabilities because the standard errors of the variance estimates increase. In general, the normalanalyses are to be denigrated relative to the count models which model the indicators as Poisson processes, or categorical IRT approaches. Unless otherwise noted, the categorical IRT analyses for dichotomous and ordered-polytomous items use probit and ordered probit equations relating children's unobserved trait (e.g., concepts about print) to the observed measures. Traditional approaches to IRT modeling use maximum likelihood to estimated logit and ordered logit-like equations, while we use a weightedleast-squares estimator in order to make use of the full range of model fit assessments described above. 3 Once again, robust estimates are used because the children are clustered in centers and assessors within those centers, so the observations are not independent. Furthermore, the samples used to study the ELSA are not simple-randomsamples of children. They are instead samples of convenience with unknown degrees 3 In addition, the weighted-least-squares estimators tend to have much lower computation times for more complex models. All CFA analyses were carried out using Mplus version 4.2 (Muthén & Muthén 2006).

13 Page 7 of nonrandomness. So once again, we reiterate that the methods we employ adjusts reliability estimates downwards so that those provided are smaller than those which would be obtained from more traditional approaches assuming simple random sampling. We suggest that these reliability estimates are interpreted as conservative Reliability Cronbach s which is a measure of response consistency is the most common measure used to gauge reliability. The model implied by this measure, however, is a single-factor model with equal factor loadings, which may not be the model that generated the data. Furthermore, Cronbach s assumes that the items are distributed normally, which is typically not the case, and is certainly not the case with the ELSA where more items are dichotomous or ordered-polytomous. We report for mean or summation scores on the ELSA instruments, but in general the most accurate reliability estimates are model derived, and are based on taking the ratio of the true score to observed score variance (see McGrew and Woodcock 2001). Thus, we calculate reliability as:, where is the variance of the estimated latent scores,, and is the standard error of the variance estimate. Reliability can be improved by (a) increasing the variance of children's scores, and/or (b) decreasing the uncertainty in the latent score estimates. Notably, because of the standard error estimates used to calculate reliability are from robust analyses using a "sandwich estimator", the results we present are conservative relative to normal techniques which assume simple random sampling. 1.3 Conceptual Issues: The Approach Because the potential uses of the ELSA ranges between small-scale assessment by parents or individual centers and larger projects with heterogeneous samples, we explore the ELSA from a variety of different perspectives. The analysis is guided by the idea that proper model specification can shed a great deal of light on the statistical properties of the test. Explicit model formulation allows the correspondence between the factor model and theory of the test to be assessed. If the test is valid, then the observed data should be consistent with a latent variable formulation which posits that the observed data are generated by a model consistent with the theoretical structure of the data since a valid test will produce measurements consistent with the unobserved test construct. In addition, if the test is intended to assess changes in children s skills, then the relationships between the indicator variables and the latent constructs should also be consistent over time. That is, the relationship between children s observed and latent variables should be the same on the pre- and posttest. Furthermore, because the

14 Page 8 Violet and Dante versions of the test are designed to be comparable and are based upon the same theory of the test, they should be relatively factor invariant as well. 4 Thus, the CFA approach can provide statistical support for the validity of ELSA since an invalid assessment should have poor cross-sectional measurement properties, poor properties over time, and poor properties across assessments. The evidence provided about validity, however, is really a failure to disconfirm the ELSA. Disconfirmation is, of course, a limited form of support, and with multiple avenues of disconfirmation, multiple small failures to disconfirm are taken to be additive. The following analyses are oriented around statistical analyses, and these analyses are related to mean or sum scores, denoted scale scores in the following text, so that both populations of ELSA users can gain a greater understanding of the test when different scoring methodologies are employed. Conceptually, the statistical analyses are designed to provide insight into the reliability and validity of the ELSA, to highlight its limitations, and to suggest where more heterogeneous samples would be useful for further understanding the ELSA. By illustrating the congruence between the statistical and scale scoring, assessors with fewer resources can better understand the limitations and strengths of the ELSA under the conditions within which they practice, while larger projects will have guideposts to the construction of more detailed analyses of children s early literacy skills. 4 Notably, the Dante and Violet versions of the ELSA were given to different samples, so sampling variability should produce at least marginally different measurement characteristics between these tests.

15 Page Figures Figure 1.4-1: General first- and second-order factor models A. First-order B. Second-order Figure 1.4-2: Two-wave or pre-post factor model Wave 1 Wave 2 λ 1 λ 2 γ 1 γ 2

16 Page 10 2 English ELSA The analyses presented below are based upon grouping the Violet and Dante samples together. Since sample sizes are relatively similar for these two versions, they each contribute similar weight to the parameter estimates. Different statistical properties within these two samples should lead to model misfit, so adequate model fit for these samples is taken to indicate similar performance. However, we explicitly explore this subsample performance to ensure that our inferences meet a high standard of quality. One potential limitation of the following analyses regards the samples used. These samples are relatively disadvantaged, leading to questions about how the ELSA performs with more advantaged populations, or how the ELSA performs in the larger population more generally. Where floor effects are estimated with disadvantaged children, ceiling effects may be evident for their most advantaged peers. We are unable to fully explore this. In many cases, having more heterogeneous samples with broader coverage over the abilities/skills assessed by the ELSA instruments may lead to better performance of the ELSA. Below we give brief descriptions of the samples used for the analyses. 2.1 Sample Descriptive Statistics Sample descriptive statistics for Dante and Violet are shown in Table and Table 2.1-2, respectively. The sample sizes are relatively equal with a total sample size of 535 for Dante and 505 for Violet. There are more sites in the Violet sample (68) than Dante (8), while across samples children are clustered in a total of 76 centers. The gender composition across samples is approximately equal, although just over 6% of children scored using Dante compared to fewer than 4% for Violet have special needs. Age at assessment and time between assessments are also similar across samples, although there is far less variability in age at assessment for the Violet sample. Greater variability in time between assessments is useful for approaches to estimating change when multiple observations are taken. For change scores, however, homogeneity is useful because it limits variability that may be taken into account as only a nuisance parameter, as compared to approaches like growth curve modeling which build time formally into the model structure. Averaging over both samples, the average child was approximately 52 months old at the first assessment and 57.7 months old at the second, so that the average child was assessed 5.5 months after the initial assessment. Both samples are relatively disadvantaged. For Dante, for example, over 60% of the sample attends either Head Start or is in a subsidized care program. Only 14% of children in the Dante sample are placed in center-based care. In addition, the sample is over 60% Hispanic or Black. The Violet sample is whiter than Dante, with over a 70% white. In addition over 30% of the sample is in center-based care, although the

17 Page 11 remainder of the children are either in Head Start or the Michigan School Readiness Program, indicating that the sample is again relatively disadvantaged. Tables Tables Table 2.1-1: Descriptive statistics for the Dante sample Variable/ Characteristic Mean/ % N 535 # of Sites 8 % Female 47.6 % Special Needs 6.2 Age at Assessment Wave (8.2) Wave (7.5) Race % White 25.7 % Black 34.0 % Hispanic 26.3 % Other 14.0 Program Type Head Start 31.3 Subsidized Care Other Program Center Care Home Care 4.13 Missing 8.66

18 Page 12 Table 2.1-2: Descriptive statistics for the Violet sample 2.2 Variable/ Characteristic Mean/ % N 505 # of Sites 68 % Female 50.4 % Special Needs 3.7 Age at Assessment Wave (4.3) Wave (3.5) Race % White 70.6 % Bl ack 11.7 % Hispanic 3.4 % Asian 7.8 % Other 6.6 Program Type Center Care 33.3 Head Start 28.3 Michigan School Readiness Program 38.4

19 Page Reading Comprehension Reading comprehension is composed of eight items, with four items falling under the prediction heading (Q. 2, 6,11, & 17), two under retelling (Q. 8 & 19), and two under connection to life (Q. 9 & 18). Each of these questions is an enumeration of the number of responses which problematizes the analysis. The difficulty for model assessment arises because each response is highly non-normal, and the more accurately treated the non-normality is, the farther one moves away from traditional approaches to factor analysis and the assessment of model fit. Because of this, we specify Poisson, ordered-categorical (IRT), and normal models, (a) assessing fit when possible, (b) correlating the factor scores to assess the degree of correspondence across approaches, and (c) graphing the expected posterior factor score distribution to assess the degree of non-normality in the estimates of children s latent comprehension score, denoted generically as in some cases. We also explore the relationship between age and age between assessments and children s scores using nonparametric Lowess curves and fractional-polynomial prediction plots to better understand how the ELSA captures change in children s skills. In addition, the fact that different items on the test are thought to comprise distinct components of comprehension (i.e. prediction, retelling, and connection to life) suggests an alternative model formulation for assessment of children s comprehension scores. The first, and simplest model specification, which is graphically represented in Figure , posits that the responses or test items are a function of children s latent comprehension skills. The second model, presented in Figure 2.2-2, suggests that the relationship between children s comprehension and the observed indicators are mediated through the prediction, retelling, and connection to life subdimensions of children s comprehension. The item-level composition of the comprehension instrument subdimensions are available in appendix Table Conceptually, these models are the same as those described in Figure for cross-sectional analyses and Figure for pre-post analyses. We explore similar models for the other ELSA instruments as well Choosing a factor model Model fit for the comprehension instrument for the pretest appears in Table for the first- and second-order (a) Poisson CFA (i.e. count model), (b) categorical or IRT CFA, and (c) traditional normal CFA (with robust estimates). The second-order factor structure is preferred across model specifications, and the chi-square tests of model fit for the IRT and normal models are nonsignificant, indicating that the model exactly reproduces the observed variation across the Dante and Violet combined assessments. The posttest results appear in Table and largely confirm the pretest findings. Although the chi-square tests are statistically significant, the RMSEA, CFI, and TLI values for the IRT and normal model second-order models are all adequate, suggesting a

20 Page 14 high degree of correspondence between the observed and model expected covariance structures. Model fit over time and across test version are presented in Table The BIC values are similar for the Poisson model with and without equal factor loadings, indicating that the factor invariant model is preferred. In addition, model fit for the constrained pre-post IRT and normal models are also adequate, suggesting that models explicitly positing temporal invariance fit the data acceptably. Additionally, model fit for the multigroup model with equal loadings for the Violet and Dante versions of the ELSA adequate, suggesting that the test performs similarly in both samples. These results confirm that the comprehension instrument performs adequately crosssectionally, over time, and across test versions. Factor loadings appear in appendix Table Factor Correlations & Reliability Correlations between scale scores and factor scores derived from the cross-sectional pre- and posttest models are presented in Table Notably, the different scoring methods are highly intercorrelated for both the pre- ( ) and posttests ( ), which should be comfortable to users of the ELSA without the sample sizes or resources necessary for a complete statistical analysis of their children. Correlations over time are modest ( ) indicating that there are differences in the rank ordering of children between waves. Reliabilities appear on the diagonal. The categorical IRT estimates provide the most precise estimates, with Cronbach s and the normal models providing the lowest. The pretest IRT reliability is.89 and.83 for the Poisson CFA, while the respective reliabilities are.91 and.83 for the posttest. By normative standards the reliability of the ELSA is adequate Factor Score Distributions If there is one important limitation of the ELSA, it involves the sometimes small number of items used to identify different constructs. The comprehension instrument contains only 8 items. This can, of course, be a boon since it can results in significantly lower implementation costs. Pretest factor score distributions by scoring method are presented in Figure Despite the adequate model fit and score reliabilities across methods, there is some non-normality across methods, with the most non-normal scores derived from scale scoring and scores from normal models. The problems are ameliorated substantially for the Poisson CFA and IRT models, although in all cases there is evidence of floor effects. These floor effects can bias children s estimated growth downward since they are not scored completely in the left tail of the distribution. It is worth noting here, however, that for the purposes of categorizing children, that is to make inferences on grouped frequency distributions (or level scores), this is not an

21 Page 15 important limitation of the ELSA since the floor from these scoring methods will simply comprise the bottom category (those with very low skills) in the grouped distribution. Posttest factor score distributions by method appear in Figure Both the scale and normal scores are non-normal with count-like distributions and susceptible to outliers. The problems are ameliorated for the Poisson CFA and especially for the IRT model. Furthermore, as shown in Figure 2.2-5, factor scores from a pre-post IRT model are much more normal in shape with smaller truncation problems in the right tail, although the variability remains limited in the left tail for the pretest. For projects with reasonable samples sizes, the preferred treatment of the ELSA is to use statistical models using both pre- and posttest information to rescale the earlier the scores and to garner richer information about the distribution of scores. The relationship between the comprehension instrument and age is shown in Figure Children s scores increase with age, but the most important finding concerns the nonlinearity occurring prior to approximately 45 months of age and children s posttest scores. This nonlinearity results from the small floor effect which disproportionately affects younger children. Children less than 42 months of age are over 2 times more likely to be scored on the absolute floor than children over 51 months old. Furthermore, about 8% (81) of children are scored at the floor when using scale scores, while only 5 are scored thusly when scores are estimated in the pre-post model. As we see below, these floor effects turn out not to play a significant role in drawing inferences from the ELSA about change in children s comprehension skills Change in Children s Scores The relationships between time between assessments and children s comprehension growth for younger children (age < 46 months), children at the middle of the age distribution (46-55 months), and older children (age>55 months) are depicted in Figure These results confirm that age at the time of assessment is related to growth in children s comprehension skills for middle and younger children. The test, however, appears to perform less well for the 218 children above 55 months of age at pretest who were appraised later. Change in children s scores from the Poisson factor model and for scale scores (measured as mean responses) for the total sample and with age restrictions is presented in Table Although there is evidence of floor effects, the ELSA is sensitive to change in children s scores, whether a scale score or latent score is used. Furthermore, despite that fact that floor-effects are most noticeable for the youngest children, growth remains detectable and is in fact larger than for the other age groups across methods, indicating that the floor effects play are less important than they seem from the histograms in Figure Overall, however, measured growth is similar across age groups.

22 Page Figures Figure : Graphical depiction of the first-order, single factor, comprehension CFA Item 1 Item 8 Figure 2.2-2: Graphical depiction of second-order comprehension CFA Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8

23 Page 17 Figure 2.2-3: Pretest factor score histograms by scoring method Scale Scores Poisson CFA Reading Comprehension Reading Comprehension Categorical (IRT) CFA Normal CFA Reading Comprehension Reading Comprehension Figure 2.2-4: Posttest factor score histograms by scoring method Scale Scores Poisson CFA Reading Comprehension Reading Comprehension Categorical (IRT) CFA Normal CFA Reading Comprehension Reading Comprehension

24 Page 18 Figure 2.2-5: Pre- and posttest scores from the two-wave model Pretest Categorical (IRT) Scores Posttest Categorical (IRT) Scores Reading Comprehension Reading Comprehension Figure 2.2-6: Nonparametric relationship between age and children s comprehension scores by scoring method Pretest: Age at Pretest Posttest: Age at Pretest Posttest: Age at Posttest Standardized Score Age in Months Standardized Score Age in Months Scale Score Poisson CFA IRT Normal CFA Standardized Score Age in Months

25 Page 19 Figure 2.2-7: Relationship between age between assessments, child age at pretest, and children s comprehension growth Scale Score CFA Poisson CFA Growth Growth Time Between Assessments in Months Age <46 Months 45<Age<56 Age>55 Months Time Between Assessments in Months Age <46 Months 45<Age<56 Age>55 Months

26 Page Tables Table 2.2-1: Pretest comprehension model fit Chi-Square Test of Model Fit Fit Indices Model BIC χ 2 df p-value Scale CFI TLI RMSEA Model 1: First-Order Poisson CFA Model 2: 2nd-Order Poisson CFA Model 1: First-Order Categorical (IRT) CFA Model 2: 2nd-Order Categorical (IRT) CFA Model 1: First-Order Normal CFA Model 2: 2nd-Order Normal CFA Table 2.2-2: Posttest comprehension model fit Chi-Square Test of Model Fit Fit Indices Model BIC χ 2 df p-value Scale CFI TLI RMSEA Model 1: First-Order Poisson CFA Model 2: 2nd-Order Poisson CFA Model 1: First-Order Categorical (IRT) CFA Model 2: 2nd-Order Categorical (IRT) CFA Model 1: First-Order Normal CFA Model 2: 2nd-Order Normal CFA Table 2.2-3: Model fit for (a) pre- and posttest and (b) across Violet and Dante assessment versions Chi-Square Test of Model Fit Fit Indices Model BIC χ 2 df p-value Scale CFI TLI RMSEA Two-Wave Poisson with Free Loadings Two-Wave Poisson with Fixed Loadings Two-Wave IRT with Fixed Loadings Two-Wave Normal with Fixed Loadings Multigroup Two-Wave Normal Model with Fixed Loadings Across Groups Due to the computation complexity, these results are from a first-order two-wave model. Table 2.2-4: Correlations between scoring methods with reliabilities on the diagonal Scoring Method (1) (2) (3) (4) (5) (6) (7) (8) (1) Pretest Scale Score 0.77 (2) Pretest Poisson Factor Score (3) Pretest IRT Factor Score (4) Pretest Normal Factor Score (5) Posttest Scale Score (6) Posttest Poisson Factor Score (7) Posstest IRT Factor Score (8) Posstest Normal Factor Score

27 Page 21 Table 2.2-5: Estimated comprehension growth for the total sample and by age group at pretest Test Pre Post Difference P-Value N Total Sample Scale Score Means (0.020) (0.027) (0.025) Poisson Factor Score (0.014) (0.013) (0.008) Age 45 Months Scale Score Means (0.047) (0.063) (0.068) Poisson Factor Score (0.035) (0.032) (0.020) Age > 45 Months Scale Score Means (0.021) (0.029) (0.027) Poisson Factor Score (0.015) (0.014) (0.008) Age > 50 Months Scale Score Means (0.025) (0.035) (0.033) Poisson Factor Score (0.017) (0.016) (0.009) Age > 55 Months Scale Score Means (0.039) (0.053) (0.047) Poisson Factor Score (0.026) (0.023) (0.013)

28 Page Phonological Awareness Assessing the phonological awareness component of the ELSA is much simpler than it was for comprehension. Phonological awareness is measured with a series of dichotomous indicators, so there is no need to explore factor models making different assumptions about the underlying distributions of the measured items. Because the items are dichotomous, indicating whether or not the child answered correctly, the general analytical approach is consistent with IRT modeling, although a weighted least squares probit estimator is used. Two model formulations are employed, the first is a first-order factor structure similar to Figure for reading comprehension, and the second is a second-order factor structure consistent with the comprehension presentation in Figure with rhyming (Q.7 & 16), segmentation (Q. 10), and phonemic awareness (Q. 12) subdimensions. Change is assessed by fixing the item thresholds or indicators to 1 and freely estimating the latent means for two of the three first-order factors, and freeing the mean for the general second-order phonological awareness factor. The individual items are mapped onto the factors in appendix Table Choosing a Factor Model Model fit statistics for the cross-sectional IRT models are presented in Table Although the first-order single factor pretest model approaches adequate model fit (CFI=.94, TLI=.931, RMSEA=.034), the model fit indices are better for the second-order model (CFI=.994, TLI=.997, RMSEA=.032). In addition, posttest fit for the secondorder model is also adequate (CFI=.998, TLI=.999, RMSEA=.026), although the chisquare difference test is significant across specifications. In addition, model fit is adequate for the two-wave IRT model with fixed loadings over time (CFI=.982, TLI=.991, RMSEA=.045) and across samples (CFI=.986, TLI=.993, RMSEA=.035), indicating similar performance for the phonological awareness instrument for the pre-posttest and across test versions. Factor loadings are presented in appendix Table Factor Correlations & Reliabilities Correlations and reliability estimates appear in Table The IRT estimates correlate highly ( ) for both the pre- and posttest, while the rank ordering of children over time is relatively consistent ( Cronbach's ). Importantly, whether a misspecified measure of reliability or IRT estimated reliability is used, reliability estimates are approximately.8 or higher, indicating that the ELSA reliably scores children's phonological awareness skills on both the pre- and posttest Factor Score Distributions Posterior factor score estimates and scale score distributions are shown in Figure As with the comprehension instrument, there is evidence of floor effects in crosssectional models, with the floor more pronounced for the scale scores. Unlike the comprehension measure, there is some indicating of a ceiling effect for the posttest, suggesting that some children at the second assessment are receiving maximal scores.