Pixels vs. Paper: Comparing Online and Traditional Survey Methods in Sport Psychology

JOURNAL OF SPORT & EXERCISE PSYCHOLOGY, 2006, 28, 100-108 2006 Human Kinetics, Inc. Pixels vs. Paper: Comparing Online and Traditional Survey Methods in Sport Psychology Chris Lonsdale, Ken Hodge, and Elaine A. Rose University of Otago The purpose of this study was to compare participant responses to a questionnaire delivered via the Internet with data collected using a traditional paper and pencil format distributed via postal mail. Athletes (N N = 214, mean age 26.53 years) representing 18 sports from the New Zealand Academy of Sport were randomly assigned into two groups and completed the Athlete Burnout Questionnaire (ABQ; Raedeke & Smith, 2001). There was a noticeable trend (p =.07, two-tailed) toward a better response rate in the online group (57.07%) compared with the postal group (46.63%). Furthermore, online questionnaires were returned faster and contained fewer missing responses. A series of nested, multigroup confirmatory factor analyses indicated that there were no significant group differences in the factor structure or latent mean structures of the ABQ. Key Words: World Wide Web, Internet, questionnaire, format Internet-based surveys have become a popular method of data collection in many areas of psychology (see Birnbaum, 2004, for a review). However, if results from these online questionnaire studies are to be compared with findings from paper-pencil (PP) questionnaire studies, an evaluation of potential survey format effects is crucial. In the majority of studies conducted thus far, online questionnaires have been found to produce similar mean scores (e.g., Buchanan & Smith, 1999), reliability coefficients (e.g., Buchanan & Smith, 1999; Miller, Neal, Roberts, et al., 2002), and factor structures (e.g., Buchanan & Smith, 1999; Fouladi, McCarthy, & Moller, 2002) compared with data collected using PP versions of the same questionnaire. In contrast, a few studies have reported small, yet statistically significant, differences in mean scores (Fouladi et al., 2002; Miller et al., 2002) and reliability coefficients (Im, Chee, Bender, et al., 2005). However, most authors have regarded these slight differences as nonsystematic and believe the substantial similarities in these data sets outweigh the subtle differences (Fouladi et al., 2002; Im et al., 2005; Miller et al., 2002). In general, there is mounting evidence that participants respond similarly to online and PP questionnaires. However, the trustworthiness of much of this evi- All authors were with the School of Physical Education, University of Otago, Dunedin, New Zealand, at the time of this study. C. Lonsdale is now with the Dept. of Sports Sciences and Physical Education, Chinese University of Hong Kong, Shatin, NT, Hong Kong. 100

Pixels vs. Paper Survey Methods / 101 dence must be questioned, as many studies have suffered from threats to internal validity. These threats have included assigning response format in a non-random manner (e.g., Buchanan & Smith, 1999; Fouladi et al., 2002; Im et al., 2005), providing supervision in one condition but not the other (e.g., Buchanan & Smith, 1999; Fouladi et al., 2002), and comparing the results of an online study with PP studies conducted by other researchers on different samples (e.g., Meyerson & Tryon, 2003). Consequently, there is a need for studies that employ internally valid research designs such as random assignment to online and PP groups (e.g., Miller et al., 2002) or counterbalanced repeated measurement (e.g., Ferrando & Lorenzo-Seva, 2005). A second problem with the existing research is that comparisons between the two methods have tended to use ad-hoc, approximate, or purely descriptive procedures for testing equivalence (Ferrando & Lorenzo-Seva, 2005, p. 194). More rigorous procedures such as multigroup confirmatory factor analysis (MG- CFA) (e.g., Ferrando & Lorenzo-Seva, 2005; Fouladi et al., 2002; Meyerson & Tryon, 2003) that can test the null hypothesis of measurement invariance have been employed infrequently. Therefore, while most of the existing research has suggested that survey format (i.e., online vs. PP) does not lead to significant differences in responses, the weakness of most research designs and/or analysis strategies makes it difficult to draw firm conclusions. Within the subdiscipline of sport psychology, there is a noticeable shortage of research that has compared online and PP survey methods. Searches of the SPORT DISCUS and PSYCHINFO databases (using search terms web or online or internet and athlet* or sport* ) revealed no relevant published studies. However, a request posted on the SPORTPSY listserv led to the discovery of a PhD dissertation and a MA thesis that examined this issue. Deaner (2002) used her Sport Disengagement Questionnaire to collect data from college athletes via online (n = 147) and PP (n = 253) questionnaires. She reported that online participants viewed the questionnaire as more convenient and that a subjective comparison of factor structures (using exploratory factor analyses) suggested that the two versions of the questionnaire elicited similar response patterns. Mean scores on the measure were not compared. In contrast, Destani (2004) collected data from intramural college sport participants and reported significant differences in mean scores for motivation and affective consequences between the online (n = 65) and PP (n = 169) groups. However, neither study randomly assigned participants to online and PP groups, thus the results may have been confounded by a group membership effect. The current study, by randomly assigning participants to either (a) a PP (delivered via postal mail) or (b) an online survey format condition, sought to minimize the confounding factors that have limited many previous studies. In addition, more rigorous analyses of measurement equivalence and mean structures were performed so that potential differences in responses between groups could be examined and a more comprehensive evaluation of online and PP survey format effects in sport psychology research could be conducted. Participants Method The participants (N N = 214) were athletes affiliated with the New Zealand Academy of Sport (NZAS). The NZAS provides sport science services to elite

102 / Lonsdale, Hodge, and Rose athletes from across New Zealand. In order to be selected into the NZAS, athletes must either have represented New Zealand at the senior national level or have been identified by their sport s governing body as having the potential to do so in the future. Almost three-quarters (71.96%) of the participants had represented New Zealand at the senior national level. The remaining participants had competed at senior provincial, junior national, or junior provincial levels. The mean age of the participants was 26.53 years (range 18 58 yrs) and 51.40% were female. Eighteen sports were represented including athletics (n = 21), basketball (n = 11), lawn bowls (n = 11), cricket (n = 22), equestrian (n = 9), golf (n = 8), paralympics (n = 22), rowing (n = 27), rugby (n = 4), rugby league (n = 3), shooting (n = 23), softball (n = 7), squash (n = 7), swimming (n = 18), and triathlon (n = 21). The majority of participants identified themselves as New Zealand European (85.51%). Other ethnic groups represented were New Zealand Maori (8.88%), Samoan (2.80%), other European (2.34%), Tongan (0.93%), Cook Islander (0.47%), Chinese (0.47%), and other ethnicity (3.27%). Participants could indicate belonging to more than one ethnic group. Questionnaire The Athlete Burnout Questionnaire (ABQ; Raedeke & Smith, 2001) is a 15- item measure with three subscales designed to tap three core burnout symptoms of physical/emotional exhaustion (e.g., I feel overly tired from my sport participation ), sport devaluation (e.g., The effort I spend in sport would be better spent doing other things ), and reduced sense of accomplishment (e.g., I am not performing up to my ability in sport ). Using 5-point Likert scales (1 = almost never, 5 = almost always), participants were asked to indicate the frequency of each feeling or thought. Raedeke and Smith (2001) provided strong initial evidence for the validity of the questionnaire, and subsequent studies (e.g., Cresswell & Eklund, 2006) have supported the internal consistency of the subscales as well as the convergent and divergent validity of the measure. Procedure The NZAS supplied email and postal addresses for 472 athletes. These potential participants were grouped according to sport and gender and were then randomly assigned from within each Gender Sport cell to one of two groups: post (n = 236) or online (n = 236). Procedures for survey distribution were based on the Tailored Design Method outlined by Dillman (2000). Athletes in both groups were sent a prenotification email informing them of the purpose of the study and of the upcoming invitation to complete the questionnaire. Thirty-one athletes in the online group and 28 athletes in the post group were found to have nonworking email addresses and were excluded from further analyses. Four days after the prenotification email, the uniform resource locator (URL) for the online version of the survey was sent via email to the 205 online participants. At the same time, PP surveys (with prepaid return envelopes) were sent via post to the 208 post participants. Ten days later a reminder email was sent to any athletes (post and online groups) who had not returned their survey. After a further 10 days a final reminder was sent via email to any participants who had still not returned his or her survey. Participants did not receive any compensation.

Pixels vs. Paper Survey Methods / 103 Analysis The gender and highest level of competition of the post and online groups were compared using Mann-Whitney U-tests. Mean age was compared across the groups using an independent samples t-test. Response rates (% of participants who returned a questionnaire) for the two groups were compared via a Mann-Whitney U-test. An independent samples t-test was used to contrast the mean response times (days = survey return date survey distribution date) of the two groups. Given the lack of research concerning online questionnaires within an elite sport population, a substantive a priori hypothesis regarding the direction of the potential differences in these tests could not be forwarded. Therefore, significance levels were interpreted using a two-tailed approach. Finally, the number of missing values per respondent on the 15-item questionnaire in the post group was analyzed using a one-sample t-test with a null hypothesis of M = 0. This hypothesis was appropriate because participants who responded via the online survey were prompted when they missed an item and were not able to continue to the next set of items until all previous items were completed. As a result, there were no missing values in the online group. Missing values in the post group were estimated using an expectation maximization algorithm. Measurement Invariance and Latent Mean Structures Analyses Single-group CFA models were first examined in each group, online and post, using LISREL 8.71 (Joreskog & Sorbom, 2004). Measurement invariance of the questionnaire across the groups was then tested using a series of nested model comparisons between a baseline MG-CFA and progressively more constrained models. A lack of significant change in fit from one model to another would indicate invariance on the new constraint (Byrne, 1998). In the final model, latent mean structures were tested by constraining the item intercepts, fixing the latent factor means of the post group to zero, and allowing the latent factor means of the online group to be estimated. Potential differences in latent factor means between the two groups could then be identified by examining the t-values of the parameter estimates for the Kappa matrix in the online group. Factor analysis procedures are not normally undertaken for small samples. Comrey and Lee s (1992) oft-cited guidelines suggested that samples of 50 are very poor; 100 are poor; 200 are fair; 300 are good; 500 are very good; 1,000 are excellent. However, more recent research (Hogarty, Hines, Kromrey, Ferron, & Mumford, 2005; MacCallum, Widaman, Zhang, & Hong, 1999) suggests that when the number of factors is low (e.g., 3), the number of items per factor is high (e.g., 4), and factor loadings are high (e.g.,.6), CFA using Maximum Likelihood estimation may be appropriate even for sample sizes of 100 or less (MacCallum et al., 1999). The ABQ fulfills the first two requirements as it is composed of 15 items measuring three factors. Furthermore, initial research (Raedeke & Smith, 2001) on the instrument has reported high factor loadings (.67.88). As a result, it was deemed appropriate to use CFA procedures in this study (post group n = 97, online group n = 117). Traditionally CFI and TLI scores >.90 and RMSEA scores <.08 represent good model fit (e.g., Bentler & Bonnett, 1980; Browne & Cudeck, 1992), while RMSEA scores between.08 and.10 suggest marginal fit (Browne & Cudeck, 1992). More recently, Hu and Bentler (1999) have proposed alternative cutoff criteria (CFI

104 / Lonsdale, Hodge, and Rose and TLI >.95, RMSEA <.06); however, Marsh, Hau, and Wen (2004) have warned against the blanket use of these higher cutoff criteria. More research is needed on this topic before firm conclusions can be drawn regarding cutoff criteria, and therefore the traditional criteria were adopted as indicators of good fit with Hu and Bentler s criteria as evidence of excellent fit. In terms of assessing change in model fit, χ 2 is the most common measure. However, given the fact that χ 2 is affected greatly by sample size, this criterion is not ideal (Cheung & Rensvold, 2002). The use of other goodness-of-fit-indexes would be preferable but little research has examined the performance of these measures in MG-CFA designs. Cheung and Rensvold (2002) conducted an initial investigation and recommended that when CFI is larger than.01 the null hypothesis of invariance should be rejected. Both χ 2 and CFI values were used to assess changes in model fit. Results Online and post groups were not significantly different in terms of age, t(205) =.347, p =.73 (two-tailed), gender, U = 5445.50, p =.56, or level of competition, U = 5346.00, p =.78. There was no significant difference in the response rates between the online (57.07%) and post (46.63%) groups, U = 2354.5, p =.065 (two-tailed). The time lag between survey distribution and return (response time) for the online group (M = 7.26, SD = 6.90 days) was significantly less, t(212) = 6.70, p <.01 (two-tailed) than for the post group (M M = 13.73, SD = 7.19 days). The mean number of missing values per post participant (M M =.26, SD = 1.36) was significantly different from zero, t(96) = 1.88, p =.03 (one-tailed). Subscale means and standard deviations for each group are displayed in Table 1. Alpha coefficients are also listed and indicated that scores on all three subscales were internally consistent (α >.70, Nunnally, 1978). Furthermore, examination of the 95% confidence intervals indicated no significant differences between the groups in terms of subscale alpha coefficients (see Table 1). Table 1 ABQ Subscale Descriptive Statistics α 95% CI Boundaries Subscale Mean SD α Lower Upper Online Group (n = 117) Exhaustion 3.00.69.77.69.83 Devaluation 2.21.65.87.83.90 Reduced sense of accomplishment 2.29.73.79.73.85 Post Group (n = 97) Exhaustion 3.20.67.78.70.84 Devaluation 2.05.69.90.86.93 Reduced sense of accomplishment 2.23.74.82.76.87 Note: The ABQ utilizes 5-point Likert scales (1 = almost never, 5 = almost always).

Pixels vs. Paper Survey Methods / 105 Measurement Invariance and Latent Mean Structures Comparisons Preliminary analyses suggested that the majority of the ABQ items scores were univariately normally distributed (skewness < 2, kurtosis < 7); however, there was evidence of multivariate non-normality (e.g., Mardia s normalized skewness coefficient = 7.73, p <.001, normalized kurtosis coefficient = 5.55, p <.001) in the data. Alternative methods of estimation (e.g., Satorra-Bentler correction, weighted least squares) are sometimes used when data are non-normally distributed. However, West, Finch, and Curran (1995) concluded that when sample sizes are small, maximum likelihood (ML) estimation is most appropriate for data distributions that are not severely non-normal. Consequently, ML estimation was employed in all CFA models. Before beginning multigroup analyses, we evaluated CFA models separately in the post and online groups. Fit statistics for the online and post groups suggested mixed support for the hypothesized model (see Table 2 for all single and multigroup CFA model fit statistics). For example, based on the criteria outlined earlier, the fit of the online group data should be considered either good (CFI =.91) or marginal (TLI =.89, RMSEA =.09). The fit of the post group data was also mixed (CFI =.92, TLI =.90, RMSEA =.11). However, model fit was not so poor as to warrant empirically-driven modifications (Byrne, 1998), and therefore multigroup analyses were conducted using the original hypothesized model. In the baseline model (Model A), the validity of the three-factor structure of the ABQ across the two groups was assessed. Marginally good fit was found in this model (CFI =.92, TLI =.90, RMSEA =.10). In Model B, factor loadings were constrained to be equal. There was no significant difference (p >.05) in χ 2 or CFI between the two models, suggesting that factor loadings were equivalent across the two groups. In Model C, factor loadings and error terms (uniqueness) Table 2 CFA Model Fit Statistics Model RM- RMSEA (new constraint) df χ 2 df χ 2 CFI CFI TLI SEA 90% CI Online group 87 172.35.91.89.09.07.11 Post group 87 213.11.92.90.11.09 13 Model A (baseline) 174 348.72.92.90.10.08.11 B (loadings) 186 360.16 12 11.44.92.00.91.10.08.11 C (uniqueness) 201 369.68 15 9.52.92.00.91.09.08.10 D (variances) 204 370.08 3 1.00.92.00.92.09.07.10 E (covariances) 207 372.34 3 2.26.92.00.92.09.07.10 F (intercepts) 222 393.34 15 21.00.91.01.92.09.07.10 Note: Dashes indicate the statistics were not applicable. Models A through F represent progressively more constrained multigroup CFA models. Each new constraint is listed in parentheses beside the model in which it was first added. No χ 2 values were significant at p <.05, and no CFI exceeded.01.

106 / Lonsdale, Hodge, and Rose were constrained to be equal, but the addition of this constraint did not significantly increase the χ 2 or decrease the CFI. In Model D, factor variance constraints were added but χ 2 and CFI were not significant. In Model E, factor covariances were also constrained, but once again there was not a significant change in the χ 2 statistic or CFI. There was also no significant change in fit for the final model (Model F), indicating that intercepts were also invariant across the two groups. Furthermore, examination of the Kappa matrix parameter estimates z-scores revealed that the latent means for devaluation (z = 1.25), exhaustion (z = 1.90), and reduced sense of accomplishment (z =.70) in the post group were not significantly different from the online group. Discussion The purpose of this study was to evaluate the influence of survey format (online vs. PP) on responses to a sport psychology questionnaire. In contrast to many previous studies (e.g., Buchanan & Smith, 1999; Deaner, 2002; Destani, 2004), the current study employed random assignment to increase internal validity. Results indicated that online surveys were returned to the researchers faster and, due to the format of the online system, had less missing data than PP versions of the questionnaire. There was a noticeable trend (p =.07, two-tailed) toward a higher response rate in the online group, but further research will be needed before conclusions can be drawn regarding online survey response rates of elite athletes. Researchers in this area will need to be careful not to overgeneralize the findings of individual studies. Indeed, research outside sport psychology has suggested that response rates may vary greatly between different populations, with studies showing higher (e.g., Wygant & Lindorf, 1999), lower (e.g., Birnbaum, 2004), or similar (e.g., Kaplowitz, Hadlock, & Levine, 2004) response rates for online surveys compared with traditional methods. Sport psychology researchers seeking to draw conclusions across a range of studies using both online and PP self-report questionnaires will be encouraged to learn that the factorial structure of this questionnaire was invariant across groups and that there were no significant group differences in latent means or reliability coefficients for any of the three ABQ subscales. These findings conform to the general pattern of results that has emerged in other areas of psychological research (e.g., Ferrando & Lorenzo-Seva, 2005; Fouladi et al., 2002; Miller et al., 2002) and provide preliminary evidence that online methodologies can be used to collect data which is similar to that obtained via PP in sport psychology studies. However, more research, using larger samples where possible, is needed before researchers can assume that online sport psychology surveys are comparable to PP questionnaires for a wider range of topics and populations. Clearly, online questionnaires will not be useful for populations that have limited access to the Internet. Furthermore, research will be needed to determine whether the current results generalize to populations in which response rates are likely to be lower (e.g., more heterogeneous populations in which athletes do not have a strong affiliation with a single association) or higher (e.g., athletes who complete a questionnaire while supervised by a coach or researcher). However, given the speed of responses, the lack of missing data, the absence of coding errors, and the relatively low cost of administration, online surveys appear to have several potential advantages over postal surveys for certain populations.

References Pixels vs. Paper Survey Methods / 107 Bentler, P.M., & Bonnett, D.G. (1980). Significance tests and goodness of fit in the analysis of covariance structures. Psychological Bulletin, 88, 588-606. Birnbaum, M.H. (2004). Human research and data collection via the Internet. Annual Review of Psychology, 55, 803-832. Browne, M.W., & Cudeck, R. (1992). Alternative ways of assessing model fit. Sociological Methods & Research, 21, 230-258. Buchanan, T., & Smith, J.L. (1999). Using the Internet for psychological research: Personality testing on the World Wide Web. British Journal of Psychology, 90, 125-144. Byrne, B.M. (1998). Structural Equation Modeling with LISREL, PRELIS, and SIMPLIS. Mahwah, NJ: Erlbaum Associates. Cheung, G.W., & Rensvold, R.B. (2002). Evaluating goodness-of-fit indexes for testing measurement invariance. Structural Equation Modeling, 9, 233 255. Comrey, A.L., & Lee, H.B. (1992). A first course in factor analysis. Hillside, NJ: Erlbaum. Cresswell, S.L., & Eklund, R.C. (2006). The convergent and divergent validity of burnout measures in sport: A multitrait-multimethod analysis. Journal of Sports Sciences, 24, 209-220. Deaner, H.R. (2002). Psychometric evaluation of the Sport Disengagement Questionnaire. Unpublished doctoral dissertation, West Virginia University, Morgantown, WV. Destani, F. (2004). Predicting contextual affective consequences in sport: Utilizing self- determination theory and achievement goal theory. Unpublished master s thesis, San Francisco State University, San Francisco. Dillman, D.A. (2000). Mail and Internet surveys: The tailored design method. New York: Wiley. Ferrando, P.J., & Lorenzo-Seva, U. (2005). IRT-related factor analytic procedures for testing the equivalence of paper-and-pencil and Internet-administered questionnaires. Psychological Methods, 10, 193-205. Fouladi, R.T., McCarthy, C.J., & Moller, N.P. (2002). Paper-and-pencil or online? Evaluating mode effects on measures of emotional functioning and attachment. Assessment, 9, 204-215. Hogarty, K.Y., Hines, C.V., Kromrey, J.D., Ferron, J.M., & Mumford, K.R. (2005). The quality of factor solutions in exploratory factor analysis: The influence of sample size, communality, and overdetermination. Educational and Psychological Measurement, 65, 202-226. Hu, L.-t., & Bentler, P.M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1-55. Im, E.-O., Chee, W., Bender, M., Cheng, C.-Y., Tsai, H.-M., Kang, N.M., & Lee, H. (2005). The psychometric properties of pen-and-pencil and internet versions of the Midlife Women s Symptom Index (MSI). International Journal of Nursing Studies, 42, 167-177. Joreskog, K.G., & Sorbom, D. (2004). LISREL 8.71 [Computer software]. Lincolnwood, IL: Scientific Software International, Inc. Kaplowitz, M.D., Hadlock, T.D., & Levine, R. (2004). A comparison of web and mail survey response rates. Public Opinion Quarterly, 68, 94-101. MacCallum, R.C., Widaman, K.F., Zhang, S., & Hong, S. (1999). Sample size in factor analysis. Psychological Methods, 4, 84-99. Marsh, H.W., Hau, K.-T., & Wen, Z. (2004). In search of golden rules: Comment on

108 / Lonsdale, Hodge, and Rose hypothesis-testing approaches to setting cutoff values for fit indexes and dangers in overgeneralizing Hu and Bentler s (1999) findings. Structural Equation Modeling, 11, 320-341. Meyerson, P., & Tryon, W.W. (2003). Validating Internet research: A test of the psychometric equivalence of Internet and in-person samples. Behavior Research Methods, Instruments, & Computers, 35, 614-620. Miller, E.T., Neal, D.J., Roberts, L.J., Baer, J.S., Cressler, S.O., Metrik, J., & Marlatt, G.A. (2002). Test-retest reliability of alcohol measures: Is there a difference between Internet-based assessment and traditional methods? Psychology of Addictive Behaviors, 16, 56-63. Nunnally, J.C. (1978). Psychometric theory (2nd ed.). New York: McGraw-Hill. Raedeke, T.D., & Smith, A.L. (2001). Development and preliminary validation of an athlete burnout measure. Journal of Sport & Exercise Psychology, 23, 281-306. West, S.G., Finch, J.F., & Curran, P.J. (1995). Structural equation models with nonnormal variables: Problems and remedies. In R.H. Hoyle (Ed.), Structural equation modeling: Concepts and applications (pp. 56-75). Thousand Oaks, CA: Sage. Wygant, S., & Lindorf, R. (1999). Surveying collegiate Net surfers Web methodology or mythology? Quirk s Marketing Research Review. Retrieved Oct 2, 2005, from http: //www.quirks.com/articles/article_print.asp?arg_articleid=515 Acknowledgments This study was funded in part by the New Zealand Academy of Sport and was completed while the first author was in receipt of a Postgraduate Scholarship from the University of Otago. The authors wish to acknowledge the technical assistance of Hamish Gould, the helpful advice of David Rowe of East Carolina University, and the thoughtful comments of two anonymous reviewers. Manuscript submitted: June 27, 2005 Revision accepted: November 13, 2005