Some fallacies and remedies in secondary data analysis for survey data

Transcription

1 Some fallacies and remedies in secondary data analysis for survey data Giancarlo Manzi Department of Economics, Management and Quantitative Methods, Università degli Studi di Milano, Italy Sonia Stefanizzi - Department of Sociology and Social Research, of Milan-Bicocca, Italy Pier Alda Ferrari Department of Economics, Management and Quantitative Methods, Università degli Studi di Milano, Italy Conference of European Statistics Stakeholders November 24-25, 2014 ROME, Sapienza

2 Fisher s famous quote revisited To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of (Fisher, 1938). We revisit this famous quote by the following: To call in the statistician after the experiment is done may be sometimes convenient: he or she can revive it! Our particular focus is on Secondary Data Analysis (SDA) fallacies in surveys, emerging during the statistical analysis. Some suggestions for future surveys and remedies for current European surveys are also presented. Conference of European Statistics Stakeholders, November 24-25, 2014, ROME, Sapienza

3 Overview of the talk Introduction and motivation. Quality of data and data analysis Coherence and comparability Issues in conducting SDA Data validity, reflexivity and reliability SDA statistical remedies: Combining survey results borrowing strength from each other Use of suitable analysis tools Building improved surveys from existing surveys An example on European data showing how these fallacies arises when performing statistical analysis, and some points to think of. Conclusion and future steps

4 Introduction and motivation (1) In a collaboration project with statisticians and sociologists, a series of problems aroused during when performing SDA on European data. From this, the need to study this topic further. We started first with general definitions of data information quality. For example, Kenett & Shmueli (2014) define eight dimensions of info quality: data resolution; data structure; data integration; temporal relevance; generalizability; chronology of data and goal; construct operationalization and communication.

5 Introduction and motivation (2) Eurostat has established seven dimensions of quality since 2000: Quality dimension 1. Relevance of statistical concept 2. Accuracy of estimates 3. Timeliness and punctuality in disseminating results 4. Accessibility and clarity of the information 5. Comparability 6. Coherence 7. Completeness Source: Eurostat (2000) Remark A statistical product is relevant if it meets user needs. Thus user needs has to be established at the outset. Accuracy is the difference between the estimate and the true parameter value. Assessing the accuracy is not always possible, due to financial and methodological constraints. In our experience this is perhaps one of the most important user needs. Perhaps this is so because this dimension is so obviously linked to an efficient use of the results. Results are of high value when they are easily accessible and available in forms suitable to users. The data provider should also assist the users in interpreting the results. Reliable comparisons across space and time are often crucial. Recently, new demands on cross-national comparisons have become common. This in turn puts new demands on developing methods for adjusting for cultural differences. When originating from a single source, statistics are coherent, in that elementary concepts can be combined in more complex ways. When originating from different sources, and in particular from statistical studies of different periodicities, statistics are coherent insofar as they are based on common definition, classifications, and methodological standards Domains for which statistics are available should reflect the needs and priorities expressed by users as a collective.

6 Introduction and motivation (3) In this talk we focus on points 5. and 6. above. Data quality can be addressed in connection with coherence and comparability as follows: i. resuming the most common fallacies linked to survey implementation and result analysis; ii. recalling some statistical tools able to reduce or control bias when analyzing, comparing or combining survey data. A special reference to problems arising in SDA is also given with an example in the social and economic field (the Eurobarometer survey (EB)) where issues are detected and some sketches of remedies are pointed out.

7 SDA: target switching from original data SDA: the set of research activities through which data from different surveys with certain assumptions and conceptual frameworks are used individually for purposes not necessarily coinciding with those that guided the data collection Examples: Boudon, 1973: through SDA the social researcher can widen the validity of atomic results to the point that he/she is able to modify the original conceptual framework, formulating new interpretative hypotheses which can be different with respect to those in primary analysis. Ferrari & Salini, 2011: data on European user satisfaction for utilities can be used to reveal the multifaceted importance and quality of different aspects of public services.

8 Issues in SDA: data validity DV: the correspondence between characteristics to be detected and indicators chosen to measure them Objective (i.e. observable) aspects must lead to latent states of societies and individuals. CONSTRUCTION OF VARIABLES APPROPRIATE CONCEPTUALIZATION APPROPRIATE MEASUREMENT MEANING OF THE REAL RELATIONSHIP BETWEEN Conference of European Statistics VARIABLES Stakeholders, November REVEALED 24, 2014 November 25, 2014, ROME, Sapienza

9 Issues in SDA: data reflexivity Example: performing immigration surveys. Immigration surveys express also the reflexive character of policies. Immigration surveys may result limited and constrained. Examples of constraints: Immigration policies defined only in terms of migrant categories and quotas. Official statistics almost exclusively focused on the foreigners legal matters as: their nationality; residence; duration and purpose of stay; etc. This sometimes leaves aside other important components of migration such as: social contexts of migrants origin (urban/rural); their social background; the way their migratory experience is articulated; etc.

10 Issues in SDA: data reliability DR: the degree to which data collection procedures are applied in a consistent and coherent way with respect to previously established criteria. The reliability issue occurs both at the level of data production and at the level of data collection, classification and dissemination.

11 Some examples of remedies for SDA fallacies on EU data (1) Methods for blending results from different surveys to attenuate data flaws. 1. Small area estimation where results from different surveys are blended to attenuate data flaws. Lohr & Brick (2012) explore methods for small domain estimation from two surveys when one survey is believed to be biased with respect to the other. The novelty of their work is that they use methods to adjust estimates before a new companion survey is being implemented, i.e. in the stage of constructing a newly planned survey. 2. Meta-analytic approaches. Manzi et al. [16] use a meta-analytic approach with a hierarchical Bayesian model for small domain statistics. Official survey estimates are integrated with estimates from smaller surveys covering smaller areas. Estimates are averaged with weights proportional to the strength of each survey, with the bigger surveys dominating the others, but with information coming also from smaller but more up-to-date surveys.

12 Some examples of remedies for SDA fallacies on EU data (2) Methods for the detection of latent variables which explain hidden structures in the data. 1. Ferrari et al. (2010) use Nonlinear Principal Component Analysis (NPCA) to detect latent constructs and then average them over countries. 2. In Ferrari & Salini (2011) NPCA is proposed together with the Rasch Model (RM) for the assessment of latent concepts such as satisfaction for public services. With this use of NPCA and RM: the level of satisfaction is individually determined via NPCA, but the importance of single satisfaction components (given by component loadings in NPCA) and the quality of components (given by item parameters of RM) are also determined.

13 A motivational example: SDA fallacies arising in EB survey Analysis on European citizens attachment/ expectations/information with regard to the EU. Data: EB survey Techniques used: NPCA and ML analysis. Evident problems: Excessive number of Don t Know answers: Questions erroneously formulated? Sometimes a DK answer makes sense, sometimes not. Maybe, should a way to avoid or diminish their presence in the data set be established? A great deal with imputation. Sometimes there is no coherence in the scales of same type of variables (same questionnaire sections) with recoding needed.

14 A motivational example: Post-analysis incoherence detection (1) Question understanding: What really does «understand» mean in this question? Respondents may be puzzled. Consequence: when exploring for latent variables (using NPCA) this question is ambiguously classified.

15 A motivational example: Post-analysis incoherence detection (2) Ambiguous questions/wording: Is this a question about trust in the EU? Or rather about how citizens are well-informed about it (most probably but not sure). Consequence: again, problems when clustering variables.

16 A motivational example: Post-analysis incoherence detection (3) Is maybe the choice of question formulation (verb tense, for example) decisive to assign a question to a category rather than to another? Sometimes questions have intrinsic double meanings: Are we sure that this question is really correct to check how citizens are well-informed about the EU? Is it also trying to investigate their attachment?

17 A motivational example: SDA in action (1) We wanted to evaluate EU citizens feelings about the EU. An initial set of 44 candidate variables were detected in a series of meetings among the authors. The order of categories of some variables were recoded inverting their order for homogeneity with other variables. After performing a NPCA with three and four components on these variables, some variables were excluded because not clearly in line with one of the extracted components. Some variables initially inserted in a dimension were included in another dimension. Final number of variables left: 37.

18 A motivational example: SDA in action (2) 22 variables for the EU attachment/expectation/confidence dimension. 9 variables for the EU evaluation dimension. 6 variables for the level of information about EU dimension. After some correlation and regression analysis on sociodemographic variables in the EB data set, 4 individual variables were left. After performing NPCA, individual NPCA scores were obtained separately for each of the three dimensions. Country averages of these scores where obtained, intended to show average country EU attachment/evaluation/information.

19 A motivational example: SDA in action (3) We also wanted to know if country ranking on citizens attachment/evaluation/information was related to some contextual country variable and therefore performed a ML analysis inserting contextual independent variables to detect determinants of attachment/evaluation/information. Contextual variables were essential economic and social measurements (GDP per capita, Public debt, Index of deprivation, inactivity rate, etc.). After performing the ML analysis, ranking was altered with respect to a one-level analysis and citizens of the socalled PIIGS where not among the less satisfied with the EU.

20 NPCA logic NPCA: belongs to the nonlinear multivariate analysis family is the nonlinear counterpart of principal component analysis provides dimensionality reduction by means of nonlinear transformation of variables, i.e. assigning quantitative values to qualitative scales has a solution which is derived by minimizing a least squares type loss function, expressed in terms of optimally quantified variables and scores on objects

21 NPCA: how it works in general (1) The goal of NPCA is the construction of a p- dimensional Euclidean space in which objects (individuals) are represented Suppose J categorical variables are observed on N objects (survey respondents) Let X be a N x p matrix of object scores (to be determined) Let be the x p matrix of "quantifications" of the J variables ( has to be determined, j = 1,,J, is the number of categories for the j-th variable). Let be an indicator matrix with entries if object i holds category t or otherwise,

22 NPCA: how it works in general (2) The solution of NPCA is determined by minimizing the following loss function: where SSQ(H) denotes the sum of squares of the elements of matrix H, is an column vector of single category optimal quantifications for the j-th variable and is a p-column vector of weights

23 ML: how it works in general (1) Consider the simple regression model: y ij 0 j 1 j performed in J (j=1,.,j) different groups (schools, regions, countries, etc.) with individual variables X. At the second level a group variable expressing changes from group to group can be important to explain second-level variability, and therefore: Inserting the two equations above in the regression model we get the full level-2 multilevel linear model: y ij x ij 0 j 00 01w j u0 j w 1 j j u1 j ( 01w j 10 xij 11w j xij ) [ u0 j u1 00 j ij ij ij x ]

24 A motivational example: results (1) ATTACHMENT Model 0 ONLY RANDOM INTERCEPT Coefficient SE z p-value CI Intercept Random Effects (RE) First-level variance (variability between citizen) Second-level variance (variability between countries) Deviance Model 1 ONLY INDIVIDUAL EXPLICATIVE VARIABLES VARIABLES IN THE MODEL: Age education; Age: years; Age: 55 years or more; Job: medium status; Job: High status; Community: small or medium town; Community: big town Coefficient SE z p-value CI Intercept Individual variables: Age education Age: years Age: 55 years or more Job: medium status Job: high status Community: small-medium town Community: big town Random Effects (RE) First-level variance (variability between citizen)

25 A motivational example: results (2) MODEL 2 (FULL MODEL): INDIVIDUAL AND CONTEXTUAL EXPLICATIVE VARIABLES VARIABLES IN THE MODEL: Age education; Age: years; Age: 55 years or more; Job: medium status; Job: High status; Community: small or medium town; Community: big town; Public deficit (2013); Household deprivation (2013) Coefficient SE z p-value CI Intercept Individual variables: Age education Age: years Age: 55 years or more Job: medium status Job: high status Community: small-medium town Community: big town Contextual variables Public deficit (2013) Household deprivation (2013) Random Effects (RE) First-level variance (variability between citizen) Second-level variance (variability between countries) Deviance

26 A motivational example: results (3) Two-step analysis on EU attachment First case: residuals of the null model no explicative variables Second case: residuals of the individual model only individual explicative variables Third case: residuals of the full model individual and contextual explicative variables

27 Some focus points for discussion (1) More integration between disciplines (Statistics and Sociology, for example). For example, the European Social Survey is pushing towards a more integrated work for the improvement of survey results. Statistics is useful for interpreting survey respondents answers, for example to unveil citizens attitudes towards the EU. Some classic and consolidated questions in EU questionnaires about citizens attachment to the EU may result obsolete: statistical techniques help in detecting flaws for future surveys. In our work, dimensions traditionally used by EU policy makers to analyze the level of Europeanization (evaluation, information and attachment) have shown many problems.

28 Some focus points for discussion (2) When doing SDA researchers are focused only on their particular problems. Comparability, harmonization and quality: in practice these problems are not sufficiently highlighted or are stressed with superficiality ( This new wave does not contain this question contained in the previous wave ) Statistical analysis helps in formulating new proposals for improved survey in the course of its implementation. Statistical techniques should not be used for the benefit of statistics only, but should be contextualized to give an answer to epistemological problems.

29 Some suggestions Questions in questionnaires should be as objective as possible. When planning new surveys use results emerged from statistical analysis in other studies/surveys (metaanalytic approach). From fallacies emerged from statistical analysis, construct new surveys (Lohr s example). Analyses should be contextualized referring to different areas. A meta-data codebook with rules coming also from previous statistical analysis should accompany traditional meta-data (example of ML results: are really Greeks angry with the EU?)

30 References Boudon, R. (1973) Equality, Opportunity, and Social Inequality. New York: Wiley. Eurostat (2000) Assessment of the Quality in Statistics. Eurostat/A4/Quality/00/General Standard Report, April 4-5, Luxembourg. Ferrari, P. A., Annoni, P., Manzi, G.: Evaluation and Comparison of European Countries (2010) Public Opinion on Services, Qual Quant, 44, Ferrari, P. A., Salini, S. (2011) Complementary Use of Rasch Models and Nonlinear Principal Components Analysis in the Assessment of the Opinion of Europeans about Utilities, J Classif, 28, Fisher, R. A. (1938) Indian statistical congress. CA: Sankhya. Kenett, R.S., Shmueli G. (2014) On information quality, J Roy Stat Soc A Sta, 177, Lohr, S.L., Brick, J.M. (2012) Blending domain estimates from two victimization surveys with possible bias, Can J Stat, 40(4), Manzi, G., Spiegelhalter, D.J., Turner, R.M., Flowers, J., Thompson, S.G. (2011) Modelling bias in combining small area prevalence estimates from multiple surveys, J Roy Stat Soc A Sta, 174,