Small area model-based estimators using big data sources

Small area model-based estimators using big data sources Monica Pratesi 1 Stefano Marchetti 2 Dino Pedreschi 3 Fosca Giannotti 4 Nicola Salvati 5 Filomena Maggino 6 1,2,5 Department of Economics and Management, University of Pisa 3 Department of Informatics, University of Pisa, 4 National Research Council of Italy 6 Department of Statistics and informatics, University of Florence NTTS conference, Bruxelles, 4-7 March 2013

Outline 1 Motivation 2 Social Mining 3 Small Area Estimation 4 Model-Based Estimators of Poverty Indicators 5 Using Big Data in Small Area Estimation

Motivation Part I Motivation

Motivation Motivation The timely, accurate monitoring of social indicators, such as poverty or inequality, at a fine grained spatial and temporal scale is a challenging task for official statistics, albeit a crucial tool for understanding social phenomena and policy making Statistical methods and social mining from big data represent today a concrete opportunity for understanding social complexity The aim is to improve our ability to measure, monitor and, possibly, predict social performance, well-being, deprivation poverty, exclusion, inequality, and so on, at a fine-grained spatial and temporal scale To succeed we need a statistical framework that allow to make inference with the big data

Social Mining Part II Social Mining

Social Mining Social Mining Social Mining aims to provide the analytical methods and associated algorithms needed to understand human behaviour by means of automated discovery of the underlying patterns, rules and profiles from the massive datasets of human activity records produced by social sensing We live in a time with unprecedented opportunities of sensing, storing, analyzing (micro)-data, at mass level, recording human activities at extreme detail Shopping patterns and lifestyle Relationships and social ties Desires, opinions, sentiments Movements

Social Mining Social Mining Figure: What happen in 60 seconds?

Social Mining Big Data and New Question to Ask Figure: Text analysis of Tweets in the USA in the 24 hours

Social Mining Big Data as Proxy of Human Mobility Figure: Impact of systematic mobility on access patterns

Social Mining Big Data as Proxy of Human Mobility Work-Home Home-Work Figure: Discovering individual systematic movements

Social Mining Big Data as Proxy of Human Mobility Figure: By GPS data is possible to identify new borders of mobility that are different from administrative borders

SAE Model-Based Estimators of Poverty Indicators Part III Model-Based Estimators of Poverty Indicators

SAE Model-Based Estimators of Poverty Indicators Introduction to Small Area Estimation Population of interest (or target population): population for which the survey is designed direct estimators should be reliable for the target population Domain: sub-population of the population of interest, they could be planned or not in the survey design Geographic areas (e.g. Regions, Provinces, Municipalities, Health Service Area) Socio-demographic groups (e.g. Sex, Age, Race within a large geographic area) Other sub-populations (e.g. the set of firms belonging to a industry subdivision) we don t know the reliability of direct estimators for the domains that have not been planned in the survey design

SAE Model-Based Estimators of Poverty Indicators Introduction to Small Area Estimation Often direct estimators are not reliable for some domains of interest In these cases we have two choices: oversampling over that domains applying statistical techniques that allow for reliable estimates in that domains Small Domain or Small Area Geographical area or domain where direct estimators do not reach a minimum level of precision Small Area Estimator (SAE) An estimator created to obtain reliable estimate in a Small Area

SAE Model-Based Estimators of Poverty Indicators Introduction to Small Area Estimation Modern small area estimation approach is based on model-based methods Statistical models link the variable of interest with covariate information unit level models when auxiliary variables are known for out f sample units area level models when auxiliary variables are known only at area level Unit level models are based on mixed models or M-quantile models Area level models are based on the Fay and Herriot model

SAE Model-Based Estimators of Poverty Indicators Small Area Estimation: Unit Level Model Let y ij be the target variable for unit j in area i Let x ij be the auxiliary vector of p auxiliary variables for unit j in area i Mixed effect model y ij = x T ij β + u i + ε ij u i N(0, σ 2 u) ε ij N(0, σ 2 ε) M-quantile models Q(y ij x ij ) qij = x T ij β(q ij ) y ij = x T ij β(θ i ) + ε ij Parameters are estimated respectively with restricted maximum likelihood and iterative weighted least squares By the use of these models estimates are more accurate because we take into account the between area variation

SAE Model-Based Estimators of Poverty Indicators Small Area Estimation: Fay-Harriot Model Based on mixed models Fay and Harriot modeled direct estimates with area level auxiliary variables Let θ i be a target parameter and theta ˆ i its direct estimate in area i Let x i be the auxiliary vector of p auxiliary variables for area i ˆθ i = θ i + ε i ε i N(0, σ 2 ε i ) θ i = x i β + u i u i N(0, σ 2 u) The Fay and Harriot model is than ˆθ i = x T i β + u i + ε i Model parameters can be estimated by maximum likelihood methods (σ 2 ε i are considered known) By the use of the auxiliary information the accuracy of the estimates is improved

SAE Model-Based Estimators of Poverty Indicators Model-Based Estimators of Poverty Indicators Denoting by t the poverty line different poverty measures are defined by using (Foster et al., 1984) ( t yij ) αi(yij F α,ij = t) i = 1,..., N t The population distribution function in small area j can be decomposed as follows [ ] F α,j = N 1 j F α,ij + F α,ij i s j i r j Setting α = 0 defines the Head Count Ratio whereas setting α = 1 defines the Poverty Gap. HCR and PG can be estimated with unit level models.

SAE Model-Based Estimators of Poverty Indicators Model-Based Estimators of Poverty Indicators By the use of HCR and PG estimators we can estimate 13 on 19 Laeken indicators The mean squared error of these estimator can be estimated by bootstrap techniques (Molina and Rao, 2010; Marchetti et al., 2012) The use of Fay and Herriot models for binary or count outcome is currently under development The use of well-specified models and the availability of auxiliary variable is a key point to allow the use of small area estimation methods

Using Big Data in Small Area Estimation Part IV Using Big Data in Small Area Estimation

Using Big Data in Small Area Estimation Small Area Estimation and Big Data Our aim is to use the huge source of data coming from human activities - the big data - to make accurate inference at a small area level. We identified three possible approaches: 1 Use big data as covariate in small area models 2 Use survey data to remove self-selection bias from estimates obtained using big data 3 Use big data to validate small area estimates

Using Big Data in Small Area Estimation Use Big Data as Covariate in Small Area Models Big data often provide unit level data Outcome variable have to be linked to auxiliary variable in order to use unit level data in a small area model Due to technical challenges and law restriction it is unfeasible at this stage to have unit level big data that can be linked with administrative archive or census or survey data Big data can be aggregate at area level and then used in an area level model ˆθ i = d T i β + u i + ε i d i is a vector of p variables gathered from big data sources

Using Big Data in Small Area Estimation Use Survey Data to Remove Self-Selection Bias from Estimates Obtained Using Big Data An option is to use big data directly to measure poverty and social exclusion It is realistic to think the big data are not representative of the whole population of interest (self-selection problem) Using a quality survey we can check difference in the distribution of common variables between big data and survey data If there aren t common variables we can use known correlated data to check difference in the distribution Given this difference we can compute weights that allow the reduction of bias due to the self-selection of the big data

Using Big Data in Small Area Estimation Use Big Data to Validate Small Area Estimates Poverty and deprivation measures obtained from big data can be compared with similar measures obtained from survey data If there is accordance between big data estimates and survey data estimates then there is a double checked evidence of the level of poverty and deprivation If there is discrepancy there is need of further investigation

Using Big Data in Small Area Estimation Essential Bibliography Breckling, J. and Chambers, R. (1988). M-quantiles. Biometrika, 75, 761-71. Chambers, R., Tzavidis, N. (2006). M-quantile models for small area estimation. Biometrika 93 (2), 255-68. Foster, J., Greer, J., Thorbecke, E. (1984). A class of decomposable poverty measures. Econometrica 52, 761-766. Giannotti, F. and Pedreschi, D. (2008). Mobility, Data Mining and Privacy, Springer. Giannotti, F., Pedreschi, D., Pentland, A., Lukowicz, P., Kossmann, D., Crowley, J. and Helbing, D. (2012). A planetary nervous system for social mining and collective awareness. European Physics Journal - Special Topics 214, 49-75 Marchetti, S., Tzavidis, N. and Pratesi, M. (2012). Non-parametric bootstrap mean squared error Estimation for M-quantile Estimators of Small Area Averages, Quantiles and Poverty Indicators. Computional Statistics and Data Analysis, 56(10), 2889-2902. Molina, I., Rao, J. (2010). Small area estimation of poverty indicators. The Canadian Journal of Statistics.