BIG DATA AND OFFICIAL STATISTICS Filomena Maggino, Monica Pratesi
What about risks, needs, and challenges of big-data in the context of measuring wellbeing?
«Data are widely available, what is scarce is the ability to extract wisdom from them» (Hal Varian, Google chief economist) http://www.economist.com/node/15557443
challenge risk need
risk loosing the way
BIG more we have, better it is risk loosing the way
BIG more we have, better it is risk loosing the way meaningful mass of information
big should represent an opportunity of transversal reading (this idea is what the multipurpose project at ISTAT has in a nutshell) risk loosing the way
system need 9
Exploiting all data sources in order to describe a consistent frame about community s wellbeing system need 10
through a transversal and horizontal approach creating a big and heterogeneous patrimony from which generating an overall view system need 11
challenge heterogeneity
challenge heterogeneity BIG heterogeneity of its components
challenge heterogeneity not [only] integration of different sources but [also]
challenge heterogeneity building and re-building paths of transversal senses
The definition of new indicators of countries progress and wellbeing introduced new needs of data. 16
BIG DATA
Instruments to manage big data 18
In order to avoid indigestible mixtures
.. a consistent conceptual framework is needed
conceptual framework + big data + analytic instruments = measuring country s wellbeing
In this perspective, we need to take into account the conceptual dimensions describing country s progress and communities wellbeing 22
1. Wellbeing quality of life: o living conditions o subjective wellbeing quality of society social cohesion (participation, trust, social relation, identity) 2. Equity distribution of wellbeing inequalities, regional disparities social exclusion 3. Sustainability Relationship between the previous levels, the environment and the future 23
The conceptual dimensions need to be observed and analyzed at micro level (individual / household) (*) (*) see Stiglitz J. E., A. Sen & J.-P. Fitoussi eds. (2009) Report by the Commission on the Measurement of Economic Performance and Social Progress, Paris. http://www.stiglitz-senfitoussi.fr/en/index.htm 24
Our aim is to introduce BIG DATA and their potential informative load into the dimension of social indicators in the field of official statistics 25
Our challenge is to construct complex indicators able to (i) monitor communities wellbeing (ii) support the definition for better policies by introducing new descriptions captured by big data. 26
Our challenge is to construct complex indicators by meeting the required characteristics 27
Identifying indicators An indicator should be able to: define and describe observe unequivocally and stably record by a degree of distortion as low as possible adhere to the principle of objectivity reflect adequately the conceptual model meet current ad potential users needs be observed through realistic efforts and costs reflect the length of time between its availability and the event of phenomenon it describes be analyzed in order to record differences and disparities be spread (I) METHODOLOGICAL SOUNDNESS (II) INTEGRITY (III) SERVICEABILITY (IV) ACCESSIBILITY
In other words, our goal is to extract consistent knowledge, new insights and meaningful pictures of our societies progress and wellbeing from BIG DATA.
Introduction to Small Area Estimation Population of interest (or target population): population for which the survey is designed directestimators should be reliable for the target population Domains: sub-populations of the population of interest, they could be planned or not in the survey design Geographic areas (e.g. Regions, Provinces, Municipalities, Health Service Area) Socio-demographic groups (e.g. Sex, Age, Race within a large geographic area) Other sub-populations (e.g. the set of firms belonging to a industry subdivision) we don t know the reliability of directestimators for the domains that have not been planned in the survey design
Introduction to Small Area Estimation Often direct estimators are not reliable for some domains of interest In these cases we have two choices: oversampling over that domains applying statistical techniques that allow for reliable estimates in that domains Small Domain or Small Area: geographical area or domain where direct estimators do not reach a minimum level of precision Small Area Estimator (SAE): an estimator created to obtain reliable estimate in a Small Area
Small Area Estimation and Big Data Our aim is to use the huge source of data coming from human activities - the big data - to make accurate inference at a small area level We identified three possible approaches: 1. Use big data as covariates in small area models 2. Use survey data to remove self-selection bias from estimates obtained using big data 3. Use big data to validate small area estimates
Use Big Data as Covariates in Small Area Models Big data often provide unit level data The outcome variable have to be linked to auxiliary variables in order to use unit level data in a small area model Due to technical challenges and law restrictions, it is unfeasible at this stage to have unit level big data that can be linked with administrative archive, census or survey data Big data can be aggregate at area level and then used in an area level model with d i a vector of p variables gathered from big data sources
Use Survey Data to Remove Self-Selection Bias from Estimates Obtained Using Big Data An option is to use big data directly to measure poverty and social exclusion It is realistic to think that the big data are not representative of the whole population of interest (self-selection problem) Using a quality survey we can check the differences in the distribution of common variables between big data and survey data If there aren t common variables we can use known correlated data to check the differencse in the distributions Given this differences, we can compute weights that allow the reduction of bias due to the self-selection of the big data
Use Big Data to Validate Small Area Estimates Poverty and deprivation measures obtained from big data can be compared with similar measures obtained from official survey data If there is accordance between big data estimates and survey data estimates, then there is a double checked evidence of the level of poverty and deprivation If there is discrepancy, there is need of further investigation