Small area model-based estimators using big data sources

Similar documents
Small area model-based estimators using big data sources

BIG DATA AND OFFICIAL STATISTICS. Filomena Maggino, Monica Pratesi

Small Area Model-Based Estimators Using Big Data Sources

Monica Pratesi, University of Pisa

Mobile phone data for Mobility statistics

Thomas Risse. 2 nd International Alexandria Workshop 2015

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Estimating Economic Development with Mobile Phone Data

Use of mobile phone data to estimate mobility flows. Measuring urban population and inter-city mobility using big data in an integrated approach

SPATIAL MODELS IN SMALL AREA ESTIMATION IN THE CONTEXT OF OFFICIAL STATISTICS

Estimation of Human Mobility Patterns and Attributes Analyzing Anonymized Mobile Phone CDR:

TREND ANALYSIS OF MONETARY POVERTY MEASURES IN THE SLOVAK AND CZECH REPUBLIC

Chapter 6. The stacking ensemble approach

and second level university master Academic Year 2013/14 QoLexity Measuring, Monitoring and Analysis of Quality of Life and its Complexity

Workshop Discussion Notes: Housing

Methodological aspects of small area estimation from the National Electronic Health Records Survey (NEHRS).

second level university master Academic Year 2013/14 QoLexity Measuring, Monitoring and Analysis of Quality of Life and its Complexity

Small area estimation of turnover of the Structural Business Survey

Processes of urban regionalization in Italy: a focus on mobility practices explained through mobile phone data in the Milan urban region

Big Data & Privacy. It s Time for a New Deal on Personal Data Dino Pedreschi. KDD LAB ISTI CNR and Univ. of Pisa

DIGITS CENTER FOR DIGITAL INNOVATION, TECHNOLOGY, AND STRATEGY THOUGHT LEADERSHIP FOR THE DIGITAL AGE

Least Squares Estimation

A Composite Likelihood Approach to Analysis of Survey Data with Sampling Weights Incorporated under Two-Level Models

The primary goal of this thesis was to understand how the spatial dependence of

How To Find Influence Between Two Concepts In A Network

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Statistics Canada s National Household Survey: State of knowledge for Quebec users

Methodological aspects of small area estimation from the National Electronic Health Records Survey (NEHRS). 1

SMALL AREA ESTIMATES OF THE POPULATION DISTRIBUTION BY ETHNIC GROUP IN ENGLAND: A PROPOSAL USING STRUCTURE PRESERVING ESTIMATORS

Information Visualization WS 2013/14 11 Visual Analytics

Automating Big Data Management, by DISIT Lab Distributed [Systems and Internet, Data Intelligence] Technologies Lab Prof. Ph.D. Eng.

EPSRC Cross-SAT Big Data Workshop: Well Sorted Materials

Association Technique on Prediction of Chronic Diseases Using Apriori Algorithm

Handling attrition and non-response in longitudinal data

Exploring Big Data in Social Networks

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Understanding Web personalization with Web Usage Mining and its Application: Recommender System

CSAC, April 16-17, 2015 Discussion: Big Data and Modernizing Federal Statistics: Update by Bill Bostic and Ron Jarmin

SPATIAL DATA CLASSIFICATION AND DATA MINING

Selectivity of Big data

Bootstrapping Big Data

Constructing Semantic Interpretation of Routine and Anomalous Mobility Behaviors from Big Data

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Science Europe Position Statement. On the Proposed European General Data Protection Regulation MAY 2013

Analysis of GSM calls data for understanding user mobility behavior

DATA MINING TECHNIQUES FOR CRM

Big Picture of Big Data Software Engineering With example research challenges

Getting the Most from Demographics: Things to Consider for Powerful Market Analysis

Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control

Neural Networks for Sentiment Detection in Financial Text

Marketing Mix Modelling and Big Data P. M Cain

Discussion of Presentations on Commercial Big Data and Official Economic Statistics

Social Media Mining. Data Mining Essentials

Federal Statistics and College Entrepreneurships

Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. Philip Kostov and Seamus McErlean

Customer Analytics. Turn Big Data into Big Value

Big Data / Privacy: Pick One?

Strategic Activities to Support Sustainability of Canada s Geospatial Data Infrastructure

CHAPTER-24 Mining Spatial Databases

Privacy-by-design in big data analytics and social mining

Content-Based Discovery of Twitter Influencers

Measuring BDC s impact on its clients

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

Big Data, Official Statistics and Social Science Research: Emerging Data Challenges

Using mobile phone data to map human population distribution

Mapping Linear Networks Based on Cellular Phone Tracking

Regression analysis of probability-linked data

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

DEGREE CURRICULUM BIG DATA ANALYTICS SPECIALITY. MASTER in Informatics Engineering

Overview Applications of Data Mining In Health Care: The Case Study of Arusha Region

ISSUES IN RULE BASED KNOWLEDGE DISCOVERING PROCESS

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

How to gather and evaluate information

Big Data and Graph Analytics in a Health Care Setting

Statistique. Canada. Statistics. Canada

Country Paper: Automation of data capture, data processing and dissemination of the 2009 National Population and Housing Census in Vanuatu.

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Use of mobile phone data to estimate visitors mobility flows

A.I. in health informatics lecture 1 introduction & stuff kevin small & byron wallace

Government Management Committee. P:\2013\Internal Services\I&T\gm13005I&T (AFS # 17768)

Big Data for Development: What May Determine Success or failure?

Chapter 6: Point Estimation. Fall Probability & Statistics

Short-Term Forecasting in Retail Energy Markets

Cross-Validation. Synonyms Rotation estimation

THE POWER OF SOCIAL NETWORKS TO DRIVE MOBILE MONEY ADOPTION

PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY

From Research Question to Exploratory Analysis

Autologistic Regression Model for Poverty Mapping and Analysis

An Overview of Knowledge Discovery Database and Data mining Techniques

Discovering Local Subgroups, with an Application to Fraud Detection

Similarity Search and Mining in Uncertain Spatial and Spatio Temporal Databases. Andreas Züfle

FRAMEWORK TO EVALUATE INTERNET USE AND DIGITAL DIVIDE IN FIRMS

In-Database Analytics

PUBLIC HEALTH MEETS SOCIAL MEDIA: MINING HEALTH INFO FROM TWITTER

Data Mining for Customer Relationship Management (CRM)

Integrating Numeric, Statistical, and Geospatial Data Services for Graduate Students

Review on Financial Forecasting using Neural Network and Data Mining Technique

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Healthcare Measurement Analysis Using Data mining Techniques

Big Trouble. Does Big Data spell. for Lawyers? Presented to Colorado Bar Association, Communications & Technology Law Section Denver, Colorado

Transcription:

Small area model-based estimators using big data sources Monica Pratesi 1 Stefano Marchetti 2 Dino Pedreschi 3 Fosca Giannotti 4 Nicola Salvati 5 Filomena Maggino 6 1,2,5 Department of Economics and Management, University of Pisa 3 Department of Informatics, University of Pisa, 4 National Research Council of Italy 6 Department of Statistics and informatics, University of Florence NTTS conference, Bruxelles, 4-7 March 2013

Outline 1 Motivation 2 Social Mining 3 Small Area Estimation 4 Model-Based Estimators of Poverty Indicators 5 Using Big Data in Small Area Estimation

Motivation Part I Motivation

Motivation Motivation The timely, accurate monitoring of social indicators, such as poverty or inequality, at a fine grained spatial and temporal scale is a challenging task for official statistics, albeit a crucial tool for understanding social phenomena and policy making Statistical methods and social mining from big data represent today a concrete opportunity for understanding social complexity The aim is to improve our ability to measure, monitor and, possibly, predict social performance, well-being, deprivation poverty, exclusion, inequality, and so on, at a fine-grained spatial and temporal scale To succeed we need a statistical framework that allow to make inference with the big data

Social Mining Part II Social Mining

Social Mining Social Mining Social Mining aims to provide the analytical methods and associated algorithms needed to understand human behaviour by means of automated discovery of the underlying patterns, rules and profiles from the massive datasets of human activity records produced by social sensing We live in a time with unprecedented opportunities of sensing, storing, analyzing (micro)-data, at mass level, recording human activities at extreme detail Shopping patterns and lifestyle Relationships and social ties Desires, opinions, sentiments Movements

Social Mining Social Mining Figure: What happen in 60 seconds?

Social Mining Big Data and New Question to Ask Figure: Text analysis of Tweets in the USA in the 24 hours

Social Mining Big Data as Proxy of Human Mobility Figure: Impact of systematic mobility on access patterns

Social Mining Big Data as Proxy of Human Mobility Work-Home Home-Work Figure: Discovering individual systematic movements

Social Mining Big Data as Proxy of Human Mobility Figure: By GPS data is possible to identify new borders of mobility that are different from administrative borders

SAE Model-Based Estimators of Poverty Indicators Part III Model-Based Estimators of Poverty Indicators

SAE Model-Based Estimators of Poverty Indicators Introduction to Small Area Estimation Population of interest (or target population): population for which the survey is designed direct estimators should be reliable for the target population Domain: sub-population of the population of interest, they could be planned or not in the survey design Geographic areas (e.g. Regions, Provinces, Municipalities, Health Service Area) Socio-demographic groups (e.g. Sex, Age, Race within a large geographic area) Other sub-populations (e.g. the set of firms belonging to a industry subdivision) we don t know the reliability of direct estimators for the domains that have not been planned in the survey design

SAE Model-Based Estimators of Poverty Indicators Introduction to Small Area Estimation Often direct estimators are not reliable for some domains of interest In these cases we have two choices: oversampling over that domains applying statistical techniques that allow for reliable estimates in that domains Small Domain or Small Area Geographical area or domain where direct estimators do not reach a minimum level of precision Small Area Estimator (SAE) An estimator created to obtain reliable estimate in a Small Area

SAE Model-Based Estimators of Poverty Indicators Introduction to Small Area Estimation Modern small area estimation approach is based on model-based methods Statistical models link the variable of interest with covariate information unit level models when auxiliary variables are known for out f sample units area level models when auxiliary variables are known only at area level Unit level models are based on mixed models or M-quantile models Area level models are based on the Fay and Herriot model

SAE Model-Based Estimators of Poverty Indicators Small Area Estimation: Unit Level Model Let y ij be the target variable for unit j in area i Let x ij be the auxiliary vector of p auxiliary variables for unit j in area i Mixed effect model y ij = x T ij β + u i + ε ij u i N(0, σ 2 u) ε ij N(0, σ 2 ε) M-quantile models Q(y ij x ij ) qij = x T ij β(q ij ) y ij = x T ij β(θ i ) + ε ij Parameters are estimated respectively with restricted maximum likelihood and iterative weighted least squares By the use of these models estimates are more accurate because we take into account the between area variation

SAE Model-Based Estimators of Poverty Indicators Small Area Estimation: Fay-Harriot Model Based on mixed models Fay and Harriot modeled direct estimates with area level auxiliary variables Let θ i be a target parameter and theta ˆ i its direct estimate in area i Let x i be the auxiliary vector of p auxiliary variables for area i ˆθ i = θ i + ε i ε i N(0, σ 2 ε i ) θ i = x i β + u i u i N(0, σ 2 u) The Fay and Harriot model is than ˆθ i = x T i β + u i + ε i Model parameters can be estimated by maximum likelihood methods (σ 2 ε i are considered known) By the use of the auxiliary information the accuracy of the estimates is improved

SAE Model-Based Estimators of Poverty Indicators Model-Based Estimators of Poverty Indicators Denoting by t the poverty line different poverty measures are defined by using (Foster et al., 1984) ( t yij ) αi(yij F α,ij = t) i = 1,..., N t The population distribution function in small area j can be decomposed as follows [ ] F α,j = N 1 j F α,ij + F α,ij i s j i r j Setting α = 0 defines the Head Count Ratio whereas setting α = 1 defines the Poverty Gap. HCR and PG can be estimated with unit level models.

SAE Model-Based Estimators of Poverty Indicators Model-Based Estimators of Poverty Indicators By the use of HCR and PG estimators we can estimate 13 on 19 Laeken indicators The mean squared error of these estimator can be estimated by bootstrap techniques (Molina and Rao, 2010; Marchetti et al., 2012) The use of Fay and Herriot models for binary or count outcome is currently under development The use of well-specified models and the availability of auxiliary variable is a key point to allow the use of small area estimation methods

Using Big Data in Small Area Estimation Part IV Using Big Data in Small Area Estimation

Using Big Data in Small Area Estimation Small Area Estimation and Big Data Our aim is to use the huge source of data coming from human activities - the big data - to make accurate inference at a small area level. We identified three possible approaches: 1 Use big data as covariate in small area models 2 Use survey data to remove self-selection bias from estimates obtained using big data 3 Use big data to validate small area estimates

Using Big Data in Small Area Estimation Use Big Data as Covariate in Small Area Models Big data often provide unit level data Outcome variable have to be linked to auxiliary variable in order to use unit level data in a small area model Due to technical challenges and law restriction it is unfeasible at this stage to have unit level big data that can be linked with administrative archive or census or survey data Big data can be aggregate at area level and then used in an area level model ˆθ i = d T i β + u i + ε i d i is a vector of p variables gathered from big data sources

Using Big Data in Small Area Estimation Use Survey Data to Remove Self-Selection Bias from Estimates Obtained Using Big Data An option is to use big data directly to measure poverty and social exclusion It is realistic to think the big data are not representative of the whole population of interest (self-selection problem) Using a quality survey we can check difference in the distribution of common variables between big data and survey data If there aren t common variables we can use known correlated data to check difference in the distribution Given this difference we can compute weights that allow the reduction of bias due to the self-selection of the big data

Using Big Data in Small Area Estimation Use Big Data to Validate Small Area Estimates Poverty and deprivation measures obtained from big data can be compared with similar measures obtained from survey data If there is accordance between big data estimates and survey data estimates then there is a double checked evidence of the level of poverty and deprivation If there is discrepancy there is need of further investigation

Using Big Data in Small Area Estimation Essential Bibliography Breckling, J. and Chambers, R. (1988). M-quantiles. Biometrika, 75, 761-71. Chambers, R., Tzavidis, N. (2006). M-quantile models for small area estimation. Biometrika 93 (2), 255-68. Foster, J., Greer, J., Thorbecke, E. (1984). A class of decomposable poverty measures. Econometrica 52, 761-766. Giannotti, F. and Pedreschi, D. (2008). Mobility, Data Mining and Privacy, Springer. Giannotti, F., Pedreschi, D., Pentland, A., Lukowicz, P., Kossmann, D., Crowley, J. and Helbing, D. (2012). A planetary nervous system for social mining and collective awareness. European Physics Journal - Special Topics 214, 49-75 Marchetti, S., Tzavidis, N. and Pratesi, M. (2012). Non-parametric bootstrap mean squared error Estimation for M-quantile Estimators of Small Area Averages, Quantiles and Poverty Indicators. Computional Statistics and Data Analysis, 56(10), 2889-2902. Molina, I., Rao, J. (2010). Small area estimation of poverty indicators. The Canadian Journal of Statistics.