REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION



Similar documents
Appendix B Data Quality Dimensions

OECD SHORT-TERM ECONOMIC STATISTICS EXPERT GROUP (STESEG)

ICT Perspectives on Big Data: Well Sorted Materials

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

THE JOINT HARMONISED EU PROGRAMME OF BUSINESS AND CONSUMER SURVEYS

Data quality and metadata

DATA WAREHOUSE AND DATA MINING NECCESSITY OR USELESS INVESTMENT

CHAPTER 1 INTRODUCTION

Introduction to Data Mining

ANNUAL QUALITY REPORT

How To Use Neural Networks In Data Mining

The Scientific Data Mining Process

Introduction. A. Bellaachia Page: 1

5 Discussion and Implications

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: X DATA MINING TECHNIQUES AND STOCK MARKET

Concept and Project Objectives

Cloud computing based big data ecosystem and requirements

The big data revolution

Will big data transform official statistics?

Intrusion Detection System using Log Files and Reinforcement Learning

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

THE INTELLIGENT BUSINESS INTELLIGENCE SOLUTIONS

A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH

Dynamic Data in terms of Data Mining Streams

Statistics 215b 11/20/03 D.R. Brillinger. A field in search of a definition a vague concept

CHAPTER SIX DATA. Business Intelligence The McGraw-Hill Companies, All Rights Reserved

OUTLIER ANALYSIS. Data Mining 1

Data Mining Part 5. Prediction

Types of Studies. Systematic Reviews and Meta-Analyses

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

DMDSS: Data Mining Based Decision Support System to Integrate Data Mining and Decision Support

Identifying IT Markets and Market Size

COURSE RECOMMENDER SYSTEM IN E-LEARNING

Information Visualization WS 2013/14 11 Visual Analytics

ENHANCING INTELLIGENCE SUCCESS: DATA CHARACTERIZATION Francine Forney, Senior Management Consultant, Fuel Consulting, LLC May 2013

Data mining and official statistics

Business Information Systems. IT Enabled Services And Emerging Technologies. Chapter 4: Facilitated e-learning Part 1 of 2 CA M S Mehta, FCA

Data Discovery, Analytics, and the Enterprise Data Hub

BIG DATA What it is and how to use?

Downloaded from UvA-DARE, the institutional repository of the University of Amsterdam (UvA)

1. Understanding Big Data

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

2. Issues using administrative data for statistical purposes

Automatic Document Categorization A Hummingbird White Paper

DATA MINING, DIRTY DATA, AND COSTS (Research-in-Progress)

Current Situations and Issues of Occupational Classification Commonly. Used by Private and Public Sectors. Summary

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Qualitative Corporate Dashboards for Corporate Monitoring Peng Jia and Miklos A. Vasarhelyi 1

Assessment Policy. 1 Introduction. 2 Background

Big Data-Challenges and Opportunities

Reflections on Probability vs Nonprobability Sampling

Key Requirements for a Job Scheduling and Workload Automation Solution

Software Firm Applies Structure to Content Management System for Greatest Value

Foundations of Business Intelligence: Databases and Information Management

Delivering Smart Answers!

Introduction to Quality Assessment

relevant to the management dilemma or management question.

warehouse landscape for HINC

SPATIAL DATA CLASSIFICATION AND DATA MINING

The Benefits of Using Data Mining Approach in Business Intelligence for Healthcare Organizations

A Survey of Classification Techniques in the Area of Big Data.

Chapter 6 Experiment Process

Fight fire with fire when protecting sensitive data

Alternative data collection methods -

Integrated archiving: streamlining compliance and discovery through content and business process management

Fairfield Public Schools

DESKTOP BASED RECOMMENDATION SYSTEM FOR CAMPUS RECRUITMENT USING MAHOUT

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Data Mining and Analytics in Realizeit

Data Catalogs for Hadoop Achieving Shared Knowledge and Re-usable Data Prep. Neil Raden Hired Brains Research, LLC

Data Mining System, Functionalities and Applications: A Radical Review

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic

Chapter 8: Quantitative Sampling

WHITE PAPER. Five Steps to Better Application Monitoring and Troubleshooting

Planning and Writing Essays

Big Data with Rough Set Using Map- Reduce

Improving quality through regular reviews:

A Review of Data Mining Techniques

Chapter ML:XI. XI. Cluster Analysis

Statistical Challenges with Big Data in Management Science

CHAPTER - 5 CONCLUSIONS / IMP. FINDINGS

USES OF CONSUMER PRICE INDICES

ECLT 5810 E-Commerce Data Mining Techniques - Introduction. Prof. Wai Lam

How to Enhance Traditional BI Architecture to Leverage Big Data

Predict the Popularity of YouTube Videos Using Early View Data

Quality Control of Web-Scraped and Transaction Data (Scanner Data)

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Social Media Mining. Data Mining Essentials

ETPL Extract, Transform, Predict and Load

Introduction to Data Mining and Business Intelligence Lecture 1/DMBI/IKI83403T/MTI/UI

The University of Adelaide Business School

WHERE TO START? A PRELIMINARY DATA QUALITY CHECKLIST FOR EMERGENCY MEDICAL SERVICES DATA (Practice - Oriented Paper)

The skill content of occupations across low and middle income countries: evidence from harmonized data

Enhancing Sales and Operations Planning with Forecasting Analytics and Business Intelligence WHITE PAPER

A Process-focused Approach to Improving Business Performance

Statistics on E-commerce and Information and Communication Technology Activity

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

SECURITY METRICS: MEASUREMENTS TO SUPPORT THE CONTINUED DEVELOPMENT OF INFORMATION SECURITY TECHNOLOGY

Transcription:

REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION Pilar Rey del Castillo May 2013 Introduction The exploitation of the vast amount of data originated from ICT tools and referring to a big variety of economic and social activities signifies a challenge and an opportunity for official statistics. But it remains still to discover how to extract significant value for the production of statistical figures from the diversity of data available. This paper proposes to start exploring some big data following a straightforward path to achieve results. Next section provides a preliminary insight on the issues that are at stake, and the following presents some ideas to start a road map from Eurostat. Some defining features of Big Data The most conspicuous feature of Big Data as compared to traditional statistical sources is that they do not come from a previous design with the aim of obtaining specific statistics, but become available as traces of human activity. This attribute makes it difficult to use traditional statistical methods and tools such as probabilistic sampling, statistical classifications and so on, turning into useless and not applicable the Generic Statistical Business Process Model. The outcomes of a recent survey conducted among executives of a wide range of industries around the world can be clarifying: although a great part of the respondents agreed on that data had become an important factor for their business, many companies were struggling with basic aspects of data management, still attempting to exploit it effectively [1]. This confirms the fact that extracting useful information from this kind of data is a non-obvious and rather difficult task that should be carefully planned. Although the appearance of some statistical figures based on Big Data may suggest the information is found directly on the data prepared to be published, there should be considered the huge amounts of data that have been previously analysed and processed to achieve these results. Nevertheless, the attractive of the potential reduction of respondents burden and costs, and the general framework of improving the productivity of the ESS, introduce increasing pressure to the use of Big Data as sources of statistical information. The experience on the use of another source that shares with Big Data this feature of not being designed for statistical purposes as it is administrative data may illuminate the road map, comparing the features in common and the ones that make a difference. The main aspects of the administrative data we are interested to consider are the following: 1

a) Methods to obtain statistical information from administrative data usually depend on the specific data, being difficult to establish general rules or use generic production process models. b) Administrative data are not structured as statistical data are, that is, they do not use statistical classifications and definitions, but still they show a certain structure related with the objective of its creation. This means that some tasks of translating, linking or harmonizing the structures (units, definitions, classifications...) should always be done. c) Sampling procedures are not used to obtain the reporting units but frequently there is an idea of their representativeness on the population of interest (sometimes all the population units are included). d) The volume of administrative data is not usually a problem and they may be treated with the statistical procedures used with other typical sources. e) The way they are increasingly being used to produce statistical figures can be classified as (i) totally replacing statistical sources, (ii) partially replacing statistical sources, completing the information by means of record linkage, matching or other procedures, and (iii) providing completely new statistical figures that may be a complement to the available statistical information from other perspectives. The two first ways may result in theory on significant reductions of costs and respondents burden, but they frequently imply new tasks of translating, linking or harmonizing which are not necessary when completely new statistical figures are produced. An example of this last case could be the figures of registered unemployment. As for Big Data, and concerning the same corresponding features: a) Due to the heterogeneity of Big Data available, methods to produce statistical information should be developed ad hoc for each case, exactly the same to the case of administrative data. b) Some Big Data have a certain structure related with the source of information and some are just unstructured text strings. Good metadata are not usually available and it seems that in most of the cases the tasks to harmonize or translate to statistical structures would be enormous. c) Apart from not using sampling procedures, Big Data come frequently from private companies and its representativeness and coverage over the populations of interest for official statistical is difficult to assess. d) The name of Big Data refers precisely to the huge volume. This dimension has an impact on the storage and processing, falling frequently out of the scope of the traditional statistical tools. e) The way Big Data could be used to produce statistical figures deals with a crucial issue. The idea is that it seems not easy to find Big Data able to totally or partially replace statistical sources in the short term because of the reasons explained in previous points and follow the path in this direction may be too expensive in time and resources. Thus, a sound approach would be to start searching for sources that could provide completely new and independent statistical figures not adapted to traditional statistical structures but offering new perspectives. For example, instead of finding sources to substitute the HBS, try to build indicators of its trends over time. When improvements on this area are achieved, the new set of statistics available will provide a valuable basis for re-designing the products and the process of production of official statistics. There may be opportunities to tackle the specific problems of Big Data by using the suitable tools: 1. An apparently critical problem is the volume of the Big Data available: there is a necessity to move away from exclusive dependence on the statistical methods that cannot handle this volume of information and adopt a more 2

diverse set of tools. This can be simply addressed through the use of algorithms specially developed for this goal such as data mining methods. These algorithms have the computational efficiency required and are scalable, that is, have the ability to handle a growing amount of work in a capable manner, or to be enlarged to accommodate that growth [2]. The state of the art provides a great variety of data mining tools for different objectives: classification, clustering, regression, association, feature extraction A first stage of exploration using data mining procedures should be usually carried out to learn about the unknown data structure and the possible outcomes, combining later this with traditional statistical procedures. The type of Big Data and its form determines the type of data mining tool to be used. Thus the statistical production process from Big Data should have as a first step the performance of an exploratory analysis. A combination of data mining and traditional statistical procedures may follow to produce the best results. 2. Another important concern is the representativeness and validity of the statistics produced. The use of probabilistic sampling in traditional statistics provides a theoretical framework that ensures confidence on the figures produced, being the accuracy based on sampling errors. Most of Big Data available cannot be adapted to this framework and other procedures should be devised. This seems to be an important weakness of Big Data use and efforts should be focused on it. Meanwhile, experiences of successful uses of Big Data could be investigated to follow a similar approach. Two well-known examples are here briefly considered. The first refers to the estimates of the incidence of flu in different countries and regions around the world from the searches on Google for flu-related topics [3]. It has been found that these estimates are very closely matched to traditional flu activity indicators. Similarly, a recent article in BBC News [4] reported that Google searches for finance-related terms may predict moves in markets, and that an investment strategy based on these search volume data between 2004 and 2011 would have made a profit of 326%. These examples have two important features in common (apart from being Google products) that may help with the problem of representativeness, coverage and validity. The first thing is that both of them estimate changes or movements across time, and not absolute figures. A well-established statistical principle is that it is more reliable to estimate changes (over time or space) than absolute figures, because some bias and errors can be cancelled up when computing the change: maybe the first attempts to use Big Data should be addressed to produce estimates of changes or evolutions. The other relevant feature shared by both examples is the criterion to evaluate the results. What is estimated are proxy variables that perform well in following the movements of a phenomenon of interest. That is, the performance is assessed in terms of its similarity to other figures available measuring the same or analogous thing. In the same way, the performance of Big Data could be evaluated on a first instance from the similarity or agreement to other available measures and not from a sampling errors criterion. This makes sense from a data mining perspective, where the equivalent to fitting a model is tuning an algorithm so that it fits with the real world. 3

When many different statistical figures are produced from different and independent Big Data sources following these principles, the coherence and agreement among them may be an argument to support the validity and representativeness of the whole system. 3. Although Big Data may be not structured as statistical data are, they may have the same type of structure/non-structure across countries. This would have the advantage of making unnecessary the process of harmonization between countries what is of special interest for transnational statistics. 4. There are other concerns about Big Data that seem to be similar to the case of statistical sources, such as the appearance of diverse types of problems or errors: noise, incompleteness, missing data, reporting errors, outliers Data editing (cleaning, checking, imputing ) are time and resources consuming activities in traditional statistical processing and similar methods to deal with them could be used. It is likely that some errors (reporting, incompleteness ) have fewer occurrences in Big Data through non-human intervention on its origin, although machine or system failures may as well happen, producing other errors. A new type of problem that do not occur with statistical sources but may emerge in Big Data is imprecision (for instance, vague or categorical measures as high, medium, low...): it may be attacked using other data mining tools such as fuzzy and rough sets. Some data mining procedures are interesting because they are robust in the sense of being tolerant towards erroneous data or departures from data assumptions. In any case, all these methods should be developed in an ad hoc basis. A final remark is that the opportunity with the use of Big Data is based on the reduction of burden to respondents and that sometimes may be quickly obtained. Hence prior to engage into a complex process to make it a reliable source for statistics, a careful analysis of the potential gains should be made. Or, something similar, reduction of costs, burden and timeliness provided by the use of Big Data may balance a possible decreasing of accuracy or quality in general. A possible road map to exploit Big Data This section just sketches out a few actions that Eurostat may promote as a first step to exploit Big Data sources. These actions are to: 1. Identify possible Big Data sources. These may be private or public, internet or non-internet, being the interest especially on those sources having international scope and appropriate to produce indicators of trends or changes in different economic and social activities. The access to these data and possible problems (confidentiality, ownership ) should be studied as well. 2. Gather information of practices in European countries on the use of Big Data for producing statistics, classifying the methods and tools and the outcomes produced. This would provide information on alternative approaches. 4

3. Launch pilot research projects to produce statistical figures from identified Big Data sources. An example of a possible research exercise using a non-internet Big Data source is the production of an indicator of the evolution of household budgets from the transactions records of a department store. It may be obtained using association rules as a first step and computing later weighted indices. These can be checked comparing to the outcomes of alternative sources as the annual HBS. References [1] Big Data: Lessons from the leaders, The Economist Intelligence Unit Limited, 2012. [2] André B. Bondi, Characteristics of scalability and their impact on performance, Proceedings of the 2nd international workshop on Software and performance, Ottawa, Ontario, Canada, 2000, ISBN 1-58113-195-X. [3] http://www.google.org/flutrends/intl/en_gb/about/how.html [4] http://www.bbc.co.uk/news/science-environment-22293693 5