REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION

REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION Pilar Rey del Castillo May 2013 Introduction The exploitation of the vast amount of data originated from ICT tools and referring to a big variety of economic and social activities signifies a challenge and an opportunity for official statistics. But it remains still to discover how to extract significant value for the production of statistical figures from the diversity of data available. This paper proposes to start exploring some big data following a straightforward path to achieve results. Next section provides a preliminary insight on the issues that are at stake, and the following presents some ideas to start a road map from Eurostat. Some defining features of Big Data The most conspicuous feature of Big Data as compared to traditional statistical sources is that they do not come from a previous design with the aim of obtaining specific statistics, but become available as traces of human activity. This attribute makes it difficult to use traditional statistical methods and tools such as probabilistic sampling, statistical classifications and so on, turning into useless and not applicable the Generic Statistical Business Process Model. The outcomes of a recent survey conducted among executives of a wide range of industries around the world can be clarifying: although a great part of the respondents agreed on that data had become an important factor for their business, many companies were struggling with basic aspects of data management, still attempting to exploit it effectively [1]. This confirms the fact that extracting useful information from this kind of data is a non-obvious and rather difficult task that should be carefully planned. Although the appearance of some statistical figures based on Big Data may suggest the information is found directly on the data prepared to be published, there should be considered the huge amounts of data that have been previously analysed and processed to achieve these results. Nevertheless, the attractive of the potential reduction of respondents burden and costs, and the general framework of improving the productivity of the ESS, introduce increasing pressure to the use of Big Data as sources of statistical information. The experience on the use of another source that shares with Big Data this feature of not being designed for statistical purposes as it is administrative data may illuminate the road map, comparing the features in common and the ones that make a difference. The main aspects of the administrative data we are interested to consider are the following: 1

a) Methods to obtain statistical information from administrative data usually depend on the specific data, being difficult to establish general rules or use generic production process models. b) Administrative data are not structured as statistical data are, that is, they do not use statistical classifications and definitions, but still they show a certain structure related with the objective of its creation. This means that some tasks of translating, linking or harmonizing the structures (units, definitions, classifications...) should always be done. c) Sampling procedures are not used to obtain the reporting units but frequently there is an idea of their representativeness on the population of interest (sometimes all the population units are included). d) The volume of administrative data is not usually a problem and they may be treated with the statistical procedures used with other typical sources. e) The way they are increasingly being used to produce statistical figures can be classified as (i) totally replacing statistical sources, (ii) partially replacing statistical sources, completing the information by means of record linkage, matching or other procedures, and (iii) providing completely new statistical figures that may be a complement to the available statistical information from other perspectives. The two first ways may result in theory on significant reductions of costs and respondents burden, but they frequently imply new tasks of translating, linking or harmonizing which are not necessary when completely new statistical figures are produced. An example of this last case could be the figures of registered unemployment. As for Big Data, and concerning the same corresponding features: a) Due to the heterogeneity of Big Data available, methods to produce statistical information should be developed ad hoc for each case, exactly the same to the case of administrative data. b) Some Big Data have a certain structure related with the source of information and some are just unstructured text strings. Good metadata are not usually available and it seems that in most of the cases the tasks to harmonize or translate to statistical structures would be enormous. c) Apart from not using sampling procedures, Big Data come frequently from private companies and its representativeness and coverage over the populations of interest for official statistical is difficult to assess. d) The name of Big Data refers precisely to the huge volume. This dimension has an impact on the storage and processing, falling frequently out of the scope of the traditional statistical tools. e) The way Big Data could be used to produce statistical figures deals with a crucial issue. The idea is that it seems not easy to find Big Data able to totally or partially replace statistical sources in the short term because of the reasons explained in previous points and follow the path in this direction may be too expensive in time and resources. Thus, a sound approach would be to start searching for sources that could provide completely new and independent statistical figures not adapted to traditional statistical structures but offering new perspectives. For example, instead of finding sources to substitute the HBS, try to build indicators of its trends over time. When improvements on this area are achieved, the new set of statistics available will provide a valuable basis for re-designing the products and the process of production of official statistics. There may be opportunities to tackle the specific problems of Big Data by using the suitable tools: 1. An apparently critical problem is the volume of the Big Data available: there is a necessity to move away from exclusive dependence on the statistical methods that cannot handle this volume of information and adopt a more 2

diverse set of tools. This can be simply addressed through the use of algorithms specially developed for this goal such as data mining methods. These algorithms have the computational efficiency required and are scalable, that is, have the ability to handle a growing amount of work in a capable manner, or to be enlarged to accommodate that growth [2]. The state of the art provides a great variety of data mining tools for different objectives: classification, clustering, regression, association, feature extraction A first stage of exploration using data mining procedures should be usually carried out to learn about the unknown data structure and the possible outcomes, combining later this with traditional statistical procedures. The type of Big Data and its form determines the type of data mining tool to be used. Thus the statistical production process from Big Data should have as a first step the performance of an exploratory analysis. A combination of data mining and traditional statistical procedures may follow to produce the best results. 2. Another important concern is the representativeness and validity of the statistics produced. The use of probabilistic sampling in traditional statistics provides a theoretical framework that ensures confidence on the figures produced, being the accuracy based on sampling errors. Most of Big Data available cannot be adapted to this framework and other procedures should be devised. This seems to be an important weakness of Big Data use and efforts should be focused on it. Meanwhile, experiences of successful uses of Big Data could be investigated to follow a similar approach. Two well-known examples are here briefly considered. The first refers to the estimates of the incidence of flu in different countries and regions around the world from the searches on Google for flu-related topics [3]. It has been found that these estimates are very closely matched to traditional flu activity indicators. Similarly, a recent article in BBC News [4] reported that Google searches for finance-related terms may predict moves in markets, and that an investment strategy based on these search volume data between 2004 and 2011 would have made a profit of 326%. These examples have two important features in common (apart from being Google products) that may help with the problem of representativeness, coverage and validity. The first thing is that both of them estimate changes or movements across time, and not absolute figures. A well-established statistical principle is that it is more reliable to estimate changes (over time or space) than absolute figures, because some bias and errors can be cancelled up when computing the change: maybe the first attempts to use Big Data should be addressed to produce estimates of changes or evolutions. The other relevant feature shared by both examples is the criterion to evaluate the results. What is estimated are proxy variables that perform well in following the movements of a phenomenon of interest. That is, the performance is assessed in terms of its similarity to other figures available measuring the same or analogous thing. In the same way, the performance of Big Data could be evaluated on a first instance from the similarity or agreement to other available measures and not from a sampling errors criterion. This makes sense from a data mining perspective, where the equivalent to fitting a model is tuning an algorithm so that it fits with the real world. 3

When many different statistical figures are produced from different and independent Big Data sources following these principles, the coherence and agreement among them may be an argument to support the validity and representativeness of the whole system. 3. Although Big Data may be not structured as statistical data are, they may have the same type of structure/non-structure across countries. This would have the advantage of making unnecessary the process of harmonization between countries what is of special interest for transnational statistics. 4. There are other concerns about Big Data that seem to be similar to the case of statistical sources, such as the appearance of diverse types of problems or errors: noise, incompleteness, missing data, reporting errors, outliers Data editing (cleaning, checking, imputing ) are time and resources consuming activities in traditional statistical processing and similar methods to deal with them could be used. It is likely that some errors (reporting, incompleteness ) have fewer occurrences in Big Data through non-human intervention on its origin, although machine or system failures may as well happen, producing other errors. A new type of problem that do not occur with statistical sources but may emerge in Big Data is imprecision (for instance, vague or categorical measures as high, medium, low...): it may be attacked using other data mining tools such as fuzzy and rough sets. Some data mining procedures are interesting because they are robust in the sense of being tolerant towards erroneous data or departures from data assumptions. In any case, all these methods should be developed in an ad hoc basis. A final remark is that the opportunity with the use of Big Data is based on the reduction of burden to respondents and that sometimes may be quickly obtained. Hence prior to engage into a complex process to make it a reliable source for statistics, a careful analysis of the potential gains should be made. Or, something similar, reduction of costs, burden and timeliness provided by the use of Big Data may balance a possible decreasing of accuracy or quality in general. A possible road map to exploit Big Data This section just sketches out a few actions that Eurostat may promote as a first step to exploit Big Data sources. These actions are to: 1. Identify possible Big Data sources. These may be private or public, internet or non-internet, being the interest especially on those sources having international scope and appropriate to produce indicators of trends or changes in different economic and social activities. The access to these data and possible problems (confidentiality, ownership ) should be studied as well. 2. Gather information of practices in European countries on the use of Big Data for producing statistics, classifying the methods and tools and the outcomes produced. This would provide information on alternative approaches. 4

3. Launch pilot research projects to produce statistical figures from identified Big Data sources. An example of a possible research exercise using a non-internet Big Data source is the production of an indicator of the evolution of household budgets from the transactions records of a department store. It may be obtained using association rules as a first step and computing later weighted indices. These can be checked comparing to the outcomes of alternative sources as the annual HBS. References [1] Big Data: Lessons from the leaders, The Economist Intelligence Unit Limited, 2012. [2] André B. Bondi, Characteristics of scalability and their impact on performance, Proceedings of the 2nd international workshop on Software and performance, Ottawa, Ontario, Canada, 2000, ISBN 1-58113-195-X. [3] http://www.google.org/flutrends/intl/en_gb/about/how.html [4] http://www.bbc.co.uk/news/science-environment-22293693 5