Annex: Concept Note Friday Seminar on Emerging Issues Big Data for Policy, Development and Official Statistics New York, 22 February 2013 How is Big Data different from just very large databases? 1 Traditionally, data processing for analytic purposes followed a fairly static blueprint. Namely, create modest amounts of structured data with stable data models. Data processing analysis and integration tools are used to extract, transform and load the data from enterprises applications and administrative databases to a staging area where data quality and data normalization (hopefully) occur and the data is modeled into neat rows and tables. The modeled, cleansed data is then loaded into an enterprise data warehouse. This routine usually occurs on a scheduled basis usually daily or weekly, monthly or annually, sometimes more frequently. From there, data warehouse administrators create and schedule regular reports to run against normalized data stored in the warehouse or some other dissemination facility, which are distributed to a wide range of users in government, business, the media and the community at large.. They also create dashboards and other limited visualization tools for executives and management. Analysts, meanwhile, use data analytics tools/engines to run more advanced analytics against the warehouse or other dissemination facility, or more often against sample data migrated to a local data mart due to size limitations. Nonexpert users perform basic data visualization and limited analytics against the data warehouse via front-end business intelligence tools. Data volumes in traditional data warehouses rarely exceeded multiple terabytes (and even that much is rare) as large volumes of data strain warehouse resources and degrade performance. The changing nature of Big Data The advent of the Web, mobile devices and other technologies such as sensor networks has caused a fundamental change to the nature of data. Big Data has important, distinct qualities that differentiate it from traditional institutional data. 1 Most of the shown text comes from a Big Data Manifesto from the Wikibon Community by Jeff Kelly, see http://wikibon.org/wiki/v/big_data:_hadoop,_business_analytics_and_beyond 1
Data are no longer centralized, highly structured and easily manageable, but are highly distributed, loosely structured (if structured at all), and increasingly large in volume. Source: Microsoft Specifically: Volume The amount of data created both inside corporations and outside the firewall via the web, mobile devices, IT infrastructure, and other sources is increasing exponentially each year. Type The variety of data types is increasing, namely unstructured text-based data and semi-structured data like social media data, location-based data, and log-file data. Speed The speed at which new data is being created and the need for real-time analytics to derive business value from it -- is increasing thanks to digitization of transactions, the emergence of sensor networks, mobile computing and the sheer number of internet and mobile device users. Broadly speaking, Big Data is generated by a range of sources, including: Mobile Devices: There are over 5 billion mobile phones in use worldwide. Each call, text and instant message is logged as data. Mobile devices, particularly smart phones 2
and tablets, also make it easier to use social media and use other data-generating applications. Mobile devices also collect and transmit location data. Internet Transactions: Billions of online purchases, funds transfers, stock trades and other transactions happen every day, including countless automated transactions. Each creates a number of data points collected by retailers, banks, credit card issuers, credit agencies and others. Networked Devices and Sensors: Electronic devices of all sorts including servers and other IT hardware, smart energy meters and temperature and other sensors -- all create semi-structured log data that record every action. Social Networking and Media: There are currently over 700 million Facebook users, 250 million Twitter users and 156 million public blogs. Each Facebook update, Tweet, blog post and comment creates multiple new data points, both structured, semistructured and unstructured, sometimes called Data Exhaust. Source: The Informatica Blog New approaches to Big Data processing and analytics Traditional data warehouses and other data management tools are not designed for processing and analyzing Big Data in a time- or cost-efficient manner. Namely, data 3
must be organized into relational tables -- neat rows and columns -- before a traditional enterprise data warehouse can ingest it. Due to the time and man-power needed, applying such structure to vast amounts of unstructured data is impractical. Further, in order to scale-up a traditional enterprise data warehouse to accommodate potentially petabytes of data would require unrealistic financial investments in new, often (depending on the vendor) proprietary hardware. Data warehouse performance would also suffer due to a single choke point for loading data. Therefore new ways of processing and analyzing Big Data are required. There are number of approaches to processing and analyzing Big Data, but most have some common characteristics. Namely, they take advantage of commodity hardware to enable scale-out, parallel processing techniques; employ non-relational data storage capabilities in order to process unstructured and semi-structured data; and apply advanced analytics and data visualization technology to Big Data to convey insights to end-users. Source: Wikibon 2012 In order to fully take advantage of Big Data, however, enterprises must take further steps. Namely, they must employ staff with the knowledge and skills to deploy advanced analytics techniques on the processed data to reveal meaningful insights. People with the knowledge and skills are often now described as Data Scientists 4
performing this sophisticated work in one of a handful of languages or approaches, including HADOOP, SAS and R. The results of this analysis can then be operationalized via Big Data applications, either homegrown or off-the-shelf. Other vendors are developing business intelligence-style applications to allow non-power users to interact with Big Data directly. The context of Official Statistics National Statistical Offices have started to explore how best to harness this phenomenon of Big Data in their mission to supply quality statistics for improving economic performance, social well-being and environmental sustainability. Some of the issues 2 raised are: Should NSOs expand its business operations to take on the opportunities of using Big Data for official government purposes? Should NSOs take on a new mission as a trusted 3rd party whose role would be to certify the statistical quality of many of these newly emerging private sector sources? Should NSOs become a clearing house for statistics from non-traditional sources that meet their quality standards? Should NSOs use non-traditional sources to supplement (and perhaps replace) their official series? How might NSOs acquire people with the knowledge and skills to effectively take advantage of Big Data for official statistics purposes? For example, the billion Price Project collects price information over the internet and computes a price index to estimate inflation. The index is published daily with a three day lag as opposed to the official inflation numbers which are published monthly with a an even longer lag. A quick turn-around allows for early detection of inflation trends and may allow policy makers to tailor policies in a much more timely manner. If governments wanted to, they could already let Big Data play a role in providing some information on areas that are currently under the responsibility of national statistical offices (NSOs). 2 These issues are being considered by the High-Level Group for Strategic Developments in Business Architecture in Statistics which reports to the Conference of European Statisticians. 5
The attraction of Big Data lies in the sheer amount of data which could be available in, or near, real time. Potentially, Big Data could be used as intelligence to better solve emergency situations. Satellite imaging or information gathered from mobile devices can be used both in developed and developing countries. Big Data presents an opportunity for the official statistical community to better meet its mission of disseminating timely and quality statistics. Building on the experiences of the private and public sector, NSOs and national statistical and international statistical systems more generally have an opportunity to expand into an area that could provide a new range of relevant information in a timely manner. The use of Big Data has a number of upsides but also many challenges related to security, privacy, analysis and interpretation. Analyses and results emerging from the use of Big Data should be properly checked and documented for their quality, validity and limitations. Practical challenges with Big Data are using commercial infrastructure (capacity and computational power) to store, mine and analyse Big Data and developing the appropriate enterprise architectures within statistical organizations. Moving from traditional data collection to procurement and use of Big Data, the statistical community will also require need to address the skill gap around Big Data administration and Big Data Analytics, or Data Science. In order for Big Data to truly gain mainstream adoption and achieve its full potential for official statistical purposes, it is critical that the statistical community does not ignore Big Data, but recognizes the use Big Data as part of their information management model, prepares an inventory of the state of play and formulates the implications for official statistics. 6