Big Data: What Can Official Statistics Expect?



Similar documents
International collaboration to understand the relevance of Big Data for official statistics

Big Data and Official Statistics The UN Global Working Group

big data in the European Statistical System

Economic and Social Council

UN Global Working Group on Big Data

New Frontiers for Official Statistics

How To Use Big Data For Official Statistics

Big Data for Official Statistics The 2030 Agenda for Sustainable Development

Big data for official statistics

Project Outline: Data Integration: towards producing statistics by integrating different data sources

The use of Big Data for statistics

WHAT DOES BIG DATA MEAN FOR OFFICIAL STATISTICS?

United Nations Global Working Group on Big Data for Official Statistics Task Team on Cross-Cutting Issues

Economic and Social Council

Report of the 2015 Big Data Survey. Prepared by United Nations Statistics Division

Item rd International Transport Forum. Big Data to monitor air and maritime transport. Paris, March 2016

Official Statistics in the Age. of Big Data. SAS Forum BeLux

Economic and Social Council

Using Big Data for the Sustainable Development Goals. Presented by: Amparo Ballivian

22 nd Meeting of the European Statistical System Committee

ONS Big Data Project Progress report: Qtr 1 Jan to Mar 2014

HLG - Big Data Sandbox for Statistical Production

Big Data andofficial Statistics Experiences at Statistics Netherlands

Modernization of European Official Statistics through Big Data methodologies and best practices: ESS Big Data Event Roma 2014

Tourism statistics - update by Eurostat

UNECE HLG-MOS: Achievements

The Sandbox 2015 Report

ESS event: Big Data in Official Statistics

HLG Initiatives and SDMX role in them

UN Global Pulse: Harnessing Big Data for a Revolution in Sustainable Development and Humanitarian Action Robert Kirkpatrick

UN Global Working Group (GWG) on Big Data for Official Statistics. Presented by: Gemma Van Halderen

The Way Forward Making the Business Case

Innovation of tourism statistics through the use of new big data sources.

12 th World Telecommunication/ICT Indicators Symposium (WTIS-14)

Unlocking the Full Potential of Big Data

BIG DATA FUNDAMENTALS

New forms of data for official statistics Niels Ploug Statistics Denmark

Use of Mobile Positioning Data for Tourism Statistics

CSPA. Common Statistical Production Architecture International activities on Big Data in Official Statistics. Carlo Vaccari Istat

ON OECD I-O DATABASE AND ITS EXTENSION TO INTER-COUNTRY INTER- INDUSTRY ANALYSIS " Norihiko YAMANO"

Quality Control of Web-Scraped and Transaction Data (Scanner Data)

Inbound Tourism: December 2014

RATIONALISING DATA COLLECTION: AUTOMATED DATA COLLECTION FROM ENTERPRISES

Introduction to Quality Assessment

Big Data (and official statistics) *

Analysis of Big Data Survey 2015 on Skills, Training and Capacity Building

ONS Big Data Project Progress report: Qtr 1 January to March 2015

The Evolution of Online Travel. Angelo Rossini Euromonitor International

Economic and Social Council

This survey addresses individual projects, partnerships, data sources and tools. Please submit it multiple times - once for each project.

Big data in official statistics Insights about world heritage from the analysis of Wikipedia use

OHS - The Big Data Project

Big Data uses cases and implementation pilots at the OECD

Big Data a big issue for Official Statistics?

USE OF GEOSPATIAL AND WEB DATA FOR OECD STATISTICS

Big data coming soon... to an NSI near you. John Dunne. Central Statistics Office (CSO), Ireland

E-commerce and Development Key Trends and Issues

COMMON ISSUES ON BENEFITS AND CHALLENGES OF BIG DATA SOURCES

Who We Are. Denis Thiery Chairman and Chief Executive Officer

Ronald Jansen, Karoly Kovacs, Luis González Trade Statistics Branch United Nations Statistics Division

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

TOTAL Magnaglobal; CCB; INDEC; CACE; IEMR; company reports; World Bank; World Trade Organization; AméricaEconomía; BCG analysis.

COST Presentation. COST Office Brussels, ESF provides the COST Office through a European Commission contract

Big Data for Informed Decisions

Implementation of the FDES 2013 and the Environment Statistics Self-Assessment Tool (ESSAT)

Agenda. Company Platform Customers Partners Competitive Analysis

Scanner Data Project: the experience of Statistics Portugal

Collaborations between Official Statistics and Academia in the Era of Big Data

Utilizing big data to bring about innovative offerings and new revenue streams DATA-DERIVED GROWTH

Big Data, Official Statistics and Social Science Research: Emerging Data Challenges

REPORT OF THE WORKSHOP

Alternative data collection methods -

How To Get A Strategic Value From Data

E-Government for Disaster Risk Management

Results of the UNSD/UNECE Survey on. organizational context and individual projects of Big Data

Big Data better business benefits

Small Steps Towards Big Data Ric Clarke, Australian Bureau of Statistics

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

41 T Korea, Rep T Netherlands T Japan E Bulgaria T Argentina T Czech Republic T Greece 50.

BIG DATA: IT MAY BE BIG BUT IS IT SMART?

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

Includ acc to all tabl and graphs in Excel TM

Sybase Solutions for Healthcare Adapting to an Evolving Business and Regulatory Environment

Accenture 2013 Global Consumer Pulse Survey. Global & U.S. Key Findings

An introduction to the World Federation of Occupational Therapists (WFOT)

Agriculture Embracing

Economic and Social Council

ICT MICRODATA LINKING PROJECTS. Brian Ring Central Statistics Office

Online Marketing Institute London, Feb 2012 Mike Shaw Director, Marketing Solutions

About the OECD Tourism Committee

COMP9321 Web Application Engineering

ECONOMIC IMPACT AND TRAVEL PATTERNS OF ACCESSIBLE TOURISM IN EUROPE FINAL REPORT

2015 Country RepTrak The World s Most Reputable Countries

ONS Big Data Project Progress report: Qtr 3 July to Sept 2014

Gov 3.0. Driving e-government through social, mobile, analytics and the cloud. Microsoft CityNext

BIG DATA FOR DEVELOPMENT: A PRIMER

Big Analytics unlocking Big Data

Hong Kong s Health Spending 1989 to 2033

About the OECD Tourism Committee

Is big data the new oil fuelling development?

Transcription:

Big Data: What Can Official Statistics Expect? Peter Hackl Österreichische Statistiktage 2015

Outline Data Needs in Official Statistics Alternative Data Sources Historical Facts Some Initiatives in Detail Big Data: Concepts Big Data: Potentials and Challenges Conclusions x Oct 2015 Hackl, ÖSG Statistiktage 2

Data Sources in Official Statistics Sample survey: Systematic use of statistical methodology Direct control over data collection High cost, quality issues (non-response, survey errors) Response burden Census Allows results for small geographic areas, population sub-groups High cost Administrative bodies: Data for specific purposes, containing information on a complete group of units, updated continuously Tax data; credit card data; social insurance data; births, deaths, etc. counts, etc. Quality issues Alternative sources, e.g., insurances, retail business, etc. Oct 2015 Hackl, ÖSG Statistiktage 3

Scanner Data Barcode scanning: Transaction data generated by retailers in point-ofsales terminals Billing of retail sales Documentation of transactions Basis for accounting, warehousing, sales forecasts, analyses, etc. Basis for price indices, e.g., CPI? Oct 2015 Hackl, ÖSG Statistiktage 4

Scanner Data in Official Statistics Advantages Reduction of response burden on enterprises Productivity gains for NSIs Improved quality of price statistics Issues Methodological issues, e.g., Treatment of rebates Potential biases Investment costs Partnership with data providers Oct 2015 Hackl, ÖSG Statistiktage 5

Scanner Data in Official Statistics EES Task Force Multi-purpose consumer price statistics with subproject Scanner Data ; support of EU member states by Eurostat Eurostat is working on guidelines on obtaining and using scanner data NSIs of EU member states using scanner data in estimating the CPI The Netherlands Sweden Norway Switzerland 17 EU member states are working on the use of scanner data for the production of CPIs; 10 of them experiment with scanner data NBS China, Statistics South Africa, and others have projects Oct 2015 Hackl, ÖSG Statistiktage 6

Outline Data Needs in Official Statistics Alternative Data Sources Historical Facts Some Initiatives in Detail Big Data: Concepts Big Data: Potentials and Challenges Conclusions x Oct 2015 Hackl, ÖSG Statistiktage 7

Alternative Data Sources in Use Type of data Scanner data Mobile phone call/text times and positions Traffic sensors Smart energy meter data Satellite images, remote sensor data Social media data Areas of potential use in official statistics CPI, price statistics, economic statistics Tourism statistics, population and migration statistics Transport statistics Population, housing statistics Agriculture, forestry, fishery, environment statistics Labour statistics, population and migration statistics, income and consumption, health, Oct 2015 Hackl, ÖSG Statistiktage 8

Mobile Phone Data Mobile phone data are information on calls and transmissions of text (SMS) positions times Potential use for tourism statistics: tourism flows (in-, outbound, domestic; same-day) population, migration and mobility statistics Eurostat: Feasibility Study on the Use of Mobile Positioning Data for Tourism, 2012-14 Participants from Estonia, Finland, France, and Germany Technical, financial, and legal aspects, methodological and quality issues Partnership with mobile network operators Projects by Istat, ONS, Slovenia, New Zealand Oct 2015 Hackl, ÖSG Statistiktage 9

Other Alternative Data Sources Road traffic sensors Traffic loops, traffic webcams, toll payment systems, etc. Statistics Finland: transport statistics, models for commuting times CBS: transport statistics, traffic statistics Smart energy meter data ONS: population, migration, mobility Satellite images, remote sensing data Agriculture, forestry, fisheries, and environment statistics Projects by ABS, StatCan Social media data Statistics on health, income and consumption, labour, population and migration, tourism Projects by ABS, INEGI (Mexico) Oct 2015 Hackl, ÖSG Statistiktage 10

Internet Data Social media data Facebook, Twitter, etc. Blogs, comments, etc. Internet searches Emails, text messages Business data, E-commerce Prices of books, CDs, electronics, photo equipments, toys, etc. (cf. sites like Amazon or Geizhals) Prices for flights, hotels, rental cars, etc. Internet of things Sensors: home automation, security, cars Data from computer systems: logs Oct 2015 Hackl, ÖSG Statistiktage 11

Global Pulse Initiative Initiative of UN Secretary-General Ban Ki-moon, 2009 Data innovation projects on global issues ranging from public health to climate change, food security to employment Network of labs in NY, Kampala, and Jakarta in collaboration with UN agencies, governments, academic and private sector partners Accelerating discovery, development and adoption of Big Data innovations for sustainable development and humanitarian action July 2015, publication of 20 case studies, e.g., Nowcasting food prices in Indonesia using social media signals Using mobile phone activity data for disaster management during floods; cf. the EU-funded project Bridge Estimating migration flows using online search data www.unglobalpulse.org/big-data-development-case-studies Oct 2015 Hackl, ÖSG Statistiktage 12

Alternative Data: Some Issues Partnership with data owners Little experience with owners from private sector, global companies Sustainability Considerable investments Methodological issues Representativity, quality New tools and skills IT-tools for handling large data amounts, internet data Data scientists, i.e., experts with skills in statistics, data engineering, high performance computing, data warehousing, et al. Legal issues: access to data, personal data protection x Oct 2015 Hackl, ÖSG Statistiktage 13

Outline Data Needs in Official Statistics Alternative Data Sources Historical Facts Some Initiatives in Detail Big Data: Concepts Big Data: Potentials and Challenges Conclusions x Oct 2015 Hackl, ÖSG Statistiktage 14

Big Data in Official Statistics Various NSIs started to experiment with alternative data sources ABS, CBS, ISTAT, INEGI, et al. Since about 2010, the notion Big Data came into use in official statistics Google search Big Data official statistics ; 101 mio results Initiatives at the UN level Initiatives at the EU level Oct 2015 Hackl, ÖSG Statistiktage 15

Big Data at the UN Level 2009, Global Pulse Initiative of UN Secretary-General Ban Ki-moon Big Data innovations for sustainable development Case studies on the use of Big Data and analytics Mar 2013, frame programme to the UNSC 2013 Seminar on Emerging Issues: Big Data for Policy, Development and Official Statistics Chief statisticians (India, Australia, NL, SA, et al.), J. Goodnight (SAS), H. Varian (Google), M. Wood (Amazon, Chief Data Scientist) May 2014, with mandate of the UNSC 2014 Establishment of the UN GWG on Big Data for Official Statistics Oct 2014, Beijing, UNSD & NBS China, International Conference on Big Data for Official Statistics Oct 2015 Hackl, ÖSG Statistiktage 16

Big Data at the EU Level Oct 2012, St Petersburg, HL-Seminar on Streamlining Statistical Production and Services Need for "a document explaining the issues surrounding the use of Big Data in the official statistics community June 2013, Geneva, Conference of European Statisticians (CES) Report of the Task Team of the HL Group for the Modernisation of Statistical Production and Services: What does Big data mean for official statistics? Task Team: experts from ABS, StatCan, Istat, CBS, Eurostat, UNECE www1.unece.org/stat/platform/display/hlgbas Sept 2013, DGINS: Scheveningen Memorandum ESS action plan and roadmap by mid-2014 Oct 2015 Hackl, ÖSG Statistiktage 17

Big Data at the EU Level, cont d June 2014, Big Data Roadmap and Action Plan ESS BIGD project Mar 2015, Brussels, New Techniques and Technologies for Statistics Satellite Workshop on Big Data Oct 2015 Hackl, ÖSG Statistiktage 18

Other Big Data Events June 2014, Vienna, Q2014, several sessions on Big Data Sept 2014, UN Climate Summit Presentation of winning projects of the Big Data Climate Challenge organized by UN Global Pulse Oct 2014, Bejing, International Conference on Big Data for Official Statistics Oct 2014, Da Nang, IAOS 2014, papers on Big Data Mar 2015, Brussels, New Techniques and Technologies for Statistics 2015, papers on Big Data Mar 2015, Rome, Big Data in Official Statistics Apr/May 2015, Washington, UNECE Workshop on Statistical Data Collection: Riding the Wave of the Data Deluge, papers on Big Data Oct 2015 Hackl, ÖSG Statistiktage 19

Other Big Data Events, cont d Aug 2015, ISI World Statistics Congress (WSC) Keynote speakers on Big Data Various sessions on Big Data Oct 2015, 2nd Global Conference on Big Data for Official Statistics, Abu Dhabi Organized by the UN Global Working Group (GWG) on Big Data for Official Statistics Oct 2015 Hackl, ÖSG Statistiktage 20

Outline Data Needs in Official Statistics Alternative Data Sources Historical Facts Some Initiatives in Detail Big Data: Concepts Big Data: Potentials and Challenges Conclusions x Oct 2015 Hackl, ÖSG Statistiktage 21

Initiatives on Big Data for Official Statistics UN GWG on Big Data for Official Statistics The HLG Big Data Project The ESS BIGD Project Survey on Big Data The CORS Website Big Data Oct 2015 Hackl, ÖSG Statistiktage 22

UN GWG on Big Data for Official Statistics Established in May 2014 based on a decision of the UNSC 2014 Aims Complement regional achievements Provision of strategic vision, direction and coordination of a global programme Promotion of practical use Group members six developed countries: Dk, It, Nl; Aus, Mex, USA six developing countries: Bangladesh, China, Colombia, Morocco, Philippines, Tanzania seven international organizations: OECD, UNECE, UNSD, World Bank, etc. Guidelines, handbooks, pilot projects Conferences on Big Data for Official Statistics Oct 2015 Hackl, ÖSG Statistiktage 23

Bejing Conference, Oct 2014 International Conference on Big Data for Official Statistics Organized by UNSD and NBS China Inaugurating Meeting of UN GWG on Big Data for Official Statistics http://unstats.un.org/unsd/trade/events/2014/beijing/ Programme Terms of Reference of the UN GWG Programme of work and deliverable Reports on projects Satellite imagery data: replacing agricultural surveys; ABS (Siu Ming- Tam), NBS China, INEGI (Mexico), Colombia Social media data: various projects, e.g., estimation of job vacancy rates, by CBS, INEGI (Mexico), ISTAT, NBS China, Positioning and tracking data (mobile phones, GPS, vehicle tracking systems): applications for statistics on tourism, transport, day time mobility, estimation of population census Oct 2015 Hackl, ÖSG Statistiktage 24

UN GWG: Terms of Reference 1. To provide a strategic vision, direction and coordination for a global programme on Big Data for official statistics 2. To promote practical use of big data sources, including cross-border data, while building on existing precedents and finding solutions for the many existing challenges, including: methodological, legal, privacy, security, and IT issues 3. To promote capacity building, training, sharing of experience 4. To foster communication and advocacy of use of Big Data 5. To build public trust in the use of private sector Big Data for official statistics Oct 2015 Hackl, ÖSG Statistiktage 25

UN GWG: Task Teams 1. Advocacy and communication 2. Big Data and SDG indicators 3. Access and partnerships 4. Training, skills, capacity building 5. Cross-cutting issues (quality framework) 6. Mobile phone data 7. Satellite imagery 8. Social media data Oct 2015 Hackl, ÖSG Statistiktage 26

Abu Dhabi Conference, Oct 2015 2 nd Global Conference on Big Data for Official Statistics Organized by UNSD, NBS of the UAE, ABS, and GCC-Stat 2 nd Meeting of UN GWG Objectives First steps towards developing guidance which will support training on Big Data issues initiatives for Big Data projects moving Big Data from pilots to production Big Data for SDG indicators framework Oct 2015 Hackl, ÖSG Statistiktage 27

Initiatives on Big Data for Official Statistics UN GWG on Big Data for Official Statistics The HLG Big Data Project The ESS BIGD Project Survey on Big Data The CORS Website Big Data Oct 2015 Hackl, ÖSG Statistiktage 28

The HLG Big Data Project Oct 2012, St Petersburg, Seminar of the UNECE High Level Group on Streamlining Statistical Production and Services (HLG) Need for "a document explaining the issues surrounding the use of Big Data in the official statistics community Task Team experts from ABS, StatCan, Istat, CBS, Eurostat, UNECE coordinator UNECE Secretariat June 2013, Geneva, Conference of European Statisticians (CES) Report of the Task Team of the HLG for the Modernisation of Statistical Production and Services: What does Big Data mean for official statistics? www1.unece.org/stat/platform/display/hlgbas Oct 2015 Hackl, ÖSG Statistiktage 29

HLG Big Data Project: Objectives To identify, examine and provide guidance for statistical organizations to identify the main possibilities offered by Big Data and to act upon the main strategic and methodological issues that Big Data poses for the official statistics industry To demonstrate the feasibility of efficient production of both novel products and 'mainstream' official statistics using Big Data sources, and the possibility to replicate these approaches across different national contexts To facilitate the sharing across organizations of knowledge, expertise, tools and methods for the production of statistics using Big Data sources Oct 2015 Hackl, ÖSG Statistiktage 30

HLG Big Data Project: Output Wiki space Big Data in Official Statistics Classification of types of Big Data Big Data inventory Sandbox, a technical platform to store and analyse large-scale datasets Links and resources Other achievements Survey "Skills necessary for people working with Big Data in Statistical Organisations" Conferences, Workshop etc. New Techniques and Technologies for Statistics (NTTS) Conference 2013, 2015 Workshop on Big Data, Brussels (Mar 2015) Oct 2015 Hackl, ÖSG Statistiktage 31

Big Data Sandbox Established within the HLG Big Data Project, launched 2014 Web-accessible environment for storage and analysis of large-scale datasets For testing and exploring the use of Big Data for statistical production Sandbox infrastructure Distributed computational environment, 28 machines, in Dublin Big Data software tools like Hadoop, MapReduce, Pig and Hive, etc. Projects 7 experiment teams 4 to 6 methodologists and IT experts from different countries Oct 2015 Hackl, ÖSG Statistiktage 32

Big Data Sandbox: Themes Experiment teams are working on the following themes Consumer price indices: use of scanner data Mobile telephone data: statistics on tourism, daily commuting etc.; data from Orange Smart meters: statistics on power consumption; real data from Ireland, synthetic data from Canada Traffic loops: traffic statistics; traffic loops data from The Netherlands Social media data: tourism flows; Twitter data from Mexico Job portals data: statistics on job vacancies; job advertisements Web scraping: test of different approaches for automatically collecting data from web sources Oct 2015 Hackl, ÖSG Statistiktage 33

Big Data Inventory Established within the HLG Big Data Project unece.org/stat/platform/display/bdi/unece+big+data+inventory+home The BD Inventory reports the following projects Satellite images used for agriculture, forestry, fisheries, and environment statistics; ABS Social media data for statistics on education, health, income and consumption, labour, population and migration; ABS Internet price data from commercial transactions for CPI; Eurostat Mobile phone call/text times and positions for tourism statistics; Eurostat Commercial transaction data for ICT statistics; Istat Mobile phone call/text times and positions for population and migration statistics; Istat, New Zealand, Slovenia Oct 2015 Hackl, ÖSG Statistiktage 34

Priorities Establishment of priority areas Quality Partnership Privacy Skills Methodology and technology Task teams to produce guidelines on quality, partnership, privacy Oct 2015 Hackl, ÖSG Statistiktage 35

Initiatives on Big Data for Official Statistics UN GWG on Big Data for Official Statistics The HLG Big Data Project The ESS BIGD Project Survey on Big Data The CORS Website Big Data Oct 2015 Hackl, ÖSG Statistiktage 36

The ESS BIGD Project Activity within the ESS Aims Implementation of the ESS Big Data Action Plan Integration of Big Data sources into the production of European and national statistics Roadmap Short term 2015-2016: analysis of legislation, strategy, ethics, communication Medium term 2016-2020: pilots, partnerships, IT architecture, skills Long term >2020: Full integration into official statistics Big Data pilots 2016-2019 Activity at Eurostat Pilots exploring: mobile phone data, flight reservation systems, Google search data Oct 2015 Hackl, ÖSG Statistiktage 37

Big Data Pilots Within the ESS BIGD project, 2016-2019 Data sources (type of data), statistical domains Mobile communication (mobile phone data): tourism statistics, population statistics WWW (web searches, websites of businesses, commerce, real estate, job advertisements): labour, employment, migration, price statistics; business registers Sensors (traffic loops, smart meters, vessel identification, satellite images, webcams): transport, energy, emission, agricultural statistics Process generated data (flight reservation systems, supermarket cashier data, loyalty programs, financial transactions, egovernment, mobile payments): transport, air emission, consumption statistics Crowd Sourcing (VGI websites,): land use Oct 2015 Hackl, ÖSG Statistiktage 38

Initiatives on Big Data for Official Statistics UN GWG on Big Data for Official Statistics The HLG Big Data Project The ESS BIGD Project Survey on Big Data The CORS Website Big Data Oct 2015 Hackl, ÖSG Statistiktage 39

Survey on Big Data UNSD/UNECE Survey on Big Data in statistical organizations in Sep 2014, report Oct 2014 78 NSIs, 28 international organizations Response: 32 NSIs, 3 international organizations 37% work already, 43% are planning to work with Big Data 57 Big Data projects Potential areas for BD use: economic & financial (48%), demographic & social (44%), price (38%), labour (21%), etc. Oct 2015 Hackl, ÖSG Statistiktage 40

Initiatives on Big Data for Official Statistics UN GWG on Big Data for Official Statistics The HLG Big Data Project The ESS BIGD Project Survey on Big Data The CORS Website Big Data Oct 2015 Hackl, ÖSG Statistiktage 41

The CORS Website Big Data Established by Eurostat within the Collaboration in Research and Methodology for Official Statistics Information related to Big Data in the context of official statistics Strategic documents e.g. the ESS Big Data Action Plan and Roadmap 1.0 Relevant resources introductory material training courses conference papers on Big Data Information on initiatives projects in the ESS European and international meetings on Big Data Oct 2015 Hackl, ÖSG Statistiktage 42

Outline Data Needs in Official Statistics Alternative Data Sources Historical Facts Some Initiatives in Detail Big Data: Concepts Big Data: Potentials and Challenges Conclusions x Oct 2015 Hackl, ÖSG Statistiktage 43

How to Define Big Data? Huge masses of digital data are the result of Modern technological, social and economic developments including the growth of smart devices and infrastructure The growing availability and efficiency of the internet The appeal of social networking sites The prevalence and ubiquity of IT systems A suitable and generally applicable definition has to cope with The complexities of the structure and dynamic of corresponding datasets The challenges in developing the suitable software tools for data analytics The diversity of potentials in making use of the masses of available data in general Oct 2015 Hackl, ÖSG Statistiktage 44

Big Data: Definitions Wikipedia: Big Data is a blanket term for any collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. Gartner : Big Data are data sources with a high volume, velocity and variety of data, which require new tools and methods to capture, curate, manage, and process them in an efficient way. Oct 2015 Hackl, ÖSG Statistiktage 45

Big Data: A Popular Definition Definition by its characteristics along the dimensions Volume: refers to the number of data records, their attributes and linkages Velocity: refers to the speed at which data are produced and changed, and to the pressure of managing large streams of realtime data Variety: refers to the diversity of sources, formats, media, content The 3 V s More V s: Variability: inconsistency of the data across time Veracity: ability to trust the data is accurate Complexity: need to link multiple data sources Oct 2015 Hackl, ÖSG Statistiktage 46

Big Data: The 3 (or More) Vs Do not capture the enormous scope of the corresponding data sets the extensive potentials of making use of these data the highly relevant aspect that Big Data are so large and complex that traditional database management tools and data processing applications are not feasible and efficient means Oct 2015 Hackl, ÖSG Statistiktage 47

Types of Big Data Sources Report of the Task Team of the HLG to the CES, 2013: Big Data come from various sources, such as Transactional data E.g. scanner data, credit card transactions Sensor data Satellite imaging, environmental sensors, road sensors Personal tracking data E.g., from tracking devices such as mobile telephones, GPS Social media data Tracks of human behaviour, e.g., online searches, online page viewing Documentation of opinion, e.g., comments posted in social media Administrative data E.g. tax data, medical records, insurance records, bank records Oct 2015 Hackl, ÖSG Statistiktage 48

Big Data vs. Administrative Data Structure of administrative data is clear, Big Data usually do not have a clear structure Relevant meta-data are usually available for administrative data, but not for Big Data The volume of administrative data may be big Oct 2015 Hackl, ÖSG Statistiktage 49

Big Data: The View of Official Statistics The potential of using Big Data to solve problems depends on what the problem is what sources of Big Data may contribute to the solution whether any inherent biases or measurement errors with those sources make them unsuitable for the solution Oct 2015 Hackl, ÖSG Statistiktage 50

Outline Data Needs in Official Statistics Alternative Data Sources Historical Facts Some Initiatives in Detail Big Data: Concepts Big Data: Potentials and Challenges Conclusions x Oct 2015 Hackl, ÖSG Statistiktage 51

Expectations on Big Data Expectations of official statistics in using Big Data Reduction of response burden Improved timeliness More detailed breakdowns Improved accuracy New indicators Reduction of costs of statistical production Oct 2015 Hackl, ÖSG Statistiktage 52

Big Data: Challenges The report of the Task Team of the HLG to the CES mentions the following challenges Legislative, i.e., with respect to the access and use of data Privacy, i.e., managing public trust and acceptance of (private) data re-use and its link to other sources Financial, i.e., potential costs of sourcing data vs. benefits Management, e.g., policies and directives about the management and protection of the data Methodological, i.e., data quality and suitability of statistical methods Technological, i.e., issues related to information technology Similarly, Priority Areas of the HLG Big Data project: partnership, privacy, methodology and technology, skills, quality Oct 2015 Hackl, ÖSG Statistiktage 53

Methodological Issues Challenges and issues depend on Type of data Use of data Illustrated by Satellite images for agricultural statistics Mobile positioning data for tourism flow statistics Web scraping data for tourism accommodation statistics Scanner data for price statistics Oct 2015 Hackl, ÖSG Statistiktage 54

Satellite Images Agricultural statistics Land cover, crop yield Agricultural census Actors: Australia, Mexico, Colombia; Abu Dhabi, China Issues Interpretation of satellite images, classifications Sustainability Quality: accuracy, relevance, etc. Classification of Land use: agriculture, forest, grassland, mixed use, nonagricultural use, other uses Agricultural use: type of crops etc. INTERIMAGE: allows the object extraction, computation of spectral, geometric and topological features, texture Oct 2015 Hackl, ÖSG Statistiktage 55

Satellite Images, cont d Agricultural statistics Updating the farm register: Istat Approach Extraction of relevant information obtained by web scraping techniques from various hubs, e.g., regional websites, commercial organizations, etc. Issues Unique identification of farms; may be referenced in different hubs with different names For the same farm, information derived from different hubs may be discordant Assessment of the quality of the input, the results Oct 2015 Hackl, ÖSG Statistiktage 56

Tourism Statistics Potential data sources for tourism statistics Mobile positioning data, e.g., tourism flow statistics Other mobile phone data, e.g., log data Internet, social media, e.g., for tourism accommodation statistics Public transport data Electronic traffic loops, cameras Credit card data Oct 2015 Hackl, ÖSG Statistiktage 57

Tourism Flow Statistics Eurostat feasibility study on the use of mobile positioning data for tourism statistics: Call for Tender (2012) Tourism flow statistics Consortium: six agencies from Estonia, Germany, France, Finland May 2014: Prague Workshop discussed access, legal basis, methodological issues; prospects Feasibility Study on the Use of Mobile Positioning Data for Tourism Statistics. Consolidated Report, June 2014 Oct 2015 Hackl, ÖSG Statistiktage 58

Tourism Flow Statistics, cont d Mobile positioning data: Call Detail Records (CDR) One record for each contact (call, SMS, data session ) between mobile device and telecom provider through a phone mast Contains ID of the mobile device, date and time of the contact, kind of communication (call, SMS, data), location of phone mast, receiver of the call (call, SMS) Indicators derived from mobile positioning data Tourism flows: destinations, durations of stay Domestic tourism Same-day, domestic and inbound Inbound flows, based on roaming data (country code from SIM card) Number of overnight stays, covering not only stays in registered accommodations Oct 2015 Hackl, ÖSG Statistiktage 59

Tourism Flow Statistics, cont d Issues Representativity: over- and under-coverage issues related to the use habits of mobile phones (during travels), the costs of roaming service, etc. Classification problems might cause biases, e.g., over-coverage of the same-day trips No information on: purpose of the trip, usual environment, type of accommodation, means of transport, expenditures Assessment of the quality of the mobile positioning data, the statistical processes, the results Main conclusion of the Eurostat feasibility study Mobile positioning data may complement currently used methods Oct 2015 Hackl, ÖSG Statistiktage 60

More on the Use of Mobile Positioning Data Other projects on tourism flow statistics CBS, in cooperation with Vodafone Estonia CDRs can similarly used for Statistics on short-term migration, commuting Long-term migration statistics Population statistics Transport statistics (passengers) Issues are, among others, Representativity Classifications Oct 2015 Hackl, ÖSG Statistiktage 61

Mobile Phone Log Data Log data on the use of mobile phones Pilots by the CBS, 2011 and 2012; data used for Mobility statistics ICT use statistics Respondents provide Data produced by a special app in the phone; see next Background data: age, sex, income, region, composition of the family/group Background data allow controlling the sample and weighting Oct 2015 Hackl, ÖSG Statistiktage 62

Mobile Phone Log Data, cont d An app installed on the phone (or mobile device) registers Every five minutes All or certain specific actions, including time and location (GPS) Information triggered by the app purpose of the journey mode of transport price paid type of accommodation, restaurant visits, satisfaction, activities, etc. Issues Representativity: not easy to find respondents Assessment of the quality Oct 2015 Hackl, ÖSG Statistiktage 63

Tourism Accommodation Statistics Production of tourism accommodation statistics Internet search using a web crawler Available data for each unit Name, address Other characteristics: number of rooms, prices, tourist tax, available facilities, guest review scores, job vacancies, Chamber of Commerce registration number Research project by CBS in 2012-13 Issues Representativity Assessment of the quality Oct 2015 Hackl, ÖSG Statistiktage 64

Tourism Accommodation Statistics, cont d Similar technology based on web crawler for statistics of Airfare prices Prices of consumer electronics ICT usage Labour market Oct 2015 Hackl, ÖSG Statistiktage 65

Scanner Data Use for estimating price indices CPI: 17 EU member states and others, EES Task Force Regional breakdowns of CPI, PPP Scanner data contain information on prices, quantities Issues Representativity, biases in CPI EAN-codes not harmonized with COICOP-codes Scanner data only for a few COICOP groups Treatment of rebates Differences between prices from scanners and prices reported by price collectors Assessment of the quality of scanner data, of the CPIs Oct 2015 Hackl, ÖSG Statistiktage 66

Other Price Indices Web-scraping of on-line prices, e.g., Billion Prices Project (BPP) at the MIT Nowcasting food prices in Indonesia (Global Pulse Initiative) Issues Representativity Assessment of the quality of data, of results Scraping techniques specific for commodities Crowd-sourced mobile app price data collection, e.g., data collections where the data collectors determine foods and markets and retailers to cover where the data collection covers specific markets, outlets and commodities Premise food price indices for Argentina, China, India, Nigeria, et al. combines web-scraping and crowd-sourcing Oct 2015 Hackl, ÖSG Statistiktage 67

Issues: A Summary Issues are specific for data sources and data use Representativity Unknown Big Data population Coverage of Big Data population deviates from target population, resulting in over- and under-coverage Quality of data Relevance Classification problems Measurement bias Lack of metadata Quality of statistical processes Combination of data from different sources Matching problems Oct 2015 Hackl, ÖSG Statistiktage 68

Issues: A Summary, cont d Quality of statistical output Availability of relevant metadata Comparability over time, across regions Oct 2015 Hackl, ÖSG Statistiktage 69

Outline Data Needs in Official Statistics Alternative Data Sources Historical Facts Some Initiatives in Detail Big Data: Concepts Big Data: Potentials and Challenges Conclusions x Oct 2015 Hackl, ÖSG Statistiktage 70

The Status Enormous interest in Big Data Conferences, workshops, publications, etc. Projects like Global Pulse Initiative, HLG Big Data Project, The ESS BIGD Project National initiatives like ABS Big Data Flagship Project Oct 2015 Hackl, ÖSG Statistiktage 71

Expectations Use of Big Data may have effects like Improved timeliness More detailed breakdowns Improved accuracy Reduction of costs of statistical production Reduction of response burden New indicators However, potentials of Big Data depend on the data source the use of the data The notion Big Data is misleading: No common methodological approach for the various types of data Oct 2015 Hackl, ÖSG Statistiktage 72

The Future of Big Data in Official Statistics New data sources, new availability of data need to be used in official statistics Making use of these opportunities needs preparations by the NSIs New skills, e.g., statistical methods, IT tools for handling large datasets New methodological issues, i.e., data quality, suitable statistical methods, metadata Preparation of the statistical environmental, e.g., legislation, partnerships, budget, privacy Oct 2015 Hackl, ÖSG Statistiktage 73

The End