HLG - Big Data Sandbox for Statistical Production



Similar documents
HLG Initiatives and SDMX role in them

UNECE HLG-MOS: Achievements

REPORT OF THE WORKSHOP

WHAT DOES BIG DATA MEAN FOR OFFICIAL STATISTICS?

The Sandbox 2015 Report

ESS event: Big Data in Official Statistics

Big data for official statistics

big data in the European Statistical System

FROM DATA STORE TO DATA SERVICES - DEVELOPING SCALABLE DATA ARCHITECTURE AT SURS. Summary

Big Data (and official statistics) *

GSBPM. Generic Statistical Business Process Model. (Version 5.0, December 2013)

Big Data andofficial Statistics Experiences at Statistics Netherlands

The Power of Pentaho and Hadoop in Action. Demonstrating MapReduce Performance at Scale

DATA COLLECTION IN THE STATISTICAL OFFICE OF THE REPUBLIC OF SLOVENIA (SURS)

REPORT OF THE WORKSHOP

Apache Hadoop: The Big Data Refinery

Questionnaire about the skills necessary for people. working with Big Data in the Statistical Organisations

Using Tableau Software with Hortonworks Data Platform

Big Data and Official Statistics The UN Global Working Group

Principles and Guidelines on Confidentiality Aspects of Data Integration Undertaken for Statistical or Related Research Purposes

Modernization of European Official Statistics through Big Data methodologies and best practices: ESS Big Data Event Roma 2014

Big Data and Analytics: Challenges and Opportunities

A Suggested Framework for the Quality of Big Data. Deliverables of the UNECE Big Data Quality Task Team December, 2014

ESS EA TF Item 2 Enterprise Architecture for the ESS

COMP9321 Web Application Engineering

United Nations Global Working Group on Big Data for Official Statistics Task Team on Cross-Cutting Issues

Data for the Public Good. The Government Statistical Service Data Strategy

Issues related to major redesigns in National Statistical Offices

Big data coming soon... to an NSI near you. John Dunne. Central Statistics Office (CSO), Ireland

ESKITP5023 Software Development Level 3 Role

Annex: Concept Note. Big Data for Policy, Development and Official Statistics New York, 22 February 2013

Measuring Intangible Investment

Statistical and Spatial Frameworks. Standards and Data Infrastructure

Challenges of Analytics

Open Source in Financial Services: Meet the challenges of new business models and disruption

Hadoop Market - Global Industry Analysis, Size, Share, Growth, Trends, And Forecast,

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

The UNECE Big Data Sandbox: What Means to What Ends?

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

THE STATISTICAL DATA WAREHOUSE: A CENTRAL DATA HUB, INTEGRATING NEW DATA SOURCES AND STATISTICAL OUTPUT

"e-statistics" Integrated Information System

Will big data transform official statistics?

2. Issues using administrative data for statistical purposes

INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS

Report of the 2015 Big Data Survey. Prepared by United Nations Statistics Division

MAPPING THE IMPLEMENTATION OF POLICY FOR INCLUSIVE EDUCATION

WELCOME TO THE WORLD OF BIG DATA. NEW WORLD PROBLEMS, NEW WORLD SOLUTIONS

Data Refinery with Big Data Aspects

Data Isn't Everything

XML enabled databases. Non relational databases. Guido Rotondi

Managing e-health data: Security management. Marc Nyssen Medical Informatics VUB Master in Health Telematics KIST

Hadoop in the Hybrid Cloud

IMLEMENTATION OF TOTAL QUALITY MANAGEMENT MODEL IN CROATIAN BUREAU OF STATISTICS

Terms of Reference For The Design of a Monitoring And Evaluation Information Management System

Concept and Project Objectives

Big Data Use Cases. To Start Today. Paul Scholey Sales Director, EMEA. 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866)

ORGANISATION FOR ECONOMIC CO-OPERATION AND DEVELOPMENT

Strategies For Setting Up Your Organisation For Success With Big Data. Kevin Long Business Development Director Teradata

Manager Service Transition

22 nd Meeting of the European Statistical System Committee

Cloud-Based Self Service Analytics

Big Data Analytics OverOnline Transactional Data Set

Executive Summary BIG DATA Future Opportunities and Challenges for the German Industry

Quick Guide: Selecting ICT Tools for your Business

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

European University Association Contribution to the Public Consultation: Science 2.0 : Science in Transition 1. September 2014

Wikibon Big Data Analytics Survey: Barriers to Adoption by Deployment Status

IT Tools for SMEs and Business Innovation

BIG DATA What it is and how to use?

Big Data Analytics on Cab Company s Customer Dataset Using Hive and Tableau

Conference on Data Quality for International Organizations

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

Improving quality through regular reviews:

Transcription:

HLG - Big Data Sandbox for Statistical Production Learning to produce meaningful statistics from big data Tatiana Yarmola (ex) Intern at the UNECE Statistical Division INEGI, December 3, 2013

Big Data: volume, velocity, variety Big Data potential: more relevant and timely statistics can reproduce a variety of statistical indices reduce survey expenses reduce response burden And more Big Data http://vimeo.com/62289901

High Level Group High Level Group for the Modernisation of Statistical Production and Services (HLG) Created by the Conference of European Statisticians in 2010 10 heads of national and international statistical organisations

The Challenges New competitors & changing expectations Rapid changes in the environment Increasing cost & difficulty of acquiring data Reducing budget Riding the big data wave Competition for skilled resources

These challenges are too big for statistical organizations to tackle on their own We need to work together Who was involved in GSIM? Get involved! Anyone is welcome to contribute! More Information HLG Wiki: http://www1.unece.org/stat/platf orm/display/hlgbas LinkedIn group Business architecture in statistics

Standards-based Modernisaton 2013 Project 2013 Project GSBPM Reviewed in 2013 Mapped in 2013 Reviewed in 2013

Big Data: The new project for 2014

What does Big data mean for official statistics?* Sources: administrative, commercial, from sensors and tracking devices, social media, mobile telephones etc. Challenges: legislative, privacy, financial, management, methodological, technological NSO s have infrastructures to address accuracy, consistency, and interpretability 3 broad areas of experimentation: Combining Big Data with official statistics Replacing official statistics by Big Data Filling new data gaps with Big Data * http://www1.unece.org/stat/platform/pages/viewpage.action?pageid=77170614

Collaboration and Activities on Big Data The Big Data Project specifies key priority areas to be tackled as a collaborative activity in agile fashion No systems in place develop collaboratively Common issues common solutions Learn to address current phenomena Agree on common classification of Big Data types Use administrative resources experience Mechanism for sharing: inventory of big data activities http://www1.unece.org/stat/platform/display/msis/ Big+Data+Inventory

Big Data Project Objectives: guidance, feasibility, facilitation of sharing Scope: the role of Big Data in modernisation of official statistics Work package 1: Issues and methodology Work package 2: Shared computing environment ( sandbox ) and practical applications Work package 3: Training and dissemination Work package 4: Project management and coordination http://www1.unece.org/stat/platform/display/msis/final+project+proposal%3a+the+r ole+of+big+data+in+the+modernisation+of+statistical+production

WP2: The Big Data Sandbox To produce strong, well-justified and internationallyapplicable recommendations on appropriate tools, methods and environments for processing and analyzing different types of Big Data To report on the feasibility of establishing a shared approach for using Big Data sources that are multinational or for which similar sources are available in different countries To enable efficient sharing of tools, methods, datasets and results

Sandbox Task Team Name Tomaž Špeh Carlo Vaccari Martin Ralphs Tanya Yarmola Andrew Murray Pilar Rey del Castillo Vittorio Perduca Marco Puts Brian Studman Country/organization Slovenia Italy Deliverables: a detailed specification to present as an annex to the project proposal to the HLG at their November meeting. UK UNECE Ireland Eurostat Netherlands Australia Goals of this group: Define precisely what concept(s) is/are to be proved Clarify the value of the exercise in terms of the HLG and international work Explore alternative scenarios regarding choice of tools, environment datasets statistics to be produced methods of producing them Outline costs, timeline for work during 2014. Explore funding options, including infrastructural support and expertise for the project during 2014.

Deliverables web-based Big Data Sandbox + specifications + virtual machine releases + training materials relevant datasets proof of concept: datasets statistic common methodology of producing statistics from Big Data communicate results manipulate

Example of a proof of concept Big Data and Official Statistics by Piet Daas et al. http://www.cros-portal.eu/sites/default/files//ntts2013fullpaper_76.pdf

Platform and Tools Criteria Ease and efficiency Open source or low cost Ease of integration with other tools Integration into existing statistical production architectures Availability of documentation Availability of online tutorials and support The existence of an active and knowledgeable user community.

Platform Recommendations Processing Environment: HortonWorks Hadoop distribution Processing tools/software: The Pentaho Business Analytics Suite Enterprise Edition Pentaho Business Analytics Suite Enterprise Edition provides a unified, interactive, and visual environment for data integration, data analysis, data mining, visualization, and other capabilities.

Pentaho Big Data Overview

choose->prepare->visualize&explore http://www.pentaho.com/ product/businessvisualizationanalytics#visual-analysis

Platform Recommendations Processing Environment: HortonWorks Hadoop distribution Processing tools/software: The Pentaho Business Analytics Suite Enterprise Edition Pentaho Business Analytics Suite Enterprise Edition provides a unified, interactive, and visual environment for data integration, data analysis, data mining, visualization, and other capabilities. Pentaho's Data Integration component is fully compatible with Hortonworks Hadoop and allows 'drag and drop' development of MapReduce jobs. R, RHadoop, and possibly additional tools

Datasets Criteria Ease of locating and obtaining data from providers Cost of obtaining data (if any) Stability (or expected stability) over time Availability of data that can be used by several countries, or data whose format is at least broadly homogeneous across countries The existence of ID variables which enable the merging of big data sets with traditional statistical data sources

Datasets Specifications One or more from each of the categories below to be installed in the sandbox and experimented with for the creation of appropriate corresponding statistics: Transactional sources (from banks/telecommunications providers/retail outlets) Sensor data sources Social network sources, image or video-based sources, other less-explored sources

Statistic Criteria conform to quality criteria for official statistics reliable and comparable across countries correspond in a systematic and predictable way with existing mainstream products, such as price statistics, household budget indicators, etc. Produce one or several 1. statistics that correspond closely and in a predictable way with a mainstream statistic produced by most statistical organizations 2. short term indicators of specific variables or cross-sectional statistics which permits the study of the detailed relationships between variables 3. statistics that represent a new, non-traditional output

More details on Big Data Project and Sandbox Specifications at: http://www1.unece.org/stat/platform/display/msis/final +project+proposal%3a+the+role+of+big+data+in+the+m odernisation+of+statistical+production Thank You