ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Similar documents

How To Handle Big Data With A Data Scientist

Implement Hadoop jobs to extract business value from large and varied data sets

Big Data and Data Science: Behind the Buzz Words

Advanced Big Data Analytics with R and Hadoop

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Workshop on Hadoop with Big Data

Integrating Big Data into the Computing Curricula

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Hadoop Ecosystem B Y R A H I M A.

Big Data Course Highlights

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

How To Scale Out Of A Nosql Database

So What s the Big Deal?

Open source large scale distributed data management with Google s MapReduce and Bigtable

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Open source Google-style large scale data analysis with Hadoop

INDUS / AXIOMINE. Adopting Hadoop In the Enterprise Typical Enterprise Use Cases

Hadoop and Map-Reduce. Swati Gore

HDP Hadoop From concept to deployment.

Architectures for Big Data Analytics A database perspective

Business Intelligence for Big Data

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

NoSQL and Hadoop Technologies On Oracle Cloud

Sentimental Analysis using Hadoop Phase 2: Week 2

Data processing goes big

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

A Brief Outline on Bigdata Hadoop

Testing Big data is one of the biggest

Internals of Hadoop Application Framework and Distributed File System

The Future of Data Management

Hadoop Big Data for Processing Data and Performing Workload

The Potential of Big Data in the Cloud. Juan Madera Technology Consultant

Oracle Big Data SQL Technical Update

Data Mining in the Swamp

BIG DATA TRENDS AND TECHNOLOGIES

Introduction to Big Data Training

Big Data Analytics - Accelerated. stream-horizon.com

Large scale processing using Hadoop. Ján Vaňo

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Cost-Effective Business Intelligence with Red Hat and Open Source

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

ANALYTICS CENTER LEARNING PROGRAM

Can the Elephants Handle the NoSQL Onslaught?

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Navigating the Big Data infrastructure layer Helena Schwenk

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Reference Architecture, Requirements, Gaps, Roles

Using distributed technologies to analyze Big Data

Big Data With Hadoop

I/O Considerations in Big Data Analytics

Big Data: Tools and Technologies in Big Data

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

BIG DATA What it is and how to use?

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

BIG DATA TECHNOLOGY. Hadoop Ecosystem

NoSQL for SQL Professionals William McKnight

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Dominik Wagenknecht Accenture

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

ESS event: Big Data in Official Statistics

Hadoop: Embracing future hardware

Hadoop implementation of MapReduce computational model. Ján Vaňo

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Hadoop & its Usage at Facebook

Big Data and Scripting Systems build on top of Hadoop

Using RDBMS, NoSQL or Hadoop?

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Hadoop IST 734 SS CHUNG

Massive Cloud Auditing using Data Mining on Hadoop

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Big Data and Apache Hadoop s MapReduce

XML enabled databases. Non relational databases. Guido Rotondi

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Big Data Can Drive the Business and IT to Evolve and Adapt

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

INTRODUCTION TO CASSANDRA

Xiaoming Gao Hui Li Thilina Gunarathne

BIG DATA CHALLENGES AND PERSPECTIVES

Comparing SQL and NOSQL databases

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Manifest for Big Data Pig, Hive & Jaql

Transforming the Telecoms Business using Big Data and Analytics

CSE-E5430 Scalable Cloud Computing Lecture 2

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Apache Hadoop: Past, Present, and Future

Information Builders Mission & Value Proposition

The Inside Scoop on Hadoop

WHITE PAPER. Four Key Pillars To A Big Data Management Solution

Transcription:

ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1

About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web technologies, mobile applications, BI & ETL Highlights: online Census questionnaires, Consumer Price Survey on- field data collection architecture PhD in Computer Engineering Field of research: large- scale distributed systems Trainer on Hadoop- MapReduce 2

Abstract The hype surrounding Big Data technologies hides the complexity of their adoption in NSIs Strong IT know- how required for configuration and management Should be accessed through common statistical software Reasoned overview of the most popular Big Data technologies, with focus on their usage in NSIs 3

Outline Motivations for Big Data technologies Overview of Big Data tools Adopting Big Data technologies in NSIs 4

MOTIVATIONS FOR BIG DATA TECHNOLOGIES 5

What does Background Big mean? 6

Size Big Tera-, Peta- and growing Processing a complex statistical method can become untreatable even with data sets of reasonable size Background Quality Big Data is often loosely structured and highly noisy 7

Tools and Techniques Size Big Real Big Data begins where your usual tools fail Distributed file systems Clusters of commodity hardware that can scale to indefinite size simply by adding new nodes at runtime Overcome physical limitations Should be managed by a middleware platform (Hadoop HDFS) 8

Tools and Techniques Processing Big MapReduce Programming paradigm that enables programs to be executed in parallel on a cluster Not tied to a programming language, interfaces exist for all common languages and tools 9

Tools and Techniques Big Quality Pre- processing for cleaning and organizing data Big Data are often unstructured but the viceversa is not true 10

Technical Challenges Handling Big Data necessarily requires relying on complex distributed technologies If you want to get something from real big data you have to deal with this complexity 11

Perspectives I can setup the infrastructure and the data and help you with the tools Ok, but I want to use my tools and methods. I don t want to touch this distributed stuff Deal Colleen, the Statistical Analyst I don t want to write programs for every analysis she makes Moss, the IT Guy 12

BIG DATA TOOLS OVERVIEW 13

Big Data IT Tools Proliferation 14

Our focus: IT Tools for Statistical Analysis of Big Data What are the basic tools? What is the best tool for the job? How these tools integrate with common elements in an IT architecture? 15

Big Data IT Tools: the Common Denominator 16

Distributed Storage and Processing: Hadoop Distributed storage platform De- facto standard for Big Data processing Open source project supported and/or adopted by most major vendors Virtually unlimited scalability storage, memory, processing power 17

I m one big data set Hadoop Hadoop Principle Hadoop is basically a middleware platforms that manages a cluster of machines The core components is a distributed file system (HDFS) HDFS Files in HDFS are split into blocks that are scattered over the cluster The cluster can grow indefinitely simply by adding new nodes 18

The MapReduce Paradigm Parallel processing paradigm Programmer is unaware of parallelism Programs are structured into a two- phase execution Map Reduce Data elements are classified into categories x 4 x 5 x 3 An algorithm is applied to all the elements of the same category 19

MapReduce and Hadoop Hadoop MapReduce HDFS MapReduce is logically placed on top of HDFS 20

MapReduce and Hadoop Hadoop MR works on (big) files loaded on HDFS MR HDFS MR HDFS MR HDFS MR HDFS Each node in the cluster executes the MR program in parallel, applying map and reduces phases on the blocks it stores Output is written on HDFS Scalability principle: Perform the computation were the data is 21

MapReduce Applications Naturally targeted at counts and aggregations 1- line aggregation algorithm Collecting & combining It all began there inverted index computation in Google Machine learning, cross- correlation Graph analysis People you may know Geographical data: in Google Maps, finding nearest feature to a given address or location Pre- processing of unstructured data Can also handle binary files NYT converted 4TBs of scanned articles into 1.5TB of PDFs 22

Data Analysis with Hadoop I finally loaded those elephant- size data sets into Hadoop! Cool! Now how can I analyze them? No It s simple! Write a MapReduce program in Java! Ok, I ll do that for you Colleen, the Statistical Analyst MapReduce programs can be written in various programming languages Several tools are also available that translate high- level analysis languages into MapReduce programs No Moss, the IT Guy 23

Tools for Data Analysis with Hadoop High- level languages for data manipulation Hadoop Pig Hive MapReduce HDFS Statistical Software 24

Using Hadoop from Statistical Software R packages rhdfs, rmr Issue HDFS commands and write MapReduce jobs SAS SAS In- Memory Statistics SAS/ACCESS Makes data stored in Hadoop appear as native SAS datasets Uses Hive interface SPSS Transparent integration with Hadoop data 25

Apache Pig Tool for querying data on Hadoop clusters Widely used in the Hadoop world Yahoo! estimates that 50% of their Hadoop workload on their 100,000 CPUs clusters is genarated by Pig scripts Allows to write data manipulation scripts written in a high- level language called Pig Latin Interpreted language: scripts are translated into MapReduce jobs Mainly targeted at joins and aggregations 26

Pig Example Real example of a Pig script used at Twitter The Java equivalent 27

Hadoop- MapReduce Limitations Not usable in transactional applications Not suited to Real- time analysis MapReduce jobs run in batch mode. HDFS is an append- only file system Can insert and delete, but cannot update MapReduce jobs run in batch mode You cannot expect low response latency Not suited for interactive, real- time operations and/ or random- access read/writes 28

NoSQL databases Distributed storage platforms that allows for lower latency processing NoSQL: Not Only SQL Non- relational data models that trade transactional consistency for query efficiency and support semi- structured data No joins, no transactions, no indexes 29

NoSQL Databases Based on Hadoop HBase Hadoop MapReduce HDFS R Cassandra Fully distributed platform Not based on Hadoop Popular choices: Hbase and Cassandra Use a column- oriented model Data organized in families of key: value pairs variable schema where each row is slightly different optimized for sparse data Can be accessed from R 30

Big Data Tools in the IT Architecture Hadoop is not a DB/DW replacement but it sits besides traditional data technologies in a modern IT architecture The outcome of Big Data processing can be stored in a traditional DB- DW Modern (visual) analytics tools can integrate both kinds of data sources 31

Augmented IT Architecture Analysis tools Statistical software Visual Analytics BI Initial processing and cleaning Keeps multi- structured historical data online and accessible Hadoop Analysis results NoSQL DB DW multi- structured big data 32

ADOPTING BIG DATA TECHNOLOGIES 33

Hadoop Deployment Options In- house Cloud Appliance Maximum control of configuration and costs High complexity Pay- per- use billing model Cuts hardware and software costs and eliminates management burden Privacy issues! Easy Costly 34

IT Skills for Big Data Tools Data analyst Data scientist Data Engineer Data Integrator System manager Uses statistical tools and VA Derives new insights by applying statistical analysis methods on different, heterogeneous, possibly big, data sources Has strong IT foundations and can develop her algorithms using both statistical tools and Hadoop R - SAS - SPSS BI and Visual Analytics Excel Designs the IT architecture for collecting and processing Designs and develop writes MR jobs or PIG scripts Map Reduce Pig Java Develops ETL procedures to move data to/from HDFS and NoSQL DBs SQL ETL Sets up and manages the physical infrastructure Linux 35

Suggestions for ESS Training on data science for statisticians and Big data engineering for IT staff Eurostat establishing repositories of Big Data and allowing NSIs to access them Implementation of standard methods and tools in a Hadoop- compliant version Set up of a statistical cloud, a Hadoop cluster shared by NSIs Possible agreements with providers of IT solutions (Google, etc.) 36

Conclusions Big Data tools makes sense when you really have serious size issues to deal with Not much use for a 2- node Hadoop cluster No value in jumping on the Big Data bandwagon for its own sake High costs You can still be a data scientist... However, Big Data engineering provides new opportunities Collect more data Ask bigger questions 37

Questions 38