ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1
About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web technologies, mobile applications, BI & ETL Highlights: online Census questionnaires, Consumer Price Survey on- field data collection architecture PhD in Computer Engineering Field of research: large- scale distributed systems Trainer on Hadoop- MapReduce 2
Abstract The hype surrounding Big Data technologies hides the complexity of their adoption in NSIs Strong IT know- how required for configuration and management Should be accessed through common statistical software Reasoned overview of the most popular Big Data technologies, with focus on their usage in NSIs 3
Outline Motivations for Big Data technologies Overview of Big Data tools Adopting Big Data technologies in NSIs 4
MOTIVATIONS FOR BIG DATA TECHNOLOGIES 5
What does Background Big mean? 6
Size Big Tera-, Peta- and growing Processing a complex statistical method can become untreatable even with data sets of reasonable size Background Quality Big Data is often loosely structured and highly noisy 7
Tools and Techniques Size Big Real Big Data begins where your usual tools fail Distributed file systems Clusters of commodity hardware that can scale to indefinite size simply by adding new nodes at runtime Overcome physical limitations Should be managed by a middleware platform (Hadoop HDFS) 8
Tools and Techniques Processing Big MapReduce Programming paradigm that enables programs to be executed in parallel on a cluster Not tied to a programming language, interfaces exist for all common languages and tools 9
Tools and Techniques Big Quality Pre- processing for cleaning and organizing data Big Data are often unstructured but the viceversa is not true 10
Technical Challenges Handling Big Data necessarily requires relying on complex distributed technologies If you want to get something from real big data you have to deal with this complexity 11
Perspectives I can setup the infrastructure and the data and help you with the tools Ok, but I want to use my tools and methods. I don t want to touch this distributed stuff Deal Colleen, the Statistical Analyst I don t want to write programs for every analysis she makes Moss, the IT Guy 12
BIG DATA TOOLS OVERVIEW 13
Big Data IT Tools Proliferation 14
Our focus: IT Tools for Statistical Analysis of Big Data What are the basic tools? What is the best tool for the job? How these tools integrate with common elements in an IT architecture? 15
Big Data IT Tools: the Common Denominator 16
Distributed Storage and Processing: Hadoop Distributed storage platform De- facto standard for Big Data processing Open source project supported and/or adopted by most major vendors Virtually unlimited scalability storage, memory, processing power 17
I m one big data set Hadoop Hadoop Principle Hadoop is basically a middleware platforms that manages a cluster of machines The core components is a distributed file system (HDFS) HDFS Files in HDFS are split into blocks that are scattered over the cluster The cluster can grow indefinitely simply by adding new nodes 18
The MapReduce Paradigm Parallel processing paradigm Programmer is unaware of parallelism Programs are structured into a two- phase execution Map Reduce Data elements are classified into categories x 4 x 5 x 3 An algorithm is applied to all the elements of the same category 19
MapReduce and Hadoop Hadoop MapReduce HDFS MapReduce is logically placed on top of HDFS 20
MapReduce and Hadoop Hadoop MR works on (big) files loaded on HDFS MR HDFS MR HDFS MR HDFS MR HDFS Each node in the cluster executes the MR program in parallel, applying map and reduces phases on the blocks it stores Output is written on HDFS Scalability principle: Perform the computation were the data is 21
MapReduce Applications Naturally targeted at counts and aggregations 1- line aggregation algorithm Collecting & combining It all began there inverted index computation in Google Machine learning, cross- correlation Graph analysis People you may know Geographical data: in Google Maps, finding nearest feature to a given address or location Pre- processing of unstructured data Can also handle binary files NYT converted 4TBs of scanned articles into 1.5TB of PDFs 22
Data Analysis with Hadoop I finally loaded those elephant- size data sets into Hadoop! Cool! Now how can I analyze them? No It s simple! Write a MapReduce program in Java! Ok, I ll do that for you Colleen, the Statistical Analyst MapReduce programs can be written in various programming languages Several tools are also available that translate high- level analysis languages into MapReduce programs No Moss, the IT Guy 23
Tools for Data Analysis with Hadoop High- level languages for data manipulation Hadoop Pig Hive MapReduce HDFS Statistical Software 24
Using Hadoop from Statistical Software R packages rhdfs, rmr Issue HDFS commands and write MapReduce jobs SAS SAS In- Memory Statistics SAS/ACCESS Makes data stored in Hadoop appear as native SAS datasets Uses Hive interface SPSS Transparent integration with Hadoop data 25
Apache Pig Tool for querying data on Hadoop clusters Widely used in the Hadoop world Yahoo! estimates that 50% of their Hadoop workload on their 100,000 CPUs clusters is genarated by Pig scripts Allows to write data manipulation scripts written in a high- level language called Pig Latin Interpreted language: scripts are translated into MapReduce jobs Mainly targeted at joins and aggregations 26
Pig Example Real example of a Pig script used at Twitter The Java equivalent 27
Hadoop- MapReduce Limitations Not usable in transactional applications Not suited to Real- time analysis MapReduce jobs run in batch mode. HDFS is an append- only file system Can insert and delete, but cannot update MapReduce jobs run in batch mode You cannot expect low response latency Not suited for interactive, real- time operations and/ or random- access read/writes 28
NoSQL databases Distributed storage platforms that allows for lower latency processing NoSQL: Not Only SQL Non- relational data models that trade transactional consistency for query efficiency and support semi- structured data No joins, no transactions, no indexes 29
NoSQL Databases Based on Hadoop HBase Hadoop MapReduce HDFS R Cassandra Fully distributed platform Not based on Hadoop Popular choices: Hbase and Cassandra Use a column- oriented model Data organized in families of key: value pairs variable schema where each row is slightly different optimized for sparse data Can be accessed from R 30
Big Data Tools in the IT Architecture Hadoop is not a DB/DW replacement but it sits besides traditional data technologies in a modern IT architecture The outcome of Big Data processing can be stored in a traditional DB- DW Modern (visual) analytics tools can integrate both kinds of data sources 31
Augmented IT Architecture Analysis tools Statistical software Visual Analytics BI Initial processing and cleaning Keeps multi- structured historical data online and accessible Hadoop Analysis results NoSQL DB DW multi- structured big data 32
ADOPTING BIG DATA TECHNOLOGIES 33
Hadoop Deployment Options In- house Cloud Appliance Maximum control of configuration and costs High complexity Pay- per- use billing model Cuts hardware and software costs and eliminates management burden Privacy issues! Easy Costly 34
IT Skills for Big Data Tools Data analyst Data scientist Data Engineer Data Integrator System manager Uses statistical tools and VA Derives new insights by applying statistical analysis methods on different, heterogeneous, possibly big, data sources Has strong IT foundations and can develop her algorithms using both statistical tools and Hadoop R - SAS - SPSS BI and Visual Analytics Excel Designs the IT architecture for collecting and processing Designs and develop writes MR jobs or PIG scripts Map Reduce Pig Java Develops ETL procedures to move data to/from HDFS and NoSQL DBs SQL ETL Sets up and manages the physical infrastructure Linux 35
Suggestions for ESS Training on data science for statisticians and Big data engineering for IT staff Eurostat establishing repositories of Big Data and allowing NSIs to access them Implementation of standard methods and tools in a Hadoop- compliant version Set up of a statistical cloud, a Hadoop cluster shared by NSIs Possible agreements with providers of IT solutions (Google, etc.) 36
Conclusions Big Data tools makes sense when you really have serious size issues to deal with Not much use for a 2- node Hadoop cluster No value in jumping on the Big Data bandwagon for its own sake High costs You can still be a data scientist... However, Big Data engineering provides new opportunities Collect more data Ask bigger questions 37
Questions 38