Programme of the ESTP training course on BIG DATA EFFECTIVE PROCESSING AND ANALYSIS OF VERY LARGE AND UNSTRUCTURED DATA FOR OFFICIAL STATISTICS Rome, 5 9 May 2014 Istat Piazza Indipendenza 4, Room Vanoni A laboratory approach in managing very large datasets, which are emerging as primary sources feeding most up to date statistical processes. Students will be introduced to the appropriate use of technology for managing the ETL processes resulting from collecting and feeding data from large structured and unstructured data sources. The course also provides a collection of methods and techniques to integrate the sources, to compare the archives against reference metadata sets and to discover and eventually resolve source anomalies. The attendee will be introduced in the theoretical fundamentals, which underlie any presented methodology and will finally be brought to a real implementation by using innovative techniques and algorithms. Day 1, 5 May 2014 Old and new data manipulation paradigms 9.00-9.15 15 Opening 9.15 9.45 30 Too big to ignore: a matter of balance. Evolution in data management; scenario. 9.45 10.15 30 The need for alternative computing paradigms. Antonino Virgillito 10.15 11.00 45 Classification of data sources. 11.15 11.45 30 The Internet of Things. 11.45-12.30 45 Case study: synthesising a Big Data driven framework. Diego Zardetto 12.30 13.00 30 Sharing experiences, expectations and critical aspects. Giulio Barcaroli 13.00 13.30 30 International activities on Big Data in Official Statistics Carlo Vaccari 14.30 15.00 30 XML as integration paradigm. Service Oriented Architecture.
15.00 15.30 30 XML enabled databases. Non relational databases. 15.45 16.15 30 Handling XML sources. Non structured XML Tables. 16.15 16.45 30 Dealing with XSD schemas. Structured XML Tables. 16.45 17.15 30 Merging XML data in the business process: the Resource Description Framework. 17.15-17.30 15 Conclusions
Day 2, 6 May 2014 A roadmap toward Big Data 9.00-9.15 15 Opening 9.15 10.00 45 The Map Reduce programming model. Antonino Virgillito 10.00 11.00 60 The World of Hadoop. Antonino Virgillito 11.15 12.15 60 NoSQL databases 12.15 12.45 30 Robust concurrent computing architectures and the Byzantine agreement problem. Single Point Of Control. Single Point Of Failure. 12.45 13.30 45 Using Big Data technologies (part one): massive computing. Antonino Virgillito 14.30 15.30 60 Using Big Data technologies (part two): dealing with unstructured data examples and applications. 15.45 16.30 45' 16.30 17.15 45' Implementing the Map Reduce programming model on a parallel enabled database: aggregating functions. Profiling the Map Reduce model on a real enterprise infrastructure. Implementing and evaluating simple Map Reduce algorithms. 17.15-17.30 15 Conclusions
Day 3, 7 May 2014 Big Data in Official Statistics 9.00-9.15 15 Opening 9.15 10.00 45 10.00 11.00 60 Introduction to Big Data in Official Statistic. The concept of Big Data; overview of Big Data sources. Methodological issues in using Big Data for Official Statistics. Antonino Virgillito Giulio Barcaroli 11.15 12.15 60 IT Issues in using Big Data for Official Statistics. 12.15 13.30 75 Using mobile phones for analyzing mobility of city users. Antonino Virgillito 14.30 15.30 60 Improving Labor Force Survey estimates by the effective usage of Google Trends. 15.45 16.45 60 16.45 17.15 30 Internet as a data source: web scraping and text mining for estimating ICT usage by enterprises and public Institutions. Privacy, Security and Safety: Recipes for securing data, recipes for disclosure control, trusted computing. 17.15-17.30 15 Conclusions
Day 4, 8 May 2014 Improving data availability and processing efficiency 9.00-9.15 15 Opening 9.15 10.00 45 10.00 11.00 60 Data location and partitioning. Indexing. Problem splitting. Actor systems. Storage virtualisation. Examples of improving data location and partitioning. Effective usage of indexes. 11.15 12.15 60 12.15 13.00 45 Improving database (serial) operations. Code profiling. Bulk operations. Pipelined functions. Sustained data streaming. Partition swapping. External tables in performing fast bulk operations. Application of a pipelined function to an ETL process. Managing changes of a big micro data set. 13.00 13.30 30 Quasi real time analytics. Diego Zardetto 14.30 15.30 60 Fundamentals of parallel computing. Definitions, metrics, workload, critical aspects. Distributed vs Symmetric Multi Processing. 15.45 16.30 45 16.30 17.15 45 Parallel database operations. Scheduled concurrent tasks. Parallel enabled pipelined functions. Parallel queries. Embedded relational objects, aggregating functions. Self-made parallelism vs controlled tasks.benefits of parallel data streaming. Multipath data querying. Embedded relational objects. Design of central aggregating functions. 17.15-17.30 15 Conclusions
Day 5, 9 May 2014 The analysis of massive datasets 9.00-9.15 15 Opening 9.15 10.15 60 Geometric interpretation of data structures and the introduction of regular languages and expressions. 10.15 11.00 45 Getting involved with regular expressions. 11.15 12.00 45 Mapping techniques for studying anomalies in structured data: Probabilistic ranking of event patterns. 12.00 12.45 45 Stochastic characterisation of unstructured data sets. 12.45 13.30 45 Characteristics of a Big Data Analysis Framework: a distributed approach 14.30-15.30 60 Inference techniques used for Official Statistics (Part-1) Diego Zardetto 15.45 16.45 60 Inference techniques used for Official Statistics. (Part-2) Diego Zardetto 16.45 17.00 15 Where can we go from here? Golden rules. 17.00 17.30 30 Final remarks Giulio Barcaroli Antonino Virgillito