The Sandbox 2015 Report
|
|
|
- Jasper Brooks
- 10 years ago
- Views:
Transcription
1 United Nations Economic Commission for Europe Statistical Division Workshop on the Modernisation of Official Statistics November 24-25, 2015 The Sandbox project The Sandbox 2015 Report Antonino Virgillito Carlo Vaccari Project coordinators, UNECE - ISTAT
2 INTRODUCTION ACTIVITIES FUTURE OF SANDBOX OUTCOMES
3 Our presentation Introduction Activities 4 Future Conclusions Comtrade Enterprise Web Sites Wikistats Twitter
4 What is the Sandbox Shared computing environment developed in partnership with the Irish Central Statistics Office and the Irish Centre for High-End Computing (ICHEC) 2 A unique platform where participating 4 organisations can engage in collaborative research activities 5 Open to all producer of official statistics 4
5 Sandbox background: 2014 Big Data project 2014: four Work Packages Work Package 1: Issues and Methodology Work Package 2: Shared computing environment ('sandbox') and practical application Work Package : Training and dissemination Work Package 4: Project management and coordination Social Media Mobile Phones Prices Smart Meters Job Vacancies Web scraping Traffic Loops 5
6 Sandbox 2015 goals Publish a set of international statistics based on Big Data, before the end of the year Choose 2- Big Data sources Collect Data Compute multi-national statistics Organize press conference Conclude 2014 experiments on the Sandbox 6
7 The Sandbox 2015 Installed Software Hadoop (Hortonworks Data Platform) R Rstudio RHadoop Spark ElasticSearch and growing 4 Data/Compute nodes 2 x 10 core Intel Xeon CPUs 128 GB RAM 4 x 4TB disks 56 Gbit InfiniBand network 2 Service/login nodes Similar hw as data nodes 10Gbit connection to Internet 7
8 Participants More than 40 people from 22 entities International organisations: Eurostat, OECD, UNECE, UNSD Countries: at, ch, de, es, fr, hu, ie, it, mx, nl, pl, rs, ru, se, si, tr, uae, uk, us Active participants: about half of the total ICHEC Assisted the task team for the testing and evaluation of Hadoop work-flows and associated data analysis application software 8
9 Work Groups Four main activities, for each task one multinational group Wikistats - Wikipedia hourly page views: use of an alternative data source Twitter - Social media data: experiences comparison in tweets collection and analysis Enterprise websites: the Web as data source - web scraping and business registers Comtrade - UN global trade data: use of Big Data tools on a traditional data source 9
10 INTRODUCTION ACTIVITIES Wikistats FUTURE OF SANDBOX OUTCOMES
11 Wikistats Description Statistics on hourly page views of articles in the several language versions of Wikipedia Download and process to obtain manageable data Wikipedia page views is a potential source for many domains: tourism, culture, economic, 1. UNESCO heritage sites 2. Cities and Beaches Popularity Touristic potential 11
12 Wikistats Data Characteristics Wikipedia seventh website (Alexa) Public source (contents and metadata) Digital traces Widely used: left by people in 44% of EU their activities 69% of EU
13 Wikistats Activities Pre-processing: scripts in shell and Pig to filter data and change time-series format Extraction: MapReduce, shell and python to filter articles and time aggregation (hourly to daily, weekly and monthly) Analytics: R and RStudio for web scraping, selection of articles and Data analysis 1
14 Wikistats Outputs More than one billion views! 14
15 Wikistats Outputs See peak for Vatican City in 201 March for pope Francis election 15
16 Wikistats Findings Relevance: Good for culture and regional statistics New topics uncovered so far Many other potential topics Unprecedented temporal detail Technology: Experience: scripts, Pig and RStudio Try NoSQL, Spark, Elasticsearch Problems with Java integration Need resource-scheduling system 16
17 Wikistats Findings Source Good potential, need more investigation Data available, adequate IT needed No privacy concerns also on individuals Continuity: substantial, not guaranteed More mobile users: need to investigate Quality Accuracy (bot excluded), categorization Timeliness: good, data few hours later Comparability in space and between languages Crowd-sourced: improve completeness Can improve statistics: new phenomena 17
18 Wikistats How far are we from something that can be published? a) Concerning the results of the experiment Almost ready for publish first experimental statistics with 0 languages (Issues on encoding) b) Concerning statistics based on this data source in general Results data source well received by domain experts (page views) No issues of accuracy, but interpretability issues need validation New statistics like index of consumption of digital cultural products require methodological development and conceptual frameworks 18
19 Wikistats What case has been made for the future of the Sandbox? Sandbox fundamental to experiment with this data source (desktop computers are simply not able to process this source) Present and future existence of the sandbox is important to develop further the work done, with tools which can increase the process efficiency (HBase Elasticsearch) Sandbox is also important to make the system available for other statisticians: the code is available in GitHub, but not the data. In the sandbox, where the data is available, statisticians can re-use the code, try and build other applications on those data. 19
20 INTRODUCTION ACTIVITIES Twitter FUTURE OF SANDBOX OUTCOMES
21 Twitter Description Geolocated tweets 2015: two approaches: - Public stream (Mexico, UK) - Buy data (UK) Source instability with technological changes Comparison of Data Preprocessing Data analysis: distinct targets, methods and tools 21
22 Twitter Data characteristics Period Method UK ONS April - August 2014 (API) August - October (gnip) Public Twitter API Purchased from GNIP Mexico INEGI February 2014 Today (running) Public Twitter API Percentage of Twitter Active Users Percentage of Geo- Referenced Tweets 5.6%.0% 1.57% (London) 1.0% (Mexico City) No-SQL Database CSV, MongoDB Elastic Search Total Records: ~106 million collected ~81.4 million used ~150 million collected ~74 million used 22
23 Twitter Outputs Procedures for collecting geolocated tweets from public stream: - Mexican version - Sandbox version Data Collection in progress on the Sandbox for tweets geolocated in Rome (Italy) for future study on Jubilee
24 Twitter Outputs 24
25 Twitter Outputs 25
26 Twitter Outputs 26
27 Twitter Outputs Number of tweets generated in the Rome area (each bar = 12 hours) Filtered on tweets containing the words Paris or Parigi 27
28 Twitter Findings Technology Elastic (Elasticsearch) good tool to collect and index data Kibana is a good tool to visualize data Source Continuity: technology outside NSOs control Continuity depends on even small changes to technology Rules for Data collection: fragile legal basis, better acquire data (cost UK 25,000/year) 28
29 Twitter How far are we from something that can be published? For people mobility we are close to official usage in Mexico Sentiment analysis Twitter data quite officially used in Netherlands What case has been made for the future of the Sandbox? Procedures to collect data from Twitter ready on the Sandbox, together with tools ready to analyze those data Data collection already active on Rome geolocated tweets, ready for analysis on tourists mobility during Jubilee 29
30 INTRODUCTION ACTIVITIES Enterprise Web Sites FUTURE OF SANDBOX OUTCOMES
31 Enterprise Web Sites Description Objective: using web sites of enterprises as a source to create statistics Issues Obtaining URLs of web sites Different approaches were experimented Obtain from registries/survey Obtain from search engines Implementing a method for scraping data from web sites Different approaches were experimented for both scraping and analysis phases Computing the statistics Number of enterprises which advertise Job Vacancies broken down by NACE activity and region 1
32 Enterprise Web Sites Data Characteristics Sweden Slovenia Websites Websites used in the experiment Web pages
33 Enterprise Web Sites Activities Development of an application for scraping the job vacancies data starting from URLs of enterprises - Not tied to language/method - Developed in Python - Available in the Sandbox Spider Downloader Start from a list of URLs, follow the links and download the content of the employment pages Determinator Classifier Implement a method for detecting and classifying job vacancies advertisements in the scrapped content
34 Enterprise Web Sites Activities Developed two different methods for implementing the Determinator module (detection of job vacancies) machine learning approach: 92% accuracy (Swedish data) Keyword (phrases)-based: 80 % accuracy (Slovenian data) Created statistics from scraped data Tested a technological stack for scraping/analyzing web sites on a large scale Nutch Elasticsearch - Kibana 4
35 Enterprise Web Sites Outputs Program to scrape and analyze web sites deployed in the Sandbox, can be used in general Statistics about the number of JV per NACE group, as calculated from the scraped data Distribution compared with that obtained from the survey Distribution using survey weights compared with distribution using calibrated weights (by number of employees) 5
36 Enterprise Web Sites Outputs Slovenian data 6
37 Enterprise Web Sites 7
38 Enterprise Web Sites Findings Privacy Retrieving and sharing URLs of web sites was not as easy as expected Still not clear yet whether URLs collected as microdata from Slovenian survey could be used in the Sandbox Methodology Comparison with distribution of job vacancies per NACE revealed coherence with survey results, indicating that the approach is solid Methodology Promising results from the machine learning approach to identification of job vacancies 8
39 Enterprise Web Sites How far are we from something that can be published? The main problem for publishing the results are non existent lists of URLs of enterprises in most of the countries (i.e., impossible to determine population) As an experimental activity, we started to get first results only recently. Six further months of activity would be needed to reach solid results However, the prototype is producing promising outcomes in two countries Reliable statistics will most probably be based on multiple sources (Job portals and administrative data from agency of Employment,..) What case has been made for the future of the Sandbox? The prototype of the IT tool for detecting the JV is available in the Sandbox It can be used by other countries with small modifications of parameters 9
40 INTRODUCTION ACTIVITIES Comtrade FUTURE OF SANDBOX OUTCOMES
41 Comtrade Description UN Comtrade compiles official trade statistics database since 1962 containing billions of records Due to interest in measuring economic globalization through trade, trade data has been used to analyse interlink between economies This project is intended to exploit the capability of the tools in Sandbox to process large amount of data to perform analysis of the trade network on a wider scale identify regional global value chain networks and analyse their properties. 41
42 Comtrade Data Characteristics Dump of the comtrade DB Each record represents flow of import export between two countries (reporter and partner) The data that is stored in the sandbox are from year 2000 to 201 in Harmonized System classification for all available reporters. The total number of data points is around 25 millions records. 42
43 Comtrade Data acquisition Activities Extraction from UN Comtrade database, transfer via FTP to the Sandbox. Data loaded into HDFS and made available in Hive Data cleaned up before it could be fed into Hive (eliminate quotes, commas etc.) Tools: Pig Hive Data preparation Imputation of gaps, present in the data as not all countries regularly report to UN Comtrade for the combination of reporter-period. Remapping of codes. Pig Hive 4
44 Comtrade Comtrade - Activities Quality analysis Measured the coherence between symmetric flows and detect number of missing values. Tested two different approaches Tools: Hive + Python RHadoop Data visualization Network analysis Built several visualizations of the trade networks using different tools Computed several metrics of graph characteristics analysis of differences and trends in graphs indicators. Tested two different approaches Gephi D Spark Hive + R 44
45 Comtrade Outputs - Visualizations Trade of intermediate goods by geographical region Made with Gephi Made with D 45
46 Comtrade Outputs Quality Analysis Differences in symmetric import-export flows (absolute vs percentage) 46
47 Comtrade Outputs Network Analysis Distribution of PageRank Computed with Spark Distribution of local clustering Computed with Hive + R import vs export 2005 vs 201 Different network indicators Different commodities 47
48 Comtrade Outputs Network Analysis Countries that were counted at least one time in the ten best ranks after application on the PageRank algorithm on each network for
49 Comtrade Findings Relevance Comprehensive analysis of global value chain through trade networks in all economic sectors is crucial part to better understand international trade and new approaches as those we experimented are needed Technology 8 different tools/languages were used to work with the data Starting from data in basic text format made Technology easy to switch from one tool to another Processing data with the Sandbox provided evident advantages in terms of processing time and manageable size wrt current tool used at UNSD (Relational DB) 49
50 Comtrade Findings Methodology The methodology used to prepare and analyse trade data is not new but the sandbox environment enables comprehensive analysis of the whole data set Methodology Novel methods for "automatic" detection of network clusters through machine learning approaches are on the agenda and should be tested before the end of the project 50
51 Comtrade How far are we from something that can be published? The objective was not to produce statistics. However analysis such as those produced in this experiment are normally published by international research centers and institutions. What case has been made for the future of the sandbox? It is possible to use big data tools and technologies (Hadoop, Pig, Hive, Spark and Gephi) in processing and analyzing large volume of trade data The easiness of setting up the data environment, powerful computing power and availability of built-in libraries to analyse networks may change the way trade analysts work 51
52 INTRODUCTION ACTIVITIES FUTURE OF SANDBOX OUTCOMES
53 Beyond the Big Data Project Extend access to the Sandbox beyond the current project (December 2015) Based on strong interest from a number of statistical organisations ICHEC is willing to continue to provide the Sandbox as a service to the international statistical community, on a non-profit basis Users will be required to pay an annual subscription to cover the costs of technical support, hardware upgrades and installation of software below: some use cases for future Sandbox 5
54 The Sandbox: Use Cases Running experiments and pilots The sandbox can be used for experiments involving creating and evaluating new software programmes, developing new methodologies and exploring the potential of new data sources This use case extends the current role of the sandbox beyond Big Data, and encompasses all types of data sources 54
55 Testing The Sandbox: Use Cases Setting up and testing of statistical pre-production processes is also possible in the Sandbox, including simulating complete workflows and process interactions The environment could be used also for testing other kind of software, beyond Big Data tools 55
56 The Sandbox: Use Cases Training The sandbox can be used as a platform for supporting training courses. It can run special software for high performance computing which cannot be installed or run on standard computers Non-confidential demonstration datasets can be uploaded and shared, facilitating shared training activities across organisations The sandbox environment also allows statisticians opportunities for self-learning, e-learning and learning by doing 56
57 The Sandbox: Use Cases Supporting the implementation of the Common Statistical Production Architecture (CSPA) The sandbox can be used as a statistical laboratory where researchers can jointly develop and test new CSPA-compliant software 57
58 Data Hub The Sandbox: Use Cases The sandbox also provides a shared data repository (subject to confidentiality constraints) It can be used to share nonconfidential data sets that cover multiple countries, as well as public-use micro-data sets 58
59 The Sandbox: Use Cases Big Data Technologies for Statistics Parallel processing and NoSQL seem to be able to effectively backup the traditional software like RDBMS and old file systems Big Data technologies have proven to be usable for many of the elaborations standards of statistical data 59
60 What you pay Subscription fee 10k per year What you get - Access to the Sandbox, shared tools, datasets and other resources for your staff - Access to international collaboration projects and opportunities - Technical support to keep the Sandbox infrastructure up to date and relevant to your needs - Support for coordination of experiments between countries and for knowledge sharing 60
61 How to Subscribe Expression of interest coordinated by UNECE Each subscriber will have a seat on the Strategic Advisory Board, a new group which will oversee Sandbox operations collectively decide, in consultation with ICHEC, on subscription levels and priorities for expenditure on software and hardware Any organisation producing official statistics can subscribe to the Sandbox Other organisations may be considered on a case-by-case basis subject to the approval of the Strategic Advisory Board Because the Sandbox activities are closely linked to the work of the HLG-MOS, the UNECE has been asked to facilitate contacts between the official statistical community and ICHEC, and support the functioning of the Strategic Advisory Board 61
62 INTRODUCTION ACTIVITIES FUTURE OF SANDBOX OUTCOMES
63 Summary of Activities Outcomes Publishable products Yes, as «alternative» statistics Already published in some countries Further work required Yes, as research Sandbox value Storage and collection Use of tools Sharing of methods Sharing of methods Use of tools Storage and collection Use of tools 6
64 Technology Proved use of big data technology to process traditional statistical data more efficiently Novel trend in statistical organizations Proved practical advantage in using Sandbox tools over traditional ones also when dealing with medium -size data Different tools could be used in the same experiment on the same dataset 64
65 Use of Tools Hive Pig RHadoop Elasticsearch Spark Python R Graph visualization 65
66 Sharing Consolidated the importance and the advantages of working in a cooperative way Shared knowledge on tools, methods and solutions Gathered lessons learned from activity made also outside the project Great value from Sandbox approach Common environment ready in zero time, with no need for installations, configurations etc. 66
67 Usable outcomes Datasets wikistat, comtrade, tweets Tools web scraping, twitter collection, wikistats preprocessing Training material on tools and sources 67
68 Sources Difficult to find quality sources Public data is limited in terms of expected quality and/or requires a lot of processing Privacy issues always present even in apparently public sources (e.g. limitations in the use and sharing of scraped data) Two kinds of actions needed negotiations and agreements with providers political actions at legislative level 68
69 Big Data Features Statistics based on big data sources will be different from what we have today new sources can cover aspects of reality that are not covered by traditional ones Methods should adapt to this accept different definition of quality wider interpretation of results (e.g. consider distorsions due to events) We should learn to accept inherent instability of sources in short-long term 69
70 Value of the Sandbox The Sandbox represents the fastest route available to statistical organizations for starting with Big Data It offers several features that facilitate the approach to Big Data and data science An infrastructure for big data processing ready for being used at a low subscription cost Software already installed and proved/tested Shared datasets instantly available Tools and material for capacity building It is driven by the community Worldwide network of organizations, with a catalogue of initiatives A place for collaboration on methods and products, not only related to Big Data A unique container of experience in management of statistical data 70
71 Concluding Remarks Huge thanks to all participants and to the HLG We feel we made great steps ahead from where we started two years ago Our hope is to not waste the experience tools, methods, knowledge, networking connect the Sandbox with national experiences and with other sandboxes 71
72 THANK YOU! 72
UNECE HLG-MOS: Achievements
UNECE HLG-MOS: Achievements Lidia Bratanova, Director UNECE Statistical Division Tokyo, December 2015 Introducing UNECE Statistics Regional Commission of the UN 56 Member countries Europe, North America,
Big data in official statistics Insights about world heritage from the analysis of Wikipedia use
Big data in official statistics Insights about world heritage from the analysis of Wikipedia use Fernando Reis, European Commission - Eurostat International Symposium on the Measurement of Digital Cultural
big data in the European Statistical System
Conference by STATEC and EUROSTAT Savoir pour agir: la statistique publique au service des citoyens big data in the European Statistical System Michail SKALIOTIS EUROSTAT, Head of Task Force 'Big Data'
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
This survey addresses individual projects, partnerships, data sources and tools. Please submit it multiple times - once for each project.
Introduction This survey has been developed jointly by the United Nations Statistics Division (UNSD) and the United Nations Economic Commission for Europe (UNECE). Our goal is to provide an overview of
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
HLG - Big Data Sandbox for Statistical Production
HLG - Big Data Sandbox for Statistical Production Learning to produce meaningful statistics from big data Tatiana Yarmola (ex) Intern at the UNECE Statistical Division INEGI, December 3, 2013 Big Data:
The Next Wave of Data Management. Is Big Data The New Normal?
The Next Wave of Data Management Is Big Data The New Normal? Table of Contents Introduction 3 Separating Reality and Hype 3 Why Are Firms Making IT Investments In Big Data? 4 Trends In Data Management
BIG DATA AND ANALYTICS
BIG DATA AND ANALYTICS Björn Bjurling, [email protected] Daniel Gillblad, [email protected] Anders Holst, [email protected] Swedish Institute of Computer Science AGENDA What is big data and analytics? and why one must bother
HLG Initiatives and SDMX role in them
United Nations Economic Commission for Europe Statistical Division National Institute of Statistics and Geography of Mexico HLG Initiatives and SDMX role in them Steven Vale UNECE [email protected]
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
Data processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect
Big Data & QlikView Democratizing Big Data Analytics David Freriks Principal Solution Architect TDWI Vancouver Agenda What really is Big Data? How do we separate hype from reality? How does that relate
15.00 15.30 30 XML enabled databases. Non relational databases. Guido Rotondi
Programme of the ESTP training course on BIG DATA EFFECTIVE PROCESSING AND ANALYSIS OF VERY LARGE AND UNSTRUCTURED DATA FOR OFFICIAL STATISTICS Rome, 5 9 May 2014 Istat Piazza Indipendenza 4, Room Vanoni
Bringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks [email protected] 2015 The MathWorks, Inc. 1 Data is the sword of the
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Report of the 2015 Big Data Survey. Prepared by United Nations Statistics Division
Statistical Commission Forty-seventh session 8 11 March 2016 Item 3(c) of the provisional agenda Big Data for official statistics Background document Available in English only Report of the 2015 Big Data
Hadoop Job Oriented Training Agenda
1 Hadoop Job Oriented Training Agenda Kapil CK [email protected] Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module
Unlocking the Full Potential of Big Data
Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015 facebook.com/statisticssweden @SCB_nyheter The report is available at https://www.aapor.org Task Force Members:
You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.
What is this course about? This course is an overview of Big Data tools and technologies. It establishes a strong working knowledge of the concepts, techniques, and products associated with Big Data. Attendees
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# So What is Data Science?" Doing Data Science" Data Preparation" Roles" This Lecture" What is Data Science?" Data Science aims to derive knowledge!
How To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5
Big Data Visualization using Apache Spark and Zeppelin Prajod Vettiyattil, Software Architect, Wipro Agenda Big Data and Ecosystem tools Apache Spark Apache Zeppelin Data Visualization Combining Spark
Manifest for Big Data Pig, Hive & Jaql
Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,
Advanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
Workshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Questionnaire about the skills necessary for people. working with Big Data in the Statistical Organisations
Questionnaire about the skills necessary for people working with Big Data in the Statistical Organisations Preliminary results of the survey (19.08 2014) More detailed analysis will be prepared by October
Cloud Computing @ JPL Science Data Systems
Cloud Computing @ JPL Science Data Systems Emily Law, GSAW 2011 Outline Science Data Systems (SDS) Space & Earth SDSs SDS Common Architecture Components Key Components using Cloud Computing Use Case 1:
Open source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
Assignment # 1 (Cloud Computing Security)
Assignment # 1 (Cloud Computing Security) Group Members: Abdullah Abid Zeeshan Qaiser M. Umar Hayat Table of Contents Windows Azure Introduction... 4 Windows Azure Services... 4 1. Compute... 4 a) Virtual
Qsoft Inc www.qsoft-inc.com
Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
Online Content Optimization Using Hadoop. Jyoti Ahuja Dec 20 2011
Online Content Optimization Using Hadoop Jyoti Ahuja Dec 20 2011 What do we do? Deliver right CONTENT to the right USER at the right TIME o Effectively and pro-actively learn from user interactions with
Modern Data Architecture for Predictive Analytics
Modern Data Architecture for Predictive Analytics David Smith VP Marketing and Community - Revolution Analytics John Kreisa VP Strategic Marketing- Hortonworks Hortonworks Inc. 2013 Page 1 Your Presenters
Apache Hadoop: The Big Data Refinery
Architecting the Future of Big Data Whitepaper Apache Hadoop: The Big Data Refinery Introduction Big data has become an extremely popular term, due to the well-documented explosion in the amount of data
BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?
BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand? The Big Data Buzz big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database
The UNECE Big Data Sandbox: What Means to What Ends?
Distr. GENERAL UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE (UNECE) CONFERENCE OF EUROPEAN STATISTICIANS Working Paper No. 17 th April 2015 ENGLISH ONLY Workshop on the Modernisation of Statistical Production
Oracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
Use of social media data for official statistics
Use of social media data for official statistics International Conference on Big Data for Official Statistics, October 2014, Beijing, China Big Data Team 1. Why Twitter 2. Subjective well-being 3. Tourism
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren
News and trends in Data Warehouse Automation, Big Data and BI Johan Hendrickx & Dirk Vermeiren Extreme Agility from Source to Analysis DWH Appliances & DWH Automation Typical Architecture 3 What Business
Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook
Talend Real-Time Big Data Talend Real-Time Big Data Overview of Real-time Big Data Pre-requisites to run Setup & Talend License Talend Real-Time Big Data Big Data Setup & About this cookbook What is the
Big Data Analytics OverOnline Transactional Data Set
Big Data Analytics OverOnline Transactional Data Set Rohit Vaswani 1, Rahul Vaswani 2, Manish Shahani 3, Lifna Jos(Mentor) 4 1 B.E. Computer Engg. VES Institute of Technology, Mumbai -400074, Maharashtra,
WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley
WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley Disclaimer: This material is protected under copyright act AnalytixLabs, 2011. Unauthorized use and/ or duplication of this material or
Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel
Big Data and Analytics: A Conceptual Overview Mike Park Erik Hoel In this technical workshop This presentation is for anyone that uses ArcGIS and is interested in analyzing large amounts of data We will
Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce
Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of
Concept and Project Objectives
3.1 Publishable summary Concept and Project Objectives Proactive and dynamic QoS management, network intrusion detection and early detection of network congestion problems among other applications in the
Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia
Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Yan Fisher Senior Principal Product Marketing Manager, Red Hat Rohit Bakhshi Product Manager,
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>
s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline
Databricks. A Primer
Databricks A Primer Who is Databricks? Databricks vision is to empower anyone to easily build and deploy advanced analytics solutions. The company was founded by the team who created Apache Spark, a powerful
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
Databricks. A Primer
Databricks A Primer Who is Databricks? Databricks was founded by the team behind Apache Spark, the most active open source project in the big data ecosystem today. Our mission at Databricks is to dramatically
Certified Big Data and Apache Hadoop Developer VS-1221
Certified Big Data and Apache Hadoop Developer VS-1221 Certified Big Data and Apache Hadoop Developer Certification Code VS-1221 Vskills certification for Big Data and Apache Hadoop Developer Certification
PEPPERDATA IN MULTI-TENANT ENVIRONMENTS
..................................... PEPPERDATA IN MULTI-TENANT ENVIRONMENTS technical whitepaper June 2015 SUMMARY OF WHAT S WRITTEN IN THIS DOCUMENT If you are short on time and don t want to read the
GSBPM. Generic Statistical Business Process Model. (Version 5.0, December 2013)
Generic Statistical Business Process Model GSBPM (Version 5.0, December 2013) About this document This document provides a description of the GSBPM and how it relates to other key standards for statistical
Testing 3Vs (Volume, Variety and Velocity) of Big Data
Testing 3Vs (Volume, Variety and Velocity) of Big Data 1 A lot happens in the Digital World in 60 seconds 2 What is Big Data Big Data refers to data sets whose size is beyond the ability of commonly used
ESS event: Big Data in Official Statistics
ESS event: Big Data in Official Statistics v erbi v is 1 Parallel sessions 2A and 2B LEARNING AND DEVELOPMENT: CAPACITY BUILDING AND TRAINING FOR ESS HUMAN RESOURCES FACILITATOR: JOSÉ CERVERA- FERRI 2
Client Overview. Engagement Situation. Key Requirements
Client Overview Our client is one of the leading providers of business intelligence systems for customers especially in BFSI space that needs intensive data analysis of huge amounts of data for their decision
Deploying Hadoop with Manager
Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer [email protected] Alejandro Bonilla / Sales Engineer [email protected] 2 Hadoop Core Components 3 Typical Hadoop Distribution
Introduction to Big Data Training
Introduction to Big Data Training The quickest way to be introduce with NOSQL/BIG DATA offerings Learn and experience Big Data Solutions including Hadoop HDFS, Map Reduce, NoSQL DBs: Document Based DB
The Future of Data Management
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class
BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES
BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES Relational vs. Non-Relational Architecture Relational Non-Relational Rational Predictable Traditional Agile Flexible Modern 2 Agenda Big Data
BIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Big Data Infrastructure at Spotify
Big Data Infrastructure at Spotify Wouter de Bie Team Lead Data Infrastructure June 12, 2013 2 Agenda Let s talk about Data Infrastructure, how we did it, what we learned and how we ve failed Some Context
Sentimental Analysis using Hadoop Phase 2: Week 2
Sentimental Analysis using Hadoop Phase 2: Week 2 MARKET / INDUSTRY, FUTURE SCOPE BY ANKUR UPRIT The key value type basically, uses a hash table in which there exists a unique key and a pointer to a particular
Data Refinery with Big Data Aspects
International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 655-662 International Research Publications House http://www. irphouse.com /ijict.htm Data
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
Big Data Analytics Roadmap Energy Industry
Douglas Moore, Principal Consultant, Architect June 2013 Big Data Analytics Energy Industry Agenda Why Big Data in Energy? Imagine Overview - Use Cases - Readiness Analysis - Architecture - Development
Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack
Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack HIGHLIGHTS Real-Time Results Elasticsearch on Cisco UCS enables a deeper
Hadoop Big Data for Processing Data and Performing Workload
Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE Anjali P P 1 and Binu A 2 1 Department of Information Technology, Rajagiri School of Engineering and Technology, Kochi. M G University, Kerala
Statistical data editing near the source using cloud computing concepts
Distr. GENERAL WP.20 6 May 2011 ENGLISH ONLY UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE (UNECE) CONFERENCE OF EUROPEAN STATISTICIANS EUROPEAN COMMISSION STATISTICAL OFFICE OF THE EUROPEAN UNION (EUROSTAT)
Hadoop & SAS Data Loader for Hadoop
Turning Data into Value Hadoop & SAS Data Loader for Hadoop Sebastiaan Schaap Frederik Vandenberghe Agenda What s Hadoop SAS Data management: Traditional In-Database In-Memory The Hadoop analytics lifecycle
How To Understand The Data Collection Of An Electricity Supplier Survey In Ireland
COUNTRY PRACTICE IN ENERGY STATISTICS Topic/Statistics: Electricity Consumption Institution/Organization: Sustainable Energy Authority of Ireland (SEAI) Country: Ireland Date: October 2012 CONTENTS Abstract...
Big Data Analytics Platform @ Nokia
Big Data Analytics Platform @ Nokia 1 Selecting the Right Tool for the Right Workload Yekesa Kosuru Nokia Location & Commerce Strata + Hadoop World NY - Oct 25, 2012 Agenda Big Data Analytics Platform
Big Data for Investment Research Management
IDT Partners www.idtpartners.com Big Data for Investment Research Management Discover how IDT Partners helps Financial Services, Market Research, and Investment Management firms turn big data into actionable
Big Data Course Highlights
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
NetView 360 Product Description
NetView 360 Product Description Heterogeneous network (HetNet) planning is a specialized process that should not be thought of as adaptation of the traditional macro cell planning process. The new approach
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
A Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000
Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000 Alexandra Carpen-Amarie Diana Moise Bogdan Nicolae KerData Team, INRIA Outline
Case Study : 3 different hadoop cluster deployments
Case Study : 3 different hadoop cluster deployments Lee moon soo [email protected] HDFS as a Storage Last 4 years, our HDFS clusters, stored Customer 1500 TB+ data safely served 375,000 TB+ data to customer
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
