Adaptive Performance Optimization for Distributed Big Data Server - Application in Sky Surveying

Size: px

Start display at page:

Download "2015-071 Adaptive Performance Optimization for Distributed Big Data Server - Application in Sky Surveying"

Cornelius Clark
8 years ago
Views:

1 Adaptive Performance Optimization for Distributed Big Data Server - Application in Sky Surveying Proposant Nom & Prénom FREZOULS Benoit Organisme CNES Adresse 18 avenue Edouard Belin Code postal Ville TOULOUSE Adresse mail benoit.frezouls@cnes.fr du Directeur du Laboratoire Jean-Michel.Fourneau@prism.uvsq.fr Descriptif du sujet In few years, the volume of data will be fifty times greater than it is today and this affects most sectors, e.g. business, web, health, urban and scientific applications. In this regard, Big Data is a real challenge requiring new works and new proposals. Today, the market of Big Data is led by worldwide information technology companies like Google, Facebook, Yahoo, IBM, and Big Data is definitely a hot topic of database research. At the European level, the Commission outlined a new strategy on Big Data, supporting and accelerating the transition towards a data-driven economy in Europe. Management and analysis of these data fields is indeed crucial to enable businesses and organizations to better understand their operations and optimize various processes to be more competitive. Many research programs address this topic in H2020. Some are related to specific application domains, such as the COST action Big-Sky-Earth [1] related to universe and earth observation science. In France the CNRS started an interdisciplinary challenge dedicated to big scientific data MASTODONS [2]. Big Data has also captured a lot of interest in the research and academic programs, in particular at the University Paris Saclay (UPSay). The Master Data- Scale of UPSay located at the University of Versailles is part of this program. ADAM group is involved the MASTODONS GAIA project [3][4] and investigates query processing within a distributed server for sky data surveying. This thesis falls in this line of research.

fr Descriptif du sujet In few years, the volume of data will be fifty times greater than it is today and this affects most sectors, e.g. business, web, health, urban and scientific applications.

2 Problem statement and objectives The Big Data phenomenon is radically changing the terms of data management because it introduces new problems due to the volume, the transfer speed (velocity) and for the type (variety) of data. In fact, the challenges of Big Data are related to these three "V". Today, parallel shared-nothing architectures using commodity machines have been established as the de facto standard in Big Data management, and Map/Reduce as a new programming paradigm. However, the original framework of Map/Reduce suffers from the lack of database functionality, such as a query service and query optimization. This has led to a long stream of research that addresses this issue, but most of them remain limited. For instance, providing advanced load balancing is a weakness of Map/Reduce that has not been addressed yet. We aim to contribute in this area. Motivated by an application domain in astrometry, the objective of this thesis is to dynamically adapt the Map/Reduce tasks execution plan to the data distribution whether at the level of the input or the intermediary data flows. The questions are: - Is is possible to build an optimized execution plan in Map/Reduce as the DBMSs attempt to do? - Which building blocks could be used from the state of the art? - How to estimate or to evaluate a given plan? and how to replace it? - Can the optimizer be self-tuned to adapt to the data and the workload profiles? - What is the actual gain and the payoff of the implemented solution? The intuition is to monitor the statistics on the input and the intermediary data flows in order to learn and predict the cost of each task and automatically trigger some physical optimization, such as partitioning or local indexing. Many other techniques will be explored, including cache-based optimization for multi query scenarios, using heuristics only or combined with statistics, etc. Overview of the state of the art: The Map/Reduce framework is being increasingly used to process large volumes of data in both commercial and scientific applications in order to overcome the limitations of the centralized architecture. Map/Reduce is scalable in term of number of computational nodes [5] and is inherently fault-tolerant. Map/Reduce has gained popularity because users can parallelize application without explicitly addressing of any parallelization details. In Map/Reduce paradigm, a core problem is the design of a relevant partitioning function between map and reduce tasks. This function assigns to every object in

Today, parallel shared-nothing architectures using commodity machines have been established as the de facto standard in Big Data management, and Map/Reduce as a new programming paradigm.

3 map task a key and objects with the same keys are distributed to the same reducer in the shuffling process. The reducer performs the operation locally over the objects that have been assigned to it. Dealing with spatial objects, Map/Reduce naturally uses spatial partitioning [6], where every object that falls in a given cell is sent to the relevant reducer. A grid with equi-sized cells is often used [7], [10]. Since the spatial distribution of real world data is often skewed, using a grid results in an unbalance partitioning. As shown in [8], an unbalanced partitioning decreases the performances in Hadoop (the most popular implementation of Map/Reduce) [9]. Also, as all data must be loaded in main memory before being processed, dense cell treatment would require costly system disk swaps or fail in the worst case. It's worth noticing that the imbalance can be encountered in many other contexts, and many work exist which address specific purpose (e.g. for entity resolution in [11]). However, at the best of our knowledge, there is no global and generic approach to cope with it. Anyway, the envisaged approach driven by the data and its usage have never been investigated as far as we know. Application domain and dataset: The Gaia space mission, launched by the European Space Agency at the end of 2013, will provide an unprecedented census of our galaxy in size, scope, and accuracy, encompassing astrometry for over one billion objects in the sky. The main mission goal for the Gaia archive is to provide a comprehensive repository of the rich data products to be generated by Gaia, and a range of access mechanisms and associated helper applications to enable effective access to the Gaia data by the end user science community. The Gaia mission will pose several challenges over current technologies of data management, mainly due to the unprecedented amount of data that will be produced (around 1 PB). In this context, basic operations like data searching, analytics or visualization are becoming increasingly difficult and in many cases almost impossible. Simple database queries can now return results so big that they are incomprehensible slow to handle, extremely hard to analyze and impossible to visualize with available tools. This thesis will exploit the catalogs that are used and/or will be produced with GAIA mission. We can distinguish three categories: the repository of data collected from the campaigns prior to GAIA, the simulated dataset, and the actual measures produced by the GAIA satellite expected from the end of Work Plan: - 1 st Semester: State of the art and familiarization with the corpus and tools - 2 nd Semester: Survey Report and first proposal - 2 nd Year: Refinement and validation, 1 st paper or demo submission - 3 rd year: Publication and writing of the manuscript

A grid with equi-sized cells is often used [7], [10]. Since the spatial distribution of real world data is often skewed, using a grid results in an unbalance partitioning.

4 References: 1. Big Data Era in Sky and Earth Observation (BIG-SKY- EARTH), 2. MASTODONS Challenge 3. Gaia, l origine et l évolution de notre Galaxie : validation des données, Défi MASTODONS - Grandes masses de données scientifiques A. Brown, F. Arenou, N. Hambly, F. v. Leeuwen, X. Luri, J.-C. Malapert, W. O Mullane, D. Tapiador, and N. Walton. Gaia data access scenarios summary. Technical Report GAIA-C9-TN-LEI-AB-026, ESA European Space Agency, Hadoop project A. Aji, F. Wang, H. Vo, R. Lee, Q. Liu, X. Zhang, and J. Saltz. Hadoop GIS: a high performance spatial data warehousing system over mapreduce. Proceedings of the VLDB Endowment, 6(11): , A. Eldawy, Y. Li, M. F. Mokbel, and R. Janardan. Cg_hadoop: computational geometry in mapreduce. In Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages ACM, C. Doulkeridis and K. Nørvåg. A survey of large-scale analytical query processing in mapreduce. The VLDB Journal, 23(3): , Y. Kwon, K. Ren, M. Balazinska, B. Howe, and J. Rolia. Managing skew in hadoop. IEEE Data Eng. Bull, 36(1):24 33, T. Seidl, S. Fries, and B. Boden. Mr-dsj: Distance-based self-join for largescale vector data analysis with mapreduce. In 15. GI-Fachtagung Datenbanksysteme fr Business, Technologie und Web (BTW 2013), Magdeburg, Germany, Bonn, Germany, GI. 11. Kolb, L., Thor, A., & Rahm, E. (2012, April). Load balancing for mapreducebased entity resolution. In Proceedings of the 28th IEEE International Conference on Data Engineering (ICDE), pages , Laboratoire d'accueil envisagé PRiSM - Parallélisme, réseaux, systèmes, modélisation Profil du candidat Master d'informatique Par exemple (liste non exhaustive) :

Hambly, F. v. Leeuwen, X. Luri, J.-C. Malapert, W. O Mullane, D. Tapiador, and N. Walton. Gaia data access scenarios summary. Technical Report GAIA-C9-TN-LEI-AB-026, ESA European Space Agency, 2012.

5 - Master Recherche COSY (des COncepts aux SYstèmes) co-habilité Université Versaille Saint-Quentin et Telecom SudParis ( - futur Master DataScale de l'université de Paris Saclay - Master IAC (Information, Apprentissage, Cognition) de l'université Paris 11 ( - Master DAC (Données, Apprentissage et Connaissances) de l'université Paris 6 co-habilité Telecon ParisTech ( Directeur de thèse Nom & Prénom Zeitouni Karine Adresse mail Karine.Zeitouni@prism.uvsq.fr Responsable CNES de la thèse Nom & Prénom Frezouls Benoit Structure DCT/PS/SN Adresse mail benoit.frezouls@cnes.fr

fr/formation/lmd/m2r_iac) - Master DAC (Données, Apprentissage et Connaissances) de l'université Paris 6 co-habilité Telecon ParisTech (http://www-master.ufr-infop6.jussieu.

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop: