HPC technology and future architecture Visual Analysis for Extremely Large-Scale Scientific Computing KGT2 Internal Meeting INRIA France Benoit Lange benoit.lange@inria.fr Toàn Nguyên toan.nguyen@inria.fr
Outline The VELaSSCo project General information Members of the consortium Motivations of the project Objectives of the project Target data Develop a Big Data platform The VELaSSCo architecture Big Data, what does it mean? Data of Big Data What are the challenges of Big Data Grid vs Cloud Big Data needs a distributed system Benoit Lange - VELaSSCo - KGT2 - benoit.lange@inria.fr - Xi'An 7 May 2015-2
The VELaSSCo project General information VELaSSCO is a EC funded Project which deals with end-user visualization of huge simulation data (Big Data). 3 years project (2014 2016) By 2020, most crucial simulation results such as those from the aeronautic industry or automotive, will not be able to be stored in a single machine or server. How to store, access, simplify and manipulate billion of records to extract the relevant information? How to represent information in a feasible and flexible way? How to visualise and interactively inspect the huge quantity of information they produce taking into account end-user's needs? Benoit Lange - VELaSSCo - KGT2 - benoit.lange@inria.fr - Xi'An 7 May 2015-3
The VELaSSCo project Members of the consortium Big Data Infrastructure Data Analytics Visualization Expertise End-users / Beneficiaries Big Data Issues HPC and Big Data, Handling, formatting,storage Data access, extraction, reduction Platforms FEM Models DEM Models LB Models End-user testing Usability verification Reactivity Spain ATOS CIMNE United Kingdom UNEDIN Norway SINTEF JOTNE France INRIA Germany FRAUNHOFER Benoit Lange - VELaSSCo - KGT2 - benoit.lange@inria.fr - Xi'An 7 May 2015-4
Motivations of the project VELaSSCo Pre- processing Calculation Post- processing Geometry description Preparation of analysis data Visualizationof results Computer Analysis Pre and post-processor Benoit Lange - VELaSSCo - KGT2 - benoit.lange@inria.fr - Xi'An 7 May 2015-5
Motivations of the project The simulation data are naturally linked to: High Performance Computing Simulation has already been introduced in Big Data area very traditional supercomputer manufacturers such as SGI companies oriented to massive number of customers such as Amazon, offering very attractive solutions for simulation software vendors (Elastic Compute Cloud, EC2, Simple Storage Service, S3) well-known simulation suites such as Matlab or OpenFOAM (precisely through Amazon services) How Big is the current Simulation Data? Some examples include: weather & climate (400 PB/year, now) nuclear & fusion energy (2PB/time step, now, and 200 PB/time step by 2020) high-energy physics, Materials, Chemistry, Biology, fluid dynamics Benoit Lange - VELaSSCo - KGT2 - benoit.lange@inria.fr - Xi'An 7 May 2015-6
Objectives of the project Target data DEM FEM Total size 50 GB à 1 PB 30 GB à 50 TB Partitions 1 à 10,000 Particles / elements 10 million 8 million à 1 billion Time-steps 1 billion 40 à 25,000 Variables per node 10 variables 2-8 scalars, 1-2 vectors,?1 tensor? Benoit Lange - VELaSSCo - KGT2 - benoit.lange@inria.fr - Xi'An 7 May 2015-7
Objectives of the project Nowadays the huge amount of data provided by the solver in HPC cannot be stored in one single machine, so it is mandatory: Distributed post-processing Distributed visualization Problems if a calculation node fails in HPC. Need a redundancy for the data Big Data The main objective of VELaSSCo project is to build the VELaSSCo Platform, a system that performs distributed post-processing operations and visualization of very large simulations. To address this objective, VELaSSCo brings together Simulation and Big Data. Develop a platform which targeted most of IT system Benoit Lange - VELaSSCo - KGT2 - benoit.lange@inria.fr - Xi'An 7 May 2015-8
The VELaSSCo architecture Big Data, what does it mean? Big Data refers to data sets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze. (McKinsey Global Institute) Big Data is the term for a collection of data sets so large and complex that it becomes difficult to process using onhand database management tools or traditional data processing applications. (Wikipedia) Benoit Lange - VELaSSCo - KGT2 - benoit.lange@inria.fr - Xi'An 7 May 2015-9
The VELaSSCo architecture Data of Big Data Usages: Data cleaning Data transformation Data analysis Data search Data computation Data visualization Heterogeneous data: From sensors Images Medias Textual Networks Benoit Lange - VELaSSCo - KGT2 - benoit.lange@inria.fr - Xi'An 7 May 2015-10
The VELaSSCo architecture What are the challenges of Big Data Scale Data volume Distribution of computation and storage between different locations Size of network and storage system Complexity A wide variety of acquisitions A large set of dimensions Fuzzy data Heterogeneity Scientific collaboration between several domains Specific data format Complex workflow computation Benoit Lange - VELaSSCo - KGT2 - benoit.lange@inria.fr - Xi'An 7 May 2015-11
The VELaSSCo architecture Grid vs Cloud Grids Owned by scientific community Batch computation Computation time Widely distributed Clouds Mainly owned by industry Simultaneous computations CPU time Can be distributed Heterogeneous system I. Foster, Y. Zhao, I. Raicu, and S. Lu. Cloud computing and grid computing 360-degree compared. In Grid Computing Environments Workshop, 2008, Nov 2008. Benoit Lange - VELaSSCo - KGT2 - benoit.lange@inria.fr - Xi'An 7 May 2015-12
The VELaSSCo architecture Grid vs Cloud Grids Clouds Business Model Project-oriented Consomption basis Architecture Five layers Four layers Abstract resources Can be implement over a grid Resources Management Batch-scheduling Shared file system Shared by all users Specific FS Programming Model Workflow tools Map-Reduce Application Model Any Any Difficulties with HPC problems Security Model Strict security Strong security Foster, Y. Zhao, I. Raicu, and S. Lu. Cloud computing and grid computing 360-degree compared. In Grid Computing Environments Workshop, 2008. GCE 08, pages 1 10, Nov 2008. Benoit Lange - VELaSSCo - KGT2 - benoit.lange@inria.fr - Xi'An 7 May 2015-13
The VELaSSCo architecture Big Data needs a distributed system The most suitable computational model for Big Data: MapReduce Designed for large distributed system A simple programming model Based on a specific FS Designed to scale up High availability Deal with nodes failure Batch computation But this model has evolved (Hadoop 2.0) More complex computation Management of Resources A Data-oriented Operating System Benoit Lange - VELaSSCo - KGT2 - benoit.lange@inria.fr - Xi'An 7 May 2015-14
The VELaSSCo architecture Simplified version or or or Or.. Visualisation Benoit Lange - VELaSSCo - KGT2 - benoit.lange@inria.fr - Xi'An 7 May 2015-15
The VELaSSCo architecture VELaSSCo.Platform.Access.lib Visualization client VELaSSCo.Engine.Layer (YARN) Query Manager Module Asynchronous Availability, resources, load, etc. Monitoring Graphics Compressi on / Streaming GPU struct Analytics LOD, D2C Iso, stream, stats VELaSSCo.Data.Layer RT Storage Module Batch Data Query VELaSSCo.Platform.DataIngestion.lib Simulation Ingestion & Processing (Flume) HBas hive FS Phoeni e x HadoopAbstractFileSystem Existing software To develop HDFS NFS EDM Plug-in EDM Results / data flow Consortium Open Queries flow Commercial Version Benoit Lange - VELaSSCo - KGT2 - benoit.lange@inria.fr - Xi'An 7 May 2015-16
Conclusions A Big Data platform for engineering data (FEM and DEM simulation) with supports of visualisationtools: GID (CIMNE) ifx (Fraunhofer IGD) With support of real-time queries A big data architecture for any IT systems For ex: co-exists with a HPC cluster Extensible (support plug-ins) A database engine, based on widely used technologies such as Hadoop-HBase and ISO 10303 STEP, that can organise and store a diverse range of largescale simulation data sets for collaborative use. An innovative approach, adopting big data best practices, to handle large scale simulation data sets that have to be stored on multiple servers. A framework equipped with advanced in-situ processing tools to analyse the output of parallel distributed simulation solvers. An analysis platform to analyse and visualize large-scale data sets interactively. This builds on leading edge graphics hardware. Benoit Lange - VELaSSCo - KGT2 - benoit.lange@inria.fr - Xi'An 7 May 2015-17
Thanks you for your attention. More information are available on http://www.velassco.eu You can contact me at: benoit.lange@inria.fr Benoit Lange - VELaSSCo - KGT2 - benoit.lange@inria.fr - Xi'An 7 May 2015-18