Big Data and Cloud Computing for GHRSST



Similar documents
Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

SURFsara HPC Cloud Workshop

Cultural Heritage Institutions, Metadata Aggregators and The Cloud Aleksandra Nowak, Marcin Werla Poznań Supercomputing and Networking Center

A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

DISTRIBUTED SYSTEMS AND CLOUD COMPUTING. A Comparative Study

SURFsara HPC Cloud Workshop

Cloud computing - Architecting in the cloud

BIG DATA USING HADOOP

Cluster, Grid, Cloud Concepts

Cloud Computing Backgrounder

Silviu Panica, Marian Neagul, Daniela Zaharie and Dana Petcu (Romania)

Chapter 7. Using Hadoop Cluster and MapReduce

Apache Hadoop FileSystem and its Usage in Facebook

Cloud Computing Where ISR Data Will Go for Exploitation

How To Understand Cloud Computing

CLOUD COMPUTING. When It's smarter to rent than to buy

wu.cloud: Insights Gained from Operating a Private Cloud System

Grid Computing Vs. Cloud Computing

Comparing Ganeti to other Private Cloud Platforms. Lance Albertson

BIG DATA TRENDS AND TECHNOLOGIES

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Comparing Open Source Private Cloud (IaaS) Platforms

w w w. u l t i m u m t e c h n o l o g i e s. c o m Infrastructure-as-a-Service on the OpenStack platform

IT of SPIM Data Storage and Compression. EMBO Course - August 27th! Jeff Oegema, Peter Steinbach, Oscar Gonzalez

Big Data With Hadoop

Use of Hadoop File System for Nuclear Physics Analyses in STAR

Open source Google-style large scale data analysis with Hadoop

Cloud Computing. Chapter 1 Introducing Cloud Computing

Challenges for cloud software engineering

Cloud Computing Paradigm

Boas Betzler. Planet. Globally Distributed IaaS Platform Examples AWS and SoftLayer. November 9, IBM Corporation

Big Data and the Earth Observation and Climate Modelling Communities: JASMIN and CEMS

High Performance Computing (HPC)

Getting Started & Successful with Big Data

Cloud Computing. Cloud computing:

JASMIN Cloud ESGF and UV- CDAT Conference December 2014 STFC / Stephen Kill

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Apache Hadoop FileSystem Internals

Part I Courses Syllabus

Building Storage Service in a Private Cloud

Cloud Computing and Content Delivery Network use within Earth Observation Ground Segments: experiences and lessons learnt

Grid Computing vs Cloud

Putchong Uthayopas, Kasetsart University

Hadoop. Sunday, November 25, 12

HPC technology and future architecture

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Cloud Courses Description

Apache Hadoop. Alexandru Costan

Efficient Cloud Management for Parallel Data Processing In Private Cloud

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Introduction to OpenStack

Design and Evolution of the Apache Hadoop File System(HDFS)

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Task Scheduling in Hadoop

Sistemi Operativi e Reti. Cloud Computing

Managing a local Galaxy Instance. Anushka Brownley / Adam Kraut BioTeam Inc.

Certified Cloud Computing Professional VS-1067

How To Understand Cloud Computing

Cloud Computing. Chapter 1 Introducing Cloud Computing

NASA's Strategy and Activities in Server Side Analytics

Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms

Ubuntu Cloud. Kyle MacDonald

Big Data Explained. An introduction to Big Data Science.

Big Data in Test and Evaluation by Udaya Ranawake (HPCMP PETTT/Engility Corporation)

Assignment # 1 (Cloud Computing Security)

Data Centric Computing Revisited

High Availability Databases based on Oracle 10g RAC on Linux

A.Prof. Dr. Markus Hagenbuchner CSCI319 A Brief Introduction to Cloud Computing. CSCI319 Page: 1

Map Reduce / Hadoop / HDFS

Oracle Applications and Cloud Computing - Future Direction

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

Introduction to Cloud Computing

Computing in clouds: Where we come from, Where we are, What we can, Where we go

Big Data Challenges in Bioinformatics

Cloud Computing Now and the Future Development of the IaaS

Introduction to Cloud Computing

Cloud Computing Summary and Preparation for Examination

Virtualization with Windows

Storage solutions for a. infrastructure. Giacinto DONVITO INFN-Bari. Workshop on Cloud Services for File Synchronisation and Sharing

2) Xen Hypervisor 3) UEC

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

How To Compare Cloud Computing To Cloud Platforms And Cloud Computing

DISTRIBUTED MINING ALGORITHM USING HADOOP ON LARGE DATA SET

MapReduce and Hadoop Distributed File System V I J A Y R A O

Virtual Machine Based Resource Allocation For Cloud Computing Environment

High Performance Computing Cloud Computing. Dr. Rami YARED

Viswanath Nandigam Sriram Krishnan Chaitan Baru

Cornell University Center for Advanced Computing

University of Messina, Italy

Behind the scene III Cloud computing

Introduction to Big Data Training

Mr. Apichon Witayangkurn Department of Civil Engineering The University of Tokyo

Hadoop and Map-Reduce. Swati Gore

Cloud 101. Mike Gangl, Caltech/JPL, 2015 California Institute of Technology. Government sponsorship acknowledged

Volunteer Computing, Grid Computing and Cloud Computing: Opportunities for Synergy. Derrick Kondo INRIA, France

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Cloud Courses Description

Cloud Storage and Backup

Transcription:

Big Data and Cloud Computing for GHRSST Jean-Francois Piollé (jfpiolle@ifremer.fr) Frédéric Paul, Olivier Archer CERSAT / Institut Français de Recherche pour l Exploitation de la Mer

Facing data deluge Today s LTSRF archive : 49 TB Increasing number of operational satellites, forthcoming Chinese / Indian programs increasing sensor spatial and temporal resolution Challenges How to allow high revisiting rate of historical (and present) data? How to perform data intensive processing? How to afford large online archive? How to transfer data to user? How to store locally data? Storage bottleneck Processing bottleneck Network bottleneck Can new big data and cloud computing technologies help with that?

How to cope with data volume? usage of high-resolution data ok for case studies : limited amount of data current solution for long time series : generation of high-level fusion products (L3 / L4) involves data transformation : averaging, smoothing,. suitable for some applications only what about more data intensive applications? highest spatial and temporal resolution feature detection (front, eddies,.) data merging and synergy Are massive central static and one-way archive centers still relevant? Data center Data Tbytes! User Processings Mbytes

How to cope with data volume? usage of high-resolution data ok for case studies : limited amount of data current solution for long time series : generation of high-level fusion products (L3 / L4) involves data transformation : averaging, smoothing,. suitable for some applications only what about more data intensive applications? highest spatial and temporal resolution feature detection (front, eddies,.) data merging and synergy Are massive central static and one-way archive centers still relevant? Data center Data Tbytes! User Processings Mbytes

Main aspects to consider Data analysis User services Cloud computing Virtualization Workflow management Data organization and format File system Storage (hardware) Big data : very confusing term how to deal with data volume growth and complexity to extract fast and relevant information? Approach to design and strategies for large volume of data Issues with data management, organization, storage, processing Cloud computing : also very confusing In our context, offering flexible remote processing capability Virtualization + dynamic allocation of resources

storage Online storage on disk required for archives Restoration from tapes : 500 GB / day What technologies considered? Big data centers (Google, Facebook, ) rely on cheap hardware => weaker reliability balanced by duplication/redundancy Strongly inter-related with file system (ex : management of redundancy, distribution, ) Connection strategy with processing nodes to be considered (data intensive architecture taking into account data topology for job distribution «closest to the data») processing and network performances while keeping low budget

File systems Parallel and distributed Large volume : disk cluster seen as one virtual space Lustre MooseFS Simple administration Scalability Reliability and robustness (redundancy implemented through replicates, and soon parity bit) Complex administration (scalability, ) No redundancy Bad fault tolerance No quota (soon) GlusterFS Complex maintenance and administration Bad reliability Not suitable for large number of files HDFS (Hadoop) Performance for streaming and massive distributed processing Requires specific API for data access Hadoop optimized for key/value data structure, not image/swath type structure

Cloud computing Providing remote access and resources to users Previous solutions : Ssh to server : limited allocation of resource (unmanageable), strong security issues Ssh to supercomputer Expensive solution for data intensive applications (no communication between processing nodes) Strong environment constraints => specific system/software/libs/ Often not at the same location than data centers Grid technology Quite complex to use Strong environment constraints => specific system/software/libs/

Cloud computing Virtualization => deploy user dedicated and customized system environment (os and libraries, softwares,...) => remote machine close to user familiar environment Cloud computing => management of ressources, allocation/deployment of virtual servers IAAS : infrastructure PAAS : platform (server + tools for processor integration, scheduling or reprocessing taks,...) SAAS : software => sustainability of processing environments Private/public clouds => public clouds (Amazon S3,...) : expensive to be revised according to Ken), not adapted for large volume of data, concerns with sustainability => private cloud : restricted to within institute => hybrid clouds : private cloud with controlled access for external users. Security issues to be solved.

CERSAT Nephelae platform Data analysis User services OpenStack, inherited from Nebula tried also Eucalyptus Cloud computing Virtualization Workflow management Data organization and format File system Storage (hardware) OpenStack, inherited from Nebula. Access through ssh. Possible remote desktop with tried also Eucalyptus KVM Ubuntu / Cent-OS w/ Matlab, scientific python PBS Pro Torque Maui data topology not taken into account netcdf4 conversion effort for existing datasets 15.8 TB for GHRSST Moose FS full replication 400 TB 414 Cores

Feedback and experiences Engineering perspective 1. Cost of commercial solutions and lack of optimization of storage vs processing strategy 2. reliability of file systems (not to loose any data) is variable depending on the file system. Longer assessment (and mistakes) is needed. 3. virtualization and input/output performances : drop by 50 %, about to be solved 4. still completely to be addressed : using storage topology to distribute processing to closest node 5. access security issues for external opening 6. stability status of most components, lack of documentation 7. lack of available expertise for our specific requirements

Feedback and experiences Usage perspective 1. Used for reprocessing campaigns : => deployment of external partner's processor on platform matching developer's requirements and reprocessing also allowed to save the processing environments and replay some part of the reprocessing later in the exact same conditions Continuous re-processing capabality 2. Sandbox for various project contributors using and sharing the same data => product intercomparison and merging => test of new algorithms, perturbating initial conditions or settings 3. Systematic analysis of a dataset => detection of features in SST images => conversion to NetCDF4 Great help of the batch processing tools we have implemented (take a list of data files as input)

Questions for GHRSST These technologies are quite new and unstable. Limited real expertise is available, technical challenges are yet to be tackled especially for scientific data but many initiatives are popping up (physics, space agencies, ). Is it a new paradigm for data centers? Will only help with some applications : not an answer to everything (traditional technologies still works)! Complementary tool to current data center services GHRSST should be concerned about the capability building around its data heritage and the user services for the exploitation of past data => from user perspective (not data producer) What are the experience and prospects at GHRSST main data nodes (PODAAC, NODC)? => Necessary to share and possibly homogenize or interconnect the available services Should these aspects be part of GHRSST strategic plan?