Big Data and Cloud Computing for GHRSST

Big Data and Cloud Computing for GHRSST Jean-Francois Piollé (jfpiolle@ifremer.fr) Frédéric Paul, Olivier Archer CERSAT / Institut Français de Recherche pour l Exploitation de la Mer

Facing data deluge Today s LTSRF archive : 49 TB Increasing number of operational satellites, forthcoming Chinese / Indian programs increasing sensor spatial and temporal resolution Challenges How to allow high revisiting rate of historical (and present) data? How to perform data intensive processing? How to afford large online archive? How to transfer data to user? How to store locally data? Storage bottleneck Processing bottleneck Network bottleneck Can new big data and cloud computing technologies help with that?

How to cope with data volume? usage of high-resolution data ok for case studies : limited amount of data current solution for long time series : generation of high-level fusion products (L3 / L4) involves data transformation : averaging, smoothing,. suitable for some applications only what about more data intensive applications? highest spatial and temporal resolution feature detection (front, eddies,.) data merging and synergy Are massive central static and one-way archive centers still relevant? Data center Data Tbytes! User Processings Mbytes

Main aspects to consider Data analysis User services Cloud computing Virtualization Workflow management Data organization and format File system Storage (hardware) Big data : very confusing term how to deal with data volume growth and complexity to extract fast and relevant information? Approach to design and strategies for large volume of data Issues with data management, organization, storage, processing Cloud computing : also very confusing In our context, offering flexible remote processing capability Virtualization + dynamic allocation of resources

storage Online storage on disk required for archives Restoration from tapes : 500 GB / day What technologies considered? Big data centers (Google, Facebook, ) rely on cheap hardware => weaker reliability balanced by duplication/redundancy Strongly inter-related with file system (ex : management of redundancy, distribution, ) Connection strategy with processing nodes to be considered (data intensive architecture taking into account data topology for job distribution «closest to the data») processing and network performances while keeping low budget

File systems Parallel and distributed Large volume : disk cluster seen as one virtual space Lustre MooseFS Simple administration Scalability Reliability and robustness (redundancy implemented through replicates, and soon parity bit) Complex administration (scalability, ) No redundancy Bad fault tolerance No quota (soon) GlusterFS Complex maintenance and administration Bad reliability Not suitable for large number of files HDFS (Hadoop) Performance for streaming and massive distributed processing Requires specific API for data access Hadoop optimized for key/value data structure, not image/swath type structure

Cloud computing Providing remote access and resources to users Previous solutions : Ssh to server : limited allocation of resource (unmanageable), strong security issues Ssh to supercomputer Expensive solution for data intensive applications (no communication between processing nodes) Strong environment constraints => specific system/software/libs/ Often not at the same location than data centers Grid technology Quite complex to use Strong environment constraints => specific system/software/libs/

Cloud computing Virtualization => deploy user dedicated and customized system environment (os and libraries, softwares,...) => remote machine close to user familiar environment Cloud computing => management of ressources, allocation/deployment of virtual servers IAAS : infrastructure PAAS : platform (server + tools for processor integration, scheduling or reprocessing taks,...) SAAS : software => sustainability of processing environments Private/public clouds => public clouds (Amazon S3,...) : expensive to be revised according to Ken), not adapted for large volume of data, concerns with sustainability => private cloud : restricted to within institute => hybrid clouds : private cloud with controlled access for external users. Security issues to be solved.

CERSAT Nephelae platform Data analysis User services OpenStack, inherited from Nebula tried also Eucalyptus Cloud computing Virtualization Workflow management Data organization and format File system Storage (hardware) OpenStack, inherited from Nebula. Access through ssh. Possible remote desktop with tried also Eucalyptus KVM Ubuntu / Cent-OS w/ Matlab, scientific python PBS Pro Torque Maui data topology not taken into account netcdf4 conversion effort for existing datasets 15.8 TB for GHRSST Moose FS full replication 400 TB 414 Cores

Feedback and experiences Engineering perspective 1. Cost of commercial solutions and lack of optimization of storage vs processing strategy 2. reliability of file systems (not to loose any data) is variable depending on the file system. Longer assessment (and mistakes) is needed. 3. virtualization and input/output performances : drop by 50 %, about to be solved 4. still completely to be addressed : using storage topology to distribute processing to closest node 5. access security issues for external opening 6. stability status of most components, lack of documentation 7. lack of available expertise for our specific requirements

Feedback and experiences Usage perspective 1. Used for reprocessing campaigns : => deployment of external partner's processor on platform matching developer's requirements and reprocessing also allowed to save the processing environments and replay some part of the reprocessing later in the exact same conditions Continuous re-processing capabality 2. Sandbox for various project contributors using and sharing the same data => product intercomparison and merging => test of new algorithms, perturbating initial conditions or settings 3. Systematic analysis of a dataset => detection of features in SST images => conversion to NetCDF4 Great help of the batch processing tools we have implemented (take a list of data files as input)

Questions for GHRSST These technologies are quite new and unstable. Limited real expertise is available, technical challenges are yet to be tackled especially for scientific data but many initiatives are popping up (physics, space agencies, ). Is it a new paradigm for data centers? Will only help with some applications : not an answer to everything (traditional technologies still works)! Complementary tool to current data center services GHRSST should be concerned about the capability building around its data heritage and the user services for the exploitation of past data => from user perspective (not data producer) What are the experience and prospects at GHRSST main data nodes (PODAAC, NODC)? => Necessary to share and possibly homogenize or interconnect the available services Should these aspects be part of GHRSST strategic plan?