The Ophidia framework: toward cloud- based big data analy;cs for escience Sandro Fiore, Giovanni Aloisio, Ian Foster, Dean Williams Sandro Fiore, Ph.D. CMCC Scientific Computing and Operations Division sandro.fiore@cmcc.it!
The big challenge is to model this complex system! Several complex processes to be simulated Several interacting processes Great range of time scales to be analyzed Great range of spatial scales to be considered Need interdisciplinar sciences (physics, chemistry, biology, geology, ) Inherently non-linear governing equations Need sophisticated numerics Need huge computational resources and large volumes of data can be produced Warren M. Washington NCAR Scien;fic Grand Challenges Workshop Series: Challenges in Climate Change Science and the Role of Compu;ng at the Extreme Scale DOE Workshop (ASCR- BER) November 6-7, 2008
ESGF & the CMIP5 data archive!
Climate change domain: the current scientific workflow and the ESGF use case! Workflow: search, locate, download, analyze, display results J. Chen, A. Choudhary, S. Feldman, B. Hendrickson, C.R. Johnson, R. Mount, V. Sarkar, V. White, D. Williams. Synergistic Challenges in Data- Intensive Science and Exascale Computing, DOE ASCAC Data Subcommittee Report, Department of Energy Office of Science, March, 2013.
The Ophidia Project Ophidia is a research effort carried out at the Euro Mediterranean Centre on Climate Change (CMCC) to address big data challenges, issues and requirements for climate change data analytics. S. Fiore, A. D Anca, C. Palazzo, I. Foster, D. N. Williams, G. Aloisio, Ophidia: toward bigdata analytics for escience, ICCS2013 Conference, Procedia Elsevier, Barcelona, June 5-7, 2013
Requirements and needs focus on: Time series analysis Data subsetting Model intercomparison Multimodel means Massive data reduction Data transformation (through array-based primitives) Param. Sweep experiments (same task applied on a set of data) Climate change signal Maps generation Ensemble analysis Data analytics worflow support But also Performance re-usability Data analytics requirements and use cases! extensibility
Ophidia Architecture Declarative language Front end Compute layer Standard interfaces Analytics Framework I/O layer Array-based primitives I/O server instance Storage layer System catalog New storage model Partitioning/hierarchical data mng
Array based primitives! Primitives provide array-based transformation A comprehensive set of primitives have been already implemented ( 100) By definition, a primitive is applied to a single fragment They come in the form of plugins (I/O server extensions) So far, Ophidia primitives perform data reduction, sub-setting, predicates evaluation, statistical analysis, compression, and so forth. Support is provided both for byte-oriented and bit-oriented arrays Plugins can be nested to get more complex functionalities Compression is provided as a primitive too Libraries like PetsC, GSL, C Math have been integrated
Array based primitives: OPH_BOXPLOT oph_boxplot(measure, "OPH_DOUBLE ) Single chunk or fragment (input) Single chunk or fragment (output)
Array based primitives: OPH_AGGREGATE oph_aggregate(measure,"oph_avg ) Single chunk or fragment (input) Ver%cal aggrega%on Single chunk or fragment (output)
Array based primitives: nesting feature oph_boxplot(oph_subarray(oph_uncompress(measure), 1,18), "OPH_DOUBLE ) Single chunk or fragment (input) Single chunk or fragment (output) subarray(measure, 1,18)
The analytics framework: datacube operators (about 50)! OPERATOR NAME OPERATOR DESCRIPTION Operators Data processing Domain-agnostic OPH_APPLY(datacube_in, datacube_out, array_based_primitive) OPH_DUPLICATE(datacube_ in, datacube_out) OPH_SUBSET(datacube_in, subset_string, datacube_out) OPH_MERGE(datacube_in, merge_param, datacube_out) OPH_SPLIT(datacube_in, split_param, datacube_out) OPH_INTERCOMPARISON (datacube_in1, datacube_in2, datacube_out) OPH_DELETE(datacube_in) Creates the datacube_out by applying the array-based primitive to the datacube_in Creates a copy of the datacube_in in the datacube_out Creates the datacube_out by doing a sub-setting of the datacube_in by applying the subset_string Creates the datacube_out by merging groups of merge_param fragments from datacube_in Creates the datacube_out by splitting into groups of split_param fragments each fragment of the datacube_in Creates the datacube_out which is the element-wise difference between datacube_in1 and datacube_in2 Removes the datacube_in Data Access (sequential and parallel operators) Metadata management (sequential and parallel operators) Data processing (parallel operators, MPI & OpenMP based) Import/Export (parallel operators) OPERATOR NAME OPERATOR DESCRIPTION Operators Data processing Domain-oriented OPH_EXPORT_NC Exports the datacube_in data into the (datacube_in, file_out) file_out NetCDF file. OPH_IMPORT_NC Imports the data stored into the file_in (file_in, datacube_out) NetCDF file into the new datacube_in datacube Operators Data access OPH_INSPECT_FRAG Inspects the data stored in the (datacube_in, fragment_in) fragment_in from the datacube_in OPH_PUBLISH(datacube_in) Publishes the datacube_in fragments into HTML pages Operators Metadata OPH_CUBE_ELEMENTS Provides the total number of the (datacube_in) elements in the datacube_in OPH_CUBE_SIZE Provides the disk space occupied by the (datacube_in) datacube_in OPH_LIST(void) Provides the list of available datacubes. OPH_CUBEIO(datacube_in) Provides the provenance information related to the datacube_in OPH_FIND(search_param) Provides the list of datacubes matching the search_param criteria
The analytics framework: datacube operators!
Metadata operator (provenance management)
EUBrazil Cloud Connect! The main objec%ve is the crea%on of a federated e- infrastructure for research using a user- centric approach. To achieve this, we need to pursue three objec%ves: Adapta;on of exis%ng applica%ons to tackle new scenarios emerging from coopera%on between Europe and Brazil relevant to both regions. Integra%on of frameworks and programming models for scien;fic gateways and complex workflows. Federa%on of resources, to build up a general- purpose infrastructure comprising exis;ng and heterogeneous resources Addi%onally, EUBrazilCC will: perform an ac%ve dissemina;on campaign, analyse innova;on, foster the involvement of Brazilian ins%tu%ons in cloud standards defini;on, and bring the EU Cloudscape series to broader interna%onal audience. EU Coordinator Ignacio Blanquer- Espert, iblanque@dsic.upv.es Universitat Politècnica de València, Spain BR Coordinator Francisco Vilar Brasileiro, fubica@dsc.ufcg.edu.br Universidade Federal de Campina Grande, Brazil!
Use Case on Biodiversity and Climate Change! Objec;ve: Understand the impact of climate change on terrestrial biodiversity through two workflows based on Earth observa%on and ground level data. Technical Challenge: Integrate parallel data analysis with other processing workflows in a geographically distributed environment. Interna;onal Added Value: Integra%on of biodiversity data and modelling with mul%spectral and remote sensing data for studying the cross- correla%on of biodiversity and climate change. Climate & Biodiversity Clearing- house Parallel Data Analysis Species- Link CMCC CIMP5 Imaging Data Federated Infrastructure & Pla\orm
Cloud-based deployment scenarios! Deployment A I/ O S C VM Instance Legend Applica%on Image Data Image Deployment B VM Instance S I/ O I/ O I/ O C VM Instance 1 VM Instance VM C Instance n C Data IO Component Compute Component Server Component Deployment C VM Instance 1 VM Instance 1 VM Instance S VM Instance VM Instance C n C C I/ O I/ O I/ O VM Instance VM C Instance m C
References! Papers [1] G. Aloisio, S. Fiore, I. Foster, D. N. Williams, Scientific big data analytics challenges at large scale, Big Data and Extreme-scale Computing (BDEC), April 30 to May 01, 2013, Charleston, USA (position paper). [2] S. Fiore, G. Aloisio, I. Foster, D. N. Williams, A software infrastructure for big data analytics, Big Data and Extreme-scale Computing (BDEC), February 26-28, 2014, Fukuoka, Japan (position paper). [3] S. Fiore, A. D'Anca, C. Palazzo, I. Foster, Dean N. Williams, Giovanni Aloisio, Ophidia: Toward Big Data Analytics for escience, ICCS 2013, June 5-7, 2013 Barcelona, Spain, Procedia Computer Science, Elsevier, pp. 2376-2385. [4] S. Fiore, C. Palazzo, A. D Anca, I. Foster, D. N. Williams, G. Aloisio, A big data analytics framework for scientific data management, Workshop on Big Data and Science: Infrastructure and Services, IEEE International Conference on BigData 2013, October 6-9, 2013, Santa Clara, USA, pp. 1-8. [5] S. Fiore, A. D'Anca, D. Elia, C. Palazzo, I. Foster, D. Williams, G. Aloisio, "Ophidia: A Full Software Stack for Scientific Data Analytics, proc. of the 2014 International Conference on High Performance Computing & Simulation (HPCS 2014), July 21 25, 2014, Bologna, Italy, pp. 343-350, ISBN: 978-1-4799-5311-0 For more info, please contact: - Dr. Sandro Fiore (sandro.fiore@cmcc.it) - Prof. Giovanni Aloisio (giovanni.aloisio@unisalento.it)
Questions?!