The Ophidia framework: toward big data analy7cs for climate change Dr. Sandro Fiore Leader, Scientific data management research group Scientific Computing Division @ CMCC Prof. Giovanni Aloisio Director, Scientific Computing Division @ CMCC
Data analytics requirements! A set of requirements have been discussed with our scientists to identify the key data analytics needs. Preliminary requirements and needs focus on: Time series analysis Data reduc6on (e.g. by aggrega6on) Data subse<ng Model intercomparison Mul6model means Data transforma6on (through array- based primi6ves) Param. Sweep experiments (same task applied on a set of data) Comparison between historical data and future scenarios Maps genera6on Ensemble analysis Data analy6cs worflow support But also Performance, re- usability, extensibility
The Ophidia Project Ophidia is a research effort carried out at the Euro Mediterranean Centre on Climate Change (CMCC) to address big data challenges, issues and requirements for climate change data analytics S. Fiore, A. D Anca, C. Palazzo, I. Foster, D. N. Williams, G. Aloisio, Ophidia: toward bigdata analytics for escience, ICCS2013 Conference, Procedia Elsevier, Barcelona, June 5-7, 2013
Ophidia Architecture 1.0 Declarative language Front end Compute layer Standard interfaces Analytics Framework I/O layer Array-based primitives I/O server instance Storage layer System catalog New storage model Partitioning/hierarchical data mng
Storage model (dimension-independent) & implementation! Array-based support and hierarchical storage! Parallel I/O
The analytics framework: datacube operators (about 50)! OPERATOR NAME OPERATOR DESCRIPTION Operators Data processing Domain-agnostic OPH_APPLY(datacube_in, datacube_out, array_based_primitive) OPH_DUPLICATE(datacube_ in, datacube_out) OPH_SUBSET(datacube_in, subset_string, datacube_out) OPH_MERGE(datacube_in, merge_param, datacube_out) OPH_SPLIT(datacube_in, split_param, datacube_out) OPH_INTERCOMPARISON (datacube_in1, datacube_in2, datacube_out) OPH_DELETE(datacube_in) Creates the datacube_out by applying the array-based primitive to the datacube_in Creates a copy of the datacube_in in the datacube_out Creates the datacube_out by doing a sub-setting of the datacube_in by applying the subset_string Creates the datacube_out by merging groups of merge_param fragments from datacube_in Creates the datacube_out by splitting into groups of split_param fragments each fragment of the datacube_in Creates the datacube_out which is the element-wise difference between datacube_in1 and datacube_in2 Removes the datacube_in Data Access (sequential and parallel operators) Metadata management (sequential and parallel operators) Data processing (parallel operators, MPI & OpenMP based) Import/Export (parallel operators) OPERATOR NAME OPERATOR DESCRIPTION Operators Data processing Domain-oriented OPH_EXPORT_NC Exports the datacube_in data into the (datacube_in, file_out) file_out NetCDF file. OPH_IMPORT_NC Imports the data stored into the file_in (file_in, datacube_out) NetCDF file into the new datacube_in datacube Operators Data access OPH_INSPECT_FRAG Inspects the data stored in the (datacube_in, fragment_in) fragment_in from the datacube_in OPH_PUBLISH(datacube_in) Publishes the datacube_in fragments into HTML pages Operators Metadata OPH_CUBE_ELEMENTS Provides the total number of the (datacube_in) elements in the datacube_in OPH_CUBE_SIZE Provides the disk space occupied by the (datacube_in) datacube_in OPH_LIST(void) Provides the list of available datacubes. OPH_CUBEIO(datacube_in) Provides the provenance information related to the datacube_in OPH_FIND(search_param) Provides the list of datacubes matching the search_param criteria
The analytics framework: datacube operators!
The analytics framework: datacube operators!
The analytics framework: datacube operators! operator=oph_apply;datacube_input=cube_in;datacube_output=cube_out;query=array_based_primi/ve
Array based primitives! The array data type support is not enough to provide scientific data management capabilities primitives are needed as well. A set of array-based primitives have been implemented A primitive is applied to a single array They come in the form of plugins (I/O server extensions) So far, Ophidia provides a wide set of plugins (about 100) to perform data reduction (by aggregation), sub-setting, predicates evaluation, statistical analysis, compression, and so forth. Support is provided both for byte-oriented and bit-oriented arrays Plugins can be nested to get more complex functionalities Compression is provided as a primitive too
Array based primitives: OPH_MATH ( SIGN ) oph_math(measure, OPH_SIGN, "OPH_DOUBLE ) Single chunk or fragment (input) Single chunk or fragment (output)
Array based primitives: OPH_BOXPLOT oph_boxplot(measure, "OPH_DOUBLE ) Single chunk or fragment (input) Single chunk or fragment (output)
Array based primitives: nesting feature oph_boxplot(oph_subarray(oph_uncompress(measure), 1,18), "OPH_DOUBLE ) Single chunk or fragment (input) Single chunk or fragment (output) subarray(measure, 1,18)
Array based primitives: oph_aggregate oph_aggregate(measure,"oph_avg ) Single chunk or fragment (input) Ver6cal aggrega6on Single chunk or fragment (output)
Ophidia primitives Mathematical primitives: oph_math, oph_mul_scalar, oph_sum_array, oph_sum_scalar, oph_sum_array_r, oph_gsl_sd, oph_gsl_complex_get_abs, oph_gsl_complex_get_arg, oph_gsl_complex_get_imag, oph_gsl_complex_get_real, oph_gsl_complex_to_polar, oph_gsl_complex_to_rect, oph_gsl_dwt, oph_gsl_fft, oph_gsl_idwt, oph_gsl_ifft, oph_gsl_quantile, oph_gsl_sort, oph_gsl_stats, oph_gsl_histogram, oph_gsl_boxplot, oph_petsc_vec_norm, oph_petsc_vec, oph_petsc_vec_r, oph_sum_scalar2, oph_mul_scalar2, oph_compare, Data transformations: oph_to_bin, oph_to_bit, oph_permute, oph_reverse, oph_dump, oph_convert_d, oph_shift, oph_rotate, oph_concat, oph_predicate, oph_roll_up, oph_get_subarray, oph_get_subarray2, oph_get_subarray3, oph_mask, oph_cast, oph_drill_down (stored procedure), oph_reduce, oph_reduce2, oph_reduce3, oph_aggregate_operator, oph_operator, oph_aggregate_stats, oph_extract, Compression: oph_compress, oph_uncompress, oph_compress2, oph_uncompress2, oph_id_to_index, oph_size_array, oph_count_array, oph_find, oph_find2, Bit measures processing: oph_bit_aggregate, oph_bit_import, oph_bit_export, oph_bit_size, oph_bit_count, oph_bit_dump, oph_bit_operator, oph_bit_not, oph_bit_shift, oph_bit_rotate, oph_bit_reverse, oph_bit_subarray, oph_bit_subarray2, oph_bit_concat, oph_bit_find, oph_bit_reduce, Integration of numerical libraries: math, GNU GSL, PetsC. Integration of some CDO operators
The Ophidia Terminal The Ophidia terminal provides an effec6ve and lightweight way to interact with the Ophidia server Bash- like environment (commands interpreter) Terminal with history management, auto- comple6on, specific environment variables and commands with integrated help Easy installa6on as an only one executable using a small number of well- known and open- source libraries More than 15 KLOC Simple enough for a novice and at the same 6me powerful enough for an expert
Variables and commands PRE-DEFINED VARIABLES (user-defined variables possible) OPH_TERM_PS1 = "color of the prompt OPH_USER = "username OPH_PASSWD = "password OPH_SERVER_HOST = "server address OPH_SERVER_PORT = "server port OPH_SESSION_ID = "current session OPH_EXEC_MODE = "execu6on mode OPH_NCORES = "number of cores OPH_CWD = "current working directory OPH_DATACUBE = "current datacube OPH_TERM_VIEWER = "output renderer OPH_TERM_IMGS = "save and/or auto- open images" PRE-DEFINED COMMANDS (user-defined aliases possible) help = "get the descrip6on of a command or variable history = "manage the history of commands env = "list environment variables setenv = "set or change the value of a variable unsetenv = "unset an environment variable getenv = "print the value of a variable quit = "quit Oph_Term exit = "quit Oph_Term clear = "clear screen update = "update local XML files resume = "resume session view = "view jobs output alias = "list aliases setalias = "set or change the value of an alias unsetalias = "unset an alias getalias = "print the value of an alias"
Terminal: (system) metadata information
Terminal: looking into the storage back-end
Terminal: provenance (II)
½ Terabyte Datacube Preliminary performance insights! on a massive in-memory data reduction workflow n D base cuboid Ideal path 0 D apex cuboid FRAGMENT1 10^4 TUPLE x 10^5 ELEMENTS FRAGMENT2 10^4 TUPLE x 10^5 ELEMENTS ID FRAGMENT3 MEASURE 10^4 TUPLE x 10^5 ELEMENTS ID FRAGMENT4 MEASURE 10^4 TUPLE x 10^5 ELEMENTS ID MEASURE 1 ID FRAG64 MEASURE 10^4 TUPLE x 10^5 ELEMENTS 1 1,95 8,64 10,47 16,11 1 2 ID MEASURE 1 1,95 8,64 10,47 16,11 2 14,81 18,14 19,93 24,35 2 1 1,95 8,64 16,11 2 14,81 18,14 19,93 24,35 2 14,81 18,14 24,35 10 10 10 6,87 10,99 12,85 16,93 10^ 6,87 10,99 12,85 16,93 3 D base cuboid Ideal path 0 D apex cuboid x1 FRAGMENT1 1 TUPLE x 1 ELEM. ID MEASURE 1 24,35 x64 10^4 6,87 10,99 16,93
FRAGMENT1 10^4 TUPLE x 10^5 ELEMENTS FRAGMENT2 10^4 TUPLE x 10^5 ELEMENTS ID FRAGMENT3 MEASURE 10^4 TUPLE x 10^5 ELEMENTS ID FRAGMENT4 MEASURE 10^4 TUPLE x 10^5 ELEMENTS ID MEASURE 1 ID FRAG64 MEASURE 10^4 TUPLE x 10^5 ELEMENTS 1 1,95 8,64 10,47 16,11 1 2 ID MEASURE 1 1,95 8,64 10,47 16,11 2 14,81 18,14 19,93 24,35 2 1 1,95 8,64 16,11 2 14,81 18,14 19,93 24,35 2 14,81 18,14 24,35 10 10 10 6,87 10,99 12,85 16,93 10^ 6,87 10,99 12,85 16,93 x64 x64 10^4 Preliminary performance insights on a! massive in-memory data reduction workflow 6,87 10,99 16,93 3 D base cuboid REDUCE ALL MAX FRAGMENT1 FRAGMENT2 10^4 TUPLE FRAGMENT3 10^4 TUPLE FRAGMENT4 x 1 ELEM. x 1 ELEM. 10^4 10^4 TUPLE FRAGMENT ID ID TUPLE x MEASURE x 1 1 ELEM. ELEM. 64 MEASURE 10^4 ID TUPLE MEASURE x 1 ELEM. ID MEASURE 1 1 ID MEASURE 1 1 2 2 1 16,11 2 2 2 24,35 10 10 4 10 4 10^ 4 10^4 16,93 AGGREGATE ALL MAX Ideal path x1 FRAGMENT1 1 TUPLE x 1 ELEM. 2,30 sec 0,17 sec 0,90 sec 0,06 sec 3,43 sec 512GB 5,12 MB 1KB 1KB 8Bytes AGGREGATE ALL MAX Real path through 4 Ophidia operators MERGE ALL FRAGMENT 1 FRAGMENT 2 1 TUPLE FRAGMENT x 1 ELEM. 3 1 TUPLE FRAGMENT x 1 ELEM. 4 ID 1 TUPLE MEASURE x 1 ELEM. 1 TUPLE FRAGMENT x 1 ELEM. 64 ID MEASURE ID MEASURE ID 1 TUPLE MEASURE x 1 ELEM. 1 1 ID MEASURE 1 1 24,35 x64 1 24,35 0 D apex cuboid x1 ID MEASURE 1 24,35 REDUCE AGGREGATE MERGE AGGREGATE TOTAL FRAGMENT1 64 TUPLE x 1 ELEM. ID 1 2 64 MEASURE 19,78 12,87 24,35
Summary The Ophidia framework provides: an internal storage model for multidimensional data about 100 array-based primitives about 50 operators for data analytics and metadata management a Web Service (WS-I + ) based server interface Client application available as user interface Complete bash-like environment, production-level, lightweight, well-documented APIs available at 5 different levels of the stack server front-end, I/O server, storage device, primitives, operators Performance evaluation still ongoing Beta release available in December Website, documentation, github repository
References and contacts! [1] G. Aloisio, S. Fiore, I. Foster, D. N. Williams, Scientific big data analytics challenges at large scale, Big Data and Extreme-scale Computing (BDEC), April 30 to May 01, 2013, Charleston, USA (position paper). [2] S. Fiore, A. D'Anca, C. Palazzo, I. Foster, Dean N. Williams, Giovanni Aloisio, Ophidia: Toward Big Data Analytics for escience, ICCS 2013, June 5-7, 2013 Barcelona, Spain, Procedia Computer Science, Elsevier, pp. 2376-2385. [3] S. Fiore, C. Palazzo, A. D Anca, I. Foster, D. N. Williams, G. Aloisio, A big data analytics framework for scientific data management, Workshop on Big Data and Science: Infrastructure and Services, IEEE International Conference on BigData 2013, October 6-9, 2013, Santa Clara, USA, pp. 1-8. [4] S.Fiore, A. D Anca, D. Elia, C. Palazzo, I. Foster, D. N. Williams, G. Aloisio, Ophidia: a full software stack for scientific data analytics, Workshop on Big Data Principles, Architectures & Applications, HPCS2014, Bologna, USA, July 21-25, 2014. Contacts: Sandro Fiore, Giovanni Aloisio CMCC and University of Salento sandro.fiore@cmcc.it, giovanni.aloisio@cmcc.it,
Questions?!