The Ophidia framework: toward big data analy7cs for climate change



Similar documents
Data analy(cs workflows for climate

The Ophidia framework: toward cloud- based big data analy;cs for escience Sandro Fiore, Giovanni Aloisio, Ian Foster, Dean Williams

Data-Intensive Science and Scientific Data Infrastructure

The ORIENTGATE data platform

Data Semantics Aware Cloud for High Performance Analytics

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

The ORIENTGATE data platform

FEATURE COMPARISON BETWEEN WINDOWS SERVER UPDATE SERVICES AND SHAVLIK HFNETCHKPRO

Lustre * Filesystem for Cloud and Hadoop *

SAP HANA In-Memory Database Sizing Guideline

Oracle BI 11g R1: Build Repositories

Crystal Server Upgrade Guide SAP Crystal Server 2013

A Service for Data-Intensive Computations on Virtual Clusters

ORACLE BUSINESS INTELLIGENCE WORKSHOP

Filesystems Performance in GNU/Linux Multi-Disk Data Storage

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Legal Notes. Regarding Trademarks KYOCERA MITA Corporation

Best Practices for Sharing Imagery using Amazon Web Services. Peter Becker

Drupal CMS for marketing sites

Hierarchical Bloom Filters: Accelerating Flow Queries and Analysis

DiskPulse DISK CHANGE MONITOR

MicroStrategy Course Catalog

SharePoint Server 2010 Capacity Management: Software Boundaries and Limits

Luna Decision Overview. Copyright Hicare Research 2014 All Rights Reserved

Basic ShadowProtect Troubleshooting

Chapter 6, The Operating System Machine Level

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Submitting UITests at the Command Line

Quick Start SAP Sybase IQ 16.0

Windows OS File Systems

<Insert Picture Here> Oracle SQL Developer 3.0: Overview and New Features

Topology Aware Analytics for Elastic Cloud Services

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Violin Symphony Abstract

INSTALLATION MINIMUM REQUIREMENTS. Visit us on the Web

Overview

HPC Wales Skills Academy Course Catalogue 2015

The THREDDS Data Repository: for Long Term Data Storage and Access

Remote Backup Software

File Management Utility User Guide

Mambo Running Analytics on Enterprise Storage

Anwendungsintegration und Workflows mit UNICORE 6

Using the Windows Cluster

Microsoft Dynamics NAV 2015 Hardware and Server Requirements. Microsoft Dynamics NAV Windows Client Requirements

CatDV Pro Workgroup Serve r

Integrating VoltDB with Hadoop

Oracle Business Intelligence Server Administration Guide. Version December 2006

Plug-In for Informatica Guide

Netwrix Auditor for Active Directory

aaps algacom Account Provisioning System

Drupal Performance Tuning

Maximising the utility of OpeNDAP datasets through the NetCDF4 API

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Building Views and Charts in Requests Introduction to Answers views and charts Creating and editing charts Performing common view tasks

Harvard Data Visualization Project

American International Journal of Research in Science, Technology, Engineering & Mathematics

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Diagram 1: Islands of storage across a digital broadcast workflow

Simba XMLA Provider for Oracle OLAP 2.0. Linux Administration Guide. Simba Technologies Inc. April 23, 2013

Using Microsoft Windows Authentication for Microsoft SQL Server Connections in Data Archive

9.1 SAS/ACCESS. Interface to SAP BW. User s Guide

Introduction to Gluster. Versions 3.0.x

Optimize the execution of local physics analysis workflows using Hadoop

Introduction to Linux and Cluster Basics for the CCR General Computing Cluster

Current Status of FEFS for the K computer

Open Source Backup with Amanda

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Netwrix Auditor for Windows Server

Oracle BI Cloud Service : What is it and Where Will it be Useful? Francesco Tisiot, Principal Consultant, Rittman Mead OUG Ireland 2015, Dublin

Configuring an Oracle Business Intelligence Enterprise Edition Resource in Metadata Manager

DBMS / Business Intelligence, SQL Server

Archiving, Indexing and Accessing Web Materials: Solutions for large amounts of data

WEB2CS INSTALLATION GUIDE

Flexible Identity Federation

MailEnable Installation Guide

From GWS to MapReduce: Google s Cloud Technology in the Early Days

NETWRIX EVENT LOG MANAGER

Overview. Edvantage Security

Scala Storage Scale-Out Clustered Storage White Paper

LAE Enterprise Server Installation Guide

StreamServe Persuasion SP5 Upgrading instructions

Decision Support System Software Asset Management (SAM)

Reduction of Data at Namenode in HDFS using harballing Technique

Hardwarekrav. 30 MB. Memory: 1 GB. Additional software Microsoft.NET Framework 4.0.

Knowledgent White Paper Series. Developing an MDM Strategy WHITE PAPER. Key Components for Success

Set Up Panorama. Palo Alto Networks. Panorama Administrator s Guide Version 6.0. Copyright Palo Alto Networks

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

How To Set Up Safetica Insight 9 (Safetica) For A Safetrica Management Service (Sms) For An Ipad Or Ipad (Smb) (Sbc) (For A Safetaica) (

Transcription:

The Ophidia framework: toward big data analy7cs for climate change Dr. Sandro Fiore Leader, Scientific data management research group Scientific Computing Division @ CMCC Prof. Giovanni Aloisio Director, Scientific Computing Division @ CMCC

Data analytics requirements! A set of requirements have been discussed with our scientists to identify the key data analytics needs. Preliminary requirements and needs focus on: Time series analysis Data reduc6on (e.g. by aggrega6on) Data subse<ng Model intercomparison Mul6model means Data transforma6on (through array- based primi6ves) Param. Sweep experiments (same task applied on a set of data) Comparison between historical data and future scenarios Maps genera6on Ensemble analysis Data analy6cs worflow support But also Performance, re- usability, extensibility

The Ophidia Project Ophidia is a research effort carried out at the Euro Mediterranean Centre on Climate Change (CMCC) to address big data challenges, issues and requirements for climate change data analytics S. Fiore, A. D Anca, C. Palazzo, I. Foster, D. N. Williams, G. Aloisio, Ophidia: toward bigdata analytics for escience, ICCS2013 Conference, Procedia Elsevier, Barcelona, June 5-7, 2013

Ophidia Architecture 1.0 Declarative language Front end Compute layer Standard interfaces Analytics Framework I/O layer Array-based primitives I/O server instance Storage layer System catalog New storage model Partitioning/hierarchical data mng

Storage model (dimension-independent) & implementation! Array-based support and hierarchical storage! Parallel I/O

The analytics framework: datacube operators (about 50)! OPERATOR NAME OPERATOR DESCRIPTION Operators Data processing Domain-agnostic OPH_APPLY(datacube_in, datacube_out, array_based_primitive) OPH_DUPLICATE(datacube_ in, datacube_out) OPH_SUBSET(datacube_in, subset_string, datacube_out) OPH_MERGE(datacube_in, merge_param, datacube_out) OPH_SPLIT(datacube_in, split_param, datacube_out) OPH_INTERCOMPARISON (datacube_in1, datacube_in2, datacube_out) OPH_DELETE(datacube_in) Creates the datacube_out by applying the array-based primitive to the datacube_in Creates a copy of the datacube_in in the datacube_out Creates the datacube_out by doing a sub-setting of the datacube_in by applying the subset_string Creates the datacube_out by merging groups of merge_param fragments from datacube_in Creates the datacube_out by splitting into groups of split_param fragments each fragment of the datacube_in Creates the datacube_out which is the element-wise difference between datacube_in1 and datacube_in2 Removes the datacube_in Data Access (sequential and parallel operators) Metadata management (sequential and parallel operators) Data processing (parallel operators, MPI & OpenMP based) Import/Export (parallel operators) OPERATOR NAME OPERATOR DESCRIPTION Operators Data processing Domain-oriented OPH_EXPORT_NC Exports the datacube_in data into the (datacube_in, file_out) file_out NetCDF file. OPH_IMPORT_NC Imports the data stored into the file_in (file_in, datacube_out) NetCDF file into the new datacube_in datacube Operators Data access OPH_INSPECT_FRAG Inspects the data stored in the (datacube_in, fragment_in) fragment_in from the datacube_in OPH_PUBLISH(datacube_in) Publishes the datacube_in fragments into HTML pages Operators Metadata OPH_CUBE_ELEMENTS Provides the total number of the (datacube_in) elements in the datacube_in OPH_CUBE_SIZE Provides the disk space occupied by the (datacube_in) datacube_in OPH_LIST(void) Provides the list of available datacubes. OPH_CUBEIO(datacube_in) Provides the provenance information related to the datacube_in OPH_FIND(search_param) Provides the list of datacubes matching the search_param criteria

The analytics framework: datacube operators!

The analytics framework: datacube operators!

The analytics framework: datacube operators! operator=oph_apply;datacube_input=cube_in;datacube_output=cube_out;query=array_based_primi/ve

Array based primitives! The array data type support is not enough to provide scientific data management capabilities primitives are needed as well. A set of array-based primitives have been implemented A primitive is applied to a single array They come in the form of plugins (I/O server extensions) So far, Ophidia provides a wide set of plugins (about 100) to perform data reduction (by aggregation), sub-setting, predicates evaluation, statistical analysis, compression, and so forth. Support is provided both for byte-oriented and bit-oriented arrays Plugins can be nested to get more complex functionalities Compression is provided as a primitive too

Array based primitives: OPH_MATH ( SIGN ) oph_math(measure, OPH_SIGN, "OPH_DOUBLE ) Single chunk or fragment (input) Single chunk or fragment (output)

Array based primitives: OPH_BOXPLOT oph_boxplot(measure, "OPH_DOUBLE ) Single chunk or fragment (input) Single chunk or fragment (output)

Array based primitives: nesting feature oph_boxplot(oph_subarray(oph_uncompress(measure), 1,18), "OPH_DOUBLE ) Single chunk or fragment (input) Single chunk or fragment (output) subarray(measure, 1,18)

Array based primitives: oph_aggregate oph_aggregate(measure,"oph_avg ) Single chunk or fragment (input) Ver6cal aggrega6on Single chunk or fragment (output)

Ophidia primitives Mathematical primitives: oph_math, oph_mul_scalar, oph_sum_array, oph_sum_scalar, oph_sum_array_r, oph_gsl_sd, oph_gsl_complex_get_abs, oph_gsl_complex_get_arg, oph_gsl_complex_get_imag, oph_gsl_complex_get_real, oph_gsl_complex_to_polar, oph_gsl_complex_to_rect, oph_gsl_dwt, oph_gsl_fft, oph_gsl_idwt, oph_gsl_ifft, oph_gsl_quantile, oph_gsl_sort, oph_gsl_stats, oph_gsl_histogram, oph_gsl_boxplot, oph_petsc_vec_norm, oph_petsc_vec, oph_petsc_vec_r, oph_sum_scalar2, oph_mul_scalar2, oph_compare, Data transformations: oph_to_bin, oph_to_bit, oph_permute, oph_reverse, oph_dump, oph_convert_d, oph_shift, oph_rotate, oph_concat, oph_predicate, oph_roll_up, oph_get_subarray, oph_get_subarray2, oph_get_subarray3, oph_mask, oph_cast, oph_drill_down (stored procedure), oph_reduce, oph_reduce2, oph_reduce3, oph_aggregate_operator, oph_operator, oph_aggregate_stats, oph_extract, Compression: oph_compress, oph_uncompress, oph_compress2, oph_uncompress2, oph_id_to_index, oph_size_array, oph_count_array, oph_find, oph_find2, Bit measures processing: oph_bit_aggregate, oph_bit_import, oph_bit_export, oph_bit_size, oph_bit_count, oph_bit_dump, oph_bit_operator, oph_bit_not, oph_bit_shift, oph_bit_rotate, oph_bit_reverse, oph_bit_subarray, oph_bit_subarray2, oph_bit_concat, oph_bit_find, oph_bit_reduce, Integration of numerical libraries: math, GNU GSL, PetsC. Integration of some CDO operators

The Ophidia Terminal The Ophidia terminal provides an effec6ve and lightweight way to interact with the Ophidia server Bash- like environment (commands interpreter) Terminal with history management, auto- comple6on, specific environment variables and commands with integrated help Easy installa6on as an only one executable using a small number of well- known and open- source libraries More than 15 KLOC Simple enough for a novice and at the same 6me powerful enough for an expert

Variables and commands PRE-DEFINED VARIABLES (user-defined variables possible) OPH_TERM_PS1 = "color of the prompt OPH_USER = "username OPH_PASSWD = "password OPH_SERVER_HOST = "server address OPH_SERVER_PORT = "server port OPH_SESSION_ID = "current session OPH_EXEC_MODE = "execu6on mode OPH_NCORES = "number of cores OPH_CWD = "current working directory OPH_DATACUBE = "current datacube OPH_TERM_VIEWER = "output renderer OPH_TERM_IMGS = "save and/or auto- open images" PRE-DEFINED COMMANDS (user-defined aliases possible) help = "get the descrip6on of a command or variable history = "manage the history of commands env = "list environment variables setenv = "set or change the value of a variable unsetenv = "unset an environment variable getenv = "print the value of a variable quit = "quit Oph_Term exit = "quit Oph_Term clear = "clear screen update = "update local XML files resume = "resume session view = "view jobs output alias = "list aliases setalias = "set or change the value of an alias unsetalias = "unset an alias getalias = "print the value of an alias"

Terminal: (system) metadata information

Terminal: looking into the storage back-end

Terminal: provenance (II)

½ Terabyte Datacube Preliminary performance insights! on a massive in-memory data reduction workflow n D base cuboid Ideal path 0 D apex cuboid FRAGMENT1 10^4 TUPLE x 10^5 ELEMENTS FRAGMENT2 10^4 TUPLE x 10^5 ELEMENTS ID FRAGMENT3 MEASURE 10^4 TUPLE x 10^5 ELEMENTS ID FRAGMENT4 MEASURE 10^4 TUPLE x 10^5 ELEMENTS ID MEASURE 1 ID FRAG64 MEASURE 10^4 TUPLE x 10^5 ELEMENTS 1 1,95 8,64 10,47 16,11 1 2 ID MEASURE 1 1,95 8,64 10,47 16,11 2 14,81 18,14 19,93 24,35 2 1 1,95 8,64 16,11 2 14,81 18,14 19,93 24,35 2 14,81 18,14 24,35 10 10 10 6,87 10,99 12,85 16,93 10^ 6,87 10,99 12,85 16,93 3 D base cuboid Ideal path 0 D apex cuboid x1 FRAGMENT1 1 TUPLE x 1 ELEM. ID MEASURE 1 24,35 x64 10^4 6,87 10,99 16,93

FRAGMENT1 10^4 TUPLE x 10^5 ELEMENTS FRAGMENT2 10^4 TUPLE x 10^5 ELEMENTS ID FRAGMENT3 MEASURE 10^4 TUPLE x 10^5 ELEMENTS ID FRAGMENT4 MEASURE 10^4 TUPLE x 10^5 ELEMENTS ID MEASURE 1 ID FRAG64 MEASURE 10^4 TUPLE x 10^5 ELEMENTS 1 1,95 8,64 10,47 16,11 1 2 ID MEASURE 1 1,95 8,64 10,47 16,11 2 14,81 18,14 19,93 24,35 2 1 1,95 8,64 16,11 2 14,81 18,14 19,93 24,35 2 14,81 18,14 24,35 10 10 10 6,87 10,99 12,85 16,93 10^ 6,87 10,99 12,85 16,93 x64 x64 10^4 Preliminary performance insights on a! massive in-memory data reduction workflow 6,87 10,99 16,93 3 D base cuboid REDUCE ALL MAX FRAGMENT1 FRAGMENT2 10^4 TUPLE FRAGMENT3 10^4 TUPLE FRAGMENT4 x 1 ELEM. x 1 ELEM. 10^4 10^4 TUPLE FRAGMENT ID ID TUPLE x MEASURE x 1 1 ELEM. ELEM. 64 MEASURE 10^4 ID TUPLE MEASURE x 1 ELEM. ID MEASURE 1 1 ID MEASURE 1 1 2 2 1 16,11 2 2 2 24,35 10 10 4 10 4 10^ 4 10^4 16,93 AGGREGATE ALL MAX Ideal path x1 FRAGMENT1 1 TUPLE x 1 ELEM. 2,30 sec 0,17 sec 0,90 sec 0,06 sec 3,43 sec 512GB 5,12 MB 1KB 1KB 8Bytes AGGREGATE ALL MAX Real path through 4 Ophidia operators MERGE ALL FRAGMENT 1 FRAGMENT 2 1 TUPLE FRAGMENT x 1 ELEM. 3 1 TUPLE FRAGMENT x 1 ELEM. 4 ID 1 TUPLE MEASURE x 1 ELEM. 1 TUPLE FRAGMENT x 1 ELEM. 64 ID MEASURE ID MEASURE ID 1 TUPLE MEASURE x 1 ELEM. 1 1 ID MEASURE 1 1 24,35 x64 1 24,35 0 D apex cuboid x1 ID MEASURE 1 24,35 REDUCE AGGREGATE MERGE AGGREGATE TOTAL FRAGMENT1 64 TUPLE x 1 ELEM. ID 1 2 64 MEASURE 19,78 12,87 24,35

Summary The Ophidia framework provides: an internal storage model for multidimensional data about 100 array-based primitives about 50 operators for data analytics and metadata management a Web Service (WS-I + ) based server interface Client application available as user interface Complete bash-like environment, production-level, lightweight, well-documented APIs available at 5 different levels of the stack server front-end, I/O server, storage device, primitives, operators Performance evaluation still ongoing Beta release available in December Website, documentation, github repository

References and contacts! [1] G. Aloisio, S. Fiore, I. Foster, D. N. Williams, Scientific big data analytics challenges at large scale, Big Data and Extreme-scale Computing (BDEC), April 30 to May 01, 2013, Charleston, USA (position paper). [2] S. Fiore, A. D'Anca, C. Palazzo, I. Foster, Dean N. Williams, Giovanni Aloisio, Ophidia: Toward Big Data Analytics for escience, ICCS 2013, June 5-7, 2013 Barcelona, Spain, Procedia Computer Science, Elsevier, pp. 2376-2385. [3] S. Fiore, C. Palazzo, A. D Anca, I. Foster, D. N. Williams, G. Aloisio, A big data analytics framework for scientific data management, Workshop on Big Data and Science: Infrastructure and Services, IEEE International Conference on BigData 2013, October 6-9, 2013, Santa Clara, USA, pp. 1-8. [4] S.Fiore, A. D Anca, D. Elia, C. Palazzo, I. Foster, D. N. Williams, G. Aloisio, Ophidia: a full software stack for scientific data analytics, Workshop on Big Data Principles, Architectures & Applications, HPCS2014, Bologna, USA, July 21-25, 2014. Contacts: Sandro Fiore, Giovanni Aloisio CMCC and University of Salento sandro.fiore@cmcc.it, giovanni.aloisio@cmcc.it,

Questions?!