The Ophidia framework: toward cloud- based big data analy;cs for escience Sandro Fiore, Giovanni Aloisio, Ian Foster, Dean Williams



Similar documents
The Ophidia framework: toward big data analy7cs for climate change

Data analy(cs workflows for climate

The ORIENTGATE data platform

Big Data Research at DKRZ

Return on Experience on Cloud Compu2ng Issues a stairway to clouds. Experts Workshop Nov. 21st, 2013

NASA's Strategy and Activities in Server Side Analytics

The ORIENTGATE data platform

CMIP6 Data Management at DKRZ

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

NASA s Big Data Challenges in Climate Science

Data Semantics Aware Cloud for High Performance Analytics

Integration strategy

NERSC Data Efforts Update Prabhat Data and Analytics Group Lead February 23, 2015

High Performance Computing in Horizon February 26-28, 2014 Fukuoka Japan

CMIP5 Data Management CAS2K13

Network for Sustainable Ultrascale Computing (NESUS)

Percipient StorAGe for Exascale Data Centric Computing

Data Centric Systems (DCS)

Introduc)on of Pla/orm ISF. Weina Ma

PART 1. Representations of atmospheric phenomena

Big Data Mining Services and Knowledge Discovery Applications on Clouds

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

A Service for Data-Intensive Computations on Virtual Clusters

HPC technology and future architecture

The THREDDS Data Repository: for Long Term Data Storage and Access

Recent and Future Activities in HPC and Scientific Data Management Siegfried Benkner

Webcasting vs. Web Conferencing. Webcasting vs. Web Conferencing

NCDC Strategic Vision

CLOUD BASED N-DIMENSIONAL WEATHER FORECAST VISUALIZATION TOOL WITH IMAGE ANALYSIS CAPABILITIES

Data-Intensive Science and Scientific Data Infrastructure

Software Defined Networking for Extreme- Scale Science: Data, Compute, and Instrument Facilities

High Performance Computing OpenStack Options. September 22, 2015

CEDA Storage. Dr Matt Pritchard. Centre for Environmental Data Archival (CEDA)

Cloud JPL Science Data Systems

RevoScaleR Speed and Scalability

The Study of a Hierarchical Hadoop Architecture in Multiple Data Centers Environment

How To Manage Cloud Service Provisioning And Maintenance

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Elastic Management of Cluster based Services in the Cloud

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

Next-Generation Networking for Science

Big Data Services at DKRZ

OT- Med: Objec,va Terra - Mediterraneum. Joël Guiot

Data Requirements from NERSC Requirements Reviews

Real-Time Analytics on Large Datasets: Predictive Models for Online Targeted Advertising

The Business Case for Virtualization Management: A New Approach to Meeting IT Goals By Rich Corley Akorri

HPC Programming Framework Research Team

Interactive Data Visualization with Focus on Climate Research

Ibis: Scaling Python Analy=cs on Hadoop and Impala

GeoKettle: A powerful open source spatial ETL tool

Software services competence in research and development activities at PSNC. Cezary Mazurek PSNC, Poland

Maximising the utility of OpeNDAP datasets through the NetCDF4 API

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Big Data and the Earth Observation and Climate Modelling Communities: JASMIN and CEMS

MicroStrategy Course Catalog

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Lustre * Filesystem for Cloud and Hadoop *

Knowledgent White Paper Series. Developing an MDM Strategy WHITE PAPER. Key Components for Success

European Data Infrastructure - EUDAT Data Services & Tools

Enterprise Content Management (ECM)

Making a Smooth Transition to a Hybrid Cloud with Microsoft Cloud OS

Transcription:

The Ophidia framework: toward cloud- based big data analy;cs for escience Sandro Fiore, Giovanni Aloisio, Ian Foster, Dean Williams Sandro Fiore, Ph.D. CMCC Scientific Computing and Operations Division sandro.fiore@cmcc.it!

The big challenge is to model this complex system! Several complex processes to be simulated Several interacting processes Great range of time scales to be analyzed Great range of spatial scales to be considered Need interdisciplinar sciences (physics, chemistry, biology, geology, ) Inherently non-linear governing equations Need sophisticated numerics Need huge computational resources and large volumes of data can be produced Warren M. Washington NCAR Scien;fic Grand Challenges Workshop Series: Challenges in Climate Change Science and the Role of Compu;ng at the Extreme Scale DOE Workshop (ASCR- BER) November 6-7, 2008

ESGF & the CMIP5 data archive!

Climate change domain: the current scientific workflow and the ESGF use case! Workflow: search, locate, download, analyze, display results J. Chen, A. Choudhary, S. Feldman, B. Hendrickson, C.R. Johnson, R. Mount, V. Sarkar, V. White, D. Williams. Synergistic Challenges in Data- Intensive Science and Exascale Computing, DOE ASCAC Data Subcommittee Report, Department of Energy Office of Science, March, 2013.

The Ophidia Project Ophidia is a research effort carried out at the Euro Mediterranean Centre on Climate Change (CMCC) to address big data challenges, issues and requirements for climate change data analytics. S. Fiore, A. D Anca, C. Palazzo, I. Foster, D. N. Williams, G. Aloisio, Ophidia: toward bigdata analytics for escience, ICCS2013 Conference, Procedia Elsevier, Barcelona, June 5-7, 2013

Requirements and needs focus on: Time series analysis Data subsetting Model intercomparison Multimodel means Massive data reduction Data transformation (through array-based primitives) Param. Sweep experiments (same task applied on a set of data) Climate change signal Maps generation Ensemble analysis Data analytics worflow support But also Performance re-usability Data analytics requirements and use cases! extensibility

Ophidia Architecture Declarative language Front end Compute layer Standard interfaces Analytics Framework I/O layer Array-based primitives I/O server instance Storage layer System catalog New storage model Partitioning/hierarchical data mng

Array based primitives! Primitives provide array-based transformation A comprehensive set of primitives have been already implemented ( 100) By definition, a primitive is applied to a single fragment They come in the form of plugins (I/O server extensions) So far, Ophidia primitives perform data reduction, sub-setting, predicates evaluation, statistical analysis, compression, and so forth. Support is provided both for byte-oriented and bit-oriented arrays Plugins can be nested to get more complex functionalities Compression is provided as a primitive too Libraries like PetsC, GSL, C Math have been integrated

Array based primitives: OPH_BOXPLOT oph_boxplot(measure, "OPH_DOUBLE ) Single chunk or fragment (input) Single chunk or fragment (output)

Array based primitives: OPH_AGGREGATE oph_aggregate(measure,"oph_avg ) Single chunk or fragment (input) Ver%cal aggrega%on Single chunk or fragment (output)

Array based primitives: nesting feature oph_boxplot(oph_subarray(oph_uncompress(measure), 1,18), "OPH_DOUBLE ) Single chunk or fragment (input) Single chunk or fragment (output) subarray(measure, 1,18)

The analytics framework: datacube operators (about 50)! OPERATOR NAME OPERATOR DESCRIPTION Operators Data processing Domain-agnostic OPH_APPLY(datacube_in, datacube_out, array_based_primitive) OPH_DUPLICATE(datacube_ in, datacube_out) OPH_SUBSET(datacube_in, subset_string, datacube_out) OPH_MERGE(datacube_in, merge_param, datacube_out) OPH_SPLIT(datacube_in, split_param, datacube_out) OPH_INTERCOMPARISON (datacube_in1, datacube_in2, datacube_out) OPH_DELETE(datacube_in) Creates the datacube_out by applying the array-based primitive to the datacube_in Creates a copy of the datacube_in in the datacube_out Creates the datacube_out by doing a sub-setting of the datacube_in by applying the subset_string Creates the datacube_out by merging groups of merge_param fragments from datacube_in Creates the datacube_out by splitting into groups of split_param fragments each fragment of the datacube_in Creates the datacube_out which is the element-wise difference between datacube_in1 and datacube_in2 Removes the datacube_in Data Access (sequential and parallel operators) Metadata management (sequential and parallel operators) Data processing (parallel operators, MPI & OpenMP based) Import/Export (parallel operators) OPERATOR NAME OPERATOR DESCRIPTION Operators Data processing Domain-oriented OPH_EXPORT_NC Exports the datacube_in data into the (datacube_in, file_out) file_out NetCDF file. OPH_IMPORT_NC Imports the data stored into the file_in (file_in, datacube_out) NetCDF file into the new datacube_in datacube Operators Data access OPH_INSPECT_FRAG Inspects the data stored in the (datacube_in, fragment_in) fragment_in from the datacube_in OPH_PUBLISH(datacube_in) Publishes the datacube_in fragments into HTML pages Operators Metadata OPH_CUBE_ELEMENTS Provides the total number of the (datacube_in) elements in the datacube_in OPH_CUBE_SIZE Provides the disk space occupied by the (datacube_in) datacube_in OPH_LIST(void) Provides the list of available datacubes. OPH_CUBEIO(datacube_in) Provides the provenance information related to the datacube_in OPH_FIND(search_param) Provides the list of datacubes matching the search_param criteria

The analytics framework: datacube operators!

Metadata operator (provenance management)

EUBrazil Cloud Connect! The main objec%ve is the crea%on of a federated e- infrastructure for research using a user- centric approach. To achieve this, we need to pursue three objec%ves: Adapta;on of exis%ng applica%ons to tackle new scenarios emerging from coopera%on between Europe and Brazil relevant to both regions. Integra%on of frameworks and programming models for scien;fic gateways and complex workflows. Federa%on of resources, to build up a general- purpose infrastructure comprising exis;ng and heterogeneous resources Addi%onally, EUBrazilCC will: perform an ac%ve dissemina;on campaign, analyse innova;on, foster the involvement of Brazilian ins%tu%ons in cloud standards defini;on, and bring the EU Cloudscape series to broader interna%onal audience. EU Coordinator Ignacio Blanquer- Espert, iblanque@dsic.upv.es Universitat Politècnica de València, Spain BR Coordinator Francisco Vilar Brasileiro, fubica@dsc.ufcg.edu.br Universidade Federal de Campina Grande, Brazil!

Use Case on Biodiversity and Climate Change! Objec;ve: Understand the impact of climate change on terrestrial biodiversity through two workflows based on Earth observa%on and ground level data. Technical Challenge: Integrate parallel data analysis with other processing workflows in a geographically distributed environment. Interna;onal Added Value: Integra%on of biodiversity data and modelling with mul%spectral and remote sensing data for studying the cross- correla%on of biodiversity and climate change. Climate & Biodiversity Clearing- house Parallel Data Analysis Species- Link CMCC CIMP5 Imaging Data Federated Infrastructure & Pla\orm

Cloud-based deployment scenarios! Deployment A I/ O S C VM Instance Legend Applica%on Image Data Image Deployment B VM Instance S I/ O I/ O I/ O C VM Instance 1 VM Instance VM C Instance n C Data IO Component Compute Component Server Component Deployment C VM Instance 1 VM Instance 1 VM Instance S VM Instance VM Instance C n C C I/ O I/ O I/ O VM Instance VM C Instance m C

References! Papers [1] G. Aloisio, S. Fiore, I. Foster, D. N. Williams, Scientific big data analytics challenges at large scale, Big Data and Extreme-scale Computing (BDEC), April 30 to May 01, 2013, Charleston, USA (position paper). [2] S. Fiore, G. Aloisio, I. Foster, D. N. Williams, A software infrastructure for big data analytics, Big Data and Extreme-scale Computing (BDEC), February 26-28, 2014, Fukuoka, Japan (position paper). [3] S. Fiore, A. D'Anca, C. Palazzo, I. Foster, Dean N. Williams, Giovanni Aloisio, Ophidia: Toward Big Data Analytics for escience, ICCS 2013, June 5-7, 2013 Barcelona, Spain, Procedia Computer Science, Elsevier, pp. 2376-2385. [4] S. Fiore, C. Palazzo, A. D Anca, I. Foster, D. N. Williams, G. Aloisio, A big data analytics framework for scientific data management, Workshop on Big Data and Science: Infrastructure and Services, IEEE International Conference on BigData 2013, October 6-9, 2013, Santa Clara, USA, pp. 1-8. [5] S. Fiore, A. D'Anca, D. Elia, C. Palazzo, I. Foster, D. Williams, G. Aloisio, "Ophidia: A Full Software Stack for Scientific Data Analytics, proc. of the 2014 International Conference on High Performance Computing & Simulation (HPCS 2014), July 21 25, 2014, Bologna, Italy, pp. 343-350, ISBN: 978-1-4799-5311-0 For more info, please contact: - Dr. Sandro Fiore (sandro.fiore@cmcc.it) - Prof. Giovanni Aloisio (giovanni.aloisio@unisalento.it)

Questions?!