Managing Complexity in Distributed Data Life Cycles Enhancing Scientific Discovery

Similar documents
Building Platform as a Service for Scientific Applications

Cluster, Grid, Cloud Concepts

XSEDE Service Provider Software and Services Baseline. September 24, 2015 Version 1.2

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

Big Data and Cloud Computing for GHRSST

Policy Policy--driven Distributed driven Distributed Data Management (irods) Richard M arciano Marciano marciano@un

Data Management using irods

Processing big data by WS- PGRADE/gUSE and Data Avenue

Data management challenges in todays Healthcare and Life Sciences ecosystems

PACE Predictive Analytics Center of San Diego Supercomputer Center, UCSD. Natasha Balac, Ph.D.

Anwendungsintegration und Workflows mit UNICORE 6

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Data Centric Systems (DCS)

An approach to grid scheduling by using Condor-G Matchmaking mechanism

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Optimizing IT Deployment Issues

HPC technology and future architecture

The Lattice Project: A Multi-Model Grid Computing System. Center for Bioinformatics and Computational Biology University of Maryland

Early Cloud Experiences with the Kepler Scientific Workflow System

Data-Intensive Science and Scientific Data Infrastructure

Grid Scheduling Dictionary of Terms and Keywords

PRIMERGY server-based High Performance Computing solutions

SURFsara HPC Cloud Workshop

Technical. Overview. ~ a ~ irods version 4.x

Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing

Data Management in an International Data Grid Project. Timur Chabuk 04/09/2007

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Overview of HPC Resources at Vanderbilt

Scaling from Workstation to Cluster for Compute-Intensive Applications

Digital libraries of the future and the role of libraries

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Altix Usage and Application Programming. Welcome and Introduction

Enabling High performance Big Data platform with RDMA

Data Semantics Aware Cloud for High Performance Analytics

Overview. The Knowledge Refinery Provides Multiple Benefits:

RELEASE ANNOUNCEMENT Kaseya Network Discovery and Network Monitoring Version 1.0

Data Requirements from NERSC Requirements Reviews

Grid Computing vs Cloud

Netapp HPC Solution for Lustre. Rich Fenton UK Solutions Architect

Part I Courses Syllabus

IBM Platform Computing : infrastructure management for HPC solutions on OpenPOWER Jing Li, Software Development Manager IBM

SURVEY ON THE ALGORITHMS FOR WORKFLOW PLANNING AND EXECUTION

A Survey Study on Monitoring Service for Grid

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Scientific and Technical Applications as a Service in the Cloud

Workflow Tools at NERSC. Debbie Bard NERSC Data and Analytics Services

Big Data - Infrastructure Considerations

CHESS DAQ* Introduction

Data Warehousing. Jens Teubner, TU Dortmund Winter 2015/16. Jens Teubner Data Warehousing Winter 2015/16 1

Manjrasoft Market Oriented Cloud Computing Platform

A Service for Data-Intensive Computations on Virtual Clusters

Monitoring of Business Processes in the EGI

PARALLELS CLOUD STORAGE

Recent Advances in HPC for Structural Mechanics Simulations

WHITE PAPER. Reinventing Large-Scale Digital Libraries With Object Storage Technology

CMIP6 Data Management at DKRZ

Modernizing Hadoop Architecture for Superior Scalability, Efficiency & Productive Throughput. ddn.com

QoS-Aware Storage Virtualization for Cloud File Systems. Christoph Kleineweber (Speaker) Alexander Reinefeld Thorsten Schütt. Zuse Institute Berlin

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Simulation Platform Overview

European Data Infrastructure - EUDAT Data Services & Tools

Big + Fast + Safe + Simple = Lowest Technical Risk

HPC and Grid Concepts

THE SUN STORAGE AND ARCHIVE SOLUTION FOR HPC

An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis

EMC ISILON AND ELEMENTAL SERVER

Clouds vs Grids KHALID ELGAZZAR GOODWIN 531

IT of SPIM Data Storage and Compression. EMBO Course - August 27th! Jeff Oegema, Peter Steinbach, Oscar Gonzalez

HP reference configuration for entry-level SAS Grid Manager solutions

IFS-8000 V2.0 INFORMATION FUSION SYSTEM

Transcription:

Center for Information Services and High Performance Computing (ZIH) Managing Complexity in Distributed Data Life Cycles Enhancing Scientific Discovery Richard Grunzke*, Jens Krüger, Sandra Gesing, Sonja Herres-Pawlis, Alexander Hoffmann, Alvaro Aguilera, Wolfgang E. Nagel richard.grunzke@tu-dresden.de

Data Life Cycles Data from creation, management, analysis, utilization and archiving Focus on generating insights based on data Data exploration as the additional paradigm of science Copyright: KIT 2

Data Life Cycles Big Data and HPC Large-scale simulations with HPC Result data can be in petabyte range Instruments such as high-throughput microscopes 0,85 GB/s 2 petabyte monthly Big Data and growing rapidly HPC to extract information for knowledge gain 3

Data Life Cycles Complexity Infrastructures ever more complex Data sources: detectors, simulations, distributed sensors,... Data management: storage hierarchy, geographical distribution, transfers, protocols, HPC and user access, AAI,... HPC: heterogeneous architectures, cores, nodes, OS, network,... Data sinks: scratch, home, repository, archive, Usage: ssh, batch systems, tools, clients, formats, data sharing, visualization,... 4

Data Life Cycles Complexity Users expected to learn all this? Few will even attempt as they want to concentrate on their science Many potential new HPC users would not begin Users do better science faster via accessible HPC and Big Data Driving and sustaining force behind HPC 5

Data Life Cycles Complexity As complexity increases, productivity decreases Maintaining usefulness via abstraction to hide complexity and automation to avoid manual tasks Frameworks and libraries Modeling and simulation approaches Automated parallelization and error detection Graphically aided performance analysis and optimization Computing and workflow middlewares Data and metadata management systems Science gateways and virtual research environments Visualization 6

Data Life Cycles Data Sources Instruments Detectors in particle accelerators High-throughput microscopes Distributed sensors measuring properties of wind power stations Computing Resources Large scale simulations Results of high-throughput data analysis Richard Grunzke

Data Life Cycles Data Management Storage hierarchy: Ramdisk, SSD, HDD, SAN, NAS, Tape Parallel file systems with focus on storing data in form of files GPFS, Lustre, pnfs, HDFS,... Distributed data management systems with advanced features IRODS, Dcache, XtreemFS, UNICORE,... 8

Data Life Cycles Metadata Metadata as information about data to organize it based on content Higher level functionality on top of data management Easy discovery of data fundamental for its usefulness Highly complex situation with many standards and systems Copyright 2009-2010 Jenn Riley. This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License http://creativecommons.org/licenses/by-nc-sa/3.0/us/ 9

Data Life Cycles Metadata Management Centralized metadata catalog + Consistent uniform view, + Directly searchable - Potential bottleneck, - Single point of failure, - Archiving complex AMGA Metadata Service, Dspace, Fedora Commons, ISOcat, Systems with metadata in close proximity to data + More failure-resistant and better scalable, + More suitable for long-term archiving - Central component for searchability necessary, - No uniform view, - Possibly more files HDF5, NeXus, NetCDF, Systems with a combined proximity approach + Combination of earlier approaches - More complex, - Possibly more files, - Consistency Management UNICORE Metadata Service 10

Data Life Cycles Computing Management Supercomputers, clusters, Architectures, CPUs, RAM, Operating systems, Racks, nodes, interconnects, Batch systems Abstraction of highly complex computing resources, User-driven - User directly initiates tasks UNICORE, Globus Toolkit, glite, Workflow-driven - User creates and submits workflow guse, UNICORE,... Data-driven - Tasks automatically executed by pre-defined rules IRODS, UNICORE,... 11

Data Life Cycles Workflow Management Higher level functionality based on computing management Workflow as chaining together of multiple applications Support for dependencies, loops, sequential, in parallel UNICORE, GWES, guse, BIS-Grid, Kepler,... 12

Data Life Cycles Data Sinks Data stored according to re-use probability Scratch file system Home directory Digital data repository Long-term archive 13

Data Life Cycles Utilization User interfaces important for acceptance among scientists Flexibility vs usability Commandline-based access - Highly customizable and scriptable UNICORE, Globus Toolkit, glite,... Rich-Client-based access - Local software installation required UNICORE, Taverna Workbench, Web-based Always up-to-date, Single point of entry to infrastructures UNICORE Portal, Galaxy, WS-PGRADE, Apache Airavata, Vine Toolkit, 14

Data Life Cycles MoSGrid Science Gateway HPC and workflow enabled science gateway for molecular simulations Built in BMBF project 350 users 3 chemical application domains 70 workflows with 90 applications Extended in two EU projects & being ported to US XSEDE infrastructure Further follow-up funding proposals submitted Molecular Dynamics Docking Quantum Chemistry J. Krüger*, R. Grunzke*, S. Gesing*, et al.: The MoSGrid Science Gateway - A Complete Solution for Molecular Simulations, Journal of Chemical Theory and Computation, 2014. 15

Data Life Cycles VAVID HPC and workflow enabled science gateway for car crash simulations and wind turbine sensor data BMBF project based on the MoSGrid idea Duration of 3 years 16

Summary Challenge of quickly rising data and computing demands Increasing complexity of data-intensive HPC needs to be managed to maintain and increase relevancy to users Done by abstraction and automation Data, computing, metadata, workflow management Science gateways for productivity Important goals Federated security Big Data Resilience Usability Sustainability Balance required between opposing goals 17

Thanks for Listening! 18