Center for Information Services and High Performance Computing (ZIH) Managing Complexity in Distributed Data Life Cycles Enhancing Scientific Discovery Richard Grunzke*, Jens Krüger, Sandra Gesing, Sonja Herres-Pawlis, Alexander Hoffmann, Alvaro Aguilera, Wolfgang E. Nagel richard.grunzke@tu-dresden.de
Data Life Cycles Data from creation, management, analysis, utilization and archiving Focus on generating insights based on data Data exploration as the additional paradigm of science Copyright: KIT 2
Data Life Cycles Big Data and HPC Large-scale simulations with HPC Result data can be in petabyte range Instruments such as high-throughput microscopes 0,85 GB/s 2 petabyte monthly Big Data and growing rapidly HPC to extract information for knowledge gain 3
Data Life Cycles Complexity Infrastructures ever more complex Data sources: detectors, simulations, distributed sensors,... Data management: storage hierarchy, geographical distribution, transfers, protocols, HPC and user access, AAI,... HPC: heterogeneous architectures, cores, nodes, OS, network,... Data sinks: scratch, home, repository, archive, Usage: ssh, batch systems, tools, clients, formats, data sharing, visualization,... 4
Data Life Cycles Complexity Users expected to learn all this? Few will even attempt as they want to concentrate on their science Many potential new HPC users would not begin Users do better science faster via accessible HPC and Big Data Driving and sustaining force behind HPC 5
Data Life Cycles Complexity As complexity increases, productivity decreases Maintaining usefulness via abstraction to hide complexity and automation to avoid manual tasks Frameworks and libraries Modeling and simulation approaches Automated parallelization and error detection Graphically aided performance analysis and optimization Computing and workflow middlewares Data and metadata management systems Science gateways and virtual research environments Visualization 6
Data Life Cycles Data Sources Instruments Detectors in particle accelerators High-throughput microscopes Distributed sensors measuring properties of wind power stations Computing Resources Large scale simulations Results of high-throughput data analysis Richard Grunzke
Data Life Cycles Data Management Storage hierarchy: Ramdisk, SSD, HDD, SAN, NAS, Tape Parallel file systems with focus on storing data in form of files GPFS, Lustre, pnfs, HDFS,... Distributed data management systems with advanced features IRODS, Dcache, XtreemFS, UNICORE,... 8
Data Life Cycles Metadata Metadata as information about data to organize it based on content Higher level functionality on top of data management Easy discovery of data fundamental for its usefulness Highly complex situation with many standards and systems Copyright 2009-2010 Jenn Riley. This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License http://creativecommons.org/licenses/by-nc-sa/3.0/us/ 9
Data Life Cycles Metadata Management Centralized metadata catalog + Consistent uniform view, + Directly searchable - Potential bottleneck, - Single point of failure, - Archiving complex AMGA Metadata Service, Dspace, Fedora Commons, ISOcat, Systems with metadata in close proximity to data + More failure-resistant and better scalable, + More suitable for long-term archiving - Central component for searchability necessary, - No uniform view, - Possibly more files HDF5, NeXus, NetCDF, Systems with a combined proximity approach + Combination of earlier approaches - More complex, - Possibly more files, - Consistency Management UNICORE Metadata Service 10
Data Life Cycles Computing Management Supercomputers, clusters, Architectures, CPUs, RAM, Operating systems, Racks, nodes, interconnects, Batch systems Abstraction of highly complex computing resources, User-driven - User directly initiates tasks UNICORE, Globus Toolkit, glite, Workflow-driven - User creates and submits workflow guse, UNICORE,... Data-driven - Tasks automatically executed by pre-defined rules IRODS, UNICORE,... 11
Data Life Cycles Workflow Management Higher level functionality based on computing management Workflow as chaining together of multiple applications Support for dependencies, loops, sequential, in parallel UNICORE, GWES, guse, BIS-Grid, Kepler,... 12
Data Life Cycles Data Sinks Data stored according to re-use probability Scratch file system Home directory Digital data repository Long-term archive 13
Data Life Cycles Utilization User interfaces important for acceptance among scientists Flexibility vs usability Commandline-based access - Highly customizable and scriptable UNICORE, Globus Toolkit, glite,... Rich-Client-based access - Local software installation required UNICORE, Taverna Workbench, Web-based Always up-to-date, Single point of entry to infrastructures UNICORE Portal, Galaxy, WS-PGRADE, Apache Airavata, Vine Toolkit, 14
Data Life Cycles MoSGrid Science Gateway HPC and workflow enabled science gateway for molecular simulations Built in BMBF project 350 users 3 chemical application domains 70 workflows with 90 applications Extended in two EU projects & being ported to US XSEDE infrastructure Further follow-up funding proposals submitted Molecular Dynamics Docking Quantum Chemistry J. Krüger*, R. Grunzke*, S. Gesing*, et al.: The MoSGrid Science Gateway - A Complete Solution for Molecular Simulations, Journal of Chemical Theory and Computation, 2014. 15
Data Life Cycles VAVID HPC and workflow enabled science gateway for car crash simulations and wind turbine sensor data BMBF project based on the MoSGrid idea Duration of 3 years 16
Summary Challenge of quickly rising data and computing demands Increasing complexity of data-intensive HPC needs to be managed to maintain and increase relevancy to users Done by abstraction and automation Data, computing, metadata, workflow management Science gateways for productivity Important goals Federated security Big Data Resilience Usability Sustainability Balance required between opposing goals 17
Thanks for Listening! 18