DIY Parallel Data Analysis



Similar documents
Towards Scalable Visualization Plugins for Data Staging Workflows

Roadmap for Many-core Visualization Software in DOE. Jeremy Meredith Oak Ridge National Laboratory

Large Vector-Field Visualization, Theory and Practice: Large Data and Parallel Visualization Hank Childs + D. Pugmire, D. Camp, C. Garth, G.

Parallel Analysis and Visualization on Cray Compute Node Linux

HPC & Visualization. Visualization and High-Performance Computing

The Scalable Data Management, Analysis, and Visualization (SDAV) Institute

Visualizing Large, Complex Data

The Design and Implement of Ultra-scale Data Parallel. In-situ Visualization System

Hank Childs, University of Oregon

NEXT-GENERATION. EXTREME Scale. Visualization Technologies: Enabling Discoveries at ULTRA-SCALE VISUALIZATION

In-situ Visualization

Visualization and Data Analysis

Parallel I/O on Mira Venkat Vishwanath and Kevin Harms

SciDAC Visualization and Analytics Center for Enabling Technology

Min Si. Argonne National Laboratory Mathematics and Computer Science Division

Parallel Large-Scale Visualization

Performance Engineering of the Community Atmosphere Model

Collaborative modelling and concurrent scientific data analysis:

HPC enabling of OpenFOAM R for CFD applications

Data Centric Systems (DCS)

Introduction to Visualization with VTK and ParaView

Using open source and commercial visualization packages for analysis and visualization of large simulation dataset

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

Software and Performance Engineering of the Community Atmosphere Model

Open Source CFD Solver - OpenFOAM

The Ultra-scale Visualization Climate Data Analysis Tools (UV-CDAT): A Vision for Large-Scale Climate Data

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

Large-Data Software Defined Visualization on CPUs

Visualization Infrastructure and Services at the MPCDF

Performance Monitoring of Parallel Scientific Applications

VisIO: Enabling Interactive Visualization of Ultra-Scale, Time Series Data via High-Bandwidth Distributed I/O Systems

Post-processing and Visualization with Open-Source Tools. Journée Scientifique Centre Image April 9, Julien Jomier

ParaView Catalyst: Enabling In Situ Data Analysis and Visualization

How To Visualize At The Dlr

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Cosmological simulations on High Performance Computers

Parallel Visualization for GIS Applications

Kommunikation in HPC-Clustern

Uintah Framework. Justin Luitjens, Qingyu Meng, John Schmidt, Martin Berzins, Todd Harman, Chuch Wight, Steven Parker, et al

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

9700 South Cass Avenue, Lemont, IL URL: fulin

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

Jean-Pierre Panziera Teratec 2011

Petascale Visualization: Approaches and Initial Results

Postdoctoral Researcher, Data Sciences Group

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell

How To Visualize Performance Data In A Computer Program

Algorithmic Challenges and Opportunities for Data Analysis and Visualization in the Co-design Process

Open source, open science

Visualisatie BMT. Introduction, visualization, visualization pipeline. Arjan Kok Huub van de Wetering

Big Data Management in the Clouds and HPC Systems

The Visualization Pipeline

Petascale Software Challenges. William Gropp

HPC Deployment of OpenFOAM in an Industrial Setting

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

MAQAO Performance Analysis and Optimization Tool

How To Monitor Performance On A Microsoft Powerbook (Powerbook) On A Network (Powerbus) On An Uniden (Powergen) With A Microsatellite) On The Microsonde (Powerstation) On Your Computer (Power

ABSTRACT FOR THE 1ST INTERNATIONAL WORKSHOP ON HIGH ORDER CFD METHODS

Network for Sustainable Ultrascale Computing (NESUS)

Kriterien für ein PetaFlop System

Big Data Classification: Problems and Challenges in Network Intrusion Prediction with Machine Learning

Basin simulation for complex geological settings

Scientific Visualization with Open Source Tools. HM 2014 Julien Jomier

An Integrated Simulation and Visualization Framework for Tracking Cyclone Aila

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

Parallel Computing for Option Pricing Based on the Backward Stochastic Differential Equation

High performance computing systems. Lab 1

Implementing MPI-IO Shared File Pointers without File System Support

Source Code Transformations Strategies to Load-balance Grid Applications

Resource Allocation Avoiding SLA Violations in Cloud Framework for SaaS

Transcription:

I have had my results for a long time, but I do not yet know how I am to arrive at them. Carl Friedrich Gauss, 1777-1855 DIY Parallel Data Analysis APTESC Talk 8/6/13 Image courtesy pigtimes.com Tom Peterka tpeterka@mcs.anl.gov Mathematics and Computer Science Division

Postprocessing Scientific Data Analysis in HPC Environments Examples: 2D statistical graphics using R 3D scientific visualization using ParaView Scientific visualization using VisIt 2

Run-time Scientific Data Analysis in HPC Environments Examples: GLEAN, ADIOS, ParaView Coprocessing Library R, ParaView, VisIt Analyze Examples: GLEAN, ADIOS, DIY 3

Scientific Data Analysis Today Big science = big data, and Big data analysis => big science resources Data analysis is data intensive. Data intensity = data movement. Parallel = data parallel (for us) Big data => data decomposition Task parallelism, thread parallelism, while important, are not part of this work Most analysis algorithms are not up to the challenge Either serial, or Communication and I/O are scalability killers 4

Data Analysis Comes in Many Flavors Visual Particle tracing of thermal hydraulics flow Topological Morse-Smale Complex of combustion Statistical Information entropy analysis of astrophysics Geometric Voronoi tessellation of cosmology

You Have Two Choices to Parallelize Data Analysis By hand or With tools void ParallelAlgorithm() { MPI_Send(); MPI_Recv(); MPI_Barrier(); MPI_File_write(); } void ParallelAlgorithm() { LocalAlgorithm(); DIY_Merge_blocks(); DIY_File_write() } 6

DIY helps the user write data-parallel analysis algorithms by decomposing a problem into blocks and communicating items between blocks. Features Parallel I/O to/from storage Domain decomposition Network communication Utilities Library Written in C++ with C bindings Autoconf build system (configure, make, make install) Lightweight: libdiy.a 800KB Maintainable: ~15K lines of code, including examples!"#$%&'"() *"+$&%",&'"()-.((% 4%&+56-789:;;;6-</== >&3&*"8?6-*"+@' /)&%0+"+-1"23&30 @.16-A+$B%(?6-C5$%%6-*.D E@F @OA M8&N E&'& P3"'8- M8+$%'+ E@F E8K(#L(+"'"() J%(K9")H /++"H)#8)' =(##$)"K&'"() 78"H52(3 I%(2&% G>@ Q'"%"'"8+ >&3&%%8% =(#L38++"() E&'&'0L8 =38&'"() >&3&%%8%!(3' DIY usage and library organization 7

1. Separate analysis ops from data ops 2. Group data items into blocks 3. Assign blocks to processes 4. Group blocks into neighborhoods Nine Things That DIY Does 5. Support multiple multiple instances of 2, 3, and 4 6. Handle time 7. Communicate between blocks in various ways 8. Read data and write results 9. Integrate with other libraries and tools ()*+,!"#$6/!$0/ /01!"1) 2$"34(*.4**5!&!%!'!'!7!8 -$.!"+$/ $0*+4!8!9!: /01+$!$#0*.1) 2$"34(*.4**5!"#$!"#$%&'(('( )"#$%&'(('( *"#$%&'((!"#$%&'()*%+$#,$-$#./$#,$'$/#/'*$#,$01$2%3456#75##8+ 8

Writing a DIY Program Documentation README for installation User s manual with description, examples of custom datatypes, complete API reference Tutorial Examples Block I/O: Reading data, writing analysis results Static: Merge-based, Swap-based reduction, Neighborhood exchange Time-varying: Neighborhood exchange Spare thread: Simulation and analysis overlap MOAB: Unstructured mesh data model VTK: Integrating DIY communication with VTK filters R: Integrating DIY communication with R stats algorithms Multimodel: multiple domains and communicating between them 9

Particle Tracing Particle tracing of ¼ million particles in a 20483 thermal hydraulics dataset results in strong scaling to 32K processes and an overall improvement of 2X over earlier algorithms 10

Information Entropy Computation of information entropy in 126x126x512 solar plume dataset shows 59% strong scaling efficiency. 11

Morse-Smale Complex Computation of Morse-Smale complex in 11523 Rayleigh-Taylor instability data set results in 35% end-to-end strong scaling efficiency, including I/O. 12

In Situ Voronoi Tessellation For 128 3 particles, 41 % strong scaling for total tessellation time, including I/O; comparable to simulation strong scaling. 13

Further Reading DIY Peterka, T., Ross, R., Kendall, W., Gyulassy, A., Pascucci, V., Shen, H.-W., Lee, T.-Y., Chaudhuri, A.: Scalable Parallel Building Blocks for Custom Data Analysis. Proceedings of Large Data Analysis and Visualization Symposium (LDAV'11), IEEE Visualization Conference, Providence RI, 2011. Peterka, T., Ross, R.: Versatile Communication Algorithms for Data Analysis. 2012 EuroMPI Special Session on Improving MPI User and Developer Interaction IMUDI'12, Vienna, AT. DIY applications Peterka, T., Ross, R., Nouanesengsey, B., Lee, T.-Y., Shen, H.-W., Kendall, W., Huang, J.: A Study of Parallel Particle Tracing for Steady-State and Time-Varying Flow Fields. Proceedings IPDPS'11, Anchorage AK, May 2011. Gyulassy, A., Peterka, T., Pascucci, V., Ross, R.: The Parallel Computation of Morse-Smale Complexes. Proceedings of IPDPS'12, Shanghai, China, 2012. Nouanesengsy, B., Lee, T.-Y., Lu, K., Shen, H.-W., Peterka, T.: Parallel Particle Advection and FTLE Computation for Time-Varying Flow Fields. Proeedings of SC12, Salt Lake, UT. Peterka, T., Kwan, J., Pope, A., Finkel, H., Heitmann, K., Habib, S., Wang, J., Zagaris, G.: Meshing the Universe: Integrating Analysis in Cosmological Simulations. Proceedings of the SC12 Ultrascale Visualization Workshop, Salt Lake City, UT. Chaudhuri, A., Lee-T.-Y., Zhou, B., Wang, C., Xu, T., Shen, H.-W., Peterka, T., Chiang, Y.-J.: Scalable Computation of Distributions from Large Scale Data Sets. Proceedings of 2012 Symposium on Large Data Analysis and Visualization, LDAV'12, Seattle, WA. 14

The purpose of computing is insight, not numbers. Richard Hamming, 1962 Acknowledgments: Facilities Argonne Leadership Computing Facility (ALCF) Oak Ridge National Center for Computational Sciences (NCCS) Funding DOE SDMAV Exascale Initiative DOE Exascale Codesign Center DOE SciDAC SDAV Institute http://www.mcs.anl.gov/~tpeterka/software.html https://svn.mcs.anl.gov/repos/diy/trunk Tom Peterka tpeterka@mcs.anl.gov APTESC Talk 8/6/13 Mathematics and Computer Science Division