Real-Time Data Pipeline and analysis using SPOT and HIPGISAXS Alexander Hexemer and Craig E. Tull Lawrence Berkeley National Lab



Similar documents
The HipGISAXS Software Suite: Recovering Nanostructures at Scale

Data Requirements from NERSC Requirements Reviews

BIG data big problems big opportunities Rudolf Dimper Head of Technical Infrastructure Division ESRF

Workflow Tools at NERSC. Debbie Bard NERSC Data and Analytics Services

Globus for Data Management

Data Centric Systems (DCS)

Data-analysis scheme and infrastructure at the X-ray free electron laser facility, SACLA

HPC & Visualization. Visualization and High-Performance Computing

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

Next-Generation Networking for Science

NERSC Data Efforts Update Prabhat Data and Analytics Group Lead February 23, 2015

Visualization and Data Analysis

(Possible) HEP Use Case for NDN. Phil DeMar; Wenji Wu NDNComm (UCLA) Sept. 28, 2015

FY 2016 Laboratory Directed Research and Development (LDRD) Program

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May Copyright 2014 Permabit Technology Corporation

Hank Childs, University of Oregon

Managing Complexity in Distributed Data Life Cycles Enhancing Scientific Discovery

ANI Network Testbed Update

CHESS DAQ* Introduction

Alternative Deployment Models for Cloud Computing in HPC Applications. Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix

Cloud-based Infrastructures. Serving INSPIRE needs

Data Analytics at NERSC. Joaquin Correa NERSC Data and Analytics Services

Data Deduplication in a Hybrid Architecture for Improving Write Performance

ANY SURVEILLANCE, ANYWHERE, ANYTIME

ATLAS Petascale Data Processing on the Grid: Facilitating Physics Discoveries at the LHC

Image Analytics on Big Data In Motion Implementation of Image Analytics CCL in Apache Kafka and Storm

ParaVision 6. Innovation with Integrity. The Next Generation of MR Acquisition and Processing for Preclinical and Material Research.

Big Data and Analytics: Challenges and Opportunities

Hybrid network traffic engineering system (HNTES)

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

ESnet Support for WAN Data Movement

Taking Big Data to the Cloud. Enabling cloud computing & storage for big data applications with on-demand, high-speed transport WHITE PAPER

The Science DMZ. Eli Dart, Network Engineer Joe Metzger, Network Engineer ESnet Engineering Group. LHCOPN / LHCONE meeting. Internet2, Washington DC

Software Defined Networking for Extreme- Scale Science: Data, Compute, and Instrument Facilities

Parallel Analysis and Visualization on Cray Compute Node Linux

With DDN Big Data Storage

Pentaho High-Performance Big Data Reference Configurations using Cisco Unified Computing System

MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services

Scaling LS-DYNA on Rescale HPC Cloud Simulation Platform

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Four Ways High-Speed Data Transfer Can Transform Oil and Gas WHITE PAPER

How To Speed Up A Flash Flash Storage System With The Hyperq Memory Router

Petascale Visualization: Approaches and Initial Results

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

GC3 Use cases for the Cloud

From Big Data to Smart Data Thomas Hahn

Advanced In-Database Analytics

Make the Most of Big Data to Drive Innovation Through Reseach

Protecting Big Data Data Protection Solutions for the Business Data Lake

Summit and Sierra Supercomputers:

Dr. Raju Namburu Computational Sciences Campaign U.S. Army Research Laboratory. The Nation s Premier Laboratory for Land Forces UNCLASSIFIED

Data Semantics Aware Cloud for High Performance Analytics

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Luncheon Webinar Series May 13, 2013

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

IBM Deep Computing Visualization Offering

Astrophysics with Terabyte Datasets. Alex Szalay, JHU and Jim Gray, Microsoft Research

Advanced Computing. for large DESY. Volker Guelzow IT-Gruppe Hamburg, Oct 29th, 2014

OnTimeDetect: Offline and Online Network Anomaly Notification Tool

NVIDIA IndeX Enabling Interactive and Scalable Visualization for Large Data Marc Nienhaus, NVIDIA IndeX Engineering Manager and Chief Architect

PIVOTAL CRM ARCHITECTURE

The Data Quality Monitoring Software for the CMS experiment at the LHC

Transcription:

Real-Time Data Pipeline and analysis using SPOT and HIPGISAXS Alexander Hexemer and Craig E. Tull Lawrence Berkeley National Lab OLCF Users Meeting June 24, 2015 @ OLCF

GB/month Data Challenge up to 2 Petabytes /year of raw data.

Hard X-ray Scattering Recover morphology from scattering Hard X-ray energy: 10keV Contrast from electron density difference BL 7.3.3 data morphology

X-ray Scattering Measure structure size from 1A to 100 s on nm

Fast Detectors + in situ experiments

Data Collection: Robot Barcode reader Epson G6 SCARA robot Each puck is a kinematic mount for repeatable, secure transfers Measurement stage

Data Flow Transfer to NERSC On-the-fly calibration, processing Real-time access via web portal Data collection Combining: GIXSGUI, dpdak + Analysis and modeling on NERSC supercomputers: HipGISAXS simulation HipRMC fitting start with random system move particle random Compare Autotuning

HiPIES: High Performance Interactive Environment for Scattering Same code for client and supercomputer On demand file loading: large data sets local or cloud Automation: Data reduction, peak finding, arc finding, background subtraction, time-line compression etc.

Printing Organic Photovoltaics & Realtime Analysis Show if on-the-fly data analysis of complex systems is possible by combining: a state of the art materials science questions (OPV or Nafion) state the art X-ray detectors and instrumentation advanced mathematical algorithms and software fast data movement and visualization run on some of the fastest computer in the world A. Hexemer (LBNL/CAMERA), C.E.Tull (LBNL), J. Deslippe (NERSC), R.S. Canon (NERSC), E. Dart (ESnet), I.Foster (ANL), J.A. Sethian (LBNL/CAMERA), G. Shipman (ORNL), J. Wells (ORNL), K. Kleese van Dam (PNNL), T.P. Russell (UMass), E. Gomez (PennState) Facilities: ALS (BES), NERSC (ASCR), ANL( ASCR), OLCF (ASCR), ESnet (ASCR), CAMERA (ASCR)

Nafion Ionomer Polymer based proton conductive membranes used as fuel cell membranes Teflon plus side groups Pavel V. Komarov doi:10.3762/bjnano.4.65

Tensile Test Nafion Polymeric fuel cell membrane 1

start with random system Polymer system with filler particles move particle random Compare FFT + mask data from experiment

Reverse Monte Carlo Results Typical structure from fitting data fit TEM (stained) 5

How about Real-time? Experiment: 1 frame per second 600 frames total/sample + 15 min sample change total of 25 min Time to fit single frame per node (actually 20x same data frame with different initial conditions, since we need statistics) (TITAN and EDISON) 12-20 min Possible!!!

R2R OPV Manufacturing (UMass Amherst) Goal: Low-Cost Materials, Low-Cost Manufacturing, Low-Cost Installation

Slot die printer

In situ slot die printer for OPVs

Organic Photovoltaic Processing using Roll-to-Roll One of the revolutionary appeals of roll-to-roll manufacturing of organic photovoltaics (OPV) is the potential to achieve energy recovery times as low as 10 days. R. Sondergaard et al.: materialstoday Volume 15, Issues 1 2, January February 2012, Pages 36 49 Roll to Roll printer placed inside the beamline Conjugated polymer and small molecule Printer head X-rays Edge-on packing GIWAXS data we collected during printing CETull@lbl.gov - 01 October 2014

High Performance GISAXS (part of CAMERA now) Scattering simulation examples Include latest theory take advantage of latest architecture: Multi CPU/GPU CETull@lbl.gov - 24 June 2015 Scientist don t care were they do the calculation!!! www.camera.lbl.gov

Time evolution of GIWAXS patterns During the drying Later stages after the drying

Real time? Titan: 1 K20X GPU/node # nodes = 500, # agents = 20, # generations = 20: Total time = 3110.00 sec [avg. generation time = 155.50 sec] # nodes = 2000, # agents = 50, # generations = 20: Total time = 2071.60 sec [avg. generation time = 103.58 sec] # nodes = 8000, # agents = 80, # generations = 20: Total time = 865.60 sec [avg. generation time = 43.28 sec] Printing demo experiments created 36,000 frames in 3 days (1/2 year on TITAN)

Real-Time Data Pipeline and analysis using SPOT and HIPGISAXS Alexander Hexemer and Craig E. Tull Lawrence Berkeley National Lab OLCF Users Meeting June 24, 2015 @ OLCF

Value, Volume, Reciprocal Space (Scattering) 100,000 s images per day Velocity, Variety HipGISAXS/HipMC parallel Scattering Real Space (Tomography) 100 s of TB per year Arec3d, QuantCT, ImgRec, GridRec Spectroscopy (MicroXas) 100 s of TFlops of GPU/CPU "Past limitations of detector technology have been largely solved through recent DOE investment and that the present bottleneck for research throughput is the lack of availability of appropriate analysis software and modeling tools." 1 1 Advanced Data Analysis and Modeling Tools for Scattering Methods Workshop - Sept, 2010 CETull@lbl.gov - 24 June 2015 ShirleyXAS (MSD) BerkeleyGW (NERSC) Light source scientists need accessible scalable software systems. Hybrid HPC/HTC/network PB s of resources. data per year soon (COSMIC) advanced algorithms & analysis. advanced simulation & theory. realtime Ptychographic feedback. advanced reconstruction visualizations.

The definition of facility must expand to provide scientific knowledge, not files. Experimental Data Science Expertise Simulations Theory Algorithms Scalable Software Data Management Applied Math Visualization Large Collaborations Advanced Network HPC & HTC Facility Data Systems Science Gateways User Support CETull@lbl.gov - 24 June 2015

SPOT Suite: Integration of ALS, ESnet, and NERSC into a proto-super-facility. Computing Research Div., Advanced Light Source, Material Science Div., ESnet, NERSC CRD brings large-scale data science experience & advanced computational science expertise. ALS represents each data "theme", provides algorithm expertise, and allows at-scale tests using real data, real executables, and critical science. CRD, MSD, NERSC supply HPC-ready simulation, advanced algorithms (CAMERA), and viz (VISiT) NERSC, ESNet provide raw capacity and experience to solve problems at scale. (eg. Science DMZ) CETull@lbl.gov - 24 June 2015

SPOT Suite LDRD: Multi-division team from CRD, ALS, ESnet, MSD, NERSC. Leverage prior experience and tools (ASCR, HEP, BES) Spade (IceCube, Daya Bay) Science Data Gateways (NERSC) Shirley-XANS (MSD) Domain Applications (ALS) Initial ALS Beamlines: 8.3.2 Micro-Tomography (D.Parkenson) 7.3.3 SAXS/WAXS (A.Hexemer) 12.3.4 Micro-Diffraction (N.Tamura) Data Demos (funded by ASCR/BES Data Pilots C.Susut-Bennett/E.Lessner) APS Advanced Photon Source (BNL) LCLS Linac Coherent Light Source NSLS - National Synchrontron Light Source CETull@lbl.gov - 24 June 2015 Pseudo-production running at ALS beamlines: 24/7 Operation (26Apr2015) 24/7 Operation 176,293 Datasets 180,758 Datasets 155 Beamline Users 157 Beamline Users 1,257 1,050 TB TBData DataStored Stored >2.8 2,379,754 Jobs at NERSC million Jobs at NERSC Portal interfaces for: Data monitoring & control Job monitoring Job inspection Data browsing and searching Browser-based remote viz metadata, provenance, algorithms, theory-simulation, etc.

Remote experiments now a reality. 25mar2014: UK scientists conduct remote experiment using new BL 7.3.3 robot and SPOT. Able to assess experimental data on train to Zurich via mobile interface. From: Alessandro Sepe as2237@cam.ac.uk -- Actually, I did not feel any difference between a standard beamtime and this NERSC remotely CETull@lbl.gov - 24 June 2015 accessed beamtime, which is quite an extraordinary result.

Data Demos: Can this model be extended to illustrate the Super-Facility? Printing BCP Organic PhotoVoltaics & Realtime Analysis Advanced mathematical algorithms for data analysis and large-scale simulations coupled to data analysis are becoming critical to scientific understanding of BES lightsource data. Many Reverse Monte Carlo (RMC) and simulation codes are HPC-scale, even Exascale, and heavily optimized for (eg. GPU) architectures. We want to eliminate boundaries between the Scientist and the world s best Algorithms running on the best architecture for that code. Multi-Facility Imaging and Analysis Pipelines for Large X-ray Data Replication at new lightsource facilities: APS (tomography), LCLS (crystalography), NSLS (3D chemical map) Evaluate common file formats & SW for Tomography Integration of VisIt high quality, remote visualization Integration of QuantCT filtering and segmentation Integration of post-segmentation analysis results CETull@lbl.gov - 24 June 2015

APS-NERSC Demo: Tomography, TomoPy, and Exchange HDF5 Semiphore APS tomography for data readiness; beamline developing Spade(gridftp) tomography in push standardized mode No Data suitcasing Exchange Data format Exchange and reconstruction. format Wrap Can TomoPy; we adapt Digest new HDF5 DE format; Image/Volume & integrate new displays application? unchanged Separate How does APS/ALS tomopy Disk scale? & CPU Does allocations; the increased Spade performance to bridge justify SPOT the workflows added data for transfer a) TomoPy time? & b) harvesting img/metadata I/O profiling & burst buffer performance tests CETull@lbl.gov - 24 June 2015

"Data sets from different user groups dealing with completely different topics - covering all possible data collection schemes and sizes." "Co-designing Big Iron for Big Data Nick Wright, Chris Daley Profiling TomoPy and testing effect of Burst Buffers on Alva CETull@lbl.gov - 24 June 2015 Science DMZ as a Service - Creating Science Super-Facilities with GENI

NSLS-NERSC Demo: 3-D Chemical Mapping using Transmission X-Rays NSLS-II will be optimized to deliver ultra-high brightness and flux and exceptional beam stability. Apply nano-scale image and spectroscopic technology to improve battery design Computationally intensive mapping and reconstruction can be accelerated by remote Scientific Data Facilities 3D visualization allows web-based image navigation Semiphore for data readiness; Spade(gridftp) in push mode No suitcasing Data Exchange format Port/Wrap 3DXANES; Adapt Image/Volume displays to 3 species Separate NSLS/ALS Disk & CPU allocations; Spade to bridge SPOT workflows for a) 3DXANES & b) harvesting img/metadata Ran live at SC14 CETull@lbl.gov - 24 June 2015

Running 3DXANES at NERSC required specialized visualization for 3D chemicals A hand-made sample with powders of LiFePO4, FePO4 and Fe 3D spectroscopic imaging: A collection of 2D absorption contrast projections at different angles with a series of monochromatic X-ray, thereby generating a 4D data. Focus on determining the chemical composition of each voxel position in a reconstructed 3D sample object. Workflow for each individual voxel position of the 3D sample: Central shift correction: align the centers of projection images Volumetric Reconstruction: recover 3D sample for each energy level Sample Fitting: reassemble the 4D data to form XANES spectra over energy levels and perform least square fitting Fe CETull@lbl.gov - 24 June 2015 FePO4 LiFePO4

LCLS-NERSC Demo: Photosynthesis and nano-crystallography SLAC/ESnet/NERSC LCLS (SLAC) femtosecond network path; pulse Round-robin hard x-ray DTN XFEL load balance REST LCLS interface has capable for data near-time readiness; analysis Spade(bbcp) on-site, in but pull LCLS-II mode will No exceed suitcasing local resources. raw XTC streams Port Can & wrap we run psana/cctbx; near-time analysis Digest XTC at NERSC data; Adapt during 2D a real image displays Separate experiment? LCLS/ALS Can Disk we handle & CPU allocations; the data rates? Spade to bridge SPOT workflows for a) Avg/Std/Max & b) cctbx Fast (chunk0) & Full (all chunks) CETull@lbl.gov - 24 June 2015

SPOT was like an extra pair of hands working in the background. N.Sauter June: 72 TB data @ CXI Dry run challenges: XTC, psana, network & queuing July: 55 TB data @ XPP Full system running (Spade, bbcp, DAQ REST, XTC, psana, cctbx, SPOT, HDF5) 6/22/15 CETull@lbl.gov - 24 June 2015

Recent work: LCLS Run May 2015 SLAC LCLS run @ XPP beamline: 88 TB data generated in 5 days 1. Spade using bbcp to pull data from SLAC 2. Spade reading Igor s RESTful API to determine "file done 3. Invoking psana on Edison via "psana(infile, outfile, configfile)"-like semantics 4. Scraping images into SPOT s display & associated Metadata into MongoDB CETull@lbl.gov - 24 June 2015

Conclusions of the Data Challenges Start with 24/7 real-time results for 2 ALS beamlines. In 9 months we were able to: establish 3 new facility data pipelines in real-time or pseudo real-time incorporate 3 new facility-developed codes (psana LCLS, TomoPy APS, 3DXANES NSLS) adopt 3 new data file formats integrate Spade and Globus Online interface our workflow to OLCF/Titan The Data Demos challenge accelerated our Goals by 2-3 years But only in heroic demo mode! CETull@lbl.gov - 24 June 2015

ALS-NERSC-Titan Super-Facility Demo: Real-time analysis for Scattering Data Typical Is on-the-fly ALS chain data + analysis dpdak + of HipRMC complex on systems NERSC possible? reservation Co-schedule a state of ALS the beamtime art materials & Titan science reservation questions (OPV or Nafion) Typical state ALS the chain art X-ray + dpdak detectors => Spade and instrumentation => Globus Online => ORNL (FW) advanced mathematical algorithms and software Sentry launches HipGISAXS 8000 nodes and feeds results back IRT fast data movement and visualization HipGISAXS runs live during beamtime (see next) run on some of the fastest computer in the world Integrated Super-Facility Dashboard not production quality CETull@lbl.gov - 24 June 2015

GISAXS Printing Demo Dashboard CETull@lbl.gov - 24 June 2015

OPV Printing Demo Sept 11, 2014 Structures evolve & develop during drying CETull@lbl.gov - 24 June 2015

Super-Facility Demo: Is real-time possible for SAXS/WAXS and GISAXS? Reverse Monte Carlo: (Titan & Edison) Experiment: 1 frame per second + 600 frames total/sample + 15 min sample change = total of 25 min Analysis: 12-20 minutes = Time to fit single frame per node Real-time is possible! GISAXS with HIPGISAXS: (Titan: 1 K20X GPU/node) Complexity of the material description led to intractably large calculations Printing demo experiments created 36,000 frames in 3 days => requires 1/2 TITAN-year to analyze Need much better analysis and selection algorithms CETull@lbl.gov - 24 June 2015

Photon Science is poised for a new, transformative era of discovery. BES Facilities are facing a data tsunami that will either bury us, or yield unprecedented scientific output. BES Facility Data Demos extraordinarily ambitious Loosely-coupled, workflow-driven architecture allowed agile integration of new formats, algorithms, visualizations. Ability to quickly contribute attracted many participants. Super-facilities are the necessary next step in the evolution of BES light-source facilities and must be strategically supported. Delivering scientific insight for all lightsource users. Concept generalizable to other science facilities. CETull@lbl.gov - 24 June 2015

BES Facilities Data Demos Participants: ANL: APS, Globus* Rachana Ananthakrishnan*, Ian Foster* Francesco De Carlo, Arthur Glowacki, Doga Gursoy, Nicholas Schwarz BNL: NSLS*, CSC Wei Xu*, Shun Yao, Dantong Yu LBNL: ALS, CRD, ESnet, NERSC*, PBD/CCI Aaron Brewster, Shane Canon*, Eli Dart, Jack Deslippe*, Abdellilah Essiari, Enrique Gomez, Alex Hexemer, Dinesh Kumar, Jon Dugan, Sherry Li, Feng Liu, Harinarayan Krishnan, Dmitry Morozov, Dula Parkinson, Simon Patton, Prabhat*, Thomas Russell, Abhinav Sarje, Polite Stewart, Taghrid Salmak, Nicholas Sauter, Eric Schaible, Craig E. Tull, Singanallur Venkatakrishnan, Chenhui Zhu, Daniela Ushizima ORNL: CADES*, OLCF Galen M. Shipman*, Don Maxwell, James Rogers, Jack C. Wells SLAC: LCLS*, SCA Amber Boehnlein, Antonio Ceseracciu, Igor Gaponenko*, Wilko Kroeger, Christopher O Grady, Amedeo Perazzo* CETull@lbl.gov - 24 June 2015 http://spot.nersc.gov/docs.php for references, papers, presentations.