AMLIGHT, Simulation Datasets, and Global Data Sharing

Similar documents
COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

International Data Sharing Framework

NASA's Strategy and Activities in Server Side Analytics

A standards-based open source processing chain for ocean modeling in the GEOSS Architecture Implementation Pilot Phase 8 (AIP-8)

NASA s Big Data Challenges in Climate Science

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Optimizing IT Deployment Issues

Data Centric Systems (DCS)

DIGITAL STEWARDSHIP SUPPLEMENTARY INFORMATION FORM

DSA-WDS collaboration. Françoise Genova Vice-Chair of WDS Scientific Commitee Thanksto the WDS IPO and to Mary Vardigan

DATA STEWARDSHIP from a geoscience and academic perspective

Analysis of Climatic and Environmental Changes Using CLEARS Web-GIS Information-Computational System: Siberia Case Study

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Big Data Services at DKRZ

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

LJMU Research Data Policy: information and guidance

CIP s Open Data & Data Management Guidelines and Procedures

Interactive Data Visualization with Focus on Climate Research

An Introduction to Managing Research Data

SURFsara Data Services

CIESIN Columbia University

Assessing a Scientific Data Center as a Trustworthy Digital Repository

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

Long Term Preservation of Earth Observation Space Data. Preservation Workflow

Geospatial Data Archiving

ICSU and the Challenge of Big Data in Science

IBM Solution Framework for Lifecycle Management of Research Data IBM Corporation

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Workload Characterization and Analysis of Storage and Bandwidth Needs of LEAD Workspace

IODE Quality Management Framework for National Oceanographic Data Centres

USGS Guidelines for the Preservation of Digital Scientific Data

Nevada NSF EPSCoR Track 1 Data Management Plan

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Massive Labeled Solar Image Data Benchmarks for Automated Feature Recognition

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Pacific Catastrophe Risk Assessment and Financing Initiative. Better Information for Smarter Investments

Overview Motivation MapReduce/Hadoop in a nutshell Experimental cluster hardware example Application areas at the Austrian National Library

Visualizing of Berkeley Earth, NASA GISS, and Hadley CRU averaging techniques

RISKSCAPE TUTORIAL 4: 200 YEAR ANNUAL RETURN INTERVAL (ARI) HEATHCOTE RIVER FLOOD EVENT: MITIGATING IMPACTS ON CHRISTCHURCH BUILDINGS

Data Isn't Everything

How To Write An Nccwsc/Csc Data Management Plan

GLOBAL DATA SPATIALLY INTERRELATE SYSTEM FOR SCIENTIFIC BIG DATA SPATIAL-SEAMLESS SHARING

Flood Modelling for Cities using Cloud Computing FINAL REPORT. Vassilis Glenis, Vedrana Kutija, Stephen McGough, Simon Woodman, Chris Kilsby

New Developments in Data Sharing, Remote Access, Secure Data, and Documentation at the Cornell Institute for Social and Economic Research (CISER)

Cloud JPL Science Data Systems

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

Image Data, RDA and Practical Policies

Conquering the Astronomical Data Flood through Machine

walberla: A software framework for CFD applications on Compute Cores

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Recent activities on Big Data Assimilation in Japan

Guidelines for Pilot Testing of Data Management Maturity sm Model for Individual Data Matching

The THREDDS Data Repository: for Long Term Data Storage and Access

PACE Predictive Analytics Center of San Diego Supercomputer Center, UCSD. Natasha Balac, Ph.D.

Big Data Research at DKRZ

Big Data and the Earth Observation and Climate Modelling Communities: JASMIN and CEMS

SGI Big Data Ecosystem - Overview. Dave Raddatz, Big Data Solutions & Performance, SGI

Data quality Vision at SBBr Danny Vélez

WHAT SHOULD NSF DATA MANAGEMENT PLANS LOOK LIKE

BIG DATA What it is and how to use?

Transcription:

AMLIGHT, Simulation Datasets, and Global Data Sharing Jean-Bernard Minster (1,2,4,6), John J. Helly (1,2), Steven M. Day (3,4), Raul Castro Escamilla (5), Philip Maechling (4),Thomas H. Jordan (4), Amit Chourasia (2,4), Mustapha Mokrane (6) 1 SIO, 2 SDSC, 3 SDSU, 4 SCEC, 5 CICESE, 6 ICSU-WDS AMLIGHT, Big Data, Big Network, CICESE

Open data Many countries have adopted an open data policy, at least for research and education (e.g. US, France, UK, ZA, etc.) This often includes the output of numerical models and simulations. But, because of different laws, large international organizations discuss principles instead of policy. AMLIGHT, Big Data, Big Network, CICESE 2

Data Sharing Policy ICSU World Data Centers (1958-2007) Federation of Astronomical and Geophysical Data Analysis Services (1958-2007) Full and Open access to data Long-term data Stewardship and curation AMLIGHT, Big Data, Big Network, CICESE 3

Data Sharing Principles Group on Earth Observations (GEO, 130+ nations) / Global Earth Observation System of Systems (GEOSS). 2010- present. Equitable, unimpeded access to data for research and education Long-term data preservation Many exceptions (National security, privacy laws, commercial protection, ecological protection) AMLIGHT, Big Data, Big Network, CICESE 4

Data Sharing Policy ICSU World Data System Data Policy (2008-present) Full and Open access to data Long-term data Stewardship AMLIGHT, Big Data, Big Network, CICESE 5

WDS Data Policy AMLIGHT, Big Data, Big Network, CICESE 6

Research Data Alliance and WDS (RDA/WDS, 2013) Include socio-economic, health, and other data in policy discussions Explore data publishing concepts and issues Collaboration with publishers AMLIGHT, Big Data, Big Network, CICESE 7

This works for observational data in the natural sciences, especially environmental data, that can never be acquired again Perhaps also for socio-economic, and human health data sets (with caveats, so as aggregation) AMLIGHT, Big Data, Big Network, CICESE 8

The Environmental Information System Tree Private Sector Under Development Distribution & Use End Users Legend End user (public) End user (private) Integration & Validation Models & Analysis Centers Synthesized Core Products Archive Quality Assurance Distribution (full & open) Distribution (proprietary) Observations & Data Collection International Networks Measurement Systems National Supplements Public data Data buy AMLIGHT, Big Data, Big Network, CICESE 9 Francis Bretherton

What about numerical simulation outputs? Issues are many, and difficult, e.g.: Volume (can be enormous) Quality (how is it measured and controlled?) Metadata (what should be included?) Costs (is it cheaper to re-compute?) Needs (longitudinal studies, vs. punctual studies) Requirements for data assimilation Examples: weather prediction, climate simulations, earthquake simulations, earthquake prediction algorithms This calls for a broad discussion AMLIGHT, Big Data, Big Network, CICESE 10

Minimalist Metadata (automatic capture) Code version HW platform (e.g. CPU, GPU, word length, etc) SW Platform (e.g compiler, options) Input and runtime options (workflow?) Other (Author, etc, Dublin core) Even then, output might not be duplicated in future rerun. Many numerical outputs become obsolete. AMLIGHT, Big Data, Big Network, CICESE 11

Example TeraShake Simulation (2004) AMLIGHT, Big Data, Big Network, CICESE 12

Example M8 Simulation (2010) AMLIGHT, Big Data, Big Network, CICESE 13

TeraShake vs. M8 comparison Terashake M8 Notes Dimensions 600x300x80 km 810x405x85 km # cells 2 10 9 436 10 9 Time step 0.011 sec. 0.0023 sec. # steps (Duration) 20,000 180 sec. 160,000 368 sec. # cores 240 (Datastar) 223,074 (CPU) 16,600 (GPU) Wall clock 5 days 24 hours (CPU) * 5 hours (GPU) ** Checkpoints Every 1,000 th step Every 20,000 th step * 220 Tflop/s ** 2.3 Pflop/s Checkpoints, each 150 Gbytes 32 Tbytes Cannot transfer Checkpoints, total 3 Tbytes 192 Tbytes * * Every 4 hrs AMLIGHT, Big Data, Big Network, CICESE 14

TeraShake vs. M8 comparison Surface Velocity vector field Total volume velocity field, all nodes, all steps Volume velocity field, decimated Terashake M8 Notes All nodes, every step: 1.1 TB Every other node, every 20 th step: 4.4 TB (out of 352 TB) 432 Tbytes 384 Pbytes All nodes, every 10 th step: 45 Tbytes ** Every other node, Every 20 th step 4,8 Pbytes Resolution OK for visualization **No longer usefully readable, because of tape read errors Typical Viz. movie <100 Gbytes < 100 Gbytes Interactive Viz. possible AMLIGHT, Big Data, Big Network, CICESE 15

So what to save? Possible strategy: Only save enough to allow interactive (user or purpose-specific) visualization, and use checkpoints to restart partial calculation. This works for punctual simulations (e.g. 1-day weather, single earthquake). AMLIGHT permits that. Save selected individual visualizations that characterize the run (small size data sets). AMLIGHT makes it easy. For long-term longitudinal research, such as climate research or earthquake prediction algorithms, some output may require long-term curation by a trusted repository This must be discussed on a case-by-case basis. AMLIGHT makes the data repository look proximal. AMLIGHT, Big Data, Big Network, CICESE 16

TeraShake Visualization Emmett MQuinn, Amit Chourasia http://visservices.sdsc.edu/projects/scec/vectorviz/glyphsea/movies/ GlyphSea-720p-cbr6.mp4 AMLIGHT, Big Data, Big Network, CICESE 17

M8 Visualization http://visservices.sdsc.edu/projects/scec/m8/1.0/movies/m8-2.0-vmag- MachCone-1600m-12020-65000-20stepintervalGlyphSea_1280.mov AMLIGHT, Big Data, Big Network, CICESE 18