AMLIGHT, Simulation Datasets, and Global Data Sharing Jean-Bernard Minster (1,2,4,6), John J. Helly (1,2), Steven M. Day (3,4), Raul Castro Escamilla (5), Philip Maechling (4),Thomas H. Jordan (4), Amit Chourasia (2,4), Mustapha Mokrane (6) 1 SIO, 2 SDSC, 3 SDSU, 4 SCEC, 5 CICESE, 6 ICSU-WDS AMLIGHT, Big Data, Big Network, CICESE
Open data Many countries have adopted an open data policy, at least for research and education (e.g. US, France, UK, ZA, etc.) This often includes the output of numerical models and simulations. But, because of different laws, large international organizations discuss principles instead of policy. AMLIGHT, Big Data, Big Network, CICESE 2
Data Sharing Policy ICSU World Data Centers (1958-2007) Federation of Astronomical and Geophysical Data Analysis Services (1958-2007) Full and Open access to data Long-term data Stewardship and curation AMLIGHT, Big Data, Big Network, CICESE 3
Data Sharing Principles Group on Earth Observations (GEO, 130+ nations) / Global Earth Observation System of Systems (GEOSS). 2010- present. Equitable, unimpeded access to data for research and education Long-term data preservation Many exceptions (National security, privacy laws, commercial protection, ecological protection) AMLIGHT, Big Data, Big Network, CICESE 4
Data Sharing Policy ICSU World Data System Data Policy (2008-present) Full and Open access to data Long-term data Stewardship AMLIGHT, Big Data, Big Network, CICESE 5
WDS Data Policy AMLIGHT, Big Data, Big Network, CICESE 6
Research Data Alliance and WDS (RDA/WDS, 2013) Include socio-economic, health, and other data in policy discussions Explore data publishing concepts and issues Collaboration with publishers AMLIGHT, Big Data, Big Network, CICESE 7
This works for observational data in the natural sciences, especially environmental data, that can never be acquired again Perhaps also for socio-economic, and human health data sets (with caveats, so as aggregation) AMLIGHT, Big Data, Big Network, CICESE 8
The Environmental Information System Tree Private Sector Under Development Distribution & Use End Users Legend End user (public) End user (private) Integration & Validation Models & Analysis Centers Synthesized Core Products Archive Quality Assurance Distribution (full & open) Distribution (proprietary) Observations & Data Collection International Networks Measurement Systems National Supplements Public data Data buy AMLIGHT, Big Data, Big Network, CICESE 9 Francis Bretherton
What about numerical simulation outputs? Issues are many, and difficult, e.g.: Volume (can be enormous) Quality (how is it measured and controlled?) Metadata (what should be included?) Costs (is it cheaper to re-compute?) Needs (longitudinal studies, vs. punctual studies) Requirements for data assimilation Examples: weather prediction, climate simulations, earthquake simulations, earthquake prediction algorithms This calls for a broad discussion AMLIGHT, Big Data, Big Network, CICESE 10
Minimalist Metadata (automatic capture) Code version HW platform (e.g. CPU, GPU, word length, etc) SW Platform (e.g compiler, options) Input and runtime options (workflow?) Other (Author, etc, Dublin core) Even then, output might not be duplicated in future rerun. Many numerical outputs become obsolete. AMLIGHT, Big Data, Big Network, CICESE 11
Example TeraShake Simulation (2004) AMLIGHT, Big Data, Big Network, CICESE 12
Example M8 Simulation (2010) AMLIGHT, Big Data, Big Network, CICESE 13
TeraShake vs. M8 comparison Terashake M8 Notes Dimensions 600x300x80 km 810x405x85 km # cells 2 10 9 436 10 9 Time step 0.011 sec. 0.0023 sec. # steps (Duration) 20,000 180 sec. 160,000 368 sec. # cores 240 (Datastar) 223,074 (CPU) 16,600 (GPU) Wall clock 5 days 24 hours (CPU) * 5 hours (GPU) ** Checkpoints Every 1,000 th step Every 20,000 th step * 220 Tflop/s ** 2.3 Pflop/s Checkpoints, each 150 Gbytes 32 Tbytes Cannot transfer Checkpoints, total 3 Tbytes 192 Tbytes * * Every 4 hrs AMLIGHT, Big Data, Big Network, CICESE 14
TeraShake vs. M8 comparison Surface Velocity vector field Total volume velocity field, all nodes, all steps Volume velocity field, decimated Terashake M8 Notes All nodes, every step: 1.1 TB Every other node, every 20 th step: 4.4 TB (out of 352 TB) 432 Tbytes 384 Pbytes All nodes, every 10 th step: 45 Tbytes ** Every other node, Every 20 th step 4,8 Pbytes Resolution OK for visualization **No longer usefully readable, because of tape read errors Typical Viz. movie <100 Gbytes < 100 Gbytes Interactive Viz. possible AMLIGHT, Big Data, Big Network, CICESE 15
So what to save? Possible strategy: Only save enough to allow interactive (user or purpose-specific) visualization, and use checkpoints to restart partial calculation. This works for punctual simulations (e.g. 1-day weather, single earthquake). AMLIGHT permits that. Save selected individual visualizations that characterize the run (small size data sets). AMLIGHT makes it easy. For long-term longitudinal research, such as climate research or earthquake prediction algorithms, some output may require long-term curation by a trusted repository This must be discussed on a case-by-case basis. AMLIGHT makes the data repository look proximal. AMLIGHT, Big Data, Big Network, CICESE 16
TeraShake Visualization Emmett MQuinn, Amit Chourasia http://visservices.sdsc.edu/projects/scec/vectorviz/glyphsea/movies/ GlyphSea-720p-cbr6.mp4 AMLIGHT, Big Data, Big Network, CICESE 17
M8 Visualization http://visservices.sdsc.edu/projects/scec/m8/1.0/movies/m8-2.0-vmag- MachCone-1600m-12020-65000-20stepintervalGlyphSea_1280.mov AMLIGHT, Big Data, Big Network, CICESE 18