Data Formats for Long-term Archiving of Climate Model Data at WDC Climate and DKRZ Michael Lautenschlager and Jörg Wegner WDC Climate / Max-Planck-Institute for Meteorology, Hamburg MPG e-science Seminar October 25th 26th, 2007 in Berlin, Germany
DKRZ: Earth system model development Simulations of past, present and future climate WDC Climate: Long-term data archiving Inter-disciplinary data dissemination
Diagram of Climate System
Diagram of the Hamburg IPCC- Climate Model ECHAM5/MPI-OM
Forcing of Climate Projetions for IPCC AR4
Near surface temperature change for the scenarios A1B und B1. Presented is the difference of the 30-year-means 2071-2100 minus 1961-1990.
Comparison of the present-day sea ice cover In March and September (oben) with the climate projection for the scenario A1B (unten) in 2100. Additionally the snow over land can be obtained.
Spatial resolution of the North Atlantic sector in ECHAM5/MPI-OM
Data Volumina in Climate Projections: IPCC AR4: ECHAM5[T63L19]/MPI-OM produces 23 TB/year Climate projection over 240 years (1860-2100): 5,5 TB and appr. 2 months computer run time Future: ECHAM5[T106L31] produces 44 GB/year Climate projection over 240 years (1860-2100): 106 TB, i.e. the complexity is appr. 20 * T63
Extrapolated HLRE2 linear archive increase (10 times HLRE) Compute server architectures: C90: Cray C90 / HLRE: NEC SX-6 / MPP: SUN-Cluster / HLRE2: new system (HLRE: Höchstleistungsrechnersystem für die Erdsystemforschung)
DKRZ System Diagram GFS lokale lokale Systeme DS CS NW entfernte Systeme
HLRE System architecture at DKRZ SX-6 SX-6 IXS SX-6 24 nodes SX-6 SX-6 SX-6 3 * 16/32-48 SX-6 SX-6 SX-6 SX-6 x SX-6 32 SX-6 SX-6 SX-6 X compile user appl x 6 GFS Disk 70 TB x 12 x 48 x 2 x 12 x 112 x 4 x 16 x 12 x 8 x 24 x 32 LAN x 6 x 2 x 12 ApplSrv SUN Az GFS GFS/UVDM UDSN GFS/UVDM UDSN DXUL-DB 8/16 UCFM 2 * 16/32 UCFM Oracle9i archive backup 2 * 8/16 GFS 8/16 x 36 HSM x 6 3 * 4/8 DS test 6 * 4/8 x 20 DBMS UCFM Cache 17 TB x 16 x 35 9840C x 7 9940B x 18 T10000 x 8 LTO2 x 2 DBMS Disk 30 TB
Data classes Test data from model code development, life cycle: weeks to months Project data from scientific model evaluation and research projects (DKRZ resources at project level), life cycle: 3 5 years Final results as data products for international projects (IPCC) and scientific publications, life cycle: 10 years and longer Data hierarchy levels Temp(orary) scratch discs at compute server Work fixed disc space at project level for evaluation Arch(ive) tape storage space (single copy) with expiration date for project data beyond available disc space Docu(mentation) documented, long-term tape archive (security copy) for data products, focus on interdisciplinary data utilisation, data are fixed and no longer matter of change
Tape space distributon to archive classes at DKRZ begin of 2007: part of the work space on tape because GFS too small docu domain consists of WDCC no expiration dates in arch domain, parts of arch domain belongs to docu but not yet documented
Data documentation requirements are accomplished by using the WDCC infrastruture CERA-2 metadata model developed in 1999 Catalogue interface: cera.wdc-climate.de Input interface: input.wdc-climate.de CERA-2 metadata content is complete with respect to browse, to discover and to use climate data which are stored in the database system or outside in flat files The WDCC matches international description standards like ISO 19115, Dublin Core or GCMD and is integrated in international data federations Data storage structure assembles storage of climate time series per variable in BLOB data tables. This allows for web-based data catalogue search and data access in small data granules.
CERA Data Model Reference Contact Coverage Status Entry Parameter Distribution Data Org Local Adm. Data Access Spatial Reference
Coloured columns correspond to BLOB data tables in WDCC. Collections of matrix rows represents storage in model raw data files (complete model output storage time step by storage time step).
Data infrastructure integrates data stewardship in the long-term archive Bit-stream preservation Quality assurance Usability enabling
Long-term archive data stewardship Bit-stream preservation Secondary tape copies on different tapes and technology at separate location Copy to new tapes after maximum number of tape accesses are reached (Refreshment) Quality assurance Semantic examinations: behavior of a numerical model compared to observations and to other models, part of the scientific evaluation process Syntactic examinations: formal aspects of data archiving and ensurance that data archiving is free of errors as far as possible Consitency between metadata and climate data Completeness of climate data Standard range of values Spatial and temporal data arrangement
Long-term archive data stewardship (continued) Usability enabling Complete and searchable documenation of climate data entities (database tables and flat files) in the catalogue system of the WDCC WDCC offers web-based data access to small data granules (individual entries in BLOB DB tables) Archive technology transfer must be downward compatible to keep old data technically readable Data processing tools and data format access libraries must be migrated to new architectures
Standard Data Formats (SDFs) at WDC Climate Requirements Self-descriptive (use metadata) Machine independent Should contain compression or packing Benefits SDFs support long-term data preservation SDFs support data exchnage and dissemination SDFs allow for application of standardized data processing tools and packages
Data Form a ts at W D C C climate model output GRIB 1 GRIB 2 NetCDF 3.x NetCDF 4.x tools: cdo, cdat, xconv, IDV cdo, cdat, nco, ncl cdat, grads, ncview, G M T convert manipulate visualize
GRIB 1 G RIdded Binary -'GRIB' Section 0 -length of message, edition nu m ber Section 1 - product description section Section 2 - grid description section repeated Section 3 - bit map section Section 4 - binary data section -'7777' ds8 55 %grib -ginfo zzz.grb Rec : Position Size : V PDS GDS BMS BDS : Code Level: LType GType 1 : 0 36948 : 1 28 32 0 36876 :133 20000 : 100 4 2 : 36948 36948 : 1 28 32 0 36876 :133 20000 : 100 4 3 : 73896 36948 : 1 28 32 0 36876 :133 20000 : 100 4 4 : 110844 36948 :1 28 32 0 36876 : 133 20000 : 100 4 5 : 147792 36948 :1 28 32 0 36876 : 133 20000 : 100 4 ds8 56 %grib -gdsinfo zzz.grb Rec : GDS NV PVPL Typ :xsize ysize Lat1 Lon1 Lat2 Lon2 dx dy 1 : 32 0 255 4 : 192 96 88572 0-88572 358125 1875 48 ds8 57 % - co mpressed data -> smallfile size - every 2d field (record) is a G RIB file -> UNIX co m m ands for catenating -library supportfor FORTRAN & C -strong restrictions for header informations - header infor mation coded (num b ers) -need of tables for decoding
GRIB 2 General Regulary-distributed Information in Binary form Section 0 -'GRIB'indicator section Section 1 -identification section* Section 2 -localuse section (optional) Section 3 - grid definition section* repeated Section 4 - product definition section* repeated Section 5 - data representation section repeated Section 6 - bit map section Section 7 - data section Section 8 - end section '7777' * Sections 1,3,4 represent the GRIB1 product description section. This splitting, com bined with the option foriterating sections and the concept of templates make the main difference to GRIB1 and keeps GRIB2 very flexible. Concept of templates: You can define templates for grid definition, product definition and data representation by your o wn.
GRIB 2exa mple A 500 hpa height field forecasts on a Northern He misphere polar stereographic grid produced by a particular num erical model atforecast hours 12, 24, 36,and 48. These fourfields could be represented by a single GRIB2 message by repeating the sequence of Sections 4to 7 four times, making the appropriate forecast time changes in the Product Definition Section in each iteration of the sequence. Section 0:Indicator Section Section 1:Identification Section Section 2:Local Use Section (optional) Section 3:Grid Definition Section Section 4:Product Definition Section (hour = 12) repetition 1 Section 5:Data Representation Section Section 6:Bit-Map Section Section 7:D ata Sectio n Section 4:Product Definition Section (hour = 24) repetition 2 Section 5:Data Representation Section Section 6:Bit-Map Section Section 7:D ata Sectio n Section 4:Product Definition Section (hour = 36) repetition 3 Section 5:Data Representation Section Section 6:Bit-Map Section Section 7:D ata Sectio n Section 4:Product Definition Section (hour = 48) repetition 4 Section 5:Data Representation Section Section 6:Bit-Map Section Section 7:D ata Sectio n Section 8:End Section Note that since the Grid Definition Section is not repeated, itremains in effectfor allfour forecast hours.
NetCDF 3.x NETwork Com mon Data Form - dimensions (1 unlimited possible) - variables & attributes - globalattributes -data netcdf simple_xy { dimensions: x = 6 ; y = 12 ; variables: int data(x,y) ; // global attributes: :C D O = "Climate Data Operators version 0.9.5 " ; :source = "E C H A M5.2" ; data: data = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,32, 33, 34,35, 36, 37,38, 39, 40, 41,42, 43, 44,45, 46, 47,48, 49, 50,51, 52, 53,54, 55, 56, 57, 58, 59,60, 61, 62,63, 64, 65,66, 67, 68,69, 70, 71 ; } - no co m pression for data -> file size bigger than G RIB1 - data stored n-dim ensional -library supportfor FORTRAN & C -file size => 2 GByte with NetCDF3.6 -no restrictions for headerinfor mations
NetCDF 4.x, HDF5 NetCDF-4/HDF5 Format W ith version 4.0 of netc D F, another ne w data for m at was introduced: netcdf-4/hdf5 format. This format is HDF5, with fulluse of the new dimension scales,creation ordering, and other features of HDF5 added in its version 1.8.0 release. Multiple unlimited dimensions. Groups to organize data. New types,including com pound types and variable length arrays. ParallelI/O. netcdf4 "exa mple" { group "/" { group "group1" { dataset "set1" { dimension variables data} dataset "set2" { dimension variables data} } group "group2" {... }} netcdf3.x file
Tools nco: for NetCDF http://nco.sourceforge.net/ ncl:for NetC D F3, NetC D F4, G RIB1, GRIB2, HDF4 http://www.ncl.ucar.edu/ ncview:for NetCDF http://m eteora.ucsd.edu/~pierce/ncview_ho me_page.html cdo: for GRIB1, NetC DF, ieg, EXTRA, Service http://w w w.m pimet.mpg.de/filead min/software/cdo/ cdat:for GRIB1 (with GrADS controlfile), NetC DF, HDF http://www-pcmdi.llnl.gov/software/ xconv: for NetCDF, G RIB http://badc.nerc.ac.uk/help/software/xconv/ IDV: for GRIB, NetCDF http://w w w.unidata.ucar.edu/software/ THG HDF5 tools:for HDF http://w w w.hdfgroup.org/products/hdf5_tools/ GrADS: GRIB1 (with controlfile),netcdf3.x http://grads.iges.org/grads/grads.ht ml