Parallel storage, mining and visualization of environmental data archives Mikhail Zhizhin, Dmitry Medvedev, Alexey Poyda, Dmitry Mishin and Sergei Berezin Space Research Institute and Geophysical Center Russian Academy of Sciences
Collaboration with Microsoft Research 2006-2009 Environmental Scenario Search Engine (ESSE) Site: Geophysical Center, Moscow State University and MSR Cambridge PIs: Mikhail Zhizhin (RAS), Eric Kihn (NOAA) and Vassily Lyutsarev (MSRC) PhDs: Alexey Poyda (Moscow State University), Dmitry Mishin and Dmitry Medvedev (RAS) Summary: ESSE is an interactive search engine for data mining in environmental data archives. What makes it so different from conventional text-based search engines is that it actually searches inside the numeric datasets using fuzzy logic to describe transitions between environmental states 2007-2009 Climate Induced Vegetation Change Analysis Tool (CLIVT) Site: Space Research Institute and MSR PIs: Eugeny Lupian and Mikhail Zhizhin (RAS) PhDs: Maria Medvedeva, Alexey Poyda, Dmitry Medvedev, Dmitry Voytsehovsky Summary: In CLIVT project we bring together large archives of satellite images and historical data on vegetation and climate for the territory of Northern Eurasia and develop a new technique to study relations between the ecosystems dynamics and the climate change
Joint Research Center IKI-MSR in Moscow Framework Agreement signed at Moscow State University on March 17 th, 2009 Administrative structure and research project agreements will be elaborated in June 2009 Main directions of research: The Parties wish to collaborate on research concerning global change of climate, ecology and space environment in their interrelation, which will require satellite and ground-based sensor observations together with data intensive highperformance computing for environmental monitoring, modeling and data mining
What are the challenges Repetitive tasks to design (very large) databases for new data products. Interactive access times for any projection of the data array Never delete/overwrite data, lineage-accreditation-quality-type flags Multilayer (catalog, inventory, order, process) distributed metadata storage [STANDARD] Semantically rich common data model [STANDARD] and query language [STANDARD] for (environmental) scientific datasets Functionally rich data services supporting data extraction, processing and mining implemented at the data server [STANDARD] Distributed algorithms to balance network/database load Data export modeling visualize ingest workflow; reference webservices for basic datasets and models Clever and seamless integration of MS Virtual Earth, Google Maps, WMS and scientific visualization libraries Parallel visualization algorithms (GIS?), applications and viewers for very large images, maps and video streams on tiled displays
Data processing, analysis and visualization workflow Virtual Observatory XML metadata and portal REST and SOAP templates OGSA-DAI Grid data services ActiveStorage NetCDF and NcML NetCDF API Metadata WMS,WCS Virtual Earth KML and tile servers MM5 and WRF mesoscale weather models Matlab
Virtual Observatory XML metadata search engine Open Source middleware VxOware Tiers: 1) Web application; 2) REST services; 3) native XML database backend + native object stores with indexing (documents, images ) XML: multiple catalog-level metadata schemas, e.g. FGDC, ECHO, SPASE, NGDC Ordering Extensions Distributed metadata search over VO federation using REST services
Virtual Observatory for Metadata: A Complete Data Environment is More than Just the Bits Metadata store Virtual Observatory Web application or portal Web service API for Data Sources FGDC records Ordering Extensions XML Ordering Extentions OE (1 *) FGDC Metadata Data Request REST or SOAP API OGSA-DAI client toolkit SEARCH in metadata OGSA-DAI Resource and Activities Wiki Documents User Guide ResourceID FGDC catalog XML CLASS, SPIDR, ActiveStorage Plugin Presentations Slideshow Search result: ResourceID_1 ResourceID_2 ResourceID_3 Visualization service Inventory service Order service CLASS products
Ordering Extensions XML schema: station map XML element Data order web form XSLT
Why OGSA-DAI service container? Standard tool in the Grid community Supports distributed workflow (in version 3.*) Built in support for asynchronous transactions Compatible with Web (Axis) and Grid (OMII, UNICORE, GT4) Looked at alternatives like OpenDap, WCS, documentation of our analysis is available Problem 1: it is very complex Solution: REST wrapper Problem 2: supports only File, SQL and XML data types and queries Solution: implement additional data sources and functions for data in multidimensional arrays
ESSE / OGSA-DAI extensions Provide catalog and inventory level metadata about a data source Support multidimensional array data model (in addition to SQL/XML/BLOB) Handle SOAP and REST requests for data export Have local data processing and fuzzy logic data mining functions Provide persistent storage for the data processing and environmental models output (as a new dataset) Can be chained into asynchronous distributed data processing workflow
OGSA-DAI Data Order Flow 1 2 OE Web Form Servlet XSLT get Data Types Time-series Sunspot number Grids NCEP Reanalysis Stations Ionospheric Soundings Swath - AVHRR Profiles Ocean Profile Maps Nighttime lights More? Client Process Document via SOAP 3 XML Result Error Message Data Server OGSA- DAI Adapter Storage Get Data Process Mine SQL XML Granule Time Series
Environmental Scenario Search Engine State S 1 corresponding to the red (upper-right) region is the fuzzy expression: S 1 = (Very Large P) and (Very Large T) State S 2 corresponding to the cyan (lower-left) region is: S 2 = (Very Small P) and (Very Small T) Time series as a trajectory in the two-dimensional phase space (P-pressure, T-temperature) Combining the descriptions of the states with the time shift operator shift dt, we can write the following symbolic expression for the Environmental Scenario very low temperature and pressure after very high temperature and pressure : (shift dt=1 S 1 ) and S 2
Web editor for a multi-state environmental scenario Search results
Parallel Active Data Storage Open Source software developed in collaboration with MSR Cambridge Data provided by NCAR and NGDC NOAA Common Data Model and API compatible with Unidata CDM for NetCDF/HDF Scalable parallel storage and processing engine based on MS SQL server Capable to store terabytes of gridded output of numerical weather models and raw meteorological station reports Special client library with API and an OGSA-DAI plugin. The OGSA-DAI receives from ActiveStorage a CMD object and transforms it into different ES formats such as NcML, NetCDF, HDF...
Common Data Model (CDM) Dataset -name -name Group DataType Attribute -name -value -datatype Variable Dimension -name -length -char -byte -short -int -long -float -double -String -name -shape -datatype Common Data Model (CDM) is a ES standard used in OpenDAP, netcdf4 and HDF5 as a general representation of multivariate numeric arrays. Sum-models such as grids (geophysical fields), points (observatories) and trajectories (ships, airplanes, satellites) are supported
Database schema to map CDM into ActiveStorage
Data retrieval scheme: single MS SQL server 1. Call the client library with array coordinates as call parameters x min, y min x max, y max 2. Issue commands to the database server 3. Select the requested data parts from the appropriate chunks Client library SQL Server database 5. Merge the data parts and return the whole array to the user 4. Return the data parts to the client library The database engine performs only the basic array selection and subsetting The client library does all the rest (merging chunks, type conversion, etc.) Two versions of the client library:.net and Java
Distributed queries: MS SQL database cluster SQL Server database Client library... SQL Server database Portions of the global array can be stored on several database servers to increase performance
NetCDF API for ActiveStorage in MATLAB import ru.wdcb.mdb.ncconnector import com.microsoft.sqlserver.jdbc.sqlserverdriver s = 'jdbc:sqlserver://localhost:1433;databasename=ncep_01;user=g uest;password=guest'; connector = NcConnector(); ncid = connector.nc_open(s,0); varid = connector.nc_inq_varid(ncid,'air'); origin = [0 0 10 10]; size = [80000 1 1 1]; stride = [1 1 1 1]; A = connector.nc_get_vars_short(ncid,varid,origin,size,stride); plot(a, 'DisplayName', 'A', 'YDataSource', 'A'); figure origin = [0 0 0 0]; size = [1 1 73 144]; stride = [1 1 1 1]; B = connector.nc_get_vars_shortm(ncid,varid,origin,size,stride); B = reshape(b,[73 144]); imagesc (B); figure(gcf);
NCEP/NCAR Weather Reanalysis Continually updating gridded data set Global Circulation Model output 74 weather parameters 5000 netcdf files, 30 500 MB each Time coverage: 1948 2008 4-hourly values Grids: Regular grid, 2.5 x 2.5 degrees T62 Gaussian grid, 192 x 94 points.
NCDC Meteorological Observations Records Fixed ground stations Ships Mobile stations Buoys 1901 2008 time coverage. 30 million sensors. 470 000 ASCII files packed with gzip. 50 GB packed; 400 GB unpacked. 1.7 billion observations. Map of the meteorological stations in the database
Integration of remote sensing and climate data in CLIVT Multi-annual NDVI time-series by land cover types Regular cell-grid for data integration NDVI for Evergreen Needleleaf Forest NDVI averaging for 2,5 x 2,5 cell-grid by land cover types 1 st decade of June 1999 2 nd decade of June 1999 Land cover map GLC2000 3 rd decade of June 2007 Multi-annual NDVI time-series 1 0,8 0,6 0,4 0,2 0-25 -20-15 -10-5 0 5 10 15 20 25 Integrated analysis Air-Temperature 1 st decade of June 1999 2 nd decade of June 1999 1 st decade of June 1999 2 nd decade of June 1999 3 rd decade of June 2007 3 rd decade of June 2007 Multi-annual time-series of meteorological data
Web technologies for visualization of different data types with geolocation KML & georss Web-services for CDM data sources OGC Web Map Services WMS/WFS/WCS MS Virtual Earth Google Maps
VisualESSE plugin for NASA World Wind desktop client CodePlex Open Source project http://www.codeplex.com/visualesse
MS Virtual Earth, OGC Web Map Service and NcML grid overlays OGC WMS web map image with transparency control Stable world nighttime lights by NGDC NOAA NcML grid extracted from ActiveStorage Current surface temperature by NWS NOAA
Reanalysis and forecast weather data fusion Related to a selected pushpin 50 years of weather history from NCEP/NCAR Reanalysis database 1 week weather forecast from NWS database
Fuzzy search and Virtual Earth mapping of environmental events Search for events at given locations Select a set of fuzzy scenarios from the VO library and a time interval (history and forecast) XSL transform of the search engine XML output into KML Map the KML: any location, any events, any time window
UIC SAGE 3.0 ported on MS Windows Fully functional, not only local display PsTools utilities instead of rsh Uses Windows build-in security Existing applications JuxtaView bitplayer, mplayer Library for.net interoperability WorldWind for SAGE http://www.codeplex.com/winsage
MultiViewer application Rendering clients 4 4 1 1 5 5 2 2 6 6 3 3 UI Controller Each node performs data fetching, processing and rendering Better utilization of videocluster resources
Transparent Data Cube All HPC components from the CLIVT toolbox can run on the same parallel cluster. At the IKI Computing Center in Moscow we utilize a 12-node cluster with WCC MPI fro MM5, MS SQL Server databases for ActiveStorage and and 12-display videowall for Multiviewer. We call this parallel installation for storage, modeling and visualization Transparent Data Cube. 4 4 1 1 5 5 2 2 6 6 3 3
Directions for further research Continue analysis of climate-biosphere interactions Sun-Earth connections, including climate, ionosphere, magnetosphere, cosmic rays Data-intensive and cloud computing on Microsoft HPC/Azure platform in remote sensing, environmental databases and sensor networks Tiled display / Virtual Earth / Deep Zoom / SAGE visualization platform / World Wide Telescope Multispectral micro-remote sensing for art conservation