A SciDB-based Framework for Efficient Satellite Data Storage and Query based on Dynamic Atmospheric Event Trajectory

Size: px
Start display at page:

Download "A SciDB-based Framework for Efficient Satellite Data Storage and Query based on Dynamic Atmospheric Event Trajectory"

Transcription

1 A SciDB-based Framework for Efficient Satellite Data Storage and Query based on Dynamic Atmospheric Event Trajectory ABSTRACT Luboš Krčál Nanyang Technological University, Singapore Czech Technical University in Prague, Czech Republic Current research in climate informatics focuses mainly on the development of novel (machine learning, data mining, or statistical) techniques to analyze climate data (e.g. model, in-situ, or satellite) or to make prediction based on these climate data. One important component missing from this analysis workflow is data management that allows efficient and flexible data retrieval, (ease of) reproducibility, and the (ease of) techniques reuse on user-defined data subsets or other data. In this paper, we describe our preliminary investigation on the utilization of the distributed array-based database management system, SciDB, to support data-driven climate science research. We focus on modeling and generating indices that allow effective execution of various spatiotemporal queries on satellite data. Moreover, we demonstrate fast and accurate data retrieval based on user-specified trajectories from the SciDB database containing tropical cyclone trajectories and the complete ten-year QuikSCAT ocean surface wind fields satellite data. Our preliminary work indicates the feasibility of the arraybased technology for multiple satellite data storage, query, and analysis. Towards this end, a successful deployment of SciDB-based data storage can facilitate the use of data from multiple satellites for climate and weather research. Categories and Subject Descriptors H.2.8 [Information Systems]: Database Management Database Applications, Scientific databases General Terms Experimentation, Design, Measurement Keywords SciDB, scientific database, arrays, multidimensional arrays, index, indexing, spatiotemporal, satellite data, QuikSCAT Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data 2015, Seattle, WA, USA Copyright 2015 ACM $ Shen-Shyang Ho Nanyang Technological University, Singapore ssho@ntu.edu.sg 1. INTRODUCTION Natural phenomena such as haze, hurricane, and blizzard that evolve over time usually do not have well-defined boundaries. Their features may be captured by multiple satellites. To process and extract information from the largescale satellite data, one needs a data-intensive architecture for distributed storage and computation resources. Such architecture allows end users such as scientists to effectively run their computation tasks with shared computational resources and intermediate results, but without data replication. All the data from multiple sources are processed and analyzed regardless of their origin. The satellite data is most conveniently represented using multidimensional arrays, exploiting its multidimensional nature. For our investigation, we use the open-source distributed, array-based database, SciDB [2], as a platform for our spatiotemporal data management framework. SciDB conforms with the data-intensive architecture, providing a highly effective computational and data storage platform. Moreover, it provides standard extension points, i.e., user defined data types, operators and functions. To represent spatiotemporal features and set a ground for implementation of spatiotemporal predicates and operators, we focus on redimensioning based indices. These simple indices, combined with distributed and parallel array platform, provide very efficient representation for various spatiotemporal queries. Since SciDB provides a computational platform, all spatiotemporal operator results can be further processed within the database. These include preprocessing, data analysis, visualization, etc. In this paper, we describe our preliminary work on the utilization of the distributed array-based SciDB database management system to support satellite data-driven climate science research by providing scientists with previously unavailable accurate and efficient Earth science satellite sensor data query and retrieval based on user-defined criteria to study and analyze atmospheric events. In particular, we ingested the complete ten-year QuikSCAT ocean surface wind fields satellite data into the SciDB database system, generated several redimension-based indices, performed trajectory queries (space time neighborhood of cyclone trajectories), processed the results and visualized them. All these are done using exclusively array operators within SciDB. Our preliminary work indicates the feasibility of the arraybased technology for multiple satellite data storage, query, and analysis. Towards this end, a successful deployment of

2 SciDB-based satellite data storage can facilitate the use of data from multiple satellite data for climate and weather research. The paper is organized as follows. In Section 2, we provide a brief overview of previous work on satellite data retrieval and using SciDB for satellite data analysis. In Section 3, we provide background related to SciDB array-based databases, QuikSCAT satellite data, cyclone trajectory data, and motivation based on scientific application. In Section 4, we describe and discuss our SciDB-based framework for satellite data storage and query based on dynamic atmospheric event trajectory in detail. In Section 5, we go through our implementation of a query to retrieve QuikSCAT data given tropical cyclone trajectory. In Section 6, we briefly discuss the time complexity of the query. In Section 7, we conclude with some future research and development directions. 2. RELATED WORK EarthDB [7, 8] is a solution based on SciDB focusing on the analysis of NASA s Moderate Resolution Imaging Spectroradiometer (MODIS) data. It is a set of tools consisting of a preprocessor from HDF (Hierarchical Data Format) to CSV (Comma-separated values) to SciDB s dense load format, parallel loader, benchmarking scripts and visualization component. EarthDB could had been potentially reused to load HDF data for our implementation. However, we decided not to reuse them due to high adaptation to MODIS data. Furthermore, due to EarthDB s use of CSV and dense load format as intermediate formats for data loading, its loading performance is subpar to using direct HDF preprocessing to SciDB s binary load format with subsequent direct parallel loading. Ho et al. [4] proposed a spatiotemporal join query to extract satellite data given tropical cyclone trajectories. A search based on the tropical cyclone trajectory of interest is performed on a partition tree that indexes the satellite swath data. Based on the returned indices, relevant data files are retrieved from NASA data centers. Then, the data files are opened to extract and subset the relevant swath data. A system prototype is developed [10, 9] based on [4] and moving objects database technology [3]. The main issue for the system is that the HDF satellite data are stored in NASA data centers and data of interest to a user have to be retrieved, extracted and subset locally. 3. BACKGROUND 3.1 SciDB Array Database SciDB [1] is a new open-source data management and analytics system organized around a multi-dimensional array data model. Array DBMS is a generalization of relational DBMS. SciDB has a distributed architecture with separate storage (shared nothing), which allows end users to transfer scientific analysis and processing from a local environment to a data driven environment, eliminating the need of transferring all the data to their local machines and relying on their own, quite often limited, computation power. There are several interfaces, including built-in languages: AFL Array Functional Language, AQL Array Query Language; programmatic interfaces: Python, R, Julia; and extensions API (user defined types, functions, aggregates and operators) in C++. The data representation in SciDB [12] is based on n-dimensional arrays. These can be seen as an extension of standard relational databases. In relational databases, all the rows can be seen as some values along one dimension, where these values consist of a fixed amount of attributes columns. Array databases instead have rows, columns and more dimensions to define a cell. Within each cell, there can be arbitrary, but fixed, number of attributes. Each dimension has continuous integer indices and each cell within the same array type has the same structure: attributes of given data types and null and default properties. Arrays are implicitly indexed on all the dimensions. Note that compared to the original ideas in [12], nested arrays feature has been dropped, so the current model is somewhat simplified. Arrays are stored across individual nodes using chunks small subarrays of fixed dimensions. Chunk dimensions are user specified and greatly impact the overall performance. Figure 1 shows a three dimensional array, with all the dimensions bounded. The dashed hyperrectangles represent individual chunks. Dimension 2 [0:1] Three dimensional sparse array Dimension 1 [0:5] Dimension 3 [0:3] Figure 1: Three dimensional sparse array of bounded dimensions Dashed hyperrectangles and different shade of gray represent individual chunks. In this case, chunks are of size Highly parallel operations on arrays allow for very effective data processing. Most of the operations used for spatiotemporal indexing and queries have a linear time complexity with an expected linear parallel speedup in SciDB. The multidimensional array model builds a good support for spatiotemporal operations on redimension-based array indices. It is though necessary to note that some operations not native to arrays, or not implementable easily on arrays, are less effective than if implemented outside SciDB in any iterative language. This has been addressed in [11]. As other users have noted, the multidimensional array model, together with fully functional underlying language, requires complete change of mindset compared to iterative programming or compared to standard relational databases. 3.2 QuikSCAT Satellite Data The QuikSCAT (Quick Scatterometer) satellite (collected data from 1999 to 2009) carried a specialized microwave radar that measures near ocean surface wind speed and direction under all weather and cloud conditions [6]. The satellite orbited Earth with a 1800 km wide measurement swath on Earth surface. It took about 101 minutes to complete

3 one full orbital revolution. The scatterometer was able to provide measurements on a particular region twice per day. There are 5 data products available to the public from Physical Oceanography Distributed Active Archive Center (PO.DAAC). There are 25 fields in the data structure for the Level 2B data. We load and store all the QuikSCAT data in our SciDB instance. The Level 2B data consist of rows of ocean wind vectors in 25km and 12.5km wind vector cells (WVC). A WVC is analogous to a pixel representing a square of dimension 25km or 12.5km. A complete coverage of the earth circumference requires km WVCs and km WVCs. The 1800 km swath width corresponds to 72 25km WVCs or km WVCs. To alleviate the problem of measurements outside the swath, the Level 2B data contains addition 4 WVCs at 25km spatial resolution and 8 WVCs at 12.5km spatial resolution. 3.3 Tropical Cyclone Trajectory The National Hurricane Center web-site contains tropical cyclone reports which include comprehensive information on tropical cyclones occurring in the North Atlantic Ocean and the Eastern Pacific Ocean. In particular, it contains postanalysis six-hourly best track locations and intensities for tropical cyclones from 1958 to current year. The Joint Typhoon Warning Center at the Naval Maritime Forecast Center provides tropical cyclone best track information for the Southern Hemisphere, Northern Indian Ocean and Western North Pacific Ocean from 1945 to current. The tropical cyclone trajectories between 1999 and 2009 are collected from the websites of the two centers. 3.4 Scientific Applications From published scientific journal papers, one observes that query and retrieval capability is extremely important to scientists who retrieve specific sensor data of specific atmospheric events for statistical analysis. Two query examples derived from these published scientific papers that require search, retrieval, and analysis of satellite data containing cyclone features, are listed below: 1. Retrieve TRMM precipitation data for tropical cyclones that attained tropical storm intensity or higher over western North Pacific and the South China Sea between longitudes 100 o E and 180 o. 138 sensor datasets for 61 tropical cyclones retrieved [5]. 2. Retrieve TRMM precipitation data for tropical cyclones from December 1997 to December sensor datasets for 563 tropical cyclones retrieved [14]. Moreover, such a capability supports scientists in their investigations where large amount of problem-specific, eventspecific data are needed. An example of tropical storm characteristics investigation using QuikSCAT ocean surface wind data subset is as follows. Wind structure is one of the important factors controlling the intensity change (intensification or weakening) of tropical storms [13]. One can retrieve ocean surface wind fields measured by QuikSCAT associated to tropical storms in the Atlantic ocean between 2000 and 2009, with the ability to further specify the search criteria to find the wind measurements associated with the group of cases satisfying characteristics such as storm paths when the translation speed is greater than 5m/s or ocean surface vector wind for hurricanes reaching categories 4 or 5, where the storm translation speed or categories is determined from the tropical cyclone trajectories. 4. SYSTEM DESIGN 4.1 System Overview We focus on integrating most of the functionality directly into SciDB, however it is necessary to have at least minimal supporting tools. In our case, we have these components. Each of the components uses SciDB-Py (Python interface to SciDB) to interact with the database. Data preprocessors: fetches and preprocesses HDF files into SciDB s binary format in parallel. QuikSCAT loader: manages the parallel loading process of QuikSCAT data. Trajectory loader: loads CSV trajectories of cyclone data from different sources. Indexing controller: the indexing process is done entirely, incrementally in SciDB. Query controller: executes queries and optionally processes data for visualization within SciDB. SciDB version is 14.12, SciBD-Py version is QuikSCAT Data Loading and Storage The whole data loading process is entirely in parallel. At first, disjoint binary data chunks are generated from raw QuikSCAT data in HDF (Hierarchical Data Format). HDF is used to store and organize large data, supporting multidimensional array data, group hierarchy and more. Due to the sheer size of QuikSCAT data (about 500 GB), we transform the data from HDF format into SciDB s binary format first. The binary format is the fastest non-internal format to load into SciDB. Binary data chunks are then loaded onto all the SciDB instances, where these are loaded in parallel. This is the fastest way to load data to SciDB. Data loading in this fashion is parallelized all the way to the number of SciDB s instances. The process is also incremental. The format of the raw data s arrays is based on which attributes we want to save, but the general array scheme is the following: /* two dimensional array definition */ QuikSCAT_ D2 time : datetime not null swath = 0:*, 307, 0, along = 0:3247, 3248, 0, /* three dimensional array definition */ QuikSCAT_ D3 lat : int16 not null, lon : uint16 not null, wind_ speed : int16 null, wind_dir : uint16 null, /* more attributes */ swath = 0:*, 2, 0, along = 0:3247, 3248, 0, cross = 0:151, 152, 0

4 Note that swath, along and cross are the three dimensions. Swath corresponds to the orbit number, along corresponds to the position within the along track of the orbit, and cross corresponds to the position within the single scan (cross track). Dimensions are specified by four numbers: minimum and maximum (can be unbounded, denoted by *) of dimension s domain, chunk size along this dimension (can be unknown, i.e. determined by SciDB, denoted by?) and overlap (number of cells overlapping adjacent chunks along this dimension). Also, note that time only depends on swath and along dimensions and not cross. Time has lower dimensionality than latitude, longitude, wind speed, wind direction, and other attributes that also depends on the cross dimension. We need two arrays to store data of two different dimensionalities, even though these dimensionalities are subset of each other. 4.3 Indexing Spatiotemporal Data Latitude-Longitude-Time Index Latitude-Longitude-Time index is a form of redimension index. The processing is based on a redimension operator of SciDB, where attributes of one array become dimensions of another (new) array. The new array s dimensions are usually rescaled compared to the original attributes range. Conflicting cells (i.e. cells that are the target of multiple source array cells) can be treated by concatenating the values into a list followed by computing an aggregate function along the list of conflicting cells. Note the aggregation can be done without explicitely storing the list of conflicting cells. The redimensioning process is depicted in Figure 2. The dimensions of the original raw data array swath and along, are turned into attributes, the cross dimension is discarded, and the attributes latitude, longitude and time are discretized and turned into dimensions by the redimension operation. Since there are many cells that ended up assigned the same latitude, longitude and time, most of them with different values of swath and along, we need to run an aggregate along the auxiliary dimension. This yields another array without conflicts, where the attributes swath start, swath end, along start and along end define a range on the swath and along dimensions in the original raw data array. There is an array with dimensions representing latitude and longitude from QuikSCAT D3 array and time from QuikSCAT D2 (see array schemes for QuikSCAT data). The granularity of the target dimensions is determined by the use-cases or by individual levels in the hierarchical indices (see Section 4.3.3). The four attributes, swath_start, swath_end, along_start, along_end, of the index define a range in the data array, i.e. the values are pointers into the original data. Every index cell covers a range in the data swath, possibly with overlaps. It is possible that multiple source cells end up indexed into a single target cell. As mentioned previously, an aggregate function is used. This aggregate function returns the range union of the swaths and along dimensions. For example, if there is an incoming data point from swath=1863, along=341, while currently the values for that index cell are swath_start=1862, along_start=1860, swath_end=1862, along_end=2894, then the index cell s pointers get updates to swath_start=1862, along_start=1860, swath_end=1863, along_end=341. Index generation is fully incremental. However, to maximize parallelism, it is better to process swaths that are further apart. Therefore, the chunks accessed in the modified index arrays are more or less random, compared to consequent chunks, where there may be a lot of data-write dependencies due to a substantial amount of data being written to the same physical chunk. The array schema of Latitude-Longitude-Time index for QuikSCAT data is as follows: /* Index array with pointers to projected data */ LatLongTime_ Index swath_ begin : uint32, swath_ end : uint32, row_ begin : uint16, row_end : uint16 lat =0:720,?, 0, long =0:1440,?, 0, time =0:96432,?, Cartesian Index An extension of Latitude-Longitude-Time index into Cartesian coordinates is Cartesian Index. The main idea of this index is to allow for faster and more convenient spatial queries. Since SciDB uses Cartesian coordinates, it is more effective to keep index data in Cartesian coordinates as well. Distance based queries, predicates and operators can be effectively truncated based on the dimensions only. This effectively eliminates the need to read chunks that may be possibly rendered unneeded when further processing a query. An example of Cartesian index with support for data projections is as follows: /* Index array with pointers to projected data */ Cartesian_ Index swath_ start : int32 not null, swath_ end : int32 not null, row_ start : int16 not null, row_end : int16 not null, /* earliest and lates time points covered by this cell */ time_ start : datetime not null, time_end : datetime not null, /* direction of time flow within the cell ( along swath path ) */ time_ angle : float not null, /* 3d angle -- normal to the plane of projection */ polar_ angle : float not null, azimuthal_ angle : float not null, /* projection pointer */ projection_ ptr : int64 not null, x = 0:1024,?, 0, y = 0:1024,?, 0, z = 0:1024,?, 0, time = 0:87600,?, Hierarchical Structure of Indices Our implementation of Latitude-Longitude-Time index consists of multiple levels of the indexing arrays. Each in-

5 Raw QuikSCAT Data Array CROSS dimension Redimension Redimensioned Update Array CONFLICTS (list) dimension Aggregate along CONFLICTS dimension Index Array SWATH dimension ALONG dimension Time Latitude Longitude other attributes Time Latitude Longitude other attributes LATITUDE dimension LONGITUDE dimension TIME dimension Swath Along Swath Along Swath Along Swath Along LATITUDE dimension LONGITUDE dimension TIME dimension Swath_start Swath_end Along_start Along_end Swath_start Swath_end Along_start Along_end Swath_start Swath_end Along_start Along_end Swath_start Swath_end Along_start Along_end Figure 2: index. Array redimensioning scheme used during incremental generation of Latitude-Longitude-Time dexing array has different granularity, i.e., its cells cover different latitude, longitude and/or time. Due to uneven distribution of spatiotemporal data from satellites, some array locations may contain denser or sparser data. For example, the data density of scans around the equator is different from the density close to the poles. Note that for swath data the biased distribution of the data does not occur when using Cartesian indices. With hierarchical indices, queries can be executed subsequently on more detailed (lower level) indices where necessary. When generating indices on multiple levels, we do not need to use online heuristics to determine where to use more levels. To do so, it would require a complicated re-indexing and possible re-reading of the raw data. Instead, since we are working with Earth observing satellite data, we can determine the boundaries for individual levels statically prior to indexing Indexing Data by Values and Indices Storing Data All indices are capable of indexing not only the raw data ranges (i.e. availability of data), but also the values or statistical aggregates (e.g. min/max/count/). Values may also be indexed by additional array dimension. Including values indexing allows for additional data dependent operators, i.e., pre-filtering that can be partially done in the index, effectively reducing the amount of raw data retrieved. Each index entry can also contain a projected data into a plane perpendicular to the sphere. This does not have to be a complete projection of the data, rather than value aggregates and a bitmask, representing a rough outline of the underlying complete data. This is a form of regridding with variable projection plane suitable for spheres. 5. USE CASE SCENARIO: SELECT QUIKSCAT DATA GIVEN TROPICAL CYCLONE TRAJECTORY We demonstrate the spatiotemporal query capabilities on an example of data selection along a moving object: Select QuikSCAT data along a given cyclone trajectory, with lat-lon radius of 1.0 degree from the cyclone eye and time span of [-3,+21] hours. Cyclone trajectory is loaded into SciDB as a list of points in the following format: lat, lon, time >[i]. This is done using CSV preprocessing and directly loaded from CSV into SciDB. Points of the trajectory are interpolated so that hyperrectangles on the interpolated points completely cover the neighborhood. Note that all the processing is done within SciDB, which also demonstrates the computational capabilities and the ease of integration of our solution. 5.1 Retrieveing Data Regions Pointers into the original data Data is selected from a Latitude-Longitude-Time index array given the mask (as a list of hyperrectangles) using cross between operator. Figure 3 shows an example of a trajectory mask, which is used as an argument for the index query. Figure 3: Trajectory and a corresponding mask sequence of hyperrectangles on latitude, longitude and time dimensions. Retrieved regions are processed to simplify the raw data retrieval. We want the regions specification as simple as possible. Overlapping regions are joined together into a single, longer region. Regions spanning very long swath area (crossing distinct physical chunks in the database) are split. Only basic functions, sort, filter and uniq along swath and along are used for the region simplification. Regions can then be adjusted based on user required visualization or data processing objective. Figure 4 shows the regions retrieved from the index with each region specifying an area in the original QuikSCAT swath. The regions are stored in a list in the following format swath start, swath end, along start, along end>[i].

6 processing. We used a sequence of the following steps to retrieve the interpolated cyclone centers: Every image corresponds to a swath range. Assign a first and last timestamp to every image, then compute the average timestamp for the region. Note that this is a simplification that results in potentially nonnegligible error compared to the real cyclone center. Retrieve the cyclone trajectory data with timestamp attribute. Redimension it so that the point in the trajectory is an attribute p Merge image and trajectory data - both with timestamp data. This gives you a list (array with a single dimension) Perform a sequence of sort (on timestamp) operations followed by cumulate operation on max(p), then reverse the list and do another cumulate, this time on min(p). The resulting list now has pointers p to both the preceding and following point of the trajectory. Join in the location and time data from the trajectory and interpolate the centers. Figure 4: Swath regions as retrieved from the index query. Raw QuikSCAT data is retrieved from the data arrays based on the regions. Regions spanning at most 1 chunk allow for fast data retrieval without inter-chunk data dependency. This is very effective for the data processing as no data have to be transfered between the individual nodes. 5.2 Processing Raw Data for Visualisation For the purpose of visualization, the following steps are taken: First, redimension the raw data from [swath, along, cross] to [i, lat, lon] dimensions. Here, i represents i- th swath section regridded to latitude and longitude dimensions, i.e., it can be seen as an image. Each swath section contributes to exactly one image. Conflicting data are aggregated using avg() into their respective cells. It may happen during the regridding process that multiple raw data points end up in the same cell based on target resolution. We then smooth the data and fill in small gaps. A new auxiliary array is created using regrid operator (creates a single cell from a grid of subarrays) with average aggregate and xgrid (opposite of regrid). This array s purpose is to fill in a neighborhood of cells in the original array as SciDB cannot directly compute window aggregates into empty cells. The original array is then merged into the auxiliary array, window aggregate (average) is computed to fill in the gaps in the original data. The auxiliary array is then merged into the original data array, effectively filling in the gaps. We then interpolate cyclone centers based on swath time. Note that this computation would be simple with iterative processing. However, SciDB only natively supports array Filter out the trajectory related data, keeping only the image related data. Note that a recently published framework for iterative processing within SciDB [11] could be used as a platform for some of the computation steps. To find the cyclone centers, one can alternatively analyze data (wind speed and direction) to determine the center, similarly as in [10, 4]. Finally, we filter data based on their distance from the cyclone center. Some examples of the resulting images are shown in Figure Example of Simple Index Query This is an example of a single query retrieve data pointers to QuikSCAT swath data from the index. Note that for clarification, the schemas of the arrays participating in the query are listed below as well. The query selects index data from an index array quikscat_index_lat_long_time based on the given mask interpolated_regions_array (list of hyperrectangles) /* Region array schema */ interpolated_ regions_ array lat : float, lon : float, time : int64, lat_low : int64, lon_low : int64, time_low : int64, lat_high : int64, lon_high : int64, time_ high : int64 i0 =0:97, 1000, 0

7 Table 1: Timing results of a query - Select QuikSCAT Data Along Trajectory Cyclone Isabel Isabel Isabel Isabel Data points Radius [deg] Span [hours] Index query time [sec] time_high )), swath_low, int64 ( swath_begin ), swath_high, int64 ( swath_end ), along_low, int64 ( row_begin ), along_high, int64 ( row_end )),idx ), swath_low, along_low, swath_high, along_ high ) # Resulting regions array swath_ low : int64, along_ low : int64, swath_ high : int64, along_ high : int64 idx =0:*, , 0 Figure 5: Processed images clipped according to the distance from the interpolated cyclone center. /* Index ( single level ) schema */ quikscat_ index_ lat_ long_ time swath_ begin : uint32, swath_ end : uint32, row_ begin : uint16, row_end : uint16 lat =0:720, 180, 0, long =0:1440, 180, 0, time =0:96432, 30, 0 /* Trajectory selection query on index */ project ( unpack ( apply ( cross_ between ( quikscat_ index_ lat_ long_ time, project ( interpolated_ regions_ array, lat_low, lon_low, time_low, lat_high, lon_high, 6. DISCUSSION: TIME COMPLEXITY Individual operations on the index arrays that do not require data transfer between nodes, i.e., lookup, cross between, filtering are all very fast. Theoretical time complexities of such array operations used are at most linear (in size of the array) with expected linear parallel speedup. This assumes asymptotically lower initial overhead and coordination overhead, sufficient network speed for data retrieval (if needed) and fine enough chunk granularity (i.e. not all data in a single big chunk). Note that assuming the chunking allocation scheme (i.e. to which node each chunk is placed) respects the neighborhood by allocating nearby chunks to different physical nodes, we estimate that the physical locations of chunks retrieved after the operations on the index arrays will be more or less random. For a random allocation of the chunks to physical machine of both index and data arrays, the speedup is linear in expectancy with high probability, the speedup S c n for some desired constant 0 c 1. Table 1 shows the timing results of the data retrieval query described in detail in Section 5: Note that the time scales mainly with the number of data points of the trajectory. This is due to the fact that each data point is more likely to hit a brand new chunk in both the index arrays and the raw data arrays. Increasing the radius (space) 2 times, i.e., the total area is increased 4 time, has minimal effect on the time. Based on

8 the level of the index, the probability of hitting additional chunks of the index that need to be retrieved increases with lover level indices (denser mesh). Same goes for span (time). If we had a prior knowledge of our queries, we could adapt the chunking scheme of the index array to accommodate more spatial or more temporal data in a single chunk; thus, resulting in increasing performance for spatial, respective temporal intensive queries. However, in our data retrieval query example we kept the ratio balanced. We run SciDB on a single physical server Intel Xeon E v3 2.6GHz, 20M Cache, 8.00GT/s QPI, 8 cores; 8x16GB RAM, 2133 MT/s, 4x1TB 7.2K RPM SATA 6Gbps. There are 4 virtual machines, each with 2 cores, 1 hard drive, 16GB RAM, running Ubuntu server ( kernel). Timing was measured as an average of 3 runs with cold start on the virtual machines. Note that the main server was kept running, which might have incurred some bias. However, the variation between individual measurements was negligible. 7. CONCLUSIONS AND FUTURE WORK In this paper, we described our preliminary work on the utilization of the distributed array-based SciDB database management system to support satellite data-driven climate science research by providing scientists with previously unavailable accurate and efficient Earth science satellite sensor data query and retrieval based on user-defined criteria to study and analyze atmospheric events such as tropical cyclones and mesoscale convective systems (MCS). In particular, we ingested the complete ten-year QuikSCAT ocean surface wind fields satellite data into the SciDB database system for fast and accurate data retrieval based on tropical cyclone trajectories. Moreover, all the processes are done in SciDB. SciDB was used as an all-in-one solution. Except for transformation of the QuikSCAT data into SciDB s binary format, we used the array database to perform all the steps: Store QuikSCAT data overall 500 GB of compressed data was stored into 8 virtual SciDB instances. Create a spatiotemporal indices with incremental construction, various grid sizes, various chunking schemes and other configuration variations. Perform spatiotemporal queries with a focus on a demonstrative use-case: selection of QuikSCAT data withing a radius and span of a cyclone trajectory. Process the results within SciDB and visualize them. We would focus our future research on experimentation with different indices, including multidimensional hierarchical (based on multidimensional trees), indices containing data aggregates, data samples, and so on. We would also implement a spatiotemporal library (indices, functions, operators, predicates) using a combination of multiple extension points, such as user defined types, functions, operators and convenience functions written in higher level, i.e., Python interface for SciDB. 8. ACKNOWLEDGEMENT This research was supported in part by AcRF Grant RG- 18/ REFERENCES [1] P. G. Brown. Overview of scidb: large scale array storage, processing and analysis. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages ACM, [2] P. CudrÃl -Mauroux. Scidb: an open-source, array-oriented database management system. Massachusetts Institute of Technology / University of Fribourg, Switzerland, [3] R. Güting and M. Schneider. Moving Objects Databases. Morgan Kaufmann Publishers, [4] S.-S. Ho, W. Tang, W. T. Liu, and M. Schneider. A framework for moving sensor data query and retrieval of dynamic atmospheric events. In Scientific and Statistical Database Management, pages Springer, [5] Y. M. Kodama and T. Yamada. Detectability and configuration of tropical cyclone eyes over the western north pacific in trmm pr and ir observations. Monthly Atmospheric Review, 133: , [6] T. Lungu and P. Callahan. Quikscat science data product user s manual: Overview and geophysical data products. D Rev A, version, 3:91, [7] G. Planthaber, M. Stonebraker, and J. Frew. Earthdb: scalable analysis of modis data using scidb. In Proceedings of the 1st ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data, pages ACM, [8] G. L. Planthaber Jr. Modbase: A scidb-powered system for large-scale distributed storage and analysis of modis earth remote sensing data. PhD thesis, Massachusetts Institute of Technology, [9] M. Schneider, S.-S. Ho, M. Agrawal, T. Chen, H. Liu, and G. Viswanathan. A Moving Objects Database Infrastructure for Hurricane Research: Data Integration and Complex Object Management. In Earth Science Technology Forum, [10] M. Schneider, S.-S. Ho, T. Chen, A. Khan, G. Viswanathan, W. Tang, and W. T. Liu. Moving objects database technology for ad-hoc querying and satellite data retrieval of dynamic atmospheric events. In Earth Science Technology Forum, [11] E. Soroush, M. Balazinska, S. Krughoff, and A. Connolly. Efficient iterative processing in the scidb parallel array engine. In Proceedings of the 27th International Conference on Scientific and Statistical Database Management, page 39. ACM, [12] M. Stonebraker, P. Brown, A. Poliakov, and S. Raman. The architecture of scidb. In Scientific and Statistical Database Management, pages Springer, [13] W. Tang and W. T. Liu. Dependence of hurricane asymmetry and intensification on translation speed revealed by a decade of quikscat measurements. In NASA Ocean Vector Wind Science Team Meeting, Boulder, Colorado, [14] C. Yokoyama and Y. N. Takayabu. A statistical study on rain characteristics of tropical cyclones using trmm satellite data. Monthly Atmospheric Review, 136: , 2008.

Moving Objects Database Technology for Ad-Hoc Querying and Satellite Data Retrieval of Dynamic Atmospheric Events

Moving Objects Database Technology for Ad-Hoc Querying and Satellite Data Retrieval of Dynamic Atmospheric Events Moving Objects Database Technology for Ad-Hoc Querying and Satellite Data Retrieval of Dynamic Atmospheric Events Markus Schneider 1, Shen-Shyang Ho 2, Tao Chen 1, Arif Khan 1, Ganesh Viswanathan 1, Wenqing

More information

A Moving Objects Database Infrastructure for Hurricane Research: Data Integration and Complex Object Management

A Moving Objects Database Infrastructure for Hurricane Research: Data Integration and Complex Object Management A Moving Objects Database Infrastructure for Hurricane Research: Data Integration and Complex Object Management Markus Schneider 1, Shen-Shyang Ho 2, Malvika Agrawal 1, Tao Chen 1, Hechen Liu 1, Ganesh

More information

NASA s Big Data Challenges in Climate Science

NASA s Big Data Challenges in Climate Science NASA s Big Data Challenges in Climate Science Tsengdar Lee, Ph.D. High-end Computing Program Manager NASA Headquarters Presented at IEEE Big Data 2014 Workshop October 29, 2014 1 2 7-km GEOS-5 Nature Run

More information

Near Real Time Blended Surface Winds

Near Real Time Blended Surface Winds Near Real Time Blended Surface Winds I. Summary To enhance the spatial and temporal resolutions of surface wind, the remotely sensed retrievals are blended to the operational ECMWF wind analyses over the

More information

ASCAT tandem coverage

ASCAT tandem coverage Ocean and Sea Ice SAF ASCAT tandem coverage Jeroen Verspeek Ad Stoffelen Version 0.8 2009-04-22 1 Introduction In order to examine the coverage of a system of two identical satellite scatterometers, a

More information

IMPACTS OF IN SITU AND ADDITIONAL SATELLITE DATA ON THE ACCURACY OF A SEA-SURFACE TEMPERATURE ANALYSIS FOR CLIMATE

IMPACTS OF IN SITU AND ADDITIONAL SATELLITE DATA ON THE ACCURACY OF A SEA-SURFACE TEMPERATURE ANALYSIS FOR CLIMATE INTERNATIONAL JOURNAL OF CLIMATOLOGY Int. J. Climatol. 25: 857 864 (25) Published online in Wiley InterScience (www.interscience.wiley.com). DOI:.2/joc.68 IMPACTS OF IN SITU AND ADDITIONAL SATELLITE DATA

More information

Data Warehousing und Data Mining

Data Warehousing und Data Mining Data Warehousing und Data Mining Multidimensionale Indexstrukturen Ulf Leser Wissensmanagement in der Bioinformatik Content of this Lecture Multidimensional Indexing Grid-Files Kd-trees Ulf Leser: Data

More information

Data-centric Renovation of Scientific Workflow in the Age of Big Data

Data-centric Renovation of Scientific Workflow in the Age of Big Data Data-centric Renovation of Scientific Workflow in the Age of Big Data Ryong Lee, Ph. D. ryonglee@kisti.re.kr Dept. of Scientific Big Data Research Korea Institute of Science and Technology Information

More information

Visualizing of Berkeley Earth, NASA GISS, and Hadley CRU averaging techniques

Visualizing of Berkeley Earth, NASA GISS, and Hadley CRU averaging techniques Visualizing of Berkeley Earth, NASA GISS, and Hadley CRU averaging techniques Robert Rohde Lead Scientist, Berkeley Earth Surface Temperature 1/15/2013 Abstract This document will provide a simple illustration

More information

Interactive Visualization of Big Data Leveraging Databases for Scalable Computation. Leilani Marie Battle

Interactive Visualization of Big Data Leveraging Databases for Scalable Computation. Leilani Marie Battle Interactive Visualization of Big Data Leveraging Databases for Scalable Computation by Leilani Marie Battle B.S., University of Washington (2011) Submitted to the Department of Electrical Engineering and

More information

Using D2K Data Mining Platform for Understanding the Dynamic Evolution of Land-Surface Variables

Using D2K Data Mining Platform for Understanding the Dynamic Evolution of Land-Surface Variables Using D2K Data Mining Platform for Understanding the Dynamic Evolution of Land-Surface Variables Praveen Kumar 1, Peter Bajcsy 2, David Tcheng 2, David Clutter 2, Vikas Mehra 1, Wei-Wen Feng 2, Pratyush

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Big Data and Big Analytics

Big Data and Big Analytics Big Data and Big Analytics Introducing SciDB Open source, massively parallel DBMS and analytic platform Array data model (rather than SQL, Unstructured, XML, or triple-store) Extensible micro-kernel architecture

More information

Towards Analytical Data Management for Numerical Simulations

Towards Analytical Data Management for Numerical Simulations Towards Analytical Data Management for Numerical Simulations Ramon G. Costa, Fábio Porto, Bruno Schulze {ramongc, fporto, schulze}@lncc.br National Laboratory for Scientific Computing - RJ, Brazil Abstract.

More information

Reprojecting MODIS Images

Reprojecting MODIS Images Reprojecting MODIS Images Why Reprojection? Reasons why reprojection is desirable: 1. Removes Bowtie Artifacts 2. Allows geographic overlays (e.g. coastline, city locations) 3. Makes pretty pictures for

More information

Description of Scatterometer Data Products

Description of Scatterometer Data Products Scatterometer Data:// Description of Scatterometer Data Products QuikScat is no longer operational - last data recorded: 2009-Nov-23 REMSS QuikScat data set currently ends on 2009-11-19; data after that

More information

NASA's Strategy and Activities in Server Side Analytics

NASA's Strategy and Activities in Server Side Analytics NASA's Strategy and Activities in Server Side Analytics Tsengdar Lee, Ph.D. High-end Computing Program Manager NASA Headquarters Presented at the ESGF/UVCDAT Conference Lawrence Livermore National Laboratory

More information

Data Products via TRMM Online Visualization and Analysis System

Data Products via TRMM Online Visualization and Analysis System Accessing Global Precipitation Data Products via TRMM Online Visualization and Analysis System (TOVAS) Zhong Liu Center for Spatial Information Science and Systems (CSISS), George Mason University and

More information

Estimating Firn Emissivity, from 1994 to1998, at the Ski Hi Automatic Weather Station on the West Antarctic Ice Sheet Using Passive Microwave Data

Estimating Firn Emissivity, from 1994 to1998, at the Ski Hi Automatic Weather Station on the West Antarctic Ice Sheet Using Passive Microwave Data Estimating Firn Emissivity, from 1994 to1998, at the Ski Hi Automatic Weather Station on the West Antarctic Ice Sheet Using Passive Microwave Data Mentor: Dr. Malcolm LeCompte Elizabeth City State University

More information

Sanjeev Kumar. contribute

Sanjeev Kumar. contribute RESEARCH ISSUES IN DATAA MINING Sanjeev Kumar I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012 sanjeevk@iasri.res.in 1. Introduction The field of data mining and knowledgee discovery is emerging as a

More information

RevoScaleR Speed and Scalability

RevoScaleR Speed and Scalability EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

More information

Satellite Products and Dissemination: Visualization and Data Access

Satellite Products and Dissemination: Visualization and Data Access Satellite Products and Dissemination: Visualization and Data Access Gregory Leptoukh GES DISC, NASA GSFC Dana Ostrenga GES DISC, NASA GSFC Introduction The Goddard Earth Sciences Data and Information Services

More information

Temporal variation in snow cover over sea ice in Antarctica using AMSR-E data product

Temporal variation in snow cover over sea ice in Antarctica using AMSR-E data product Temporal variation in snow cover over sea ice in Antarctica using AMSR-E data product Michael J. Lewis Ph.D. Student, Department of Earth and Environmental Science University of Texas at San Antonio ABSTRACT

More information

DATA WAREHOUSING AND OLAP TECHNOLOGY

DATA WAREHOUSING AND OLAP TECHNOLOGY DATA WAREHOUSING AND OLAP TECHNOLOGY Manya Sethi MCA Final Year Amity University, Uttar Pradesh Under Guidance of Ms. Shruti Nagpal Abstract DATA WAREHOUSING and Online Analytical Processing (OLAP) are

More information

Big data analy+cs for global change monitoring and research in forestry and agriculture. Lubia Vinhas

Big data analy+cs for global change monitoring and research in forestry and agriculture. Lubia Vinhas Big data analy+cs for global change monitoring and research in forestry and agriculture Lubia Vinhas Earth observa+on satellites and geosensor webs provide key informa+on about global change but that informa+on

More information

SciDB: an open-source, array-oriented database management system

SciDB: an open-source, array-oriented database management system SciDB: an open-source, array-oriented database management system Philippe Cudré-Mauroux Massachusetts Institute of Technology / University of Fribourg, Switzerland & the SciDB Team October 29, 2010 CIS,

More information

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets!! Large data collections appear in many scientific domains like climate studies.!! Users and

More information

TRMM and Other Global Precipitation Products and Data Services at NASA GES DISC. Zhong Liu George Mason University and NASA GES DISC

TRMM and Other Global Precipitation Products and Data Services at NASA GES DISC. Zhong Liu George Mason University and NASA GES DISC TRMM and Other Global Precipitation Products and Data Services at NASA GES DISC Zhong Liu George Mason University and NASA GES DISC Outline Introduction of data and services at GES DISC TRMM and other

More information

The THREDDS Data Repository: for Long Term Data Storage and Access

The THREDDS Data Repository: for Long Term Data Storage and Access 8B.7 The THREDDS Data Repository: for Long Term Data Storage and Access Anne Wilson, Thomas Baltzer, John Caron Unidata Program Center, UCAR, Boulder, CO 1 INTRODUCTION In order to better manage ever increasing

More information

Name Period 4 th Six Weeks Notes 2015 Weather

Name Period 4 th Six Weeks Notes 2015 Weather Name Period 4 th Six Weeks Notes 2015 Weather Radiation Convection Currents Winds Jet Streams Energy from the Sun reaches Earth as electromagnetic waves This energy fuels all life on Earth including the

More information

Cloud Computing for Research Roger Barga Cloud Computing Futures, Microsoft Research

Cloud Computing for Research Roger Barga Cloud Computing Futures, Microsoft Research Cloud Computing for Research Roger Barga Cloud Computing Futures, Microsoft Research Trends: Data on an Exponential Scale Scientific data doubles every year Combination of inexpensive sensors + exponentially

More information

A Microwave Retrieval Algorithm of Above-Cloud Electric Fields

A Microwave Retrieval Algorithm of Above-Cloud Electric Fields A Microwave Retrieval Algorithm of Above-Cloud Electric Fields Michael J. Peterson The University of Utah Chuntao Liu Texas A & M University Corpus Christi Douglas Mach Global Hydrology and Climate Center

More information

Development of an Integrated Data Product for Hawaii Climate

Development of an Integrated Data Product for Hawaii Climate Development of an Integrated Data Product for Hawaii Climate Jan Hafner, Shang-Ping Xie (PI)(IPRC/SOEST U. of Hawaii) Yi-Leng Chen (Co-I) (Meteorology Dept. Univ. of Hawaii) contribution Georgette Holmes

More information

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011 Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis

More information

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:

More information

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel Big Data and Analytics: A Conceptual Overview Mike Park Erik Hoel In this technical workshop This presentation is for anyone that uses ArcGIS and is interested in analyzing large amounts of data We will

More information

How to analyze synoptic-scale weather patterns Table of Contents

How to analyze synoptic-scale weather patterns Table of Contents How to analyze synoptic-scale weather patterns Table of Contents Before You Begin... 2 1. Identify H and L pressure systems... 3 2. Locate fronts and determine frontal activity... 5 3. Determine surface

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency

More information

PART 1. Representations of atmospheric phenomena

PART 1. Representations of atmospheric phenomena PART 1 Representations of atmospheric phenomena Atmospheric data meet all of the criteria for big data : they are large (high volume), generated or captured frequently (high velocity), and represent a

More information

Technical Report. An Interactive Iso-Surface Based 3D Weather Radar Data Visualization Package Using VisAD for WDSSII

Technical Report. An Interactive Iso-Surface Based 3D Weather Radar Data Visualization Package Using VisAD for WDSSII Technical Report An Interactive Iso-Surface Based 3D Weather Radar Data Visualization Package Using VisAD for WDSSII (Jianting Zhang, Last Modified 2/5/2004) Abstract... 2 1 Introduction... 3 2 Overviews

More information

Fact Sheet In-Memory Analysis

Fact Sheet In-Memory Analysis Fact Sheet In-Memory Analysis 1 Copyright Yellowfin International 2010 Contents In Memory Overview...3 Benefits...3 Agile development & rapid delivery...3 Data types supported by the In-Memory Database...4

More information

Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

More information

The basic data mining algorithms introduced may be enhanced in a number of ways.

The basic data mining algorithms introduced may be enhanced in a number of ways. DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Data mining algorithms have traditionally assumed data is memory resident,

More information

Huai-Min Zhang & NOAAGlobalTemp Team

Huai-Min Zhang & NOAAGlobalTemp Team Improving Global Observations for Climate Change Monitoring using Global Surface Temperature (& beyond) Huai-Min Zhang & NOAAGlobalTemp Team NOAA National Centers for Environmental Information (NCEI) [formerly:

More information

Clustering & Visualization

Clustering & Visualization Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.

More information

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

More information

Daily High-resolution Blended Analyses for Sea Surface Temperature

Daily High-resolution Blended Analyses for Sea Surface Temperature Daily High-resolution Blended Analyses for Sea Surface Temperature by Richard W. Reynolds 1, Thomas M. Smith 2, Chunying Liu 1, Dudley B. Chelton 3, Kenneth S. Casey 4, and Michael G. Schlax 3 1 NOAA National

More information

CHAPTER-24 Mining Spatial Databases

CHAPTER-24 Mining Spatial Databases CHAPTER-24 Mining Spatial Databases 24.1 Introduction 24.2 Spatial Data Cube Construction and Spatial OLAP 24.3 Spatial Association Analysis 24.4 Spatial Clustering Methods 24.5 Spatial Classification

More information

Studying cloud properties from space using sounder data: A preparatory study for INSAT-3D

Studying cloud properties from space using sounder data: A preparatory study for INSAT-3D Studying cloud properties from space using sounder data: A preparatory study for INSAT-3D Munn V. Shukla and P. K. Thapliyal Atmospheric Sciences Division Atmospheric and Oceanic Sciences Group Space Applications

More information

Performance of KDB-Trees with Query-Based Splitting*

Performance of KDB-Trees with Query-Based Splitting* Performance of KDB-Trees with Query-Based Splitting* Yves Lépouchard Ratko Orlandic John L. Pfaltz Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science University of Virginia Illinois

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

COASTAL WIND ANALYSIS BASED ON ACTIVE RADAR IN QINGDAO FOR OLYMPIC SAILING EVENT

COASTAL WIND ANALYSIS BASED ON ACTIVE RADAR IN QINGDAO FOR OLYMPIC SAILING EVENT COASTAL WIND ANALYSIS BASED ON ACTIVE RADAR IN QINGDAO FOR OLYMPIC SAILING EVENT XIAOMING LI a, b, * a Remote Sensing Technology Institute, German Aerospace Center (DLR), Oberpfaffenhofen, 82234, Germany

More information

Remote Sensitive Image Stations and Grid Services

Remote Sensitive Image Stations and Grid Services International Journal of Grid and Distributed Computing 23 Remote Sensing Images Data Integration Based on the Agent Service Binge Cui, Chuanmin Wang, Qiang Wang College of Information Science and Engineering,

More information

LDIF - Linked Data Integration Framework

LDIF - Linked Data Integration Framework LDIF - Linked Data Integration Framework Andreas Schultz 1, Andrea Matteini 2, Robert Isele 1, Christian Bizer 1, and Christian Becker 2 1. Web-based Systems Group, Freie Universität Berlin, Germany a.schultz@fu-berlin.de,

More information

ORACLE BUSINESS INTELLIGENCE, ORACLE DATABASE, AND EXADATA INTEGRATION

ORACLE BUSINESS INTELLIGENCE, ORACLE DATABASE, AND EXADATA INTEGRATION ORACLE BUSINESS INTELLIGENCE, ORACLE DATABASE, AND EXADATA INTEGRATION EXECUTIVE SUMMARY Oracle business intelligence solutions are complete, open, and integrated. Key components of Oracle business intelligence

More information

OPTIMIZATION OF DATABASE STRUCTURE FOR HYDROMETEOROLOGICAL MONITORING SYSTEM

OPTIMIZATION OF DATABASE STRUCTURE FOR HYDROMETEOROLOGICAL MONITORING SYSTEM OPTIMIZATION OF DATABASE STRUCTURE FOR HYDROMETEOROLOGICAL MONITORING SYSTEM Ph.D. Robert SZCZEPANEK Cracow University of Technology Institute of Water Engineering and Water Management ul.warszawska 24,

More information

InfiniteGraph: The Distributed Graph Database

InfiniteGraph: The Distributed Graph Database A Performance and Distributed Performance Benchmark of InfiniteGraph and a Leading Open Source Graph Database Using Synthetic Data Objectivity, Inc. 640 West California Ave. Suite 240 Sunnyvale, CA 94086

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,

More information

Bringing Big Data Modelling into the Hands of Domain Experts

Bringing Big Data Modelling into the Hands of Domain Experts Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the

More information

SciDB DBMS Research at M.I.T.

SciDB DBMS Research at M.I.T. SciDB DBMS Research at M.I.T. Michael Stonebraker, Jennie Duggan, Leilani Battle, Olga Papaemmanouil {stonebraker, jennie, leilani}@csail.mit.edu, olga@cs.brandeis.edu Abstract This paper presents a snapshot

More information

Big Data Mining Services and Knowledge Discovery Applications on Clouds

Big Data Mining Services and Knowledge Discovery Applications on Clouds Big Data Mining Services and Knowledge Discovery Applications on Clouds Domenico Talia DIMES, Università della Calabria & DtoK Lab Italy talia@dimes.unical.it Data Availability or Data Deluge? Some decades

More information

In-Situ Bitmaps Generation and Efficient Data Analysis based on Bitmaps. Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

In-Situ Bitmaps Generation and Efficient Data Analysis based on Bitmaps. Yu Su, Yi Wang, Gagan Agrawal The Ohio State University In-Situ Bitmaps Generation and Efficient Data Analysis based on Bitmaps Yu Su, Yi Wang, Gagan Agrawal The Ohio State University Motivation HPC Trends Huge performance gap CPU: extremely fast for generating

More information

Norwegian Satellite Earth Observation Database for Marine and Polar Research http://normap.nersc.no USE CASES

Norwegian Satellite Earth Observation Database for Marine and Polar Research http://normap.nersc.no USE CASES Norwegian Satellite Earth Observation Database for Marine and Polar Research http://normap.nersc.no USE CASES The NORMAP Project team has prepared this document to present functionality of the NORMAP portal.

More information

Data Mining and Database Systems: Where is the Intersection?

Data Mining and Database Systems: Where is the Intersection? Data Mining and Database Systems: Where is the Intersection? Surajit Chaudhuri Microsoft Research Email: surajitc@microsoft.com 1 Introduction The promise of decision support systems is to exploit enterprise

More information

Spatio-Temporal Mapping -A Technique for Overview Visualization of Time-Series Datasets-

Spatio-Temporal Mapping -A Technique for Overview Visualization of Time-Series Datasets- Progress in NUCLEAR SCIENCE and TECHNOLOGY, Vol. 2, pp.603-608 (2011) ARTICLE Spatio-Temporal Mapping -A Technique for Overview Visualization of Time-Series Datasets- Hiroko Nakamura MIYAMURA 1,*, Sachiko

More information

Exploratory Climate Data Visualization and Analysis

Exploratory Climate Data Visualization and Analysis Exploratory Climate Data Visualization and Analysis by Thomas Maxwell, Jerry Potter & Laura Carriere, NASA NCCS and the UVCDAT Development Consortium Scientific Visualization! We process, understand, and

More information

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES ABSTRACT EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES Tyler Cossentine and Ramon Lawrence Department of Computer Science, University of British Columbia Okanagan Kelowna, BC, Canada tcossentine@gmail.com

More information

Development of nosql data storage for the ATLAS PanDA Monitoring System

Development of nosql data storage for the ATLAS PanDA Monitoring System Development of nosql data storage for the ATLAS PanDA Monitoring System M.Potekhin Brookhaven National Laboratory, Upton, NY11973, USA E-mail: potekhin@bnl.gov Abstract. For several years the PanDA Workload

More information

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications

More information

Oracle8i Spatial: Experiences with Extensible Databases

Oracle8i Spatial: Experiences with Extensible Databases Oracle8i Spatial: Experiences with Extensible Databases Siva Ravada and Jayant Sharma Spatial Products Division Oracle Corporation One Oracle Drive Nashua NH-03062 {sravada,jsharma}@us.oracle.com 1 Introduction

More information

How to Ingest Data into Google BigQuery using Talend for Big Data. A Technical Solution Paper from Saama Technologies, Inc.

How to Ingest Data into Google BigQuery using Talend for Big Data. A Technical Solution Paper from Saama Technologies, Inc. How to Ingest Data into Google BigQuery using Talend for Big Data A Technical Solution Paper from Saama Technologies, Inc. July 30, 2013 Table of Contents Intended Audience What you will Learn Background

More information

Data-Intensive Science and Scientific Data Infrastructure

Data-Intensive Science and Scientific Data Infrastructure Data-Intensive Science and Scientific Data Infrastructure Russ Rew, UCAR Unidata ICTP Advanced School on High Performance and Grid Computing 13 April 2011 Overview Data-intensive science Publishing scientific

More information

GIS Initiative: Developing an atmospheric data model for GIS. Olga Wilhelmi (ESIG), Jennifer Boehnert (RAP/ESIG) and Terri Betancourt (RAP)

GIS Initiative: Developing an atmospheric data model for GIS. Olga Wilhelmi (ESIG), Jennifer Boehnert (RAP/ESIG) and Terri Betancourt (RAP) GIS Initiative: Developing an atmospheric data model for GIS Olga Wilhelmi (ESIG), Jennifer Boehnert (RAP/ESIG) and Terri Betancourt (RAP) Unidata seminar August 30, 2004 Presentation Outline Overview

More information

Big Table A Distributed Storage System For Data

Big Table A Distributed Storage System For Data Big Table A Distributed Storage System For Data OSDI 2006 Fay Chang, Jeffrey Dean, Sanjay Ghemawat et.al. Presented by Rahul Malviya Why BigTable? Lots of (semi-)structured data at Google - - URLs: Contents,

More information

Web Log Data Sparsity Analysis and Performance Evaluation for OLAP

Web Log Data Sparsity Analysis and Performance Evaluation for OLAP Web Log Data Sparsity Analysis and Performance Evaluation for OLAP Ji-Hyun Kim, Hwan-Seung Yong Department of Computer Science and Engineering Ewha Womans University 11-1 Daehyun-dong, Seodaemun-gu, Seoul,

More information

Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

Scientific Data Management and Dissemination

Scientific Data Management and Dissemination Federal GIS Conference February 9 10, 2015 Washington, DC Scientific Data Management and Dissemination John Fry Solution Engineer, Esri jfry@esri.com Agenda Background of Scientific Data Management through

More information

Open Source Visualisation with ADAGUC Web Map Services

Open Source Visualisation with ADAGUC Web Map Services Open Source Visualisation with ADAGUC Web Map Services Maarten Plieger Ernst de Vreede John van de Vegte, Wim Som de Cerff, Raymond Sluiter, Ian van der Neut, Jan Willem Noteboom 1 ADAGUC project Cooperative

More information

SQL Server 2012 Business Intelligence Boot Camp

SQL Server 2012 Business Intelligence Boot Camp SQL Server 2012 Business Intelligence Boot Camp Length: 5 Days Technology: Microsoft SQL Server 2012 Delivery Method: Instructor-led (classroom) About this Course Data warehousing is a solution organizations

More information

Efficiently Integrating MapReduce-based Computing into a Hurricane Loss Projection Model

Efficiently Integrating MapReduce-based Computing into a Hurricane Loss Projection Model Efficiently Integrating MapReduce-based Computing into a Hurricane Loss Projection Model Fausto C. Fleites 1, Steve Cocke 2, Shu-Ching Chen 1, Shahid Hamid 3 1 School of Computing and Information Sciences

More information

Reporting Services. White Paper. Published: August 2007 Updated: July 2008

Reporting Services. White Paper. Published: August 2007 Updated: July 2008 Reporting Services White Paper Published: August 2007 Updated: July 2008 Summary: Microsoft SQL Server 2008 Reporting Services provides a complete server-based platform that is designed to support a wide

More information

Present Status of Coastal Environmental Monitoring in Korean Waters. Using Remote Sensing Data

Present Status of Coastal Environmental Monitoring in Korean Waters. Using Remote Sensing Data Present Status of Coastal Environmental Monitoring in Korean Waters Using Remote Sensing Data Sang-Woo Kim, Young-Sang Suh National Fisheries Research & Development Institute #408-1, Shirang-ri, Gijang-up,

More information

Chapter Overview. Seasons. Earth s Seasons. Distribution of Solar Energy. Solar Energy on Earth. CHAPTER 6 Air-Sea Interaction

Chapter Overview. Seasons. Earth s Seasons. Distribution of Solar Energy. Solar Energy on Earth. CHAPTER 6 Air-Sea Interaction Chapter Overview CHAPTER 6 Air-Sea Interaction The atmosphere and the ocean are one independent system. Earth has seasons because of the tilt on its axis. There are three major wind belts in each hemisphere.

More information

2 Associating Facts with Time

2 Associating Facts with Time TEMPORAL DATABASES Richard Thomas Snodgrass A temporal database (see Temporal Database) contains time-varying data. Time is an important aspect of all real-world phenomena. Events occur at specific points

More information

Cloud Computing @ JPL Science Data Systems

Cloud Computing @ JPL Science Data Systems Cloud Computing @ JPL Science Data Systems Emily Law, GSAW 2011 Outline Science Data Systems (SDS) Space & Earth SDSs SDS Common Architecture Components Key Components using Cloud Computing Use Case 1:

More information

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

IDL. Get the answers you need from your data. IDL

IDL. Get the answers you need from your data. IDL Get the answers you need from your data. IDL is the preferred computing environment for understanding complex data through interactive visualization and analysis. IDL Powerful visualization. Interactive

More information

Visualizing Data: Scalable Interactivity

Visualizing Data: Scalable Interactivity Visualizing Data: Scalable Interactivity The best data visualizations illustrate hidden information and structure contained in a data set. As access to large data sets has grown, so has the need for interactive

More information

Digital Remote Sensing Data Processing Digital Remote Sensing Data Processing and Analysis: An Introduction and Analysis: An Introduction

Digital Remote Sensing Data Processing Digital Remote Sensing Data Processing and Analysis: An Introduction and Analysis: An Introduction Digital Remote Sensing Data Processing Digital Remote Sensing Data Processing and Analysis: An Introduction and Analysis: An Introduction Content Remote sensing data Spatial, spectral, radiometric and

More information

Load Distribution in Large Scale Network Monitoring Infrastructures

Load Distribution in Large Scale Network Monitoring Infrastructures Load Distribution in Large Scale Network Monitoring Infrastructures Josep Sanjuàs-Cuxart, Pere Barlet-Ros, Gianluca Iannaccone, and Josep Solé-Pareta Universitat Politècnica de Catalunya (UPC) {jsanjuas,pbarlet,pareta}@ac.upc.edu

More information

Introduction to GIS (Basics, Data, Analysis) & Case Studies. 13 th May 2004. Content. What is GIS?

Introduction to GIS (Basics, Data, Analysis) & Case Studies. 13 th May 2004. Content. What is GIS? Introduction to GIS (Basics, Data, Analysis) & Case Studies 13 th May 2004 Content Introduction to GIS Data concepts Data input Analysis Applications selected examples What is GIS? Geographic Information

More information

NetCDF and HDF Data in ArcGIS

NetCDF and HDF Data in ArcGIS 2013 Esri International User Conference July 8 12, 2013 San Diego, California Technical Workshop NetCDF and HDF Data in ArcGIS Nawajish Noman Kevin Butler Esri UC2013. Technical Workshop. Outline NetCDF

More information

Multi-dimensional index structures Part I: motivation

Multi-dimensional index structures Part I: motivation Multi-dimensional index structures Part I: motivation 144 Motivation: Data Warehouse A definition A data warehouse is a repository of integrated enterprise data. A data warehouse is used specifically for

More information

A Dynamic Load Balancing Strategy for Parallel Datacube Computation

A Dynamic Load Balancing Strategy for Parallel Datacube Computation A Dynamic Load Balancing Strategy for Parallel Datacube Computation Seigo Muto Institute of Industrial Science, University of Tokyo 7-22-1 Roppongi, Minato-ku, Tokyo, 106-8558 Japan +81-3-3402-6231 ext.

More information

Structure? Integrated Climate Data Center How to use the ICDC? Tools? Data Formats? User

Structure? Integrated Climate Data Center How to use the ICDC? Tools? Data Formats? User Integrated Climate Data Center? Data Formats?? Tools???? visits Structure???? User Contents Which Data Formats do we offer? What is the Structure of our data center? Which Tools do we provide? Our Aims

More information