A SciDB-based Framework for Efficient Satellite Data Storage and Query based on Dynamic Atmospheric Event Trajectory

Transcription

1 A SciDB-based Framework for Efficient Satellite Data Storage and Query based on Dynamic Atmospheric Event Trajectory ABSTRACT Luboš Krčál Nanyang Technological University, Singapore Czech Technical University in Prague, Czech Republic Current research in climate informatics focuses mainly on the development of novel (machine learning, data mining, or statistical) techniques to analyze climate data (e.g. model, in-situ, or satellite) or to make prediction based on these climate data. One important component missing from this analysis workflow is data management that allows efficient and flexible data retrieval, (ease of) reproducibility, and the (ease of) techniques reuse on user-defined data subsets or other data. In this paper, we describe our preliminary investigation on the utilization of the distributed array-based database management system, SciDB, to support data-driven climate science research. We focus on modeling and generating indices that allow effective execution of various spatiotemporal queries on satellite data. Moreover, we demonstrate fast and accurate data retrieval based on user-specified trajectories from the SciDB database containing tropical cyclone trajectories and the complete ten-year QuikSCAT ocean surface wind fields satellite data. Our preliminary work indicates the feasibility of the arraybased technology for multiple satellite data storage, query, and analysis. Towards this end, a successful deployment of SciDB-based data storage can facilitate the use of data from multiple satellites for climate and weather research. Categories and Subject Descriptors H.2.8 [Information Systems]: Database Management Database Applications, Scientific databases General Terms Experimentation, Design, Measurement Keywords SciDB, scientific database, arrays, multidimensional arrays, index, indexing, spatiotemporal, satellite data, QuikSCAT Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data 2015, Seattle, WA, USA Copyright 2015 ACM $ Shen-Shyang Ho Nanyang Technological University, Singapore ssho@ntu.edu.sg 1. INTRODUCTION Natural phenomena such as haze, hurricane, and blizzard that evolve over time usually do not have well-defined boundaries. Their features may be captured by multiple satellites. To process and extract information from the largescale satellite data, one needs a data-intensive architecture for distributed storage and computation resources. Such architecture allows end users such as scientists to effectively run their computation tasks with shared computational resources and intermediate results, but without data replication. All the data from multiple sources are processed and analyzed regardless of their origin. The satellite data is most conveniently represented using multidimensional arrays, exploiting its multidimensional nature. For our investigation, we use the open-source distributed, array-based database, SciDB [2], as a platform for our spatiotemporal data management framework. SciDB conforms with the data-intensive architecture, providing a highly effective computational and data storage platform. Moreover, it provides standard extension points, i.e., user defined data types, operators and functions. To represent spatiotemporal features and set a ground for implementation of spatiotemporal predicates and operators, we focus on redimensioning based indices. These simple indices, combined with distributed and parallel array platform, provide very efficient representation for various spatiotemporal queries. Since SciDB provides a computational platform, all spatiotemporal operator results can be further processed within the database. These include preprocessing, data analysis, visualization, etc. In this paper, we describe our preliminary work on the utilization of the distributed array-based SciDB database management system to support satellite data-driven climate science research by providing scientists with previously unavailable accurate and efficient Earth science satellite sensor data query and retrieval based on user-defined criteria to study and analyze atmospheric events. In particular, we ingested the complete ten-year QuikSCAT ocean surface wind fields satellite data into the SciDB database system, generated several redimension-based indices, performed trajectory queries (space time neighborhood of cyclone trajectories), processed the results and visualized them. All these are done using exclusively array operators within SciDB. Our preliminary work indicates the feasibility of the arraybased technology for multiple satellite data storage, query, and analysis. Towards this end, a successful deployment of

2 SciDB-based satellite data storage can facilitate the use of data from multiple satellite data for climate and weather research. The paper is organized as follows. In Section 2, we provide a brief overview of previous work on satellite data retrieval and using SciDB for satellite data analysis. In Section 3, we provide background related to SciDB array-based databases, QuikSCAT satellite data, cyclone trajectory data, and motivation based on scientific application. In Section 4, we describe and discuss our SciDB-based framework for satellite data storage and query based on dynamic atmospheric event trajectory in detail. In Section 5, we go through our implementation of a query to retrieve QuikSCAT data given tropical cyclone trajectory. In Section 6, we briefly discuss the time complexity of the query. In Section 7, we conclude with some future research and development directions. 2. RELATED WORK EarthDB [7, 8] is a solution based on SciDB focusing on the analysis of NASA s Moderate Resolution Imaging Spectroradiometer (MODIS) data. It is a set of tools consisting of a preprocessor from HDF (Hierarchical Data Format) to CSV (Comma-separated values) to SciDB s dense load format, parallel loader, benchmarking scripts and visualization component. EarthDB could had been potentially reused to load HDF data for our implementation. However, we decided not to reuse them due to high adaptation to MODIS data. Furthermore, due to EarthDB s use of CSV and dense load format as intermediate formats for data loading, its loading performance is subpar to using direct HDF preprocessing to SciDB s binary load format with subsequent direct parallel loading. Ho et al. [4] proposed a spatiotemporal join query to extract satellite data given tropical cyclone trajectories. A search based on the tropical cyclone trajectory of interest is performed on a partition tree that indexes the satellite swath data. Based on the returned indices, relevant data files are retrieved from NASA data centers. Then, the data files are opened to extract and subset the relevant swath data. A system prototype is developed [10, 9] based on [4] and moving objects database technology [3]. The main issue for the system is that the HDF satellite data are stored in NASA data centers and data of interest to a user have to be retrieved, extracted and subset locally. 3. BACKGROUND 3.1 SciDB Array Database SciDB [1] is a new open-source data management and analytics system organized around a multi-dimensional array data model. Array DBMS is a generalization of relational DBMS. SciDB has a distributed architecture with separate storage (shared nothing), which allows end users to transfer scientific analysis and processing from a local environment to a data driven environment, eliminating the need of transferring all the data to their local machines and relying on their own, quite often limited, computation power. There are several interfaces, including built-in languages: AFL Array Functional Language, AQL Array Query Language; programmatic interfaces: Python, R, Julia; and extensions API (user defined types, functions, aggregates and operators) in C++. The data representation in SciDB [12] is based on n-dimensional arrays. These can be seen as an extension of standard relational databases. In relational databases, all the rows can be seen as some values along one dimension, where these values consist of a fixed amount of attributes columns. Array databases instead have rows, columns and more dimensions to define a cell. Within each cell, there can be arbitrary, but fixed, number of attributes. Each dimension has continuous integer indices and each cell within the same array type has the same structure: attributes of given data types and null and default properties. Arrays are implicitly indexed on all the dimensions. Note that compared to the original ideas in [12], nested arrays feature has been dropped, so the current model is somewhat simplified. Arrays are stored across individual nodes using chunks small subarrays of fixed dimensions. Chunk dimensions are user specified and greatly impact the overall performance. Figure 1 shows a three dimensional array, with all the dimensions bounded. The dashed hyperrectangles represent individual chunks. Dimension 2 [0:1] Three dimensional sparse array Dimension 1 [0:5] Dimension 3 [0:3] Figure 1: Three dimensional sparse array of bounded dimensions Dashed hyperrectangles and different shade of gray represent individual chunks. In this case, chunks are of size Highly parallel operations on arrays allow for very effective data processing. Most of the operations used for spatiotemporal indexing and queries have a linear time complexity with an expected linear parallel speedup in SciDB. The multidimensional array model builds a good support for spatiotemporal operations on redimension-based array indices. It is though necessary to note that some operations not native to arrays, or not implementable easily on arrays, are less effective than if implemented outside SciDB in any iterative language. This has been addressed in [11]. As other users have noted, the multidimensional array model, together with fully functional underlying language, requires complete change of mindset compared to iterative programming or compared to standard relational databases. 3.2 QuikSCAT Satellite Data The QuikSCAT (Quick Scatterometer) satellite (collected data from 1999 to 2009) carried a specialized microwave radar that measures near ocean surface wind speed and direction under all weather and cloud conditions [6]. The satellite orbited Earth with a 1800 km wide measurement swath on Earth surface. It took about 101 minutes to complete

3 one full orbital revolution. The scatterometer was able to provide measurements on a particular region twice per day. There are 5 data products available to the public from Physical Oceanography Distributed Active Archive Center (PO.DAAC). There are 25 fields in the data structure for the Level 2B data. We load and store all the QuikSCAT data in our SciDB instance. The Level 2B data consist of rows of ocean wind vectors in 25km and 12.5km wind vector cells (WVC). A WVC is analogous to a pixel representing a square of dimension 25km or 12.5km. A complete coverage of the earth circumference requires km WVCs and km WVCs. The 1800 km swath width corresponds to 72 25km WVCs or km WVCs. To alleviate the problem of measurements outside the swath, the Level 2B data contains addition 4 WVCs at 25km spatial resolution and 8 WVCs at 12.5km spatial resolution. 3.3 Tropical Cyclone Trajectory The National Hurricane Center web-site contains tropical cyclone reports which include comprehensive information on tropical cyclones occurring in the North Atlantic Ocean and the Eastern Pacific Ocean. In particular, it contains postanalysis six-hourly best track locations and intensities for tropical cyclones from 1958 to current year. The Joint Typhoon Warning Center at the Naval Maritime Forecast Center provides tropical cyclone best track information for the Southern Hemisphere, Northern Indian Ocean and Western North Pacific Ocean from 1945 to current. The tropical cyclone trajectories between 1999 and 2009 are collected from the websites of the two centers. 3.4 Scientific Applications From published scientific journal papers, one observes that query and retrieval capability is extremely important to scientists who retrieve specific sensor data of specific atmospheric events for statistical analysis. Two query examples derived from these published scientific papers that require search, retrieval, and analysis of satellite data containing cyclone features, are listed below: 1. Retrieve TRMM precipitation data for tropical cyclones that attained tropical storm intensity or higher over western North Pacific and the South China Sea between longitudes 100 o E and 180 o. 138 sensor datasets for 61 tropical cyclones retrieved [5]. 2. Retrieve TRMM precipitation data for tropical cyclones from December 1997 to December sensor datasets for 563 tropical cyclones retrieved [14]. Moreover, such a capability supports scientists in their investigations where large amount of problem-specific, eventspecific data are needed. An example of tropical storm characteristics investigation using QuikSCAT ocean surface wind data subset is as follows. Wind structure is one of the important factors controlling the intensity change (intensification or weakening) of tropical storms [13]. One can retrieve ocean surface wind fields measured by QuikSCAT associated to tropical storms in the Atlantic ocean between 2000 and 2009, with the ability to further specify the search criteria to find the wind measurements associated with the group of cases satisfying characteristics such as storm paths when the translation speed is greater than 5m/s or ocean surface vector wind for hurricanes reaching categories 4 or 5, where the storm translation speed or categories is determined from the tropical cyclone trajectories. 4. SYSTEM DESIGN 4.1 System Overview We focus on integrating most of the functionality directly into SciDB, however it is necessary to have at least minimal supporting tools. In our case, we have these components. Each of the components uses SciDB-Py (Python interface to SciDB) to interact with the database. Data preprocessors: fetches and preprocesses HDF files into SciDB s binary format in parallel. QuikSCAT loader: manages the parallel loading process of QuikSCAT data. Trajectory loader: loads CSV trajectories of cyclone data from different sources. Indexing controller: the indexing process is done entirely, incrementally in SciDB. Query controller: executes queries and optionally processes data for visualization within SciDB. SciDB version is 14.12, SciBD-Py version is QuikSCAT Data Loading and Storage The whole data loading process is entirely in parallel. At first, disjoint binary data chunks are generated from raw QuikSCAT data in HDF (Hierarchical Data Format). HDF is used to store and organize large data, supporting multidimensional array data, group hierarchy and more. Due to the sheer size of QuikSCAT data (about 500 GB), we transform the data from HDF format into SciDB s binary format first. The binary format is the fastest non-internal format to load into SciDB. Binary data chunks are then loaded onto all the SciDB instances, where these are loaded in parallel. This is the fastest way to load data to SciDB. Data loading in this fashion is parallelized all the way to the number of SciDB s instances. The process is also incremental. The format of the raw data s arrays is based on which attributes we want to save, but the general array scheme is the following: /* two dimensional array definition */ QuikSCAT_ D2 time : datetime not null swath = 0:*, 307, 0, along = 0:3247, 3248, 0, /* three dimensional array definition */ QuikSCAT_ D3 lat : int16 not null, lon : uint16 not null, wind_ speed : int16 null, wind_dir : uint16 null, /* more attributes */ swath = 0:*, 2, 0, along = 0:3247, 3248, 0, cross = 0:151, 152, 0

4 Note that swath, along and cross are the three dimensions. Swath corresponds to the orbit number, along corresponds to the position within the along track of the orbit, and cross corresponds to the position within the single scan (cross track). Dimensions are specified by four numbers: minimum and maximum (can be unbounded, denoted by *) of dimension s domain, chunk size along this dimension (can be unknown, i.e. determined by SciDB, denoted by?) and overlap (number of cells overlapping adjacent chunks along this dimension). Also, note that time only depends on swath and along dimensions and not cross. Time has lower dimensionality than latitude, longitude, wind speed, wind direction, and other attributes that also depends on the cross dimension. We need two arrays to store data of two different dimensionalities, even though these dimensionalities are subset of each other. 4.3 Indexing Spatiotemporal Data Latitude-Longitude-Time Index Latitude-Longitude-Time index is a form of redimension index. The processing is based on a redimension operator of SciDB, where attributes of one array become dimensions of another (new) array. The new array s dimensions are usually rescaled compared to the original attributes range. Conflicting cells (i.e. cells that are the target of multiple source array cells) can be treated by concatenating the values into a list followed by computing an aggregate function along the list of conflicting cells. Note the aggregation can be done without explicitely storing the list of conflicting cells. The redimensioning process is depicted in Figure 2. The dimensions of the original raw data array swath and along, are turned into attributes, the cross dimension is discarded, and the attributes latitude, longitude and time are discretized and turned into dimensions by the redimension operation. Since there are many cells that ended up assigned the same latitude, longitude and time, most of them with different values of swath and along, we need to run an aggregate along the auxiliary dimension. This yields another array without conflicts, where the attributes swath start, swath end, along start and along end define a range on the swath and along dimensions in the original raw data array. There is an array with dimensions representing latitude and longitude from QuikSCAT D3 array and time from QuikSCAT D2 (see array schemes for QuikSCAT data). The granularity of the target dimensions is determined by the use-cases or by individual levels in the hierarchical indices (see Section 4.3.3). The four attributes, swath_start, swath_end, along_start, along_end, of the index define a range in the data array, i.e. the values are pointers into the original data. Every index cell covers a range in the data swath, possibly with overlaps. It is possible that multiple source cells end up indexed into a single target cell. As mentioned previously, an aggregate function is used. This aggregate function returns the range union of the swaths and along dimensions. For example, if there is an incoming data point from swath=1863, along=341, while currently the values for that index cell are swath_start=1862, along_start=1860, swath_end=1862, along_end=2894, then the index cell s pointers get updates to swath_start=1862, along_start=1860, swath_end=1863, along_end=341. Index generation is fully incremental. However, to maximize parallelism, it is better to process swaths that are further apart. Therefore, the chunks accessed in the modified index arrays are more or less random, compared to consequent chunks, where there may be a lot of data-write dependencies due to a substantial amount of data being written to the same physical chunk. The array schema of Latitude-Longitude-Time index for QuikSCAT data is as follows: /* Index array with pointers to projected data */ LatLongTime_ Index swath_ begin : uint32, swath_ end : uint32, row_ begin : uint16, row_end : uint16 lat =0:720,?, 0, long =0:1440,?, 0, time =0:96432,?, Cartesian Index An extension of Latitude-Longitude-Time index into Cartesian coordinates is Cartesian Index. The main idea of this index is to allow for faster and more convenient spatial queries. Since SciDB uses Cartesian coordinates, it is more effective to keep index data in Cartesian coordinates as well. Distance based queries, predicates and operators can be effectively truncated based on the dimensions only. This effectively eliminates the need to read chunks that may be possibly rendered unneeded when further processing a query. An example of Cartesian index with support for data projections is as follows: /* Index array with pointers to projected data */ Cartesian_ Index swath_ start : int32 not null, swath_ end : int32 not null, row_ start : int16 not null, row_end : int16 not null, /* earliest and lates time points covered by this cell */ time_ start : datetime not null, time_end : datetime not null, /* direction of time flow within the cell ( along swath path ) */ time_ angle : float not null, /* 3d angle -- normal to the plane of projection */ polar_ angle : float not null, azimuthal_ angle : float not null, /* projection pointer */ projection_ ptr : int64 not null, x = 0:1024,?, 0, y = 0:1024,?, 0, z = 0:1024,?, 0, time = 0:87600,?, Hierarchical Structure of Indices Our implementation of Latitude-Longitude-Time index consists of multiple levels of the indexing arrays. Each in-

5 Raw QuikSCAT Data Array CROSS dimension Redimension Redimensioned Update Array CONFLICTS (list) dimension Aggregate along CONFLICTS dimension Index Array SWATH dimension ALONG dimension Time Latitude Longitude other attributes Time Latitude Longitude other attributes LATITUDE dimension LONGITUDE dimension TIME dimension Swath Along Swath Along Swath Along Swath Along LATITUDE dimension LONGITUDE dimension TIME dimension Swath_start Swath_end Along_start Along_end Swath_start Swath_end Along_start Along_end Swath_start Swath_end Along_start Along_end Swath_start Swath_end Along_start Along_end Figure 2: index. Array redimensioning scheme used during incremental generation of Latitude-Longitude-Time dexing array has different granularity, i.e., its cells cover different latitude, longitude and/or time. Due to uneven distribution of spatiotemporal data from satellites, some array locations may contain denser or sparser data. For example, the data density of scans around the equator is different from the density close to the poles. Note that for swath data the biased distribution of the data does not occur when using Cartesian indices. With hierarchical indices, queries can be executed subsequently on more detailed (lower level) indices where necessary. When generating indices on multiple levels, we do not need to use online heuristics to determine where to use more levels. To do so, it would require a complicated re-indexing and possible re-reading of the raw data. Instead, since we are working with Earth observing satellite data, we can determine the boundaries for individual levels statically prior to indexing Indexing Data by Values and Indices Storing Data All indices are capable of indexing not only the raw data ranges (i.e. availability of data), but also the values or statistical aggregates (e.g. min/max/count/). Values may also be indexed by additional array dimension. Including values indexing allows for additional data dependent operators, i.e., pre-filtering that can be partially done in the index, effectively reducing the amount of raw data retrieved. Each index entry can also contain a projected data into a plane perpendicular to the sphere. This does not have to be a complete projection of the data, rather than value aggregates and a bitmask, representing a rough outline of the underlying complete data. This is a form of regridding with variable projection plane suitable for spheres. 5. USE CASE SCENARIO: SELECT QUIKSCAT DATA GIVEN TROPICAL CYCLONE TRAJECTORY We demonstrate the spatiotemporal query capabilities on an example of data selection along a moving object: Select QuikSCAT data along a given cyclone trajectory, with lat-lon radius of 1.0 degree from the cyclone eye and time span of [-3,+21] hours. Cyclone trajectory is loaded into SciDB as a list of points in the following format: lat, lon, time >[i]. This is done using CSV preprocessing and directly loaded from CSV into SciDB. Points of the trajectory are interpolated so that hyperrectangles on the interpolated points completely cover the neighborhood. Note that all the processing is done within SciDB, which also demonstrates the computational capabilities and the ease of integration of our solution. 5.1 Retrieveing Data Regions Pointers into the original data Data is selected from a Latitude-Longitude-Time index array given the mask (as a list of hyperrectangles) using cross between operator. Figure 3 shows an example of a trajectory mask, which is used as an argument for the index query. Figure 3: Trajectory and a corresponding mask sequence of hyperrectangles on latitude, longitude and time dimensions. Retrieved regions are processed to simplify the raw data retrieval. We want the regions specification as simple as possible. Overlapping regions are joined together into a single, longer region. Regions spanning very long swath area (crossing distinct physical chunks in the database) are split. Only basic functions, sort, filter and uniq along swath and along are used for the region simplification. Regions can then be adjusted based on user required visualization or data processing objective. Figure 4 shows the regions retrieved from the index with each region specifying an area in the original QuikSCAT swath. The regions are stored in a list in the following format swath start, swath end, along start, along end>[i].

6 processing. We used a sequence of the following steps to retrieve the interpolated cyclone centers: Every image corresponds to a swath range. Assign a first and last timestamp to every image, then compute the average timestamp for the region. Note that this is a simplification that results in potentially nonnegligible error compared to the real cyclone center. Retrieve the cyclone trajectory data with timestamp attribute. Redimension it so that the point in the trajectory is an attribute p Merge image and trajectory data - both with timestamp data. This gives you a list (array with a single dimension) Perform a sequence of sort (on timestamp) operations followed by cumulate operation on max(p), then reverse the list and do another cumulate, this time on min(p). The resulting list now has pointers p to both the preceding and following point of the trajectory. Join in the location and time data from the trajectory and interpolate the centers. Figure 4: Swath regions as retrieved from the index query. Raw QuikSCAT data is retrieved from the data arrays based on the regions. Regions spanning at most 1 chunk allow for fast data retrieval without inter-chunk data dependency. This is very effective for the data processing as no data have to be transfered between the individual nodes. 5.2 Processing Raw Data for Visualisation For the purpose of visualization, the following steps are taken: First, redimension the raw data from [swath, along, cross] to [i, lat, lon] dimensions. Here, i represents i- th swath section regridded to latitude and longitude dimensions, i.e., it can be seen as an image. Each swath section contributes to exactly one image. Conflicting data are aggregated using avg() into their respective cells. It may happen during the regridding process that multiple raw data points end up in the same cell based on target resolution. We then smooth the data and fill in small gaps. A new auxiliary array is created using regrid operator (creates a single cell from a grid of subarrays) with average aggregate and xgrid (opposite of regrid). This array s purpose is to fill in a neighborhood of cells in the original array as SciDB cannot directly compute window aggregates into empty cells. The original array is then merged into the auxiliary array, window aggregate (average) is computed to fill in the gaps in the original data. The auxiliary array is then merged into the original data array, effectively filling in the gaps. We then interpolate cyclone centers based on swath time. Note that this computation would be simple with iterative processing. However, SciDB only natively supports array Filter out the trajectory related data, keeping only the image related data. Note that a recently published framework for iterative processing within SciDB [11] could be used as a platform for some of the computation steps. To find the cyclone centers, one can alternatively analyze data (wind speed and direction) to determine the center, similarly as in [10, 4]. Finally, we filter data based on their distance from the cyclone center. Some examples of the resulting images are shown in Figure Example of Simple Index Query This is an example of a single query retrieve data pointers to QuikSCAT swath data from the index. Note that for clarification, the schemas of the arrays participating in the query are listed below as well. The query selects index data from an index array quikscat_index_lat_long_time based on the given mask interpolated_regions_array (list of hyperrectangles) /* Region array schema */ interpolated_ regions_ array lat : float, lon : float, time : int64, lat_low : int64, lon_low : int64, time_low : int64, lat_high : int64, lon_high : int64, time_ high : int64 i0 =0:97, 1000, 0

7 Table 1: Timing results of a query - Select QuikSCAT Data Along Trajectory Cyclone Isabel Isabel Isabel Isabel Data points Radius [deg] Span [hours] Index query time [sec] time_high )), swath_low, int64 ( swath_begin ), swath_high, int64 ( swath_end ), along_low, int64 ( row_begin ), along_high, int64 ( row_end )),idx ), swath_low, along_low, swath_high, along_ high ) # Resulting regions array swath_ low : int64, along_ low : int64, swath_ high : int64, along_ high : int64 idx =0:*, , 0 Figure 5: Processed images clipped according to the distance from the interpolated cyclone center. /* Index ( single level ) schema */ quikscat_ index_ lat_ long_ time swath_ begin : uint32, swath_ end : uint32, row_ begin : uint16, row_end : uint16 lat =0:720, 180, 0, long =0:1440, 180, 0, time =0:96432, 30, 0 /* Trajectory selection query on index */ project ( unpack ( apply ( cross_ between ( quikscat_ index_ lat_ long_ time, project ( interpolated_ regions_ array, lat_low, lon_low, time_low, lat_high, lon_high, 6. DISCUSSION: TIME COMPLEXITY Individual operations on the index arrays that do not require data transfer between nodes, i.e., lookup, cross between, filtering are all very fast. Theoretical time complexities of such array operations used are at most linear (in size of the array) with expected linear parallel speedup. This assumes asymptotically lower initial overhead and coordination overhead, sufficient network speed for data retrieval (if needed) and fine enough chunk granularity (i.e. not all data in a single big chunk). Note that assuming the chunking allocation scheme (i.e. to which node each chunk is placed) respects the neighborhood by allocating nearby chunks to different physical nodes, we estimate that the physical locations of chunks retrieved after the operations on the index arrays will be more or less random. For a random allocation of the chunks to physical machine of both index and data arrays, the speedup is linear in expectancy with high probability, the speedup S c n for some desired constant 0 c 1. Table 1 shows the timing results of the data retrieval query described in detail in Section 5: Note that the time scales mainly with the number of data points of the trajectory. This is due to the fact that each data point is more likely to hit a brand new chunk in both the index arrays and the raw data arrays. Increasing the radius (space) 2 times, i.e., the total area is increased 4 time, has minimal effect on the time. Based on

8 the level of the index, the probability of hitting additional chunks of the index that need to be retrieved increases with lover level indices (denser mesh). Same goes for span (time). If we had a prior knowledge of our queries, we could adapt the chunking scheme of the index array to accommodate more spatial or more temporal data in a single chunk; thus, resulting in increasing performance for spatial, respective temporal intensive queries. However, in our data retrieval query example we kept the ratio balanced. We run SciDB on a single physical server Intel Xeon E v3 2.6GHz, 20M Cache, 8.00GT/s QPI, 8 cores; 8x16GB RAM, 2133 MT/s, 4x1TB 7.2K RPM SATA 6Gbps. There are 4 virtual machines, each with 2 cores, 1 hard drive, 16GB RAM, running Ubuntu server ( kernel). Timing was measured as an average of 3 runs with cold start on the virtual machines. Note that the main server was kept running, which might have incurred some bias. However, the variation between individual measurements was negligible. 7. CONCLUSIONS AND FUTURE WORK In this paper, we described our preliminary work on the utilization of the distributed array-based SciDB database management system to support satellite data-driven climate science research by providing scientists with previously unavailable accurate and efficient Earth science satellite sensor data query and retrieval based on user-defined criteria to study and analyze atmospheric events such as tropical cyclones and mesoscale convective systems (MCS). In particular, we ingested the complete ten-year QuikSCAT ocean surface wind fields satellite data into the SciDB database system for fast and accurate data retrieval based on tropical cyclone trajectories. Moreover, all the processes are done in SciDB. SciDB was used as an all-in-one solution. Except for transformation of the QuikSCAT data into SciDB s binary format, we used the array database to perform all the steps: Store QuikSCAT data overall 500 GB of compressed data was stored into 8 virtual SciDB instances. Create a spatiotemporal indices with incremental construction, various grid sizes, various chunking schemes and other configuration variations. Perform spatiotemporal queries with a focus on a demonstrative use-case: selection of QuikSCAT data withing a radius and span of a cyclone trajectory. Process the results within SciDB and visualize them. We would focus our future research on experimentation with different indices, including multidimensional hierarchical (based on multidimensional trees), indices containing data aggregates, data samples, and so on. We would also implement a spatiotemporal library (indices, functions, operators, predicates) using a combination of multiple extension points, such as user defined types, functions, operators and convenience functions written in higher level, i.e., Python interface for SciDB. 8. ACKNOWLEDGEMENT This research was supported in part by AcRF Grant RG- 18/ REFERENCES [1] P. G. Brown. Overview of scidb: large scale array storage, processing and analysis. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages ACM, [2] P. CudrÃl -Mauroux. Scidb: an open-source, array-oriented database management system. Massachusetts Institute of Technology / University of Fribourg, Switzerland, [3] R. Güting and M. Schneider. Moving Objects Databases. Morgan Kaufmann Publishers, [4] S.-S. Ho, W. Tang, W. T. Liu, and M. Schneider. A framework for moving sensor data query and retrieval of dynamic atmospheric events. In Scientific and Statistical Database Management, pages Springer, [5] Y. M. Kodama and T. Yamada. Detectability and configuration of tropical cyclone eyes over the western north pacific in trmm pr and ir observations. Monthly Atmospheric Review, 133: , [6] T. Lungu and P. Callahan. Quikscat science data product user s manual: Overview and geophysical data products. D Rev A, version, 3:91, [7] G. Planthaber, M. Stonebraker, and J. Frew. Earthdb: scalable analysis of modis data using scidb. In Proceedings of the 1st ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data, pages ACM, [8] G. L. Planthaber Jr. Modbase: A scidb-powered system for large-scale distributed storage and analysis of modis earth remote sensing data. PhD thesis, Massachusetts Institute of Technology, [9] M. Schneider, S.-S. Ho, M. Agrawal, T. Chen, H. Liu, and G. Viswanathan. A Moving Objects Database Infrastructure for Hurricane Research: Data Integration and Complex Object Management. In Earth Science Technology Forum, [10] M. Schneider, S.-S. Ho, T. Chen, A. Khan, G. Viswanathan, W. Tang, and W. T. Liu. Moving objects database technology for ad-hoc querying and satellite data retrieval of dynamic atmospheric events. In Earth Science Technology Forum, [11] E. Soroush, M. Balazinska, S. Krughoff, and A. Connolly. Efficient iterative processing in the scidb parallel array engine. In Proceedings of the 27th International Conference on Scientific and Statistical Database Management, page 39. ACM, [12] M. Stonebraker, P. Brown, A. Poliakov, and S. Raman. The architecture of scidb. In Scientific and Statistical Database Management, pages Springer, [13] W. Tang and W. T. Liu. Dependence of hurricane asymmetry and intensification on translation speed revealed by a decade of quikscat measurements. In NASA Ocean Vector Wind Science Team Meeting, Boulder, Colorado, [14] C. Yokoyama and Y. N. Takayabu. A statistical study on rain characteristics of tropical cyclones using trmm satellite data. Monthly Atmospheric Review, 136: , 2008.