A New Strategy of Storage & Retrieval for Massive Remote Sensing Data

A New Strategy of Storage & Retrieval for Massive Remote Sensing Data Based on Embedded Database Files 1,2,3 YUAN Fa-jie, 1,2,3 GAO Wei, 2,3,4* HUANG Xiang-zhi, 1 HUANG Fang, 2,3 YU Tao, 2,3,5 ZHU Yan-na 1 School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China 2 State Key Laboratory of Remote Sensing Science, Institute of Remote Sensing Applications Chinese Academy of Sciences, Beijing 100101,China ; 3 Demonstration Center for Spaceborne Remote Sensing, Beijing 100101, China 4 Zhejiang Provincial Key Lab of GIS, Zhejiang University, Hangzhou 310028, China 5 Geomatics School, Liaoning Technical University, Fuxin 123000, China yuan2961634811@163.com, hxz_gis_china@hotmail.com Abstract This paper, referring to technology about storage management systems of massive remote sensing images at home and abroad, proposed a new distributed storage strategy and parallel retrieval mode based on lightweight embedded database files. In the light of demands of storage management systems of massive remote sensing data, the original image data is firstly cut into standard tile data based on the data organization way called Five-layer-Fifteen-level, then tile data is stored in the disk directories of storage nodes by an distributed way, and the metadata of these standard tiles is stored in lightweight embedded database files instead of large-scale relational database in the corresponding storage nodes also by an distributed way. A large number of retrieval efficiency tests were implemented by using this storage program. Retrieval test results indicate that retrieval method based on embedded database files has much higher efficiency than that based on popular large-scale client/server relational database engines or that based on pure file system. The comprehensive program proposed by this paper provides a new strategy for storage and management of remote sensing data facing petabyte level. Key words: Massive Remote Sensing Data, Embedded Database, Distributed Storage, Parallel Retrieval 1. Introduction With the rapid development of satellite observing systems, remote sensing image data has an explosive growth trend. For a remote sensing information processing system of high-resolution satellite, as the operation of the system, the storage order of magnitude of remote sensing raster image data can reach hundreds of TB and PB level. How to more efficiently store and manage massive remote sensing data, form a unified standard, and achieve rapid sharing and distribution of remote sensing information, has become one of the key issues of the spatial information science [1]. Aiming at the study of storage and management for massive images, scholars both at home and abroad have done much work, also have introduced some database management system for massive images. Representative storage and management systems of massive images in the foreign mainly include Google Earth/Maps [1,2] developed by the Google Inc; the geographical science software World Wind [3] and EOS earth observation system [4] released by NASA; Bing Maps, which is an online mapping service platform released by Microsoft [5]; EyeStroNext data management system designed by GeoEye Company[6]. In recent years our country has also begun to study on research and construction of the image data storage and management system, there are Map World, China Centre for Resources Satellite Data and Application (CCRSDA), National Satellite Ocean Application Service (NSOAS), etc. Analysis from the above storage mode, Google Earth / Maps, World Wind, EOS, Geoeye are all based file system management; Bing Maps, TerraServer are based on large-scale relational database International Journal of Advancements in Computing Technology(IJACT) Volume4, Number18,October. 2012 doi:10.4156/ijact.vol4.issue18.47 398

management; The storage structure of Map World, CCRSDA, NSOAS is a hybrid approach based on the file system and large-scale relational databases [1 ~ 7]. This paper, referring to the technologies of above massive remote sensing data storage management system, designed a new storage strategy based on lightweight embedded database files. The basic idea of this scheme is that in the first place original images are cut into a series of standard tiles based on segmentation standard combining with the space characteristics and application requirements of remote sensing images, and then standard tiles are stored in disk file directories of the distributed storage nodes in accordance with the storage specification, meanwhile, tiles attribute information or metadata which has a role of index is stored in corresponding SQLite embedded database files in the same node. This program gives full play to the advantage of embedded database files and the parallel characteristics of the distributed system, achieving low-cost storage and efficient retrieval. 2. Brief introduction of embedded database The name of embedded database comes from its unique mode of operation. Embedded database, without stand-alone database engine, is usually integrated with an embedded operating system and specific applications, and data access operation can be achieved directly by the program by calling the appropriate API [10]. Compared with the common databases, embedded database has a small size, full function, portability, robustness and other characteristics. We use SQLite embedded database files in this research program. SQLite developed by D.RichardHipp by using a small C library is an embedded relational database management system. The architecture of SQLite can be in general divided into eight major subsystems [8], as shown in figure 1. SQLite supports most of SQL92, for example, supporting multiple tables, indices, transactions, triggers, and views, etc. In additional, SQLite is a compact library. With all features enabled, the library size can be less than 350KB, depending on the target platform and compiler optimization settings. Simultaneously, SQLite has also the following features: Zero-configuration-no setup or administration needed; A complete database is stored in a single cross-platform disk file; Don't have external dependence with a strong cross-platform transportability; Support ACID; Faster than popular client/server database engines for most common operations; Supports terabyte-sized databases and gigabyte-sized strings and blobs; Free use for any purpose[8]. Figure 1. Block diagram of SQLite 3. Distributed strategy based on embedded database files A simple distributed prototype system is designed for the purpose of illustrating the principle of this research program, as shown in figure 2. This system is based on the client/server mode, and is designed by the three logically independent layers. The first layer named the presentation layer is composed of a number of clients providing graphical user interface(gui) and mainly used for 399

uploading and searching the original data; The second layer is operational level consisted of a central node server and preprocessor, which is mainly used for block preprocessing of original images, data transfer, transaction processing, order monitoring, data flow control, etc; The third layer which is data layer composed of several PC as storage nodes, which is used for management and maintenance of data. The original data, tiles data and attribute information which has the effect of index function is stored in this layer. Figure 2. Distributed strategy based on the embedded database files 4. Storage strategy based on embedded database files 4.1 Data pretreatment on the basis of application requirements 4.1.1 The necessity of block preprocessing of original image data At present, providing image publishing services and standardized production is the most important applications direction of remote sensing images. By studying on the data storage and management technologies, it is not difficult to find that using multi-resolution pyramid image block technique is an effective way to realize the real-time zooming display and high-speed service of remote sensing images; According to production, the increasing speed of the current computer memory always falls behind of that of the amount of data. Reading all data to memory and calculating using complex algorithms is difficult to ordinary PC and will cause a slow response system. So it is very necessary to use the image block technology from the point of effectively making use of computer resource. 4.1.2 Preprocessing method based on block technology According to the characteristic of service demands of remote sensing data, This system adopts a block standard named Five-layer-Fifteen-level[9]. Block standard is as shown in table 1. The basic idea of this method is that cut the original remote sensing data into standard block data or tile data which is organized by pyramid on the basis of grid of longitude and latitude according to the need of resolution, then generate pyramid layers from bottom to top by sampling. Each layer of the pyramid layers has three levels, the ratio of each level is 5:2.5:1, and the ratio of each layer is 10:1, each tile has a size of 1000 pixel * 1000 pixel.finally all tiles are marked with row and column instead of latitude and longitude. Compared with the existing hierarchical segmentation such as World Wind, this present invention does not require projection, the block size of each level is simple limited floating-point numbers, position can be determined by simplified latitude and longitude coordinates, facilitating data organization and management, and can also ensure the seamless splice of multiple images when displaying in the sphere. 400

4.2 Storage procedure Table 1. Five-layer-Fifteen-level image segmentation standard The system uses a storage method combined with disk directories and the embedded database files, Image data and image attribute information is stored separately, the original images tiles are stored in disk directories, while attribute data of original images and tiles is stored in the embedded databases. Storage of image data and image attribute information is all by distributed structure in this program. The basic steps are as follows: 1. The client uploads original image data to center node server. 2. The center node server sends original data to the disk directory of the storage node of original data, and at the same time,the attribute data of original data is stored in the embedded database files which is stored in the disk directory, meanwhile, the center node will send a message to preprocessor so that preprocessor can download the original data, then original data will be cut into standard tiles according to Five-layer-Fifteen-level image segmentation standard after registration and correction by preprocessor, and then preprocessor will send message to center node server and upload the tiles to center node. 3. Center node server send standard tiles to storage nodes based on the storage rule according to row and column number of each tile. Specific storage rules are as follows: Mod (TileRowNU M TileColumn NUM)%MaxNo de if the number of the storage nodes is n(n<=maxnode),and now the remainder number of each node is N which is between (INT)MaxNode/n and (INT) MaxNode/n+1, and the adjacent remainder number of the same disk directory of each node is n. For example, The maximum number of storage nodes is set to two hundred and fifty-five, but the actual number of storage nodes is eight, and now the assigned remainder number of each node is between thirty-one and thirty-two, the remainder number of the first storage node is assigned to zero, eight, sixteen,..., two hundred and forty-eight, the second is assigned to one, nine, seventeen, two hundred and forty-nine, According to this rule, the remainder number of the eighth storage node should be assigned to seven, fifteen, twenty-three,..., two hundred and forty-seven. 4. On each disk directory of storage nodes there are one or more folders which are named by remainder number called IPMod-directory, in which there is a SQLite embedded database. When tiles data is stored, the attribute information data or metadata of tiles will be stored in embedded database file. The metadata data of each tile in this study mainly includes serial number of the original image, satellite type, sensor type, the date, tiles level, row and column number of tiles, tile types (including four bands, azimuth angle, zenith angle, day, quick view), etc. 401

4.3 Storage and naming convention of tiles In order to facilitate the tile data retrieval, the tile storage path and tile naming specification must be associated with metadata information of each tile which is stored in embedded database files. We need establish a mapping between tile storage path, tile name and attribute field information of the data table. Taking HJ-1 satellite data as an example in this study, storage path specification of tiles can be as follows: disk directory \ IPMod-directory \ tiles level\sensor type \year \month\day\ satellite type\ the serial number of original data\tile type, storage tree structure is as shown in figure 3; Tiles naming convention: satellite_ sensor_ time_ product serial number of original data _ tile level _ row number _column number _ tile type. image extension. Figure 3. Tile storage directory 5. Research of retrieval schemes 5.1 Design of retrieval schemes According to the storage structure above, it is not hard to find that all the standard tiles and attribute data are stored in specific computer cluster by distributed mode. In addition, tiles data is stored level by level according to naming convention so that storage address of tiles will have a rule to be followed. Based on all above, this research has discussed two traditional retrieval schemes, and done retrieval efficiency tests with our retrieval scheme based on SQLite embedded database files so that we can prove our retrieval scheme has a much more efficiency,. Scenario 1: Retrieval based on large-scale client/server mode relational database such as Oracle, DB2, Microsoft SQL Server, which are the current mainstream retrieval methods. The basic idea is that retrieval function is achieved by image paths or other metadata which are stored in database tables. You just need organize SQL based on corresponding search condition, all records met the search conditions will be retrieved.take Oracle as an example, which supports distributed structure, however, it is not easy to realize distributed databases in actual project applications due to the complex installation and configuration, maintenance of operating environment and very expensive software. Therefore the current majority of units which adopts this retrieval scheme in most case essentially use the distributed storage only for image data, centralized storage for image attribute data in a database table by stand-alone or single-core. Retrieval test results in this study also adopt the normal centralized storage of metadata. 402

Scenario 2: Retrieval completely based on file system. Please refer to the storage structure of Figure 3, and its basic idea is that retrieval starts from the root directory, and then retrieves its subdirectories. If the subdirectories match the search criteria, turn down recursively level by level until file. It is similar to a tree structure traverse, folders are equivalent to tree nodes, and files are equivalent to leaves nodes. In this study, for example, the user inputs retrieval conditions which mainly include retrieval longitude and latitude range, satellite type, sensor type, level of tiles, time, tiles type, etc. Center node server can calculate the row and column numbers of tiles and the IP of storage nodes according to the longitude and latitude, then sends a message to inform the corresponding nodes; storage nodes determine whether each directory name matches retrieval conditions, and traverse to bottom by this way level by level, get the file name at the bottom directory, such as HJ1A_CCD1_20111212_L20000521090_5_241_599-Azimuth.Tif, Then just judge whether row and column numbers of tiles in the search range so that you can determine whether tiles met retrieval requirements. Scenario 3: Distributed multi-machine parallel retrieval based on SQLite embedded database file, which is low-cost and easy to implement compared with distributed large-scale relational databases. As shown in figure 4, retrieval core structure includes a centre node server which is used to receive query requests from users and many backend storage nodes for practical retrieval. This retrieval is implemented by constructing index files for massive tiles data instead of traversing a lot of disk directories. Basic procedures of search process are as follows: 1. The client puts forward inquiry requests and inform center node server. 2. Center node server organizes corresponding SQL statements according to the inquiry condition, then sends messages to inform each storage node and delivers SQL through the TCP/IP protocol. Take a complex SQL statement as an example: SELECT * FROM TABLENAME WHERE (ROW>=TILE1ROWNUM AND ROW<= TILE2ROWNUM) AND (COLUMN>=TILE1COLUMNNUM AND COLUMN<=TILE2COLUMNNUM ) AND (DATE BETWEEN 20090101 AND 20111230) AND ( SATELLITE= 'HJ1A' ) AND ( SENSOR= 'CCD2' OR SENSOR= 'CCD1') AND ( TYPE= 'BAND1' OR TYPE= 'AZIMUTH' OR TYPE= 'ZENITH') AND (TILELEVEL=5). 3. Storage nodes search their own embedded database files in disk dictionary according to the SQL, and full paths of the images can be parsed out in accordance with the storage specification by the metadata information retrieved. In order to the ignore the impact of communication speed as much as possible, all result records will be returned to center node by paging feedback, the center node server can get the next page data from data set of retrieval results when users put forward the page request. 4. The center node collects all result records of storage nodes and shows to the users by paging. Figure 4. Retrieval framework based on distributed multi-machine parallel method 5.2 Comparative analysis of experimental test results 5.2.1 Experimental environment configuration is LAN speed 100 Mbps, we have tested twice, the first time we adopt four storage nodes, and second time we adopt eight storage nodes; a central node server, four clients. All the machines have the same hardware and software configuration, as follows: 403

CPU: Intel (R) Core(TM) 2 Quad CPU Q8300 @2.5GHz; Physical memory: 4 GB; Hard disc: 500G 7200 RPM; Operating system: Window XP pro 32. 5.2.2 Affected by the SQL statement complexity and systematic errors factors, the retrieval test results of several times float within about 3 second, and the following test results are only time of CPU computing and outputting result into cache, do not contain display time, test results are as follows: Table.2 Scenario 1 based on the Oracle database test results Table 3. Test results of a single SQLite database file Table 4. Scenario 3 parallel test results of four storage nodes Table 5. Scenario 3 parallel test results of eight storage nodes Figure 5. Contrast of test Efficiency Figure 6. Single storage nodes contrast of test Efficiency 5.2.3 Results analysis Scenario 1 test results are as shown in Table 2. It is obviously that the number of records (ten thousand) and the retrieval time is nearly linear relationship, k 2. Therefore, when the data reaches a 404

certain level, data retrieval efficiency may become a bottleneck, in this case, if we want to improve the speed of database retrieval, the database must deploy a lot of indices and upgrade hardware configuration greatly, which will cause the rising cost of equipment, obviously, the retrieval system based stand-alone or single-core can't reach the requirement of high efficiency retrieval for millions or even billion records. Operation of scenario 2 is simple, which do not need database support, and system based on this scenario is very cheap, the efficiency is high for the precise retrieval conditions including file names, but with the increase of the retrieval data files,retrieval efficiency will be increasingly slow, especially for fuzzy retrieval, sometimes the processor must traverse all subdirectories space, as shown in figure 3, if each disk directory only has one IPMod-directory, it will be nearly a million times traversing all directories, which will spend much time. Tests show that when the amount of retrieval data of each storage node reaches one hundred thousand records, the retrieval time will cost several minutes, the retrieval efficiency of which is not satisfying. Test results of scenario 3 are shown in table 4 and 5, figure 5 shows that the test efficiency is higher than the former two solutions when retrieving a large quantity of data by test comparison, and is raised from minute level to second grade, and the overall efficiency of searching with parallel increase with the increase of the number of node. Due to the limit of objective conditions, it is difficult to reach design index of the massive data test, we can only do efficiency test to a certain extent. Figure 6 is the comparison of retrieval efficiency of single SQLite file and retrieval efficiency of each node by adopting multi-node parallel test. Doing this test is only for theoretical analysis. Each node is assigned only one IPMod-directory, namely each node has one SQLite database file, it can be found that the retrieval time is very close when the number of retrieval records of each storage node is equal to the number of retrieval records of single SQLite database file. Accordingly, we can infer the retrieval efficiency of each node by adopting multi-node parallel test is approximately equal to retrieval efficiency of single SQLite file when each storage node has only one SQLite database file (ignoring influence of system error). It can be inferred from above analysis that if two hundred million tile records are stored in this system, the average number of each embedded database file records assigned is eight hundred thousand because the Mods of tiles are continuous according to the above distributed architecture which has two hundred and fifty-five storage nodes. Test results of table 3 shows it will take about 20 seconds when eight hundred thousand records are all retrieved, while it will take a few minutes or even hours by scheme1 in spite of memory overflow when two hundred million tile records are retrieved. The experiment test and theoretical analysis shows that the retrieval framework of plan 3 is better than plan 2, and the retrieval efficiency of scenario 3 is much higher than that of scenario 1 and scenario 2. 6. Conclusion According to the characteristics of massive remote sensing image data and image management needs, this paper,on the basis of the previous work, put forward a kind of distributed storage architecture based on embedded database files and retrieval scheme based on multi-machine parallel mode. Through application testing of this prototype system, the result shows that the storage of massive remote sensing image data based on distributed lightweight embedded database files instead of large-scale relational database is feasible. In an actual project application, expansion and optimization for this strategy may be needed, such as increasing the number of central servers and storage nodes, changing the distributed strategy, increasing data redundancy, for the purpose of optimization for storage and retrieval of the system. The main contribution of this study is as follows: 1It is the first time to introduce embedded database into the storage and management system of massive remote sensing data, and a distributed structure using simple principle is designed for preliminary study 2the retrieval scenario based on the embedded database file has a much higher efficiency than that based on large-scale client/server relational database or that based on pure file system3this scenario based on SQLite file, breaking away from large-scale relational database, reduces the cost of the system to a large extent. Therefore, the result of this research has important commercial value4the comprehensive scenario presented provides a new technical support for the storage and management of massive remote sensing data. The 405

next step we will do further research and testing for the security, stability, data recovery capabilities of the scenario, as well as large-user concurrent access ability. 7. Acknowledgements This work is supported by the Civil Aerospace Twelfth Five-Year Project (Grant no. Y1K00200KJ), the National Nature Science Foundation of China (Grant no. 41105017), and the integrated geo-spatial information technology and its application to resource and environmental management towards the GEOSS (Grant no. FP-7 2009 ). I would also offer my deep appreciation to Professor Gu Xing-fa, my tutor, for the precious guidance to my academic life. 8. References [1] LÜ Xuefeng, CHENG Chengqi, Review of data storage and management technologies for massive remote sensing data, Science China Technological Sciences, Science China Press, vol. 54, no. 12, pp. 3220-3232, 2011. [2] GIBIN M, Singleton A, An exploratory cartographic visualization of London through the Google Maps API, Appl Spat AnalPol, vol. 1, no. 2, pp. 85 97, 2008. [3] BellD G, Kuehnel F, Maxwell C, NASA World Wind: open source GIS for mission operations, In Proceedings of IEEE Aerospace Conference, pp. 1 9, 2007. [4] ESFANDIARI M, Ramapriyan H, Earth Observing System (EOS) Data and Information System (EOSDIS) Evolution Update and Future, In Proceedings of IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2007). Barcelona: IEEE, pp. 40005 40008, 2007. [5]MICHAEL Levin, Certain Finite Dimensional Maps and Their Application To Hyperspaces,Israel Journal of Mathematics, Hebrew University Magnes Press, vol. 105, no. 1, pp. 257-26, 1998. [6] YU Haiyang, CHENG Gang, Object Oriented Land Cover Classification Using ALS and GeoEye Imagery over Mining Area,Transactions of Nonferrous Metals Society of China, Science Press, vol. 21, pp. s733-s737, 2011. [7] MITHCHELL A, Ramapriyan H, Evolution of Web services in EOSDIS-search and order metadata registry (ECHO), In Proceedings of IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2009), pp. 371 374, 2009. [8] SQLite home page, http://www.sqlite.org/arch.html [9] Gu Xingfa, YU Tao, Xie Donghai, a hierarchically organized method based on the latitude and longitude grid data: In Chinese, publication Number is CN102346923. [10] Wei-Chih Hsu, Tsan-Ying Yu, "E-mail Spam Filtering Based on Support Vector Machines with Taguchi Method for Parameter Selection", JCIT, vol. 5, no. 8, pp. 78 ~ 88, 2010. [11] Shifei Ding, Li Xu, Hong Zhu, Liwen Zhang, Research and Progress of Cluster Algorithms based on Granular Computing, JDCTA: International Journal of Digital Content Technology and its Applications, vol.4, no.5, pp.96-104,2010. [12] Ding Shifei, Qian Jun, Xu Li, Zhao Xiangwei, Jin Fengxiang, "A Clustering Algorithm Based on Information Visualization ", JDCTA: International Journal of Digital Content Technology and its Applications, Vol. 5, No. 1, pp. 26 ~ 31, 2011 406