5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission of Education, Beijing Jiaotong University, Beijing 100044; 2. China Information Technology Security Evaluation Center, Beijing 100085) Abstract: With the development of national large-scale railway construction, massive railway information data emerge rapidly, and then how to store and manage these data effectively becomes very significant. This paper puts forward a method based on distributed computing technology to store and manage massive railway information data, builds massive railway information data storage platform by using the Linux cluster technology. This system consists of three levels including data access layer, data management layer, application interface layer, enjoying safety and reliability, low operation cost, fast processing speed, easy expansibility characteristics, which shall satisfy the massive railway information data storage requirement. Keywords: Massive Railway Information Data Storage; Hadoop Distributed Technology; Cluster System 0 Introduction The Medium and Long-Term Railway Network Plan has established the goal to finish railway construction in 2020, when the high standard and large-scale railway construction will appear in full swing and will generate huge amounts of railway information data. These data, being vast and complex, diverse, heterogeneous, dynamic, relate to various aspects such as railway geographic information, railway construction, railway operation and maintenance, and railway dispatching. However, the current situation is lack of the unified collection and storage criterion or standard, leading to the data island phenomenon. How to store and manage massive railway information data and how to make more efficient use of these data, become one of the key even the bottleneck projects in railway department, and that s what this paper is about. Traditional methods to deal with massive data mostly use distributed high performance computing and grid computing technology[1], which consume expensive computing resources, need tedious programming to realize effective segmentation of massive data and reasonable distribution of computing tasks. Fortunately, the new development of Hadoop distributed technology can solve these problems better[2]. Basing on Linux cluster technology and using the Hadoop distributed technology, this platform effectively processes the massive amounts of railway information data and stores them in the distributed database, which designs and implements an easily extended and effective massive data storage management system. In the section 2 of this article, the author gives an introduction to the massive data storage platform architecture. The section 3 focuses on the key technologies of system implementation needed. The section 4 illustrates the implementation and performance characteristics of the system. Finally, the section 5 presents the conclusions. Foundations: National Natural Science Foundation of China Grant (NO.61172072, NO.61271308), Beijing Natural Science Foundation Grant (NO.4112045), Research Fund for the Doctoral Program of Higher Education of China Grant (NO.20100009110002), Beijing Science and Technology Program Grant (NO.Z121100000312024). Brief author introduction:shan Xu, (1989-), Male, Computer Network. Correspondance author: WANG Genying, (1958-), Male, Associate Professor,Computer Network. E-mail: gywang@bjtu.edu.cn - 1 -
40 1 Platform Architecture 1.1 Platform overall framework 45 According to the actual needs, we employ the MVC three-tier framework for system design. The platform is divided into data resource layer, business logic layer and presentation layer. Each layer has a clear division of responsibilities, high cohesion and low coupling, making the structure more clear and easier to maintain. The platform overall Framework is shown in figure 1. Fig. 1 Platform Overall Framework 50 55 Data resource layer, composed of several storage nodes, storing and managing massive railway information data, is the foundation of the whole platform. Business logic layer, provides the parallel loading storage of the massive railway information data and the management and support services to ensure the normal operation of the system, is the core of the whole platform. Presentation layer, provide users with user-friendly interface, convenient for users to query the railway information data and extend system, which meets users directly. 1.2 Network topology of the platform 60 The platform adopts the distributed, hierarchical structure in the design, which stores the resources in multiple nodes in different parts of the cluster server. The main service node schedules and manages the distributed server cluster nodes in an unified way. With the data volume increasing and the complex application requirements changing, this platform can be easily extended, and the existing relational database can also be integrated into the platform[3]. And through the de-isomerization process, the platform and the existing relational database can jointly provide storage services for users, which serves the users transparently massive railway information data by storage and management functions. - 2 -
65 Fig. 2 Network Topology of the Platform 1.3 Overall functional design of the platform 70 According to the system function, the system can be divided into three layers, including data access layer, data management layer and application interface layer, as is shown in figure 3. Fig. 3 Overall Functional Design of the Platform 75 80 85 Data Access Layer, with the support of the upper data management, connects the storage nodes distributed in different parts through the local area network (LAN) and the Internet to form a distributed cluster system, which provides shielding for different sources of various databases, and provides database access by service function, only to provide a convenient management and deployment chance. Data Management Layer, after the huge amounts of data parallel processing, stores the processed data in the distributed database of this system, while it provides management support service to guarantee the system run normally. It is formed of six function modules, including system management module, log management module, parallel load module, load balancing module, storage module, parallel query module, backup recovery module. System management module is used for software management, including running state monitoring, remote system deployment and autonomous running and maintenance, etc. Log management module is used for software operation log management, including the system trajectories, key events and state records, - 3 -
90 95 100 etc.. Load balancing serves load balancing and fault tolerance of management of storage node. Parallel loading storage module provides parallel loading and storage for huge amounts of data. Parallel query module provides parallel query, user-defined function such as transaction processing. Backup recovery module provides system stored data backup management, backup storage, backup restore, etc[3]. Application Interface Layer, according to the actual business types, provides different application service interface, and provides various application service on the basis of user authentication to meet the needs of different users. 2 The key Technologies of the Platform Development According to the front introduction of Hadoop and functional design of the platform, the most important part is the data management layer. When to manage it, parallel loading storage module of the data become the core of the entire platform. The Hadoop distributed technology provides models and methods of data storage and data processing for the platform. We use the Hadoop distributed file system to store a huge number of source data, and use MapReduce distributed computing model to deal with these data, then use HBase distributed database to store the processed data, in order to realize the storage and management of massive railway information data. Hadoop storage architecture is shown in the figure 4[4]. 105 Fig. 4 Hadoop Storage System Architecture Diagram 2.1 The Hadoop distributed file system 110 115 HDFS is the storage basis of distributed computing, which has high fault tolerance and can be deployed on cheap hardware equipment, to store massive data[4]. HDFS uses a structural model of master-slave (Master / Slave). A HDFS cluster consists of a NameNode and several DataNodes. NameNode plays as the primary server, which manages the file system namespace and the client operation access to the file. DataNode manages stored data. From internal point, the file is divided into a plurality of data blocks, and the plurality of data blocks are stored in a set of DataNode. The NameNode execute file system namespace operations, such as opening, closing, renaming file or directory. It is also responsible for mapping data block to the specific DataNode. DataNode is responsible for handling the file system client's file read and write requests, and carries on the data block create, delete, and copy job under the unified dispatching of the NameNode. The HDFS architecture is shown in figure 5. - 4 -
120 Fig. 5 HDFS Architecture Diagram HDFS is of high throughput of data reading and writing, provides a basis to massive storage for railway information data, stored as an unprocessed set of source data in the Hadoop distributed file system. In our platform, we use HDFS to store these large amounts of source data. 125 130 2.2 The MapReduce distributed computing model MapReduce is a summary of task decomposition and results. Map is to break the task down into multiple tasks, and Reduce is to sum the results of the breakdown multitasking together to get the final result. Calculation process can be concluded as the Map (in_key in_value) - > list (inter_key inter_value) and Reduce (inter_key, list (inter_value)) - > list (out_value). In this platform, we will firstly read vast railway information data from the HDFS and divide them into M pieces to operate in parallel Map, and secondly form the state intermediate pair < k, value >, and then operate in Group on k value, just form new < k list (value) > tuple, and then break the tuples into R segments to operate in parallel Reduce, and finally, the processed results shall be stored in the distributed database. Calculation model is shown as below. 135 140 145 Fig. 6 MapReduce Computing Model The implementation of the MapReduce computing model in this platform is composed of JobTracker running on the primary node singly and TaskTracker running on nodes of each cluster[5]. The primary node is responsible for all the tasks scheduling, which constitutes the whole work, and these tasks are distributed in different nodes. The master node monitors their execution, and reruns the failing tasks; sub-node is responsible for the tasks assigned by the master node. When a Job is submitted, the JobTracker receives operation and configuration information, it will distribute the configuration information to the sub-nodes, and schedule the task and monitor TaskTracker s execution. Our platform use this way to achieve the massive railway information - 5 -
data processing. 150 155 160 2.3 HBase distributed database HBase is of high-reliability, high performance, oriented column, scalable distributed storage system[6]. The data line includes three basic types, Row Key, Timestamp and Column Family. Each line includes a sortable line keywords that uniquely identifies the data row in a table. An optional time stamp, each data operation has an associated timestamp. One or more column clusters, each column clusters can consist of any number of columns, they can have data or not. Vast amounts of railway information data after the MapReduce computation can use the value of k as a line keyword for distributed storage, implements massive data storage and management functions. Railway information data storage sample as shown in table 1. Line keywords represent railway geographic, construction, operation and maintenance, and dispatching information. Timestamp means the time it cost to operate the data. As is shown in the table, at time t3 Jinan to Qingdao direction at 73km happens Roadbed damages, and in the moment of t6 shows the dispatching information that the G88 train starts from Shanghai to Beijing at 15:00. Tab. 1 Railway information data storage sample Row Key Timestamp Column Family Location Value Geographic t1 Construction t3 Jinan Jinan-Qingdao 73km Roadbed damage t2 Maintenance t4 Beijing Dispatching t5 Shanghai Shanghai-Beijing G18 15:00 3 The Platform Test and Result Analysis 165 3.1 Platform performance test When testing the system, the data files are divided into different order of magnitude to get rule numeration, and time consuming of the single machine and of Hadoop cluster shall get comparative analysis. Test results is shown as below. 170 175 Fig. 7 Clustering Performance Test Results We can see from the diagram that when the system deals with 1GB data, the elapsed time the cluster takes is about 4 time of single machine, which is because the distributed architecture of cluster costs some time when the system initialization and intermediate files generated and passed. When the data quantity is small, the Hadoop cluster cannot play out the advantages of distributed computing. As the amount of input file data, Hadoop cluster advantages of distributed parallel computing plays out gradually. When the amount of entering data increases to 15GB from 5GB, single machine processing time increases significantly, while processing time of cluster system increases in a tiny amplitude. When data volumes get close to 20GB, cluster system takes about a - 6 -
180 185 190 195 200 205 quarter time of single one. Data test shows that with the amount of data increasing, the cluster saves more time than single machine, which embodies the advantage of the Hadoop cluster on the large amount of data processing speed. 3.2 The result analysis Through performance tests, this platform not only can efficiently store and manage massive data, but also has the following features: 1High safety and reliability. System will save file in different server in the form of multiple copies to ensure the security and integrity of data. 2Data processing speed is fast. System makes documents distribution to different local compute nodes to process the data, which reduces data transfer amount and improves the speed of data processing through the MapReduce model. 3Operation cost is low. Using distributed computing architecture, server performance requirements are lower, just to reduce the cost. 4God extensibility. System adopts parallel expansion method, which can extend cluster scale and storage capacity at any time according to need. 4 Conclusion This paper bases on the Linux cluster technology, uses Hadoop distributed technology, employs the HDFS distributed file system, Map/Reduce distributed computing model and HBase distributed database technology to deal with huge amounts of data, and designs and develops the vast railway information data storage platform based on Hadoop. By doing a lot of ordinary test on cheap computers, it meets the requirements of railway information efficient storage and management of massive data. This platform has the characteristics of high safety and reliability, fast data processing speed, low running cost, good scalability, which will provide certain reference value to railway department for data storage. Acknowledgements This work has been supported by the National Natural Science Foundation of China under Grant 61172072, 61271308, the Beijing Natural Science Foundation under Grant 4112045, the Research Fund for the Doctoral Program of Higher Education of China under Grant 20100009110002, the Beijing Science and Technology Program under Grant Z121100000312024. References 210 215 220 [1] Dean J, Ghemawat S. MapReduce:Simplified Data Processing on Large Clusters[J]. Communications of the ACM, 2008, 51(1):107-113. [2] Apache Hadoop.Welcome to Apache Hadoop[OL].[2012]. http://hadoop.apache.org/core/ [3] S. Papadimitriou and J. Sun. DisCo: distributed co-clustering with Map-Reduce[J]. IEEE ICDM, 2008, 08(1): 512-521. [4] H.C.Yang, A.Dasdan, R.L.Hsiao,and D.S.Parker. Mapreduce-merge: simplified rela-tional data processing on large clusters[j]. SCMD, 2007, 07(01):1029-1040. [5] Li Ying-an. Research on Parallelization of Clustering Algorithm Based on MapReduce[D]. Guang Zhou: Zhongshan University, 2010. [6] DUO Xue-song, ZHANG Jing and GAO Qiang. A Massive Data Management System based on the Hadoop[J]. Microcomputer Information, 2010, 26(05-I):202-204. - 7 -
225 230 235 一 种 海 量 铁 路 信 息 数 据 的 存 储 方 法 单 旭 1, 王 根 英 1, 刘 林 (1. 北 京 交 通 大 学 通 信 与 信 息 系 统 北 京 市 重 点 实 验 室, 北 京 100044; 2. 中 国 信 息 安 全 测 评 中 心, 北 京 100085) 摘 要 : 随 着 国 家 大 规 模 铁 路 建 设 的 展 开, 海 量 铁 路 信 息 数 据 飞 速 涌 现 出 来, 如 何 有 效 的 存 储 和 管 理 这 些 数 据 显 得 极 为 重 要 本 文 提 出 了 一 种 基 于 分 布 式 计 算 技 术 进 行 存 储 和 管 理 海 量 铁 路 信 息 数 据 的 方 法, 采 用 Linux 集 群 技 术, 构 建 了 海 量 铁 路 信 息 数 据 存 储 平 台. 本 系 统 由 数 据 访 问 层 数 据 管 理 层 应 用 接 口 层 三 个 层 次 组 成, 具 有 安 全 可 靠 运 行 成 本 低 处 理 速 度 快 易 扩 展 性 等 特 点, 能 够 满 足 海 量 铁 路 信 息 数 据 的 存 储 要 求. 关 键 词 : 海 量 数 据 存 储 ; 铁 路 信 息 数 据 ; 分 布 式 计 算 ; 集 群 系 统 中 图 分 类 号 :[U2-9] 2-8 -