An Hadoop-based Platform for Massive Medical Data Storage

5 10 15 An Hadoop-based Platform for Massive Medical Data Storage WANG Heng * (School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing 100876) Abstract: With the fast and sound development of information technology in healthcare, medical data rapidly emerge in large numbers. However, existing platforms for patients data storage cannot meet the needs of the even increasing large volume of medical data. Therefore, it is very important to develop effective storage platform to manage and store these massive medical data. Cloud computing offers low cost, high scalability, availability and fault tolerance which provides a good solution for some of the problems faced in storing and analyzing patients medical data. Based on the distributed computing technology, this paper proposes a novel approach for mass medical data storage and management. It includes a solution for storing massive medical data storage platform based on Hadoop by using Linux cluster technology. Extensive experimental results demonstrate the efficiency, reliability, high scalability, and fault tolerance. Key words: mass data storage; medical data; Hadoop; HBase; distributed computing. 20 25 30 35 40 45 0 Introduction We are giving more and more attention to disease prevention, inspection and diagnosis. As a result, more relevant data [1] are collected for disease tracking or other purposes. Pervious medical information platform is mainly based on the Hospital Information System [2].This platform collects all kinds of information for patients and the data are managed by hospitals. Similar systems are EMR (Electronic Medical Records) [3] or PHR (Personal Health Record) [4] that are able to store the medical data for patients. With the development of the Internet and social networks, these electronic medical records can be promoted to share with other doctors in other hospitals or healthcare organizations. However, the health care data of EMR or PHR is becoming larger and larger. These data are generated at distinct hospitals or organizations, and may be heterogeneously distributed at different places, having different data formats. Currently, for example, many health information systems and platforms store and manage these distributed data based on RDBMS [5].However, with geometric growth of structural and non-structural data of the platform, traditional relational database solutions have limitations in storing and managing these large and heterogeneous medical data. As a result, how to store and manage a large amount of medical data emerges as an important issue. The traditional large-scale data processing mostly takes advantage of distributed high-performance computing and grid computing. It requires expensive computing resources and tedious programming which is used to large-scale data segmentation and the rational allocation of computing tasks. However, the framework of Hadoop provides a solution for the problem of massive data storage [6,7]. Hadoop is an open-source organization s distributed computing framework. It can build a high reliability and good scalability of parallel and distributed systems through running the application on a cluster which is composed of a large number of commodity Brief author introduction:wang Heng (1989-), Male, Master, Wireless ehealth. E-mail: 18810540297@163.com - 1 -

50 55 60 hardware devices. Hadoop Distributed File System (HDFS) [8], programming model and HBase distributed database [9] are three core technologies. The high performance of Hadoop-based system motivates us to design a novel data storage system for processing massive medical data in a cost effective way. Meanwhile, our experimental results show high scalability and robust fault tolerance by using the Hadoop-based system. This paper is divided into four sections. Section 2 describes the architecture of the proposed massive medical data storage platform. The functions of each system component are presented in Section 3. This section also includes the description of how to deploy and implement the platform. Hereafter, Section 4 illustrates the implementation. In this section, we evaluate the system performance and verify the efficiency of the proposed platform. Finally, Section 5 concludes this paper. 1 Framework of the massive medical data storage platform Medical data are massive which imposes many challenges for effective data storage and processing. We propose a platform to accommodate the massive medical data that could be large scale, geographically distributed and heterogeneous. Fig. 1 shows the framework of our proposed platform. In this framework, the user is located at the top level. In addition to the user, the platform consists of the following three parts: storage layer, management layer, and application layer. User User Access Authorization API Interface Services Application Layer Slave Storage Jobs Coordination Master Query Concurren cy Management Layer Storage Device Data Access Interface Storage Layer Storage Monitor 65 70 Fig. 1 The architecture of massive medical storage platform In top layer of the framework, a user could be a medical staff, a patient, and system manager, or other person using the system. A medical staff accesses to patients physiological data, analyze patients information, and give patients diagnosis through the platform. The patients can retrieve their own records from the platform for various purposes, such as properly utilizing them to track and prevent disease. The data manager administrates the data such as specify data accessing privileges and has the ability to share the data based on certain policies. The users could be many and they simultaneously access the data. Thus, the data system should achieve application agnostic characteristics such that different applications can access the system simultaneously through the web portal. - 2 -

75 80 85 1.1 Storage Layer As shown in Fig. 1, there is a storage device for storing medical information. Storage device runs on clusters of commodity hardware, it will continue working even if any node fails, which is key to enable the service availability all the time. Data access interface provides the data source access to services. For example, it can provide different database access services. The medical data such as electronic medical records, medical image data require terabytes or petabytes storage. SQL-like databases are restricted to provide such capacities and thus they are unable to handle massive storage needs. This motivated use to develop to a system with horizontally scalable, distributed non-relational data stores, called NOSQL databases. Because of the high concurrent read and write, mass storage support, easily expandable and low cost, it is suitable for the current requirements [10]. This layer is use the non-relational database HBase to store data. HBase running on top of distributed file systems is designed for data storage and analysis. 1.2 Management Layer 90 95 100 In the proposed architecture, master-slave architecture is implemented. While there is a coordination manager coordinate between master and slave. The master has two main components: Query and Concurrency. Query is an important component in the system. It provides the application layer of massive data parallel query. It contains the meta-data of the file system and data locations in the database, required for each query. Concurrency is mainly used for scheduling, distributing and managing the tasks on the slaves by coordinating with the query manager. In the slave part, there are two parts: Storage and Jobs. Storage is responsible for processing and maintaining data stored in the distributed file system. Jobs is responsible for concurrency management in the slave parts, including instantiating and monitoring individual tasks. Coordination is responsible for handling and managing the requests and responses in case of multi master-slave communication. 1.3 Application Layer The function of the entire layer is to provide access to users with user authorization. Users can access services through the network. For this layer, all of the available components are running in the Web framework. This layer has four components: User Access, API Interface, Authorization and Services. 105 110 115 2 Development of the massive medical data storage platform The main part of the medical data storage platform is Storage Layer and Management Layer, which provides services and data support for the entire platform. And the main technology to realize the two layers is Hadoop distributed technology. HDFS, and HBase are the three core technologies. is a high performance computing model of Hadoop. The application is divided into many small fragments of works, each of which could be executed or re-executed on any node in the cluster. HDFS is used to store data on the compute node and it can provide high aggregate bandwidth across the cluster. On top of HDFS, Hadoop has an open source, distributed, column-oriented database named HBase. HBase can be used to build massive storage cluster in low-cost computer servers [11]. The data analysis framework based on Hadoop is shown in Fig. 2. - 3 -

Result......... HBASE HDFS Data Extraction Data Block Data Block Data Block Original Data Fig. 2 The data analysis framework based on Hadoop 2.1 Hadoop Distributed File System (HDFS) 120 125 HDFS is a Hadoop distributed file system running on commodity hardware. Compared with other distributed file systems, HDFS has high fault tolerance. It can improve the throughput of data access and is suitable for large-scale data storage. In HDFS, data is organized into files and directories. Files are divided into uniform sized blocks and distributed across cluster nodes. HDFS has master/slave architecture. As shown as Fig. 3, an HDFS cluster consists of a NameNode runs on the master server and a number of DataNodes run on the slave servers. NameNode maintains the file namespace (metadata, directory structure, lists of files, list of blocks for each file, location for each block, file attributes, authorization and authentication). DataNodes creates, deletes or replicates the actual data blocks based on the instructions received from the NameNode and they report back periodically to the NameNode with the complete list of blocks that they are storing. Only one NameNode runs in the cluster, which simplifying the system architecture. Metadata ops NameNode Metadata (Name,replicas...) Client Block ops Read DataNodes DataNodes Replication Blocks Write Write Rack 1 Rack 2 130 Client Fig. 3 The architecture of HDFS 135 Both NameNode and DataNode are designed to work with commodity hardware running on a Linux operating system. However, a common problem of using cluster of commodity hardware is higher probability of hardware failure and thereby data loss. The easiest way to solve this problem is replication mechanism: allow the system to store redundant copies so that when a failure occurs, there is always another copy available and thereby improving reliability and making system - 4 -

140 fault-tolerant. By default, HDFS stores three copies of each data block. The usual replication strategy for a data center is to have two replicas on the machines from the same rack and a replica on a machine located in another rack. This strategy limits the data traffic between racks. Meanwhile, in order to minimize latency for data access, HDFS tries to access the data from the nearest replica. 2.2 / 145 150 is a simple software framework. It can process massive data in parallel with fault-tolerant. Applications based on can run on large clusters consisting of thousands of machines. The basic idea of / framework is divide and conquer partition a large problem into smaller sub-problems to the extent that the sub-problems are independent and they can be tackled in parallel by different slaves. / framework divides the work in two phases: the map phase and the reduce phase separated by data transfer between nodes in the cluster. At first, input data set of a job is usually split into several separate fixed-sized pieces called input splits. Then Hadoop creates one map task for each split. output is a set of records in the form of key-value pairs. The records for any given key-possibly spread across many nodes are aggregated at the node running the r for that key. The stage produces another sets of key-value pair, as final output based on user defined reduce function. The calculation model is shown in Fig. 4. Input Intermediate k1:v1 k2:v1 k2:v2 k1:v2 k3:v1 k4:v1 k2:v3 k4:v2 k5:v1 k1:v3 k3:v2 k4:v3 Group by Key Grouped k1:v1,v2,v3 k2:v1,v2,v3 k3:v1,v2 k4:v1,v2,v3 k5:v1 155 Output Fig. 4 The calculation model of 160 Similar to HDFS, / framework also uses master/slave architecture: a JobTracker (master) and a number of TaskTrackers (slaves). JobTracker is responsible for scheduling and managing TaskTracker, for example the tasks and tasks are assigned to idle TaskTrackers. JobTracker is also responsible for monitoring the operation of tasks. TaskTrackers run tasks and send progress reports to JobTrackers, which keeps a record of the overall progress of each job. 2.3 HBase 165 HBase is a distributed, column-oriented, NOSQL database. It is a high reliability, high performance, scalable distributed storage system. An important difference to relational databases is that there is no defined schema. That means that a record A might have 10 attributes while record B has only 7 attributes although they are in the same table. Data in HBase are organized in - 5 -

170 175 180 labeled tables. Tables are made of rows and columns. All columns in HBase belong to a particular column family. Rows in HBase tables are sorted by row key which is also a byte array and serves as table s primary key. All table accesses are via the table row key. Meanwhile, each data operation has an associated timestamp. Timestamp is assigned automatically by HBase when it is inserted. Tab. 1 illustrates data structure with a simple example. The data structure consists of two main column families: Basic Information, Vital Signs. Each record, as defined by HBase, has a unique key and a timestamp defining the creation or update time. The Basic Information family consists of three or more columns: name, gender, date of birth, address and so on. The Vital Signs also consists of many columns: blood pressure, heartbeat and so on. Because of the flexibility of the HBase table, we can insert more column families, especially unstructured data such as graphs, medical imaging. Tab. 1 A Simple Example of an HBase Table Row Key Timestamp Basic Information Vital Signs Name gender Date of Brith Blood pressure Blood pressure heart 185 Patient ID T8 T5 Tom Male 1989.07.03 100 5.8 80 T1 96 72 HBase also adopts master/slave architecture. It consists of three parts: HMaster, HRegionServers and client. HMster is responsible for arranging regions to slaves and for recovering from HRegionServer failures. HRegionServer is responsible for managing client requests for reading and writing. Client is responsible for seeking address information in HRegionServer. 3 Experiments and performance evaluation 190 In order to validate the performance of the proposed massive medical data storage platform based on Hadoop, we implemented the platform. This section describes the hardware components in the platform and design several testing scenarios for case study. Fig. 5 shows the deployment model of the massive medical data storage platform. Client Test AppServer Hadoop Test 100 Mbps LAN NameNode HBase Master JobTracker DataNode HRegionServer TaskTracker Fig. 5 Deployment model of the massive medical data storage platform 195 This architecture assigns one node as master for the management of the Hadoop cluster. As for the slaves part, the system is able to add as many slaves as needed, because Hadoop scales linearly. The master contains NameNode, HBase Master and JobTracker. As for the data nodes i.e., - 6 -

200 slaves, where each contains the region server that run on data node with the task tracker. The data nodes are efficient in data storage management and jobs execution. Thrift was used as the client API for its simplicity and noticeable benefits to communicate with HBase. After successfully establishing connection with HBase server, the Restlet Framework is used as the web platform. The machines specifications are CPU of Inter(R) Core 2Quad, 2.66Hz processor, memory of 8GB and hard drive of 320GB. As for slaves, same specifications except for disk storage 160GB. The test is attempted to measure the system s query performance and scalability. Here are two experiments. 3.1 Experiments on Data Query 205 We run the data query operation with 100-concurrency rate under different data storage levels of 1000 records, 1 million records, 100 million records and 1 billion records. The performance of data query is shown in Tab. 2. Tab. 2 Comparation of Data Query Performance between Hadoop and RDBMS Data Query Time of Hadoop (s) Time of RDBMS(s) Minimum Average Maximum 90% Table of 1K 0.012 0.148 4.581 0.2 13.45 Table of 1M 0.033 0.182 7.564 0.24 19.00 Table of 100M 0.065 0.354 15.62 0.67 97.26 Table of 1B 0.055 0.671 24.48 0.89 315.2 210 From the table we can see that our Hadoop framework is outperformed than RDBMS on data query with all levels of data storage. 3.2 Experiments on Through-out 215 The number of requests per second is computed across the increasing number of data nodes. The experiments are conducted with 1000 input requests, 100-concurrency rate and 100Mbps transfer rate. Every time 5 data nodes (200,000 records) are added to the cluster. As shown in Fig. 6, although the data is progressively growing, the requests per second are approximately 200. It means the massive medical data storage platform maintains its performance level. Therefore, this ensures the system s scalability. 220 Fig. 6 Requests per second versus number of machines 4 Conclusion In this paper, we design and develop a Hadoop-based platform for storing and processing - 7 -

225 230 massive medical data in an effective way. Integrating with Hadoop distributed file system, / parallel-computing model and HBase database, we implemented the proposed approach and substantial tests are conducted in the experimental system that includes a large number of commodity hardware devices. Through performance testing on the experiment platform, result shows the system has a good performance, scalability and fault tolerance. Meanwhile, the proposed model introduced in this paper offers a flexible and portable platform for applications development and user access. Finally, promising output across the conducted tests was a good indicator for usability of the proposed system. References 235 240 245 250 255 260 265 270 [1] Wei Liu and Dedao Gu. Research on construction of smart medical system based on the social security card [A]. EMEIT 2011. 2011 International Conference on Electronic and Mechanical Engineering and Information Technology [C]. Harbin, PR.China, 2011. 4697~4700. [2] Boqiang Liu, Xiaomei Li, Zhongguo Liu, er al. Design and implementation of information exchange between HIS and PACS based on HL7 standard [A]. ITAB 2008. International Conference on Information Technology and Application in Biomedicine [C]. Shenzhen, PR.China, 2008. 552~555. [3] Yuwen Shuilim, Yang Xiaoping, Li Huiling. Research on the EMR Storage Model [A]. IFCSTA'09. International Forum on Computer Science-Technology and Application [C]. Chongqing, PR.China, 2009. 222~ 226. [4] Faro, A., Giordano, D., Lavasidis, I., Spampinato, C.. A web 2.0 telemedicine system integrating TV-centric services and Personal Health Records [A]. ITAB 2010. 2010 10th IEEE International Conference on Information Technology and Applications in Biomedicine [C]. Corfu, Greece,2010. 1~4. [5] Miquel, M., Tchounikine, A.. Software components integration in medical data warehouses: a proposal [A]. CBMS 2002. Proceedings of the 15th IEEE Symposium on Computer-Based Medical Systems [C]. Maribor, Slovenia, 2002. 361~364. [6] Hongyong Yu, Deshuai Wang. Research and Implementation of Massive Health Care Data Management and Analysis Based on Hadoop [A]. ICCIS 2012. 2012 4th International Conference on Computational and Information Sciences [C]. Chongqing, PR.China, 2012. 514~517. [7] Dalia Sobhy, Yasser EI-Sonbaty, Mohamad Abou Elnasr. MedCloud: Healthcare Cloud Computing System [A]. ICITST 2012. 2012 International Conference for Internet Technology And Secured Transactions [C]. London, 2012. 161~166. [8] The Apache Software Foundation. Apache Hadoop HDFS homepage [OL]. [2012]. http://hadoop.apache.org/hdfs/. [9] Lars George. HBase: The Definitive Guide, 1st edition [M]. O'Reilly Media, 2011. [10] J. Han, E. Haihong, G. Le, and J. Du. Survey on nosql database [A]. ICPCA 2011. 2011 6th International Conference on Pervasive Computing and Application [C]. Port Elizabeth, South Africa, 2011. 363~366. [11] The Apache Software Foundation. Apache Hadoop HBase homepage [OL]. [2013]. http://hbase.apache.org/. 基于 Hadoop 的海量医疗数据存储平台的研究王恒 ( 信息与通信工程学院, 北京邮电大学, 北京,100876) 摘要 : 随着医疗信息化又好又快的发展, 医疗数据迅速的涌现出来然而, 现存的医疗数据存储平台不能够满足日益增长的海量数据的需求因此, 研究并开发一个有效的存储平台来管理和存储这些海量医疗数据显得尤为的重要云计算为目前所面临的存储分析医疗数据的困境, 提供了一种低成本容错性强高可扩展性和可用性的解决方案基于分布式计算技术, 本文提出了一种新的管理和存储海量医疗数据的方法它提供了一种基于 Hadoop, 使用 Linux 集群技术的海量数据存储的解决方案大量的实验结果证明了该平台的高效性可靠性高可扩展性和容错性关键词 : 海量数据存储 ; 医疗数据 ;Hadoop;HBase; 分布式计算中图分类号 :TP392-8 -