Virtual file system on NoSQL for processing high volumes of HL7 messages

Transcription

1 Digital Healthcare Empowering Europeans R. Cornet et al. (Eds.) 2015 European Federation for Medical Informatics (EFMI). This article is published online with Open Access by IOS Press and distributed under the terms of the Creative Commons Attribution Non-Commercial License. doi: / Virtual file system on NoSQL for processing high volumes of HL7 messages Eizen KIMURA 1 and Ken ISHIHARA Dept. Medical Informatics of Medical School of Ehime University 687 Abstract. The Standardized Structured Medical Information Exchange (SS-MIX) is intended to be the standard repository for HL7 messages that depend on a local file system. However, its scalability is limited. We implemented a virtual file system using NoSQL to incorporate modern computing technology into SS-MIX and allow the system to integrate local patient IDs from different healthcare systems into a universal system. We discuss its implementation using the database MongoDB and describe its performance in a case study. Keywords. HL7, Data Science, Distributed Computing, NoSQL, SS-MIX Introduction The leveraging of big data analysis in the medical domain could break new ground in the management of lifestyle-related diseases and also increase the speed of drug development. The US government recently advocated the Big Data Research and Development Initiative, with the NIH announcing that more than 200 terabytes of genomic data from a thousand genomic research projects will be available on the Amazon Web Service 1. Conversely, in Japan, the National Database (NDB) already contains more than 6.9 billion health insurance claims, yet, there is no still concrete plan to develop a big data analysis framework. In 2006, the Ministry of Health Labour and Welfare introduced the Standardized Structured Medical Information Exchange project to promote the exchange of health information among institutions 2. The design concept of SS-MIX aims at simplicity by making use of standard file systems and storing HL7 messages in a standard directory structure. However, its development started before the Internet era, and it is intended for use with local file systems. The idea was for a clinic or hospital to be able to provide patient data on a portable storage device (e.g., CD-ROM, USB memory stick), enabling that a patient to take the data to another institution. SS-MIX also lacks a distributed data processing scheme and has limited scalability. Moreover, in Japan, there is no nationwide patient ID system, but rather institution-specific IDs, regional patient IDs, clinical research registration IDs, and so on. A scheme needs to be developed to aggregate these IDs and medical records, and to enable analysis in a cross-sectional manner. One way forward would be to preserve the simplicity of SS-MIX, but add various capabilities including large-scale storage, high-speed search, distributed processing, and the ability to aggregate multiple patient IDs into unique nationwide IDs. Google has already built a distributed data storage system called BigTable 3. By separating user data based on metadata, it offers 1 Corresponding Author. Eizen Kimura Medical School of Ehime Univ. [email protected]

2 688 E. Kimura and K. Ishihara / Virtual File System on NoSQL for Processing High Volumes an individualised user experience, although all of the resources of each user are accumulated on a single cloud storage system 4. In the present study, we leverage cloud technology to aggregate all patient medical records in SS-MIX, improve its search and distribution processing performance, and share medical information with stakeholders securely. To this end, we applied the same metadata scheme used in BigTable. Convert(HL7(messages(to(BSON(for(MongoDB(store( MSH ^~\& XXXX C PRIORITYHEALTH PRIORITYHEALTH ORU^R01 Q T P 2.3 PID ^^^Priority Health LASTNAME^FIRSTNAME^INIT M PD ^PCPLAST^PCPFIRST^M^^^^^NPI OBR 1 185L29839X64489JLPF~X64489^ACC_NUM JLPF^Lipid Panel - C 1694^DOCLAST^DOCFIRST^^MD OBX 1 NM JHDL^HDL Cholesterol (CAD) 1 62 CD:289^mg/dL >40^>40 "" "" F ^^^"" OBX 2 NM JTRIG^Triglyceride (CAD) 1 72 CD:289^mg/dL ^35^150 "" "" F ^^^"" OBX 3 NM JVLDL^VLDL-C (calc - CAD) 1 14 CD:289^mg/dL "" "" F ^^^"" OBX 4 NM JLDL^LDL-C (calc - CAD) CD:289^mg/dL 0-100^0^100 H "" F ^^^"" OBX 5 NM JCHO^Cholesterol (CAD) CD:289^mg/dL ^90^200 H "" F ^^^"" Original'Raw'HL7'Message Question: The average value of cholesterol of male adults (age: ) Select HL7 Records satisfying following conditions: Sex is Man (PID 8) AND Birth date is in between 1936/04/01 and 1972/03/31 AND OBX has JHDL entry. Retrieve Cholesterol Laboratory Result from OBX 5th eld Count number of results and sum the cholesterol values Calcurate average value of collected cholesterol values Convert var res = db.somecoll.mapreduce( map,reduce, { nalize: nalize, out:{ replace: "map_reduce_example" }, query: { "HL7Message.PID.PID_8" : "M", "HL7Message.PID.PID_7": {"$gte": , "$lte": }, "HL7Message.OBX.OBX_3.OBX_3_0": "JHDL", }}); var map = function() { for (idx in this['hl7message']['obx']) { if (this['hl7message']['obx'][idx]['obx_3']['obx_3_0'] == "JHDL") { var key = "JHDL"; var value = { sum : parseint(this['hl7message']['obx'][idx]['obx_5']), count : 1} emit(key,value);}}} var reduce = function(key, values) { reducedval = { sum: 0,count: 0}; values.foreach(function(value) { if (!isnan(value.sum)) { reducedval.sum+=value.sum; reducedval.count+=value.count; } }); return (reducedval); } var nalize = function(key, reducedval) { return { sum: reducedval.sum, count: reducedval.count, average: reducedval.sum / reducedval.count, }; Serialized'XML'Message <?xml version="1.0" encoding="utf-8"?> <HL7Message> <MSH> <MSH_0>MSH</MSH_0> <MSH_1>^~\&</MSH_1> <MSH_2>XXXX</MSH_2> <MSH_3>C</MSH_3> <OBX> <OBX_0>OBX</OBX_0> <OBX_1>1</OBX_1> <OBX_2>NM</OBX_2> <OBX_3> <OBX_3_0>JHDL</OBX_3_0> <OBX_3_1>HDL Cholesterol (CAD)</OBX_3_1> </OBX_3> <OBX_4>1</OBX_4> <OBX_5>62</OBX_5> <OBX_6> <OBX_6_0>CD:289</OBX_6_0> <OBX_6_1>mg/dL</OBX_6_1> </OBX_6> <OBX_7> <OBX_7_0>>40</OBX_7_0> <OBX_7_1>>40</OBX_7_1> </OBX_7> <OBX_8>""</OBX_8> <OBX_9/> <OBX_10>""</OBX_10> <OBX_11>F</OBX_11> <OBX_12/> <OBX_13/> <OBX_14> </OBX_14> <OBX_15/> <OBX_16/> <OBX_17> <OBX_17_0/> <OBX_17_1/> <OBX_17_2/> <OBX_17_3>""</OBX_17_3> </OBX_17> <OBX_18/> </OBX> <OBX> <OBX_0>OBX</OBX_0> <OBX_1>2</OBX_1> <OBX_2>NM</OBX_2> <OBX_3> <OBX_3_0>JTRIG</OBX_3_0> <OBX_3_1>Triglyceride (CAD)</OBX_3_1> </OBX_3> <OBX_4>1</OBX_4> <OBX_5>72</OBX_5> <OBX_6> <OBX_6_0>CD:289</OBX_6_0> <OBX_6_1>mg/dL</OBX_6_1> </OBX_6> <OBX_7> <OBX_7_0>35-150</OBX_7_0> <OBX_7_1>35</OBX_7_1> <OBX_7_2>150</OBX_7_2> </OBX_7> ConfigServer (mongod) Convert {"HL7Message"=> {"MSH"=> {"MSH_0"=>"MSH", "MSH_1"=>"^~\\&", "MSH_2"=>"XXXX", "MSH_3"=>"C", "MSH_4"=>"PRIORITYHEALTH", Fig. 1 Storing and MapReduce HL7 messages on virtual file system BSON'Message "OBX"=> [{"OBX_0"=>"OBX", "OBX_1"=>"1", "OBX_2"=>"NM", "OBX_3"=>{"OBX_3_0"=>"JHDL", "OBX_3_1"=>"HDL Cholesterol (CAD)"}, "OBX_4"=>"1", "OBX_5"=>"62", "OBX_6"=>{"OBX_6_0"=>"CD:289", "OBX_6_1"=>"mg/dL"}, "OBX_7"=>{"OBX_7_0"=>">40", "OBX_7_1"=>">40"}, "OBX_8"=>"\"\"", "OBX_9"=>nil, "OBX_10"=>"\"\"", "OBX_11"=>"F", "OBX_12"=>nil, "OBX_13"=>nil, "OBX_14"=>" ", "OBX_15"=>nil, "OBX_16"=>nil, "OBX_17"=> {"OBX_17_0"=>nil, "OBX_17_1"=>nil, "OBX_17_2"=>nil, "OBX_17_3"=>"\"\""}, "OBX_18"=>nil}, {"OBX_0"=>"OBX", "OBX_1"=>"2", "OBX_2"=>"NM", "OBX_3"=>{"OBX_3_0"=>"JTRIG", "OBX_3_1"=>"Triglyceride (CAD)"}, "OBX_4"=>"1", "OBX_5"=>"72", "OBX_6"=>{"OBX_6_0"=>"CD:289", "OBX_6_1"=>"mg/dL"}, "OBX_7"=>{"OBX_7_0"=>"35-150", "OBX_7_1"=>"35", "OBX_7_2"=>"150"}, Import(BSON(Messages(into MongoDB(Sharding(Clusters Sharding'Nodes i (mongod)' Query Mapping Shuffilng Reducing Final'Result Rou$ng'Server (mongos) Mongo'Map'Reduce'Framework 1. Methods The virtual file system uses MongoDB (version 2.4.9) as the NoSQL backend. MongoDB offers distributed processing on multiple nodes via sharding, which consists of 10 nodes 3. One (called mongos) is for the routing server process, one (mongod) mediates the interaction between sharding nodes and clients, and the rest of the nodes process distributed data. Each node is deployed on Science Cloud 5 and is run on a CentOS 5.7 (64 bit) Intel Xeon X5675 chip at 3.97 GHz with 12 cores and 96 Gb RAM, 10 Gbps x 2. The General Parallel File System (GPFS) 6 was built on the RAID6 system and consists of 600 sets of 3-Tb/7200-rpm hard disks; we linked this system to the 10 nodes using 10 Gbps connections. Because MongoDB uses Binary Java Script Object

3 E. Kimura and K. Ishihara / Virtual File System on NoSQL for Processing High Volumes 689 Notation (BSON) as an internal representation 7, we developed a tool that converts raw HL7 ver 2.x messages into BSON format and then stores the converted messages under an HL7Message node of the document (Figs. 1, 2). It decomposes every separator of the HL7 message, arranges its contents in accordance with the hierarchical structure of the BSON document, and assigns consecutive numbers to each one. It also extracts the patient ID, institutional ID, and the type of message from the original HL7 message and arranges the metadata simulating SS-MIX standard storage under the SS-MIX node of the BSON document (Fig. 1). The virtual file system that simulates SS-MIX storage is developed using Filesystem in Userspace (FUSE) 8. It mounts a virtual SS-MIX storage system on the host and then converts the requests and responses of file system access to the query to, and response from, the MongoDB. The FUSE module was developed using Ruby and the FUSEfs module. It simulates the file system hierarchy using the metadata under the SS-MIX node in the BSON document (Fig. 1). The tool for aggregating patient IDs adds an entry containing a universal patient ID and a new institution ID to the existing metadata under the SS-MIX node. This makes it possible to search all medical records for any given patient across healthcare facilities. The system must be able to efficiently register data from nationwide healthcare systems in real time. To test this, we performed various evaluations. First, to assess the relationship between the number of sharding nodes and the performance of data registration, the average processing times were calculated from five processing times. We simulated a case in which 40 clients sent a message that included 100 HL7 messages and repeated this 500 times. Thus, we determined the registration time to process 2 million HL7 messages. We repeated this process using a different number of nodes, from one to eight. Next, we investigated how the numbers of concurrent connections and of bulk-transferred HL7 messages affected registration performance. On the sharding setting of eight nodes, we measured the average number of registrations while changing the following conditions. We assumed various concurrent connections (from 1 to 40 clients), different numbers of HL7 messages per inquiry (1000, 2000, 4000, and 8000), and repeated this 500 times. To evaluate its performance processing distributed data, we prepared a MapReduce scenario that collected laboratory data on high-density lipoprotein (HDL) cholesterol levels of men aged years in April First, the system performed a query to narrow down the HL7 records to only those that matched our conditions (gender, or PID-8; birth date, or PID- 7; lab test result, whose OBX-3 is JHDL). In Map process, the system extracts the JHDL laboratory test results from the value of the OBX-5 field from the OBX resides previously matched HL7 messages. In Reduce process, it counts the number of laboratory test and the sum of the laboratory test result values from every node. In finalizing process, it calculates the average value for HDL cholesterol from previously corrected values. We conducted this process 10 times, changing the number of nodes and number of HL7 messages involved, and determined the average processing time. 2. Results The average size of HL7 messages was 824 bytes, and that of BSON-converted ones was 3568 bytes. When 100 million HL7 messages were stored in MongoDB, its physical volume was Gb. Figure 3 shows the relationship between the number of nodes and data registration performance. Registration performance increased up to four nodes, after which it remained constant. Figure 4 shows the relationship between

4 690 E. Kimura and K. Ishihara / Virtual File System on NoSQL for Processing High Volumes the number of concurrent connections and the number of bulk-transferred HL7 messages. As the number of concurrent connections increased, the registration performance improved, up to 34 simultaneous connections. At that point, registration peaked at 7664 messages per second and thereafter reached a plateau. The number of bulk-transferred messages had no impact on the overall performance. As long as every node has less than 30 million messages, the sharding shows an inverse proportion to the number of nodes. The processing time was measured as t = /x (s) (R2 = 0.987) (x: number of nodes), and it shows the O(n) order performance scale. Fig 2. SS-MIX schema on MongoDB Fig 3. Performance of bulk message transfer Fig. 4 Performance of MongoDB sharding Fig. 5 MapReduce processing time 3. Discussion MongoDB uses the sharding keys to keep O(n) order search performance as a whole by adding sharding nodes. We had to take advantage of assigned equally distributed ID, not patient ID for the sharding key because patient ID was known to be considerable variation in distribution. This method shows high scalability in processing cross tabulations by reducing the need of cross-referring data over another nodes. As MongoDB depends on memory-mapped files 9, its performance is reduced greatly when its contents exceed the capacity of the server memory. According to our tests, its performance was degraded once 30 million messages were stored in a single node. MongoDB is a document-oriented No-SQL that uses the BSON format as an internal storing representation and allows indexing of all document contents. Hence, we believe that it is suitable to use MongoDB as the NoSQL infrastructure for structured documents such as HL7 CDA R2. Our system converts raw HL7 messages into BSON

5 E. Kimura and K. Ishihara / Virtual File System on NoSQL for Processing High Volumes 691 format, and it proved to be scalable. Assuming a server akin to what we used in the present study, 55 nodes will be sufficient to process one billion HL7 records from all healthcare institutions in Japan. Our system may conduct a MapReduce process in minutes and handle real-time streaming of laboratory results to detect anomalies, such as signs of infectious disease spread. However, in our tests, the routing server eventually reached a plateau in registration performance. Hence, we have to increase the nodes on the routing server to avoid this problem in the future. The previous study 10 has the similar system settings of ours one. However, the main difference is that the study stores data files on legacy file system. It builds the metadata indexes for the files and stores them into MongoDB. It provides the virtual file system that shows the files limited by some queries against metadata. Meanwhile, our system stores data directly and adds the metadata for simulating virtual file system on MongoDB to overcome the performance limitation of legacy file system. A healthcare setting can mount its data through a virtual file system, separated from other healthcare settings data. Despite the fact that our approach does not need the preexistence of a file system, providing the virtual file system was required to ensure compatibility with legacy applications on the SS-MIX storage requires a file system. HL7 had been developing innovative standards framework Fast Healthcare Interoperability Resources (FHIR) for sharing medical information 11. In its specification, FHIR adopts JSON as a standard representation format, which has sideby-side compatibility with BSON. Therefore, FHIR documents will be the primary targets of parallel distributed processing immediately by storing in MongoDB. We will verify whether FHIR is a suitable format for distributed processing in cloud computing. Acknowledgements: Data processing and other research was performed using the NICT Science Cloud at the National Institute of Information and Communications Technology (NICT) as a collaborative research project. This work was supported by MEXT KAKENHI Grant Number References [1] Policy OoSaT. OBAMA ADMINISTRATION UNVEILS BIG DATA INITIATIVE: ANNOUNCES $200 MILLION IN NEW R&D INVESTMENTS 2012; Available from: [2] Kimura M, Nakayasu K, Ohshima Y, Fujita N, Nakashima N, Jozaki H, et al. SS-MIX: A Ministry Project to Promote Standardized Healthcare Information Exchange. Methods of Information in Medicine. 2011;50(2):131. [3] Chodorow K. Scaling MongoDB: O'Reilly Media, Inc.; [4] Cooper J. How Entities and Indexes are Stored. 2009; Available from: [5] Murata KT, Watari S, Nagatsuma T, Kunitake M, Watanabe H, Yamamoto K, et al. A Science Cloud for Data Intensive Sciences. Data Science Journal. 2013;12:WDS139-WDS46. [6] Schmuck FB, Haskin RL. GPFS: A Shared-Disk File System for Large Computing Clusters. FAST. 2002;2:19. [7] Cattell R. Scalable SQL and NoSQL data stores. ACM SIGMOD Record. 2011;39(4): [8] Szeredi M. Filesystem in Userspace. 2013; Available from: [9] Parker Z, Poe S, Vrbsky SV. Comparing NoSQL MongoDB to an SQL DB. Proceedings of the 51st ACM Southeast Conference; Savannah, Georgia : ACM; p [10] Jacobi MR, editor. Applied Parallel Metadata Indexing. Conference: 4th Annual Computing and Information Technology Student Mini-Showcase; 2012: Los Alamos National Laboratory (LANL). [11] HL7. FHIR: Fast healthcare interoperability resources [cited /17]; Available from: