A method for handling multi-institutional HL7 data on Hadoop in the cloud

Size: px

Start display at page:

Download "A method for handling multi-institutional HL7 data on Hadoop in the cloud"

Eunice Logan
8 years ago
Views:

A method for handling multi-institutional HL7 data on Hadoop in the cloud { Masamichi Ishii *1, Yoshimasa Kawazoe *1, Akimichi Tatsukawa 2*, Kazuhiko Ohe *2 *1 Department of Planning, Information and

1 A method for handling multi-institutional HL7 data on Hadoop in the cloud { Masamichi Ishii *1, Yoshimasa Kawazoe *1, Akimichi Tatsukawa 2*, Kazuhiko Ohe *2 *1 Department of Planning, Information and Management, The University of Tokyo Hospital, Japan *2 Department of Medical Informatics and Economics, Graduate School of Medicine, The University of Tokyo, Japan AGENDA 1. Introduction 2. Technology Brief 3. Implementation Processes & Outcomes 4. Conclusion

of Tokyo Hospital, Japan *2 Department of Medical Informatics and Economics, Graduate School of Medicine, The

2 many complaints EMR systems wide-spread, but clinicians (also often researchers) voice many complaints because Their information access demands for - clinical research for - evidence based medicine But Getting less in return for their elaboration of keyboarding EMRs

access demands for - clinical research for - evidence based

3 Purposes of querying clinical data Classified and subtotaled 195 retrieval request sheets submitted by clinicians working for The University of Tokyo between 2006 and Figure 1. Purposes of querying clinical data

clinicians working for The University of Tokyo between

4 IF direct retrieval of the clinical data were available

Slowdown of On-premise DWH The bigger Data Warehouse data become the more time your query job linearly takes Improving the performance means scaling up the IT infrastructure (leads to greater costs

5 Slowdown of On-premise DWH The bigger Data Warehouse data become the more time your query job linearly takes Improving the performance means scaling up the IT infrastructure (leads to greater costs perhaps exponentially so) Most of your DWHs based on RDB technology. This requires ongoing maintenance as well as a thorough understanding of the relevant table schemas before you can launch your querying jobs

costs perhaps exponentially so) Most of your DWHs based on RDB technology.

6 SS-MIX Growing into In Japan, an increasing number of medical institutions exchange medical records via SS-MIX. On-premises SS-MIX storages integrated, it would become clinical Big Data!! SS-MIX :the Standardized Structured Medical-record Information exchange >>>> the Japanese de facto standard for HL7 format medical records

On-premises SS-MIX storages integrated, it would become clinical Big Data!

7 IF multi-institutional SS-MIX storages became integrated, what would you want from it? rapid direct retrieval of clinical Big Data became available

8 How to make the most of secondary use of clinical records Challenges Our Goal : Providing IT infrastructure to make it easier for clinicians to directly retrieve for themselves what they want from clinical Big Data

9 Technology Brief Cloud computing technology + distributed processing architecture a convenient, cost-effective, and scalable environment You can share clinical data among different medical institutions Our prototype system will meet our challenges We address how the effective use of Hadoop & Pig will bring great benefits to clinicians working in your medical institution who are eager to carry out EBM or epidemiological studies

Our prototype system will meet our challenges We address how the effective use of Hadoop & Pig will bring

10 Components of IT infrastructure Selected Hadoop as a distributed processing framework To ensure future scalability built a Hadoop cluster in the cloud To assure direct access to Big Data Selected Hadoop Pig as the data retrieval component

scalability built a Hadoop cluster in the cloud To assure

What s Pig Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these

11 What s Pig Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. ~ quoted from ~ But, Pig has no function for parsing HL7 message (as semi-structured data)

The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns

12 Newly developed developed a set of specific utilities to optimise clinical data search and retrieval in minimum time 1 data migrating tool : merge & convert HL7 file to a file optimised for distributed processing in Hadoop 2 HL7 message tabulating tool : User Definition Functions for parsing HL7 These tools help users to store, manage and retrieve HL7 data on Hadoop in the cloud

for distributed processing in Hadoop 2 HL7 message tabulating tool : User Definition Functions

13 1 Data migration tool Efficient query execution in Hadoop depends on file size. the average size of an HL7 message (2 4 KB) is too small for distribution on Hadoop ( default block size 64GB ) SS-MIX storage (HL7 message files) Data migration Tool Merge each file according the Data type Convert a HL7 file into one line ; line feed code tag <EOF> line feed code Make personal information anonymous Encode each file to UTF-8 Add key values ADT for HDFS PPR for HDFS OUL for HDFS RDE for HDFS RAS for HDFS

(HL7 message files) Data migration Tool Merge each file according the Data type Convert a HL7 file into one line ; line feed

14 1 Data migration tool (adding key values) HL7 filename (SS-MIX) conventions Patient id _ Transaction Date/Time _ Data Type _ Placer Order Number _ Date/time Of Message _ diagnosis and treatment department _ condition flag OUL ^ 脳外 ^L HL7 message (contents) <HT> <HT>OUL-01<HT> <HT> <HT>08^diabetes^L<HT>1<HT> MSH ^~\& OUL^R22^OUL_R22 PID ^^^^PI 匿名 ^ 患者名 ^^^^^L^I PV O 32^^^^^C 32 SPM ^ 全血 ( 添加物入り)^JC10^84^ 血漿 ^99Z13 OBR _ E518^ 血糖尿糖 ^99Z ORC SC _01 OBX OBX <LF> One line LEGEND: <HT>:horizontal tabulation 0x09 <LF>: line feed 0x0a

$2013041912150012345<HT>08^diabetes^L<HT>1<HT> MSH ^~\& OUL^R22^OUL_R22 PID 0001 9999994776^^^^PI 匿名 ^ 患者名 ^^^^^L^I PV1 0001 O 32^^^^^C 32 SPM 0001 019^ 全血 ( 添加物入り)^JC10^84^ 血漿$

15 2 HL7 message tabulating tool SS-MIX Storage ( HL7 message files ) MSH ^~\& OUL^R22^OUL_R22 PID ^^^^PI 匿名 ^ 患者名 ^^^^^L^I PV O 32^^^^^C 32 SPM ^ 全血 ( 添加物入り)^JC10^84^ 血漿 ^99Z13 OBR _ E518^ 血糖尿糖 ^99Z ORC SC _01 OBX 0001 NM 3D ^グリコヘモグロビンA1c 全血 ( 添加物入り)^JC10^ _84^ヘモグロビンA1c(JDS)^99Z % ^%^99Z F SPM ^ 血清 ^JC10^85^ 血清 ^99Z13 OBR _ E564^ 生化学免疫 ^99Z ORC SC _01 OBX 0001 NM 3C ^クレアチニン血清 ^JC10^ _85^クレアチニン(Cre) ^99Z mg/dl^mg/dl^99z F REGISTER /usr/lib/pig/p4udf.jar define NSSMIX p4udf.normalizessmix('pid_3_1 OBR_7 OBX_3_1 OBX_3_2 OBX_5'); 1 a = LOAD 'SSMIX2/OML UTF-8.ssmix2.log' as (aa,bb,cc,dd,ee,ff,ssmix:chararray); b = FOREACH a GENERATE NSSMIX(ssmix) as MEISAI; Create relations & schemas 23 { ( , , 3D , グリコヘモグロビンA1c 全血 ( 添加物入り), 4.9 ), ( , , 3C , クレアチニン血清, 1.11 ) } 45

り)^JC10^0170500_84^ヘモグロビンA1c(JDS)^99Z14 4.9 % ^%^99Z15 4.3-5.

16 Implementation Process & Outcomes 1) Setting up the Hadoop cluster on a cloud service 2) Investigating query pattern requirements 3) Merging HL7 messages 4) Adding key values to each HL7 messages 5) Migrating merged files to HDFS 6) Querying clinical data with PIG scripts 7) Preliminary evaluation

messages 4) Adding key values to each HL7 messages 5) Migrating merged

17 1) Setting up the Hadoop Cluster on a Cloud Service cdh-05 cdh-04 Data Data Node Data Node cdh-03 cdh-02 cdh-01 Data Node HDFS Data Node HDFS Node HDFS HDFS HDFS CentOS V5.6 with 6 virtual CPU, CentOS 16GB V5.6 memory, with 6 virtual 100GB CPU, CentOS virtual disk 16GB V5.6 memory, with 6 virtual 100GB CPU, CentOS virtual disk 16GB V5.6 memory, with 6 virtual 100GB CPU, virtual CentOS disk 16GB V5.6 memory, with 6 virtual 100GB CPU, virtual disk 16GB memory, 100GB virtual disk 10 nodes in a Cloud Job Name Node cdh-00 cdh-06 cdh-07 Data Node Data CentOS V5.6 with Node Data 6 virtual CPU, 16GB CentOS memory, V5.6 with 100GB 6 virtual virtual CPU, disk CentOS V5.6 with CentOS V5.6 with 6 virtual CPU, 16GB memory, 100GB virtual disk 16GB memory, 100GB 6 virtual virtual CPU, disk 16GB CentOS memory, V5.6 with 100GB 6 virtual virtual CPU, disk 16GB memory, 100GB virtual disk cdh-08 cdh-09 cdh-10 HDFS Node Data HDFS Node Data HDFS Node HDFS HDFS Figure 2. Hadoop Cluster Client PC to launch PIG scripts pig-01

6 memory, with 6 virtual 100GB CPU, virtual disk 16GB memory, 100GB virtual disk 10 nodes in a Cloud Job Name Node cdh-00 cdh-06 cdh-07 Data Node Data CentOS V5.

18 2) Investigating Query Pattern Requirements Classified and subtotaled 195 retrieval request sheets submitted by clinicians working for Tokyo Univ. Figure 3. Frequency of using Data Type Dosing (RDE), 39 Diagnosis (PPR), 67 Lab. Results (OUL), 35 Patient property (ADT), 138 Figure 4. Frequency of Data types used in the search requests

Frequency of using Data Type Dosing (RDE), 39 Diagnosis (PPR), 67 Lab.

19 3) Data Merging Flow HL7 Diagram messages (Overview) 4) Adding key values to each HL7 5) Migrating merged files to HDFS HIS/EHR HIS /EHR Patient Information medication, injection Laboratory results x-ray reports, CT reports, MRI reports, etc.. Diagnosis Pathological diagnosis Endoscopic diagnosis Physiology diagnostic - electrocardiogram SS-MIX Server 3 SS-MIX Storage (Standardi zed) HL7 V2.5 3)4) Data Migration tool Cloud Computing Services Gateway Server 5) File transport /Upload to HDFS Multi-institutional medical storage Institution A HL7 V2.5 HL7 V2.5 HDFS Institution B HDFS Institution C The merged files consist of -450,000 patients properties (ADT) -3.5 million disease diagnosis records (PPR) -10 million labo test result records (OUL) -2 million drug dosing records (RDE) recorded between June 2010 and January HL7 V2.5 HDFS n

5 3)4) Data Migration tool Cloud Computing Services Gateway Server 5) File transport /Upload to HDFS Multi-institutional medical storage Institution A HL7 V2.5 HL7 V2.

20 6) Querying clinical data with PIG scripts Sample(benchmark query) retrieve the ID of patients who had been diagnosed as Type2 diabetes mellitus who visited as outpatient in 2010, and whose lab test results was such as ( HbA1c >= 6.5 % ) and ( CRE < 2.0 mg/dl ) The duration between one specimen collection data/time and the other must be within seven days

2010, and whose lab test results was such as ( HbA1c >= 6.5 % ) and ( CRE < 2.

21 7) Preliminary evaluation We defined benchmark queries and compared query performance in the cloud environment to performance in an on-premise DWH Queries executed to a cloud-based HL7 storage were significantly faster than queries executed by the on-premise DWH Figure 5. result of sample benchmark query

22 Conclusion outcomes and lessons learnt - A Hadoop cloud based system designed to share HL7 messages between medical organisations has potential to speed up data-retrieving queries. The system offers potential for efficient, fast data retrieval, and substantial benefits to clinicians seeking information on specific medical data.

23 Acknowledgement This research is granted by the Japan Society for the Promotion of Science(JSPS) through the Funding Program for World-Leading Innovative R&D on Science and Technology(FIRST Program), initiated by the Council for Science and Technology Policy(CSTP).

24 おわり (FIN)

The deployment of OHMS TM. in private cloud

The deployment of OHMS TM. in private cloud Healthcare activities from anywhere anytime The deployment of OHMS TM in private cloud 1.0 Overview:.OHMS TM is software as a service (SaaS) platform that enables the multiple users to login from anywhere