Performance Measurement of a High-Performance Computing System Utilized for Electronic Medical Record Management 1 Kiran George, 2 Chien-In Henry Chen 1,Corresponding Author Computer Engineering Program, California State University, 800, N. State College Blvd., Fullerton, CA 92381 kgeorge@fullerton.edu 2, Department of Electrical Engineering, Wright State University Dayton, Ohio 45435 henry.chen@wright.edu Abstract With the recent mandate requiring all medical records to be computerized, health care providers will face the difficult challenge to quickly and efficiently access and sort through extremely large databases of electronic medical records. In order to minimize the database query response time, an expedited query application on a high-performance system is needed. Today, high performance computing (HPC) certainly allows us to ask questions on a scale that we have not been able to ask before. HPC is a form of computing that utilizes parallel processing techniques in order to run programs more efficiently and quickly on multiple processing units. In this paper, a high-performance computing application, expedited electronic medical record query (E2MRQ) on a GPU cluster for electronic medical record (EMR) databases being used in hospitals around the country, is presented. This application would be capable of quickly and efficiently searching through the vast medical databases used by healthcare professionals. Performance measurement of the E2MRQ application on the GPU based HPC cluster, including timing for data transfer within the cluster and timing performances for basic data operations, was evaluated. Keywords: Electronic medical records; high performance computing; Graphics Processing Units; database query; cluster computing; parallel processing 1. Introduction The Health Information Technology for Economic and Clinical Health Act (HITECH) enacted as part of the American Recovery and Reinvestment Act (ARRA) of 2009 [1] describes the mandate for all medical and health-care providers to switch to a digital medical information system. This mandate will ensure that all medical records in the United States will be converted into electronic medical records (EMRs) by 2014. Health Information Exchange (HIE), an electronic mobilization of healthcare and medical information across an organization, community, or hospital system, enables doctors and nurses to easily pass medical records and other clinical information among health care information systems while maintaining the integrity of the information being moved. The overall goal of HIE is to speed up the access and retrieval of medical information so as to provide effective and efficient health care. By having patients EMRs stored in a database, doctors will be able to quickly bring up a patient s medical history, and through the use of complex algorithms detect that if a patient will encounter a certain medical condition based on their past medical history. However, with all the EMRs stored in a data base of some sort, data search could become remarkably time consuming. In order to make record retrieval as quick and efficient as possible, large-scale high-performance computing (HPC) cluster could be used to vastly improve the database query response time. Research has consistently shown that clusters comprising of Graphics processor units (GPU) can significantly outperform the same CPU [2][3] while reducing space, heat, power, and cooling requirements. This makes the GPU clusters much more favorable when compared to CPU clusters of similar cost. However, challenges involved in setting up such a system are: a) having the correct dataflow - making sure the correct information and data are being sent to the correct node/gpu/etc.; b) communication between nodes in a cluster making sure all nodes are in sync and processing the correct information as well as sending necessary information to and from International Journal of Advancements in Computing Technology(IJACT) Volume 7, Number 1, January 2015 1
the head node; and c) making sure that the system is time and cost-efficient so that it maintains a favorable cost-performance ratio. In this paper, a high-performance computing application, expedited electronic medical record query (E2MRQ) on a GPU cluster for EMR databases being used in hospitals around the country is presented. This application would be capable of quickly and efficiently searching through the vast medical databases used by healthcare professionals. Monetary savings could be realized on the basis that timely availability of information to care providers would increase efficiency and accuracy in healthcare; expediting the processing of large data sets would also mean that health patterns could be recognized more quickly across broader groups of patients. 2. High-Performance Computing (HPC) and Cluster Architecture High-Performance Computing (HPC) has gained a lot of popularity in recent years. Its use of parallel processing and cluster computing for running computationally intensive applications more efficiently, reliably, and quickly has attracted scientists, engineers, teachers, and the military. With demand for higher processing speed and power constantly on the rise, HPC systems will soon garner the interest of businesses in every field. The overarching goal of HPC is to solve complex problems with more accuracy and higher speed by offloading the computation intensive tasks to a combination of specialized hardware accelerators such as Nvidia's Tesla Graphics Processor Units (GPUs) and Xilinx FPGAs, while still being exceptionally efficient. These specialized processors can be configured using NVIDIA Compute Unified Device Architecture (CUDA) and Hardware Description Languages (HDL) such as VHDL/Verilog respectively. GPUs have a large number of programmable cores (Single instruction, multiple data (SIMD) architecture), high on-chip memory bandwidth, support for floating-point operations, and ease of programmability using high-level languages and application programming interface (API). All these properties leverage the value of using GPU as a common off the-shelf (COTS) hardware accelerator. Currently, GPUs are used in wide array of applications such as medical imaging [4-6], audio and video processing [7][8], and data mining applications [9][10]. Figure 1(a). Dataflow block diagram of the GPU cluster HPC system utilizes parallel processing, cluster computing, or a combination of the two. Parallel processing is having one or more processing units (CPUs or GPUs) to process information while cluster computing involves including one or more computers and linking them together. The GPU cluster that was utilized for this application comprises of a head master node and 6 compute slave nodes (Figure 1; Table 1). Along with the nodes is a Mellanox InfiniScale IV IS5031 QDR 18-Port InfiniBand switch, IOGEAR GCL1808KITU 8-Port LCD combo KVM switch and APC SUA3000RMT2U 3000 VA 2700 Watts smart-ups in an APC AR3107 48U enclosure. Each node is a 2
SUPERMICRO SYS-7146GT-TRG 4U server that contains within it two Quad-Core Intel Xeon E5520 Nehalem processors (2.26 GHz, 8M cache), 12 GB DDR3 memory, and two NVIDIA Tesla C2050 GPUs (3 GB GDDR5). The motherboards for all the nodes support 4 full-bandwidth PCIe Gen 2 x16 slots, 2 PCIe Gen 2x4 slots, 1 PCIe Gen 1 x4 slot, and 2 PCI 33 MHz slots. Figure 1(b). GPU Cluster comprising of a head node and 6 compute nodes in an APC AR3107 48U enclosure Table 1. Components and Capability of the GPU Cluster SUPERMICRO SYS-7046GT-TRF 4U server Head and (OS: CentOS; motherboard supports: 4 full-bandwidth PCIe compute node Gen 2 x16 slot; 2 PCIe Gen 2 x4 slots; 1 PCIe Gen 1 x4 slot; systems and 2 PCI 33 MHz slots) CPU for nodes Quad-core Intel Xeon E5520 Nehalem processors (8M Cache; 2.26 GHz; 5.86 GT/s Intel QPI) CPUs per node 2 Head and compute node memory (GB) GPU for compute nodes GPUs per compute node 12 NVIDIA Tesla C2050 (448 cores; 1.15 GHz; Memory Bandwidth: 144 GB/sec; Dedicated Memory: 3GB GDDR5) 2 Interconnect Mellanox InfiniBand QDR (ConnectX-2 VPI InfiniBand) No: of compute nodes 6 3
2.1. Processors NVIDIA Tesla C2050 GPUs are utilized in all the nodes. These cards have a total of 14 multiprocessors, with each multiprocessor having a total of 32 CUDA cores, giving a grand total of 448 CUDA cores per GPU. Each one of these cores has a clock frequency of 1.15 GHz. The cards are capable of reaching up to 515 Gigaflops for double-precision floating point performance, and up to 1.03 Teraflops for single-precision floating point performance. Each card comes with 3 MB of GDDR5 memory at 1.5 GHz. The C2050 has a 384-bit memory interface and a memory bandwidth of 144 GB/s [11]. The CPUs used in both head and compute nodes include a pair of Intel Xeon E5520 Nehalem 2.26 GHz Quad-Core CPUs [12]. The processors have a QuickPath Interconnect (QPI) to the PCI Express interfaces at a rate of 5.86 GT/s. Table 2. Operating System and other Software On Cluster Operating System CentOS 5.5 GPU Software NVIDIA Developer Drivers 295.41 Compute Software NVIDIA CUDA and SDK 4.2 InfiniBand Software Mellanox OFED 1.5.3 MPI Software MVAPICH 1.2.0 DBMS Software PostgreSQL 8.4.13 2.2. Software (Table 2) NVIDIA CUDA is the software architecture that allows the utilization of the Tesla and other NVIDIA video cards for parallel processing by writing high level code languages such as Mellanox InfiniBand adapters, tools to connect the InfiniBand network and Message Passing Interface (MPI) software. MPI is the component of E2MRQ application that passes information from the head node to the computing nodes and vice versa. PostgreSQL is an object-oriented Database Management System (DBMS) that uses B-trees to store its elements. The data that E2MRQ will be processing will come from a PostgreSQL DBMS. Although PostgreSQL is installed on all of the nodes for code compilation, only the head node will contain the database, thus it will be the only node that will perform the necessary DBMS operations. 3. E2MRQ Application The E2MRQ application is designed to enhance database query performance by integrating GPUs into the toolset of DBMS. The GPUs are proven to be capable of handling massively parallel matrix manipulations. The basic process is as follows: the data set is pulled from its storage, conceptualized as matrices, and processed on the GPU. However with E2MRQ, the application will be pulling a large data set from a PostgreSQL DBMS and then distributing the data from the main node down into the compute nodes using MPI and the InfiniBand interface. The compute nodes will then pass on the data from the CPU to the GPU where it will be processed. Following that, the GPUs will pass the processed data back to the respective CPUs in the nodes, and then (using MPI and InfiniBand) each node will send its data back to the head node. 4
3.1. Process (Figure 2) The PostgreSQL database table is set up on the head node before processing. Since head node is the only one that carries the PostgreSQL database, E2MRQ retrieves the database and save it into that node s memory, but as a single-dimensional array. For GPU memory management, we had to generate tables on the host side (CPU side) as single-dimension arrays. However before that we need to use MPI to divide the information between the number of computing nodes that will be used for processing. For example, if four compute nodes Figure 2. Data Management Analysis will be used, then the head node will distribute the array, by MPI_Send(), into four equal-sized portions to each of the compute nodes. This means each compute node will have an array based on having the entire 1024 columns, but with 256 rows. After the data has been successfully passed to the nodes, the array will then be copied to the GPUs. To do this we utilize CUDA memory operation to copy it to the device (GPU) memory using cudamemcpy command. This happens only once at the beginning of the application. After this initial memory copy, all of the functions performed on the GPU are performed using the data that was copied to the device memory. A second memory copy is done from the device to the host memory after each function for further CPU analysis as well as for sending the data back to the head node. Some of the functions in E2MRQ application that retrieves data from a PostgreSQL database and process it across one or several GPUs includes: a) search for a key value in the database; b) sum vector (column or row); c) sort database based on a vector; and d) pattern search. 1) Key search within column The application begins by searching for a pre-defined key value on a column of the database. As the database has been divided between equal rows, each node will be searching on the same column of the database. A separate, equally sized, array will be allocated to store the results when searching through the array. When the key has been found on any cell, the result array will mark true on that same location. After the processing, the result array is copied to the host. 2) Column and Row Summation Summation of a column/row is frequently used to assess user counts, inventories, and location data. In this function, values within a column or row are added together returning the result in a dynamic floating point array data structure. It starts the process by dividing the array in half, adding each half and storing the result in the first half. The second half is discarded and the summation procedure is repeated until all of the values have been added to the first value of the array. When the summation is completed, the first compute node will retrieve the sum results from the other compute nodes using MPI and add the results to get the final result. 3) Radix Sort In this function, the portion of the database is loaded into the device memory and it is sorted using the Thrust library [13] in the CUDA architecture software. 5
4) Quick Sort In this function, the host CPU will perform the Quick Sort function on the array. This sorting procedure is done to compare the result of Radix Sort on the GPU. 5) Pattern Search During the execution of this function, database is loaded into the device memory and pattern search with key characteristics is conducted. The characteristics are searched one at a time; first characteristic is searched through the array and when found it is recorded; next characteristic is then searched in the adjacent index. This process will continue until all the characteristics are found. Table 3. Timing Results for Basic Integer Operations Timing (ms) Key search within column 0.095 Column Summation 0.568 Row Summation 0.345 Radix Sort 3.042 Quick Sort 0.288 Pattern Search (w/five key characteristics) 0.831 Table 4. Comparison of Timing Results for Integer Queries Integer Queries Transfer Time (ms) GPU time (ms) Total Time (ms) [14] 25.9 56.6 82.2 E 2 MRQ 5.09 22.36 27.45 3.2. Performance 1) Data Transfer and Timing The timing reflecting the data transfer of 10242 data elements and processing time from head node to compute node(s) is given in Figure 3. The time it takes for the head node to transfer data to the host RAM is 1.90 sec. Next, depending on the number of the nodes used in each configuration, the transfer time to the memory of the compute node varies from 3.19 ms for single node configuration to 3.66 ms for a four node configuration. The computation time required for pattern search with five key characteristics also varies based on the number of nodes in the configuration; the computation time ranges from 1.83 ms for a single node with two GPU s to 0.83 ms for a four node configuration with 8 GPU s. 2) Timing Performances for Basic Data Operations The timing results of basic operations of E2MRQ application on the cluster configuration with four nodes are given Table 3. Tests have conducted to obtain the time to search through the entire database comprising of 10242 data elements. Comparison of E2MRQ application with a similar project [14], with searching as its primary objective, is discussed next. The systems utilized in both projects are similar accept that in [14] only a single GPU is used for computation; whereas the cluster utilized for E2MRQ application had 4 compute nodes each with two GPUs. Both [14] and E2MRQ application copied the necessary data to the graphics 6
cards before the computations are executed. As integer query is the only comparison where the two projects truly line up; table 4 compares the timing results for E2MRQ application implemented on single GPU. As it can be observed E2MRQ application provided over 66% improvement in performance when processing the array. Figure 3. GPU cluster block diagram illustrating the data flow and timing 4. Conclusion Performance measurement of GPU based HPC cluster utilized for an electronic medical record database query application was presented; timing for data transfer within the cluster and timing performances for basic data operations were evaluated. Based on the number of compute nodes utilized to execute a query, the data transfer time and computation time required for pattern search with five key characteristics improved. Furthermore E2MRQ application demonstrated over 66% improvement in timing compared to a similar project that involved integer query. 7
5. Future Work This application can be further expanded upon by: a) allowing it to work on any number of nodes; currently, the application only runs on a number of nodes that is equal to a power of 2; b) processing queries with an FPGA to see if there are any performance benefits. 6. References [1] American Medical Software: http://americanmedical.com/2011/04/ehr-what-does-it-mean-and-do-we-have-to-go-there/ [2] J. Canny and H. Zhao, Big data analytics with small footprint: squaring the cloud, Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, August 2013 [3] M. C. Altinigneli, C. Plant, and C. Böhm, Massively parallel expectation maximization using graphics processing units Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, August 2013. [4] T. Idzenga, E. Gaburov, W. Vermin, J. Menssen, and C. De Korte, Fast 2-D ultrasound strain imaging: the benefits of using a GPU, IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control, Vol. 61, no. 1, pp 207 213, 2014. [5] M. Tanter and M. Fink, Ultrafast imaging in biomedical ultrasound, IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control, vol. 61, no. 1, pp. 102 109, 2014. [6] J.R. Ferreira, M. Costa Oliveira and A. Lage Freitas, Performance Evaluation of Medical Image Similarity Analysis in a Heterogeneous Architecture, IEEE 27th International Symposium on Computer-Based Medical Systems, pp. 159 164, 2014. [7] S. Momcilovic, A. Ilic, N. Roma and L. Sousa, Dynamic Load Balancing for Real-Time Video Encoding on Heterogeneous CPU+GPU Systems, IEEE Transactions on Multimedia, vol. 16, no. 1, 2014. [8] J.A. Belloch, B. Bank, L. Savioja, A. Gonzalez and V. Valimaki, Multi-channel IIR filtering of audio signals using a GPU, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6692 6696, 2014. [9] B. Liu, C. Yu, D.Z. Wang, Cheung, R.C.C. Hong Yan, Design Exploration of Geometric Biclustering for Microarray Data Analysis in Data Mining, IEEE Transactions on Parallel and Distributed Systems, vol. 25, no. 10, pp. 2540 2550, 2014. [10] M. Hyun Jo and W.W. Ro, DPM: Data Partitioning Method for pipelined MapReduce on GPU, 18th IEEE International Symposium on Consumer Electronics, pp. 1-3, 2014. [11] NVIDIA Tesla 2050 GPU: http://www.nvidia.com/docs/io/43395/nv_ds_tesla_c2050_c2070_jul10_lores.pdf [12] Intel Xeon 5520 CPU: http://ark.intel.com/products/40200/intel-xeon-processor-e5520-8m-cache-2_26-ghz-5_86- GTs-Intel-QPI [13] Thrust Library: http://code.google.com/p/thrust/ [14] P. Bakkum and K. Skadron, Accelerating SQL Database Operations on a GPU with CUDA, GPGPU Conference, 2010. 8