Development of Real-time Big Data Analysis System and a Case Study on the Application of Information in a Medical Institution

, pp. 93-102 http://dx.doi.org/10.14257/ijseia.2015.9.7.10 Development of Real-time Big Data Analysis System and a Case Study on the Application of Information in a Medical Institution Mi-Jin Kim and Yun-Sik Yu Convergence of IT Devices Institute Busan(CIDI), Dong-Eui University agicap@deu.ac.kr, ysyu@deu.ac.kr Abstract Individual data might not be thought of as that important for business purposes. However, Big Data analytics use cases are increasing, because individual data can become a valuable data aggregate from which any hidden information can be found, once it s collected into large volumes. Big Data. Known as one of conventional Big Data analytics technologies, Hadoop is a widely accepted technology to analyze structured/ unstructured Big Data to date. However, Hadoop has a high possibility for response time latency with larger data because of batch processing systems, which makes it difficult to do real time analysis for massive amounts of high speed event data under the current business environment and market conditions. In this paper, open source CEP (Complex Event Processing)-based technologies are used as an alternative for rapidly changing business, thereby developing the real time analytics system that enables us to analyze over thousands of event streams per second on a real time basis without latency, in order to be applicable to medical institution ERP systems. Keywords: Real-time analysis system, Big data, CEP 1. Introduction In the Smart era, social network, IoT and life log data are important elements to enter the Big Data era. Smart devices produce lots of data which is collected in distributed files and then processed into important information. Big Data is intelligence service-based technology, processing massive amounts of unstructured and understanding information that was not understandable or analyzable. This technology will raise the bar of service quality for users and change experts role and value in the near future [1]. Social media and Internet companies including Google, Facebook, Amazon and Yahoo are attempting to operate public services such as transport systems, marine resource systems, security systems and medical systems by using Big Data, proving the effect of Big Data-based social analyze, while those in private sector doing so for improved corporate managerial conditions and marketing efficiency in a hurry to develop Big Data-based new business models as well as increased profitability and efficient process. Individual data might be thought of as not that important for business purposes. However, Big Data analytics use cases are increasing, because individual data can become a valuable data aggregate from which any hidden information can be found, once collected into large volumes, Big Data. Known as one of conventional Big Data analytics technologies, Hadoop is a widely accepted technology to analyze structured/ unstructured Big Data to date. However, Hadoop has high possibilities for response time latency with larger data because of a batch processing system, which makes it difficult to do real time analysis for massive amounts of high speed event data under current business environment and market conditions. In addition, there are increasing needs for massive amount Big Data on a various corporate entity levels. In reality, however, access to Big Data use has been limited and not easy due to the high cost analytics system. ISSN: 1738-9984 IJSEIA Copyright c 2015 SERSC

In this paper, open source CEP (Complex Event Processing)-based technologies are used as an alternative for rapidly changing business, thereby developing the real time analytics system that enables us to analyze thousands of event streams per second on a real time basis without latency, in order to be applicable to medical institution ERP systems. In this paper, the system allowing for efficient patient management and administrative management at medical institution was applicable and implemented to ERP systems of small-medium size hospitals, to analyze patient-oriented information and equipment data. 2. Related Works 2.1. Big Data Big Data refers to a technology for extracting value and analyzing results from the massive amount of structured and unstructured data set beyond the capacity of collecting, saving, managing and analyzing with any conventional database management tools [2]. In large, such Big Data s own attributes are called 3V: Volume, Velocity, and Variety. More recently, Value or Complexity can also be added. From the aspects of using Big Data, one of the most important attributes is Low Cost, because Big Data was first born in hope of saving and processing data at lower cost under the system different to the conventional one [3]. Big Data technologies are largely divided into data collection technology; saving technology; processing technology; analytics technology; expression and using technology; management (Infra, Biz) technology. Looking into technologies and issues required by each technical field, it is noted that the massive data loading time accounts for the entire time in the collection technologies with continuous increasing data, while it takes high cost in data saving and management in the saving technologies. In addition, there are discussions over insufficient timeliness due to processing for long time, as well as high cost for processing and operations in processing/ analytics technologies as well. 2.2. Hadoop Hadoop is a solution with its key focused on distributed processing technology, which is currently most favorable for Big Data processing. It is Java-based framework with Apache open source process to process massive data, using a relatively simple program model [4]. Hadoop is used as a core technology by Yahoo and Facebook, while being applied to many other companies own solution. Hadoop is composed of the distributed file system called HDFS (Hadoop Distributed File System) and the distributed processing system called MapReduce. HDFS is open source made based on the GFS (Google File System) model. Therefore, HDFS shared the same characteristics with the GFS. HDFS divides massive amounts of file into chunk (64MB) units, thereby distributing three into and saving to each data node. Meta data on where chunk is in the data node will be saved onto Namenode. The operation method is HDFS, but MapReduce is composed of the masters called Namenode and Datanode as well as multiple slaves in its structure. As shown in Fig. 1, Namenode manages the Metadata of file system, controlling data I/O between Client and Datanode. Datanode saves actual data, while directly performing the data I/O with Client and the block copying. 94 Copyright c 2015 SERSC

Figure 1. HDFS Structure MapReduce framework is the program model for distributed/parallel processing where previously time-consuming data can be operated in batch processing in short time. It supports the customizing for Key-Value from massive data, and performs high speed data processing based on Binary exploration and applicable Hash algorithm. With such characteristics, Hadoop is used to the fields with respect to saving and analyzing massive multimedia data, as well as log data [5]. 2.3. CEP(Complex Event Processing) CEP is complex event processing technology to extract meaningful data in real time basis from events from various event sources, thereby performing the corresponding actions[6]. Event data herein refers to stream data, which are data of continuous massive inputs, with important time sequences and endless data. It is impossible to process and analyze such stream data in real time basis into a conventional Relational Database. CEP is an event data processing solution that can provide real time analysis of such stream data. That is, it is possible to do real time processing of hundreds/millions of various high speed event stream based on In-Memory without saving it to database, file or Hadoop. The CEP engine is used for CEP processing. When you designate events generated by various systems and then register event patterns to be extracted, the CEP engine performs filtering, aggregation, gathering and joining of various event streams, and then provides the function to sensor the generated event patterns as you want through pattern matching. Basically, CEP-based analysis methods are applicable in order of Visibility Understanding Get Insight, and the visualized real time internal data let you understand connectivity and pattern between such events, based on which intuition can be acquired as necessary for the corporate environment, thereby responding to the business environment in real time basis [7]. Copyright c 2015 SERSC 95

Figure 2. CEP Architecture 2.4. Hadoop and CEP for Big Data Approach The Big Data-based methodologies have been mainly the Hadoop-oriented approaches from the prospects of saving. More recently, however, the approach to Big Data becomes more important. Indeed, the Hadoop ecological system-based perspectives of batch analysis is important, but the approach from the CEP-based perspectives of real time distributed is growing increasingly more important as in the Gartner report. With respect to the conventional DB, Hadoop-based batch processing and various high speed event streams, the differences in analytics mechanisms with the In-Memory-based real-time processing are shown in the Figure 3 [8]. In this study, we adopted the CEP technology using the In-Memory-based analytics processing mechanisms where data saved after analysis. Figure 3. Analysis Processing Mechanism of Hadoop and CEP 2.5. Definitions and Characteristics of NoSQL Unlikely conventional Relational Database (RDBMS), NoSQL refers to database differently designed [9]. The characteristics of NoSQL include the use of a model that is consistent that is less limiting than the conventional RDBMS, and far much more data can be saved, which can be used more effectively for handling big data. Other characteristics of NoSQL include horizontal scalability in a new type of database to overcome the limitations of the conventional Relational Database. From the aspects of CAP theory on distributed system, NoSQL adopts the partitioning-included AP or CP for distributed processing purpose, while the Relational Database adopted CA. NoSQL is focused on availability and instant response, as distributed processing database to provide massive amounts of data to users in a cloud environment. Therefore, 96 Copyright c 2015 SERSC

compared with NoSQL is much more efficient in the Connection Pool management and Fault-Tolerant management, than the conventional DBMS. Many enterprises use NoSQL in various types to provide users with cloud environment, including Google s Bigtable, Amazon s Dynamo and MongoDB. In particular, Cassandra [10] is the distributed-type database made in combination of characteristics of Google s Bigtable with those of Amazon s Dynamo, where each server is composed of ring-connected clusters, supporting defect permission and higher availability. Figure 4 shows the results of NoSQL product and consumer preferences survey, showing that Cassandra is the most popular due to its superior scalability, which is followed by CouchDB, MongoDB and HBase. Figure 4. Results of NoSQL Product and Consumer Preferences Survey NoSQL products are actively advanced. Cassandra is applied in this study based on the following figure, because applicable fields being different depending on each product. Table 1. Applicable Field of NoSQL Product Product CouchDB Redis MongoDB Membase Cassandra HBase Applicable field Erlang/Apache, Applied to master information which is not frequently accumulated/ changed. C/C++/BSD, Frequently changing information management Stock information/analysis/real-time data collection/ Real-time communications, etc. C++/AGPL, Massive DB/ Frequently changing information management, etc. When intending to replace the conventional RDBMS Erlang&C/Apache2.0, Fields where shorter delay time and higher simultaneity required/ First saved onto memory and disk used as the 2 nd repository./ Online gaming, etc. Java/Apache, Fields where there are more inputs than reading/ Information processing in banking and financial institutions/ Real-time information analytics fields Java/Apache, Real-time massive data processing fields Cassandra server is divided into multiple data centers that are networked. A node means one Cassandra server process in operation in one server, where one Cassandra logical storage is composed of such connected nodes. The structure of Cassandra logical storage is shown in Figure 5. In fact, Cassandra is distributed into multiple machines and operated. However, it is designed to look like, to an enduser, a single-instance with a cluster as outer structure. Cassandra aligns data to cluster and assigns such data to each node of cluster. Nodes in cluster can be added and deleted. When a new node is added, it requires for seed node that notifies cluster information. When any new node is added to the Cassandra cluster, seed node teaches how nodes-connected ring structured, to the new node. Either one or multiple number of seed node(s) can be designated in cluster. Copyright c 2015 SERSC 97

Figure 5. Cassandra Logical Storage Structure 3. System Compositions and Designs CEP technology is currently emerging across business fields, as the simplest and strongest method to implement the real time business intelligence based on timely analysis, providing new values including real time monitoring, with an early alarm and production field management by processing and analyzing various events. In this study, we aimed at providing a systematic and organized business environment for efficient patient management and administrative management at hospitals, by using CEP-based advantages in consideration of the low cost of Big Data and then establishing the Big Data-applicable real time analytics system in combination with medical institution ERP systems with yet a insufficient number of cases. In addition, the main Adaptor and data publisher/customizing functions were implemented, to allow to identify and develop UI screens - depending on the needs of each hospital. Real time analytics system architecture is composed as shown in Fig. 6. Incoming and outgoing real time data in varieties from/to analytics through Event Adaptor are converted into internal/external event types such as data protocol and type. While mapping with streams used for real time analysis, the Event Collector also saves the incoming system from each even onto NoSQL, which allows for time-series analysis as well. For NoSQL DB, the batch event processing was done through Cassandra. In the Big Data Analyzer, data collected from the Event Collector are analyzed using the open source-based real time analysis engine CEP and the Hive-based batch layer, before it performs the mapping function in Reporting-enabled form. The Event Generator performs the functions to convert the real time analyzed results into types, and protocols as user wants, as well as to save the analyzed results to database in real time basis. The Reporter provides the Webbased analysis functions that allows for a user to have visualized approach to the CEPbased real time analytics system, including Dashboard, Alarm, Analysis scheduling for real time analyzed results. 98 Copyright c 2015 SERSC

Figure 6. CEP-based Real Time Analytics System Architecture 4. System Implementation The processing of sequence for incoming Event Source on a real time basis during the system s implementation will be transferred to the Event Source through the Data Adapter as shown in Figure 7 if the Event Source was extracted by Legacy/ Batch Layer. Then the Real time processing layer will process such delivered Event Source and processed analytics results will be saved to the data repository or monitoring will be enabled if necessary, through external system API call. Figure 7. Processing of Sequence for Incoming Event Source To be brief on the processing of the entire system, Legacy saves log details generated in its own process as in Fig. 8, while performing real time analysis processing as far as major tasks among such processes. Batch layer collects data through data aggregation and extracts data through analysis works. Depending on such extracted results, there may be either storage to middle repository or request for real time analysis. Real time processing layer processes the requests from Legacy and from batch layers, requesting for data as necessary for processing and saving the results to the middle repository. For such processed results, there will be monitoring system API calls. Dashboard visualizes the data depending on the user s request, and then saved the Dashboard-processed data to the middle repository. Copyright c 2015 SERSC 99

Figure 8. Processing of the Entire System Figure 9 is the screenshot that shows the Web-based real time analysis provided by Reporter on selected items, when you set the Output Adaptor on items that are necessary for analysis among event stream data (please enter randomized diagnosis data) on the UI development screen of a hospital. Figure 9. UI Development Screen and Web-based Real Time Analysis Screen Figure 10 is the screenshot of monitoring analyzed of all data on Output Adaptor settings on Dashboard. Through filtering functions, Server/Adaptor/Event/Hour can also be set up in order to analyze necessary parts alone. Figure 11 is the screenshot of analysis, when filtering Server selected as 210.109.9.114, Output Adaptor selected as Event, Event selected as event_5 and Hour selected as 13:00 ~ 18:00. 100 Copyright c 2015 SERSC

Figure 10. Screenshot of Monitoring Analyzed of All Data 5. Conclusion Figure 11. Filter Functions for Monitoring Screen According to a McKinsey report, it was found that value in the medical fields is closely connected to saving national health care expenditures and medical expenses, as well as to enabling clinical trials. In this regard, this study established the fundamentals where massive amounts of high speed event data in rapidly changing environments by developing hospital ERP system-applied CEP based real time analytics system, which can help use a wide range of Big Data - especially in health and medical fields and make a social-economic impact. Such Big Data analytics system was designed to be applicable to small-medium sized hospitals as well, which can provide faster and more accurate information that meets specific needs of each hospital and enable them to create economic value through efficient patient management and administrative management. There are still more needs for R&D activities with respect to data use at hospitals in the future as well as to additional system complementation that can be applicable to many corporate entities in shipping and logistics. References [1] 7 Top Mega Trends in IT Industry for 2013, The Federation of Korean Information Industries, Big Data Policy, (2013), pp. 1. [2] J. Gantz and D. Reinsel, << Extracting Value from Chaos >>, IDC IVIEW, (2011) June, pp. 6. [3] P. Russom, Big Data Analytics, TDWI Research Fourth Quarter, (2011), pp. 6. Copyright c 2015 SERSC 101

[4] K.-W. Park, K.-J. Ban, S.-H. Song and E.-K. Kim, Cloud-based Intelligent Management System for Photovoltaic Power Plants, Korea Institute of Electronic Communication Sciences, vol. 7, no. 3, (2012) June, pp. 591-596. [5] http://hadoop.apache.org/docs/hdfs/current/ hdfs_design.html. [6] D. C. Luckham and B. Frasca, Complex event processing in distributed systems. Computer Systems Laboratory Technical Report CSL-TR-98-754, Stanford University, Stanford 28, 1998. [7] http://blog.naver.com/tibcokorea/20207252684. [8] http://www.dbguide.net/knowledge.db?cmd=specialist_view&boarduid=180895&boardconfiguid = 108&boardStep=0&categoryUid, Lee Ho-cheol, No. 249, 2014. 7. 29. [9] Wikipedia NoSQL. http://en.wikipedia.org/wiki/nosql. [10] NoSQL Database Comparison, (2011). Authors Mi-Jin Kim Feb, 2004: Obtained Bachelor s Degree in Computer Engineering at Dongeui University Aug, 2008: Obtained Master s Degree in Computer Education at Education School of Pukyong National University Feb, 2011: Completed Doctoral degree Course of Computer Engineering & Applications at Dongeui University Sept, 2014 Present: Senior Researcher Convergence of IT Devices Institute Busan at Dongeui University Yun-Sik Yu Feb, 1990: Obtained Doctor s Degree in Physics at Pusan National University Mar, 1983 Present: Professor of Radiological Science/IT Convergence department, Dongeui University Mar, 2008 Present: Director Convergence of IT Devices Institute Busan at Dongeui University 102 Copyright c 2015 SERSC