Big Data Analytics in Healthcare Temporal Clinical Events Patterns Mining



Similar documents
Exploration and Visualization of Post-Market Data

Investigating Clinical Care Pathways Correlated with Outcomes

Selection of Optimal Discount of Retail Assortments with Data Mining Approach

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

IMPROVING BUSINESS PROCESS MODELING USING RECOMMENDATION METHOD

Big Data Analytics for Healthcare

Big Data Analytics and Healthcare

Mining Signatures in Healthcare Data Based on Event Sequences and its Applications

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Binary Coded Web Access Pattern Tree in Education Domain

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Electronic Medical Record Mining. Prafulla Dawadi School of Electrical Engineering and Computer Science

High-Volume Hypothesis Testing for Large-Scale Web Log Analysis

Mining various patterns in sequential data in an SQL-like manner *

Syllabus. HMI 7437: Data Warehousing and Data/Text Mining for Healthcare

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm

Information Management course

Interactive Information Visualization of Trend Information

A Time Efficient Algorithm for Web Log Analysis

Data Mining Solutions for the Business Environment

SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY

Domain Classification of Technical Terms Using the Web

Mining Association Rules: A Database Perspective

Collecting Polish German Parallel Corpora in the Internet

72. Ontology Driven Knowledge Discovery Process: a proposal to integrate Ontology Engineering and KDD

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: X DATA MINING TECHNIQUES AND STOCK MARKET

Modeling and Design of Intelligent Agent System

Prediction of Heart Disease Using Naïve Bayes Algorithm

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April ISSN

Healthcare Measurement Analysis Using Data mining Techniques

DATA MINING TECHNIQUES AND APPLICATIONS

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Web Mining Seminar CSE 450. Spring 2008 MWF 11:10 12:00pm Maginnes 113

Visibility optimization for data visualization: A Survey of Issues and Techniques

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

Manjeet Kaur Bhullar, Kiranbir Kaur Department of CSE, GNDU, Amritsar, Punjab, India

A Fraud Detection Approach in Telecommunication using Cluster GA

KEYWORD SEARCH IN RELATIONAL DATABASES

Semantic Concept Based Retrieval of Software Bug Report with Feedback

Protein Protein Interaction Networks

TOWARD BIG DATA ANALYSIS WORKSHOP

A Review of Data Mining Techniques

Single Level Drill Down Interactive Visualization Technique for Descriptive Data Mining Results

MAXIMAL FREQUENT ITEMSET GENERATION USING SEGMENTATION APPROACH

Natural Language to Relational Query by Using Parsing Compiler

MED 2400 MEDICAL INFORMATICS FUNDAMENTALS

An Overview of Knowledge Discovery Database and Data mining Techniques

A Clustering Model for Mining Evolving Web User Patterns in Data Stream Environment

Course Requirements for the Ph.D., M.S. and Certificate Programs

Comparative Study in Building of Associations Rules from Commercial Transactions through Data Mining Techniques

Practical Implementation of a Bridge between Legacy EHR System and a Clinical Research Environment

Efficient Iceberg Query Evaluation for Structured Data using Bitmap Indices

Application of Data Mining Methods in Health Care Databases

A Case Study of Question Answering in Automatic Tourism Service Packaging

Association Technique on Prediction of Chronic Diseases Using Apriori Algorithm

Online Credit Card Application and Identity Crime Detection

COMBINED METHODOLOGY of the CLASSIFICATION RULES for MEDICAL DATA-SETS

Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm

Impelling Heart Attack Prediction System using Data Mining and Artificial Neural Network

Doctor of Philosophy in Computer Science

A View Integration Approach to Dynamic Composition of Web Services

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Understanding Web personalization with Web Usage Mining and its Application: Recommender System

The Use of Data Mining Classification Techniques to Predict and Diagnose of Diseases

International Journal of Advance Research in Computer Science and Management Studies

How To Write A Summary Of A Review

A Framework for Data Warehouse Using Data Mining and Knowledge Discovery for a Network of Hospitals in Pakistan

How To Use Neural Networks In Data Mining

2.1. Data Mining for Biomedical and DNA data analysis

CONCEPTUAL MODEL OF MULTI-AGENT BUSINESS COLLABORATION BASED ON CLOUD WORKFLOW

Electronic health records to study population health: opportunities and challenges

Big Data Analytics in Mobile Environments

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

A COGNITIVE APPROACH IN PATTERN ANALYSIS TOOLS AND TECHNIQUES USING WEB USAGE MINING

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

Healthcare Big Data Exploration in Real-Time

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search

What is your level of coding experience?

Text Classification Using Symbolic Data Analysis

Blog Post Extraction Using Title Finding

Comparison of Data Mining Techniques for Money Laundering Detection System

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Sustaining Privacy Protection in Personalized Web Search with Temporal Behavior

Transcription:

Big Data Analytics in Healthcare Temporal Clinical Events Patterns Mining Svetla Boytcheva 1, Galia Angelova 1, Zhivko Angelov 2, Dimitar Tcharaktchiev 3 1 Institute of Information and Communication Technologies, Bulgarian Academy of Sciences, Bulgaria 2 Adiss Lab Ltd., Sofia, Bulgaria 3 Medical University Sofia, University Specialised Hospital for Active Treatment of Endocrinology, Bulgaria Emails:svetla.boytcheva@gmail.com, galia@lml.bas.bg, angelov@adiss bg.com dimitardt@gmail.com Keywords: Medical Informatics, Big Data, Text mining, Temporal Information, Data mining, MLCS. 1. Motivation Temporal events relations analysis of outpatient records has higher importance for proving different hypothesis: for risk factors analysis, treatment effect assessment, comparative analysis of treatment with different medications and dosage; monitoring of disease complications. Currently such analysis is also used in epidemiology for identifying complex relations between different unrelated diseases so called comorbidity and for research of rare diseases. A lot of efforts were reported in the area of electronic health records visualization and analysis of periodical data for single patient or searching patterns for a cohort of patients [4, 5, 14, 17, 19, 27]. In [19] is proposed method for temporal event matrix representation and learning framework that discovers complex latent event patterns or diabetes mellitus complications. Patnaik et al [4,5] report one of the first attempts for mining patients history data in Big data scope processing over 1.6 million of patient histories. They demonstrate a system called EMRView for mining the precedence relationships to identify and visualize partial order information. In the later researched are addressed three tasks: mining parallel episodes, tracking serial extensions, and learning partial orders. The task of searching for patterns in sequences of events is one of the fundamental tasks in computer science and dynamic programming, as well as its version the problem for searching of the longest common subsequence (LCS) of two strings [20, 21, 25]. This task is important for medical informatics, because in bioinformatics is used for multi omics data analysis [24, 28]. It has also a lot of other applications like data compression, file comparison, syntactic pattern recognition, and etc. Unfortunately the classical algorithms are mainly developed for two strings only and some generalizations are considered for 3 or more strings. The problem for searching of the LCS of two strings has a solution with time complexity [25], where and are lengths of both strings. Hunt and Szymanski [21] improve the solution with time complexity where the length of both strings is and is the number of possible common pairs of symbols in them. Masek and Peterson [30] present solution inspired by four Russians method for speeding up algorithms and report complexity. Recently several slight improvements of LCS were reported for some specific applications in bioinformatics [23]. The multiple longest common subsequence (MLCS) problem is NP complete task. Wang et al [15] propose a parallel algorithm based on dominant points approach for solving MLSC problem, which limits the search for small subset of dominant points rather that searching the whole set. Chen et al [23] describe a method based on successor tables, in 2010 Wang et 1

al [16] explore different heuristics in A* search model for solving MLSC problem. Yang et al [12] use beam search model and propose anytime algorithm which can solve the problem in reasonable time for massive data, but also addressing such issues like memory constraints for data tabulation. Han et al [29] propose one of the fundamental approaches for searching frequent patterns in time series based on statistical methods and construction of extended prefix tree frequent patterns tree (FP tree). Karthik et al [22] study periodicity in sequences using methods based on FP tree. In temporal events frequent pattern mining is important to filter events with similar importance and features, this relationship can be specified by temporal constraints There are different mining approaches, for instance we can interest in sequences leading to certain target event [6]. Temporal events frequent pattern mining is a major task in temporal data mining with several applications in telecommunications, social networks, climate changes prediction, earthquake prediction and etc. Gyet and Quiniou propose and recursive depthfirst algorithm QTIPrefixSpan that explores the extensions of a temporal patterns. They continue their research in the area of for extracting temporal sequences with quantitative temporal intervals with different models using hyper cube presentation and develop a version of EM algorithm for candidates generation [8]. Patnaik et al present the streaming algorithm for mining frequent episodes over a window of recent events in the stream [5]. Monroe et al [18] presents a systems that allows the user by using visual tools to narrow iterative the process for mining patterns to the desired target with application in electronic health records (EHR). Yang et al [10] describes another application of temporal events sequence mining in medical informatics for mining patient histories. They develop a modelbased method for discovering common progression stages in general event sequences. 2. Project Setup The main goal of our research is to examine comorbidity of diseases and their relationship/causality with different treatment, i.e. how the treatment of a disease can affect the co existence of other diseases. This is quite challenging task, because the number of diagnosis (more than 10,000) and of medications (approx. 6,500) is huge. Thus the possible variations of diagnosis and the corresponding treatment are above 10 500 for one patient for one year period. That is why we will examine separately chronic vs acute diseases [9] and afterword the patterns will be combined. Chronic diseases constitute a major cause of mortality according to the World Health Organization reports and their study is with higher importance for medicine. In order to solve this task we split it down into several subtasks: to find the most frequent patterns in chronic diseases; for each frequent pattern to find patterns of treatment and explore their relationship with the periodic occurrence of various acute conditions. Since the task is to apply the method to big data some scarifications is necessary to be made. We propose method based on FP trees and dominant points, mining separately candidate frequent patterns for prefixes and suffixes of dominant points. We explore different approaches for assigning scores for dominant points and candidate generation and compare the results for time and space in all versions of the algorithm. We apply the proposed algorithm for time sequence mining with qualitative and quantitative time intervals. In the first case we use only the order of the events, and in the second version we take into account also the distance between different consecutive events. 2

3. Materials We deal with a repository of anonymous outpatient records (AOR) provided by the Bulgarian National Health Insurance Fund (NHIF) in XML format in Bulgarian language. The majority of data necessary for health management are structured in XML tags, but still there are some fields that contain in free text formats important explanations about the patient Anamnesis, Status, Clinical examinations and Therapy. From XML tags we extract (Error! Reference source not found.) Patient ID, code of doctors medical specialty from 00 to 56 (SimpCode), region of practice (RZOK). Date/Time and ID of the outpatient record (NoAL). XML tags also contain information for the main Figure 1 Structured Event Data diagnose and additional diagnoses with their codes according to the International Classification of Diseases, 10th Revision (ICD 10) [3]. For treatment information extraction we use Text mining tool, because drugs, dosage, frequency and route, mainly included in the Therapy section but sometimes the Anamnesis contains sentences that discuss the current or previous treatment. We developed a drug extractor, based on text mining algorithms using regular expressions to describe linguistic patterns [2]. There are more than 80 different patterns for matching text units to ATC drug names/codes [1] and NHIF drug codes, medication name, dosage and frequency. Currently, the extractor is elaborated and handles 2,239 drug names included in the NHIF nomenclatures. Error! Reference source not found. presents the three collections of clinical texts used for training and tests in our experiments, containing data for patients suffering from Schizophrenia (ICD 10 code F20), Hyperprolactinaemia (ICD 10 code E22.1), and Diabetes Mellitus (ICD 10 codes E10 E15). Table 1 Clinical Text Collections Set Outpatient Records Patients Avg AOR per Patient per year Diagnose ICD 10 Period Size (GB) S1 1,682,429 114,945 4.879 F20 3 years 4 S2 288,977 9,777 6.744 E22.1 3 years 1 S3 6,327,503 435,953 14.514 E10 E15 1 year 18 4. Problem definition Each collection is processed independently from the other two collections. For each collection all extracted events are stored in comma separated format. For collection the set of all different patients is extracted. For each patient events sequence of tuples is generated:. where. The empty sequence of events will be denoted as the empty set 3

We define the set of all possible event sequences in each collection Then from diagnosis sequence of tuples is extracted, and A sequence is said to be event subsequence of the event sequence if there exists sequence such that the times and events, match. It will be denote by. Let is event subsequence and The cardinality is called the support of event subsequence in, i.e.. Length of event sequence is the number of tuples in it. It will be denote by, i.e.. Frequent pattern (FP) in the set is such subsequence that has support above the given threshold, i.e.. Multiple longest common subsequence (MLCS) of the set is such subsequence that satisfies all of the following conditions: (i) is a frequent pattern of ; (ii), is a frequent pattern of,. Sequence split by event of the sequence generates two subsequences: prefix subsequence and sufix subsequence for,and. We define set of all events in the set :, respectively we define the set of all diagnoses in the : References 1) Anatomical Therapeutic Chemical (ATC) Classification System, http://atc.thedrugsinfo.com/ 2) Svetla Boytcheva. Shallow Medication Extraction from Hospital Patient Records. In: Koutkias, V., J. Nies, S. Jensen, N. Maglaveras, and R. Beuscart (Eds.), Studies in Health Technology and Informatics, Vol. 166, IOS Press, pp. 119 128. 3) International Classification of Diseases and Related Health Problems 10th Revision. http://apps.who.int/classifications/icd10/browse/2015/en 4) Debprakash Patnaik, Laxmi Parida, Patrick Butler, Benjamin J. Keller, Naren Ramakrishnan, David A. Hanauer. Experiences with Mining Temporal Event Sequences from Electronic Medical 4

Records: Initial Successes and Some Challenges. in Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'11), San Diego, CA, pages 360 368, Aug 2011 5) D. Patnaik, S. Laxman, and B. Chandramouli. Efficient Episode Mining of Dynamic Event Streams, in Proceedings of the IEEE International Conference on Data Mining (ICDM'12), Brussels, Belgium, pages 605 614, Dec 2012. 6) X. Sun, M. Orlowska and X. Zhou, "Finding Event Oriented Patterns in Long Temporal Sequences", in Proceedings of 7th Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2003), Springer Lecture Notes in Computer Science (LNCS 2637), pages 15 26, Seoul, Korea, April 2003. 7) Guyet Thomas et René Quiniou, Mining temporal patterns with quantitative intervals, In Zighed D., Ras Z., Tsumoto S. editors : The 4th International Workshop on Mining Complex Data, IEEE ICDM Workshop, 10 pages, 2008. 8) Guyet Thomas et René Quiniou, Extracting temporal patterns from interval based sequences, International Join Conference on Artificial Intelligence, 2011 9) Chronic diseases, World Health Organization, retrieved 2012 11 26, http://www.who.int/topics/chronic_diseases/en/ 10) Jaewon Yang, Julian McAuley, Jure Leskovec, Paea LePendu, and Nigam Shah. 2014. Finding progression stages in time evolving event sequences. In Proceedings of the 23rd international conference on World wide web (WWW '14). ACM, New York, NY, USA, 783 794. DOI=http://dx.doi.org/10.1145/2566486.2568044 11) Yang, Jiaoyun, Yun Xu, and Yi Shang. "An efficient parallel algorithm for longest common subsequence problem on gpus." Proceedings of the World Congress on Engineering. Vol. 1. 2010. 12) Jiaoyun Yang,Yun Xu, Yi Shang, Guoliang Chen, A Space Bounded Anytime Algorithm for the Multiple Longest Common Subsequence Problem. Published by the IEEE Computer Society Issue No.11 Nov. (2014 vol.26) pp: 2599 2609 DOI: http://doi.ieeecomputersociety.org/10.1109/tkde.2014.2304464 13) François Nicolas, Eric Rivals, Longest common subsequence problem for unoriented and cyclic strings, Theoretical Computer Science, Volume 370, Issues 1 3, 12 February 2007, Pages 1 18, ISSN 0304 3975, http://dx.doi.org/10.1016/j.tcs.2006.10.002. (http://www.sciencedirect.com/science/article/pii/s0304397506006980) 14) David Gotz, Fei Wang, Adam Perer. A methodology for interactive mining and visual analysis of clinical event patterns using electronic health record data. Original Research Article Journal of Biomedical Informatics, Volume 48, April 2014, Pages 148 159 15) Qingguo Wang and Dmitry Korkin and Yi Shang. Efficient Dominant Point Algorithms for the Multiple Longest Common Subsequence (MLCS) Problem, Proceedings of the Twenty First International Joint Conference on Artificial Intelligence, Pasadena, America, pp. 1494 1499, 7/09 16) Q. Wang, M. Pan, Y. Shang, and D. Korkin, "A Fast Heuristic Search Algorithm for Finding the Longest Common Subsequence of Multiple Strings," Proc. 24th AAAI Conf. Artificial Intelligence, pp. 1287 1292, 2010. 17) Taowei David Wang, Catherine Plaisant, Alexander J. Quinn, Roman Stanchak, Shawn Murphy, and Ben Shneiderman. 2008. Aligning temporal data by sentinel events: discovering patterns in electronic health records. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '08). ACM, New York, NY, USA, 457 466. DOI=http://dx.doi.org/10.1145/1357054.1357129 18) Monroe, M.; Rongjian Lan; Hanseung Lee; Plaisant, C.; Shneiderman, B., "Temporal Event Sequence Simplification," in Visualization and Computer Graphics, IEEE Transactions on, vol.19, no.12, pp.2227 2236, Dec. 2013, doi: 10.1109/TVCG.2013.200, http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6634100&isnumber=6634084 5

19) Lee, N.,Laine, A.F. ; Jianying Hu ; Fei Wang ; Jimeng Sun ; Ebadollahi, S. Mining electronic medical records to explore the linkage between healthcare resource utilization and disease severity in diabetic patients. Published in: Healthcare Informatics, Imaging and Systems Biology (HISB), 2011 First IEEE International Conference on Date of Conference: 26 29 July 2011. Page(s): 250 257 20) J. W. Hunt and M. D. McIlroy, "An Algorithm for Differential File Comparison", Computing Science Technical Report 41 (Bell Telephone Laboratories). 21) James W. Hunt and Thomas G. Szymanski. 1977. A fast algorithm for computing longest common subsequences. Commun. ACM 20, 5 (May 1977), 350 353. DOI=http://dx.doi.org/10.1145/359581.359603 22) G. M. Karthik, Ramachandra V. Pujeri, Constraint Based Periodic Pattern Mining in Multiple Longest Common Subsequences, Indian Journal of Science and Technology, DOI: 10.17485/ijst/2013/v6i8/36343 23) Chen Y, Wan A, Liu W. A fast parallel algorithm for finding the longest common sequence of multiple biosequences. BMC Bioinformatics. 2006;7(Suppl 4):S4. doi:10.1186/1471 2105 7 S4 S4. 24) Korkin D1, Goldfarb L. Multiple genome rearrangement: a general approach via the evolutionary genome graph. Bioinformatics. 2002;18 Suppl 1:S303 11. 25) D. S. Hirschberg. 1975. A linear space algorithm for computing maximal common subsequences. Commun. ACM 18, 6 (June 1975), 341 343. DOI=http://dx.doi.org/10.1145/360825.360861 26) Goeman, Heiko ; Clausen, Michael. A new practical linear space algorithm for the longest common subsequence problem. Kybernetika, vol. 38 (2002), issue 1, pp. [45] 66 27) Alexander Rind, Taowei David Wang, Wolfgang Aigner, Silvia Miksch, Krist Wongsuphasawat, Catherine Plaisant and Ben Shneiderman (2013), "Interactive Information Visualization to Explore and Query Electronic Health Records", Foundations and Trends in Human Computer Interaction: Vol. 5: No. 3, pp 207 298. http://dx.doi.org/10.1561/1100000039 28) David Sankoff, Joseph B. Kruskal. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Publisher: Addison Wesley (August 1983), ISBN 13: 978 0201078091 29) Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without candidate generation In Proceedings of the 2000 ACM SIGMOD international conference on Management of data (SIGMOD '00). ACM, New York, NY, USA, 1 12. DOI=http://dx.doi.org/10.1145/342009.335372 30) William J. Masek, Michael S. Paterson, A faster algorithm computing string edit distances, Journal of Computer and System Sciences, Volume 20, Issue 1, February 1980, Pages 18 31, ISSN 0022 0000, http://dx.doi.org/10.1016/0022 0000(80)90002 1. (http://www.sciencedirect.com/science/article/pii/0022000080900021) 6