Temporal Data Mining in Hospital Information Systems: Analysis of Clinical Courses of Chronic Hepatitis

Transcription

1 Vol. 1, No. 1, Issue 1, Page 11 of 19 Copyright 2007, TSI Press Printed in the USA. All rights reserved Temporal Data Mining in Hospital Information Systems: Analysis of Clinical Courses of Chronic Hepatitis Shoji Hirano and Shusaku Tsumoto Department of Medical Informatics, Shimane University, School of Medicine 89-1 Enya-cho, Izumo, Shimane , Japan {hirano, Received 1 January 2007; revised 2 February 2007, accepted 3 March 2007 Abstract This paper presents a new approach to finding interesting knowledge from temporal data on chronic diseases based on the combination of advanced sequence comparison techniques and cluster analysis procedure. First we briefly introduce the cluster analysis system for temporal data that we have developed. Second, we apply it to the analysis of platelet (PLT) count data on chronic viral hepatitis patients. Third, we show the results of PLT value-based temporal analysis, conducted based on the results of cluster analysis, aiming at finding years for reaching F4 (liver fibrosis stage four), years elapsed between stages, and their relationships with virus types and fibrotic stages. The results conveyed some interesting findings; (1) the temporal courses of PLT could be grouped into several patterns exhibiting similar average PLT level and increase/decrease trends, and (2) liver fibrosis might proceed faster in some exacerbating cases. Keywords Temporal Data Mining, Multiscale Matching, Clustering, Chronic Hepatitis, KDD Process 1. INTRODUCTION Steady operations of hospital information systems over the past two decades gave them a new role of archiving temporal data about long-term condition of patients, in addition to their basic function of providing information necessary for daily clinical services. Such archives of longitudinal, time-series data can be used as a new source for retrospective study on chronic diseases, which may lead to the discovery of novel knowledge useful for diagnosis or treatment. However, large-scale, cross-patient analysis of time-series medical data is a challenging task because of the multidimensionality and temporal irregularity of data caused by the variety of laboratory tests and change of patient conditions over time, as well as the difficulty in determining observation scales appropriate for capturing short-term and long-term events. Therefore, practical application of data mining methods to longitudinal medical time-series data is still limited. In this paper, we present a new approach to finding interesting knowledge from temporal data on chronic diseases based on the combination of advanced sequence comparison techniques and cluster analysis procedure. First we briefly introduce the cluster analysis system for temporal data that we have developed. Second, we apply it to the analysis of platelet (PLT) count data on chronic viral hepatitis patients. Platelet count has been receiving considerable interests as an index for liver dysfunctions, because a hematogenetic factor called thrombopoietin [1], which facilitates the production of platelets, is produced in the liver. Matsumura et al. reported that the PLT count correlated with fibrotic stage [2]. PLT counts were significantly different among the patients of different fibrotic stages, with the characteristics that PLT count becomes smaller as the liver fibrosis proceeds [2]. However, few studies investigate the temporal relationships between the decrease patterns of PLT and progress of liver fibrosis using time series data of individual patients. Our results of cluster analysis indicate that the temporal courses of PLT can be grouped into several patterns, each of which presents similarity in average PLT level and increase/decrease trends. Third, we show the results of PLT value-based temporal analysis aiming at finding years for reaching F4 (fibrosis stage 4), years elapsed between stages, and their relationships with virus types and fibrotic stages. This value-based analysis was conducted based on the observation of quickly decreasing patterns revealed through the cluster analysis. The results of value-based analysis

2 suggest that liver fibrosis may proceed faster in exacerbating cases. 2. CLUSTER ANALYSIS SYSTEM The cluster analysis system we have developed consists of two components, sequence comparison and clustering, in order to utilize advanced sequence comparison methods that can handle the temporal irregularity of medical data. In sequence comparison part, two methods were implemented: dynamic time warping (DTW) [5,6] and modified multiscale structure matching (MMSM) [8]. In clustering part, it employs two methods: conventional hierarchical clustering (HC) [4] and rough set-based clustering (RC) [7]. The sequence comparison part performs pairwise comparison for all possible pairs of time series, and then produces a dissimilarity matrix. The clustering part performs grouping of the time series according to the given dissimilarity matrix. Figure 1 provides a screenshot of the system. The left window shows a dendrogram which is generated when using HC as the clustering method. The right window shows constitution of the clusters, as well as the number of cases in each cluster. When a cluster is selected by a user, sequences that belong to the cluster are visualized. Both windows are related internally. When a user specifies a cutting point on the dendrogram, corresponding cluster constitution and sequences are displayed interactively. 3. CLUSTER ANALYSIS OF TIME-SERIES PLT COUNT DATA Data Sets We employed the chronic hepatitis dataset [3], which was provided as a common dataset for ECML/PKDD Discovery Challenge 2002 and The dataset contained time-series data on laboratory examination which were collected at a university hospital in Japan. The subjects were 771 patients of Type B and Type C chronic viral hepatitis who received hospital laboratory examinations during the period from 1982 to A total of 720 patients received at least one examination on platelet count. Out of these 720 cases, 222 were removed from analysis because their biopsy information was not available and additional 10 were removed because of their short examination periods (less than 2 weeks) Consequently, a total of 488 series were used for analysis. Experimental Procedure Below we show the procedure of cluster analysis. 1. Sequence rebuild: Rearrange PLT data of each patient into one-week interval by linear interpolation. 2. Dataset split by virus types and administration of interferon (IFN) therapy: Split the dataset into Type B and Type C cases, and further the Type C cases into Type C with IFN therapy and Type C without IFN therapy cases. We call these subsets as Type B subset, Type C with IFN subset and Type C without IFN subset. The number of cases in each subset was as follows: Type B = 193, Type C with Figure 1. Cluster analysis system for time-series. 12

3 IFN = 196, Type C without IFN = 99. The following procedures were applied independently to each subset. 3. Creation of a dissimilarity matrix: Perform a comparison of two PLT sequences by the modified multiscale matching. Apply this process to every possible pair of sequences in the subset to fill in the dissimilarity matrix. In order to perform comprehensive comparison, we set the parameters for multiscale matching as follows: the number of scales = 150, starting scale = 0.1, scale interval = 0.5. The weight for replacement cost was set to 0.2 according to a preparatory experiment. 4. Cluster analysis: Generate dendrograms by agglomerative hierarchical clustering and perform cluster analysis. We employed group average as a cluster merge criterion. Figure 2 shows the three dendrograms obtained from Type B, Type C with IFN and Type C without IFN subsets, respectively. We manually determined cutting points on the dendrograms so that the clusters represent global structure of the data while retaining the meaningful features of sequences. Consequently, we obtained 16, 23 and 6 clusters respectively for each subset. A horizontal line on the dendrogram represents the cutting point. Table 1 provides the constitution of clusters stratified by the fibrotic stage. The three sub-tables respectively correspond to, from left to right, Type B, Type C with IFN and Type C without IFN subsets. Each row in a table represents one cluster. The leftmost column contains cluster number. Subsequent five columns contain the number of cases in the cluster stratified by fibrotic stages (F0-F4). The rightmost column contain the total number of cases in the cluster. The tables implied that clusters could be roughly classified into two categories: (1) a cluster containing high stage (progressed) cases, and (2) a cluster containing low (early) stage cases Figure 2. Dendrograms for PLT sequences. Left: Type B, Middle: Type C with IFN, Right: Type C without IFN. Table 1. Cluster constitutions w.r.t. fibrotic stages. Small clusters (less than 3 cases) were omitted. Left: Type B, Center: Type C with IFN, Right: Type C without IFN. B C IFN C noifn Cls # of Cases / Fibrosis Stage # of Cases / Fibrosis Stage # of Cases / Fibrosis Stage Total Cls Total Cls F0 F1 F2 F3 F4 F0 F1 F2 F3 F4 F0 F1 F2 F3 F4 Total

4 Due to space limitation, we mainly describe about the results on Type C with IFN subset. According to the middle table in Table 1, there were two remarkable clusters containing many progressed cases (F4 or F3): cluster 5 (8/11) and 8 (25/40). Additionally, there were other three remarkable clusters containing many early-stage (F0-F2) cases: 11 (34/46), 12 (33/42) and 23 (18/19). Figure 3 provides examples of sequences grouped into clusters 5 and 8, respectively. Each figure is composed of 16 sub-windows and each sub-window contains one sequence. The two horizontal lines in each sub-window represent normal high ( /µl) and normal low ranges ( /µl) respectively. In cluster 5, most of the sequences represented decreasing/flat courses below the normal low range, meaning the severe states of the patients. Sequences in cluster 8 exhibited the similar courses, but with slightly higher values than those in cluster 5. Figure 4 provides sequences grouped into clusters 11, 12 and 23. In contrast to clusters 5 and 8, sequences in these clusters represented flat courses maintaining the normal range. Clusters 11, 12 and 23 would differentiate the global PLT levels: low, middle and high respectively. Other interesting courses were found on clusters 4, 6 and 10, that demonstrated obviously decreasing or increasing patterns as shown in Figure 5. The left in Figure 5 provides sequences in cluster 4 (F1=3,F2=1). While 3/4 of them were on stage F1, PLT counts continued decreasing and finally reached below the normal low level in relatively short period. The middle of Figure 5 shows sequences in cluster 6 (F1=1,F3=1,F4=1). The global levels were lower than those in cluster 4, that might be caused by F3 and F4 cases. The bottom provides sequences in cluster 10 (F1=1,F3=1,F4=3), which represent recovery courses after IFN therapy. We observed similarly interesting patterns on the other two subsets. Below we summarize the findings. 1. In both type B and C, some clusters contained relatively large numbers of progressed cases. PLT count in these cases commonly represented decrease or flat courses going Figure 3. Clusters containing many cases of progressed-stage (F4 or F3) (Type C with IFN). Left: cluster 5. Center and Right: cluster 8 (32 cases selected by MID order). Figure 4. Clusters containing many cases of early-stage (F0, F1 or F2) (Type C with IFN). Left: cluster 11. Center: cluster 12. Right: cluster 23. (16 cases selected by MID order). 14

5 Figure 5. Clusters containing remarkably increase/decrease cases (Type C with IFN). Left: cluster 4. Center: cluster 6. Right: cluster 10 below the normal low level. Some F1 and F2 cases represented similarly low level as F4 cases. (Type B cluster 7, Type C with IFN cluster 5). 2. In both type B and C, some clusters contained relatively large numbers of early-stage cases. PLT count in these cases commonly represented flat courses going within the normal range (Type B cluster 5, 15, 16, Type C with IFN cluster 11, 12, 23). F4 cases might retain the normal range; however, the number of such cases in a cluster decreased following the global PLT levels of the cluster. (Type B cluster: 16>15>5, Type C with IFN cluster: 11=12>23). 3. In type C, there were remarkable cases including F1 and F2 cases in which PLT count continuously decreased and finally reached below the normal range. (Type C with IFN clusters 4 and 6, Type C without IFN cluster 1). In type C without IFN, the decreasing trend was observed rather frequently. (Type C without IFN clusters 1 and 3). 4. In type C with IFN, there were F4 cases in which PLT levels increased toward the normal range after IFN administration 4. ANALYSIS OF YEARS FOR REACHING F4 AND ELAPSED YEARS BETWEEN STAGES BASED ON THE PLT COUNTS Determination of the stage of liver fibrosis is usually done with liver biopsy which is an invasive examination. In recent years, platelet count has been receiving considerable attention as an non-invasive index reflecting the liver dysfunctions, which may be associated with the fibrotic stage in chronic hepatitis. Several researchers have reported the relationships between platelet counts and fibrotic stages [2,9]. For example, Matsumura et al. [2] reported the following values: F1: 20.3±5.2( 10 4 µl), F2: 16.0±4.9, F3: 13.0±4.0, F4: 11.8±4.1 and in LC 11.8±4.1. Our results of cluster analysis corresponded to these differences. Additionally, through the visual inspection of clustered sequences, we observed that there might be several types of temporal courses of PLT values. Matsumura et al. [2] also reported the progress speed of liver fibrosis examined on the patients of Type C chronic hepatitis in Japan. They used the date of blood transplants, which could be associated with F0, and the date and results of liver biopsy for calculating the progress speed. The result was about 0.12±0.15 stage/year. In order to investigate the temporal characteristics PLT count, we tried to utilize the time-series data. We set the goal of this study to analyze, without information about blood transplants, the progress speed of liver fibrosis. As a preliminary stage, we attempted to calculate (1) years required for reaching F4 stage, and (2) years elapsed between stages, by combining the fibrotic stages predicted from PLT level and observed by liver biopsy. Here we made an assumption: If the PLT level of a patient is continuously lower than the normal range for at least 6 months, and after that never keeps normal range more than 6 months, then the patient is F4. Based on this assumption, we first examined whether and when a patient reached F4. Then by subtracting dates and stages from those obtained by biopsy, we calculated elapsed years. As a pre-process, we selected the cases for analysis according to the following procedure. 1. Exclude cases that met any of the following three conditions from analysis: (1) No biopsy - biopsy information was not available. (2) Short sequence - the number of examinations was less than 2 or the duration of examination was shorter than 2 years. (3) Inhomogeneous sequence - Deviation of examination intervals was larger than 1 year. 2. Rearrange the sampling intervals of each sequence into one-week. The starting date of re-sampling was selected independently to each case, based on two criteria that (1) it was the day of a week on which the patient most frequently received examinations, and (2) it was the closest date to the first examination. If examination data were missing, we inserted a predicted value by linearly interpolating nearest examination results. In the following procedures we used these rearranged sequences. 15

6 3. Smooth each sequence in order to remove short-term changes. We performed convolution with discrete Gaussian kernel with support width of 6 month (26 weeks; σ=2.8). 4. From the head of a sequence, search the first point that satisfies both of the following two conditions: (a) PLT level became continuously lower than the normal range for the next 6 months. Duration of IFN therapy was not included therein as it might induce short-term decrease of PLT. (b) Recovered PLT level could not continuously maintain the normal range for 6 months. 5. If found, let the detected point the date of declination from normal range. Otherwise, the case was considered to keep normal PLT range and removed from analysis. Table 2 shows the result of sequence classification by the above four procedure. A total of 97 cases classified as 'declinated' were the subject of analysis. Table 2. Result of sequence classification. Judging criteria for declination are: (1) PLT becomes continuously lower than the normal range over 6 months, (2) Recovered PLT level cannot continuously maintain the normal range for 6 months. Both criteria should be satisfied. Inhomogeneous Available No biopsy Short Total Declinated Normal Table 3 summarizes calculated years for reaching F4 (first examination date basis), for the 97 declinated cases in Table 2. The cases were stratified by the virus types and fibrotic stage. Note that years=0 if the date of declination was earlier than the date of first examination. For each of type B, C with IFN and C without IFN groups, we performed statistical tests (ANOVA) aiming at detecting Table 3. Years for reaching F4 (First-exam basis) stratified by virus types and fibrotic stages. Summary for 97 declination cases in Table 2* Type Fibrotic Years for reaching F4 [First-exam basis] (years) Cases Stage Mean Median SD B subtotal C IFN subtotal C w/o IFN subtotal Total *Fibrotic stages in the second column are based on biopsy. Years for reaching F4 was years from first exam to the date of declination under assumption that the fibrotic stage at the date of declination was F4. If the date of declination was the same as or before the first exam, years were treated as 0. 16

7 differences of mean years for reaching F4 with respect to the biopsy-based fibrotic stages. The result of Type C IFN was p=0.012 (< 0.05), indicating that significant differences of years exist among fibrotic stages. However, this was primarily due to one exceptionally long case in F0; tests after removing this case yielded p=0.291, indicating that there was no significant difference on the years for reaching F4 among fibrotic stages. Results for Type B and Type C w/o IFN were p=0.357 and p=0.613 respectively, indicating no significant differences. Kruscal-Wallis tests yielded the same conclusion. Between-group comparison of Type B, Type C with IFN and Type C w/o IFN groups resulted in p= Years for reaching F4 in Table 3 were calculated as years between the first date of PLT examination and the date of PLT declination. Therein we assumed that the fibrotic stage at first examination was the same as that at first biopsy. However, the date of first biopsy and the date of first PLT examination were generally different; in some cases they were several years apart. This implies that the stages might also be different. Therefore, we calculated years for reaching F4 biopsy basis, which are years from the date of first biopsy to the date of PLT declination. Additionally, based on the assumption that the stage at PLT declination should be F4, we calculated elapsed years between stages by the following formula: (date of declination - date of first biopsy) /(4 - fibrotic stage at biopsy). If declination occurred before the first biopsy, years were treated as 0. Table 4 summarizes the results. As we did in the first-exam basis results, for each of type B, C with IFN and C without IFN groups, we performed statistical tests with ANOVA aiming at detecting differences of mean years for reaching F4 w.r.t. the fibrotic stages. The results were p=0.421, (<0.05), for each group respectively. In Type C IFN there appeared significant difference among stages, however, this was primarily due to one exceptionally long case in F0; tests after removing this case yielded p=0.970, indicating that there was no significant difference on the years for reaching F4 even in the biopsy-date basis measurement. Kruscal-Wallis tests resulted in the same conclusion. Similarity, for each of type B, C with IFN and C without IFN groups, we performed statistical tests with ANOVA aiming at detecting differences of mean elapsed years between stages w.r.t. the fibrotic stages. In this test we removed F4 cases as we could not measure the elapsed years. For the Table 4. Years for reaching F4 (biopsy basis) and years between stages stratified by virus type and fibrotic stages. Summary for 97 declination cases in Table 2* Type Fibrotic Years for reaching F4 [biopsy basis](years) Years between stages (years/stage) Cases Stage Mean Median SD Mean Median SD B subtotal C IFN subtotal C w/o IFN subtotal Total *Fibrotic stages in the second column are based on biopsy. Years for reaching F4 were years from first biopsy to the date of declination under assumption that the fibrotic stage at the date of declination was F4. If the date of declination was the same as or before the first biopsy, years were treated as 0. Years between stages were calculated by (years for reaching F4)/(4-stage at biopsy). 17

8 same reason, we excluded F4 cases for calculating values such as mean and SD in Table 4. The results of ANOVA were p=0.836, 0.425, 0.340, indicating that there was no significant differences among stages, including F0, for all of the three groups. In summary, with this limited analysis, no significant difference was observed for years for reaching F4 and years elapsed between stages, with respect to fibrotic stages, virus types and administration of IFN. However, it is interesting that the elapsed years between stages were 1-2 years/stage in almost all groups. If we simply invert it into progress speed for comparison with other resources, the result would be about 1/1.32=0.76 stage/year for example of Type C w/o IFN cases. This is faster than in [2] (0.12±0.15 stage/year), implying that the liver fibrosis might proceed faster. It should be noted that the results of analysis should not be generalized because (1) we assume that a patient was considered to reach F4 when PLT level continuously declinates from the normal range over long time, (2) we selected only exacerbating cases in which PLT continuously decreased, and (3) we did not take into account patient background information such as history of drinking. However, we consider that our approach of measuring elapsed years between stages by combining fibrotic stages obtained from biopsy and inferred from PLT level lead to find interesting results. 5. CONCLUSIONS In this paper we have introduced a cluster analysis system for time series medical data and reported the results of temporal analysis of PLT data in chronic hepatitis patients. The results revealed that temporal courses of PLT might be classified into some patterns according to their levels and trends which might be further related to fibrotic stages. The results also suggest that, in some exacerbating cases, liver fibrosis may proceed a few times faster than the natural courses. In the future, we would proceed to validate the clinical reasonability of the results and validate the usefulness of the system on other datasets. ACKNOWLEDGEMENTS This work was supported in part by the Grant-in-Aid for Scientific Research on Priority Area (# ), Development of the Active Mining System in Medicine Based on Rough Sets by the Ministry of Education, Culture, Science and Technology of Japan. REFERENCES [1] H. Miyazaki, Future Prospect of Thrombopoietin. Jpn J. Transfusion Medicine, Vol. 46, No.3, pp , [2] H. Matsumura, M. Moriyama, and I. Goto and N. Tanaka, and H. Okubo and Y. Arakawa, Natural course of progression of liver fibrosis in patients with chronic liver disease type C in Japan - a study of 527 patients at one establishment in Japan. J. Viral Hepat, Vol. 7, pp , [3] URL: [4] B. S. Everitt, S. Landau, and M. Leese, Cluster Analysis, Fourth Edition. Arnold Publishers, [5] D. Sankoff and J. Kruskal, Time Warps, String Edits, and Macromolecules. CLSI Publications, [6] S. Chu, E. J. Keogh, D. Hart, and M. J. Pazzani, Iterative Deepening Dynamic Time Warping for Time Series., In Proc. the Second SIAM Int l Conf. Data Mining, pp , [7] S. Hirano and S. Tsumoto (2003): An Indiscernibility-Based Clustering Method with Iterative Refinement of Equivalence Relations - Rough Clustering - Journal of Advanced Computational Intelligence and Intelligent Informatics, Vol. 7, No.2, pp , [8] S. Tsumoto, S.Hirano, and K. Takabayashi, Development of the Active Mining System in Medicine Based on Rough Sets, Journal of Japan Society for Artificial Intelligence, Vol. 20, 2, pp , AUTHOR INFORMATION Shoji Hirano received the Ph. D. degree in electronics in 2001 from Himeji Institute of Technology, Japan. He joined in the Department of Medical Informatics, Shimane Medical University as a research associate in April 2001, and serves as an associate professor since July His research interests include data mining, rough sets, image processing, and medical informatics. He received the Best Paper Award at the Fourth Biannual World Automation Congress 18

9 in 2000, and the Annual Conference Award at the 19th Annual Conference of Japanese Society for Artificial Intelligence (JSAI) in He is a member of the IEEE, JSAI and Japan Society for Fuzzy Theory and Intelligent Informatics. Shusaku Tsumoto graduated from Osaka University, School of Medicine in He received his Ph.D (Computer Science) on application of rough sets to medical data mining from Tokyo Institute of Technology in 1997 and has become a Professor at Department of Medical Informatics, Shimane University in His interests include approximate reasoning, data mining, fuzzy sets, granular computing, knowledge acquisition, mathematical theory of data mining, medical informatics and rough sets (alphabetical order). He serves as a President of International Rough Set Society from 2000 to 2005 and served as a PC chair of RSCTC2000, IEEE ICDM2002, RSCTC2004 and ISMIS