Data a systematic approach

Transcription

1 Pattern Discovery on Australian Medical Claims Data a systematic approach Ah Chung Tsoi Senior Member, IEEE, Shu Zhang, Markus Hagenbuchner Member, IEEE Abstract The national health insurance system in Australia records details on medical services and claims provided to its population. An effective method to the discovery of temporal behavioral patterns in the dataset is proposed in this paper. The method consists of a two step approach which is applied recursively to the dataset. First, a clustering algorithm is used to segment the data into classes. Then, hidden Markov models ) are employed to find the underlying temporal behavioral patterns. These steps are applied recursively to features extracted from the dataset until convergence. The main objective is to minimize the misclassification of patient profiles into various classes. This results in a hierarchical tree model consisting of a number of classes; each class groups similar patient temporal behavioral patterns together. The capabilities of the proposed method are demonstrated through the application to a subset of the Australian national health insurance dataset. It is shown that the proposed method not only clusters data into various categories of interest, but it also automatically marks the periods in which similar temporal behavioral patterns occurred. Index Terms H.2.8.d Data mining,, H.3.3 Information Search and Retrieval, H.3.3 Information Search and Retrieval, H.2.8.b Clustering, classification, and association rules. I. INTRODUCTION Australia s population of about 2 million is covered by a general universal publicly funded medical insurance scheme called Medicare. A government agency called the Health Insurance Commission HIC)

2 administers the Medicare program. The function of HIC includes, among many other tasks, the detailed recording of medical transactions, mainly for their internal accounting purposes. Access to the Australian Medicare program is granted to all Australian residents and certain categories of visitors where services covered include medical and hospital services. Where an eligible person incurs medical expenses in respect of a professional service, Medicare will pay benefits for that particular service as outlined in the Medical Benefit Schedule MBS) which essentially indicates the amount of rebate that the patient can claim, under various circumstances 1. Medical transactions for the entire population have been collected since HIC was set up in Each medical transaction record contains much valuable information, e.g., the name and address of the patient, age, gender, the medical service provider s name, and practice location, the type of medical service provided, the date the service was provided, the date payment was made, the date the claim was processed, the way the payment is claimed, the Medicare office in which the claim was made, the officer who served the customer, etc. This information collection is primarily for the accounting purpose. It is noted that medical treatments for which no claim to the HIC is made are not recorded. In general, the total number of Medicare transaction records is approximately 375,, records per annum which equates to about 13 GB of data. From a data mining perspective, the Medicare transaction record database presents a treasure trove for many possible pattern discovery, and data mining exercises. We have obtained seven quarters of de-identified 2 medical transaction record data from the HIC. Our aim is to detect similar temporal behavioral patterns amongst patients in the dataset. For this paper, we will use hidden Markov models ) as a suitable approach for temporal behavioral pattern discovery. are commonly applied to pattern recognition tasks since they allow a formal representation of a stochastic dynamic process, and allow for a systematic analysis of the data and prediction based on such models. 1 Details of MBS and the various situations in which a patient can claim rebate is found in the following web site: 2 The records are all de-identified so as to remove all possibility of identifying a person to provide anonymity to patients.

3 Overall, the method proposed in this paper consists of three steps: Step 1 Feature extraction Step 2 Segmentation into clusters Step 3 Pattern recognition Because of the richness and the format of the data, the first task is to decide on the extraction of features which is most suitable for the task at hand. Feature extraction is performed by applying content filtering techniques in that we will filter appropriate items in the dataset which are representative of the underlying dynamical behavior of the system. Step 2 is to segment the dataset into clusters of similarly behaving patients. We propose to segment the data into suitable age cohorts, and then to cluster each age cohort using a popular clustering technique, viz., the K-means clustering algorithm. Based on the result from the segmentation process, Step 3 can then be employed to the task of pattern discovery. We propose the use of HMM as a suitable approach for this task. These three steps will yield a set of clusters with similarly behaving temporal patterns together in a class. We will call the application of one cycle of this three step process a coarse classification method as it yields a set of classes from the dataset. The K-means clustering algorithm clusters the dataset according to some criterion based on the entire patient profile, while the HMM groups the dataset into similar temporal behavioral patterns. Hence both K-means clustering algorithm and HMM group the dataset quite differently. Since our primary aim is to group patients of similar temporal behavioral patterns together, we could have used the HMM alone on the features extracted from the dataset without first using the K-means clustering algorithm. However, we find, from some preliminary experiments, this will produce a large number of classes, some of them do not seem to be assigned correctly when inspected visually. The K-means clustering algorithm was introduced to first cluster the dataset according to their overall behaviors, as it considers the characteristics of the entire patient profile. In other words, the K-means clustering algorithm acts as a filtering process, filtering similar overall patient behaviors together first. The HMM is then applied to these similarly overall behaviors to further classify those with similar temporal behavioral patterns together into subclasses. We

4 find from our initial experiments that this two step process yields a smaller number of subclasses. Further, profiles within the same subclass bear resemblance to one another when inspected visually. As will be shown later, the recursive application of Step 2 and Step 3 allows for a further refinement of the coarse classification of the data by the K-means clustering algorithm. This yields a hierarchical tree model, consisting of a number of layers, each layer is described by classes of patients with similar temporal behavioral patterns. The main innovative idea in this paper is the provision of a practical method for unsupervised grouping of a dataset into a hierarchical tree model, each layer of the tree model consists of clusters classes) of similar temporal behavioral patterns. The subclasses representing the successive layers of the hierarchical tree model bear similar temporal behavioral characteristics to the classes in the preceding layer, or their parent classes. The structure of this paper is as follows: a feature extraction approach presented in Section II generates individual patient profiles as output. In Section III we use a K-means clustering algorithm to coarsely segment the data into clusters. Such segmentation allows us to employ a hidden Markov model HMM) to classify the data in each cluster into similar temporal behavioral patterns. This is addressed in Section IV. In Section V we show that K-means clustering and HMM recursively applied to the dataset produce a fine decomposition of classes for the data, and results in a generalization performance which is superior to the initial approach of deploying the K-means clustering algorithm and HMM methodology alone without the hierarchical decomposition. Experimental results presented in Section VI show that the proposed approach is effective in classifying patient records into classes such that similar profiles are grouped together. Finally, conclusions are presented in Section VII. II. FEATURE EXTRACTION AND CONSTRUCTION OF PROFILES Given the vast amount of data, a way is needed to extract some pertinent features of the data to allow for further processing by a pattern discovery algorithm. In addition, a segmentation of the data can be employed to support the mechanism of pattern discovery. It is recognized that a patient s medical

5 record changes dramatically with age. For example, a patient engages medical services differently during childhood years than when at an older age. Also female patients at a child bearing age utilize medical services differently than male patients of the same age. Moreover, a patient s medical record can be influenced by other factors such as location of residence e.g. rural or metropolitan area), seasonal changes, and others. We analyzed the data and found that a patient s age and gender influence the transactions most strikingly. In order to reduce the effect of dependencies on age we decided to sort patients into age cohorts. As a consequence, we partitioned the data into groups of similar age such as those shown in Table I. TABLE I COHORT GROUPS AND SIZES. Female Male Age Number of Average number Average value Number of Average number Average value Comments cohort patients of claims filed of claim patients of claims filed of claim -3 46, , infants , , pre-school age , , school age ,53, , young adult ,498, ,257, adult ,221, ,74, mature adult ,287, ,199, mid-life ,11, ,68, retirement age >71-862, , elderly Σ =8,99,473 / =9.1 / = Σ =6,88,55 / =8.36 / = From Table I we find that both the cost and the frequency of medical services used tend to increase with the age of a patient. Moreover, a striking observation is that with female patients older than 16 years of age use medical services considerably more frequently than their male counterpart. This behavior

6 comes quite obviously with the sexual differences and is associated with its ongoing consequences e.g. child bearing). In contrast, more boys male children younger than 16) seek medical services than girls of the same age. This may be caused by the general observation that boys are more exploratory outdoors thus incurring a higher exposure to risky activities than girls. Further analysis of the data showed that a patient s behavior changes not only with age and gender but also with the address of patients e.g. whether a patient is from a rural or metropolitan area). These two observations guide us to the following feature extraction process: segment the dataset into age cohorts so that patients with similar behaviors are grouped together. Secondly, in order to provide generalization capability, we have chosen to consider patients from addresses involving both metropolitan, semi-rural, and rural areas. This selection of patients will provide possibilities of generalization of the models built as it includes diversity of data for the same age cohort. This paper will utilize the cohort of patients who are between 45 and 55 years of age for demonstration purposes and for subsequent experiments 3. We further mark those patients depending on their location of residence which is either a metropolitan area or a rural region. This cohort group is used to present the modeling methodology. For the task of extracting features, it is possible to use a number of indicators such as the number and type of services used, the way a payment is made, the provider s specialization, etc. For demonstration purposes, we decide to use the total benefits paid as a feature, since it not only bears patient s behaviors but also encodes most types of service implicitly. Most patients do not see a medical service provider daily. This motivates us to consider a rolling time window through the data rather than individual days as a basic time unit. Hence, we will extract features by using a rolling time window of fixed size on the total benefits paid during a period. The result is a temporal profile of a patient. In order to avoid border effects, profiling is performed on one year of data, rather than the full seven quarters. This is because of the fact some patients may not claim for the 3 We could have divided the age cohort further into male and female patient profiles. However, we decided not to present such results as it adds complexity to the presentation without adding much value to the proposed methodology.

7 expenses until some time after having received the services from the medical service providers. Had we use all 7 quarters of data this will introduce border effects. Considering the total benefits within a time window instead of working on each day as a unit of processing has an intuitive appeal. Often a patient is required to undertake a course of treatments over a short period of time, e.g., 14 days or less. During this period, the patient may need to see a medical service provider a number of times, incurring medical expenses. Hence by summing the total benefits paid over a time window, the underlying nature of the course of treatment is captured. Mathematically this simple step can be represented as follows: given a longitudinal set of medical records of a patient over a total of D days. In our case D = 365 days one year). We are only interested in the benefits paid by HIC on servicing this person. Hence only the benefits paid relevant to the person from the longitudinal medical records are extracted. Let this be denoted by z 1, z 2,..., z D. Note that we use the day when the patient visits the medical service provider as the day in which the benefit is paid, rather than the actual date when the benefit is paid. This is because of the fact that the medical service provider may choose to bulk bill, so that claims may be send in batches to the HIC. This would distort the profiles. If there is no visit to the medical service provider, on day i, then z i =. Assume a time window W days, then the total benefits paid over this time window is given by: y t = t+w 1 i=t z i t = 1, 2,..., D W + 1; 1) It is possible to slide the time window to the next day and perform the same computation and obtain another point on the profile. Thus over D days, a total of D W + 1 points in the profile are obtained. We have experimented with different values of fixed time window W, and found that using a period of 14 days helps to achieve good results. Thus, each profile will be of dimension D W + 1 = 352. An example of the effects on using a time window is shown in Figure 1. The profile on the left shows a patient s profile which uses individual days as a basic time unit. The same profile using a sliding time window of W = 14 is shown on the right. It is observed that profiles consist of segments of activities over a period of time. The area under the

8 Total benefit paid $ Total benefit paid Fig A patient s profile. The horizontal axis shows the number of the time window, the vertical axis displays y t, the sum of benefit paid in a time window. The profiles are from the same patient using W = 1 left) and W = 14 right). curve for the segment [t 1, t 2 ] is given by: A t1,t 2 = t 2 t 1 t= Thus the area enclosed by the segment will be a defining feature 4. W t 2 z t1 +i+t = y t t 1, t 2 =, 1, 2,..., 352; 2) i=1 t=t 1 In summary, this feature extraction approach creates a one-dimensional temporal patterns of fixed dimension for each patient in the dataset. These profiles can now be grouped into homogeneous clusters of profiles such that profiles within a cluster share similar properties. A simple approach to the clustering of data is the application of a K-means clustering algorithm. III. CLUSTERING The feature extraction step produced a 352-dimensional numerical data vector for each patient in the data set. Because of the time consuming aspects of performing the calculation on the entire age cohort, we created a smaller subset of profiles totaling 14,417 patients aged between 45 and 55, 32,476 of whom reside in one of the 1 major Australian cities, all other patients live in rural or semi-rural areas. We randomly select 73,942 approximately two third of the total number of patients) profiles to serve as a training set, all remaining profiles will serve as a testing dataset. The feature vectors can differ significantly in value, and hence, produce points in the data space which are located in well distinct areas. As a consequence, it can be assumed that profiles which share certain properties form clusters. 4 It will be shown in Section VI, this is close to the mean value of the output of the.

9 There are a number of algorithms which are capable of segmenting data into clusters in the input space. Perhaps the simplest algorithm for such a task is the well known K-means clustering algorithm [3]. In short, the K-means clustering algorithm works as follows [3]: given a set of unlabeled data y 1, y 2,..., y n, where y i is a d-dimensional vector we wish to compute the number of classes in which this set of data can be grouped together. We use a coherence criterion to group the data points together: J = K k=1 y D k y m k 2 3) where D = K k=1d i is a possible partition of the n data points into K partitions, D i D j =. denotes the Euclidean norm, m i the vector denoting the mean of the i-th cluster. Note that this criterion measures the total distortion from the mean over the entire duration of the profile. The criterion does not pay attention to the underlying temporal differences within the profiles. Thus two profiles with significantly different temporal behaviors may give rise to the same distortion measure. Then a simple K-means clustering algorithm is given as follows [3]: Step 1 Initialize a partition. Randomly choose K points, y 1, y 2,..., y K as centers. Every other point y j, j k is assigned to cluster D i if it is closer to the center y i than to any other center. Step 2 Compute the mean of each cluster D i, i = 1, 2,..., K as follows: m i = 1 D i y k D i y k 4) Step 3 For k = 1, 2,..., n compute d 2 j = y k m j 2, j=1,2,...,k. Assign y j to cluster D i if i = arg min{d 1, d 2,..., d K }. Step 4 Exit if there is no update otherwise repeat step 2 through to step 4. There are many variants of the K-means clustering algorithm. We will use this simple K-means clustering algorithm, as all we wish to obtain is an approximate clustering of profiles. Optionally, other clustering algorithms, e.g., fuzzy K-means algorithm [3], self organizing map method [3], Gaussian Mixtures [4], or EM clustering [5], [6]. We have experimented with various values of K. For example, when applying the K-means algorithm to the training set of 73,942 profiles using K = 1, we find 1 clusters would be sufficient. The six

10 largest clusters contain more than T = 3 profiles each and are shown in Table II. Clusters containing less than 3 profiles will be ignored at this stage 5, their profiles, however, are grouped into the Others category. Reasons for this step will be addressed in greater detail later in this paper. TABLE II THE SIX LARGEST CLUSTERS FOUND BY THE K-MEANS CLUSTERING ALGORITHM. THE NUMBER OF PROFILES IN EACH CLUSTER, AND THE MINIMUM / MAXIMUM VALUE OF TOTAL BENEFITS PAID OVER A 14 DAYS PERIOD FOR THE PROFILES IN A CLUSTER ARE SHOWN. Name of Number Benefit Range Cluster of Profiles in A$ Cluster1 61, ,923.3 Cluster2 7, ,264.3 Cluster3 1, ,43.5 Cluster4 1, , Cluster5 1, ,354.1 Cluster ,582.5 Others 55 2, , Six typical profiles from each cluster are shown in Figure 2. The profiles from different clusters differ significantly in terms of activity and benefit values paid. In the current context, the cluster 6 contains profiles which feature greatest activity and the largest amounts of benefit paid, while in contrast, profiles in cluster 1 featured little activity. The K-means clustering algorithm by itself is not particularly useful for the task of temporal pattern discovery. This is because the K-means clustering algorithm has a number of shortfalls: The algorithm is not very efficient on sparse vectors since the Euclidean norm often becomes zero. Thus, the K-means algorithm does not reliably cluster profiles which show little activity. 5 The figure of T = 3 is obtained by intuition that the HMM is a stochastic model, and requires large set of training data to estimate its parameters. Hence, if the training set consists of T <3 profiles is considered as too small to reliably estimate the parameters of the HMM. In practice, the results obtained stay approximately the same when the value of T varies over a range. Hence, the algorithms appear to be relatively insensitive to the variation of T over a range.

11 Cluster Cluster Cluster Cluster Cluster Cluster 6 Fig A figure showing typical profiles in each cluster. The vertical scales differ in each cluster. In this diagram we are only interested to portrait the variations of the shape of the samples in each cluster, rather than concerning about their actual magnitudes. The algorithm is not particularly reliable in detecting overlapping clusters. The optimal border zone between overlapping clusters may not be found. The clustering of data does not equate to the discovery of temporal patterns which may be encoded in the input data. For the task at hand, a context sensitive classification of the data set is desired. Nevertheless, the K-means clustering algorithm is fast and effective in the segmentation of the input space into clusters of various degrees of interest. Building on this, a temporal pattern discovery algorithm such as a HMM can be trained to detect sequences of events embedded in the profiles. This will be addressed in the following section. IV. HIDDEN MARKOV MODELS It is noted that the profiles resemble signatures. We have a large number of profiles, and wish to find out how any unseen profiles can be classified. In this type of situations, are well known to be

12 a very good tool as it has been widely deployed in speech processing [2], image processing, and many other areas. This motivates us to study to find out if it is possible to classify unseen profiles. This section briefly describes the HMM model, following closely to that developed in [2]. This is presented here for self contained explanation of our proposed method. There are a number of variants of [1], e.g., discrete HMM, continuous observation HMM, input-output HMM. Here we describe the hidden Markov model with continuous observations [2]. The sample HMM given in Figure 3 will help to understand the mathematical formulation. a 11 a 1 X 1 a 13 X a 1 a 12 X 2 a 23 a a 33 a 21 X 3 a 22 Fig. 3. Example HMM with an initial state π, hidden states x i, and state transitions a ij. It is assumed that the observations y 1, y 2,..., y T are generated by a multivariate probability density function. For simplicity, we will assume that this is generated by a Gaussian mixture as follows: M f y x ξ i) = c im N ξ; µ im, C im ) 5) m=1 where N ξ; µ im, C im ) denotes a Gaussian probability density function with mean µ im and covariance matrix C im. The notation f y x ) denotes the probability of observing y given the hidden state sequence x. The constants c im are known as mixing coefficients. In order to be a probability density function, we must have M m=1 c im = 1 for 1 i S, and S is the size of the alphabet the dimension of the state space). It is further assumed that the observation probability density functions are generated by a hidden state x, x is a S dimensional vector. This state follows the evolution equation: xt + 1) = Axt) 6)

13 where A is the state transition matrix, with initial condition x) = π. The parameters in the model are then M = {S, π, A, {f y x ξ i), 1 i S}}. The problem in HMM estimation can be divided into two sub-problems [2]: 1) Given a series of training observations for a given entity, say, a label, how do we train an HMM to represent this label? This problem becomes the finding of a procedure for estimating an appropriate state transition matrix A, and observation probability density function f y x for each state. 2) Given a trained HMM, how do we find the likelihood that it produced the incoming observation sequence. We will not derive the HMM estimation algorithm here, as it is readily available in e.g., [2], [7], [8]. But we will summarize the training algorithm as follows: Let vi; t, l) = P xt) = i yt) produced in accordance with mixture density l) = αy t 1, i)βy T t+1 i) Sj=1 αy t 1, j)βy T t+1 j) c il N ξ; µ il, C il ) Mm=1 c im N ξ; µ im, C im ) 7) where y t 1 denotes the sequence y 1, y 2,..., y t, α and β are quantities, respectively called forward and backward probability sequences, which are associated with the estimation of the parameters, defined as follows: αy t 1, i) = P y t 1 = y t 1, xt) = i M) βy T t+1 i) = P y T t+1 = y T t+1 xt) = i, M) 8) Both α and β can be estimated efficiently from the given set of training data as follows: αy t+1 1, i) = S αy1, t j)ai j)byt + 1) i) 9) j=1 and S βyt+1 i) T = βyt+2 j)aj i)byt T + 1) j) 1) j=1

14 If we let T νi;, l) = νi; t, l) 11) t=1 where νj, t) = αy t 1,j)βyT t+1 j) P y M) other t t = 1, 2,..., T 12) Then we have: c il = µ il = C il = νi;, l) Mm=1 νi;, m) Tt=1 νi; t, l)yt) νi;, l) Tt=1 νi; t, l)[yt) µ il ][yt) µ il ] T νi;, l) 13) 14) 15) These equations, derived using an expectation maximization algorithm [2], will converge to a set of consistent parameters. As for the problem of finding class labels given a set of observations, we will use the Viterbi algorithm [2]. This can be described as follows: Step 1 Initialization: δ i) = 1 i = 1 i > 1 Step 2 Recursion for 1 t T and 1 j n: δ t j) = max 1 j n δ t 1i)a ij b j y t ) Ψ i j) = arg max 1 i n [δ t 1i)a ij ] Step 3 Termination maximum probability P of y and best exiting state i T ) P = max 1 i n δ T i) i T = arg max 1 i n δ T i)

15 Step 4 Backtracking of state sequence: for t = T 1, T 2,..., 1 i t = Ψ t+1 i t+1) In order to understand how HMM performs in the context of this paper, we apply HMM to the dataset described earlier. A. Applying HMM As described in Section III a set of patient profiles has been grouped into 6 individual clusters. Here we apply one HMM to each of the six clusters of data. Each HMM is trained on 9% of the data in each cluster ensuring that the size of the training set does not exceed a given limit here 2). This step effectively groups profiles from each cluster into a training set and a data pool. There are a number of reasons for this step: first, it balances the size of the training sets so that each HMM is trained on a similarly sized set of data. However, a more important reason will become clear in Section V where are applied and trained recursively on the data. There the data pool allows for an optimization of the generalization performance, while providing an automated mechanism which determines the minimal number of training data required to obtain an optimal solution. Table III shows the sizes of the data sets used in this section. Data in the training sets are used to train the. TABLE III SIZE OF TRAINING SET AND DATA POOL. THE CLUSTERS WERE OBTAINED THROUGH K-MEANS ALGORITHM. Cluster Total Training Data Pool Cluster1 61,325 2, 59,325 Cluster2 7,28 2, 5,28 Cluster3 1,89 1, Cluster4 1,76 1, Cluster5 1,65 1, Cluster

16 Once trained, the HMM is evaluated on the combined training set and data pool 6. The classification results for the trained is shown in Table IV. For example, 46,119 patterns from cluster 1 are misclassified as patterns from cluster 2. Diagonal elements in Table IV give the number of patterns where the classification by agrees with the clustering produced by the K-means clustering algorithm. Thus, it is noted that HMM does not agree very well with the K-means clustering at this stage as Table IV contains many non-zero and relatively large off diagonal elements. TABLE IV CLASSIFICATION RESULTS BY THE HIDDEN MARKOV MODEL. THE LABEL A IS USED TO DENOTE PROFILES IN CLUSTER 1 IN TABLE III, B FOR CLUSTER 2, ETC. Class A B C D E F Total A B C D E F Others Σ When comparing HMM s classification results with those obtained from the K-means clustering algorithm, the HMM classification appears to be perceptually more agreeable. From Figure 4 it is observed that the K-means clustering algorithm and the HMM grouped the same set of profiles differently. On a global scale it is noted that the K-means clustering algorithm grouped the profiles as a whole profile using the Euclidean distance as a coherence criterion). On the other hand the HMM groups profiles together based on the time evolution of the profiles, and hence the grouping of profiles is influenced by the context of events. Thus, the classification of profiles is the result of considering the context in 6 We chose to use the combined training set and data pool set in this evaluation step to facilitate further refinement of the modeling process through cycling as will be described in a later section.

17 which benefit is paid. For example, the HMM assesses whether a benefit paid is out-of-the-ordinary by considering the context within which a benefit was paid. As such, may classify large payments of benefit as not atypical if the context within which such a payment is made is commonly observed. Similarly, a HMM may classify relatively small amounts of benefit paid as atypical if the context within which such payment is made is rarely observed in the dataset Class A Class B Class C Class D Class E Class F Fig. 4. The same profiles as in Figure 2 classified by trained hidden Markov models. This approach neglected an important issue: clusters produced by the K-means clustering algorithm which contained less than 3 profiles were ignored, and not used for the training of. This is in recognition of the fact that probabilistic methods such as HMM require large amount of training patterns

18 in order to be trained successfully. However, neglecting profiles from small clusters can have a negative side effect since it is the particularly interesting atypical profiles which are normally grouped in small clusters. Hence, we suggest an approach which will utilize profiles from clusters which are too small for the training of. It is possible to combine profiles from the small clusters with the rest of the dataset during the classification phase. The effect is indicated by the row Others in Table IV. By doing this, all profiles from the training set are classified and hence, can be used for further processing. It is recognized that such a procedure may not produce a particularly good classification result for some of the profiles. This is due to the fact that the K-means clustering algorithm discovered more clusters than HMM can work with, and hence, the separation of the profiles may not be optimal. An approach which solves this dilemma is addressed in the next two sections. In addition, an interesting observation is that HMM classified just 14,229 profiles as patterns from class A, and hence, reduced the number of patterns in the original cluster 1 significantly. Similarly, the resized the other classes. Given that HMM re-arranges the way patterns are classified suggests to re-apply HMM to these new sets of data. For example, we can re-train a HMM on the set of 14,229 profiles which belong to class A. Similarly, a HMM can be trained on patterns from each of the other classes. It can be assumed that such a procedure will reduce the number of off-diagonal, or misclassified, patterns. This idea is elaborated in the following section. V. RECURSION In order to reduce the number of misclassifications we propose to apply HMM recursively to the data set as follows: Step 1 For each class, randomly choose 9% of the profiles but no more than N 7 and place them in the active training set. Place the remaining profiles in a data pool. There is one data pool for each class. 7 In this paper we use N = 2

19 Step 2 Train one HMM on each active training set. Step 3 Classify all data by the newly-trained. The result is a new classification of the data. Step 4 Cycle through Step 1 if the number of misclassifications is greater than zero, or until a maximum number of recursions have been performed. This approach of iterative re-training of has a number of advantages: In step 2, only a relatively small set of patterns is used to train. This reduces the computational time required considerably. Since recursion is performed through step 1, the are eventually presented with all profiles in the training set. But since recursion stops when no profiles are misclassified, a HMM may not necessarily need to be trained on all profiles. Hence, this approach provides a tool to minimize training requirements on large sets containing redundant data. The number of misclassified profiles is minimized. Executing HMM recursively produces Figure 5 which illustrates that the number of misclassified profiles approaches zero with the number of iterations. Here, training is stopped after 5 iterations upon which we observed only 216 misclassified profiles. Details on the classification result are shown in Table V. 6 Convergency at top level training "convergetop.dat" using 1:$2+$3) 5 Sum of Off-diagonal Element Fig Num of Iteration The convergence of the sum of off-diagonal elements through the iterative training of.

20 In this set of experimental results it is observed that the number of misclassified profiles converges within less than 5 iterations. It is possible to interrupt training earlier as the number of misclassified profiles can converge to zero at an earlier iteration. Table V, as an example, presents the confusion matrix obtained after 47 iterations. The sum of off diagonal number or the misclassification to has significantly improved when compared with those in Table IV. TABLE V FINAL RECOGNITION RESULT AFTER TRAINING HIDDEN MARKOV MODEL ITERATIVELY. Class A B C D E F Total A B C D E F Σ Note that the recursive training of the HMM is valid as we do not have any a priori information as which profile belongs to which cluster. In other words, we do not have any ground truth data to guide the grouping of data together. Our task of finding a good grouping of the data is achieved by recursively training the, as described in this section, since this minimizes the off diagonal values of the confusion matrix. In other words, the data set is grouped into groups with minimum misclassification. Intuitively, this should produce groups of profiles which are most similar to one another. The recursive application of HMM re-classified data into different classes. However the number of is influenced by the clustering process. It is possible that within each of the current classes, more than one HMM models exists. Hence, it is possible to diversify the classification by partitioning the profiles from particular classes. For example, the 996 profiles belonging to class A after having applied HMM recursively, can again be segmented into individual clusters by applying the K-means clustering

21 algorithm. Such an approach allows to refine the classification of patterns in class A. Once we have the new partitioning of this subset, it is then possible to re-train on the new clusters. The idea and its effects will become clearer in the following section. A. Refining the clustering The outcomes of the previous section give us profiles which are grouped into a fixed number of classes. What we wish to explore is the possibility of refining the classification by building sub-classes. We achieve this by applying K-means clustering algorithm or any other clustering algorithm) on each of the identified classes. Once we divide a class of profiles into sub-clusters using the K-means clustering algorithm, we can train a HMM, and iterate through the previous steps to obtain a set of sub-classes which constitutes the class. We can repeat this process for all classes. The outcome of this step is that we obtain a number of sub-classes for each class of profiles which allows for a refined labeling of the profiles. In other words, we obtain a hierarchical tree of classes. A flowchart of the proposed approach is given in Figure 6. Clustering Number of clusters > 2 No Yes HMM training No Converged? Yes Classify with HMM Yes Number of classes > 2 No Stop Fig. 6. The procedures involved in the proposed algorithm. The recursive approach is illustrated. It is shown that this approach allows the K-means clustering algorithm to be applied to each sub-group

22 recursively until no further clustering of the data is possible. Top Level 6 Clusters 5 HMM iterations 6 Classes ClassA ClassB ClassC ClassD ClassE ClassF Sublevel 1 2 Clusters 6 Clusters 4 Clusters 3 Clusters 3 Clusters 21 Iteratins 5 Iteratins 5 Iteratins 5 Iteratins 5 Iteratins 1 Cluster 2 Classes 6 Classes 4 Classes 3 Classes 3 Classes Fig. 7. A flowchart of iterative K-means clustering algorithm and the hidden Markov model refinement process. For example, the result from re-applying the clustering algorithm to each of the six classes is illustrated in Figure 7. In Figure 7, Top Level refers to results the classes A to F) as obtained earlier in this section. Level 1 refers to results obtained when re-applying K-means, and HMM training and classification to each of the 6 classes. For example, it is shown that the application of K-means clustering algorithm to the data belonging to class A produced 2 clusters. Similarly, the K-means clustering algorithm separated data belonging to class B into 6 clusters, and so on. Such clustering of data within a class allows to refine the classification of the data in that class. Consequently, HMM can then be trained recursively on each of the new clusters. This is indicated in Figure 7 by the number of iterations executed when training the HMM. It is possible that a recursively trained HMM can respond with less classes than clusters. Evidently, the recursive application of HMM and K-means clustering algorithm allows to find more finely separated sub-classes, and hence allows for refined classification of the data. This refinement of classification is achieved through a tree-like diversification of the partitioning of the data. The recursion can continue until either the K-means clustering algorithm or the is unable to separate patterns into more clusters and classes. A set of experiments which visualizes the effect of the proposed approach is given in the following section.

23 VI. EXPERIMENTAL RESULTS We continued to apply the K-means clustering algorithm and HMM recursively on the dataset until the stopping criteria was reached. The final clustering of the data is illustrated in Figure 8 to Figure 12. Each of these figures correspond to one of the six classes produced at the 47th iteration of top level. Figure 8 gives an example by illustrating the clustering of data belonging to class A. Class A Total number of data 996) 2C Level 1) 21 Iterations Class1 6C 5 Iterations Class2 6C 4 Iterations Level 2) Class1 A1 59 ) Class4 A2 489 ) Class2 A4 522 ) Class5 7C 8 Iterations Class3 5C 4 Iterations Class6 A3 118 ) Class3 A8 676 ) Class1 4C 6 Iterations Class4 A9 44 ) Class4 2C 3 Iterations 1 class left A7 89 ) Level 3) Level 4) Class3 Class5 A5 19 ) 699 A6 ) Level 5) Fig. 8. Applying K-means clustering algorithm and HMM iteratively to profiles in class A. The values in the brackets give the class label, and the number of profiles in the sub-class. It is shown that the recursive application of K-means clustering algorithm and HMM discovered 9 subclasses for class A denoted as A1, A2,..., A9, and that the algorithm reached the stopping condition after recursing for at most 6 times which corresponds to Level 5 in Figure 8. A total of 71 classes are found when combining the results for all classes. Some of the properties of profiles found in each sub-cluster are visualized in Figure 13. Figure 13 presents the maximum, minimum, and average magnitude of total benefit values found in profiles within any given class. It is shown that the value ranges overlap considerably between classes, and hence, it can be concluded that the total benefit

24 Class B Total number of data 1771) Class1 Class2 Class3 3C 34 Iterations Class1 Class2 Class3 C 1692 B3 ) 616 B4 ) 7 Iterations Class1 479 B5 ) 2C Class2 6C 5 Iterations 28th Iteration Class1 Class2 Class6 C B7 ) B8 ) 759 B9 ) 744 B6 ) 6C 5 Iterations 1679 B1 ) Class1 B1 1646) Class4 2C 19 Iterations Class2 B ) Class5 C B2 ) th Iteration Class6 6C 15 Iterations Class2 Class5 Class6 B ) B ) B ) Level 1) Level 2) Level 3) Level 4) Fig. 9. Applying K-means clustering algorithm and HMM iteratively to profiles in class B. The values in the brackets give the class label, and the number of profiles in the sub-class. Class2 4C 5 Iterations Class1 Class2 Class3 Class4 4C C 42nd Iteration C 494 C2 ) 71 C3 ) 991 C4 ) 863 C5 ) Class C Class1 C 18 C1 ) Total Number of Data 99) 4C 5 Iterations C C Class3 4C 5 Iterations C 2th Iteration Class1 Class2 Class3 Class4 746 C6 ) 792 C7 ) 73 C8 ) 159 C9 ) C Class2 4C 24 Iterations 4C 25th Iteration Class4 2C 16 Iterations Class1 Class2 Class4 C C ) C ) C ) Class1 C C1 639 ) Level 1) Level 2) Level 3) Level 4) Fig. 1. Applying K-means clustering algorithm and HMM iteratively to profiles in class C. The values in brackets give the class label, and the number of profiles in the sub-class.

25 Class D Total number of data 18816) 2C 3C 5 Iterations 49th Iteration Class1 Class2 Class3 2C 4C Level 1) Level 2) 43 Iterations 5 Iterations 5 Iterations 12nd Iteration 3th Iteration Class1 4C 22 Iterations Class2 6C 5 Iterations Class1 5C 5 Iterations Class2 3C 7 Iterations Class1 Class2 Class3 Class4 2C 11 Iterations C D17 ) 656 C D D ) 36 ) Level 3) 38th Iteration 41st Iteration Class2 3C 31 Iterations Class4 C D1 ) 36 Class1 Class2 Class3 Class4 Class5 C D5 ) C C D ) D7 125 ) D8 984 ) D9 ) Class1 Class2 Class3 Class4 2C C Iterations D1 ) Class2 C Class3 Class1 Class2 D11 93 ) D ) D ) D2 171 ) D21 ) 892 D12 ) 161 Level 4) Class1 Class2 Class3 C 4C C D4 ) D2 ) 775 D2 ) D13 ) 542 C ) D Level 5) Fig. 11. Applying K-means clustering algorithm and HMM iteratively to profiles in class D. The values in brackets give the class label, and the number of profiles in the sub-class. Class E Total number of data 15733) 3C 5 Iterations Level 1) Class1 5C Class2 5C 44th Iteration Class3 4C Level 2) 5 Iterations 5 Iterations 5 Iterations 28th Iteration Class1 Class5 Class4 4C 13 Iterations C C 666 E1 ) 4C 35 Iterations Class1 Class2 Class1 Class2 Class3 C 33rd Iteration Class3 Class4 Class5 4C 14 Iterations C 2C 13 Iterations Class1 Class3 Class1 Class E2 ) 1144 E3 ) 8 E4 ) 1261 E5 ) 664 E6 ) 147 E8 ) 149 E9 ) E1 19) E ) 2C 5 Iterations 1 class left 137 E7 ) Class1 4C 17 Iterations 1 class left E ) 28th Iteration Class3 E12 336) Level 3) Level 4) Fig. 12. Applying K-means clustering algorithm and HMM iteratively to profiles in class E. The values in brackets give the class label, and the number of profiles in the sub-class.

26 paid to a patient over the course of a year does not contribute significantly to the clustering result. Each of the 71 classes is represented by a HMM. The HMM output is modeled by a mixture of Gaussian which are modeled by combining a number of Gaussian functions. We find that in our case, the HMM gives us a model which requires only one Gaussian function at the output. In Figure 14, we will show the magnitude of the mean and the corresponding variance for the output of each HMM. We notice that the mean values are distinct from class to class. For example, the representing profiles in class A generally produced the smallest mean values. In fact, after sorting the mean values the subclasses are still segmented into non-overlapping sections of parent classes A to F Zoom into the first 23 classes) Benefits paid A B Fig. 13. A B C D E F Classes and class labels List of classes. A zoom into the first 23 classes is also shown. From this, we find that a main contributing factor leading to the separation of profiles are the mean values as obtained by the. Another observation which can be made from Figure 13 and Figure 14 is that profiles which display large amounts of benefits paid generally generate a large mean and variance. This indicates that large benefit payments correspond to patients requiring frequent medical services. This means that patients making isolated instances of large claims can be atypical as such cases are typically not observed in the dataset.

27 Variance Mean value A B C D E F Class label Fig. 14. The variance top) and means bottom values of the 71. A. Classification of the 36 profiles using the developed methodology In this section we will show that the refined can provide a better explanation to patient s behavior. In Section III, we have used 36 sample profiles to demonstrate the grouping of clusters using the k-means clustering algorithm referring to Figure 2). In Section IV-A, the same 36 profiles have been classified by the obtained from the first iteration at the top level referring to Figure 4). It is thought interesting to see how these profiles are classified using the methodology developed, i.e., the classification by the final tree of. The results are shown in Figure 15. It is observed that the final hierarchical tree of provide a finer classification to the profiles. Take Class C in Figure 4 for example: The first five profiles in Figure 4 of Class C are classified into sub-class C1 shown in Figure 15. It is clear these patients had frequent visits to doctors within the specified 12 months. Most of their fortnightly claims are under $3.. The sixth and seventh profiles in Figure 4 of Class C are classified into sub-class C12 in Figure 15

28 where each patient had one fortnightly claim reached about $8.. The last three profile in Figure 4 of Class C are classified into sub-class C13 in Figure 15. All of them had a sudden change of their medical behavior which incurred about $8. benefit. The overall classification of all data in the training set is illustrated by Table VI. Note that the sizes of most classes differ from those seen in Figure 8 to Figure 12. This is due to the fact that the final classification is performed on the entire training set as opposed to during the training phase when classification is performed on the set of training data associated with the top level class only. B. Classification of profiles from the test set The algorithm has demonstrated to be efficient in grouping temporal similar patient behaviors together. In this section we wish to investigate the generalization capabilities of the approach. More specifically, the algorithm was trained on a relatively small sub-set of patient profiles. It is important to find how well data which were not used for training are classified. For this, we utilize 3,475 profiles, none of which has been used in the generation of the HMM models. These 3,475 patients are from the same age cohort, and domain as the training pattern. We will compare classification results on as generated after the first training iteration, and on as obtained after the training procedure converged. We will refer to the former set of HMMS as initial HMM set, and the later as final HMM set. The general recognition results of the initial and the final HMM sets are given in Table VII and VIII respectively. An obvious observation would be that the initial HMM set provides us with an aggregate recognition which means some classes contain large number of profiles, such as class D where 7824 profiles are grouped. A large group can imply an approximate clustering, since the number of classes is to restricted to allow a more efficient separation of profiles. This problem is not observed in Table VIII where the algorithm converged to a final set of. Overall, the classification of test patterns greatly reflect results as obtained on the training set in that the class sizes are proportional to those observed in Table V and Table VI.