Leveraging Hierarchy in Medical Codes for Predictive Modeling

Transcription

1 Leveraging Hierarchy in Medical Codes for Predictive Modeling Anima Singh Massachusetts Institute of Technology 32 Vassar Street Massachusetts, USA Girish Nadkarni Mount Sinai Hospital New York, NY Erwin Bottinger Mount Sinai Hospital New York, NY John Guttag Massachusetts Institute of Technology 32 Vassar Street Massachusetts, USA ABSTRACT ICD-9 codes are among the most important patient information recorded in electronic health records. They have been shown to be useful for predictive modeling of different adverse outcomes in patients, including diabetes and heart failure. An important characteristic of ICD-9 codes is the hierarchical relationships among different codes. Nevertheless, the most common feature representation used to incorporate ICD-9 codes in predictive models disregards the structural relationships. In this paper, we explore different methods to leverage the hierarchical structure in ICD-9 codes with the goal of improving performance of predictive models. We compare methods that leverage hierarchy by 1) incorporating the information during feature construction, 2) using a learning algorithm that addresses the structure in the ICD-9 codes when building a model, or 3) doing both. We propose and evaluate a novel feature engineering approach to leverage hierarchy, while simultaneously reducing feature dimensionality. Our experiments indicate that significant improvement in predictive performance can be achieved by properly exploiting ICD-9 hierarchy. Using two clinical tasks: predicting chronic kidney disease progression (Task-CKD), and predicting incident heart failure (Task-HF), we show that methods that use hierarchy outperform the conventional approach in F-score (0.44 vs 0.36 for Task-HF and 0.40 vs 0.37 for Task- CKD) and relative risk (4.6 vs 3.3 for Task-HF and 5.9 vs 3.8 for Task-CKD). Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. BCB 14, September 20 23, 2014, Newport Beach, CA, USA. Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM /14/09...$ Categories and Subject Descriptors I.2.1 [Artificial Intelligence]: Applications and Expert Systems Medicine and Science; G.3 [Probability and Statistics]: [Multivariate statistics] General Terms Machine Learning in Healthcare and Medicine Keywords predictive modeling, ICD-9 codes, feature hierarchy 1. INTRODUCTION Electronic health records (EHRs) contain information that can be used to make clinically useful predictions about the future trajectory of a patient s health. Important parts of this information are recorded using ICD-9 codes. ICD-9 is the most widely used classification system for describing diagnoses and procedures [1]. It consists of a collection of trees which represent relationships between the ICD-9 codes. A node at a higher level of a tree represents a more general concept than the nodes at a lower level. Figure 1 shows an excerpt from one of the trees that form the ICD-9 classification system. The most commonly used representation for ICD-9 codes in predictive models ignores the hierarchical structure and maps each individual code to a separate binary feature to indicate presence or absence of a diagnosis or procedure [17]. We will refer to this approach as Raw Binary. In this work, we demonstrate that the ICD-9 hierarchy is a source of potentially useful information that can improve the performance of predictive models. The fundamental assumption in leveraging the ICD-9 hierarchy is that neighboring nodes share similar relationship with the outcome of interest. Under this assumption, Top-level Binary, a less commonly used approach aggregates all the codes within a tree to a single binary feature. A rule of thumb is to use Raw Binary when the number of samples is much greater than the feature dimensionality, since finer diagnostic levels can lead to better discrimination. On the other hand, if feature dimensionality is large relative to the number of samples, Top-level Binary is preferred to avoid overfitting.

2 The issue of overfitting is particularly important when learning predictive models for medical applications. Medical datasets are high-dimensional and have a small sample size. Since predictive models are developed for a subpopulation of patients with specific attributes (e.g. over 50 years old) and specific clinical conditions (e.g. with chronic kidney disease), the number of patients of interest is usually significantly smaller than the patient population in an EHR. These characteristics present a challenge in learning generalizable predictive models for medical applications. While Top-level Binary aims to reduce overfitting by reducing feature dimensionality, it fails to take advantage of the specificity in diagnoses and procedures that can enhance performance. With the advent of ICD-10, which has more than 69,000 distinct codes to provide greater details and specificity [3], the challenge of high dimensionality will get worse. Learning supervised predictive models involves two components: 1) feature construction and 2) model building. One can leverage the structure in the data by incorporating the information during feature construction, or by using a learning algorithm that addresses the structure in the data when building a model, or doing both. In this paper, we explore these different approaches to leveraging hierarchy in ICD-9 codes. We evaluate the methods by developing predictive models using EHR data from a large hospital in New York for two real-world clinical applications: 1) predicting whether a patient with chronic kidney disease will decline quickly or slowly (Task-CKD), and 2) predicting whether a patient at high risk of heart failure will be admitted to the hospital within a year with congestive heart failure (Task-HF). We formulate each task as a binary classification problem. We make the following contributions in this paper: We propose a novel feature engineering approach to represent hierarchical categorical variables such as ICD- 9 codes. Our approach preserves predictive information and reduces feature dimensionality to prevent overfitting. We compare different methods to leverage hierarchy. Our results show that significant improvement in predictive performance can be achieved by properly exploiting hierarchy. More specifically, we show that methods that use hierarchical information outperform the conventional approach in F-score (0.44 vs 0.36 for Task-HF and 0.40 vs 0.37 for Task-CKD) and relative risk (4.6 vs 3.3 for Task-HF and 5.9 vs 3.8 for Task- CKD). Our results also demonstrate the importance of coupling an appropriate feature representation with an appropriate learning algorithm to maximize the performance boost by leveraging the ICD-9 hierarchy. The rest of the paper is organized as follows: Section 2 presents related work. In Section 3, we discuss the ICD-9 hierarchy. In Section 4, we provide notations and definitions used in the paper. Section 5 presents the different methods. In Section 6, we present the two clinical tasks. Section 7 presents the experimental setup along with the results showing the performance of different methods. Section 8 presents discussion of results. Finally, Section 9 provides summary and conclusions. 2. RELATED WORK ICD-9 codes have been shown to be useful for identifying phenotypes in phenome-wide association studies [2, 4, 14] and for predictive modeling of adverse outcomes in patients. Some of the work in predictive modeling that use ICD-9 codes include: detection of epidemics [19], early detection of diabetes [7] and heart failure [17], and risk stratification of hospital-associated infections [20]. These works use either Raw Binary or Top-level Binary representation for ICD-9 codes. Recently, there has been interest in methods to leverage ICD-9 hierarchy. Perotte et al. [12] use a hierarchy-based SVM that exploits hierarchy in the codes for automated ICD-9 code assignment to discharge summary notes. Their task is to use features derived from the discharge summary notes to generate appropriate ICD-9 codes. In contrast, the focus of this paper is to leverage hierarchy when using ICD-9 codes as features for predictive modeling. Most of the research on leveraging structural information in predictors has been on developing learning algorithms. Supervised group lasso [23, 15] and overlapping group lasso [21] are the most widely used learning methods that take into account group structure. More specifically, they use the group information in features to enforce regularization constraints. These methods have been used to develop predictive models using gene expression data as features. These learning methods use the cluster structure in genes, where a cluster consist of genes with co-ordinated functions [10, 21]. Overlapping group methods can also be used for variables with hierarchical structures [24, 6]. To the best of our knowledge, these methods have never been used in the context of leveraging hierarchy in ICD-9 codes. 3. ICD-9 HIERARCHY The ICD-9 codes are organized in a hierarchy where an edge represents an is-a relationship between a parent and its children. Hence, the codes become more specific as we go down the hierarchy. When leveraging the ICD-9 hierarchy for learning predictive models, we assume that the sister nodes have a correlated relationship with the outcome of interest. There are about 17,000 diagnostic and 1200 procedure codes in the ICD-9 hierarchy. Each three digit diagnostic code and each two digit procedure code is associated with a hierarchy tree. In this paper, we refer to it as a top-level diagnostic or procedure code. Figure 1 shows an excerpt from the hierarchy within the top-level diagnostic code 428 that represents heart failure. 4. DEFINITIONS In this section, we provide notations and definitions used in the paper. n represents a node in a hierarchy tree. corresponds to an ICD-9 code. Each node n represents the parent of node n in the hierarchy. T n represents the subtree of the hierarchy rooted at node n. N n represents the set of nodes within T n. t represents a node corresponding to a top-level code. Since t is a top-level code, t = null. T t represents the

3 428.0 Conges5ve heart failure Le; heart failure 428 Heart Failure Systolic heart failure. distinct values of the variable into a single variable. Propagated Binary uses structural relationships between each distinct value within the hierarchy. Lastly, we present our proposed Probability-based features, a novel feature representation that captures hierarchy. Top-level Binary Unspecified Acute systolic heart failure Chronic systolic heart failure. Figure 1: An excerpt from the hierarchy in diagnostic code 428 that represents heart failure. tree corresponding to the top-level code t. Figure 1 shows the hierarchy T 428 for t = 428. D = {(x i, y i) x i R f, y i { 1, 1}} N i=1 represents the dataset, where f = dimension of the feature space and N = number of examples. y i = 1 if the patient corresponding to example i suffers a positive (adverse) outcome, and y i = 1 otherwise. 5. METHODS In this section, we describe feature representations and model learning algorithms that disregard hierarchy, followed by those that leverage hierarchy. 5.1 Ignoring Hierarchy Feature Representation: Raw Binary A conventional way to use ICD-9 codes (and other categorical variables) is to treat each node n as a separate binary feature. For an example i, we set the value of the feature corresponding to node n to 1 if the example is assigned the ICD-9 code represented by the node. Otherwise, the value is set to 0. This feature representation ignores any information about the hierarchical relationships between the diagnoses. It also suffers from high dimensionality since we use one feature for every distinct value of the variable, and is therefore susceptible to overfitting Model Learning Algorithm: Logistic Regression with L2 norm A commonly used learning algorithm for classification tasks is an L2-regularized logistic regression model that solves the following optimization problem: min w N log(1 + exp( y i(w T x i + c)))) + λ w 2 2 (1) i=1 While the L2 regularization helps prevent overfitting, it does not use any structural relationships between features. We refer to this algorithm as L Leveraging Hierarchy Feature Representation We present three different feature representations that use hierarchical information in the ICD-9 codes. First, we describe Top-level Binary that uses hierarchy to cluster all Top-level Binary, another commonly used representation, uses a separate binary feature to represent hierarchy T t for each t. Given an example i, we set the value corresponding to t to 1 if the example is assigned any node in N t. If none of the codes in N t are assigned to the example, the feature value is set to 0. E.g. regardless of whether an example is given a diagnosis of Malignant essential hypertension or Benign essential hypertension, Toplevel Binary sets the feature corresponding to node 401 to 1. As a result, Top-level Binary has a low dimensionality. Propagated Binary This representation uses a separate binary feature for each node n. If example i is assigned a code represented by node n, the feature values corresponding to the node and all of its ancestors are set to 1. For example, if an example is given a diagnosis of , the feature values corresponding to nodes , and 250 are all set to 1. Propagated Binary uses hierarchical information by setting all the ancestor nodes of an assigned diagnostic code to 1. It does not reduce feature dimensionality. Probability-based Features The key idea of our approach is to 1) map each ICD-9 code to a conditional probability of outcome using the training data and 2) then use the probabilities to construct features. More specifically, given a top-level code t, we transform each distinct code n within T t using a function g : N t [0, 1]. The function g(n) = P (y = 1 T n) maps node n to the conditional probability of a positive outcome given T n. Our approach leverages hierarchy when estimating P (y = 1 T n) in two ways: 1. In addition to considering the examples that are assigned the code corresponding to n, we also consider examples that are assigned any of the descendant nodes. E.g, for n = 428.2, instead of computing P (y = ), we compute P (y = 1 T ). 2. We consider the probabilities of the more data-rich ancestor nodes for robust probability estimation for nodes lower in the tree using shrinkage, as we describe later. Training Data Training Labels Hierarchy info Compute Probabili:es Step 1 β Probability Es-mates Data point x Feature Construc:on Step 2 Feature vector Figure 2: Block diagram of our proposed approach for feature construction. β is the shrinkage parameter used for probability estimation. We use a two-step process to generate our feature representation (Figure 2).

4 Step 1: Probability estimation To estimate P (y = 1 T n), we first compute the maximum likelihood (ML) estimate ˆP (y = 1 T n) using the training data. ˆP (y = 1 T n) = θ(y = 1, Tn) θ(t n) where, θ(t n) is the number of training examples that are assigned any codes in N n, and θ(y = 1, T n) is the number of training examples with a positive outcome that are assigned any codes in N n. Next, we adjust the estimated probabilities by shrinking the ML probability estimates towards the estimate of the parent node. P (y = 1 T n) = θ(tn) ˆP (y = 1 T n) + α P (y = 1 T n ) θ(t n) + α (3) where α = β θ(t t) and β is the shrinkage parameter. Since a top-level node t does not have a parent node, we set P (y = 1 T t) = ˆP (y = 1 T t). The shrinkage parameter β determines the weighting factor for the ML estimate of T n and the shrunk estimate of T n to obtain P (y = 1 T n). If β is 0, the shrunk estimate equals the ML estimate. The value of β is optimized in the training stage along with the parameters of the classifier (Section 7). Once the probabilities are estimated using β, we use them to construct a feature vector for each example. The computational cost of this step is O(N t Nt ) ) where N = number of training examples and N t = the number of nodes within the top-level code t. Step 2: Feature Construction In this representation, we use a separate feature to represent the hierarchy T t for each t. For an example i and a top-level code t, we first consider nodes in N t that are assigned to the example. Let Nt i represent this set. An example could be assigned more than one node in N t. E.g., within the hierarchy of top-level node 428, an example might be given a diagnosis of Left heart failure and Chronic systolic heart failure. Next, we set the feature value of t to max({ P (y = 1 T n) n Nt i}). The intuition behind choosing the maximum is to consider the node that puts the patient at highest risk of the outcome (In our preliminary investigations, we also considered other aggregate functions such as sum and average. Taking the maximum of the probabilities yielded the best performance). If example i is not assigned any nodes in N t, we set the feature value of t to ˆP (y = 1 not T t), the probability of a positive outcome given that an example is not assigned any nodes in N t. (2) The computational cost of constructing a probabilitybased feature representation for an example x i from the Raw Binary representation is O( t Nt ). We refer to this feature representation as Probabilitybased. Since this representation uses one feature per toplevel ICD-9 code, it has the same dimensionality as Toplevel Binary Model Learning Algorithm Logistic Regression with Overlapping Group Lasso Given a training data set D, the logistic overlapping group lasso [24] solves the following optimization problem: min w N k log(1 + exp( y i(w T x i + c)))) + λ w Gj 2 (4) i=1 j=1 where w R f is divided into k overlapping groups w G1, w G2,..., w Gk. The overlapping group lasso can encode the hierarchical relationship between the ICD-9 codes by defining groups with particular overlapping group patterns. Given a node n, representing an ICD-9 code, a group G is a set containing n and its descendants. This overlapping group pattern leads to hierarchical selection of variables via regularization [24, 22]. In other words, an ICD-9 code is selected in a model only if its parent node is selected. We refer to this algorithm as OvGroup. 6. CLINICAL TASKS We evaluated the approaches described above on two clinical applications 1) predicting whether a patient with chronic kidney disease (CKD) will decline quickly or slowly and 2) predicting whether a patient at high risk of heart failure will be admitted to the hospital within a year with congestive heart failure (HF). We represent the the outcome of each task as a binary variable where a positive label corresponds to an adverse outcome. For each task, we use all the diagnoses and procedures recorded as ICD-9 codes along with the information about a patient s vitals (e.g. blood pressure) and labs tests (e.g. blood sugar level). Table 1 lists the features used. 6.1 Predicting CKD progression Chronic kidney disease is an increasingly prevalent condition associated with a high degree of mortality and morbidity. Clinical decision making for CKD is challenging in part because of the variability in rates of disease progression among patients [18]. Severity of CKD progression is measured using estimated glomerular filtration rate (egfr) computed from serum creatinine level. Lower values of egfr correspond to increased severity. We consider the task of predicting whether a patient in the early stages of CKD will suffer a sharp drop in their egfr value in the next two years. Specifically, we assign a positive label for an example if egf R 15 units, where egf R is defined as the difference between the average egfr in the current 1 year window and the average egfr in the next two years. The threshold of 15 units was derived by computing the top quintile of egf R in the training data. In the rest of the paper, we will refer to this task as Task-CKD. 6.2 Predicting incident HF Incident heart failure is defined as the first occurrence of hospitalization that includes a diagnosis of congestive heart

5 failure [9]. Approximately 60% of men and 45% of women die within 5 years of incident HF [8]. Accurate prediction of incident HF can allow clinicians to explore therapy options that can reduce complications associated with HF [5]. In this paper, we consider patients who will eventually be hospitalized with HF. This allows us to focus on the challenging task of predicting when a patient will be admitted. In particular, we investigate the task of predicting whether a patient will be admitted with incident HF within a year. In the rest of the paper, we will refer to this task as Task-HF. Since we use data from patients who eventually is hospitalized with HF, the learned model is not applicable to a general population. However, the data is still useful to illustrate the potential advantage of leveraging structural information in the ICD-9 codes. Table 1: Predictors used for each clinical task and the number of binary features associated with the given set of predictors. The numeric features are discretized into four bins based on the quartiles of the corresponding predictor and then mapped into binary features. For vital signs and lab values, the mean, median, range, min, max and standard deviation of each of the predictors are computed and then binarized. For egfr, the recent slope in egfr measurements was also computed. Predictors Task-CKD Task-HF ICD-9 diagnosis codes ICD-9 procedure codes Age 4 4 Systolic blood pressure Diastolic blood pressure HbA1c Low density lipo-protein - 24 egfr EXPERIMENTS AND RESULTS This section describes the experiments used to evaluate different combinations of the feature representation and model learning methods described in Section 5. Table 2 shows the various combinations we evaluate in this paper. Top-level Binary and Probability-based features group all distinct nodes within a hierarchy tree into a single feature. Since the resulting representation does not have any group structure to exploit, we only consider L2 for model building. Table 2: Definitions of different combinations of feature representation and model learning algorithms. Hierarchy is leveraged in feature construction, model building, or both. RawL2 does not use hierarchy. Feature Representation Learning algorithm L2 OvGroup Raw Binary RawL2 RawOvGroup Toplevel Binary TopL2 - Propagated Binary PbL2 PbOvGroup Probability-based ProbL2-7.1 Experimental Setup We use data from de-identified electronic health records of patients treated at Mount Sinai Hospital in New York City. We divide the patients into training and test patients with an 80/20 split. For each patient, we use a 1 year sliding window approach to generate multiple examples (Figure 3). Table 3 shows the summary of the data used for each task. Pa#ent p 1 year 1 year 1 year Example 1 Example 2 Example 3... Time Figure 3: A schematic summarizing how multiple examples are generated per patient. Each rectangle represents a one year window (an example). For Task-CKD we only consider examples that have 1) at least two egfr measurements within the example time period and 2) at least two egfr measurements each year for two years in the future. Since patients egfr measurements are not measured at regular intervals, we are left with 974 examples that satisfy the criteria. As a result, Task-CKD has fewer examples than Task-HF despite having more patients. Table 3: The summary of data used for Task-CKD and Task- HF. The value in the parentheses is the number of positive examples. Task-CKD Task-HF No. of patients No. of Examples 974 (195) 1649 (326) No. of Training Examples 780 (156) 1320 (261) No. of Test Examples 194 (39) 329 (65) We evaluated the performance of each of the methods in Table 2 for Task-HF and Task-CKD. For each task, we generate 100 different training and test splits. For each split the training set is used for parameter selection and model training. All the methods, besides ProbL2, have one parameter, i.e. the regularization parameter λ. ProbL2 has an additional parameter β, the shrinkage parameter. Parameter selection is carried out using 5-fold cross-validation. The test set is used for evaluation only. 7.2 Performance Evaluation To evaluate the ability of the different methods to discriminate between examples with a positive outcome and those without a positive outcome, we compute F-score: the harmonic mean of recall and precision [13]. To compute the recall and precision, we consider the examples with predicted probability greater than the overall probability of the outcome in the training population as positive. AUROC: a rank order statistic based on predicted probability of outcome [16] We evaluate the performance of all other methods with respect to RawL2, since it does not use any structural information in the ICD-9 codes. Figure 4 and Table 4 show the results. Our results show that overall the methods that take into account the hierarchical structure boost the performance of the predictive model for both Task-HF and Task-CKD,

6 Table 4: Average performance of the different methods for Task-CKD and Task-HF. The value within the parentheses is the standard deviation. The * indicates that the average is significantly (p < 0.05) different from the average performance of RawL2 as measured using a paired t-test. The methods that perform the best are highlighted in gray. Task-CKD Approach # of features F-score AUROC RawL2 TopL2 PbL2 ProbL2 RawOvGroup PbOvGroup (a) Task CKD RawL (0.05) 0.66 (0.04) TopL (0.04) 0.66 (0.03) PbL * (0.06) 0.64 (0.04) ProbL (0.05) 0.67 (0.03) RawOvGroup (0.05) 0.66 (0.03) PbOvGroup * (0.06) 0.69* (0.03) Task-HF Approach # of features F-score AUROC RawL (0.04) 0.65 (0.03) TopL * (0.04) 0.68* (0.03) PbL * (0.05) 0.69* (0.04) ProbL * (0.03) 0.71* (0.03) RawOvGroup * (0.05) 0.67 (0.04) PbOvGroup * (0.04) 0.70* (0.03) although the amount of performance improvement varies between the different methods. For Task-HF, ProbL2 and PbOvGroup perform the best, significantly outperforming RawL2. For Task-CKD, PbOvGroup yields the highest improvement over RawL2. The performance improvement in F-score and AUROC obtained by leveraging hierarchy has the potential to have a clinical impact. Our results show that PbOvGroup can identify 60% of the examples where the patients suffer an adverse outcome compared to 38% and 46% by RawL2 for Task-HF and Task-CKD respectively. Moreover, PbOvGroup achieves this boost in recall without sacrificing precision. It is unclear from our results if one method has better performance than the other. However, the advantage of exploiting the ICD-9 hierarchy for the two tasks is clear. To further convey the ability of the methods to risk stratify, we divide the test examples into quintiles (often done in clinical studies) based on the predicted probability of outcome. The examples in the 1 st quintile (Q1) are predicted to be at lowest risk and the ones in the 5 th quintile (Q5) are predicted to be at highest risk of adverse outcome. The relative risk of outcome in Q5 versus Q1 measures how well we can discriminate low risk patients from high risk patients. Each of ProbL2 and PbOvGroup has higher relative risk compared to RawL2 for both Task-CKD and Task-HF. Our results also demonstrate the importance of coupling an appropriate feature representation with an appropriate learning algorithm. RawOvGroup and PbOvGroup use the same learning algorithm but different feature representations. On the other hand, PbL2 and PbOvGroup use different learning algorithms but the same feature representation. PbOvGroup outperforms RawOvGroup and PbL2 for both Task-CKD and Task-HF showing that leveraging ICD-9 hierarchy in both feature representation and learning algorithm yields the best results. RawL2 TopL2 PbL2 ProbL2 RawOvGroup PbOvGroup Relative Risk of outcome in Q5 vs Q1 (b) Task HF Relative Risk of outcome in Q5 vs Q1 Figure 4: Relative risk of outcome in examples in the 5th quintile vs the 1st quintile. 7.3 Feature Analysis To better understand why PbOvGroup and ProbL2 outperform RawL2, we analyze the weights of the features assigned by the model. Roughly speaking, the larger the magnitude of the weight, the more important the feature is for the task at hand. For brevity, we show the results for Task-HF only Feature Ranking For each method we rank each top-level ICD-9 code based on the magnitude of the average weight assigned by logistic regression across 100 splits. For RawL2 and PbOvGroup, we associate a top-level code with the highest rank assigned to any of the codes within its hierarchy. Table 5: Rank assigned by RawL2 to the highest ranking top-level ICD-9 codes for ProbL2 and PbOvGroup. Highest ranking top-level ICD-9 codes Rank in RawL2 564 Functional digestive disorders Symptoms involving skin 5 and other integumentary tissue 715 Osteoarthrosis and allied disorders Other and unspecified disorders 25 of joint 786 Symptoms involving respiratory 39 system and other chest symptoms Table 5 shows the intersection of the 10 highest ranking top-level codes shared by ProbL2 and PbOvGroup, along with their rank in RawL2. The ICD-9 codes 719 and 786 that are ranked within the highest 10 top-level codes using both ProbL2 and PbOvGroup, are assigned ranks of 25 and 39 by RawL2. That 786 received a low rank in RawL2 is particularly telling. Chest pains and respiratory problems are important clinical indications of heart failure that have many different ICD-9 codes associated with them. By ignoring hierarchy,

7 RawL2 was unable to identify this important constellation of risk factors as a top predictor. This could be one of the reasons for its underperformance relative to methods that use hierarchy Feature Weights Next, we analyze the weights assigned by different models to ICD-9 codes within a hierarchy. Figure 5 (a) and (b) show the weights assigned by RawL2 and PbOvGroup to the features associated with the ICD-9 code 250 Diabetes (We compare RawL2 and PbOvGroup since both representations have one feature per node within the hierarchy). Figure 5 (c) shows the number of examples that are assigned the raw ICD-9 code. Consider the ICD-9 codes 250 and The number of examples that are assigned the raw ICD-9 codes 250 and are small. However, the most frequently assigned code is a subcategory of both 250 and RawL2 is unable to leverage this information. Therefore, while it assigns a high positive weight to , it assigns a low negative and a low positive weight to 250 and respectively. In contrast, PbOvGroup is able to leverage the hierarchy and consistently assign high positive weights to all three ICD-9 codes. 8. DISCUSSION Our results demonstrate that methods that properly leverage ICD-9 hierarchy can significantly improve performance over a method (RawL2) that does not use hierarchy. This is true especially for datasets with sample size significantly lower than the feature dimensionality, and if the sister nodes in the ICD-9 hierarchy have a correlated relationship with the outcome of interest. The underperformance of RawL2 can be attributed to 1) overfitting due to high feature dimensionality and 2) its inability to leverage relationships between ICD-9 codes within a hierarchy. TopL2, which aggregates all the codes within a hierarchy tree to a single binary feature, yields some performance boost over RawL2 by using associations between codes within a hierarchy tree using a single binary feature to reduce feature dimensionality. However, as discussed earlier, TopL2 does not take advantage of the specificity in lower level ICD- 9 codes. We use partial correlation to quantify the additional predictive information in the lower level codes in the ICD-9 hierarchy. Partial correlation measures the correlation of two variables after removing the effect of other variables [11]. Using the Propagated Binary features, we compute the partial correlation between each code and the outcome, controlling for the parent code. Table 6 shows the average and the maximum absolute partial correlation over all nodes with partial correlation significantly (p < 0.05) greater than zero. ProbL2 and PbOvGroup are able to exploit this residual information to yield performance boost over TopL2. 9. SUMMARY AND CONCLUSIONS In this paper, we examined different ways to leverage the hierarchy in ICD-9 codes to improve predictive performance. We compare methods that leverage hierarchy by 1) incorporating the structural information during feature construction, 2) using a learning algorithm that consider the struc- Table 6: The average and the maximum absolute partial correlation between an ICD-9 code and the outcome, given its parent ICD-9 code over all nodes with partial correlation significantly greater than zero. Average Maximum Task-CKD Task-HF ture in features when learning a model, and 3) doing both. We present a novel probability-based feature construction approach that is simple and effective in exploiting the structural relationships between the ICD-9 codes. Our experiments, performed on two clinical tasks, show that we can achieve significant improvement in predictive performance by properly exploiting the ICD-9 hierarchy when learning models from a high dimensional dataset. 10. ACKNOWLEDGMENTS We would like to thank Quanta Computers Inc. for its financial support. We would also like to thank Stephen B. Ellis and Rajiv Nadukuru for their assistance with data access and data preparation. We are also grateful to Omri Gottesman for his guidance with hypothesis generation and study design. 11. REFERENCES [1] Centers for Disease Control and Prevention (CDC). International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) [2] A phewas approach in studying hla-drb1*1501. Genes and Immunity, 14, [3] CMS ICD-10-CM. Centers for Medicare and Medicaid Services, [4] J. C. Denny, M. D. Ritchie, M. A. Basford, J. M. Pulley, L. Bastarache, K. Brown-Gentry, D. Wang, D. R. Masys, D. M. Roden, and D. C. Crawford. Phewas: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations. Bioinformatics, 26(9): , [5] J. E. Ho, C. Liu, and A. L. et al. Galectin-3, a marker of cardiac fibrosis, predicts incident heart failure in the community. Journal of the American College of Cardiology, 60(14): , [6] R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding. J. Mach. Learn. Res., 12: , July [7] R. Krishnan, N. Razavian, and S. N. et al. Early detection of diabetes from health claims. NIPS workshop in Machine Learning for Clinical Data Analysis and Healthcare, [8] D. Lloyd-Jones, R. J. Adams, and T. M. B. et al. Heart disease and stroke statistics 2010 update: A report from the american heart association. Circulation, 121(7):e46 e215, [9] L. R. Loehr and W. D. R. et al. Association of multiple anthropometrics of overweight and obesity with incident heart failure: The atherosclerosis risk in

8 0.15 (a) Normalized Feature Weights (b) Normalized Feature Weights Number of Samples ICD 9 Codes (c) Number of Samples ICD 9 Codes ICD 9 Codes Figure 5: (a) Box plots derived from the 100 splits for weights assigned by RawL2 to features associated with ICD-9 code 250 representing diabetes for Task-HF. The circle indicates the median, while the top and the bottom of the box corresponds to 75 th and 25 th percentile respectively. (b) Box plots derived from the 100 splits assigned by PbOvGroup for Task-HF. The ICD-9 codes are arranged according depth-first traversal of the hierarchy tree. (c) The number of samples that are assigned the given raw diagnostic code. communities study. Circulation: Heart Failure, 2(1):18 24, [10] S. Ma, X. Song, and J. Huang. Supervised group lasso with applications to microarray data analysis. BMC Bioinformatics, 8(1):1 17, [11] R. Muirheard. Aspects of Multivariate Statistical Theory. Wiley, [12] A. Perotte, R. Pivovarov, and K. N. et al. Diagnosis code assignment: models and evaluation metrics. Journal of the American Medical Informatics Association, [13] D. M. Powers. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. Journal of Machine Learning Technologies, 2:37 63, [14] M. Ritchie, J. Denny, D. Crawford, and A. R. et al. Robust replication of genotype-phenotype associations across multiple diseases in an electronic medical record. American Journal of Human Genetics, 86(4): , [15] N. Simon, J. Friedman, T. Hastie, and R. Tibshirani. A sparse-group lasso. Journal of Computational and Graphical Statistics, 22(2): , [16] E. Steyerberg, A. Vickers, N. Cook, and et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epimemiology, [17] J. Sun, J. Hu, and D. L. et al. Combining knowledge and data driven insights for identifying risk factors using electronic health records. American Medical Informatics Association, [18] N. Tangri, L. Stevens, J. Griffith, and et al. A predictive model for progression of chronic kidney disease to kidney failure. JAMA, 305(15): , [19] F. Tsui, M. Wagner, V. Dato, and C. Chang. Value of ICD-9-coded chief complaints for detection of epidemics. Journal of the American Medical Informatics Association, [20] J. Wiens and J. Guttag. Patient risk stratification for hospital-associated c. diff as a time-series classification task. Neural Information Processing Systems (NIPS), [21] L. Yuan, J. Liu, and J. Ye. Efficient methods for overlapping group lasso. Neural Information Processing Systems (NIPS), [22] L. Yuan, J. Liu, and J. Ye. Efficient methods for overlapping group lasso. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(9): , Sept [23] M. Yuan, M. Yuan, Y. Lin, and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68:49 67, [24] P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties family for grouped and hierarchical variable selection. Annals of Statistics, 2009.