Patient Risk Prediction Model via Top-k Stability Selection

Patient Ris Prediction Model via Top- Stability Selection Jiayu Zhou Jimeng Sun Yashu Liu Jianying Hu Jieping Ye Abstract The patient ris prediction model aims at assessing the ris o a patient in developing a target disease based on his/her health proile. As electronic health records EHRs become more prevalent, a large number o eatures can be constructed in order to characterize patient proiles. This wealth o data provides unprecedented opportunities or data mining researchers to address important biomedical questions. Practical data mining challenges include: How to correctly select and ran those eatures based on their prediction power? What predictive model perorms the best in predicting a target disease using those eatures? In this paper, we propose top- stability selection, which generalizes a powerul sparse learning method or eature selection by overcoming its limitation on parameter selection. In particular, our proposed top- stability selection includes the original stability selection method as a special case given =. Moreover, we show that the top- stability selection is more robust by utilizing more inormation rom selection probabilities than the original stability selection, and provides stronger theoretical properties. In a large set o real clinical prediction datasets, the top- stability selection methods outperorm many existing eature selection methods including the original stability selection. We also compare three competitive classiication methods SVM, logistic regression and random orest to demonstrate the eectiveness o selected eatures by our proposed method in the context o clinical prediction applications. inally, through several clinical applications on predicting heart ailure related symptoms, we show that top- stability selection can successully identiy important eatures that are clinically meaningul. Introduction The electronic health records EHRs capture longitudinal patient inormation in digital orms rom diverse sources: demographics, diagnosis, medication, lab results and physician notes. The adoption o EHRs has increased dramatically recently in response to the gov- Computer Science and Engineering, Arizona State University IBM T.J. Watson Research Lab ernment s EHR incentive program. Many healthcare institutions and practitioners are looing or secondary use o EHR data; among which, clinical decision support and care management systems are the most prominent applications. As more EHR data becomes available, a large number o eatures can be constructed that may be potentially useul or ris prediction, which has important applications in both clinical decision support and care management systems. The challenge o getting high quality eatures is how to ran all available eatures based on the predictive power to a speciic condition. or that, appropriate eature selection is required. The need or eature selection is urther ampliied by the challenge o obtaining high quality training data. The orm o training data depends on the speciic tass and the source data quality. Because o the highly noisy nature o EHR data, clinical experts oten have to be involved in the annotation process in order to obtain reliably labeled training data. As a result, in many cases only limited labeled data can be obtained. Because o the small sample size, over-itting becomes one o the biggest concerns or predictive modeling in the healthcare domain. The ey to avoiding over-itting is to construct robust and parsimonious models, where recent development o sparse learning [9,, 7, 2, 25] can oer many insights. Another reason why eature selection is particularly important in building ris prediction models or healthcare is the need or interpretability and actionability. or both clinical decision support and care management support, it is oten not suicient to simply have a model that produces a ris score, even i it is highly accurate. The decision maers need to understand what are the leading ris actors so that proper interventions can be taen to address the ris. Parsimonious models that achieve high accuracy with a relatively small number o succinct eatures clearly have an advantage o oering better interpretability. This again points to eature selection as the center piece o the discussion. We ocus on addressing this central issue o eature selection or ris prediction in this paper. In many embedded eature selection algorithms, there is a tuning parameter that determines the amount o regularization and correspondingly how many eatures are included in the model. Examples are regularization actors o spar-

sity induced penalties e.g. Lasso regression [25], sparse logistic regression [, 4], group Lasso [9, 7], and the iteration number o orthogonal matching pursuit [2]. The choice o such tuning parameters is data-dependent and thereore there is no general guidance on how the parameters should be chosen. Typically cross-validation process is used to estimate the tuning parameters rom the training data. Due to overitting, the estimated parameters via cross validation usually include too many irrelevant eatures in the model [8], especially in high dimensional problems, and thus lead to sub-optimal perormance. Moreover, chances are that not all relevant eatures are selected given a particular tuning parameter, and thereore the inormation o relevant eatures may be contained in the models given by more than one tuning parameter. To address the aorementioned problems, we propose to introduce and extend a recently developed sparse learning method, stability selection [9], which is currently not widely used in the data mining community, despite its success in bioinormatics especially in genome-related biomarer selection problems where sample size is much smaller than eature dimension n p [6, 22, 23, 26]. Stability selection, based on subsampling/bootstrapping, provides a general method to perorm model selection using inormation rom a set o regularization parameters. The stability raning score gives a probability which maes it naturally interpretable. However, the existing stability selection uses a rather arbitrary and limited maximum selection criterion, which may lead to suboptimal results. To tacle the limitation o stability selection, we propose a general top- stability selection algorithm and establish stronger theoretical properties. The original selection method can be shown to be a special case o our proposed top- stability selection given =. We also show theoretically that the proposed top stability selection is more robust by utilizing more inormation rom selection probabilities when > than the original stability selection, and is less sensitive to the input parameters. Overall, the top- stability selection method has the ollowing advantages over the original stability selection: More stable raning: The eature raning is more robust against data perturbation and less sensitive to the input parameters. Improved theoretical guarantee: Improved theoretical guarantees can be established that provides a stronger bound o the alse positive error rate. We design experiments to compare top- stability selection given dierent and systematically compare our proposed method against six dierent eature selection methods in the context o clinical prediction. Our top- stability selection method outperorms the traditional eature selection methods in all datasets. Moreover, the proposed top- stability selection achieves better perormance than the original stability selection on both synthetic and real datasets, which empirically conirms the superiority o this generalization. inally, we compare three competitive classiication methods using selected eatures rom top- stability selection: SVM, Random orest and Logistic Regression, and discuss several practical considerations in how to choose the classiiers in the context o clinical prediction application. Logistic regression gives the best prediction perormance in terms o AUC, while random orest is shown to be less sensitive to the input parameter value. The rest o paper is organized as ollows: Section 2 describes the motivating application and shows how to convert EHR data into eatures or ris prediction. Section 3 presents the idea o stability selection, and discusses the top- stability selection extension and its theoretical properties. Section 4 presents systematic evaluation o dierent methods or ris prediction. We conclude in Section 5. 2 Motivating Application Ris prediction has important applications in both clinical decision support and care management systems. or ris prediction to provide actionable insight, it oten requires building a predictive model e.g., classiier or a speciic disease condition. or example, based on the EHR data, one may want to assess the ris scores o a patient developing dierent disease conditions, such as diabetes complication [24] and heart ailure [7, 27], so that proper intervention and care plan can be designed accordingly. The common pipeline or building any predictive model involves the ollowing steps: eature construction prepares the input eatures and the target variable or a speciic tas. In this step, the labeled training data sets are constructed rom EHR data. eature selection rans the input eatures and eeps only the most relevant ones or subsequent model building. Model building uses the selected eatures and the target variable to construct a predictive model. Depending on the tass, the model can be either classiication or regression. In this paper, we ocus on classiication tass with binary target variables. Model validation evaluates the models against a diverse set o perormance metrics. In the presence

o an unbalanced class distribution, the Area Under the Curve AUC is typically employed. eature selection, model building and validation are amiliar concepts in the data mining ield. eature construction is oten application dependent. Next we illustrate the common strategies or constructing eatures in clinical applications. EHR data documents detailed patient encounters in time, which oten includes diagnosis, medication, lab result and clinical notes. We systematically construct eatures rom dierent data sources, recognizing that longitudinal data on even a single variable e.g., blood pressure can be represented in a variety o ways. The objective o our eature construction eort is to capture suicient clinical nuances or subsequent ris prediction tas. We model eatures as longitudinal sequences o observable measures such as demographic, diagnoses, medication, lab, vital, and symptom. As clinical events are recorded over time, or a given time window observation window, we summarize those events into eature variables. As shown in igure, at the index date, we want to construct eatures or a speciic patient. or that, we summarize all clinical events in observation window right beore the index date into eatures about this patient. igure : Illustration o longitudinal EHR data. The sparseness o the data will vary among patients and we expect the sequencing and clustering o what is observed to also vary. Dierent types o clinical events arise in dierent requencies and in dierent orders. Thereore, we construct summary statistics or dierent types o event sequences based on the eature characteristics: or static eatures such as gender and ethnicity, we will use a single static value to encode the eature. or temporal numeric eatures such as lab measures, we will use summary statistics such as point estimate - mean, variance, and trend statistics to represent the eatures. or temporal discrete eatures such as diagnoses and medications, we use the event requency e.g., number o occurrences o a speciic International Classiication o Diseases code. or complex variables, lie medication prescribing, we model medication use as a time dependent variable and also express medication usage i.e., percent o days pills may have been used at dierent time intervals beore the index date, taing account o relevant changes i.e., intensiication, de-intensiication, switching, multiple therapies or a given condition. 3 Top- Stability Selection Stability selection is a eature selection method recently proposed to address the issues o selecting proper amount o regularization in existing embedded eature selection algorithms [9]. In this section we introduce a more general top- stability selection, which is more robust against data perturbation and less sensitive to parameters compared to original stability selection; 2 possesses improved theoretical guarantee on the alse positive o eatures selected. 3. Algorithm In this section we describe the detailed process o the stability selection using sparse logistic regression, and introduce the top- stability selection. Sparse logistic regression serves as an embedded eature selection algorithm that simultaneously perorms eature selection and classiication. Given binary class label y {+, } and input vector x R p, where p is the dimension o eature space, logistic regression assumes that the posterior probability o a class y is a logistic sigmoid unction acting on the linear combination o eatures x [], which is given by: P y x = + exp yw T x + b, where w R p and b are model parameters. Given training data T = {x i, y i } n, the training o logistic regression is to minimize the empirical loss unction: w, b = n n log P y i x i = n n log + exp y i w T x i + b. Sparse logistic regression adds a sparsity-inducing l - norm regularization on the model parameter w [4], which solves the ollowing optimization problem: min w,b w, b + w, where is the regularization parameter that controls complexity sparsity o the model. Let be the index set o eatures, and let denote a eature. or a eature, i w then the eature is included in the model. The sparse logistic regression is sensitive to parameter and thereore it is hard to control the exact number o eatures included in the model. Given a set o regularization parameter values Λ and an iteration number β, the stability selection iteratively perorms the ollowing procedure. Let D j

be a random subsample rom T o size n/2 drawn without replacement. or a given Λ, let ŵ j be the optimal solution o sparse logistic regression on D j. Denote Ŝ D j = { : ŵ j } as the eatures selected by ŵ j. The procedure is repeated or β times and selection probability ˆΠ is computed as { } β β j= Ŝ D j, where { } is the indicator unction. The stability score o is given by: 3. S SS = max Λ ˆΠ. There are two methods o choosing eatures rom the stability score: the irst one is to give a threshold π thr and select eatures that have higher stability score than this threshold. Second, given the number o eatures we desired to include in the model, we can choose top eatures raned by stability score as in ilters-based eature selection methods; this is the method used in our experiments. Stability selection resembles a mixture o multiple models o dierent regularization parameters by using the maximum selection probability as in Eq. 3.. The choice o using the maximum selection probability as the stability score is questionable in some discussions o the paper [9]. In this paper, we address this issue by computing the stability score in an alternative way; speciically, we choose the average o top- selection probabilities: 3.2 S SS = ˆΠ /. : ˆΠ rans top- It ollows that the original method o computing stability score in Eq. 3. is a special case o this more general orm in Eq. 3.2 given =. Intuitively, as increases the selection score uses inormation rom more parameters. In this sense, given a large the stability selection boosts eatures that are more stable compared to others in the given parameter range Λ. eatures will have high scores i they perorm well in many parameter settings. On the other hand, stability selection with a small boosts eatures that are sensitive to the parameters. A eature may have a high score even i it perorms well only in a ew parameter settings. To achieve the best perormance, the selection o is datadependent and can be determined by using the standard cross-validation technique on the training data. Since the score o dierent can be easily calculated given the selection probabilities, the cross-validation can be used at very low cost. The algorithm o computing top stability selection is given in Algorithm. Though the paper ocuses on top- stability selection or classiication problems, it can also be applied to the eature selection in the regression settings, or example, using Lasso in top- stability selection. Algorithm Top- Stability Selection Input: dataset T = {x i, y i } n, iteration number β, parameter set Λ Output: top- stability scores S SS or all : or j = to β do 2: subsample D j rom T without replacement 3: or Λ do 4: compute the sparse logistic regression on D j using parameter and obtain the result ŵ j 5: store indices o selected eatures: Ŝ D j = { : ŵ j } 6: end or 7: end or 8: or do 9: compute selection probability or all Λ: ˆΠ = { } β β Ŝ D i : compute top- stability score: S SS = : ˆΠ ˆΠ rans top / : end or Another open question in stability selection is how to obtain the set o parameters Λ. The stability selection is said to be not sensitive to the selected regularization parameters [9]. Practically, however, the values o the parameter should not be too large, or very ew eatures are selected in each iteration; nor should they be very small, in which case the selection probability will be very high or all the eatures. The selection o the parameters also depends on the data set used. Some datasets may be very sensitive to the parameter chosen, in terms o sparsity in the model o sparse logistic regression. In this paper we use a heuristic algorithm to compute a proper set o parameters Λ. The algorithm irstly uniormly samples a set o parameters rom the entire range and inds minimum min and maximum max that give us p/3 and eatures respectively, and we uniormly choose 8 parameters in this range. In the case that the data is sensitive to the parameter, whose min and max reduce to the same value s, the algorithm iterates the above search algorithm on [ s ε, s + ε], where ε is a small value that reduces in each iteration. One o the ey technical contributions o this paper is to establish the upper bound o the average number o alsely selected eatures or the proposed top- stability selection in the next section. 3.2 Theoretical Analysis In this section, we show the theoretical upper bound o the average number o

alsely selected eatures in top- stability selection. The error bound o original stability selection in [9] is a special case o our established bound when =. We show that the error bound in our paper is guaranteed to be at least as tight as the bound when =, and we show that the proposed top- stability selection is more robust by utilizing more inormation rom selection probabilities. Let w R p be the ground truth o the sparse model, S be the set o non-zero components o w, i.e., S = { : w }. Let T be the total n samples {, 2,..., n} and D be a random subsample o T with size n/2 drawn without replacement. or simplicity, we denote the selected eatures Ŝ T by Ŝ except when the dependence on data is necessary to be clariied. The set o zero components o w is denoted by N = { : w = } and the estimated set o zeros under the model with regularization parameter is ˆN. It ollows that ŜΛ = Λ Ŝ, ˆN Λ = ˆN Λ. The average number o selected eatures rom models built on Λ is u Λ = E Ŝ Λ. The selection probability o a subset o eatures {, 2,..., p} under model with is ˆΠ = P Ŝ D, where the probability P is with respect to the randomness o subsampling. By averaging the top- selection probabilities o eatures, we deine the top- stable eatures as ollows: Deinition. Top- Stable eatures or a cut-o threshold π thr with < π thr < and a set o regularization parameters Λ, the set o top- stable eatures is deined as { } Ŝ stable = : ˆΠ π thr, where ˆΠ = max ˆΠ, Λ i is deined as Λ\, 2,..., i { }, i =,...,, i = argmax ˆΠ, Λ = Λ. Note that the top- stable eature is deined with respect to a single eature. The deinition and error control can be easily extended to the case o a subset o eatures {, 2..., p}. Let V be the set o alsely selected eatures o top- stability selection, V = N Ŝ stable. The average number o alsely selected eatures by top- stability selection, EV is guaranteed to be controlled by the ollowing theorem: Theorem 3.. Top- Error Control Assume that the original eature selection method is not worse than random guessing, i.e. S N E ŜΛ E ŜΛ, S N and { or } } Λ the distribution o { Ŝ, N is exchangeable, then, 3.3 EV u2 Λ,i p 2π thr, where 2 < π thr <, u 2 Λ,i = E [Λ i ]2, i =, 2,...,, Λ i is deined the same as in the deinition o top- stable eatures. The proo o the Theorem 3. is given in the supplemental materials [29]. Obviously, the error bound u EV 2 Λ 2π thr p in [9] is a special case o our bound in Eq. 3.3 given =. or each u Λ,i, i =, 2,...,, it measures the average o u Λ i with respect to all eatures. Since u 2 Λ,i u2 Λ or i =, 2,...,, we have u2 Λ,i / u2 Λ, which says that the upper bound o error in top- stability selection is guaranteed to be at least as tight as the original bound. u 2 Λ,i expresses the average number o selected eatures rom Λ excluding top i s maximizing the selection probability o the eatures. By utilizing the inormation rom u 2 Λ,i s, the top- stability selection is more robust in terms o average number o alsely selected eatures, compared with the = stability selection. 4 Experiments In the experiment we study the empirical perormance o eature selection algorithms and classiication methods in both synthetic and real clinical prediction datasets. In the experiments the implementation o top stability selection uses Lasso and the sparse logistic regression rom the SLEP pacage [5]. 4. Simulation We use synthetic datasets to study the proposed top- selection probability in the setting o regression. We start with generating the true model w R. There are 5 non-zero elements at random locations generated with distribution N,. We then generate a data matrix o samples X R with elements generated i.i.d. by N,. The response is generated using y = Xw+ϵ, where ϵ N,.5. We apply the top- stability selection on the data X, y, and calculate the number o alsely selected eatures V = N S stable and the ratio V / S stable using = {, 2, 4, 8, }. We repeat the experiments or

..2.3.4.5.7 alsely Selected eatures 9 8 7 6 5 4 3 2 Threshold o Selection Probability Top Stability Top 2 Stability Top 4 Stability Top 8 Stability Top Stability Ratio o alsely Selected eatures.9.7.5.4.3.2. Top Stability Top 2 Stability Top 4 Stability Top 8 Stability Top Stability..2.3.4.5.7 Threshold o Selection Probability igure 2: Comparisons o top- stability selection with =, 2, 4, 8,. When increases, we observe improved empirical perormance using top- stability selection in terms o the number o alse positive eatures included in the model let and the ratio right, which is consistent with the analysis in Theorem 3.. iterations and report the averaged results in igure 2. We observe that or a given threshold on the stability score, the top- stability selection perorms better when >. When increases, we observe improved empirical perormance using the top- stability selection in terms o both the number and ratio o the alse positive eatures included in the model, which is consistent with the discussion in Theorem 3.. 4.2 Clinical Application Setting The individual and societal impact o heart ailure H is staggering. One in ive US citizens over age 4 is expected to develop H during their lietimes. It is currently the leading cause o hospitalization among Medicare beneiciaries. ramingham ris criteria [6], being the most commonly used diagnostic criteria or H, are signs and symptoms that are oten buried in clinical notes, which may not be in the structured ields in the EHR. ramingham criteria include Acute pulmonary edema APEdema, Anle Edema, Dyspnea on ordinary exertion, and etc. Despite their importance, ramingham criteria are not systematically documented in structured EHR. Processing clinical notes requires manual annotations or building customized natural language processing NLP applications, which are both expensive and time-consuming. Plenty o structured EHRs, on the other hand, are available, which include diagnosis, medication, and lab results. Our goal here is to construct eatures rom structured EHR data and to use only a small number o labeled annotations o ramingham criteria to build predictive models, so that we can identiy individuals that are liely to have ramingham criteria though they are not documented. The reasons or identiying potentially missing ramingham criteria are to enable timely diagnosis o H, and 2 to use the ramingham criteria as additional eatures or predicting uture onset o H. In this paper we construct 9 datasets targeting dierent ramingham criteria rom the EHR o approximately 4 patients over 5 years rom a hospital. The samples size o the data is given in Table, where each sample corresponds to exactly one patient in the study. The number o eatures is 32, which include 79 diagnosis eatures, 36 medication eatures, 7 lab test eatures. Table : The sample size o the EHR data sets. Dataset# Tas Name Training Testing + - + - APEdema 424 576 424 576 2 AnleEdema 25 794 26 795 3 ChestPain 89 9 8 9 4 DOExertion 65 935 65 935 5 Neg-Hepatomegaly 537 463 537 463 6 Neg-JVDistention 232 767 233 768 7 Neg-NightCough 7 929 7 93 8 Neg-PNDyspnea 869 3 87 3 9 Neg-S3Gallop 552 447 553 448 4.3 eature Selection Comparison In this experiment we evaluate the perormance o top- stability selection given dierent and existing eature selection methods including isher Score [5], Relie [, 2], Gini Index [8], Ino Gain [3], χ 2 [3], Minimum-Redundancy Maximum-Relevance mrmr [2]. Note that or the eature selection methods that only tae categorical inputs, we can still use continuous eatures by applying discretization techniques [4]. or eature selection algorithms other than stability selection SS, we use the implementations rom an existing pacage [28]. or each eature selection algorithm, we compute the eature raning on the training data. We then use top t eatures to build classiication model using logistic regression and evaluate the model on the testing data, where we vary t rom 2 to 2. To study the eects o top- stability selection when changes, we treat =, 2, 4, 8 as dierent eature selection algorithms in the evaluation. Since the class distribution o the data sets is not balanced, we ocus on the Area Under Curve AUC metric. Given a speciic eature number, or all eature selection methods we select the number o top eatures raned respectively and build classiication models using logistic regression on the training data, and the models are then evaluated using the testing data. We repeat this procedure or times, and the average AUC over the times is used to ran the eature selection algorithms. We report the mean ran and standard deviation o 9 datasets in Table 2. We highlight the top 3 eature selection algorithms according to the mean ran. One important observation is that stability selection algorithms perorm better than all other eature selection algorithms. The perormance results also demonstrate the potential o top- stability selection with 2. These results are consistent with our theoretical analysis in Section 3.2. Sparse Learning Comparison. In our experiments, the stability selection uses sparse logistic regression,

Table 2: Average AUC rans o eature selection algorithms over the 9 datasets. eature # 4 6 8 2 4 6 8 2 Relie. ±.. ±.. ±.. ±.. ±.. ±.. ±.. ±.. ±. isher 7. ± 2. 7.2 ±.5 6.2 ± 2.4 6. ± 2.6 5.8 ± 2.6 5.6 ± 2.2 4.9 ± 2.8 5. ± 2. 5.6 ±.5 Gini 7. ±.5 7. ± 2. 6. ±.8 6.8 ±.8 6.9 ± 2. 6.4 ± 2.8 5.9 ± 2. 6.6 ± 2. 6. ± 2. InoGain 4.7 ± 2.3 6.2 ± 2. 5.7 ±.7 5.7 ±.7 5.8 ±. 5.9 ±.3 6. ± 2.4 7. ±.6 6.8 ±.7 ChiSquare 5.3 ± 2.2 6.8 ±.8 6.4 ± 2.5 6.6 ±.6 6.2 ±.4 5.8 ± 2. 6.7 ±.4 7.2 ±.5 7.2 ±.6 mrmr 5.8 ± 2.6 5.7 ±.7 7. ±.5 7.3 ±.7 7.8 ±.5 7.6 ±.6 7.2 ±.7 6.9 ± 2.4 6.4 ± 3.4 SS = 5. ± 3. 4. ± 2.2 4.8 ± 2.5 4.4 ± 2.5 4.6 ± 2.7 4.2 ± 3.6 4.7 ± 3.2 3.8 ± 2.4 4.8 ± 2.3 SS = 2 3.8 ± 3. 3. ± 2.4 2.8 ± 2. 2.4 ±.9 3.2 ± 2. 3.7 ± 2. 3.2 ± 2.7 3.3 ±.9 3. ±.7 SS = 4 3. ±.8 2.3 ±.6 2.8 ± 2.3 2.2 ±.6 2.6 ±.6 2.8 ±.6 2.9 ±.4 2.3 ±.3 2.3 ±. SS = 8 3.3 ±.9 2.6 ±.2 3.2 ± 2.4 3.6 ±.9 2.2 ±.5 3. ±.8 3.6 ±.9 2.9 ±.7 2.8 ±.8 Table 3: Classiication perormance on EHR datasets in terms o AUC. We compare three classiiers: support vector machine SVM, random orest Ran and logistic regression LReg. Dataset 2 3 4 5 6 7 8 9 SVM 62.2 ± 3. 7.9 ± 68.8 ±.2 75.2 ±.7 59. ±.4 64.3 ±.3 67. ±.5 64.5 ± 56. ± 2. Ran 69.6 ±. 73.7 ±.9 69.7 ±.7 75.7 ± 59.3 ± 2. 63.9 ± 66. ±. 65.3 ±.3 52.3 ± 5.6 LReg 7. ±. 75. ±.9 69.7 ± 76.6 ±.9 65.2 ±.3 67. ±.9 67.6 ± 66.7 ±.7 63.9 ±.9 which is the most popular sparse learning method or classiication and also serves as an embedded eature selection method itsel via the l -norm penalty. This motivates us to investigate the perormance o directly applying sparse logistic regression to the data sets. We vary the regularization parameter o sparse logistic regression and evaluate the perormance in terms o AUC. In order to compare the perormance, we perorm the top-4 stability selection on the same training data and use the same number o eatures as selected in the model o sparse logistic regression to build models using logistic regression. We show the results o AnleEdema and DOExertion datasets in igure 3. Since dierent regularization parameters may give the same number o eatures, there may be more than one value at each eature number. We observe that when the same number o eatures are selected, the approach o using stability selection outperorms sparse logistic regression. Also note that sparse logistic regression is sensitive to the parameter and the estimation o which requires cross-validation. In the approach o logistic regression ater stability selection, however, the perormance is not very sensitive to the number o eatures selected. We observe similar patterns given dierent values. 4.4 Classiication Methods Comparison In this experiment we study the perormance o popular classiiers on the EHR data sets. We include SVM, random orest Ran, and logistic regression LReg in the study. or SVM, 5-old cross validation is used to estimate the best parameter c. In random orest we ix the tree number to be 5. When building the classiication models, we include top 2 eatures raned by stability selection = 4. The perormance o the classiiers on the 9 datasets are given in Table 3. We observe that in all data sets logistic regression perorms better than other classiiers in terms o AUC. AUC.77.76.75.74.73.72.7.7 9 8 Dataset: AnleEdema SS=4 Sparse LReg 7 3 5 7 9 3 5 7 9 eature Number AUC.78.76.74.72.7 8 6 Dataset: DOExertion SS=4 Sparse LReg 4 3 5 7 9 3 5 7 9 eature Number igure 3: The AUC o sparse logistic regression Sparse LReg, logistic regression ater top-4 stability selection SS = 4 or randomly splitting training/testing datasets. In each iteration we vary the regularization parameter o the Spare LReg and choose the same number o eatures in SS. Note that dierent regularization parameters may give the same number o eatures. Sensitivity analysis. We perorm experiments to study the sensitivity o methods with respect to their parameters. We randomly split the AnleEdema dataset into training and testing sets we observe similar patterns in other datasets. We perorm stability selection = 4 on the training data, and choose top 2 eatures to build the classiication models using dierent parameters. We repeat the procedure or times and report AUC. In SVM we vary the parameter c, such that logc is rom 4 to +6 in the study. or random orest we vary the tree number rom 5 to. In logistic regression, we vary the l 2 -norm regularization term in the range log [ 4, 6]. The results are given in igure 4. We ind that the perormance o SVM and logistic regression on the dataset is very sensitive to the parameter. or logistic regression, we obtain better perormance or a smaller parameter value, which leads to less regularization. This is because when a small number o eatures are included in the model, it is not necessary to add l 2 -norm regularization. or SVM, however, a cross-validation is necessary in order to select the best

parameter. Unlie SVM and LReg, random orest is not sensitive to the tree number parameter and the perormance remains almost the same when the tree number is large enough..7.7 4 2 2 4 6 logc 2 4 6 8 tree number.7 4 2 2 4 6 log igure 4: The sensitivity o classiication methods with respect to varying parameters. SVM top is sensitive to parameter c; random orest middle is not sensitive to the tree number parameter; logistic regression bottom is sensitive to the regularization parameter. 4.5 Top- Stability Score Case Study In this experiment we study the top- stability scores given dierent. We show the distribution o stability scores or DOExertion dataset in igure 5. We present the top eatures to the clinical experts and conirm their clinical validity in many cases. or example, G-58 is actually shortness o breath a symptom diagnose, which includes DOExertion as a special case. G3-33 are sympathomimetic agents, used as nasal decongestants to help breathing. Moreover, we observe that the absolute value o stability score shrins when increases. The shrinage is expected because we average the selection probabilities over more regularization parameters. In the stability score distribution given =, we ind that the average stability score decreases continuously. As increases, we ind that some steep drops o stability score emerge. In igure 5 = 2, there is a drop between eature G- 44 and G3-4. When we increase to = 8, we ind a more signiicant drop ater the irst three eatures. Such drop inormation can potentially be used as a guidance or the number o eatures to be included, or equivalently the choice o threshold in stability selection. We are also aware that the ran o eatures can be dierent given dierent. The eature G3-6, cardio selective beta blocer, has a subtle connection to DOExertion, since this is oten prescribed among H patients, and DOExertion is a common symptom among H patients. In particular, G3-6 rans given =. Its ran improves to, 6 and 4, respectively, when increases to 2, 4 and 8. The improvement in ran implies that the eature s selection probability Average Stability Score Average Stability Score Average Stability Score Average Stability Score.4.2.4.2.4.2.4.2 G4 7 G 58 G3 33 G3 6 G 58 G3 33 G 73 G3 27 G 43 G4 7 G4 G4 5 G3 25 G4 4 G3 6 G4 G 3 G4 4 G3 34 G 58 G3 33 G3 27 G 73 G4 7 G 43 G4 G3 25 G4 5 G3 6 G4 4 G4 G4 4 G 3 G 58 G3 33 G4 7 G3 27 G3 25 G3 6 G 43 G4 G 73 G4 5 G4 4 G4 G4 4 G3 26 Stability score = or DOExertion Stability score =2 or DOExertion G3 34 Stability score =4 or DOExertion G3 34 Stability score =8 or DOExertion G3 26 G4 2 G3 24 G4 5 G3 5 G G 78 G 44 G3 28 G4 9 G3 4 G 57 G 6 G 9 G3 29 G 79 G3 3 G 52 G 25 G 26 G 27 G 28 G 29 G 3 G 3 G3 26 G3 24 G4 2 G4 5 G3 5 G G3 28 G 78 G4 9 G 44 G3 4 G 9 G 57 G 6 G3 29 G3 3 G 79 G3 G 52 G3 2 G 25 G 26 G 27 G 28 G 29 G4 5 G 3 G3 24 G4 2 G3 28 G4 9 G3 5 G G 78 G 44 G 9 G3 29 G3 4 G 57 G 6 G3 3 G3 G3 2 G 79 G3 23 G 52 G 65 G 25 G 26 G 27 G3 27 G3 25 G4 G4 5 G4 5 G 43 G4 4 G3 26 G3 34 G4 4 G4 G 73 G3 24 G4 9 G3 28 G4 2 G 3 G3 5 G G 78 G 44 G 9 G3 29 G3 4 G 57 G 6 G3 3 G3 2 G3 G 79 G3 23 G 52 G 65 igure 5: Average stability scores o DOExertion. may not be highest with any regularization parameter, but across all parameters Λ the eature s stability probabilities are consistently high. Such eatures lie G3-6 in this case are considered to be stable with respect to parameters, which can only be identiied with top- stability selection. 5 Conclusion The goal o disease-speciic ris prediction is to assess the ris o a patient in developing a target disease based on his/her health proile. As electronic health records EHRs become more prevalent, a large number o eatures can be constructed in order to characterize patient proiles. In this paper, we propose top- stability selection, which generalizes a powerul sparse learning eature selection method by overcoming its limitation on parameter selection and providing stronger theoretical properties. In a large set o real clinical prediction datasets, the top- stability selection methods outperorm the original stability selection and many existing eature selection methods. We compare three competitive classiication methods to demonstrate the eectiveness o selected eatures by our proposed method. We also show that in several clinical applications on predicting heart ailure related symptoms, top- stability se- G 25 G 26 G 27

lection can successully identiy important eatures that are clinically meaningul. Recently the data mining and machine learning community has integrated its research eorts or structure sparsity. In our uture wors, we plan to combine stability selection and structure sparsity and apply to research on EHRs. Top- stability selection is based on Lasso and thereore is vulnerable to strongly correlated eatures. We plan to develop methods that are more robust against such ill-conditioned design matrices. Acnowledgments We would lie to than Dr. Steven Steinhubl and Dr. Walter. Stewart rom Geisinger or acilitating the testing o our algorithm. We also want to than Roy Byrd rom IBM or providing symptom annotations on the data. This wor was supported in part by NS IIS-953662, CC-2577 and NIH RLM73. Reerences [] C. Bishop. Pattern recognition and machine learning, volume 4. Springer New Yor, 26. [2] S. Boucheron and O. Bousquet. Concentration inequalities. In Advanced Lectures in Machine Learning, pages 28 24. Springer, 24. [3] T. Cover, J. Thomas, J. Wiley, et al. Elements o inormation theory, volume 6. Wiley Online Lib., 99. [4] J. Dougherty, R. Kohavi, and M. Sahami. Supervised and unsupervised discretization o continuous eatures. In ICML, pages 94 22, 995. [5] R. Duda, P. Hart, and D. Stor. Pattern classiication and scene analysis 2nd ed. 995. [6] H. Eletherohorinou, C. Hoggart, V. Wright, M. Levin, and L. Coin. Pathway-driven gene stability selection o two rheumatoid arthritis gwas identiies and validates new susceptibility genes in receptor mediated signalling pathways. Hum. Mol. Genet., 27:3494 356, 2. [7] G. onarow, K. Adams, W. Abraham, C. Yancy, W. Boscardin, et al. Ris stratiication or in-hospital mortality in acutely decompensated heart ailure. J. o the American Medical Association, 2935:572, 25. [8] C. Gini. Variabilità e mutabilità. Reprinted in Memorie di metodologica statistica Ed. Pizetti E, Salvemini, T. Rome: Libreria Eredi Virgilio Veschi,, 92. [9] L. Jacob, G. Obozinsi, and J. Vert. Group lasso with overlap and graph lasso. In ICML, pages 433 44. ACM, 29. [] K. Kira and L. Rendell. A practical approach to eature selection. In Intl. Worshop on Machine Learning, pages 249 256, 992. [] B. Krishnapuram, L. Carin, M. igueiredo, and A. Hartemin. Sparse multinomial logistic regression: ast algorithms and generalization bounds. IEEE Trans. on Pattern Analysis and Machine Intelligence, 276:957 968, 25. [2] H. Liu and H. Motoda. Computational methods o eature selection. Chapman & Hall, 28. [3] H. Liu and R. Setiono. Chi2: eature selection and discretization o numeric attributes. In Intl. Con. on Tools with Artiicial Intelligence, pages 388 39, 995. [4] J. Liu, J. Chen, and J. Ye. Large-scale sparse logistic regression. In KDD, pages 547 556. ACM, 29. [5] J. Liu, S. Ji, and J. Ye. SLEP: Sparse Learning with Eicient Projections. Arizona State University, 29. [6] P. McKee, W. Castelli, P. McNamara, and W. Kannel. The natural history o congestive heart ailure: the ramingham study. New England Journal o Medicine, 28526:44 446, 97. [7] L. Meier, S. Van De Geer, and P. Bühlmann. The group lasso or logistic regression. J. R. Stat. Soc. Ser. B Stat. Methodol., 7:53 7, 28. [8] N. Meinshausen and P. Bühlmann. High-dimensional graphs and variable selection with the lasso. The Annals o Statistics, 343:436 462, 26. [9] N. Meinshausen and P. Bühlmann. Stability selection. J. R. Stat. Soc. Ser. B Stat. Methodol., 724:47 473, 2. [2] Y. Pati, R. Rezaiiar, and P. Krishnaprasad. Orthogonal matching pursuit: Recursive unction approximation with applications to wavelet decomposition. In Asilomar Conerence on Signals, Systems and Computers, pages 4 44. IEEE, 993. [2] H. Peng,. Long, and C. Ding. eature selection based on mutual inormation criteria o max-dependency, max-relevance, and min-redundancy. IEEE Trans. on Pattern Analysis and Machine Intelligence, 278:226 238, 25. [22] Z. Ryali, T. Chen, K. Supear, and V. Menon. Estimation o unctional connectivity in mri data using stability selection-based sparse partial correlation with elastic net penalty. Neuroimage, 59:3852 386, 22. [23] D. Stehoven, L. Hennig, G. Sveinbjörnsson, I. Moraes, M. Maathuis, and P. Bühlmann. Causal stability raning. 2. [24] M. Stern, K. Williams, D. Eddy, and R. Kahn. Validation o prediction o diabetes by the archimedes model and comparison with other predicting models. Diabetes Care, 38:67, 28. [25] R. Tibshirani. Regression shrinage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol., pages 267 288, 996. [26] M. Vounou, E. Janousova, R. Wolz, J. Stein, P. Thompson, D. Ruecert, and G. Montana. Sparse reduced-ran regression detects genetic associations with voxel-wise longitudinal phenotypes in alzheimer s disease. NeuroImage, 2. [27] J. Wu, J. Roy, and W. Stewart. Prediction modeling using ehr data: challenges, strategies, and a comparison o machine learning approaches. Medical care, 486:S6, 2. [28] Z. Zhao,. Morstatter, S. Sharma, S. Alelyani, A. Anand, and H. Liu. Asu eature selection repository, eatureselection.asu.edu, 2. [29] J. Zhou. http://www.public.asu.edu/ jzhou29/ papers/jzhousdm3 appendix.pd.

Patient Ris Prediction Model via Top- Stability Selection Appendix A: Proo o Theorem 3. Jiayu Zhou, Jimeng Sun, Yashu Liu, Jianying Hu, Jieping Ye Jiayu.Zhou@asu.edu Beore proceeding to the proo o Theorem 3., we irstly see some insights by analyzing simultaneous selection probability rom random splittings. Let D and D 2 be two disjoint subsets o T generated by random splitting with sample size n/2. The simultaneously selected set is then given by: Ŝ smlt, = Ŝ D Ŝ D 2. The corresponding simultaneous selection probability or any set P. Ŝsmlt, {,..., p} is deined as ˆΠ smlt, = Lemma 5.. or any subset {,..., p}, a lower bound or the average o top maximum simultaneous selection probabilities is given by 2 ˆΠ. ˆΠsmlt, Proo: According to the Lemma in [9], it ollows that thereore have: ˆΠsmlt, = = 2 ˆΠ smlt, max 2 ˆΠ or any set {,..., p}. We ˆΠsmlt, 2 max ˆΠ max ˆΠ = 2 ˆΠ. This completes the proo o Lemma 5.. Lemma 5.2. or a subset o eatures {,..., p}, i P ŜΛ i ϵ i, or i =, 2,...,, where Λ i deined the same as in the deinition o top- stable eatures, Ŝ is estimated rom n/2 samples, then: is P ˆΠsmlt, ξ ξ ϵ 2 i. Proo: Let D, D 2 {,..., n} be two subsamples o T with size n/2 generated rom random splitting. Deine B { {Ŝ = D } } Ŝ smlt, D 2, and the simultaneous selection probability is given by ˆΠ = E B = E B T, where the expectation E is with respect to the random splitting. Hence or i =, 2, 3,..., we have: It ollows immediately that: max ˆΠsmlt, ˆΠsmlt, = max E B = max E B T. = E B = M Λ E B T.

The inequality P ŜΛ i ϵ i or sample size n/2 implies that: max P B = P That is, or i =, 2,...,, max E B ϵ 2 i. Thus, E ˆΠsmlt, Using the Marov-type inequality [2], we have: ξp thus P ˆΠsmlt, 2 ŜΛ i D ϵ 2 i. =E [ M Λ E B T ] [ =E max = E B T ] max E B = ϵ 2 i. [ ξ E E B ϵ 2 i, ] ˆΠsmlt, ˆΠsmlt, ξ ξ ϵ2 i. This completes the proo o the lemma. Proo o Theorem 3. Top- Error Control: rom Theorem in [9] we have that: P ŜΛ u Λ /p or N, it ollows immediately that or all N, i =, 2,..., we have P ŜΛ i u Λ i/p. Using Lemma 5.2, we have: P ˆΠsmlt, ξ ξ u Λ i /p 2. By Lemma 5., it ollows that: { P { P ˆΠ π thr } ˆΠsmlt, } + /2 π thr u2 Λ i p 2 2π thr. Thereore we have: EV = { P N } ˆΠ π thr u2 Λ,i p 2π thr, where u 2 Λ,i = E [u Λ i] 2. This completes the proo.