Patient Risk Prediction Model via Top-k Stability Selection

Size: px
Start display at page:

Download "Patient Risk Prediction Model via Top-k Stability Selection"

Transcription

1 Patient Ris Prediction Model via Top- Stability Selection Jiayu Zhou Jimeng Sun Yashu Liu Jianying Hu Jieping Ye Abstract The patient ris prediction model aims at assessing the ris o a patient in developing a target disease based on his/her health proile. As electronic health records EHRs become more prevalent, a large number o eatures can be constructed in order to characterize patient proiles. This wealth o data provides unprecedented opportunities or data mining researchers to address important biomedical questions. Practical data mining challenges include: How to correctly select and ran those eatures based on their prediction power? What predictive model perorms the best in predicting a target disease using those eatures? In this paper, we propose top- stability selection, which generalizes a powerul sparse learning method or eature selection by overcoming its limitation on parameter selection. In particular, our proposed top- stability selection includes the original stability selection method as a special case given =. Moreover, we show that the top- stability selection is more robust by utilizing more inormation rom selection probabilities than the original stability selection, and provides stronger theoretical properties. In a large set o real clinical prediction datasets, the top- stability selection methods outperorm many existing eature selection methods including the original stability selection. We also compare three competitive classiication methods SVM, logistic regression and random orest to demonstrate the eectiveness o selected eatures by our proposed method in the context o clinical prediction applications. inally, through several clinical applications on predicting heart ailure related symptoms, we show that top- stability selection can successully identiy important eatures that are clinically meaningul. Introduction The electronic health records EHRs capture longitudinal patient inormation in digital orms rom diverse sources: demographics, diagnosis, medication, lab results and physician notes. The adoption o EHRs has increased dramatically recently in response to the gov- Computer Science and Engineering, Arizona State University IBM T.J. Watson Research Lab ernment s EHR incentive program. Many healthcare institutions and practitioners are looing or secondary use o EHR data; among which, clinical decision support and care management systems are the most prominent applications. As more EHR data becomes available, a large number o eatures can be constructed that may be potentially useul or ris prediction, which has important applications in both clinical decision support and care management systems. The challenge o getting high quality eatures is how to ran all available eatures based on the predictive power to a speciic condition. or that, appropriate eature selection is required. The need or eature selection is urther ampliied by the challenge o obtaining high quality training data. The orm o training data depends on the speciic tass and the source data quality. Because o the highly noisy nature o EHR data, clinical experts oten have to be involved in the annotation process in order to obtain reliably labeled training data. As a result, in many cases only limited labeled data can be obtained. Because o the small sample size, over-itting becomes one o the biggest concerns or predictive modeling in the healthcare domain. The ey to avoiding over-itting is to construct robust and parsimonious models, where recent development o sparse learning [9,, 7, 2, 25] can oer many insights. Another reason why eature selection is particularly important in building ris prediction models or healthcare is the need or interpretability and actionability. or both clinical decision support and care management support, it is oten not suicient to simply have a model that produces a ris score, even i it is highly accurate. The decision maers need to understand what are the leading ris actors so that proper interventions can be taen to address the ris. Parsimonious models that achieve high accuracy with a relatively small number o succinct eatures clearly have an advantage o oering better interpretability. This again points to eature selection as the center piece o the discussion. We ocus on addressing this central issue o eature selection or ris prediction in this paper. In many embedded eature selection algorithms, there is a tuning parameter that determines the amount o regularization and correspondingly how many eatures are included in the model. Examples are regularization actors o spar-

2 sity induced penalties e.g. Lasso regression [25], sparse logistic regression [, 4], group Lasso [9, 7], and the iteration number o orthogonal matching pursuit [2]. The choice o such tuning parameters is data-dependent and thereore there is no general guidance on how the parameters should be chosen. Typically cross-validation process is used to estimate the tuning parameters rom the training data. Due to overitting, the estimated parameters via cross validation usually include too many irrelevant eatures in the model [8], especially in high dimensional problems, and thus lead to sub-optimal perormance. Moreover, chances are that not all relevant eatures are selected given a particular tuning parameter, and thereore the inormation o relevant eatures may be contained in the models given by more than one tuning parameter. To address the aorementioned problems, we propose to introduce and extend a recently developed sparse learning method, stability selection [9], which is currently not widely used in the data mining community, despite its success in bioinormatics especially in genome-related biomarer selection problems where sample size is much smaller than eature dimension n p [6, 22, 23, 26]. Stability selection, based on subsampling/bootstrapping, provides a general method to perorm model selection using inormation rom a set o regularization parameters. The stability raning score gives a probability which maes it naturally interpretable. However, the existing stability selection uses a rather arbitrary and limited maximum selection criterion, which may lead to suboptimal results. To tacle the limitation o stability selection, we propose a general top- stability selection algorithm and establish stronger theoretical properties. The original selection method can be shown to be a special case o our proposed top- stability selection given =. We also show theoretically that the proposed top stability selection is more robust by utilizing more inormation rom selection probabilities when > than the original stability selection, and is less sensitive to the input parameters. Overall, the top- stability selection method has the ollowing advantages over the original stability selection: More stable raning: The eature raning is more robust against data perturbation and less sensitive to the input parameters. Improved theoretical guarantee: Improved theoretical guarantees can be established that provides a stronger bound o the alse positive error rate. We design experiments to compare top- stability selection given dierent and systematically compare our proposed method against six dierent eature selection methods in the context o clinical prediction. Our top- stability selection method outperorms the traditional eature selection methods in all datasets. Moreover, the proposed top- stability selection achieves better perormance than the original stability selection on both synthetic and real datasets, which empirically conirms the superiority o this generalization. inally, we compare three competitive classiication methods using selected eatures rom top- stability selection: SVM, Random orest and Logistic Regression, and discuss several practical considerations in how to choose the classiiers in the context o clinical prediction application. Logistic regression gives the best prediction perormance in terms o AUC, while random orest is shown to be less sensitive to the input parameter value. The rest o paper is organized as ollows: Section 2 describes the motivating application and shows how to convert EHR data into eatures or ris prediction. Section 3 presents the idea o stability selection, and discusses the top- stability selection extension and its theoretical properties. Section 4 presents systematic evaluation o dierent methods or ris prediction. We conclude in Section 5. 2 Motivating Application Ris prediction has important applications in both clinical decision support and care management systems. or ris prediction to provide actionable insight, it oten requires building a predictive model e.g., classiier or a speciic disease condition. or example, based on the EHR data, one may want to assess the ris scores o a patient developing dierent disease conditions, such as diabetes complication [24] and heart ailure [7, 27], so that proper intervention and care plan can be designed accordingly. The common pipeline or building any predictive model involves the ollowing steps: eature construction prepares the input eatures and the target variable or a speciic tas. In this step, the labeled training data sets are constructed rom EHR data. eature selection rans the input eatures and eeps only the most relevant ones or subsequent model building. Model building uses the selected eatures and the target variable to construct a predictive model. Depending on the tass, the model can be either classiication or regression. In this paper, we ocus on classiication tass with binary target variables. Model validation evaluates the models against a diverse set o perormance metrics. In the presence

3 o an unbalanced class distribution, the Area Under the Curve AUC is typically employed. eature selection, model building and validation are amiliar concepts in the data mining ield. eature construction is oten application dependent. Next we illustrate the common strategies or constructing eatures in clinical applications. EHR data documents detailed patient encounters in time, which oten includes diagnosis, medication, lab result and clinical notes. We systematically construct eatures rom dierent data sources, recognizing that longitudinal data on even a single variable e.g., blood pressure can be represented in a variety o ways. The objective o our eature construction eort is to capture suicient clinical nuances or subsequent ris prediction tas. We model eatures as longitudinal sequences o observable measures such as demographic, diagnoses, medication, lab, vital, and symptom. As clinical events are recorded over time, or a given time window observation window, we summarize those events into eature variables. As shown in igure, at the index date, we want to construct eatures or a speciic patient. or that, we summarize all clinical events in observation window right beore the index date into eatures about this patient. igure : Illustration o longitudinal EHR data. The sparseness o the data will vary among patients and we expect the sequencing and clustering o what is observed to also vary. Dierent types o clinical events arise in dierent requencies and in dierent orders. Thereore, we construct summary statistics or dierent types o event sequences based on the eature characteristics: or static eatures such as gender and ethnicity, we will use a single static value to encode the eature. or temporal numeric eatures such as lab measures, we will use summary statistics such as point estimate - mean, variance, and trend statistics to represent the eatures. or temporal discrete eatures such as diagnoses and medications, we use the event requency e.g., number o occurrences o a speciic International Classiication o Diseases code. or complex variables, lie medication prescribing, we model medication use as a time dependent variable and also express medication usage i.e., percent o days pills may have been used at dierent time intervals beore the index date, taing account o relevant changes i.e., intensiication, de-intensiication, switching, multiple therapies or a given condition. 3 Top- Stability Selection Stability selection is a eature selection method recently proposed to address the issues o selecting proper amount o regularization in existing embedded eature selection algorithms [9]. In this section we introduce a more general top- stability selection, which is more robust against data perturbation and less sensitive to parameters compared to original stability selection; 2 possesses improved theoretical guarantee on the alse positive o eatures selected. 3. Algorithm In this section we describe the detailed process o the stability selection using sparse logistic regression, and introduce the top- stability selection. Sparse logistic regression serves as an embedded eature selection algorithm that simultaneously perorms eature selection and classiication. Given binary class label y {+, } and input vector x R p, where p is the dimension o eature space, logistic regression assumes that the posterior probability o a class y is a logistic sigmoid unction acting on the linear combination o eatures x [], which is given by: P y x = + exp yw T x + b, where w R p and b are model parameters. Given training data T = {x i, y i } n, the training o logistic regression is to minimize the empirical loss unction: w, b = n n log P y i x i = n n log + exp y i w T x i + b. Sparse logistic regression adds a sparsity-inducing l - norm regularization on the model parameter w [4], which solves the ollowing optimization problem: min w,b w, b + w, where is the regularization parameter that controls complexity sparsity o the model. Let be the index set o eatures, and let denote a eature. or a eature, i w then the eature is included in the model. The sparse logistic regression is sensitive to parameter and thereore it is hard to control the exact number o eatures included in the model. Given a set o regularization parameter values Λ and an iteration number β, the stability selection iteratively perorms the ollowing procedure. Let D j

4 be a random subsample rom T o size n/2 drawn without replacement. or a given Λ, let ŵ j be the optimal solution o sparse logistic regression on D j. Denote Ŝ D j = { : ŵ j } as the eatures selected by ŵ j. The procedure is repeated or β times and selection probability ˆΠ is computed as { } β β j= Ŝ D j, where { } is the indicator unction. The stability score o is given by: 3. S SS = max Λ ˆΠ. There are two methods o choosing eatures rom the stability score: the irst one is to give a threshold π thr and select eatures that have higher stability score than this threshold. Second, given the number o eatures we desired to include in the model, we can choose top eatures raned by stability score as in ilters-based eature selection methods; this is the method used in our experiments. Stability selection resembles a mixture o multiple models o dierent regularization parameters by using the maximum selection probability as in Eq. 3.. The choice o using the maximum selection probability as the stability score is questionable in some discussions o the paper [9]. In this paper, we address this issue by computing the stability score in an alternative way; speciically, we choose the average o top- selection probabilities: 3.2 S SS = ˆΠ /. : ˆΠ rans top- It ollows that the original method o computing stability score in Eq. 3. is a special case o this more general orm in Eq. 3.2 given =. Intuitively, as increases the selection score uses inormation rom more parameters. In this sense, given a large the stability selection boosts eatures that are more stable compared to others in the given parameter range Λ. eatures will have high scores i they perorm well in many parameter settings. On the other hand, stability selection with a small boosts eatures that are sensitive to the parameters. A eature may have a high score even i it perorms well only in a ew parameter settings. To achieve the best perormance, the selection o is datadependent and can be determined by using the standard cross-validation technique on the training data. Since the score o dierent can be easily calculated given the selection probabilities, the cross-validation can be used at very low cost. The algorithm o computing top stability selection is given in Algorithm. Though the paper ocuses on top- stability selection or classiication problems, it can also be applied to the eature selection in the regression settings, or example, using Lasso in top- stability selection. Algorithm Top- Stability Selection Input: dataset T = {x i, y i } n, iteration number β, parameter set Λ Output: top- stability scores S SS or all : or j = to β do 2: subsample D j rom T without replacement 3: or Λ do 4: compute the sparse logistic regression on D j using parameter and obtain the result ŵ j 5: store indices o selected eatures: Ŝ D j = { : ŵ j } 6: end or 7: end or 8: or do 9: compute selection probability or all Λ: ˆΠ = { } β β Ŝ D i : compute top- stability score: S SS = : ˆΠ ˆΠ rans top / : end or Another open question in stability selection is how to obtain the set o parameters Λ. The stability selection is said to be not sensitive to the selected regularization parameters [9]. Practically, however, the values o the parameter should not be too large, or very ew eatures are selected in each iteration; nor should they be very small, in which case the selection probability will be very high or all the eatures. The selection o the parameters also depends on the data set used. Some datasets may be very sensitive to the parameter chosen, in terms o sparsity in the model o sparse logistic regression. In this paper we use a heuristic algorithm to compute a proper set o parameters Λ. The algorithm irstly uniormly samples a set o parameters rom the entire range and inds minimum min and maximum max that give us p/3 and eatures respectively, and we uniormly choose 8 parameters in this range. In the case that the data is sensitive to the parameter, whose min and max reduce to the same value s, the algorithm iterates the above search algorithm on [ s ε, s + ε], where ε is a small value that reduces in each iteration. One o the ey technical contributions o this paper is to establish the upper bound o the average number o alsely selected eatures or the proposed top- stability selection in the next section. 3.2 Theoretical Analysis In this section, we show the theoretical upper bound o the average number o

5 alsely selected eatures in top- stability selection. The error bound o original stability selection in [9] is a special case o our established bound when =. We show that the error bound in our paper is guaranteed to be at least as tight as the bound when =, and we show that the proposed top- stability selection is more robust by utilizing more inormation rom selection probabilities. Let w R p be the ground truth o the sparse model, S be the set o non-zero components o w, i.e., S = { : w }. Let T be the total n samples {, 2,..., n} and D be a random subsample o T with size n/2 drawn without replacement. or simplicity, we denote the selected eatures Ŝ T by Ŝ except when the dependence on data is necessary to be clariied. The set o zero components o w is denoted by N = { : w = } and the estimated set o zeros under the model with regularization parameter is ˆN. It ollows that ŜΛ = Λ Ŝ, ˆN Λ = ˆN Λ. The average number o selected eatures rom models built on Λ is u Λ = E Ŝ Λ. The selection probability o a subset o eatures {, 2,..., p} under model with is ˆΠ = P Ŝ D, where the probability P is with respect to the randomness o subsampling. By averaging the top- selection probabilities o eatures, we deine the top- stable eatures as ollows: Deinition. Top- Stable eatures or a cut-o threshold π thr with < π thr < and a set o regularization parameters Λ, the set o top- stable eatures is deined as { } Ŝ stable = : ˆΠ π thr, where ˆΠ = max ˆΠ, Λ i is deined as Λ\, 2,..., i { }, i =,...,, i = argmax ˆΠ, Λ = Λ. Note that the top- stable eature is deined with respect to a single eature. The deinition and error control can be easily extended to the case o a subset o eatures {, 2..., p}. Let V be the set o alsely selected eatures o top- stability selection, V = N Ŝ stable. The average number o alsely selected eatures by top- stability selection, EV is guaranteed to be controlled by the ollowing theorem: Theorem 3.. Top- Error Control Assume that the original eature selection method is not worse than random guessing, i.e. S N E ŜΛ E ŜΛ, S N and { or } } Λ the distribution o { Ŝ, N is exchangeable, then, 3.3 EV u2 Λ,i p 2π thr, where 2 < π thr <, u 2 Λ,i = E [Λ i ]2, i =, 2,...,, Λ i is deined the same as in the deinition o top- stable eatures. The proo o the Theorem 3. is given in the supplemental materials [29]. Obviously, the error bound u EV 2 Λ 2π thr p in [9] is a special case o our bound in Eq. 3.3 given =. or each u Λ,i, i =, 2,...,, it measures the average o u Λ i with respect to all eatures. Since u 2 Λ,i u2 Λ or i =, 2,...,, we have u2 Λ,i / u2 Λ, which says that the upper bound o error in top- stability selection is guaranteed to be at least as tight as the original bound. u 2 Λ,i expresses the average number o selected eatures rom Λ excluding top i s maximizing the selection probability o the eatures. By utilizing the inormation rom u 2 Λ,i s, the top- stability selection is more robust in terms o average number o alsely selected eatures, compared with the = stability selection. 4 Experiments In the experiment we study the empirical perormance o eature selection algorithms and classiication methods in both synthetic and real clinical prediction datasets. In the experiments the implementation o top stability selection uses Lasso and the sparse logistic regression rom the SLEP pacage [5]. 4. Simulation We use synthetic datasets to study the proposed top- selection probability in the setting o regression. We start with generating the true model w R. There are 5 non-zero elements at random locations generated with distribution N,. We then generate a data matrix o samples X R with elements generated i.i.d. by N,. The response is generated using y = Xw+ϵ, where ϵ N,.5. We apply the top- stability selection on the data X, y, and calculate the number o alsely selected eatures V = N S stable and the ratio V / S stable using = {, 2, 4, 8, }. We repeat the experiments or

6 alsely Selected eatures Threshold o Selection Probability Top Stability Top 2 Stability Top 4 Stability Top 8 Stability Top Stability Ratio o alsely Selected eatures Top Stability Top 2 Stability Top 4 Stability Top 8 Stability Top Stability Threshold o Selection Probability igure 2: Comparisons o top- stability selection with =, 2, 4, 8,. When increases, we observe improved empirical perormance using top- stability selection in terms o the number o alse positive eatures included in the model let and the ratio right, which is consistent with the analysis in Theorem 3.. iterations and report the averaged results in igure 2. We observe that or a given threshold on the stability score, the top- stability selection perorms better when >. When increases, we observe improved empirical perormance using the top- stability selection in terms o both the number and ratio o the alse positive eatures included in the model, which is consistent with the discussion in Theorem Clinical Application Setting The individual and societal impact o heart ailure H is staggering. One in ive US citizens over age 4 is expected to develop H during their lietimes. It is currently the leading cause o hospitalization among Medicare beneiciaries. ramingham ris criteria [6], being the most commonly used diagnostic criteria or H, are signs and symptoms that are oten buried in clinical notes, which may not be in the structured ields in the EHR. ramingham criteria include Acute pulmonary edema APEdema, Anle Edema, Dyspnea on ordinary exertion, and etc. Despite their importance, ramingham criteria are not systematically documented in structured EHR. Processing clinical notes requires manual annotations or building customized natural language processing NLP applications, which are both expensive and time-consuming. Plenty o structured EHRs, on the other hand, are available, which include diagnosis, medication, and lab results. Our goal here is to construct eatures rom structured EHR data and to use only a small number o labeled annotations o ramingham criteria to build predictive models, so that we can identiy individuals that are liely to have ramingham criteria though they are not documented. The reasons or identiying potentially missing ramingham criteria are to enable timely diagnosis o H, and 2 to use the ramingham criteria as additional eatures or predicting uture onset o H. In this paper we construct 9 datasets targeting dierent ramingham criteria rom the EHR o approximately 4 patients over 5 years rom a hospital. The samples size o the data is given in Table, where each sample corresponds to exactly one patient in the study. The number o eatures is 32, which include 79 diagnosis eatures, 36 medication eatures, 7 lab test eatures. Table : The sample size o the EHR data sets. Dataset# Tas Name Training Testing APEdema AnleEdema ChestPain DOExertion Neg-Hepatomegaly Neg-JVDistention Neg-NightCough Neg-PNDyspnea Neg-S3Gallop eature Selection Comparison In this experiment we evaluate the perormance o top- stability selection given dierent and existing eature selection methods including isher Score [5], Relie [, 2], Gini Index [8], Ino Gain [3], χ 2 [3], Minimum-Redundancy Maximum-Relevance mrmr [2]. Note that or the eature selection methods that only tae categorical inputs, we can still use continuous eatures by applying discretization techniques [4]. or eature selection algorithms other than stability selection SS, we use the implementations rom an existing pacage [28]. or each eature selection algorithm, we compute the eature raning on the training data. We then use top t eatures to build classiication model using logistic regression and evaluate the model on the testing data, where we vary t rom 2 to 2. To study the eects o top- stability selection when changes, we treat =, 2, 4, 8 as dierent eature selection algorithms in the evaluation. Since the class distribution o the data sets is not balanced, we ocus on the Area Under Curve AUC metric. Given a speciic eature number, or all eature selection methods we select the number o top eatures raned respectively and build classiication models using logistic regression on the training data, and the models are then evaluated using the testing data. We repeat this procedure or times, and the average AUC over the times is used to ran the eature selection algorithms. We report the mean ran and standard deviation o 9 datasets in Table 2. We highlight the top 3 eature selection algorithms according to the mean ran. One important observation is that stability selection algorithms perorm better than all other eature selection algorithms. The perormance results also demonstrate the potential o top- stability selection with 2. These results are consistent with our theoretical analysis in Section 3.2. Sparse Learning Comparison. In our experiments, the stability selection uses sparse logistic regression,

7 Table 2: Average AUC rans o eature selection algorithms over the 9 datasets. eature # Relie. ±.. ±.. ±.. ±.. ±.. ±.. ±.. ±.. ±. isher 7. ± ± ± ± ± ± ± ± ±.5 Gini 7. ±.5 7. ± ± ± ± ± ± ± ± 2. InoGain 4.7 ± ± ± ± ±. 5.9 ±.3 6. ± ± ±.7 ChiSquare 5.3 ± ± ± ± ± ± ± ± ±.6 mrmr 5.8 ± ±.7 7. ± ± ± ± ± ± ± 3.4 SS = 5. ± ± ± ± ± ± ± ± ± 2.3 SS = ± ± ± ± ± ± ± ±.9 3. ±.7 SS = 4 3. ± ± ± ± ± ± ± ± ±. SS = ± ± ± ± ±.5 3. ± ± ± ±.8 Table 3: Classiication perormance on EHR datasets in terms o AUC. We compare three classiiers: support vector machine SVM, random orest Ran and logistic regression LReg. Dataset SVM 62.2 ± ± 68.8 ± ± ± ± ± ± 56. ± 2. Ran 69.6 ± ± ± ± 59.3 ± ± 66. ± ± ± 5.6 LReg 7. ±. 75. ± ± 76.6 ± ± ± ± 66.7 ± ±.9 which is the most popular sparse learning method or classiication and also serves as an embedded eature selection method itsel via the l -norm penalty. This motivates us to investigate the perormance o directly applying sparse logistic regression to the data sets. We vary the regularization parameter o sparse logistic regression and evaluate the perormance in terms o AUC. In order to compare the perormance, we perorm the top-4 stability selection on the same training data and use the same number o eatures as selected in the model o sparse logistic regression to build models using logistic regression. We show the results o AnleEdema and DOExertion datasets in igure 3. Since dierent regularization parameters may give the same number o eatures, there may be more than one value at each eature number. We observe that when the same number o eatures are selected, the approach o using stability selection outperorms sparse logistic regression. Also note that sparse logistic regression is sensitive to the parameter and the estimation o which requires cross-validation. In the approach o logistic regression ater stability selection, however, the perormance is not very sensitive to the number o eatures selected. We observe similar patterns given dierent values. 4.4 Classiication Methods Comparison In this experiment we study the perormance o popular classiiers on the EHR data sets. We include SVM, random orest Ran, and logistic regression LReg in the study. or SVM, 5-old cross validation is used to estimate the best parameter c. In random orest we ix the tree number to be 5. When building the classiication models, we include top 2 eatures raned by stability selection = 4. The perormance o the classiiers on the 9 datasets are given in Table 3. We observe that in all data sets logistic regression perorms better than other classiiers in terms o AUC. AUC Dataset: AnleEdema SS=4 Sparse LReg eature Number AUC Dataset: DOExertion SS=4 Sparse LReg eature Number igure 3: The AUC o sparse logistic regression Sparse LReg, logistic regression ater top-4 stability selection SS = 4 or randomly splitting training/testing datasets. In each iteration we vary the regularization parameter o the Spare LReg and choose the same number o eatures in SS. Note that dierent regularization parameters may give the same number o eatures. Sensitivity analysis. We perorm experiments to study the sensitivity o methods with respect to their parameters. We randomly split the AnleEdema dataset into training and testing sets we observe similar patterns in other datasets. We perorm stability selection = 4 on the training data, and choose top 2 eatures to build the classiication models using dierent parameters. We repeat the procedure or times and report AUC. In SVM we vary the parameter c, such that logc is rom 4 to +6 in the study. or random orest we vary the tree number rom 5 to. In logistic regression, we vary the l 2 -norm regularization term in the range log [ 4, 6]. The results are given in igure 4. We ind that the perormance o SVM and logistic regression on the dataset is very sensitive to the parameter. or logistic regression, we obtain better perormance or a smaller parameter value, which leads to less regularization. This is because when a small number o eatures are included in the model, it is not necessary to add l 2 -norm regularization. or SVM, however, a cross-validation is necessary in order to select the best

8 parameter. Unlie SVM and LReg, random orest is not sensitive to the tree number parameter and the perormance remains almost the same when the tree number is large enough logc tree number log igure 4: The sensitivity o classiication methods with respect to varying parameters. SVM top is sensitive to parameter c; random orest middle is not sensitive to the tree number parameter; logistic regression bottom is sensitive to the regularization parameter. 4.5 Top- Stability Score Case Study In this experiment we study the top- stability scores given dierent. We show the distribution o stability scores or DOExertion dataset in igure 5. We present the top eatures to the clinical experts and conirm their clinical validity in many cases. or example, G-58 is actually shortness o breath a symptom diagnose, which includes DOExertion as a special case. G3-33 are sympathomimetic agents, used as nasal decongestants to help breathing. Moreover, we observe that the absolute value o stability score shrins when increases. The shrinage is expected because we average the selection probabilities over more regularization parameters. In the stability score distribution given =, we ind that the average stability score decreases continuously. As increases, we ind that some steep drops o stability score emerge. In igure 5 = 2, there is a drop between eature G- 44 and G3-4. When we increase to = 8, we ind a more signiicant drop ater the irst three eatures. Such drop inormation can potentially be used as a guidance or the number o eatures to be included, or equivalently the choice o threshold in stability selection. We are also aware that the ran o eatures can be dierent given dierent. The eature G3-6, cardio selective beta blocer, has a subtle connection to DOExertion, since this is oten prescribed among H patients, and DOExertion is a common symptom among H patients. In particular, G3-6 rans given =. Its ran improves to, 6 and 4, respectively, when increases to 2, 4 and 8. The improvement in ran implies that the eature s selection probability Average Stability Score Average Stability Score Average Stability Score Average Stability Score G4 7 G 58 G3 33 G3 6 G 58 G3 33 G 73 G3 27 G 43 G4 7 G4 G4 5 G3 25 G4 4 G3 6 G4 G 3 G4 4 G3 34 G 58 G3 33 G3 27 G 73 G4 7 G 43 G4 G3 25 G4 5 G3 6 G4 4 G4 G4 4 G 3 G 58 G3 33 G4 7 G3 27 G3 25 G3 6 G 43 G4 G 73 G4 5 G4 4 G4 G4 4 G3 26 Stability score = or DOExertion Stability score =2 or DOExertion G3 34 Stability score =4 or DOExertion G3 34 Stability score =8 or DOExertion G3 26 G4 2 G3 24 G4 5 G3 5 G G 78 G 44 G3 28 G4 9 G3 4 G 57 G 6 G 9 G3 29 G 79 G3 3 G 52 G 25 G 26 G 27 G 28 G 29 G 3 G 3 G3 26 G3 24 G4 2 G4 5 G3 5 G G3 28 G 78 G4 9 G 44 G3 4 G 9 G 57 G 6 G3 29 G3 3 G 79 G3 G 52 G3 2 G 25 G 26 G 27 G 28 G 29 G4 5 G 3 G3 24 G4 2 G3 28 G4 9 G3 5 G G 78 G 44 G 9 G3 29 G3 4 G 57 G 6 G3 3 G3 G3 2 G 79 G3 23 G 52 G 65 G 25 G 26 G 27 G3 27 G3 25 G4 G4 5 G4 5 G 43 G4 4 G3 26 G3 34 G4 4 G4 G 73 G3 24 G4 9 G3 28 G4 2 G 3 G3 5 G G 78 G 44 G 9 G3 29 G3 4 G 57 G 6 G3 3 G3 2 G3 G 79 G3 23 G 52 G 65 igure 5: Average stability scores o DOExertion. may not be highest with any regularization parameter, but across all parameters Λ the eature s stability probabilities are consistently high. Such eatures lie G3-6 in this case are considered to be stable with respect to parameters, which can only be identiied with top- stability selection. 5 Conclusion The goal o disease-speciic ris prediction is to assess the ris o a patient in developing a target disease based on his/her health proile. As electronic health records EHRs become more prevalent, a large number o eatures can be constructed in order to characterize patient proiles. In this paper, we propose top- stability selection, which generalizes a powerul sparse learning eature selection method by overcoming its limitation on parameter selection and providing stronger theoretical properties. In a large set o real clinical prediction datasets, the top- stability selection methods outperorm the original stability selection and many existing eature selection methods. We compare three competitive classiication methods to demonstrate the eectiveness o selected eatures by our proposed method. We also show that in several clinical applications on predicting heart ailure related symptoms, top- stability se- G 25 G 26 G 27

9 lection can successully identiy important eatures that are clinically meaningul. Recently the data mining and machine learning community has integrated its research eorts or structure sparsity. In our uture wors, we plan to combine stability selection and structure sparsity and apply to research on EHRs. Top- stability selection is based on Lasso and thereore is vulnerable to strongly correlated eatures. We plan to develop methods that are more robust against such ill-conditioned design matrices. Acnowledgments We would lie to than Dr. Steven Steinhubl and Dr. Walter. Stewart rom Geisinger or acilitating the testing o our algorithm. We also want to than Roy Byrd rom IBM or providing symptom annotations on the data. This wor was supported in part by NS IIS , CC-2577 and NIH RLM73. Reerences [] C. Bishop. Pattern recognition and machine learning, volume 4. Springer New Yor, 26. [2] S. Boucheron and O. Bousquet. Concentration inequalities. In Advanced Lectures in Machine Learning, pages Springer, 24. [3] T. Cover, J. Thomas, J. Wiley, et al. Elements o inormation theory, volume 6. Wiley Online Lib., 99. [4] J. Dougherty, R. Kohavi, and M. Sahami. Supervised and unsupervised discretization o continuous eatures. In ICML, pages 94 22, 995. [5] R. Duda, P. Hart, and D. Stor. Pattern classiication and scene analysis 2nd ed [6] H. Eletherohorinou, C. Hoggart, V. Wright, M. Levin, and L. Coin. Pathway-driven gene stability selection o two rheumatoid arthritis gwas identiies and validates new susceptibility genes in receptor mediated signalling pathways. Hum. Mol. Genet., 27: , 2. [7] G. onarow, K. Adams, W. Abraham, C. Yancy, W. Boscardin, et al. Ris stratiication or in-hospital mortality in acutely decompensated heart ailure. J. o the American Medical Association, 2935:572, 25. [8] C. Gini. Variabilità e mutabilità. Reprinted in Memorie di metodologica statistica Ed. Pizetti E, Salvemini, T. Rome: Libreria Eredi Virgilio Veschi,, 92. [9] L. Jacob, G. Obozinsi, and J. Vert. Group lasso with overlap and graph lasso. In ICML, pages ACM, 29. [] K. Kira and L. Rendell. A practical approach to eature selection. In Intl. Worshop on Machine Learning, pages , 992. [] B. Krishnapuram, L. Carin, M. igueiredo, and A. Hartemin. Sparse multinomial logistic regression: ast algorithms and generalization bounds. IEEE Trans. on Pattern Analysis and Machine Intelligence, 276: , 25. [2] H. Liu and H. Motoda. Computational methods o eature selection. Chapman & Hall, 28. [3] H. Liu and R. Setiono. Chi2: eature selection and discretization o numeric attributes. In Intl. Con. on Tools with Artiicial Intelligence, pages , 995. [4] J. Liu, J. Chen, and J. Ye. Large-scale sparse logistic regression. In KDD, pages ACM, 29. [5] J. Liu, S. Ji, and J. Ye. SLEP: Sparse Learning with Eicient Projections. Arizona State University, 29. [6] P. McKee, W. Castelli, P. McNamara, and W. Kannel. The natural history o congestive heart ailure: the ramingham study. New England Journal o Medicine, 28526:44 446, 97. [7] L. Meier, S. Van De Geer, and P. Bühlmann. The group lasso or logistic regression. J. R. Stat. Soc. Ser. B Stat. Methodol., 7:53 7, 28. [8] N. Meinshausen and P. Bühlmann. High-dimensional graphs and variable selection with the lasso. The Annals o Statistics, 343: , 26. [9] N. Meinshausen and P. Bühlmann. Stability selection. J. R. Stat. Soc. Ser. B Stat. Methodol., 724:47 473, 2. [2] Y. Pati, R. Rezaiiar, and P. Krishnaprasad. Orthogonal matching pursuit: Recursive unction approximation with applications to wavelet decomposition. In Asilomar Conerence on Signals, Systems and Computers, pages IEEE, 993. [2] H. Peng,. Long, and C. Ding. eature selection based on mutual inormation criteria o max-dependency, max-relevance, and min-redundancy. IEEE Trans. on Pattern Analysis and Machine Intelligence, 278: , 25. [22] Z. Ryali, T. Chen, K. Supear, and V. Menon. Estimation o unctional connectivity in mri data using stability selection-based sparse partial correlation with elastic net penalty. Neuroimage, 59: , 22. [23] D. Stehoven, L. Hennig, G. Sveinbjörnsson, I. Moraes, M. Maathuis, and P. Bühlmann. Causal stability raning. 2. [24] M. Stern, K. Williams, D. Eddy, and R. Kahn. Validation o prediction o diabetes by the archimedes model and comparison with other predicting models. Diabetes Care, 38:67, 28. [25] R. Tibshirani. Regression shrinage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol., pages , 996. [26] M. Vounou, E. Janousova, R. Wolz, J. Stein, P. Thompson, D. Ruecert, and G. Montana. Sparse reduced-ran regression detects genetic associations with voxel-wise longitudinal phenotypes in alzheimer s disease. NeuroImage, 2. [27] J. Wu, J. Roy, and W. Stewart. Prediction modeling using ehr data: challenges, strategies, and a comparison o machine learning approaches. Medical care, 486:S6, 2. [28] Z. Zhao,. Morstatter, S. Sharma, S. Alelyani, A. Anand, and H. Liu. Asu eature selection repository, eatureselection.asu.edu, 2. [29] J. Zhou. jzhou29/ papers/jzhousdm3 appendix.pd.

10 Patient Ris Prediction Model via Top- Stability Selection Appendix A: Proo o Theorem 3. Jiayu Zhou, Jimeng Sun, Yashu Liu, Jianying Hu, Jieping Ye [email protected] Beore proceeding to the proo o Theorem 3., we irstly see some insights by analyzing simultaneous selection probability rom random splittings. Let D and D 2 be two disjoint subsets o T generated by random splitting with sample size n/2. The simultaneously selected set is then given by: Ŝ smlt, = Ŝ D Ŝ D 2. The corresponding simultaneous selection probability or any set P. Ŝsmlt, {,..., p} is deined as ˆΠ smlt, = Lemma 5.. or any subset {,..., p}, a lower bound or the average o top maximum simultaneous selection probabilities is given by 2 ˆΠ. ˆΠsmlt, Proo: According to the Lemma in [9], it ollows that thereore have: ˆΠsmlt, = = 2 ˆΠ smlt, max 2 ˆΠ or any set {,..., p}. We ˆΠsmlt, 2 max ˆΠ max ˆΠ = 2 ˆΠ. This completes the proo o Lemma 5.. Lemma 5.2. or a subset o eatures {,..., p}, i P ŜΛ i ϵ i, or i =, 2,...,, where Λ i deined the same as in the deinition o top- stable eatures, Ŝ is estimated rom n/2 samples, then: is P ˆΠsmlt, ξ ξ ϵ 2 i. Proo: Let D, D 2 {,..., n} be two subsamples o T with size n/2 generated rom random splitting. Deine B { {Ŝ = D } } Ŝ smlt, D 2, and the simultaneous selection probability is given by ˆΠ = E B = E B T, where the expectation E is with respect to the random splitting. Hence or i =, 2, 3,..., we have: It ollows immediately that: max ˆΠsmlt, ˆΠsmlt, = max E B = max E B T. = E B = M Λ E B T.

11 The inequality P ŜΛ i ϵ i or sample size n/2 implies that: max P B = P That is, or i =, 2,...,, max E B ϵ 2 i. Thus, E ˆΠsmlt, Using the Marov-type inequality [2], we have: ξp thus P ˆΠsmlt, 2 ŜΛ i D ϵ 2 i. =E [ M Λ E B T ] [ =E max = E B T ] max E B = ϵ 2 i. [ ξ E E B ϵ 2 i, ] ˆΠsmlt, ˆΠsmlt, ξ ξ ϵ2 i. This completes the proo o the lemma. Proo o Theorem 3. Top- Error Control: rom Theorem in [9] we have that: P ŜΛ u Λ /p or N, it ollows immediately that or all N, i =, 2,..., we have P ŜΛ i u Λ i/p. Using Lemma 5.2, we have: P ˆΠsmlt, ξ ξ u Λ i /p 2. By Lemma 5., it ollows that: { P { P ˆΠ π thr } ˆΠsmlt, } + /2 π thr u2 Λ i p 2 2π thr. Thereore we have: EV = { P N } ˆΠ π thr u2 Λ,i p 2π thr, where u 2 Λ,i = E [u Λ i] 2. This completes the proo.

Big Data Analytics for Healthcare

Big Data Analytics for Healthcare Big Data Analytics for Healthcare Jimeng Sun Chandan K. Reddy Healthcare Analytics Department IBM TJ Watson Research Center Department of Computer Science Wayne State University 1 Healthcare Analytics

More information

FIXED INCOME ATTRIBUTION

FIXED INCOME ATTRIBUTION Sotware Requirement Speciication FIXED INCOME ATTRIBUTION Authors Risto Lehtinen Version Date Comment 0.1 2007/02/20 First Drat Table o Contents 1 Introduction... 3 1.1 Purpose o Document... 3 1.2 Glossary,

More information

Exploration and Visualization of Post-Market Data

Exploration and Visualization of Post-Market Data Exploration and Visualization of Post-Market Data Jianying Hu, PhD Joint work with David Gotz, Shahram Ebadollahi, Jimeng Sun, Fei Wang, Marianthi Markatou Healthcare Analytics Research IBM T.J. Watson

More information

Using quantum computing to realize the Fourier Transform in computer vision applications

Using quantum computing to realize the Fourier Transform in computer vision applications Using quantum computing to realize the Fourier Transorm in computer vision applications Renato O. Violin and José H. Saito Computing Department Federal University o São Carlos {renato_violin, saito }@dc.uscar.br

More information

Financial Services [Applications]

Financial Services [Applications] Financial Services [Applications] Tomáš Sedliačik Institute o Finance University o Vienna [email protected] 1 Organization Overall there will be 14 units (12 regular units + 2 exams) Course

More information

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center 1 Outline Part I - Applications Motivation and Introduction Patient similarity application Part II

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Retrospective Test for Loss Reserving Methods - Evidence from Auto Insurers

Retrospective Test for Loss Reserving Methods - Evidence from Auto Insurers Retrospective Test or Loss Reserving Methods - Evidence rom Auto Insurers eng Shi - Northern Illinois University joint work with Glenn Meyers - Insurance Services Oice CAS Annual Meeting, November 8, 010

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

Classification of Bad Accounts in Credit Card Industry

Classification of Bad Accounts in Credit Card Industry Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

More information

A NOVEL ALGORITHM WITH IM-LSI INDEX FOR INCREMENTAL MAINTENANCE OF MATERIALIZED VIEW

A NOVEL ALGORITHM WITH IM-LSI INDEX FOR INCREMENTAL MAINTENANCE OF MATERIALIZED VIEW A NOVEL ALGORITHM WITH IM-LSI INDEX FOR INCREMENTAL MAINTENANCE OF MATERIALIZED VIEW 1 Dr.T.Nalini,, 2 Dr. A.Kumaravel, 3 Dr.K.Rangarajan Dept o CSE, Bharath University Email-id: [email protected]

More information

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning. Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott

More information

A FRAMEWORK FOR AUTOMATIC FUNCTION POINT COUNTING

A FRAMEWORK FOR AUTOMATIC FUNCTION POINT COUNTING A FRAMEWORK FOR AUTOMATIC FUNCTION POINT COUNTING FROM SOURCE CODE Vinh T. Ho and Alain Abran Sotware Engineering Management Research Laboratory Université du Québec à Montréal (Canada) [email protected]

More information

Copyright 2005 IEEE. Reprinted from IEEE MTT-S International Microwave Symposium 2005

Copyright 2005 IEEE. Reprinted from IEEE MTT-S International Microwave Symposium 2005 Copyright 2005 IEEE Reprinted rom IEEE MTT-S International Microwave Symposium 2005 This material is posted here with permission o the IEEE. Such permission o the IEEE does t in any way imply IEEE endorsement

More information

Regularized Logistic Regression for Mind Reading with Parallel Validation

Regularized Logistic Regression for Mind Reading with Parallel Validation Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland

More information

Introduction to Logistic Regression

Introduction to Logistic Regression OpenStax-CNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction

More information

Lasso on Categorical Data

Lasso on Categorical Data Lasso on Categorical Data Yunjin Choi, Rina Park, Michael Seo December 14, 2012 1 Introduction In social science studies, the variables of interest are often categorical, such as race, gender, and nationality.

More information

Differential privacy in health care analytics and medical research An interactive tutorial

Differential privacy in health care analytics and medical research An interactive tutorial Differential privacy in health care analytics and medical research An interactive tutorial Speaker: Moritz Hardt Theory Group, IBM Almaden February 21, 2012 Overview 1. Releasing medical data: What could

More information

A MPCP-Based Centralized Rate Control Method for Mobile Stations in FiWi Access Networks

A MPCP-Based Centralized Rate Control Method for Mobile Stations in FiWi Access Networks A MPCP-Based Centralized Rate Control Method or Mobile Stations in FiWi Access Networks 215 IEEE. Personal use o this material is permitted. Permission rom IEEE must be obtained or all other uses, in any

More information

Mining the Situation: Spatiotemporal Traffic Prediction with Big Data

Mining the Situation: Spatiotemporal Traffic Prediction with Big Data 1 Mining the Situation: Spatiotemporal Traic Prediction with Big Data Jie Xu, Dingxiong Deng, Ugur Demiryurek, Cyrus Shahabi, Mihaela van der Schaar Abstract With the vast availability o traic sensors

More information

Chapter 12 Discovering New Knowledge Data Mining

Chapter 12 Discovering New Knowledge Data Mining Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to

More information

Divide-n-Discover Discretization based Data Exploration Framework for Healthcare Analytics

Divide-n-Discover Discretization based Data Exploration Framework for Healthcare Analytics for Healthcare Analytics Si-Chi Chin,KiyanaZolfaghar,SenjutiBasuRoy,AnkurTeredesai,andPaulAmoroso Institute of Technology, The University of Washington -Tacoma,900CommerceStreet,Tacoma,WA980-00,U.S.A.

More information

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression Saikat Maitra and Jun Yan Abstract: Dimension reduction is one of the major tasks for multivariate

More information

Is the trailing-stop strategy always good for stock trading?

Is the trailing-stop strategy always good for stock trading? Is the trailing-stop strategy always good or stock trading? Zhe George Zhang, Yu Benjamin Fu December 27, 2011 Abstract This paper characterizes the trailing-stop strategy or stock trading and provides

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Lasso-based Spam Filtering with Chinese Emails

Lasso-based Spam Filtering with Chinese Emails Journal of Computational Information Systems 8: 8 (2012) 3315 3322 Available at http://www.jofcis.com Lasso-based Spam Filtering with Chinese Emails Zunxiong LIU 1, Xianlong ZHANG 1,, Shujuan ZHENG 2 1

More information

A performance analysis of EtherCAT and PROFINET IRT

A performance analysis of EtherCAT and PROFINET IRT A perormance analysis o EtherCAT and PROFINET IRT Conerence paper by Gunnar Prytz ABB AS Corporate Research Center Bergerveien 12 NO-1396 Billingstad, Norway Copyright 2008 IEEE. Reprinted rom the proceedings

More information

Healthcare data analytics. Da-Wei Wang Institute of Information Science [email protected]

Healthcare data analytics. Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw Healthcare data analytics Da-Wei Wang Institute of Information Science [email protected] Outline Data Science Enabling technologies Grand goals Issues Google flu trend Privacy Conclusion Analytics

More information

REVIEW OF ENSEMBLE CLASSIFICATION

REVIEW OF ENSEMBLE CLASSIFICATION Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.

More information

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

More information

Pattern recognition using multilayer neural-genetic algorithm

Pattern recognition using multilayer neural-genetic algorithm Neurocomputing 51 (2003) 237 247 www.elsevier.com/locate/neucom Pattern recognition using multilayer neural-genetic algorithm Yas Abbas Alsultanny, Musbah M. Aqel Computer Science Department, College o

More information

Data Mining for Knowledge Management. Classification

Data Mining for Knowledge Management. Classification 1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh

More information

Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

More information

Swept Sine Chirps for Measuring Impulse Response

Swept Sine Chirps for Measuring Impulse Response Swept Sine Chirps or Measuring Impulse Response Ian H. Chan Design Engineer Stanord Research Systems, Inc. Log-sine chirp and variable speed chirp are two very useul test signals or measuring requency

More information

How To Identify A Churner

How To Identify A Churner 2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management

More information

!!! Technical Notes : The One-click Installation & The AXIS Internet Dynamic DNS Service. Table of contents

!!! Technical Notes : The One-click Installation & The AXIS Internet Dynamic DNS Service. Table of contents Technical Notes: One-click Installation & The AXIS Internet Dynamic DNS Service Rev: 1.1. Updated 2004-06-01 1 Table o contents The main objective o the One-click Installation...3 Technical description

More information

A RAPID METHOD FOR WATER TARGET EXTRACTION IN SAR IMAGE

A RAPID METHOD FOR WATER TARGET EXTRACTION IN SAR IMAGE A RAPI METHO FOR WATER TARGET EXTRACTION IN SAR IMAGE WANG ong a, QIN Ping a ept. o Surveying and Geo-inormatic Tongji University, Shanghai 200092, China. - [email protected] ept. o Inormation, East

More information

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Supervised Feature Selection & Unsupervised Dimensionality Reduction Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

More information

SECTION 6: FIBER BUNDLES

SECTION 6: FIBER BUNDLES SECTION 6: FIBER BUNDLES In this section we will introduce the interesting class o ibrations given by iber bundles. Fiber bundles lay an imortant role in many geometric contexts. For examle, the Grassmaniann

More information

Bag of Pursuits and Neural Gas for Improved Sparse Coding

Bag of Pursuits and Neural Gas for Improved Sparse Coding Bag of Pursuits and Neural Gas for Improved Sparse Coding Kai Labusch, Erhardt Barth, and Thomas Martinetz University of Lübec Institute for Neuro- and Bioinformatics Ratzeburger Allee 6 23562 Lübec, Germany

More information

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES Claus Gwiggner, Ecole Polytechnique, LIX, Palaiseau, France Gert Lanckriet, University of Berkeley, EECS,

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376 Course Director: Dr. Kayvan Najarian (DCM&B, [email protected]) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.

More information

A Proposal for Estimating the Order Level for Slow Moving Spare Parts Subject to Obsolescence

A Proposal for Estimating the Order Level for Slow Moving Spare Parts Subject to Obsolescence usiness, 2010, 2, 232-237 doi:104236/ib201023029 Published Online September 2010 (http://wwwscirporg/journal/ib) A Proposal or Estimating the Order Level or Slow Moving Spare Parts Subject to Obsolescence

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

1. Overview of Nios II Embedded Development

1. Overview of Nios II Embedded Development May 2011 NII52001-11.0.0 1. Overview o Nios II Embedded Development NII52001-11.0.0 The Nios II Sotware Developer s Handbook provides the basic inormation needed to develop embedded sotware or the Altera

More information

Maintenance Scheduling Optimization for 30kt Heavy Haul Combined Train in Daqin Railway

Maintenance Scheduling Optimization for 30kt Heavy Haul Combined Train in Daqin Railway 5th International Conerence on Civil Engineering and Transportation (ICCET 2015) Maintenance Scheduling Optimization or 30kt Heavy Haul Combined Train in Daqin Railway Yuan Lin1, a, Leishan Zhou1, b and

More information

Why focus on assessment now?

Why focus on assessment now? Assessment in th Responding to your questions. Assessment is an integral part o teaching and learning I this sounds amiliar to you it s probably because it is one o the most requently quoted lines rom

More information

Sparse Prediction with the k-support Norm

Sparse Prediction with the k-support Norm Sparse Prediction with the -Support Norm Andreas Argyriou École Centrale Paris [email protected] Rina Foygel Department of Statistics, Stanford University [email protected] Nathan Srebro Toyota Technological

More information

Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

Bootstrapping Big Data

Bootstrapping Big Data Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu

More information

High School Students Who Take Acceleration Mechanisms Perform Better in SUS Than Those Who Take None

High School Students Who Take Acceleration Mechanisms Perform Better in SUS Than Those Who Take None May 2008 Dr. Eric J. Smith, Commissioner Dr. Willis N. Holcombe, Chancellor High School Who Take Acceleration Mechanisms Perorm Better in SUS Than Those Who Take None Edition 2008-01 Introduction. is one

More information

BotCop: An Online Botnet Traffic Classifier

BotCop: An Online Botnet Traffic Classifier 2009 Seventh Annual Communications Networks and Services Research Conerence BotCop: An Online Botnet Traic Classiier Wei Lu, Mahbod Tavallaee, Goaletsa Rammidi and Ali A. Ghorbani Faculty o Computer Science

More information

8. ENERGY PERFORMANCE ASSESSMENT OF COMPRESSORS 8.1 Introduction

8. ENERGY PERFORMANCE ASSESSMENT OF COMPRESSORS 8.1 Introduction 8. ENERGY PERFORMANCE ASSESSMENT OF COMPRESSORS 8.1 Introduction The compressed air system is not only an energy intensive utility but also one o the least energy eicient. Over a period o time, both perormance

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

ADVANCED MACHINE LEARNING. Introduction

ADVANCED MACHINE LEARNING. Introduction 1 1 Introduction Lecturer: Prof. Aude Billard ([email protected]) Teaching Assistants: Guillaume de Chambrier, Nadia Figueroa, Denys Lamotte, Nicola Sommer 2 2 Course Format Alternate between: Lectures

More information

Cross-Validation. Synonyms Rotation estimation

Cross-Validation. Synonyms Rotation estimation Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical

More information

Server Load Prediction

Server Load Prediction Server Load Prediction Suthee Chaidaroon ([email protected]) Joon Yeong Kim ([email protected]) Jonghan Seo ([email protected]) Abstract Estimating server load average is one of the methods that

More information

Combinational-Circuit Building Blocks

Combinational-Circuit Building Blocks May 9, 24 :4 vra6857_ch6 Sheet number Page number 35 black chapter 6 Combinational-Circuit Building Blocks Chapter Objectives In this chapter you will learn about: Commonly used combinational subcircuits

More information

SIMPLIFIED CBA CONCEPT AND EXPRESS CHOICE METHOD FOR INTEGRATED NETWORK MANAGEMENT SYSTEM

SIMPLIFIED CBA CONCEPT AND EXPRESS CHOICE METHOD FOR INTEGRATED NETWORK MANAGEMENT SYSTEM SIMPLIFIED CBA CONCEPT AND EXPRESS CHOICE METHOD FOR INTEGRATED NETWORK MANAGEMENT SYSTEM Mohammad Al Rawajbeh 1, Vladimir Sayenko 2 and Mohammad I. Muhairat 3 1 Department o Computer Networks, Al-Zaytoonah

More information

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised

More information

1. Overview of Nios II Embedded Development

1. Overview of Nios II Embedded Development January 2014 NII52001-13.1.0 1. Overview o Nios II Embedded Development NII52001-13.1.0 The Nios II Sotware Developer s Handbook provides the basic inormation needed to develop embedded sotware or the

More information

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

More information

AM Receiver. Prelab. baseband

AM Receiver. Prelab. baseband AM Receiver Prelab In this experiment you will use what you learned in your previous lab sessions to make an AM receiver circuit. You will construct an envelope detector AM receiver. P1) Introduction One

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Machine Learning Logistic Regression

Machine Learning Logistic Regression Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS MAXIMIZING RETURN ON DIRET MARKETING AMPAIGNS IN OMMERIAL BANKING S 229 Project: Final Report Oleksandra Onosova INTRODUTION Recent innovations in cloud computing and unified communications have made a

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University [email protected] [email protected] I. Introduction III. Model The goal of our research

More information

Auditing EMR System Usage. You Chen Jan, 17, 2013 [email protected]

Auditing EMR System Usage. You Chen Jan, 17, 2013 You.chen@vanderbilt.edu Auditing EMR System Usage You Chen Jan, 17, 2013 [email protected] Health data being accessed by hackers, lost with laptop computers, or simply read by curious employees Anomalous Usage You Chen,

More information

KEITH LEHNERT AND ERIC FRIEDRICH

KEITH LEHNERT AND ERIC FRIEDRICH MACHINE LEARNING CLASSIFICATION OF MALICIOUS NETWORK TRAFFIC KEITH LEHNERT AND ERIC FRIEDRICH 1. Introduction 1.1. Intrusion Detection Systems. In our society, information systems are everywhere. They

More information

Clustering & Visualization

Clustering & Visualization Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 [email protected]

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 [email protected] 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Integrated airline scheduling

Integrated airline scheduling Computers & Operations Research 36 (2009) 176 195 www.elsevier.com/locate/cor Integrated airline scheduling Nikolaos Papadakos a,b, a Department o Computing, Imperial College London, 180 Queen s Gate,

More information

Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

More information

Manufacturing and Fractional Cell Formation Using the Modified Binary Digit Grouping Algorithm

Manufacturing and Fractional Cell Formation Using the Modified Binary Digit Grouping Algorithm Proceedings o the 2014 International Conerence on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Manuacturing and Fractional Cell Formation Using the Modiied Binary

More information

Lecture 10: Regression Trees

Lecture 10: Regression Trees Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli ([email protected])

More information

Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013

Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013 Model selection in R featuring the lasso Chris Franck LISA Short Course March 26, 2013 Goals Overview of LISA Classic data example: prostate data (Stamey et. al) Brief review of regression and model selection.

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

Predicting borrowers chance of defaulting on credit loans

Predicting borrowers chance of defaulting on credit loans Predicting borrowers chance of defaulting on credit loans Junjie Liang ([email protected]) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm

More information

Speaker First Plenary Session THE USE OF "BIG DATA" - WHERE ARE WE AND WHAT DOES THE FUTURE HOLD? William H. Crown, PhD

Speaker First Plenary Session THE USE OF BIG DATA - WHERE ARE WE AND WHAT DOES THE FUTURE HOLD? William H. Crown, PhD Speaker First Plenary Session THE USE OF "BIG DATA" - WHERE ARE WE AND WHAT DOES THE FUTURE HOLD? William H. Crown, PhD Optum Labs Cambridge, MA, USA Statistical Methods and Machine Learning ISPOR International

More information

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether

More information

8. Exception Handling

8. Exception Handling 8. Exception Handling February 2011 NII52006-10.1.0 NII52006-10.1.0 Introduction This chapter discusses how to write programs to handle exceptions in the Nios II processor architecture. Emphasis is placed

More information

Big Data Analytics and Healthcare

Big Data Analytics and Healthcare Big Data Analytics and Healthcare Anup Kumar, Professor and Director of MINDS Lab Computer Engineering and Computer Science Department University of Louisville Road Map Introduction Data Sources Structured

More information