Decision Trees. The predictor vector x can be a mixture of continuous and categorical variables.

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Decision Trees. The predictor vector x can be a mixture of continuous and categorical variables."

Transcription

1 Decision Trees 1 Introduction to tree based methods 1.1 The general setting Supervised learning problem. Data: {(y i, x i ) : i = 1,..., n}, where y is the response, x is a p-dimensional input vector, and n is the sample size. The response y can be binary or categorical classification (or decision) trees; or continuous regression trees. The predictor vector x can be a mixture of continuous and categorical variables. A tree-based method (or recursive partitioning) recursively partitions the predictor space to model the relationship between the response y and the predictors (x 1,..., x p ). 1.2 Advantages of trees It is nonparametric requiring few statistical assumptions. It can be applied to various data structures including both ordered and categorical variables in a simple and natural way. In particular, the recursive partitioning method is exceptionally efficient in handling categorical predictors. It ameliorates the curse of dimensionality. When p > 2, the conventional nonparametric smoothing techniques become computationally infeasible. As p increases, parametric models encounter problems too, such as variable selection, transformations, and interaction handling. It does stepwise variable selection, complexity reduction, and (implicit) interaction handling in an automatic manner. 16

2 Invariant under all monotone transformations of individual ordered predictors Efficient in handling Missing values and provides variable importance rankings. The output gives easily understood and interpreted information. Interpretability is one of the main advantages of decision trees compared to black box methods such as neural networks. The hierarchical (binary) tree structure automatically and optimally groups data, which renders it an excellent tool in medical prognosis/diagnosis. Provides a natural platform to handle heterogeneity in the data by allowing different models to be fit to different groups (treed models). Tradeoff: tree models are robust yet unstable. 1.3 A brief history of tree modeling Morgan and Sonquist (1963) -Automatic Interaction Detection (AID) Breiman, Friedman, Olshen and Stone (1984) - Classification And Regression Trees (CART) Addressed the tree size selection (pruning) and many other issues such as missing values, variable importance, etc. Greatly advanced the use of tree methods in various application fields. Extensions: Freund and Schapire (1996) and Friedman (2001): Boosting Breiman (1996): Bagging Breiman (2003): Random Forests 1.4 An example and terminology The stage C prostate cancer example The dataset contains info about 146 stage C prostate cancer patients. The main clinical endpoint of interest is whether the disease recurs after initial surgical removal of the prostate, and the time interval 17

3 to that progression (if any). The enpoint of this example is pgstat, which takes on the value 1 if the disease has progressed and 0 if not. Below is a short description of the variables. The data is a matrix of 146 rows and 8 columns corresponding to the following 8 variables: pgtime = time to progression in years pgstat = status at last follow-up: 1=progressed, 0=censored age = age at diagnosis eet = early endocrine therapy: 1=no 2=yes g2 = % of cells in g2 phase, from flow cytometry grade = tumor grade 1,2,3,4 gleason = Gleason score (competing grading system, 3-10) ploidy = diploid/tetraploid/aneuploid DNA pattern The file stagec.r shows how to construct a classification tree predicting pgstat from the last 6 variables (age, eet, g2, grade, gleason, ploidy). Terminology: node, root node, parent node, child node, split, leaf (terminal) node, internal node, and path. Built from root node (top) to leaf/terminal nodes (bottom) A record first enters the root node. A test (split) is applied to determine to which child node it should go next. The process is repeated until a record arrives at a leaf (terminal) node. The path from the root to a leaf node provides an expression of a rule. 1.5 References CART by Breiman, Friedman, Olshen, and Stone (1984). Sections 2.2 and 2.7. An introduction to recursive partitioning using the RPART routines by Atkinson and Therneau, Mayo Foundation, February 11, Statistical learning from a regression perspective by Berk (2008) Sections 18

4 Decision Trees (continued) 2 Growing a Large Tree We will follow the CART methodology to develop tree models, which contains the following three major steps: 1. Grow a large initial tree, T 0 ; 2. Iteratively truncate branches of T 0 to obtain a sequence of optimally pruned (nested) subtrees; 3. Select the best tree size based on validation provided by either test sample or cross validation (CV). To illustrate, we consider decision trees with binary responses. Namely, y i = 1 when an event of interest occurs to subject i; 0 otherwise. In this section, we focus on how to grow a large tree. The four elements needed in the initial tree growing procedure are 1. A set of binary questions to induce a split 2. A goodness of split criterion to evaluate a split 3. A stop-splitting rule 4. A rule for assigning every terminal node to a class (0 or 1) We will discuss each of these elements in the sections that follow. 2.1 Possible Number of Splits The first problem in tree construction is how to determine the number of partitions needed to examine at each node. An exhaustive (greedy) search algorithm considers all possible partitions of all input variables at every node in the tree. However, the number of child nodes tends to increase rapidly when there are too many 19

5 variables or when there are too many levels in one or more variables. This makes an exhaustive search algorithm prohibitively expensively. Examples 1. Suppose x is an ordinal variable with four levels 1, 2, 3, and 4. What is the total number of possible splits considering only binary ones? Solution: 2 way split: 1-234, 12-34, Note that there are L 1 possible splits for an ordinal variable with L levels. 2. Suppose x is a numerical variable with 100 distinct values. What is the total number of possible splits? Solution: The formula above for computing the number of possible partitions for ordinal variables also applies when computing the number of possible partitions for numerical variables, where L now denotes the number of distinct values in the observed sample. Total number of splits = = Suppose x is a nominal variable with four categories a, b, c, d. What is the total number of possible binary splits? Solution: ab-cd, ac-bd, ad-bc, abc-d, abd-c, acd-b, a-bcd Total number of binary splits = 7 Note that the total number of possible binary splits is 2 L 1 1. Reducing the Number of Possible Partitions for Nominal Variables For categorical predictors that has many levels {b 1,..., b L }, one way to reduce the number of splits is to rank the levels as {b l1,..., b ll } according to the occurrence rate within each node p{1 b l1 } p{1 b l2 } p{1 b ll } and then treat it as an ordinal input. (See CART, p. 101). 20

6 2.2 Node Impurity based Splitting criteria In general, the impurity i(t) of node t can be defined as a nonnegative function of p{0 t} and p{1 t}, where p{0 t} and p{1 t} denote the proportions of the cases in node t belonging to classes 0 and 1, respectively. More formally, i(t) = φ(p 1 ), where p 1 = p{y = 1 t} and the impurity function φ( ) is the largest when both classes are equally mixed together and it is the smallest when the node contains only one class. Hence, it has the following properties: 1. φ(p) 0; 2. φ(p) attains its minimum 0 when p = 0 or p = φ(p) attains its maximum when p = 1 p = 1/2. 4. φ(p) = φ(1 p), i.e., φ(p) is symmetric about p = 1/2. Common choices of φ include: the minimum error, the entropy function, and the Gini index. The minimum or Bayes Error φ(p) = min(p, 1 p). This measure corresponds to the misclassification rate when majority vote is used. The minimum error is rarely used in practice due to the fact that it does not sufficiently reward purer nodes (CART, p. 99). The Entropy Function φ(p) = p log(p) (1 p) log(1 p). Quinlan (1993) first proposed to use the reduction of Entropy as a goodness of split criterion. Ripley (1996) showed the entropy reduction criterion is equivalent to using the likelihood ratio chi-square statistic for association between the branches and the target categories. 21

7 The Gini Index φ(p) = p(1 p). Breiman et al. (1984) proposed to use the reduction of Gini index as a goodness of split criterion. It has been observed that this rule has an undesirable endcut preference problem (Breiman et al., 1984, Ch. 11): It gives preference to the splits that result in two child nodes of extremely unbalanced sizes. To resolve this problem, a modification called the delta splitting method has been adopted in both the THAID (Morgan and Messenger, 1973) and CART programs. Because of the above concerns, from now on the impurity refers to the entropy criterion unless stated otherwise. Computation of i(t) The computation of impurity is simple when the occurrence rate p{y = 1 t} in node t is available. In many applications such as prospective studies, this occurrence rate can be estimated empirically from the data. At other times (e.g. retrospective studies), additional prior information may be required to estimate the occurrence rate. For a given split s, we have the following 2 2 table according to the split and the response. response node 0 1 left (t L ) n 11 n 12 n 1 right (t R ) n 21 n 22 n 2 n 1 n 2 n In prospective studies, p = p{y = 1 t L } and 1 p = p{y = 0 t L } can be estimated by n 12 /n 1 and n 11 /n 1, respectively. Hence i(t L ) = n 12 n 1 log ( ) n12 n 1 n 11 n 1 log ( n11 n 1 ). 22

8 In fact, it can be shown that the above entropy criterion is proportional to the maximized log-likelihood associated with t L. In light of this fact, many nodesplitting criteria originate from the maximum of certain likelihood functions. The importance of this observation will be appreciated later. Goodness-of-Split Measure Let s be any candidate split and suppose s divides t into t L and t R such that the proportions of the cases in t go into t L and t R are p L and p R, respectively. Define the reduction in node impurity as i(s, t) = i(t) [p L i(t L ) + p R i(t R )], which provides a goodness-of-split measure for s. provides the maximum impurity reduction, i.e., The best split s for node t i(s, t) = max i(s, t). s S Then t will be split into t L and t R according to the split s and the search procedure for the best split repeated on t L and t R separately. A node becomes a terminal node when prespecified terminal node conditions are satisfied. 2.3 Alternative Splitting Criteria There are two alternative splitting criteria: the twoing rule and the χ 2 test. The twoing rule is an alternative measure of the goodness of a split: p L p R 4 j=0,1 p{y = j t L } p{y = j t R } For a binary response, the twoing rule coincides with the use of the Gini index, which has the end-cut preference problem. The Pearson chi-square test statistic measures the difference between the observed cell frequencies and the expected cell frequencies (under the independence assumption). The p-value associated with the χ 2 test may be used as a goodness of split measure

9 2.4 Input variables with different number of possible splits There are more splits to consider on a variable with more levels. Therefore, the maximum possible value for the goodness of split measure tends to become large as the number of possible splits, m, increases. For example, there is only one split for a binary input variable and there are 511 possible binary splits for a nominal input variable with 10 levels. Thus, all commonly used splitting criteria (e.g. Gini index, Entropy, and Pearson χ 2 test) favor variables with large number of possible splits. This problem has been identified as the variable selection bias problem (Loh 2002). Adjustment for Gini index is unavailable. The information gain ratio can be used to adjust Entropy (Quinlan, 1993). information gain ratio = Entropy input levels in parent node. Bonferroni type of adjustment can be used to adjust the χ 2 test (Kass, 1980). Kass adjustment is to multiply the p-value by m, the number of possible splits. In order to identify the unbiased split, Loh (2002) proposed a residual-based method of selecting the most important variable first, and then applying greedy search only on this variable to find the best cutpoint. 2.5 References [1] Statistical learning from a regression perspective by Berk (2008). Section 3.3. [2] Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984) Classification and Regression Trees, Chapman and Hall. Chapters 2 and 4. [3] Kass, G. V. (1980) An Exploratory Technique for Investigating Large Quantities of Categorical Data,, Applied Statistics, Vol. 29, pp [4] Loh, W.-Y. (2002). Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 12:

10 3 Tree Pruning 3.1 Motivation Why need tree pruning? Decision Trees (continued) Want a final tree model that can generalize well to new data A decision tree can be grown until every node is pure so that the misclassification rate is 0 A small tree with only few branches may fail to adapt enough signals. A Simulated Example from CART (page 60) # Terminal nodes Estimated Rate True Rate Note that in the above table 25

11 The difference between the estimated misclassification rate and true misclassification rate is getting bigger after a certain number of nodes The optimal number of terminal nodes is 10 in this example. Trees with less than 10 nodes under-fit the data. Trees with more than 10 nodes over-fit the data. Balance Between Bias and Variance Tree complexity can be measured by the number of leaves, the number of splits, or the depth. A well-fitted tree has low bias (i.e., adapts enough signal) and low variance (i.e., does not adapt to noise). The determination of tree complexity usually involves the balance between bias and variance. An under-fitted tree that has no sufficient complexity has high bias and low variance. On the other hand, an over-fitted tree has low bias and high variance. 3.2 Before CART: Top-Down Pruning by Stopping Rules Instead of growing a large tree, one may use a set of stopping rules in order to decide when one can declare a terminal node. The strategies for stopping the growth of a tree may include: (1) limit the depth of the tree, (2) set the minimum number of cases in a terminal node, (3) set the minimum statistical significance that a split has to reach. Problems for Top-Down Pruning Stopping rules are often subjective. Both underfitting and overfitting problems may occur. Treatment: bottom-up pruning procedures 26

12 3.3 Misclassification Cost and Cost-Complexity Measure A few notes: CART: First grow a large initial tree T 0 by using loose stopping rules and then select a subtree of T 0 as the best tree structure. Unfortunately, evaluating all possible subtrees would not be computationally feasible even for moderately sized trees, because the number of subtrees grows much faster than the number of terminal nodes in the initial tree. To narrow down the choices of subtrees from which the best-sized subtree is to be selected, CART employs the idea of iteratively pruning off the weakest link to obtain a nested set of best subtrees of size ranging from 1 to T 0. In CART, the complexity of a tree model is determined by the total number of terminal nodes it has. Some tree terminology: Descendant and Ancestor: A node t is called a descendant of a higher node h if there is a connected path down the tree leading from h to t. If t is a descendant node of h, then h is an ancestor of t. Set and size of terminal nodes: Let T denote the set of all terminal nodes of T and T T the set of all internal nodes of T. Furthermore, let denote cardinality, i.e., the number of elements in a set. Therefore, T represents the number of terminal nodes of tree T. Note that T = 2 T 1. Subtree of T: A tree T 1 is a subtree of T if T 1 has the same root node as T and for every h T 1, h T. We denote subtree by : T 1 T. Branch: A tree T h is called a branch of T if T h is a tree with root node h T and all descendants of h in T are descendants of h in T h. Pruning a Branch: Pruning a Branch T h from a tree T consists of deleting from T all descendants of h, that is, cutting off all of T h except its root node. The pruned tree is denoted by T T h. 27

13 Illustration of concepts: page 31 (Source: LeBlanc and Crowley, JASA, 1993) To evaluate branches, define the goodness-of-fit of a tree as R(T ) = t T R(t) where R(t) measures the quality (or goodness-of-fit) of node t. Because our ultimate goal is to classify objects, R(t) is commonly chosen as the misclassification rate. Two types of errors False Positive Error: a case with true response value 0 (-) is falsely classified as 1 (+). False Negative Error: a case with true response value 1 (+) is falsely classified as 0 (-). Two types of errors may need to be weighted with different costs. Modifying Majority Voting by Incorporating Misclassification Cost The class membership of 0 or 1 for a node now depends on whether the total cost of the false positive errors is higher or lower than that of the false negative errors. Let c(i j) denote the cost associated with misclassifying a true j as i. Node t will be assigned to class j (j = 0, 1) if it has the smallest misclassification cost, i.e., c(j 1 j) p(y = 1 j t) c(1 j j) p(y = j t) or c(j y i ) c(1 j y i ) i : y i t i : y i t y i j y i 1 j 28

14 Example: consider a node with 44 preterm (1) and 356 full term (0) babies. Using a simple majority vote principle, the node will be classified as full term. However, in order to minimize the error of misclassifying preterm babies as term babies, we may define the costs to be c(1 0) = 1 and c(0 1) = 10. What class will the node be assigned to? The goodness-of-fit measure R(T ) alone is not sufficient for determining which subtree is better especially because larger trees typically have smaller values of R(T ). To develop a better measure of the performance (predictive ability) of tree T, we need to penalize the misclassification cost by its size, i.e., T. Define the Cost-Complexity Measure of Tree T as R α (T ) = R(T ) + α T where α 0 is the complexity parameter, used to penalize large trees. 3.4 CART: Cost-Complexity Pruning If the complexity parameter α is 0, then the initial tree T 0 is the best, i.e., having the smallest cost-complexity measure; if the complexity parameter goes to infinity, then the tree containing the root node only is the best. Note that as the complexity parameter increases from 0, there will be a link or internal node h that first becomes ineffective. What do we mean by ineffective? The node h as a terminal node is better than the branch T h, i.e., R α (h) R α (T h ) or R(h) + α 1 R(T h ) + α T h or α R(h) R(T h) T h 1 Let α = R(h) R(T h) T, which is the threshold that changes an internal node (link) h 1 h to a terminal node. Compute such threshold for every link (internal node). The link corresponding to the smallest threshold is identified as the weakest link h. 29

15 Denote the pruned subtree, after truncating T h, as T 1 = T 0 T h and repeat the same procedure by considering all internals nodes of T 1 and pruning off the weakest link to obtain T 2. The Algorithm: Let j = 0 Let T = T 0 While T 2, do For every h T T, compute α h = R(h) R(T h) T h 1 Set j = j + 1 enddo Let α j = min α h and h be the corresponding link Let T j = T T h Set T = T j The pruning algorithm results in a nested sequence of optimally pruned subtrees T 0 T 1 T m, where T m denotes the tree with the root node only, and a corresponding sequence of thresholds satisfying 0 = α 0 < α 1 < < α m. CART shows that for α [α k, α k+1 ), k = 0,..., m, tree T k is the smallest subtree that minimizes the cost-complexity measure R α (T ). 3.5 References [1] CART. Sections (pp ). 30

16 Insert: LeBlanc s 1993 JASA paper, page

17 4 Tree Size Selection Decision Trees (continued) As the third step in the CART algorithm, now we need to identify one (or several) optimally-sized tree from the subtree sequence as the final tree model. This step is equivalent to selecting the best tree size. A natural approach is to choose the subtree that optimizes an estimate of a performance measure. However, the resubstitution estimate based on training sample tends to be over optimistic because of the very adaptive nature of decision trees. Need validation methods (test sample & cross validation) to develop a more honest estimate of performance. 4.1 The Test Sample Method Step 1: Split data randomly into two sets: the learning sample L 1 (66.66%) and the test sample L 2 (33.33%) The learning sample is also called the training sample and the test sample is sometimes called the validation sample. The above ratio (2:1) is quoted from CART. A different ratio may be applied depending on the total sample size. For example, when a huge amount of data is available, one may apply a larger proportion for test sample, e.g. a ratio of 1:1 for the learning and test samples. Stratified sampling may be applied. Stratification may be based on the outcome variable or important input variables. Step 2: Using the training sample L 1 only, grow a large initial tree and then prune it back to obtain a nested sequence of subtrees T 0... T M. Step 3: Send the test sample L 2 down each subtree and compute the misclassification cost R ts (T m ) based on the test sample for each subtree T m, m = 0, 1,..., M. The subtree having the smallest misclassification cost is then selected as the best subtree. We denote it as T. That is, R ts (T ) = min m Rts (T m ) 32

18 . Once the best subtree T is determined, R ts (T ) is used as an estimate of the misclassification cost. Advantages and disadvantages of the test sample method Very straight forward. Does not use all data: the sequence of subtrees, from which the best tree model is selected, is based on 2/3 of the data. The estimate of the misclassification cost R(T ) is based on 1/3 of the data, hence the test sample based estimator of the tree performance has high variance when we don t have much data. 4.2 Cross Validation (CV) Often applied when the sample size is moderate or small. (Even when the sample size seems to be large, we may still not have data to waste if the target variable is sparse or if there are a large number of input variables.) Does not waste data. One of the resampling techniques: generate samples from the one sample at hand. Other resampling techniques include bootstrap and jackknife. V -fold cross-validation: Step 1: The whole sample L is randomly divided into V subsets: L v, v = 1,..., V. The sample sizes of the V subsets should be all equal, or as nearly equal as possible. The vth learning sample is L (v) = L L v, v = 1,..., V. The vth subset L v is used as the test sample corresponding to L (v). The value of V needs to be reasonably large so that the size of each training sample, (V 1)/V, is close to the size of all data. CART s suggestion is V = 10, in which case, each training sample contains 90% of the data and each test sample contains 10% of the data. 33

19 Stratified sampling may be used to ensure balance for important variables. Step 2: For a fixed v = 1,..., V, grow a large initial tree and prune it back using only L (v). The pruning procedure provides a nested sequence of optimally pruned subtrees T (v) 0 T (v) 1... T (v) M. Also grow and prune a tree based on all data to obtain the nested sequence of best pruned subtrees T 0 T 1... T M and a corresponding sequence of complexity parameters 0 = α 0 < α 1 <... < α M < α M+1 =. step 3: Now we want to select the best subtree from the subtree sequence T 0 T 1... T M based on the minimum misclassification cost. How do we achieve this through V-fold cross validation? Let s first review an important property from the CART pruning procedure: Theorem (Theorem 3.10 in CART, page 71): For m 1, T m is the smallest subtree that minimizes the cost-complexity measure R α (T ) for complexity parameter α such that α m α < α m+1. The above theorem implies that we can get the optimally pruned subtree for any penalty α from the efficient pruning algorithm. Define α m = α m α m+1 : m = 0, 1,..., M such that for each m, α m is the geometric midpoint of the interval [α m, α m+1 ). Here, {α m : m = 0, 1,..., M + 1} are obtained by applying the cost-complexity pruning algorithm to the entire sample L. Note that for each v = 1,..., V, we have the optimally pruned subtree T (v) (α m) for complexity parameter α m, m = 1,..., M. Now we want to find the complexity parameter α that minimizes the average of the estimated misclassification cost for v = 1,..., V. Fix the value of v : For each m = 1,..., M, L v is sent down the tree T (v) (α m). The quantity R CV v (T (v) (α m)) = t T (v) (α m) i: (x i,y i ) t L v R(i) is calculated, where R(i) is the misclassification cost for observation i that belongs to L v and falls into terminal node t. 34

20 Sum over v: the above quantity is summed over the V subsamples to obtain R CV (T (α m)) = V v=1 R CV v (T (v) (α m)). The best pruned subtree can be defined as the subtree T (α ) which minimizes the cross-validated estimate of the misclassification cost: SE rule R CV (T (α )) = min m RCV (T (α m)). There is one problem with methods based on honest estimates of the misclassification cost (test sample & cross-validation). The estimate of the misclassification cost (or prediction error) tends to decrease rapidly as the tree size increases from the root node. Then there is a wide flat valley with the estimated misclassification cost rising slowly as the number of terminal nodes gets large (see figure on page 79 of CART). Breiman et al. (1984) note that there may be considerable variability in the minimum misclassification cost. CART proposes an ad hoc fix, namely the 1-SE rule. The 1-SE Rule is designed To keep the tree as simple as possible without sacrificing much accuracy To reduce instability in tree selection. CART selects the smallest subtree T such that ˆR(T ) is less than one standard error greater than ˆR(T ), where ˆR denotes either R ts or R cv, and T denotes the best subtree from the corresponding validation method. Namely, for all subtrees with ˆR less than ˆR(T ) + SE( ˆR(T )), T has the smallest size. 4.4 References [1] Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984) Classification and Regression Trees, Chapman and Hall. Sections

Classification/Decision Trees (II)

Classification/Decision Trees (II) Classification/Decision Trees (II) Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Right Sized Trees Let the expected misclassification rate of a tree T be R (T ).

More information

6 Classification and Regression Trees, 7 Bagging, and Boosting

6 Classification and Regression Trees, 7 Bagging, and Boosting hs24 v.2004/01/03 Prn:23/02/2005; 14:41 F:hs24011.tex; VTEX/ES p. 1 1 Handbook of Statistics, Vol. 24 ISSN: 0169-7161 2005 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(04)24011-1 1 6 Classification

More information

Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

Data Mining Classification: Decision Trees

Data Mining Classification: Decision Trees Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous

More information

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data

More information

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised

More information

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2. Tid Refund Marital Status

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2. Tid Refund Marital Status Data Mining Classification: Basic Concepts, Decision Trees, and Evaluation Lecture tes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Classification: Definition Given a collection of

More information

Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

More information

An Overview and Evaluation of Decision Tree Methodology

An Overview and Evaluation of Decision Tree Methodology An Overview and Evaluation of Decision Tree Methodology ASA Quality and Productivity Conference Terri Moore Motorola Austin, TX terri.moore@motorola.com Carole Jesse Cargill, Inc. Wayzata, MN carole_jesse@cargill.com

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Yin Zhao School of Mathematical Sciences Universiti Sains Malaysia (USM) Penang, Malaysia Yahya

More information

Trees and Random Forests

Trees and Random Forests Trees and Random Forests Adele Cutler Professor, Mathematics and Statistics Utah State University This research is partially supported by NIH 1R15AG037392-01 Cache Valley, Utah Utah State University Leo

More information

Data Mining Techniques Chapter 6: Decision Trees

Data Mining Techniques Chapter 6: Decision Trees Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................

More information

Classification tree analysis using TARGET

Classification tree analysis using TARGET Computational Statistics & Data Analysis 52 (2008) 1362 1372 www.elsevier.com/locate/csda Classification tree analysis using TARGET J. Brian Gray a,, Guangzhe Fan b a Department of Information Systems,

More information

Data Mining for Knowledge Management. Classification

Data Mining for Knowledge Management. Classification 1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh

More information

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

Lecture 10: Regression Trees

Lecture 10: Regression Trees Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

More information

Benchmarking Open-Source Tree Learners in R/RWeka

Benchmarking Open-Source Tree Learners in R/RWeka Benchmarking Open-Source Tree Learners in R/RWeka Michael Schauerhuber 1, Achim Zeileis 1, David Meyer 2, Kurt Hornik 1 Department of Statistics and Mathematics 1 Institute for Management Information Systems

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century Nora Galambos, PhD Senior Data Scientist Office of Institutional Research, Planning & Effectiveness Stony Brook University AIRPO

More information

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

!!!#$$%&'()*+$(,%!#$%$&'()*%(+,'-*&./#-$&'(-&(0*.$#-$1(2&.3$'45 !"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 - Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create

More information

Data mining techniques: decision trees

Data mining techniques: decision trees Data mining techniques: decision trees 1/39 Agenda Rule systems Building rule systems vs rule systems Quick reference 2/39 1 Agenda Rule systems Building rule systems vs rule systems Quick reference 3/39

More information

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

More information

Classification and Prediction

Classification and Prediction Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser

More information

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.

More information

Classification: Basic Concepts, Decision Trees, and Model Evaluation. General Approach for Building Classification Model

Classification: Basic Concepts, Decision Trees, and Model Evaluation. General Approach for Building Classification Model 10 10 Classification: Basic Concepts, Decision Trees, and Model Evaluation Dr. Hui Xiong Rutgers University Introduction to Data Mining 1//009 1 General Approach for Building Classification Model Tid Attrib1

More information

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4. Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics

More information

Penalized regression: Introduction

Penalized regression: Introduction Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

More information

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d. EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

REPORT DOCUMENTATION PAGE

REPORT DOCUMENTATION PAGE REPORT DOCUMENTATION PAGE Form Approved OMB NO. 0704-0188 Public Reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions,

More information

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde

More information

Decision Tree Learning on Very Large Data Sets

Decision Tree Learning on Very Large Data Sets Decision Tree Learning on Very Large Data Sets Lawrence O. Hall Nitesh Chawla and Kevin W. Bowyer Department of Computer Science and Engineering ENB 8 University of South Florida 4202 E. Fowler Ave. Tampa

More information

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,

More information

Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

More information

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com

More information

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls.

More information

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 7 of Data Mining by I. H. Witten and E. Frank

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 7 of Data Mining by I. H. Witten and E. Frank Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 7 of Data Mining by I. H. Witten and E. Frank Engineering the input and output Attribute selection Scheme independent, scheme

More information

Classification and Regression Trees (CART) Theory and Applications

Classification and Regression Trees (CART) Theory and Applications Classification and Regression Trees (CART) Theory and Applications A Master Thesis Presented by Roman Timofeev (188778) to Prof. Dr. Wolfgang Härdle CASE - Center of Applied Statistics and Economics Humboldt

More information

TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP

TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions

More information

Distances, Clustering, and Classification. Heatmaps

Distances, Clustering, and Classification. Heatmaps Distances, Clustering, and Classification Heatmaps 1 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be

More information

Lecture Notes for Chapter 4. Introduction to Data Mining

Lecture Notes for Chapter 4. Introduction to Data Mining Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data

More information

The More Trees, the Better! Scaling Up Performance Using Random Forest in SAS Enterprise Miner

The More Trees, the Better! Scaling Up Performance Using Random Forest in SAS Enterprise Miner Paper 3361-2015 The More Trees, the Better! Scaling Up Performance Using Random Forest in SAS Enterprise Miner Narmada Deve Panneerselvam, Spears School of Business, Oklahoma State University, Stillwater,

More information

Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups

Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups Achim Zeileis, Torsten Hothorn, Kurt Hornik http://eeecon.uibk.ac.at/~zeileis/ Overview Motivation: Trees, leaves, and

More information

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

More information

IBM SPSS Decision Trees 21

IBM SPSS Decision Trees 21 IBM SPSS Decision Trees 21 Note: Before using this information and the product it supports, read the general information under Notices on p. 104. This edition applies to IBM SPSS Statistics 21 and to all

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether

More information

CLASSIFICATION AND CLUSTERING. Anveshi Charuvaka

CLASSIFICATION AND CLUSTERING. Anveshi Charuvaka CLASSIFICATION AND CLUSTERING Anveshi Charuvaka Learning from Data Classification Regression Clustering Anomaly Detection Contrast Set Mining Classification: Definition Given a collection of records (training

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Decision-Tree Learning

Decision-Tree Learning Decision-Tree Learning Introduction ID3 Attribute selection Entropy, Information, Information Gain Gain Ratio C4.5 Decision Trees TDIDT: Top-Down Induction of Decision Trees Numeric Values Missing Values

More information

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

More information

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data

More information

Applied Multivariate Analysis - Big data analytics

Applied Multivariate Analysis - Big data analytics Applied Multivariate Analysis - Big data analytics Nathalie Villa-Vialaneix nathalie.villa@toulouse.inra.fr http://www.nathalievilla.org M1 in Economics and Economics and Statistics Toulouse School of

More information

Smart Grid Data Analytics for Decision Support

Smart Grid Data Analytics for Decision Support 1 Smart Grid Data Analytics for Decision Support Prakash Ranganathan, Department of Electrical Engineering, University of North Dakota, Grand Forks, ND, USA Prakash.Ranganathan@engr.und.edu, 701-777-4431

More information

L13: cross-validation

L13: cross-validation Resampling methods Cross validation Bootstrap L13: cross-validation Bias and variance estimation with the Bootstrap Three-way data partitioning CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU

More information

Efficiency in Software Development Projects

Efficiency in Software Development Projects Efficiency in Software Development Projects Aneesh Chinubhai Dharmsinh Desai University aneeshchinubhai@gmail.com Abstract A number of different factors are thought to influence the efficiency of the software

More information

L25: Ensemble learning

L25: Ensemble learning L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna

More information

Decision Trees for Predictive Modeling

Decision Trees for Predictive Modeling Decision Trees for Predictive Modeling Padraic G. Neville SAS Institute Inc. 4 August 1999 What a Decision Tree Is...................2 What to Do with a Tree................... 3 Variable selection Variable

More information

Data Mining Algorithms for Classification

Data Mining Algorithms for Classification Data Mining Algorithms for Classification BSc Thesis Artificial Intelligence Author: Patrick Ozer Radboud University Nijmegen January 2008 Supervisor: Dr. I.G. Sprinkhuizen-Kuyper Radboud University Nijmegen

More information

CART: Classification and Regression Trees

CART: Classification and Regression Trees Chapter 10 CART: Classification and Regression Trees Dan Steinberg Contents 10.1 Antecedents... 180 10.2 Overview... 181 10.3 A Running Example... 181 10.4 The Algorithm Briefly Stated... 183 10.5 Splitting

More information

Leveraging Ensemble Models in SAS Enterprise Miner

Leveraging Ensemble Models in SAS Enterprise Miner ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

More information

Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University. www.cs.cmu.edu/~awm awm@cs.cmu.

Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University. www.cs.cmu.edu/~awm awm@cs.cmu. Decision Trees Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 42-268-7599 Copyright Andrew W. Moore Slide Decision Trees Decision trees

More information

Model Combination. 24 Novembre 2009

Model Combination. 24 Novembre 2009 Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

More information

Regression Modeling Strategies

Regression Modeling Strategies Frank E. Harrell, Jr. Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis With 141 Figures Springer Contents Preface Typographical Conventions

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

THE RISE OF THE BIG DATA: WHY SHOULD STATISTICIANS EMBRACE COLLABORATIONS WITH COMPUTER SCIENTISTS XIAO CHENG. (Under the Direction of Jeongyoun Ahn)

THE RISE OF THE BIG DATA: WHY SHOULD STATISTICIANS EMBRACE COLLABORATIONS WITH COMPUTER SCIENTISTS XIAO CHENG. (Under the Direction of Jeongyoun Ahn) THE RISE OF THE BIG DATA: WHY SHOULD STATISTICIANS EMBRACE COLLABORATIONS WITH COMPUTER SCIENTISTS by XIAO CHENG (Under the Direction of Jeongyoun Ahn) ABSTRACT Big Data has been the new trend in businesses.

More information

Inferential Statistics

Inferential Statistics Inferential Statistics Sampling and the normal distribution Z-scores Confidence levels and intervals Hypothesis testing Commonly used statistical methods Inferential Statistics Descriptive statistics are

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

REVIEW OF ENSEMBLE CLASSIFICATION

REVIEW OF ENSEMBLE CLASSIFICATION Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.

More information

Comparison of Data Mining Techniques used for Financial Data Analysis

Comparison of Data Mining Techniques used for Financial Data Analysis Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract

More information

S03-2008 The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY

S03-2008 The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY S03-2008 The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY ABSTRACT Predictive modeling includes regression, both logistic and linear,

More information

Introduction to Learning & Decision Trees

Introduction to Learning & Decision Trees Artificial Intelligence: Representation and Problem Solving 5-38 April 0, 2007 Introduction to Learning & Decision Trees Learning and Decision Trees to learning What is learning? - more than just memorizing

More information

2 Decision tree + Cross-validation with R (package rpart)

2 Decision tree + Cross-validation with R (package rpart) 1 Subject Using cross-validation for the performance evaluation of decision trees with R, KNIME and RAPIDMINER. This paper takes one of our old study on the implementation of cross-validation for assessing

More information

How to Conduct a Hypothesis Test

How to Conduct a Hypothesis Test How to Conduct a Hypothesis Test The idea of hypothesis testing is relatively straightforward. In various studies we observe certain events. We must ask, is the event due to chance alone, or is there some

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

Microsoft Azure Machine learning Algorithms

Microsoft Azure Machine learning Algorithms Microsoft Azure Machine learning Algorithms Tomaž KAŠTRUN @tomaz_tsql Tomaz.kastrun@gmail.com http://tomaztsql.wordpress.com Our Sponsors Speaker info https://tomaztsql.wordpress.com Agenda Focus on explanation

More information

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

More information

CART 6.0 Feature Matrix

CART 6.0 Feature Matrix CART 6.0 Feature Matri Enhanced Descriptive Statistics Full summary statistics Brief summary statistics Stratified summary statistics Charts and histograms Improved User Interface New setup activity window

More information

1 Maximum likelihood estimation

1 Maximum likelihood estimation COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

PASS Sample Size Software

PASS Sample Size Software Chapter 250 Introduction The Chi-square test is often used to test whether sets of frequencies or proportions follow certain patterns. The two most common instances are tests of goodness of fit using multinomial

More information

A Property and Casualty Insurance Predictive Modeling Process in SAS

A Property and Casualty Insurance Predictive Modeling Process in SAS Paper 11422-2016 A Property and Casualty Insurance Predictive Modeling Process in SAS Mei Najim, Sedgwick Claim Management Services ABSTRACT Predictive analytics is an area that has been developing rapidly

More information

DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING

DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING ABSTRACT The objective was to predict whether an offender would commit a traffic offence involving death, using decision tree analysis. Four

More information

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Increasing Classification Accuracy. Data Mining: Bagging and Boosting. Bagging 1. Bagging 2. Bagging. Boosting Meta-learning (stacking)

Increasing Classification Accuracy. Data Mining: Bagging and Boosting. Bagging 1. Bagging 2. Bagging. Boosting Meta-learning (stacking) Data Mining: Bagging and Boosting Increasing Classification Accuracy Andrew Kusiak 2139 Seamans Center Iowa City, Iowa 52242-1527 andrew-kusiak@uiowa.edu http://www.icaen.uiowa.edu/~ankusiak Tel: 319-335

More information

Data Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan

Data Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan Data Mining: An Overview David Madigan http://www.stat.columbia.edu/~madigan Overview Brief Introduction to Data Mining Data Mining Algorithms Specific Eamples Algorithms: Disease Clusters Algorithms:

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

More information

Chapter 12 Discovering New Knowledge Data Mining

Chapter 12 Discovering New Knowledge Data Mining Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to

More information

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step

More information

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS Abstract D.Lavanya * Department of Computer Science, Sri Padmavathi Mahila University Tirupati, Andhra Pradesh, 517501, India lav_dlr@yahoo.com

More information