Asymmetric Gradient Boosting with Application to Spam Filtering
|
|
|
- Spencer Short
- 10 years ago
- Views:
Transcription
1 Asymmetric Gradient Boosting with Application to Spam Filtering Jingrui He Carnegie Mellon University 5 Forbes Avenue Pittsburgh, PA 523 USA [email protected] ABSTRACT In this paper, we propose a new asymmetric boosting method, Boosting with Different Costs. Traditional boosting methods assume the same cost for misclassified instances from different classes, and in this way focus on good performance with respect to overall accuracy. Our method is more generic, and is designed to be more suitable for problems where the major concern is a low false positive (or negative) rate, such as spam filtering. Experimental results on a large scale spam data set demonstrate the superiority of our method over state-of-the-art techniques.. INTRODUCTION Classification is a very important field in machine learning, and has been well studied over the past years. Many different classification methods have been proposed, such as logistic regression, decision trees, neural networks, SVMs, etc. Recent developments in classification methodology combines individual classifiers into a powerful ensemble classifier, which predicts the label of a particular instance using a weighted vote over the ensemble. Boosting methods are ensemble methods in which one adds new base classifiers to the ensemble in a sequential way (e.g., Freund & Schapire, 997). The new classifiers are trained on the basis of reweighed versions of the data, where instances that are not predicted well by the current ensemble are given a higher weight. This simple strategy of combining classifiers has shown to improve results dramatically in many cases, and thus has been the focus of much subsequent research. See, e.g., Friedman, Hastie & Tibshirani (2), Mason, Baxter, Bartlett & Frean (999), and Friedman (2) to mention a few. The loss function is one of the key factors in a standard classification framework. In boosting, this loss function affects the reweighing of the data used to train the next classifier in the ensemble. Most loss functions are symmetric in the sense that they assume the same cost for misclassified instances CEAS 27 - Fourth Conference on and Anti-Spam, August 2-3, 27, Mountain View, California USA Bo Thiesson Microsoft Research One Microsoft Way Redmond, WA USA [email protected] with different class labels. In many application domains, however, this assumption is not appropriate. For example, in an spam filtering setting it is much more concerning to lose a piece of good mail (ham) than it is to receive a few unsoliced mails (spam). In this paper we focus on spam filtering. Recently, Androutsopoulos, Paliouras & Michelakis (24) and Carreras & Marquez (2) introduced boosting with symmetric cost functions for building spam filtering classifiers. We propose a new asymmetric boosting method, Boosting with Different Costs (BDC). This method is generally applicable to any problems, where it is important to achieve good classification at low false positive (or low false negative) rates. Despite the slightly mind-bending connotation, we will follow tradition and denote the mis-classification of good s as false positives. For an spam filter, false positives are therefore more expensive than false negatives or in other words, misclassifying good as spam is more concerning than the other way around. To accommodate this asymmetry, a commonly used method is stratification with more emphasis on ham than spam s during the training of the classifier. For boosting, this stratification can be achieved by multiplying the loss associated with a ham by a utility value greater than one. One possible problem with this simple method is that all the spam s are de-emphasized in the same way no matter if they are important for constructing the classifier or not. If the training data are noisy, either due to mis-labeling or due to very uncharacteristic spam examples which is often the case we will not be able to differentiate between the noisy and the characteristic spam examples. In this way, the performance of the subsequent classifier may be largely affected by the noise. The proposed BDC method is designed to solve this problem. It is based on the MarginBoost framework (Mason, Baxter, Bartlett & Frean, 999). In our approach, we design two different cost functions, one for ham and one for spam. For ham, the cost gradually increases as an instance moves away from the decision boundary on the wrong side of its label; while for spam, the cost remains unchanged after it reaches a point far enough from the decision boundary. As will become clear in Section 3, the effect of this difference in the cost functions is that we are trading off difficult spam for the ability to better classify all ham and easier,
2 more characteristic spam. We are in this way skewing the classifier towards better classification at low false positive rates. As straw men for comparisons, we will with and without stratification consider symmetric boosting with logistic loss, as well as logistic regression, which is one of the most popular machine learning algorithms for spam filtering. See, e.g., Yih, Goodman, & Hulten (26). As noticed by Friedman, Hastie & Tibshirani (2) and Collins, Schapire & Singer (2), using logistic loss, boosting and logistic regression are quite similar. To further bolster this boosting view on logistic regression and in this way better understand similarities and differences to our BDC method, we demonstrate how logistic regression can be considered as a special implementation of the MarginBoost framework in Mason et al. (999). Our spam filtering application involves a large number of features, and it turns out that if we use decision stumps as the base classifiers, boosting is similar to logistic regression with a smart feature selection method. In addition, boosting with decision trees of depth two, for example, can be used to add more complicated co-occurrence features to existing models, which is not possible for logistic regression without an explosion in the number of parameters. To summarize, the possible advantages of BDC include:. BDC is tailored for spam filtering by introducing different cost functions for ham and spam. 2. Unlike stratification, spam is de-emphasized only when hard to classify correctly. If the training data is noisy, BDC will filter out noisy spam, and focus on all the other instances (ham and characteristic spam). Thus it produces a classifier with better performance at low false positive rate. 3. Like other boosting algorithms, BDC can be used to select features, which will save space for storing the model and have faster runtime when deployed; it can also be used to add more complicated features into existing models without an explosion in the number of parameters. The rest of the paper is organized as follows. In Section 2, we briefly review the MarginBoost framework (Mason et al., 999), and view logistic regression as a special implementation of this framework. In Section 3, we introduce the cost functions used in BDC, present the associated boosting algorithm, and discuss its behavior in the low false positive region. Section 4 studies the parameter setting in BDC. To prove the effectiveness of BDC in spam filtering, we conduct experiments on a large scale spam data set, and summarize the results in Section 5. In Section 6 we discuss related work, followed by conclusions in Section MARGINBOOST MarginBoost is a special variation of a more general class of boosting algorithms based on gradient decent in function space. A detailed derivation of this general class of boosting algorithms can be found in Mason et al. (999) and Friedman (2). In this Section, we will first review the MarginBoost algorithm, as described in Mason et al. (999), who also coined this name. Then, we will consider particular specializations of MarginBoost and demonstrate their relations to logistic regression. 2. The Framework In a binary classification problem, given a training set D = {(x, y ),..., (x m, y m)} of m labeled data points generated according to some unknown distribution, where x i R d and y i {±}, we wish to construct a voted combination of classifiers, or strong classifier: F (x) = T w tf t(x) () t= where f t (x) : R d {±} are base classifiers, and w t R are the weights for each base classifier in the combination. A data point (x, y) is classified according to the sign of F (x), and its margin is defined by yf (x). A positive value for the margin therefore corresponds to a correct classification and a negative value corresponds to a misclassification. To obtain the classifier, MarginBoost minimizes the sample average of some suitable cost function of the margin C : R R. That is, we find a classifier F that minimizes the loss functional S(F ) = m C (y i F (x i )) (2) m i= We emphasize here that MarginBoost does not perform traditional parameter optimization. As formulated in Friedman (2), we instead consider F evaluated at each point x to be a parameter, and seek to minimize (2) directly with respect to F evaluated at each x. Taking a numerical approach, the minimization of (2) can be achieved by gradient descent in function space. In other words, given the current classifier F t, at iteration t of the algorithm, we are looking for a new direction f t+ such that S(F t +ɛf t+ ) decreases most rapidly, for small values of ɛ. In function space, the desired direction is simply the negative functional derivative of S at F t, S(F t )(x), defined by S(F t)(x) = S(F t + ɛ x ) ɛ ɛ= = m δ(x = x i )y i C (y i F t (x i )) (3) m i= where x is the indicator function at x, δ(x = x i ) = iff x = x i, and C (z) is the derivative of the cost function with respect to the margin z. This derivative is defined only at the data points and do not generalize to other values. We achieve generalization (or smoothing) by restricting the direction of f t+ to a base classifier in some fixed parameterized class F. In general it will therefore not be possible to choose f t+ = S(F t)(x), so instead we find the base classifier that has the greatest inner product with S(F t )(x). Letting a dot denote inner
3 product, and inserting the expression for S(F t ) from (3), the new direction f t+ should in this case maximize S(F t), f t+ = m m S(F t)(x i) f t+(x i) i= = m y if t+(x i)c (y if t(x i)) (4) m 2 i= For later discussions, it is important to notice from (4) that the new base classifier f t+ is found by considering this classifier s weighted margins for all the data points, y i f t+ (x i ), where the weights are in proportion to minus the derivative of the cost function evaluated at the current margins, D t (i) C (y i F t (x i )). Given the new base classifier f t+, we can now use a line search to find the best coefficient w t+ for f t+. The iterative process of adding w t+ f t+ to the strong classifier in () can be terminated when the dot product between the negative functional derivative of S at F t and any base classifier is less than or equal to zero. Another stopping criterion commonly used in practice is to set a fixed maximum number of iterations. 2.2 MarginBoost Specializations As has been noted in the literature (e.g., Mason et al., 999 and Friedman, 2), many important boosting algorithms can be reformulated in the MarginBoost framework. For example, if we use the exponential loss C (yf (x)) = exp( yf (x)), and w t+ is obtained by a line search, we will get AdaBoost (Freund & Schapire, 997); if we use the logistic loss C (yf (x)) = ln ( + exp( yf (x))), and w t+ is chosen by a single Newton-Raphson step, we get LogitBoost (Friedman et al., 2). In general, the MarginBoost framework will work for any differentiable cost function, and in most situations, a monotonically decreasing cost function of the margin is a sensible choice. In order to demonstrate that linear logistic regression can also be viewed as a special implementation of MarginBoost, we will concentrate on the logistic loss function. Let us first consider the class of linear base classifiers of the form F = {f(x) = a x + b, a R d, b R, a 2 +b 2 = }, where a is the coefficient vector, b is the intercept, and denotes transpose. The new direction f t+ is obtained by choosing a base classifier in F that maximizes (4) with a logistic loss for C S(F t ), f t+ m y i (a exp( y if t(x i)) t+x i + b t+ ) + exp( y i F t (x i )) i= This constrained optimization yields a classifier in F with coefficients and intercept a t+ b t+ m i= m i= y i x i exp( y i F t (x i )) + exp( y if t(x i)) y i exp( y i F t (x i )) + exp( y if t(x i)) (5) (6) Iteration t+ therefore updates the strong classifier as F t+(x) = F t(x) + w t+(a t+x + b t+), where a t+ and b t+ are determined as in (5) and (6), and w t+ is a small positive parameter, which is chosen by line search or based on simple heuristics. It is an interesting observation that for a logistic regression model, P (y x) = / ( + exp( y(a x + b)) ) with parameters a and b, the gradient of the log-likelihood equals exactly the same expressions as on the right-hand sides of (5) and (6). Hence, the particular boosting algorithm, described above, is equivalent to logistic regression with a gradient ascent parameter optimization method. Another interesting observation will follow from choosing decision stumps as the class of base classifiers. For the spam data set that we will consider later in this paper, all features are binary, i.e., x {, } d. In this case, the learned decision stump at iteration t can be defined as f t(x) = cx t d, where x t is the most discriminating feature at that iteration, and c, d R. The voted combination F (x) will therefore also result in a linear classifier of the form F (x) = a x + b, where a R d and b R. Since F (x) and logistic regression are minimizing the same concave cost function, upon convergence, the two classifiers would therefore again be the same, although they are achieved in two different ways. It is also noteworthy that by making use of a rougher stopping criterion, MarginBoost may end up with a sparse coefficient vector a. In this case, MarginBoost implicitly performs feature selection, which will reduce both the storage space and the response time for the deployed classifier. 3. BOOSTING WITH DIFFERENT COSTS Existing cost functions used in the MarginBoost framework (including the one used in logistic regression) are symmetric in the sense that they assume the same cost function for misclassified instances from different classes. However, in spam filtering, the cost for misclassifying ham is much higher than that for misclassifying spam, which requires cost functions that are asymmetric for the two classes. In this section, we therefore present the Boosting with Different Costs method. It uses two different cost functions for ham and spam, with an emphasis on classifying ham correctly. In each round of boosting, moderately misclassified spam (the absolute value of the margin is small) have large weights, whereas extremely misclassified spam (the absolute value of the margin is large) will have small weights instead, since they tend to be noise. In contrast, all the misclassified ham have large weights. In this way, the base classifier trades off correct classification of extremely misclassified spam for better classification of all ham and the remaining spam, thus ensuring a low false positive rate of the strong ensemble classifier. 3. Cost Functions in BDC Suppose that a data point is labeled as y i = for ham, and as y i = for spam. The cost functions used in BDC are A logistic regression model is often regularized by adding a term to the model in the form of a penalizing distribution on the parameters in the model. This type of regularization will not fit into the MarginBoost framework, because we cannot in this case express a suitable cost function of the margin only.
4 as follows Ham: C h (yf (x)) = ln( + exp( yf (x))) (7) Spam: α C s(yf (x)) = + exp(γyf (x)) (8) where α and γ are two positive parameters that control the shape of the cost function for spam relative to that for ham. Note that here the two cost functions are designed for spam filtering. In general, BDC could work with any two cost functions that satisfy the differentiability condition mentioned in Section 2.2. Figure compares the two costs as functions of the margin (α = and γ = ). From this figure, we can see that when the margin is positive, which corresponds to correct classification, both functions will output a small cost. On the other hand, when the margin is negative, the cost for ham increases at a higher and higher rate as the margin becomes more negative, until it approximates a linear function for extremely misclassified instances. In contrast, the cost for spam almost remains unchanged at very negative margins. That is, if a spam message is extremely misclassified, the further decrease in the margin will not increase the cost. Cost Margin Cost for Ham Cost for Spam Figure : Cost functions for ham and spam For iteration t of BDC, as in MarginBoost, we need to calculate the weight for each training instance. As mentioned in Subsection 2., this weight is in proportion to the negative first derivatives of the two cost functions, i.e. Ham:D t (i) C h(y i F (x i )) = exp( yif (xi)) + exp( y i F (x i )) (9) Spam:D t(i) C s(y if (x i))= αγ exp(γy if (x i )) (+exp(γy i F (x i ))) 2 () In Figure 2, we compare the negative first derivatives as functions of the margin (α = and γ = ). From the figure, we can see that misclassified ham always have a large weight; while misclassified spam get a large weight iff the margin is close to zero. Weight Margin Weight for Ham Weight for Spam Figure 2: Weight functions for ham and spam The advantages of the weighing scheme in BDC are two-fold. Firstly, in spam filtering, the training set is always noisy to some extent, and the outliers tend to be difficult to classify as their given labels. If a training instance is extremely misclassified (the margin is very negative), it is quite likely to be an outlier. In existing boosting algorithms, as more and more base classifiers are combined into the strong classifier, the outliers will have larger and larger weights, and the subsequent base classifiers will end up fitting the noise. From this perspective, since BDC assigns small weights to extremely misclassified spam, it is able to discard some noisy spam. Secondly, in spam filtering, people would rather receive a few pieces of spam than losing a single piece of ham. In BDC, the weight of misclassified ham is always high, which ensures that the combined classifier will have a low false positive rate. 3.2 BDC Algorithm Based on the MarginBoost framework, we summarize the BDC algorithm in Figure 3. Initially, each training instance is assigned the same weight. In each iteration, the weak learner selects a base classifier that maximizes the inner product in (4). After the coefficient w t+ is chosen by line search, the weights of the training instances are updated based on the negative first derivative of the ham and spam cost functions C h and C s with respect to the margin for the newly combined classifier. The iteration process is stopped when the inner product in (4) is less than or equal to zero, or when we reach a prespecified maximum number of iterations T. Let D (i) = /m for i =,..., m. Let F (x) =. For t = to T do Let f t+ = arg max m f F i= y if(x i )D t (i). If m i= y if t+ D t(i)(x i ) then Return F t End if Choose w t+ by line search. Let F t+ = F t + w t+ f t+ For i =,..., m do { C Let D t+ (i) = h (y i F t+ (x i )) for y i = C s(y i F t+ (x i )) for y i = End for Let D t+ (i) = End for Return F t+ D t+ (i) mi= for i =,..., m. D t+ (i) Figure 3: BDC algorithm 3.3 BDC at Low False Positive Region A commonly used performance measure in spam filtering is the ROC curve, which is obtained by varying the threshold of the classifier. In the ROC curve, false negative rates at the low false positive region are of particular interest. Compared to boosting algorithms with symmetric cost functions, or symmetric boosting, BDC tends to have lower false negative rates in the low false positive region. Figure 4 is an illustrative example of the process causing this phenomenon. The ticks on the axis correspond to the training instances. The +/ above each tick labels an instance as spam (+) or ham ( ). According to some decision threshold on the axis, instances to the left of this threshold
5 are classified as ham and instances to the right as spam. The further an instance is to one side of the threshold, the more confidence the classifier has in the decision. The dashed line perpendicular to the axis in Figure 4(a) marks the decision threshold for F t (x). The left most + in this figure therefore represents a noisy spam message. The numbers following FP () are the false positive rates (false negative rates) in terms of the number of false positive (false negative) instances if the decision threshold of the classifier is drifted to the corresponding position on the axis. If we apply BDC, the misclassified ham (the instance marked as on the righthand side of the threshold) will have a large weight, but the noisy spam (the instance marked as + far out on the lefthand side of the threshold) will have a small weight. After one base classifier is added to F t (x), the classifier is able to better classify ham and all but the noisy spam (Figure 4(b)). In contrast, if we apply symmetric boosting, the misclassified spam will have the largest weight of all instances, and the classifier may end up correctly classifying this instance or just lowering the confidence in the misclassification. This improvement may come at a slight expense in the confidence of correctly classified instances (Figure 4(c)). Comparing Figure 4(b) and Figure 4(c), at FP value zero, using BDC decreases from two to one, while using symmetric boosting stays at two. This suggests the advantage of BDC over symmetric boosting in the low FP region, which is a desirable property in spam filtering. On the other hand, at FP value four, using BDC is one, whereas using symmetric boosting is zero, which suggests that BDC may not be as good as symmetric boosting in the high FP region. a) b) F t (x) Ham FP: : F t+ (x) Ham FP: : 2 3 Spam Spam If we approximate C s (yf (x)) in (8) by a first order Taylor expansion around yf (x) =, we have C s (yf (x)) α(.25γyf (x)) () Thus, with α fixed at a certain value, the γ parameter controls the slope of the cost. The larger γ is, the faster the cost changes with the margin around yf (x) =, and vice versa. We study the effect of the two parameters on the performance of BDC using three noisy data sets, constructed from the synthetic data set illustrated in Figure 5. In this data Figure 5: Synthetic data. set, the inner ball and the outer circle represent the positive and negative classes, respectively. The noisy data sets are constructed by assigning an incorrect class label to each of the data points with a small probability. We have gradually increased the noise level, and generated three data sets with noise probabilities.3,.5, and.. For different settings of the parameters α and γ, we generate the corresponding ROC curves by varying the threshold of the classifier. Since we are only interested in classification results at low false positive rates, we only compare rates at FP rates 3%, 5%, and % in the ROC curves, which will provide a gauge to how the ROC curve will behave in the low false positive region. The averaged results for 4-fold cross validation are summarized in Figure 6. The first row of experiments shows rates with varying α and fixed γ = for the three noisy data sets, and the second row shows similar results with varying γ and fixed α =. c) F t+ (x) Ham FP: : Spam 2 4 ALPHA 2 4 ALPHA 2 4 ALPHA Figure 4: a) Current classifier. b) After one iteration in BDC. c) After one iteration in symmetric boosting. 5 GAMMA 5 GAMMA 5 GAMMA 4. PARAMETER STUDY IN BDC In BDC, we need to set the value for the positive parameters α and γ in the cost function for spam. From (8), we can see that α is the maximum cost for spam, and that the cost for a spam message with F (x) = is α/2. By setting the value of α, we are implicitly performing stratification. The smaller α is, the less important spam is compared to ham. Figure 6: In each of the six experiments, the three curves correspond to: upper curve: at FP 3%; middle curve: at FP 5%; lower curve: at FP %. The first row illustrates the effect of α with γ =, and the second row, the effect of γ with α =. The three columns correspond to the three data sets with noise probabilities.3,.5, and., respectively.
6 From the figure we see that the performance of BDC is close to optimal over a wide range of the parameter values. In other words, the results are not very sensitive to specific parameter values as long as they are within a reasonable range. This observation suggests a simple way of selecting parameter values in real applications: using a hold-out test set or via cross validation, perform a simple line search for one parameter while fixing the other, and iterate this process until performance does not improve further. The line search could be as simple as just sampling a number of parameters within an allowable range, and then pick the best one. In our empirical experiments, this simple scheme always converges to a good parameter setting. 5. EXPERIMENTAL RESULTS In this section, we use a large spam corpus to compare BDC with state-of-the-art spam filtering techniques. Both the training and test sets are from the Hotmail Feedback Loop data, which is collected by polling over, Hotmail volunteers daily. In this feedback loop, each user is provided with a special copy of a message that was addressed to him, and is asked to label this message as ham or spam. The training set consists of messages received between July st, 25 and August 9th, 25. We randomly picked 5, messages from each day, which results in a total number of 2, messages used for training. The test set contains messages received between December st, 25 and December 6th, 25., messages were drawn from each day, resulting in a collection of 6, messages. From each message, we extracted features consisting of subject keywords and body keywords that occurred at least three times in the training set. The total number of binary features is over 2 million. Please refer to Yih, Goodman & Hulten (26) for a detailed description of a similar corpus. The methods used for comparison include: logistic regression (LR), regularized logistic regression, regularized logistic regression with stratification, LogitBoost 2 (LB), and LogitBoost with stratification. We regularized LR the standard way by multiplying the likelihood with a Normal prior N (, σ 2 ) on the parameters in the model. To determine the parameter values in each method σ for regularization in LR, the stratification factor for LR and LB, and α, γ in BDC we split the training set into two sets of, messages each: one for training the classifier with different parameter settings, and the other for validating the performance of the trained classifier. The values leading to the lowest at low FP regions are assigned to the parameters. With the chosen parameter values, we train the classifiers again, this time using the whole training set, and compare the results for the different methods on the test set. With BDC, LogitBoost, and LogitBoost with stratification, we have considered decision trees of depth one (decision stumps) and depth two (which generates features similar to co-occurrence features) as the base classifiers. We have also tried two criteria for stopping the iteration procedure in the boosting algorithms: ) setting the maximum number of base classifiers, and 2) setting the minimum decrease in the loss function between two consecutive iteration steps. The 2 Different from LogitBoost in Friedman (2), here, the weight w t is obtained by line search. results using these two methods are quite similar, so we only present the results obtained with the first stopping criterion. Experiments with decision stumps are stopped after including 4 base classifiers in the ensemble, and experiments with decision trees of depth two are stopped after including 2 base classifiers. Finally, recall from Subsection 2.2 that although LB and LR optimize parameters in different ways, the models obtained by LB with decision stumps are similar to the models obtained by LR with a smart selection of the most dominant features (< 4). Figures 7 and 8 compare the ROC curves at the low false positive region from zero to.2 for all of the considered methods by varying the threshold of the associated classifier. The two figures show the resulting curves for the boosting methods with decision stumps and with decision trees of depth two as base classifiers, respectively. The curves for the logistic regression methods are the same in these figures..4.2 LR LogitBoost with stratification BDC Regualized LR Regualized LR with stratification.5..2 FP LogitBoost Figure 7: ROC curves for BDC, LB, and LR methods. The base classifiers in BDC and LB methods are decision stumps. From Figures 7 and 8, we can make the following conclusions: ) As is desirable for spam filtering, at the low false positive region, BDC with simple base classifiers (decision stumps) outperforms BDC with complicated ones (decision trees of depth 2), and actually performs the best of all the methods reported here; 2) in both of the base classifier settings, BDC is among the best methods in the low false positive region; 3) the improvement of BDC over LB or LB with stratification is more obvious with complicated decision trees than with simple ones, showing that BDC is more resistant to overfitting than LB based methods; 4) no matter with LR or LB, stratification always improves over the original methods in the low false positive region. Considering the storage space and the runtime for a deployed classifier, the size of the classifier trained by BDC is a lot smaller than that trained by LR based methods. For instance, with 4 decision stumps, the number of coefficients used in the classifier trained by BDC is at most 4 (different base classifiers may involve the same feature), compared
7 LR is trained from cases classified as spam during the training of the first-stage classifier. The suggested cascading technique can be combined with any type of classifiers, including BDC, and may even result in further improvement..4.2 LogitBoost with stratification Regualized LR LogitBoost BDC Regualized LR with stratification Viola and Jones (22) also suggests a cascade of classifiers. Starting with very simple classifiers at early stages of the cascade and gradually increasing the complexity, the classifier is very fast, because most cases in the domain they consider are easy negatives. Their classifier further ensures good performance at low false positive regions by training each classifier in the cascade by an asymmetic variation of AdaBoost that weighs false negatives differently than false positives during training. - Very much in the spirit of stratification Figure 8: ROC curves for BDC, LB, and LR methods. The base classifiers in BDC and LB methods are decision trees of depth two. with more than 2 million coefficients used in the classifier trained by LR based methods. Finally, it is worth mentioning that the above results are for a spam classifier solely based on text features, and does not reflect the performance of a full commercial system. In commercial systems, a text based classifier is only one component of a larger system involving IP blocklists, IP safelists, user supplied blocklists and safelists, etc., where each component helps driving the spam system to even better performance. Better performance obviously creates more satisfied users, but it also results in significant storage savings for the provider. For example, the throughput on Hotmail is well over one billion messages per day, with an average spam message size of 4Kb. If messages classified as spam (at a fixed low false positive rate) are deleted right away, storage of more than 4Gb daily spam messages are avoided per percent improvement. 6. RELATED WORK Recently, boosting has been applied to spam filtering. In Carreras and Marquez (2), the authors consider several variants of the AdaBoost algorithm with confidence-rated predictions, and compare with Naive Bayes and induction of decision trees, reaching the conclusion that the boostingbased methods clearly outperform the baseline learning algorithms on the PU corpus. The authors of Androutsopoulos et al. (24) also apply LogitBoost, as well as Support Vector Machines, Naive Bayes, and Flexible Bayes on four spam data sets, so as to examine various aspects of the design of learning-based anti-spam filters. On the issue of good classification of spam at low false positive (spam) rates, Yih, Goodman and Hulten (26) propose a two-stage cascade classifier. The first-stage classifier is trained on all data, and classifications as ham by this classifier are not further contested. Classifications as spam, on the other hand, are sent to the second-stage classifier, which FP Recently, there has also been great interest in algorithms for ranking. In particular Rudin (26) has derived a very flexible boosting-style algorithm, designed to perform well on the top of the ranking list at the expense of possible misrankings further down the list. A parameterized cost function controls this tradeoff. We may consider the asymmetric spam classification problem as a special instance of this framework. By pushing ham to the top of the list at the expense of rankings further down the list, this skewed ranking is in effect concentrating on the low false positive region of the ROC. Finally, the proposed BDC combines boosting and costsensitive learning in a natural way. A closely related work is AdaCost proposed in Fan et al. (999). It uses the cost of misclassifications to update the training distribution on successive boosting rounds. The purpose is to reduce the cumulative misclassification cost more than AdaBoost. The difference between AdaCost and BDC is that, the former relies on a pre-specified cost for each training instance, while the latter provides a reasonable way of obtaining those costs. 7. CONCLUSION In this paper, we have proposed a new asymmetric boosting method, Boosting with Different Costs, and applied it to spam filtering. In BDC we define two cost functions, one for ham and one for spam, according to practical needs. In each round of boosting, misclassified ham always have large weights. Moderately misclassified spam also have large weights, but extremely misclassified spam will have small weights. In this way, BDC is able to focus on the ham messages and the more characteristic spam messages at the expense of the more difficult or noisy spam messages. BDC will therefore improve the false negative rates at the low false positive region in the ROC curve, as demonstrated for the spam filtering data investigated in this paper. Acknowledgements We thank Paul Viola, Joshua Goodman, and Wen-tau Yih for helpful discussions regarding this paper. References Androutsopoulos, I., Paliouras, G., & Michelakis, E. (24). Learning to filter unsolicited commercial (Technical Report 24/2). National Centre for Scientific Research Demokritos.
8 Carreras, X., & Márquez, L. (2). Boosting trees for anti-spam filtering. Proceedings of RANLP-, 4th International Conference on Recent Advances in Natural Language Processing. Tzigov Chark, BG. Collins, M., Schapire, R. E., & Singer, Y. (2). Logistic regression, adaboost and bregman distances. Proceedings of Thirteenthth Annual Conference on Computational Learing Theory (pp ). Fan, W., Stolfo, S. J., Zhang, J., & Chan, P. K. (999). AdaCost: misclassification cost-sensitive boosting. Proceedings of the Sixteenth International Conference on Machine Learning (pp. 97 5). Morgan Kaufmann. Freund, Y., & Schapire, R. (997). A decision-theoretic generalization of on-line learning and application to boosting. Journal of Computer and System Sciences, 55, Friedman, J. (2). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29, Friedman, J., Hastie, T., & Tibshirani, R. (2). Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28, Mason, L., Baxter, J., Bartlett, P., & Frean, M. (999). Boosting algorithms as gradient descent in function space (Technical Report). RSISE, Australian National University. Rudin, C. (26). Ranking with a p-norm push. Proceedings of the Nineteenth Annual Conference on Computational Learning Theory (pp ). Viola, P., & Jones, M. (22). Fast and robust classification using asymmetric AdaBoost and a detector cascade. Advances in Neural Information Processing Systems 4. MIT Press, Cambridge, MA. Yih, W., Goodman, J., & Hulten, G. (26). Learning at low false positive rates. Proceedings of the Third Conference on and Anti-Spam.
Machine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Question 2 Naïve Bayes (16 points)
Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the
Boosting. [email protected]
. Machine Learning Boosting Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg [email protected]
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
Active Learning with Boosting for Spam Detection
Active Learning with Boosting for Spam Detection Nikhila Arkalgud Last update: March 22, 2008 Active Learning with Boosting for Spam Detection Last update: March 22, 2008 1 / 38 Outline 1 Spam Filters
Logistic Regression for Spam Filtering
Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used
FilterBoost: Regression and Classification on Large Datasets
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley Machine Learning Department Carnegie Mellon University Pittsburgh, PA 523 [email protected] Robert E. Schapire Department
Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -
Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
Source. The Boosting Approach. Example: Spam Filtering. The Boosting Approach to Machine Learning
Source The Boosting Approach to Machine Learning Notes adapted from Rob Schapire www.cs.princeton.edu/~schapire CS 536: Machine Learning Littman (Wu, TA) Example: Spam Filtering problem: filter out spam
Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods
Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Fraud Detection for Online Retail using Random Forests
Fraud Detection for Online Retail using Random Forests Eric Altendorf, Peter Brende, Josh Daniel, Laurent Lessard Abstract As online commerce becomes more common, fraud is an increasingly important concern.
D-optimal plans in observational studies
D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
Making Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University [email protected] [email protected] I. Introduction III. Model The goal of our research
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
CS570 Data Mining Classification: Ensemble Methods
CS570 Data Mining Classification: Ensemble Methods Cengiz Günay Dept. Math & CS, Emory University Fall 2013 Some slides courtesy of Han-Kamber-Pei, Tan et al., and Li Xiong Günay (Emory) Classification:
Lecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski [email protected]
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski [email protected] Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training
Model Combination. 24 Novembre 2009
Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy
Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
E-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee [email protected] Seunghee Ham [email protected] Qiyi Jiang [email protected] I. INTRODUCTION Due to the increasing popularity of e-commerce
Towards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 [email protected] Abstract Spam identification is crucial
CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives
BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL
The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Beating the NCAA Football Point Spread
Beating the NCAA Football Point Spread Brian Liu Mathematical & Computational Sciences Stanford University Patrick Lai Computer Science Department Stanford University December 10, 2010 1 Introduction Over
Mining the Software Change Repository of a Legacy Telephony System
Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences
Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.
Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features
REVIEW OF ENSEMBLE CLASSIFICATION
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.
Data Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee
UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee 1. Introduction There are two main approaches for companies to promote their products / services: through mass
How Boosting the Margin Can Also Boost Classifier Complexity
Lev Reyzin [email protected] Yale University, Department of Computer Science, 51 Prospect Street, New Haven, CT 652, USA Robert E. Schapire [email protected] Princeton University, Department
L25: Ensemble learning
L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
Linear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
On Adaboost and Optimal Betting Strategies
On Adaboost and Optimal Betting Strategies Pasquale Malacaria School of Electronic Engineering and Computer Science Queen Mary, University of London Email: [email protected] Fabrizio Smeraldi School of
Identifying SPAM with Predictive Models
Identifying SPAM with Predictive Models Dan Steinberg and Mikhaylo Golovnya Salford Systems 1 Introduction The ECML-PKDD 2006 Discovery Challenge posed a topical problem for predictive modelers: how to
Benchmarking of different classes of models used for credit scoring
Benchmarking of different classes of models used for credit scoring We use this competition as an opportunity to compare the performance of different classes of predictive models. In particular we want
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano
Support Vector Machines Explained
March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),
large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)
large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven
Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model
AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different
A Content based Spam Filtering Using Optical Back Propagation Technique
A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT
STA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! [email protected]! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information
Big Data - Lecture 1 Optimization reminders
Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics
On the Path to an Ideal ROC Curve: Considering Cost Asymmetry in Learning Classifiers
On the Path to an Ideal ROC Curve: Considering Cost Asymmetry in Learning Classifiers Francis R. Bach Computer Science Division University of California Berkeley, CA 9472 [email protected] Abstract
Evolutionary Detection of Rules for Text Categorization. Application to Spam Filtering
Advances in Intelligent Systems and Technologies Proceedings ECIT2004 - Third European Conference on Intelligent Systems and Technologies Iasi, Romania, July 21-23, 2004 Evolutionary Detection of Rules
Simple Language Models for Spam Detection
Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to
Ensemble Data Mining Methods
Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods
MHI3000 Big Data Analytics for Health Care Final Project Report
MHI3000 Big Data Analytics for Health Care Final Project Report Zhongtian Fred Qiu (1002274530) http://gallery.azureml.net/details/81ddb2ab137046d4925584b5095ec7aa 1. Data pre-processing The data given
Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm
Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Martin Hlosta, Rostislav Stríž, Jan Kupčík, Jaroslav Zendulka, and Tomáš Hruška A. Imbalanced Data Classification
Combining SVM classifiers for email anti-spam filtering
Combining SVM classifiers for email anti-spam filtering Ángela Blanco Manuel Martín-Merino Abstract Spam, also known as Unsolicited Commercial Email (UCE) is becoming a nightmare for Internet users and
Employer Health Insurance Premium Prediction Elliott Lui
Employer Health Insurance Premium Prediction Elliott Lui 1 Introduction The US spends 15.2% of its GDP on health care, more than any other country, and the cost of health insurance is rising faster than
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 [email protected]
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or
Machine Learning in Spam Filtering
Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov [email protected] Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.
Simple and efficient online algorithms for real world applications
Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,
Least Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
Cross-Validation. Synonyms Rotation estimation
Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical
Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression
Logistic Regression Department of Statistics The Pennsylvania State University Email: [email protected] Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios
Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios By: Michael Banasiak & By: Daniel Tantum, Ph.D. What Are Statistical Based Behavior Scoring Models And How Are
AdaBoost. Jiri Matas and Jan Šochman. Centre for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz
AdaBoost Jiri Matas and Jan Šochman Centre for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Presentation Outline: AdaBoost algorithm Why is of interest? How it works? Why
1 Maximum likelihood estimation
COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N
MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS
MAXIMIZING RETURN ON DIRET MARKETING AMPAIGNS IN OMMERIAL BANKING S 229 Project: Final Report Oleksandra Onosova INTRODUTION Recent innovations in cloud computing and unified communications have made a
LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as
LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values
Penalized regression: Introduction
Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood
Evaluation & Validation: Credibility: Evaluating what has been learned
Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model
Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information
Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information Technology : CIT 2005 : proceedings : 21-23 September, 2005,
Decompose Error Rate into components, some of which can be measured on unlabeled data
Bias-Variance Theory Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Decomposition for Regression Bias-Variance Decomposition for Classification Bias-Variance
Machine Learning Logistic Regression
Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.
AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.
AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree
Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence
Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support
CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York
BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not
Chapter 6: The Information Function 129. CHAPTER 7 Test Calibration
Chapter 6: The Information Function 129 CHAPTER 7 Test Calibration 130 Chapter 7: Test Calibration CHAPTER 7 Test Calibration For didactic purposes, all of the preceding chapters have assumed that the
On the effect of data set size on bias and variance in classification learning
On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent
An Approach to Detect Spam Emails by Using Majority Voting
An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,
CSE 473: Artificial Intelligence Autumn 2010
CSE 473: Artificial Intelligence Autumn 2010 Machine Learning: Naive Bayes and Perceptron Luke Zettlemoyer Many slides over the course adapted from Dan Klein. 1 Outline Learning: Naive Bayes and Perceptron
SAS Software to Fit the Generalized Linear Model
SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling
II. RELATED WORK. Sentiment Mining
Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract
A semi-supervised Spam mail detector
A semi-supervised Spam mail detector Bernhard Pfahringer Department of Computer Science, University of Waikato, Hamilton, New Zealand Abstract. This document describes a novel semi-supervised approach
Random forest algorithm in big data environment
Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest
