Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel
|
|
|
- Marvin Small
- 10 years ago
- Views:
Transcription
1 Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Copyright 2008 All rights reserved. Random Forests Forest of decision trees Introduced by Breiman (2001) to reduce instability of decision tree:... unstable learning algorithms- algorithms whose output classifier undergoes major changes in response to small changes in the training data. Decision-tree, neural network, and rule learning algorithms are all unstable. Dietterich (2000) 1
2 Prediction Error = Irreducible error + statistical bias + variance Hastie, Tibshirani and Friedman, Springer, 2001, p. 37. Improve predictive performance by reducing bias and variance Dietterich and Kong (1995) Decision Tree and Prediction Error Low statistical bias but moderate to high variance Dietterich and Kong (1995) Bagging can dramatically reduce the variance of unstable procedures like trees, leading to improved prediction. Hastie, Tibshirani and Friedman, Springer, 2001, p
3 Random Forests (Breiman 2001) and Randomness Random observations: bagging (Breiman, 1996) Data {X,Y} T bootstrap samples {X t =1, Y t =1 } {X t =..., Y t =... } {X T, Y T } Random Forests and Randomness Random features: Increase noise robustness (Breiman, 2001) X 1 X M X... At each node, select optimal splitting variable out of m randomly selected features of M X 1 X m X... 3
4 Noise Robustness Robust to many input variables with each one containing only a small amount of information Breiman (2001) Data {X,Y} T bootstrap samples {X t =1, Y t =1 } {X t =..., Y t =... } {X T, Y T } X X m X X m X X m X X m X X m X X m X X m X X m X X m Tree t =1 (S t =1,Y t =1 ) y t=1 (x) Tree t = (S t =,Y t = ) y t= (x) Tree t =T (S t =T,Y t =T ) y t=1 (x) 4
5 RF and Classification Put input vector down the T trees in the forest. Each tree votes for the predicted class. Majority voting: Classify instance into class having the most votes over all trees T in the forest RF and Feature Importance Importance of unique feature m a) Calculate performance on instances left out of t-th decision tree (Out-Of-Bag) b) Randomly permute m-th feature in the OOB data and apply respective t-th decision tree a- b, average over all decision trees containing feature m Anita and Prinzie standardize & Dirk Van den Poel 5
6 Research Question Could generalizing the Random Forests framework to other methods improve their predictive performance? RF METHOD bagging random feature Stability of decision tree => reduced variance Noise robustness Stability of method Ensemble and accuracy (Dietterich, 2000) Hughes phenomenon => better accuracy Noise robustness 6
7 Hughes Phenomenon (Hughes, 1968) The density of a fixed amount of data points decreases as the number of dimensions increases, inhibiting the ability to make reliable estimates of the probability distribution optimal number of features for given number of observations on increasing the number of features as input to a method over a given threshold, the accuracy decreases Random Utility Models and MNL Multinomial logit dominant RU model due to closedform solution MNL and multicollinearity MNL not well-suited for large feature spaces Need for feature selection in MNL 7
8 MNL RF Stability Theory Large feature space Feature importance Yes Yes No Biased Yes No Yes Unbiased MNL RMNL RF Stability Theory Yes Yes Large feature space Feature importance Yes Unbiased 8
9 RMNL RNB bagging random feature Stability of even stable classifier Ensemble and accuracy (Dietterich, 2000) Multicollinearity => better accuracy Noise robustness Ensemble and accuracy (Dietterich, 2000) Cf. AODE (Webb, Boughton, & Wang, 2005) Alleviate conditional independence: subsets Cf. SBC (Langley and Sage, 1994) Random MultiNomial Logit 9
10 Random MNL RF => RMNL 1. Random Forests 2. RMNL Forest of T decision trees selecting optimal split at each node out of m randomly selected features of M, estimated on T bootstrap samples S T Forest of R MNLs with m randomly selected features out of M, estimated on R bootstrap samples S R Data {X,Y} R bootstrap samples {X r =1, Y r =1 } {X r =..., Y r =... } {X R, Y R } X 1 X M X... X 1 X M X... X 1 X M X... m randomly selected features m randomly selected features m randomly selected features X 1 X m X... X 1 X m X... X 1 X m X... MNL r =1 (S r =1,Y r =1 ) p r=1 (x) MNL r=... (S r=...,y r=... ) p r= (x) MNL R (S R,Y R ) p R (x) 10
11 Random MNL RMNL and Classification Deliver input vector to R MNLs and infer predicted class by adjusted Majority Voting Random MNL RMNL s importance of feature m a) Calculate performance on instances left out of r-th MNL (Out-Of-Bag) b) Randomly permute m-th feature in the OOB data, apply respective r-th MNL, calculate peformance a-b, average over all MNLs containing feature m and standardize 11
12 Random Naive Bayes Random Naive Bayes Naive Bayes Class-conditional independence: Effect of variable value on class is independent of the values of other variables Biased posterior probabilities but accurate and fast! 12
13 RMNL RNB bagging random feature Stability of even stable classifier Ensemble and accuracy (Dietterich, 2000) Multicollinearity => better accuracy Noise robustness Ensemble and accuracy (Dietterich, 2000) Cf. AODE (Webb, Boughton, & Wang, 2005) Alleviate conditional independence: subsets Cf. SBC (Langley and Sage, 1994) Random Naive Bayes RF => RNB Forest of B NBs with m randomly selected features out of M, estimated on B bootstrap samples S B Class-conditional independence within subset of m randomly chosen variables out of feature space M 13
14 Data {X,Y} B bootstrap samples {X b =1, Y b =1 } {X b =..., Y b =... } {X B, Y B } Y m randomly selected features Y Y X 1 X X m X 1 X X m X 1 X X m NB b =1 (S b =1,Y b =1 ) NB b=... (S b=...,y b=... ) p b=1 (x) p b=... (x) p B (x) NB B (S B,Y B ) Random Naive Bayes RNB Classification: adjusted Majority Voting Feature importances on oob data 14
15 Application... Customer Intelligence Application NPTB Model for CROSS-SELL Analysis Predict in what product category a customer will buy next product Scanner data from Belgian homeappliance retailer. N_train= 37,276 N_test = 37,110 15
16 Cleaning, communication, entertainment and cooking needs 9 product categories Application NULL Predictors X: Monetary value, purchase depth & width # Home appliances acquired at retailer Socio-demo Brand loyalty Price sensitivity # Home-appliances returned Dominant mode of payment Special life-event experience ORDER of acquisition of appliances Time to first-acquisition or repeated acquisition (DURATION) 89 dimensions 16
17 Application More details can be found in: PRINZIE, A., & VAN DEN POEL, D. (2007), Predicting home-appliance acquisition sequences: Markov/Markov for Discrimination and survival analysis for modeling sequential information in NPTB models, Decision Support Systems, 44 (1), Application Predictive Model Evaluation Test set instead of oob data wpcc and AUC s.t. with 17
18 Evaluation wpcct PCCt AUCt RF MNL RMNL NB RNB SVM Estimation RF on all M=441 features T=500 Balanced RF Optimize number of random features m to split each node on. Default m: M=21, stepsize 1/3 => [7,399] m*=336 18
19 Estimation MNL and Expert Feature Selection MNL with M=89 did not converge Expert Feature Selection (wrapper) on NULL, Order and Duration features (Prinzie and Van den Poel, Decision Support Systems, 44, 2007, 28-45) MNL Expert Feature Selection Best3_5+DURATION K > Cr pro =!f k 2 = k=1 Morrison, JMR (1969) 19
20 Estimation RMNL on all M=89 features R=100 Optimize number of random features m to estimate R MNLs on. Default m: M=9, stepsize 1/3 => [3,84] m*=48 Estimation Combining fit Members All 100 MNLs with fixed m value have performance better than Cr pro Combining only very accuracte classifiers might improve the performance of the ensemble even more. Dietterich (1997) 20
21 Estimation RMNL combining MNLs with 10% highest wpcc for given m R=10 Optimize number of random features m to estimate R MNLs on. Default m: M=9, stepsize 1/3 => [3,84] m*=48 Estimation NB Laplace estimation, M=441 Fayad and Irani (1993) supervised discretisation 21
22 Estimation RNB on all M=441 features B=100 Optimize number of random features m to estimate B NBs on. Default m: M=21, stepsize 1/3 => [7,399] m*=42 Estimation RNB combining NBs with 10% highest wpcc, M=441 B=10 Optimize number of random features m to estimate B MNLs on. Default m: M=21, stepsize 1/3 => [7,399] m*=77 22
23 Estimation SVM Libsvm is used (Chang & Lin, 2001) Optimal (C,ɣ) is determined by grid search using five-fold cross validation Validation RF wpcct PCCt AUCt MNL RMNL _ NB RNB_ SVM
24 Validation RF MNL/RMNL NB/RNB SVM Validation Statistical significance of differences in k class-specific AUCs Non-parametric test (Delong, Delong and Clarke-Pearson,1998) RMNL_10 versus RF, MNL, NB, RNB_10 and SVM: all significant at α=0.05 except for RMNL_10-SVM, k=4 and k=5. 24
25 Validation Statistical significance of differences in k class-specific AUCs RNB_10 versus NB: all k AUCs are statistically significant different at α=0.05. Top-20 Features in NPTB Model 25
26 Conclusion Generalization of Random Forests to other methods is valuable! Introduction of Random MultiNomial Logit and Random Naive Bayes as two possible generalizations of Random Forests Random MNL RMNL accommodates 2 weaknesses of MNL: 1) The curse of dimensionality 2) The sensitivity to multicollinearity and as a consequence the biased feature estimates and 1) improves accuracy 2) provides feature importances 26
27 Random NB Random Naive Bayes 1) Alleviates the class-conditional independence assumption 2) Improves accuracy 3) Provides feature importances Random Forests, Random MNL and Random Naive Bayes: 1) Random observation selection 2) Random feature selection 3) Tuning parameter: m 4) No need for test/validation data set 5) Generalization error and feature importance estimates on oob data 27
28 Future Research Generalize Random Forests to other methods potentially benefiting from random observation selection and random feature selection: supervised: classification or regression unsupervised: cluster ensembles Future Research Ensemble with flexible m across base methods Multi-target problems 28
29 Software Independent runs => easily to parallellize Given results, priority is given to RMNL Acknowledgement Leo Breiman Symposium on Data Mining, May 10 th, 2004, Gent, Belgium. 29
30 PRINZIE, A., & VAN DEN POEL, D., Random Forests for multiclass classification: Random MultiNomial Logit, Expert Systems with Applications, 34(3), 2008, DOI: /j.eswa PRINZIE, A., & VAN DEN POEL, D., Random Multiclass Classification: Generalizing RF to RMNL and RNB, Dexa 2007, Lecture Notes in Computer Science, 4653, l / 30
31 Questions, suggestions, Copyright 2008 Dirk Van den Poel & Anita Prinzie All Rights Reserved 31
Using multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
Leveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
Model Combination. 24 Novembre 2009
Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk [email protected] Tom Kelsey ID5059-19-B &
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Applied Multivariate Analysis - Big data analytics
Applied Multivariate Analysis - Big data analytics Nathalie Villa-Vialaneix [email protected] http://www.nathalievilla.org M1 in Economics and Economics and Statistics Toulouse School of
Comparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
CS570 Data Mining Classification: Ensemble Methods
CS570 Data Mining Classification: Ensemble Methods Cengiz Günay Dept. Math & CS, Emory University Fall 2013 Some slides courtesy of Han-Kamber-Pei, Tan et al., and Li Xiong Günay (Emory) Classification:
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
Better credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
MS1b Statistical Data Mining
MS1b Statistical Data Mining Yee Whye Teh Department of Statistics Oxford http://www.stats.ox.ac.uk/~teh/datamining.html Outline Administrivia and Introduction Course Structure Syllabus Introduction to
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,
Ensemble Data Mining Methods
Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
How To Make A Credit Risk Model For A Bank Account
TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző [email protected] 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions
Classification and Regression by randomforest
Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many
Classification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
Distributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
Chapter 12 Bagging and Random Forests
Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts
Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -
Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create
Package trimtrees. February 20, 2015
Package trimtrees February 20, 2015 Type Package Title Trimmed opinion pools of trees in a random forest Version 1.2 Date 2014-08-1 Depends R (>= 2.5.0),stats,randomForest,mlbench Author Yael Grushka-Cockayne,
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Event driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016
Event driven trading new studies on innovative way of trading in Forex market Michał Osmoła INIME live 23 February 2016 Forex market From Wikipedia: The foreign exchange market (Forex, FX, or currency
EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.
EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Multinomial Logistic Regression Ensembles
Multinomial Logistic Regression Ensembles Abstract This article proposes a method for multiclass classification problems using ensembles of multinomial logistic regression models. A multinomial logit model
Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification
Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde
On the effect of data set size on bias and variance in classification learning
On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent
Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms
Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Yin Zhao School of Mathematical Sciences Universiti Sains Malaysia (USM) Penang, Malaysia Yahya
BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL
The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University
Class Imbalance Learning in Software Defect Prediction
Class Imbalance Learning in Software Defect Prediction Dr. Shuo Wang [email protected] University of Birmingham Research keywords: ensemble learning, class imbalance learning, online learning Shuo Wang
Package acrm. R topics documented: February 19, 2015
Package acrm February 19, 2015 Type Package Title Convenience functions for analytical Customer Relationship Management Version 0.1.1 Date 2014-03-28 Imports dummies, randomforest, kernelfactory, ada Author
Why Ensembles Win Data Mining Competitions
Why Ensembles Win Data Mining Competitions A Predictive Analytics Center of Excellence (PACE) Tech Talk November 14, 2012 Dean Abbott Abbott Analytics, Inc. Blog: http://abbottanalytics.blogspot.com URL:
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
from Larson Text By Susan Miertschin
Decision Tree Data Mining Example from Larson Text By Susan Miertschin 1 Problem The Maximum Miniatures Marketing Department wants to do a targeted mailing gpromoting the Mythic World line of figurines.
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
Statistics W4240: Data Mining Columbia University Spring, 2014
Statistics W4240: Data Mining Columbia University Spring, 2014 Version: January 30, 2014. The syllabus is subject to change, so look for the version with the most recent date. Course Description Massive
BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376
Course Director: Dr. Kayvan Najarian (DCM&B, [email protected]) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.
REVIEW OF ENSEMBLE CLASSIFICATION
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.
An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
Predictive Data modeling for health care: Comparative performance study of different prediction models
Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath [email protected] National Institute of Industrial Engineering (NITIE) Vihar
Package bigrf. February 19, 2015
Version 0.1-11 Date 2014-05-16 Package bigrf February 19, 2015 Title Big Random Forests: Classification and Regression Forests for Large Data Sets Maintainer Aloysius Lim OS_type
Machine Learning Methods for Demand Estimation
Machine Learning Methods for Demand Estimation By Patrick Bajari, Denis Nekipelov, Stephen P. Ryan, and Miaoyu Yang Over the past decade, there has been a high level of interest in modeling consumer behavior
New Ensemble Combination Scheme
New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,
COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments
Contents List of Figures Foreword Preface xxv xxiii xv Acknowledgments xxix Chapter 1 Fraud: Detection, Prevention, and Analytics! 1 Introduction 2 Fraud! 2 Fraud Detection and Prevention 10 Big Data for
Expert Systems with Applications
Expert Systems with Applications 36 (2009) 5445 5449 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa Customer churn prediction
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
Gerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Monday Morning Data Mining
Monday Morning Data Mining Tim Ruhe Statistische Methoden der Datenanalyse Outline: - data mining - IceCube - Data mining in IceCube Computer Scientists are different... Fakultät Physik Fakultät Physik
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
Supervised Feature Selection & Unsupervised Dimensionality Reduction
Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or
Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/
Ensemble Methods Adapted from slides by Todd Holloway h8p://abeau
Employer Health Insurance Premium Prediction Elliott Lui
Employer Health Insurance Premium Prediction Elliott Lui 1 Introduction The US spends 15.2% of its GDP on health care, more than any other country, and the cost of health insurance is rising faster than
CRM at Ghent University
CRM at Ghent University Customer Relationship Management at Department of Marketing Faculty of Economics and Business Administration Ghent University Prof. Dr. Dirk Van den Poel Updated April 6th, 2005
Data mining on the EPIRARE survey data
Deliverable D1.5 Data mining on the EPIRARE survey data A. Coi 1, M. Santoro 2, M. Lipucci 2, A.M. Bianucci 1, F. Bianchi 2 1 Department of Pharmacy, Unit of Research of Bioinformatic and Computational
Hong Kong Stock Index Forecasting
Hong Kong Stock Index Forecasting Tong Fu Shuo Chen Chuanqi Wei [email protected] [email protected] [email protected] Abstract Prediction of the movement of stock market is a long-time attractive topic
Random forest algorithm in big data environment
Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest
Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
Data Mining Lab 5: Introduction to Neural Networks
Data Mining Lab 5: Introduction to Neural Networks 1 Introduction In this lab we are going to have a look at some very basic neural networks on a new data set which relates various covariates about cheese
L25: Ensemble learning
L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna
Predicting Flight Delays
Predicting Flight Delays Dieterich Lawson [email protected] William Castillo [email protected] Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing
Big Data Analytics for SCADA
ENERGY Big Data Analytics for SCADA Machine Learning Models for Fault Detection and Turbine Performance Elizabeth Traiger, Ph.D., M.Sc. 14 April 2016 1 SAFER, SMARTER, GREENER Points to Convey Big Data
E-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee [email protected] Seunghee Ham [email protected] Qiyi Jiang [email protected] I. INTRODUCTION Due to the increasing popularity of e-commerce
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
Mining Wiki Usage Data for Predicting Final Grades of Students
Mining Wiki Usage Data for Predicting Final Grades of Students Gökhan Akçapınar, Erdal Coşgun, Arif Altun Hacettepe University [email protected], [email protected], [email protected]
CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott
Predicting borrowers chance of defaulting on credit loans
Predicting borrowers chance of defaulting on credit loans Junjie Liang ([email protected]) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm
Getting Even More Out of Ensemble Selection
Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand [email protected] ABSTRACT Ensemble Selection uses forward stepwise
Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye team @ MSRC
Machine Learning for Medical Image Analysis A. Criminisi & the InnerEye team @ MSRC Medical image analysis the goal Automatic, semantic analysis and quantification of what observed in medical scans Brain
A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND
Paper D02-2009 A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND ABSTRACT This paper applies a decision tree model and logistic regression
How To Solve The Kd Cup 2010 Challenge
A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China [email protected] [email protected]
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
STA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! [email protected]! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
How To Understand The Theory Of Probability
Graduate Programs in Statistics Course Titles STAT 100 CALCULUS AND MATR IX ALGEBRA FOR STATISTICS. Differential and integral calculus; infinite series; matrix algebra STAT 195 INTRODUCTION TO MATHEMATICAL
Machine Learning Big Data using Map Reduce
Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories
Enhanced Boosted Trees Technique for Customer Churn Prediction Model
IOSR Journal of Engineering (IOSRJEN) ISSN (e): 2250-3021, ISSN (p): 2278-8719 Vol. 04, Issue 03 (March. 2014), V5 PP 41-45 www.iosrjen.org Enhanced Boosted Trees Technique for Customer Churn Prediction
Feature Subset Selection in E-mail Spam Detection
Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature
How To Perform An Ensemble Analysis
Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier
Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions
SMA 50: Statistical Learning and Data Mining in Bioinformatics (also listed as 5.077: Statistical Learning and Data Mining ()) Spring Term (Feb May 200) Faculty: Professor Roy Welsch Wed 0 Feb 7:00-8:0
Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)
Machine Learning Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos) What Is Machine Learning? A computer program is said to learn from experience E with respect to some class of
Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008
Ensemble Learning Better Predictions Through Diversity Todd Holloway ETech 2008 Outline Building a classifier (a tutorial example) Neighbor method Major ideas and challenges in classification Ensembles
KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics
ROCHESTER INSTITUTE OF TECHNOLOGY COURSE OUTLINE FORM KATE GLEASON COLLEGE OF ENGINEERING John D. Hromi Center for Quality and Applied Statistics NEW (or REVISED) COURSE (KGCOE- CQAS- 747- Principles of
ElegantJ BI. White Paper. The Competitive Advantage of Business Intelligence (BI) Forecasting and Predictive Analysis
ElegantJ BI White Paper The Competitive Advantage of Business Intelligence (BI) Forecasting and Predictive Analysis Integrated Business Intelligence and Reporting for Performance Management, Operational
KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES
HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES Translating data into business value requires the right data mining and modeling techniques which uncover important patterns within
REPORT DOCUMENTATION PAGE
REPORT DOCUMENTATION PAGE Form Approved OMB NO. 0704-0188 Public Reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions,
HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION
HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION Chihli Hung 1, Jing Hong Chen 2, Stefan Wermter 3, 1,2 Department of Management Information Systems, Chung Yuan Christian University, Taiwan
Fortgeschrittene Computerintensive Methoden
Fortgeschrittene Computerintensive Methoden Einheit 3: mlr - Machine Learning in R Bernd Bischl Matthias Schmid, Manuel Eugster, Bettina Grün, Friedrich Leisch Institut für Statistik LMU München SoSe 2014
1 Maximum likelihood estimation
COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N
Evaluation and Comparison of Data Mining Techniques Over Bank Direct Marketing
Evaluation and Comparison of Data Mining Techniques Over Bank Direct Marketing Niharika Sharma 1, Arvinder Kaur 2, Sheetal Gandotra 3, Dr Bhawna Sharma 4 B.E. Final Year Student, Department of Computer
DATA MINING SPECIES DISTRIBUTION AND LANDCOVER. Dawn Magness Kenai National Wildife Refuge
DATA MINING SPECIES DISTRIBUTION AND LANDCOVER Dawn Magness Kenai National Wildife Refuge Why Data Mining Random Forest Algorithm Examples from the Kenai Species Distribution Model Pattern Landcover Model
