A CLASSIFIER FUSION-BASED APPROACH TO IMPROVE BIOLOGICAL THREAT DETECTION. Palaiseau cedex, France; 2 FFI, P.O. Box 25, N-2027 Kjeller, Norway.

Size: px
Start display at page:

Download "A CLASSIFIER FUSION-BASED APPROACH TO IMPROVE BIOLOGICAL THREAT DETECTION. Palaiseau cedex, France; 2 FFI, P.O. Box 25, N-2027 Kjeller, Norway."

Transcription

1 A CLASSIFIER FUSION-BASED APPROACH TO IMPROVE BIOLOGICAL THREAT DETECTION Frédéric Pichon 1, Florence Aligne 1, Gilles Feugnet 1 and Janet Martha Blatny 2 1 Thales Research & Technology, Campus Polytechnique, 1 avenue Augustin Fresnel, Palaiseau cedex, France; 2 FFI, P.O. Box 25, N-2027 Kjeller, Norway. INTRODUCTION The aim of the TWOBIAS EU FP7 project ( ) is to develop a modular demonstrator of a stationary, reliable, vehicle-portable, low false alarm rate Two Stage Rapid Biological Surveillance and Alarm System for Airborne Threats (TWOBIAS). The TWOBIAS concept is built around two successive stages: a first detect-to-warn stage dedicated to biological threat detection using Biological Detection Units (BDUs), and a second detect-to-treat stage devoted to pathogenic agent identification using Biological Identification Units (BIUs). In the first stage, the main challenge of the TWOBIAS system is to improve the state-of-the-art regarding biological threat detection by using orthogonal detector techniques together with some mathematical framework and high level information processing approaches. This paper describes the approach developed for the first stage of the TWOBIAS project, as well as preliminary results suggesting its potential usefulness. APPROACH The BDUs may be seen as so-called classifiers in pattern classification [DUD 00]. Indeed, for each air sample, each of the BDUs raises or not an alarm, depending on whether it believes that the air sample contains or not a pathogenic biological agent. Hence, each BDU is a classifier in that it classifies each air sample in of two groups (biological pathogenic agents and non biological pathogenic agents). There exists a large body of literature dedicated to the use of multiple classifiers, also called classifier ensemble, which is now widely acknowledged as a useful means to address difficult pattern classification problems [KIT 98, ROG 94]. The main idea underlying classifier ensembles is to take advantage of the potential complementarities of different classifiers, so as to obtain potentially better classification performance. One of the central issues is then to find how to combine the classifier outputs. Several fusion strategies have been proposed to combine classifier outputs. The most simple fusion scheme is majority voting [XU 92]. This scheme basically amounts to deciding that the

2 class of a given pattern (object, instance), is the class the most present in the individual decisions of the classifiers. When classifiers provide confidence scores on the actual class of a given pattern in the form of probability distributions, a technique consists in taking the average of these distributions and deciding that the class of the pattern is the one of maximum probability [XU 92]. Other schemes such as [ELO 10] and [ELO 04] involve the notion of discounting [SHA 76, SME 93], that is, weakening the decisions provided by the classifiers according to their reliability. In these schemes, the decisions of the classifiers are first discounted before being combined using an appropriate operator (called Dempster s rule of combination [DEM 67]). In other works, classifier outputs are combined using more advanced combination operators, for instance triangular norm based combination rules [DEN 08, PIC 10], as is the case in [QUO 11]. To decide which fusion strategy to use to combine a set of classifiers, one may proceed as in [QUO 11], that is one may learn the best fusion strategy. This learning amounts to looking for the fusion strategy which optimizes a given performance criterion, such as the error rate, over a set of data. This is the approach that we have followed to take advantage of the BDUs complementary orthogonal detector techniques. As will be seen in the results part of this paper, this approach allows us to obtain a detection system that provides a more consistent global response and that improves on each of the individual BDU detection performance. The rest of this paper is organized as follows. First, we provide a brief description of the BDUs considered in TWOBIAS, as well as of the first TWOBIAS experiment, which yielded the set of data that is needed in our approach. Then, we recall some basic performance measures that can be used to evaluate a classifier. We proceed with a description of the experiments that we performed to evaluate our approach, as well as of the results that we obtained. Finally, we conclude this paper with some perspectives. TWOBIAS BIOLOGICAL DETECTION UNITS AND FIRST EXPERIMENT At the time of performing this work, three BDUs were available. These BDUs can be expected to be somewhat complementary since they rely on different techniques and settings: one relies on flame emission spectroscopy and the other two rely on laser induced fluorescence, but they are tuned differently. Hereafter, these BDUs will be denoted by BDU1, BDU2, and BDU3. The first TWOBIAS experiment was conducted in a facility of the Direction Générale de l'armement. Several disseminations of biological agents simulating biological pathogenic agents took place during this experiment. The outputs of the three BDUs, i.e., the alarms raised by the BDUs, were recorded during those dispersions. Offline, we also analyzed the

3 BIU outputs, which tell us whether a (simulant of a) biological pathogenic agent was actually present or not in the air at a given time. This is important since those BIUs outputs can thus be used as references regarding whether a given time-stamped air sample should have raised an alarm or not, and thus they can be used to evaluate the performances of the BDUs. Those various recordings and analysis yielded a data set of about records, each record corresponding to an air sample and being a 5-uple containing a date, the outputs (i.e., presence or absence of alarm) of the three BDUs, and the expected output (estimated using the BIUs outputs), also called label, for this sample. PERFORMANCE MEASURES IN BINARY CLASSIFICATION Let us assume in this section that we have at our disposal a set of air samples whose true classes are known, i.e., for each of these samples, we know whether it contains or not a (simulant of a) biological pathogenic agent. Besides, we also have access to the class predicted for each of those samples, by a classifier. From those pieces of information, we can compute the number of: True Positives (TP), which are the samples correctly classified as presence of simulant, i.e., correct detections; True Negatives (TN), which are the samples correctly classified as absence of stimulant; False Positives (FP), which are the samples incorrectly classified as presence of simulant, i.e., the false alarms; False Negatives (FN), which are the samples incorrectly classified as absence of stimulant. Using this terminology, the error rate of the classifier over the set of samples is simply defined as:. The error rate is a natural and sensible performance criterion in general. However, it is not really adapted to imbalanced data. Two other common performance criteria in binary classification are called recall (also known as sensitivity) and precision. They are defined as follows: and. Recall is the proportion of samples for which the classifier rightfully raised an alarm (i.e., samples correctly identified as presence of simulant), of all samples that actually should have raised an alarm. On the other hand, precision is the proportion of samples for which the classifier rightfully raised an alarm, of all samples for which it raised an alarm.

4 Another useful and often used indicator of performance is the F 1 -measure (or F 1 -score). It can be interpreted as an average 1 of the precision and recall: 2, where an F 1 -score reaches its best value at 1 and worst score at 0. A more general form exists and is called the F β -measure. It is defined as follows or, equivalently, by 1,. The -measure (i.e., 2) for instance, weights recall higher than precision, whereas the. - measure puts more emphasis on precision than recall. The F β -measure measures the effectiveness of detection with respect to a user who attaches times as much importance to recall as precision. EXPERIMENTS As discussed above, the difficulty in this classifier fusion-based approach is to determine the best strategy to combine the detector outputs, out of a given set of fusion strategies. A second issue is to evaluate the performances of this approach. In order to perform these two tasks, a standard method consists in splitting the available labeled data set into two disjoint sets, which are called the training set and the test set (this is known in the pattern classification literature as the hold-out method): the training set is used to learn the best fusion strategy - learning the best fusion strategy basically amounts to selecting the strategy that optimizes a given performance criterion on the learning set, such as F 1 -score -, and the test set is used to evaluate independently the performance - according to the same performance criterion as the one used for the training phase - of this best fusion strategy. This basic method has nonetheless the drawback that the holdout estimate of the performance of the best fusion strategy may be misleading if we happen to get an unfortunate split. To overcome this drawback, other methods can be used. In particular, the method known as random subsampling performs n data splits of the data set. Each split randomly selects a (fixed) number of examples without replacement. For each data split, the best fusion strategy is relearnt from scratch with the training examples and its performance is evaluated with the test examples. The overall performance of the fusion-based approach is then evaluated as the average of the separate estimates. In this work, the set of fusion strategies that we considered are all the schemes mentioned in the Approach section of this paper. Besides, we used the random subsampling method 1 It is actually the harmonic mean of precision and recall.

5 described in the previous paragraph, with n=20 (for each split, the learning data represented 2/5 of the entire available data) and in conjunction with several performance criteria: the F β - scores, for β = 0.2, 0.5, 0. 7, 1, 2. In other words, we applied the random subsampling method five times (one for each performance criterion). The following figure synthesizes these experiments: the values 0.2, 0.5, 0.7, 1 and 2 on the x-axis correspond to the five performance measures considered, and the y-axis provides the average values (and standard deviation) taken by these performance measures over the n repetitions for the different detection systems studied in this work, i.e., the three BDUs as well as of our fusion approach (called Fusion in the figure). Let us recall that the higher these latter values the better the performances. As can be seen on this figure, our approach always yields the best detection system and it is at worse as good as the BDU1 (this happens for the F 0.7 -measure). In addition, we remark that depending on the value of, some BDUs are better than the others. Our approach offers some robustness in this respect since it is always the best detection system. The approach seems also to take advantage of the quality of the BDU1 and BDU3 for the F 0.2 -measure, to yield a F 0.2 -score that is better than the F 0.2 -scores of these two BDUs and is in itself good. Similarly, for the F 2 -measure, the approach seems to take advantage of the quality of the BDU1 and BDU2. Overall, these experiments clearly show the robustness as well as the increase of performance that a detection system based on the fusion of individual systems offers. CONCLUSION This study has shown that it is possible to propose a detection system that combines the alarms raised by the TWOBIAS BDUs and that exhibits better detection performances as well as better robustness than any of these BDUs. In addition, the performances of this detection

6 system are in themselves promising. It is foreseen that the performances could be further enhanced by considering some improvements to the proposed detection system. In particular, it seems worthwhile to include the temporal dimension in the fusion process. In addition, it seems interesting to better leverage in the fusion process the various uncertainties that can emerge in such a system, in particular the BDUs own confidence in their outputs. Next works will be dedicated to these improvements as well as testing this approach on the data obtained during the recent TWOBIAS underground experiment REFERENCES [DEM 67] A. P. Dempster. Upper and lower probabilities induced by a multivalued mapping. Annals of Mathematical Statistics, 38: , [DEN 08] T. Denoeux. Conjunctive and disjunctive combination of belief functions induced by non-distinct bodies of evidence. Artificial Intelligence, 172: , [DUD 00] R. O. Duda, P. E. Hart and D. G. Stork. Pattern classification. John Wiley & Sons, [ELO 10] Z. Elouedi, E. Lefevre, and D. Mercier. Discountings of a belief function using a confusion matrix. In 22th IEEE International Conference on Tools with Artificial Intelligence, volume 1, pages , Arras, France, North-Holland. [ELO 04] Z. Elouedi, K.Mellouli, and Ph. Smets. Assessing sensor reliability for multisensor data fusion within the Transferable Belief Model. IEEE Transactions on Systems, Man and Cybernetics B, 34(1): , [KIT 98] J. Kittler, M. Hatef, R. Duin, and J. Matas. On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3): , [PIC 10] F. Pichon and T. Denoeux. The unnormalized Dempster s rule of combination: a new justification from the least commitment principle and some extensions. Journal of Automated Reasoning, 45(1):61 87, [QUO 11] B. Quost, M.-H. Masson and T. Denoeux. Classifier fusion in the Dempster-Shafer framework using optimized t-norm based combination rules. International Journal of Approximate Reasoning, vol. 52, Issue 3, pages , 2011 [ROG 94] G. Rogova. Combining the results of several neural network classifiers. Neural Networks, 7(5): , [SHA 76] G. Shafer. A mathematical theory of evidence. Princeton University Press, Princeton, N.J., [SME 93] Ph. Smets. Belief functions: the disjunctive rule of combination and the generalized Bayesian theorem. International Journal of Approximate Reasoning, 9(1):1 35, [XU 92] L. Xu, A. Krzyzak, and C. Y. Suen. Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Transactions on System, Man and Cybernetics, 22: , 1992.

Evaluation & Validation: Credibility: Evaluating what has been learned

Evaluation & Validation: Credibility: Evaluating what has been learned Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model

More information

Combination of supervised and unsupervised classification using the theory of belief functions

Combination of supervised and unsupervised classification using the theory of belief functions Combination of supervised and unsupervised classification using the theory of belief functions Fatma Karem, Mounir Dhibi, Arnaud Martin Abstract In this paper, we propose to fuse both clustering and supervised

More information

Intrusion Detection via Machine Learning for SCADA System Protection

Intrusion Detection via Machine Learning for SCADA System Protection Intrusion Detection via Machine Learning for SCADA System Protection S.L.P. Yasakethu Department of Computing, University of Surrey, Guildford, GU2 7XH, UK. s.l.yasakethu@surrey.ac.uk J. Jiang Department

More information

COMP9417: Machine Learning and Data Mining

COMP9417: Machine Learning and Data Mining COMP9417: Machine Learning and Data Mining A note on the two-class confusion matrix, lift charts and ROC curves Session 1, 2003 Acknowledgement The material in this supplementary note is based in part

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Credal classification of uncertain data using belief functions

Credal classification of uncertain data using belief functions 23 IEEE International Conference on Systems, Man, and Cybernetics Credal classification of uncertain data using belief functions Zhun-ga Liu a,c,quanpan a, Jean Dezert b, Gregoire Mercier c a School of

More information

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification

More information

W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Performance Metrics. number of mistakes total number of observations. err = p.1/1

Performance Metrics. number of mistakes total number of observations. err = p.1/1 p.1/1 Performance Metrics The simplest performance metric is the model error defined as the number of mistakes the model makes on a data set divided by the number of observations in the data set, err =

More information

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com

More information

How to preserve the conflict as an alarm in the combination of belief functions?

How to preserve the conflict as an alarm in the combination of belief functions? How to preserve the conflict as an alarm in the combination of belief functions? Eric Lefèvre a, Zied Elouedi b a Univ. Lille Nord de France, UArtois, EA 3926 LGI2A, France b University of Tunis, Institut

More information

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM ABSTRACT Luis Alexandre Rodrigues and Nizam Omar Department of Electrical Engineering, Mackenzie Presbiterian University, Brazil, São Paulo 71251911@mackenzie.br,nizam.omar@mackenzie.br

More information

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing

More information

SVM Ensemble Model for Investment Prediction

SVM Ensemble Model for Investment Prediction 19 SVM Ensemble Model for Investment Prediction Chandra J, Assistant Professor, Department of Computer Science, Christ University, Bangalore Siji T. Mathew, Research Scholar, Christ University, Dept of

More information

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data

More information

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type. Chronological Sampling for Email Filtering Ching-Lung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada

More information

Performance Measures in Data Mining

Performance Measures in Data Mining Performance Measures in Data Mining Common Performance Measures used in Data Mining and Machine Learning Approaches L. Richter J.M. Cejuela Department of Computer Science Technische Universität München

More information

Evaluating Machine Learning Algorithms. Machine DIT

Evaluating Machine Learning Algorithms. Machine DIT Evaluating Machine Learning Algorithms Machine Learning @ DIT 2 Evaluating Classification Accuracy During development, and in testing before deploying a classifier in the wild, we need to be able to quantify

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Lecture 15 - ROC, AUC & Lift Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-17-AUC

More information

Component Ordering in Independent Component Analysis Based on Data Power

Component Ordering in Independent Component Analysis Based on Data Power Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals

More information

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577 T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or

More information

A Lightweight Solution to the Educational Data Mining Challenge

A Lightweight Solution to the Educational Data Mining Challenge A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China catch0327@yahoo.com yanxing@gdut.edu.cn

More information

FALSE ALARMS IN FAULT-TOLERANT DOMINATING SETS IN GRAPHS. Mateusz Nikodem

FALSE ALARMS IN FAULT-TOLERANT DOMINATING SETS IN GRAPHS. Mateusz Nikodem Opuscula Mathematica Vol. 32 No. 4 2012 http://dx.doi.org/10.7494/opmath.2012.32.4.751 FALSE ALARMS IN FAULT-TOLERANT DOMINATING SETS IN GRAPHS Mateusz Nikodem Abstract. We develop the problem of fault-tolerant

More information

Analysis on Weighted AUC for Imbalanced Data Learning Through Isometrics

Analysis on Weighted AUC for Imbalanced Data Learning Through Isometrics Journal of Computational Information Systems 8: (22) 37 378 Available at http://www.jofcis.com Analysis on Weighted AUC for Imbalanced Data Learning Through Isometrics Yuanfang DONG, 2, Xiongfei LI,, Jun

More information

Using Random Forest to Learn Imbalanced Data

Using Random Forest to Learn Imbalanced Data Using Random Forest to Learn Imbalanced Data Chao Chen, chenchao@stat.berkeley.edu Department of Statistics,UC Berkeley Andy Liaw, andy liaw@merck.com Biometrics Research,Merck Research Labs Leo Breiman,

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Multi-ultrasonic sensor fusion for autonomous mobile robots

Multi-ultrasonic sensor fusion for autonomous mobile robots Multi-ultrasonic sensor fusion for autonomous mobile robots Zou Yi *, Ho Yeong Khing, Chua Chin Seng, and Zhou Xiao Wei School of Electrical and Electronic Engineering Nanyang Technological University

More information

Searching for Gravitational Waves from the Coalescence of High Mass Black Hole Binaries

Searching for Gravitational Waves from the Coalescence of High Mass Black Hole Binaries Searching for Gravitational Waves from the Coalescence of High Mass Black Hole Binaries 2015 SURE Presentation September 22 nd, 2015 Lau Ka Tung Department of Physics, The Chinese University of Hong Kong

More information

Statistics in Retail Finance. Chapter 7: Fraud Detection in Retail Credit

Statistics in Retail Finance. Chapter 7: Fraud Detection in Retail Credit Statistics in Retail Finance Chapter 7: Fraud Detection in Retail Credit 1 Overview > Detection of fraud remains an important issue in retail credit. Methods similar to scorecard development may be employed,

More information

Holland s GA Schema Theorem

Holland s GA Schema Theorem Holland s GA Schema Theorem v Objective provide a formal model for the effectiveness of the GA search process. v In the following we will first approach the problem through the framework formalized by

More information

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE 1 2 CONTENTS OF DAY 2 I. More Precise Definition of Simple Random Sample 3 Connection with independent random variables 3 Problems with small populations 8 II. Why Random Sampling is Important 9 A myth,

More information

Combining SVM Classifiers Using Genetic Fuzzy Systems Based on AUC for Gene Expression Data Analysis

Combining SVM Classifiers Using Genetic Fuzzy Systems Based on AUC for Gene Expression Data Analysis Combining SVM Classifiers Using Genetic Fuzzy Systems Based on AUC for Gene Expression Data Analysis Xiujuan Chen 1, Yichuan Zhao 2, Yan-Qing Zhang 1, and Robert Harrison 1 1 Department of Computer Science,

More information

Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

More information

ROC Graphs: Notes and Practical Considerations for Data Mining Researchers

ROC Graphs: Notes and Practical Considerations for Data Mining Researchers ROC Graphs: Notes and Practical Considerations for Data Mining Researchers Tom Fawcett 1 1 Intelligent Enterprise Technologies Laboratory, HP Laboratories Palo Alto Alexandre Savio (GIC) ROC Graphs GIC

More information

Lecture 13: Validation

Lecture 13: Validation Lecture 3: Validation g Motivation g The Holdout g Re-sampling techniques g Three-way data splits Motivation g Validation techniques are motivated by two fundamental problems in pattern recognition: model

More information

Classification: Naïve Bayes Classifier Evaluation. Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Classification: Naïve Bayes Classifier Evaluation. Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Classification: Naïve Bayes Classifier Evaluation Toon Calders ( t.calders@tue.nl ) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Last Lecture Classification

More information

A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms

A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms Ian Davidson 1st author's affiliation 1st line of address 2nd line of address Telephone number, incl country code 1st

More information

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS Breno C. Costa, Bruno. L. A. Alberto, André M. Portela, W. Maduro, Esdras O. Eler PDITec, Belo Horizonte,

More information

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques. International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant

More information

An Approach to Detect Spam Emails by Using Majority Voting

An Approach to Detect Spam Emails by Using Majority Voting An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

Performance Measures for Machine Learning

Performance Measures for Machine Learning Performance Measures for Machine Learning 1 Performance Measures Accuracy Weighted (Cost-Sensitive) Accuracy Lift Precision/Recall F Break Even Point ROC ROC Area 2 Accuracy Target: 0/1, -1/+1, True/False,

More information

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum Statistical Validation and Data Analytics in ediscovery Jesse Kornblum Administrivia Silence your mobile Interactive talk Please ask questions 2 Outline Introduction Big Questions What Makes Things Similar?

More information

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com

More information

A Novel Classification Approach for C2C E-Commerce Fraud Detection

A Novel Classification Approach for C2C E-Commerce Fraud Detection A Novel Classification Approach for C2C E-Commerce Fraud Detection *1 Haitao Xiong, 2 Yufeng Ren, 2 Pan Jia *1 School of Computer and Information Engineering, Beijing Technology and Business University,

More information

CLASS distribution, i.e., the proportion of instances belonging

CLASS distribution, i.e., the proportion of instances belonging IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART C: APPLICATIONS AND REVIEWS 1 A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches Mikel Galar,

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

Performance Metrics for Graph Mining Tasks

Performance Metrics for Graph Mining Tasks Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical

More information

Another Look at Sensitivity of Bayesian Networks to Imprecise Probabilities

Another Look at Sensitivity of Bayesian Networks to Imprecise Probabilities Another Look at Sensitivity of Bayesian Networks to Imprecise Probabilities Oscar Kipersztok Mathematics and Computing Technology Phantom Works, The Boeing Company P.O.Box 3707, MC: 7L-44 Seattle, WA 98124

More information

Machine Learning. Topic: Evaluating Hypotheses

Machine Learning. Topic: Evaluating Hypotheses Machine Learning Topic: Evaluating Hypotheses Bryan Pardo, Machine Learning: EECS 349 Fall 2011 How do you tell something is better? Assume we have an error measure. How do we tell if it measures something

More information

Machine Learning model evaluation. Luigi Cerulo Department of Science and Technology University of Sannio

Machine Learning model evaluation. Luigi Cerulo Department of Science and Technology University of Sannio Machine Learning model evaluation Luigi Cerulo Department of Science and Technology University of Sannio Accuracy To measure classification performance the most intuitive measure of accuracy divides the

More information

On the effect of data set size on bias and variance in classification learning

On the effect of data set size on bias and variance in classification learning On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent

More information

A Spectral Clustering Approach to Validating Sensors via Their Peers in Distributed Sensor Networks

A Spectral Clustering Approach to Validating Sensors via Their Peers in Distributed Sensor Networks A Spectral Clustering Approach to Validating Sensors via Their Peers in Distributed Sensor Networks H. T. Kung Dario Vlah {htk, dario}@eecs.harvard.edu Harvard School of Engineering and Applied Sciences

More information

Towards better accuracy for Spam predictions

Towards better accuracy for Spam predictions Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 czhao@cs.toronto.edu Abstract Spam identification is crucial

More information

A Content based Spam Filtering Using Optical Back Propagation Technique

A Content based Spam Filtering Using Optical Back Propagation Technique A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT

More information

Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment

Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment 2009 10th International Conference on Document Analysis and Recognition Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment Ahmad Abdulkader Matthew R. Casey Google Inc. ahmad@abdulkader.org

More information

L13: cross-validation

L13: cross-validation Resampling methods Cross validation Bootstrap L13: cross-validation Bias and variance estimation with the Bootstrap Three-way data partitioning CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU

More information

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES Claus Gwiggner, Ecole Polytechnique, LIX, Palaiseau, France Gert Lanckriet, University of Berkeley, EECS,

More information

A DECISION TREE BASED PEDOMETER AND ITS IMPLEMENTATION ON THE ANDROID PLATFORM

A DECISION TREE BASED PEDOMETER AND ITS IMPLEMENTATION ON THE ANDROID PLATFORM A DECISION TREE BASED PEDOMETER AND ITS IMPLEMENTATION ON THE ANDROID PLATFORM ABSTRACT Juanying Lin, Leanne Chan and Hong Yan Department of Electronic Engineering, City University of Hong Kong, Hong Kong,

More information

Machine Learning Approach for Estimating Sensor Deployment Regions on Satellite Images

Machine Learning Approach for Estimating Sensor Deployment Regions on Satellite Images Machine Learning Approach for Estimating Sensor Deployment Regions on Satellite Images 1 Enes ATEŞ, 2 *Tahir Emre KALAYCI, 1 Aybars UĞUR 1 Faculty of Engineering, Department of Computer Engineering Ege

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

COMP 551 Applied Machine Learning Lecture 6: Performance evaluation. Model assessment and selection.

COMP 551 Applied Machine Learning Lecture 6: Performance evaluation. Model assessment and selection. COMP 551 Applied Machine Learning Lecture 6: Performance evaluation. Model assessment and selection. Instructor: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Random Forest Based Imbalanced Data Cleaning and Classification

Random Forest Based Imbalanced Data Cleaning and Classification Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem

More information

Contradiction measures and specificity degrees of basic belief assignments

Contradiction measures and specificity degrees of basic belief assignments Contradiction measures and specificity degrees of basic belief assignments Florentin Smarandache Math. & Sciences Dept. University of New Mexico, 200 College Road, Gallup, NM 87301, U.S.A. Email: smarand@unm.edu

More information

SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis

SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu October 17, 2015 Outline

More information

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2. Tid Refund Marital Status

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2. Tid Refund Marital Status Data Mining Classification: Basic Concepts, Decision Trees, and Evaluation Lecture tes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Classification: Definition Given a collection of

More information

An innovative application of a constrained-syntax genetic programming system to the problem of predicting survival of patients

An innovative application of a constrained-syntax genetic programming system to the problem of predicting survival of patients An innovative application of a constrained-syntax genetic programming system to the problem of predicting survival of patients Celia C. Bojarczuk 1, Heitor S. Lopes 2 and Alex A. Freitas 3 1 Departamento

More information

REVIEW OF ENSEMBLE CLASSIFICATION

REVIEW OF ENSEMBLE CLASSIFICATION Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.

More information

Stabilization by Conceptual Duplication in Adaptive Resonance Theory

Stabilization by Conceptual Duplication in Adaptive Resonance Theory Stabilization by Conceptual Duplication in Adaptive Resonance Theory Louis Massey Royal Military College of Canada Department of Mathematics and Computer Science PO Box 17000 Station Forces Kingston, Ontario,

More information

Getting Even More Out of Ensemble Selection

Getting Even More Out of Ensemble Selection Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand qs12@cs.waikato.ac.nz ABSTRACT Ensemble Selection uses forward stepwise

More information

Threshold Logic. 2.1 Networks of functions

Threshold Logic. 2.1 Networks of functions 2 Threshold Logic 2. Networks of functions We deal in this chapter with the simplest kind of computing units used to build artificial neural networks. These computing elements are a generalization of the

More information

A Novel Solution on Alert Conflict Resolution Model in Network Management

A Novel Solution on Alert Conflict Resolution Model in Network Management A Novel Solution on Alert Conflict Resolution Model in Network Management Yi-Tung F. Chan University of Wales United Kingdom FrankChan2005@gmail.com Ramaswamy D.Thiyagu University of East London United

More information

Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

More information

A new Approach for Intrusion Detection in Computer Networks Using Data Mining Technique

A new Approach for Intrusion Detection in Computer Networks Using Data Mining Technique A new Approach for Intrusion Detection in Computer Networks Using Data Mining Technique Aida Parbaleh 1, Dr. Heirsh Soltanpanah 2* 1 Department of Computer Engineering, Islamic Azad University, Sanandaj

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

A Hybrid Data Mining Model to Improve Customer Response Modeling in Direct Marketing

A Hybrid Data Mining Model to Improve Customer Response Modeling in Direct Marketing A Hybrid Data Mining Model to Improve Customer Response Modeling in Direct Marketing Maryam Daneshmandi mdaneshmandi82@yahoo.com School of Information Technology Shiraz Electronics University Shiraz, Iran

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

Operations Research and Knowledge Modeling in Data Mining

Operations Research and Knowledge Modeling in Data Mining Operations Research and Knowledge Modeling in Data Mining Masato KODA Graduate School of Systems and Information Engineering University of Tsukuba, Tsukuba Science City, Japan 305-8573 koda@sk.tsukuba.ac.jp

More information

Performance of Probability Transformations Using Simulated Human Opinions

Performance of Probability Transformations Using Simulated Human Opinions Performance of Probability Transformations Using Simulated Human Opinions Donald J. Bucci, Sayandeep Acharya, Timothy J. Pleskac, and Moshe Kam Department of Electrical and Computer Engineering, Drexel

More information

Pedestrian Detection with RCNN

Pedestrian Detection with RCNN Pedestrian Detection with RCNN Matthew Chen Department of Computer Science Stanford University mcc17@stanford.edu Abstract In this paper we evaluate the effectiveness of using a Region-based Convolutional

More information

Evaluation and Credibility. How much should we believe in what was learned?

Evaluation and Credibility. How much should we believe in what was learned? Evaluation and Credibility How much should we believe in what was learned? Outline Introduction Classification with Train, Test, and Validation sets Handling Unbalanced Data; Parameter Tuning Cross-validation

More information

Decision Algorithms in Fire Detection Systems

Decision Algorithms in Fire Detection Systems SERBIAN JOURNAL OF ELECTRICAL ENGINEERING Vol. 8, No. 2, May 2011, 155-161 UDK: 654.924.5:004.21; 614.842.4:004.021 Decision Algorithms in Fire Detection Systems Jovan D. Ristić 1, Dragana B. Radosavljević

More information

Spam Filter Optimality Based on Signal Detection Theory

Spam Filter Optimality Based on Signal Detection Theory Spam Filter Optimality Based on Signal Detection Theory ABSTRACT Singh Kuldeep NTNU, Norway HUT, Finland kuldeep@unik.no Md. Sadek Ferdous NTNU, Norway University of Tartu, Estonia sadek@unik.no Unsolicited

More information

Binary and Ranked Retrieval

Binary and Ranked Retrieval Binary and Ranked Retrieval Binary Retrieval RSV(d i,q j ) {0,1} Does not allow the user to control the magnitude of the output. In fact, for a given query, the system may return under-dimensioned output

More information

Addressing the Class Imbalance Problem in Medical Datasets

Addressing the Class Imbalance Problem in Medical Datasets Addressing the Class Imbalance Problem in Medical Datasets M. Mostafizur Rahman and D. N. Davis the size of the training set is significantly increased [5]. If the time taken to resample is not considered,

More information

Florida International University - University of Miami TRECVID 2014

Florida International University - University of Miami TRECVID 2014 Florida International University - University of Miami TRECVID 2014 Miguel Gavidia 3, Tarek Sayed 1, Yilin Yan 1, Quisha Zhu 1, Mei-Ling Shyu 1, Shu-Ching Chen 2, Hsin-Yu Ha 2, Ming Ma 1, Winnie Chen 4,

More information

Synthetic Aperture Radar: Principles and Applications of AI in Automatic Target Recognition

Synthetic Aperture Radar: Principles and Applications of AI in Automatic Target Recognition Synthetic Aperture Radar: Principles and Applications of AI in Automatic Target Recognition Paulo Marques 1 Instituto Superior de Engenharia de Lisboa / Instituto de Telecomunicações R. Conselheiro Emídio

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Counting the cost Data Mining Practical Machine Learning Tools and Techniques Slides for Section 5.7 In practice, different types of classification errors often incur different costs Examples: Loan decisions

More information

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines. Colin Campbell, Bristol University Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

More information

Evaluation in Machine Learning (1) Evaluation in machine learning. Aims. 14s1: COMP9417 Machine Learning and Data Mining.

Evaluation in Machine Learning (1) Evaluation in machine learning. Aims. 14s1: COMP9417 Machine Learning and Data Mining. Acknowledgements Evaluation in Machine Learning (1) 14s1: COMP9417 Machine Learning and Data Mining School of Computer Science and Engineering, University of New South Wales March 19, 2014 Material derived

More information

Numerical Field Extraction in Handwritten Incoming Mail Documents

Numerical Field Extraction in Handwritten Incoming Mail Documents Numerical Field Extraction in Handwritten Incoming Mail Documents Guillaume Koch, Laurent Heutte and Thierry Paquet PSI, FRE CNRS 2645, Université de Rouen, 76821 Mont-Saint-Aignan, France Laurent.Heutte@univ-rouen.fr

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 5: Decision Theory & ROC Curves Gaussian ML Estimation Many figures courtesy Kevin Murphy s textbook,

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

Introduction To Ensemble Learning

Introduction To Ensemble Learning Educational Series Introduction To Ensemble Learning Dr. Oliver Steinki, CFA, FRM Ziad Mohammad July 2015 What Is Ensemble Learning? In broad terms, ensemble learning is a procedure where multiple learner

More information

Using News Articles to Predict Stock Price Movements

Using News Articles to Predict Stock Price Movements Using News Articles to Predict Stock Price Movements Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 9237 gyozo@cs.ucsd.edu 21, June 15,

More information

Introduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing.

Introduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing. Introduction to Hypothesis Testing CHAPTER 8 LEARNING OBJECTIVES After reading this chapter, you should be able to: 1 Identify the four steps of hypothesis testing. 2 Define null hypothesis, alternative

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information