A CLASSIFIER FUSION-BASED APPROACH TO IMPROVE BIOLOGICAL THREAT DETECTION. Palaiseau cedex, France; 2 FFI, P.O. Box 25, N-2027 Kjeller, Norway.
|
|
- Rafe Joseph
- 8 years ago
- Views:
Transcription
1 A CLASSIFIER FUSION-BASED APPROACH TO IMPROVE BIOLOGICAL THREAT DETECTION Frédéric Pichon 1, Florence Aligne 1, Gilles Feugnet 1 and Janet Martha Blatny 2 1 Thales Research & Technology, Campus Polytechnique, 1 avenue Augustin Fresnel, Palaiseau cedex, France; 2 FFI, P.O. Box 25, N-2027 Kjeller, Norway. INTRODUCTION The aim of the TWOBIAS EU FP7 project ( ) is to develop a modular demonstrator of a stationary, reliable, vehicle-portable, low false alarm rate Two Stage Rapid Biological Surveillance and Alarm System for Airborne Threats (TWOBIAS). The TWOBIAS concept is built around two successive stages: a first detect-to-warn stage dedicated to biological threat detection using Biological Detection Units (BDUs), and a second detect-to-treat stage devoted to pathogenic agent identification using Biological Identification Units (BIUs). In the first stage, the main challenge of the TWOBIAS system is to improve the state-of-the-art regarding biological threat detection by using orthogonal detector techniques together with some mathematical framework and high level information processing approaches. This paper describes the approach developed for the first stage of the TWOBIAS project, as well as preliminary results suggesting its potential usefulness. APPROACH The BDUs may be seen as so-called classifiers in pattern classification [DUD 00]. Indeed, for each air sample, each of the BDUs raises or not an alarm, depending on whether it believes that the air sample contains or not a pathogenic biological agent. Hence, each BDU is a classifier in that it classifies each air sample in of two groups (biological pathogenic agents and non biological pathogenic agents). There exists a large body of literature dedicated to the use of multiple classifiers, also called classifier ensemble, which is now widely acknowledged as a useful means to address difficult pattern classification problems [KIT 98, ROG 94]. The main idea underlying classifier ensembles is to take advantage of the potential complementarities of different classifiers, so as to obtain potentially better classification performance. One of the central issues is then to find how to combine the classifier outputs. Several fusion strategies have been proposed to combine classifier outputs. The most simple fusion scheme is majority voting [XU 92]. This scheme basically amounts to deciding that the
2 class of a given pattern (object, instance), is the class the most present in the individual decisions of the classifiers. When classifiers provide confidence scores on the actual class of a given pattern in the form of probability distributions, a technique consists in taking the average of these distributions and deciding that the class of the pattern is the one of maximum probability [XU 92]. Other schemes such as [ELO 10] and [ELO 04] involve the notion of discounting [SHA 76, SME 93], that is, weakening the decisions provided by the classifiers according to their reliability. In these schemes, the decisions of the classifiers are first discounted before being combined using an appropriate operator (called Dempster s rule of combination [DEM 67]). In other works, classifier outputs are combined using more advanced combination operators, for instance triangular norm based combination rules [DEN 08, PIC 10], as is the case in [QUO 11]. To decide which fusion strategy to use to combine a set of classifiers, one may proceed as in [QUO 11], that is one may learn the best fusion strategy. This learning amounts to looking for the fusion strategy which optimizes a given performance criterion, such as the error rate, over a set of data. This is the approach that we have followed to take advantage of the BDUs complementary orthogonal detector techniques. As will be seen in the results part of this paper, this approach allows us to obtain a detection system that provides a more consistent global response and that improves on each of the individual BDU detection performance. The rest of this paper is organized as follows. First, we provide a brief description of the BDUs considered in TWOBIAS, as well as of the first TWOBIAS experiment, which yielded the set of data that is needed in our approach. Then, we recall some basic performance measures that can be used to evaluate a classifier. We proceed with a description of the experiments that we performed to evaluate our approach, as well as of the results that we obtained. Finally, we conclude this paper with some perspectives. TWOBIAS BIOLOGICAL DETECTION UNITS AND FIRST EXPERIMENT At the time of performing this work, three BDUs were available. These BDUs can be expected to be somewhat complementary since they rely on different techniques and settings: one relies on flame emission spectroscopy and the other two rely on laser induced fluorescence, but they are tuned differently. Hereafter, these BDUs will be denoted by BDU1, BDU2, and BDU3. The first TWOBIAS experiment was conducted in a facility of the Direction Générale de l'armement. Several disseminations of biological agents simulating biological pathogenic agents took place during this experiment. The outputs of the three BDUs, i.e., the alarms raised by the BDUs, were recorded during those dispersions. Offline, we also analyzed the
3 BIU outputs, which tell us whether a (simulant of a) biological pathogenic agent was actually present or not in the air at a given time. This is important since those BIUs outputs can thus be used as references regarding whether a given time-stamped air sample should have raised an alarm or not, and thus they can be used to evaluate the performances of the BDUs. Those various recordings and analysis yielded a data set of about records, each record corresponding to an air sample and being a 5-uple containing a date, the outputs (i.e., presence or absence of alarm) of the three BDUs, and the expected output (estimated using the BIUs outputs), also called label, for this sample. PERFORMANCE MEASURES IN BINARY CLASSIFICATION Let us assume in this section that we have at our disposal a set of air samples whose true classes are known, i.e., for each of these samples, we know whether it contains or not a (simulant of a) biological pathogenic agent. Besides, we also have access to the class predicted for each of those samples, by a classifier. From those pieces of information, we can compute the number of: True Positives (TP), which are the samples correctly classified as presence of simulant, i.e., correct detections; True Negatives (TN), which are the samples correctly classified as absence of stimulant; False Positives (FP), which are the samples incorrectly classified as presence of simulant, i.e., the false alarms; False Negatives (FN), which are the samples incorrectly classified as absence of stimulant. Using this terminology, the error rate of the classifier over the set of samples is simply defined as:. The error rate is a natural and sensible performance criterion in general. However, it is not really adapted to imbalanced data. Two other common performance criteria in binary classification are called recall (also known as sensitivity) and precision. They are defined as follows: and. Recall is the proportion of samples for which the classifier rightfully raised an alarm (i.e., samples correctly identified as presence of simulant), of all samples that actually should have raised an alarm. On the other hand, precision is the proportion of samples for which the classifier rightfully raised an alarm, of all samples for which it raised an alarm.
4 Another useful and often used indicator of performance is the F 1 -measure (or F 1 -score). It can be interpreted as an average 1 of the precision and recall: 2, where an F 1 -score reaches its best value at 1 and worst score at 0. A more general form exists and is called the F β -measure. It is defined as follows or, equivalently, by 1,. The -measure (i.e., 2) for instance, weights recall higher than precision, whereas the. - measure puts more emphasis on precision than recall. The F β -measure measures the effectiveness of detection with respect to a user who attaches times as much importance to recall as precision. EXPERIMENTS As discussed above, the difficulty in this classifier fusion-based approach is to determine the best strategy to combine the detector outputs, out of a given set of fusion strategies. A second issue is to evaluate the performances of this approach. In order to perform these two tasks, a standard method consists in splitting the available labeled data set into two disjoint sets, which are called the training set and the test set (this is known in the pattern classification literature as the hold-out method): the training set is used to learn the best fusion strategy - learning the best fusion strategy basically amounts to selecting the strategy that optimizes a given performance criterion on the learning set, such as F 1 -score -, and the test set is used to evaluate independently the performance - according to the same performance criterion as the one used for the training phase - of this best fusion strategy. This basic method has nonetheless the drawback that the holdout estimate of the performance of the best fusion strategy may be misleading if we happen to get an unfortunate split. To overcome this drawback, other methods can be used. In particular, the method known as random subsampling performs n data splits of the data set. Each split randomly selects a (fixed) number of examples without replacement. For each data split, the best fusion strategy is relearnt from scratch with the training examples and its performance is evaluated with the test examples. The overall performance of the fusion-based approach is then evaluated as the average of the separate estimates. In this work, the set of fusion strategies that we considered are all the schemes mentioned in the Approach section of this paper. Besides, we used the random subsampling method 1 It is actually the harmonic mean of precision and recall.
5 described in the previous paragraph, with n=20 (for each split, the learning data represented 2/5 of the entire available data) and in conjunction with several performance criteria: the F β - scores, for β = 0.2, 0.5, 0. 7, 1, 2. In other words, we applied the random subsampling method five times (one for each performance criterion). The following figure synthesizes these experiments: the values 0.2, 0.5, 0.7, 1 and 2 on the x-axis correspond to the five performance measures considered, and the y-axis provides the average values (and standard deviation) taken by these performance measures over the n repetitions for the different detection systems studied in this work, i.e., the three BDUs as well as of our fusion approach (called Fusion in the figure). Let us recall that the higher these latter values the better the performances. As can be seen on this figure, our approach always yields the best detection system and it is at worse as good as the BDU1 (this happens for the F 0.7 -measure). In addition, we remark that depending on the value of, some BDUs are better than the others. Our approach offers some robustness in this respect since it is always the best detection system. The approach seems also to take advantage of the quality of the BDU1 and BDU3 for the F 0.2 -measure, to yield a F 0.2 -score that is better than the F 0.2 -scores of these two BDUs and is in itself good. Similarly, for the F 2 -measure, the approach seems to take advantage of the quality of the BDU1 and BDU2. Overall, these experiments clearly show the robustness as well as the increase of performance that a detection system based on the fusion of individual systems offers. CONCLUSION This study has shown that it is possible to propose a detection system that combines the alarms raised by the TWOBIAS BDUs and that exhibits better detection performances as well as better robustness than any of these BDUs. In addition, the performances of this detection
6 system are in themselves promising. It is foreseen that the performances could be further enhanced by considering some improvements to the proposed detection system. In particular, it seems worthwhile to include the temporal dimension in the fusion process. In addition, it seems interesting to better leverage in the fusion process the various uncertainties that can emerge in such a system, in particular the BDUs own confidence in their outputs. Next works will be dedicated to these improvements as well as testing this approach on the data obtained during the recent TWOBIAS underground experiment REFERENCES [DEM 67] A. P. Dempster. Upper and lower probabilities induced by a multivalued mapping. Annals of Mathematical Statistics, 38: , [DEN 08] T. Denoeux. Conjunctive and disjunctive combination of belief functions induced by non-distinct bodies of evidence. Artificial Intelligence, 172: , [DUD 00] R. O. Duda, P. E. Hart and D. G. Stork. Pattern classification. John Wiley & Sons, [ELO 10] Z. Elouedi, E. Lefevre, and D. Mercier. Discountings of a belief function using a confusion matrix. In 22th IEEE International Conference on Tools with Artificial Intelligence, volume 1, pages , Arras, France, North-Holland. [ELO 04] Z. Elouedi, K.Mellouli, and Ph. Smets. Assessing sensor reliability for multisensor data fusion within the Transferable Belief Model. IEEE Transactions on Systems, Man and Cybernetics B, 34(1): , [KIT 98] J. Kittler, M. Hatef, R. Duin, and J. Matas. On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3): , [PIC 10] F. Pichon and T. Denoeux. The unnormalized Dempster s rule of combination: a new justification from the least commitment principle and some extensions. Journal of Automated Reasoning, 45(1):61 87, [QUO 11] B. Quost, M.-H. Masson and T. Denoeux. Classifier fusion in the Dempster-Shafer framework using optimized t-norm based combination rules. International Journal of Approximate Reasoning, vol. 52, Issue 3, pages , 2011 [ROG 94] G. Rogova. Combining the results of several neural network classifiers. Neural Networks, 7(5): , [SHA 76] G. Shafer. A mathematical theory of evidence. Princeton University Press, Princeton, N.J., [SME 93] Ph. Smets. Belief functions: the disjunctive rule of combination and the generalized Bayesian theorem. International Journal of Approximate Reasoning, 9(1):1 35, [XU 92] L. Xu, A. Krzyzak, and C. Y. Suen. Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Transactions on System, Man and Cybernetics, 22: , 1992.
Evaluation & Validation: Credibility: Evaluating what has been learned
Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model
More informationIntrusion Detection via Machine Learning for SCADA System Protection
Intrusion Detection via Machine Learning for SCADA System Protection S.L.P. Yasakethu Department of Computing, University of Surrey, Guildford, GU2 7XH, UK. s.l.yasakethu@surrey.ac.uk J. Jiang Department
More informationCredal classification of uncertain data using belief functions
23 IEEE International Conference on Systems, Man, and Cybernetics Credal classification of uncertain data using belief functions Zhun-ga Liu a,c,quanpan a, Jean Dezert b, Gregoire Mercier c a School of
More informationChapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
More informationData Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationOverview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set
Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification
More informationW6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set
http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer
More informationArtificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier
International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing
More informationHow to preserve the conflict as an alarm in the combination of belief functions?
How to preserve the conflict as an alarm in the combination of belief functions? Eric Lefèvre a, Zied Elouedi b a Univ. Lille Nord de France, UArtois, EA 3926 LGI2A, France b University of Tunis, Institut
More informationPerformance Metrics. number of mistakes total number of observations. err = p.1/1
p.1/1 Performance Metrics The simplest performance metric is the model error defined as the number of mistakes the model makes on a data set divided by the number of observations in the data set, err =
More informationThree types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.
Chronological Sampling for Email Filtering Ching-Lung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada
More informationAUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM
AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM ABSTRACT Luis Alexandre Rodrigues and Nizam Omar Department of Electrical Engineering, Mackenzie Presbiterian University, Brazil, São Paulo 71251911@mackenzie.br,nizam.omar@mackenzie.br
More informationSVM Ensemble Model for Investment Prediction
19 SVM Ensemble Model for Investment Prediction Chandra J, Assistant Professor, Department of Computer Science, Christ University, Bangalore Siji T. Mathew, Research Scholar, Christ University, Dept of
More informationA General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms
A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms Ian Davidson 1st author's affiliation 1st line of address 2nd line of address Telephone number, incl country code 1st
More informationPerformance Measures in Data Mining
Performance Measures in Data Mining Common Performance Measures used in Data Mining and Machine Learning Approaches L. Richter J.M. Cejuela Department of Computer Science Technische Universität München
More informationPerformance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com
More informationCONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE
1 2 CONTENTS OF DAY 2 I. More Precise Definition of Simple Random Sample 3 Connection with independent random variables 3 Problems with small populations 8 II. Why Random Sampling is Important 9 A myth,
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Lecture 15 - ROC, AUC & Lift Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-17-AUC
More informationMulti-ultrasonic sensor fusion for autonomous mobile robots
Multi-ultrasonic sensor fusion for autonomous mobile robots Zou Yi *, Ho Yeong Khing, Chua Chin Seng, and Zhou Xiao Wei School of Electrical and Electronic Engineering Nanyang Technological University
More informationLecture 13: Validation
Lecture 3: Validation g Motivation g The Holdout g Re-sampling techniques g Three-way data splits Motivation g Validation techniques are motivated by two fundamental problems in pattern recognition: model
More informationT-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or
More informationA Content based Spam Filtering Using Optical Back Propagation Technique
A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT
More informationComponent Ordering in Independent Component Analysis Based on Data Power
Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals
More informationHolland s GA Schema Theorem
Holland s GA Schema Theorem v Objective provide a formal model for the effectiveness of the GA search process. v In the following we will first approach the problem through the framework formalized by
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationSupervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
More informationUsing Random Forest to Learn Imbalanced Data
Using Random Forest to Learn Imbalanced Data Chao Chen, chenchao@stat.berkeley.edu Department of Statistics,UC Berkeley Andy Liaw, andy liaw@merck.com Biometrics Research,Merck Research Labs Leo Breiman,
More informationMachine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
More informationStabilization by Conceptual Duplication in Adaptive Resonance Theory
Stabilization by Conceptual Duplication in Adaptive Resonance Theory Louis Massey Royal Military College of Canada Department of Mathematics and Computer Science PO Box 17000 Station Forces Kingston, Ontario,
More informationFALSE ALARMS IN FAULT-TOLERANT DOMINATING SETS IN GRAPHS. Mateusz Nikodem
Opuscula Mathematica Vol. 32 No. 4 2012 http://dx.doi.org/10.7494/opmath.2012.32.4.751 FALSE ALARMS IN FAULT-TOLERANT DOMINATING SETS IN GRAPHS Mateusz Nikodem Abstract. We develop the problem of fault-tolerant
More informationStatistics in Retail Finance. Chapter 7: Fraud Detection in Retail Credit
Statistics in Retail Finance Chapter 7: Fraud Detection in Retail Credit 1 Overview > Detection of fraud remains an important issue in retail credit. Methods similar to scorecard development may be employed,
More informationData Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
More informationAn Approach to Detect Spam Emails by Using Majority Voting
An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,
More informationFRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS
FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS Breno C. Costa, Bruno. L. A. Alberto, André M. Portela, W. Maduro, Esdras O. Eler PDITec, Belo Horizonte,
More informationKeywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.
International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant
More informationLow Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment
2009 10th International Conference on Document Analysis and Recognition Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment Ahmad Abdulkader Matthew R. Casey Google Inc. ahmad@abdulkader.org
More informationHow To Solve The Kd Cup 2010 Challenge
A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China catch0327@yahoo.com yanxing@gdut.edu.cn
More informationContradiction measures and specificity degrees of basic belief assignments
Contradiction measures and specificity degrees of basic belief assignments Florentin Smarandache Math. & Sciences Dept. University of New Mexico, 200 College Road, Gallup, NM 87301, U.S.A. Email: smarand@unm.edu
More informationL13: cross-validation
Resampling methods Cross validation Bootstrap L13: cross-validation Bias and variance estimation with the Bootstrap Three-way data partitioning CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU
More informationDECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com
More informationStatistical Validation and Data Analytics in ediscovery. Jesse Kornblum
Statistical Validation and Data Analytics in ediscovery Jesse Kornblum Administrivia Silence your mobile Interactive talk Please ask questions 2 Outline Introduction Big Questions What Makes Things Similar?
More informationAnother Look at Sensitivity of Bayesian Networks to Imprecise Probabilities
Another Look at Sensitivity of Bayesian Networks to Imprecise Probabilities Oscar Kipersztok Mathematics and Computing Technology Phantom Works, The Boeing Company P.O.Box 3707, MC: 7L-44 Seattle, WA 98124
More informationData Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
More informationRandom Forest Based Imbalanced Data Cleaning and Classification
Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem
More informationOn the effect of data set size on bias and variance in classification learning
On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent
More informationPerformance Measures for Machine Learning
Performance Measures for Machine Learning 1 Performance Measures Accuracy Weighted (Cost-Sensitive) Accuracy Lift Precision/Recall F Break Even Point ROC ROC Area 2 Accuracy Target: 0/1, -1/+1, True/False,
More informationA new Approach for Intrusion Detection in Computer Networks Using Data Mining Technique
A new Approach for Intrusion Detection in Computer Networks Using Data Mining Technique Aida Parbaleh 1, Dr. Heirsh Soltanpanah 2* 1 Department of Computer Engineering, Islamic Azad University, Sanandaj
More informationHow To Solve The Class Imbalance Problem In Data Mining
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART C: APPLICATIONS AND REVIEWS 1 A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches Mikel Galar,
More informationCHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES
CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES Claus Gwiggner, Ecole Polytechnique, LIX, Palaiseau, France Gert Lanckriet, University of Berkeley, EECS,
More informationA DECISION TREE BASED PEDOMETER AND ITS IMPLEMENTATION ON THE ANDROID PLATFORM
A DECISION TREE BASED PEDOMETER AND ITS IMPLEMENTATION ON THE ANDROID PLATFORM ABSTRACT Juanying Lin, Leanne Chan and Hong Yan Department of Electronic Engineering, City University of Hong Kong, Hong Kong,
More informationAn innovative application of a constrained-syntax genetic programming system to the problem of predicting survival of patients
An innovative application of a constrained-syntax genetic programming system to the problem of predicting survival of patients Celia C. Bojarczuk 1, Heitor S. Lopes 2 and Alex A. Freitas 3 1 Departamento
More informationGetting Even More Out of Ensemble Selection
Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand qs12@cs.waikato.ac.nz ABSTRACT Ensemble Selection uses forward stepwise
More informationIntroduction To Ensemble Learning
Educational Series Introduction To Ensemble Learning Dr. Oliver Steinki, CFA, FRM Ziad Mohammad July 2015 What Is Ensemble Learning? In broad terms, ensemble learning is a procedure where multiple learner
More informationA Hybrid Data Mining Model to Improve Customer Response Modeling in Direct Marketing
A Hybrid Data Mining Model to Improve Customer Response Modeling in Direct Marketing Maryam Daneshmandi mdaneshmandi82@yahoo.com School of Information Technology Shiraz Electronics University Shiraz, Iran
More informationNumerical Field Extraction in Handwritten Incoming Mail Documents
Numerical Field Extraction in Handwritten Incoming Mail Documents Guillaume Koch, Laurent Heutte and Thierry Paquet PSI, FRE CNRS 2645, Université de Rouen, 76821 Mont-Saint-Aignan, France Laurent.Heutte@univ-rouen.fr
More informationIntroduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.
More informationA Survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery
A Survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery Alex A. Freitas Postgraduate Program in Computer Science, Pontificia Universidade Catolica do Parana Rua Imaculada Conceicao,
More informationIntroduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing.
Introduction to Hypothesis Testing CHAPTER 8 LEARNING OBJECTIVES After reading this chapter, you should be able to: 1 Identify the four steps of hypothesis testing. 2 Define null hypothesis, alternative
More informationData Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
More informationEnhancing Quality of Data using Data Mining Method
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2, ISSN 25-967 WWW.JOURNALOFCOMPUTING.ORG 9 Enhancing Quality of Data using Data Mining Method Fatemeh Ghorbanpour A., Mir M. Pedram, Kambiz Badie, Mohammad
More informationGerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
More informationData quality in Accounting Information Systems
Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania
More informationA Novel Classification Approach for C2C E-Commerce Fraud Detection
A Novel Classification Approach for C2C E-Commerce Fraud Detection *1 Haitao Xiong, 2 Yufeng Ren, 2 Pan Jia *1 School of Computer and Information Engineering, Beijing Technology and Business University,
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
More informationEM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
More informationFixed-Effect Versus Random-Effects Models
CHAPTER 13 Fixed-Effect Versus Random-Effects Models Introduction Definition of a summary effect Estimating the summary effect Extreme effect size in a large study or a small study Confidence interval
More informationMachine Learning Approach for Estimating Sensor Deployment Regions on Satellite Images
Machine Learning Approach for Estimating Sensor Deployment Regions on Satellite Images 1 Enes ATEŞ, 2 *Tahir Emre KALAYCI, 1 Aybars UĞUR 1 Faculty of Engineering, Department of Computer Engineering Ege
More informationPerformance Metrics for Graph Mining Tasks
Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical
More informationAddressing the Class Imbalance Problem in Medical Datasets
Addressing the Class Imbalance Problem in Medical Datasets M. Mostafizur Rahman and D. N. Davis the size of the training set is significantly increased [5]. If the time taken to resample is not considered,
More informationA Spectral Clustering Approach to Validating Sensors via Their Peers in Distributed Sensor Networks
A Spectral Clustering Approach to Validating Sensors via Their Peers in Distributed Sensor Networks H. T. Kung Dario Vlah {htk, dario}@eecs.harvard.edu Harvard School of Engineering and Applied Sciences
More informationSYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis
SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu October 17, 2015 Outline
More informationSPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING
AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations
More informationThe Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
More informationDecision Algorithms in Fire Detection Systems
SERBIAN JOURNAL OF ELECTRICAL ENGINEERING Vol. 8, No. 2, May 2011, 155-161 UDK: 654.924.5:004.21; 614.842.4:004.021 Decision Algorithms in Fire Detection Systems Jovan D. Ristić 1, Dragana B. Radosavljević
More informationCOMPUTATIONAL METHODS FOR A MATHEMATICAL THEORY OF EVIDENCE
COMPUTATIONAL METHODS FOR A MATHEMATICAL THEORY OF EVIDENCE Jeffrey A. Barnett USC/lnformation Sciences Institute ABSTRACT: Many knowledge-based expert systems employ numerical schemes to represent evidence,
More informationDissecting the Learning Behaviors in Hacker Forums
Dissecting the Learning Behaviors in Hacker Forums Alex Tsang Xiong Zhang Wei Thoo Yue Department of Information Systems, City University of Hong Kong, Hong Kong inuki.zx@gmail.com, xionzhang3@student.cityu.edu.hk,
More informationPedestrian Detection with RCNN
Pedestrian Detection with RCNN Matthew Chen Department of Computer Science Stanford University mcc17@stanford.edu Abstract In this paper we evaluate the effectiveness of using a Region-based Convolutional
More informationConsolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance
Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance Jesús M. Pérez, Javier Muguerza, Olatz Arbelaitz, Ibai Gurrutxaga, and José I. Martín Dept. of Computer
More informationFlorida International University - University of Miami TRECVID 2014
Florida International University - University of Miami TRECVID 2014 Miguel Gavidia 3, Tarek Sayed 1, Yilin Yan 1, Quisha Zhu 1, Mei-Ling Shyu 1, Shu-Ching Chen 2, Hsin-Yu Ha 2, Ming Ma 1, Winnie Chen 4,
More informationMonotonicity Hints. Abstract
Monotonicity Hints Joseph Sill Computation and Neural Systems program California Institute of Technology email: joe@cs.caltech.edu Yaser S. Abu-Mostafa EE and CS Deptartments California Institute of Technology
More informationOpen World Target Identification Using the Transferable Belief Model
Open World Target Identification Using the Transferable Belief Model Matthew Roberts EADS Innovation Works Newport, UK E-mail: matthew.roberts@eads.com Gavin Powell EADS Innovation Works Newport, UK E-mail:
More informationSystem Aware Cyber Security
System Aware Cyber Security Application of Dynamic System Models and State Estimation Technology to the Cyber Security of Physical Systems Barry M. Horowitz, Kate Pierce University of Virginia April, 2012
More informationDissimilarity representations allow for building good classifiers
Pattern Recognition Letters 23 (2002) 943 956 www.elsevier.com/locate/patrec Dissimilarity representations allow for building good classifiers El_zbieta Pezkalska *, Robert P.W. Duin Pattern Recognition
More informationREVIEW OF ENSEMBLE CLASSIFICATION
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.
More informationMeasuring Intrusion Detection Capability: An Information-Theoretic Approach
Measuring Intrusion Detection Capability: An Information-Theoretic Approach Guofei Gu, Prahlad Fogla, David Dagon, Boris Škorić Wenke Lee Philips Research Laboratories, Netherlands Georgia Institute of
More informationEvidential multinomial logistic regression for multiclass classifier calibration
Evidential multinomial logistic regression for multiclass classifier calibration Philippe Xu Franck Davoine Thierry Denœux Sorbonne universités Université de technologie de Compiègne CNRS, Heudiasyc UMR
More informationSynthetic Aperture Radar: Principles and Applications of AI in Automatic Target Recognition
Synthetic Aperture Radar: Principles and Applications of AI in Automatic Target Recognition Paulo Marques 1 Instituto Superior de Engenharia de Lisboa / Instituto de Telecomunicações R. Conselheiro Emídio
More informationConstrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm
Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Martin Hlosta, Rostislav Stríž, Jan Kupčík, Jaroslav Zendulka, and Tomáš Hruška A. Imbalanced Data Classification
More informationA Secure Online Reputation Defense System from Unfair Ratings using Anomaly Detections
A Secure Online Reputation Defense System from Unfair Ratings using Anomaly Detections Asha baby PG Scholar,Department of CSE A. Kumaresan Professor, Department of CSE K. Vijayakumar Professor, Department
More informationDirect Marketing When There Are Voluntary Buyers
Direct Marketing When There Are Voluntary Buyers Yi-Ting Lai and Ke Wang Simon Fraser University {llai2, wangk}@cs.sfu.ca Daymond Ling, Hua Shi, and Jason Zhang Canadian Imperial Bank of Commerce {Daymond.Ling,
More informationOperations Research and Knowledge Modeling in Data Mining
Operations Research and Knowledge Modeling in Data Mining Masato KODA Graduate School of Systems and Information Engineering University of Tsukuba, Tsukuba Science City, Japan 305-8573 koda@sk.tsukuba.ac.jp
More informationTowards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 czhao@cs.toronto.edu Abstract Spam identification is crucial
More informationSpam Filter Optimality Based on Signal Detection Theory
Spam Filter Optimality Based on Signal Detection Theory ABSTRACT Singh Kuldeep NTNU, Norway HUT, Finland kuldeep@unik.no Md. Sadek Ferdous NTNU, Norway University of Tartu, Estonia sadek@unik.no Unsolicited
More informationCategorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,
More informationBOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL
The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University
More informationUsing News Articles to Predict Stock Price Movements
Using News Articles to Predict Stock Price Movements Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 9237 gyozo@cs.ucsd.edu 21, June 15,
More informationA Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
More informationQuestion 2 Naïve Bayes (16 points)
Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the
More informationDomain Classification of Technical Terms Using the Web
Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using
More informationEnsemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
More information