A CLASSIFIER FUSIONBASED APPROACH TO IMPROVE BIOLOGICAL THREAT DETECTION. Palaiseau cedex, France; 2 FFI, P.O. Box 25, N2027 Kjeller, Norway.


 Rafe Joseph
 1 years ago
 Views:
Transcription
1 A CLASSIFIER FUSIONBASED APPROACH TO IMPROVE BIOLOGICAL THREAT DETECTION Frédéric Pichon 1, Florence Aligne 1, Gilles Feugnet 1 and Janet Martha Blatny 2 1 Thales Research & Technology, Campus Polytechnique, 1 avenue Augustin Fresnel, Palaiseau cedex, France; 2 FFI, P.O. Box 25, N2027 Kjeller, Norway. INTRODUCTION The aim of the TWOBIAS EU FP7 project ( ) is to develop a modular demonstrator of a stationary, reliable, vehicleportable, low false alarm rate Two Stage Rapid Biological Surveillance and Alarm System for Airborne Threats (TWOBIAS). The TWOBIAS concept is built around two successive stages: a first detecttowarn stage dedicated to biological threat detection using Biological Detection Units (BDUs), and a second detecttotreat stage devoted to pathogenic agent identification using Biological Identification Units (BIUs). In the first stage, the main challenge of the TWOBIAS system is to improve the stateoftheart regarding biological threat detection by using orthogonal detector techniques together with some mathematical framework and high level information processing approaches. This paper describes the approach developed for the first stage of the TWOBIAS project, as well as preliminary results suggesting its potential usefulness. APPROACH The BDUs may be seen as socalled classifiers in pattern classification [DUD 00]. Indeed, for each air sample, each of the BDUs raises or not an alarm, depending on whether it believes that the air sample contains or not a pathogenic biological agent. Hence, each BDU is a classifier in that it classifies each air sample in of two groups (biological pathogenic agents and non biological pathogenic agents). There exists a large body of literature dedicated to the use of multiple classifiers, also called classifier ensemble, which is now widely acknowledged as a useful means to address difficult pattern classification problems [KIT 98, ROG 94]. The main idea underlying classifier ensembles is to take advantage of the potential complementarities of different classifiers, so as to obtain potentially better classification performance. One of the central issues is then to find how to combine the classifier outputs. Several fusion strategies have been proposed to combine classifier outputs. The most simple fusion scheme is majority voting [XU 92]. This scheme basically amounts to deciding that the
2 class of a given pattern (object, instance), is the class the most present in the individual decisions of the classifiers. When classifiers provide confidence scores on the actual class of a given pattern in the form of probability distributions, a technique consists in taking the average of these distributions and deciding that the class of the pattern is the one of maximum probability [XU 92]. Other schemes such as [ELO 10] and [ELO 04] involve the notion of discounting [SHA 76, SME 93], that is, weakening the decisions provided by the classifiers according to their reliability. In these schemes, the decisions of the classifiers are first discounted before being combined using an appropriate operator (called Dempster s rule of combination [DEM 67]). In other works, classifier outputs are combined using more advanced combination operators, for instance triangular norm based combination rules [DEN 08, PIC 10], as is the case in [QUO 11]. To decide which fusion strategy to use to combine a set of classifiers, one may proceed as in [QUO 11], that is one may learn the best fusion strategy. This learning amounts to looking for the fusion strategy which optimizes a given performance criterion, such as the error rate, over a set of data. This is the approach that we have followed to take advantage of the BDUs complementary orthogonal detector techniques. As will be seen in the results part of this paper, this approach allows us to obtain a detection system that provides a more consistent global response and that improves on each of the individual BDU detection performance. The rest of this paper is organized as follows. First, we provide a brief description of the BDUs considered in TWOBIAS, as well as of the first TWOBIAS experiment, which yielded the set of data that is needed in our approach. Then, we recall some basic performance measures that can be used to evaluate a classifier. We proceed with a description of the experiments that we performed to evaluate our approach, as well as of the results that we obtained. Finally, we conclude this paper with some perspectives. TWOBIAS BIOLOGICAL DETECTION UNITS AND FIRST EXPERIMENT At the time of performing this work, three BDUs were available. These BDUs can be expected to be somewhat complementary since they rely on different techniques and settings: one relies on flame emission spectroscopy and the other two rely on laser induced fluorescence, but they are tuned differently. Hereafter, these BDUs will be denoted by BDU1, BDU2, and BDU3. The first TWOBIAS experiment was conducted in a facility of the Direction Générale de l'armement. Several disseminations of biological agents simulating biological pathogenic agents took place during this experiment. The outputs of the three BDUs, i.e., the alarms raised by the BDUs, were recorded during those dispersions. Offline, we also analyzed the
3 BIU outputs, which tell us whether a (simulant of a) biological pathogenic agent was actually present or not in the air at a given time. This is important since those BIUs outputs can thus be used as references regarding whether a given timestamped air sample should have raised an alarm or not, and thus they can be used to evaluate the performances of the BDUs. Those various recordings and analysis yielded a data set of about records, each record corresponding to an air sample and being a 5uple containing a date, the outputs (i.e., presence or absence of alarm) of the three BDUs, and the expected output (estimated using the BIUs outputs), also called label, for this sample. PERFORMANCE MEASURES IN BINARY CLASSIFICATION Let us assume in this section that we have at our disposal a set of air samples whose true classes are known, i.e., for each of these samples, we know whether it contains or not a (simulant of a) biological pathogenic agent. Besides, we also have access to the class predicted for each of those samples, by a classifier. From those pieces of information, we can compute the number of: True Positives (TP), which are the samples correctly classified as presence of simulant, i.e., correct detections; True Negatives (TN), which are the samples correctly classified as absence of stimulant; False Positives (FP), which are the samples incorrectly classified as presence of simulant, i.e., the false alarms; False Negatives (FN), which are the samples incorrectly classified as absence of stimulant. Using this terminology, the error rate of the classifier over the set of samples is simply defined as:. The error rate is a natural and sensible performance criterion in general. However, it is not really adapted to imbalanced data. Two other common performance criteria in binary classification are called recall (also known as sensitivity) and precision. They are defined as follows: and. Recall is the proportion of samples for which the classifier rightfully raised an alarm (i.e., samples correctly identified as presence of simulant), of all samples that actually should have raised an alarm. On the other hand, precision is the proportion of samples for which the classifier rightfully raised an alarm, of all samples for which it raised an alarm.
4 Another useful and often used indicator of performance is the F 1 measure (or F 1 score). It can be interpreted as an average 1 of the precision and recall: 2, where an F 1 score reaches its best value at 1 and worst score at 0. A more general form exists and is called the F β measure. It is defined as follows or, equivalently, by 1,. The measure (i.e., 2) for instance, weights recall higher than precision, whereas the.  measure puts more emphasis on precision than recall. The F β measure measures the effectiveness of detection with respect to a user who attaches times as much importance to recall as precision. EXPERIMENTS As discussed above, the difficulty in this classifier fusionbased approach is to determine the best strategy to combine the detector outputs, out of a given set of fusion strategies. A second issue is to evaluate the performances of this approach. In order to perform these two tasks, a standard method consists in splitting the available labeled data set into two disjoint sets, which are called the training set and the test set (this is known in the pattern classification literature as the holdout method): the training set is used to learn the best fusion strategy  learning the best fusion strategy basically amounts to selecting the strategy that optimizes a given performance criterion on the learning set, such as F 1 score , and the test set is used to evaluate independently the performance  according to the same performance criterion as the one used for the training phase  of this best fusion strategy. This basic method has nonetheless the drawback that the holdout estimate of the performance of the best fusion strategy may be misleading if we happen to get an unfortunate split. To overcome this drawback, other methods can be used. In particular, the method known as random subsampling performs n data splits of the data set. Each split randomly selects a (fixed) number of examples without replacement. For each data split, the best fusion strategy is relearnt from scratch with the training examples and its performance is evaluated with the test examples. The overall performance of the fusionbased approach is then evaluated as the average of the separate estimates. In this work, the set of fusion strategies that we considered are all the schemes mentioned in the Approach section of this paper. Besides, we used the random subsampling method 1 It is actually the harmonic mean of precision and recall.
5 described in the previous paragraph, with n=20 (for each split, the learning data represented 2/5 of the entire available data) and in conjunction with several performance criteria: the F β  scores, for β = 0.2, 0.5, 0. 7, 1, 2. In other words, we applied the random subsampling method five times (one for each performance criterion). The following figure synthesizes these experiments: the values 0.2, 0.5, 0.7, 1 and 2 on the xaxis correspond to the five performance measures considered, and the yaxis provides the average values (and standard deviation) taken by these performance measures over the n repetitions for the different detection systems studied in this work, i.e., the three BDUs as well as of our fusion approach (called Fusion in the figure). Let us recall that the higher these latter values the better the performances. As can be seen on this figure, our approach always yields the best detection system and it is at worse as good as the BDU1 (this happens for the F 0.7 measure). In addition, we remark that depending on the value of, some BDUs are better than the others. Our approach offers some robustness in this respect since it is always the best detection system. The approach seems also to take advantage of the quality of the BDU1 and BDU3 for the F 0.2 measure, to yield a F 0.2 score that is better than the F 0.2 scores of these two BDUs and is in itself good. Similarly, for the F 2 measure, the approach seems to take advantage of the quality of the BDU1 and BDU2. Overall, these experiments clearly show the robustness as well as the increase of performance that a detection system based on the fusion of individual systems offers. CONCLUSION This study has shown that it is possible to propose a detection system that combines the alarms raised by the TWOBIAS BDUs and that exhibits better detection performances as well as better robustness than any of these BDUs. In addition, the performances of this detection
6 system are in themselves promising. It is foreseen that the performances could be further enhanced by considering some improvements to the proposed detection system. In particular, it seems worthwhile to include the temporal dimension in the fusion process. In addition, it seems interesting to better leverage in the fusion process the various uncertainties that can emerge in such a system, in particular the BDUs own confidence in their outputs. Next works will be dedicated to these improvements as well as testing this approach on the data obtained during the recent TWOBIAS underground experiment REFERENCES [DEM 67] A. P. Dempster. Upper and lower probabilities induced by a multivalued mapping. Annals of Mathematical Statistics, 38: , [DEN 08] T. Denoeux. Conjunctive and disjunctive combination of belief functions induced by nondistinct bodies of evidence. Artificial Intelligence, 172: , [DUD 00] R. O. Duda, P. E. Hart and D. G. Stork. Pattern classification. John Wiley & Sons, [ELO 10] Z. Elouedi, E. Lefevre, and D. Mercier. Discountings of a belief function using a confusion matrix. In 22th IEEE International Conference on Tools with Artificial Intelligence, volume 1, pages , Arras, France, NorthHolland. [ELO 04] Z. Elouedi, K.Mellouli, and Ph. Smets. Assessing sensor reliability for multisensor data fusion within the Transferable Belief Model. IEEE Transactions on Systems, Man and Cybernetics B, 34(1): , [KIT 98] J. Kittler, M. Hatef, R. Duin, and J. Matas. On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3): , [PIC 10] F. Pichon and T. Denoeux. The unnormalized Dempster s rule of combination: a new justification from the least commitment principle and some extensions. Journal of Automated Reasoning, 45(1):61 87, [QUO 11] B. Quost, M.H. Masson and T. Denoeux. Classifier fusion in the DempsterShafer framework using optimized tnorm based combination rules. International Journal of Approximate Reasoning, vol. 52, Issue 3, pages , 2011 [ROG 94] G. Rogova. Combining the results of several neural network classifiers. Neural Networks, 7(5): , [SHA 76] G. Shafer. A mathematical theory of evidence. Princeton University Press, Princeton, N.J., [SME 93] Ph. Smets. Belief functions: the disjunctive rule of combination and the generalized Bayesian theorem. International Journal of Approximate Reasoning, 9(1):1 35, [XU 92] L. Xu, A. Krzyzak, and C. Y. Suen. Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Transactions on System, Man and Cybernetics, 22: , 1992.
Evaluation & Validation: Credibility: Evaluating what has been learned
Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model
More informationCombination of supervised and unsupervised classification using the theory of belief functions
Combination of supervised and unsupervised classification using the theory of belief functions Fatma Karem, Mounir Dhibi, Arnaud Martin Abstract In this paper, we propose to fuse both clustering and supervised
More informationIntrusion Detection via Machine Learning for SCADA System Protection
Intrusion Detection via Machine Learning for SCADA System Protection S.L.P. Yasakethu Department of Computing, University of Surrey, Guildford, GU2 7XH, UK. s.l.yasakethu@surrey.ac.uk J. Jiang Department
More informationCOMP9417: Machine Learning and Data Mining
COMP9417: Machine Learning and Data Mining A note on the twoclass confusion matrix, lift charts and ROC curves Session 1, 2003 Acknowledgement The material in this supplementary note is based in part
More informationChapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
More informationCredal classification of uncertain data using belief functions
23 IEEE International Conference on Systems, Man, and Cybernetics Credal classification of uncertain data using belief functions Zhunga Liu a,c,quanpan a, Jean Dezert b, Gregoire Mercier c a School of
More informationOverview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set
Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.unisb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification
More informationW6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set
http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer
More informationData Mining  Evaluation of Classifiers
Data Mining  Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationPerformance Metrics. number of mistakes total number of observations. err = p.1/1
p.1/1 Performance Metrics The simplest performance metric is the model error defined as the number of mistakes the model makes on a data set divided by the number of observations in the data set, err =
More informationPerformance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com
More informationHow to preserve the conflict as an alarm in the combination of belief functions?
How to preserve the conflict as an alarm in the combination of belief functions? Eric Lefèvre a, Zied Elouedi b a Univ. Lille Nord de France, UArtois, EA 3926 LGI2A, France b University of Tunis, Institut
More informationAUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM
AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM ABSTRACT Luis Alexandre Rodrigues and Nizam Omar Department of Electrical Engineering, Mackenzie Presbiterian University, Brazil, São Paulo 71251911@mackenzie.br,nizam.omar@mackenzie.br
More informationArtificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Email Classifier
International Journal of Recent Technology and Engineering (IJRTE) ISSN: 22773878, Volume1, Issue6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing
More informationSVM Ensemble Model for Investment Prediction
19 SVM Ensemble Model for Investment Prediction Chandra J, Assistant Professor, Department of Computer Science, Christ University, Bangalore Siji T. Mathew, Research Scholar, Christ University, Dept of
More informationData Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data
More informationThree types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.
Chronological Sampling for Email Filtering ChingLung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada
More informationPerformance Measures in Data Mining
Performance Measures in Data Mining Common Performance Measures used in Data Mining and Machine Learning Approaches L. Richter J.M. Cejuela Department of Computer Science Technische Universität München
More informationEvaluating Machine Learning Algorithms. Machine DIT
Evaluating Machine Learning Algorithms Machine Learning @ DIT 2 Evaluating Classification Accuracy During development, and in testing before deploying a classifier in the wild, we need to be able to quantify
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Lecture 15  ROC, AUC & Lift Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.standrews.ac.uk twk@standrews.ac.uk Tom Kelsey ID505917AUC
More informationComponent Ordering in Independent Component Analysis Based on Data Power
Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals
More informationT61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577
T61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or
More informationA Lightweight Solution to the Educational Data Mining Challenge
A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China catch0327@yahoo.com yanxing@gdut.edu.cn
More informationFALSE ALARMS IN FAULTTOLERANT DOMINATING SETS IN GRAPHS. Mateusz Nikodem
Opuscula Mathematica Vol. 32 No. 4 2012 http://dx.doi.org/10.7494/opmath.2012.32.4.751 FALSE ALARMS IN FAULTTOLERANT DOMINATING SETS IN GRAPHS Mateusz Nikodem Abstract. We develop the problem of faulttolerant
More informationAnalysis on Weighted AUC for Imbalanced Data Learning Through Isometrics
Journal of Computational Information Systems 8: (22) 37 378 Available at http://www.jofcis.com Analysis on Weighted AUC for Imbalanced Data Learning Through Isometrics Yuanfang DONG, 2, Xiongfei LI,, Jun
More informationUsing Random Forest to Learn Imbalanced Data
Using Random Forest to Learn Imbalanced Data Chao Chen, chenchao@stat.berkeley.edu Department of Statistics,UC Berkeley Andy Liaw, andy liaw@merck.com Biometrics Research,Merck Research Labs Leo Breiman,
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationMultiultrasonic sensor fusion for autonomous mobile robots
Multiultrasonic sensor fusion for autonomous mobile robots Zou Yi *, Ho Yeong Khing, Chua Chin Seng, and Zhou Xiao Wei School of Electrical and Electronic Engineering Nanyang Technological University
More informationSearching for Gravitational Waves from the Coalescence of High Mass Black Hole Binaries
Searching for Gravitational Waves from the Coalescence of High Mass Black Hole Binaries 2015 SURE Presentation September 22 nd, 2015 Lau Ka Tung Department of Physics, The Chinese University of Hong Kong
More informationStatistics in Retail Finance. Chapter 7: Fraud Detection in Retail Credit
Statistics in Retail Finance Chapter 7: Fraud Detection in Retail Credit 1 Overview > Detection of fraud remains an important issue in retail credit. Methods similar to scorecard development may be employed,
More informationHolland s GA Schema Theorem
Holland s GA Schema Theorem v Objective provide a formal model for the effectiveness of the GA search process. v In the following we will first approach the problem through the framework formalized by
More informationCONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE
1 2 CONTENTS OF DAY 2 I. More Precise Definition of Simple Random Sample 3 Connection with independent random variables 3 Problems with small populations 8 II. Why Random Sampling is Important 9 A myth,
More informationCombining SVM Classifiers Using Genetic Fuzzy Systems Based on AUC for Gene Expression Data Analysis
Combining SVM Classifiers Using Genetic Fuzzy Systems Based on AUC for Gene Expression Data Analysis Xiujuan Chen 1, Yichuan Zhao 2, YanQing Zhang 1, and Robert Harrison 1 1 Department of Computer Science,
More informationMachine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
More informationROC Graphs: Notes and Practical Considerations for Data Mining Researchers
ROC Graphs: Notes and Practical Considerations for Data Mining Researchers Tom Fawcett 1 1 Intelligent Enterprise Technologies Laboratory, HP Laboratories Palo Alto Alexandre Savio (GIC) ROC Graphs GIC
More informationLecture 13: Validation
Lecture 3: Validation g Motivation g The Holdout g Resampling techniques g Threeway data splits Motivation g Validation techniques are motivated by two fundamental problems in pattern recognition: model
More informationClassification: Naïve Bayes Classifier Evaluation. Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining
Classification: Naïve Bayes Classifier Evaluation Toon Calders ( t.calders@tue.nl ) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Last Lecture Classification
More informationA General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms
A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms Ian Davidson 1st author's affiliation 1st line of address 2nd line of address Telephone number, incl country code 1st
More informationFRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANNBASED KNOWLEDGEDISCOVERY PROCESS
FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANNBASED KNOWLEDGEDISCOVERY PROCESS Breno C. Costa, Bruno. L. A. Alberto, André M. Portela, W. Maduro, Esdras O. Eler PDITec, Belo Horizonte,
More informationKeywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.
International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant
More informationAn Approach to Detect Spam Emails by Using Majority Voting
An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H12 Islamabad, Pakistan Usman Qamar Faculty,
More informationSupervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
More informationPerformance Measures for Machine Learning
Performance Measures for Machine Learning 1 Performance Measures Accuracy Weighted (CostSensitive) Accuracy Lift Precision/Recall F Break Even Point ROC ROC Area 2 Accuracy Target: 0/1, 1/+1, True/False,
More informationStatistical Validation and Data Analytics in ediscovery. Jesse Kornblum
Statistical Validation and Data Analytics in ediscovery Jesse Kornblum Administrivia Silence your mobile Interactive talk Please ask questions 2 Outline Introduction Big Questions What Makes Things Similar?
More informationDECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com
More informationA Novel Classification Approach for C2C ECommerce Fraud Detection
A Novel Classification Approach for C2C ECommerce Fraud Detection *1 Haitao Xiong, 2 Yufeng Ren, 2 Pan Jia *1 School of Computer and Information Engineering, Beijing Technology and Business University,
More informationCLASS distribution, i.e., the proportion of instances belonging
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART C: APPLICATIONS AND REVIEWS 1 A Review on Ensembles for the Class Imbalance Problem: Bagging, Boosting, and HybridBased Approaches Mikel Galar,
More informationData Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
More informationData Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
More informationPerformance Metrics for Graph Mining Tasks
Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical
More informationAnother Look at Sensitivity of Bayesian Networks to Imprecise Probabilities
Another Look at Sensitivity of Bayesian Networks to Imprecise Probabilities Oscar Kipersztok Mathematics and Computing Technology Phantom Works, The Boeing Company P.O.Box 3707, MC: 7L44 Seattle, WA 98124
More informationMachine Learning. Topic: Evaluating Hypotheses
Machine Learning Topic: Evaluating Hypotheses Bryan Pardo, Machine Learning: EECS 349 Fall 2011 How do you tell something is better? Assume we have an error measure. How do we tell if it measures something
More informationMachine Learning model evaluation. Luigi Cerulo Department of Science and Technology University of Sannio
Machine Learning model evaluation Luigi Cerulo Department of Science and Technology University of Sannio Accuracy To measure classification performance the most intuitive measure of accuracy divides the
More informationOn the effect of data set size on bias and variance in classification learning
On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent
More informationA Spectral Clustering Approach to Validating Sensors via Their Peers in Distributed Sensor Networks
A Spectral Clustering Approach to Validating Sensors via Their Peers in Distributed Sensor Networks H. T. Kung Dario Vlah {htk, dario}@eecs.harvard.edu Harvard School of Engineering and Applied Sciences
More informationTowards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 czhao@cs.toronto.edu Abstract Spam identification is crucial
More informationA Content based Spam Filtering Using Optical Back Propagation Technique
A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad  Iraq ABSTRACT
More informationLow Cost Correction of OCR Errors Using Learning in a MultiEngine Environment
2009 10th International Conference on Document Analysis and Recognition Low Cost Correction of OCR Errors Using Learning in a MultiEngine Environment Ahmad Abdulkader Matthew R. Casey Google Inc. ahmad@abdulkader.org
More informationL13: crossvalidation
Resampling methods Cross validation Bootstrap L13: crossvalidation Bias and variance estimation with the Bootstrap Threeway data partitioning CSCE 666 Pattern Analysis Ricardo GutierrezOsuna CSE@TAMU
More informationCHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES
CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES Claus Gwiggner, Ecole Polytechnique, LIX, Palaiseau, France Gert Lanckriet, University of Berkeley, EECS,
More informationA DECISION TREE BASED PEDOMETER AND ITS IMPLEMENTATION ON THE ANDROID PLATFORM
A DECISION TREE BASED PEDOMETER AND ITS IMPLEMENTATION ON THE ANDROID PLATFORM ABSTRACT Juanying Lin, Leanne Chan and Hong Yan Department of Electronic Engineering, City University of Hong Kong, Hong Kong,
More informationMachine Learning Approach for Estimating Sensor Deployment Regions on Satellite Images
Machine Learning Approach for Estimating Sensor Deployment Regions on Satellite Images 1 Enes ATEŞ, 2 *Tahir Emre KALAYCI, 1 Aybars UĞUR 1 Faculty of Engineering, Department of Computer Engineering Ege
More informationEM Clustering Approach for MultiDimensional Analysis of Big Data Set
EM Clustering Approach for MultiDimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
More informationCOMP 551 Applied Machine Learning Lecture 6: Performance evaluation. Model assessment and selection.
COMP 551 Applied Machine Learning Lecture 6: Performance evaluation. Model assessment and selection. Instructor: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise
More informationThe Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
More informationRandom Forest Based Imbalanced Data Cleaning and Classification
Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem
More informationContradiction measures and specificity degrees of basic belief assignments
Contradiction measures and specificity degrees of basic belief assignments Florentin Smarandache Math. & Sciences Dept. University of New Mexico, 200 College Road, Gallup, NM 87301, U.S.A. Email: smarand@unm.edu
More informationSYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis
SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu October 17, 2015 Outline
More informationTan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2. Tid Refund Marital Status
Data Mining Classification: Basic Concepts, Decision Trees, and Evaluation Lecture tes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Classification: Definition Given a collection of
More informationAn innovative application of a constrainedsyntax genetic programming system to the problem of predicting survival of patients
An innovative application of a constrainedsyntax genetic programming system to the problem of predicting survival of patients Celia C. Bojarczuk 1, Heitor S. Lopes 2 and Alex A. Freitas 3 1 Departamento
More informationREVIEW OF ENSEMBLE CLASSIFICATION
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.
More informationStabilization by Conceptual Duplication in Adaptive Resonance Theory
Stabilization by Conceptual Duplication in Adaptive Resonance Theory Louis Massey Royal Military College of Canada Department of Mathematics and Computer Science PO Box 17000 Station Forces Kingston, Ontario,
More informationGetting Even More Out of Ensemble Selection
Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand qs12@cs.waikato.ac.nz ABSTRACT Ensemble Selection uses forward stepwise
More informationThreshold Logic. 2.1 Networks of functions
2 Threshold Logic 2. Networks of functions We deal in this chapter with the simplest kind of computing units used to build artificial neural networks. These computing elements are a generalization of the
More informationA Novel Solution on Alert Conflict Resolution Model in Network Management
A Novel Solution on Alert Conflict Resolution Model in Network Management YiTung F. Chan University of Wales United Kingdom FrankChan2005@gmail.com Ramaswamy D.Thiyagu University of East London United
More informationData quality in Accounting Information Systems
Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania
More informationA new Approach for Intrusion Detection in Computer Networks Using Data Mining Technique
A new Approach for Intrusion Detection in Computer Networks Using Data Mining Technique Aida Parbaleh 1, Dr. Heirsh Soltanpanah 2* 1 Department of Computer Engineering, Islamic Azad University, Sanandaj
More informationData Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
More informationA Hybrid Data Mining Model to Improve Customer Response Modeling in Direct Marketing
A Hybrid Data Mining Model to Improve Customer Response Modeling in Direct Marketing Maryam Daneshmandi mdaneshmandi82@yahoo.com School of Information Technology Shiraz Electronics University Shiraz, Iran
More informationGerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
More informationOperations Research and Knowledge Modeling in Data Mining
Operations Research and Knowledge Modeling in Data Mining Masato KODA Graduate School of Systems and Information Engineering University of Tsukuba, Tsukuba Science City, Japan 3058573 koda@sk.tsukuba.ac.jp
More informationPerformance of Probability Transformations Using Simulated Human Opinions
Performance of Probability Transformations Using Simulated Human Opinions Donald J. Bucci, Sayandeep Acharya, Timothy J. Pleskac, and Moshe Kam Department of Electrical and Computer Engineering, Drexel
More informationPedestrian Detection with RCNN
Pedestrian Detection with RCNN Matthew Chen Department of Computer Science Stanford University mcc17@stanford.edu Abstract In this paper we evaluate the effectiveness of using a Regionbased Convolutional
More informationEvaluation and Credibility. How much should we believe in what was learned?
Evaluation and Credibility How much should we believe in what was learned? Outline Introduction Classification with Train, Test, and Validation sets Handling Unbalanced Data; Parameter Tuning Crossvalidation
More informationDecision Algorithms in Fire Detection Systems
SERBIAN JOURNAL OF ELECTRICAL ENGINEERING Vol. 8, No. 2, May 2011, 155161 UDK: 654.924.5:004.21; 614.842.4:004.021 Decision Algorithms in Fire Detection Systems Jovan D. Ristić 1, Dragana B. Radosavljević
More informationSpam Filter Optimality Based on Signal Detection Theory
Spam Filter Optimality Based on Signal Detection Theory ABSTRACT Singh Kuldeep NTNU, Norway HUT, Finland kuldeep@unik.no Md. Sadek Ferdous NTNU, Norway University of Tartu, Estonia sadek@unik.no Unsolicited
More informationBinary and Ranked Retrieval
Binary and Ranked Retrieval Binary Retrieval RSV(d i,q j ) {0,1} Does not allow the user to control the magnitude of the output. In fact, for a given query, the system may return underdimensioned output
More informationAddressing the Class Imbalance Problem in Medical Datasets
Addressing the Class Imbalance Problem in Medical Datasets M. Mostafizur Rahman and D. N. Davis the size of the training set is significantly increased [5]. If the time taken to resample is not considered,
More informationFlorida International University  University of Miami TRECVID 2014
Florida International University  University of Miami TRECVID 2014 Miguel Gavidia 3, Tarek Sayed 1, Yilin Yan 1, Quisha Zhu 1, MeiLing Shyu 1, ShuChing Chen 2, HsinYu Ha 2, Ming Ma 1, Winnie Chen 4,
More informationSynthetic Aperture Radar: Principles and Applications of AI in Automatic Target Recognition
Synthetic Aperture Radar: Principles and Applications of AI in Automatic Target Recognition Paulo Marques 1 Instituto Superior de Engenharia de Lisboa / Instituto de Telecomunicações R. Conselheiro Emídio
More informationData Mining Practical Machine Learning Tools and Techniques
Counting the cost Data Mining Practical Machine Learning Tools and Techniques Slides for Section 5.7 In practice, different types of classification errors often incur different costs Examples: Loan decisions
More informationIntroduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multiclass classification.
More informationEvaluation in Machine Learning (1) Evaluation in machine learning. Aims. 14s1: COMP9417 Machine Learning and Data Mining.
Acknowledgements Evaluation in Machine Learning (1) 14s1: COMP9417 Machine Learning and Data Mining School of Computer Science and Engineering, University of New South Wales March 19, 2014 Material derived
More informationNumerical Field Extraction in Handwritten Incoming Mail Documents
Numerical Field Extraction in Handwritten Incoming Mail Documents Guillaume Koch, Laurent Heutte and Thierry Paquet PSI, FRE CNRS 2645, Université de Rouen, 76821 MontSaintAignan, France Laurent.Heutte@univrouen.fr
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950F, Spring 2012 Prof. Erik Sudderth Lecture 5: Decision Theory & ROC Curves Gaussian ML Estimation Many figures courtesy Kevin Murphy s textbook,
More informationCategorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors ChiaHui Chang and ZhiKai Ding Department of Computer Science and Information Engineering, National Central University, ChungLi,
More informationIntroduction To Ensemble Learning
Educational Series Introduction To Ensemble Learning Dr. Oliver Steinki, CFA, FRM Ziad Mohammad July 2015 What Is Ensemble Learning? In broad terms, ensemble learning is a procedure where multiple learner
More informationUsing News Articles to Predict Stock Price Movements
Using News Articles to Predict Stock Price Movements Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 9237 gyozo@cs.ucsd.edu 21, June 15,
More informationIntroduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing.
Introduction to Hypothesis Testing CHAPTER 8 LEARNING OBJECTIVES After reading this chapter, you should be able to: 1 Identify the four steps of hypothesis testing. 2 Define null hypothesis, alternative
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
More information