A CLASSIFIER FUSION-BASED APPROACH TO IMPROVE BIOLOGICAL THREAT DETECTION. Palaiseau cedex, France; 2 FFI, P.O. Box 25, N-2027 Kjeller, Norway.

Transcription

1 A CLASSIFIER FUSION-BASED APPROACH TO IMPROVE BIOLOGICAL THREAT DETECTION Frédéric Pichon 1, Florence Aligne 1, Gilles Feugnet 1 and Janet Martha Blatny 2 1 Thales Research & Technology, Campus Polytechnique, 1 avenue Augustin Fresnel, Palaiseau cedex, France; 2 FFI, P.O. Box 25, N-2027 Kjeller, Norway. INTRODUCTION The aim of the TWOBIAS EU FP7 project ( ) is to develop a modular demonstrator of a stationary, reliable, vehicle-portable, low false alarm rate Two Stage Rapid Biological Surveillance and Alarm System for Airborne Threats (TWOBIAS). The TWOBIAS concept is built around two successive stages: a first detect-to-warn stage dedicated to biological threat detection using Biological Detection Units (BDUs), and a second detect-to-treat stage devoted to pathogenic agent identification using Biological Identification Units (BIUs). In the first stage, the main challenge of the TWOBIAS system is to improve the state-of-the-art regarding biological threat detection by using orthogonal detector techniques together with some mathematical framework and high level information processing approaches. This paper describes the approach developed for the first stage of the TWOBIAS project, as well as preliminary results suggesting its potential usefulness. APPROACH The BDUs may be seen as so-called classifiers in pattern classification [DUD 00]. Indeed, for each air sample, each of the BDUs raises or not an alarm, depending on whether it believes that the air sample contains or not a pathogenic biological agent. Hence, each BDU is a classifier in that it classifies each air sample in of two groups (biological pathogenic agents and non biological pathogenic agents). There exists a large body of literature dedicated to the use of multiple classifiers, also called classifier ensemble, which is now widely acknowledged as a useful means to address difficult pattern classification problems [KIT 98, ROG 94]. The main idea underlying classifier ensembles is to take advantage of the potential complementarities of different classifiers, so as to obtain potentially better classification performance. One of the central issues is then to find how to combine the classifier outputs. Several fusion strategies have been proposed to combine classifier outputs. The most simple fusion scheme is majority voting [XU 92]. This scheme basically amounts to deciding that the

2 class of a given pattern (object, instance), is the class the most present in the individual decisions of the classifiers. When classifiers provide confidence scores on the actual class of a given pattern in the form of probability distributions, a technique consists in taking the average of these distributions and deciding that the class of the pattern is the one of maximum probability [XU 92]. Other schemes such as [ELO 10] and [ELO 04] involve the notion of discounting [SHA 76, SME 93], that is, weakening the decisions provided by the classifiers according to their reliability. In these schemes, the decisions of the classifiers are first discounted before being combined using an appropriate operator (called Dempster s rule of combination [DEM 67]). In other works, classifier outputs are combined using more advanced combination operators, for instance triangular norm based combination rules [DEN 08, PIC 10], as is the case in [QUO 11]. To decide which fusion strategy to use to combine a set of classifiers, one may proceed as in [QUO 11], that is one may learn the best fusion strategy. This learning amounts to looking for the fusion strategy which optimizes a given performance criterion, such as the error rate, over a set of data. This is the approach that we have followed to take advantage of the BDUs complementary orthogonal detector techniques. As will be seen in the results part of this paper, this approach allows us to obtain a detection system that provides a more consistent global response and that improves on each of the individual BDU detection performance. The rest of this paper is organized as follows. First, we provide a brief description of the BDUs considered in TWOBIAS, as well as of the first TWOBIAS experiment, which yielded the set of data that is needed in our approach. Then, we recall some basic performance measures that can be used to evaluate a classifier. We proceed with a description of the experiments that we performed to evaluate our approach, as well as of the results that we obtained. Finally, we conclude this paper with some perspectives. TWOBIAS BIOLOGICAL DETECTION UNITS AND FIRST EXPERIMENT At the time of performing this work, three BDUs were available. These BDUs can be expected to be somewhat complementary since they rely on different techniques and settings: one relies on flame emission spectroscopy and the other two rely on laser induced fluorescence, but they are tuned differently. Hereafter, these BDUs will be denoted by BDU1, BDU2, and BDU3. The first TWOBIAS experiment was conducted in a facility of the Direction Générale de l'armement. Several disseminations of biological agents simulating biological pathogenic agents took place during this experiment. The outputs of the three BDUs, i.e., the alarms raised by the BDUs, were recorded during those dispersions. Offline, we also analyzed the

3 BIU outputs, which tell us whether a (simulant of a) biological pathogenic agent was actually present or not in the air at a given time. This is important since those BIUs outputs can thus be used as references regarding whether a given time-stamped air sample should have raised an alarm or not, and thus they can be used to evaluate the performances of the BDUs. Those various recordings and analysis yielded a data set of about records, each record corresponding to an air sample and being a 5-uple containing a date, the outputs (i.e., presence or absence of alarm) of the three BDUs, and the expected output (estimated using the BIUs outputs), also called label, for this sample. PERFORMANCE MEASURES IN BINARY CLASSIFICATION Let us assume in this section that we have at our disposal a set of air samples whose true classes are known, i.e., for each of these samples, we know whether it contains or not a (simulant of a) biological pathogenic agent. Besides, we also have access to the class predicted for each of those samples, by a classifier. From those pieces of information, we can compute the number of: True Positives (TP), which are the samples correctly classified as presence of simulant, i.e., correct detections; True Negatives (TN), which are the samples correctly classified as absence of stimulant; False Positives (FP), which are the samples incorrectly classified as presence of simulant, i.e., the false alarms; False Negatives (FN), which are the samples incorrectly classified as absence of stimulant. Using this terminology, the error rate of the classifier over the set of samples is simply defined as:. The error rate is a natural and sensible performance criterion in general. However, it is not really adapted to imbalanced data. Two other common performance criteria in binary classification are called recall (also known as sensitivity) and precision. They are defined as follows: and. Recall is the proportion of samples for which the classifier rightfully raised an alarm (i.e., samples correctly identified as presence of simulant), of all samples that actually should have raised an alarm. On the other hand, precision is the proportion of samples for which the classifier rightfully raised an alarm, of all samples for which it raised an alarm.

4 Another useful and often used indicator of performance is the F 1 -measure (or F 1 -score). It can be interpreted as an average 1 of the precision and recall: 2, where an F 1 -score reaches its best value at 1 and worst score at 0. A more general form exists and is called the F β -measure. It is defined as follows or, equivalently, by 1,. The -measure (i.e., 2) for instance, weights recall higher than precision, whereas the. - measure puts more emphasis on precision than recall. The F β -measure measures the effectiveness of detection with respect to a user who attaches times as much importance to recall as precision. EXPERIMENTS As discussed above, the difficulty in this classifier fusion-based approach is to determine the best strategy to combine the detector outputs, out of a given set of fusion strategies. A second issue is to evaluate the performances of this approach. In order to perform these two tasks, a standard method consists in splitting the available labeled data set into two disjoint sets, which are called the training set and the test set (this is known in the pattern classification literature as the hold-out method): the training set is used to learn the best fusion strategy - learning the best fusion strategy basically amounts to selecting the strategy that optimizes a given performance criterion on the learning set, such as F 1 -score -, and the test set is used to evaluate independently the performance - according to the same performance criterion as the one used for the training phase - of this best fusion strategy. This basic method has nonetheless the drawback that the holdout estimate of the performance of the best fusion strategy may be misleading if we happen to get an unfortunate split. To overcome this drawback, other methods can be used. In particular, the method known as random subsampling performs n data splits of the data set. Each split randomly selects a (fixed) number of examples without replacement. For each data split, the best fusion strategy is relearnt from scratch with the training examples and its performance is evaluated with the test examples. The overall performance of the fusion-based approach is then evaluated as the average of the separate estimates. In this work, the set of fusion strategies that we considered are all the schemes mentioned in the Approach section of this paper. Besides, we used the random subsampling method 1 It is actually the harmonic mean of precision and recall.

5 described in the previous paragraph, with n=20 (for each split, the learning data represented 2/5 of the entire available data) and in conjunction with several performance criteria: the F β - scores, for β = 0.2, 0.5, 0. 7, 1, 2. In other words, we applied the random subsampling method five times (one for each performance criterion). The following figure synthesizes these experiments: the values 0.2, 0.5, 0.7, 1 and 2 on the x-axis correspond to the five performance measures considered, and the y-axis provides the average values (and standard deviation) taken by these performance measures over the n repetitions for the different detection systems studied in this work, i.e., the three BDUs as well as of our fusion approach (called Fusion in the figure). Let us recall that the higher these latter values the better the performances. As can be seen on this figure, our approach always yields the best detection system and it is at worse as good as the BDU1 (this happens for the F 0.7 -measure). In addition, we remark that depending on the value of, some BDUs are better than the others. Our approach offers some robustness in this respect since it is always the best detection system. The approach seems also to take advantage of the quality of the BDU1 and BDU3 for the F 0.2 -measure, to yield a F 0.2 -score that is better than the F 0.2 -scores of these two BDUs and is in itself good. Similarly, for the F 2 -measure, the approach seems to take advantage of the quality of the BDU1 and BDU2. Overall, these experiments clearly show the robustness as well as the increase of performance that a detection system based on the fusion of individual systems offers. CONCLUSION This study has shown that it is possible to propose a detection system that combines the alarms raised by the TWOBIAS BDUs and that exhibits better detection performances as well as better robustness than any of these BDUs. In addition, the performances of this detection

6 system are in themselves promising. It is foreseen that the performances could be further enhanced by considering some improvements to the proposed detection system. In particular, it seems worthwhile to include the temporal dimension in the fusion process. In addition, it seems interesting to better leverage in the fusion process the various uncertainties that can emerge in such a system, in particular the BDUs own confidence in their outputs. Next works will be dedicated to these improvements as well as testing this approach on the data obtained during the recent TWOBIAS underground experiment REFERENCES [DEM 67] A. P. Dempster. Upper and lower probabilities induced by a multivalued mapping. Annals of Mathematical Statistics, 38: , [DEN 08] T. Denoeux. Conjunctive and disjunctive combination of belief functions induced by non-distinct bodies of evidence. Artificial Intelligence, 172: , [DUD 00] R. O. Duda, P. E. Hart and D. G. Stork. Pattern classification. John Wiley & Sons, [ELO 10] Z. Elouedi, E. Lefevre, and D. Mercier. Discountings of a belief function using a confusion matrix. In 22th IEEE International Conference on Tools with Artificial Intelligence, volume 1, pages , Arras, France, North-Holland. [ELO 04] Z. Elouedi, K.Mellouli, and Ph. Smets. Assessing sensor reliability for multisensor data fusion within the Transferable Belief Model. IEEE Transactions on Systems, Man and Cybernetics B, 34(1): , [KIT 98] J. Kittler, M. Hatef, R. Duin, and J. Matas. On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3): , [PIC 10] F. Pichon and T. Denoeux. The unnormalized Dempster s rule of combination: a new justification from the least commitment principle and some extensions. Journal of Automated Reasoning, 45(1):61 87, [QUO 11] B. Quost, M.-H. Masson and T. Denoeux. Classifier fusion in the Dempster-Shafer framework using optimized t-norm based combination rules. International Journal of Approximate Reasoning, vol. 52, Issue 3, pages , 2011 [ROG 94] G. Rogova. Combining the results of several neural network classifiers. Neural Networks, 7(5): , [SHA 76] G. Shafer. A mathematical theory of evidence. Princeton University Press, Princeton, N.J., [SME 93] Ph. Smets. Belief functions: the disjunctive rule of combination and the generalized Bayesian theorem. International Journal of Approximate Reasoning, 9(1):1 35, [XU 92] L. Xu, A. Krzyzak, and C. Y. Suen. Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Transactions on System, Man and Cybernetics, 22: , 1992.