Discovering Criminal Behavior by Ranking Intelligence Data

Transcription

1 UNIVERSITY OF AMSTERDAM Faculty of Science Discovering Criminal Behavior by Ranking Intelligence Data by A thesis submitted in partial fulfillment for the degree of Master of Science in the field of Artificial Intelligence track: Forensic Intelligence written during an internship at the Netherlands Forensic Institute Knowledge and Expertise Centre for Intelligent Data Analysis Digital Technology & Biometrics Department External Supervisor: Cor J. Veenman Internal Supervisor: Marcel Worring August 2010

2 Declaration of Authorship I,, declare that this thesis titled, Discovering Criminal Behavior by Ranking Intelligence Data and the work presented in it are my own. I confirm that: This work was done wholly or mainly while in candidature for a research degree at this University. Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated. Where I have consulted the published work of others, this is always clearly attributed. Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work. I have acknowledged all main sources of help. Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself. Signed: Date: Netherlands Forensic Institute i University of Amsterdam

3 Three things cannot be long hidden: the sun, the moon, and the truth. Buddha

4 Abstract Intelligence data consists of a small number of known examples of criminal behavior and a large number of unknown samples. A small fraction of these unknown samples represents undiscovered criminal behavior. An intelligence based investigation is aimed at the discovery of these undiscovered samples. As resources are limited, the goal of such an investigation is to obtain a subset of the data with a high probability of criminal behavior. This requires the use of ranking methods and a comparison based on performance measures that relate to this goal, such as the average precision. We assess the effect of the imbalance inherent in these datasets and see if random undersampling can be performed to reduce computational complexity. Furthermore, the effect of label noise resulting from labeling unknown samples as normal behavior is researched. We then implement several ranking methods for ranking intelligence data and asses their performance. Next to widely used two-class classification methods, we observe the applicability of one-class classification methods and support vector machines using model selection criteria that more closely relate to the goal of an intelligence based investigation. Netherlands Forensic Institute iii University of Amsterdam

5 Acknowledgements This is where I could write my thanks to all those people who were there at those moments the going got tough and I nearly succumbed under the pressure of writing the whole thing. But to be honest, that never happened. The people I am going to thank here were there long before those moments could occur and kept my spirits up well enough at all times. That definitely deserves my gratitude. First of all, I would like to thank my girlfriend Linda, for always being there for me and keeping me relaxed with her carefree attitude. My mother Ellen, who got me this far by pointing out the importance of good education and her never faltering support. My parents-in-law, for their positive view on life and generous nature which always cheers me up. My grandmother, who will be sitting front row at my graduation, hours before it commences. My dear friends, Amer, Amy, Arike, Ben, Colin, Fleur, Jasper, Krishan, Lea, Pim, Rick, Rolf, Steven and Tabor, for their support and good times together, recharging my energy in between all the hard work. My colleagues at Kecida, Andre, Asli, Chrissie, Gert, Guido, Maarten, Menno, Nienke and Xandra, for all the inspiring talks and shared knowledge. My supervisors, Cor and Marcel, for their useful feedback, eye-opening discussions and continuous support. And finally, I would like to thank Sara, who introduced me to the field of AI, what feels like such a long time ago. Netherlands Forensic Institute iv University of Amsterdam

6 Contents Declaration of Authorship i Abstract iii Acknowledgements iv Table of Contents List of Figures List of Tables Abbreviations Notation v vii viii ix x 1 Introduction Research Questions Approach Outline Properties of Intelligence Data Intelligence Data Imbalance Noise Label Noise Attribute Noise and Missing Values Discovering Criminal Behavior Ranking ROC Curve Precision-Recall Curve Ranking with Two-Class Classification Methods k-nearest Neighbors Logistic Regression Netherlands Forensic Institute v University of Amsterdam

7 Contents Fisher s Linear Discriminant Quadratic Discriminant Support Vector Machines Ranking with One-Class Classification Methods Gaussian Data Description Support Vector Data Description Ranking with Ranking Methods SVM-Rank SVM-Precision SVM-ROCArea SVM-MAP Performance Estimation Data Artificial Data Gaussian Data Stretched Parallel Data Spherical Data IILS Data Experiments Setup Training Imbalance Undiscovered Criminal Behavior Ranking Performance Conclusions Future Work Bibliography 48 Netherlands Forensic Institute vi University of Amsterdam

8 List of Figures 2.1 Over- and under-sampling Example ROC Curve Comparison Between ROC and Precision-Recall Curves An Example of k-nn Classification Maximum Margin Classification Artificial Gaussian Intelligence Dataset Artificial Stretched Parallel Intelligence Dataset Artificial Spherical Intelligence Dataset Learning Curves for Gaussian Dataset Learning Curves for Stretched Dataset Learning Curves for Spherical Dataset Learning Curves for IILS Dataset Noise in Gaussian Data Noise in Stretched Data Noise in Spherical Data Noise in IILS Data Netherlands Forensic Institute vii University of Amsterdam

9 List of Tables 3.1 Confusion Matrix of Ranked Results Based on a Decision Point Size and Composition of Intelligence Datasets with Added Artificial Noise Difference Between Measured AP and Expected AP Performance of all Ranking Methods on Gaussian Data Performance of all Ranking Methods on Stretched Data Performance of all Ranking Methods on Spherical Data Performance of all Ranking Methods on IILS Data Netherlands Forensic Institute viii University of Amsterdam

10 Abbreviations ACC AP AUC EM FLD FN FP FPR GDD IILS k-nn LPOCV MAP MAR MCAR MNAR QDC ROC SVDD SVM TN TP TPR Accuracy Average Precision Area Under the (ROC) Curve Expectation Maximization Fisher s Linear Discriminant False Negative False Positive False Positive Rate Gaussian Data Description Illegal Irregular Living Situation k-nearest Neighbor Leave-Pair-Out Cross-Validation Mean Average Precision Missing At Random Missing Completely At Random Missing Not At Random Quadratic Discriminant Classifier Receiver Operator Characteristic Support Vector Data Description Support Vector Machine True Negative True Positive True Positive Rate Netherlands Forensic Institute ix University of Amsterdam

11 Notation X X + X X D (X) X tr X val X te x = (x 1,..., x n ) An intelligence dataset Positive samples representing criminal behavior Negative samples representing normal behavior The size of dataset X, the number of samples The dimensionality of dataset X, the number of features A training set, used for learning a model A validation set, used for parameter tuning A test set, used for performance estimation A single sample in dataset X with n = D (X) and features x 1,..., x n C rank(x) y ŷ c(x) dp Y θ p(a b) I n N (µ, Σ) φ ψ A classification method The position of x in a ranked result True class label Predicted class label Ranking score of a data point x after classification Decision point on a ranking of samples A ranked list of samples resulting from a ranking method Parameters of a distribution model Conditional probability of a given b n-by-n Identity Matrix Gaussian distribution with means µ and covariance matrix Σ Noise on majority class (undiscovered criminal behavior) Noise on minority class Netherlands Forensic Institute x University of Amsterdam

12 Dedicated to Henny, my grandmother who has been looking forward to my graduation ever since I started studying at the university

13 Chapter 1 Introduction In our ever more digitally focused society, the current tendency is to record and store all possible sorts of data. As data collection mechanisms have improved vastly over the years, nowadays in many different settings, certain features are recorded for every single entity in a population. A credit card company, records for example for every transaction the date, time, sender, receiver, method and the card validity. A municipality, records for example for every address within its perimeters the type of housing, surface area and number of rooms. As these features are recorded for every entity in a population, data related to entities that exhibit deviant or criminal behavior is also recorded. However, recording of this data does not imply automatic detection of criminal behavior. To achieve this, data analysts combine elaborate methods from the fields of artificial intelligence, computer science, linguistics and mathematics to obtain relevant information regarding criminal behavior from data. As a direct consequence of the increase in data recording and storage, these analysts are more often faced with large datasets. An intelligence based investigation is aimed at finding correlations between features of the data and prevalence of a specific type of crime in an intelligence dataset with known occurrences of that specific type of crime. A model based on these correlations can be used to assess the risk of unknown or new entities exhibiting forms of criminal behavior. Further investigation into these entities is invasive and requires resources. Initiating such an investigation is only feasible if the predictive value of the model is high. A typical investigation in such a large dataset is therefore not focused on classifying everything correctly. It is more important to discover a few new cases of criminal behavior without having to address many false hits. A system operating on an intelligence dataset should offer the possibility to prioritize cases based on a ranking of entities that is related to the probability of them exhibiting forms of criminal behavior. An intelligent ranking allows to deploy the available resources to investigate the top n highest ranked entities. Netherlands Forensic Institute 1 University of Amsterdam

14 Chapter 1. Introduction Luckily, many crimes occur only infrequently as only a small percentage of the population exhibits forms of criminal behavior. Recording of data regarding normal behavior is therefore far more common than recording of data regarding criminal behavior. However, not all occurrences of criminal behavior are identified as such. This results in specific characteristics that define an intelligence dataset. It is imbalanced as it contains many samples of which only a small subset is known to represent criminal behavior. The rest of the samples is unknown but only a small percentage of these will represent criminal behavior. Using these samples as examples of normal behavior is therefore possible but results in a small amount of label noise. Some supervised learning methods can be used to obtain a ranking of intelligence data. Two-class and one-class classification methods rank samples based on a confidence score. Other methods from the field of information retrieval directly optimize a ranking by using rank-based model selection criteria. The application of ranking methods that use model selection criteria related to the goal of an intelligence based investigation is novel and has not been researched before. Methods used for ranking intelligence data also need to handle the specific properties these intelligence datasets have in common. The problems of data imbalance and label noise from labeling unknown samples have individually received much attention and several solutions have been proposed. But any possible effects resulting from their combined occurrence have only been marginally researched. Van Hulse et al. have recently performed a comprehensive study into classification performance in imbalanced noisy data [1] but they have only used datasets from the domain of software measurement. It is unknown if the effect on intelligence data is comparable. Methods assigning cost to imbalanced data in a classification task exist [2], but no mention is made of noise and these methods are not focused on ranking. 1.1 Research Questions It is clear that intelligence datasets often involve huge amounts of imbalanced noisy data. The discovery of patterns in this type of data that signify criminal or deviant behavior is central to an intelligence based investigation. As resources are often limited, the discovery of small subsets with high chances of discovering criminal behavior is most important. Currently available and generally accepted methods focus only on parts of this problem. This leads us to pose the following central research question: How can criminal behavior be discovered in intelligence data? research question can be divided in a few subquestions. This central Netherlands Forensic Institute 2 University of Amsterdam

15 Chapter 1. Introduction To what extent do imbalance and label noise hinder the discovery of criminal behavior? In other fields of research, label noise and imbalanced data result in a detrimental effect on performance. The effect of their co-occurrence in intelligence data is unknown. Which methods are best at ranking samples in intelligence data? Many different classification methods exist and different classification scenarios ask for different methods. Classification methods operating on intelligence data are required to return a ranking of results. It is unknown if classification methods fit this task. Are model selection criteria from the field of information retrieval helpful in the efficient discovery of criminal behavior? Ranking results is important with intelligence data. Methods that optimize for such a ranked result exist, but it is unknown how they perform in the discovery of criminal behavior. 1.2 Approach There is no fixed solution that works for every data analysis problem. No single method performs well on every type of data. Only a careful observation of the characteristics of intelligence datasets will provide footholds for addressing this type of data. As indicated, intelligence data is imbalanced and contains unknown samples. We first assess the effects of these properties on the performance of several selected ranking methods. We test if sampling methods are suited for coping with imbalance and observe the effect of noise by artificially increasing the amount of label noise in either class. We compare the performance of several one-class and two-class classification methods in ranking intelligence data with the goal to find methods that perform systematically well on different intelligence datasets. Finally, we test the applicability of ranking methods that use model selection criteria from the field of information retrieval that directly optimize for a ranking and are more closely related to the goal of an intelligence based investigation. 1.3 Outline This thesis is outlined as follows. The following chapter describes the properties of intelligence data in general and goes deeper into the subjects of imbalance and noise. Chapter 3 revolves around the discovery process of criminal behavior in intelligence data. This is followed by chapter 4, in which we describe the specific datasets we used for our experiments described in chapter 5. Netherlands Forensic Institute 3 University of Amsterdam

16 Chapter 2 Properties of Intelligence Data 2.1 Intelligence Data An intelligence dataset contains entities that can represent for example persons, cars, companies or houses. Descriptive features are available for every individual entity in a population. These descriptive features can be very broad, but also very specific. A single feature such as age is not sufficient to serve as an indicator for the fact that an individual in question is involved in burglaries. But a combination of several general or specific features that are known for every entity in the population, might indeed serve as a good indicator. If we combine age with for example neighborhood of residence, income, race and history of violence, we might observe a single group of individuals to have a significantly higher probability rate of being or becoming a burglar. It is entities with features such as these that form an intelligence dataset. To formalize, an intelligence dataset X contains samples from two classes, X + containing all positive samples representing criminal behavior, and X containing all samples representing unknown behavior. X represents the size of dataset X in terms of the number of samples. With D (X), we denote the dimensionality of the dataset X, i.e. the number of features. Individual samples x X are represented as (x 1,..., x n ), with features x 1... x n R and they are labeled with y 1, 0, 1, where n = D (X). These labels denote the class to which a sample belongs. We define that x X + : y = 1 and x X : y = 0, representing that the true class is unknown. As we assume only a small percentage of criminal behavior in X, we can set the labels y such that x X : y = 1. This does result in label noise, which is discussed in section Our goal is to obtain a ranking of the dataset. To train the ranking method, we consider X tr to represent the training set, X val represents an optional validation set, used for Netherlands Forensic Institute 4 University of Amsterdam

17 Chapter 2. Properties of Intelligence Data tuning parameters specific to the ranking method used and X te represents the separate test set, used for performance estimation of a ranking method. All three sets should be independently drawn from the same distribution. This separation of data is required to make an unbiased assessment of the performance of a ranking method [3]. 2.2 Imbalance An important recurring property of intelligence datasets is the low frequency of certain criminal behavior. The prevalence of a specific type of crime might seem high, but relative to the same crime not occurring in similar situations, this rate is luckily always low. That means that if data is collected for every instance in a specific situation, far more data is collected on normal instances than on crime-related instances. X is also called the majority class and X + the minority class as X >> X + [4]. therefore cope with this imbalance. Our method should To cope with the computational costs involved with some modern-day ranking methods, intelligence datasets might need to be reduced in size by a few orders of magnitude. Maintaining the natural occurring class frequency in such a scenario would leave us with less than a single example of criminal behavior from which to build our model, a less than satisfactory approach. Weiss and Provost show that when reducing the training set X size, the naturally occurring positive class fraction + is indeed not always the best X + + X fraction for learning [5]. Several methods for reducing dataset size in imbalanced datasets are discussed by Weiss [6]. One method describes learning only from examples of the minority class, X tr X +. This one-class classification has been extensively researched by Tax [7] and is described in section 3.3. The most commonly used method to deal with imbalanced data is sampling. Weiss describes the difference between over- and under-sampling and their drawbacks [6]. Oversampling duplicates minority class samples while under-sampling reduces majority-class samples. A pictorial representation is shown in figure 2.1. In doing this, both methods achieve a more balanced class-distribution, but at a cost. Over-sampling increases the chance of overfitting by repeating examples, and does not actually generate novel data points. For several methods over-sampling is simply no solution. Under-sampling removes majority-class samples from the training data resulting in a potential loss of useful information. The extent to which this detrimental effect occurs depends entirely on the characteristics of the underlying data and the ranking method. Netherlands Forensic Institute 5 University of Amsterdam

18 Chapter 2. Properties of Intelligence Data X tr x 1 x 1 x 2 x 2. x 3 x n x 4 x 1 x 5 x 2 x 6. x 7 x n. X tr. x n Figure 2.1: Over-sampling (left) and under-sampling (right) 2.3 Noise In a machine learning framework, a distinction is made between two types of noise. Label noise occurs when for some x X, y y true. Attribute noise occurs when for some x X, some x i x i,true Label Noise Zhu and Wu state that label or class noise can have a few possible sources [8]. One cause of label noise is human labeling error, which can occur if the labels are acquired by a manual process. Next to that, contradictory examples arise when two examples in the training set have the same attribute values but come with different labels. It is debatable if this should be considered label noise as overlapping class distributions can result in two correctly labeled examples with the same attribute values but originating from different classes. In an intelligence dataset, examples of criminal behavior are available only because a criminal investigation has proven guilt in the court cases that underpin all of the positive examples. Therefore usually no label noise occurs in the minority class. However, a judicial system is not infallible and sometimes wrong accusations are made. Yet, discovering wrongly accused individuals falls outside the scope of this paper and for now we assume no label noise occurs in the minority class. The majority class consists of unknown samples of which only a fraction represents undiscovered criminal behavior. As most two-class ranking methods require negative examples as input, we label the majority class as normal behavior. This results in label noise in the majority class with the undiscovered samples being the source of this noise. The Netherlands Forensic Institute 6 University of Amsterdam

19 Chapter 2. Properties of Intelligence Data percentage of label noise is relatively low in this type of data, due to the fact that the relative occurrence of criminal behavior compared to normal behavior is always low. Be aware that when we discuss label noise in class X, we mean to say that some samples in the dataset that are truly from class X + are labeled with 1. The noise is thus actually only in the labels of x X, as the ground truth of these labels is actually 1. Other literature speaks of label noise from class X, meaning that some instances that truly originate from class X are mislabeled and have been given the label y = 1. This might lead to confusion if not addressed carefully as these situations are the exact opposite of each other. To avoid this, we will only use the former way. Thus when we speak of label noise in the majority class, we mean to say that some samples from the minority class have been given the wrong label and are labeled as majority. In the setting of an intelligence dataset, this means that some instances of criminal behavior have not been recognized and are mislabeled as normal behavior. To address the issue of label noise, several different approaches have been proposed. Li et al. propose a probabilistic Kernel Fisher method that optimizes the projection direction of noisy data [9]. They achieve better classification performance of noisy datasets when compared to the original classification methods they build upon. A drawback is the computational complexity of this method which prevents it from use with large datasets. Anyfantis et al. look specifically at the presence of label noise in imbalanced datasets [10] and note a detrimental effect on classification performance. However, they only perform experiments with equal levels of noise in the minority and majority class and their results are therefore not representative for intelligence data. Weiss does take imbalanced datasets into account and notes that relatively high levels of noise are required for a classification method to become error prone [6]. Although experimenting on imbalanced data, no mention is made of classification results if the noise is only occurring in a single class. A more extensive study into noisy imbalanced data has been presented by van Hulse and Khoshgoftaar [1]. They focus solely on the effect of label noise on classification of imbalanced datasets and experiment with a broad range of modern-day classification methods and sampling techniques. Next to that the effects of different levels of label noise occurring in a different ratio among the classes have been observed. They show that the effect of noise in the majority class has a substantially less detrimental effect on classification performance than if the minority class labels would contain noise, especially when the total amount of noise is low. They have tested several different sampling methods and show that the choice of a sampling method is only relevant in high-noise or minorityclass noise settings. In the scenario most important to intelligence datasets with low Netherlands Forensic Institute 7 University of Amsterdam

20 Chapter 2. Properties of Intelligence Data amounts of noise only in the majority class, there is no significant difference between sampling methods. One of the most simple methods, random under-sampling, yields a competitive performance while maintaining low computational costs. A few issues remain unresolved by van Hulse and Khoshgoftaar. Although they experiment with different noise distributions for both classes, they simultaneously alter the size and balance of the training data due to the creation of the noise. In real world noisy datasets, the amount of noise has no relation to the number of available positive examples for training or the balance between positive and negative examples and should therefore be left out of the equation. More difficulties arise when relating their results to our type of data, as the minimum amount of noise they have used in their experiments is 10%. As indicated, the relative frequency of a single type of crime occurring in comparison to that crime not occurring is low. The amount of noise is therefore much lower when labeling unknown samples as normal behavior in intelligence data. Next to that, there is less imbalance in their experiments Attribute Noise and Missing Values Attribute noise occurs when a single attribute has erroneous, incomplete or missing values [8], often caused by data corruption or measurement errors [11]. Zhu and Wu find attribute noise to have a detrimental effect on classification performance in terms of accuracy, either when it occurs in the training set, the test set or both. This effect is found to be linear in the level of noise. A difficulty related to attribute noise and occurring often is missing values. Schafer and Graham note that although missing values are most often a nuisance and not the focus of inquiry, they can create quite some difficulty as most data analysis procedures were not designed to incorporate them and will fail to operate in the presence of missing attribute values [12]. Neglecting them is therefore not an option, and they should be treated with care. For any method that deals with missing data it is important to observe if the fact that a value is missing is related to the data. Rubin describes three different types of missing data with regard to the underlying process that caused their absence [13]: Missing Completely At Random (MCAR) occurs when the probability that a feature value is missing does not depend on the data. Netherlands Forensic Institute 8 University of Amsterdam

21 Chapter 2. Properties of Intelligence Data Missing At Random (MAR) occurs when the probability that a feature value is missing depends on the values of the observed features, but not on the missing feature value. Missing Not At Random (MNAR) occurs when the probability that a feature value is missing depends on the value of the missing feature. For a more thorough explanation of the individual differences between these types we refer to Schafer and Graham [12]. As missing data in intelligence datasets can have many causes, it is important to pinpoint the extent to which data is missing and if elaborate methods are required to repair the missing values. Two simple methods for dealing with missing data are case deletion and single imputation. Case deletion simply involves throwing away all samples that contain missing attribute values. This method is only valid if the data is MCAR and is only feasible if enough samples are available and the frequency of missing values is low. As intelligence datasets usually already have a small amount of positive examples, case deletion is not always the best choice. Single imputation using the expectation maximization (EM) algorithm is described by Schneider [14]. If the number of samples in X is larger than the number of features, X > D (X), the EM algorithm can be used to compute the maximum likelihood estimates of the mean and covariance of the data. Missing values are imputed with their conditional expectation values given the available values and the estimated mean and covariance [14]. Single imputation is again only possible if the missing values are MCAR. Simple tricks such as substituting missing values with zeros or the mean of the population should be avoided as they distort estimated variances and correlations. Netherlands Forensic Institute 9 University of Amsterdam

22 Chapter 3 Discovering Criminal Behavior The goal of an intelligence based investigation is to discover criminal behavior. Ranking methods serve as tools to expose undiscovered criminal behavior from these datasets. But as a criminal investigation into individuals is invasive and requires resources, care should be taken in evaluating methods. We will describe several ranking methods and their applicability to ranking intelligence data. 3.1 Ranking In an optimal ranking we observe x + X +, x X : rank(x + ) < rank(x ), (3.1) with rank(x) representing the position of x in a ranked list. A ranking of intelligence data can be obtained in several ways. When using a classification method C for obtaining a ranking, we define the problem as finding a function c : x y that maps samples x X with input features (x 1,..., x n ), n = D(X) to ranking scores c(x) R. The result is a ranking of samples Y based on their ranking scores. Other methods directly optimize a ranking of samples and are defined as finding a function h : X Y that ranks a dataset X into a ranking Y that orders samples according to their likelihood of representing criminal behavior. To assess the extent to which a ranking method is able to reach the goal of an intelligence based investigation, an optimal ranking, a suitable performance measure is required that is independent of imbalance and noise. This performance measure should represent the Netherlands Forensic Institute 10 University of Amsterdam

23 Chapter 3. Discovering Criminal Behavior x X + x X ŷ = y True Positive True Negative (TP) (TN) ŷ y False Negative False Positive (FN) (FP) Table 3.1: Confusion Matrix of Ranked Results Based on a Decision Point correctness of the ranking. At some point, performance measures for ranking methods require the computation of values from the confusion matrix (or contingency table) 3.1. Such a performance measure indicates one or several decision points dp on the ranking scores c for which we can calculate the predicted label ŷ as follows 1 if c < dp ŷ = 1 if c dp. (3.2) Most classification techniques optimize for some performance measure related to the generalization performance on the training set, such as the accuracy. But problems arise as a direct consequence of the imbalance inherent in the dataset. As Kubat and Matwin point out [15], the performance of a classifier C operating on an imbalanced classification problem can not be expressed in terms of accuracy. In heavily imbalanced datasets, the error on the majority class will outweigh any errors made on the minority class. The accuracy of a classifier C operating on a dataset X is defined as the fraction of the total amount of samples that is classified correctly. Accuracy = TP + TN X (3.3) Accuracy is also less well suited as a performance measure of a ranking method as it requires a fixed decision point. For ranked results, a performance measure is required that integrates over all possible decision points. A suitable performance measure should reflect the extent to which a ranking method is able to reach the goal of an intelligence based investigation. Examples of criminal behavior should be ranked high, resulting in a high probability of encountering criminal behavior at the top of the ranking. A single valued measure that encapsulates this goal is required to compare ranking methods. Two graphical representations of performance measures that operate on a ranking method help to obtain a better understanding and deduct such a measure. The ROC curve shows the relation between the true positive rate and the false positive rate and the precision-recall curve shows the relation between precision and recall. Netherlands Forensic Institute 11 University of Amsterdam

24 Chapter 3. Discovering Criminal Behavior ROC Curve In a two-class ranking scenario, the receiver operator characteristic (ROC) curve represents the relationship between the true positive rate (TPR) and the false positive rate (FPR) at any value of the decision point of the ranking method [16]. The TPR and FPR are defined as False Positive Rate (FPR) = FP FP + TN (3.4) True Positive Rate (TPR) = TP TP + FN (3.5) It is often displayed by plotting the true positive rate against the false positive rate. Figure 3.1 shows an example of an ROC curve True Positive rate (TPr) False Positive rate (FPr) Figure 3.1: An Example ROC Curve. It represents the trade-off between correctly recognized samples of criminal behavior versus samples of normal behavior that are recognized as criminal behavior. It provides a graphical representation of the overall performance of a ranking method, while compensating for imbalanced data by showing the results as a ratio. A measure derived from the ROC curve, the area under the curve (AUC), is defined as the integral over the whole length of the ROC curve. It represents the probability that a randomly chosen positive Netherlands Forensic Institute 12 University of Amsterdam

25 Chapter 3. Discovering Criminal Behavior example is ranked higher than a randomly chosen negative example. In other words, it represents the fraction of correctly ranked pairs. where AUC (C) = ( ) δ x + i, x j = 1 X + X X + i=1 X j=1 ( ) δ x + i, x j, (3.6) 0 if rank(x + i ) > rank(x j )) (3.7) 1 if rank(x + i ) < rank(x j )) By utilizing the area under the curve (AUC) as a performance statistic, a single real valued number is retrieved that represents the performance of the ranking method while taking the imbalance into account [17]. Ranking methods exist that directly optimize the AUC. The AUC is a performance measure that handles imbalanced data but the usefulness as a performance measure in ranking intelligence data is debatable. Although the AUC gives a good indication of overall accuracy while taking imbalance into account, a single example proves its shortcoming on intelligence data [18]. Figure 3.2(a) shows two hypothetical ROC curves A and B with the same AUC. Although these ROC curves have the same AUC value, one result is clearly preferred from an intelligence perspective. ROC curve A starts in the origin and increases linearly. This means that with imbalanced data, for every true positive example we retrieve, a much larger number of false positives comes along and clutters the result. Curve B however, starts at a true positive rate of 50%. This means that half of all the criminal behavior has been ranked at the top of the list before any example of normal behavior. The other half is more difficult to find, resulting is a less steep curve and the same AUC as curve A. In an intelligence based investigation, this results in a very high probability of success when investigating the highest ranked samples. Clearly the ranking method resulting in curve B is preferred. As intelligence data is often heavily imbalanced, a small increase in the false positive rate means a large increase in the number of false positives. Therefore, instead of a good performance over the whole length of the ROC curve, it is often much more interesting to have a good performance on a small subset. If there is a high probability of crime within the first 100 or 1000 samples of the ranking result, a criminal investigation into all of these probable perpetrators would become feasible. The presence of such a subset is hidden in the leftmost part of the ROC curve, where false positive rates are low. A high starting point indicates that such a subset exists. But to compare performance of individual models, a quantifiable performance measure is required that captures the presence of such a subset in a single value. Certain performance measures used in document ranking Netherlands Forensic Institute 13 University of Amsterdam

26 Chapter 3. Discovering Criminal Behavior A B B targets accepted (TPr) Precision A outliers accepted (FPr) (a) ROC Curve Recall (b) Precision-Recall Curve Figure 3.2: Comparison Between ROC and Precision-Recall Curves. systems and information retrieval systems achieve this goal by placing more importance to higher ranked correct results than on lower ranked correct results Precision-Recall Curve Precision and recall are measures from the field of information retrieval and are defined at a single decision point as Precision = TP TP + FP (3.8) Recall = TP TP + FN (3.9) Information retrieval is concerned with systems returning specific query results as a ranked list in which higher ranked positive results are valued more [19]. Take for example an Internet search engine. End-users will most likely only look at the first few pages of a search result. Returning a positive hit at page 100 is not very valuable. The main goal of these systems is not to return all possible positive results with as little as possible false positives, but to return the best results at first with as little as possible false positives in the first few pages. The search engine aims at a high precision at the first n results. An intelligence based investigation has the same goal. Although individual search results from a search engine also differ in their relevance score while there is no difference between individual examples of criminal behavior, in both scenarios there is a clear demand for a high precision at a small subset of the population. Netherlands Forensic Institute 14 University of Amsterdam

27 Chapter 3. Discovering Criminal Behavior Precision is defined by values at a single operating point. It is often more intuitively defined as precision@n, where n represents the operating point and stands for the first n returned results. TP + FP is set to equal n and the precision@n is calculated as in equation 3.8. Recall@n is similarly defined with TP + FP again equal to n and then calculated as in equation 3.9. If we plot precision at all possible recall values, we obtain the precision-recall curve. This curve gives an intuitive representation of the level of compliance to the goal we are after. The precision-recall curves shown in figure 3.2(b) originate from the same ranking as the ROC curves shown in figure 3.2(a). As indicated, the research into possible perpetrators is costly, both in resources and in impact as the capacity available for individual investigations is limited and such an investigation is privacy-invasive. Evaluating the precision at all recall values gives insight into the prevalence of a subset with high precision. If we set n to correspond to the available capacity, a ranking method can be optimized for precision@n to obtain the most desirable results. The capacity is often low but can be quite variable in different investigations. The average precision is therefore used as a generalization of this performance measure and serves as a good indication if methods are suitable for use in an intelligence based investigation [20]. Average precision (AP) is defined in equation 3.10 as the average precision value over all recall values. where AP = 1 X + X Precision@i β (3.10) i=1 0 if x i X β = (3.11) 1 if x i X + In this equation X is the ranked result of a classification method C, with x 1 being the highest ranked result and having the highest value for c(x). It is defined as the area under the precision-recall curve. Figure 3.2(b) shows that curve B has a higher AP and is therefore preferred in an intelligence based investigation. It is proven that neither optimal performance in terms of accuracy nor optimal performance in terms of AUC can guarantee optimal performance in terms of average precision [18]. Netherlands Forensic Institute 15 University of Amsterdam

28 Chapter 3. Discovering Criminal Behavior 3.2 Ranking with Two-Class Classification Methods In an intelligence based investigation, we only distinguish between normal and criminal behavior. This allows the use of two-class classification methods for ranking results and the use of performance measures that are suitable for two-class methods. To be able to rank results, only classification methods can be used that return some form of a confidence score instead of only a predicted class label. Different classification methods implement this differently according to the underlying mechanisms of the method. This results in different scales being used by different methods for the ranking of results. Performance measures comparing the performance of these methods should therefore only do this based on the order of ranked results and not on the individual confidence scores of the ranked results. Several classification methods exist that have the ability to return a confidence score in a two-class scenario. Generative classification methods are distribution estimators. Given data points from a single class X tr with {+, }, this group of classification methods try to find the parameters θ of a model that maximize the probability of observing the data, given the model. Which model is used depends on the classification method. The parameters of the model can be estimated by Maximum Likelihood estimation as in equation 3.12 [21]. θml = argmax p(x tr θ ) (3.12) θ Or by Maximum A Posteriori estimation as in equation θmap = argmax p(x tr θ )p(θ ) (3.13) θ These classification methods rank data points based on the likelihood ratio given in equation Λ(x) = p(x θ+ ) p(x θ ) (3.14) Discriminative classification methods are boundary estimators. They approximate a boundary in the multidimensional feature hyperspace. Data points are labeled according to on which side of the boundary they lie, the distance to the boundary is assigned as confidence scores to the classification and the side of the boundary determines the sign of the confidence score. Netherlands Forensic Institute 16 University of Amsterdam

29 Chapter 3. Discovering Criminal Behavior In the following few subsections, several methods are described in detail and their applicability to ranking intelligence data is discussed k-nearest Neighbors k-nearest Neighbors (k-nn) is an intuitively very simple classification method. It assigns to new data points the class label similar to the label that is most common in the k nearest neighbors from the training data. Intuitively this is a strong point. If an entity shows behavior very similar to that of known criminals, we would expect a definite increase in the probability of observing criminal behavior with this entity as well. Confidence scores can be determined in k-nn ranking by the ratio of occurrence of both classes within the k nearest neighbors. In 1-NN, the confidence scores are determined by the difference in distance between the single nearest neighbor from both classes. What value for k is optimal depends on several properties of the data. An example of k-nn classification for k = 1, 5 is shown in figure NN 5 NN 2 Feature Feature 1 Figure 3.3: An Example of k-nn Classification. When measuring how near two data points are, a dissimilarity function d(x, x ) is used. Often the squared Euclidean distance is used, represented in equation d(x, x ) = (x x ) T (x x ) (3.15) Netherlands Forensic Institute 17 University of Amsterdam

30 Chapter 3. Discovering Criminal Behavior But as the squared Euclidean distance does not take differences in feature ranges into account, sometimes the Mahalanobis distance is used as represented in equation d(x, x ) = (x x ) T Σ 1 (x x ) (3.16) Where Σ represents the covariance matrix of the training data. As the method does not try to estimate underlying data distributions, it is very flexible in representing complex non-linear boundaries in the underlying feature space. With small sample sizes, it is however also prone to overfitting, as it relies heavily on the individual training data points. A few outliers or wrongly labeled samples can easily distort the classification boundaries. This effect is most notable in 1-NN as can be seen in figure 3.3. Imbalanced training data will also bias the classifier towards the most common class, as these samples will then have a higher probability of occurring within the nearest k neighbors, just because they are more numerous. Although the algorithm is intuitively simple to understand, it is computationally expensive as it requires keeping all training data in memory and compares new data points to every training data point at the time of classification. The applicability of k-nn to ranking intelligence data is unclear. Although the classification method shows insufficient ability to cope with properties common in intelligence data, label noise and imbalance, it does show flexibility in representing complex boundaries Logistic Regression In a logistic regression model, the sigmoid function is used to represent the probability that a new data point belongs to either class. For example for a data point x X te we can calculate p(x X + x) = σ(b + x T w). (3.17) In this equation, as b + x T w increases, so does the probability that x belongs to class X +. The (D) 1 dimensional decision boundary is defined as the hyperplane b + x T w = 0 in the feature space and is represented by weights w that determine the orientation of the hyperplane and a scalar b representing the offset of the hyperplane [21]. The logistic sigmoid function σ(a) is defined as σ(a) = 1. (3.18) 1 + e a Netherlands Forensic Institute 18 University of Amsterdam

31 Chapter 3. Discovering Criminal Behavior When learning a logistic regression model, the parameters b and w are estimated by a maximum likelihood estimation using gradient ascent [21] Fisher s Linear Discriminant The concept behind Fisher s Linear Discriminant (FLD) is that the data is projected onto a single dimension, while minimizing within-class variance and maximizing between-class variance [3] [21]. Suppose we have two classes X + X tr and X X tr with sizes N + and N respectively. We make the assumption that the classes are each described by a high-dimensional Gaussian distribution. The means of each class X with {+, } are estimated using maximum likelihood by µ = 1 N and the covariance matrices of each class are estimated as x X x, (3.19) Σ = 1 N We can now define the between-class covariance matrix as x X (x µ ) (x µ ) T (3.20) Σ B = ( µ µ +) ( µ µ +) T (3.21) and the within-class covariance matrix as Σ W = ( x µ + ) ( x µ +) T ( + x µ ) ( x µ ) T. (3.22) x X + x X The fisher criterion is defined as the ratio of the between-class variance and the withinclass variance and can be written as J(w) = wt Σ B w w T Σ W w. (3.23) Here, the vector w with D(w) = D(X) represents the linear projection vector that projects data points x onto a single dimension z. Netherlands Forensic Institute 19 University of Amsterdam

32 Chapter 3. Discovering Criminal Behavior z = w T x (3.24) We can find the maximum of J(w) by differentiating 3.23 with respect to w and find w to be proportional to w Σ 1 ( W µ µ +). (3.25) We finally project all samples onto their corresponding z values using The values for z obtained by this projection serve as confidence scores on which to base a ranking. The derivation shown here requires Σ W to be invertible. If this is not the case, for example when the dimensionality of the data is larger than the number of samples in a class, a more complex derivation is required. This is beyond the scope of this thesis, but we refer the interested reader to [21] for a more thorough explanation. FLD is a very fast method as it is computationally inexpensive. It is also robust against noise and imbalance as it only regards the means and covariances of the data points. However, it will only prove strong in classifying data that is close to linear separable. It is unclear why intelligence data should exhibit that property, but given that FLD handles noise and imbalance well, we should consider its applicability to ranking intelligence data Quadratic Discriminant The Quadratic Discriminant Classifier (QDC) assumes Gaussian distributions for each class and estimates the individual class means and covariances by maximum likelihood estimation as in equation 3.19 and When given new examples from the test set, it calculates the ratio of posterior probabilities of the data points x belonging to either class using a Bayesian framework: p(x + x) p(x x) = p(x X+ ) p(x X ) p(x+ ) p(x ). (3.26) The class-conditional densities for either class {+, } are calculated according to [3] as p(x X ) = { 1 1 (2π) D/2 exp 1 } Σ 1/2 2 (x µ ) T Σ 1 (x µ ). (3.27) Netherlands Forensic Institute 20 University of Amsterdam

33 Chapter 3. Discovering Criminal Behavior A ranking of samples is obtained by ordering the posterior probability ratios of the individual samples. The power of QDC is its ability to represent non-linear decision boundaries. Compensating for prior odds also deals with imbalanced data. But keeping Occam s Razor [22] in mind, we know that a more complex data description is not always better. If the data is sparse, a non-linear approximation easily overfits the training data. Intelligence data is often sparse in the positive class, as there are few examples of criminal behavior, but often a lot of descriptive features. The applicability of QDC to ranking intelligence data is therefore unclear Support Vector Machines Support Vector Machines (SVMs) is a classification method that maximizes the margin between the classes, first described by Vapnik [23] and later generalized by Mangasarian [24] [25]. A practical tutorial on its application is written by Hsu et al. [26]. It is intuitive that maximizing the margin between classes increases generalization performance. Figure 3.4(a) 1 shows data points from two classes separated by a decision boundary that maximizes the margin. (a) Maximizing the Margin (b) With Slack Variables Figure 3.4: Maximum Margin Classification. The data points that lie closest to the decision boundary are the support vectors and determine the location of the decision boundary. We model the decision boundary as 1 Figures taken from Barber [21] w T φ(x) + b = 0, (3.28) Netherlands Forensic Institute 21 University of Amsterdam

34 Chapter 3. Discovering Criminal Behavior where φ(x) denotes a fixed feature-space transformation [3] and w and b are parameters of a linear decision surface in the transformed space. To maximize the margin, we minimize N C ξ n w 2, (3.29) n=1 where ξ n are slack variables that represent the penalty for misclassified samples (ξ > 1) and samples that end up inside the margin (0 < ξ 1). Figure 3.4(b) shows maximum margin classification using slack variables. C > 0 represents a tunable parameter that controls the trade-off between misclassification penalty and the size of the margin. We now set the constraints x n : y(w T φ(x n ) + b) 1 ξ n, n = 1,..., N, ξ n 0, (3.30) with the true class label y defined as 1 if x n X y = (3.31) 1 if x n X + Finding the optimal margin comes down to minimizing 3.29 subject to the constraints This is an example of a quadratic programming problem. Samples can subsequently be ranked according to their distance to the optimal margin, with the sample that is farthest away from the margin on the positive side ranked first and the sample that is farthest away on the negative side ranked last. SVMs have shown consistently good performance over many different types of data. There is no reason to believe this is not the case with intelligence data. 3.3 Ranking with One-Class Classification Methods The applicability of one-class classifiers is described by Tax et al. [7] [27] [28] and Juszczak [29]. Methods for one-class classification model only a single class and are therefore applicable in scenarios where only a single class is well defined. One-class classification methods have proven to perform well on datasets from other fields of study where imbalanced, noisy data is common, for example on interstitial disease detection in chest radiography [30], seizure analysis from intracranial EEG [31], intrusion detection [32] and forensic case data [33]. Netherlands Forensic Institute 22 University of Amsterdam

35 Chapter 3. Discovering Criminal Behavior Intelligence data shows characteristics that signify the applicability of one-class ranking methods. It is logical that only modeling the class of criminal behavior deals with the problems of label noise and imbalance. However, throwing away all information in the negative class might also have a detrimental effect on ranking performance. A careful performance evaluation of a selection of possibly useful one-class methods should show if they perform competitive to previously mentioned methods Gaussian Data Description The Gaussian Data Description (GDD) is one of the simplest methods available. It estimates the mean µ and covariance matrix Σ of the target class by maximum likelihood. New samples are simply ranked according to their posterior probability given the Gaussian model Support Vector Data Description The Support Vector Data Description (SVDD) is a one-class SVM variant described by Tax [34]. It represents a class by the minimal-volume hypersphere around the data, parameterized by its center c and radius R. Optimizing the parameters for this sphere is similar to the optimization of an SVM. We minimize C constrained by N ξ n + R 2 (3.32) n=1 n : x n c 2 R 2 + ξ n, n = 1,..., N, ξ n 0. (3.33) This is again solved using quadratic programming. Samples can again be ranked according to their distance to the optimal margin. 3.4 Ranking with Ranking Methods We have seen in the previous sections that most methods minimize the misclassification rate of training examples, which results in a performance that is optimal in terms of accuracy. An intelligence based investigation is not aimed at classifying everything right, but on finding a subset of entities that have a high probability of exhibiting criminal Netherlands Forensic Institute 23 University of Amsterdam

36 Chapter 3. Discovering Criminal Behavior behavior. This underpins the need for rank-based optimization criteria for use in powerful ranking methods like SVMs. Is has to be noted that accuracy is closely linked to other performance measures and optimizing for accuracy might also yield optimal performance on other performance measures SVM-Rank An SVM method that optimizes for ranking is described by Joachims [35]. The concept behind SVM-Rank is based on an optimization problem similar to normal SVMs, which optimizes the number of individual data points that end up on the correct side of the decision boundary as in equation SVM-Rank however optimizes the number of pairs of data points (x n, x m ), with n, m = 1,..., N, that are ranked correctly. To this end we introduce an extra slack variable ξ m and rewrite the optimization problem as C ξ m,n w 2, (3.34) constrained by (x m, x n ) : w T (φ(x m ) φ(x n )) + b 1 ξ m,n, m, n = 1,..., N, ξ m,n 0, (3.35) which can again be solved using quadratic programming. As this method directly optimizes a ranking of results, its applicability to intelligence data is unquestionable. It is however unclear how this method deals with noise and imbalance SVM-Precision Joachims describes an SVM method for optimizing non-linear performance measures derived from the confusion matrix shown in table 3.1, named SVM-Multi [36]. It builds upon the approach of normal SVMs but instead of optimizing for each individual data point, or optimizing for pairs of data points such as SVM-Rank, it uses a multivariate prediction rule and optimizes for all data points at the same time, which allows the use of multivariate loss functions. The optimization problem is formulated as Cξ w 2, (3.36) constrained by Netherlands Forensic Institute 24 University of Amsterdam

37 Chapter 3. Discovering Criminal Behavior t T \ t : w T (Ψ( x, t) Ψ( x, t )) ( t, t) ξ. (3.37) In these constraints, x X represent tuples of n feature vectors x = (x 1,..., x n ) and t T represent tuples of n labels t = (t 1,..., t n ) with T { 1, +1} n. The function Ψ is a function that describes the match between x and t and is calculated here as Ψ( x, t ) = n t x i. (3.38) i=1 The loss function represents the non-linear multivariate loss function that is derived from the confusion matrix. SVM-Precision is a special case of SVM-Multi, where ( t, t) = TP TP + FP. (3.39) To optimize for precision@n, we require TP + FP = n and only include t t for which this requirement holds. This method directly optimizes an SVM ranking method for a specific precision value, which is very close to the goal of an individual intelligence based investigation. If the capacity of investigative resources is known, a choice can be made for n and the ranking can be optimized for the current capacity SVM-ROCArea The method SVM-ROCArea optimizes the AUC of the ranking, by minimizing the number of swapped pairs of a positive and a negative example. For a detailed description of its implementation built upon SVM-Multi, we refer to Joachims [36]. Although optimizing the AUC is not a direct goal of an intelligence based investigation, it is a measure that is invariant to imbalance and widely used. Optimizing the AUC directly is therefore interesting even for intelligence data SVM-MAP Yue et al. describe an SVM method for optimizing MAP 2 [37] that is based on structural SVMs as described by Tsochantaridis [38] and built open SVM-ROCArea [36]. Equivalent 2 In literature the term mean average precision is defined as the mean of several average precision values for different queries. But in an intelligence based investigation only a single query is proposed and therefore the mean average precision equals the average precision. Netherlands Forensic Institute 25 University of Amsterdam

38 Chapter 3. Discovering Criminal Behavior to SVM-Multi, SVM-Map optimizes equation 3.36 constrained by The combined feature representation Ψ is now calculated as with Ψ( x, t) 1 X + X x + X + x X ( t± ( φ( x, x + ) φ( x, x ) )), (3.40) +1 if rank(x + ) < rank(x ) t ± = 1 if rank(x + ) > rank(x ) (3.41) The algorithm described in [37] iteratively optimizes MAP by finding the most violated constraint. Average Precision (AP) performance can be seen as a generalization of the performance in terms of precision@n for different values of n. Optimizing for the precision@n is very close to the goal of an intelligence based investigation with a known investigative capacity of n. When a concrete number for n is not available, or as a measure to compare different methods on their general performance on different types of intelligence data, AP is a suited generalization. Optimizing for MAP performance (equivalent to AP on a single query) seems therefore ideal for discovering criminal behavior in intelligence datasets. 3.5 Performance Estimation If a large amount of data is available and every class is well represented, a simple hold-out method is sufficient to provide an accurate performance estimate. The data is split in a separate training and test set only once, the model is learned on the training set and the performance measure is estimated on the test set. This simple approach is almost never suited for intelligence data though, as the amount of examples of criminal behavior is always low. If the available amount of data is limited, the estimation of any performance measure is often based on the average performance on several draws of the available data. Different drawing methods exist, but two aspects of such an estimate are important to consider for any drawing method. The difference between the average performance and the expected performance is the bias of the performance estimate. The variance is the extent to which individual performance estimates differ from each other. There is a trade-off between bias and variance. Often methods that exhibit a low bias show a high variance and vice-versa. Netherlands Forensic Institute 26 University of Amsterdam

39 Chapter 3. Discovering Criminal Behavior In general, k-fold cross-validation is used to make optimal use of the available data in estimating a performance measure without a high bias or variance in the estimate. All available data is split in k equal folds. If stratification is applied, the original class frequency is maintained in each fold. Subsequently in k runs, each fold is held out as a test set once with the rest of the data serving as training data. If classification methods require parameter tuning, the same process is repeated n times within each fold. The training data is then again split in n equal folds, with each fold acting as a validation set once and the rest as training data for the parameter tuning. Once the optimal parameter has been found, all data used during validation is then used as training data for learning the model. Kohavi shows that in general 10-fold stratified cross-validation results in the best estimate in terms of both bias and variance [39]. Airola et al. show that 10-fold stratified cross-validation shows large variance in smallsample studies [40] when estimating the AUC. They propose the use of leave-pair-out cross-validation (LPOCV) as an AUC estimator with slightly lower variance but at a much higher computational cost. Pahikkala et al. show a computationally efficient method for performing LPOCV for a regularized least squares algorithm [41], but it is not applicable to other classification algorithms. Isaksson et al. also note that 10-fold cross-validation is unreliable in small sample sizes [42]. The variance between individual estimates is too large to make any valid assumption based on the cross-validation estimate of the performance. They show that in large sample sizes the estimate becomes more reliable, yet for sample sizes in the hundreds there is currently no way to ensure the uncertainty of the estimates and the large variance is as good as it gets. It is however still the most used validation method and without making any statements about the significance of obtained results, it still gives a good indication of how different methods compare to each other based on some performance measure. Netherlands Forensic Institute 27 University of Amsterdam

40 Chapter 4 Data To represent intelligence data in our experiments, we have chosen to use both artificial and real data. By using artificial data, we are able to control the parameters of the imbalance and noise. Next to that, we know the ground truth behind the label noise, which allows us to measure its detrimental effect on measured performance. Using a real dataset next to the artificial datasets allows us to observe the performance of our methods in practice. Although a single dataset is not representative for intelligence data in general, we are still able to observe the effects of the specific properties of intelligence data on ranking performance. 4.1 Artificial Data To artificially represent intelligence data, we chose three different representations using quite basic Gaussian data distributions but emulate the same characteristics common in real intelligence data, noise and imbalance. All artificial datasets are created using the pattern recognition toolbox for Matlab, PRTools 4.1 [43]. All three datasets are built out of two Gaussian distributions. Normal behavior is modeled by the Gaussian distribution N (µ, Σ ) and criminal behavior by N + (µ +, Σ + ). To emulate imbalance and noise in the artificial datasets, we represent X by taking N φ samples from the distribution N and φ from the distribution N +, such that X = N. Criminal behavior X + is represented by N + ψ samples from N + and ψ samples from N, such that X + = N +. The parameter φ controls the amount of noise in the majority class representing undiscovered criminal behavior and ψ the amount of noise in the minority class. We assume that noise in the minority class usually does not occur in real intelligence data, but to gain a better understanding of the effect of label Netherlands Forensic Institute 28 University of Amsterdam

41 Chapter 4. Data noise in imbalance datasets, we include the scenario where noise occurs in the minority class. The number of features that represents the dimensionality of the dataset is chosen to be D(X) = Gaussian Data The first dataset gauss consists of data from a simple Gaussian distribution with slightly different means and unit covariance. µ = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0), µ + = (1, 1, 1, 0, 0, 0, 0, 0, 0, 0) (4.1) Σ = I 10, {+, } (4.2) A scatter plot of the first two dimensions of the data is shown in figure 4.1. A few things are interesting to observe here. The imbalance is immediately clear from the plot and the difficulty in estimating a decision boundary is evident. The noise however, seems neglectable Stretched Parallel Data Another artificial dataset for use in our experiments is a stretched parallel dataset stretch. The means of the classes are the same except for the first two features. Both classes also have unit variance except for feature 2, which has a variance of 40. The first two features are also rotated by 45. This results in µ = (3, 3, 0, 0, 0, 0, 0, 0, 0, 0), µ + = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0) (4.3) Σ = , {+, } (4.4) A scatter plot of this dataset showing the first two features is shown in figure 4.2. Remember that all other features show the same uniform distribution for both classes. This Netherlands Forensic Institute 29 University of Amsterdam

42 Chapter 4. Data 4 Feature Feature 1 Figure 4.1: Artificial Gaussian Intelligence Dataset. dataset represents a type of class distribution that is difficult to separate for some classification methods. The imbalance can also clearly be seen Spherical Data The third and final artificial dataset we use is a spherical dataset sphere, where both classes have the same means but different feature variance in the first two features. µ = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0), {+, } (4.5) Σ + = I 10, Σ = (4.6) Netherlands Forensic Institute 30 University of Amsterdam

43 Chapter 4. Data Feature Feature 1 Figure 4.2: Artificial Stretched Parallel Intelligence Dataset. Figure 4.3 shows a scatter plot of the first two features of this dataset. Due to the equal mean vectors, linear models are unable to correctly model this dataset. The imbalance is again clearly visible. 4.2 IILS Data The real-world intelligence dataset iils we use in our experiments represents the problem of finding new examples of Illegal Irregular Living Situations (IILS) in the municipality of Rotterdam. An illegal irregular living situation occurs for example when more people than allowed reside at a single address. This intelligence dataset consists of 292,100 examples (addresses), of which 300 are confirmed IILS. The rest is unknown, but expected to only contain roughly around 0.5% IILS. An intelligence based investigation is aimed at discovering these examples. For each address, several features are available. Some features are available for every address individually: the number of households enlisted at the address, the number of persons enlisted at the address, the number of persons enlisted at the address that are Netherlands Forensic Institute 31 University of Amsterdam

44 Chapter 4. Data 10 5 Feature Feature 1 Figure 4.3: Artificial Spherical Intelligence Dataset. over 18, the number of rooms, the total surface area, the period it was built, the type of building, the type of municipal land use plan, the type of owner and the level of administrative over-occupation. Next to that many features are available at the neighborhood level, with the entire municipality partitioned into 65 neighborhoods. Within a single neighborhood, the feature values of these features are identical for every sample. Some features are categorical features, without an ordering and with n possible predefined feature values. To be able to use these features with numeric-oriented ranking methods, they are rewritten as n separate binary features, thereby increasing the dimensionality of the dataset. Labeling all unknown examples as normal behavior makes two-class ranking methods available for use but at the cost of creating label noise. It is unknown if attribute noise is present in the form of measurement errors. Missing values do occur, and as these are in all probability caused by human error, we assume they are MCAR and delete 12 positive examples along with a number of unknown examples. This leaves us 284,645 unknown samples which we label as normal behavior and 288 true examples of IILS. Netherlands Forensic Institute 32 University of Amsterdam

45 Chapter 4. Data As this dataset is a typical example of an intelligence dataset, it is ideally suited to test our hypotheses. Netherlands Forensic Institute 33 University of Amsterdam

46 Chapter 5 Experiments 5.1 Setup To answer the research questions posed in section 1.1, we conduct several experiments to get an understanding of our data and solve the subquestions. All experiments are conducted in Mathworks MatLab R2010a, using the pattern recognition toolbox for Matlab, PRTools 4.1 [43]. For one-class ranking experiments, we make use of DDtools 1.7.3, the Data Description Toolbox for Matlab [44]. The experiments using SVM are performed with SVM light [45]. Experiments with SVM-Rank are performed with SVM rank [35]. Experiments with SVM-P@n and SVM-ROCArea are performed with SVM perf [36]. Note that the ranking methods SVM-P@n were never tested so these should be taken with a grain of salt 1. We chose to include them anyway as optimization of precision is relevant to our data. We modified SVM perf slightly to allow for values of n > N + by removing a hard-coded limit. Experiments with SVM-MAP are performed with SVM map [37]. 5.2 Training Imbalance We limit the size of the negative class in the training set by performing random undersampling. To determine how small we can make the sample size without observing a detrimental effect on performance, we construct learning curves by experimenting with several different amounts of sampling. We perform our experiments on all data sets described in the last chapter. As parameters for the artificial datasets, we set N = 30, 000, N + = 100 and φ = 900. As in real intelligence data, we keep ψ = 0. We use two selected two-class discriminative classification methods to construct a ranking, Fisher s 1 Netherlands Forensic Institute 34 University of Amsterdam

47 Chapter 5. Experiments Linear Discriminant (FLD) and the Quadratic Discriminant Classifier (QDC). The result of one-class methods is trivial as these methods ignore the negative class altogether. The effect of sampling on the performance of ranking methods is computationally too expensive to perform with the available hardware, but it is deemed comparable to the used two-class methods. FLD QDC (a) Learning Curve for Gaussian Dataset (AUC) (b) Learning Curve for Gaussian Dataset (AP) Figure 5.1: Learning Curves for Gaussian Dataset. We perform our experiments using 10-fold stratified cross-validation and within each fold we perform random under-sampling on the majority class in the training set X tr to cut it down to a specific size X tr. We perform the experiment with increasing values of X tr for 200 < X tr < 10, 000. For each value of X tr we repeat the experiment 10 times. We measure the performance by calculating the average AUC and AP and measuring the variance over all of the 100 individual folds. Within each fold, the training set and test set are scaled. The mean of the training set is shifted to the origin and the total variances Netherlands Forensic Institute 35 University of Amsterdam

48 Chapter 5. Experiments FLD QDC (a) Learning Curve for Stretched Dataset (AUC) (b) Learning Curve for Stretched Dataset (AP) Figure 5.2: Learning Curves for Stretched Dataset. of all features are scaled to unit variance. To prevent information leakage, the scaling vector obtained from the training set is subsequently applied to the test set. The learning curves resulting from this experiment are shown in figure 5.1, 5.2, 5.3 and 5.4. The first thing to note is that the negative sample size has only a marginal effect on ranking performance. Sampling only a fraction of the majority class is enough to maintain a steady performance. Sampling more does not result in better performance but does increase computational complexity. In all three artificial datasets, the effect of sampling more than the minimum tested value of 200 examples does not increase performance. On the IILS dataset, a slight detrimental effect of under-sampling is seen both in AUC Netherlands Forensic Institute 36 University of Amsterdam

49 Chapter 5. Experiments FLD QDC (a) Learning Curve for Spherical Dataset (AUC) (b) Learning Curve for Spherical Dataset (AP) Figure 5.3: Learning Curves for Spherical Dataset. and AP when sampling less than 2,000 examples from the majority class and optimal performance occurs at a majority sample size of around 2,000 or more. From these results we can conclude that sampling around ten times the training size of the minority class from the majority class results in an optimal performance both in terms of AUC and AP on these datasets. To reduce computational complexity in further experiments we will use a fixed majority sample size of 3,000 in our training sets for all subsequent experiments. We keep this sample size constant and use it in every fold. As indicated, the large variance that is clearly visible is caused by performing 10-fold stratified cross-validation on a small sample size. Although this prevents us from reporting very exact results, we are still able to observe general trends. Netherlands Forensic Institute 37 University of Amsterdam

50 Chapter 5. Experiments FLD QDC (a) Learning Curve for IILS Dataset (AUC) (b) Learning Curve for IILS Dataset (AP) Figure 5.4: Learning Curves for IILS Dataset. 5.3 Undiscovered Criminal Behavior To assess the effects of label noise on ranking performance we conduct several experiments with different levels of noise φ. To gain a better understanding of label noise in imbalanced data we include the scenario where different levels of noise ψ occur in the minority class. We construct our datasets according to table 5.1 with N = 30, 000 and N + = 300 for the artificial datasets and N = 284, 645 and N + = 288 for the IILS dataset. We set the levels of noise to experiment with to 0, 50, 100, 150. In our experiments with noise in the majority class we set ψ = 0 and increment φ. In our experiments with noise in the minority class we set φ = 0 and increment ψ. For our experiments with noise in both classes we increment φ and ψ together such that φ = ψ. Netherlands Forensic Institute 38 University of Amsterdam

51 Chapter 5. Experiments Artificial Data IILS X X + new X new X + Noise in N N + N N + X X + X X + X N φ φ 0 N + N φ X + N 0 ψ N + ψ N ψ 0 ψ N + ψ both N φ φ ψ N + ψ N ψ φ ψ N + φ Table 5.1: Size and Composition of Intelligence Datasets with Added Artificial Noise It is important to note that an unknown amount of noise is already present in the majority class of the IILS dataset. The artificially added noise has to come from our already low amount of confirmed cases of criminal behavior. Therefore in the scenario with noise in the majority class, we can only experiment with N + = 133, as the rest is used to generate extra noise in the majority class. We perform each experiment using 10-fold stratified cross-validation. We use random under-sampling to reduce the size of the majority class in the training set to 3,000 in each fold. Within each fold, the training set and test set are again scaled. The mean of the training set is shifted to the origin and the total variances of all features are scaled to unit variance. The test set is scaled with the scaling vector obtained from the training set. We measure the performance by calculating the average AUC and AP and measuring the variance over all of the 100 individual folds. As we perform random under-sampling after artificially inserting noise in the majority class, the training set will most likely contain few to no noisy examples. The learned models will not differ much from the noiseless setting. But as the test set will contain added noisy examples in the majority class, we expect to observe a detrimental effect on measured performance. This effect should be most visible in the AP, as noisy samples in the majority class represent undiscovered criminal behavior and should be ranked high by the model. Especially as noise levels increase, the fraction of undiscovered criminal behavior in the top ranked examples should increase, leading to a decrease in measured AP performance. Generating artificial noise in the minority class or both classes is expected to have a severe detrimental effect on performance, as the already limited amount of examples of criminal behavior is heavily distorted. The results are shown in figure 5.5, 5.6, 5.7 and 5.8. As we artificially add noise to the majority class, we can keep track of the original labels. By doing so, we can calculate the expected AP by testing on a test set with the corrected labels. This is in contrast to the measured AP, which is measured on a test set with noisy labels. As the added noise represents criminal behavior, we can calculate the extent to Netherlands Forensic Institute 39 University of Amsterdam

52 Chapter 5. Experiments LogReg FLD GDD SVDD (a) Noise in Majority Class (AUC) (b) Noise in Minority Class (AUC) (c) Noise in Both Classes (AUC) (d) Noise in Majority Class (AP) (e) Noise in Minority Class (AP) (f) Noise in Both Classes (AP) Figure 5.5: Noise in Gaussian Data. LogReg FLD GDD SVDD (a) Noise in Majority Class (AUC) (b) Noise in Minority Class (AUC) (c) Noise in Both Classes (AUC) (d) Noise in Majority Class (AP) (e) Noise in Minority Class (AP) (f) Noise in Both Classes (AP) Figure 5.6: Noise in Stretched Data. Netherlands Forensic Institute 40 University of Amsterdam

53 Chapter 5. Experiments LogReg FLD GDD SVDD (a) Noise in Majority Class (AUC) (b) Noise in Minority Class (AUC) (c) Noise in Both Classes (AUC) (d) Noise in Majority Class (AP) (e) Noise in Minority Class (AP) (f) Noise in Both Classes (AP) Figure 5.7: Noise in Spherical Data. LogReg FLD GDD SVDD (a) Noise in Majority Class (AUC) (b) Noise in Minority Class (AUC) (c) Noise in Both Classes (AUC) (d) Noise in Majority Class (AP) (e) Noise in Minority Class (AP) (f) Noise in Both Classes (AP) Figure 5.8: Noise in IILS Data. Netherlands Forensic Institute 41 University of Amsterdam

54 Chapter 5. Experiments which a method is able to discover this undiscovered criminal behavior by calculating the expected AP. We expect to observe an increase in expected AP due to two factors. Firstly undiscovered criminal behavior ending up high in the ranking is now labeled correctly as criminal behavior, and secondly the total amount of criminal behavior, consisting of known and undiscovered criminal behavior, increases linearly with the amount of added noise, resulting in more positive samples. The expected AP resulting from experiments with noise in the majority class is shown in table 5.2. (a) Gaussian Data Method Noise Measured AP Expected AP ± ± LogReg ± ± ± ± ± ± ± ± FLD ± ± ± ± ± ± ± ± GDD ± ± ± ± ± ± ± ± SVDD ± ± ± ± ± ± (c) Spherical Data Method Noise Measured AP Expected AP ± ± LogReg ± ± ± ± ± ± ± ± FLD ± ± ± ± ± ± ± ± GDD ± ± ± ± ± ± ± ± SVDD ± ± ± ± ± ± (b) Stretched Data Method Noise Measured AP Expected AP ± ± LogReg ± ± ± ± ± ± ± ± FLD ± ± ± ± ± ± ± ± GDD ± ± ± ± ± ± ± ± SVDD ± ± ± ± ± ± (d) IILS Data Method Noise Measured AP Expected AP ± ± LogReg ± ± ± ± ± ± ± ± FLD ± ± ± ± ± ± ± ± GDD ± ± ± ± ± ± ± ± SVDD ± ± ± ± ± ± Table 5.2: Difference Between Measured AP and Expected AP These results show that measuring the AP in settings with undiscovered criminal behavior will yield a conservative estimate of the AP. The detrimental effect in measured AP performance with increasing amounts of noise is due to the fact that the method is able to partly fulfill its goal: undiscovered criminal behavior ends up high in the ranking. Netherlands Forensic Institute 42 University of Amsterdam

55 Chapter 5. Experiments Calculating the expected AP has revealed this effect to be quite severe, even with relatively low amounts of label noise. When noise occurs in the minority class, we observe a severe degradation of performance. This leads us to conclude that although noise on the minority class is usually non-existent in intelligence data, we must be vigilant if there is reason to believe not all examples of criminal behavior truly represent criminal behavior. 5.4 Ranking Performance We finally measure the performance of all previously mentioned ranking methods on ranking intelligence data. As artificial datasets we use the datasets described in section 4.1. We set N = 30, 000, N + = 100 and φ = 900. As in real intelligence data, we keep ψ = 0. We also perform this experiment on the IILS dataset with N = 284, 645 and N + = 288. No extra noise is added to the IILS dataset. We perform 10 times 10-fold stratified cross-validation and measure the performance by averaging the results from all folds and calculating the variance over all 100 folds. The SVM variants require parameter tuning for the slack size trade-off parameter c. To this end, within each fold, we perform again 10-fold cross validation on the training set to estimate the optimal value for c. As optimization criterion for the parameter tuning we use the same optimization criterion the classification method uses. So for example for SVM-MAP, we calculate the AP and choose the value for c that results in an optimal performance in terms of AP. For each outer fold, we use the value of c for training the SVM that has shown optimal performance in the cross-validation loop of the training set. As performance measures we calculate the AUC, AP and precision@n with n = 10, 100, Note that the precision@1000 is upper bounded by N The results are shown in table 5.3, 5.4, 5.5 and 5.6. A few things are interesting to observe here. First of all, a simple linear discriminative method like FLD performs well in terms of both AUC and AP in all settings where the means of the data distributions do not overlap. Logistic Regression shows similar good performance. Given that these methods are also among the least computationally expensive, they should serve as a starting point for an analysis in any intelligence based investigation. SVM-ROCArea shows a better performance in terms of AUC compared to most other SVM variants on all data where the means of the data distributions do not overlap. This Netherlands Forensic Institute 43 University of Amsterdam

56 Chapter 5. Experiments Method AUC AP 1-NN ± ± ± ± ± LogReg ± ± ± ± ± FLD ± ± ± ± ± QDC ± ± ± ± ± SVM ± ± ± ± ± SVM-Rank ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± SVM-ROCArea ± ± ± ± ± SVM-MAP ± ± ± ± ± GDD ± ± ± ± ± SVDD ± ± ± ± ± Table 5.3: Performance of all Ranking Methods on Gaussian Data Method AUC AP 1-NN ± ± ± ± ± LogReg ± ± ± ± ± FLD ± ± ± ± ± QDC ± ± ± ± ± SVM ± ± ± ± ± SVM-Rank ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± SVM-ROCArea ± ± ± ± ± SVM-MAP ± ± ± ± ± GDD ± ± ± ± ± SVDD ± ± ± ± ± Table 5.4: Performance of all Ranking Methods on Stretched Data shows that for an optimal performance in terms of AUC, SVM-ROCArea is a suited ranking method. SVM-MAP does not show an improvement in AP compared to other SVM variants. It seems SVM-MAP is not able to maintain its competitive AP performance in the presence of noise and imbalance. The one-class ranking methods we used in our experiments only perform competitive on data that is not linearly separable. Performance in terms of AP on artificial data is in some settings in this experiment worse than shown in table 5.2. This shows again that an increased amount of noise in the majority class (φ = 900) has a quite severe detrimental effect on measured AP. Measured performance on IILS data is better than in table 5.2 as we are able to use all positive examples instead of only 133. Two factors contribute to a conservative estimate of the AP. Undiscovered criminal behavior ends up high in the ranking, but their being labeled as normal behavior results in worse measured performance in terms of AP. Next Netherlands Forensic Institute 44 University of Amsterdam

57 Chapter 5. Experiments Method AUC AP 1-NN ± ± ± ± ± LogReg ± ± ± ± ± FLD ± ± ± ± ± QDC ± ± ± ± ± SVM ± ± ± ± ± SVM-Rank ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± SVM-ROCArea ± ± ± ± ± SVM-MAP ± ± ± ± ± GDD ± ± ± ± ± SVDD ± ± ± ± ± Table 5.5: Performance of all Ranking Methods on Spherical Data Method AUC AP 1-NN ± ± ± ± ± LogReg ± ± ± ± ± FLD ± ± ± ± ± QDC ± ± ± ± ± SVM ± ± ± ± ± SVM-Rank ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± SVM-ROCArea ± ± ± ± ± SVM-MAP ± ± ± ± ± GDD ± ± ± ± ± SVDD ± ± ± ± ± Table 5.6: Performance of all Ranking Methods on IILS Data to that, when building a model to be used in practice, all positive examples can be used for training the model, resulting in more predictive power. Netherlands Forensic Institute 45 University of Amsterdam

58 Chapter 6 Conclusions In an intelligence based investigation, the goal is the discovery of criminal behavior in large amounts of data based on a few known examples of criminal behavior. The goal of such an investigation is not aimed at classifying everything correctly, but rather to discover a few new cases of criminal behavior. To this end, ranking methods are used to obtain a ranking of examples, which enables the deployment of available resources to investigate the top n ranked results. We have shown that random under-sampling can be used to decrease the number of negative examples used for training a ranking method. Keeping roughly a number of negative examples in the order of ten times the amount of positive samples is sufficient to obtain optimal performance both in terms of AUC and AP. By labeling unknown examples as negative examples, two-class ranking methods can be trained. This results in an imbalanced dataset with a small amount of label noise in the majority class. This label noise does not effect the training of a model as almost all noise is neglected during random under-sampling. But we have shown a relatively severe effect on AP performance estimates as noisy examples representing undiscovered criminal behavior end up high in the ranked result. This leads us to conclude that a measured AP of a ranking method on an intelligence dataset with a small amount of noise in the majority class is a conservative estimate of the AP to be encountered in practice. When label noise occurs on the minority class representing criminal behavior, a severe degradation of performance is observed. Although intelligence data usually only contains confirmed examples of criminal behavior, care should be taken if the correctness of these labels is debatable. We have further shown that simple linear discriminative methods such as FLD and Logistic Regression show competitive performance on intelligence data, both in terms of AUC and Netherlands Forensic Institute 46 University of Amsterdam

59 Chapter 6. Conclusions AP. SVM s that use optimization criteria more closely related to the goal on an intelligence based investigation are not able to outperform these simple methods. SVM-MAP fails altogether in achieving an optimal AP performance on intelligence data. 6.1 Future Work We have shown that even a relatively small amount of label noise in the majority class results in a lower measured AP. Further experimentation with different datasets and parameters might reveal the exact relationship between the amount of noise and the under-estimation of the AP. The results could be used to give an estimate of the expected AP in practice when undiscovered criminal behavior is sought after. To be able to compare the performance of models on different datasets, in future work the Balanced Average Precision (BAP) [46] can be used as an normalized AP measure that is independent of the dataset balance. Netherlands Forensic Institute 47 University of Amsterdam

60 Bibliography [1] Jason van Hulse and Taghi Khoshgoftaar. Knowledge discovery from imbalanced and noisy data. Data Knowl. Eng., 68(12): , ISSN X. [2] Yanmin Sun, Mohamed S. Kamel, Andrew K. C. Wong, and Yang Wang. Costsensitive boosting for classification of imbalanced data. Pattern Recogn., 40(12): , ISSN [3] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer, 1st ed corr. 2nd printing edition, October ISBN [4] Sofia Visa and Anca Ralescu. Issues in mining imbalanced data sets - a review paper. In in Proceedings of the Sixteen Midwest Artificial Intelligence and Cognitive Science Conference, pages 67 73, [5] Gary M. Weiss and Foster Provost. The effect of class distribution on classifier learning: An empirical study, [6] Gary M. Weiss. Mining with rarity: a unifying framework. SIGKDD Explorations, 6 (1):7 19, [7] David M. J. Tax. One-class classification. PhD thesis, Technische Universiteit Delft, [8] Xingquan Zhu and Xindong Wu. Class noise vs. attribute noise: a quantitative study of their impacts. Artif. Intell. Rev., 22(3): , ISSN [9] Yunlei Li, Lodewyk F. A. Wessels, Dick de Ridder, and Marcel J. T. Reinders. Classification in the presence of class noise using a probabilistic kernel fisher method. Pattern Recogn., 40(12): , ISSN [10] D. Anyfantis, M. Karagiannopoulos, Sotiris B. Kotsiantis, and Panayiotis E. Pintelas. Robustness of learning techniques in handling class noise in imbalanced datasets. In Christos Boukis, Aristodemos Pnevmatikakis, and Lazaros Polymenakos, editors, IFIP, volume 247, pages Springer, ISBN Netherlands Forensic Institute 48 University of Amsterdam

61 Bibliography [11] Aleksander Kolcz and Gordon V. Cormack. Genre-based decomposition of class noise. In KDD 09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages , New York, NY, USA, ACM. ISBN [12] J. Schafer and J. Graham. Missing data: Our view of the state of the art. Psychological Methods, 7: , [13] D.B. Rubin. Inference and missing data. Biometrika, 63: , [14] T. Schneider. Analysis of incomplete climate data: Estimation of mean values and covariance matrices and imputation of missing values. J. Climate, 14: , [15] Miroslav Kubat and Stan Matwin. Addressing the curse of imbalanced training sets: one-sided selection. In Proc. 14th International Conference on Machine Learning, pages Morgan Kaufmann, [16] Tom Fawcett. An introduction to roc analysis. Pattern Recogn. Lett., 27(8): , ISSN [17] Andrew P. Bradley. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7): , July [18] Jesse Davis and Mark Goadrich. The relationship between precision-recall and roc curves. In ICML 06: Proceedings of the 23rd international conference on Machine learning, pages , New York, NY, USA, ACM. ISBN [19] David A. Grossman and Ophir Frieder. Information Retrieval: Algorithms and Heuristics. The Kluwer International Series of Information Retrieval. Springer, P.O.Box 17, 3300 AA Dordrecht, The Netherlands, zweite edition, [20] Kazuaki Kishida. Property of average precision and its generalization: an examination of evaluation indicator for information retrieval. Technical report, National Institute of Informatics, [21] David Barber. Bayesian Reasoning and Machine Learning. Cambridge University Press, [22] Pedro Domingos. The role of occam s razor in knowledge discovery. Data Mining and Knowledge Discovery, 3: , [23] Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, September ISBN [24] O. L. Mangasarian. Generalized support vector machines. In Advances in Large Margin Classifiers, pages MIT Press, Netherlands Forensic Institute 49 University of Amsterdam

62 Bibliography [25] O. L. Mangasarian. Data mining via support vector machines. In IFIP Conference on System Modelling and Optimization, pages Kluwer Academic Publishers, [26] C. W. Hsu, C. C. Chang, and C. J. Lin. A practical guide to support vector classification. Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, [27] David M.J. Tax and Robert P.W. Duin. Combining one-class classifiers. In in Proc. Multiple Classifier Systems, pages Springer Verlag, [28] D. M. J. Tax and R. P. W. Duin. Characterizing one-class datasets. In In Proceedings of the 16th annual symposium of the pattern recognition association of South Africa, pages 21 26, [29] Piotr Juszczak. Learning to recognise. A study on one-class classification and active learning. PhD thesis, Technische Universiteit Delft, [30] Y. Arzhaeva, D.M.J. Tax, and B. van Ginneken. Improving computer-aided diagnosis of interstitial disease in chest radiographs by combining one-class and two-class classifiers. In J.M. Reinhardt and J.P.W. Pluim, editors, SPIE Medical Imaging, volume SPIE, Bellingham, WA, [31] Andrew B. Gardner, Abba M. Krieger, George Vachtsevanos, and Brian Litt. Oneclass novelty detection for seizure analysis from intracranial eeg. J. Mach. Learn. Res., 7: , ISSN [32] Giorgio Giacinto, Roberto Perdisci, Mauro Del Rio, and Fabio Roli. Intrusion detection in computer networks by a modular ensemble of one-class classifiers. Inf. Fusion, 9(1):69 82, ISSN [33] Frederic Ratle, Mikhail Kanevski, Anne-Laure Terrettaz-Zufferey, and Olivier Ribaux. A comparison of one-class classifiers for novelty detection in forensic case data, [34] David M. J. Tax and Robert P. W. Duin. Support vector data description. Mach. Learn., 54(1):45 66, ISSN [35] Thorsten Joachims. Optimizing search engines using clickthrough data. In KDD 02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages ACM, ISBN X. [36] Thorsten Joachims. A support vector method for multivariate performance measures. In ICML 05: Proceedings of the 22nd international conference on Machine learning, pages ACM, ISBN Netherlands Forensic Institute 50 University of Amsterdam

63 Bibliography [37] Yisong Yue, Thomas Finley, Filip Radlinski, and Thorsten Joachims. A support vector method for optimizing average precision. In SIGIR 07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages ACM, ISBN [38] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res., 6: , ISSN [39] Ron Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. In IJCAI 95: Proceedings of the 14th international joint conference on Artificial intelligence, pages , San Francisco, CA, USA, Morgan Kaufmann Publishers Inc. ISBN [40] Antti Airola, Tapio Pahikkala, Willem Waegeman, Bernard De Baets, and Tapio Salakoski. A comparison of auc estimators in small-sample studies. In Saso Dzeroski, Pierre Geurts, and Juho Rousu, editors, International workshop on Machine Learning in Systems Biology, volume 3, pages 15 23, [41] T Pahikkala, A Airola, J Boberg, and T Salakoski. Exact and efficient leave-pair-out cross-validation for ranking rls. In T. Honkela, M. Pll, M.-S. Paukkeri, and O. Simula, editors, Proceedings of the 2nd international and interdisciplinary conference on adaptive knowledge representation and reasoning, pages 1 8, [42] A. Isaksson, M. Wallman, H. Goransson, and M. Gustafsson. Cross-validation and bootstrapping are unreliable in small sample classification. Pattern Recognition Letters, 29(14): , October ISSN [43] R.P.W. Duin, P. Juszczak, P. Paclik, E. Pekalska, D. de Ridder, D.M.J. Tax, and S. Verzakov. Prtools4.1, a matlab toolbox for pattern recognition. Delft University of Technology, [44] D.M.J. Tax. Ddtools, the data description toolbox for matlab, Dec version [45] Thorsten Joachims. Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms. Kluwer Academic Publishers, Norwell, MA, USA, ISBN X. [46] J. C. van Gemert, C. J. Veenman, and J. M. Geusebroek. Episode-constrained crossvalidation in video concept retrieval. IEEE Transactions on Multimedia, 11(4): , Netherlands Forensic Institute 51 University of Amsterdam