Discovering Criminal Behavior by Ranking Intelligence Data

Similar documents

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Linear Threshold Units

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Supervised Learning (Big Data Analytics)

Social Media Mining. Data Mining Essentials

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Machine Learning Logistic Regression

Evaluation & Validation: Credibility: Evaluating what has been learned

E-commerce Transaction Anomaly Classification

Making Sense of the Mayhem: Machine Learning and March Madness

Data Mining: Algorithms and Applications Matrix Math Review

A Survey on Pre-processing and Post-processing Techniques in Data Mining

Statistical Machine Learning

Data Mining - Evaluation of Classifiers

Chapter 6. The stacking ensemble approach

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

Intrusion Detection via Machine Learning for SCADA System Protection

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Machine Learning and Pattern Recognition Logistic Regression

STA 4273H: Statistical Machine Learning

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Performance Measures for Machine Learning

Random Forest Based Imbalanced Data Cleaning and Classification

Active Learning SVM for Blogs recommendation

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Classification of Bad Accounts in Credit Card Industry

Performance Measures in Data Mining

Performance Metrics for Graph Mining Tasks

An Overview of Knowledge Discovery Database and Data mining Techniques

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

Support Vector Machine (SVM)

Azure Machine Learning, SQL Data Mining and R

Towards better accuracy for Spam predictions

Linear Models for Classification

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Machine Learning in Spam Filtering

How can we discover stocks that will

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Lecture 3: Linear methods for classification

Machine Learning Final Project Spam Filtering

1. Classification problems

Linear Classification. Volker Tresp Summer 2015

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

MHI3000 Big Data Analytics for Health Care Final Project Report

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Component Ordering in Independent Component Analysis Based on Data Power

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Data, Measurements, Features

An Introduction to Machine Learning

Least-Squares Intersection of Lines

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Addressing the Class Imbalance Problem in Medical Datasets

Going Big in Data Dimensionality:

Statistical Models in Data Mining

Using Random Forest to Learn Imbalanced Data

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Introduction to Support Vector Machines. Colin Campbell, Bristol University

CSE 473: Artificial Intelligence Autumn 2010

Mining the Software Change Repository of a Legacy Telephony System

Decision Support Systems

Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

The Scientific Data Mining Process

A semi-supervised Spam mail detector

Mimicking human fake review detection on Trustpilot

Hong Kong Stock Index Forecasting

Classification algorithm in Data mining: An Overview

Data Mining. Nonlinear Classification

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Predict Influencers in the Social Network

Environmental Remote Sensing GEOG 2021

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Christfried Webers. Canberra February June 2015

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Knowledge Discovery and Data Mining

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

Data Mining Applications in Higher Education

Introduction to General and Generalized Linear Models

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

The primary goal of this thesis was to understand how the spatial dependence of

Statistics in Retail Finance. Chapter 7: Fraud Detection in Retail Credit

Support Vector Machines with Clustering for Training with Very Large Datasets

IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION

Knowledge Discovery from patents using KMX Text Analytics

SESSION DEPENDENT DE-IDENTIFICATION OF ELECTRONIC MEDICAL RECORDS

Local classification and local likelihoods

Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm

Transcription:

UNIVERSITY OF AMSTERDAM Faculty of Science Discovering Criminal Behavior by Ranking Intelligence Data by 5889081 A thesis submitted in partial fulfillment for the degree of Master of Science in the field of Artificial Intelligence track: Forensic Intelligence written during an internship at the Netherlands Forensic Institute Knowledge and Expertise Centre for Intelligent Data Analysis Digital Technology & Biometrics Department External Supervisor: Cor J. Veenman Internal Supervisor: Marcel Worring August 2010

Declaration of Authorship I,, declare that this thesis titled, Discovering Criminal Behavior by Ranking Intelligence Data and the work presented in it are my own. I confirm that: This work was done wholly or mainly while in candidature for a research degree at this University. Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated. Where I have consulted the published work of others, this is always clearly attributed. Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work. I have acknowledged all main sources of help. Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself. Signed: Date: Netherlands Forensic Institute i University of Amsterdam

Three things cannot be long hidden: the sun, the moon, and the truth. Buddha

Abstract Intelligence data consists of a small number of known examples of criminal behavior and a large number of unknown samples. A small fraction of these unknown samples represents undiscovered criminal behavior. An intelligence based investigation is aimed at the discovery of these undiscovered samples. As resources are limited, the goal of such an investigation is to obtain a subset of the data with a high probability of criminal behavior. This requires the use of ranking methods and a comparison based on performance measures that relate to this goal, such as the average precision. We assess the effect of the imbalance inherent in these datasets and see if random undersampling can be performed to reduce computational complexity. Furthermore, the effect of label noise resulting from labeling unknown samples as normal behavior is researched. We then implement several ranking methods for ranking intelligence data and asses their performance. Next to widely used two-class classification methods, we observe the applicability of one-class classification methods and support vector machines using model selection criteria that more closely relate to the goal of an intelligence based investigation. Netherlands Forensic Institute iii University of Amsterdam

Acknowledgements This is where I could write my thanks to all those people who were there at those moments the going got tough and I nearly succumbed under the pressure of writing the whole thing. But to be honest, that never happened. The people I am going to thank here were there long before those moments could occur and kept my spirits up well enough at all times. That definitely deserves my gratitude. First of all, I would like to thank my girlfriend Linda, for always being there for me and keeping me relaxed with her carefree attitude. My mother Ellen, who got me this far by pointing out the importance of good education and her never faltering support. My parents-in-law, for their positive view on life and generous nature which always cheers me up. My grandmother, who will be sitting front row at my graduation, hours before it commences. My dear friends, Amer, Amy, Arike, Ben, Colin, Fleur, Jasper, Krishan, Lea, Pim, Rick, Rolf, Steven and Tabor, for their support and good times together, recharging my energy in between all the hard work. My colleagues at Kecida, Andre, Asli, Chrissie, Gert, Guido, Maarten, Menno, Nienke and Xandra, for all the inspiring talks and shared knowledge. My supervisors, Cor and Marcel, for their useful feedback, eye-opening discussions and continuous support. And finally, I would like to thank Sara, who introduced me to the field of AI, what feels like such a long time ago. Netherlands Forensic Institute iv University of Amsterdam

Contents Declaration of Authorship i Abstract iii Acknowledgements iv Table of Contents List of Figures List of Tables Abbreviations Notation v vii viii ix x 1 Introduction 1 1.1 Research Questions............................... 2 1.2 Approach.................................... 3 1.3 Outline..................................... 3 2 Properties of Intelligence Data 4 2.1 Intelligence Data................................ 4 2.2 Imbalance.................................... 5 2.3 Noise....................................... 6 2.3.1 Label Noise............................... 6 2.3.2 Attribute Noise and Missing Values................. 8 3 Discovering Criminal Behavior 10 3.1 Ranking..................................... 10 3.1.1 ROC Curve............................... 12 3.1.2 Precision-Recall Curve......................... 14 3.2 Ranking with Two-Class Classification Methods............... 16 3.2.1 k-nearest Neighbors.......................... 17 3.2.2 Logistic Regression........................... 18 Netherlands Forensic Institute v University of Amsterdam

Contents 3.2.3 Fisher s Linear Discriminant..................... 19 3.2.4 Quadratic Discriminant........................ 20 3.2.5 Support Vector Machines....................... 21 3.3 Ranking with One-Class Classification Methods............... 22 3.3.1 Gaussian Data Description...................... 23 3.3.2 Support Vector Data Description................... 23 3.4 Ranking with Ranking Methods........................ 23 3.4.1 SVM-Rank............................... 24 3.4.2 SVM-Precision............................. 24 3.4.3 SVM-ROCArea............................. 25 3.4.4 SVM-MAP............................... 25 3.5 Performance Estimation............................ 26 4 Data 28 4.1 Artificial Data................................. 28 4.1.1 Gaussian Data............................. 29 4.1.2 Stretched Parallel Data........................ 29 4.1.3 Spherical Data............................. 30 4.2 IILS Data.................................... 31 5 Experiments 34 5.1 Setup...................................... 34 5.2 Training Imbalance............................... 34 5.3 Undiscovered Criminal Behavior....................... 38 5.4 Ranking Performance............................. 43 6 Conclusions 46 6.1 Future Work.................................. 47 Bibliography 48 Netherlands Forensic Institute vi University of Amsterdam

List of Figures 2.1 Over- and under-sampling........................... 6 3.1 Example ROC Curve.............................. 12 3.2 Comparison Between ROC and Precision-Recall Curves.......... 14 3.3 An Example of k-nn Classification...................... 17 3.4 Maximum Margin Classification........................ 21 4.1 Artificial Gaussian Intelligence Dataset.................... 30 4.2 Artificial Stretched Parallel Intelligence Dataset............... 31 4.3 Artificial Spherical Intelligence Dataset.................... 32 5.1 Learning Curves for Gaussian Dataset.................... 35 5.2 Learning Curves for Stretched Dataset.................... 36 5.3 Learning Curves for Spherical Dataset.................... 37 5.4 Learning Curves for IILS Dataset....................... 38 5.5 Noise in Gaussian Data............................ 40 5.6 Noise in Stretched Data............................ 40 5.7 Noise in Spherical Data............................ 41 5.8 Noise in IILS Data............................... 41 Netherlands Forensic Institute vii University of Amsterdam

List of Tables 3.1 Confusion Matrix of Ranked Results Based on a Decision Point...... 11 5.1 Size and Composition of Intelligence Datasets with Added Artificial Noise 39 5.2 Difference Between Measured AP and Expected AP............ 42 5.3 Performance of all Ranking Methods on Gaussian Data.......... 44 5.4 Performance of all Ranking Methods on Stretched Data.......... 44 5.5 Performance of all Ranking Methods on Spherical Data.......... 45 5.6 Performance of all Ranking Methods on IILS Data............. 45 Netherlands Forensic Institute viii University of Amsterdam

Abbreviations ACC AP AUC EM FLD FN FP FPR GDD IILS k-nn LPOCV MAP MAR MCAR MNAR QDC ROC SVDD SVM TN TP TPR Accuracy Average Precision Area Under the (ROC) Curve Expectation Maximization Fisher s Linear Discriminant False Negative False Positive False Positive Rate Gaussian Data Description Illegal Irregular Living Situation k-nearest Neighbor Leave-Pair-Out Cross-Validation Mean Average Precision Missing At Random Missing Completely At Random Missing Not At Random Quadratic Discriminant Classifier Receiver Operator Characteristic Support Vector Data Description Support Vector Machine True Negative True Positive True Positive Rate Netherlands Forensic Institute ix University of Amsterdam

Notation X X + X X D (X) X tr X val X te x = (x 1,..., x n ) An intelligence dataset Positive samples representing criminal behavior Negative samples representing normal behavior The size of dataset X, the number of samples The dimensionality of dataset X, the number of features A training set, used for learning a model A validation set, used for parameter tuning A test set, used for performance estimation A single sample in dataset X with n = D (X) and features x 1,..., x n C rank(x) y ŷ c(x) dp Y θ p(a b) I n N (µ, Σ) φ ψ A classification method The position of x in a ranked result True class label Predicted class label Ranking score of a data point x after classification Decision point on a ranking of samples A ranked list of samples resulting from a ranking method Parameters of a distribution model Conditional probability of a given b n-by-n Identity Matrix Gaussian distribution with means µ and covariance matrix Σ Noise on majority class (undiscovered criminal behavior) Noise on minority class Netherlands Forensic Institute x University of Amsterdam

Dedicated to Henny, my grandmother who has been looking forward to my graduation ever since I started studying at the university

Chapter 1 Introduction In our ever more digitally focused society, the current tendency is to record and store all possible sorts of data. As data collection mechanisms have improved vastly over the years, nowadays in many different settings, certain features are recorded for every single entity in a population. A credit card company, records for example for every transaction the date, time, sender, receiver, method and the card validity. A municipality, records for example for every address within its perimeters the type of housing, surface area and number of rooms. As these features are recorded for every entity in a population, data related to entities that exhibit deviant or criminal behavior is also recorded. However, recording of this data does not imply automatic detection of criminal behavior. To achieve this, data analysts combine elaborate methods from the fields of artificial intelligence, computer science, linguistics and mathematics to obtain relevant information regarding criminal behavior from data. As a direct consequence of the increase in data recording and storage, these analysts are more often faced with large datasets. An intelligence based investigation is aimed at finding correlations between features of the data and prevalence of a specific type of crime in an intelligence dataset with known occurrences of that specific type of crime. A model based on these correlations can be used to assess the risk of unknown or new entities exhibiting forms of criminal behavior. Further investigation into these entities is invasive and requires resources. Initiating such an investigation is only feasible if the predictive value of the model is high. A typical investigation in such a large dataset is therefore not focused on classifying everything correctly. It is more important to discover a few new cases of criminal behavior without having to address many false hits. A system operating on an intelligence dataset should offer the possibility to prioritize cases based on a ranking of entities that is related to the probability of them exhibiting forms of criminal behavior. An intelligent ranking allows to deploy the available resources to investigate the top n highest ranked entities. Netherlands Forensic Institute 1 University of Amsterdam

Chapter 1. Introduction Luckily, many crimes occur only infrequently as only a small percentage of the population exhibits forms of criminal behavior. Recording of data regarding normal behavior is therefore far more common than recording of data regarding criminal behavior. However, not all occurrences of criminal behavior are identified as such. This results in specific characteristics that define an intelligence dataset. It is imbalanced as it contains many samples of which only a small subset is known to represent criminal behavior. The rest of the samples is unknown but only a small percentage of these will represent criminal behavior. Using these samples as examples of normal behavior is therefore possible but results in a small amount of label noise. Some supervised learning methods can be used to obtain a ranking of intelligence data. Two-class and one-class classification methods rank samples based on a confidence score. Other methods from the field of information retrieval directly optimize a ranking by using rank-based model selection criteria. The application of ranking methods that use model selection criteria related to the goal of an intelligence based investigation is novel and has not been researched before. Methods used for ranking intelligence data also need to handle the specific properties these intelligence datasets have in common. The problems of data imbalance and label noise from labeling unknown samples have individually received much attention and several solutions have been proposed. But any possible effects resulting from their combined occurrence have only been marginally researched. Van Hulse et al. have recently performed a comprehensive study into classification performance in imbalanced noisy data [1] but they have only used datasets from the domain of software measurement. It is unknown if the effect on intelligence data is comparable. Methods assigning cost to imbalanced data in a classification task exist [2], but no mention is made of noise and these methods are not focused on ranking. 1.1 Research Questions It is clear that intelligence datasets often involve huge amounts of imbalanced noisy data. The discovery of patterns in this type of data that signify criminal or deviant behavior is central to an intelligence based investigation. As resources are often limited, the discovery of small subsets with high chances of discovering criminal behavior is most important. Currently available and generally accepted methods focus only on parts of this problem. This leads us to pose the following central research question: How can criminal behavior be discovered in intelligence data? research question can be divided in a few subquestions. This central Netherlands Forensic Institute 2 University of Amsterdam

Chapter 1. Introduction To what extent do imbalance and label noise hinder the discovery of criminal behavior? In other fields of research, label noise and imbalanced data result in a detrimental effect on performance. The effect of their co-occurrence in intelligence data is unknown. Which methods are best at ranking samples in intelligence data? Many different classification methods exist and different classification scenarios ask for different methods. Classification methods operating on intelligence data are required to return a ranking of results. It is unknown if classification methods fit this task. Are model selection criteria from the field of information retrieval helpful in the efficient discovery of criminal behavior? Ranking results is important with intelligence data. Methods that optimize for such a ranked result exist, but it is unknown how they perform in the discovery of criminal behavior. 1.2 Approach There is no fixed solution that works for every data analysis problem. No single method performs well on every type of data. Only a careful observation of the characteristics of intelligence datasets will provide footholds for addressing this type of data. As indicated, intelligence data is imbalanced and contains unknown samples. We first assess the effects of these properties on the performance of several selected ranking methods. We test if sampling methods are suited for coping with imbalance and observe the effect of noise by artificially increasing the amount of label noise in either class. We compare the performance of several one-class and two-class classification methods in ranking intelligence data with the goal to find methods that perform systematically well on different intelligence datasets. Finally, we test the applicability of ranking methods that use model selection criteria from the field of information retrieval that directly optimize for a ranking and are more closely related to the goal of an intelligence based investigation. 1.3 Outline This thesis is outlined as follows. The following chapter describes the properties of intelligence data in general and goes deeper into the subjects of imbalance and noise. Chapter 3 revolves around the discovery process of criminal behavior in intelligence data. This is followed by chapter 4, in which we describe the specific datasets we used for our experiments described in chapter 5. Netherlands Forensic Institute 3 University of Amsterdam

Chapter 2 Properties of Intelligence Data 2.1 Intelligence Data An intelligence dataset contains entities that can represent for example persons, cars, companies or houses. Descriptive features are available for every individual entity in a population. These descriptive features can be very broad, but also very specific. A single feature such as age is not sufficient to serve as an indicator for the fact that an individual in question is involved in burglaries. But a combination of several general or specific features that are known for every entity in the population, might indeed serve as a good indicator. If we combine age with for example neighborhood of residence, income, race and history of violence, we might observe a single group of individuals to have a significantly higher probability rate of being or becoming a burglar. It is entities with features such as these that form an intelligence dataset. To formalize, an intelligence dataset X contains samples from two classes, X + containing all positive samples representing criminal behavior, and X containing all samples representing unknown behavior. X represents the size of dataset X in terms of the number of samples. With D (X), we denote the dimensionality of the dataset X, i.e. the number of features. Individual samples x X are represented as (x 1,..., x n ), with features x 1... x n R and they are labeled with y 1, 0, 1, where n = D (X). These labels denote the class to which a sample belongs. We define that x X + : y = 1 and x X : y = 0, representing that the true class is unknown. As we assume only a small percentage of criminal behavior in X, we can set the labels y such that x X : y = 1. This does result in label noise, which is discussed in section 2.3.1. Our goal is to obtain a ranking of the dataset. To train the ranking method, we consider X tr to represent the training set, X val represents an optional validation set, used for Netherlands Forensic Institute 4 University of Amsterdam

Chapter 2. Properties of Intelligence Data tuning parameters specific to the ranking method used and X te represents the separate test set, used for performance estimation of a ranking method. All three sets should be independently drawn from the same distribution. This separation of data is required to make an unbiased assessment of the performance of a ranking method [3]. 2.2 Imbalance An important recurring property of intelligence datasets is the low frequency of certain criminal behavior. The prevalence of a specific type of crime might seem high, but relative to the same crime not occurring in similar situations, this rate is luckily always low. That means that if data is collected for every instance in a specific situation, far more data is collected on normal instances than on crime-related instances. X is also called the majority class and X + the minority class as X >> X + [4]. therefore cope with this imbalance. Our method should To cope with the computational costs involved with some modern-day ranking methods, intelligence datasets might need to be reduced in size by a few orders of magnitude. Maintaining the natural occurring class frequency in such a scenario would leave us with less than a single example of criminal behavior from which to build our model, a less than satisfactory approach. Weiss and Provost show that when reducing the training set X size, the naturally occurring positive class fraction + is indeed not always the best X + + X fraction for learning [5]. Several methods for reducing dataset size in imbalanced datasets are discussed by Weiss [6]. One method describes learning only from examples of the minority class, X tr X +. This one-class classification has been extensively researched by Tax [7] and is described in section 3.3. The most commonly used method to deal with imbalanced data is sampling. Weiss describes the difference between over- and under-sampling and their drawbacks [6]. Oversampling duplicates minority class samples while under-sampling reduces majority-class samples. A pictorial representation is shown in figure 2.1. In doing this, both methods achieve a more balanced class-distribution, but at a cost. Over-sampling increases the chance of overfitting by repeating examples, and does not actually generate novel data points. For several methods over-sampling is simply no solution. Under-sampling removes majority-class samples from the training data resulting in a potential loss of useful information. The extent to which this detrimental effect occurs depends entirely on the characteristics of the underlying data and the ranking method. Netherlands Forensic Institute 5 University of Amsterdam

Chapter 2. Properties of Intelligence Data X tr x 1 x 1 x 2 x 2. x 3 x n x 4 x 1 x 5 x 2 x 6. x 7 x n. X tr. x n Figure 2.1: Over-sampling (left) and under-sampling (right) 2.3 Noise In a machine learning framework, a distinction is made between two types of noise. Label noise occurs when for some x X, y y true. Attribute noise occurs when for some x X, some x i x i,true. 2.3.1 Label Noise Zhu and Wu state that label or class noise can have a few possible sources [8]. One cause of label noise is human labeling error, which can occur if the labels are acquired by a manual process. Next to that, contradictory examples arise when two examples in the training set have the same attribute values but come with different labels. It is debatable if this should be considered label noise as overlapping class distributions can result in two correctly labeled examples with the same attribute values but originating from different classes. In an intelligence dataset, examples of criminal behavior are available only because a criminal investigation has proven guilt in the court cases that underpin all of the positive examples. Therefore usually no label noise occurs in the minority class. However, a judicial system is not infallible and sometimes wrong accusations are made. Yet, discovering wrongly accused individuals falls outside the scope of this paper and for now we assume no label noise occurs in the minority class. The majority class consists of unknown samples of which only a fraction represents undiscovered criminal behavior. As most two-class ranking methods require negative examples as input, we label the majority class as normal behavior. This results in label noise in the majority class with the undiscovered samples being the source of this noise. The Netherlands Forensic Institute 6 University of Amsterdam

Chapter 2. Properties of Intelligence Data percentage of label noise is relatively low in this type of data, due to the fact that the relative occurrence of criminal behavior compared to normal behavior is always low. Be aware that when we discuss label noise in class X, we mean to say that some samples in the dataset that are truly from class X + are labeled with 1. The noise is thus actually only in the labels of x X, as the ground truth of these labels is actually 1. Other literature speaks of label noise from class X, meaning that some instances that truly originate from class X are mislabeled and have been given the label y = 1. This might lead to confusion if not addressed carefully as these situations are the exact opposite of each other. To avoid this, we will only use the former way. Thus when we speak of label noise in the majority class, we mean to say that some samples from the minority class have been given the wrong label and are labeled as majority. In the setting of an intelligence dataset, this means that some instances of criminal behavior have not been recognized and are mislabeled as normal behavior. To address the issue of label noise, several different approaches have been proposed. Li et al. propose a probabilistic Kernel Fisher method that optimizes the projection direction of noisy data [9]. They achieve better classification performance of noisy datasets when compared to the original classification methods they build upon. A drawback is the computational complexity of this method which prevents it from use with large datasets. Anyfantis et al. look specifically at the presence of label noise in imbalanced datasets [10] and note a detrimental effect on classification performance. However, they only perform experiments with equal levels of noise in the minority and majority class and their results are therefore not representative for intelligence data. Weiss does take imbalanced datasets into account and notes that relatively high levels of noise are required for a classification method to become error prone [6]. Although experimenting on imbalanced data, no mention is made of classification results if the noise is only occurring in a single class. A more extensive study into noisy imbalanced data has been presented by van Hulse and Khoshgoftaar [1]. They focus solely on the effect of label noise on classification of imbalanced datasets and experiment with a broad range of modern-day classification methods and sampling techniques. Next to that the effects of different levels of label noise occurring in a different ratio among the classes have been observed. They show that the effect of noise in the majority class has a substantially less detrimental effect on classification performance than if the minority class labels would contain noise, especially when the total amount of noise is low. They have tested several different sampling methods and show that the choice of a sampling method is only relevant in high-noise or minorityclass noise settings. In the scenario most important to intelligence datasets with low Netherlands Forensic Institute 7 University of Amsterdam

Chapter 2. Properties of Intelligence Data amounts of noise only in the majority class, there is no significant difference between sampling methods. One of the most simple methods, random under-sampling, yields a competitive performance while maintaining low computational costs. A few issues remain unresolved by van Hulse and Khoshgoftaar. Although they experiment with different noise distributions for both classes, they simultaneously alter the size and balance of the training data due to the creation of the noise. In real world noisy datasets, the amount of noise has no relation to the number of available positive examples for training or the balance between positive and negative examples and should therefore be left out of the equation. More difficulties arise when relating their results to our type of data, as the minimum amount of noise they have used in their experiments is 10%. As indicated, the relative frequency of a single type of crime occurring in comparison to that crime not occurring is low. The amount of noise is therefore much lower when labeling unknown samples as normal behavior in intelligence data. Next to that, there is less imbalance in their experiments. 2.3.2 Attribute Noise and Missing Values Attribute noise occurs when a single attribute has erroneous, incomplete or missing values [8], often caused by data corruption or measurement errors [11]. Zhu and Wu find attribute noise to have a detrimental effect on classification performance in terms of accuracy, either when it occurs in the training set, the test set or both. This effect is found to be linear in the level of noise. A difficulty related to attribute noise and occurring often is missing values. Schafer and Graham note that although missing values are most often a nuisance and not the focus of inquiry, they can create quite some difficulty as most data analysis procedures were not designed to incorporate them and will fail to operate in the presence of missing attribute values [12]. Neglecting them is therefore not an option, and they should be treated with care. For any method that deals with missing data it is important to observe if the fact that a value is missing is related to the data. Rubin describes three different types of missing data with regard to the underlying process that caused their absence [13]: Missing Completely At Random (MCAR) occurs when the probability that a feature value is missing does not depend on the data. Netherlands Forensic Institute 8 University of Amsterdam

Chapter 2. Properties of Intelligence Data Missing At Random (MAR) occurs when the probability that a feature value is missing depends on the values of the observed features, but not on the missing feature value. Missing Not At Random (MNAR) occurs when the probability that a feature value is missing depends on the value of the missing feature. For a more thorough explanation of the individual differences between these types we refer to Schafer and Graham [12]. As missing data in intelligence datasets can have many causes, it is important to pinpoint the extent to which data is missing and if elaborate methods are required to repair the missing values. Two simple methods for dealing with missing data are case deletion and single imputation. Case deletion simply involves throwing away all samples that contain missing attribute values. This method is only valid if the data is MCAR and is only feasible if enough samples are available and the frequency of missing values is low. As intelligence datasets usually already have a small amount of positive examples, case deletion is not always the best choice. Single imputation using the expectation maximization (EM) algorithm is described by Schneider [14]. If the number of samples in X is larger than the number of features, X > D (X), the EM algorithm can be used to compute the maximum likelihood estimates of the mean and covariance of the data. Missing values are imputed with their conditional expectation values given the available values and the estimated mean and covariance [14]. Single imputation is again only possible if the missing values are MCAR. Simple tricks such as substituting missing values with zeros or the mean of the population should be avoided as they distort estimated variances and correlations. Netherlands Forensic Institute 9 University of Amsterdam

Chapter 3 Discovering Criminal Behavior The goal of an intelligence based investigation is to discover criminal behavior. Ranking methods serve as tools to expose undiscovered criminal behavior from these datasets. But as a criminal investigation into individuals is invasive and requires resources, care should be taken in evaluating methods. We will describe several ranking methods and their applicability to ranking intelligence data. 3.1 Ranking In an optimal ranking we observe x + X +, x X : rank(x + ) < rank(x ), (3.1) with rank(x) representing the position of x in a ranked list. A ranking of intelligence data can be obtained in several ways. When using a classification method C for obtaining a ranking, we define the problem as finding a function c : x y that maps samples x X with input features (x 1,..., x n ), n = D(X) to ranking scores c(x) R. The result is a ranking of samples Y based on their ranking scores. Other methods directly optimize a ranking of samples and are defined as finding a function h : X Y that ranks a dataset X into a ranking Y that orders samples according to their likelihood of representing criminal behavior. To assess the extent to which a ranking method is able to reach the goal of an intelligence based investigation, an optimal ranking, a suitable performance measure is required that is independent of imbalance and noise. This performance measure should represent the Netherlands Forensic Institute 10 University of Amsterdam

Chapter 3. Discovering Criminal Behavior x X + x X ŷ = y True Positive True Negative (TP) (TN) ŷ y False Negative False Positive (FN) (FP) Table 3.1: Confusion Matrix of Ranked Results Based on a Decision Point correctness of the ranking. At some point, performance measures for ranking methods require the computation of values from the confusion matrix (or contingency table) 3.1. Such a performance measure indicates one or several decision points dp on the ranking scores c for which we can calculate the predicted label ŷ as follows 1 if c < dp ŷ = 1 if c dp. (3.2) Most classification techniques optimize for some performance measure related to the generalization performance on the training set, such as the accuracy. But problems arise as a direct consequence of the imbalance inherent in the dataset. As Kubat and Matwin point out [15], the performance of a classifier C operating on an imbalanced classification problem can not be expressed in terms of accuracy. In heavily imbalanced datasets, the error on the majority class will outweigh any errors made on the minority class. The accuracy of a classifier C operating on a dataset X is defined as the fraction of the total amount of samples that is classified correctly. Accuracy = TP + TN X (3.3) Accuracy is also less well suited as a performance measure of a ranking method as it requires a fixed decision point. For ranked results, a performance measure is required that integrates over all possible decision points. A suitable performance measure should reflect the extent to which a ranking method is able to reach the goal of an intelligence based investigation. Examples of criminal behavior should be ranked high, resulting in a high probability of encountering criminal behavior at the top of the ranking. A single valued measure that encapsulates this goal is required to compare ranking methods. Two graphical representations of performance measures that operate on a ranking method help to obtain a better understanding and deduct such a measure. The ROC curve shows the relation between the true positive rate and the false positive rate and the precision-recall curve shows the relation between precision and recall. Netherlands Forensic Institute 11 University of Amsterdam

Chapter 3. Discovering Criminal Behavior 3.1.1 ROC Curve In a two-class ranking scenario, the receiver operator characteristic (ROC) curve represents the relationship between the true positive rate (TPR) and the false positive rate (FPR) at any value of the decision point of the ranking method [16]. The TPR and FPR are defined as False Positive Rate (FPR) = FP FP + TN (3.4) True Positive Rate (TPR) = TP TP + FN (3.5) It is often displayed by plotting the true positive rate against the false positive rate. Figure 3.1 shows an example of an ROC curve. 1 0.9 True Positive rate (TPr) 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 False Positive rate (FPr) Figure 3.1: An Example ROC Curve. It represents the trade-off between correctly recognized samples of criminal behavior versus samples of normal behavior that are recognized as criminal behavior. It provides a graphical representation of the overall performance of a ranking method, while compensating for imbalanced data by showing the results as a ratio. A measure derived from the ROC curve, the area under the curve (AUC), is defined as the integral over the whole length of the ROC curve. It represents the probability that a randomly chosen positive Netherlands Forensic Institute 12 University of Amsterdam

Chapter 3. Discovering Criminal Behavior example is ranked higher than a randomly chosen negative example. In other words, it represents the fraction of correctly ranked pairs. where AUC (C) = ( ) δ x + i, x j = 1 X + X X + i=1 X j=1 ( ) δ x + i, x j, (3.6) 0 if rank(x + i ) > rank(x j )) (3.7) 1 if rank(x + i ) < rank(x j )) By utilizing the area under the curve (AUC) as a performance statistic, a single real valued number is retrieved that represents the performance of the ranking method while taking the imbalance into account [17]. Ranking methods exist that directly optimize the AUC. The AUC is a performance measure that handles imbalanced data but the usefulness as a performance measure in ranking intelligence data is debatable. Although the AUC gives a good indication of overall accuracy while taking imbalance into account, a single example proves its shortcoming on intelligence data [18]. Figure 3.2(a) shows two hypothetical ROC curves A and B with the same AUC. Although these ROC curves have the same AUC value, one result is clearly preferred from an intelligence perspective. ROC curve A starts in the origin and increases linearly. This means that with imbalanced data, for every true positive example we retrieve, a much larger number of false positives comes along and clutters the result. Curve B however, starts at a true positive rate of 50%. This means that half of all the criminal behavior has been ranked at the top of the list before any example of normal behavior. The other half is more difficult to find, resulting is a less steep curve and the same AUC as curve A. In an intelligence based investigation, this results in a very high probability of success when investigating the highest ranked samples. Clearly the ranking method resulting in curve B is preferred. As intelligence data is often heavily imbalanced, a small increase in the false positive rate means a large increase in the number of false positives. Therefore, instead of a good performance over the whole length of the ROC curve, it is often much more interesting to have a good performance on a small subset. If there is a high probability of crime within the first 100 or 1000 samples of the ranking result, a criminal investigation into all of these probable perpetrators would become feasible. The presence of such a subset is hidden in the leftmost part of the ROC curve, where false positive rates are low. A high starting point indicates that such a subset exists. But to compare performance of individual models, a quantifiable performance measure is required that captures the presence of such a subset in a single value. Certain performance measures used in document ranking Netherlands Forensic Institute 13 University of Amsterdam

Chapter 3. Discovering Criminal Behavior 1 0.9 0.8 A B 1 0.9 0.8 B targets accepted (TPr) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Precision 0.7 0.6 0.5 0.4 0.3 0.2 0.1 A 0 0 0.2 0.4 0.6 0.8 1 outliers accepted (FPr) (a) ROC Curve 0 0 0.2 0.4 0.6 0.8 1 Recall (b) Precision-Recall Curve Figure 3.2: Comparison Between ROC and Precision-Recall Curves. systems and information retrieval systems achieve this goal by placing more importance to higher ranked correct results than on lower ranked correct results. 3.1.2 Precision-Recall Curve Precision and recall are measures from the field of information retrieval and are defined at a single decision point as Precision = TP TP + FP (3.8) Recall = TP TP + FN (3.9) Information retrieval is concerned with systems returning specific query results as a ranked list in which higher ranked positive results are valued more [19]. Take for example an Internet search engine. End-users will most likely only look at the first few pages of a search result. Returning a positive hit at page 100 is not very valuable. The main goal of these systems is not to return all possible positive results with as little as possible false positives, but to return the best results at first with as little as possible false positives in the first few pages. The search engine aims at a high precision at the first n results. An intelligence based investigation has the same goal. Although individual search results from a search engine also differ in their relevance score while there is no difference between individual examples of criminal behavior, in both scenarios there is a clear demand for a high precision at a small subset of the population. Netherlands Forensic Institute 14 University of Amsterdam

Chapter 3. Discovering Criminal Behavior Precision is defined by values at a single operating point. It is often more intuitively defined as precision@n, where n represents the operating point and stands for the first n returned results. TP + FP is set to equal n and the precision@n is calculated as in equation 3.8. Recall@n is similarly defined with TP + FP again equal to n and then calculated as in equation 3.9. If we plot precision at all possible recall values, we obtain the precision-recall curve. This curve gives an intuitive representation of the level of compliance to the goal we are after. The precision-recall curves shown in figure 3.2(b) originate from the same ranking as the ROC curves shown in figure 3.2(a). As indicated, the research into possible perpetrators is costly, both in resources and in impact as the capacity available for individual investigations is limited and such an investigation is privacy-invasive. Evaluating the precision at all recall values gives insight into the prevalence of a subset with high precision. If we set n to correspond to the available capacity, a ranking method can be optimized for precision@n to obtain the most desirable results. The capacity is often low but can be quite variable in different investigations. The average precision is therefore used as a generalization of this performance measure and serves as a good indication if methods are suitable for use in an intelligence based investigation [20]. Average precision (AP) is defined in equation 3.10 as the average precision value over all recall values. where AP = 1 X + X Precision@i β (3.10) i=1 0 if x i X β = (3.11) 1 if x i X + In this equation X is the ranked result of a classification method C, with x 1 being the highest ranked result and having the highest value for c(x). It is defined as the area under the precision-recall curve. Figure 3.2(b) shows that curve B has a higher AP and is therefore preferred in an intelligence based investigation. It is proven that neither optimal performance in terms of accuracy nor optimal performance in terms of AUC can guarantee optimal performance in terms of average precision [18]. Netherlands Forensic Institute 15 University of Amsterdam

Chapter 3. Discovering Criminal Behavior 3.2 Ranking with Two-Class Classification Methods In an intelligence based investigation, we only distinguish between normal and criminal behavior. This allows the use of two-class classification methods for ranking results and the use of performance measures that are suitable for two-class methods. To be able to rank results, only classification methods can be used that return some form of a confidence score instead of only a predicted class label. Different classification methods implement this differently according to the underlying mechanisms of the method. This results in different scales being used by different methods for the ranking of results. Performance measures comparing the performance of these methods should therefore only do this based on the order of ranked results and not on the individual confidence scores of the ranked results. Several classification methods exist that have the ability to return a confidence score in a two-class scenario. Generative classification methods are distribution estimators. Given data points from a single class X tr with {+, }, this group of classification methods try to find the parameters θ of a model that maximize the probability of observing the data, given the model. Which model is used depends on the classification method. The parameters of the model can be estimated by Maximum Likelihood estimation as in equation 3.12 [21]. θml = argmax p(x tr θ ) (3.12) θ Or by Maximum A Posteriori estimation as in equation 3.13. θmap = argmax p(x tr θ )p(θ ) (3.13) θ These classification methods rank data points based on the likelihood ratio given in equation 3.14. Λ(x) = p(x θ+ ) p(x θ ) (3.14) Discriminative classification methods are boundary estimators. They approximate a boundary in the multidimensional feature hyperspace. Data points are labeled according to on which side of the boundary they lie, the distance to the boundary is assigned as confidence scores to the classification and the side of the boundary determines the sign of the confidence score. Netherlands Forensic Institute 16 University of Amsterdam

Chapter 3. Discovering Criminal Behavior In the following few subsections, several methods are described in detail and their applicability to ranking intelligence data is discussed. 3.2.1 k-nearest Neighbors k-nearest Neighbors (k-nn) is an intuitively very simple classification method. It assigns to new data points the class label similar to the label that is most common in the k nearest neighbors from the training data. Intuitively this is a strong point. If an entity shows behavior very similar to that of known criminals, we would expect a definite increase in the probability of observing criminal behavior with this entity as well. Confidence scores can be determined in k-nn ranking by the ratio of occurrence of both classes within the k nearest neighbors. In 1-NN, the confidence scores are determined by the difference in distance between the single nearest neighbor from both classes. What value for k is optimal depends on several properties of the data. An example of k-nn classification for k = 1, 5 is shown in figure 3.3. 3 1 NN 5 NN 2 Feature 2 1 0 1 2 2 1 0 1 2 3 Feature 1 Figure 3.3: An Example of k-nn Classification. When measuring how near two data points are, a dissimilarity function d(x, x ) is used. Often the squared Euclidean distance is used, represented in equation 3.15. d(x, x ) = (x x ) T (x x ) (3.15) Netherlands Forensic Institute 17 University of Amsterdam

Chapter 3. Discovering Criminal Behavior But as the squared Euclidean distance does not take differences in feature ranges into account, sometimes the Mahalanobis distance is used as represented in equation 3.16. d(x, x ) = (x x ) T Σ 1 (x x ) (3.16) Where Σ represents the covariance matrix of the training data. As the method does not try to estimate underlying data distributions, it is very flexible in representing complex non-linear boundaries in the underlying feature space. With small sample sizes, it is however also prone to overfitting, as it relies heavily on the individual training data points. A few outliers or wrongly labeled samples can easily distort the classification boundaries. This effect is most notable in 1-NN as can be seen in figure 3.3. Imbalanced training data will also bias the classifier towards the most common class, as these samples will then have a higher probability of occurring within the nearest k neighbors, just because they are more numerous. Although the algorithm is intuitively simple to understand, it is computationally expensive as it requires keeping all training data in memory and compares new data points to every training data point at the time of classification. The applicability of k-nn to ranking intelligence data is unclear. Although the classification method shows insufficient ability to cope with properties common in intelligence data, label noise and imbalance, it does show flexibility in representing complex boundaries. 3.2.2 Logistic Regression In a logistic regression model, the sigmoid function is used to represent the probability that a new data point belongs to either class. For example for a data point x X te we can calculate p(x X + x) = σ(b + x T w). (3.17) In this equation, as b + x T w increases, so does the probability that x belongs to class X +. The (D) 1 dimensional decision boundary is defined as the hyperplane b + x T w = 0 in the feature space and is represented by weights w that determine the orientation of the hyperplane and a scalar b representing the offset of the hyperplane [21]. The logistic sigmoid function σ(a) is defined as σ(a) = 1. (3.18) 1 + e a Netherlands Forensic Institute 18 University of Amsterdam

Chapter 3. Discovering Criminal Behavior When learning a logistic regression model, the parameters b and w are estimated by a maximum likelihood estimation using gradient ascent [21]. 3.2.3 Fisher s Linear Discriminant The concept behind Fisher s Linear Discriminant (FLD) is that the data is projected onto a single dimension, while minimizing within-class variance and maximizing between-class variance [3] [21]. Suppose we have two classes X + X tr and X X tr with sizes N + and N respectively. We make the assumption that the classes are each described by a high-dimensional Gaussian distribution. The means of each class X with {+, } are estimated using maximum likelihood by µ = 1 N and the covariance matrices of each class are estimated as x X x, (3.19) Σ = 1 N We can now define the between-class covariance matrix as x X (x µ ) (x µ ) T (3.20) Σ B = ( µ µ +) ( µ µ +) T (3.21) and the within-class covariance matrix as Σ W = ( x µ + ) ( x µ +) T ( + x µ ) ( x µ ) T. (3.22) x X + x X The fisher criterion is defined as the ratio of the between-class variance and the withinclass variance and can be written as J(w) = wt Σ B w w T Σ W w. (3.23) Here, the vector w with D(w) = D(X) represents the linear projection vector that projects data points x onto a single dimension z. Netherlands Forensic Institute 19 University of Amsterdam

Chapter 3. Discovering Criminal Behavior z = w T x (3.24) We can find the maximum of J(w) by differentiating 3.23 with respect to w and find w to be proportional to w Σ 1 ( W µ µ +). (3.25) We finally project all samples onto their corresponding z values using 3.24. The values for z obtained by this projection serve as confidence scores on which to base a ranking. The derivation shown here requires Σ W to be invertible. If this is not the case, for example when the dimensionality of the data is larger than the number of samples in a class, a more complex derivation is required. This is beyond the scope of this thesis, but we refer the interested reader to [21] for a more thorough explanation. FLD is a very fast method as it is computationally inexpensive. It is also robust against noise and imbalance as it only regards the means and covariances of the data points. However, it will only prove strong in classifying data that is close to linear separable. It is unclear why intelligence data should exhibit that property, but given that FLD handles noise and imbalance well, we should consider its applicability to ranking intelligence data. 3.2.4 Quadratic Discriminant The Quadratic Discriminant Classifier (QDC) assumes Gaussian distributions for each class and estimates the individual class means and covariances by maximum likelihood estimation as in equation 3.19 and 3.20. When given new examples from the test set, it calculates the ratio of posterior probabilities of the data points x belonging to either class using a Bayesian framework: p(x + x) p(x x) = p(x X+ ) p(x X ) p(x+ ) p(x ). (3.26) The class-conditional densities for either class {+, } are calculated according to [3] as p(x X ) = { 1 1 (2π) D/2 exp 1 } Σ 1/2 2 (x µ ) T Σ 1 (x µ ). (3.27) Netherlands Forensic Institute 20 University of Amsterdam

Chapter 3. Discovering Criminal Behavior A ranking of samples is obtained by ordering the posterior probability ratios of the individual samples. The power of QDC is its ability to represent non-linear decision boundaries. Compensating for prior odds also deals with imbalanced data. But keeping Occam s Razor [22] in mind, we know that a more complex data description is not always better. If the data is sparse, a non-linear approximation easily overfits the training data. Intelligence data is often sparse in the positive class, as there are few examples of criminal behavior, but often a lot of descriptive features. The applicability of QDC to ranking intelligence data is therefore unclear. 3.2.5 Support Vector Machines Support Vector Machines (SVMs) is a classification method that maximizes the margin between the classes, first described by Vapnik [23] and later generalized by Mangasarian [24] [25]. A practical tutorial on its application is written by Hsu et al. [26]. It is intuitive that maximizing the margin between classes increases generalization performance. Figure 3.4(a) 1 shows data points from two classes separated by a decision boundary that maximizes the margin. (a) Maximizing the Margin (b) With Slack Variables Figure 3.4: Maximum Margin Classification. The data points that lie closest to the decision boundary are the support vectors and determine the location of the decision boundary. We model the decision boundary as 1 Figures taken from Barber [21] w T φ(x) + b = 0, (3.28) Netherlands Forensic Institute 21 University of Amsterdam

Chapter 3. Discovering Criminal Behavior where φ(x) denotes a fixed feature-space transformation [3] and w and b are parameters of a linear decision surface in the transformed space. To maximize the margin, we minimize N C ξ n + 1 2 w 2, (3.29) n=1 where ξ n are slack variables that represent the penalty for misclassified samples (ξ > 1) and samples that end up inside the margin (0 < ξ 1). Figure 3.4(b) shows maximum margin classification using slack variables. C > 0 represents a tunable parameter that controls the trade-off between misclassification penalty and the size of the margin. We now set the constraints x n : y(w T φ(x n ) + b) 1 ξ n, n = 1,..., N, ξ n 0, (3.30) with the true class label y defined as 1 if x n X y = (3.31) 1 if x n X + Finding the optimal margin comes down to minimizing 3.29 subject to the constraints 3.30. This is an example of a quadratic programming problem. Samples can subsequently be ranked according to their distance to the optimal margin, with the sample that is farthest away from the margin on the positive side ranked first and the sample that is farthest away on the negative side ranked last. SVMs have shown consistently good performance over many different types of data. There is no reason to believe this is not the case with intelligence data. 3.3 Ranking with One-Class Classification Methods The applicability of one-class classifiers is described by Tax et al. [7] [27] [28] and Juszczak [29]. Methods for one-class classification model only a single class and are therefore applicable in scenarios where only a single class is well defined. One-class classification methods have proven to perform well on datasets from other fields of study where imbalanced, noisy data is common, for example on interstitial disease detection in chest radiography [30], seizure analysis from intracranial EEG [31], intrusion detection [32] and forensic case data [33]. Netherlands Forensic Institute 22 University of Amsterdam

Chapter 3. Discovering Criminal Behavior Intelligence data shows characteristics that signify the applicability of one-class ranking methods. It is logical that only modeling the class of criminal behavior deals with the problems of label noise and imbalance. However, throwing away all information in the negative class might also have a detrimental effect on ranking performance. A careful performance evaluation of a selection of possibly useful one-class methods should show if they perform competitive to previously mentioned methods. 3.3.1 Gaussian Data Description The Gaussian Data Description (GDD) is one of the simplest methods available. It estimates the mean µ and covariance matrix Σ of the target class by maximum likelihood. New samples are simply ranked according to their posterior probability given the Gaussian model. 3.3.2 Support Vector Data Description The Support Vector Data Description (SVDD) is a one-class SVM variant described by Tax [34]. It represents a class by the minimal-volume hypersphere around the data, parameterized by its center c and radius R. Optimizing the parameters for this sphere is similar to the optimization of an SVM. We minimize C constrained by N ξ n + R 2 (3.32) n=1 n : x n c 2 R 2 + ξ n, n = 1,..., N, ξ n 0. (3.33) This is again solved using quadratic programming. Samples can again be ranked according to their distance to the optimal margin. 3.4 Ranking with Ranking Methods We have seen in the previous sections that most methods minimize the misclassification rate of training examples, which results in a performance that is optimal in terms of accuracy. An intelligence based investigation is not aimed at classifying everything right, but on finding a subset of entities that have a high probability of exhibiting criminal Netherlands Forensic Institute 23 University of Amsterdam

Chapter 3. Discovering Criminal Behavior behavior. This underpins the need for rank-based optimization criteria for use in powerful ranking methods like SVMs. Is has to be noted that accuracy is closely linked to other performance measures and optimizing for accuracy might also yield optimal performance on other performance measures. 3.4.1 SVM-Rank An SVM method that optimizes for ranking is described by Joachims [35]. The concept behind SVM-Rank is based on an optimization problem similar to normal SVMs, which optimizes the number of individual data points that end up on the correct side of the decision boundary as in equation 3.29. SVM-Rank however optimizes the number of pairs of data points (x n, x m ), with n, m = 1,..., N, that are ranked correctly. To this end we introduce an extra slack variable ξ m and rewrite the optimization problem as C ξ m,n + 1 2 w 2, (3.34) constrained by (x m, x n ) : w T (φ(x m ) φ(x n )) + b 1 ξ m,n, m, n = 1,..., N, ξ m,n 0, (3.35) which can again be solved using quadratic programming. As this method directly optimizes a ranking of results, its applicability to intelligence data is unquestionable. It is however unclear how this method deals with noise and imbalance. 3.4.2 SVM-Precision Joachims describes an SVM method for optimizing non-linear performance measures derived from the confusion matrix shown in table 3.1, named SVM-Multi [36]. It builds upon the approach of normal SVMs but instead of optimizing for each individual data point, or optimizing for pairs of data points such as SVM-Rank, it uses a multivariate prediction rule and optimizes for all data points at the same time, which allows the use of multivariate loss functions. The optimization problem is formulated as Cξ + 1 2 w 2, (3.36) constrained by Netherlands Forensic Institute 24 University of Amsterdam

Chapter 3. Discovering Criminal Behavior t T \ t : w T (Ψ( x, t) Ψ( x, t )) ( t, t) ξ. (3.37) In these constraints, x X represent tuples of n feature vectors x = (x 1,..., x n ) and t T represent tuples of n labels t = (t 1,..., t n ) with T { 1, +1} n. The function Ψ is a function that describes the match between x and t and is calculated here as Ψ( x, t ) = n t x i. (3.38) i=1 The loss function represents the non-linear multivariate loss function that is derived from the confusion matrix. SVM-Precision is a special case of SVM-Multi, where ( t, t) = TP TP + FP. (3.39) To optimize for precision@n, we require TP + FP = n and only include t t for which this requirement holds. This method directly optimizes an SVM ranking method for a specific precision value, which is very close to the goal of an individual intelligence based investigation. If the capacity of investigative resources is known, a choice can be made for n and the ranking can be optimized for the current capacity. 3.4.3 SVM-ROCArea The method SVM-ROCArea optimizes the AUC of the ranking, by minimizing the number of swapped pairs of a positive and a negative example. For a detailed description of its implementation built upon SVM-Multi, we refer to Joachims [36]. Although optimizing the AUC is not a direct goal of an intelligence based investigation, it is a measure that is invariant to imbalance and widely used. Optimizing the AUC directly is therefore interesting even for intelligence data. 3.4.4 SVM-MAP Yue et al. describe an SVM method for optimizing MAP 2 [37] that is based on structural SVMs as described by Tsochantaridis [38] and built open SVM-ROCArea [36]. Equivalent 2 In literature the term mean average precision is defined as the mean of several average precision values for different queries. But in an intelligence based investigation only a single query is proposed and therefore the mean average precision equals the average precision. Netherlands Forensic Institute 25 University of Amsterdam

Chapter 3. Discovering Criminal Behavior to SVM-Multi, SVM-Map optimizes equation 3.36 constrained by 3.37. The combined feature representation Ψ is now calculated as with Ψ( x, t) 1 X + X x + X + x X ( t± ( φ( x, x + ) φ( x, x ) )), (3.40) +1 if rank(x + ) < rank(x ) t ± = 1 if rank(x + ) > rank(x ) (3.41) The algorithm described in [37] iteratively optimizes MAP by finding the most violated constraint. Average Precision (AP) performance can be seen as a generalization of the performance in terms of precision@n for different values of n. Optimizing for the precision@n is very close to the goal of an intelligence based investigation with a known investigative capacity of n. When a concrete number for n is not available, or as a measure to compare different methods on their general performance on different types of intelligence data, AP is a suited generalization. Optimizing for MAP performance (equivalent to AP on a single query) seems therefore ideal for discovering criminal behavior in intelligence datasets. 3.5 Performance Estimation If a large amount of data is available and every class is well represented, a simple hold-out method is sufficient to provide an accurate performance estimate. The data is split in a separate training and test set only once, the model is learned on the training set and the performance measure is estimated on the test set. This simple approach is almost never suited for intelligence data though, as the amount of examples of criminal behavior is always low. If the available amount of data is limited, the estimation of any performance measure is often based on the average performance on several draws of the available data. Different drawing methods exist, but two aspects of such an estimate are important to consider for any drawing method. The difference between the average performance and the expected performance is the bias of the performance estimate. The variance is the extent to which individual performance estimates differ from each other. There is a trade-off between bias and variance. Often methods that exhibit a low bias show a high variance and vice-versa. Netherlands Forensic Institute 26 University of Amsterdam

Chapter 3. Discovering Criminal Behavior In general, k-fold cross-validation is used to make optimal use of the available data in estimating a performance measure without a high bias or variance in the estimate. All available data is split in k equal folds. If stratification is applied, the original class frequency is maintained in each fold. Subsequently in k runs, each fold is held out as a test set once with the rest of the data serving as training data. If classification methods require parameter tuning, the same process is repeated n times within each fold. The training data is then again split in n equal folds, with each fold acting as a validation set once and the rest as training data for the parameter tuning. Once the optimal parameter has been found, all data used during validation is then used as training data for learning the model. Kohavi shows that in general 10-fold stratified cross-validation results in the best estimate in terms of both bias and variance [39]. Airola et al. show that 10-fold stratified cross-validation shows large variance in smallsample studies [40] when estimating the AUC. They propose the use of leave-pair-out cross-validation (LPOCV) as an AUC estimator with slightly lower variance but at a much higher computational cost. Pahikkala et al. show a computationally efficient method for performing LPOCV for a regularized least squares algorithm [41], but it is not applicable to other classification algorithms. Isaksson et al. also note that 10-fold cross-validation is unreliable in small sample sizes [42]. The variance between individual estimates is too large to make any valid assumption based on the cross-validation estimate of the performance. They show that in large sample sizes the estimate becomes more reliable, yet for sample sizes in the hundreds there is currently no way to ensure the uncertainty of the estimates and the large variance is as good as it gets. It is however still the most used validation method and without making any statements about the significance of obtained results, it still gives a good indication of how different methods compare to each other based on some performance measure. Netherlands Forensic Institute 27 University of Amsterdam

Chapter 4 Data To represent intelligence data in our experiments, we have chosen to use both artificial and real data. By using artificial data, we are able to control the parameters of the imbalance and noise. Next to that, we know the ground truth behind the label noise, which allows us to measure its detrimental effect on measured performance. Using a real dataset next to the artificial datasets allows us to observe the performance of our methods in practice. Although a single dataset is not representative for intelligence data in general, we are still able to observe the effects of the specific properties of intelligence data on ranking performance. 4.1 Artificial Data To artificially represent intelligence data, we chose three different representations using quite basic Gaussian data distributions but emulate the same characteristics common in real intelligence data, noise and imbalance. All artificial datasets are created using the pattern recognition toolbox for Matlab, PRTools 4.1 [43]. All three datasets are built out of two Gaussian distributions. Normal behavior is modeled by the Gaussian distribution N (µ, Σ ) and criminal behavior by N + (µ +, Σ + ). To emulate imbalance and noise in the artificial datasets, we represent X by taking N φ samples from the distribution N and φ from the distribution N +, such that X = N. Criminal behavior X + is represented by N + ψ samples from N + and ψ samples from N, such that X + = N +. The parameter φ controls the amount of noise in the majority class representing undiscovered criminal behavior and ψ the amount of noise in the minority class. We assume that noise in the minority class usually does not occur in real intelligence data, but to gain a better understanding of the effect of label Netherlands Forensic Institute 28 University of Amsterdam

Chapter 4. Data noise in imbalance datasets, we include the scenario where noise occurs in the minority class. The number of features that represents the dimensionality of the dataset is chosen to be D(X) = 10. 4.1.1 Gaussian Data The first dataset gauss consists of data from a simple Gaussian distribution with slightly different means and unit covariance. µ = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0), µ + = (1, 1, 1, 0, 0, 0, 0, 0, 0, 0) (4.1) Σ = I 10, {+, } (4.2) A scatter plot of the first two dimensions of the data is shown in figure 4.1. A few things are interesting to observe here. The imbalance is immediately clear from the plot and the difficulty in estimating a decision boundary is evident. The noise however, seems neglectable. 4.1.2 Stretched Parallel Data Another artificial dataset for use in our experiments is a stretched parallel dataset stretch. The means of the classes are the same except for the first two features. Both classes also have unit variance except for feature 2, which has a variance of 40. The first two features are also rotated by 45. This results in µ = (3, 3, 0, 0, 0, 0, 0, 0, 0, 0), µ + = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0) (4.3) Σ = 20.5 19.5 0... 0 19.5 20.5 0... 0 0 0 1..... 0 0 1, {+, } (4.4) A scatter plot of this dataset showing the first two features is shown in figure 4.2. Remember that all other features show the same uniform distribution for both classes. This Netherlands Forensic Institute 29 University of Amsterdam

Chapter 4. Data 4 Feature 2 2 0 2 2 0 2 4 Feature 1 Figure 4.1: Artificial Gaussian Intelligence Dataset. dataset represents a type of class distribution that is difficult to separate for some classification methods. The imbalance can also clearly be seen. 4.1.3 Spherical Data The third and final artificial dataset we use is a spherical dataset sphere, where both classes have the same means but different feature variance in the first two features. µ = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0), {+, } (4.5) Σ + = I 10, Σ = 10 0 0... 0 0 10 0... 0 0 0 1..... 0 0 1 (4.6) Netherlands Forensic Institute 30 University of Amsterdam

Chapter 4. Data 15 10 Feature 2 5 0 5 10 15 20 10 0 10 20 Feature 1 Figure 4.2: Artificial Stretched Parallel Intelligence Dataset. Figure 4.3 shows a scatter plot of the first two features of this dataset. Due to the equal mean vectors, linear models are unable to correctly model this dataset. The imbalance is again clearly visible. 4.2 IILS Data The real-world intelligence dataset iils we use in our experiments represents the problem of finding new examples of Illegal Irregular Living Situations (IILS) in the municipality of Rotterdam. An illegal irregular living situation occurs for example when more people than allowed reside at a single address. This intelligence dataset consists of 292,100 examples (addresses), of which 300 are confirmed IILS. The rest is unknown, but expected to only contain roughly around 0.5% IILS. An intelligence based investigation is aimed at discovering these examples. For each address, several features are available. Some features are available for every address individually: the number of households enlisted at the address, the number of persons enlisted at the address, the number of persons enlisted at the address that are Netherlands Forensic Institute 31 University of Amsterdam

Chapter 4. Data 10 5 Feature 2 0 5 10 15 15 10 5 0 5 10 15 Feature 1 Figure 4.3: Artificial Spherical Intelligence Dataset. over 18, the number of rooms, the total surface area, the period it was built, the type of building, the type of municipal land use plan, the type of owner and the level of administrative over-occupation. Next to that many features are available at the neighborhood level, with the entire municipality partitioned into 65 neighborhoods. Within a single neighborhood, the feature values of these features are identical for every sample. Some features are categorical features, without an ordering and with n possible predefined feature values. To be able to use these features with numeric-oriented ranking methods, they are rewritten as n separate binary features, thereby increasing the dimensionality of the dataset. Labeling all unknown examples as normal behavior makes two-class ranking methods available for use but at the cost of creating label noise. It is unknown if attribute noise is present in the form of measurement errors. Missing values do occur, and as these are in all probability caused by human error, we assume they are MCAR and delete 12 positive examples along with a number of unknown examples. This leaves us 284,645 unknown samples which we label as normal behavior and 288 true examples of IILS. Netherlands Forensic Institute 32 University of Amsterdam

Chapter 4. Data As this dataset is a typical example of an intelligence dataset, it is ideally suited to test our hypotheses. Netherlands Forensic Institute 33 University of Amsterdam

Chapter 5 Experiments 5.1 Setup To answer the research questions posed in section 1.1, we conduct several experiments to get an understanding of our data and solve the subquestions. All experiments are conducted in Mathworks MatLab R2010a, using the pattern recognition toolbox for Matlab, PRTools 4.1 [43]. For one-class ranking experiments, we make use of DDtools 1.7.3, the Data Description Toolbox for Matlab [44]. The experiments using SVM are performed with SVM light [45]. Experiments with SVM-Rank are performed with SVM rank [35]. Experiments with SVM-P@n and SVM-ROCArea are performed with SVM perf [36]. Note that the ranking methods SVM-P@n were never tested so these should be taken with a grain of salt 1. We chose to include them anyway as optimization of precision is relevant to our data. We modified SVM perf slightly to allow for values of n > N + by removing a hard-coded limit. Experiments with SVM-MAP are performed with SVM map [37]. 5.2 Training Imbalance We limit the size of the negative class in the training set by performing random undersampling. To determine how small we can make the sample size without observing a detrimental effect on performance, we construct learning curves by experimenting with several different amounts of sampling. We perform our experiments on all data sets described in the last chapter. As parameters for the artificial datasets, we set N = 30, 000, N + = 100 and φ = 900. As in real intelligence data, we keep ψ = 0. We use two selected two-class discriminative classification methods to construct a ranking, Fisher s 1 http://svmlight.joachims.org/svm_perf.html Netherlands Forensic Institute 34 University of Amsterdam

Chapter 5. Experiments Linear Discriminant (FLD) and the Quadratic Discriminant Classifier (QDC). The result of one-class methods is trivial as these methods ignore the negative class altogether. The effect of sampling on the performance of ranking methods is computationally too expensive to perform with the available hardware, but it is deemed comparable to the used two-class methods. FLD QDC 1 0.75 0.5 0 2500 5000 7500 10000 (a) Learning Curve for Gaussian Dataset (AUC) 0.1 0 0 2500 5000 7500 10000 (b) Learning Curve for Gaussian Dataset (AP) Figure 5.1: Learning Curves for Gaussian Dataset. We perform our experiments using 10-fold stratified cross-validation and within each fold we perform random under-sampling on the majority class in the training set X tr to cut it down to a specific size X tr. We perform the experiment with increasing values of X tr for 200 < X tr < 10, 000. For each value of X tr we repeat the experiment 10 times. We measure the performance by calculating the average AUC and AP and measuring the variance over all of the 100 individual folds. Within each fold, the training set and test set are scaled. The mean of the training set is shifted to the origin and the total variances Netherlands Forensic Institute 35 University of Amsterdam

Chapter 5. Experiments FLD QDC 1 0.9 0.8 0.7 0 2500 5000 7500 10000 (a) Learning Curve for Stretched Dataset (AUC) 0.2 0.1 0 0 2500 5000 7500 10000 (b) Learning Curve for Stretched Dataset (AP) Figure 5.2: Learning Curves for Stretched Dataset. of all features are scaled to unit variance. To prevent information leakage, the scaling vector obtained from the training set is subsequently applied to the test set. The learning curves resulting from this experiment are shown in figure 5.1, 5.2, 5.3 and 5.4. The first thing to note is that the negative sample size has only a marginal effect on ranking performance. Sampling only a fraction of the majority class is enough to maintain a steady performance. Sampling more does not result in better performance but does increase computational complexity. In all three artificial datasets, the effect of sampling more than the minimum tested value of 200 examples does not increase performance. On the IILS dataset, a slight detrimental effect of under-sampling is seen both in AUC Netherlands Forensic Institute 36 University of Amsterdam

Chapter 5. Experiments FLD QDC 0.8 0.6 0.4 0 2500 5000 7500 10000 (a) Learning Curve for Spherical Dataset (AUC) 0.05 0.025 0 0 2500 5000 7500 10000 (b) Learning Curve for Spherical Dataset (AP) Figure 5.3: Learning Curves for Spherical Dataset. and AP when sampling less than 2,000 examples from the majority class and optimal performance occurs at a majority sample size of around 2,000 or more. From these results we can conclude that sampling around ten times the training size of the minority class from the majority class results in an optimal performance both in terms of AUC and AP on these datasets. To reduce computational complexity in further experiments we will use a fixed majority sample size of 3,000 in our training sets for all subsequent experiments. We keep this sample size constant and use it in every fold. As indicated, the large variance that is clearly visible is caused by performing 10-fold stratified cross-validation on a small sample size. Although this prevents us from reporting very exact results, we are still able to observe general trends. Netherlands Forensic Institute 37 University of Amsterdam

Chapter 5. Experiments FLD QDC 0.95 0.9 0.85 0 2500 5000 7500 10000 (a) Learning Curve for IILS Dataset (AUC) 0.06 0.03 0 0 2500 5000 7500 10000 (b) Learning Curve for IILS Dataset (AP) Figure 5.4: Learning Curves for IILS Dataset. 5.3 Undiscovered Criminal Behavior To assess the effects of label noise on ranking performance we conduct several experiments with different levels of noise φ. To gain a better understanding of label noise in imbalanced data we include the scenario where different levels of noise ψ occur in the minority class. We construct our datasets according to table 5.1 with N = 30, 000 and N + = 300 for the artificial datasets and N = 284, 645 and N + = 288 for the IILS dataset. We set the levels of noise to experiment with to 0, 50, 100, 150. In our experiments with noise in the majority class we set ψ = 0 and increment φ. In our experiments with noise in the minority class we set φ = 0 and increment ψ. For our experiments with noise in both classes we increment φ and ψ together such that φ = ψ. Netherlands Forensic Institute 38 University of Amsterdam

Chapter 5. Experiments Artificial Data IILS X X + new X new X + Noise in N N + N N + X X + X X + X N φ φ 0 N + N φ 0 133 X + N 0 ψ N + ψ N ψ 0 ψ N + ψ both N φ φ ψ N + ψ N ψ φ ψ N + φ Table 5.1: Size and Composition of Intelligence Datasets with Added Artificial Noise It is important to note that an unknown amount of noise is already present in the majority class of the IILS dataset. The artificially added noise has to come from our already low amount of confirmed cases of criminal behavior. Therefore in the scenario with noise in the majority class, we can only experiment with N + = 133, as the rest is used to generate extra noise in the majority class. We perform each experiment using 10-fold stratified cross-validation. We use random under-sampling to reduce the size of the majority class in the training set to 3,000 in each fold. Within each fold, the training set and test set are again scaled. The mean of the training set is shifted to the origin and the total variances of all features are scaled to unit variance. The test set is scaled with the scaling vector obtained from the training set. We measure the performance by calculating the average AUC and AP and measuring the variance over all of the 100 individual folds. As we perform random under-sampling after artificially inserting noise in the majority class, the training set will most likely contain few to no noisy examples. The learned models will not differ much from the noiseless setting. But as the test set will contain added noisy examples in the majority class, we expect to observe a detrimental effect on measured performance. This effect should be most visible in the AP, as noisy samples in the majority class represent undiscovered criminal behavior and should be ranked high by the model. Especially as noise levels increase, the fraction of undiscovered criminal behavior in the top ranked examples should increase, leading to a decrease in measured AP performance. Generating artificial noise in the minority class or both classes is expected to have a severe detrimental effect on performance, as the already limited amount of examples of criminal behavior is heavily distorted. The results are shown in figure 5.5, 5.6, 5.7 and 5.8. As we artificially add noise to the majority class, we can keep track of the original labels. By doing so, we can calculate the expected AP by testing on a test set with the corrected labels. This is in contrast to the measured AP, which is measured on a test set with noisy labels. As the added noise represents criminal behavior, we can calculate the extent to Netherlands Forensic Institute 39 University of Amsterdam

Chapter 5. Experiments LogReg FLD GDD SVDD 1 1 1 0.75 0.75 0.75 0.5 0.5 0.5 0 50 100 150 (a) Noise in Majority Class (AUC) 0 50 100 150 (b) Noise in Minority Class (AUC) 0 50 100 150 (c) Noise in Both Classes (AUC) 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0 0 50 100 150 0 0 50 100 150 0 0 50 100 150 (d) Noise in Majority Class (AP) (e) Noise in Minority Class (AP) (f) Noise in Both Classes (AP) Figure 5.5: Noise in Gaussian Data. LogReg FLD GDD SVDD 1 1 1 0.75 0.75 0.75 0.5 0.5 0.5 0 50 100 150 (a) Noise in Majority Class (AUC) 0 50 100 150 (b) Noise in Minority Class (AUC) 0 50 100 150 (c) Noise in Both Classes (AUC) 0.75 0.75 0.75 0.5 0.5 0.5 0.25 0.25 0.25 0 0 50 100 150 0 0 50 100 150 0 0 50 100 150 (d) Noise in Majority Class (AP) (e) Noise in Minority Class (AP) (f) Noise in Both Classes (AP) Figure 5.6: Noise in Stretched Data. Netherlands Forensic Institute 40 University of Amsterdam

Chapter 5. Experiments LogReg FLD GDD SVDD 1 1 1 0.75 0.75 0.75 0.5 0.5 0.5 0 50 100 150 (a) Noise in Majority Class (AUC) 0 50 100 150 (b) Noise in Minority Class (AUC) 0 50 100 150 (c) Noise in Both Classes (AUC) 0.09 0.09 0.09 0.06 0.06 0.06 0.03 0.03 0.03 0 0 50 100 150 0 0 50 100 150 0 0 50 100 150 (d) Noise in Majority Class (AP) (e) Noise in Minority Class (AP) (f) Noise in Both Classes (AP) Figure 5.7: Noise in Spherical Data. LogReg FLD GDD SVDD 1 1 1 0.75 0.75 0.75 0.5 0.5 0.5 0 50 100 150 (a) Noise in Majority Class (AUC) 0 50 100 150 (b) Noise in Minority Class (AUC) 0 50 100 150 (c) Noise in Both Classes (AUC) 0.05 0.05 0.05 0.025 0.025 0.025 0 0 50 100 150 0 0 50 100 150 0 0 50 100 150 (d) Noise in Majority Class (AP) (e) Noise in Minority Class (AP) (f) Noise in Both Classes (AP) Figure 5.8: Noise in IILS Data. Netherlands Forensic Institute 41 University of Amsterdam

Chapter 5. Experiments which a method is able to discover this undiscovered criminal behavior by calculating the expected AP. We expect to observe an increase in expected AP due to two factors. Firstly undiscovered criminal behavior ending up high in the ranking is now labeled correctly as criminal behavior, and secondly the total amount of criminal behavior, consisting of known and undiscovered criminal behavior, increases linearly with the amount of added noise, resulting in more positive samples. The expected AP resulting from experiments with noise in the majority class is shown in table 5.2. (a) Gaussian Data Method Noise Measured AP Expected AP 0 0.207 ± 0.059 0.207 ± 0.059 LogReg 50 0.199 ± 0.065 0.219 ± 0.062 100 0.175 ± 0.057 0.217 ± 0.057 150 0.179 ± 0.065 0.241 ± 0.069 0 0.209 ± 0.059 0.209 ± 0.059 FLD 50 0.199 ± 0.066 0.220 ± 0.062 100 0.176 ± 0.057 0.218 ± 0.057 150 0.179 ± 0.065 0.241 ± 0.069 0 0.025 ± 0.011 0.025 ± 0.011 GDD 50 0.027 ± 0.011 0.030 ± 0.010 100 0.026 ± 0.011 0.032 ± 0.010 150 0.026 ± 0.011 0.035 ± 0.011 0 0.026 ± 0.013 0.026 ± 0.013 SVDD 50 0.027 ± 0.012 0.030 ± 0.011 100 0.024 ± 0.011 0.030 ± 0.010 150 0.026 ± 0.012 0.035 ± 0.014 (c) Spherical Data Method Noise Measured AP Expected AP 0 0.013 ± 0.005 0.013 ± 0.005 LogReg 50 0.014 ± 0.004 0.016 ± 0.005 100 0.014 ± 0.005 0.017 ± 0.004 150 0.014 ± 0.006 0.019 ± 0.005 0 0.013 ± 0.005 0.013 ± 0.005 FLD 50 0.014 ± 0.004 0.016 ± 0.005 100 0.014 ± 0.005 0.017 ± 0.004 150 0.014 ± 0.006 0.019 ± 0.005 0 0.052 ± 0.015 0.052 ± 0.015 GDD 50 0.055 ± 0.019 0.063 ± 0.019 100 0.054 ± 0.020 0.066 ± 0.018 150 0.057 ± 0.024 0.076 ± 0.023 0 0.024 ± 0.011 0.024 ± 0.011 SVDD 50 0.025 ± 0.010 0.028 ± 0.011 100 0.023 ± 0.008 0.029 ± 0.008 150 0.027 ± 0.014 0.035 ± 0.012 (b) Stretched Data Method Noise Measured AP Expected AP 0 0.690 ± 0.076 0.690 ± 0.076 LogReg 50 0.625 ± 0.082 0.710 ± 0.074 100 0.587 ± 0.081 0.735 ± 0.059 150 0.542 ± 0.078 0.736 ± 0.055 0 0.693 ± 0.075 0.693 ± 0.075 FLD 50 0.630 ± 0.081 0.714 ± 0.072 100 0.591 ± 0.083 0.739 ± 0.059 150 0.544 ± 0.078 0.740 ± 0.055 0 0.094 ± 0.039 0.094 ± 0.039 GDD 50 0.095 ± 0.040 0.106 ± 0.041 100 0.092 ± 0.032 0.115 ± 0.039 150 0.092 ± 0.037 0.126 ± 0.035 0 0.017 ± 0.008 0.017 ± 0.008 SVDD 50 0.017 ± 0.006 0.019 ± 0.006 100 0.017 ± 0.008 0.022 ± 0.008 150 0.017 ± 0.008 0.023 ± 0.007 (d) IILS Data Method Noise Measured AP Expected AP 0 0.019 ± 0.016 0.019 ± 0.016 LogReg 50 0.021 ± 0.024 0.024 ± 0.021 100 0.020 ± 0.022 0.025 ± 0.016 150 0.017 ± 0.015 0.029 ± 0.017 0 0.019 ± 0.018 0.019 ± 0.018 FLD 50 0.015 ± 0.014 0.020 ± 0.015 100 0.017 ± 0.017 0.022 ± 0.013 150 0.017 ± 0.018 0.027 ± 0.015 0 0.009 ± 0.014 0.009 ± 0.014 GDD 50 0.008 ± 0.014 0.009 ± 0.011 100 0.007 ± 0.010 0.010 ± 0.007 150 0.010 ± 0.020 0.012 ± 0.011 0 0.002 ± 0.001 0.002 ± 0.001 SVDD 50 0.002 ± 0.001 0.002 ± 0.001 100 0.002 ± 0.001 0.003 ± 0.001 150 0.002 ± 0.001 0.003 ± 0.002 Table 5.2: Difference Between Measured AP and Expected AP These results show that measuring the AP in settings with undiscovered criminal behavior will yield a conservative estimate of the AP. The detrimental effect in measured AP performance with increasing amounts of noise is due to the fact that the method is able to partly fulfill its goal: undiscovered criminal behavior ends up high in the ranking. Netherlands Forensic Institute 42 University of Amsterdam

Chapter 5. Experiments Calculating the expected AP has revealed this effect to be quite severe, even with relatively low amounts of label noise. When noise occurs in the minority class, we observe a severe degradation of performance. This leads us to conclude that although noise on the minority class is usually non-existent in intelligence data, we must be vigilant if there is reason to believe not all examples of criminal behavior truly represent criminal behavior. 5.4 Ranking Performance We finally measure the performance of all previously mentioned ranking methods on ranking intelligence data. As artificial datasets we use the datasets described in section 4.1. We set N = 30, 000, N + = 100 and φ = 900. As in real intelligence data, we keep ψ = 0. We also perform this experiment on the IILS dataset with N = 284, 645 and N + = 288. No extra noise is added to the IILS dataset. We perform 10 times 10-fold stratified cross-validation and measure the performance by averaging the results from all folds and calculating the variance over all 100 folds. The SVM variants require parameter tuning for the slack size trade-off parameter c. To this end, within each fold, we perform again 10-fold cross validation on the training set to estimate the optimal value for c. As optimization criterion for the parameter tuning we use the same optimization criterion the classification method uses. So for example for SVM-MAP, we calculate the AP and choose the value for c that results in an optimal performance in terms of AP. For each outer fold, we use the value of c for training the SVM that has shown optimal performance in the cross-validation loop of the training set. As performance measures we calculate the AUC, AP and precision@n with n = 10, 100, 1000. Note that the precision@1000 is upper bounded by N + 1000. The results are shown in table 5.3, 5.4, 5.5 and 5.6. A few things are interesting to observe here. First of all, a simple linear discriminative method like FLD performs well in terms of both AUC and AP in all settings where the means of the data distributions do not overlap. Logistic Regression shows similar good performance. Given that these methods are also among the least computationally expensive, they should serve as a starting point for an analysis in any intelligence based investigation. SVM-ROCArea shows a better performance in terms of AUC compared to most other SVM variants on all data where the means of the data distributions do not overlap. This Netherlands Forensic Institute 43 University of Amsterdam

Chapter 5. Experiments Method AUC AP Precision@10 Precision@100 Precision@1000 1-NN 0.721 ± 0.071 0.017 ± 0.015 0.017 ± 0.010 0.011 ± 0.001 0.007 ± 0.001 LogReg 0.864 ± 0.047 0.052 ± 0.044 0.055 ± 0.015 0.031 ± 0.001 0.009 ± 0.001 FLD 0.865 ± 0.047 0.053 ± 0.044 0.058 ± 0.015 0.032 ± 0.001 0.009 ± 0.001 QDC 0.655 ± 0.085 0.009 ± 0.004 0.003 ± 0.005 0.003 ± 0.002 0.006 ± 0.002 SVM 0.713 ± 0.141 0.020 ± 0.018 0.021 ± 0.046 0.016 ± 0.013 0.007 ± 0.002 SVM-Rank 0.867 ± 0.048 0.043 ± 0.030 0.035 ± 0.054 0.030 ± 0.013 0.009 ± 0.001 SVM-P@10 0.815 ± 0.137 0.043 ± 0.036 0.052 ± 0.061 0.028 ± 0.016 0.008 ± 0.002 SVM-P@100 0.843 ± 0.108 0.044 ± 0.033 0.053 ± 0.063 0.030 ± 0.014 0.009 ± 0.002 SVM-P@1000 0.531 ± 0.194 0.014 ± 0.021 0.015 ± 0.036 0.007 ± 0.011 0.004 ± 0.003 SVM-ROCArea 0.866 ± 0.049 0.045 ± 0.031 0.042 ± 0.054 0.030 ± 0.013 0.009 ± 0.001 SVM-MAP 0.852 ± 0.081 0.044 ± 0.033 0.054 ± 0.064 0.030 ± 0.014 0.009 ± 0.001 GDD 0.655 ± 0.085 0.009 ± 0.004 0.003 ± 0.005 0.003 ± 0.002 0.006 ± 0.002 SVDD 0.618 ± 0.094 0.008 ± 0.003 0.000 ± 0.006 0.004 ± 0.002 0.005 ± 0.002 Table 5.3: Performance of all Ranking Methods on Gaussian Data Method AUC AP Precision@10 Precision@100 Precision@1000 1-NN 0.709 ± 0.079 0.026 ± 0.030 0.024 ± 0.013 0.017 ± 0.001 0.006 ± 0.001 LogReg 0.971 ± 0.017 0.157 ± 0.070 0.167 ± 0.013 0.075 ± 0.000 0.010 ± 0.000 FLD 0.974 ± 0.014 0.160 ± 0.071 0.170 ± 0.012 0.077 ± 0.000 0.010 ± 0.000 QDC 0.877 ± 0.052 0.067 ± 0.052 0.075 ± 0.016 0.034 ± 0.001 0.009 ± 0.001 SVM 0.963 ± 0.022 0.144 ± 0.074 0.150 ± 0.102 0.069 ± 0.014 0.010 ± 0.000 SVM-Rank 0.971 ± 0.016 0.137 ± 0.062 0.153 ± 0.104 0.076 ± 0.013 0.010 ± 0.000 SVM-P@10 0.958 ± 0.024 0.132 ± 0.068 0.146 ± 0.100 0.067 ± 0.016 0.010 ± 0.000 SVM-P@100 0.961 ± 0.022 0.127 ± 0.061 0.141 ± 0.106 0.069 ± 0.015 0.010 ± 0.000 SVM-P@1000 0.941 ± 0.049 0.113 ± 0.067 0.123 ± 0.101 0.062 ± 0.019 0.010 ± 0.001 SVM-ROCArea 0.970 ± 0.016 0.134 ± 0.058 0.150 ± 0.101 0.076 ± 0.014 0.010 ± 0.000 SVM-MAP 0.957 ± 0.037 0.123 ± 0.062 0.130 ± 0.106 0.067 ± 0.018 0.010 ± 0.001 GDD 0.858 ± 0.058 0.062 ± 0.051 0.076 ± 0.016 0.032 ± 0.001 0.009 ± 0.001 SVDD 0.583 ± 0.095 0.013 ± 0.017 0.022 ± 0.008 0.007 ± 0.002 0.004 ± 0.002 Table 5.4: Performance of all Ranking Methods on Stretched Data shows that for an optimal performance in terms of AUC, SVM-ROCArea is a suited ranking method. SVM-MAP does not show an improvement in AP compared to other SVM variants. It seems SVM-MAP is not able to maintain its competitive AP performance in the presence of noise and imbalance. The one-class ranking methods we used in our experiments only perform competitive on data that is not linearly separable. Performance in terms of AP on artificial data is in some settings in this experiment worse than shown in table 5.2. This shows again that an increased amount of noise in the majority class (φ = 900) has a quite severe detrimental effect on measured AP. Measured performance on IILS data is better than in table 5.2 as we are able to use all positive examples instead of only 133. Two factors contribute to a conservative estimate of the AP. Undiscovered criminal behavior ends up high in the ranking, but their being labeled as normal behavior results in worse measured performance in terms of AP. Next Netherlands Forensic Institute 44 University of Amsterdam

Chapter 5. Experiments Method AUC AP Precision@10 Precision@100 Precision@1000 1-NN 0.582 ± 0.090 0.007 ± 0.003 0.000 ± 0.006 0.004 ± 0.002 0.005 ± 0.002 LogReg 0.486 ± 0.077 0.005 ± 0.002 0.002 ± 0.004 0.002 ± 0.001 0.003 ± 0.001 FLD 0.486 ± 0.077 0.005 ± 0.002 0.001 ± 0.004 0.002 ± 0.001 0.003 ± 0.001 QDC 0.830 ± 0.042 0.025 ± 0.013 0.019 ± 0.012 0.023 ± 0.001 0.008 ± 0.001 SVM 0.507 ± 0.075 0.005 ± 0.004 0.002 ± 0.014 0.002 ± 0.004 0.003 ± 0.001 SVM-Rank 0.495 ± 0.082 0.005 ± 0.005 0.002 ± 0.014 0.002 ± 0.004 0.003 ± 0.002 SVM-P@10 0.506 ± 0.088 0.005 ± 0.003 0.002 ± 0.014 0.003 ± 0.005 0.003 ± 0.002 SVM-P@100 0.514 ± 0.085 0.005 ± 0.002 0.001 ± 0.010 0.002 ± 0.005 0.003 ± 0.002 SVM-P@1000 0.504 ± 0.086 0.004 ± 0.002 0.000 ± 0.000 0.002 ± 0.005 0.003 ± 0.001 SVM-ROCArea 0.492 ± 0.079 0.005 ± 0.004 0.002 ± 0.014 0.002 ± 0.004 0.003 ± 0.001 SVM-MAP 0.496 ± 0.080 0.004 ± 0.001 0.000 ± 0.000 0.001 ± 0.003 0.003 ± 0.001 GDD 0.827 ± 0.043 0.024 ± 0.013 0.019 ± 0.012 0.022 ± 0.001 0.008 ± 0.001 SVDD 0.643 ± 0.085 0.018 ± 0.016 0.028 ± 0.011 0.015 ± 0.002 0.005 ± 0.002 Table 5.5: Performance of all Ranking Methods on Spherical Data Method AUC AP Precision@10 Precision@100 Precision@1000 1-NN 0.820 ± 0.047 0.014 ± 0.009 0.019 ± 0.044 0.018 ± 0.013 0.011 ± 0.002 LogReg 0.907 ± 0.031 0.033 ± 0.018 0.078 ± 0.077 0.035 ± 0.017 0.016 ± 0.003 FLD 0.927 ± 0.023 0.031 ± 0.017 0.069 ± 0.081 0.033 ± 0.017 0.016 ± 0.003 QDC 0.862 ± 0.032 0.001 ± 0.000 0.014 ± 0.035 0.017 ± 0.011 0.010 ± 0.002 SVM 0.925 ± 0.021 0.033 ± 0.020 0.079 ± 0.082 0.033 ± 0.019 0.016 ± 0.003 SVM-Rank 0.929 ± 0.021 0.031 ± 0.018 0.082 ± 0.083 0.030 ± 0.016 0.016 ± 0.003 SVM-P@10 0.916 ± 0.028 0.031 ± 0.020 0.085 ± 0.089 0.030 ± 0.017 0.015 ± 0.003 SVM-P@100 0.914 ± 0.031 0.031 ± 0.020 0.083 ± 0.084 0.030 ± 0.018 0.015 ± 0.003 SVM-P@1000 0.780 ± 0.089 0.017 ± 0.015 0.047 ± 0.067 0.022 ± 0.014 0.008 ± 0.003 SVM-ROCArea 0.933 ± 0.019 0.032 ± 0.019 0.090 ± 0.083 0.030 ± 0.017 0.016 ± 0.003 SVM-MAP 0.920 ± 0.024 0.031 ± 0.020 0.076 ± 0.080 0.033 ± 0.019 0.015 ± 0.003 GDD 0.853 ± 0.037 0.013 ± 0.009 0.015 ± 0.036 0.019 ± 0.012 0.010 ± 0.003 SVDD 0.695 ± 0.045 0.004 ± 0.004 0.007 ± 0.026 0.003 ± 0.006 0.003 ± 0.002 Table 5.6: Performance of all Ranking Methods on IILS Data to that, when building a model to be used in practice, all positive examples can be used for training the model, resulting in more predictive power. Netherlands Forensic Institute 45 University of Amsterdam

Chapter 6 Conclusions In an intelligence based investigation, the goal is the discovery of criminal behavior in large amounts of data based on a few known examples of criminal behavior. The goal of such an investigation is not aimed at classifying everything correctly, but rather to discover a few new cases of criminal behavior. To this end, ranking methods are used to obtain a ranking of examples, which enables the deployment of available resources to investigate the top n ranked results. We have shown that random under-sampling can be used to decrease the number of negative examples used for training a ranking method. Keeping roughly a number of negative examples in the order of ten times the amount of positive samples is sufficient to obtain optimal performance both in terms of AUC and AP. By labeling unknown examples as negative examples, two-class ranking methods can be trained. This results in an imbalanced dataset with a small amount of label noise in the majority class. This label noise does not effect the training of a model as almost all noise is neglected during random under-sampling. But we have shown a relatively severe effect on AP performance estimates as noisy examples representing undiscovered criminal behavior end up high in the ranked result. This leads us to conclude that a measured AP of a ranking method on an intelligence dataset with a small amount of noise in the majority class is a conservative estimate of the AP to be encountered in practice. When label noise occurs on the minority class representing criminal behavior, a severe degradation of performance is observed. Although intelligence data usually only contains confirmed examples of criminal behavior, care should be taken if the correctness of these labels is debatable. We have further shown that simple linear discriminative methods such as FLD and Logistic Regression show competitive performance on intelligence data, both in terms of AUC and Netherlands Forensic Institute 46 University of Amsterdam

Chapter 6. Conclusions AP. SVM s that use optimization criteria more closely related to the goal on an intelligence based investigation are not able to outperform these simple methods. SVM-MAP fails altogether in achieving an optimal AP performance on intelligence data. 6.1 Future Work We have shown that even a relatively small amount of label noise in the majority class results in a lower measured AP. Further experimentation with different datasets and parameters might reveal the exact relationship between the amount of noise and the under-estimation of the AP. The results could be used to give an estimate of the expected AP in practice when undiscovered criminal behavior is sought after. To be able to compare the performance of models on different datasets, in future work the Balanced Average Precision (BAP) [46] can be used as an normalized AP measure that is independent of the dataset balance. Netherlands Forensic Institute 47 University of Amsterdam

Bibliography [1] Jason van Hulse and Taghi Khoshgoftaar. Knowledge discovery from imbalanced and noisy data. Data Knowl. Eng., 68(12):1513 1542, 2009. ISSN 0169-023X. [2] Yanmin Sun, Mohamed S. Kamel, Andrew K. C. Wong, and Yang Wang. Costsensitive boosting for classification of imbalanced data. Pattern Recogn., 40(12): 3358 3378, 2007. ISSN 0031-3203. [3] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer, 1st ed. 2006. corr. 2nd printing edition, October 2007. ISBN 0387310738. [4] Sofia Visa and Anca Ralescu. Issues in mining imbalanced data sets - a review paper. In in Proceedings of the Sixteen Midwest Artificial Intelligence and Cognitive Science Conference, pages 67 73, 2005. [5] Gary M. Weiss and Foster Provost. The effect of class distribution on classifier learning: An empirical study, 2001. [6] Gary M. Weiss. Mining with rarity: a unifying framework. SIGKDD Explorations, 6 (1):7 19, 2004. [7] David M. J. Tax. One-class classification. PhD thesis, Technische Universiteit Delft, 2001. [8] Xingquan Zhu and Xindong Wu. Class noise vs. attribute noise: a quantitative study of their impacts. Artif. Intell. Rev., 22(3):177 210, 2004. ISSN 0269-2821. [9] Yunlei Li, Lodewyk F. A. Wessels, Dick de Ridder, and Marcel J. T. Reinders. Classification in the presence of class noise using a probabilistic kernel fisher method. Pattern Recogn., 40(12):3349 3357, 2007. ISSN 0031-3203. [10] D. Anyfantis, M. Karagiannopoulos, Sotiris B. Kotsiantis, and Panayiotis E. Pintelas. Robustness of learning techniques in handling class noise in imbalanced datasets. In Christos Boukis, Aristodemos Pnevmatikakis, and Lazaros Polymenakos, editors, IFIP, volume 247, pages 21 28. Springer, 2007. ISBN 978-0-387-74160-4. Netherlands Forensic Institute 48 University of Amsterdam

Bibliography [11] Aleksander Kolcz and Gordon V. Cormack. Genre-based decomposition of email class noise. In KDD 09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 427 436, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-495-9. [12] J. Schafer and J. Graham. Missing data: Our view of the state of the art. Psychological Methods, 7:147 177, 2002. [13] D.B. Rubin. Inference and missing data. Biometrika, 63:581 592, 1976. [14] T. Schneider. Analysis of incomplete climate data: Estimation of mean values and covariance matrices and imputation of missing values. J. Climate, 14:853 871, 2001. [15] Miroslav Kubat and Stan Matwin. Addressing the curse of imbalanced training sets: one-sided selection. In Proc. 14th International Conference on Machine Learning, pages 179 186. Morgan Kaufmann, 1997. [16] Tom Fawcett. An introduction to roc analysis. Pattern Recogn. Lett., 27(8):861 874, 2006. ISSN 0167-8655. [17] Andrew P. Bradley. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7):1145 1159, July 1997. [18] Jesse Davis and Mark Goadrich. The relationship between precision-recall and roc curves. In ICML 06: Proceedings of the 23rd international conference on Machine learning, pages 233 240, New York, NY, USA, 2006. ACM. ISBN 1-59593-383-2. [19] David A. Grossman and Ophir Frieder. Information Retrieval: Algorithms and Heuristics. The Kluwer International Series of Information Retrieval. Springer, P.O.Box 17, 3300 AA Dordrecht, The Netherlands, zweite edition, 2004. [20] Kazuaki Kishida. Property of average precision and its generalization: an examination of evaluation indicator for information retrieval. Technical report, National Institute of Informatics, 2005. [21] David Barber. Bayesian Reasoning and Machine Learning. Cambridge University Press, 2010. http://www.cs.ucl.ac.uk/staff/d.barber/brml. [22] Pedro Domingos. The role of occam s razor in knowledge discovery. Data Mining and Knowledge Discovery, 3:409 425, 1999. [23] Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, September 1998. ISBN 0471030031. [24] O. L. Mangasarian. Generalized support vector machines. In Advances in Large Margin Classifiers, pages 135 146. MIT Press, 1998. Netherlands Forensic Institute 49 University of Amsterdam

Bibliography [25] O. L. Mangasarian. Data mining via support vector machines. In IFIP Conference on System Modelling and Optimization, pages 23 27. Kluwer Academic Publishers, 2001. [26] C. W. Hsu, C. C. Chang, and C. J. Lin. A practical guide to support vector classification. Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, 2003. [27] David M.J. Tax and Robert P.W. Duin. Combining one-class classifiers. In in Proc. Multiple Classifier Systems, pages 299 308. Springer Verlag, 2001. [28] D. M. J. Tax and R. P. W. Duin. Characterizing one-class datasets. In In Proceedings of the 16th annual symposium of the pattern recognition association of South Africa, pages 21 26, 2005. [29] Piotr Juszczak. Learning to recognise. A study on one-class classification and active learning. PhD thesis, Technische Universiteit Delft, 2006. [30] Y. Arzhaeva, D.M.J. Tax, and B. van Ginneken. Improving computer-aided diagnosis of interstitial disease in chest radiographs by combining one-class and two-class classifiers. In J.M. Reinhardt and J.P.W. Pluim, editors, SPIE Medical Imaging, volume 6144. SPIE, Bellingham, WA, 2006. [31] Andrew B. Gardner, Abba M. Krieger, George Vachtsevanos, and Brian Litt. Oneclass novelty detection for seizure analysis from intracranial eeg. J. Mach. Learn. Res., 7:1025 1044, 2006. ISSN 1532-4435. [32] Giorgio Giacinto, Roberto Perdisci, Mauro Del Rio, and Fabio Roli. Intrusion detection in computer networks by a modular ensemble of one-class classifiers. Inf. Fusion, 9(1):69 82, 2008. ISSN 1566-2535. [33] Frederic Ratle, Mikhail Kanevski, Anne-Laure Terrettaz-Zufferey, and Olivier Ribaux. A comparison of one-class classifiers for novelty detection in forensic case data, 2008. [34] David M. J. Tax and Robert P. W. Duin. Support vector data description. Mach. Learn., 54(1):45 66, 2004. ISSN 0885-6125. [35] Thorsten Joachims. Optimizing search engines using clickthrough data. In KDD 02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133 142. ACM, 2002. ISBN 1-58113-567-X. [36] Thorsten Joachims. A support vector method for multivariate performance measures. In ICML 05: Proceedings of the 22nd international conference on Machine learning, pages 377 384. ACM, 2005. ISBN 1-59593-180-5. Netherlands Forensic Institute 50 University of Amsterdam

Bibliography [37] Yisong Yue, Thomas Finley, Filip Radlinski, and Thorsten Joachims. A support vector method for optimizing average precision. In SIGIR 07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 271 278. ACM, 2007. ISBN 978-1-59593-597-7. [38] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res., 6:1453 1484, 2005. ISSN 1532-4435. [39] Ron Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. In IJCAI 95: Proceedings of the 14th international joint conference on Artificial intelligence, pages 1137 1143, San Francisco, CA, USA, 1995. Morgan Kaufmann Publishers Inc. ISBN 1-55860-363-8. [40] Antti Airola, Tapio Pahikkala, Willem Waegeman, Bernard De Baets, and Tapio Salakoski. A comparison of auc estimators in small-sample studies. In Saso Dzeroski, Pierre Geurts, and Juho Rousu, editors, International workshop on Machine Learning in Systems Biology, volume 3, pages 15 23, 09 2009. [41] T Pahikkala, A Airola, J Boberg, and T Salakoski. Exact and efficient leave-pair-out cross-validation for ranking rls. In T. Honkela, M. Pll, M.-S. Paukkeri, and O. Simula, editors, Proceedings of the 2nd international and interdisciplinary conference on adaptive knowledge representation and reasoning, pages 1 8, 2008. [42] A. Isaksson, M. Wallman, H. Goransson, and M. Gustafsson. Cross-validation and bootstrapping are unreliable in small sample classification. Pattern Recognition Letters, 29(14):1960 1965, October 2008. ISSN 01678655. [43] R.P.W. Duin, P. Juszczak, P. Paclik, E. Pekalska, D. de Ridder, D.M.J. Tax, and S. Verzakov. Prtools4.1, a matlab toolbox for pattern recognition. Delft University of Technology, 2007. [44] D.M.J. Tax. Ddtools, the data description toolbox for matlab, Dec 2009. version 1.7.3. [45] Thorsten Joachims. Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms. Kluwer Academic Publishers, Norwell, MA, USA, 2002. ISBN 079237679X. [46] J. C. van Gemert, C. J. Veenman, and J. M. Geusebroek. Episode-constrained crossvalidation in video concept retrieval. IEEE Transactions on Multimedia, 11(4):780 785, 2009. Netherlands Forensic Institute 51 University of Amsterdam