Lowering False Alarm rates in Motion Detection Scenarios using Machine Learning TIM LENNERYD

Transcription

1 Lowering False Alarm rates in Motion Detection Scenarios using Machine Learning TIM LENNERYD Master of Science Thesis Stockholm, Sweden 2012

2 Lowering False Alarm rates in Motion Detection Scenarios using Machine Learning TIM LENNERYD 2D1021, Master s Thesis in Computer Science (30 ECTS credits) Degree Progr. in Computer Science and Engineering 270 credits Royal Institute of Technology year 2012 Supervisor at CSC was Hedvig Kjellström Examiner was Danica Kragic TRITA-CSC-E 2012:024 ISRN-KTH/CSC/E--12/024--SE ISSN Royal Institute of Technology School of Computer Science and Communication KTH CSC SE Stockholm, Sweden URL:

3 Abstract Camera motion detection is a form of intruder detection that may cause high false alarm rates, especially in home environments where movements from example pets and windows may be the cause. This article explores the subject of reducing the frequency of such false alarms by applying machine learning techniques, for the specific scenario where only data regarding the motion detected is available, instead of the full image. This article introduces two competitive unsupervised learning algorithms, the first a vector quantization algorithm for filtering false alarms from window sources, the second a self-organizing map for filtering out smaller events such as pets by way of scaling based on the distance to the camera. Initial results show that the two algorithms can provide the functionality needed, but that the algorithms need to be more robust to be used well in an unsupervised live situation. The majority of the results have been obtained using simulated data rather than live data due to issues with obtaining such live data at the time of the project, with live data tests to be done as future work.

4 Referat Reducering av falsklarm i rörelsedetektering genom användande av maskininlärning Rörelsedetektering med kamera är en form av inbrottslarm som kan ge upphov till en hög frekvens av falsklarm, speciellt i hemmiljöer då husdjur och fönster kan vara bidragande orsaker. Denna artikel utforskar möjligheten till reducering av falsklarmsfrekvensen genom användning av maskininlärningstekniker. Den specifika situationen som undersöks är den där endast data om den detekterade rörelsen används, istället för hela bilden. Denna artikel introducerar två algoritmer baserade på kompetitiv inlärning utan tillsyn. Den första algoritmen är en vektorkvantiseringsalgoritm för filtrering av falsklarm från fösterkällor och den andra är en self-organizing map för filtrering av händelser baserat på händelsernas storlek där storleken skalas beroende på distansen från kameran. Inledande resultat visar att algoritmerna kan tillhandahålla den funktionalitet som önskas, men att algoritmerna behöver vara mer robusta för att kunna användas väl utan tillsyn i verkliga situationer. Majoriteten av resultaten har erhållits från simulerad data snarare än reell data eftersom det har varit svårigheter att få fram reell data under projektets gång. Därför ligger tester med reell data som en viktig punkt i framtida arbete med projektet.

5 Contents 1 Introduction The Scenario Anomaly Detection Classification Related Work Theory Preliminaries Deriving the Distance between the Pet and Camera Deriving the Diagonal Length of the pet s Bounding Box as a Limit Competitive Learning Method and Implementation Simulation Visualizing the Results Keeping track of windows using Vector Quantization A Self-Organizing Map as a Height map for pet size Thresholds Results and Conclusions Window Adjustment Filter Pet Filtering Conclusions Future Work 39 Bibliography 41 Appendices 44 A Other Considered Methods 45 A.1 One Class and Two Class Support Vector Machines A.2 Clustering A.2.1 Computational Complexity

6 A.2.2 Advantages and Disadvantages A.3 Nearest Neighbor A.3.1 Computational Complexity A.3.2 Advantages and Disadvantages A.4 Neural networks A.4.1 Supervised and Semi-supervised Neural Networks A.4.2 Computational Complexity A.4.3 Advantages and Disadvantages

7 Chapter 1 Introduction 1.1 The Scenario A company is providing intrusion detection alarms for houses and apartments. These alarms are motion based, with cameras taking pictures and using motion detection algorithms to provide a bounding box around the detected motion. This box together with some supporting information is then sent to the algorithms developed and presented in this article. The constraint below, decided on by the author and the company in cooperation, limits the focus of the algorithms to be developed. By not using the full picture, the algorithms has to make do with less information than if the picture was available, something that influences both choice of algorithms and results. There are a number of reasons as to why this decision was made, but the most important of those were to get a lower dimensional feature space, reduce privacy concerns about machine analysis of private pictures and to reduce the computational complexity. Constraint 1 The algorithms may not assume that they have access to any of the pictures taken by the camera, the only data available will be the surrounding data and the bounding box information. Currently the company only uses very simplistic filtering within the routers connected to the cameras to avoid the most obvious false alarms. This filter consists of a few tests, as can be seen below: Disregard movement if the object has very inconsistent movements, when the velocity of the detected movement changes very rapidly between images. Disregard movement if the size of the object changes rapidly and inconsistently between images. Disregard movement if the velocity or size of the object is far too small to be anything but a false alarm. 1

8 CHAPTER 1. INTRODUCTION They hope with this project to include more intelligent detection of false alarms by applying learning algorithms on the different situations present in the camera environments. They currently do not use nor log the bounding box motion data on the server, but by sending the motion data on to the server together with the pictures taken they will have data for the learning algorithms to work on. Since this is a situation where mis-classifying a break-in attempt as a false alarm is very damaging, the system generated will need to minimize the number of such mis-classifications while still lowering the amounts of false alarms from the main factors. More formally, the task is to minimize the type I error (false positive), that is, minimize the number of false alarms, while keeping the type II errors (false negatives) on a reasonable level. Since the current filters used by the company are so basic, this condition holds for those filters. There is very little risk of misclassifying a break-in attempt as a false alarm with the above filters, since there is very little risk of humans appearing small enough to be disregarded, or by having inconsistent movements between camera pictures. The current filters however, only manage to catch very specific types of false alarms, such as quick light effects or camera distortions and the like. By applying more filters with more computing power the hope is that this can be vastly improved upon. The company has noted that there are two separate sources providing high numbers of false alarms, with a third possibly providing a lower but still significant amount. The first, windows within the vision of a camera, provide false alarms due to the fact that any movement outside of the window will register as detected movement, and there is currently no way for the camera to distinguish between these movements and the ones that would be within the protected house, thus it has to raise an alarm. Pets make up the second source of false alarms, due to the fact that whatever detected movement of the pet might very well cause a false alarm, depending on the velocity and the size. As such, the company cannot yet sell their product to clients with pets, since the false alarm rate will be too high. The third, lesser source is that of movements from outside windows translating into movements inside the room due to shadows and light. Examples of this includes shadows from trees moving due to wind, cars driving by outside which bounces light into the room and other quick light phenomenas. The actual movement of the sun, clouds and such do not fit in here, due to the fact that such movements are too slow to get past the threshold mentioned above. While the prototype algorithms realistically will not completely solve the above issues, the prototype should strive to minimize false alarms from these sources while still keeping the type II errors minimized. Since some algorithms cope better with certain types of data and others have big problems, it is useful to spend some time considering the nature of the data the scenario provides. The system provides some data, such as time, what camera, start position, end position and velocity vector of the bounding box containing the movement. From this, it is possible to derive certain other values that can be of importance, such as size, by calculating the area of the bounding box using the 2

9 1.2. ANOMALY DETECTION start and the end positions. This size value is then important in the consideration of whether the value is a false alarm or not, since small boxes could indicate pets within the house, or something else that is just too small to be a human. All cameras will be trained individually, so there is no reason to add information about which camera sent the event to the anomaly detection algorithm, it will not give the algorithm any more information to work with. However, the rest of the information can be of importance. The data is multivariate, since one data instance holds a number of different values and both scalars and vectors. Time Time of detection (Scalar) Start position The top left corner of the bounding box (2D Vector) End position The bottom right corner of the bounding box (2D Vector) Size The size of the box, either as a diagonal or the area (Scalar) Velocity The current velocity of the detected object (2D Vector) The scalars are one-dimensional and the vectors above are on a two-dimensional plane, that of the picture taken, so the total number of dimensions used by one data instance is eight. This is regardless of how many dimensions the algorithms presented in this article actually use, since there is always the option of completely ignoring some dimensions in the feature space. 1.2 Anomaly Detection Anomaly detection, also called outlier detection, is a heavily researched subject with many widely differing proposed algorithms for both general use and very specific situations. There are also a couple of notable definitions quoted by Hodge and Austin, [21] that were first presented by Grubbs (1969). An extension was also presented by Barnett and Lewis (1994). Grubbs: An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs. Barnett and Lewis: An observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data. By using a few simple assumptions seen below, the defined task in section 1.1 can be defined as an anomaly detection problem. 1. The probability of a break-in is much smaller than that of a false alarm, regardless of the source of the false alarm. 2. Any false alarm due to movement seen outside of the window will by necessity be confined to the edges of the window. 3

10 CHAPTER 1. INTRODUCTION 3. Movements by pets generally conform to certain patterns, for example pets generally keep to the floor or certain preferred furniture while at home alone. If the class of normal everyday events that should be considered false alarms are based on these assumptions, then movements differing in perceivable ways can be found with algorithms that detect anomalies. This article will refer both to anomaly detection and outlier detection, but within the context of this article we do not in any way differentiate between the definition or the function of the two expressions. To help with separating the different algorithm classes, Hodge and Austin [21] define three approaches, defined below, depending on what is to be modeled and what knowledge is available. Type 1 Determine the outliers with no prior knowledge of the data. This is essentially a learning approach analogous to unsupervised clustering. The approach processes the data as a static distribution, pinpoints the most remote points and flags them as potential outliers. It is noted also that this approach requires that all data is available before processing. As part of a type 1 approach, two different techniques diagnosis and accommodation are commonly employed. Diagnosis detects the outlying points in the data and may remove them from future iterations, gradually pruning the data and fitting the model until no outliers are found. Accommodation incorporates outliers and employs a robust classification method that can withstand such isolated outliers [21]. Type 2 Model both normality and abnormality. This approach is analogous to supervised classification and requires pre-labeled data, tagged as normal or abnormal. Hodge and Austin [21] continues by referring to this type of approach as a normal/abnormal classification, either using one normal class or several depending on what is needed. They also note that these classifiers are best suited to static data unless an incremental classifier such as for example an evolutionary neural network is used, since the classification needs to be rebuilt if the distribution shifts. Type 3 Model only normality or in few cases model abnormality. Authors generally name this technique novelty detection or novelty recognition. It is analogous to a semi-supervised recognition or detection task and can be considered semi-supervised as the normal class is taught but the algorithm learns to recognize abnormality. The approach needs pre-classified data but only learns data marked normal. 4

11 1.2. ANOMALY DETECTION Figure 1.1: Example of Point Anomalies. Points noted by circles have been classified as normal, while points noted by a square are classified as point anomalies. While type 3 approaches may seem similar to type 2 approaches, the difference lies in the fact that by only labeling the normal class, one can avoid the corner cases where it is uncertain whether a data instance belongs to the normal class or not. Instead of a normal/abnormal separation, type 3 approaches may present a separation between the normal class and those data instances the approach cannot reliably classify as normal. While the above definition of approaches is very useful, there is also a need to define different types of anomalies. Chandola et al. [13] defines and refers to three different categories of anomalies, which we will describe briefly below: Point Anomalies If an individual data instance can be considered as anomalous with respect to the rest of the data, then that instance is termed a point anomaly. Point anomalies are the simplest types of anomalies and the focus of much anomaly detection research. This is also the type of anomaly detection we will be focusing on in this article, with the scenario and the prototype. Figure 1.1 shows a simple point anomaly example where the squares are classified as anomalous since when looking at the whole dataset, they are the few points that are markedly different in position from the rest. Contextual Anomalies If a data instance is anomalous in a specific context, but not otherwise, then it is termed a contextual anomaly (or conditional anomaly). 5

12 CHAPTER 1. INTRODUCTION Figure 1.2: Example of Collection Anomalies. Points noted by circles have been classified as normal, while points noted by squares are classified as anomalies. If a single point had been at the position of the squares the point would not have been deemed a collection anomaly, but since there are several of them on a line, this is deemed anomalous. If time is a contextual attribute (an attribute orienting the data instance within the dataset) in a dataset, then an event or occurrence at an unusual time might be a contextual anomaly, if the occurrence would be normal at other times. For example, if the safe of a bank is opened in the middle of the night when the bank is closed, as opposed to very specific times during the day when the bank is open and procedures are followed. Collective Anomalies If a collection of related data instances is anomalous with respect to the entire data set, it is termed a collective anomaly. The individual data instances in a collection anomaly may not be anomalies by themselves, but their occurrence together as a collection is anomalous. The easiest way to show collective anomalies is with an example. Figure 1.2 shows how a single point in the center would not be classified as anomalous, but with the concentrated distribution of points in the center differing from the rest of the dataset, the classifier classifies it as anomalous. In this article, the focus lies almost exclusively on point anomalies, contextual and collective anomalies will be completely disregarded when choosing algorithms. The reason for this is that the point anomaly definitions fit well with what the scenario is looking to achieve. While collective or contextual anomaly detection could very well be used to find and distinguish between false alarms and real alarms, the algorithms easily become more complex without necessarily finding the simpler point anomalies. 6

13 1.3. CLASSIFICATION 1.3 Classification The process of classification uses a model (classifier) that takes labeled data instances as training data, and adjusts the model to correctly classify as many of the training instances as possible into one of the available data classes [13]. After these adjustments have been made, similar data to that used for training is used to test how well the system generalizes on data the system has not seen. Anomaly detection by classification operates similarly, by first training the model using one or several normal classes and then testing the system by asking it whether particular data instances can be classified as one of the normal classes, or are anomalous. Chandola et al. [13] presents an assumption that anomaly detection algorithms based on classification operate under: Assumption: A classifier that can distinguish between the normal and anomalous classes can be learned in the given feature space. Within multi-class anomaly detection it is assumed that only if the data instance cannot be reliably placed in one of the available normal classes will it be defined as anomalous. In one-class anomaly detection a boundary around the normal class is formed within the given feature space, and any data instance that does appear within that boundary is classified as an anomaly. It is essentially the same in the multi-class case, except that the data instance is deemed anomalous only if it does not appear within the boundary for any of the classes. There exists a reduction from the outlier detection problem to that of classification [1], which allows the use of active learning techniques with outlier detection problems. While a formal reduction is in many cases not needed to apply traditional machine learning techniques as well as those detailed later in this article, it is in any case useful to note the existence. 1.4 Related Work Research done on subjects of anomaly detection can be separated into a number of sections, based on their focus. There are several surveys, articles and books that discuss a number of different techniques with widely differing basics. These broader reviews consider a large number of algorithms together with a number of domains. Chandola et al. touches on Classification, Clustering, Nearest Neighbor, Statistical, Information Theoretic and Spectral algorithms, and considers them for each of the domains Cyber-Intrusion, Fraud Detection, Medical Anomaly Detection, Image Processing, Textual Anomaly Detection and Sensor Networks [13]. Hodge and Austin is a survey similar to the above, but with a slightly slimmer scope, in that it does not go through the various application domains for anomaly detection techniques, but focus instead on the techniques themselves and various variants [21]. Hodge and Austin defines three fundamental approach types to the problem of outlier detection, based on what knowledge is available as well as what is being modeled. 7

14 CHAPTER 1. INTRODUCTION Markou and Singh has published a very extensive review of statistical approaches that introduces a number of principles useful in novelty detection and related problems. Among the considered statistical approaches are Hidden Markov Models (HMM), k-nearest Neighbor (knn) and k-means clustering [32]. Markou and Singh have also reviewed Neural Networks extensively, where they discuss Multi-Layer Perceptrons, Support Vector Machines, Auto-Associator, Hopfield Networks and Radial Basis Function approaches among others to give a good outline of available algorithms within the neural network class of algorithms [33]. Naturally, there are also a number of articles focusing on the individual techniques mentioned by the broader surveys, many of those used as sources by the surveys. Stefano et al. considers the use of an added reject option to a one-class neural classifier, with the reject option depending on a reliability evaluator depending on the classifiers architecture [36]. This reject option allows the system to reject the sample rather than classifying it with low reliability (essentially refusing to choose rather than chancing it). Abe et al. has reviewed the idea of reducing the problem of outlier detection to a classification problem, that can then be solved using active learning techniques [1]. Gwadera et al. considers the use of machine learning together with sliding windows to detect any suspicious sequences of events in an event stream, where they set up dynamic thresholds for the number of suspicious events that are allowed before an alarm is raised [17]. Ma and Perkins also considers temporal sequences, as they present an on-line novelty detection framework for temporal sequences using Support Vector machines [29] [30]. Also on the subject of SVMs, Mika et al. discuss how to use SVMs to create a boosting algorithm, and showing by equivalent mathematical programs that such can be done [34]. Kohonen has presented a very extensive book focusing on Self-Organizing Maps that details the variants of the algorithm and mathematical considerations among other things, that has been well received and referred to by all the wider surveys considering self-organizing maps [28]. Ando has presented an information theoretic analysis on the subject of minority and outlier detection [5]. This analysis is abstract for the most part, and focuses on clustering. They also present an algorithm that is also evaluated in the analysis. Aggarwal and Yu discuss challenges specific to high dimensional data, such as distance measures not being meaningful, and presents some solutions to the problems presented [3]. There are a number of articles dealing in the domain of Wireless Sensor Networks (WSNs). Much of the research done by these articles discuss the specific challenges of the WSNs. Branch et al. [9] and Janakiram et al. [24] for example discusses limited battery power, computational power and high error probability and how such things influence the choice of algorithms. 8

15 Chapter 2 Theory 2.1 Preliminaries The preliminary theory consists of deriving a constant that we may call d 0 and a formula that can be used to scale the diagonal of a bounding box depending on where in the image the box occurs. These derivations depends on the assumption that there exists a horizontal base-plane in the image that acts as a floor. By assuming this base-plane exists all movements detected will follow this floor plane and the size-changes in the bounding box and the diagonal will therefore be predictable. Assumption 1 There exists a ground-plane within the image defined as the floor, on which any movement will occur Deriving the Distance between the Pet and Camera Figure 2.1 shows in detail our efforts to derive the distance d between the camera and the pet, or in other words, to find the ratio h/d to use as a scaling factor. The below values are assumed to have been provided, either by the camera or by the user through some interface outside of the scope of this article. α camera tilt angle β camera field of view h camera height from the floor, in cm l pet length in cm p picture height in pixels (vertical resolution) P d height from bottom to detected movement box in pixels 9

16 CHAPTER 2. THEORY Figure 2.1: The defined angles used in the derivation of a bounding box scaling factor, with the variables defined in section The camera image plane can be seen, as well as how the pet is projected onto the plane. To find the distance d, the angle v can be used. But then the angle must first be derived. If P d = 0, that is if the pet is detected at the very bottom of the picture, then the angle v is simply that of α + (β/2). But whenever P d is more than zero, the angle v will need an appropriate value subtracted to account for the small slice that should not be counted. This angle that should be subtracted will then be β (P d /p). P d /p is the ratio of where the β angle should be divided to give v the correct angle. This can be easily visualized if one considers P d to be 1/2 of p. This will mean that β should be split up in two pieces, exactly as is being done by the focus line shown in the picture above and v would then be identical to α. By this reasoning, the angle v will be: v = α + β ( 2 β P ) d = α + β p ( 1 2 P ) d 2 (2.1) By using the definition of the sinus function, equation 2.2 can be defined as below. This is done by using equation 2.1 for v as the angle and the height h as the opposite side in the triangle, leaving the distance d between the pet and the camera as the hypotenuse. d = h sin(v) = h sin(α + β ( 1 2 P d p 10 ) (2.2)

17 2.1. PRELIMINARIES The ratio between height and distance (h/d) then becomes: f = h ( 1 d = sin(α + β 2 P ) d p (2.3) This scaling factor f can then be used to scale the diagonals for the respective position in the image, and we do not need to perform any further calculations here Deriving the Diagonal Length of the pet s Bounding Box as a Limit The given values will remain the same as in the previous section and figure 2.1 will again be of interest. The height h given, together with a field of view length we call b, will allow for a formulation of the below equation 2.4 by way of figure 2.2. The definition of the tangens function is used with h being the adjacent side, b/2 being the opposite side and β/2 as the given angle. ( ) β tan = b ( ) β 2 2 h b = 2 h tan 2 (2.4) Under on the assumption that we are using a camera based on the pinhole principle [12], the ratio of the pet length and the field of view length b will remain the same both within the picture and outside in the real room space. This can be used to our advantage by defining P h to be the length of the pet in pixels, giving us equation 2.5: Length of Pet Field of View Length = l b = P h p P h = l p b (2.5) Since we have already defined a formula for b in 2.4, we can simply plug this formula in to get equation 2.6: Figure 2.2: Represents the relationship between the height of the camera and the horizontal length that can be seen with the field of view angle β. The camera is pointed straight down, placing the image plane parallel to the ground. 11

18 CHAPTER 2. THEORY P h = l p 2 h tan( β 2 ) (2.6) Assumption If the pet length is l, then the diagonal of the box can be approximated as 2 l. While this assumption is not a good one, it gives a starting point. This approximation can later be modified if it is deemed to be too crude. With this approximation and by using the expression for P h instead of l, the diagonal of the box in pixels within the picture can be expressed as: 2 l p d 0 = ( ( )) (2.7) 2 h tan β 2 Given that the box is positioned directly under the camera, regardless of if the camera can see the box or not. If the box could be seen, it would have a diagonal of d 0 pixels. This gives a constant value d 0, that can be used to scale the diagonal to any height in the picture, provided we know enough to use the derivation in the previous section, finding the distance between the camera and the pet. When the position P d > 0, the diagonal will be scaled by multiplying expression 2.7 with the ratio found in formula 2.3 in the previous section, giving equation 2.8: ( d r = d 0 f = d 0 sin α + β ( 1 2 P )) d p (2.8) Any calculated diagonals can then be compared with d r to see whether they are small enough to be considered pets and therefore be ignored, or if they should be cause for alarm. That is: f(d) = 2.2 Competitive Learning { 1 if d dr 1 if d < d r (2.9) The competitive learning paradigm is something generally used with artificial neural networks. It can be used for any of the approaches described in 1.2. In this paradigm, nodes compete for the right to represent a particular input, and whichever node is closest earns the right to learn from the input. The learning in this case usually consists of moving the winning node slightly closer to the input in terms of the feature space. In two or three dimensions this means simply that the winning node will be moved closer to the x, y, z position of the input. In non-categorical data, the Euclidean Distance (equation 2.10) below is also often used to measure distance between the input x and the node y and by then comparing the distances identifying a winner. There are also a number of other distance measures used for different 12

19 2.2. COMPETITIVE LEARNING situations, such as the computationally expensive Mahalanobis distance (equation A.2) mentioned in section A.3 and the Manhattan distance measure. Manhattan distance is also often called taxicab geometry since it measures distance along cartesian axes, just as a taxicab would measure distance between a point x and a point y in a city. distance = n ( x i y i ) 2 (2.10) Two of the most widely used algorithms, the Vector Quantization algorithm for neural networks and the Self-Organizing Map operates unsupervised and can therefore be classified as a type 1 approach. The Self-Organizing Map was first introduced by Teuovo Kohonen using the vector quantization algorithm with unsupervised learning to produce a low-dimensional representation of the input space [28]. Hastie explains the behavior of the Self-Organizing Map as follows [20]: Constrained version of K-means clustering, in which the prototypes are encouraged to lie in a one- or two-dimensional manifold in the feature space Since Self-Organizing maps uses a neighborhood function, that is, allows the nodes close to the winning nodes to learn a little from the input as well, the map created by the SOM algorithm preserves the topology of the input data. If this neighborhood function is not used, that is, if the winner takes it all strategy is used, then the system according to Hastie will be analogous to a k-means clustering system [20]. The look of the neighborhood function determines how the topology is preserved and which nodes get updates. In many cases, the neighborhood function will return a wide neighborhood to start with, to give the whole map the general shape. By starting with a wide neighborhood, the chance is small that a part of the map is completely void of updates and remain in its start state. Gradually as learning goes on the function returns a smaller neighborhood, which translates to a more finely tuned topology over a smaller section of the map at the time. This can be easily visualized by considering a three-dimensional surface, where billowing hills are results of a wider neighborhood being used and smaller sharper peaks are the results of a smaller neighborhood. The concept of a learning rate δ that is often used in machine learning controls the time the system takes to converge when it comes to learning. If using a high δ value such as for example δ 1.0 fluctuations and divergence may occur, but the lower the δ value the slower the convergence. In essence, the δ value controls how much the system may learn from a single training pattern. Depending on the complexity of the network, as well as the availability of data, a system might be run for a number of iterations (epochs) through the training set to get convergence. In the case of on-line classification, there might be no need to run through several iterations since convergence might not be needed. 13 i=1

20 CHAPTER 2. THEORY Kohonen [28] mentions some ways of speeding up the SOM calculations using pointers to tentative winners, that will reduce the number of comparison operations from quadratic when performing learning through exhaustive search, to linear when using the pointers. While that would speed up the SOM, it is still not as quick as the Hopfield net due to the learning procedure involved in learning a SOM, as well as querying the system [21]. 14

21 Chapter 3 Method and Implementation 3.1 Simulation There are several ways of obtaining data needed for a learning system, the most effective method from the system s point of view is to use real-world data. In many cases this is infeasible however, due to the impossibility of collecting the amount of data needed as well as the cost of acquiring such data. Simulating data is a cheaper solution if real world data is not available, but the actual simulation requires some work if the generated data is to be accurate to any degree. To provide data in any way useful for the scenario outlined in this article, the simulation needs to be defined both by a number of parameters and also by a number of rules and assumptions. Any of these assumptions used will separate the generated data somewhat from the real world data, but these assumptions also lower the time and complexity of programming the simulation, which is direly needed in this case to allow the focus of the article to lie on the learning algorithm rather than the simulation. The actual programming of this simulation will happen in several steps, consisting first of defining the basic assumptions used and thereafter stepwise improving the assumptions to provide a better modeling of the data. Assumption 1 The three dimensional space used will have a right-handed base, meaning when x points to the right, y will point straight up and z will point straight out of the paper. Assumption 2 Input parameters to the simulation will be given in three dimensions, with the camera placed in the Cartesian coordinate system position (0,0,0) and it will be point straight ahead along the z axis for simplified calculations. Assumption one and two are mainly to define how the conversion from three dimensional space to two dimensional screen space will occur. Carlbom and Paciorek presents information about how to project the three dimensions down to two dimensions in the same way that certain cameras do [12]. The fact that cameras use 15

22 CHAPTER 3. METHOD AND IMPLEMENTATION similar techniques allows the simpler to define three dimensions while still achieving the important perspectives. That is, something being smaller further away from the camera despite being the same size in three dimensions and how shapes change when projected down depending on their position relative to the camera. Similar calculations to those presented by Carlbom and Paciorek can be found in many books and lecture notes dealing with computer graphics, since the perspective projection transformation is so vital to that field of computer science. Since the cameras used in the live situation will provide coordinates in two dimensions with proper perspective and positioning, the simulation will need to provide two dimensional points as well, otherwise the differences between the simulation and the live situation will be too large to present any meaningful data. Assumption 3 Movements outside defined windows will primarily be parallel to the window with few exceptions. The size of the shapes moving will be arbitrary. Assumption three is mainly an assumption to provide a starting point for the learning algorithms. The assumption is that the better part of any movements recorded are of people and cars driving and walking by the window, and thus these movements usually are parallel to the window. With this basic assumption made, changes can be made later on to provide a more accurate representation of such movements. Assumption 4 Movements defined by pets within the room will have an arbitrary direction and velocity. The movement events will be defined by the size of the pet. There are a number of ways that pets can move within a house, with varying speed, positions of rest and general movement. These can not all be simulated, and even simulating a single one of these continuous movements realistically is time consuming and complex. Therefore the first basic assumption is that movements recorded from pets are not connected, and they move more or less randomly. This does not fit very well with reality, but if such arbitrary movements can be classified with some reliability, then more reliable movements should hopefully be easier to classify. Regardless, this assumption can be improved upon at a later date if the simulation is kept. The simulation generates events used by the system, and it employs some programmatic techniques (mainly inheritance) that provides easier implementation of events, both events needed by the system and ones not within the scope of this article. By employing a time step and querying a normal distribution for a random value, the simulation checks whether a new event should be generated. Each event type to be simulated has a slightly different distribution connected to it, and that distribution decides the the ratio of events and at what times during the day that the events should be focused. When an event has been generated, it will be appended to an output file that will be used as input by the learning system. 16

23 3.2. VISUALIZING THE RESULTS 3.2 Visualizing the Results Data from the live scenario consists of either scalars or two-dimensional data, which means that they can be easily visualized using two-dimensional graphics in any mathematical program. The simulation can also return data with three dimensions, the points before they have been projected into the two-dimensional screen space. Therefore it can be useful to provide a top down view of a defined room for these three-dimensional points. In the top-down projection the y axis is simply ignored, allowing a R 3 R 2 projection. Figure 3.1 shows an example of how the top-down projection looks with some example data. Since the reverse projection, R 2 R 3 is not easily done, top-down only works for simulated data. To provide a useful data visualization both for the simulation and live scenario we might wish to present the two-dimensional screen projection of the bounding boxes defining a detected movement event. Adding windows to this screen space projection is done by simply projecting the three-dimensional coordinates of the window for the simulation case or using given two-dimensional screen space coordinates (see section 3.4). This visualization, as shown by figure 3.2, is especially useful for reviewing the effect of window event classifications, since it shows windows within the twodimensional screen space used by the live scenario. Other events, such as pet events may not be as apparent since they are not as constrained by the room geometry. Figure 3.1: Top-Down view example using only the window classifier. Data points are the center points of detected movement bounding boxes, with black circles being considered normal and red squares as anomalous. 17

24 CHAPTER 3. METHOD AND IMPLEMENTATION Figure 3.2: 2D perspective projection example using only the window classifier. Red squares show unrelated events that are there merely to sidetrack the window filter, black circles are successfully classified window false alarm events, and red circles are window events that the window filter has not managed to classify as such. While the above visualizations lay the foundations by showing training and test data and the classifications for those sets, additions to these foundations allow for a more informative visualization. To appropriately show the effect and behavior of the competitive learning algorithm used by the window classifier (see section 3.3), the nodes used in the algorithm will need to be visualized. Since the nodes work only in two dimensions, the screen space projection visualization above is useful for adding the nodes. Figure 3.2 also shows the starting positions of the competitive learning nodes as a dot and the end positions of the nodes after training on a set as a star. For the window classifier visualizing the individual nodes makes sense, but in section 3.4 the nodes used by the Self-Organizing Map for pet sizes will never move in the two dimensions visualized by the figures above. The nodes do however keep a value of the bounding box diagonal that can be used as a third dimension, making a height map a powerful tool of visualization for the pet classifier. If the scale of the map is chosen to be the same as in the screen space projection, the previous visualization (figure 3.2) and the height map can be presented together to show a more complete picture, such as figure

25 3.3. KEEPING TRACK OF WINDOWS USING VECTOR QUANTIZATION Figure 3.3: 2D Projection perspective with corresponding SOM heightmap. The heightmap shows how the scaled pet sizes have been modified from the default scaling by the algorithm to accommodate pets above the floor plane. Lighter areas allow a larger pet. 3.3 Keeping track of windows using Vector Quantization Assumption 1 Windows are the only sources of detected movement that can cause false alarms, and only from movements outside the windows such as pedestrians and cars. Assumption 2 Initial window coordinates are given to the system either by the user or by the system in some way not within the scope of this article. For the sake of discussion, let us assume that the above assumptions hold. Then the easiest option for eliminating any false alarms would be to simply ignore any events that have been constrained by any windows, provided that the window positions are known. By having the user fill in where in the image the windows are, events within the defined area could then be ignored, and as long as the camera angle and position remains the same, the system would know which events to ignore. 19

26 CHAPTER 3. METHOD AND IMPLEMENTATION Figure 3.4: The window events have no offset, as can be seen by the fact that the windows have not moved, and most events are defined as false alarms (black circles). Red squares represent events that cannot reliably classified as false alarms. Figure 3.5: The events have been offset by a change in camera angle, and due to this the window filters have moved to accommodate this change. The start positions of the filters are represented by a red dot, and the end positions by a blue star. There is a major problem with this naive approach, namely the assumption that the camera angle will remain constant. While cameras may not move drastically from day to day, the company has supplied that they will turn the lens toward the roof when deactivated to preserve privacy. The angle they return at may then differ somewhat, which in turn will mean that the filter initially provided by the user may be slightly misplaced, possibly resulting in false alarms from the windows. If an identical event occurs before and after a camera adjustment, then with the filter properly positioned the event will be properly classified as false alarm and ignored. But after the adjustment, the filter may be slightly misplaced and the system may classify the event as a real alarm, despite the event being identical to a previously classified event. To solve this inconsistency with camera movement, one option is to try and track the position of the window using the events generated by it. To effectively track a given window, only movement events close to the windows should be considered, 20

27 3.3. KEEPING TRACK OF WINDOWS USING VECTOR QUANTIZATION something that can be done by applying some distance limit parameters. This should automatically remove events from windows other than the chosen window, provided that the distance limits are small enough and that the camera movement is not too large. To perform a simple tracking, the filters may be pulled somewhat in the right direction to always try and keep the filter in the center of the closest window, provided one filter for every window exists. Since only the filter closest to any specific window should be moved for an input, by thinking of the windows as isolated nodes many of the different competitive learning algorithms can be applied to this problem. This works since competitive learning algorithms, as mentioned in section 2.2, compete for the right to process and learn from a subset of the possible input space. By using Competitive learning with a winner-takes-it-all strategy, as has been done in figures 3.4 and 3.5, the nodes will only learn from events generated by their own specific window if the distance limit parameters are appropriate. An issue with tracking the window in this way occurs if the first assumption above does not hold. If there are unrelated events, say from pets, they have the ability to influence the behavior of the window tracking. Such influence would cause the window filter to become unreliable, since any event, at any distance from the window or the current nodes, would pull the nodes away from the window, and thus spreading and warping the filter, as can be seen in figure 3.6. The severity of this issue can be reduced in various ways, the simplest being to create some additional requirements for when the window nodes may learn from an event. The system should always be able to classify an event, regardless of it is allowed to learn from it or not. By setting a maximum Euclidean distance value (defined in equation 2.10) allowed between the original window center and the event allows the lets the filter stay reasonably close to the original window position, while still allowing flexibility and limiting the effects of unrelated event interference. Obviously there are other solutions available for this particular problem, for example to filter using other filters beforehand so that the assumption above does Figure 3.6: Without constraining the filter using some maximum distance limits, the filters may use inputs from too many unrelated events and as such become unstable and unreliable. The middle window filter has moved far away from it s position. Black circles as false alarms, red squares as unrelated or real alarms. 21

28 CHAPTER 3. METHOD AND IMPLEMENTATION Figure 3.7: If the limiting parameters are used, then the results after training on the data set used in figure 3.6 are greatly improved. The filters stay reasonably close to the original positions, and while the trainer still has some problems with the rightmost window, this is expected due to the event density. hold for most reasonable cases, but the solution proposed above is far simpler and less computationally expensive. If the results are acceptable in this situation, then using a simpler solution is often the best choice. Following is a list of parameters used in the implementation of algorithm Vector Quantization algorithm defined in pseudo-code below (algorithm 1). numnodes Number of nodes used by the learning system. In a winner-takes-it-all strategy, each node will represent a window within the image. If learning is allowed for nodes close to the winner as well, a number of nodes could together represent a window. delta The learning rate of the system. This changes how much a single data point influences the system. singlewinner If the winner-takes-it-all strategy should be used. If the single winner strategy is not used, all nodes in an area may very well all converge at a specific point, which may or may not be useful depending on the situation. neighborhoodsize Only relevant if the single winner strategy is not used. This variable then describes the size of the neighborhood around any winning node that also get updates. maxdist The maximum euclidean Distance from a node from which a data point can affect it, even if the node is the winner. Used together with maxdistwin to constrain the filter to a window. maxdistwin The maximum euclidean Distance from a window centroid, that is calculated by the initial positions defined by the user, that a node can move. Used together with maxdist to constrain the filter to a window. 22

29 3.3. KEEPING TRACK OF WINDOWS USING VECTOR QUANTIZATION Algorithm 1 Vector Quantization algorithm {Initiating the nodes} for all node in nodes do window random(windows) node randomposwithin(window) end for {Learning and classifying events} for all event in events do closestn ode mineucdist(event, nodes) D eucdist(closestn ode, event) closestw in mineucdist(event, windows) Dwin eucdist(closestw in, event) diff event closestnode if D < maxdist and Dwin < maxdistw in then closestnode closestnode + diff delta if not singlew inner then for all node in neighborhood(closestn ode) do diff delta D eucdist(node,event) node node + end for end if end if {Classifying events} if event iswithinwindow(closestn ode) then return true else return false end if end for 23

30 CHAPTER 3. METHOD AND IMPLEMENTATION 3.4 A Self-Organizing Map as a Height map for pet size Thresholds After the window events have been dealt with, it is of interest to consider pet events since they make up the second largest false alarm source in some situations. Pet events are where pets are moving within the vision of the cameras and therefore get detected. A reasonable starting assumption is that only pet events will cause a false alarm, same as was assumed in section 3.3. Assumption 1 The only sources of false alarms are from pets being detected within the vision of the camera. Assumption 2 Pets generally move along the floor plane, but they may move in an arbitrary but predictable manner. They may for example have favorite spots diverging from the floor plane, such as for example on top of a sofa or table depending on the pet. Assumption 3 The length of the pet is given to the system either by the user or by the system in some way not within the scope of this article. In general the main difference between pets and their owners is the size. Pets do have a different shape than humans, but this shape may not always be visible on the picture due to angles and positions, and therefore the bounding box may not be that different from a bounding box of a moving human. Pets are in general smaller, which is something that can be used to differentiate between humans and pets. A naive approach to filtering out pet related events would then be to simply classify any events where the bounding box has a size smaller than or equal to a pet as false alarms. While this would work for some events, it would cause more problems than it solves due to the simple fact that without applying any scaling, a box only just fitting a cat could just as well be a human further away from the camera. Therefore the problem becomes two-fold, provided that the user supplies some data about the camera and the pet. First the event needs to be scaled depending on where in the image the movement takes place. After the scaling has been done, the values received can be compared with threshold values, to decide whether they should be classified as pet related false alarms or if they should be classified as real alarms. Sections and detail the math behind the scaling operation. The scaling operation uses information about the length of the pet, the height of the camera as well as the tilt angle and the field of view angle of the camera. With this information a scaled diagonal of the bounding box can be calculated depending on at which height in the picture the bottom corners of the box are positioned. Since self-organizing maps are topology preserving they can create height maps where the height corresponds to the the threshold values. Further, since scaling the event position coordinates (x,y) to fit the map can be done in constant time, the time complexity of classifying an event will be O(1). During training, if the pet has 24

31 3.4. A SELF-ORGANIZING MAP AS A HEIGHT MAP FOR PET SIZE THRESHOLDS Figure 3.8: Training the SOM using 200 simulated training points. Black circles close to the lower edge are already small enough to be allowed due to the scaling, and as such do not cause any changes to the map. The red squares represent data points not conforming to the scaling and as such the map tries to accommodate these anomalies. a spot it likes on for example a couch, the corresponding nodes in the map will learn to allow larger diagonal sizes to accommodate the difference from the floor plane, which is the norm. The self-organizing map will in essence become a heightmap, as can be seen in figure 3.8 where the height value used is the scaled diagonal allowed at that position. The initial normal class consist the scaled diagonals allowed at the different nodes, but after training and accommodation the normal class also includes the changes to the map to accommodate the anomalies. Below, pseudo-code of the proposed algorithm has been included for completeness. 25

32 CHAPTER 3. METHOD AND IMPLEMENTATION Algorithm 2 Self-Organizing Map for pet size thresholds {Initiate Map with default threshold} for all col in columns do for all row in rows do x scale(col) y scale(row) matrix(x, y) scaleddiagonal(x, y) end for end for {Learning and Classification phase} for all event in events do diagonal event.diag x scale(event.x) y scale(event.y) storeddiag scaleddiag(x, y) dif f diagonal storeddiag if diff > 0 then if not singlew inner then for all node in neighborhood(x, y) do dist (x node.x) 2 + (y node.y) change diff delta dist+ɛ matrix(node.x, node.y) matrix(node.x, node.y) + change end for else matrix(x, y) storeddiag + diff delta end if end if if diff > 0 then return true else return false end if end for 26

33 Chapter 4 Results and Conclusions Before going further into the individual and combined results of this project, one thing must be clearly mentioned. Due to factors outside of the authors direct control, no live data was available for the author to use for training and testing, something that was not originally intended. All of the results and conclusions are therefore based on data provided by the simulation detailed in section 3.1. This has a number of effects both on the available results and the discussion regarding them, as well as what conclusions can be drawn and the suggestions for future work. 4.1 Window Adjustment Filter Even though the window filter has some naive elements in the implementation, it can be seen in the following figures that the filter can handle both skewed distributions and fairly large window offsets fairly well, despite the naive distance limits explained previously. Datasets that consist of only window related events, with no unrelated events such as pet events or other random events, can be seen in the figures 3.4 and 3.5. It can there be seen that the error rate is close to zero for the training case, and that the test case for that distribution has similar results. What is of most importance is the results on the testing set, that is, the result on data that has not yet been seen by the system. To accurately measure the results, the test and training sets should have few discernible differences in terms of distribution. With this in mind we will mostly be considering the testing sets. Favored Distributions A distribution favoring a classifier is one where unrelated events make up a smaller part of the dataset than related events, allowing for less interference from such events. If a distribution not favored, the opposite is true. Then the classifier has to work with less relevant events and has to cope with more interference from unrelated events. Figures 4.1b and 4.1c shows that that in general, on distributions favoring the window classifier, a learning rate (δ) between seems to be most effective, with a success rate for the slightly offset windows being 97%. Something that 27

34 CHAPTER 4. RESULTS AND CONCLUSIONS (a) No window offset (b) Slight window offset (c) Heavy window offset Figure 4.1: Showing success rates using data sets favoring the window classifier. The three diagrams show how the learning rate influences the resulting classifications. In sets where the windows are offset, like 4.1c, a very low learning rate will cause a low success rate since the system cannot react to the change quickly enough. Note the difference in scales in the different diagrams. also fits well with what is generally known, that the learning rate should not be too high and the default learning rate that many people use is within Since for this scenario, the only time there is a need for learning is if the windows have been offset, it makes sense that figure 4.1a shows that the best value for delta is very low (0.02). This simply means that the system is already in the best state it can be, and further learning will only cause the system to overtrain. Even with heavily offset windows, such as those in figure 4.2, the system manages a degree of success, topping out at 72%. As can be seen in the figure, the two leftmost windows have less problems than the rightmost window, something that has to do with the fact that the rightmost window is on another wall in the simulated house. 28

35 4.1. WINDOW ADJUSTMENT FILTER Figure 4.2: A classification using δ = 0.14, the best choice of learning rate from figure 4.1c showing heavily offset windows. Red squares show unrelated events that are there merely to sidetrack the window filter, black circles are successfully classified window false alarm events, and red circles are window events that the window filter has not managed to classify as such. By being on the other wall, the angle to the camera is different, and therefore events may be positioned differently in the two-dimensional space. Add to that the fact that since the window is on the other wall, it is thinner than the other windows, which also reflects on the filter. This is something that is reflected in similar datasets as well, when there is a heavy offset in that direction. When using neutral or non-favoring distributions with the window classifier, it can be seen that the filters are negatively affected in some cases. Without using any limiting constraints, the filter may end up in the situation shown in figure 3.6, but this is a very extreme case when there are no constraints active. A more realistic example would be figures 4.4 and 4.5, showing the projection view and the top down view of a non-favored distribution classified by the window classifier. In the first figure, it can be seen at the rightmost window that the filter has problems classifying the rightmost points. This is most likely a result of close by unrelated events influencing the node coupled with the difficulty of the rightmost window. While the result of classifying the distribution discussed above is good (87%) for the chosen learning rate (δ = 0.08), figure 4.3 shows that this result is highly dependent on the learning rate chosen. The success rate takes a sharp dive right after the top value before finding a very stable success rate of 75%. 29

36 CHAPTER 4. RESULTS AND CONCLUSIONS Figure 4.3: Diagram showing the effect of different learning rates (δ) on a dataset not favoring the window classifier. There is a heavy offset, that can be seen in the two figures 4.4 and 4.5 also using the same dataset. Figure 4.4: For visibility, the unrelated events have been removed form this plot. The events are still there to affect the classifier, and can be seen in figure 4.5, which is the top down view of the same data set. Black circles represent correctly classified window events, and red circles represents incorrectly classified window events. 30

37 4.1. WINDOW ADJUSTMENT FILTER Figure 4.5: A top down view showing the window classification on a heavily offset, unfavored dataset. Here black circles correspond to successfully classified events, and red squares correspond to what has been classified as unrelated events. To avoid completely cluttering this section of the article with figures, the best results from the various datasets and classifications have been combined into table 4.1. As can be seen, a high success rate can be achieved for both slight and heavy offsets, depending on the chosen learning rate. The table does however show the fact mentioned above, that the window filter has difficulties with unfavorable distributions in certain situations. These problems tend toward certain windows and window configurations, such as the one used in all the figures in this section. Some possible solutions to these problems will be discussed in the later sections 4.3 and 5. Data Set Distribution δ-value Success Rate (%) Window favored, no offset Window favored, no offset Window favored, slight offset Window favored, heavy offset Pet favored, no offset Pet favored, slight offset Pet favored, heavy offset Pet favored, heavy offset Table 4.1: Table showing window classifier specific results from various dataset distributions at a specific δ value. 31

38 CHAPTER 4. RESULTS AND CONCLUSIONS 4.2 Pet Filtering Due to the random nature of how the simulation generates pet-related events, it is hard to create scenarios that are really lifelike. Therefore the focus of this section will be to show how the filter looks after training without deeply considering the dataset the filter is training on. Depending on whether the calculated default value is used or if the height map starts at zero the resulting map varies widely. For all the figures within this section, a learning rate of δ = 0.3 has been used. This decision was made to allow the filter to learn somewhat speedily, to lower the need for an extended training period and large datasets. Since the system learns in only one direction (raising the map), there is no chance of a situation where a node fluctuates between two points. The only adverse effect a high learning rate might have is that the system might learn too much from a single occurrence. While this is an important factor, as it might lead to over-training, with the lack of real and life-like data we are considering other factors and leaving this as something to be considered in the future, when real data has been obtained. All the simulated pet favored datasets have a high concentration of events along the lower edge, as seen in figure 4.6, with a lower concentration in the rest of the image. This is an effect of the camera projection and some simplifications made for that in the simulation. At the bottom of the image the largest diagonal sizes are allowed, since the closer a pet is the larger the bounding box becomes, and as seen along the lower edge these events are allowed (classified as false alarms) after only one iteration. As the iterations continue, the hight difference between these normal cases and the anomalies higher up in image increase to the point where there are only a few anomalies shown and only one point left that cannot yet be classified as a pet related false alarm. There are few cases in the live scenario where forty iterations would be used, since using so many iterations would cause a very high coupling between the specific training set and the SOM output, which would degrade the ability to generalize on unseen but similar data. Therefore, for further tests and comparisons only five iterations are used, which might hurt the test with no default values slightly in terms of raw success rate, but but will show how the different cases perform under identical conditions. If the default values are used during training, convergence will be much quicker since in theory there is no need to teach the system about what diagonal lengths should be allowed along the floor plane. Instead, only anomalies need to be learned, when pets move on top of furniture or stairs and as such leave the normality of the floor plane. The default values may also add robustness since the resulting map will most likely be much smoother than a map starting from a flat default value of zero. Examples of this can be seen by comparing the two figures 4.7 and 4.8, where the later is much more jagged, with obvious dips wherever the training data points do not reach. The former, by contrast, is very smooth with only a few peaks for anomalous points. The two figures have been created with the same axis values 32

39 4.2. PET FILTERING (a) Iterations: 1 (b) Iterations: 40 Figure 4.6: At early iterations much of the figure appears light, since the difference between the peaks are small. The figure darkens considerably with only a few lighter peaks as the iterations continue, meaning that those events are highly anomalous. Black circles represent events classified as false alarms, and red squares represent real alarms. to allow for a better comparison. By comparing the curve in figure 4.7 created by the default values and the more jagged curve in figure 4.8 it can be seen that they are somewhat similar, showing that the default values can in fact provide improved generalization on datasets similar to those that have been used for the pet filter. Further views on this similarity can also be seen in figure 4.9, where the raised images have been placed in a ninety degree sideways view for ease of comparison. While this result is by no means a complete proof of the effectiveness of this filter, it can be seen as a proof of concept for certain situations, and it remains to be shown whether the default value can provide similar results in a live situation with a camera in any reasonable position. This initial result is made with those assumptions mentioned in 3.1, and as such cannot be taken for fact by the company until they have been verified in the various live situations that can occur. Therefore the author has decided not to add any table showing exact results for different iteration values, since the values would have little meaning in the live scenario. 33

40 CHAPTER 4. RESULTS AND CONCLUSIONS Figure 4.7: A raised view of the map shown in figure 4.6, showing the points of the map raised beyond the normal case after five iterations. Figure 4.8: If a flat default value of zero is used, the filter can still create a workable map, but this map will take longer to reach similar success rate and will most likely end up with a much more jagged look. This figure uses the same dataset as 4.6, but with five iterations. 34

41 4.3. CONCLUSIONS (a) Default values used (b) Flat default value of zero. Figure 4.9: A 90 sideways view of the map, showing the effect of the curving created by the use of the calculated default values versus the use of a flat default value of zero. Five iterations have been used on the same dataset as in previous figures. Also note the differing y axis scaling. 4.3 Conclusions Both filters show some positive results, such as they are. The lack of data from a live situation unfortunately prevents many concrete conclusions regarding the effectiveness of the filters. This also means there are no effective ways to measure the type II error rate (real alarms being miss-classified), something that was wished for (mentioned early on in section 1.1). This is something that needs to be remedied if either of the algorithms are to be used in the live situation described. Integrating any code of some complexity into a working application without thorough testing may very well lead to unforeseen consequences like for example a higher type II error rate than expected. Below the author has provided a (possibly incomplete) list of possible consequences: Higher type II error rate than expected due to miss-classifying real alarms as false alarms. Long training phase due to low event detection frequency in the live situation. Overlearning or lack of ability to generalize after either a long training period or continuous learning. Incorrect assumptions about the nature of the live scenario, for example such as the derived formulas in sections and Camera and image related issues affecting the ability to provide the needed values for the filters, such as for example the α angle. Time and memory requirements on live data to successfully train might be higher than expected. 35

42 CHAPTER 4. RESULTS AND CONCLUSIONS The window filter is simple to implement and simple to modify, but the versatility of the window filter is relatively low. For the moment the filter only detects rectangles, but this is something that can easily be changed in the future to include arbitrary shapes. Without these arbitrary shapes, the versatility of the window filter is too low to provide all the functionality that one might need in a filter for motion detection, such as for example the ability to define an area that should not be included in the detection. While the window filter will try to keep track of these shapes through the events given to the filter, if many unrelated events occur or if the camera movement is too large for the filter to accommodate it, the filter will not know how to handle the situation. For a more robust filter more fitting for the live scenario, the filter would need to have contingency plans in place for dealing with such situations, some of which may be discussed in chapter 5. The pet filter, with its more advanced design, is much slower in terms of training since the filter needs to learn what is a normal state for the environment, even if the default values help the filter out by giving some information about normality. With the default values, the pet filter goes from the slow convergence of most SOMs (and neural networks) to a much quicker convergence since only anomalies need to be trained. However, if the default values are faulty, their introduction may add type II errors to the system, and as such the default values will need thorough testing at the least to make sure this is not the case. The fact that the pet filter, with correct training and choice of parameters may perform the same work as the window filter, as well as performing the pet filter specific functionality makes it a more attractive choice for continued development. By putting more time into developing the pet filter, there might be no reason to implement the window filter at all, focusing developing efforts and saving both time and money. Regarding choice of algorithms for the two filters, the author feels that the competitive learning algorithms has many attractive features for these types of implementations. Many other available algorithms, such as those present in the appendix, also have attractive features and standard implementations of these could also yield favorable results. The main feature considered by the author at the time was the ability to handle unsupervised learning, since while there are situations where supervised or semi-supervised learning could be used in the live scenario, in many cases this would require human interaction to decide whether or not any particular event is a false alarm. This would obviously make training the system a very tedious chore, making the training speed a much more important factor than it otherwise would be. As the project continued, it became more and more apparent that the initial constraint mentioned in section 1.1 should have been considered more thoroughly. The algorithms may not assume they have access to any of the pictures taken by the camera, the only data available will be the surrounding data and the bounding box information. 36

43 4.3. CONCLUSIONS Constraining the system in this way was thought to provide both benefits and limitations to the system s capabilities. Among the benefits considered were quicker learning and smaller space requirements due to smaller data instances, easier simulation of data instances, no risk of sidetracking deep into the subject of image analysis and use of code already developed by the company. The primary limiting factor that was considered as the decision was made was that without image data, there was no way to make shape-specific choices when it comes to event detection and classification. While this was true, as the project continued it was found that this was not at all the most problematic issue with excluding the image from the data instances. Without images, the data instances became so simplistic that it was very hard to distinguish between a valid movement and a break-in attempt, and as a result the learning became more erratic. This was something that was explored when testing a support vector solution (more info in A.1), and with only the information given in the scenario there was no way for the SVM to pinpoint the combination of features that should be considered anomalous. That is, the classifier could not distinguish between the normal and anomalous classes in the given feature space. After the failed SVM implementation, the choice was made to focus on less features which allows for more basic assumptions to be made about the data. The resulting implementations are those described in this article, and as can be seen they make very basic assumptions such as for example all events happening within an area defined as a window are to be considered false alarm events, without any consideration for size or velocity. While this means that the window might be opened without an alarm being raised, if the intruder moves away form the window the alarm will trigger since the detected movement is no longer close to the window. To conclude, the algorithms have shown some positive results that the company might wish to explore and improve upon, but the introductory constraint not to allow image data should have been more carefully considered before such a decision was made. With such image data, the company might have gotten more immediate practical uses from this project. However, since the simulation would likely have been impossible to create with any accuracy if image data was used, this would have required some available live data to have been collected either during the project or beforehand. 37

44

45 Chapter 5 Future Work As have already been mentioned in detail above and in earlier sections of this article, the most important task for the future is to run tests on live data. With available live data the algorithms can be expanded upon and as such taken in any direction the company wishes. The company could then decide whether they wish to go toward several quick and simple filters that focus on one aspect of the feature space, or if they wish to improve upon one or two larger algorithms to provide the full functionality they feel they need from the feature space. From a more theoretical standpoint, under the assumption that live data is available, there should be consideration put into measuring training and test performance and error rates. Not only for the company s specific needs, but to have comparable numbers for use in comparisons with implementations with similar functionality but differing machine learning techniques. Since competitive learning was one option among many it does make sense to consider alternatives since with live data available the choice might have been different. To further widen the scope, image data could be added to the feature space. Image data however would likely need reconsideration of machine learning techniques as well, as the current implementations are specifically designed not to have access to such features. For the window filter, general robustness is of importance for future considerations. The window filter can perform, but the filter lacks the robustness that would be needed for long continuous and unsupervised use. By implementing measures such as scaling of the limits described in section 3.3, some problematic situations could be prevented such as a window far away being moved far out of it s way, since the allowed movement limit does not take into account the distance between the camera and the window. Implementation of such scaling in similar ways to the pet size scaling described in this article has possibilities, but this has not been explored further. To further increase robustness, the window filter needs contingency plans for situations such as larger movements than the filter can handle. If for example the camera is intentionally moved by the owner to use a different viewing angle and position. 39

46 CHAPTER 5. FUTURE WORK To increase functionality of the window filter arbitrary shapes for windows could be introduced. Most likely such shapes would be defined as a two-dimensional polygon, with points and a line connecting them, but they could also be defined by way of painting a surface on the camera image. Introduction of arbitrary shapes would allow the window filter to provide similar functionality to the pet filter, if on a more basic level. For example, if an animal has a favorite spot above the floor plane, simply remove that position from the motion detection and apply the pet diagonal scaling test on the rest. While the window filter is fairly straight forward in it s filtering process (if it is within an area, ignore it), the pet filter is not. This means that live data is even more important for the pet filter than for the window filter, to be able to visualize the sizes that the pet filter will consider false alarms at any given position in the image. Such a visualization would be beneficial to the company, as it could take the form of an overlay to a presented camera image, and with a touch anywhere in the image the allowed pet size at that position could be displayed on the screen. Creating this would need both graphical user interface development and improvements to the pet filter implementation, but it would be a useful tool for the company commercially and for the study of the implementation effectiveness. Currently the pet filter never lowers the height map created by the SOM. Introducing lowering of the height map, if done at a flat rate based on time, can be considered adding a very basic decay to the system. Adding decay is worth exploring, since systems without decay may become over-saturated and by introducing decay some temporal information is retained, since older information has bearing on decisions, but to a lesser degree than new [33]. Furthermore there is also the option of allowing correctly classified false alarms to lower the closest node(s) to find a better fit, but doing this may have unintended side effects that will need to be explored if this option is to be considered. Given available technology, the assumptions present in the scaling mechanic could be replaced by a camera with depth capabilities. That way the distance to any object can be reliably calculated and as such scaled properly without needing to rely on the floor plane assumptions. Depth could then be added to the feature space, as could other technology such as color data, infrared data and sound if those are available. 40

47 Bibliography [1] Naoki Abe, Bianca Zadrozny, and John Langford. Outlier detection by active learning. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, [2] Shigeo Abe. Multiclass Support Vector Machines. Springer, London, second edition, [3] Charu C. Aggarwal and Philip S. Yu. Outlier detection for high dimensional data. Proceedings of the 2001 ACM SIGMOD international conference on Management of data, pages 37 46, [4] S. Albrecht, J. Busch, M. Kloppenburg, F. Metze, and P. Tavan. Generalized radial basis function networks for classification and novelty detection: selforganization of optional bayesian decision. Neural Networks, Volume 13 Issue 10, pages , [5] Shin Ando. Clustering needles in a haystack: An information theoretic analysis of minority and outlier detection. ICDM 2007, Seventh IEEE International Conference on Data Mining, pages 13 22, [6] Stephen D. Bay and Mark Schwabacher. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 29 38, [7] C.M. Bishop. Novelty detection and neural network validation. IEEE Proceedings - Vision, Image and Signal Processing Volume 141 Issue 4, [8] Rafal Bogacz, Malcolm W. Brown, and Cristophe Giraud-Carrier. High capacity neural networks for familiarity discrimination. ICANN 99 Ninth International Conference on Artificial Neural Networks Volume 2, pages , [9] Joel Branch, Boleslaw Szymanski, Chris Giannella, Ran Wolff, and Hillol Karguptai. In-network outlier detection in wireless sensor networks. Proceedings of the 26th IEEE International Conference on Distributed Computing Systems (ICDCS 06),

48 BIBLIOGRAPHY [10] Tom Brotherton, Tom Johnson, and George Chadderdon. Generalized radial basis function networks for classification and novelty detection: selforganization of optional bayesian decision. Proceedings of the 1998 IEEE International JOint Conference on Neural Networks Volume 2, pages , [11] Simon Byers and Adrian E. Raftery. Nearest neighbor clutter removal for estimating features in spatial point processes. Journal of the American Statistical Association Issue 93, pages , [12] Ingrid Carlbom and Joseph Paciorek. Planar geometric projections and viewing transformations. ACM Computing Survey Volume 10 No 4, [13] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM Computing Survey Volume 41 Issue 3, [14] Chih-Chung Chang and Chih-Jeh Lin. Libsvm: A library of support vector machines Available online at libsvm/. [15] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1 27:27, Software available at libsvm. [16] Paul A. Crook, Stephen Marsland, Gillian Hayes, and Ulrich Nehmzow. A tale of two filters - on-line novelty detection. IEEE International Conference on Robotics and Automation Volume 4, pages , [17] Robert Gwadera, Mikhail J. Atallah, and Wojciech Szpankowski. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. Third IEEE International Conference on Data Mining, pages 67 74, [18] Greg Hamerly and Charles Elkan. Alternatives to the k-means clustering that might find better clusterings. Proceedings of the eleventh international conference on Information and Knowledge management, pages , [19] John A. Hartigan. Clustering Algorithms. Wiley, New York, London, [20] Trevor Hastie. The Elements of Statistical Learning. Springer, New York, [21] Victoria J. Hodge and Jim Austin. A survey of outlier detection methodologies. Artificial Intelligence Review Volume 22 Issue 2, [22] Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jeh Lin. A practical guide to support vector machines Available online at edu.tw/~cjlin/libsvm/. 42

49 [23] Byungho Hwang and Sungzoon Cho. Characteristics of autoassociative mlp as a novelty detector. IJCNN 99, International Joint Conference on Neural Networks, volume 5, pages , [24] D. Janakiram, Adi Mallikarjuna Reddy V, and A V U Phani Kumar. Outlier detection in wireless sensor networks using bayesian belief networks. First International Conference in Communication System Software and Middleware, [25] T. Joachims. Making large-scale svm learning practical. Advances in Kernel Methods - Support Vector Learning, Software available at http: //svmlight.joachims.org/. [26] Thorsten Joachims. Training linear svms in linear time. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, [27] Tapas Kanungo, David M. Mount, Ruth Silverman, Nathan S. Netanyahu, Angela Y. Wu, and Christine Piatko. The analysis of a simple k-means clustering algorithm. Proceedings of the sixteenth annual symposium on Computational geometry, pages , [28] Teuovo Kohonen. Self-Organizing Maps. Springer 3rd edition, [29] Junshui Ma and Simon Perkins. Online novelty detection on temporal sequences. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages , [30] Junshui Ma and Simon Perkins. Time series novelty detection using one-class support vector machines. Proceedings of the International Joint Conference on Neural Networks, Volume 3, pages , [31] Larry M. Manevitz and Malik Yousef. One-class svms for document classification. Journal of Machine Learning Research Issue 2, pages , [32] Markos Markou and Sameer Singh. Novelty detection: a review - part 1: statistical approaches. Signal Processing Volume 83 Issue 12, [33] Markos Markou and Sameer Singh. Novelty detection: a review - part 2: neural network approaches. Signal Processing Volume 83 Issue 12, [34] Sebastian Mika, Gunnar Rätsch, Bernhard Schölkopf, and Klaus-Robert Müller. Constructing boosting algorithms from svms: An application to oneclass classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages , [35] Bernhard Schölkopf, John C. Platt, John Shawe-Taylor, Alex J. Smola, and Robert C. Williamson. Estimating the support of a high-dimensional distribution. Neural Computation Issue 13, pages ,

50 BIBLIOGRAPHY [36] Claudio De Stefano, Carlo Sansone, and Mario Vento. To reject or not to reject: That is the question - an answer in case of neural classifiers. IEEE Transactions on Systems, Man and Cybernetics - Part C: Applications and Reviews, Vol 30 No 1, [37] Yanxin Wang, Johnny Wong, and Andrew Miner. Anomaly intrusion detection using one class svm. Proceedings from the Fifth Annual IEEE SMC, pages , [38] Kai Zhang, James T. Kwok, and Bahram Parvin. Prototype vector machine for large scale semi-supervised learning. Proceedings of the 26th International Conference on Machine Learning,

51 Appendix A Other Considered Methods A.1 One Class and Two Class Support Vector Machines Support Vector Machines (SVMs) are powerful constructs, and they can can be used for both supervised learning by using two-class or multi-class SVMs [2], and unsupervised learning using one-class SVMs[35][31]. SVMs utilize the idea that even if a dataset is not linearly separable within the current dimensional space, it will become linearly separable if transformed into a higher dimensional space [22]. The transformation into that high-dimensional space is performed using one of a number of available Kernel functions. These Kernels provide a somewhat different result and behavior depending on the kernel used, and they are still very a very active area of study. There are however a few kernels that are usually provided by more introductory articles and books, and as such are generally quite widely used. Given a training set containing training instances and labels such as (x i, y i ) i = 1,..., l where the training instance x i R n and y {1, 1} l, the SVM will need to solve the following quadratic optimization problem in order to maximize the margin between the two classes of data instances [22] [14] [35]. min w,b,ξ 1 2 wt w + C l i=1 ξ i subject to y i (w T φ(x i + b) 1 ξ i, ξ i 0 (A.1) While equation A.1 is complex and for many hard to understand, full understanding regarding the theory of SVMs is not required to be able to utilize the algorithms, especially if libraries such as those mentioned below are available for use. In the equation, φ(x i ) transforms the training instance x i into the higher dimensions using the chosen kernel, generally one of those shown below. φ is therefore a feature map from the general space to an inner product space like R n F. The Kernel function K(x i, x j ) = (φ(x i ) φ(x j )) can then utilize what is commonly known as the Kernel trick to avoid explicitly mapping the values by only using the dot product [2]. The constant C in A.1 is the penalty parameter, since it induces a penalty on the system for each error the system allows. This is done simply by summing all the errors ξ i 45

52 APPENDIX A. OTHER CONSIDERED METHODS and multiplying it with C, before adding it to the expression that the quadratic optimization problem solver will try to minimize. Linear: K(x i, x j ) = x T i x j Polynomial: K(x i, x j ) = (γx T i x j + r) d, γ > 0 Radial Basis Function: K(x i, x j ) = exp( γ x i x j 2 ), γ > 0 Sigmoid: K(x i, x j ) = tanh(γx T i x j + r) In the formulas above, γ, r and d are kernel parameters for the specific Kernels. I will not further explain the reasoning behind why these Kernels are the most widely used. In general the Radial Basis Function (RBF) is the most widely used, since the RBF kernel resembles a gauss clock in the number of dimensions requested and for reasons outside of the scope of this article the gauss clock is a very useful tool in many situations. There exists a number of libraries containing implementations that can be used for convenience, such as LibSVM [15] and SVM-Lite [25]. These libraries are widely used, since they help alleviate the complexity of solving the quadratic optimization problem effectively and correctly. The LibSVM library was first developed in the year 2000, and since then the library has grown with a number of different algorithm implementations and a great amount of extensions and ports to different programming languages. It contains implementations for the following: C-support vector classification (C-SVC), nu- Support vector classification (nu-svc), epsilon-support Vector Regression (epsilon- SVR), nu-support Vector Regression(nu-SVR), distribution estimation (one-class SVM) and multi-class classification. Due to the large algorithm support as well as the test and training sets in this library, many academic articles and projects use it in a variety of situations [37][31][26][38]. The one-class SVM is an unsupervised approach (type 1) and the C-SVC is a supervised approach (type 3) where the classes are defined as normal and anomalous. It is generally known that it is easier to construct an accurate learner if a supervised approach can be taken, since then the learner has information regarding how the user wants the output to look like (that is, what is the correct classification for any given point). The learner can then go back and change parameters and values and try to provide a better percentage of accuracy. This is not the case with unsupervised approaches, since the system has no way of knowing if it has classified something correctly or not. To put it in the scope of this scenario, the one-class SVM has a parameter that is proportional to the number of percent that the system will classify as outliers. This parameter will decide whether 1% of the dataset will be classified as outliers, or 10%, or 50% depending on the number used for the parameter [35]. Which points will be classified as outliers depends on other parameters, such as the kernel used, any kernel parameters, and the distribution of the training data. For the One-class SVM to perform well, the outliers detected must be the events that constitute a real alarm, rather than a movement outside a window or any 46

53 A.2. CLUSTERING other sort of false alarm. More mathematically, there needs to be a significant enough difference between these outliers and other more normal events. Due to the small number of dimensions available, there may very well be occasions where the differences are too small, something that will greatly hinder the application of a mathematical model such as the One-class SVM. The Two class SVM fares somewhat better, but still has similar problems depending on the parameters used and the peculiarities of the training data. At the start of the project, when the SVM approach was first considered, the idea of scaling the diagonals had not yet been thought of nor implemented. As such, the implementation of SVMs within this project was focused toward detecting anomalous motion detection events rather than detecting anomalous diagonal sizes. An option for future work can therefore be to test the performance of SVMs on the updated scope of the article, that of detecting anomalous diagonal sizes as in 3.4. As implementation and testing of the support vector machines continued, both the one-class and the two-class case, another problem was noticed. When training the system, if there exists no data instances showing real alarms, that is, the events the system should find, then the learners might very well find other outliers and thus completely missing the most important events. The most likely reason for this behavior is that the boundary created by the algorithm has the maximized margin between a number of support vectors, and in the one-class SVM these support vectors will be defined as outliers [35]. Depending on the parameter mentioned above, the system will find a number of supporting vectors that will be defined as outliers by virtue of being closer to the origin than the general population of data points. Since the support vectors placed during training maintains the boundary and the margin, if there are no real alarms present in that data, then those points will not be among those chosen as support vectors. The hyperplane will then be placed elsewhere, perhaps within what should be considered the normal class, and the classification will not be what we wish for in this specific scenario. A.2 Clustering Approaching the problem of anomaly detection by way of clustering can be called a type 1 approach, if one follows the definition mentioned in section 1.2 Hodge and Austin [21] claim that unsupervised clustering algorithms need all data to be available when training is to be done, and that the data needs to be static, since the system is analogous to a batch-processing system. Chandola et al. presents two assumptions that help us grasp the nature of clustering-based algorithms [13]. Assumption 1: Normal data instances belong to a cluster in the data, while anomalies do not belong to any cluster. Assumption 2: Normal data instances lie close to their closest cluster centroid, while anomalies are far away from their closest cluster centroid. 47

54 APPENDIX A. OTHER CONSIDERED METHODS While the first assumption seems more intuitive than the second at first glance, the first assumption requires that all normal data instances belong to a cluster, something that can, depending on the algorithm chosen, cause problems with clusters of very low density being formed. Many of the more well-known algorithms fall under the second assumption, such as for example Self-Organizing Maps (can be considered both a clustering algorithm and a neural network) and k-means Clustering [27][19]. Kohonen has thoroughly explored the subject of Self-Organizing Maps as a solution to detecting anomalies in a semi-supervised mode [28]. By first clustering the data, and then measuring the distance to the closest cluster centroid and using that distance as an anomaly score, these algorithms make use of assumption two to provide a ranking of possible outliers. To use these types of algorithms in a semi-supervised mode, the training data is first clustered as mentioned above using the two steps. After that, instances are taken from the test data, and the anomaly score is calculated in the same way and compared with the anomaly scores of the clusters, giving the system a comparative value to decide whether the data instance should be considered anomalous or not. Since the training data is labeled as normal, defining normal clusters, this means that the system operates within a semi-supervised mode. A.2.1 Computational Complexity The achievable computational complexity of training an anomaly detection algorithm based on clustering is very dependent on the underlying clustering algorithm, since that is where all the essential work lies. If pairwise distance computation between all the data pairs is required, then the algorithm will have a quadratic complexity in terms of training. There are however a number of heuristic techniques that can be used to achieve linear or near linear complexity, for example k-means or algorithms that use approximations [19] [27] [18]. A.2.2 Advantages and Disadvantages Techniques that are based on clustering can operate in an unsupervised mode (type 1) as well as semi-supervised mode, the later allowing for a faster algorithm as well as a better result in general. The testing phase is also quick with clustering based algorithms due to the fact that any data instance need only test for membership in a small number of clusters. However, if the clustering algorithm chosen is not able to capture the clusters within the data, then the performance of the anomaly detection algorithm will suffer greatly, due to the high dependency the anomaly detection algorithm has on having such clusters. As mentioned above, some clustering algorithms require that every normal data instance belong to a cluster, which can lead to large clusters with low density and large false positive errors. Chandola et al. notes that the complexity of performing the clustering can be expensive, with a O(n 2 m) complexity where n as number of data instances and m represents the dimensionality. 48

55 A.3. NEAREST NEIGHBOR A.3 Nearest Neighbor According to Chandola et al. nearest neighbor based anomaly detection techniques work under an assumption similarly to the other algorithms discussed in this article [13]. Assumption: Normal data instances occur in dense neighborhoods, while anomalies occur far from their closest neighbors. Based on this assumption, it is intuitive to consider some way of measuring distance between two data instances, to determine whether a data instance is far from another instance. This distance can also be thought of as a similarity measure between two instances, and if the distance between them is small, then so should also the differences between the two data instances. In one-dimensional data, this can simply be the difference between the value of the first point and the second point, but in high dimensional space, there is a need for a more complex calculation to determine how similar one instance is to another. For continuous attributes that are not too complex, Euclidean distance (see equation 2.10) is a well-used distance measure, but there are other choices if the data is a bit more complex, for example the Mahalanobis distance (see equation A.2) mentioned by Hodge and Austin [21]. ( x µ) T C 1 ( x µ) (A.2) The Mahalanobis distance A.2 above calculates a distance from a point to a center µ defined by correlations within the dataset. Such correlations are given by a covariance matrix C, and since correlations are taken into account the Mahalanobis distance can give a better result in some cases where the Euclidean distance has problems. The Mahalanobis distance is however much more computationally expensive than the Euclidean distance, since it requires a pass through the dataset to build the covariance matrix and identify any attribute correlations. The Euclidean distance, as a comparison, only compares one point with another using the vector distance. The algorithm often termed K-nearest Neighbor is often viewed as a black box prediction engine, since it is highly unstructured, and require little knowledge about the data. Hastie describes the nearest-neighbor technique to be one of the best performers in real data problems, and considers them to work reasonably well on low-dimensional problems, but it is also mentioned that the k-nearest neighbor algorithm does not manage high-dimensional space well, due to a bias-variance trade off and the time complexity [20]. The basic nearest neighbor algorithm functions as a type 3 approach, that is, semi-supervised approaches where the training set contains the normal class, and any subsequent queries are either close enough to its neighbors to be classified as normal, or too far, in which case the point is classified as an outlier. Hodge and Austin[21] does however refer to type 1 approaches using variations of the nearest neighbor algorithms [11]. 49

56 APPENDIX A. OTHER CONSIDERED METHODS A.3.1 Computational Complexity The Nearest Neighbor class of algorithms, being similar to the clustering class, suffer from similar problems when it comes to their computational complexity, where the complexity is directly proportional to both the dimensionality m and the number of data instances in the training set n [21]. Many algorithms based on the k-nearest neighbor algorithm will end up having a complexity of O(n 2 m) when training, and O(n k) when querying if a point is part of the class or not. There has been a number of optimizations to the k-nearest neighbor algorithm however, that claims to bring down the average case to close to linear [6] using pruning. Both Chandola et al. [13] and Hastie [20] also mention other optimizations based on pruning, which would allow for optimizing the speed when training and when querying. Such pruning also eases the second problem of the nearest neighbor algorithm, the need to keep the training set in memory or otherwise available to answer queries. Since the decision boundary in the basic k-nearest neighbor algorithm depends solely on the underlying training set and their distances to the query point, there is a need to keep this training set in memory, which can take considerable space. Pruning the training set would allow the system to keep only the points needed to maintain the boundary. A.3.2 Advantages and Disadvantages The basic nearest-neighbor algorithms are very effective in many low-dimensional cases, able to function and predict well as a black box without much information and with few parameters to set. The algorithm itself is very simple and it has a number of optimizing pruning rules that can be applied. The downside is that the algorithm has a high time complexity both while training and querying without any optimization, as well as having a high space requirement due to the need to keep the training set in memory even after training. The algorithm does not support on-line learning out of the box, but when a sufficient training set has been collected, queries and additions to the training set can be performed in an on-line fashion. A.4 Neural networks Neural networks come in many different shapes, and as such have very varied uses. They are in general non-parametric, and they generalize well on patterns they have not seen [21]. There are options for both supervised and unsupervised approaches, thus various neural networks can be used as any of the approaches mentioned in section 1.2. Some supervised and semi-supervised networks will be discussed below, as well as some notes regarding the self-organizing map. The self-organizing map is also discussed in section

57 A.4. NEURAL NETWORKS A.4.1 Supervised and Semi-supervised Neural Networks The most basic of the supervised neural network approaches, the Multi-Layered Perceptron (MLP), has a number of issues that in many cases make it unfit for anomaly detection. The MLP is a feed-forward network that uses hyper-planes to separate data into classes, it does interpolate well, but cannot do extrapolation very well, and as such can have problems classifying new data in a region where no data has been seen previously [21], [20].There are some variations that use the fact that MLPs do not extrapolate well to detect novel data, such as Bishop [7], when he uses it to monitor oil pipeline flows. The MLP classifier also requires a number of runs through the test data for the weights in the system to settle before being ready to accept queries, something that can make the system unfit for situations where training performance requirements are high or where the training data set is large. As an extension to the MLP classifier, there exist a classifier called the Radial Basis Function (RBF) classifier, that uses hyper-ellipsoids instead of hyper-planes to separate the data [4], and therefore tends to have a quicker convergence due to the more powerful hyper-ellipsoids. The RBF classifier can be used as a type 3 approach on novelty detection, and can also be adapted to provide incremental additions to classes and data[10]. Chandola [13] mentions using a Replicator neural network, which has been used for one-class anomaly detection and which is a multi-layer feed forward neural network. By using the same number of input and output nodes as there are features (dimensions in our case) in the data, and by training the system by compressing the data within three hidden layers, the data can then be reconstructed by the system when similar data is sent to the input during testing. If the data is similar enough to a previously trained data instance, then the output value will be the previously trained data instance, and by using the reconstruction error between the input and the output as an anomaly score, it can be decided whether a data instance is close enough to the previously trained data to be considered normal. This replicator network is similar to the auto-associative neural network (AANN) explored by Hwang and Cho [23] and mentioned by Hodge and Austin[21] among others, another feed forward perceptron-based network that can be used where type 3 approaches are needed. The AANN functions by decreasing the number of available hidden nodes during training, which introduces a bottleneck. The bottleneck forces the system to rely on as few hidden nodes as possible, which reduces the redundancy, making the system focus on the key features in the data. The AANN then tries to recreate the inputs during testing, and if the input given is far from the previously trained data instances the recreation error will be high and an anomaly will have been found. A third supervised neural network that also has the auto-associative property is the Hopfield network, a network using only +1/-1 weights and that is a fully connected recurrent network [21][32]. Hopfield nets are based on the way the human brain stores memory and discern familiarity. Hopfield nets differ from the above nets in that they apply training inputs to all nodes simultaneously, instead of spreading 51

58 APPENDIX A. OTHER CONSIDERED METHODS slowly through several iterations, which shows the computational efficiency of a Hopfield net where training is concerned. Due to this, Hopfield nets can also deal well with high-dimensional inputs as well as when a large training set is to be stored. [16] presents a novelty detection algorithm fit for type 3 approaches by using a Hopfield net, where the energy calculated from the net is used to determine whether a data instance is novel or not, and if so it is classified as anomalous. This works since Bogacz et al. [8] shows that the energy for a pattern which has been learned by the Hopfield network is (N/2) + noise, where N is the number of neurons in the network, and the energy for a novel random pattern is zero plus similar noise term. [16] mentions that based on this, a threshold of E < (N/4) is normally used for classification of patterns in this way. A.4.2 Computational Complexity The running time during training for neural networks depend on various parameters, but mainly the learning rate. The learning rate dictates how much a single input during training influences the weights. If the learning rate is high, then fluctuations may occur, but if the learning rate is low, then the system will be slow to converge. Depending on the type of neural network and the number of nodes and layers, the ideal value of learning rate and number of runs (epochs) for convergence will vary. The MLP is generally relatively slow to converge, the RBF helps the case somewhat with the use of the more powerful hyper-ellipsoids instead of the rigid hyper-planes [20]. The Hopfield is, as mentioned above, very fast at learning new patterns, since all the weights of the Hopfield network is updated simultaneously when a new pattern is learned. Kohonen [28] mentions some ways of speeding up the SOM calculations using pointers to tentative winners, that will reduce the number of comparison operations from quadratic when performing learning through exhaustive search, to linear when using the pointers. While that would speed up the SOM, it is still not as quick as the Hopfield net due to the learning procedure involved in learning a SOM, as well as querying the system [21]. A.4.3 Advantages and Disadvantages All of the mentioned neural networks above have had novelty detection or anomaly detection algorithms presented, and as such all can provide a measure of success. Many of the auto-associative networks have similar advantages, the fact that they can present a piece of data when given a similar, or incomplete input mapping to the data to be presented. Replicator networks, Hopfield networks and similar autoassociative networks map certain key features to a specific output, and when given those key features, or similar, the system remembers it has seen the features before and will output the remembered data, thus giving a good measure for testing the distance between previously trained data and the newly given input. This makes auto-associative networks useful for novelty and anomaly detection, given that one 52

59 A.4. NEURAL NETWORKS has information about the normal class (type 3 approach). Neural networks have relatively few parameters to consider, however those parameters might need to be fine-tuned to provide the best performance. In many cases, the entire training data set has to be iterated through a number of times for the neural network to converge, one of the exceptions being the Hopfield network, which for this reason is deemed one of the better novelty detection solutions. The fact that the energy calculation of the Hopfield network can be used to perform queries makes Hopfield networks even better in this regard. Self-Organizing Maps has the big advantage that they can be used as an unsupervised approach (type 1), where they provide a powerful tool due to their ability to reduce the dimensions of high-dimensional inputs into one or two dimensions. SOMs can therefore be used in combination with other algorithms after the dimension reduction has be done. Regardless of this, their computational complexity can limit their uses, especially if a complex distance measure is used. 53

60 TRITA-CSC-E 2012:024 ISRN-KTH/CSC/E--12/024-SE ISSN