Interpreting Individual Classifications of Hierarchical Networks

Interpreting Individua Cassifications of Hierarchica Networks Wi Landecker Michae D. Thomure Luís M. A. Bettencourt Meanie Mitche Garrett T. Kenyon SFI WORKING PAPER: 2013-02-007 SFI Working Papers contain accounts of scienti5ic work of the authors) and do not necessariy represent the views of the Santa Fe Institute. We accept papers intended for pubication in peer- reviewed ournas or proceedings voumes, but not papers that have aready appeared in print. Except for papers by our externa facuty, papers must be based on work done at SFI, inspired by an invited visit to or coaboration at SFI, or funded by an SFI grant. NOTICE: This working paper is incuded by permission of the contributing authors) as a means to ensure timey distribution of the schoary and technica work on a non- commercia basis. Copyright and a rights therein are maintained by the authors). It is understood that a persons copying this information wi adhere to the terms and constraints invoked by each author's copyright. These works may be reposted ony with the expicit permission of the copyright hoder. www.santafe.edu SANTA FE INSTITUTE

Interpreting Individua Cassifications of Hierarchica Networks Wi Landecker, Michae D. Thomure, Luís M. A. Bettencourt, Meanie Mitche, Garrett T. Kenyon, and Steven P. Brumby Department of Computer Science Portand State University {andeckw,thomure,mm}@cs.pdx.edu Santa Fe Institute Abstract Hierarchica networks are known to achieve high cassification accuracy on difficut machine-earning tasks. For many appications, a cear expanation of why the data was cassified a certain way is ust as important as the cassification itsef. However, the compexity of hierarchica networks makes them i-suited for existing expanation methods. We propose a new method, contribution propagation, that gives per-instance expanations of a trained network s cassifications. We give theoretica foundations for the proposed method, and evauate its correctness empiricay. Finay, we use the resuting expanations to revea unexpected behavior of networks that achieve high accuracy on visua obect-recognition tasks using we-known data sets. I. INTRODUCTION In machine earning, a mode is earned from structure in the data. When using machine-earning techniques, one hopes that the earned structure generaizes we to yet-unseen data. However, many machine-earning techniques resut in backbox modes, making it difficut to understand the nature of the earned structure. With interpretabe methods that expain the interactions between data and mode, we can ensure that our earned modes have earned reevant structure from the data, rather than spurious statistics. This paper focuses on a famiy of hierarchica networks in which the operations computed at each ayer aternate between some type of pattern matching e.g., convoution, radia basis function) and subsamping e.g., mean, maximum). Such networks have been used extensivey in the machine earning community, and are known to achieve high cassification accuracy on a variety of tasks such as phoneme recognition [1], text document cassification [2], and visua obect recognition [3]. Due to their compexity, such networks are often treated as back boxes, giving accurate soutions to difficut probems without expaining how their soutions were found. However, for hierarchica networks to be used for appications in which each cassification carries high risk e.g., medica diagnosis or financia decisions), it may be necessary for these networks to expain the evidence for each cassification. More generay, for any cassification method, it is important bettencourt@santafe.edu Los Aamos Nationa Laboratory {gkenyon, brumby}@an.gov to verify that the network s resuts are due not to anomaies or accidenta correations in particuar data sets, but to features of the data that make sense for the genera task at hand. Given an instance x = x 1, x 2,..., x n ) cassified as ŷx) {1, 1}, we seek a method to expain the reative importance of each input x i to the instance s cassification. Pouin et a. [4] offer such an expanation method for the case of additive cassifiers. For an additive cassifier ŷx) = sgn [ n i=1 f ix i ) + b], Pouin et a. define the contribution of feature x i to be f i x i ). However, the cassifications of hierarchica networks cannot be expained with this method: even if the features cacuated by the network are cassified by an additive cassifier, the higheve features themseves are generay nonadditive functions of the network s ow-eve inputs. We address this imitation with a new method caed contribution propagation. In short, we cacuate contributions of high-eve features as in Pouin et a. [4], and then work backward from the features to the network s input, determining the reative contribution of each node to its parent nodes. This process gives us a we-founded expanation for individua cassifications: given a singe cassified datum, our method expains how important each part of the datum i.e., each entry in the input vector) was to that cassification. In what foows, we review the iterature reated to this research, and in Section II, we introduce notation and review the architecture of hierarchica networks. Section III gives the detais of the contribution-propagation method. Section IV discusses the impementation detais of our network, and we give experimenta resuts in Section V. We concude with directions for further research and open questions. A. Reated Work Rue extraction refers to a famiy of methods that expain the genera rues used by a network by anayzing the network s interna weights [5]. Simiary, one can visuaize the network topoogy in order to understand the overa strategies that might be used [6]. The extracted rues are not specific to the mode s decision for any particuar datum. Rather, they describe the genera behavior of the mode over a arge set c 2013 IEEE Symposium Series on Computationa Inteigence To appear in Computationa Inteigence and Data Mining CIDM) 2013

of data. Thus, rue extraction does not provide the type of information we seek: a fine-grained expanation of what caused a singe datum s cassification. In an attempt to provide such expanations, recent research has treated the entire cassification system as a singe back box in order to provide cassifier-independent expanations [7]. These methods avoid deaing with the interna cacuations of the system at the expense of samping from an exponentiay arge set of possibe inputs. For cassifiers with arge input spaces, as is often the case with hierarchica networks, these methods introduce unwanted dependencies on samping. The gradient approach is one way to expain individua cassifications [8]. Given an instance x = x 1, x 2,...) and its cassification ŷx) = sgn [fx)], this approach defines the importance of feature x i to be f x i. Intuitivey, x i is important if f is sensitive to sma changes in x i. However, the gradient approach fais to te us how much each feature contributed to the output. Consider the simpe exampe of a inear cassifier fx) = w x, where w = 2, 2). An input x = x 1, x 2 ) = 3, 1) gives output fx) = 2 3 + 2 1) = 6 2 = 4. A different input x = x 1, x 2) = 1, 3) gives output fx ) = 2 1) + 2 3 = 2 + 6 = 4. The gradient approach woud te us that our function f is equay sensitive to a inputs in both cases f x i = 2). However, this anaysis fais to te us that in the first case, the positive output of f is due to the first feature x 1 ; and in the second case, the positive output of f is due to the second feature x 2. The contribution approach of Pouin et a. [4], on the other hand, perfecty captures this information. Using this approach, we woud determine that for the first exampe, the contributions of x 1 and x 2 are 6 and -2, respectivey. For the second exampe, the contributions of x 1 and x 2 are -2 and 6, respectivey. Thus contributions which we expore in more detai beow) give us fine-grained expanations of the degree to which each input contributed the output, whie the gradient approach oses much usefu information about the cassification. A. Notation II. PRELIMINARIES We use itaicized etters to denote rea numbers and bod etters to denote rea vectors, such as x = x 1,..., x n ) R n. The L2 norm of a vector, denoted x, is cacuated as n i=1 x2 i. A node is a rea-vaued function. Capita etters X refer to the identity of a node, the function cacuated by that node, or the vaue returned by the node s function. The intended use wi be cear from context. For a given hierarchica network, X i refers to the ith node in the th ayer. The vector X = X 1, X 2,...) denotes a nodes in ayer. We write X to mean some node in the th ayer of the network, or merey X to denote any particuar node in the network. The chidren of node X, denoted ch X ), are the set of nodes that are the inputs to node X. In a sight abuse of notation, we sometimes treat ch X ) as a vector; the intended use wi be cear by context. Simiary, the parents of node X, denoted pa X ), are the set or vector) of nodes that receive X as input. In what foows, we wi restrict the inputs of any node to come from ony the preceding ayer. Thus X 1 ch X ) and X +1 pa X ). A network has L ayers [1,..., L]). Thus the output of the network, often caed the feature vector, is X L. For convenience, the input to the network is denoted simpy x. Lasty, et x R n, and for any node Xi 1 in the network s first ayer, ch ) Xi 1 x. B. Hierarchica Networks Hierarchica networks comprise a arge famiy of machineearning modes such as HMAX [3], [9], convoutiona neura networks [2], and others. In order to make our anaysis concrete, in this work we focus on the we-known famiy of hierarchica networks described by HMAX. However, it shoud be noted that our contribution propagation method is appicabe to the more genera cass of feed-forward hierarchica networks. We briefy review the architecture of HMAX-ike networks here. Detais of our impementation are given in the Methods section. In a trained network, a node Xk in an odd-numbered ayer computes the radia basis function RBF) X k = exp β ch X k) P k 2 ), where P k and β are parameters of the mode. We refer to P k as the kerne of node Xk. Foowing Serre et a. [3], we refer to these ayers as S simpe ce) ayers, with S1 being the first S ayer, and so on. Nodes X +1 h in even ayers compute a maximum of their inputs: X +1 h = max Xk chx+1 h ) X k. Again foowing Serre et a. [3], we refer to these ayers as C compex ce) ayers. Figure 1 shows a hierarchica network with two S ayers and two C ayers, whose input x consists of the gray-scae pixe vaues of an image. The output of the network i.e., the output of the C2 ayer) is the feature vector X L, which is given to a trainabe cassifier in this case, a inear support vector machine SVM)). III. METHODS Given an instance x = x 1, x 2,..., x n ), its corresponding feature vector X L i.e., the fina ayer of the network), and the feature vector s binary cassification ŷx L ) = sgn[fx L )] {1, 1}, we ask what portion of the vaue fx L ) came from input x i? In this work we assume that the score fx L ) is an estimate of the cassifier s confidence in its cassification of X L [10]. Rather than answer this question for the entire cassifier and network as one arge back box, our approach is to anayze each node as we as the cassifier) sequentiay. Working from the cassifier back to the inputs, we determine the contributions of the nodes in ayer L to the cassifier, then the contributions of the nodes in ayer L 1 to the nodes in ayer L, and so on unti we have cacuated the contribution of the inputs e.g., the contribution of each pixe in the image of Figure 1). This process, which we ca contribution propagation, is idea for interpreting the cassifications made by compex networks

A. The Contribution-Propagation Agorithm The purpose of the contribution-propagation agorithm is to provide an interpretabe expanation of which components of the input were responsibe and to what degree) for a given datum s cassification. When the cassification is the resut of many simpe cacuations, as is the case with hierarchica networks, we form an overa description of an input s importance i.e., this pixe was important to the cassification) by anayzing the interna cacuations of the network i.e., node Xi was important to node X +1 ). The centra idea of contribution propagation is that a node was important to the cassification if it was important to its parents, and its parents were important to the cassification. Mathematicay, C X i ) def = X +1 pax i ) C X i X +1 ) ) C X +1. 1) In Equation 1, C ) Xi is the contribution of node X i to the cassification; simiary, C X +1 ) is the contribution of node X +1 to the cassification. Finay, C ) Xi X+1 is the partia contribution of node Xi to its parent node X+1. Informay, this vaue represents how important Xi was to X+1. We wi give expicit definitions for these terms in ater sections; for the moment, we compete our genera discussion of the agorithm. The contribution-propagation agorithm starts by cacuating the contribution C ) Xi L of each top-eve feature X L i to the cassifier. The agorithm then iterativey descends through the ayers of the network, determining the contribution of each node to its parent nodes. This process is described in Figure 2. Fig. 1. A hierarchica network taking an image as input. Ony a sma subset of each ayer is shown. Each sma square is a node in the network. Computation fows from top Image) to bottom SVM). Dashed arrows iustrate the oca connectivity of the network: a sma subset of each ayer gray group of nodes) is fed as input to a singe node back) in the foowing ayer. The vector consisting of each C2 output is the feature vector used for training and testing the SVM. which are too arge and compicated to interpret directy, yet each individua cacuation i.e., cacuating the vaue output by any singe node in the network) is simpe. It is crucia to note that our contribution-propagation agorithm is used ony in expaining a mode s cassification. The cassification itsef is performed by a trained hierarchica mode of the user s choosing. Separation between the cassification and expanation agorithms aows our expanation method to be used in a variety of settings with a variety of modes. Pseudocode for our agorithm is given in Figure 2. In the remainder of this section, we describe the contribution-propagation agorithm at a reativey high eve. We then discuss how to appy the agorithm to the particuar type of hierarchica network discussed in Section II-B. This invoves deriving equations specific to the inear SVM, RBF, and maximum operator. // Given instance x = x 1, x 2,..., x n), for = L 1 do for a i do Cacuate C ) Xi // cacuate contribution of each network node for a do Cacuate C x ) // cacuate contribution of each input return C x 1 ), C x 2 ),..., C x n )) Fig. 2. The contribution-propagation agorithm. C X i ) is the tota contribution of node X i. Reca that x = x 1,..., x n) is the input to the network, thus C x ) is the tota contribution of input x. In order to compete the description of contribution propagation, we need ony derive the equations for C ) Xk L the contribution of the network s output nodes) and C ) Xi X+1 the partia contribution of interna nodes to their parents). We have previousy stated that C ) Xi X+1 represents how important Xi was to X+1. To make this idea more concrete and we-founded, we impose the foowing constraints: C Xi X +1 ) = 1 2) X i chx+1 ) and 0 C X i X +1 ) 1. 3)

In other words, we require that C ) Xi X+1 be a distribution over ch X +1 ). The meaning of this distribution is the fraction of the vaue X +1 that is due to Xi. With this in mind, we can see that each individua summand from Equation 1, C ) ) Xi X+1 C X +1, is the portion of the contribution of X +1 that is due to Xi. Thus Equation 1 in its entirety tes us that a node s contribution C ) Xi is equa to a of the contributions in the parent s ayer for which Xi is responsibe. We now derive C ) Xi X+1 for RBF and maximum functions the two types of nodes in our network). We wi concude with the definition of C ) Xk L, the contribution of the network outputs. This wi compete our derivation of the contribution-propagation agorithm. B. Contribution to Radia Basis Function Consider a node X +1 in an S-ayer. For convenience, et X = ch X +1 ) and P = P +1. Then the function computed by X +1 is the RBF X +1 = exp β X P 2 ). 4) Our goa is to define a function C RBF X i X +1 ) which is a distribution over ch X +1 ) and which accuratey describes the degree to which Xi was responsibe for the vaue of X+1. Equation 4 is a measure of distance between the vectors X and P. A coser distance yieds a arger RBF vaue, and a further distance yieds a smaer vaue. Thus we want C RBF X i X +1 ) to be higher if X i is coser to P i, meaning when Xi P i) 2 is smaer. Moreover, the distance cacuated by the RBF is tuned by the function sx) = exp βx). Thus, we define C RBF X i X +1 ) as: C RBF X i X +1 ) def = exp βx i P i) 2 ), 5) Z where Z is a normaization term. Recaing the constraints in Equations 2 and 3, we note that the constraint of Equation 3 impies that Z > exp βxi P i ) 2 ). To make C RBF X i X +1 ) satisfy the constraint in Equation 2, it must be normaized so that i C RBF X i X +1 ) = 1. Thus we simpy set the denominator Z in Equation 5 to equa the sum over a chidren of the RBF, and we arrive at the fu definition: C RBF X i X +1 ) def = exp βxi P i) 2 ) Xk chx+1 ) exp βx k P k) 2 ). 6) Equation 6 agrees with our intuition: those chidren X i that are cose to their target P i have a higher contribution. C. Contribution to Maximum Function A node X +1 in a C-ayer cacuates the maximum of its chidren, X +1 = max Xi chx+1 ) X i. 7) As before, our goa is to define the function C MAX X i X +1 ) which is a distribution over Xi and which accuratey describes the degree to which Xi was responsibe for the vaue of X+1. To this end, we note that an input to a max function was important to the function if that input was itsef the maximum vaue. Thus we view the max function as a type of switch, and we define { C MAX X i X +1 ) def 1/r if X = i M 8) 0 o.w. where M = {Xi ch X +1 ) : X i = X +1 }, and M contains r eements. That is, we divide the contribution equay among those inputs that shared the maximum vaue; those chidren who were not the maximum did not contribute. D. Contribution to an Additive Cassifier Given a feature vector X L cassified with an additive cassifier as ŷx L ) = sgn [ i f ixi L) + b], we define the contribution of feature Xi L to be f i Xi L ) as in Pouin et a. [4]. We denote the feature s contribution by writing C Xi L ) def = f i Xi L ). 9) Here we assume the additive cassifier is a inear SVM [11]. Given the set V of support vectors, a inear SVM cacuates ŷx L ) = sgn [ V V α v V, X L + b ] = sgn [ i XL i V V α ) ] vv i + b, where, denotes the dot product. In this case, Equation 9 becomes C SVM X L i ) def = X L i ) α v V i. 10) V V That is, Equation 10 gives the contribution of node Xi L to the SVM s cassification of feature vector X L. A positive vaue gives the amount to which node Xi L contributes to a positive cassification or away from a negative cassification); simiary a negative vaue gives the degree of Xi L s contribution towards a negative or away from a positive) cassification. IV. HIERARCHICAL NETWORK IMPLEMENTATION We impement a four-ayer network Figure 1), based on the network of Serre et a. [3]. The input image is preprocessed to form a 256 256 gray-scae image with oca contrast enhancement. An S1 kerne is an 11 11-pixe Gabor fiter. Using Equation 4, we appy a battery of Gabors at 8 orientations, 2 phases, and 4 scaes, with β = 1.0 for a S1 nodes. For each Gabor configuration, we subsampe by centering an S1 node at every other pixe, resuting in a set of 64 S1 output maps, each of size 128 128. A C1 node poos over the two phases and a 5 5 spatia neighborhood of S1 outputs. We again subsampe in the same way, resuting in 32 C1 output maps, each of size 64 64. For an S2 node, the input is a 7 7 neighborhood of C1 nodes at a orientations, but at a singe scae. The input vector and the kerne of each S2 node are each scaed to unit ength X = P = 1). We set β = 5.0 for every S2 node. For each kerne, there is a corresponding S2 node centered at every other C1 node, resuting in mutipe 32 32 S2 output maps, one for each kerne and scae. Finay, a C2 node appies a max operation to a ocations and a scaes of a singe kerne s

S2 map. Thus the output of the C2 ayer is a vector with one component per S2 kerne. This feature vector is passed to the inear SVM. We use the SVM ight package [11] with an unbiased SVM b = 0). This aows a simper derivation of our method without impacting the accuracy of the network. V. RESULTS We appy our contribution-propagation method in order to expain the cassifications of hierarchica networks that are trained on different tasks. The first experiment is intentionay simpe and controed, and demonstrates that our approach accuratey expains the network s cassifications. The second experiment, which uses a we-known, rea-word data set, shows how contribution propagation gives new insight into the network s performance. A. Simpe Shapes In our first experiment, we use a simpe artificia visua cassification task to verify that our method s expanations are correct and understandabe. Each training image contains a simpe shape, either an L shape Figure 3B, positive cass) or an inverted L shape Figure 3C, negative cass). Noise is added by rotating the shape uniformy randomy within ±5 degrees and transating the shape to a random ocation, and 1/f noise is added to the background. The noise ensures that the earned cassifier is nontrivia. Figure 3A shows the two earned S2 kernes around the vertex of the L and inverted L shapes. We input 20 training images 10 positive and 10 negative) to the network, and use the resuting feature vectors to train the SVM. Our test images contain 9 possibe shapes Fig. 3 D), incuding both an L and inverted L, each paced at a random position in a 3 3 grid and rotated randomy within ±5 degrees. Again, 1/f noise is added to the background. Note that because both the positive and negative obects are present in the test image, we do not expect one cassification over the other. The test images were designed to iustrate the authenticity of the contribution-propagation agorithm rather than to test the cassification accuracy of the hierarchica network accuracy wi be addressed in the next section). A test images in this toy exampe were very near the decision boundary, which is reasonabe given that both the positive and negative casses are present in each test image. Using contribution propagation, we expain a test image s cassification using fase coor as foows. First, we trace down through the ayers of the network Fig. 3E-H) using the agorithm presented in Figure 2. Thus Figure 2E shows the contribution of each node in the S2 ayer, Figure 2F shows the contribution of each node in the C1 ayer, and so on. This resuts in the cacuation of the contribution C x i ) of each pixe x i Fig. 3H). Pixes drawn in red contributed to a positive cassification L ); pixes drawn in bue contributed to a negative cassification inverted L ); pixes that did not contribute to the cassification are drawn in green. We see in this visuaization Figure 3H) that our contribution-propagation agorithm correcty coored the pixes surrounding the area matching the earned kernes red around the L shape, and bue around the inverted L shape). Our agorithm thus expains the cassification of undecided : there was a neary equa pu between the pixes surrounding the L toward positive cassification) and those surrounding the inverted L toward negative cassification). This pixe-eve expanation of how the image is interpreted by the network and cassifier was provided automaticay by our contributionpropagation agorithm, and gives evidence for the correctness of our agorithm. B. Rea-Word Images Next, we use the Catech101 data set [12] to train the network and a inear, unbiased SVM in a binary cassification task using categories of chair positive cass, corresponding to red in the visuaizations) and damatian negative cass, corresponding to bue). The categories contain 60 images each. Using 10 spits for cross vaidation, we randomy choose 30 training images and 30 test images from each category. Foowing Serre et a. [3], the network imprints 1000 S2 kernes randomy from the S2 inputs of the training set, and the SVM is trained on the resuting network s output for each training image. Test images are cassified with an average accuracy of 94%, with a 3% standard deviation a biased SVM achieved 93% accuracy with 1.2% standard deviation). In Figure 4, we see that some images A, C) are correcty cassified primariy due to the pixes of the obect itsef B, D). However, our method aso reveas some surprising behavior of the network and cassifier F, H): it appears that some images were correcty cassified due to features extracted primariy from the image s background. In Figure 4F, this may be ess surprising as the background is quite simiar to the damatian. However, in Figure 4H, it is uncear why the background dark red) was taken as evidence for the presence of a chair or, possiby, absence of a damatian). Such an unexpected expanation offered by the contribution-propagation method can be usefu to the user who is trying to create a system that wi generaize we; the user can see that, at east in some cases, the network is basing its cassification on features that are not reevant to the genera task, due to either deficiencies in the network or spurious correations in the data set. A natura question is how often a correct cassification is surprising that is to say, a correcty cassified image where the background appears to contribute more than the obect). Formuating a metric to define such a surprising cassification is beyond the scope of the present work. However, a subective visua inspection of the cassified images reveas 5 of the 60 cassifications of test images to be of this nature. As a fina test of the cassification expanations provided by the contribution-propagation agorithm, we edit an image to incude both a damatian and a chair Figure 4I). This image was cassified by the network as negative damatian), and the contribution propagation agorithm expains this cassification. Figure 4 J) shows that, athough there were features associated with the chair cass on the right side of the image yeow, ight red), the features extracted from the pixes beonging to the damatian were weighted more heaviy deep bue). Some readers may fee that the sma number of training images used may cast doubt on the vaidity of the trained cassifier. However, note that researchers often benchmark their computer-vision system by measuring its performance on the

Training and testing network A) Imprinted kernes B) Positive training images C) Negative training images D) Test image Expaining the cassification of D) through the ayers of the network E) C XS2 F) C XC1 G) C XS1 H) C x) I) Coor egend for visuaizations in E - H Fig. 3. Figure best viewed in coor.) Visuaizing the binary cassification of images containing simpe shapes. The contribution of the nodes in each ayer verifies the correctness of our agorithm. Two S2 kernes are used shaded squares, A). A inear SVM is trained on images containing either an L shape B, positive cass) or an inverted L C, negative cass). We present a test image D), and use contribution propagation to visuaize the contribution of a ayers E-H). Note that the image is drawn in the background of E-H) in order to better expain the contribution of each region. Coors correspond to the pixe s contribution, as shown in the egend I). These visuaizations give evidence for the correctness of our contribution propagation agorithm. Test images A) C) E) G) I) Expaining the cassifications of test images B) D) F) H) J) Fig. 4. Figure best viewed in coor.) Visuaizations from the cassification of chairs positive) vs. damatians negative), from the Catech101 database. Coors correspond to the egend in Figure 3 I): positive contribution toward chair) is denoted with red, and negative contribution toward damatian) with bue. Some images A, C) are correcty cassified because of the contribution of pixes that beong to the obect being cassified B, bue on damatian; D, red on chair). Other images that contain confusing patterns in the background E, G) are sti correcty cassified, but partiay due to the contribution of background pixes F, bue on background; H, red on background). An image manipuated to contain both obects I) is cassified as damatian, and this cassification is intuitivey expained by the contribution-propagation agorithm J).

Catech101 dataset using 30 images per cass as training data [13]. Thus our experiment was designed to mimic a benchmarking process famiiar to many computer-vision researchers. In this ight, the surprising resuts presented in Figure 4 hint at an important question to the computer-vision community: does high performance on this dataset indicate a system s capacity for obect recognition, or merey for earning spurious statistica background) cues? The prevaence of this dataset makes this question a the more pressing. VI. CONCLUSIONS We presented a nove method for expaining the cassifications of hierarchica networks with additive cassifiers, having reviewed why traditiona approaches such as a sensitivity anaysis fai to give satisfactory expanations. Our method extends the contribution-based expanations of Pouin et a. [4], and determines the contribution of each input based on the interna cacuations of the network. We empiricay vaidate our method s expanations with a simpe obect-recognition task using artificia data. We appy our method to a binary cassification task using a we-known set of natura images, reveaing surprising artifacts in the way that some images are cassified. In particuar, we see that some images are correcty cassified because of the contribution of pixes beonging to the image s background Figure 4 F, H). This behavior is surprising when the task is competed with high accuracy, but it is aso very usefu to the user of the machine-earning agorithm. Such information provided by our method can hep the user to tune the agorithm for better generaizabiity, as we as to create data sets without artifacts in the background. In order to visuaize our method s expanations, we cassify with an unbiased SVM. Athough the ack of bias may, in some cases, ead to ower cassification accuracy, it did not for any task we impemented. We stress that the ack of bias was chosen merey for visuaization, and that the contributionpropagation method can be impemented with biased functions as we. In future work, the contribution-propagation method may aso be appied to different types of hierarchica networks, for exampe those using convoutions or different subsamping methods. [2] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based earning appied to document recognition, Proceedings of the IEEE, vo. 86, no. 11, 1998. [3] T. Serre, A. Oiva, and T. Poggio, A feedforward architecture accounts for rapid categorization, Proceedings of the Nationa Academy of Sciences, vo. 104, no. 15, 2007. [4] B. Pouin, R. Eisner, D. Szafron, P. Lu, R. Greiner, D. Wishart, A. Fyshe, B. Pearcy, C. MAcDone, and J. Anvik, Visua expanation of evidence in additive cassifiers, in Proceedings of 18th Conference on Innovative Appications of Artificia Inteigence. IAAI, Juy 2006. [5] L. Fu, Rue generation from neura networks, IEEE Transactions on Systems, Man and Cybernetics, vo. 24, no. 8, 1994. [6] T. Masqueier and S. Thorpe, Unsupervised earning of visua features through spike timing dependent pasticity, PLoS Comp. Bio., vo. 3, no. 2, 2007. [7] E. Strumbe and I. Kononenko, An efficient expanation of individua cassifications using game theory, Journa of Machine Learning Research, vo. 11, no. 1, 2010. [8] D. Baehrens, T. Schroeter, S. Harmeing, M. Kawanabe, K. Hansen, and K. Müer, How to expain individua cassification decisions, Journa of Machine Learning Research, vo. 11, pp. 1803 1831, 2010. [9] M. Riesenhuber and T. Poggio, Hierarchica modes of obect recognition in cortex, Nature Neuroscience, vo. 2, no. 11, 1999. [10] M. Dredze, K. Crammer, and F. Pereira, Confidence-weighted inear cassification, in Internationa Conference on Machine Learning, 2008. [11] T. Joachims, Making arge-scae SVM earning practica. in Advances in Kerne Methods - Support Vector Learning, B. Schökopf, C. Burges, and A. Smoa, Eds. MIT Press, 1999. [12] L. Fei-Fei, R. Fergus, and P. Perona, Learning generative visua modes from few training exampes: an incrementa Bayesian approach tested on 101 obect categories, in IEEE. CVPR 2004, Workshop on Generative-Mode Based Vision, 2004. [13] A. Bosch, A. Zisserman, and X. Munoz, Representing shape with a spatia pyramid kerne, in Proceedings of the 6th ACM internationa conference on Image and video retrieva, 2007. ACKNOWLEDGMENTS This materia is based upon work supported by the Nationa Science Foundation under Grant Nos. 1018967 and 0749348, as we as the Laboratory Directed Research and Deveopment program at Los Aamos Nationa Laboratory Proect 20090006DR). Any opinions, findings, concusions or recommendations expressed in this materia are those of the authors and do not necessariy refect the views of the funding agencies. REFERENCES [1] R. Rifkin, J. Bouvrie, K. Schutte, S. Chikkerur, M. Kouh, T. Ezzat, and T. Poggio, Phonetic cassification using hierarchica, feed-forward, spectro-tempora patch-based architectures, Massachusetts Institute of Technoogy, Tech. Rep. MIT-CSAIL-TR-2007-019, 2007.