arxiv: v3 [cs.lg] 19 Apr 2016
|
|
|
- Letitia Kennedy
- 9 years ago
- Views:
Transcription
1 Show, ttend and Tell: Neural Image Caption Generation with Visual ttention arxiv: v3 [cs.lg] 19 pr 2016 Kelvin Xu Jimmy Lei Ba Ryan Kiros Kyunghyun Cho aron Courville Ruslan Salakhutdinov Richard S. Zemel Yoshua Bengio bstract Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence. We validate the use of attention with state-of-theart performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO. 1. Introduction utomatically generating captions of an image is a task very close to the heart of scene understanding one of the primary goals of computer vision. Not only must caption generation models be powerful enough to solve the computer vision challenges of determining which objects are in an image, but they must also be capable of capturing and expressing their relationships in a natural language. For this reason, caption generation has long been viewed as a difficult problem. It is a very important challenge for machine learning algorithms, as it amounts to mimicking the remarkable human ability to compress huge amounts of salient visual infomation into descriptive language. Despite the challenging nature of this task, there has been a recent surge of research interest in attacking the image caption generation problem. ided by advances in training neural networks (Krizhevsky et al., 2012) and large classification datasets (Russakovsky et al., 2014), recent work [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] Figure 1. Our model learns a words/image alignment. The visualized attentional maps (3) are explained in section 3.1 & Input Image 14x14 Feature Map LSTM 2. Convolutional 3. RNN with attention Feature Extraction over the image bird flying over a body of water 4. Word by word generation has significantly improved the quality of caption generation using a combination of convolutional neural networks (convnets) to obtain vectorial representation of images and recurrent neural networks to decode those representations into natural language sentences (see Sec. 2). One of the most curious facets of the human visual system is the presence of attention (Rensink, 2000; Corbetta & Shulman, 2002). Rather than compress an entire image into a static representation, attention allows for salient features to dynamically come to the forefront as needed. This is especially important when there is a lot of clutter in an image. Using representations (such as those from the top layer of a convnet) that distill information in image down to the most salient objects is one effective solution that has been widely adopted in previous work. Unfortunately, this has one potential drawback of losing information which could be useful for richer, more descriptive captions. Using more low-level representation can help preserve this information. However working with these features necessitates a powerful mechanism to steer the model to information important to the task at hand. In this paper, we describe approaches to caption generation that attempt to incorporate a form of attention with
2 Figure 2. ttention over time. s the model generates each word, its attention changes to reflect the relevant parts of the image. soft (top row) vs hard (bottom row) attention. (Note that both models generated the same captions in this example.) Figure 3. Examples of attending to the correct object (white indicates the attended regions, underlines indicated the corresponding word) two variants: a hard attention mechanism and a soft attention mechanism. We also show how one advantage of including attention is the ability to visualize what the model sees. Encouraged by recent advances in caption generation and inspired by recent success in employing attention in machine translation (Bahdanau et al., 2014) and object recognition (Ba et al., 2014; Mnih et al., 2014), we investigate models that can attend to salient part of an image while generating its caption. The contributions of this paper are the following: We introduce two attention-based image caption generators under a common framework (Sec. 3.1): 1) a soft deterministic attention mechanism trainable by standard back-propagation methods and 2) a hard stochastic attention mechanism trainable by maximizing an approximate variational lower bound or equivalently by REINFORCE (Williams, 1992). We show how we can gain insight and interpret the results of this framework by visualizing where and what the attention focused on. (see Sec. 5.4) Finally, we quantitatively validate the usefulness of attention in caption generation with state of the art performance (Sec. 5.3) on three benchmark datasets: Flickr8k (Hodosh et al., 2013), Flickr30k (Young et al., 2014) and the MS COCO dataset (Lin et al., 2014). 2. Related Work In this section we provide relevant background on previous work on image caption generation and attention. Recently, several methods have been proposed for generating image descriptions. Many of these methods are based on recurrent neural networks and inspired by the successful use of sequence to sequence training with neural networks for machine translation (Cho et al., 2014; Bahdanau et al., 2014; Sutskever et al., 2014). One major reason image caption generation is well suited to the encoder-decoder framework (Cho et al., 2014) of machine translation is because it is analogous to translating an image to a sentence. The first approach to use neural networks for caption generation was Kiros et al. (2014a), who proposed a multimodal log-bilinear model that was biased by features from the image. This work was later followed by Kiros et al. (2014b) whose method was designed to explicitly allow a natural way of doing both ranking and generation. Mao et al. (2014) took a similar approach to generation but replaced a feed-forward neural language model with a recurrent one. Both Vinyals et al. (2014) and Donahue et al. (2014) use LSTM RNNs for their models. Unlike Kiros et al. (2014a) and Mao et al. (2014) whose models see the image at each time step of the output word sequence, Vinyals et al. (2014) only show the image to the RNN at the beginning. long
3 with images, Donahue et al. (2014) also apply LSTMs to videos, allowing their model to generate video descriptions. ll of these works represent images as a single feature vector from the top layer of a pre-trained convolutional network. Karpathy & Li (2014) instead proposed to learn a joint embedding space for ranking and generation whose model learns to score sentence and image similarity as a function of R-CNN object detections with outputs of a bidirectional RNN. Fang et al. (2014) proposed a three-step pipeline for generation by incorporating object detections. Their model first learn detectors for several visual concepts based on a multi-instance learning framework. language model trained on captions was then applied to the detector outputs, followed by rescoring from a joint image-text embedding space. Unlike these models, our proposed attention framework does not explicitly use object detectors but instead learns latent alignments from scratch. This allows our model to go beyond objectness and learn to attend to abstract concepts. Prior to the use of neural networks for generating captions, two main approaches were dominant. The first involved generating caption templates which were filled in based on the results of object detections and attribute discovery (Kulkarni et al. (2013), Li et al. (2011), Yang et al. (2011), Mitchell et al. (2012), Elliott & Keller (2013)). The second approach was based on first retrieving similar captioned images from a large database then modifying these retrieved captions to fit the query (Kuznetsova et al., 2012; 2014). These approaches typically involved an intermediate generalization step to remove the specifics of a caption that are only relevant to the retrieved image, such as the name of a city. Both of these approaches have since fallen out of favour to the now dominant neural network methods. There has been a long line of previous work incorporating attention into neural networks for vision related tasks. Some that share the same spirit as our work include Larochelle & Hinton (2010); Denil et al. (2012); Tang et al. (2014). In particular however, our work directly extends the work of Bahdanau et al. (2014); Mnih et al. (2014); Ba et al. (2014). 3. Image Caption Generation with ttention Mechanism 3.1. Model Details In this section, we describe the two variants of our attention-based model by first describing their common framework. The main difference is the definition of the φ function which we describe in detail in Section 4. We denote vectors with bolded font and matrices with capital letters. In our description below, we suppress bias terms for readability. z t h t-1 Ey t-1 h t-1 z t Ey t-1 h t-1 z t Ey t-1 i input gate input modulator c f h t-1 z t Ey t-1 o memory cell forget gate output gate Figure 4. LSTM cell, lines with bolded squares imply projections with a learnt weight vector. Each cell learns how to weigh its input components (input gate), while learning how to modulate that contribution to the memory (input modulator). It also learns weights which erase the memory cell (forget gate), and weights which control how this memory should be emitted (output gate) ENCODER: CONVOLUTIONL FETURES Our model takes a single raw image and generates a caption y encoded as a sequence of 1-of-K encoded words. y = {y 1,..., y C }, y i R K where K is the size of the vocabulary and C is the length of the caption. We use a convolutional neural network in order to extract a set of feature vectors which we refer to as annotation vectors. The extractor produces L vectors, each of which is a D-dimensional representation corresponding to a part of the image. a = {a 1,..., a L }, a i R D In order to obtain a correspondence between the feature vectors and portions of the 2-D image, we extract features from a lower convolutional layer unlike previous work which instead used a fully connected layer. This allows the decoder to selectively focus on certain parts of an image by selecting a subset of all the feature vectors DECODER: LONG SHORT-TERM MEMORY NETWORK We use a long short-term memory (LSTM) network (Hochreiter & Schmidhuber, 1997) that produces a caption by generating one word at every time step conditioned on a context vector, the previous hidden state and the previously generated words. Our implementation of LSTM h t
4 closely follows the one used in Zaremba et al. (2014) (see Fig. 4). Using T s,t : R s R t to denote a simple affine transformation with parameters that are learned, i t f t o t g t σ = σ σ T D+m+n,n tanh Ey t 1 h t 1 ẑ t (1) c t = f t c t 1 + i t g t (2) h t = o t tanh(c t ). (3) Here, i t, f t, c t, o t, h t are the input, forget, memory, output and hidden state of the LSTM, respectively. The vector ẑ R D is the context vector, capturing the visual information associated with a particular input location, as explained below. E R m K is an embedding matrix. Let m and n denote the embedding and LSTM dimensionality respectively and σ and be the logistic sigmoid activation and element-wise multiplication respectively. In simple terms, the context vector ẑ t (equations (1) (3)) is a dynamic representation of the relevant part of the image input at time t. We define a mechanism φ that computes ẑ t from the annotation vectors a i, i = 1,..., L corresponding to the features extracted at different image locations. For each location i, the mechanism generates a positive weight α i which can be interpreted either as the probability that location i is the right place to focus for producing the next word (the hard but stochastic attention mechanism), or as the relative importance to give to location i in blending the a i s together. The weight α i of each annotation vector a i is computed by an attention model fatt for which we use a multilayer perceptron conditioned on the previous hidden state h t 1. The soft version of this attention mechanism was introduced by Bahdanau et al. (2014). For emphasis, we note that the hidden state varies as the output RNN advances in its output sequence: where the network looks next depends on the sequence of words that has already been generated. e ti =fatt(a i, h t 1 ) (4) exp(e ti ) α ti = L k=1 exp(e tk). (5) Once the weights (which sum to one) are computed, the context vector ẑ t is computed by ẑ t = φ ({a i }, {α i }), (6) through two separate MLPs (init,c and init,h): c 0 = f init,c ( 1 L h 0 = f init,h ( 1 L L a i ) i L a i ) In this work, we use a deep output layer (Pascanu et al., 2014) to compute the output word probability given the LSTM state, the context vector and the previous word: p(y t a, y t 1 1 ) exp(l o (Ey t 1 + L h h t + L z ẑ t )) (7) Where L o R K m, L h R m n, L z R m D, and E are learned parameters initialized randomly. 4. Learning Stochastic Hard vs Deterministic Soft ttention In this section we discuss two alternative mechanisms for the attention model fatt: stochastic attention and deterministic attention Stochastic Hard ttention We represent the location variable s t as where the model decides to focus attention when generating the t th word. s t,i is an indicator one-hot variable which is set to 1 if the i-th location (out of L) is the one used to extract visual features. By treating the attention locations as intermediate latent variables, we can assign a multinoulli distribution parametrized by {α i }, and view ẑ t as a random variable: i p(s t,i = 1 s j<t, a) = α t,i (8) ẑ t = i s t,i a i. (9) We define a new objective function L s that is a variational lower bound on the marginal log-likelihood log p(y a) of observing the sequence of words y given image features a. The learning algorithm for the parameters W of the models can be derived by directly optimizing L s : L s = s p(s a) log p(y s, a) log p(s a)p(y s, a) s = log p(y a) (10) where φ is a function that returns a single vector given the set of annotation vectors and their corresponding weights. The details of φ function are discussed in Sec. 4. The initial memory state and hidden state of the LSTM are predicted by an average of the annotation vectors fed L s W = s [ log p(y s, a) p(s a) + W log p(s a) log p(y s, a) W ]. (11)
5 Figure 5. Examples of mistakes where we can use attention to gain intuition into what the model saw. Equation 11 suggests a Monte Carlo based sampling approximation of the gradient with respect to the model parameters. This can be done by sampling the location s t from a multinouilli distribution defined by Equation 8. L s W 1 N s t Multinoulli L ({α i }) N [ log p(y s n, a) + W n=1 log p(y s n, a) log p( sn a) W ] (12) moving average baseline is used to reduce the variance in the Monte Carlo estimator of the gradient, following Weaver & Tao (2001). Similar, but more complicated variance reduction techniques have previously been used by Mnih et al. (2014) and Ba et al. (2014). Upon seeing the k th mini-batch, the moving average baseline is estimated as an accumulated sum of the previous log likelihoods with exponential decay: b k = 0.9 b k log p(y s k, a) To further reduce the estimator variance, an entropy term on the multinouilli distribution H[s] is added. lso, with probability 0.5 for a given image, we set the sampled attention location s to its expected value α. Both techniques improve the robustness of the stochastic attention learning algorithm. The final learning rule for the model is then the following: L s W 1 N N [ log p(y s n, a) + W n=1 λ r (log p(y s n, a) b) log p( sn a) W H[ s n ] ] + λ e W where, λ r and λ e are two hyper-parameters set by crossvalidation. s pointed out and used in Ba et al. (2014) and Mnih et al. (2014), this is formulation is equivalent to the REINFORCE learning rule (Williams, 1992), where the reward for the attention choosing a sequence of actions is a real value proportional to the log likelihood of the target sentence under the sampled attention trajectory. In making a hard choice at every point, φ ({a i }, {α i }) from Equation 6 is a function that returns a sampled a i at every point in time based upon a multinouilli distribution parameterized by α Deterministic Soft ttention Learning stochastic attention requires sampling the attention location s t each time, instead we can take the expectation of the context vector ẑ t directly, E p(st a)[ẑ t ] = L α t,i a i (13) i=1 and formulate a deterministic attention model by computing a soft attention weighted annotation vector φ ({a i }, {α i }) = L i α ia i as introduced by Bahdanau et al. (2014). This corresponds to feeding in a soft α
6 weighted context into the system. The whole model is smooth and differentiable under the deterministic attention, so learning end-to-end is trivial by using standard backpropagation. Learning the deterministic attention can also be understood as approximately optimizing the marginal likelihood in Equation 10 under the attention location random variable s t from Sec The hidden activation of LSTM h t is a linear projection of the stochastic context vector ẑ t followed by tanh non-linearity. To the first order Taylor approximation, the expected value E p(st a)[h t ] is equal to computing h t using a single forward prop with the expected context vector E p(st a)[ẑ t ]. Considering Eq. 7, let n t = L o (Ey t 1 +L h h t +L z ẑ t ), n t,i denotes n t computed by setting the random variable ẑ value to a i. We define the normalized weighted geometric mean for the softmax k th word prediction: i exp(n t,k,i) p(st,i=1 a) NW GM[p(y t = k a)] = j i exp(n t,j,i) p(st,i=1 a) = exp(e p(s t a)[n t,k ]) j exp(e p(s t a)[n t,j ]) The equation above shows the normalized weighted geometric mean of the caption prediction can be approximated well by using the expected context vector, where E[n t ] = L o (Ey t 1 + L h E[h t ] + L z E[ẑ t ]). It shows that the NWGM of a softmax unit is obtained by applying softmax to the expectations of the underlying linear projections. lso, from the results in (Baldi & Sadowski, 2014), NW GM[p(y t = k a)] E[p(y t = k a)] under softmax activation. That means the expectation of the outputs over all possible attention locations induced by random variable s t is computed by simple feedforward propagation with expected context vector E[ẑ t ]. In other words, the deterministic attention model is an approximation to the marginal likelihood over the attention locations DOUBLY STOCHSTIC TTENTION By construction, i α ti = 1 as they are the output of a softmax. In training the deterministic version of our model we introduce a form of doubly stochastic regularization, where we also encourage t α ti 1. This can be interpreted as encouraging the model to pay equal attention to every part of the image over the course of generation. In our experiments, we observed that this penalty was important quantitatively to improving overall BLEU score and that qualitatively this leads to more rich and descriptive captions. In addition, the soft attention model predicts a gating scalar β from previous hidden state h t 1 at each time step t, such that, φ ({a i }, {α i }) = β L i α ia i, where β t = σ(f β (h t 1 )). We notice our attention weights put more emphasis on the objects in the images by including the scalar β. Concretely, the model is trained end-to-end by minimizing the following penalized negative log-likelihood: L C L d = log(p (y x)) + λ (1 α ti ) 2 (14) 4.3. Training Procedure Both variants of our attention model were trained with stochastic gradient descent using adaptive learning rate algorithms. For the Flickr8k dataset, we found that RM- SProp (Tieleman & Hinton, 2012) worked best, while for Flickr30k/MS COCO dataset we used the recently proposed dam algorithm (Kingma & Ba, 2014). To create the annotations a i used by our decoder, we used the Oxford VGGnet (Simonyan & Zisserman, 2014) pretrained on ImageNet without finetuning. In principle however, any encoding function could be used. In addition, with enough data, we could also train the encoder from scratch (or fine-tune) with the rest of the model. In our experiments we use the feature map of the fourth convolutional layer before max pooling. This means our decoder operates on the flattened (i.e L D) encoding. s our implementation requires time proportional to the length of the longest sentence per update, we found training on a random group of captions to be computationally wasteful. To mitigate this problem, in preprocessing we build a dictionary mapping the length of a sentence to the corresponding subset of captions. Then, during training we randomly sample a length and retrieve a mini-batch of size 64 of that length. We found that this greatly improved convergence speed with no noticeable diminishment in performance. On our largest dataset (MS COCO), our soft attention model took less than 3 days to train on an NVIDI Titan Black GPU. In addition to dropout (Srivastava et al., 2014), the only other regularization strategy we used was early stopping on BLEU score. We observed a breakdown in correlation between the validation set log-likelihood and BLEU in the later stages of training during our experiments. Since BLEU is the most commonly reported metric, we used BLEU on our validation set for model selection. In our experiments with soft attention, we also used Whetlab 1 (Snoek et al., 2012; 2014) in our Flickr8k experiments. Some of the intuitions we gained from hyperparameter regions it explored were especially important in our Flickr30k and COCO experiments. We make our code for these models based in Theano 1 i t
7 Table 1. BLEU-1,2,3,4/METEOR metrics compared to other methods, indicates a different split, ( ) indicates an unknown metric, indicates the authors kindly provided missing metrics by personal communication, Σ indicates an ensemble, a indicates using lexnet BLEU Dataset Model BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR Google NIC(Vinyals et al., 2014) Σ Flickr8k Log Bilinear (Kiros et al., 2014a) Soft-ttention Hard-ttention Flickr30k COCO Google NIC Σ Log Bilinear Soft-ttention Hard-ttention CMU/MS Research (Chen & Zitnick, 2014) a MS Research (Fang et al., 2014) a BRNN (Karpathy & Li, 2014) Google NIC Σ Log Bilinear Soft-ttention Hard-ttention (Bergstra et al., 2010) publicly available upon publication to encourage future research in this area. 5. Experiments We describe our experimental methodology and quantitative results which validate the effectiveness of our model for caption generation Data We report results on the popular Flickr8k and Flickr30k dataset which has 8,000 and 30,000 images respectively as well as the more challenging Microsoft COCO dataset which has 82,783 images. The Flickr8k/Flickr30k dataset both come with 5 reference sentences per image, but for the MS COCO dataset, some of the images have references in excess of 5 which for consistency across our datasets we discard. We applied only basic tokenization to MS COCO so that it is consistent with the tokenization present in Flickr8k and Flickr30k. For all our experiments, we used a fixed vocabulary size of 10,000. Results for our attention-based architecture are reported in Table We report results with the frequently used BLEU metric 2 which is the standard in the caption generation literature. We report BLEU from 1 to 4 with- 2 We verified that our BLEU evaluation code matches the authors of Vinyals et al. (2014), Karpathy & Li (2014) and Kiros et al. (2014b). For fairness, we only compare against results for which we have verified that our BLEU evaluation code is the same. With the upcoming release of the COCO evaluation server, we will include comparison results with all other recent image captioning models. out a brevity penalty. There has been, however, criticism of BLEU, so in addition we report another common metric METEOR (Denkowski & Lavie, 2014), and compare whenever possible Evaluation Procedures few challenges exist for comparison, which we explain here. The first is a difference in choice of convolutional feature extractor. For identical decoder architectures, using more recent architectures such as GoogLeNet or Oxford VGG Szegedy et al. (2014), Simonyan & Zisserman (2014) can give a boost in performance over using the lexnet (Krizhevsky et al., 2012). In our evaluation, we compare directly only with results which use the comparable GoogLeNet/Oxford VGG features, but for METEOR comparison we note some results that use lexnet. The second challenge is a single model versus ensemble comparison. While other methods have reported performance boosts by using ensembling, in our results we report a single model performance. Finally, there is challenge due to differences between dataset splits. In our reported results, we use the predefined splits of Flickr8k. However, one challenge for the Flickr30k and COCO datasets is the lack of standardized splits. s a result, we report with the publicly available splits 3 used in previous work (Karpathy & Li, 2014). In our experience, differences in splits do not make a substantial difference in overall performance, but we note the differ- 3 deepimagesent/
8 ences where they exist Quantitative nalysis In Table 4.2.1, we provide a summary of the experiment validating the quantitative effectiveness of attention. We obtain state of the art performance on the Flickr8k, Flickr30k and MS COCO. In addition, we note that in our experiments we are able to significantly improve the state of the art performance METEOR on MS COCO that we speculate is connected to some of the regularization techniques we used and our lower level representation. Finally, we also note that we are able to obtain this performance using a single model without an ensemble Qualitative nalysis: Learning to attend By visualizing the attention component learned by the model, we are able to add an extra layer of interpretability to the output of the model (see Fig. 1). Other systems that have done this rely on object detection systems to produce candidate alignment targets (Karpathy & Li, 2014). Our approach is much more flexible, since the model can attend to non object salient regions. The 19-layer OxfordNet uses stacks of 3x3 filters meaning the only time the feature maps decrease in size are due to the max pooling layers. The input image is resized so that the shortest side is 256 dimensional with preserved aspect ratio. The input to the convolutional network is the center cropped 224x224 image. Consequently, with 4 max pooling layers we get an output dimension of the top convolutional layer of 14x14. Thus in order to visualize the attention weights for the soft model, we simply upsample the weights by a factor of 2 4 = 16 and apply a Gaussian filter. We note that the receptive fields of each of the 14x14 units are highly overlapping. s we can see in Figure 2 and 3, the model learns alignments that correspond very strongly with human intuition. Especially in the examples of mistakes, we see that it is possible to exploit such visualizations to get an intuition as to why those mistakes were made. We provide a more extensive list of visualizations in ppendix for the reader. 6. Conclusion We propose an attention based approach that gives state of the art performance on three benchmark datasets using the BLEU and METEOR metric. We also show how the learned attention can be exploited to give more interpretability into the models generation process, and demonstrate that the learned alignments correspond very well to human intuition. We hope that the results of this paper will encourage future work in using visual attention. We also expect that the modularity of the encoder-decoder approach combined with attention to have useful applications in other domains. cknowledgments The authors would like to thank the developers of Theano (Bergstra et al., 2010; Bastien et al., 2012). We acknowledge the support of the following organizations for research funding and computing support: the Nuance Foundation, NSERC, Samsung, Calcul Québec, Compute Canada, the Canada Research Chairs and CIFR. The authors would also like to thank Nitish Srivastava for assistance with his ConvNet package as well as preparing the Oxford convolutional network and Relu Patrascu for helping with numerous infrastructure related problems. References Ba, Jimmy Lei, Mnih, Volodymyr, and Kavukcuoglu, Koray. Multiple object recognition with visual attention. arxiv: , December Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural machine translation by jointly learning to align and translate. arxiv: , September Baldi, Pierre and Sadowski, Peter. The dropout learning algorithm. rtificial intelligence, 210:78 122, Bastien, Frederic, Lamblin, Pascal, Pascanu, Razvan, Bergstra, James, Goodfellow, Ian, Bergeron, rnaud, Bouchard, Nicolas, Warde-Farley, David, and Bengio, Yoshua. Theano: new features and speed improvements. Submited to the Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, Bergstra, James, Breuleux, Olivier, Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, Desjardins, Guillaume, Turian, Joseph, Warde-Farley, David, and Bengio, Yoshua. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), Chen, Xinlei and Zitnick, C Lawrence. Learning a recurrent visual representation for image caption generation. arxiv preprint arxiv: , Cho, Kyunghyun, van Merrienboer, Bart, Gulcehre, Caglar, Bougares, Fethi, Schwenk, Holger, and Bengio, Yoshua. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP, October Corbetta, Maurizio and Shulman, Gordon L. Control of goaldirected and stimulus-driven attention in the brain. Nature reviews neuroscience, 3(3): , Denil, Misha, Bazzani, Loris, Larochelle, Hugo, and de Freitas, Nando. Learning where to attend with deep architectures for image tracking. Neural Computation, Denkowski, Michael and Lavie, lon. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ECL 2014 Workshop on Statistical Machine Translation, 2014.
9 Donahue, Jeff, Hendrikcs, Lisa nne, Guadarrama, Segio, Rohrbach, Marcus, Venugopalan, Subhashini, Saenko, Kate, and Darrell, Trevor. Long-term recurrent convolutional networks for visual recognition and description. arxiv: v2, November Elliott, Desmond and Keller, Frank. Image description using visual dependency representations. In EMNLP, Fang, Hao, Gupta, Saurabh, Iandola, Forrest, Srivastava, Rupesh, Deng, Li, Dollár, Piotr, Gao, Jianfeng, He, Xiaodong, Mitchell, Margaret, Platt, John, et al. From captions to visual concepts and back. arxiv: , November Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation, 9(8): , Hodosh, Micah, Young, Peter, and Hockenmaier, Julia. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of rtificial Intelligence Research, pp , Karpathy, ndrej and Li, Fei-Fei. Deep visual-semantic alignments for generating image descriptions. arxiv: , December Kingma, Diederik P. and Ba, Jimmy. dam: Method for Stochastic Optimization. arxiv: , December Kiros, Ryan, Salahutdinov, Ruslan, and Zemel, Richard. Multimodal neural language models. In International Conference on Machine Learning, pp , 2014a. Kiros, Ryan, Salakhutdinov, Ruslan, and Zemel, Richard. Unifying visual-semantic embeddings with multimodal neural language models. arxiv: , November 2014b. Krizhevsky, lex, Sutskever, Ilya, and Hinton, Geoffrey. ImageNet classification with deep convolutional neural networks. In NIPS Kulkarni, Girish, Premraj, Visruth, Ordonez, Vicente, Dhar, Sagnik, Li, Siming, Choi, Yejin, Berg, lexander C, and Berg, Tamara L. Babytalk: Understanding and generating simple image descriptions. PMI, IEEE Transactions on, 35(12): , Kuznetsova, Polina, Ordonez, Vicente, Berg, lexander C, Berg, Tamara L, and Choi, Yejin. Collective generation of natural image descriptions. In ssociation for Computational Linguistics. CL, Kuznetsova, Polina, Ordonez, Vicente, Berg, Tamara L, and Choi, Yejin. Treetalk: Composition and compression of trees for image descriptions. TCL, 2(10): , Larochelle, Hugo and Hinton, Geoffrey E. Learning to combine foveal glimpses with a third-order boltzmann machine. In NIPS, pp , Li, Siming, Kulkarni, Girish, Berg, Tamara L, Berg, lexander C, and Choi, Yejin. Composing simple image descriptions using web-scale n-grams. In Computational Natural Language Learning. CL, Lin, Tsung-Yi, Maire, Michael, Belongie, Serge, Hays, James, Perona, Pietro, Ramanan, Deva, Dollár, Piotr, and Zitnick, C Lawrence. Microsoft coco: Common objects in context. In ECCV, pp Mao, Junhua, Xu, Wei, Yang, Yi, Wang, Jiang, and Yuille, lan. Deep captioning with multimodal recurrent neural networks (m-rnn). arxiv: , December Mitchell, Margaret, Han, Xufeng, Dodge, Jesse, Mensch, lyssa, Goyal, mit, Berg, lex, Yamaguchi, Kota, Berg, Tamara, Stratos, Karl, and Daumé III, Hal. Midge: Generating image descriptions from computer vision detections. In European Chapter of the ssociation for Computational Linguistics, pp CL, Mnih, Volodymyr, Hees, Nicolas, Graves, lex, and Kavukcuoglu, Koray. Recurrent models of visual attention. In NIPS, Pascanu, Razvan, Gulcehre, Caglar, Cho, Kyunghyun, and Bengio, Yoshua. How to construct deep recurrent neural networks. In ICLR, Rensink, Ronald. The dynamic representation of scenes. Visual cognition, 7(1-3):17 42, Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, ndrej, Khosla, ditya, Bernstein, Michael, Berg, lexander C., and Fei-Fei, Li. ImageNet Large Scale Visual Recognition Challenge, Simonyan, K. and Zisserman,. Very deep convolutional networks for large-scale image recognition. CoRR, abs/ , Snoek, Jasper, Larochelle, Hugo, and dams, Ryan P. Practical bayesian optimization of machine learning algorithms. In NIPS, pp , Snoek, Jasper, Swersky, Kevin, Zemel, Richard S, and dams, Ryan P. Input warping for bayesian optimization of nonstationary functions. arxiv preprint arxiv: , Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, lex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: simple way to prevent neural networks from overfitting. JMLR, 15, Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc VV. Sequence to sequence learning with neural networks. In NIPS, pp , Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, nguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, ndrew. Going deeper with convolutions. arxiv preprint arxiv: , Tang, Yichuan, Srivastava, Nitish, and Salakhutdinov, Ruslan R. Learning generative models with visual attention. In NIPS, pp , Tieleman, Tijmen and Hinton, Geoffrey. Lecture rmsprop. Technical report, Vinyals, Oriol, Toshev, lexander, Bengio, Samy, and Erhan, Dumitru. Show and tell: neural image caption generator. arxiv: , November 2014.
10 Weaver, Lex and Tao, Nigel. The optimal reward baseline for gradient-based reinforcement learning. In Proc. UI 2001, pp , Williams, Ronald J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4): , Yang, Yezhou, Teo, Ching Lik, Daumé III, Hal, and loimonos, Yiannis. Corpus-guided sentence generation of natural images. In EMNLP, pp CL, Young, Peter, Lai, lice, Hodosh, Micah, and Hockenmaier, Julia. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TCL, 2:67 78, Zaremba, Wojciech, Sutskever, Ilya, and Vinyals, Oriol. Recurrent neural network regularization. arxiv preprint arxiv: , September Neural Image Caption Generation with Visual ttention
11 Neural Image Caption Generation with Visual ttention. ppendix Visualizations from our hard (a) and soft (b) attention model. White indicates the regions where the model roughly attends to (see section 5.4). man and a woman playing frisbee in a field. (a) man and a woman playing frisbee in a field. (b) woman is throwing a frisbee in a park. Figure 6.
12 giraffe standing in a field with trees. (a) giraffe standing in the field with trees. (b) large white bird standing in a forest. Figure 7.
13 dog is laying on a bed with a book. (a) dog is laying on a bed with a book. (b) dog is standing on a hardwood floor. Figure 8.
14 woman is holding a donut in his hand. (a) woman is holding a donut in his hand. (b) woman holding a clock in her hand. Figure 9.
15 stop sign with a stop sign on it. (a) stop sign with a stop sign on it. (b) stop sign is on a road with a mountain in the background. Figure 10.
16 man in a suit and a hat holding a remote control. (a) man in a suit and a hat holding a remote control. (b) man wearing a hat and a hat on a skateboard. Figure 11.
17 little girl sitting on a couch with a teddy bear. (a) little girl sitting on a couch with a teddy bear. (b) little girl sitting on a bed with a teddy bear.
18 man is standing on a beach with a surfboard. (a) man is standing on a beach with a surfboard. (b) person is standing on a beach with a surfboard.
19 man and a woman riding a boat in the water. (a) man and a woman riding a boat in the water. (b) group of people sitting on a boat in the water. Figure 12.
20 man is standing in a market with a large amount of food. (a) man is standing in a market with a large amount of food. (b) woman is sitting at a table with a large pizza. Figure 13.
21 giraffe standing in a field with trees. (a) giraffe standing in a field with trees. (b) giraffe standing in a forest with trees in the background. Figure 14.
22 group of people standing next to each other. (a) group of people standing next to each other. (b) man is talking on his cell phone while another man watches. Figure 15.
Generating Natural Language Descriptions for Semantic Representations of Human Brain Activity
Generating Natural Language Descriptions for Semantic Representations of Human Brain Activity Eri Matsuo Ichiro Kobayashi Ochanomizu University [email protected] [email protected] Shinji Nishimoto
Compacting ConvNets for end to end Learning
Compacting ConvNets for end to end Learning Jose M. Alvarez Joint work with Lars Pertersson, Hao Zhou, Fatih Porikli. Success of CNN Image Classification Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton,
Image Captioning A survey of recent deep-learning approaches
Image Captioning A survey of recent deep-learning approaches Kiran Vodrahalli February 23, 2015 The task We want to automatically describe images with words Why? 1) It's cool 2) Useful for tech companies
arxiv:1411.4555v1 [cs.cv] 17 Nov 2014
Show and Tell: A Neural Image Caption Generator Oriol Vinyals Google [email protected] Alexander Toshev Google [email protected] Samy Bengio Google [email protected] Dumitru Erhan Google [email protected]
CS 1699: Intro to Computer Vision. Deep Learning. Prof. Adriana Kovashka University of Pittsburgh December 1, 2015
CS 1699: Intro to Computer Vision Deep Learning Prof. Adriana Kovashka University of Pittsburgh December 1, 2015 Today: Deep neural networks Background Architectures and basic operations Applications Visualizing
Pedestrian Detection with RCNN
Pedestrian Detection with RCNN Matthew Chen Department of Computer Science Stanford University [email protected] Abstract In this paper we evaluate the effectiveness of using a Region-based Convolutional
Image Caption Generator Based On Deep Neural Networks
Image Caption Generator Based On Deep Neural Networks Jianhui Chen CPSC 503 CS Department Wenqiang Dong CPSC 503 CS Department Minchen Li CPSC 540 CS Department Abstract In this project, we systematically
Learning to Process Natural Language in Big Data Environment
CCF ADL 2015 Nanchang Oct 11, 2015 Learning to Process Natural Language in Big Data Environment Hang Li Noah s Ark Lab Huawei Technologies Part 1: Deep Learning - Present and Future Talk Outline Overview
CAP 6412 Advanced Computer Vision
CAP 6412 Advanced Computer Vision http://www.cs.ucf.edu/~bgong/cap6412.html Boqing Gong Jan 26, 2016 Today Administrivia A bigger picture and some common questions Object detection proposals, by Samer
Bert Huang Department of Computer Science Virginia Tech
This paper was submitted as a final project report for CS6424/ECE6424 Probabilistic Graphical Models and Structured Prediction in the spring semester of 2016. The work presented here is done by students
Image and Video Understanding
Image and Video Understanding 2VO 710.095 WS Christoph Feichtenhofer, Axel Pinz Slide credits: Many thanks to all the great computer vision researchers on which this presentation relies on. Most material
Efficient online learning of a non-negative sparse autoencoder
and Machine Learning. Bruges (Belgium), 28-30 April 2010, d-side publi., ISBN 2-93030-10-2. Efficient online learning of a non-negative sparse autoencoder Andre Lemme, R. Felix Reinhart and Jochen J. Steil
Introduction to Machine Learning CMU-10701
Introduction to Machine Learning CMU-10701 Deep Learning Barnabás Póczos & Aarti Singh Credits Many of the pictures, results, and other materials are taken from: Ruslan Salakhutdinov Joshua Bengio Geoffrey
Sequence to Sequence Weather Forecasting with Long Short-Term Memory Recurrent Neural Networks
Volume 143 - No.11, June 16 Sequence to Sequence Weather Forecasting with Long Short-Term Memory Recurrent Neural Networks Mohamed Akram Zaytar Research Student Department of Computer Engineering Faculty
Manifold Learning with Variational Auto-encoder for Medical Image Analysis
Manifold Learning with Variational Auto-encoder for Medical Image Analysis Eunbyung Park Department of Computer Science University of North Carolina at Chapel Hill [email protected] Abstract Manifold
Object Detection in Video using Faster R-CNN
Object Detection in Video using Faster R-CNN Prajit Ramachandran University of Illinois at Urbana-Champaign [email protected] Abstract Convolutional neural networks (CNN) currently dominate the computer
Lecture 6: CNNs for Detection, Tracking, and Segmentation Object Detection
CSED703R: Deep Learning for Visual Recognition (206S) Lecture 6: CNNs for Detection, Tracking, and Segmentation Object Detection Bohyung Han Computer Vision Lab. [email protected] 2 3 Object detection
The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2
2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of
Convolutional Feature Maps
Convolutional Feature Maps Elements of efficient (and accurate) CNN-based object detection Kaiming He Microsoft Research Asia (MSRA) ICCV 2015 Tutorial on Tools for Efficient Object Detection Overview
STA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! [email protected]! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
Stochastic Pooling for Regularization of Deep Convolutional Neural Networks
Stochastic Pooling for Regularization of Deep Convolutional Neural Networks Matthew D. Zeiler Department of Computer Science Courant Institute, New York University [email protected] Rob Fergus Department
Supporting Online Material for
www.sciencemag.org/cgi/content/full/313/5786/504/dc1 Supporting Online Material for Reducing the Dimensionality of Data with Neural Networks G. E. Hinton* and R. R. Salakhutdinov *To whom correspondence
arxiv:1502.08029v5 [stat.ml] 1 Oct 2015
Describing Videos by Exploiting Temporal Structure arxiv:1502.08029v5 [stat.ml] 1 Oct 2015 Li Yao Université de Montréal [email protected] Nicolas Ballas Université de Montréal [email protected]
Network Morphism. Abstract. 1. Introduction. Tao Wei
Tao Wei [email protected] Changhu Wang [email protected] Yong Rui [email protected] Chang Wen Chen [email protected] Microsoft Research, Beijing, China, 18 Department of Computer Science and Engineering,
Image Classification for Dogs and Cats
Image Classification for Dogs and Cats Bang Liu, Yan Liu Department of Electrical and Computer Engineering {bang3,yan10}@ualberta.ca Kai Zhou Department of Computing Science [email protected] Abstract
Lecture 6: Classification & Localization. boris. [email protected]
Lecture 6: Classification & Localization boris. [email protected] 1 Agenda ILSVRC 2014 Overfeat: integrated classification, localization, and detection Classification with Localization Detection. 2 ILSVRC-2014
arxiv:1312.6034v2 [cs.cv] 19 Apr 2014
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps arxiv:1312.6034v2 [cs.cv] 19 Apr 2014 Karen Simonyan Andrea Vedaldi Andrew Zisserman Visual Geometry Group,
Recurrent Neural Networks
Recurrent Neural Networks Neural Computation : Lecture 12 John A. Bullinaria, 2015 1. Recurrent Neural Network Architectures 2. State Space Models and Dynamical Systems 3. Backpropagation Through Time
Deformable Part Models with CNN Features
Deformable Part Models with CNN Features Pierre-André Savalle 1, Stavros Tsogkas 1,2, George Papandreou 3, Iasonas Kokkinos 1,2 1 Ecole Centrale Paris, 2 INRIA, 3 TTI-Chicago Abstract. In this work we
Deep Learning using Linear Support Vector Machines
Yichuan Tang [email protected] Department of Computer Science, University of Toronto. Toronto, Ontario, Canada. Abstract Recently, fully-connected and convolutional neural networks have been trained
Neural Networks for Machine Learning. Lecture 13a The ups and downs of backpropagation
Neural Networks for Machine Learning Lecture 13a The ups and downs of backpropagation Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed A brief history of backpropagation
Taking Inverse Graphics Seriously
CSC2535: 2013 Advanced Machine Learning Taking Inverse Graphics Seriously Geoffrey Hinton Department of Computer Science University of Toronto The representation used by the neural nets that work best
Module 5. Deep Convnets for Local Recognition Joost van de Weijer 4 April 2016
Module 5 Deep Convnets for Local Recognition Joost van de Weijer 4 April 2016 Previously, end-to-end.. Dog Slide credit: Jose M 2 Previously, end-to-end.. Dog Learned Representation Slide credit: Jose
InstaNet: Object Classification Applied to Instagram Image Streams
InstaNet: Object Classification Applied to Instagram Image Streams Clifford Huang Stanford University [email protected] Mikhail Sushkov Stanford University [email protected] Abstract The growing
Implementation of Neural Networks with Theano. http://deeplearning.net/tutorial/
Implementation of Neural Networks with Theano http://deeplearning.net/tutorial/ Feed Forward Neural Network (MLP) Hidden Layer Object Hidden Layer Object Hidden Layer Object Logistic Regression Object
arxiv:1506.03365v2 [cs.cv] 19 Jun 2015
LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop Fisher Yu Yinda Zhang Shuran Song Ari Seff Jianxiong Xiao arxiv:1506.03365v2 [cs.cv] 19 Jun 2015 Princeton
PhD in Computer Science and Engineering Bologna, April 2016. Machine Learning. Marco Lippi. [email protected]. Marco Lippi Machine Learning 1 / 80
PhD in Computer Science and Engineering Bologna, April 2016 Machine Learning Marco Lippi [email protected] Marco Lippi Machine Learning 1 / 80 Recurrent Neural Networks Marco Lippi Machine Learning
CIKM 2015 Melbourne Australia Oct. 22, 2015 Building a Better Connected World with Data Mining and Artificial Intelligence Technologies
CIKM 2015 Melbourne Australia Oct. 22, 2015 Building a Better Connected World with Data Mining and Artificial Intelligence Technologies Hang Li Noah s Ark Lab Huawei Technologies We want to build Intelligent
arxiv:1502.01852v1 [cs.cv] 6 Feb 2015
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun arxiv:1502.01852v1 [cs.cv] 6 Feb 2015 Abstract Rectified activation
A Look Into the World of Reddit with Neural Networks
A Look Into the World of Reddit with Neural Networks Jason Ting Institute of Computational and Mathematical Engineering Stanford University Stanford, CA 9435 [email protected] Abstract Creating, placing,
IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...
IFT3395/6390 Historical perspective: back to 1957 (Prof. Pascal Vincent) (Rosenblatt, Perceptron ) Machine Learning from linear regression to Neural Networks Computer Science Artificial Intelligence Symbolic
An Early Attempt at Applying Deep Reinforcement Learning to the Game 2048
An Early Attempt at Applying Deep Reinforcement Learning to the Game 2048 Hong Gui, Tinghan Wei, Ching-Bo Huang, I-Chen Wu 1 1 Department of Computer Science, National Chiao Tung University, Hsinchu, Taiwan
Fast R-CNN. Author: Ross Girshick Speaker: Charlie Liu Date: Oct, 13 th. Girshick, R. (2015). Fast R-CNN. arxiv preprint arxiv:1504.08083.
Fast R-CNN Author: Ross Girshick Speaker: Charlie Liu Date: Oct, 13 th Girshick, R. (2015). Fast R-CNN. arxiv preprint arxiv:1504.08083. ECS 289G 001 Paper Presentation, Prof. Lee Result 1 67% Accuracy
Steven C.H. Hoi School of Information Systems Singapore Management University Email: [email protected]
Steven C.H. Hoi School of Information Systems Singapore Management University Email: [email protected] Introduction http://stevenhoi.org/ Finance Recommender Systems Cyber Security Machine Learning Visual
An Introduction to Neural Networks
An Introduction to Vincent Cheung Kevin Cannons Signal & Data Compression Laboratory Electrical & Computer Engineering University of Manitoba Winnipeg, Manitoba, Canada Advisor: Dr. W. Kinsner May 27,
Denoising Convolutional Autoencoders for Noisy Speech Recognition
Denoising Convolutional Autoencoders for Noisy Speech Recognition Mike Kayser Stanford University [email protected] Victor Zhong Stanford University [email protected] Abstract We propose the use of
Opinion Mining with Deep Recurrent Neural Networks
Opinion Mining with Deep Recurrent Neural Networks Ozan İrsoy and Claire Cardie Department of Computer Science Cornell University Ithaca, NY, 14853, USA oirsoy, [email protected] Abstract Recurrent
Università degli Studi di Bologna
Università degli Studi di Bologna DEIS Biometric System Laboratory Incremental Learning by Message Passing in Hierarchical Temporal Memory Davide Maltoni Biometric System Laboratory DEIS - University of
Novelty Detection in image recognition using IRF Neural Networks properties
Novelty Detection in image recognition using IRF Neural Networks properties Philippe Smagghe, Jean-Luc Buessler, Jean-Philippe Urban Université de Haute-Alsace MIPS 4, rue des Frères Lumière, 68093 Mulhouse,
GPU-Based Deep Learning Inference:
Whitepaper GPU-Based Deep Learning Inference: A Performance and Power Analysis November 2015 1 Contents Abstract... 3 Introduction... 3 Inference versus Training... 4 GPUs Excel at Neural Network Inference...
New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction
Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.
Applications of Deep Learning to the GEOINT mission. June 2015
Applications of Deep Learning to the GEOINT mission June 2015 Overview Motivation Deep Learning Recap GEOINT applications: Imagery exploitation OSINT exploitation Geospatial and activity based analytics
A Hybrid Neural Network-Latent Topic Model
Li Wan Leo Zhu Rob Fergus Dept. of Computer Science, Courant institute, New York University {wanli,zhu,fergus}@cs.nyu.edu Abstract This paper introduces a hybrid model that combines a neural network with
Programming Exercise 3: Multi-class Classification and Neural Networks
Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks
OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH. Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane
OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane Carnegie Mellon University Language Technology Institute {ankurgan,fmetze,ahw,lane}@cs.cmu.edu
An Introduction to Deep Learning
Thought Leadership Paper Predictive Analytics An Introduction to Deep Learning Examining the Advantages of Hierarchical Learning Table of Contents 4 The Emergence of Deep Learning 7 Applying Deep-Learning
Tattoo Detection for Soft Biometric De-Identification Based on Convolutional NeuralNetworks
1 Tattoo Detection for Soft Biometric De-Identification Based on Convolutional NeuralNetworks Tomislav Hrkać, Karla Brkić, Zoran Kalafatić Faculty of Electrical Engineering and Computing University of
arxiv:1409.1556v6 [cs.cv] 10 Apr 2015
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION Karen Simonyan & Andrew Zisserman + Visual Geometry Group, Department of Engineering Science, University of Oxford {karen,az}@robots.ox.ac.uk
Gated Neural Networks for Targeted Sentiment Analysis
Gated Neural Networks for Targeted Sentiment Analysis Meishan Zhang 1,2 and Yue Zhang 2 and Duy-Tin Vo 2 1. School of Computer Science and Technology, Heilongjiang University, Harbin, China 2. Singapore
The Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
Machine Learning. 01 - Introduction
Machine Learning 01 - Introduction Machine learning course One lecture (Wednesday, 9:30, 346) and one exercise (Monday, 17:15, 203). Oral exam, 20 minutes, 5 credit points. Some basic mathematical knowledge
Three New Graphical Models for Statistical Language Modelling
Andriy Mnih Geoffrey Hinton Department of Computer Science, University of Toronto, Canada [email protected] [email protected] Abstract The supremacy of n-gram models in statistical language modelling
Administrivia. Traditional Recognition Approach. Overview. CMPSCI 370: Intro. to Computer Vision Deep learning
: Intro. to Computer Vision Deep learning University of Massachusetts, Amherst April 19/21, 2016 Instructor: Subhransu Maji Finals (everyone) Thursday, May 5, 1-3pm, Hasbrouck 113 Final exam Tuesday, May
CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott
Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 269 Class Project Report
Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 69 Class Project Report Junhua Mao and Lunbo Xu University of California, Los Angeles [email protected] and lunbo
Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition
Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition Marc Aurelio Ranzato, Fu-Jie Huang, Y-Lan Boureau, Yann LeCun Courant Institute of Mathematical Sciences,
Data Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
Transform-based Domain Adaptation for Big Data
Transform-based Domain Adaptation for Big Data Erik Rodner University of Jena Judy Hoffman Jeff Donahue Trevor Darrell Kate Saenko UMass Lowell Abstract Images seen during test time are often not from
Pedestrian Detection using R-CNN
Pedestrian Detection using R-CNN CS676A: Computer Vision Project Report Advisor: Prof. Vinay P. Namboodiri Deepak Kumar Mohit Singh Solanki (12228) (12419) Group-17 April 15, 2016 Abstract Pedestrian detection
Learning and transferring mid-level image representions using convolutional neural networks
Willow project-team Learning and transferring mid-level image representions using convolutional neural networks Maxime Oquab, Léon Bottou, Ivan Laptev, Josef Sivic 1 Image classification (easy) Is there
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence
Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support
CNN Based Object Detection in Large Video Images. WangTao, [email protected] IQIYI ltd. 2016.4
CNN Based Object Detection in Large Video Images WangTao, [email protected] IQIYI ltd. 2016.4 Outline Introduction Background Challenge Our approach System framework Object detection Scene recognition Body
Bayesian Dark Knowledge
Bayesian Dark Knowledge Anoop Korattikara, Vivek Rathod, Kevin Murphy Google Research {kbanoop, rathodv, kpmurphy}@google.com Max Welling University of Amsterdam [email protected] Abstract We consider the
Fast R-CNN Object detection with Caffe
Fast R-CNN Object detection with Caffe Ross Girshick Microsoft Research arxiv code Latest roasts Goals for this section Super quick intro to object detection Show one way to tackle obj. det. with ConvNets
Predict the Popularity of YouTube Videos Using Early View Data
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
Neural Networks and Support Vector Machines
INF5390 - Kunstig intelligens Neural Networks and Support Vector Machines Roar Fjellheim INF5390-13 Neural Networks and SVM 1 Outline Neural networks Perceptrons Neural networks Support vector machines
Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski [email protected]
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trakovski [email protected] Neural Networks 2 Neural Networks Analogy to biological neural systems, the most robust learning systems
Temporal Difference Learning in the Tetris Game
Temporal Difference Learning in the Tetris Game Hans Pirnay, Slava Arabagi February 6, 2009 1 Introduction Learning to play the game Tetris has been a common challenge on a few past machine learning competitions.
Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks
This version: December 12, 2013 Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks Lawrence Takeuchi * Yu-Ying (Albert) Lee [email protected] [email protected] Abstract We
Improving Deep Neural Network Performance by Reusing Features Trained with Transductive Transference
Improving Deep Neural Network Performance by Reusing Features Trained with Transductive Transference Chetak Kandaswamy, Luís Silva, Luís Alexandre, Jorge M. Santos, Joaquim Marques de Sá Instituto de Engenharia
Applying Deep Learning to Car Data Logging (CDL) and Driver Assessor (DA) October 22-Oct-15
Applying Deep Learning to Car Data Logging (CDL) and Driver Assessor (DA) October 22-Oct-15 GENIVI is a registered trademark of the GENIVI Alliance in the USA and other countries Copyright GENIVI Alliance
A picture is worth five captions
A picture is worth five captions Learning visually grounded word and sentence representations Grzegorz Chrupała (with Ákos Kádár and Afra Alishahi) Learning word (and phrase) meanings Cross-situational
Distributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
HE Shuncheng [email protected]. March 20, 2016
Department of Automation Association of Science and Technology of Automation March 20, 2016 Contents Binary Figure 1: a cat? Figure 2: a dog? Binary : Given input data x (e.g. a picture), the output of
Bayesian probability theory
Bayesian probability theory Bruno A. Olshausen arch 1, 2004 Abstract Bayesian probability theory provides a mathematical framework for peforming inference, or reasoning, using probability. The foundations
Transfer Learning for Latin and Chinese Characters with Deep Neural Networks
Transfer Learning for Latin and Chinese Characters with Deep Neural Networks Dan C. Cireşan IDSIA USI-SUPSI Manno, Switzerland, 6928 Email: [email protected] Ueli Meier IDSIA USI-SUPSI Manno, Switzerland, 6928
Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach
Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach Xavier Glorot (1) [email protected] Antoine Bordes (2,1) [email protected] Yoshua Bengio (1) [email protected]
Design call center management system of e-commerce based on BP neural network and multifractal
Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2014, 6(6):951-956 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 Design call center management system of e-commerce
3D Model based Object Class Detection in An Arbitrary View
3D Model based Object Class Detection in An Arbitrary View Pingkun Yan, Saad M. Khan, Mubarak Shah School of Electrical Engineering and Computer Science University of Central Florida http://www.eecs.ucf.edu/
Deep learning applications and challenges in big data analytics
Najafabadi et al. Journal of Big Data (2015) 2:1 DOI 10.1186/s40537-014-0007-7 RESEARCH Open Access Deep learning applications and challenges in big data analytics Maryam M Najafabadi 1, Flavio Villanustre
Lecture 8 February 4
ICS273A: Machine Learning Winter 2008 Lecture 8 February 4 Scribe: Carlos Agell (Student) Lecturer: Deva Ramanan 8.1 Neural Nets 8.1.1 Logistic Regression Recall the logistic function: g(x) = 1 1 + e θt
Unsupervised and Transfer Learning Challenge: a Deep Learning Approach
JMLR: Workshop and Conference Proceedings 27:97 111, 2012 Workshop on Unsupervised and Transfer Learning Unsupervised and Transfer Learning Challenge: a Deep Learning Approach Grégoire Mesnil 1,2 [email protected]
2 Reinforcement learning architecture
Dynamic portfolio management with transaction costs Alberto Suárez Computer Science Department Universidad Autónoma de Madrid 2849, Madrid (Spain) [email protected] John Moody, Matthew Saffell International
MSCA 31000 Introduction to Statistical Concepts
MSCA 31000 Introduction to Statistical Concepts This course provides general exposure to basic statistical concepts that are necessary for students to understand the content presented in more advanced
Towards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 [email protected] Abstract Spam identification is crucial
Data quality in Accounting Information Systems
Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania
Deep Convolutional Inverse Graphics Network
Deep Convolutional Inverse Graphics Network Tejas D. Kulkarni* 1, William F. Whitney* 2, Pushmeet Kohli 3, Joshua B. Tenenbaum 4 1,2,4 Massachusetts Institute of Technology, Cambridge, USA 3 Microsoft
Course: Model, Learning, and Inference: Lecture 5
Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 [email protected] Abstract Probability distributions on structured representation.
Convolutional Neural Networks with Intra-layer Recurrent Connections for Scene Labeling
Convolutional Neural Networks with Intra-layer Recurrent Connections for Scene Labeling Ming Liang Xiaolin Hu Bo Zhang Tsinghua National Laboratory for Information Science and Technology (TNList) Department
